Next Article in Journal
Synthesis, Properties, and Selected Technical Applications of Magnesium Oxide Nanoparticles: A Review
Next Article in Special Issue
The SwissSimilarity 2021 Web Tool: Novel Chemical Libraries and Additional Methods for an Enhanced Ligand-Based Virtual Screening Experience
Previous Article in Journal
Mesenchymal Stem Cell-Derived Extracellular Vesicle: A Promising Alternative Therapy for Osteoporosis
Previous Article in Special Issue
A New Paradigm for KIM-PTP Drug Discovery: Identification of Allosteric Sites with Potential for Selective Inhibition Using Virtual Screening and LEI Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Ligand-Based Virtual Screening Based on the Graph Edit Distance

Departament d’Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili, 43007 Tarragona, Spain
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2021, 22(23), 12751; https://doi.org/10.3390/ijms222312751
Submission received: 5 October 2021 / Revised: 12 November 2021 / Accepted: 13 November 2021 / Published: 25 November 2021
(This article belongs to the Special Issue Recent Advances in Virtual Screening 2.0)

Abstract

:
Chemical compounds can be represented as attributed graphs. An attributed graph is a mathematical model of an object composed of two types of representations: nodes and edges. Nodes are individual components, and edges are relations between these components. In this case, pharmacophore-type node descriptions are represented by nodes and chemical bounds by edges. If we want to obtain the bioactivity dissimilarity between two chemical compounds, a distance between attributed graphs can be used. The Graph Edit Distance allows computing this distance, and it is defined as the cost of transforming one graph into another. Nevertheless, to define this dissimilarity, the transformation cost must be properly tuned. The aim of this paper is to analyse the structural-based screening methods to verify the quality of the Harper transformation costs proposal and to present an algorithm to learn these transformation costs such that the bioactivity dissimilarity is properly defined in a ligand-based virtual screening application. The goodness of the dissimilarity is represented by the classification accuracy. Six publicly available datasets—CAPST, DUD-E, GLL&GDD, NRLiSt-BDB, MUV and ULS-UDS—have been used to validate our methodology and show that with our learned costs, we obtain the highest ratios in identifying the bioactivity similarity in a structurally diverse group of molecules.

1. Introduction

The high increase in chemical compounds data has created the need to develop computational tools to reduce the drug synthesis and drug test cycle runtimes. When activity data are analysed, these tools are required to generate new models for virtual screening techniques [1,2,3]. In the drug discovery process, virtual screening is a common step in which computational techniques are used to search and filter chemical compounds in databases. Basically, there are two main types of methods in the virtual screening: ligand-based virtual screening (LBVS) [4] and structure-based virtual screening (SBVS) [5]. In this work, we focus only in LBVS applications. The idea of the LBVS method is to predict the unknown activity of new molecules [6] using the information about the known activity of some molecules—specifically, their behaviour as ligands that bind to a receptor.
Some LBVS approaches are shape-based similarity [7], pharmacophore mapping [6], fingerprint similarity and machine learning methods [8]. According to [9], structurally similar molecules are presumed to have similar activity properties, then, in the context of LBVS methods, the chosen molecular similarity metric is important because it can determine the success of a virtual screening method to discover proper drug candidates. Various similarity methods are used in several applications [10,11,12,13,14].
To compute molecular similarity, it is necessary to define a distance and define a descriptor representing the molecule. Hundreds of molecular descriptors have been reported in the literature [15]. For instance, one-dimensional descriptors include general molecular properties, such as size, molecular weight, logP or dipole moment, or BCUT parameters [16,17,18,19]. Two-dimensional descriptors generate an array of representations of the molecules by simplifying the atomic information within them, such as 2D fingerprints [20,21,22]. Finally, three-dimensional descriptors use 3D information, such as molecular volume [23,24]. Other existing methods, instead of representing molecules by an N-dimensional vector, use relational structures, such as trees [25] or graphs [26,27]. Regarding the molecule representation by graphs, some methods represent compounds using reduced graphs [28,29,30,31] and other ones, such as extended reduced graphs (ErGs) [28]. Reduced graphs group atomic sub-structures that have related features, e.g., pharmacophoric features, ring systems, hydrogen-bonding or others. Moreover, ErGs are an extension of reduced graphs that introduce some changes to better represent shape, size and pharmacophoric properties of the molecules. The method presented in [28] has demonstrated its use as a powerful tool for virtual screening.
To perform reduced graph comparisons, three different similarity measures have been used: In [28,29,32], they map the reduced graphs into a 2D fingerprint. In [33], they map reduced graphs into sets of shortest paths. Finally, in [34,35], they perform the comparison on the graphs using the Graph Edit Distance (GED). GED considers the distance between two graphs as the minimum cost of modifications required to transform one graph into another. Each modification can be one of the following six operations: insertion, deletion and substitution of both nodes and edges in the graph [36,37,38]. The main goal of this paper is to present an algorithm that learns the edit costs in the GED to improve the classification ratio returned by the system when the Harper costs were used.
In an initial paper [34], the edit costs were imposed and extracted from [33], given the chemical expertise of the authors and considering the different node and edge types. Later, in [35], authors presented an algorithm for optimising those edit costs based on minimising the distance between correctly classified molecules and maximising the distance between incorrectly classified molecules. That work was inspired in a similar one carried out by Birchall et al. [39], in which the authors optimise the transformation costs of a String Edit Distance method to compare molecules using reduced graphs.
The main problem of the algorithm in [35] was the huge computational cost, which depends on the number of edit costs to be optimised. Thus, for practical reasons, in the experimental section in [35], they presented four experiments, in which only one edit cost was optimised in each experiment. They imposed the other costs (126 in total) to be the ones defined in [33]. In contrast, starting from the costs defined by [33], our method learns the whole edit costs of the GED to compare molecules with a lower computational cost obtaining higher classification ratios in the ligand-based screening application, as shown in the experimental section.
The outline of this paper is as follows. First, materials and methods are explained in detail, including the datasets, the GED and the learning algorithm. Second, the results of the practical experiments are described and discussed. Third and finally, a general discussion about the method and the results is presented.

2. Materials and Methods

2.1. Datasets

In this study, we have used six available public datasets: ULS-UDS [40], GLL&GDD [41], CAPST [42], DUD-E [43], NRLiSt-BDB [44] and MUV [45]. All these datasets had been formatted and standardized by the LBVS benchmarking platform developed by Skoda and Hoksza [46]. The datasets are composed of various groups of active and inactive molecules arranged according to the purpose of a target. Each group is split in two halves, the test and train sets, which are required when using machine learning methods. The train set is used to optimize the transformation costs, and the test set is used to evaluate the classification ratio. The targets of the datasets are shown in Table 1. In our experimentation, we have taken a subset of the first 100 active molecules and 100 of the first inactive molecules per target. Some datasets have less than 100 active molecules; in this case, all active molecules are taken and also the same number of inactive molecules.

2.2. Molecular Representation

Reduced graphs are compact representations of chemical compounds, in which the main information is condensed in feature nodes to give abstractions of the chemical structures. Different versions of reduced graphs have been presented [26,28,30,32,33] that depend on the features they summarise or the use that is given to them. In the virtual screening context, the structures are reduced to track down features or sub-structures that have the potential to interact with a specific receptor and, at the same time, try to keep the topology and spatial distribution of those features. Figure 1 presents an example of molecule reduction.

2.3. The Proposed Method

We explain our proposed method in the next three subsections. The first one explains the classification of compounds based on structural information; in the second one, we explain the learning algorithm; and in the third one, we detail the code of the algorithm.

2.3.1. Classification of Molecules

Once the molecules are represented as ErGs, we can compare them by means of the Graph Edit Distance (GED) [47,48]. The GED is defined as the minimum cost of transformations required to convert one graph into the other. Thus, in our application, it is the cost to transform an ErG into the other one. These modifications are called edit operations, and six of them have been defined: insertion, deletion and substitution of both nodes and edges. To classify a molecule, we apply the Nearest Neighbour (NN) strategy that consists of calculating the GED between this molecule and the other ones, which we know their class, and predicting its class (active or inactive) to be the class of the nearest molecule. In the case the molecule is equidistant from more than one classified molecule, the method arbitrarily selects one of the closest molecules.
Edit costs have been introduced to quantitatively evaluate each edit operation. The aim of the edit costs is to designate a coherent transformation penalty in proportion to the extent to which it modifies the transformation sequence. For instance, when ErGs are compared, it makes sense that the cost of substituting a “hydrogen-bond donor" feature with a joint “hydrogen-bond donor-acceptor" feature be less heavily penalized than the cost of substituting a “hydrogen-bond donor" feature with an “aromatic ring" system. Similarly, inserting a single bond should have a lower penalization cost than inserting a double bond, and so on. In a previous work [34], the edit costs proposed by Harper et al. [33] were used. The node and edge descriptions are shown in Table 2, and the specific costs proposed by Harper et al. are exposed in Table 3 and Table 4.
The final edit cost for a given transformation sequence is obtained by adding up all of the individual edition costs. Figure 2 shows a schematic example of a transformation of a molecule G 1 into another one, G 2 . As we can see, the executed operations in this transformation are: a deletion of node type [1], a deletion of a simple edge, an insertion of node type [5], an insertion of a simple edge a substitution of node type [7] by node of type [2], and a substitution of a simple edge with a double edge. If we sum the values of Harper costs associated with these operations in Table 3 and Table 4, we obtain that the cost of this transformation equals: 2 + 0 + 2 + 0 + 3 + 3 = 10 .
Since several transformation sequences can be applied to transform a graph into another one, the GED resulting for any pair of graphs is defined as the minimum cost under all those possible transformation sequences. Usually, the final distance is normalized according to the number of nodes in both graphs being compared. This is performed in order to make the measure independent of the size of the graphs.
More formally, we define the GED as follows,
G E D ( G a , G b , C 1 , , C n ) = min { N i : i = 1 , , n } C 1 N 1 + + C n N n L
where C t is the imposed cost of the t th edit operation on nodes and edges, and N t is the number of times this edit operation has been applied. Moreover, the combination of N 1 , N 2 , is restricted to be one that transforms G a into G b . Finally, L is the sum of the number of nodes of both graphs, and n is the number of different edit operations on nodes and edges.
Several GED computational methods have been proposed during the last three decades, they can be classified into two groups: those returning the exact value of the GED in the exponential computational cost with respect to the number of nodes [49], and those returning an approximation of the GED in the polynomial cost [50,51,52,53]. These two groups of GED computational methods have been widely studied [54,55]. In our experiments, we used the fast bipartite graph matching method [50] (polynomial computational cost), although our learning method is independent of the matching algorithm.
Initially, the edit costs were manually set in a trial and error process considering the application at hand [33,34]. (As previously commented, Table 3 and Table 4 show their edit cost proposal.) Nevertheless, there has been a tendency to automatically learn these costs since it has been seen that a proper tuning of them is crucial to achieve good classification ratios in virtual screening [35] and other applications [56,57,58,59,60]. In [35], authors presented a learning algorithm that is forced to learn only one edit cost at once due to runtime restrictions. Thus, they perform four different experiments on the same data as [34] in which they use all the costs of [34] except the one that is learned. These experiments are: C1: Learning the deletion/insertion cost of the carbon link ([6]). C2: Learning the cost of substituting a carbon link node ([6]) with an aromatic ring system ([5]). C3: Learning the insertion/deletion cost of the bond edge ([-]). C4: Learning the substitution cost between a single bond edge ([-]) and a double bond edge ([=]). Table 5 shows their learnt costs.
The next section presents our method, which has the advantage of learning the whole set of edit costs at once.

2.3.2. The Learning Method

We present an iterative algorithm, in which, in each iteration, the NN strategy is applied and the initial edit costs are modified such that one molecule that has been incorrectly classified becomes correctly classified. Modifying the edit costs could cause other incorrectly classified molecules to also be properly classified, but, unfortunately, some other ones that were properly classified become incorrectly classified. This is the reason why we want to generate the minimum modification on the edit costs. To do so, the selected molecule is the one that it is easier to move from the incorrectly classified ones to the correctly classified ones. In the next paragraphs, our learning algorithm is explained in detail.
Let G j be a molecule in the learning set that has been incorrectly classified using the NN strategy and the current costs C 1 , , C n . We define D j as the minimal GED between G j and all the molecules but restricted to be the ones that have a different class:
D j = min q G E D ( G j , G q , C 1 , , C n ) , where class ( G q ) class ( G j ) .
Moreover, we define D j as the minimal GED between G j and all the molecules of the learning set but restricted to be the ones that have the same class:
D j = min p G E D ( G j , G p , C 1 , , C n ) , where class ( G p ) = class ( G j )
Since G j is incorrectly classified, we can confirm that D j > D j . Figure 3 schematically shows this situation. It turns out that G j and G q belong to different classes even though the distance between them is smaller than the distance between G j and its closest molecule that has the same class, G p .
The main idea of our method is to permute D j and D j , modifying the edit costs. With this exchange, we achieve a lower distance between G j and the molecule of its same class ( G p ) than the distance between G j and the molecule with different classes ( G q ). Thus, G j will be correctly classified. However, considering that adapting these distances affects all the molecules’ classifications, we select a molecule G i among the incorrectly-classified ones, { G j | D j > D j , G j } , which satisfies that the difference of the distances D j D j is the minimum one, as shown in Equation (4). Note that in Equation (4), all the values of D j D j are always positive because D j > D j by definition of G j .
G i = a r g min { G j | D j > D j } ( D j D j )
Figure 4 shows this idea. However, what is crucial to understand is that this modification is performed in the distances since the molecule representations are not modified. Furthermore, this is carried out by modifying the edit costs. Thus, the strategy is to define the new edit costs such that D i becomes D i and vice versa.
The rest of this section is devoted to explaining how to modify the edit costs.
Considering Equation (1), the distance is composed of edit costs C 1 , , C n and the number of times the specific edit operations have been taken N 1 , , N n . Our method modifies the edit costs without altering the number of operations N 1 , , N n .
Thus, we define D i and D i as follows:
D i = C 1 N 1 + + C n N n L D i = C 1 N 1 + + C n N n L
Then, we exchange the distances D i and D i and modify the edits costs by adding new terms:
D i = ( C 1 + α 1 ) N 1 + + ( C n + α n ) N n L D i = ( C 1 + α 1 ) N 1 + + ( C n + α n ) N n L
Note that these new terms α 1 , , α n and also α 1 , , α n are defined such that the new value of D i is D i instead of D i and vice versa. Moreover, the edit costs C 1 , , C n are the same in both expressions. We proceed to explain below how to deduce the terms α 1 , , α n and also α 1 , , α n .
From Equation (6), we obtain:
D i = C 1 N 1 + + C n N n L + α 1 N 1 + + α n N n L D i = C 1 N 1 + + C n N n L + α 1 N 1 + + α n N n L
We observe that the first terms in both expressions are D j and D j , respectively:
D i = D i + α 1 N 1 + + α n N n L D i = D i + α 1 N 1 + + α n N n L
By regrouping the terms again, we have:
D i D i = α 1 N 1 + + α n N n L D i D i = α 1 N 1 + + α n N n L
Furthermore, finally, we divide by D i D i and D i D i in each expression to arrive at the following normalised expressions:
1 = α 1 N 1 ( D i D i ) L + + α n N n ( D i D i ) L 1 = α 1 N 1 ( D i D i ) L + + α n N n ( D i D i ) L
Note that, as commented in the definition of the GED, not all of the edit operations are used to transform a molecule into another. These edit operations are the ones that N t = 0 or N t = 0 . Because of this, in Equation (10), there are some addends that are null. We use m and m to denote the number of edit operations that have been used, that is, the ones that N t 0 or N t 0 , respectively.
We want to deduce α 1 , α 2 , and also α 1 , α 2 , such that Equation (10) is fulfilled. The easiest way is to impose that each non-null term in these expressions equal 1 / m or 1 / m , respectively. Then, we achieve the following expressions,
1 / m = α t N t ( D i D i ) L being N t > 0 1 / m = α t N t ( D i D i ) L being N t > 0
From the previous expressions, we arrive at the definitions of α t that allow the modification from D i to D i . Moreover, we also arrive at the definitions of α t that allow the modification from D i to D i .
α t = ( D i D i ) L m N t , N t > 0 α t = ( D i D i ) L m N t , N t > 0
Note that considering Equations (5), (6) and (12), we have, on one hand, that the new costs C ¯ t = C t + α t and, on the other hand, that C ¯ t = C t + α t . Since it may happen that α t α t , we assume the average option is the best choice when both weights are computed,
C ¯ t = C t + α t + α t 2 , if N t > 0 and N t > 0 C t + α t , if N t > 0 and N t = 0 C t + α t , if N t = 0 and N t > 0 C t , if N t = 0 and N t = 0
In the next subsection, we present our algorithm.

2.3.3. Algorithm

Algorithm 1 consists of an iterative process in which, in each iteration, the edit costs are updated to correct the classification of one selected molecule. The updated costs are used in the next iteration to classify all the molecules again, select a molecule and modify the costs again.
Algorithm 1 Costs learning.
Input (Learning Set, Initial edit costs, M a x _ I t e r )
Output (Learnt edit costs)
  • Initialise:
    i t e r = 1 .
    C 1 , …, C n = Initial edit costs.
    While i t e r M a x _ I t e r :
  •   Classify all molecules with nearest neighbour and G E D (Equation (1)) using C 1 , , C n .
  •   Compute D j and D j : (Equations (2) and (3)) for all G j incorrectly classified.
  •   Deduce G i (Equation (4)).
  •   Compute α t , t = 1 , , m and α t , t = 1 , , m : (Equation (12)).
  •   Compute C ¯ 1 , …, C ¯ n (Equation (13)).
  •   Update costs: C t = C ¯ t , t = 1 , , n .
  •    i t e r = i t e r + 1 .
    End While
End Algorithm
This algorithm has been coded in Matlab, and it is available in https://deim.urv.cat/~francesc.serratosa/SW/, accessed on 12 November 2021.

3. Results

Table 6 shows the classification ratios obtained in each dataset using different edit cost configurations, algorithms and initialisations. The first row corresponds to the accuracies obtained with the costs proposed by Harper [33], the second row corresponds to the accuracies deduced by setting all the costs to 1 (no learning algorithm). The next four rows correspond to the accuracies obtained using the costs deduced in García et al. [35] in their four experiments (C1, C2, C3 and C4). Finally, the last two rows present the accuracies obtained by our method: the first row by initialising the algorithm by the Harper costs and the second one by initialising all the costs to 1. We note the used costs are the mean of the learned costs in all the databases, and our algorithm performed 50 iterations.
We realise that in all the datasets, except for MUV and ULS-UDS, our costs with Harper initialisation obtained the highest classification ratios. In these two datasets, the best accuracy is obtained by Harper costs. Note that our method initialised by all-ones costs returns lower accuracies than our method initialised by Harper costs, except for the ULS-UDS dataset. This behaviour makes us think that the initialisation point is very important in this type of algorithm. Another highlight is that we have achieved better accuracies than the four experiments presented by García et al. [35] in all the tests. In the ULS-UDS dataset, our method returns close accuracy to the Harper costs. Nevertheless, that is not the case for MUV dataset. To deeply analyse this behaviour, we have computed the accuracy using the costs learned by only the MUV targets. In this case, the accuracy is 64.9 % , which is significantly lower than using mean costs. This is not the normal behaviour in learning algorithms since while conducting specific learning, the classification ratio tends to increase. We think there are other reasons for this abnormal behaviour: one could be the small size of this dataset and the other the separability between ligands and decoys in MUV is low, which makes our algorithm not to converge to a proper solution.
In Figure 5, we present the classification ratio obtained in the 127 targets in the six datasets. At a glance, we realise that our method achieves most of the highest accuracies in all the targets in CAPST, DUD-E, GLL&GDD and NRLiSt-BDB databases. Specifically, we point out targets from 19 to 31 in the GLL&GDD dataset where the other cost combinations have very low accuracies while our method achieves much higher results. We observe that targets in the datasets MUV and ULS-UDS, in which our method does not return the highest accuracies, have a high variability because the same costs produce very different results.
Note that in [35], authors computed a cost per each of the six datasets and each target. Conversely, we learn the edit costs given the six datasets at once. In general, using several datasets at once makes the learnt parameters less specific for the application at hand, and thus, the classification ratios tend to decrease. In spite of this possible disadvantage, our method returns better classification ratios than the one in [35] in all the datasets. Figure 6 shows the percentage of times that each cost configuration obtains the highest classification ratio taking into account all the 127 targets, given the four configurations proposed in [35], one configuration proposed in [33] and our deduced configuration. Our method obtains the best classification ratio the highest number of times.
Table 7 and Table 8 show our learned edition costs for nodes and edges, respectively. In red and bold are the ones that are different to the ones proposed by Harper et al. [33]. As we can see, the results are very similar to Harper costs because we introduce a very small modification in each step. In addition, there are many costs that have not been modified. This is because these costs were not involved in the modifications of molecules that are improperly classified, minimising D i D i .

4. Discussion

We present an iterative algorithm such that, in each iteration, the current costs are modified to properly classify an improperly classified molecule. While updating the costs, other improperly classified molecules could also be properly classified and vice versa. This is the reason why we cannot guarantee the algorithm’s convergence. To reduce the no-convergence impact and the possible solution oscillation, the algorithm selects the molecule that requires the minimum modification of the costs with the aim of slightly moving to the best solution.
The algorithm requires some initial costs. We have initialised the algorithm by some aleatory costs and by the costs proposed by Harper [33]. In all the tests, the highest accuracies appeared while initialising the costs by the Harper proposal. We believe this behaviour appears because the optimisation function of the learning algorithm is highly non-convex. Generally, in these situations, the selected initialisation has a high impact on the solution. Finally, we have seen that the classification accuracy is highly dependent on the edit costs. That is, a slight modification of one of the costs could make the classification be completely different. Considering that the computational cost of this learning problem is extremely high, sub-optimal algorithms, as the one presented, are needed to achieve an acceptable classification accuracy. Thus, any proposal that achieves better classification ratios would have to be considered and analysed.

5. Conclusions and Future Research

In some ligand-based virtual screening (LBVS) methods, molecules are represented by extended reduced graphs. In this case, the Graph Edit Distance can be used to compute the dissimilarity between molecules.
In this article, we have presented a new method that automatically learns the edit costs in the Graph Edit Distance. In each iteration, our method introduces slight modifications in the current costs to correct the classification of a selected molecule that had been incorrectly classified in the previous step.
The obtained costs have been tested in six publicly available datasets and have been compared to previous works published in [34,35]. We achieve better classification ratios than [35] in the six datasets and better classification ratios than [34] in four of them.
In the experimental section, we realised that small modifications in the costs could produce considerable improvement in the classification ratio.
In future work, we will analyse which types of molecules cause the algorithm to converge and which are the best initial values to obtain higher classification accuracy.

Author Contributions

Conceptualization, methodology and investigation: E.R., S.Á. and F.S.; Supervision: S.Á. and F.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Martí i Franquès grant number 2018PMF-PIPF-30.

Acknowledgments

This project has received funding from the Martí i Franquès fellowship of Universitat Rovira i Virgili (URV), Spain.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kubinyi, H.; Mannhold, R.; Timmerman, H. Virtual Screening for Bioactive Molecules; John Wiley & Sons: Hoboken, NJ, USA, 2008; Volume 10. [Google Scholar]
  2. Bajorath, J. Selected concepts and investigations in compound classification, molecular descriptor analysis, and virtual screening. J. Chem. Inf. Comput. Sci. 2001, 41, 233–245. [Google Scholar] [CrossRef]
  3. Schneider, G.; Clément-Chomienne, O.; Hilfiger, L.; Schneider, P.; Kirsch, S.; Böhm, H.J.; Neidhart, W. Virtual screening for bioactive molecules by evolutionary de novo design. Angew. Chem. Int. Ed. 2000, 39, 4130–4133. [Google Scholar] [CrossRef]
  4. Cereto-Massagué, A.; Ojeda, M.J.; Valls, C.; Mulero, M.; Garcia-Vallvé, S.; Pujadas, G. Molecular fingerprint similarity search in virtual screening. Methods 2015, 71, 58–63. [Google Scholar] [CrossRef]
  5. Heikamp, K.; Bajorath, J. The future of virtual compound screening. Chem. Biol. Drug Des. 2013, 81, 33–40. [Google Scholar] [CrossRef]
  6. Sun, H. Pharmacophore-based virtual screening. Curr. Med. Chem. 2008, 15, 1018–1024. [Google Scholar] [CrossRef]
  7. Kirchmair, J.; Distinto, S.; Markt, P.; Schuster, D.; Spitzer, G.M.; Liedl, K.R.; Wolber, G. How to optimize shape-based virtual screening: Choosing the right query and including chemical information. J. Chem. Inf. Model. 2009, 49, 678–692. [Google Scholar] [CrossRef]
  8. Melville, J.L.; Burke, E.K.; Hirst, J.D. Machine learning in virtual screening. Comb. Chem. High Throughput Screen. 2009, 12, 332–343. [Google Scholar] [CrossRef]
  9. Johnson, M.A.; Maggiora, G.M. Concepts and Applications of Molecular Similarity; Wiley: Hoboken, NJ, USA, 1990. [Google Scholar]
  10. Bender, A.; Glen, R.C. Molecular similarity: A key technique in molecular informatics. Org. Biomol. Chem. 2004, 2, 3204–3218. [Google Scholar] [CrossRef]
  11. Nikolova, N.; Jaworska, J. Approaches to measure chemical similarity—A review. Mol. Inf. 2003, 22, 1006–1026. [Google Scholar] [CrossRef]
  12. Willett, P. Evaluation of molecular similarity and Mol. Diversity methods using biological activity data. In Chemoinformatics; Springer: Berlin/Heidelberg, Germany, 2004; pp. 51–63. [Google Scholar]
  13. Lajiness, M. Molecular similarity-based methods for selecting compounds for screening. In Computational Chemical Graph Theory; Nova Science Publishers, Inc.: Hauppauge, NY, USA, 1990; pp. 299–316. [Google Scholar]
  14. Willett, J. Similarity and Clustering in Chemical Information Systems; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1987. [Google Scholar]
  15. Xue, L.; Bajorath, J. Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. Comb. Chem. High Throughput Screen. 2000, 3, 363–372. [Google Scholar] [CrossRef]
  16. Menard, P.R.; Mason, J.S.; Morize, I.; Bauerschmidt, S. Chemistry space metrics in diversity analysis, library design, and compound selection. J. Chem. Inf. Comput. Sci. 1998, 38, 1204–1213. [Google Scholar] [CrossRef]
  17. Pearlman, R.S.; Smith, K.M. Metric validation and the receptor-relevant subspace concept. J. Chem. Inf. Comput. Sci. 1999, 39, 28–35. [Google Scholar] [CrossRef]
  18. Schnur, D. Design and diversity analysis of large combinatorial libraries using cell-based methods. J. Chem. Inf. Comput. Sci. 1999, 39, 36–45. [Google Scholar] [CrossRef]
  19. Livingstone, D.J. The characterization of chemical structures using molecular properties. A survey. J. Chem. Inf. Comput. Sci. 2000, 40, 195–209. [Google Scholar] [CrossRef]
  20. Barnard, J.M. Substructure searching methods: Old and new. J. Chem. Inf. Comput. Sci. 1993, 33, 532–538. [Google Scholar] [CrossRef]
  21. James, C.; Weininger, D. Daylight, 4.41 Theory Manual; Daylight Chemical Information Systems Inc.: Irvine, CA, USA, 1995. [Google Scholar]
  22. McGregor, M.J.; Pallai, P.V. Clustering of large databases of compounds: Using the MDL “keys” as structural descriptors. J. Chem. Inf. Comput. Sci. 1997, 37, 443–448. [Google Scholar] [CrossRef]
  23. Güner, O.F. Pharmacophore Perception, Development, and Use in Drug Design; Internat’l University Line: Geneva, Switzerland, 2000; Volume 2. [Google Scholar]
  24. Beno, B.R.; Mason, J.S. The design of combinatorial libraries using properties and 3D pharmacophore fingerprints. Drug Discov. Today 2001, 6, 251–258. [Google Scholar] [CrossRef]
  25. Rarey, M.; Dixon, J.S. Feature trees: A new molecular similarity measure based on tree matching. J. Comput.-Aided Mol. Des. 1998, 12, 471–490. [Google Scholar] [CrossRef]
  26. Barker, E.J.; Buttar, D.; Cosgrove, D.A.; Gardiner, E.J.; Kitts, P.; Willett, P.; Gillet, V.J. Scaffold hopping using clique detection applied to reduced graphs. J. Chem. Inf. Model. 2006, 46, 503–511. [Google Scholar] [CrossRef]
  27. Takahashi, Y.; Sukekawa, M.; Sasaki, S. Automatic identification of molecular similarity using reduced-graph representation of chemical structure. J. Chem. Inf. Model. 1992, 32, 639–643. [Google Scholar] [CrossRef]
  28. Stiefl, N.; Watson, I.A.; Baumann, K.; Zaliani, A. ErG: 2D pharmacophore descriptions for scaffold hopping. J. Chem. Inf. Model. 2006, 46, 208–220. [Google Scholar] [CrossRef] [PubMed]
  29. Gillet, V.J.; Willett, P.; Bradshaw, J. Similarity searching using reduced graphs. J. Chem. Inf. Comput. Sci. 2003, 43, 338–345. [Google Scholar] [CrossRef]
  30. Gillet, V.J.; Downs, G.M.; Holliday, J.D.; Lynch, M.F.; Dethlefsen, W. Computer storage and retrieval of generic chemical structures in patents. 13. Reduced graph generation. J. Chem. Inf. Comput. Sci. 1991, 31, 260–270. [Google Scholar] [CrossRef]
  31. Fisanick, W.; Lipkus, A.H.; Rusinko, A. Similarity Searching on CAS Registry Substances. 2. 2D Structural Similarity. J. Chem. Inf. Comput. Sci. 1994, 34, 130–140. [Google Scholar] [CrossRef]
  32. Barker, E.J.; Gardiner, E.J.; Gillet, V.J.; Kitts, P.; Morris, J. Further development of reduced graphs for identifying bioactive compounds. J. Chem. Inf. Comput. Sci. 2003, 43, 346–356. [Google Scholar] [CrossRef]
  33. Harper, G.; Bravi, G.S.; Pickett, S.D.; Hussain, J.; Green, D.V.S. The reduced graph descriptor in virtual screening and data-driven clustering of high-throughput screening data. J. Chem. Inf. Comput. Sci. 2004, 44, 2145–2156. [Google Scholar] [CrossRef]
  34. Garcia-Hernandez, C.; Fernández, A.; Serratosa, F. Ligand-Based Virtual Screening Using Graph Edit Distance as Molecular Similarity Measure. J. Chem. Inf. Model. 2019, 59, 1410–1421. [Google Scholar] [CrossRef]
  35. Garcia-Hernandez, C.; Fernández, A.; Serratosa, F. Learning the Edit Costs of Graph Edit Distance Applied to Ligand-Based Virtual Screening. Curr. Top. Med. Chem. 2020, 20, 1582–1592. [Google Scholar] [CrossRef]
  36. Munkres, J. Algorithms for the Assignment and Transportation Problems. J. Soc. Ind. Appl. Math. 1957, 5, 32–38. [Google Scholar] [CrossRef] [Green Version]
  37. Sanfeliu, A.; Fu, K.S. A distance measure between attributed relational graphs for Pattern Recognit. IEEE Trans. Syst. Man Cybern. 1983, SMC-13, 353–362. [Google Scholar] [CrossRef]
  38. Gao, X.; Xiao, B.; Tao, D.; Li, X. A survey of graph edit distance. Pattern Anal. Appl. 2010, 13, 113–129. [Google Scholar] [CrossRef]
  39. Birchall, K.; Gillet, V.J.; Harper, G.; Pickett, S.D. Training similarity measures for specific activities: Application to reduced graphs. J. Chem. Inf. Model. 2006, 46, 577–586. [Google Scholar] [CrossRef]
  40. Xia, J.; Tilahun, E.L.; Reid, T.E.; Zhang, L.; Wang, X.S. Benchmarking methods and data sets for ligand enrichment assessment in virtual screening. Methods 2015, 71, 146–157. [Google Scholar] [CrossRef] [Green Version]
  41. Gatica, E.A.; Cavasotto, C.N. Ligand and decoy sets for docking to G protein-coupled receptors. J. Chem. Inf. Model. 2011, 52, 1–6. [Google Scholar] [CrossRef]
  42. Sanders, M.P.; Barbosa, A.J.; Zarzycka, B.; Nicolaes, G.A.; Klomp, J.P.; de Vlieg, J.; Del Rio, A. Comparative analysis of pharmacophore screening tools. J. Chem. Inf. Model. 2012, 52, 1607–1620. [Google Scholar] [CrossRef] [PubMed]
  43. Mysinger, M.M.; Carchia, M.; Irwin, J.J.; Shoichet, B.K. Directory of useful decoys, enhanced (DUD-E): Better ligands and decoys for better benchmarking. J. Med. Chem. 2012, 55, 6582–6594. [Google Scholar] [CrossRef]
  44. Lagarde, N.; Ben Nasr, N.; Jeremie, A.; Guillemain, H.; Laville, V.; Labib, T.; Zagury, J.F.; Montes, M. NRLiSt BDB, the manually curated nuclear receptors ligands and structures benchmarking database. J. Med. Chem. 2014, 57, 3117–3125. [Google Scholar] [CrossRef] [PubMed]
  45. Rohrer, S.G.; Baumann, K. Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data. J. Chem. Inf. Model. 2009, 49, 169–184. [Google Scholar] [CrossRef] [PubMed]
  46. Skoda, P.; Hoksza, D. Benchmarking platform for ligand-based virtual screening. In Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2016), Shenzhen, China, 15–18 December 2016; pp. 1220–1227. [Google Scholar] [CrossRef]
  47. Solé, A.; Serratosa, F.; Sanfeliu, A. On the Graph Edit Distance Cost: Properties and Applications. Intern. J. Pattern Recognit. Artif. Intell. 2012, 26, 1260004. [Google Scholar] [CrossRef] [Green Version]
  48. Serratosa, F. Redefining the Graph Edit Distance. SN Comput. Sci. 2021, 2, 438. [Google Scholar] [CrossRef]
  49. Blumenthal, D.B.; Gamper, J. On the exact computation of the graph edit distance. Pattern Recognit. Lett. 2018, 1–12. [Google Scholar] [CrossRef]
  50. Serratosa, F. Fast computation of bipartite graph matching. Pattern Recognit. Lett. 2014, 45, 244–250. [Google Scholar] [CrossRef]
  51. Santacruz, P.; Serratosa, F. Error-tolerant graph matching in linear computational cost using an initial small partial matching. Pattern Recognit. Lett. 2018, 1–10. [Google Scholar] [CrossRef]
  52. Serratosa, F. Speeding up Fast Bipartite Graph Matching Through a New Cost Matrix. Int. J. Pattern Recognit. Artif. Intell. 2014, 29, 1550010. [Google Scholar] [CrossRef]
  53. Serratosa, F. Computation of graph edit distance: Reasoning about optimality and speed-up. Image Vis. Comput. 2015, 40, 38–48. [Google Scholar] [CrossRef]
  54. Conte, D.; Foggia, P.; Sansone, C.; Vento, M. Thirty years of graph matching in Pattern Recognit. Intern. J. Pattern Recognit. Artif. Intell. 2004, 18, 265–298. [Google Scholar] [CrossRef]
  55. Vento, M. A long trip in the charming world of graphs for Pattern Recognit. Pattern Recognit. 2015, 48, 291–301. [Google Scholar] [CrossRef]
  56. Rica, E.; Álvarez, S.; Serratosa, F. On-line learning the graph edit distance costs. Pattern Recognit. Lett. 2021, 146, 55–62. [Google Scholar] [CrossRef]
  57. Conte, D.; Serratosa, F. Interactive online learning for graph matching using active strategies. Knowl. Based Syst. 2020, 205, 106275. [Google Scholar] [CrossRef]
  58. Santacruz, P.; Serratosa, F. Learning the Graph Edit Costs Based on a Learning Model Applied to Sub-optimal Graph Matching. Neural Process. Lett. 2020, 51, 881–904. [Google Scholar] [CrossRef]
  59. Algabli, S.; Serratosa, F. Embedding the node-to-node mappings to learn the Graph edit distance parameters. Pattern Recognit. Lett. 2018, 112, 353–360. [Google Scholar] [CrossRef]
  60. Cortés, X.; Serratosa, F. Learning Graph Matching Substitution Weights Based on the Ground Truth Node Correspondence. Int. J. Pattern Recognit. Artif. Intell. 2016, 30, 1650005:1–1650005:22. [Google Scholar] [CrossRef]
Figure 1. Example of molecule reduction using ErG. The original molecule is on the top and its ErG representation is below. Elements of the same colour on the top are reduced to nodes on the ErG. R: Ring system, Ac: Acyclic components.
Figure 1. Example of molecule reduction using ErG. The original molecule is on the top and its ErG representation is below. Elements of the same colour on the top are reduced to nodes on the ErG. R: Ring system, Ac: Acyclic components.
Ijms 22 12751 g001
Figure 2. Transformation sequence from graph G 1 to graph G 2 .
Figure 2. Transformation sequence from graph G 1 to graph G 2 .
Ijms 22 12751 g002
Figure 3. Classification of molecule G j . The true classes are in solid colours. G j is classified in the wrong class (blue), but the correct class is the red one. The distance between G j and G q is lower than the distance between G j and G p .
Figure 3. Classification of molecule G j . The true classes are in solid colours. G j is classified in the wrong class (blue), but the correct class is the red one. The distance between G j and G q is lower than the distance between G j and G p .
Ijms 22 12751 g003
Figure 4. Stripped molecules have been improperly classified using NN strategy. G i is the one that minimises D j D j being D j > D j .
Figure 4. Stripped molecules have been improperly classified using NN strategy. G i is the one that minimises D j D j being D j > D j .
Ijms 22 12751 g004
Figure 5. Classification ratio in the test set over the 127 targets available in the six datasets. The horizontal axis represents the index of the targets presented in Table 1.
Figure 5. Classification ratio in the test set over the 127 targets available in the six datasets. The horizontal axis represents the index of the targets presented in Table 1.
Ijms 22 12751 g005aIjms 22 12751 g005b
Figure 6. Percentage of times that each set of costs returns the best classification ratio.
Figure 6. Percentage of times that each set of costs returns the best classification ratio.
Ijms 22 12751 g006
Table 1. Datasets used for the experiments. Each dataset on the left contains the targets on the right.
Table 1. Datasets used for the experiments. Each dataset on the left contains the targets on the right.
DatasetUsed Targets
CAPSTCDK2, CHK1, PTP1B, UROKINASE
DUD-ECOX2, DHFR, EGFR, FGFR1, FXA, P38, PDGFRB, SRC, AA2AR
GLL&GDD5HT1A_Agonist, 5HT1A_Antagonist, 5HT1D_Agonist, 5HT1D_Antagonist, 5HT1F_Agonist, 5HT2A_Antagonist, 5HT2B_Antagonist, 5HT2C_Agonist, 5HT2C_Antagonist, 5HT4R_Agonist, 5HT4R_Antagonist, AA1R_Agonist, AA1R_Antagonist, AA2AR_Antagonist, AA2BR_Antagonist, ACM1_Agonist, ACM2_Antagonist, ACM3_Antagonist, ADA1A_Antagonist, ADA1B_Antagonist, ADA1D_Antagonist, ADA2A_Agonist, ADA2A_Antagonist, ADA2B_Agonist, ADA2B_Antagonist, ADA2C_Agonist, ADA2C_Antagonist, ADRB1_Agonist, ADRB1_Antagonist, ADRB2_Agonist, ADRB2_Antagonist, ADRB3_Agonist, ADRB3_Antagonist, AG2R_Antagonist, BKRB1_Antagonist, BKRB2_Antagonist, CCKAR_Antagonist, CLTR1_Antagonist, DRD1_Antagonist, DRD2_Agonist, DRD2_Antagonist, DRD3_Antagonist, DRD4_Antagonist, EDNRA_Antagonist, EDNRB_Antagonist, GASR_Antagonist, HRH2_Antagonist, HRH3_Antagonist, LSHR_Antagonist, LT4R1_Antagonist, LT4R2_Antagonist, MTR1A_Agonist, MTR1B_Agonist, MTR1L_Agonist, NK1R_Antagonist, NK2R_Antagonist, NK3R_Antagonist, OPRD_Agonist, OPRK_Agonist, OPRM_Agonist, OXYR_Antagonist, PE2R1_Antagonist, PE2R2_Antagonist, PE2R3_Antagonist, PE2R4_Antagonist, TA2R_Antagonist, V1AR_Antagonist, V1BR_Antagonist, V2R_Antagonist
MUV466, 548, 600, 644, 652, 689, 692, 712, 713, 733, 737, 810, 832, 846, 852, 858, 859
NRLiSt_BDBAR_Agonist, AR_Antagonist, ER_Alpha_Agonist, ER_Alpha_Antagonist, ER_Beta_Agonist, FXR_Alpha_Agonist, GR_Agonist, GR_Antagonist, LXR_Alpha_Agonist, LXR_Beta_Agonist, MR_Antagonist, PPAR_Alpha_Agonist, PPAR_Beta_Agonist, PPAR_Gamma_Agonist, PR_Agonist, PR_Antagonist, PXR_Agonist, RAR_Alpha_Agonist, RAR_Beta_Agonist, RAR_Gamma_Agonist, RXR_Alpha_Agonist, RXR_Alpha_Antagonist, RXR_Gamma_Agonist, VDR_Agonist
ULS-UDS5HT1F_Agonist, MTR1B_Agonist, OPRM_Agonist, PE2R3_Antagonist
Table 2. Node and edge attributes description in an ErG.
Table 2. Node and edge attributes description in an ErG.
Node Attributes
AttributeDescription
[ 0 ] hydrogen-bond donor
[ 1 ] hydrogen-bond acceptor
[ 2 ] positive charge
[ 3 ] negative charge
[ 4 ] hydrophobic group
[ 5 ] aromatic ring system
[ 6 ] carbon link node
[ 7 ] non-carbon link node
[ 0 , 1 ] hydrogen-bond donor + hydrogen-bond acceptor
[ 0 , 2 ] hydrogen-bond donor + positive charge
[ 0 , 3 ] hydrogen-bond donor + negative charge
[ 1 , 2 ] hydrogen-bond acceptor + positive charge
[ 1 , 3 ] hydrogen-bond acceptor + negative charge
[ 2 , 3 ] positive charge + negative charge
[ 0 , 1 , 2 ] hydrogen-bond donor + hydrogen-bond acceptor + positive charge
Edge Attributes
AttributeDescription
-single bond
=double bond
triple bond
Table 3. Substitution, insertion and deletion costs for nodes proposed by Harper et al. [33].
Table 3. Substitution, insertion and deletion costs for nodes proposed by Harper et al. [33].
Substitution Costs for Nodes
[0][1][2][3][4][5][6][7][0, 1][0, 2][0, 3][1, 2][1, 3][2, 3][0, 1, 2]
[0]022222231112221
[1]202222231221121
[2]220222232121211
[3]222022232212112
[4]222202232222222
[5]222220232222222
[6]222222032222222
[7]333333303333333
[0, 1]112222230222222
[0, 2]121222232022222
[0, 3]122122232202222
[1, 2]211222232220222
[1, 3]212122232222022
[2, 3]221122232222202
[0, 1, 2]111222232222220
Insertion/Deletion Costs for Nodes
[0][1][2][3][4][5][6][7][0, 1][0, 2][0, 3][1, 2][1, 3][2, 3][0, 1, 2]
insert222222112222222
delete222222112222222
Table 4. Substitution, insertion and deletion costs for edges proposed by Harper et al. [33].
Table 4. Substitution, insertion and deletion costs for edges proposed by Harper et al. [33].
Substitution Costs
For Edges
-=
-033
=303
330
Insertion/Deletion Costs
For Edges
-=
insert011
delete011
Table 5. Costs obtained in [35]. Each row corresponds to one of their experiments.
Table 5. Costs obtained in [35]. Each row corresponds to one of their experiments.
Type of CostCAPSTDUD-EGLL&GDDMUVNRLiSt_BDBULS-UDS
C1Ins/Del [6]0.0000020.0050.0140.4900.0120.115
C2Subs [5] by [6]0.0130.1450.3330.8670.1040.500
C3Ins/Del [-]0.0040.0010.0030.3270.0030.011
C4Subs [-] by [=]0.0170.1860.2061.0050.0240.607
Table 6. Accuracy (%) obtained in each dataset. In bold, the highest ones. The last column shows the mean accuracy.
Table 6. Accuracy (%) obtained in each dataset. In bold, the highest ones. The last column shows the mean accuracy.
CAPSTDUD-EGLL&GDDMUVNRLiSt_BDBULS-UDSMean
Harper93.7595.8885.6892.7693.1796.1092.89
1s92.9391.2593.0356.0194.7592.9486.82
C189.2592.6382.4786.0688.5889.6588.11
C289.7591.1382.5187.3588.2191.6988.44
C391.2591.2583.2586.6587.7592.3488.75
C489.5090.8882.4386.0089.9292.5988.55
Our method95.8596.3893.6788.6395.9094.0094.07
(Harper init.)
Our method88.1593.5093.3061.7694.9895.2587.82
(1s init.)
Table 7. Substitution, insertion and deletion costs of nodes obtained with our method. In bold, the ones that are different from Table 3 and Table 4.
Table 7. Substitution, insertion and deletion costs of nodes obtained with our method. In bold, the ones that are different from Table 3 and Table 4.
Substitution Costs for Nodes
[0][1][2][3][4][5][6][7][0, 1][0, 2][0, 3][1, 2][1, 3][2, 3][0, 1, 2]
[0]01.992.022.001.992.042.053.001.060.991.002.002.002.000.97
[1]1.9902.002.001.981.991.963.001.021.992.001.021.002.001.04
[2]2.022.0002.002.002.001.993.002.000.992.001.002.001.000.98
[3]2.002.002.0002.002.002.053.001.992.001.002.001.001.002.00
[4]1.991.982.002.0001.992.013.002.012.012.002.002.002.002.00
[5]2.041.992.002.001.9901.993.001.961.962.002.002.002.002.02
[6]2.051.961.992.052.011.9903.002.002.012.002.002.002.001.98
[7]3.003.003.003.003.003.003.0003.003.003.003.003.003.003.00
[0, 1]1.061.022.001.992.011.962.003.0002.022.002.002.002.002.01
[0, 2]0.991.990.992.002.011.962.013.002.0202.002.002.002.002.00
[0, 3]1.002.002.001.002.002.002.003.002.002.0002.002.002.002.00
[1, 2]2.001.021.002.002.002.002.003.002.002.002.0002.002.002.00
[1, 3]2.001.002.001.002.002.002.003.002.002.002.002.0002.002.00
[2, 3]2.002.001.001.002.002.002.003.002.002.002.002.002.0002.00
[0, 1, 2]0.971.040.982.002.002.021.983.002.012.002.002.002.002.000
Insertion/Deletion Costs for Nodes
[0][1][2][3][4][5][6][7][0, 1][0, 2][0, 3][1, 2][1, 3][2, 3][0, 1, 2]
insert1.951.982.002.001.991.890.971.002.032.022.001.992.002.001.96
delete1.951.982.002.001.991.890.971.002.032.022.001.992.002.001.96
Table 8. Substitution, insertion and deletion costs of edges obtained with our method.
Table 8. Substitution, insertion and deletion costs of edges obtained with our method.
Substitution Costs
For Edges
-=
-03.003.00
=3.0003.00
3.003.000
Insertion/Deletion Costs
For Edges
-=
insert01.021.00
delete01.021.00
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Rica, E.; Álvarez, S.; Serratosa, F. Ligand-Based Virtual Screening Based on the Graph Edit Distance. Int. J. Mol. Sci. 2021, 22, 12751. https://doi.org/10.3390/ijms222312751

AMA Style

Rica E, Álvarez S, Serratosa F. Ligand-Based Virtual Screening Based on the Graph Edit Distance. International Journal of Molecular Sciences. 2021; 22(23):12751. https://doi.org/10.3390/ijms222312751

Chicago/Turabian Style

Rica, Elena, Susana Álvarez, and Francesc Serratosa. 2021. "Ligand-Based Virtual Screening Based on the Graph Edit Distance" International Journal of Molecular Sciences 22, no. 23: 12751. https://doi.org/10.3390/ijms222312751

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop