Research Article  Open Access
Yoshifumi Fukunishi, Haruki Nakamura, "A Similarity Search Using Molecular Topological Graphs", BioMed Research International, vol. 2009, Article ID 231780, 8 pages, 2009. https://doi.org/10.1155/2009/231780
A Similarity Search Using Molecular Topological Graphs
Abstract
A molecular similarity measure has been developed using molecular topological graphs and atomic partial charges. Two kinds of topological graphs were used. One is the ordinary adjacency matrix and the other is a matrix which represents the minimum path length between two atoms of the molecule. The ordinary adjacency matrix is suitable to compare the local structures of molecules such as functional groups, and the other matrix is suitable to compare the global structures of molecules. The combination of these two matrices gave a similarity measure. This method was applied to in silico drug screening, and the results showed that it was effective as a similarity measure.
1. Introduction
The molecular similarity measure is an important tool in chemometrics and chemoinformatics. The main applications include ligandbased in silico (virtual) drug screening, ADMETox (adsorption, distribution, metabolism, excretion, and toxicity) property prediction, physical molecular property prediction (1octanolwater partition coefficient, solubility), and measurement of the diversity of chemical compounds in a library.
The molecular similarity measure generally assigns a onedimensional (1D) and/or twodimensional (2D) descriptor—that is, molecular fingerprints based on substructure, molecular mass, number of rotatable bonds, number of hydrogen donors/acceptors of the compound, and so forth—to compounds so that the similarities of the compounds can be evaluated [1–5]. Many methods have been proposed for the similarity search of chemical compounds, such as the comparison of overlapping substructures in the form of Daylight fingerprints (Daylight Chemical Information Systems Inc., Aliso Viejo, CA, USA), the chemically advanced template search (CATS) descriptor method developed by Pickett [6], and the BurdenCASUniversity of Texas (BCUT) descriptor method [7]. One of the most widely used methods is to compare the existence of fragment structures; this is the technique employed by the MACSS key, which was developed by Molecular Design Limited (MDL, Santa Clara, CA, USA). Each element of the feature vector of the molecule represents the existence of a particular fragment structure in the molecule (dictionary based fingerprinting). A rather large example of a dictionary used for this fingerprinting technique is the software program DRAGON developed by Talete SRL (geographical information), which consists of more than 3200 molecular descriptors. The affinity fingerprint approach is a new type of similarity search method based on a multiprotein/multicompound affinity matrix [8–21]. In this method, each element of the feature vector of the molecule represents the binding affinity of the molecule with a particular protein. Usually, the binding affinity is measured by calculation using a proteincompound docking program.
There are various applications for molecular similarity, and thus many types of similarity measures are needed. Most of the conventional molecular descriptors aim scaffold hopping (lead hopping) to find a compound with a different scaffold from the known active compound. However, in some cases, we want to find similar compounds with similar scaffolds. For example, in lead generation, we want to find a series of similar compounds with the same or similar scaffold instead of performing an actual synthesis. Substructure searches have been used for this purpose [22, 23]. However, a comparison of the indices of molecular topologies is much faster than a substructure search. A topological index, which is any of several numerical parameters of a molecular graph, is also widely used [5]. The Wiener index, Hosoya index, and Randic’s molecular connectivity index, are graph invariants and conventional topological indices. These topological indices show correlation to the physical or chemical properties of molecules, although these indices do not recognize atom types and they can be quite difficult to calculate.
In the current study, we proposed a new similarity measure for identifying topologically similar compounds based on their molecular topologies and evaluated this method by applying it to a ligandbased drug screening test.
2. Methods
2.1. Similarity and Distance between Compounds
First, the allatom model compound structures are converted to united atom models, in which all hydrogen atoms are omitted and the atomic charge of the hydrogen atom is added to the atomic charge of the connected heavy atom.
In this method, the adjacency matrix and the distance matrix are used, and Figure 1 shows an example of these two matrices for a simple graph. The topology of the compound can be represented by an edgeadjacency matrix [4, 5]:
where is the  element of matrix . The value of could be the bond order between the th and the th atoms (the value of could be 1.5 for an aromatic bond). Just as in the BCUT method, the diagonal part () is replaced by the converted atomic charge :
where is an atomic partial charge and is a coefficient. In this study, was set to 1.0. The value is >0 for any value.
The  matrix element of the pseudodistance matrix represents the minimum path length between the th and th atoms:
where is a shortest path length between the th and the th atoms. We also tried (shortestpath topological distance matrix [4, 5]) and and found that gave the best result among these definitions.
Let and be the th eigenvalue of the matrix or and the number of atoms of the unitedatom model of the compound. We define the eigenvalue histogram as follows:
Here ε and are the energy and the arbitral coefficient.
The distance between molecules and is defined based on the eigenvalue histogram of molecule and that of molecule as follows:
In the current in silico drug screening, candidate hit compounds are selected using the following method. Let and be the distance between and based on the adjacency matrix and that based on the distance matrix . These two distances give the consensus distance with the weight parameter λ:
Compounds that are close to the known active compounds are selected as the candidate hit compounds. For this purpose, the distance to the known active compounds is introduced. The distance from the th compound to the average position of the active compounds (Dist_{k}) is defined as
where , , and are the th compound, th active compounds, and the total number of the active compounds. When the number of active compounds is one, . We call Dist_{k} the moleculargraph (MG) distance and we call this screening procedure the moleculargraph distance (MGD) method. The eigenvalues (ε_{i}) of and of each compound of the compound library are stored in a database file a priori. For a query compound, the eigenvalues of and must be calculated, which costs less than 1 second. The database search is conducted only to perform the calculations in (4)–(7) and thus is quite fast.
3. Preparation of Materials
For the drug screening test, our target proteins were the macrophage migration inhibitory factor (MIF), cyclooxygenase2 (COX2), human immunodeficiency virus protease1 (HIV), thermolysin (THR), glutathione Stransferase (GST), the histamine H1 receptor, the adrenaline beta receptor, the serotonin receptor, and the dopamine D2 receptor. For validation of the present method, we used the same set of compounds as used in our previous study [20]. Namely, the compound set consisted of 12 inhibitors of MIF, 28 inhibitors of THR, 15 inhibitors of COX2, 20 inhibitors of HIV, 12 inhibitors of GST, 10 antagonists of the histamine H1 receptor [24], 12 agonists and 13 antagonists of the adrenaline beta receptor [25], 8 agonists and 9 antagonists of the serotonin receptor [26], and 6 agonists and 15 antagonists of the dopamine D2 receptor [27] as the active compounds, along with 11050 potentially negative compounds from the random compound library of the Coelacanth Chemical Corporation (East Windsor, NJ, USA). Typically, only one hit compound could be found out of 10^{4} randomly selected compounds; we therefore expected that there would be no more than a few, if any, hit compounds among these 11212 compounds. The 160 active compounds are listed in the Supplemental Materials available online at http://dx.doi.org/10.1155/2009/231780.
The size distribution of compounds was as follows: percentage of compounds with 019 atoms, 0.1%; with 20 29 atoms, 1.2%; with 30 39 atoms, 1.6%; with 40 49 atoms, 9.3%; with 50 59 atoms, 22.5%; with 60 69 atoms, 37.9%; with 70 79 atoms, 20.5%; and with more than 80 atoms, 7.0%. The average number of heavy atoms was 30.9.
The atomic charge of each ligand was determined by the Gasteiger method [28, 29]. To calculate the Gasteiger charge, an allatom model is necessary for each compound. The threedimensional (3D) coordinates of the 11050 random compounds above were generated to add the hydrogen atoms by the Concord program (Tripos, St. Louis, MO, USA) from 2D Sybyl SD files provided by the Coelacanth Chemical Corporation. The 3D coordinates of the active compounds (inhibitors, substrates, agonists, and antagonists) were generated by Chem3D (Cambridge Software, Cambridge, MA, USA).
4. Results
To evaluate the efficiency of this method, the leaveoneout crossvalidation test was applied; namely, the active compounds of each target protein were selected one by one as the known active compounds for this software and the other unknown active compounds were discovered by the software. The test dataset consists of these active compounds and the other approximately 10^{4} potential inactive compounds (decoy set). The total number of compounds was 11212. A total of 160 (= total 160 active compounds) database enrichment curves were calculated for these 9 target proteins and 11212 compounds and the results were averaged.
The surface area ( or “area under curve’’: AUC) under the total database enrichment curve is a measure of the database enrichment:
where and are the percentages of compounds that are selected from the total compound library and the database enrichment curve, respectively. A higher value corresponds to better database enrichment, and the value is always greater than zero and less than 100. For the random screening, .
First, the dependence of the hit ratio was examined in the MGD method. The average values and the hit ratio of the 160 trials with various values are summarized in Table 1. The coefficients for matrices and were optimized for every to maximize the hit ratio. When or , the hit ratio and the values were lower than those in the other cases. This result showed that the combination of matrices and is more effective than the single usage of either or . The optimized coefficients were used in the following study. The average eigenvalues of and were and 0.505 for the decoy set, respectively. The histograms of (4) were close to single Gaussian distributions. We show the distributions of of H1 antagonist diphenhydramine and COX2 inhibitor indomethacin in the supplementary data. For diphenhydramine, the average values of and were and , respectively. The deviations of values of and were 75.75 and 9.712, respectively. For indomethacin, the average values of and were and , respectively. The deviations of values of and were 75.88 and 9.739, respectively.

Second, the score distribution was examined by the MGD method. The average values and the standard deviations for the values are summarized in Table 1. Figure 2 shows a score distribution with using diphenhydramine as the template (see Figure 4). The template corresponds to . The frequency was normalized; the surface area under the curve was set to 1. The distribution is similar to Gaussian distribution.
Figure 3 shows the average database enrichment results of the 160 trial screening tests by the leaveoneout crossvalidation test. The MGD method worked well and showed good enrichment. In this calculation, was set to 0.25. Within the first 1%, 5%, and 10% of the database, 36.5%, 52.8%, and 60.0% of the active compounds were found by the similarity measure defined by (7), respectively. The average value by the MGD method using (7) was 82.53. This result is worse than the result by the machinelearning docking score index method reported previously. Namely, about 70% of the active compounds were found within the first 1% of the database and the average value was 98.5.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Ten histamine H1 receptor antagonists were included in the compound library (see Figure 4). Figure 5 shows the known active compound (template) and the best ranked molecules, when , 0.25 and 1. The Dist_{k} values and scores (= (score – average score)/standard deviation) of the three topranked compounds are summarized in Table 1. These scores show that the score distribution is slightly different from the Gaussian distribution. In the Gaussian distribution, the number of compounds with a score is 0.1% of the database (10 compounds in this case). The scores of the topranked compounds were only 2, in this case. The selected molecules look similar to the template compound, diphenhydramine. For , the other H1 receptor antagonist, chlorpheniramine, was found as the third compound. Both diphenhydramine and chlorpheniramine have a diphenyl group. The other compounds (a1 and a2) are not very similar to the template. For , compound b2, which is not an H1 antagonist, is similar to the template. Compound b3 is identical to compound a1. For , the three bestranked compounds are not particularly similar to the query. Compounds a1, a2, b3, and c3 are similar to each other.
(a)
(b)
(c)
5. Discussion
Figure 3 shows that 36.5% of the active compounds were found within the first 1% of the compounds of the whole library. Our previous study showed that 12.4%, 43.4%, and 67.5% of the active compounds were found within the first 1% of the database by the docking score index (DSI), factorselection DSI (FSDSI), and machinelearning DSI (MLDSI) methods, respectively, when 180 proteins were used to calculate the affinity fingerprint [19]. The threedimensional (3D) shape and charge distribution of a compound govern the proteincompound binding energy. The DSI, FSDSI, and MLDSI methods utilize the 3D shape and charge distribution of the compound through the affinity fingerprint. On the other hand, the 2D structure of the compound does not govern the proteincompound binding energy. Thus, the current similarity measure was not better than the previously developed screening methods for in silico drug screening, when it was used as a single measure to describe the molecular similarity. However, the MGD method did have an advantage in terms of its computational speed. The MGD method can search 10 000 000 compounds within 1 hour on a Xeon 3 GHz computer, which is 1000 times faster than the MSMDSI method.
However, the current similarity measure was still effective for in silico drug screening. Our active compounds were chosen based on literature. As shown in Figure 4, some compounds were very similar to each other by humaneye inspection. The main reason for this similarity was likely that these compounds were generated from a progenitor compound by lead optimization. Diphenhydramine, chlorpheniramine, homochlorcyclizine, cetirizine, and clemastine have a diphenyllike group. In promethazine, olopatadine, mequitazine, and cyprohrptadine, the conformations of two phenyl groups are fixed. Most of these antagonists are structurally similar, which should be the reason why the current similarity measure was effective for the in silico drug screening. In other words, this method is not suitable for scaffold hopping (lead hopping) [30]. For scaffold hopping, other methods have been developed [30, 31].
If all matrix elements of the offdiagonal part are zero, the eigenvalues are equal to the values of the diagonal part. The offdiagonal part shifts the eigenvalues from the values of the diagonal part. In (1), the bond information close to the th atom can give major perturbation on the th diagonal value (atomic charge of the th atom). Thus the matrix represents the shortrange information of the molecular topology. On the other hand, in (3), the  matrix element of the matrix becomes large when the th atom is far from the th atom on the molecular topology. Thus the matrix can represent the longrange information of the molecular topology.
The information of the matrix and that of are independent of each other. For each compound in the test database of 10^{4} compounds, the values of and in (6) were calculated. The correlation coefficient of the values of and was only 0.08, indicating that there was no correlation between these values. Thus, the compounds in the compound library are widely distributed in the (, ) twodimensional (2D) space, and the MGD method selects the compounds around the query compound from the compound library in the 2D space.
6. Conclusion
We developed a similarity measure for chemical compounds that is based on the molecular topology, the atomic charge, and the minimum path length between atoms. The histograms of eigenvalues of these matrices were smoothed to generate a continuous curve. The difference and overlap between these two histograms define the distance between the two compounds.
This similarity measure was applied to ligandbased in silico drug screening. In this calculation, compounds whose molecular topology structures are similar to the given active compounds were selected by using this similarity measure. This measure actually worked to find unknown active compounds from a random compound library.
Acknowledgments
This work was supported by the New Energy and Industrial Technology Development Organization of Japan (NEDO), the Ministry of Economy, Trade, and Industry (METI) of Japan, and the Japan Biological Informatics Consortium (JBiC).
Supplementary Materials
The 160 active compounds are listed in the Supplemental Materials (compound names, SMILES and figures of g(e) of two compounds).
References
 H. van de Waterbeemd, B. Testa, and G. Folkers, Eds., ComputerAssisted Lead Finding and Optimization: Current Tools for Medicinal Chemistry, WileyVCH, Weinheim, Germany, 1997.
 A. R. Leach, Molecular Modelling: Principles and Applications, Pearson Education, Edinburgh, UK, 2nd edition, 2001.
 W. G. Richards and D. D. Robinson, “Molecular similarity in Rational Drug Design,” in Rational Drug Design, D. G. Truhlar, W. J. Howe, A. J. Hopfinger, J. Blaney, and R. A. Dammkoehler, Eds., pp. 39–49, Springer, New York, NY, USA, 1999. View at: Google Scholar
 A. Varnek and A. Tropsha, Eds., Chemoinformatics Approaches to Virtual Screening, Royal Society of Chemistry, Cambridge, UK, 2008.
 J. Gasteiger and T. Engel, Eds., Chemoinformatics: A Textbook, WileyVCH, Weinheim, Germany, 2003.
 S. Pickett, “Pharmacophore fingerprints in Proteinligand Interactions,” in ProteinLigand Interactions: From Molecular Recognition to Drug Design, H.J. Böhm, G. Schneider, R. Mannhold, H. Kubinyi, and G. Folkers, Eds., Methods and Principles in Medicinal Chemistry, pp. 88–91, WileyVCH, Weinheim, Germany, 2003. View at: Google Scholar
 R. S. Pearlman and K. M. Smith, “Metric validation and the receptorrelevant subspace concept,” Journal of Chemical Information and Computer Sciences, vol. 39, no. 1, pp. 28–35, 1999. View at: Google Scholar
 L. M. Kauvar, D. L. Higgins, H. O. Villar et al., “Predicting ligand binding to proteins by affinity fingerprinting,” Chemistry & Biology, vol. 2, no. 2, pp. 107–118, 1995. View at: Google Scholar
 H. Briem and I. D. Kuntz, “Molecular similarity based on DOCKgenerated fingerprints,” Journal of Medicinal Chemistry, vol. 39, no. 17, pp. 3401–3408, 1996. View at: Publisher Site  Google Scholar
 U. F. Lessel and H. Briem, “FlexsimX: a method for the detection of molecules with similar biological activity,” Journal of Chemical Information and Modeling, vol. 40, no. 2, pp. 246–253, 2000. View at: Publisher Site  Google Scholar
 H. Briem and U. F. Lessel, “In vitro and in silico affinity fingerprints: finding similarities beyond structural classes,” Perspectives in Drug Discovery and Design, vol. 20, pp. 231–244, 2000. View at: Publisher Site  Google Scholar
 A. Weber, A. Teckentrup, and H. Briem, “FlexsimR: a virtual affinity fingerprint descriptor to calculate similarities of functional groups,” Journal of ComputerAided Molecular Design, vol. 16, no. 12, pp. 903–916, 2002. View at: Publisher Site  Google Scholar
 N. Hsu, D. Cai, K. Damodaran et al., “Novel cyclooxygenase1 inhibitors discovered using affinity fingerprints,” Journal of Medicinal Chemistry, vol. 47, no. 20, pp. 4875–4880, 2004. View at: Publisher Site  Google Scholar
 G. P. A. Vigers and J. P. Rizzi, “Multiple active site corrections for docking and virtual screening,” Journal of Medicinal Chemistry, vol. 47, no. 1, pp. 80–89, 2004. View at: Publisher Site  Google Scholar
 Y. Fukunishi, Y. Mikami, and H. Nakamura, “Similarities among receptor pockets and among compounds: analysis and application to in silico ligand screening,” Journal of Molecular Graphics and Modelling, vol. 24, no. 1, pp. 34–45, 2005. View at: Publisher Site  Google Scholar
 Y. Fukunishi, Y. Mikami, S. Kubota, and H. Nakamura, “Multiple target screening method for robust and accurate in silico ligand screening,” Journal of Molecular Graphics and Modelling, vol. 25, no. 1, pp. 61–70, 2006. View at: Publisher Site  Google Scholar
 Y. Fukunishi, Y. Mikami, K. Takedomi, M. Yamimouehi, H. Shima, and H. Nakamura, “Classification of chemical compounds by proteincompound docking for use in designing a focused library,” Journal of Medicinal Chemistry, vol. 49, no. 2, pp. 523–533, 2006. View at: Publisher Site  Google Scholar
 Y. Fukunishi, S. Kubota, C. Kanai, and H. Nakamura, “A virtual active compound produced from the negative image of a ligandbinding pocket, and its application to insilico drug screening,” Journal of ComputerAided Molecular Design, vol. 20, no. 4, pp. 237–248, 2006. View at: Publisher Site  Google Scholar
 Y. Fukunishi, S. Kubota, and H. Nakamura, “Finding ligands for G proteincoupled receptors based on the proteincompound affinity matrix,” Journal of Molecular Graphics and Modelling, vol. 25, no. 5, pp. 633–643, 2007. View at: Publisher Site  Google Scholar
 Y. Fukunishi, S. Kubota, and H. Nakamura, “Noise reduction method for molecular interaction energy: application to in silico drug screening and in silico target protein screening,” Journal of Chemical Information and Modeling, vol. 46, no. 5, pp. 2071–2084, 2006. View at: Publisher Site  Google Scholar
 Y. Fukunishi, S. Hojo, and H. Nakamura, “An efficient in silico screening method based on the proteincompound affinity matrix and its application to the design of a focused library for cytochrome P450 (CYP) ligands,” Journal of Chemical Information and Modeling, vol. 46, no. 6, pp. 2610–2622, 2006. View at: Publisher Site  Google Scholar
 M. Hattori, Y. Okuno, S. Goto, and M. Kanehisa, “Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways,” Journal of the American Chemical Society, vol. 125, no. 39, pp. 11853–11865, 2003. View at: Publisher Site  Google Scholar
 M. Hattori, Y. Okuno, S. Goto, and M. Kanehisa, “Heuristics for chemical compound matching,” Genome Informatics, vol. 14, pp. 144–153, 2003. View at: Google Scholar
 T. Watanabe and Y. Fukui, “Chapter 7,” in Saiboumaku no Jyuyoutai, I. Takayanagi, Ed., pp. 121–131, Nanzandou, Tokyo, Japan, 1988. View at: Google Scholar
 K. Koike and T. Nagatomo, “Chapter 6,” in Saiboumaku no Jyuyoutai, I. Takayanagi, Ed., pp. 103–118, Nanzandou, Tokyo, Japan, 1998. View at: Google Scholar
 M. Sasa and K. Ishihara, “Chapter 8,” in Saiboumaku no Jyuyoutai, I. Takayanagi, Ed., pp. 135–147, Nanzandou, Tokyo, Japan, 1998. View at: Google Scholar
 Y. Nakata and A. Inoue, “Chapter 10,” in Saiboumaku no Jyuyoutai, I. Takayanagi, Ed., pp. 169–182, Nanzandou, Tokyo, Japan, 1998. View at: Google Scholar
 J. Gasteiger and M. Marsili, “Iterative partial equalization of orbital electronegativity—a rapid access to atomic charges,” Tetrahedron, vol. 36, no. 22, pp. 3219–3228, 1980. View at: Google Scholar
 J. Gasteiger and M. Marsili, “A new model for calculating atomic charges in molecules,” Tetrahedron Letters, vol. 19, no. 34, pp. 3181–3184, 1978. View at: Google Scholar
 G. Schneider, W. Neidhart, T. Giller, and G. Schmid, ““Scaffoldhopping” by topological pharmacophore search: a contribution to virtual screening,” Angewandte Chemie International Edition, vol. 38, no. 19, pp. 2894–2896, 1999. View at: Publisher Site  Google Scholar
 M. M. Ahlstrom, M. Ridderstrom, K. Luthman, and I. Zamora, “Virtual screening and scaffold hopping based on GRID molecular interaction fields,” Journal of Chemical Information and Modeling, vol. 45, no. 5, pp. 1313–1323, 2005. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2009 Yoshifumi Fukunishi and Haruki Nakamura. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.