A molecular similarity measure has been developed using molecular topological graphs and atomic partial charges. Two kinds of topological graphs were used. One is the ordinary adjacency matrix and the other is a matrix which represents the minimum path length between two atoms of the molecule. The ordinary adjacency matrix is suitable to compare the local structures of molecules such as functional groups, and the other matrix is suitable to compare the global structures of molecules. The combination of these two matrices gave a similarity measure. This method was applied to in silico drug screening, and the results showed that it was effective as a similarity measure.

1. Introduction

The molecular similarity measure is an important tool in chemometrics and chemoinformatics. The main applications include ligand-based in silico (virtual) drug screening, ADME-Tox (adsorption, distribution, metabolism, excretion, and toxicity) property prediction, physical molecular property prediction (1-octanol-water partition coefficient, solubility), and measurement of the diversity of chemical compounds in a library.

The molecular similarity measure generally assigns a one-dimensional (1D) and/or two-dimensional (2D) descriptor—that is, molecular fingerprints based on substructure, molecular mass, number of rotatable bonds, number of hydrogen donors/acceptors of the compound, and so forth—to compounds so that the similarities of the compounds can be evaluated [15]. Many methods have been proposed for the similarity search of chemical compounds, such as the comparison of overlapping substructures in the form of Daylight fingerprints (Daylight Chemical Information Systems Inc., Aliso Viejo, CA, USA), the chemically advanced template search (CATS) descriptor method developed by Pickett [6], and the Burden-CAS-University of Texas (BCUT) descriptor method [7]. One of the most widely used methods is to compare the existence of fragment structures; this is the technique employed by the MACSS key, which was developed by Molecular Design Limited (MDL, Santa Clara, CA, USA). Each element of the feature vector of the molecule represents the existence of a particular fragment structure in the molecule (dictionary based fingerprinting). A rather large example of a dictionary used for this fingerprinting technique is the software program DRAGON developed by Talete SRL (geographical information), which consists of more than 3200 molecular descriptors. The affinity fingerprint approach is a new type of similarity search method based on a multiprotein/multicompound affinity matrix [821]. In this method, each element of the feature vector of the molecule represents the binding affinity of the molecule with a particular protein. Usually, the binding affinity is measured by calculation using a protein-compound docking program.

There are various applications for molecular similarity, and thus many types of similarity measures are needed. Most of the conventional molecular descriptors aim scaffold hopping (lead hopping) to find a compound with a different scaffold from the known active compound. However, in some cases, we want to find similar compounds with similar scaffolds. For example, in lead generation, we want to find a series of similar compounds with the same or similar scaffold instead of performing an actual synthesis. Substructure searches have been used for this purpose [22, 23]. However, a comparison of the indices of molecular topologies is much faster than a substructure search. A topological index, which is any of several numerical parameters of a molecular graph, is also widely used [5]. The Wiener index, Hosoya index, and Randic’s molecular connectivity index, are graph invariants and conventional topological indices. These topological indices show correlation to the physical or chemical properties of molecules, although these indices do not recognize atom types and they can be quite difficult to calculate.

In the current study, we proposed a new similarity measure for identifying topologically similar compounds based on their molecular topologies and evaluated this method by applying it to a ligand-based drug screening test.

2. Methods

2.1. Similarity and Distance between Compounds

First, the all-atom model compound structures are converted to united atom models, in which all hydrogen atoms are omitted and the atomic charge of the hydrogen atom is added to the atomic charge of the connected heavy atom.

In this method, the adjacency matrix and the distance matrix are used, and Figure 1 shows an example of these two matrices for a simple graph. The topology of the compound can be represented by an edge-adjacency matrix [4, 5]:

where is the - element of matrix . The value of could be the bond order between the th and the th atoms (the value of could be 1.5 for an aromatic bond). Just as in the BCUT method, the diagonal part () is replaced by the converted atomic charge :

where is an atomic partial charge and is a coefficient. In this study, was set to 1.0. The value is >0 for any value.

The - matrix element of the pseudodistance matrix represents the minimum path length between the th and th atoms:

where is a shortest path length between the th and the th atoms. We also tried (shortest-path topological distance matrix [4, 5]) and and found that gave the best result among these definitions.

Let and be the th eigenvalue of the matrix or and the number of atoms of the united-atom model of the compound. We define the eigenvalue histogram as follows:

Here ε and are the energy and the arbitral coefficient.

The distance between molecules and is defined based on the eigenvalue histogram of molecule and that of molecule as follows:

In the current in silico drug screening, candidate hit compounds are selected using the following method. Let and be the distance between and based on the adjacency matrix and that based on the distance matrix . These two distances give the consensus distance with the weight parameter λ:

Compounds that are close to the known active compounds are selected as the candidate hit compounds. For this purpose, the distance to the known active compounds is introduced. The distance from the th compound to the average position of the active compounds (Distk) is defined as

where , , and are the th compound, th active compounds, and the total number of the active compounds. When the number of active compounds is one, . We call Distk the molecular-graph (MG) distance and we call this screening procedure the molecular-graph distance (MGD) method. The eigenvalues (εi) of and of each compound of the compound library are stored in a database file a priori. For a query compound, the eigenvalues of and must be calculated, which costs less than 1 second. The database search is conducted only to perform the calculations in (4)–(7) and thus is quite fast.

3. Preparation of Materials

For the drug screening test, our target proteins were the macrophage migration inhibitory factor (MIF), cyclooxygenase-2 (COX-2), human immunodeficiency virus protease-1 (HIV), thermolysin (THR), glutathione S-transferase (GST), the histamine H1 receptor, the adrenaline beta receptor, the serotonin receptor, and the dopamine D2 receptor. For validation of the present method, we used the same set of compounds as used in our previous study [20]. Namely, the compound set consisted of 12 inhibitors of MIF, 28 inhibitors of THR, 15 inhibitors of COX-2, 20 inhibitors of HIV, 12 inhibitors of GST, 10 antagonists of the histamine H1 receptor [24], 12 agonists and 13 antagonists of the adrenaline beta receptor [25], 8 agonists and 9 antagonists of the serotonin receptor [26], and 6 agonists and 15 antagonists of the dopamine D2 receptor [27] as the active compounds, along with 11050 potentially negative compounds from the random compound library of the Coelacanth Chemical Corporation (East Windsor, NJ, USA). Typically, only one hit compound could be found out of 104 randomly selected compounds; we therefore expected that there would be no more than a few, if any, hit compounds among these 11212 compounds. The 160 active compounds are listed in the Supplemental Materials available online at http://dx.doi.org/10.1155/2009/231780.

The size distribution of compounds was as follows: percentage of compounds with 019 atoms, 0.1%; with 20 29 atoms, 1.2%; with 30 39 atoms, 1.6%; with 40 49 atoms, 9.3%; with 50 59 atoms, 22.5%; with 60 69 atoms, 37.9%; with 70 79 atoms, 20.5%; and with more than 80 atoms, 7.0%. The average number of heavy atoms was 30.9.

The atomic charge of each ligand was determined by the Gasteiger method [28, 29]. To calculate the Gasteiger charge, an all-atom model is necessary for each compound. The three-dimensional (3D) coordinates of the 11050 random compounds above were generated to add the hydrogen atoms by the Concord program (Tripos, St. Louis, MO, USA) from 2D Sybyl SD files provided by the Coelacanth Chemical Corporation. The 3D coordinates of the active compounds (inhibitors, substrates, agonists, and antagonists) were generated by Chem3D (Cambridge Software, Cambridge, MA, USA).

4. Results

To evaluate the efficiency of this method, the leave-one-out cross-validation test was applied; namely, the active compounds of each target protein were selected one by one as the known active compounds for this software and the other unknown active compounds were discovered by the software. The test dataset consists of these active compounds and the other approximately 104 potential inactive compounds (decoy set). The total number of compounds was 11212. A total of 160 (= total 160 active compounds) database enrichment curves were calculated for these 9 target proteins and 11212 compounds and the results were averaged.

The surface area ( or “area under curve’’: AUC) under the total database enrichment curve is a measure of the database enrichment:

where and are the percentages of compounds that are selected from the total compound library and the database enrichment curve, respectively. A higher value corresponds to better database enrichment, and the value is always greater than zero and less than 100. For the random screening, .

  First, the dependence of the hit ratio was examined in the MGD method. The average values and the hit ratio of the 160 trials with various values are summarized in Table 1. The coefficients for matrices and were optimized for every to maximize the hit ratio. When or , the hit ratio and the values were lower than those in the other cases. This result showed that the combination of matrices and is more effective than the single usage of either or . The optimized coefficients were used in the following study. The average eigenvalues of and were and 0.505 for the decoy set, respectively. The histograms of (4) were close to single Gaussian distributions. We show the distributions of of H1 antagonist diphenhydramine and COX-2 inhibitor indomethacin in the supplementary data. For diphenhydramine, the average values of and were and , respectively. The deviations of values of and were 75.75 and 9.712, respectively. For indomethacin, the average values of and were and , respectively. The deviations of values of and were 75.88 and 9.739, respectively.

Second, the score distribution was examined by the MGD method. The average values and the standard deviations for the values are summarized in Table 1. Figure 2 shows a score distribution with using diphenhydramine as the template (see Figure 4). The template corresponds to . The frequency was normalized; the surface area under the curve was set to 1. The distribution is similar to Gaussian distribution.

Figure 3 shows the average database enrichment results of the 160 trial screening tests by the leave-one-out cross-validation test. The MGD method worked well and showed good enrichment. In this calculation, was set to 0.25. Within the first 1%, 5%, and 10% of the database, 36.5%, 52.8%, and 60.0% of the active compounds were found by the similarity measure defined by (7), respectively. The average value by the MGD method using (7) was 82.53. This result is worse than the result by the machine-learning docking score index method reported previously. Namely, about 70% of the active compounds were found within the first 1% of the database and the average value was 98.5.

Ten histamine H1 receptor antagonists were included in the compound library (see Figure 4). Figure 5 shows the known active compound (template) and the best ranked molecules, when , 0.25 and 1. The Distk values and -scores (= (score – average score)/standard deviation) of the three top-ranked compounds are summarized in Table 1. These -scores show that the score distribution is slightly different from the Gaussian distribution. In the Gaussian distribution, the number of compounds with a -score is 0.1% of the database (10 compounds in this case). The -scores of the top-ranked compounds were only 2, in this case. The selected molecules look similar to the template compound, diphenhydramine. For , the other H1 receptor antagonist, chlorpheniramine, was found as the third compound. Both diphenhydramine and chlorpheniramine have a diphenyl group. The other compounds (a1 and a2) are not very similar to the template. For , compound b2, which is not an H1 antagonist, is similar to the template. Compound b3 is identical to compound a1. For , the three best-ranked compounds are not particularly similar to the query. Compounds a1, a2, b3, and c3 are similar to each other.

5. Discussion

Figure 3 shows that 36.5% of the active compounds were found within the first 1% of the compounds of the whole library. Our previous study showed that 12.4%, 43.4%, and 67.5% of the active compounds were found within the first 1% of the database by the docking score index (DSI), factor-selection DSI (FS-DSI), and machine-learning DSI (ML-DSI) methods, respectively, when 180 proteins were used to calculate the affinity fingerprint [19]. The three-dimensional (3D) shape and charge distribution of a compound govern the protein-compound binding energy. The DSI, FS-DSI, and ML-DSI methods utilize the 3D shape and charge distribution of the compound through the affinity fingerprint. On the other hand, the 2D structure of the compound does not govern the protein-compound binding energy. Thus, the current similarity measure was not better than the previously developed screening methods for in silico drug screening, when it was used as a single measure to describe the molecular similarity. However, the MGD method did have an advantage in terms of its computational speed. The MGD method can search 10 000 000 compounds within 1 hour on a Xeon 3 GHz computer, which is 1000 times faster than the MSM-DSI method.

However, the current similarity measure was still effective for in silico drug screening. Our active compounds were chosen based on literature. As shown in Figure 4, some compounds were very similar to each other by human-eye inspection. The main reason for this similarity was likely that these compounds were generated from a progenitor compound by lead optimization. Diphenhydramine, chlorpheniramine, homochlorcyclizine, cetirizine, and clemastine have a diphenyl-like group. In promethazine, olopatadine, mequitazine, and cyprohrptadine, the conformations of two phenyl groups are fixed. Most of these antagonists are structurally similar, which should be the reason why the current similarity measure was effective for the in silico drug screening. In other words, this method is not suitable for scaffold hopping (lead hopping) [30]. For scaffold hopping, other methods have been developed [30, 31].

If all matrix elements of the off-diagonal part are zero, the eigenvalues are equal to the values of the diagonal part. The off-diagonal part shifts the eigenvalues from the values of the diagonal part. In (1), the bond information close to the th atom can give major perturbation on the th diagonal value (atomic charge of the th atom). Thus the matrix represents the short-range information of the molecular topology. On the other hand, in (3), the - matrix element of the matrix becomes large when the th atom is far from the th atom on the molecular topology. Thus the matrix can represent the long-range information of the molecular topology.

The information of the matrix and that of are independent of each other. For each compound in the test database of 104 compounds, the values of and in (6) were calculated. The correlation coefficient of the values of and was only 0.08, indicating that there was no correlation between these values. Thus, the compounds in the compound library are widely distributed in the (, ) two-dimensional (2D) space, and the MGD method selects the compounds around the query compound from the compound library in the 2D space.

6. Conclusion

We developed a similarity measure for chemical compounds that is based on the molecular topology, the atomic charge, and the minimum path length between atoms. The histograms of eigenvalues of these matrices were smoothed to generate a continuous curve. The difference and overlap between these two histograms define the distance between the two compounds.

This similarity measure was applied to ligand-based in silico drug screening. In this calculation, compounds whose molecular topology structures are similar to the given active compounds were selected by using this similarity measure. This measure actually worked to find unknown active compounds from a random compound library.


This work was supported by the New Energy and Industrial Technology Development Organization of Japan (NEDO), the Ministry of Economy, Trade, and Industry (METI) of Japan, and the Japan Biological Informatics Consortium (JBiC).

Supplementary Materials

The 160 active compounds are listed in the Supplemental Materials (compound names, SMILES and figures of g(e) of two compounds).

  1. Supplementary Material