Abstract

Molecular similarity is a pervasive concept in drug design. The basic idea underlying molecular similarity is the similar property principle, which states that structurally similar molecules will exhibit similar physicochemical and biological properties. In this paper, a new graph-based molecular descriptor (GBMD) is introduced. The GBMD is a new method of obtaining a rough description of 2D molecular structure in textual form based on the canonical representations of the molecule outline shape and it allows rigorous structure specification using small and natural grammars. Simulated virtual screening experiments with the MDDR database show clearly the superiority of the graph-based descriptor compared to many standard descriptors (ALOGP, MACCS, EPFP4, CDKFP, PCFP, and SMILE) using the Tanimoto coefficient (TAN) and the basic local alignment search tool (BLAST) when searches were carried.

1. Background

The foundation of a chemical information system is the ability to represent molecules in a computer and to compare a molecule’s structure with another. Molecular comparison has been used in the early chemical information systems, for example, structure and substructure searching [1, 2]. Structure searching involves searching a chemical database for a particular query structure to retrieve all molecules with an exact match to the query structure, whereas substructure searching retrieves all molecules that contain the query structure as a subgraph [3, 4]. The equivalence similarity between two structures can be achieved by using a graph and subgraph isomorphism algorithms. Isomorphism algorithms are time consuming because it is a combinatorial problem. Various isomorphism algorithms have been developed for efficient performance but they are too slow for large chemical databases. However, structure and substructure searching were later complemented by another searching mechanism called similarity searching [5].

Similarity searching methods may be the simplest tools for ligand based virtual screening. The basic idea underlying similarity searching is the similar property principle, which states that structurally similar molecules will exhibit similar physicochemical and biological properties [6]. Over the years, many ways of measuring the structural similarity of molecules have been introduced [79]. The 2D similarity methods can be divided into two classes: the first class is the graph-based similarity methods and the second one is the fingerprint-based similarity methods. The graph-based similarity methods directly compare the molecular structures with each other and identify the similar (or common) substructures. These methods relate parts of one molecule to parts of the other molecule, and they generate a mapping or alignment between molecules. Maximum common subgraph method (MCS) is an example of the graph-based similarity methods. Another example of the graph-based similarity methods is the feature trees. The feature trees were introduced by Rarey and Dixon [8], which are the most abstract way of representing a molecule by means of a graph. A feature tree represents hydrophobic fragments and functional groups of the molecule and the way these groups are linked together. Each node in the tree is labelled with a set of features representing chemical properties of the part of the molecule corresponding to the node. The comparison of feature trees is based on matching subtrees of two feature trees onto each other. Feature trees allow similarity searching to be performed against large database, when combined with a fast mapping algorithm [9]. However, the most common similarity approaches use molecules characterized by fingerprints that encode the presence of fragments features in a molecule. The similarity between two molecules is then computed using the number of substructural fragments that is common to a pair of structures and a simple association coefficient.

The shape similarity between two molecules can be determined by comparing the shapes of those molecules; find the overlap volume between them and then use similarity measure (e.g., Tanimoto) to calculate the similarity between the molecules. However, most of the works in shape-based similarity approaches depended on the 3D molecular shape. The shape comparison program rapid overlay of chemical structures (ROCS) [10] is used to perceive similarity between molecules based on their 3D shape. The objective of this approach is to find molecules with similar bioactivity to a target molecule but with different chemotypes, that is, scaffold hopping. However, a disadvantage of 3D similarity methods is that the conformational properties of the molecules should be considered and therefore these methods are more computationally intensive than methods based on 2D structure representation. The complexity increases considerably if conformational flexibility is taken into account. There are many 2D structure representations in a numerical form integer or real. The simplest 2D descriptors are based on simple counts of features such as hydrogen donors, hydrogen bond acceptors, ring systems (such as aromatic rings), and rotatable bonds, whereas the complex 2D descriptors are computed from complex mathematical equations such as 2D fingerprints and topological indices. Topological indices are integer or real value numbers (single value) that represent the constitution of the molecules and can be calculated from the 2D graph representation of molecules and may contain additional property information about the molecule [11]. They characterize molecular structures according to their size, degree of branching, and overall shape where the structural diagram of molecules is considered as a mathematical graph but not the contour of molecule shape.

One of the representation types of chemical structures is the line notations, which encodes the connection table and (usually) the stereochemistry of a molecule as a line of text. They are widely used for storing, representing, communicating, and checking the identity of chemical structures. Their popularity derives from one or more of the following: they encode the chemical structure in a compact form; they may be human-readable and/or human-writable; they are easily entered into software (e.g., by copying and pasting into a text entry box on a website); they may be canonical (i.e., provide a unique representation for a particular molecule), in which case they may easily be used to check identity, search databases, or even search the web. In [12, 13], we introduced a new language for writing descriptors of outline shape of molecules (LWDOSM) and LINGO for descriptors of outline shape of molecules (LINGO-DOSM) that were inspired by research in information retrieval on the use of contour-based shape descriptor for image retrieval systems. The LWDOSM is a new method to obtain a rough description of the 2D molecular structure from its outline shape and allows rigorous structure specification by the use of a very small and natural grammar. Typically, there may be many different ways to construct the connection table and the LWDOSM string for a given molecule. In a connection table, one may choose different ways to number the atoms and coordinate each atom; LWDOSM string may be written starting from the top left atom and by following a different sequence through the molecule to get final LWDOSM. For example, a number of equally valid LWDOSM can be written for a molecule. For example, C–C–O–C, C–O–C–C, O–C–C–C, and C–C–C–O all specify the same structure of ethanol C2H6O. This problem could be tackled by renumbering one of the connection tables in all possible ways and testing for identity. However this is computationally infeasible since there are N different ways of numbering a connection table consisting of N atoms. Hence, in order to deal with this situation it is usual to generate a canonical representation of the chemical structure. A canonical representation is a unique ordering of the atoms for a given molecular graph.

In this paper, an algorithm has been developed to ensure that the same representation is generated for a molecule regardless of the order of atoms in the structure. This graph-based descriptor for molecule (GBMD) provides unique representation for each structure and depends on the canonicalization algorithm. A well-known method for determining a canonical order of the atoms is the Morgan algorithm [14] which is used in this algorithm.

2. Methods

The new descriptor (GBMD) is a textual descriptor using printable characters for representing molecules. In this work, the proposed method uses a connection table to extract the information needed to represent the molecule. The GBMD is a true language, albeit with a simple vocabulary (atom and bond symbols) and only a few grammar rules. However, part of the GBMD power is that it is highly sensitive to molecular structure changes. This is because each atom and bond is recorded more than once (back and forth).

In this work, the graph denotes the 2D molecular structure. Here, only the labeled molecular graph (i.e., atoms and bonds) and all possible paths between every atom pair are taken into account. A correspondent shape to a 2D molecule structure is generally composed of a main region (representing the outline shape) and one or many internal regions (representing areas inside rings) obtained after visiting all the atoms in the connection table of a molecule. The process of generating the shape descriptor of any molecule involves few steps as follows.

Step 1. Define graph to represent the molecule M. Each vertex and edge of G represents atom and bond, respectively, in molecule M. The vertices and edges are labeled with the corresponding kind of atoms and bonds, respectively (see Figure 1).

Step 2. Construct atomic connectivity values in GBMD using Morgan algorithm as shown in Figure 2.
Once vertices’ numbering in graph G is completed, a tree graph T is constructed. T is a subgraph from graph G that contains all the nodes (but not necessarily all the edges as highlighted in Figure 2 using the blue color) of a graph G. Figure 3 shows T graph extracted from graph G represents the aspirin molecule.

Step 3. Generating the GBMD descriptor, the process of generating the shape descriptor (GBMD) for any molecule starts with head of the tree T of the graph (G). The atom name is represented in the descriptor as the grammar described below. Then, we move in a depth first algorithm to browse the tree. The bond type is represented before we visit and represent the next atom. For each breaking arc in the graph G (in case graph has cycle) we add a special character “=” to identify the cycles, as shown in Figure 4.

The same procedure is repeated until we visit again the head node/atom in the tree. Once the atom is visited again, the description of the outline shape of the molecule graph has been completed. Figure 4 shows the process of generating the descriptor (GBMD).

Specification Rules. The language used for writing the GBMD descriptor consists of series of characters and symbols. There are four generic encoding rules corresponding to the specification of atoms, bonds, ring closure, and disconnected parts. These rules are similar to the rules used in LWDOSM strings presented in [12]. The rules are described as follows.

Rule 1. Atoms are represented by their atomic symbols, usually two characters. The second character of the atomic symbol must be entered in lower case (e.g., “Br,” “C1,” “N,” and “O”).

Rule 2. The single, double, and triple bonds are represented by the symbols “–”, “=”, and “#”, respectively.

Rule 3. If the molecule graph is composed of more than one part (disconnected structures), the description of the disconnected compound is written as individual structures separated by “ . ” (period).

Rule 4. If the molecule graph is composed of rings (cycles), the special character “/” is added to each breaking arc to identify the rings.

The final GBDM for the aspirin molecule using the specification steps and the four rules described above is C–C–O–C–C–C–O–C–O–C–C–C–/C–C–C–C–O–C–O–C–C–CC–/C–C.

3. Intermolecular Similarity Calculation

The evaluation of the proposed descriptor will be measured by the performance of the similarity methods and will be compared to eight molecular descriptors which are ALOGP, MACCS, EPFP4, CDKFP, PCFP, SMILE, LWDOSM, and LINGO-DOSM. Two methods of similarity measures will be used. The first method will be used for evaluating the 2D fingerprint descriptors. In this method, the Tanimoto (TAN) coefficient was used which has been used for ligand based virtual screening for many years and is considered as a reference standard. The TAN was used for five types of descriptors (fingerprints) in this study (ALOGP, MACCS, EPFP4, CDKFP, PCFP, and LINGO-DOSM). The second similarity measure is the basic local alignment search tool (BLAST) string matching method which is used to evaluate the text-based molecular descriptors (SMILE, LWDOSM, and GBMD).

4. Experimental Design

In this section, we conduct experiments that show the usefulness of our proposed descriptor GBMD when used for similarity-based virtual screening. To evaluate the GBMD descriptor, GBMD was compared with different descriptors from Scitegic’s Pipeline Pilot [15] and PaDEL-descriptor [16] software. These were canonical SMILES (SMILE) [17], 120-bit ALOGP, 166-bit MACCS, and 1024-bit Path fingerprints (EPFP4) from Scitegic’s Pipeline Pilot and 1024-bit CDK (CDKFP) and 881-bit Pubchem fingerprints (PCFP) from the PaDEL-descriptor software.

Experiments were conducted over the most popular chemoinformatics database: the MDL drug data report (MDDR) [18] which has been used in our previous studies [12, 13, 1921]. This database consists of 102,516 molecules. For the screening experiments, three data sets (DS1–DS3) were chosen from the MDDR database. The dataset DS1 contains 11 activity classes, with some of the classes involving actives that are structurally homogeneous and with others involving actives that are structurally heterogeneous (i.e., structurally diverse). The DS2 data set contains 10 homogeneous activity classes and the DS3 data set contains 10 heterogeneous activity classes. Details of these three data sets are listed in Tables 1, 2, and 3. Each row in the tables contains an activity class, the number of molecules belonging to the class, and the diversity of the class, which was computed as the mean pairwise Tanimoto similarity calculated across all pairs of molecules in the class using ECFP6.

The screening experiments were performed with 10 reference structures selected randomly from each activity class, which are the same references structures which were used in [12, 13, 1921]. The recall results were averaged over each set of active molecules, where the recall is the percentage of the actives retrieved in the top-1% or the top-5% of the ranked list resulting from a similarity search.

5. Results and Discussion

The experiments are conducted to identify the possibility of using the GBMD descriptor in similarity-based virtual screening and then identifying the retrieval effectiveness of using such a descriptor. In this study, we compared the retrieval effectiveness of GBMD against six different descriptors using three different data sets, DS1–DS3. In addition, we compared the effectiveness of the new method with the LWDOSM and Lingo-DOSM [12, 13].

Selecting the best descriptors is based on their use in predicting the property/activity of a molecule from another molecule that is considered similar to it using a certain similarity method. For these descriptors, and for predicting the activity class of molecules, the best descriptors are these yielding the highest number of correct predictions (molecules with similar activity class), taking into account the total number of molecules having that activity class in the database used.

The results for the searches of DS1–DS3 are shown in Tables 49, respectively, using cutoffs of both 1% and 5%. Tables 4, 6, and 8 contain the results using the cutoff of 1%; and Tables 5, 7, and 9 contain the corresponding results using the cutoff of 5%. Each row in a table corresponds to one activity class and lists the recall for the top 1% and 5% of a sorted ranking when averaged over the ten searches for this activity class. The penultimate row in a table corresponds to the mean value for that descriptor when averaged over all of the activity classes for a data set. The descriptor with the best recall rate in each row is bolded. The bottom row in a table corresponds to the total number of bold cells for each descriptor type across the full set of activity classes.

Visual inspection of the recall values and the number of bold cells in Tables 49 enables comparisons to be made between the effectiveness of the GBMD descriptor and the various other descriptors. In addition, a more quantitative approach using the Kendall test of concordance [22] was used to determine which descriptor performed best. This test was developed to quantify the level of agreement between multiple sets of rankings of the same set of objects, here and in previous works [12, 13, 1921]. We used this approach to rank the effectiveness of different descriptor types. In the present context, the activity classes were considered as judges and the recall rates of the various descriptor types as objects. The outputs of the test are the value of the Kendall coefficient and the associated significance level, which indicates whether this value of the coefficient could have occurred by chance. If the value is significant (for which we used cutoff values of 0.01 or 0.05), then it is possible to give an overall ranking of the objects that have been ranked. The results of the Kendall analyses (for DS1–DS3) are reported in Table 10 and describe the top 1% and 5% ranking for the various descriptor types. In Table 10, the columns show the data set type, the value of the coefficient, the associated probability, and the ranking of the descriptor. The descriptors are ranked in decreasing order of screening effectiveness (if two descriptors have the same rank then they are ordered on the basis of the mean recall, that is, the mean values from the main tables of results).

We will use the DS1 results (in Tables 4 and 5) to illustrate the processing that took place. Here, it is shown that the GBMD descriptor has the best overall performance at the 1% and 5% cutoff. In addition, according to the total number of bold cells in Table 4, GBMD is the best performing descriptor across the 11 activity classes at the 1% and 5% cutoff. Table 10 shows that the value of the Kendall coefficient, 0.599, is significant at the 0.01 level of statistical significance; given that the result is significant, we can hence conclude that the overall ranking of the seven descriptors (for DS1 at 1% cutoff) is GBMD > CDKFP > SMILE > EPFP4 = MACCS > PCFP > ALOGP. The good performance of GBMD is not restricted to the top 1% for DS1, since it also gives the best results for the top-5% for DS1.

The results in Tables 6 and 7 show that the performance of GBMD is inferior to the best performance descriptors (CDKFP and EPFP4). The overall ranking of the 7 descriptors based on the Kendall coefficient in Table 10 (for DS2 top 1%) is CDKFP > EPFP4 > GBMD > SMILE > MACCS > ALOGP > PCFP. The results in Tables 6, 7, and 10 show that the GBMD descriptor performs least well when the active molecules being sought have a high degree of structural homogeneity. This is also the same in case of SMILE descriptor.

The best descriptor must be able to provide a scaffold-hopping capability for those cases where the actives belong to multiple structural classes. To determine to what degree GBMD descriptor obeys this principle, a GBMD descriptor was used to search for the most diverse set of active classes (DS3 data set). The search results for DS3 are shown in Tables 8 and 9. Here, the mean and total number of bold cells suggests that the GBMD descriptor is the best performing descriptor across the 10 activity classes at the 1% cutoff. GBMD also performs well and is comparable to standard descriptors (MACCS and EPFP) at the 5% cutoff.

Using the mean recall value as an evaluation criterion could be impartial to some descriptor type but not others and that is because some of the activity classes may contribute disproportionally to the overall value of mean recall. To avoid this bias, the effectiveness performance of different descriptors has been further investigated based on the total number of bold cells for each descriptor across the full set of activity classes, as shown in the bottom rows of Tables 49. These bold cell results are listed in Table 11. Visual inspection of the results in Table 11 shows clearly that the GBMD descriptor can provide a level of performance that is generally superior to the other descriptors. The only obvious exception is the DS2.

Retrieval results of top 1% and 5% of GBMD for data set DS1–DS3 compared with the LWDOSM and Lingo-DOSM are shown in Tables 1214. The results show the performance improvements obtained by using GBMD compared to these two proposed descriptors for DS1. However, comparing GBMD with the LWDOSM and LINGO-DOSM descriptors in Table 13 for DS2 shows a lower performance because the LWDOSM and LINGO-DOSM are more sensitive on the shape of molecules, so they give better results for the low diversity dataset. Moreover, in Table 13, the retrieval results of top 1% and 5% for data set DS3 show that GBMD outperform the LWDOSM and Lingo-DOSM descriptors.

From the above results, it should be noted here that the main purpose of using several types of descriptor in the experiments was not a performance comparison only but also to show that our new descriptor GBMD is capable of representing and characterising the molecule structure and to show the possibility and feasibility of using GBMD for similarity-based virtual screening. However, the retrieval performance for any descriptor depends on the type of similarity methods used. Hence, we believe that using different text similarity searching methods with the GBDM descriptor such as the longest-common-subsequence (LCS) [23] and dynamic time warping (DTW) [24], will yield different results.

6. Conclusion

In this paper, we developed a new graph-based descriptor (GBMD) based on the canonical representations of molecules to provide a unique representation for each structure. Simulated virtual screening experiments with the MDL drug data report data sets compared the GBMD to many standard descriptors. The results show that the GBMD is working well for high diverse datasets, such that it outperformed all other descriptors for DS1 and DS3. Also, it outperformed the performance of LWDOSM and Lingo-DOSM descriptors that were proposed in our previous works.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by Ministry of Higher Education (MOHE) and Research Management Centre (RMC) at the Universiti Teknologi Malaysia (UTM) under Research University Grant Category (VOT Q.J130000.2528.07H89). In addition, Faisal Saeed is a Researcher of Universiti Teknologi Malaysia under the Post-Doctoral Fellowship.