Computational Systems Biology Methods in Molecular Biology, Chemistry Biology, Molecular Biomedicine, and BiopharmacyView this Special Issue
Prediction of Drugs Target Groups Based on ChEBI Ontology
Most drugs have beneficial as well as adverse effects and exert their biological functions by adjusting and altering the functions of their target proteins. Thus, knowledge of drugs target proteins is essential for the improvement of therapeutic effects and mitigation of undesirable side effects. In the study, we proposed a novel prediction method based on drug/compound ontology information extracted from ChEBI to identify drugs target groups from which the kind of functions of a drug may be deduced. By collecting data in KEGG, a benchmark dataset consisting of 876 drugs, categorized into four target groups, was constructed. To evaluate the method more thoroughly, the benchmark dataset was divided into a training dataset and an independent test dataset. It is observed by jackknife test that the overall prediction accuracy on the training dataset was 83.12%, while it was 87.50% on the test dataset—the predictor exhibited an excellent generalization. The good performance of the method indicates that the ontology information of the drugs contains rich information about their target groups, and the study may become an inspiration to solve the problems of this sort and bridge the gap between ChEBI ontology and drugs target groups.
Identification of target proteins of drugs is of importance in the drug discovery pipeline  because drugs exert their functions by hitting some proteins, that is, their target proteins, in human tissues. On the other hand, in addition to their therapeutic effects, most of the drugs have some undesirable side effects caused also by hitting some target proteins. If a drug with unclear undesirable side effects was brought into the market, it is a potential hazard to both pharmaceutical companies and their consumers. Thus, studying the target proteins of a drug is highly beneficial to the treatment of diseases and reduction of side effects. However, identification of drugs target proteins by experiments needs lots of time and money. It is necessary to establish effective computational methods to tackle this problem which can provide useful references.
Many efforts have been made to identify drugs target proteins in the past few years, such as docking simulations [2, 3], literature text mining , combination of chemical structure and protein structural information or functional information [5–8], side effect similarity , and so forth. In this paper, we attempted a novel method using the ontology information of compounds, which was similar to gene ontology of proteins, to identify drugs target proteins. With the discovery of novel candidate drugs, the quantity of all candidate pairs of drugs and target proteins is tremendously large, preventing researchers to carry out an exhaustive search of drugs target proteins. In view of this, a necessary step is to establish an effective method to reduce the candidate proteins for each query drug, that is, reducing the search space by deducing the kind of functions a drug may have. According to the data in KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.jp/kegg/) , the target proteins of drugs could be divided into the following five groups: (1) G Protein-coupled Receptors, (2) Cytokine Receptors, (3) Nuclear Receptors, (4) Ion Channels, and (5) Enzymes. If one can establish a method to correctly predict the target groups of a query drug, the possible target proteins would be limited only to the predicted group, facilitating further analyses.
In the past few years, many novel compounds have been discovered with the advance of combinatorial chemistry. To record these compounds, some online databases are established, such as KEGG , STITCH (Search Tool for Interactions of Chemicals) , and ChEBI (Chemical Entities of Biological Interest) , from which users can retrieve all sorts of information about the compounds, for example, their structures, activities, reactions, and so on. Furthermore, their information can also be used to infer the attributes of novel compounds [5, 7, 8, 13–15]. In the paper, we employed compound ontology information, named as ChEBI ontology, to infer the target group of a novel drug, that is, a predictor that was built to predict the target group of drugs based on ChEBI ontology. A benchmark dataset consisting of 876 drugs was established by collecting data in KEGG, from which a training dataset and a test dataset were obtained by splitting the data. Jackknife test demonstrates an overall prediction accuracy of 83.12% and independent test achieves a prediction accuracy of 87.50%, indicating that the predictor has excellent generalization. We hope that the predictor may facilitate the discovery of new therapeutic or undesirable effects of existing drugs.
2. Materials and Methods
2,795 drugs were retrieved from Chen and Zeng’s study , which were downloaded from KEGG (http://www.genome.jp/kegg/) . According to their target proteins, these drugs were classified into the following five groups: (1) G Protein-coupled Receptors, (2) Cytokine Receptors, (3) Nuclear Receptors, (4) Ion Channels, and (5) Enzymes. We then screened the data with the following rules: drugs without ChEBI ontology information were excluded, resulting in 895 drugs; drugs belonging to more than one group were excluded, resulting in 879 drugs; and because there were only 3 drugs in Cytokine Receptors—not enough to build an effective prediction model on the group, these drugs and the group were also excluded. Thus, we obtained a benchmark dataset containing 876 drugs allocated into four groups. The distribution of these drugs is listed in column 5 of Table 1. The codes of the drugs in each group are available in Supplementary Material I available online at http://dx.doi.org/10.1155/2013/132724.
To evaluate the generalization of the predictor, the benchmark dataset was divided into a training dataset and a test dataset , where was constructed by randomly selecting 88 (10%) drugs in and the rest in comprised . The number of drugs in each group in the training and test dataset was listed in columns 3 and 4 of Table 1, respectively.
2.2. Prediction Based on ChEBI Ontology
The term “ontology” derived from philosophy, meaning the theory or study of the basic characteristics of all reality. Since gene ontology, the established ontology information about proteins, is deemed as a very useful tool for investigating various attributes of proteins [16–21], similarly, the ontology information of compounds may also facilitate the study of various attributes of compounds.
ChEBI, a well-known compound database, contained some important ontology information about compounds named as ChEBI ontology . It consists of four subontologies: (1) Molecular Structure, (2) Biological Role, (3) Application, and (4) Subatomic Particle, which may be suitable for the prediction of various attributes of compounds. The information of ChEBI ontology was retrieved from ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/ (“chebi.obo”, July 2012). Ontologies are controlled vocabularies which can be conceived as graph-theoretical structures consisting of “terms” forming the node set and “relations” of two terms forming the edge set . Based on the “terms” and “relations” (including “is_a” and “relationship”) in the obtained file, a graph with 31,813 nodes and 64,514 edges was established. As for the two terms, the smaller the distance is between them, the more intimate the “relations” are implicated between them. Thus, the distance of terms and , denoted by , would be used to measure the relationship of compounds.
For two compounds and , and are an ontology term set of and , respectively. The following formula was used to measure the functional relationship of and : The smaller the is, the stronger the functional relationship would be shared by and .
For a query drug , its target group was predicted according to the following steps.(i) Find drugs in the training set , say, , such that (ii) The target groups of were put into a voting system. (iii) The target group with the most votes is deemed to be the predicted target group of . Note that if more than one target group is receiving the most votes, randomly select one of them as the predicted result.
2.3. Prediction Based on Chemical Interaction
In recent years, the idea of “systems biology” is penetrating into the prediction of various attributes of proteins and compounds and is considered to be very useful [13, 14, 23–25]. The constructed methods were all based on the fact that interactive proteins and compounds often share common features. To define the interactive compounds, we downloaded the chemical interaction files from STITCH ((chemical_chemical.links.detailed.v3.1.tsv.gz) http://stitch.embl.de/download/chemical_chemical.links.detailed.v3.1.tsv.gz, http://stitch.embl.de/) , a well-known database including the interaction information of proteins and chemicals. In the obtained file, each interaction is composed of two chemicals and five kinds of scores. In detail, the first four kinds of scores are estimated according to the structures, activities, reactions, and cooccurrence in the literature of two chemicals , while the last kind of score is calculated by integrating the aforementioned four kinds of scores. It is reasonable to use the last kind of score to indicate the interactivity of two chemicals. Thus, it was adopted here to indicate the interactivity of two chemicals; that is, two chemicals are interactive chemicals if and only if the last kind of score of the interaction between them is greater than 0. For the later formulation, we denote the score of chemicals and by . In particular, if and are noninteractive chemicals, we set .
As described above, the interactive compounds share common features with higher possibility than noninteractive ones. In view of this, the target group of a query drug can be determined by its interactive compounds in the training set. The detailed procedure of the method is almost similar to that of the method in Section 2.2. Now, instead of (2), we used the following formula to select drugs in the first step
2.4. Jackknife Test
In statistical prediction, there are three cross-validation methods: independent dataset test, subsampling (or -fold crossover) test, and jackknife test , which are often used to evaluate the performance of various classifiers. Among them, jackknife test is deemed the least arbitrary  because the test sample and training samples are always open. Furthermore, the classifier evaluated by jackknife test can always provide a unique result for a given dataset. Accordingly, it has been widely used to examine the performance of various classifiers in recent years [13, 26–36]. Here, we also adopted it to evaluate the current method.
3. Results and Discussions
As described in Section 2.1, the benchmark dataset was divided into two datasets, and , consisting of 788 and 88 drugs, respectively. The method based on ChEBI ontology was applied to predict the target groups of drugs in these two datasets. The detailed results were given in the following sections.
3.1. Performance of the Predictor on the Training Dataset
As for the 788 drugs in the training dataset , the predictor based on ChEBI ontology was evaluated by jackknife test. The prediction results were listed in column 2 of Table 2, from which we can see that the prediction accuracies for each target group were 93.38%, 73.17%, 60.55%, and 84.62%, respectively, while the overall prediction accuracy was 83.12%. Since there are four target groups investigated by the study, the average correct rate would be 25% if one identifies drugs target groups in by random guesses, which is much lower than the overall prediction accuracy obtained by our method. Compared to the results in Chen and Zeng’s work , in which a similarity-based method was proposed to predict drugs target groups, our results are also very competitive because the prediction accuracies in their work were less than 80%. All of these suggest that the proposed predictor performs fairly well on the training dataset.
3.2. Performance of the Predictor on the Test Dataset
As for the 88 drugs in the test dataset , the predictor was modeled only based on the training dataset without involving . The prediction accuracies for each group and the overall accuracy were listed in column 3 of Table 2. It can be seen that the prediction accuracies for each group were 100%, 69.23%, 55.56%, and 90.32%, respectively, while the overall prediction accuracy was 87.50%, which is even better than that of the training dataset, indicating that the predictor has an excellent generalization.
3.3. Comparison of the Predictors Based on ChEBI Ontology and Chemical Interaction
The method based on chemical interaction described in Section 2.3 is popular for predicting various attributes of compounds [13, 14]. Thus, we compared the performances of these two methods in identifying drugs target groups as follows.
To compare the methods with the same datasets, all samples in the benchmark dataset were used to make prediction; that is, two predictors were conducted to predict the target groups of samples in evaluated by jackknife test. The prediction results obtained by these two methods were listed in Table 3. It is observed that the overall prediction accuracy for the predictor using ChEBI ontology was 84.70%, which is a little higher than that of the method using chemical interaction. In detail, the prediction accuracy for the target group “G Protein-coupled Receptors” obtained by the proposed method was much higher than the corresponding accuracy obtained by the method based on chemical interaction, the prediction accuracies for the target group “Enzymes” obtained by these two methods were almost the same, while the prediction accuracies for the rest two target groups obtained by the proposed method were lower than those obtained by the method based on chemical interaction. All of these indicate that the two predictors perform at the same level on the benchmark dataset . Thus, it can be inferred that strong links may exist between ChEBI ontology and chemical interactions.
3.4. Analysis of the Relationship of Drugs Ontology Information and Their Target Group
From Sections 3.1–3.3, the ChEBI ontology information of compounds connects strongly with their targets’ information. In this section, some examples are picked up to confirm this and to reinforce the understanding of using ChEBI to categorize drugs into their target groups.
The drug “D00146” is a sample in the training dataset . Its target group is “G Protein-coupled Receptors” and it hits the ontology term “CHEBI:3892.” According to the procedure of the method based on ChEBI ontology, 13 drugs in (listed in Table 4) were found, satisfying the function to be minimum. It is observed that 11 out of 13 drugs are in the target group “G Protein-coupled Receptors” and the rest two drugs are in the target group “Enzymes.” Thus, the target group “G Protein-coupled Receptors” got 11 votes, “Enzymes” got 2 votes, and the rest target groups did not get any votes. Accordingly, the target group of “D00146” is predicted to be “G Protein-coupled Receptors,” which is indeed its true target group. Another example is the drug “D00387” in the test dataset , which is in the target group “Ion Channels.” According to its ontology term “CHEBI:9674,” we found 20 drugs in , such that the function achieved a minimum. These 20 drugs were listed in Table 5, from which we can see that 12 drugs are in target group “Ion Channels” and 8 drugs are in target group “G Protein-coupled Receptors.” Thus, the result of “D00387” is predicted to be in the target group “Ion Channels.” It is also predicted correctly.
The two examples in the above paragraph show that the target information of these drugs is indeed related to their ontology information. The good performance of the predictor demonstrated the validity of using ontology information to predict drugs target groups.
This study employed ChEBI ontology to categorize drugs based on their target proteins. The good performance of the method suggests that ontologies are good indicators of drugs target groups. However, only about 30% of the samples reported in KEGG were investigated in this study due to the lack of ontology information of most drugs. It is anticipated that the method would be more effective at the prediction with the development of ChEBI ontology and hopefully a multilabel classifier may be developed to allocate some drugs to more than one category in the near future.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Yu-Fei Gao and Lei Chen contributed equally to this work.
This work was supported by Grants from National Basic Research Program of China (2011CB510101, 2011CB510102), Innovation Program of Shanghai Municipal Education Commission (12YZ120, 12ZZ087), National Natural Science Foundation of China (31371335, 61202021, and 61373028), the Grant of “The First-Class Discipline of Universities in Shanghai”, Natural Science Fund Projects of Jilin Province (201215059), Development of Science and Technology Plan Projects of Jilin Province (20100733, 201101074), SRF for ROCS, SEM (2009-36), Scientific Research Foundation (Jilin Department of Science and Technology, 200705314, 20090175, and 20100733), Scientific Research Foundation (Jilin Department of Health, 2010Z068), SRF for ROCS (Jilin Department of Human Resource and Social Security, 2012–2014), and Shanghai Educational Development Foundation (12CG55).
Supplementary Material I lists 876 drug samples investigated in this study.
L. Chen and W.-M. Zeng, “A two-step similarity-based method for prediction of drugs target group,” Protein and Peptide Letters, vol. 20, pp. 364–370, 2013.View at: Google Scholar
B. Smith, W. Ceusters, B. Klagges et al., “Relations in biomedical ontologies,” Genome biology, vol. 6, no. 5, article R46, 2005.View at: Google Scholar
R. Sharan, I. Ulitsky, and R. Shamir, “Network-based prediction of protein function,” Molecular systems biology, vol. 3, p. 88, 2007.View at: Google Scholar
X. Xiao, J. Min, and P. Wang, “Predicting ion channel-drug interactions based on sequence-derived features and functional groups,” Journal of Bionanoscience, vol. 7, pp. 49–54, 2013.View at: Google Scholar
R. G. Ramani and S. G. Jacob, “Prediction of P53 mutants (multiple sites) transcriptional activity based on structural (2D&3D) properties,” PLoS ONE, vol. 8, Article ID e55401, 2013.View at: Google Scholar
G. S. Han, V. Anh, A. P. Krishnajith, and Y.-C. Tian, “An ensemble method for predicting subnuclear localizations from primary protein structures,” PLoS ONE, vol. 8, Article ID e57225, 2013.View at: Google Scholar
Y. Matsuta, M. Ito, and Y. Tohsato, “ECOH: an enzyme commission number predictor using mutual information and a support vector machine,” Bioinformatics, vol. 29, pp. 365–372, 2013.View at: Google Scholar
Z. Qiu, C. Qin, M. Jiu, and X. Wang, “A simple iterative method to optimize protein ligand-binding residue prediction,” Journal of Theoretical Biology, vol. 317, pp. 219–223, 2012.View at: Google Scholar
Y.-N. Zhang, D.-J. Yu, S.-S. Li et al., “Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features,” BMC Bioinformatics, vol. 13, article 118, 2012.View at: Google Scholar
L. Chen, W. Zeng -M, Y. Cai -D, and T. Huang, “Prediction of metabolic pathway using graph property, chemical functional group and chemical structural set,” Current Bioinformatics, vol. 8, pp. 200–207, 2013.View at: Google Scholar