Machine Learning and Network Methods for Biology and MedicineView this Special Issue
Identification of Chemical Toxicity Using Ontology Information of Chemicals
With the advance of the combinatorial chemistry, a large number of synthetic compounds have surged. However, we have limited knowledge about them. On the other hand, the speed of designing new drugs is very slow. One of the key causes is the unacceptable toxicities of chemicals. If one can correctly identify the toxicity of chemicals, the unsuitable chemicals can be discarded in early stage, thereby accelerating the study of new drugs and reducing the R&D costs. In this study, a new prediction method was built for identification of chemical toxicities, which was based on ontology information of chemicals. By comparing to a previous method, our method is quite effective. We hope that the proposed method may give new insights to study chemical toxicity and other attributes of chemicals.
In drug discovery, detecting the toxicity of candidate drugs is a very important procedure. Some approved drugs such as phenacetin  and troglitazone , which have passed Phase III clinical trials, have to be withdrawn from the market, because their unexpected toxicities were detected. Pharmaceutical companies thus lost millions of dollars. In view of this, it is necessary to detect the toxicity of chemicals before they are selected as candidate drugs. However, evaluating the toxicity of a certain chemical requires comprehensive experimental testing, which costs millions of dollars and takes many years. On the other hand, with the advance of the combinatorial chemistry, a large number of synthetic compounds have surged, inducing that detecting chemical toxicities through traditional methods is an impossible task. Thus, quick, effective, and non-animal-involved prediction methods are urgently necessary.
In recent years, some prediction methods have been built for detecting chemical toxicities. Most of them can only deal with a single toxicity at the same time [3, 4], that is, predict a certain chemical to be toxic or nontoxic for a single toxicity. To detect all toxicities of a chemical, these methods have to be executed many times. Recently, Chen et al. built a multiclass prediction method using chemical-chemical interaction information , which can provide a candidate toxicity sequence ranging from the most likely toxicity to the least likely one. Their method was applied to detect the toxicities of chemicals listed in Accelrys Toxicity Database , in which six types of toxicity are reported: (1) acute toxicity; (2) mutagenicity; (3) tumorigenicity; (4) skin and eye irritation; (5) reproductive effects; (6) multiple dose effects. In this study, we employed the data in Chen et al.’s study  and adopted a new kind of information of chemicals to identify chemical toxicities. ChEBI ontology, integrated in a well-known database ChEBI (Chemical Entities of Biological Interest) , reports the ontology information of chemicals and is composed of the following subontologies: (1) molecular structure; (2) biological role; (3) application; (4) subatomic particle. Since gene ontology , the ontology information for proteins has been deemed to be a useful tool to investigate protein-related problems [9–12]. It is believed that ChEBI ontology is also a useful tool for studying chemicals and building effective prediction methods to identify chemical attributes. Here, we established a prediction method based on this information and compared to the method reported in . The results indicate that this information is suitable to identify chemical toxicity. And we hope that the proposed method may stimulate extensive investigation based on this information, thereby promoting the study of chemicals and drug discovery.
2. Materials and Methods
The toxicity information of chemicals was retrieved from a previous study , which was collected from the Accelrys Toxicity Database . Six types of toxicity are reported in this database; there are (1) acute toxicity; (2) mutagenicity; (3) tumorigenicity; (4) skin and eye irritation; (5) reproductive effects; (6) multiple dose effects. Thus, the toxic chemicals in Accelrys Toxicity Database can be assigned to six classes. To investigate the problem of predicting chemical toxicity more throughout, we also employed the nontoxic chemicals, which were also retrieved from Chen et al.’s study . These chemicals were collected from DrugBank (http://www.drugbank.ca/)  and Human Metabolome database (HMDB) (http://www.hmdb.ca/) . Totally, 174,137 chemicals were collected and each of them was nontoxic or had at least one type of toxicity.
To obtain a well-defined dataset, the chemicals with no ontology information were excluded, resulting in 4,177 chemicals. Thus, we obtained a dataset consisting of 4,177 chemicals, in which 3,769 chemicals were toxic and 408 chemicals were nontoxic. As mentioned in the above paragraph, each toxic chemical has at least one type of toxicity. For convenience, let us tag the six types of toxicity using and nontoxicity using . Accordingly, the dataset can be separated into seven subsets formulated by where consisted of chemicals having toxicity . The number of chemicals in each subset (i.e., number of chemicals having each type of toxicity) is listed in Table 1, column 3, from which we can see that the acute toxicity was a greatest type of toxicity containing most chemicals, followed by mutagenicity, multiple dose effects, and so forth, while the number of nontoxic chemicals was least. Since some chemicals may have more than one type of toxicity, that is, they may occur in more than one set of , the sum of numbers in seven subsets was larger than the total number of chemicals in . Thus, it is a multilabel classification problem. Figure 1 gives the number of chemicals having 1–7 types of toxicity. Like many previous studies dealing with multilabel classification problem [5, 15, 16], the proposed method would give a series of candidate toxicities for each query chemical with the sequence from most likely toxicity to the least likely one.
2.2. Construction of a Graph by Ontology Information of Compound
The ontology information of compound was retrieved from ChEBI (http://www.ebi.ac.uk/chebi/init.do) . We downloaded a file named as “chebi.obo” (accessed November 2014) from its ftp website: ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/, which contains larger number of ontology terms and their descriptions. Since the ontology terms can be conceived as graph-theoretical structures, a graph can be constructed according to the information of all ontology terms, in which nodes represent ontology terms and edges denote the relationship between two terms. By using the entries “is a” and “relationship” in the obtained file to indicate the relationship between two terms, we constructed a large graph with 45,206 nodes and 113,549 edges.
2.3. Prediction Method
As mentioned in Section 2.2, a graph was constructed according to the ontology information of compounds. It can be observed that the corresponding ontology terms of two adjacent nodes in have some special relationship. And it can be further inferred that if two nodes are with small distance in , the corresponding ontology terms have close linkage. In view of this, using the distance in to quantitatively measure the relationship between two ontology terms is reasonable. For two terms and , let us denote the distance of the corresponding nodes in by , .
For two chemicals and , let be the ontology terms of and let be the ontology terms of . It is obvious that if (, ) is small, and are highly related and have high probability to share same structures, functions, and so on. Thus, we gave the following formulation to measure the common features of chemicals and :where denote the distance of terms and in the graph constructed in Section 2.2, which can be obtained by Dijkstra’s algorithm . The smaller the is, the closer the relationship and have.
The proposed prediction method highly relied on the result of (2). To introduce the method clearly, it is necessary to employ some notations. Let be a training set consisting of chemicals, say ; that is, . The toxicity information of each can be represented bywhere was defined by
For a query chemical , its score of having toxicity was calculated as follows.(1)For each chemical in the training set , calculate according to (2). Then, find all nearest neighbors, say , without generalization, such that .(2)For each , the score of having toxicity was calculated byIt is easy to observe that the score of having toxicity is the number of chemicals among which have toxicity . Since are highly related to , larger indicates that many closely related training chemicals of have toxicity , inducing that the probability of having toxicity is high. In particular, suggests that the score of having toxicity is zero, inducing that the possibility of having this toxicity is zero.
As mentioned in Section 2.1, the investigated problem is a multilabel classification problem. Only giving the most likely candidate toxicity is not enough. Fortunately, we can output a series of candidate toxicities according to the scores of the query chemical having 7 types of toxicity. The toxicity which receives the highest score is the most likely toxicity, while the toxicity receiving the second highest score is the second likely toxicity and so forth. For example, if the rank of seven scores for a certain query chemical isit suggests (i.e., acute toxicity) is the most likely toxicity for , followed by (i.e., skin and eye irritation) and (i.e., mutagenicity), while the other types of toxicity are not predicted to be candidate toxicities for . Furthermore, is called the first prediction, the second prediction, and so forth.
2.4. Accuracy Measurements
For a query chemical, the proposed method can provide a series of candidate toxicities. In view of this, we should calculate the accuracy for each order prediction. The th prediction accuracy can be computed by [5, 15]where is the number of chemicals whose th prediction is correct and is the total number of chemicals that are predicted by the method. Since it is difficult to know the number of toxicities for a query chemical, the first prediction accuracy is the most important measure to evaluate the performance of the method. In addition, an effective prediction method for a multilabel classification problem should rank the candidate toxicities well; that is, prediction accuracies should follow a decreasing trend with the increasing of the prediction order.
Besides, to evaluate the performance of prediction method on the whole, another measurement was also adopted [5, 15]. It measures the proportion of the true toxicities covered by the first predictions of chemicals, which can be calculated bywhere is the number of true toxicities of the th chemical which are listed among its first predictions and is the total number of true toxicities of the th chemical. Generally, is always taken as the smallest integer bigger than or equal to the average number of toxicities of chemicals processed by the method; that is, . It is obvious that larger indicates the true toxicities are arranged in the front of candidate toxicities.
3. Results and Discussion
3.1. Performance of the Method
For the 4,177 chemicals in , the prediction method was executed to identify their toxicities evaluated by jackknife test . The seven prediction accuracies thus obtained by (7) are listed in Table 2, column 2. It can be observed that the first prediction accuracy was 75.17%, the second one was 43.52%, and the third one was 28.47%. Furthermore, seven prediction accuracies always followed a decreasing trend with the increasing of the prediction order, indicating the proposed method arranged the candidate toxicities of all tested chemicals quite well. In addition, the average number of toxicities of chemicals in was about 2.38. Thus, the first three predictions of all chemicals in were collected, obtaining the accuracy of 61.87% by (8), which means the proportion of the true toxicities of chemicals in covered by their first three predictions. All of these indicate that the proposed method is quite effective for identification of chemical toxicities.
3.2. Understanding the Method by Listing an Example
To better understand our method, this section listed an example. CID104975 is a chemical with toxicity (mutagenicity) and (tumorigenicity). Its ontology term is CHEBI:25957. According to the method, we computed the distance between CHEBI:25957 and ontology terms of other chemicals in , thereby calculating the relationship between CID104975 and other chemicals by (2). Four chemicals, listed in Table 3, were found to be closely related to CID104975; they are CID995, CID2236, CID6763, and CID13257. Their toxicities and ontology terms are listed in Table 3, column 2 and column 3, respectively. By the method, the toxicity received 3 votes, 4 votes, 3 votes, 2 votes, and other toxicities no votes. Accordingly, we obtained that the candidate toxicities for CID104975 were , , , and . It is obvious that the first and third predictions were correct, while the second prediction was incorrect.
3.3. Comparison of Other Methods
In this section, we employed another kind of chemical information, which has been applied for identification of chemical toxicities in Chen et al.’s study . Their method used chemical-chemical interaction information, which has been deemed to be useful information for study of chemical-related problems [5, 15, 18, 19], to build the prediction method, and gave good performance.
To compare our method and Chen et al.’s method in a fair circumstance, a chemical set, consisting of 3,955 chemicals, was extracted from , called , such that each chemical in has both ontology information and interaction information; that is, each chemical can be predicted by these two methods. The number of chemicals in on each type of toxicity is listed in Table 1, column 4, from which we can see that the distribution of 3,955 chemicals on seven types of toxicity is similar to chemicals in . Also some chemicals have two or more toxicities. Our method and Chen et al.’s method were all executed on with their performance being evaluated by jackknife test. Listed in Table 2, columns 3 and 4, are seven prediction accuracies. It can be seen that the first prediction accuracy of our method was 75.40%, which is little higher than 75.14% of Chen et al.’s method. However, with the increasing of prediction order, the prediction accuracies of Chen et al.’s method were higher than those obtained by our method. It is reasonable because the ontology information of chemicals is not very complete at present, which induces that many relations of ontology terms have not been detected. Furthermore, we also calculated the measurement defined in (8). Since the average number of toxicities of chemical in was about 2.44, the first three predictions of chemicals in , which were obtained by two methods, were collected, thereby obtaining the accuracy of 61.70% for our method and 65.31% for Chen et al.’s method. It is also caused by the aforementioned reason. Although, if one considers more than one toxicity for a certain chemical, our method is not better than Chen et al.’s method, the first prediction accuracy of our method is higher than that of Chen et al.’s method, which is the most important one because one always pays more attention to the most likely toxicity for a chemical. In view of this, we believe that our method has superiority for identification of chemical toxicities.
This study gave a new prediction method to identify chemical toxicities. By utilizing the ontology information of chemicals reported in ChEBI, one can predict the toxicities of a certain chemical with quite high quality. It is hopeful that this method may promote the study of chemicals.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
U. C. Dubach, B. Rosner, and T. Stürmer, “An epidemiologic study of abuse of analgesic drugs. Effects of phenacetin and salicylate on mortality and cardiovascular morbidity (1968 to 1987),” The New England Journal of Medicine, vol. 324, no. 3, pp. 155–160, 1991.View at: Publisher Site | Google Scholar
“AstraZeneca Decides to Withdraw Exanta,” 2006, http://www.astrazeneca.com/Media/Press-releases/Article/20060214–AstraZeneca-Decides-to-Withdraw-Exanta.View at: Google Scholar
M. Zheng, Z. Liu, C. Xue et al., “Mutagenic probability estimation of chemical compounds by a novel molecular electrophilicity vector and support vector machine,” Bioinformatics, vol. 22, no. 17, pp. 2099–2106, 2006.View at: Publisher Site | Google Scholar
Y. Wang, J. Lu, F. Wang et al., “Estimation of carcinogenicity using molecular fragments tree,” Journal of Chemical Information and Modeling, vol. 52, no. 8, pp. 1994–2003, 2012.View at: Publisher Site | Google Scholar
L. Chen, J. Lu, J. Zhang, K.-R. Feng, M.-Y. Zheng, and Y.-D. Cai, “Predicting chemical toxicity effects based on chemical-chemical interactions,” PLoS ONE, vol. 8, no. 2, Article ID e56517, 2013.View at: Publisher Site | Google Scholar
Accelrys Software Inc, Accelrys Toxicity Database 2011.4, Accelrys Software Inc., San Diego, Calif, USA, 2011.
K. Degtyarenko, P. De matos, M. Ennis et al., “ChEBI: a database and ontology for chemical entities of biological interest,” Nucleic Acids Research, vol. 36, no. 1, pp. D344–D350, 2008.View at: Publisher Site | Google Scholar
M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000.View at: Publisher Site | Google Scholar
M. A. Mahdavi and Y.-H. Lin, “False positive reduction in protein-protein interaction predictions using gene ontology annotations,” BMC Bioinformatics, vol. 8, article 262, 2007.View at: Publisher Site | Google Scholar
C.-S. Yu, C.-W. Cheng, W.-C. Su et al., “CELLO2GO: a web server for protein subcellular localization prediction with functional gene ontology annotation,” PLoS ONE, vol. 9, no. 6, Article ID e99368, 2014.View at: Publisher Site | Google Scholar
C. Bettembourg, C. Diot, and O. Dameron, “Semantic particularity measure for functional characterization of gene sets using gene ontology,” PLoS ONE, vol. 9, no. 1, Article ID e86525, 2014.View at: Publisher Site | Google Scholar
K.-C. Chou and Y.-D. Cai, “A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology,” Biochemical and Biophysical Research Communications, vol. 311, no. 3, pp. 743–747, 2003.View at: Publisher Site | Google Scholar
D. S. Wishart, C. Knox, A. C. Guo et al., “DrugBank: a knowledgebase for drugs, drug actions and drug targets,” Nucleic Acids Research, vol. 36, no. 1, pp. D901–D906, 2008.View at: Publisher Site | Google Scholar
D. S. Wishart, D. Tzur, C. Knox et al., “HMDB: the human metabolome database,” Nucleic Acids Research, vol. 35, no. 1, pp. D521–D526, 2007.View at: Publisher Site | Google Scholar
L. Chen, W.-M. Zeng, Y.-D. Cai, K.-Y. Feng, and K.-C. Chou, “Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities,” PLoS ONE, vol. 7, no. 4, Article ID e35254, 2012.View at: Publisher Site | Google Scholar
P. Du, T. Li, and X. Wang, “Recent progress in predicting protein sub-subcellular locations,” Expert Review of Proteomics, vol. 8, no. 3, pp. 391–404, 2011.View at: Publisher Site | Google Scholar
T. H. Gormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Eds., Introduction to Algorithms, MIT Press, Cambridge, Mass, USA, 1990.
L.-L. Hu, C. Chen, T. Huang, Y.-D. Cai, and K.-C. Chou, “Predicting biological functions of compounds based on chemical-chemical interactions,” PLoS ONE, vol. 6, no. 12, Article ID e29491, 2011.View at: Publisher Site | Google Scholar
L. Chen, J. Lu, T. Huang et al., “Finding candidate drugs for hepatitis C based on chemical-chemical and chemical-protein interactions,” PLoS ONE, vol. 9, no. 9, Article ID e107767, 2014.View at: Publisher Site | Google Scholar