iEzy-Drug: A Web Server for Identifying the Interaction between Enzymes and Drugs in Cellular Networking
With the features of extremely high selectivity and efficiency in catalyzing almost all the chemical reactions in cells, enzymes play vitally important roles for the life of an organism and hence have become frequent targets for drug design. An essential step in developing drugs by targeting enzymes is to identify drug-enzyme interactions in cells. It is both time-consuming and costly to do this purely by means of experimental techniques alone. Although some computational methods were developed in this regard based on the knowledge of the three-dimensional structure of enzyme, unfortunately their usage is quite limited because three-dimensional structures of many enzymes are still unknown. Here, we reported a sequence-based predictor, called “iEzy-Drug,” in which each drug compound was formulated by a molecular fingerprint with 258 feature components, each enzyme by the Chou’s pseudo amino acid composition generated via incorporating sequential evolution information and physicochemical features derived from its sequence, and the prediction engine was operated by the fuzzy -nearest neighbor algorithm. The overall success rate achieved by iEzy-Drug via rigorous cross-validations was about 91%. Moreover, to maximize the convenience for the majority of experimental scientists, a user-friendly web server was established, by which users can easily obtain their desired results.
Enzymes are biomacromolecules that catalyze almost all the chemical reactions essential for the life of a cell . Most enzymes are proteins although some RNA molecules have been identified to possess the function of enzyme as well. As catalysts, enzymes possess two exceptional features: one is of high efficiency and the other of high selectivity. For instance, the second-order rate constant between some enzymes and their substrates  was surprisingly high , which could almost reach the upper limit of diffusion-controlled reaction rate according to the calculation and analysis by Chou and coworkers [4–6]. The high selectivity or specificity of enzymes was likened to the “lock-and-key” model, implying that an accurate fit is required between the active site of an enzyme and its substrate for the catalytic reaction to occur. Owing to the previous unique features, enzymes play a crucial role in controlling and regulating the order of chemical reactions in cells that is vitally important for their survival. It is also because of this that enzymes are excellent drug targets, and actually many drugs are enzyme inhibitors. For example, some peptide inhibitors against HIV/AIDS [7–10] and SARS (severe acute respiratory syndrome) [11–13] were based on the Chou’s distorted key theory , as illustrated in Figure 1, where (a) shows a good fit for a cleavable octapeptide with the active site of HIV-protease and (b) shows that the peptide has become an ideal inhibitor or “distorted key” after its scissile bond is modified. For a brief introduction about the Chou’s distorted key theory and its application for designing peptide drugs, see a Wikipedia article at http://en.wikipedia.org/wiki/Chou’s_distorted_key_theory_for_peptide_drugs.
To develop enzyme-targeting drugs, an essential step is to identify drug-enzyme interaction in cellular networking . The completion of the human genome project and the emergence of molecular medicine have provided excellent opportunity to discover unknown target enzymes for drugs. Many efforts were made in this regard by computationally analyzing drug-enzyme interactions. The most commonly used approaches are docking simulations (see, e.g., [16–19]) and protein cleavage site analysis (see, e.g., [8, 12, 13]) based on Chou’s distorted key theory . However, the latter approach is mainly used to find peptide drugs. Compared with the smaller organic compounds, although peptide drugs have the advantage of low toxicity to human body, they have the shortcoming of poor metabolic stability and low bioavailability due to their inability to readily crossing thru membrane barriers such as the intestinal and blood-brain barriers . In contrast, the molecular docking is indeed a useful vehicle for investigating the interaction of an enzyme receptor with its organic inhibitor and revealing their binding mechanism as demonstrated by a series of studies [11, 19–23]. However, to conduct molecular docking, a necessary prerequisite is the availability of the 3D (three dimensional) structure of the targeted enzyme. Unfortunately, the 3D structures of many enzymes are still unknown. Although X-ray crystallography is a powerful tool in determining the 3D structures of enzymes, it is time-consuming and expensive. Particularly, not all enzymes can be successfully crystallized. For example, membrane enzymes are very difficult to crystallize and most of them will not dissolve in normal solvents. Therefore, so far very few membrane enzyme 3D structures have been determined. Although NMR is indeed a very powerful tool in determining the 3D structures of membrane proteins as indicated by a series of recent publications (see, e.g., [24–30]), it is time-consuming and costly. To acquire the structural information in a timely manner, one has to resort to various structural bioinformatics tools (see, e.g., [18, 31, 32]). Unfortunately the number of templates for developing high quality 3D structures by structural bioinformatics is very limited.
Therefore, it would save us a lot of time and money if we could identify the interactions between enzymes and drugs before carrying out any intense study in this regard. In view of this, the present study was initiated in an attempt to develop a computational method based on the sequence-derived features that can be used to predict the drug-enzyme interactions in cellular networking.
As summarized in a comprehensive review  and demonstrated by a series of recent publications [34–37], to successfully develop the desired method, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) denote the drug-enzyme samples with an effective formulation that can truly reflect their intrinsic relation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) conduct a rigorous cross-validation to objectively evaluate its anticipated accuracy; (v) establish a user-friendly web-server for the predictor that is freely accessible to the public. Next, let us elaborate how to deal with these procedures one by one.
2. Materials and Methods
2.1. Benchmark Dataset
The data used in this study were collected from Kyoto Encyclopedia of Genes and Genomes (KEGG)  at http://www.kegg.jp/kegg/, which is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism, and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. For the current study, the benchmark dataset can be formulated as where is the positive subset that consists of the interactive enzyme-drug pairs only, while is the negative subset that contains of the noninteractive enzyme-drug pairs only, and the symbol represents the union in the set theory. Here, the “interactive” pair means the pair whose two counterparts are interacted with each other in the drug-target networks as defined in the KEGG database , while the “noninteractive” pair means that its two counter parts are not interacted with each other in the drug-target networks. The positive dataset contains 2,719 enzyme-drug pairs derived from Yamanishi et al. . The negative dataset contains 5,438 noninteractive enzyme-drug pairs, which were derived according to the following procedures: (i) separating each of the pairs in into single drug and enzyme; (ii) recoupling each of the single drugs with each of the single enzymes into pairs in a way that none of them occurred in ; (iii) randomly picking the pairs, thus, formed until they reached the number two times as many as the pairs in . The 2,719 interactive enzyme-drug pairs and 5,438 noninteractive enzyme-drug pairs are given in Online Supporting Information S1 (see Supplementary Material available online at http://dx.doi.org/10.1155/2013/701317) All the detailed information for the compounds or drugs listed there can be found in the KEGG database via their codes.
2.2. Sample Representation
Since each of the samples in the current network system contains an enzyme (protein) and a drug, a combination of the following two approaches was adopted to represent the enzyme-drug pair samples.
(a) 2D Molecular Fingerprints. Although the number of drugs is extremely large, most of them are small organic molecules and are composed of some fixed small structures . The identification of small molecules or structures can be used to detect the drug-target interactions . Molecular fingerprints are bit-string representations of molecular structure and properties . It should be pointed out that there are many types of structural representations that have been suggested for the description of drug molecules, including physicochemical properties , chemical graphs , topological indices , 3D pharmacophore patterns, and molecular fields. In the current study, let us use the simple and generally adopted 2D molecular fingerprints to represent drug molecules, as described below.
First, for each of the drugs concerned, we can obtain a MOL file from the KEGG database  via its code that contains the detailed information of chemical structure. Second, we can convert the MOL file format into its 2D molecular fingerprint file format by using a chemical toolbox software called OpenBabel , which can be downloaded from the website at http://openbabel.org/. The current version of OpenBabel can generate four types of fingerprints: FP2, FP3, FP4, and MACCS. In the current study, we used the FP2 fingerprint format. It is a path-based fingerprint that identifies small molecule fragments based on all linear and ring substructures and maps them onto a bit-string using a hash function (somewhat similar to the daylight fingerprints [47, 48]). It is a length of 256-bit hexadecimal string obtained from the OpenBabel, and we can convert it to a 256-bit vector. Then, a molecular fingerprint can be formulated as a 256-D vector given by where is an integer between 0 and 15, and is the matrix transpose operator.
In order to capture as much useful information from a molecular fingerprint as possible, we can also convert the above 256-bit hexadecimal string into a 1024-bit binary vector, which is a digital sequence only including 0 and 1, and consider two different digital signal characteristics for the digital sequence as follows.
(b) Information Entropy. Shannon proposed that any information is redundant, and redundant size is related with the occurrence probability or uncertainty of each symbol such as numbers, letters, or words among the information. The information entropy for a system with a probability distribution for two classes information entropy  is defined as where represents the occurrence probability of number in the aforementioned 1024-bit binary vector and the information entropy is a measure value of the information amount. For example, for the digital sequence 100100011010010, the value of the information entropy , thus, obtained is
(c) Complexity Factor. The Lempel-Ziv (LZ) complexity  reflects the order that is retained in the sequence, and hence was adopted in this study. For each step only two operations were allowed in the process to get the LZ complexity: either copying the longest section from the part of a nonempty sequence or generating an additional symbol mark that ensures the uniqueness of per component . Its substring is expressed by where represents the 1st digital value, the 2nd value, and so forth. A nonempty digital sequence is synthesized according to the following formula:
Suppose that has been reconstructed by the subsymbol which is viewed as the newly inserted symbol. The substring up to will be denoted by , where the bold dot indicates that is a newly inserted symbol for checking whether the rest of the substring can be reconstructed by a simple process. At first suppose , and see whether is the substring for the subsequence , which means deleting the last symbol from the substring . If the answer is “no”, we insert into the sequence followed by a dot . Thus, it could not be obtained by the same operation. If the answer is “yes”, no new symbol is needed, and we can go on to proceed with and repeat the same previous procedure. The LZ complexity is the number of dots (plus one if the string is not terminated by a dot). For example, for the sequence 100100011010010, syn and the corresponding complexity factor CF are described as Thus, by adding the information entropy (4) and complexity factor CF (7) into the molecular fingerprint MF (2), we obtained a total of feature elements to represent a drug compound; that is, it can now be formulated as a 258-D vector given by where has the same meaning as in (2), while and CF are the information entropy and complexity factor, respectively, as described in the previous two sections.
The sequences of the enzymes involved in this study are given in Online Supporting Information S2. Now the problem is how to effectively represent these enzyme sequences for the current study. Generally speaking, there are two kinds of approaches to formulate enzyme sequences: the sequential model and the nonsequential or discrete model . The most typical sequential representation for an enzyme sample with residues is its entire amino acid sequence; that is, where represents the 1st residue, the 2nd residue, and so forth. An enzyme sample thus formulated can contain its most complete information. This is an obvious advantage of the sequential representation. To get the desired results, the sequence-similarity-search-based tools, such as BLAST [52, 53], are usually utilized to conduct the prediction. However, this kind of approach failed to work when the query enzyme did not have significant homology to enzyme of known characters. Thus, various nonsequential representation models were proposed. The simplest nonsequential model for an enzyme was based on its amino acid composition (AAC), as defined by where are the normalized occurrence frequencies of the 20 native amino acids [54–56] in the enzyme , and has the same meaning as in (2) and (8). The AAC-discrete model was widely used for identifying various attributes of proteins (see, e.g., [57–61]). However, as can be seen from (10), all the sequence order effects were lost by using the AAC-discrete model. This is its main shortcoming. To avoid completely losing the sequence-order information, the pseudo amino acid composition [62, 63] or Chou’s PseAAC  was proposed to replace the simple AAC model. Since the concept of PseAAC was proposed in 2001 , it has penetrated into almost all the fields of protein attribute predictions and computational proteomics, such as predicting supersecondary structure , predicting metalloproteinase family , predicting membrane protein types [66, 67], predicting protein structural class , discriminating outer membrane proteins , identifying antibacterial peptides , identifying allergenic proteins , identifying bacterial virulent proteins , predicting protein subcellular location [73, 74], identifying GPCRs and their types , identifying protein quaternary structural attributes , predicting protein submitochondria locations , identifying risk type of human papillomaviruses , identifying cyclin proteins , predicting GABA(A) receptor proteins , and predicting cysteine S-nitrosylation sites in proteins , among many others (see a long list of papers cited in the References section of ). Recently, the concept of PseAAC was further extended to represent the feature vectors of DNA and nucleotides [36, 82], as well as other biological samples (see, e.g., [83, 84]). Because it has been widely and increasingly used, recently two powerful soft-wares called “PseAAC-Builder”  and “propy”  were established for generating various special Chou’s pseudo-amino acid compositions, in addition to the web-server PseAAC  built in 2008. According to a recent review , the general form of Chou's PseAAC for an enzyme sample can be formulated by where the subscript is an integer, and its value as well as the components will depend on how to extract the desired information from the amino acid sequence of (cf. (10)). Next, let us describe how to extract useful information from the benchmark dataset and Online Supporting Information S2 to define the enzyme samples concerned via (11).
To incorporate as much useful information as possible from an enzyme sample, we are to approach this problem from three different angles, followed by incorporating the feature elements thus obtained into the general form of PseAAC of (11).
(a) Amino Acid Composition. The components of amino acid composition have been widely used to predict various protein attributes [57–61]. In this study, they were also included as the first 20 elements in the general Chou’s PseAAC (cf. (11)); that is, where has the same meaning as in (10).
(b) Dipeptide Composition. Dipeptide composition has been used to predict the protein secondary structural contents [88, 89] as well as various protein attributes (see, e.g., [90–93]). The number of different dipeptides is . Suppose that the normalized occurrence frequencies of the 400 dipeptides in an enzyme sample are given by Incorporating the above 400 dipeptide components into (11), we have
(c) Sequential Evolution Information. Biology is a natural science with a historic dimension. All biological species have developed starting out from a very limited number of ancestral species. Their evolution involves changes of single residues, insertions and deletions of several residues , gene doubling, and gene fusion. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes , such as having basically the same biological function and residing at a same subcellular location. To extract the sequential evolution information and use it to define the components of (11), the PSSM (Position Specific Scoring Matrix) was used as described next.
According to Schäffer et al. , the sequence evolution information of enzyme with amino acid residues can be expressed by an matrix, as given by where represents the original score of the th amino acid residue in the enzyme sequence changed to amino acid type in the process of evolution. Here, the numerical codes are used to represent the 20 native amino acid types denoted by A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. The scores in (15) were generated by using PSI-BLAST  to search the UniProtKB/Swiss-Prot database (Release 2013-05) through three iterations with 0.001 as the value cutoff for multiple sequence alignment against the sequence of the enzyme . In order to make every element in (15) be scaled from their original score ranges into region of , we performed a conversion through the standard sigmoid function to make it become where Now, we extract the useful information from (16) to define the components of (11) via the following approach: where
(d) Grey System Model Approach. The grey system theory  is quite useful in dealing with complicated systems that lack sufficient information, or need to process uncertain information. According to the grey system theory, we can extract the following information from the th column of (16); that is, where
Therefore, based on the grey system theory and (20), we can extract another quantities from (16) to define the components of (11); that is, where and are given by (20); and are weight factors, which were all set to 1 in the current study.
In other words, in this study (11) or Chou’s PseAAC is a 480-D vector, whose 480 components are given by (23) derived from the amino acid composition, dipeptide composition, sequential evolution information, and grey system theory.
(e) Representing Enzyme-Drug Pairs. Now the pair between an enzyme molecule and a drug compound can be formulated by combing (8) and (11), as given by where represents the enzyme-drug pair, the orthogonal sum , and each of the feature elements is given in (8) and (23).
For the convenience of the later formulation, let us use to represent the 738 components of (24); that is,
To optimize the prediction results, different weights were usually tested for each of the elements in (25). However, since it would consume a lot of computational time for a total of 738 weight factors, here let us adopt the normalization approach to deal with this problem as done in [98, 99]; that is, convert in (25) to according to the following equation: where means arctangent. By means of (26), every component in (25) will be converted into the range of ; that is, we have . As demonstrated in [98, 99], the normalization approach via (26) was quite effective in enhancing the quality of prediction operated in a high dimension space. Therefore, in this study, we would not to take the procedure of optimizing the weight factors, significantly reducing the computational times.
2.3. Fuzzy -Nearest Neighbour Algorithm
The -NN (-Nearest Neighbor) classifier is quite popular in pattern recognition community owing to its good performance and simple-to-use feature. According to the -NN rule , named also as the “voting -NN rule,” the query sample should be assigned to the subset represented by a majority of its nearest neighbors, as illustrated in Figure 5 of .
Fuzzy -NN classification method  is a special variation of the -NN classification family. Instead of roughly assigning the label based on a voting from the nearest neighbors, it attempts to estimate the membership values that indicate how much degree the query sample belongs to the classes concerned, Obviously, it is impossible for any characteristic description to contain complete information, which would make the classification ambiguous. In view of this, the fuzzy principle is very reasonable and particularly useful in dealing with complicated biological systems, such as identifying nuclear receptor subfamilies , characterizing the structure of fast-folding proteins , classifying G protein-coupled receptors , predicting protein quaternary structural attributes , predicting protein structural classes [106, 107], and so forth.
Next, let us give a brief introduction how to use the fuzzy-NN approach to identify the interactions between the enzymes and the drug compounds in the network concerned.
Supposing that is a set of vectors representing enzyme-drug pairs in a training set classified into two classes , where denotes the interactive pair class, while the noninteractive pair class; is the subset of the nearest neighbor pairs to the query pair . Thus, the fuzzy membership value for the query pair in the two classes of is given by where is the number of the nearest neighbors counted for the query pair ; and , the fuzzy membership values of the training sample to the class and , respectively, as will be further defined next; , the cosine distance between and its th nearest pair in the training dataset ; , the fuzzy coefficient for determining how heavily the distance is weighted when calculating each nearest neighbor’s contribution to the membership value. Note that the parameters and will affect the computation result of (27), and they will be optimized by a grid-search as will be described later. Also, various other metrics can be chosen for , such as Euclidean distance, Hamming distance , and Mahalanobis distance [55, 109].
The quantitative definitions for the aforementioned and in (27) are given by Substituting the results obtained by (27) into (28), it follows that if then the query pair is an interactive couple; otherwise, noninteractive. In other words, the outcome can be formulated as
If there is a tie between and , the query pair will be randomly assigned to one of the two classes. However, case like that is quite rare and in this study never happened.
The predictor, thus, established is called iEzy-Drug, where “i” means identify, and “Ezy-Drug” means the interaction between enzyme and drug. To provide an intuitive overall picture, a flowchart is provided in Figure 2 to show the process of how the classifier works in identifying enzyme-drug interactions.
2.4. Criteria for Performance Evaluation
In the literature, the following equation set is often used for examining the performance quality of a predictor: where represents the true positive; , the true negative; , the false positive; , the false negative; Sn, the sensitivity; Sp, the specificity; , the accuracy; , the Mathew’s correlation coefficient.
To most biologists, however, the four metrics as formulated in (30) are not quite intuitive and easier-to-understand, particularly for the Mathew’s correlation coefficient. Here, let us adopt the Chou’s symbols to formulate the previous four metrics. By means of Chou’s symbols [111, 112], the rates of correct predictions for the interactive enzyme-drug pairs in dataset and the noninteractive enzyme-drug pairs in dataset are, respectively, defined by (cf. (1)) where is the total number of the interactive enzyme-drug pairs investigated, while is the number of the interactive enzyme-drug pairs incorrectly predicted as the noninteractive enzyme-drug pairs; is the total number of the noninteractive enzyme-drug pairs investigated, while is the number of the noninteractive enzyme-drug pairs incorrectly predicted as the interactive enzyme-drug pairs. The overall success prediction rate is given by  as follows:
It is obvious from (31)-(32) that if and only if none of the interactive enzyme-drug pairs and the noninteractive enzyme-drug pairs are mispredicted; that is, and , we have the overall success rate . Otherwise, the overall success rate would be smaller than 1.
Now it is obvious to see from (34): when meaning none of the interactive enzyme-drug pairs was mispredicted to be a noninteractive enzyme-drug pair, we have the sensitivity ; while meaning that all the interactive enzyme-drug pairs were mispredicted to be the noninteractive enzyme-drug pairs, we have the sensitivity . Likewise, when meaning none of the noninteractive enzyme-drug pairs was mispredicted, we have the specificity ; while meaning all the noninteractive enzyme-drug pairs were incorrectly predicted as interactive enzyme-drug pairs, we have the specificity . When meaning that none of the interactive enzyme-drug pairs in the dataset and none of the noninteractive enzyme-drug pairs in was incorrectly predicted, we have the overall accuracy ; while and meaning that all the interactive enzyme-drug pairs in the dataset and all the noninteractive enzyme-drug pairs in were mispredicted, we have the overall accuracy . The MCC correlation coefficient is usually used for measuring the quality of binary (two-class) classifications. When meaning that none of the interactive enzyme-drug pairs in the dataset and none of the noninteractive enzyme-drug pairs in were mispredicted, we have ; when and , we have meaning no better than random prediction; when and , we have meaning total disagreement between prediction and observation. As we can see from the previous discussion, it is much more intuitive and easier-to-understand when using (34) to examine a predictor for its sensitivity, specificity, overall accuracy, and Mathew’s correlation coefficient. It is instructive to point out that the metrics as defined in (30) and (34) are valid for single label systems; for multilabel systems, a set of more complicated metrics should be used as given in .
3. Results and Discussion
How to properly examine the prediction quality is a key for developing a new predictor and estimating its potential application value. Generally speaking, the following three cross-validation methods are often used to examine a predictor of its effectiveness in practical application: independent dataset test, subsampling or -fold (such as 5-fold, 7-fold, or 10-fold) test, and jackknife test . However, as elaborated by a penetrating analysis in , considerable arbitrariness exists in the independent dataset test. Also, as demonstrated by (27)–(29) in , the subsampling test (or -fold cross-validation) cannot avoid arbitrariness either. Only the jackknife test is the least arbitrary that can always yield a unique result for a given benchmark dataset. Therefore, the jackknife test has been widely recognized and increasingly utilized by investigators to examine the quality of various predictors (see, e.g., [66, 71, 74, 80]). Accordingly, the success rate by the jackknife test was also used to optimize the two uncertain parameters and in (27). The result, thus, obtained is shown in Figure 3, from which we obtain when and the iEzy-Drug predictor reaches its optimized status.
The success rates thus obtained by the jackknife test in identifying interactive Enzyme-drug pairs or noninteractive enzyme-drug pairs on the benchmark dataset (cf. Online Supporting Information S1) are given in Table 1, where for facilitating comparison, the corresponding result by He et al.  is also given. As we can see from the table, the overall accuracy Acc achieved by iEzy-Drug was 91.03%, remarkably higher than 85.48%, the corresponding rate obtained by He et al.  on the same benchmark. Furthermore, listed in Table 1 are also the values obtained by iEzy-Drug for the other three metrics; that is, , , and , indicating that the accuracy of iEzy-Drug is not only very high but also quite stale.
To provide a graphical illustration to show the performance of the current binary classifier iEzy-Drug as its discrimination threshold is varied, a 2D plot, called Receiver Operating Characteristic (ROC) curve [116, 117], was also given (Figure 4). In the ROC curve, the vertical coordinate is for the true positive rate or Sn (cf. (34)), while horizontal coordinate for the false positive rate or 1-Sp. The best possible prediction method would yield a point with the coordinate (0, 1) representing 100% true positive rate (sensitivity Sn) and 0 false positive rate or 100% specificity. Therefore, the (0, 1) point is also called a perfect classification. A completely random guess would give a point along a diagonal from the point (0, 0) to (1, 1). The area under the ROC curve, also called Area Under the ROC (AUROC), is often used to indicate the performance quality of a binary classifier; the value 0.5 of AUROC is equivalent to random prediction, while 1 of AUROC represents a perfect one. As we can see from Figure 4, the AUROC value obtained by iEzy-Drug is 0.9377.
The reason why iEzy-Drug can remarkably improve the prediction quality is that it has introduced the 2D molecular fingerprints to represent drug samples see Online Supporting Information S3 for the detailed fingerprint expressions for the drugs listed in Online Supporting Information S1 and that it has successfully used PseAAC to incorporate the features derived from the sequences of enzymes that are essential for identifying the interaction of enzymes with drugs in the cellular networking.
To enhance the value of its practical applications, the web server for iEzy-Drug has been established that can be freely accessible at http://www.jci-bioinfo.cn/iEzy-Drug/. It is anticipated that the web server will become a useful high throughput tool for both basic research and drug development in the relevant areas, or at the very least play a complementary role to the existing method [39, 110, 118] for which so far no web-server whatsoever has been provided yet.
3.2. The Protocol or User Guide
For the convenience of the vast majority of biologists and pharmaceutical scientists, here let us provide a step-by-step guide to show how the users can easily get the desired result by means of the web server without the need to follow the complicated mathematical equations presented in this paper for the process of developing the predictor and its integrity.
Step 1. Open the web server at the site http://www.jci-bioinfo.cn/iEzy-Drug/ and you will see the top page of the predictor on your computer screen, as shown in Figure 5. Click on the Read Me button to see a brief introduction about iEzy-Drug predictor and the caveat when using it.
Step 2. Either type or copy/paste the query pairs into the input box at the center of Figure 5. Each query pair consists of two parts: one is for the protein sequence and the other for the drug. The enzyme sequence should be in FASTA format, while the drug in the KEGG code. Examples for the query pairs input can be seen by clicking on the Example button right above the input box.
Step 3. Click on the Submit button to see the predicted result. For example, if you use the four query pairs in the Example window as the input, after clicking the Submit button, you will see on your screen that the “hsa: 10056” enzyme and the “D0021” drug are an interactive pair, and that the “hsa: 100” enzyme and the “D0037” drug are also an interactive pair, but that the “hsa: 3295” enzyme and the “D00889” drug are not an interactive pair, and that the “hsa: 7366” enzyme and the “D03601” drug are not an interactive pair either. All these results are fully consistent with the experimental observations. It takes about 3 minutes before the results are shown on the screen.
Step 4. Click on the Citation button to find the relevant paper that documents the detailed development and algorithm of iEzy-Durg.
Step 5. Click on the Data button to download the benchmark dataset used to train and test the iEzy-Durg predictor.
Step 6. The program code is also available by clicking the button download on the lower panel of Figure 5.
The authors would like to thank the three anonymous reviewers, whose constructive comments are very helpful for strengthening the presentation of the paper. This work was supported by the Grants from the National Natural Science Foundation of China (no. 31260273), the Jiangxi Provincial Foreign Scientific and Technological Cooperation Project (no. 20120BDH80023), Natural Science Foundation of Jiangxi Province, China (no. 2010GZS0122, 20122BAB201020), the Department of Education of Jiangxi Province (GJJ12490), the LuoDi plan of the Department of Education of Jiangxi Province (KJLD12083), and the Jiangxi Provincial Foundation for Leaders of Disciplines in Science (20113BCB22008). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the paper.
Online Supporting Information S1. The benchmark dataset contains 8,157 enzyme-drug pair samples, of which 2,719 are interactive and 5438 non-interactive. The codes listed here were from the KEGG database at http://www.kegg.jp/kegg/.
Online Supporting Information S1. The benchmark dataset contains 8,157 enzyme-drug pair samples, of which 2,719 are interactive and 5438 non-interactive. The codes listed here were from the KEGG database at http://www.kegg.jp/kegg/.
Online Supporting Information S3. The fingerprints for the drug codes listed in Online Supporting Information S1. Each of these fingerprints is a 256-D vector generated by the OpenBabel software downloaded from http://openbabel.org/.
A. Bairoch, “The ENZYME database in 2000,” Nucleic Acids Research, vol. 28, no. 1, pp. 304–305, 2000.View at: Google Scholar
S. H. Koenig and R. D. Brown, “H2CO3 as substrate for carbonic anhydrase in the dehydration of H2CO3(−),” Proceedings of the National Academy of Sciences of the United States of America, vol. 69, no. 9, pp. 2422–2425, 1972.View at: Google Scholar
K. C. Chou and S. P. Jiang, “Studies on the rate of diffusion-controlled reactions of enzymes. Spatial factor and force field factor,” Scientia Sinica, vol. 17, no. 5, pp. 664–680, 1974.View at: Google Scholar
K. C. Chou, “The kinetics of the combination reaction between enzyme and substrate,” Scientia Sinica, vol. 19, no. 4, pp. 505–528, 1976.View at: Google Scholar
K. C. Chou and G. P. Zhou, “Role of the protein outside active site on the diffusion-controlled reaction of enzyme,” Journal of the American Chemical Society, vol. 104, no. 5, pp. 1409–1413, 1982.View at: Google Scholar
R. A. Poorman, A. G. Tomasselli, R. L. Heinrikson, and F. J. Kezdy, “A cumulative specificity model for proteases from human immunodeficiency virus types 1 and 2, inferred from statistical analysis of an extended substrate data base,” The Journal of Biological Chemistry, vol. 266, no. 22, pp. 14554–14561, 1991.View at: Google Scholar
K.-C. Chou, “A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins,” The Journal of Biological Chemistry, vol. 268, no. 23, pp. 16938–16948, 1993.View at: Google Scholar
J. J. Chou, “Predicting cleavability of peptide sequences by HIV protease via correlation-angle approach,” Journal of Protein Chemistry, vol. 12, no. 3, pp. 291–302, 1993.View at: Google Scholar
Q. S. Du, S. Wang, D. Q. Wei, S. Sirois, and K. C. Chou, “Molecular modeling and chemical modification for finding peptide inhibitor against severe acute respiratory syndrome coronavirus main proteinase,” Analytical Biochemistry, vol. 337, no. 2, pp. 262–270, 2005.View at: Publisher Site | Google Scholar
K. C. Chou, “Review: structural bioinformatics and its impact to biomedical science,” Current Medicinal Chemistry, vol. 11, no. 16, pp. 2105–2134, 2004.View at: Google Scholar
K. C. Chou, D. Q. Wei, and W. Z. Zhong, “Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS,” Biochemical and Biophysical Research Communications, vol. 308, pp. 148–151, 2003, (Erratum in: Biochemical and Biophysical Research Communications, vol. 310, p. 675, 2003).View at: Google Scholar
R. B. Huang, Q. S. Du, C. H. Wang, and K. C. Chou, “An in-depth analysis of the biological functional studies based on the NMR M2 channel structure of influenza A virus,” Biochemical and Biophysical Research Communications, vol. 377, no. 4, pp. 1243–1247, 2008.View at: Publisher Site | Google Scholar
B. OuYang, S. Xie, M. J. Berardi et al., “Unusual architecture of the p7 channel from hepatitis C virus,” Nature, vol. 498, pp. 521–525, 2013.View at: Google Scholar
W. Z. Lin, J. A. Fang, X. Xiao, and K. C. Chou, “iLoc-animal: a multi-label learning classifier for predicting subcellular localization of animal proteins,” Molecular BioSystems, vol. 9, pp. 634–644, 2013.View at: Google Scholar
X. Xiao, P. Wang, W. Z. Lin, J. H. Jia, and K. C. Chou, “iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types,” Analytical Biochemistry, vol. 436, pp. 168–177, 2013.View at: Google Scholar
P. Finn, S. Muggleton, D. Page, and A. Srinivasan, “Pharmacophore discovery using the Inductive Logic Programming system PROGOL,” Machine Learning, vol. 30, no. 2-3, pp. 241–270, 1998.View at: Google Scholar
I. Vogt, D. Stumpfe, H. E. Ahmed, and J. Bajorath, “Methods for computer-aided chemical biology. Part 2: evaluation of compound selectivity using 2D molecular fingerprints,” Chemical Biology and Drug Design, vol. 70, pp. 195–205, 2007.View at: Google Scholar
C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mobile Computing and Communications Review, vol. 5, pp. 3–55, 2001.View at: Google Scholar
V. D. Gusev, L. A. Nemytikova, and N. A. Chuzhanova, “On the complexity measures of genetic sequences,” Bioinformatics, vol. 15, no. 12, pp. 994–999, 1999.View at: Google Scholar
S. F. Altschul, “Evaluating the statistical significance of multiple distinct local alignments,” in Theoretical and Computational Methods in Genome Research, S. Suhai, Ed., pp. 1–14, Plenum, New York, NY, USA, 1997.View at: Google Scholar
J. C. Wootton and S. Federhen, “Statistics of local complexity in amino acid sequences and sequence databases,” Computers and Chemistry, vol. 17, no. 2, pp. 149–163, 1993.View at: Google Scholar
H. Nakashima, K. Nishikawa, and T. Ooi, “The folding type of a protein is relevant to the amino acid composition,” Journal of Biochemistry, vol. 99, no. 1, pp. 153–162, 1986.View at: Google Scholar
K. C. Chou and C. T. Zhang, “Predicting protein folding types by distance functions that make allowances for amino acid interactions,” The Journal of Biological Chemistry, vol. 269, no. 35, pp. 22014–22020, 1994.View at: Google Scholar
I. Bahar, A. R. Atilgan, R. L. Jernigan, and B. Erman, “Understanding the recognition of protein structural classes by amino acid composition,” Proteins, vol. 29, pp. 172–185, 1997.View at: Google Scholar
K. Chou, “Prediction of protein cellular attributes using pseudo-amino acid composition,” Proteins, vol. 43, pp. 246–255, 2001, (Erratum in: Proteins, vol. 44, p. 60, 2001).View at: Google Scholar
M. M. Beigi, M. Behjati, and H. Mohabatkar, “Prediction of metalloproteinase family based on the concept of Chou's pseudo amino acid composition using a machine learning approach,” Journal of Structural and Functional Genomics, vol. 12, no. 4, pp. 191–197, 2011.View at: Publisher Site | Google Scholar
Y. K. Chen and K. B. Li, “Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition,” Journal of Theoretical Biology, vol. 318, pp. 1–12, 2013.View at: Google Scholar
C. Huang and J. Q. Yuan, “A multilabel model based on Chou's pseudo-amino acid composition for identifying membrane proteins with both single and multiple functional types,” The Journal of Membrane Biology, vol. 246, pp. 327–334, 2013.View at: Google Scholar
M. Hayat and A. Khan, “Discriminating outer membrane proteins with fuzzy K-nearest neighbor algorithms based on the general form of Chou's PseAAC,” Protein and Peptide Letters, vol. 19, no. 4, pp. 411–421, 2012.View at: Google Scholar
M. Khosravian, F. K. Faramarzi, M. M. Beigi, M. Behbahani, and H. Mohabatkar, “Predicting antibacterial peptides by the concept of Chou's pseudo-amino acid composition and machine learning methods,” Protein & Peptide Letters, vol. 20, pp. 180–186, 2013.View at: Google Scholar
H. Mohabatkar, M. M. Beigi, K. Abdolahi, and S. Mohsenzadeh, “Prediction of allergenic proteins by means of the concept of Chou's pseudo amino acid composition and a machine learning approach,” Medicinal Chemistry, vol. 9, pp. 133–137, 2013.View at: Google Scholar
L. Nanni, A. Lumini, D. Gupta, and A. Garg, “Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou's pseudo amino acid composition and on evolutionary information,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 2, pp. 467–475, 2012.View at: Publisher Site | Google Scholar
T. H. Chang, L. C. Wu, T. Y. Lee, S. P. Chen, H. D. Huang, and J. T. Horng, “EuLoc: a web-server for accurately predict protein subcellular localization in eukaryotes by incorporating various features of sequence segments into the general form of Chou's PseAAC,” Journal of Computer-Aided Molecular Design, vol. 27, pp. 91–103, 2013.View at: Google Scholar
S. Zhang, Y. Zhang, H. Yang, C. Zhao, and Q. Pan, “Using the concept of Chou's pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies,” Amino Acids, vol. 34, no. 4, pp. 565–572, 2008.View at: Publisher Site | Google Scholar
R. Zia Ur and A. Khan, “Identifying GPCRs and their types with Chou's pseudo amino acid composition: an approach from multi-scale energy representation and position specific scoring matrix,” Protein & Peptide Letters, vol. 19, pp. 890–903, 2012.View at: Google Scholar
X. Y. Sun, S. P. Shi, J. D. Qiu, S. B. Suo, S. Y. Huang, and R. P. Liang, “Identifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of Chou's PseAAC via discrete wavelet transform,” Molecular BioSystems, vol. 8, pp. 3178–3184, 2012.View at: Google Scholar
Y. Xu, J. Ding, L. Y. Wu, and K. C. Chou, “iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition,” PLoS ONE, vol. 8, Article ID e55844, 2013.View at: Google Scholar
W. Chen, H. Lin, P. M. Feng, C. Ding, Y. C. Zuo, and K. C. Chou, “iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties,” PLoS ONE, vol. 7, Article ID e47843, 2012.View at: Google Scholar
D. S. Cao, Q. S. Xu, and Y. Z. Liang, “Propy: a tool to generate various modes of Chou's PseAAC,” Bioinformatics, vol. 29, pp. 960–962, 2013.View at: Google Scholar
W. Liu and K. C. Chou, “Prediction of protein secondary structure content,” Protein Engineering, vol. 12, no. 12, pp. 1041–1050, 1999.View at: Google Scholar
A. A. Schäffer, L. Aravind, T. L. Madden et al., “Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements,” Nucleic Acids Research, vol. 29, no. 14, pp. 2994–3005, 2001.View at: Google Scholar
J. Deng, “Grey entropy and grey target decision making,” Journal of Grey System, vol. 22, no. 1, pp. 1–24, 2010.View at: Google Scholar
J. M. Keller, M. R. Gray, and J. A. Givens, “A fuzzy k-nearest neighbours algorithm,” IEEE Transactions on Systems, Man and Cybernetics, vol. 15, no. 4, pp. 580–585, 1985.View at: Google Scholar
X. Zheng, C. Li, and J. Wang, “An information-theoretic approach to the prediction of protein structural class,” Journal of Computational Chemistry, vol. 31, no. 6, pp. 1201–1206, 2010.View at: Google Scholar
K.-C. Chou and C.-T. Zhang, “Review: prediction of protein structural classes,” Critical Reviews in Biochemistry and Molecular Biology, vol. 30, no. 4, pp. 275–349, 1995.View at: Google Scholar
P. C. Mahalanobis, “On the generalized distance in statistics,” Proceedings of the National Institute of Sciences of India, vol. 2, pp. 49–55, 1936.View at: Google Scholar
R. M. Centor, “Signal detectability: the use of ROC curves and their analyses,” Medical Decision Making, vol. 11, no. 2, pp. 102–106, 1991.View at: Google Scholar
K.-C. Chou, “Using subsite coupling to predict signal peptides,” Protein Engineering, vol. 14, no. 2, pp. 75–79, 2001.View at: Google Scholar
K. C. Chou, “Prediction of protein signal sequences and their cleavage sites,” Proteins, vol. 42, pp. 136–139, 2001.View at: Google Scholar
K. C. Chou, “Prediction of signal peptides using scaled window,” Peptides, vol. 22, no. 12, pp. 1973–1979, 2001.View at: Google Scholar
K. C. Chou, “Some remarks on predicting multi-label attributes in molecular biosystems,” Molecular Biosystems, vol. 9, pp. 1092–1100, 2013.View at: Google Scholar
M. Gribskov and N. L. Robinson, “Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching,” Computers and Chemistry, vol. 20, no. 1, pp. 25–33, 1996.View at: Google Scholar
Z. He, J. Zhang, X. Shi et al., “Predicting drug-target interaction networks based on functional groups and biological features,” PLoS ONE, vol. 5, no. 3, Article ID e9603, 2010.View at: Google Scholar