Abstract

Conotoxins are small disulfide-rich neurotoxic peptides, which can bind to ion channels with very high specificity and modulate their activities. Over the last few decades, conotoxins have been the drug candidates for treating chronic pain, epilepsy, spasticity, and cardiovascular diseases. According to their functions and targets, conotoxins are generally categorized into three types: potassium-channel type, sodium-channel type, and calcium-channel types. With the avalanche of peptide sequences generated in the postgenomic age, it is urgent and challenging to develop an automated method for rapidly and accurately identifying the types of conotoxins based on their sequence information alone. To address this challenge, a new predictor, called iCTX-Type, was developed by incorporating the dipeptide occurrence frequencies of a conotoxin sequence into a 400-D (dimensional) general pseudoamino acid composition, followed by the feature optimization procedure to reduce the sample representation from 400-D to 50-D vector. The overall success rate achieved by iCTX-Type via a rigorous cross-validation was over 91%, outperforming its counterpart (RBF network). Besides, iCTX-Type is so far the only predictor in this area with its web-server available, and hence is particularly useful for most experimental scientists to get their desired results without the need to follow the complicated mathematics involved.

1. Introduction

Being peptides consisting of about 10 to 30 amino acid residues, conotoxins are toxins secreted by cone snails for capturing prey and securing themselves. This kind of toxins can bind to various targets, such as G protein-coupled receptors (GPCRs), nicotinic acetylcholine, and neurotensin receptors. In particular, they display extremely high specificity and affinity for ion channels. Ion channels represent a class of membrane spanning protein pores that mediate the flux of ions in a variety of cell types. There are over 300 types of ion channels in a living cell [1]. Many crucial functions in life, such as heartbeat, sensory transduction, and central nervous system response, are controlled by cell signaling via various ion channels. Ion channel dysfunction may lead to a number of diseases, such as epilepsy, arrhythmia, and type II diabetes. These kinds of diseases are primarily treated with the drugs that modulate the ion channels concerned. Ion channels are also the important targets for treating virus diseases (see, e.g., [24]). Owing to their importance to human being’s life, ion channels have become the 2nd most frequent targets for drug development, just next to GPCRs (G protein-coupled receptors) [5]. The following three kinds of ion channels are usually the targets by conotoxins: potassium (K) channel (Figure 1), sodium (Na) channel (Figure 2), and calcium (Ca) channel (Figure 3). Based on their functions and targeting objects, conotoxins can be classified into the following three types: (i) K-channel-targeting type; (ii) Na-channel-targeting type; and (iii) Ca-channel-targeting type.

Although conotoxins are lethally venomous because of blocking the transmission of nerve impulses, they have been widely used to treat chronic pain, epilepsy, spasticity, and cardiovascular diseases. Therefore, conotoxins have been regarded as important pharmacological tools for neuroscience research.

It has been estimated that there are more than 100,000 kinds of conotoxins secreted by over 700 kinds of Conus in the world [8]. However, relatively much fewer conotoxins (about 3,000 peptides) have been experimentally confirmed and reported in literature and databases. Moreover, the records about the functions of conotoxins in public databases are no more than 300 items. Hence, developing a computational method to predict the functions of conotoxins has become a challenging task.

In a pioneer work, Mondal et al. [9] proposed a method for predicting conotoxin superfamilies by using the pseudoamino acid composition approach [10, 11]. Subsequently, a series of studies have been reported in predicting conotoxin superfamilies (see, for example, [1215]). All these methods yielded quite encouraging results, and each of them did play a role in stimulating the development of this area. However, none of these methods can be used to predict the types of conotoxins defined according to their targeting ion-channels. For instance, both delta-conotoxin-like Ac6.1 (UniProt accession number: P0C8V5) [16] and omega-conotoxin-like Ai6.2 [17] (UniProt accession number: P0CB10) belong to the conotoxin O1 superfamily. However, the former targets the voltage-gated sodium channels, while the latter targets the voltage-gated calcium channels.

To deal with this problem, recently, a method was developed [7] to identify conotoxins among the aforementioned three types by using their sequence information alone. However, further work is needed in this regard due to the following reasons. (i) The prediction quality can be further improved. (ii) No web server for the prediction method in [7] was provided, and hence its usage is quite limited, especially for the majority of experimental scientists.

The present study was devoted to develop a new predictor for identifying the conotoxins’ types from the above two aspects.

As elaborated in a comprehensive review [18] and conducted by a series of recent publications [1928], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly web server for the predictor that is accessible to the public. In what follows, let us describe how to deal with these procedures one by one.

2. Materials and Methods

2.1. Benchmark Dataset

The sequences of conotoxins and their functions were collected from the UniProt [29]. To ensure its quality, the benchmark dataset was constructed strictly according to the following criteria. (i) Included were only those peptides annotated with “conotoxin” and with the keyword of potassium, calcium, or sodium in their functional ontologies. (ii) Included were only those conotoxins with clear functional annotations based on experiment results. In other words, we excluded those annotated with “uncertain,” “predicted,” or “inferred from homology” because of lacking confidence. (iii) Excluded were those that were annotated with “immature” due to the incompleteness. (iv) Excluded were also those that contained any invalid amino acid codes, such as “B,” “X,” and “Z”. After going through the above procedures, we obtained 195 conotoxins, of which 37 belonged to the K-channel-targeting type, 86 to the Na-channel-targeting type, and 72 to the Ca-channel-targeting type.

As elaborated in a comprehensive review [18], a benchmark dataset containing many redundant samples with high similarity would lack statistical representativeness. A predictor, if trained and tested by a benchmark dataset with many homologous sequences, might yield misleading results with overestimated accuracy [30]. To remove the homologous sequences from the benchmark dataset, a cutoff threshold of 25% was recommended [31] to exclude those protein/peptide sequences from the benchmark datasets that had ≥25% pairwise sequence identity to any other sample in the same subset. However, in this study we did not use such a stringent criterion because the currently available data did not allow us to do so. Otherwise, the numbers of peptides for some subsets would be very few to have statistical significance. As a compromise, we set the cutoff threshold at 80% and used the CD-HIT software [32] to remove those conotoxin samples that had ≥80% sequence identity to any other in a same subset. After such a screening procedure, we obtained 112 conotoxin samples for the benchmark dataset , as formulated as follows: where the subset contains 24 conotoxin samples of K-channel-targeting type, contains 43 samples of Na-channel-targeting type, and contains 45 samples of Ca-channel-targeting type, while the symbol represents the union in the set theory. The codes of 112 conotoxins and their sequences are given in Supporting Information S1 (see Supplementary Material available online at http://dx.doi.org/10.1155/2014/286419).

Likewise, we also constructed an independent dataset as formulated by where contains 12 K-conotoxins, contains 37 Na-conotoxins, and contains 21 Ca-conotoxins. None of the samples in the independent dataset occurs in the dataset of (1), and their detailed sequences are given in Supporting Information S2.

For simplicity, hereafter, let us use “K-conotoxin,” “Na-conotoxin,” and “Ca-conotoxin” to represent K-channel-targeting type conotoxin, Na-channel-targeting type conotoxin, and Ca-channel-targeting type conotoxin, respectively.

2.2. The Dipeptide Mode of Pseudoamino Acid Composition

Given a conotoxin peptide P with L amino acids, how do we translate it into a mathematical expression for statistical prediction? This is one of the first important problems to develop a sequence-based predictor for identifying the type of a conotoxin. The most straightforward way to formulate the sample of a conotoxin peptide P with L residues is to use its entire amino acid sequence, as can be formulated by where represents the 1st residue of the conotoxin peptide and the 2nd residue of the peptide and so forth. Subsequently, we can utilize various sequence similarity search based tools, such as BLAST [33], to perform statistical prediction. Although this kind of sequence model was very straightforward and intuitive, unfortunately, it failed to work when a query conotoxin peptide did not have significant similarity to any of the peptide sequences in the training dataset. Thus, investigators turned to use vectors to represent the peptide samples. Another reason for them to do so is that the statistical samples in vector format are much easier to be handled than in sequence format by many existing operation engines, such as the correlation angle approach [34], covariance discriminant (CD) [27, 3537], neural network [3840], optimization approach [41], support vector machine (SVM) [22, 23, 42, 43], random forest [44, 45], conditional random field [20], nearest neighbor (NN) [46, 47]; K-nearest neighbor (KNN) [30], OET-KNN [4850], fuzzy K-nearest neighbor [25, 5155], ML-KNN algorithm [56], and SLLE algorithm [36].

The simplest vector used to represent a peptide or protein sample is its amino acid composition (AAC), as given as follows: where () is the normalized occurrence frequency of the th type of native amino acid in the peptide chain and is the transpose operator. The AAC model was used by many in predicting various contributes of proteins (see, e.g., [41, 5759]). However, as we can see from (4), when using AAC to represent a peptide or protein sample, all its sequence order information would be completely lost and hence limit the prediction quality.

How can we formulate a peptide or protein sequence with a vector yet still keep considerable sequence order information? As reported in many recent publications, in order to incorporate the sequence order information, the pseudoamino acid composition [10, 11] or Chou’s PseAAC [60] was proposed. Since the concept of PseAAC was proposed in 2001 [10], it has been penetrating into almost all the fields of protein attribute predictions (see, e.g., [6178]). Recently, the concept of PseAAC was further extended to represent the feature vectors of DNA and nucleotides [19, 21, 23, 27, 79], as well as other biological samples (see, e.g., [8082]). Because it has been widely and increasingly used, in addition to the web server “PseAAC” [83] built in 2008, recently three types of powerful open access software, called “PseAAC-Builder” [84], “propy” [85], and “PseAAC-General” [86], were established: the former two are for generating various modes of Chou’s special PseAAC, while the 3rd one is for those of Chou’s general PseAAC.

According to a comprehensive review [18], the general PseAAC is formulated by where the component () and the dimension will depend on how to extract the features from the peptide sequences concerned. For the current study, since the conotoxin sequences are not long (about 10–30 residues), we could just consider the sequence order information between two most contiguous amino acid residues. Thus, the dimension of the vector in (5) is and each of the components therein is given by where are, respectively, the single letter codes of 20 native amino acids, is the occurrence frequency for the dipeptide AA in the conotoxin sequence (see (3)), and is for the dipeptide AC and so forth. The formulation defined by (5)-(6) is actually the dipeptide mode of PseAAC, which can be automatically generated by the PseAAC server [83] for a given peptide or protein sequence.

2.3. Feature Selection

The original raw features usually contain the redundant information and noise that may negatively affect the prediction quality [87]. Using the feature selection techniques to optimize the feature set can not only enhance the prediction accuracy but also provide useful insights for in-depth understanding of the action mechanism of conotoxins. According to the feature selection algorithm [87], the -score function is defined by where is the average frequency of the th feature in the th dataset, the average frequency of the th feature in the all datasets concerned, is the frequencies of the th feature of the th sequence in the th dataset, and is the number of peptide samples in the th dataset. The program called “fselect.py” was downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools to calculate -score defined in (7).

The larger the -score is, the more likely it has a better discriminative capability [87]. Accordingly, we ranked the 400 dipeptides in (5) according to their -scores. Subsequently, based on the ranked dipeptides, we performed the incremental feature selection (IFS) strategy to find an optimal subset of features that yielded the highest predictive accuracy. During the IFS procedure, the feature subset started with one feature with the highest -score. A new feature subset was composed when one more feature with the second highest -score was added. By adding these features sequentially from the higher to lower ranks, 400 feature sets would be obtained. The th feature set can be formulated as

For each of the 400 feature sets, a prediction model based on the proposed predictive algorithm was constructed and examined with the jackknife cross-validation on the benchmark dataset. By doing so, we obtained an IFS curve in a 2D (dimensional) Cartesian coordinate system with index as the abscissa (or X-coordinate) and the overall accuracy as the ordinate (or Y-coordinate). The optimal feature set is expressed as with which the IFS curve reached its peak. In other words, in the 2D coordinate system, when , the value of the overall accuracy was the maximum. Thus, we used the features to build the final predictor.

2.4. Support Vector Machine (SVM)

The classification algorithm used in this work was the support vector machine (SVM). The SVM has been widely used in the realm of bioinformatics (see, e.g., [19, 22, 23, 8890]). Its basic principle is to transform the input vector into a high-dimension Hilbert space and seek a separating hyperplane with the maximal margin in this space by using the decision function: where is the th training vector, the represents the type of the th training vector, and is a kernel function which defines an inner product in a high dimensional feature space. Because of its effectiveness and speed in nonlinear classification process, the radial basis kernel function (RBF) was used in the current work. The original SVM was designed for two-class problems. For multiclass problems, several strategies such as one-versus-rest (OVR), one-versus-one (OVO), and DAGSVM have been applied to extend the traditional SVM. In the present study, we used the OVO strategy for multiclass prediction. The concrete SVM software (LibSVM) was downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm. A grid search method was used to optimize the regularization parameter and kernel parameter via the jackknife cross-validation. The search spaces for and are and with steps of and 2, respectively. For more details about SVM, see a monograph [91].

3. Results and Discussion

3.1. Test Method and Criteria

In statistical prediction, the independent dataset test, subsampling or K-fold crossover test and jackknife test are the three cross-validation methods often used to check a predictor for its accuracy [92]. However, among the three test methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset [18]. Accordingly, the jackknife test has been increasingly used and widely recognized by investigators to examine the quality of various predictors (see, e.g., [19, 21, 73, 75, 9395]). Therefore, in this study we also adopted the jackknife test.

In addition to an objective test method, we also need a set of metrics to reasonably measure the test outcome. Here, let us use the criterion proposed in [96, 97] to develop a set of more intuitive and easier-to-understand metrics; that is, the correct rates in predicting K-conotoxins, in predicting Na-conotoxins, and in predicting Ca-conotoxins are defined by where is the total number of the K-conotoxins investigated, while is the number of the K-conotoxins incorrectly predicted as the Na-conotoxins, and is the number of the K-conotoxins incorrectly predicted as the Ca-conotoxins; is the total number of the Na-conotoxins investigated, while is the number of the Na-conotoxins incorrectly predicted as the K-conotoxins and is the number of the Na-conotoxins incorrectly predicted as the Ca-conotoxins; and is the total number of the Ca-conotoxins investigated, while is the number of the Ca-conotoxins incorrectly predicted as the Na-conotoxins and is the number of the Ca-conotoxins incorrectly predicted as the K-conotoxins. From (11), it follows that where OA stands for the overall accuracy and AA for the average accuracy.

3.2. The Optimal Features

As mentioned above, it would be no good for a sample vector to contain either too few or too many features. This is because the former would limit the prediction quality due to lack of information, while the latter would generate a lot of noise due to redundancy. Therefore, we should find a set of optimal features, for which there is minimal redundancy among themselves but maximal relevancy to the target to be predicted. In the present study, such an optimal feature-set is none but (9).

Shown in Figure 4 is the IFS curve for the value of OA against the number of the counted features, as described in Section 2.3. As can be seen from there, the value of OA reached its peak of 91.1% when the top-ranked 50 dipeptides (Table 1) were taken into account.

The predictor thus obtained via the aforementioned procedures is called “iCTX-Type,” where “i” stands for “identify” and “CTX” for “conotoxin.”

A comparison of the current predictor iCTX-Type with the one in [7] (i.e., to the best of our knowledge, it is the only existing predictor in this area) is given in Table 2, from which we can see the following. (i) For four of the five metrics defined in (10)-(11), iCTX-Type yielded higher scores than the method in [7]. Particularly, iCTX-Type achieved higher overall accuracy (OA) and average accuracy (AA). (ii) Compared with the method of [7] using 70 features, only 50 features were used in the present method (Table 1), indicating that the iCTX-Type is more efficient in excluding redundancy and noise as well as in capturing the core features.

To further verify the performance of the current predictor, iCTX-Type was also used to identify the samples in the independent dataset (see Supporting Information S2), and the success rates (see (11)) thus obtained were 91.7%, 91.9%, and 90.5% for K-, Na-, and Ca-conotoxins, respectively. These results are fully consistent with those obtained by the jackknife test as given in Table 2, furtherindicating that the new predictor iCTX-Type is quite promising and holds a high potential to become a useful tool for in-depth studying ion channel-targeted conotoxins.

To enhance the value of its practical applications [98], a web server for the new iCTX-Type predictor was established as described below.

3.3. Web-Server Guide

For the convenience of the vast majority of experimental scientists, below a step-by-step guide is provided for how to use the web server to get the desired results without the need to follow the mathematic equations that were presented in this paper just for the integrity in developing the predictor.

Step 1. Open the web server at http://lin.uestc.edu.cn/server/iCTX-Type and you will see the top page of iCTX-Type on your computer screen, as shown in Figure 5. Click on the Read Me button to see a brief introduction about the predictor and the caveat when using it.

Step 2. Either type or copy/paste the query peptide sequences into the input box at the center of Figure 5. The input sequence should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with a greater-than symbol “>” in the first column, followed by lines of sequence data. The words right after the “>” symbol in the single initial line are optional and only used for the purpose of identification and description. All lines should be no longer than 120 characters and usually do not exceed 80 characters. The sequence ends if another line starting with a “>” appears; this indicates the start of another sample sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box.

Step 3. Click on the Submit button to see the predicted result. For instance, when using the three peptide sequences as an input and clicking the Submit button, you will see the following shown on the screen of your computer: the outcome for the 1st query example is “Ca-conotoxin”; the outcome for the 2nd query sample is “K-conotoxin”; the outcome for the 3rd query sample is “Na-conotoxin.” All these results are fully consistent with the experimental observations. It takes only a few seconds for the above computation before the predicted result appears on your computer screen; the more number of query sequences, the longer time it usually needs.

Step 4. Click on the Data button to download the benchmark datasets used to train and test the iCTX-Type predictor.

Step 5. Click on the Citation button to find the relevant papers that document the detailed development and algorithm of iCTX-Type.

Caveats. The input query sequences must be formed by the single-letter codes of the 20 native amino acids; any other characters such as “B,” “X,” “U,” and “Z” are invalid and should not be part of the peptide sequence.

4. Conclusion

It is anticipated that iCTX-Type may become a useful high throughput tool for both basic research and drug development, particularly for in-depth investigation into the mechanisms of ion-channels and developing new drugs to treat chronic pain, epilepsy, spasticity, and cardiovascular diseases, among others.

It is instructive to point out that since the binding of conotoxins to ion-channel is highly selective and specific, the information obtained by iCTX-Type in identifying the types of conotoxins may be also very useful for designing ion channel inhibitors according to the Chou’s distorted key theory as elaborated in [99] and briefed in a Wikipedia article at http://en.wikipedia.org/wiki/Chou’s_distorted_key_theory_for_peptide_drugs.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors wish to thank the anonymous reviewers for their constructive comments, which were very helpful for strengthening the presentation of this study. This work was supported by the National Nature Scientific Foundation of China (nos. 61202256, 61301260, and 61100092), the Nature Scientific Foundation of Hebei Province (no. C2013209105), and the Fundamental Research Funds for the Central Universities (nos. ZYGX2012J113 and ZYGX2013J102).

Supplementary Materials

Supporting Information S1: The benchmark dataset 𝕊 contains 112 conotoxins, of which 24 belong to K-channel-targeting type, 43 to Na-channel-targeting type, and 45 to Ca-channel-targeting type.

Supporting Information S2: The independent dataset 𝕊Ind contains 70 conotoxins, of which 12 are of K-channel-targeting type, 37 of Na-channel-targeting type, and 21 of Ca-channel-targeting type. None of the samples listed here occurs in benchmark dataset 𝕊.

  1. Supporting Information S1
  2. Supporting Information S2