Abstract

The conotoxin proteins are disulfide-rich small peptides. Predicting the types of ion channel-targeted conotoxins has great value in the treatment of chronic diseases, epilepsy, and cardiovascular diseases. To solve the problem of information redundancy existing when using current methods, a new model is presented to predict the types of ion channel-targeted conotoxins based on AVC (Analysis of Variance and Correlation) and SVM (Support Vector Machine). First, the value is used to measure the significance level of the feature for the result, and the attribute with smaller value is filtered by rough selection. Secondly, redundancy degree is calculated by Pearson Correlation Coefficient. And the threshold is set to filter attributes with weak independence to get the result of the refinement. Finally, SVM is used to predict the types of ion channel-targeted conotoxins. The experimental results show the proposed AVC-SVM model reaches an overall accuracy of 91.98%, an average accuracy of 92.17%, and the total number of parameters of 68. The proposed model provides highly useful information for further experimental research. The prediction model will be accessed free of charge at our web server.

1. Introduction

Conotoxins proteins have many merits, such as low relative molecular mass, stable structure, remarkable activity, high selectivity, and ease of synthesis [1]. Besides, conotoxins have a wide range of applications in the scope of disease treatment, which includes chronic pain, movement disorders, cramps, cancer, and stroke [2]. According to its different targets acting on the organism, the conotoxins can be divided into three categories [3]: () acting on voltage-gated ion channels, () acting on the ligand-gated ion channel, and () acting on other receptors. Further, the voltage-gated ion channels, also known as voltage-sensitive channels, include potassium ion channels, calcium ion channels, and sodium ion channels.

The performance of using different machine learning algorithms in predicting different targets is different. In 2014, neural network and SVM classifier were used to predict lipid binding proteins by Bakhtiarizadeh et al. [4]; the experiments showed that SVM was more successful at discriminating between LBPs and non-LBPs than neural network. In 2016, the potential druggable proteins were predicted through comparing 6 kinds of machine learning algorithms by Jamali et al.; the experiments showed that neural network was the best classifier when predicting potential druggable proteins [5]. In this paper, we will compare the performance of several different machine learning algorithms in the prediction of ion channel types of conotoxin.

There are studies on the prediction of superfamily and family of conotoxins based on protein sequence. In 2006, SVM model was built to predict the superfamily conotoxins based on PseAAC (pseudo amino acid composition) with an overall accuracy of 88.1% by Mondal et al. [6]. In 2007, an IDQD model was proposed based on dipeptide combinations to predict superfamily and family of conotoxins with accuracy of 87.7% and 72%, respectively, by Lin and Li [2]. However, there are few researches on the prediction of ion channel types of conotoxins. In 2011, a feature selection approach based ANOVA was used to predict the types of ion channel [7]. In 2013, an RBF model based on the feature selection method of Binomial Distribution was used to predict the ion channels of three types of conotoxins with an overall accuracy of 89.3% and total of parameters of 70 by Yuan et al. [8]. However, these feature extraction methods belong to winding method, which not only depends on the performance of classifier, but also causes time consumption.

In view of the above problems in the prediction of ion channel types of conotoxins, a model named AVC-SVM is proposed based on AVC and SVM in this paper. First, the value is used to measure the level of significance of all features to the results. Besides, rough selection is carried out to delete the attributes which have less influence on the classification results. Secondly, Pearson Correlation Coefficient [9, 10] is introduced to measure the redundancy among the attributes. Then, threshold is set to filter the features whose correlation is too strong. Finally, SVM was used as a classifier to predict the ion channel types of conotoxins. And results of prediction are used to calculate the sensitivity, average precision, and overall accuracy. Results of 5-fold cross-validation show that the AVC-SVM model has better performance when considering accuracy, the total number of features, and running time as a whole.

2. Preprocessing of Data Sets

The data sets used in this experiment were derived from Universal Protein Resource (UniProt). In order to obtain a reliable benchmark database, the following steps are performed according to the literature [8]:(1)Protein sequences must be annotated and evaluated manually.(2)Protein sequences, which contain ambiguous amino acid residues (such as X, B, and Z), should be excluded.(3)Amino acid sequences belonging to other protein fragments should be excluded.(4)Homologous proteins should be excluded.

We used 112 protein sequences as the basic data set which include 24 potassium ion channel-targeted conotoxins, 43 sodium ion channel-targeted conotoxins, and 45 calcium ion channel-targeted conotoxins from [8]. It is necessary to express the protein sequences with the eigenvector of the same number of dimension before predicting [11]. However, the information contained in the eigenvectors tends to be redundant. In the prediction of the ion channel types, the feature selection will directly affect the performance of the classifier [12]. Consequently, it is significant for feature extraction.

3. Feature Extraction

The prediction for ion channel types of the conotoxins requires that the protein sequences are represented by the eigenvectors of the same number of dimension. However, there is still redundancy by using general methods of representation of the information. It not only affects the speed of calculation but also affects the results of classification. Therefore, we need to choose the remarkable characteristics of both independence and recognition ability. At present, many feature selection techniques are used to optimize the feature sets, such as ReliefF [13], ReCorre [14], Binomial Distribution [8], and ANOVA [11]. However, few feature selection algorithms have both good prediction accuracy and short running time. In this paper, a novel feature extraction algorithm named AVC is designed to reduce redundancy of attributes and improve the accuracy and speed of prediction.

3.1. Features Representation of Protein Sequences

Both amino acid combinations and dipeptide combinations are often used as parameters for feature selection. The dipeptides combination can not only reflect the information of amino acid residues but also reflect the amino acid sequence number information [7]. Parameters of features by dipeptides combination can reflect the information from protein sequence more comprehensively [2], so we selected dipeptide combinations as parameters to represent features of protein sequences. The total number of dipeptides is 400; therefore, there are 400 features. The protein sequence is defined as follows:where is the frequency of occurrence of the th dipeptide combination in the protein sequence . The calculation method is shown as follows:

In (2), is the th dipeptide in the protein sequence.

Here, we take the protein sequence APELVVTATTTCCGYDPMTICPPCMCTHSCPPKRK as an example; the conversion process is shown in Figure 1.

According to the order of the 20 amino acid residues in the alphabet, we arranged 400 dipeptides. When , . counts the frequency of occurrence of the dipeptide AA in the protein sequence sample . Similarly, the frequencies of the emergence of 400 dipeptides are obtained from the proteins sequence sample. Finally, the eigenvectors of each protein sequence are decided.

3.2. AVC

The process of the AVC method is described as follows. Firstly, variance-based analysis is used to calculate the ratio of the variance between groups and variance within the group for each attribute [15]. The size of the value is used to measure the recognition capability of the attributes [16]. The larger the value is, the stronger the recognition capability of attribute is [17]. And then the features which have less impact on the results of classification are deleted. Secondly, we introduce Pearson Correlation Coefficient [9, 10] to measure the redundancy of attributes. Threshold is set to filter the features whose correlation is too strong. The value of the th dipeptide is calculated as follows:where represents the variance between groups and represents the variance within groups [18]. The calculation methods are shown in (4) and (5), respectively [19]:where is the total of classes and is the total of samples. Here, the value of is 3 and the value of is 112. is the sum of the squares between the groups. And is the sum of squares within the groups [20]. The calculation methods are shown in (6) and (7), respectively:where denotes the total of samples in the th group (here , , and ). represents the frequency of the th dipeptide of th samples in the th group. Take the threshold . If , remove from all samples. Then the rough selection of attributes is completed. The attribute that is not important to the classification result is deleted, and the new feature matrix is obtained.

Method of variance-based analysis preserves attributes which have strong recognition ability. However, redundancy may exist in the attributes which have strong recognition ability. It is not conducive to the results of prediction. To solve this problem, Pearson Correlation Coefficient is used to measure correlation between attributes [9]. Its value is between −1 and 1 [10]. We can obtain correlation coefficient between dipeptides. The calculation method is shown as follows:where represents occurrence frequency of the th dipeptide in the th sample in whole dataset. Similarly, represents the frequency of occurrence of the th dipeptide of the th sample in whole dataset. and are the average of the occurrence frequency of the th dipeptide and the th dipeptide in whole dataset, respectively. and are the standard deviation of and , respectively. The calculation method of is shown as follows:

The obtained is compared with a preset threshold . If , the correlation between the th attribute and the th attribute is larger than the expected value. It means that there is much redundancy between them. And then we compare the value of the th with value of the th attribute. The attribute whose value is smaller than another is deleted. We can obtain a collection of attributes which are both strong and independent until all attributes are traversed. A new feature matrix is obtained.

4. Prediction Principle of AVC-SVM

After feature selection, we need to select an appropriate algorithm to predict the types of ion channels of conotoxins. SVM is a machine learning algorithm based on statistical analysis [21]. It has great advantages in solving nonlinear, small sample and high-dimensional pattern recognition based on the principle of minimizing structural risk [22]. In addition, SVM algorithm also has many applications in bioinformatics [4, 21, 22]. In this paper, the SVM algorithm was used to predict ion channel types of the conotoxins.

The samples are divided into three categories in this paper. Therefore the method of SVM multiclassification is used to predict the ion channel types of conotoxins. There are many methods of SVM multiclassification such as OVR (one-versus-rest), OVO (one-versus-one), and DAG (Directed Acyclic Graph) [23]. We select OVO method to construct a multiclass classifier to predict the ion channel types of conotoxins. The predictive process using AVC-SVM model is shown in Figure 2.

The principle of method of OVO [24] multiclassification is depicted that there are classifiers for classes. A classifier is trained for two classes. When classifying an unknown sample, each classifier determines its class and “votes” for the corresponding category. Finally, the category with the largest number of votes is the category of the unknown sample.

4.1. Evaluation Criteria

In the study for the prediction of protein function, the evaluation criteria which are widely used are sensitivity (Sn), overall accuracy (OA), and average accuracy (AA) [25]. They are defined as follows:where and denote true positives and false positives for the th class, respectively. and denote the total of samples and the total of classes, respectively.

4.2. Steps for Prediction

There are five steps to predict the types of ion channels.

Step 1. Formulae (1) and (2) are used to preprocess the date sets and obtain the feature representation of amino acid sequences.

Step 2. The value calculated by (5) is used to measure the recognition ability of all attributes. Set the threshold . If , the th attribute value is deleted from all attributes of samples. And, then, a new vector is obtained.

Step 3. Formulae (8) and (9) are used to calculate the correlation coefficient between the th attribute and the th attribute in feature matrix . Set the threshold ; if , value of the th attribute is compared with value of the th attribute. Then the attribute whose value is smaller is deleted from the two features.

Step 4. The 112 samples are divided into 5 subsets randomly. One of the five subsets takes turns as test set; the rest are training set. SVM multiclass method was used to train and predict types of ion channel.

Step 5. Formulae (10)–(12) are used to evaluate sensitivity, the overall accuracy, and average accuracy of the model.

5. Results and Analysis

5.1. Results of Attributes Reduction Using AVC

The analysis of variance is used to calculate the values of all the attributes. The distribution of value of 400 dipeptides is shown in Figure 3. Figures 4 and 5 are the values of some dipeptides after the rough selection and after the correlation analysis, respectively.

As we can see from Figures 3 and 4, the number of the small values in Figure 3 is less than that in Figure 4. Because the value measures the ability to identify the attribute, the features which have smaller value have less effect on the result. Consequently, these attributes are deleted from all features. Figure 5 shows the value distribution for the portion dipeptides after correlation analysis. The splashes in Figure 5 become few and sparser than the splashes distributed in Figure 4. Figure 5 not only shows the features which have the smaller value are deleted but also shows that the features having a strong correlation are deleted. It proves that the method of AVC feature selection can reduce the number of dimensions effectively.

5.2. Contrastive Results Using Different Methods for Feature Selection

To further illustrate the effectiveness of our method, Table 1 shows the results of comparison of AVC and different feature selection methods. All the classification algorithms in Table 1 use the SVM method and perform 5-fold cross-validation.

In Table 1, Sn indicates the sensitivities of three types of ion channels. OA is the overall accuracy. And AA is the average accuracy. The accuracy and sensitivity of the AVC, ANOVA (Analysis of Variance), BiDi (Binomial Distribution) [8], ReliefF [2628], and ReCorre [14] algorithms are compared when using SVM. The AVC method with an average accuracy of 92.17% and an overall accuracy of 91.98% is higher than other methods in Table 1. In addition, the sensitivities in predicting K and Na ion channels using the AVC-SVM method are the highest and reach 93.14% and 94.17%, respectively. The sensitivity using ANOVA method in predicting Ca ion channel is the best and reaches 92.54%. Comparing the principle of AVC, ANOVA, BiDi, and ReliefF, we can find that only AVC can distinguish the redundant features with strong correlation. Comparing the principle of AVC, ReliefF, and ReCorre, we can find that ReCorre algorithm adds the analysis of relativity analysis based on ReliefF but it does not solve the problem of instability caused by noise and exception points. However, the process of weight calculation based on analysis of variance used in this paper has better robustness. In order to compare the efficiency of feature selection, Table 2 shows running time and the resulting dimensions when using different methods of feature selection. The classification algorithm uses SVM uniformly in Table 2.

The results in Table 2 show the running time of AVC-SVM is the shortest and reaches 0.085 s. The running times of ANOVA-SVM, BiDi-SVM, ReliefF-SVM, and ReCorre-SVM are 9.350 s, 11.939 s, 9.478 s, and 7.547 s, respectively. The method with the least dimensions is AVC-SVM with the dimensions of 68.

5.3. Comparison Using Different Multiclassification Algorithms

For the choice of classification algorithm, this paper uses SVM algorithm, which is suitable for the prediction of small sample data [4]. Besides, SVM algorithm does not involve the use of probability measure and law of large numbers, so it is different from the existing statistical methods [29]. In order to prove the superiority of SVM in accuracy and sensitivity, further experiments are needed. When using AVC method to feature selection, the comparisons using different prediction algorithms are shown in Table 3. To make the results more reliable, 5-fold cross-validation was used in all the methods in Table 3.

The results show that AVC-SVM is superior to other methods with the highest average accuracy of 92.17% and the highest overall accuracy of 91.98%, respectively. The overall accuracies of Bayes [32], ELM (extreme learning machine) [33], RF (Random Forest) [34, 35], and RBF (radial basis function neural network) [36] are 82.61%, 78.70%, 76.80%, and 66.09%, respectively. Moreover, the sensitivities for the three types of ion channels predicted by SVM are the highest. Comparing SVM with Bayes, ELM, RF, and RBF neural networks, the results show that SVM is the best prediction method when using feature selection of AVC.

5.4. Comparison Using Different Models

In recent years, there are some studies on the prediction of ion channel types of conotoxins. The contrast experiments were shown in Table 4.

It can be seen from Table 4 that AVC-SVM model is better than the BiDi-RBF model and iCTX-Type model in terms of average accuracy, overall accuracy, and time efficiency. When compared with -score-SVM, the average accuracy and the overall accuracy of the AVC-SVM model are not as high as those in literature [31]. However, the sensitivity of the AVC-SVM model is better than that of the -score-SVM in predicting K ion channel. Moreover, the number of features and running time used by the AVC-SVM model is less than the -score-SVM model.

The value used in our method and -score proposed by the literature [30] are different. The -score in the literature [30] is the ratio of the variance between groups and the variance within groups. The variance between groups in the literature [30] is calculated using sum of squares of deviations. The value in our paper is the ratio of the mean square deviation between groups and the mean square deviation within groups. In this paper, the mean square deviation is the sum of squares of deviations divided by degree of freedom. It can eliminate the impact caused by imbalance of number of samples between groups.

6. Conclusions

In this paper, the proposed model based on feature selection of AVC and prediction method of SVM is used to predict the type of ion channels. The results of 5-fold cross-validation show that our model reaches high predicted accuracies and the feature selection method in this paper has two advantages over other feature selection methods: first, the analysis of correlation for features is used to further reduce the existing information redundancy between the strong correlating features. Second, the calculated process for weights of the attributes is robust. However, it is necessary to declare the data set which is mined for analysis. We will further expand the data set in the follow-up work for in-depth analysis.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 61173071), the Science and Technology Research Project of Henan Province (no. 122102210079), the 2013 Program of China Scholarship Council Countries about Senior Research Scholar and Visiting Scholar (no. 201308410018), the Innovation Talent Support Program of Henan Province Universities (no. 2012HASTIT011), the Doctoral Started Project of Henan Normal University (no. 1039), and the International Training Project of High-Level Talents (no. 17) of Henan Administration of Foreign Experts Affairs in 2016. Therefore, it is necessary for the stability conditions to be investigated in the multiregions.