Abstract

Protein aggregation is a biological phenomenon caused by misfolding proteins aggregation and is associated with a wide variety of diseases, such as Alzheimer’s, Parkinson’s, and prion diseases. Many studies indicate that protein aggregation is mediated by short “aggregation-prone” peptide segments. Thus, the prediction of aggregation-prone sites plays a crucial role in the research of drug targets. Compared with the labor-intensive and time-consuming experiment approaches, the computational prediction of aggregation-prone sites is much desirable due to their convenience and high efficiency. In this study, we introduce two computational approaches Aggre_Easy and Aggre_Balance for predicting aggregation residues from the sequence information; here, the protein samples are represented by the composition of k-spaced amino acid pairs (CKSAAP). And we use the hybrid classification approach to predict aggregation-prone residues, which integrates the naïve Bayes classification to reduce the number of features, and two undersampling approaches EasyEnsemble and BalanceCascade to deal with samples imbalance problem. The Aggre_Easy achieves a promising performance with a sensitivity of 79.47%, a specificity of 80.70% and a MCC of 0.42; the sensitivity, specificity, and MCC of Aggre_Balance reach 70.32%, 80.70% and 0.42. Experimental results show that the performance of Aggre_Easy and Aggre_Balance predictor is better than several other state-of-the-art predictors. A user-friendly web server is built for prediction of aggregation-prone which is freely accessible to public at the website.

1. Introduction

Protein aggregation is a phenomenon caused by misfolding proteins aggregation. Many studies indicate that protein aggregations can cause amyloid fibrils which are associated with a wide variety of diseases, such as Alzheimer’s, Parkinson’s, and prion diseases [1]. Although the amyloidogenic proteins do not share the homology in sequences or common native fold patterns, they are remarkably similar in β structure [1]. Experiments demonstrate that protein aggregation is mediated by short “aggregation-prone” peptide segments. So the identification of aggregation prone in the protein sequences is the key to finding protein aggregation phenomenon. As we know, traditional experimental identification and characterization of aggregation prone are labor-intensive and expensive. Therefore, the aggregation residues prediction by computational technology attracted more and more attention in the past few years.

Over the past ten years, a large number of computational approaches have been developed to analyze and predict the aggregation prone. Broadly, from the perspective of feature extraction, these approaches can be divided into three categories: experiment-based methods, structure-based methods, and physical-chemical attribute-based methods. For example, Aggrescan [2] proposed by Conchillo-Solé et al. and the saturation mutagenesis analysis [3] performed by López de la Paz and Serrano were both validated by experiments. Among structure-based methods, Galzitskaya et al. [4] used a new parameter, “mean packing density,” to detect both amyloidogenic and disordered regions in a protein sequence; SALSA [5], Hexapeptide Conf. Energy [1], and SecStr [6] were proposed by β-sheet structure analysis. NetCSSP [7] developed by Kim et al. used CSSP algorithm and 3D structure to predict the amyloid fibril formation. On the other hand, physical-chemical attribute-based methods such as PaFigure [8] proposed by Tian et al. and Tango [9] developed by Fernandez-Escamilla et al. took into account the physical-chemical principles to predict the aggregation prone. Recently, Tsolis et al. developed two methods named AMYLPRED [10] and AMYLPRED2 [11] which integrated 5 predictors and 11 predictors, respectively.

However, the above methods did not consider that the dataset of aggregation-prone prediction was imbalanced, and some methods were based on the structure information which had high computational complexity. For these reasons, we develop two approaches Aggre_Easy and Aggre_Balance based on the sequence information to predict the aggregation residues. In this study, the protein samples are represented by the composition of k-spaced amino acid pairs (CKSAAP) [1214]. Then, we use a hybrid classification approach to solve sample imbalance problem. The hybrid classification approach integrates the naïve Bayes classification to reduce the number of features and undersampling strategy to deal with the class-imbalance problem. Two undersampling algorithms, EasyEnsemble and BalanceCascade, are both utilized in this paper. The Aggre_Easy achieves a promising performance with a sensitivity of 79.47%, a specificity of 80.70%, and a MCC of 0.42; the sensitivity, specificity, and MCC of Aggre_Balance reach 70.32%, 80.70%, and 0.42. Experimental results show that the performance of Aggre_Easy and Aggre_Balance predictor is better than several other state-of-the-art predictors. A user-friendly web server is built for prediction of aggregation prone which is freely accessible to public at the following website: http://202.198.129.220:8080/AggrePrediction/.

2. Materials and Method

2.1. Datasets

In this paper, we select 33 amyloidogenic proteins to predict “aggregation-prone” peptides. And all the proteins are extracted from Uniprot/Swiss-Prot (Mar, 20, 2013). Moreover, in order to facilitate comparison with the AMYLPRED2, we select the same dataset. For aggregation-prone peptides prediction, 25 proteins are used for training and the remaining 8 proteins for testing. Similar to [11], all experimentally verified aggregation sites in this paper are regarded as positive samples, and the other nonaggregation sites in the same proteins are taken as the negative samples (as can be seen in Supporting Information Text S1 in Supplementary Material available online at http://dx.doi.org/10.1155/2015/857325). The number of proteins samples in each dataset is shown in the Table 1.

We define a possible aggregation-prone peptide as the aggregation bond is flanked by “w” residues upstream and “w” residues downstream from the aggregation site. In this paper, we select four different values of w (2, 3, 4, and 5), and the window sizes are the 5, 7, 9, and 11. If the aggregation site is located at the N- or C-terminus of the protein and the length of the peptide is smaller than , one or multiple “O” characters are added to complement the peptide .

2.2. Protein Encoding Schema

To develop a powerful predictor, an effective mathematical expression to formulate the protein sequences plays an important role, which can truly reflect their intrinsic correlation with the attribute to be predicted [15, 16]. In this research, we use the encoding scheme based the composition of k-spaced amino acid pairs (CKSAAP) [1214], which is successfully used for predicting more types of posttranslational modifications (PTMs) sites (e.g., prediction of palmitoylation sites [13], ubiquitination sites [12], and Phosphorylation Sites [17]). We describe the detailed procedures as follows.

Generally, we define an aggregation prone with a sequence fragment of amino acids. There are 441 possible amino acid pair types (i.e., ). Note that the pairs are extended to the k-spaced amino acid pairs (i.e., pairs that are separated by other amino acids). We can use the vector to describe a feature vector. For instance, denotes that pairs occur times, and amino acid is separated by other amino acid from the amino acid in the sequences fragment. In this study, based on previous experience, the amino acid pairs for k = 3, 4, 5 are jointly considered. So the total dimension of the proposed feature vector is .

2.3. Hybrid Classification Approach

From Section 2.1, we can see that the negative samples are about five times of positive samples, so the traditional learning algorithm, such as SVM, cannot get good performance in this kind of imbalance dataset. For the huge number of features transformed from the CKSAAP encoding, many feature selections methods are carried out to overcome this problem by reducing the dimension of the features. In this paper, we design a hybrid classification approach integrating naïve Bayes classification and two undersampling methods EasyEnsemble and BalanceCascade to predict aggregation sites, which has also been successfully used for text document classification [18, 19]. It takes advantage of both the simplicity of the Bayes technique and the efficient strategy of the undersampling to deal with class-imbalance problem. In Figure 1, the black frame illustrates the process of hybrid classification approach. Firstly, all training proteins or peptides are represented by CKSAAP encoding schema. Secondly, the features of composition of k-spaced amino acid pairs are used as the input data for Bayes classification, where the dimensions of process of Bayes classification are based on the number of available classes in the classification task [20]. Finally, we use the undersampling approaches for the predictors. For the query protein, we can use the training model to predict whether or not it is an aggregation protein. The details would be shown in the following sections.

2.3.1. The Bayes Classification Approach

The naïve Bayes classifier [21] starts with the initial step of encoding the sample by extracting the composition of k-spaced amino acid pairs (CKSAAP). The list of AAPs (amino acid pairs) is constructed with the assumption that input data contains , where is the CKSAAP encoding schema length. And it can be used to create a table; containing the probabilities of the amino acid pair (AAP) in each class. And Table 2 shows the details.

Based on the list of AAP numbers, the trained probabilistic classifier calculates the posterior probability of the particular AAP of the sample being annotated to a particular class by using the formula (1), since each AAP in the input sample contributes to the sample’s class probability:

The prior probability, can be computed from

Meanwhile, we calculate the , which is represented by the probability of each in all classes; it is expressed as

The total occurrence of a particular AAP in every class can be calculated by searching the training database, which is composed from the lists of AAP occurrences for every class. As previously mentioned, the list of AAP numbers for a class is generated from the analysis of all training samples in the particular class during the initial training stage. The same method can be used to retrieve the sum of numbers of all samples in every class in the training database.

To calculate the likelihood of a particular with respect to a particular , the lists of AAP number from the training database is searched to retrieve the numbers of in and the sum of all AAPs in . This information will contribute to the value of given in

Based on the derived Bayes’ formula for classification and the value of the prior probability Pr(Class), the likelihood Pr(AAP, Class) and evidence Pr(AAP), along with the posterior probability, Pr(Class, AAP) of each AAP in the input data being annotated to each class can be measured.

The probability for an input sample to be annotated to a particular is calculated by dividing the sum of each of the “Probability” column with the length of the query, , which is shown inwhere are the AAPs that are extracted from the input sample.

The is the probability value for a sample to be annotated to a class. And if we have class list as Class1, Class2, , each sample would have associated probability values, where sample will have , , , and . All the probability values of a sample are combined to construct a multidimensional array, which represents the probability distribution of the sample in the vector space. In this way, all the training samples are vectorized into their probability distribution in vector space, in the format of numerical multidimensional arrays, with the number of dimensions depending on the number of classes.

2.3.2. EasyEnsemble and BalanceCascade

Liu et al. proposed EasyEnsemble algorithm and BalanceCascade algorithm [22], which were the undersampling algorithms and were widely used for classification task. EasyEnsemble algorithm extracted several subsets from majority class examples by themselves; for each subset, a classifier was built, and all generated classifiers created an ensemble learning system and then combined them for the final decision by using Adaboost [23]. BalanceCascade algorithm depending on supervised learning methods extracted examples from majority class examples and then created ensemble classifiers with training datasets [24]. The pseudocodes for EasyEnsemble and BalanceCascade were shown in Algorithms 1 and 2.

Input: Training dataset , the number of individuals , the number of iterations
(1) Begin
(2) For
(3) Creating a subset from negative dataset of by using Bootstrap sampling technique, and
the number is equal to the
(4) Use the Adaboost with the weak classifiers and corresponding weights to train the
individual model , the ensemble’s threshold is , i.e.
.
(5) End For
(6) Output: An ensemble like:
    
(7) End

Input: Training dataset , the number of individuals , the number of iterations
(1) Begin
(2)    is the false positive rate (the error rate of misclassifying a majority class
 example to the minority class) that should achieve
(3) For
(4) Creating a subset from negative dataset of by using Bootstrap sampling technique, and
the number is equal to the
(5) Use the Adaboost with the weak classifiers and corresponding weights to train the
individual model , the ensemble’s threshold is , i.e.
       
(6) Adjust such that ’s false positive rate is .
(7) Remove from all examples that are correctly classified by
(8) End for
(9) Output: A single ensemble like:
(10) End

2.4. Evaluation

In this study, we adopt the 10-fold cross-validation. The dataset is randomly divided into ten equal sets, out of which nine sets are used for training and the remaining one for testing. This procedure is repeated ten times and the final prediction result is the average accuracy of the ten testing sets [2532].

Four parameters, sensitivity (Sn), specificity (Sp), , and Mathew correlation coefficient (MCC), are used to measure the performance of our model. They are defined by the following formulas:where TP, TN, FP, and FN denote the number of true positive, true negative, false positive, and false negative, respectively. For a given dataset, all these values can be obtained from the decision function with fixed cutoff [3337].

3. Result and Discussion

3.1. The Performance in the Testing Dataset

In this research, we select 33 amyloidogenic proteins for the prediction of “aggregation-prone” peptides. And 25 amyloidogenic proteins are selected for training; there are 923 positive samples and 5074 negative samples; and the rest of 8 amyloidogenic proteins are selected for testing; thus, there are 335 positive samples and 1499 negative samples. The details are shown in Table 1. We define a possible aggregation-prone peptide as the aggregation bond; “w” is 3, 4, and 5, and the window size is 7, 9, and 11. Next, we use the encoding scheme based on the composition of k-spaced amino acid pairs (CKSAAP) to formulate the aggregation-prone peptide, and the “k” is 3, 4, and 5. In Tables 3 and 4, we compare the values of MCC to determine the best values of and k.

We use the hybrid classification approach (naïve Bayes vectorizer and two undersampling algorithms called EasyEnsemble and BalanceCascade) to improve the classification accuracy and performance in the imbalance dataset. For the EasyEnsemble approach, the CART is used to train weak classifiers; the number subset is 4; the number of iterations is 10 in the each Adaboost ensemble method; the same parameters are used for the BalanceCascade approach. Meanwhile, we perform a 10-fold stratified cross validation. Within each fold, the classification method is repeated 10 times considering that the sampling of subsets introduces randomness. The whole cross validation process is repeated 10 times, and the averages of these 10 cross validations are the final performance of the method.

The average performance of the different parameter is summarized in Tables 3 and 4. When the window size is 7 and the value was 4, the value of MCC is the highest, 0.0827 for EasyEnsemble learning algorithm and 0.0738 for BalanceCascade learning algorithm in the testing dataset. Thus, we select 7 (window size) and 4 (the value) as the final parameters of classifier, which is used to comprise with other predictors by 10-fold cross validation in all datasets.

The average Sn of the EasyEnsemble learning algorithm and BalanceCascade learning algorithm is shown in Figures 2 and 4. When the window size is smaller, the value of Sn is higher; for example, window size is 5 and 7, and the Sn is about 0.39~0.41 for EasyEnsemble and 0.27~0.32 for BalanceCascade; on the contrary, when the window size is 9 and 11, the Sn is about 0.34~0.38 for EasyEnsemble and 0.24~0.31 for BalanceCascade. Also in Figures 3 and 5, the average Sp of the EasyEnsemble learning algorithm and BalanceCascade learning algorithm is summarized. It is about 0.66~0.70 for EasyEnsemble and 0.73~0.77 for BalanceCascade, when the window size is 5 and 7; however, it is about 0.69~0.75 for EasyEnsemble and 0.76~0.80 for BalanceCascade, when the window size is 9 and 11. It indicates that smaller window size would be beneficial to predict positive sample; also, the larger the window size is, the more redundant the information is. What’s more, the value of Sn is higher for EasyEnsemble than for BalanceCascade, about 10% higher; however, the value of Sp is lower for EasyEnsemble than for BalanceCascade, about 7% lower; it illustrates that the EasyEnsemble would improve the prediction performance of sensitivity, and BalanceCascade would improve the prediction performance of specificity.

3.2. Comparison with Other Predictors

As the result in Table 5, the prediction sensitivity and MCC of Aggre_Easy and Aggre_Balance are the highest compared to others, the Sp is 79.46%, MCC is 0.42 for the Aggre_Easy, the Sp is 70.32%, MCC is 0.42 for the Aggre_Balance. It indicates that our predictor has good performance to predict the positive samples in the imbalance dataset. However, the value of specificity is lower than others. For Aggre_Easy, the value of specificity (Sp = 74.43%) is lower than Amyloidogenic Pattern, Average Packing Density, Beta-strand contiguity, SecStr, Tango, AMYLPRED, and AMYLPRED2, slightly lower than Aggrescan, AmyloidMutants, and Hexapeptide Conf. Energy, and higher than NetCSSP, PaFigure, and Waltz. For Aggre_Balance, the value of specificity (Sp = 80.70%) is lower than Amyloidogenic Pattern, Average Packing Density, Beta-strand contiguity, SecStr, Tango, AMYLPRED, and AMYLPRED2 and higher than other methods. More importantly, the reasonably good performance of Aggre_Easy and Aggre_Balance reflects that the method effectively captures the information of aggregation sites, and we propose that the hybrid classification approach can take advantage of the simplicity of the Bayes technique and the sensitivity of the undersampling ensemble learning algorithm.

In Table 5, the false positives (FP) is large; the main reason was because of the fact that only a relative small portion of them have been studied and confirmed experimentally to be amyloidogenic [11]. On the other hand, we would propose the window redirection operator to improve the prediction performance in the future.

3.3. Web Server for Aggregation-Prone Prediction

An effective prediction servers, Aggre_Easy and Aggre_Balance, are available at http://202.198.129.220:8080/AggrePrediction/. And it is hosted on Apache 2.2 web server by using Windows 2003 server environment. In the web server, the models based on the datasets with the optimal parameters are used to predict sites in submitted sequences. As is displayed in Figures 6 and 7, users could submit the uncharacteristic sequences with FASTA format, and the system would return the prediction results. A region in the polypeptide sequence was considered an aggregation prone if there are 5 or more sequentially continuous residues to be prediction aggregation prone.

4. Conclusion

Accurate identification of the aggregation residues could help fully decipher the molecular mechanisms. Though some researchers have focused on this problem, the overall prediction performance is still not satisfied. In this paper, we develop approaches Aggre_Easy and Aggre_Balance to predict the aggregation prone from the primary sequence information. The Aggre_Easy achieves a promising performance with a sensitivity of 79.47%, a specificity of 80.70%, and a MCC of 0.42; the sensitivity, specificity, and MCC of Aggre_Balance reach 70.32%, 80.70%, and 0.42. Experimental results show that the performances of Aggre_Easy and Aggre_Balance predictor are better than several other state-of-the-art predictors and our methods are helpful for the prediction of aggregation prone.

Supporting Information

Text S1: All datasets are consisting of the 33 proteins and their sites information.

Text S2: The prediction result of aggregation prone for Aggre_Easy, Aggre_Balance, AMYLPRED and AMYLPRED2, by comparison. Simply, we remove the single prediction positive site, and, in the future, we will propose the window redirection operator to improve the prediction performance.

Conflict of Interests

The authors declare no conflict of interests.

Acknowledgments

This research is partially supported by National Natural Science Foundation of China (61403077 and 61403076), the Fundamental Research Funds for the Central Universities (14QNJJ029), and the Postdoctoral Science Foundation of China (2014M550166).

Supplementary Materials

Text S1: The dataset for prediction of aggregation prone. All datasets are consisting of the 33 proteins and their sites information.

Text S2: The prediction result comparison with the AMYLPRED and AMYLPRED2. The prediction result of aggregation prone for Aggre Easy, Aggre Balance, AMYLPRED and AMYLPRED2, by comparison. Simply, we remove the single prediction positive site, and, in the future, we will propose the window redirection operator to improve the prediction performance.

  1. Supplementary Materials