Abstract

Protein fold classification plays an important role in both protein functional analysis and drug design. The number of proteins in PDB is very large, but only a very small part is categorized and stored in the SCOPe database. Therefore, it is necessary to develop an efficient method for protein fold classification. In recent years, a variety of classification methods have been used in many protein fold classification studies. In this study, we propose a novel classification method called proFold. We import protein tertiary structure in the period of feature extraction and employ a novel ensemble strategy in the period of classifier training. Compared with existing similar ensemble classifiers using the same widely used dataset (DD-dataset), proFold achieves 76.2% overall accuracy. Another two commonly used datasets, EDD-dataset and TG-dataset, are also tested, of which the accuracies are 93.2% and 94.3%, higher than the existing methods. ProFold is available to the public as a web-server.

1. Introduction

Protein fold classification is a crucial problem in structural bioinformatics. Protein folding information is helpful in identifying the tertiary structure and functional information of a protein [1]. In recent years, many protein fold classification studies have been performed. The methods proposed by researchers can be roughly divided into two categories: one is template-based method [27], and the other is taxonomy-based method [815]. Recently, taxonomy-based methods have attracted more attention due to their relatively excellent performance.

The taxonomy-based method was proposed by Dubchak et al. [8, 9] in 1995 for the first time. Many taxonomy-based methods classify a query protein to a known folding type. This nonmanual label method contributes to the growth of the quantity of protein in Structural Classification of Proteins (SCOP) [16] and could narrow the gap between the number of proteins in SCOP and Protein Data Bank (PDB). In this paper, the taxonomy-based method is equivalent to the classification problem in machine learning. There are two significant problems in classification tasks: one is feature extraction, and the other is machine learning method.

In terms of feature extraction, most of the researchers extract multidimensional numerical feature vectors from amino acid sequences. In 1995, Dubchak et al. [8, 9] extracted global description of amino acid sequence for the first time. Since then, in order to improve the accuracy of classification, some researchers have put forward other feature extraction methods, such as pseudoamino acid composition [12, 17], pairwise frequency information [18], Position Specific Scoring Matrix (PSSM) [17], structural properties of amino acid residues and amino acid residue pairs [19], and hidden Markov model structural alphabet [20, 21]. Except for extracting features from amino acid sequence directly, some features are extracted from evolution information combining the functional domain and the sequential evolution information [22] and predicted secondary structure [14, 23, 24]. Although the classification accuracy can be improved after combining these features together [20, 25], it is still not good enough.

For protein fold classification, many classifiers have been used, such as neutral network (NNs) [8, 13], SVMs [10, 13, 1821, 24, 2633], k-nearest neighbors (k-NN) [12], probabilistic multiclass multikernel classifier [25], random forest [23, 3437], rotation forest [38], and a variety of ensemble classifiers [11, 12, 14, 18, 22, 3941].

Up to 28th April, 2016, PDB had 109850 protein structures (http://www.rcsb.org/pdb/home/home.do). However, Structural Classification of Proteins- extended (SCOPe) [42] only had 77439 PDB entries (http://scop.berkeley.edu/statistics/ver=2.06). Therefore, there still exists a great number of protein structures which do not have their structure classification labels in the SCOPe database. What is more, most protein structures in SCOPe are classified manually, so it requires a lot of manual labor. In this study, we start from the PDB file 3D structure studying the protein fold classification. In terms of feature extraction, we use a new feature extraction method, combining the existing methods of the global description of amino acid sequence [13], PSSM [43], and protein functional information [22] proposed by other researchers. The new feature extraction method extracts eight types of secondary structure states from PDB files by the Definition of Secondary Structure in Proteins (DSSP) software [44]. In terms of machine learning classifiers, we propose a novel ensemble strategy. With the new added feature extracted from DSSP and the novel ensemble strategy we propose, our method can achieve 1–3% higher accuracy than similar methods.

As demonstrated by a series of recent publications [4555] in compliance with Chou’s 5-step rule [56], to establish a really useful machine learning classifier for a biological system, we should follow the following five guidelines: (a) benchmark dataset construction or selection for training and testing the model; (b) extract features from the biological sequence samples with effective methods that can truly reflect their intrinsic correlation with the target to be predicted; (c) introduce or develop a powerful algorithm (or engine) to operate the classifier; (d) properly perform cross-validation tests and test on independent dataset to objectively evaluate the anticipated accuracy of the classifier; (e) establish a user-friendly web-server (http://binfo.shmtu.edu.cn/profold/) for the classifier that is accessible to the public. In the following, we are to describe how to deal with these steps one-by-one.

2. Materials and Methods

2.1. Data Sets

In this study, three benchmark datasets are used, respectively: Ding and Dubchak (DD) [13], Taguchi and Gromiha (TG) [58], and Extended DD (EDD) [10]. DD-dataset was proposed by Ding and Dubchak in 2001 and modified by Shen and Chou in 2006 [12]. Since then, DD-dataset has been used in many protein fold classification studies [11, 18, 2024, 26, 3236, 38, 40, 57, 59]. There are 311 protein sequences in the training set and 386 protein sequences in the testing set with no two proteins having more than 35% of sequence identity. The protein sequences in DD-dataset were selected from 27 SCOP [35] folds comprehensively, which belong to different structural classes containing , , , and .

TG-dataset contains 30 SCOP folds and 1612 protein sequences with no two protein having more than 25% sequence identity.

EDD-dataset contains 27 SCOP folds, like DD-dataset. There are 3418 protein sequences with no protein having more than 40% sequence identity.

These three datasets can be downloaded directly from our website (http://binfo.shmtu.edu.cn/profold/benchmark.html).

2.2. Feature Extraction Method

With the rapid growth of biological sequences in the postgenomic age, one of the most important but also most difficult problems in computational biology is how to represent a biological sequence with a discrete model or a vector. Therefore Chou’s PseAAC [6062] was proposed. Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, three web-servers [6365] were developed for generating various feature vectors for DNA/RNA sequences. Particularly, recently a powerful web-server called Pse-in-One [66] has been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users’ studies. Inspired by this, in this study, we extract four feature groups, including the DSSP feature, the amino acid composition and physicochemical properties (AAsCPP) feature, the PSSM feature, and the functional domain (FunD) composition feature. These feature extraction methods will be described as follows.

2.2.1. Definition of Secondary Structure in Proteins

The DSSP program was designed by Kabsch and Sander [44] and used to standardize protein secondary structure. The DSSP program works by calculating the most likely protein secondary structure given by the protein 3-dimensional structure. The specific principle of the DSSP program is calculating the H-bond energy between every two atoms by the atomic position in a PDB file, and then the most likely class of secondary structure for each residue can be determined by the best two H-bonds of each atom.

The DSSP feature extraction process is as follows. Firstly, DSSP entries are calculated from PDB entries by DSSP program. Secondly, the corresponding DSSP sequences from DSSP entries are obtained. DSSP sequence contains eight states (T, S, G, H, I, B, E, —), which can be divided into four groups, as shown in Table 1. Finally, according to the eight states and four groups, a 40D feature vector can be extracted from a DSSP sequence. The detail of the description and dimension of the features are shown in Table 2.

2.2.2. Amino Acids Composition and Physicochemical Properties

As effective features to describe a protein, the amino acid composition and physiochemical properties have reached good predict result, respectively [13, 34, 35]. Ding and Dubchak [13] tried to integrate the features for the first time and achieved a good result. Later, many other researchers proposed other feature integration methods. In 2013, Lin et al. [41] used a 188D feature vector combining amino acid composition and physiochemical properties. The 188D feature extraction method is also used in this paper.

The eight physiochemical properties of amino acids are hydrophobicity, van der Waals volume, polarity, polarizability, charge, surface tension, surface tension, and solvent accessibility. Different kinds of amino acids have different physiochemical properties so that they can be divided into three groups [13, 41], as shown in Table 3.

The percentage composition of the 20 amino acids in the query protein forms a 20D feature vector. The group composition of amino acids (3D), the pairwise frequency between every two groups (3D), and the distribution pattern of constituents (where the first, 25%, 50%, 75%, and 100% of a given constituent are contained) (5 × 3D) from each physiochemical property are extracted. Therefore, we can get a 168D feature vector from a protein sequence according to the eight physiochemical properties. Adding up the 20D amino acid composition feature and the 168D physiochemical feature, we can get a 188D feature vector altogether. The name and the dimensions of the features are listed in Table 4.

2.2.3. Position Specific Scoring Matrix

PSSM is a relatively common feature. In addition to protein fold type classification research area, there are some studies on protein structural class prediction [67, 68] which used this feature. PSSM is derived from PSI-BLAST (Position Specific Iterative Basic Local Alignment Search Tool) [43] by taking the multiple sequence alignment of sequences in nonredundant protein sequence database (nrdb90) [69]. The iteration number is 3 and the cutoff -value is 0.001. Two matrices can be obtained by PSI-BLAST, in which represents the length of the query amino acid sequence, and 20 represents the 20 amino acids. One of the two matrices contains conservation scores of a given amino acid at a given position in sequence, and the other provides probability of occurrence of a given amino acid at a given position in the sequence. The PSSM feature is extracted from the former matrix. Suppose that the parameter in the matrix is . Then the feature can be calculated by (1). That is to calculate the average value of each column in the matrix and form a 20D feature vector.

2.2.4. Functional Domain Composition

Proteins always contain some modules or domains, which involve different evolution resources and functions. Therefore, we can extract features in some FunD databases. There are some different FunD databases: SMART [70], Pfam [71], COG [72], KOG [72], and CDD [73]. In 2009, Shen and Chou [22] considered CDD as a relatively more complete functional domain database, and they used CDD to extract features. In this study, we used CDD (version 2.11), which co ntains 17402 common protein domains and families. Taking each of protein domains as a vector-base, we can extract a 17402D feature vector. Specific process is as follows. Firstly, use RPS-BLAST program [74] to compare the protein sequence with each of the 17402 domain sequences. Secondly, if the significance threshold value (expect value) is no more than 0.001, this component of the protein in the 17402D feature vector is assigned 1; otherwise, it is assigned 0. In this way, we can extract a 17402D feature vector, and each component of the feature can be either 1 or 0.

2.3. The Proposed Ensemble Classifier

In this study, we propose a novel ensemble strategy which includes 5 individual steps. Step : 10 widely used machine learning classifiers, LMT [75], RandomForest [34], LibSVM [76], SimpleLogistic [75], RotationForest [38], SMO [77], NaiveBayes [78], RandomTree [79], FT [80], and SimpleCart [81], are selected, and a 5-fold cross validation is implemented on the DD-dataset. Step : the classifier with the highest accuracy in each feature group is chosen. Step : corresponding models by training each feature group with the chosen classifier are selected. The four models are DSSP classification model, AAsCPP classification model, PSSM classification model, and FunD classification model. Detailed process is shown in Figure 1. Step : features from the test dataset are extracted and the classification result by calculating the corresponding models is obtained, represents a kind of classification model ranging from 1 to 4, and represents a kind of fold index, ranging from 1 to the total number of the fold classes (e.g., the value of ranges from 1 to 27 on DD-dataset). Step : the average of the probabilities of the four models in each fold class is calculated. The fold class with the highest probability will be chosen as the classification result. Detailed process is shown in Figure 2.

The machine learning tool we used is WEKA (Waikato Environment for Knowledge Analysis) [56], a collection of machine learning classifiers for data mining tasks based on Java.

2.4. Measurement

In this study, the standard percentage accuracy is used to test the effect of the proposed classification method, which helped us to compare our result with other researchers’ results [12, 13, 34]. The definition of the standard percentage accuracy is described in where represents the number of the proteins which belong to class , represents the correct number in test data, represents the classification accuracy of class , represents the total number of classes, represents the total number of tests, represents the total number of the correct classified data, and represents the classification accuracy.

3. Results and Discussion

3.1. Performance of ProFold

In order to test the performance of proFold, we first select the widely used DD-dataset for evaluation. The overall accuracy is 76.2%. Comparison with existing ensemble learning methods on DD-dataset is shown in Table 5. From Table 5, we can see that the accuracy of the other methods are under 75%, and the accuracy of our method is 3% higher than PFPA (2015) [40], which is the best one in the other methods.

In order to further evaluate the performance of proFold, we also select another two large scale datasets: EDD-dataset and TG-dataset. Training and testing dataset are not clearly distinguished in the two datasets, so a -fold cross validation is implemented on them.

We calculated the classification accuracy of EDD-dataset by 10-fold cross validation for 10 times and compared the result with other methods. The results are shown in Table 6. We can see from the table that only the accuracies of Paliwal et al. and Lyons et al. are more than 90%, which are lower than that of proFold. The result showed that the advantage of proFold is obvious when larger scale datasets are used for validation.

Regarding TG-dataset, we also took experiments by 10-fold cross validation for 10 times and compared the results with other methods. The results are shown in Table 7. We can see from the table that HMMFold (2015) method achieved the highest accuracy, which is 93.8%. The accuracy of our method is 94.3%, which is higher than HMMFold. TG-dataset has threefold classes more than DD-dataset and its scale is twice larger than DD-dataset. The result showed that the advantage of proFold is obvious when the dataset with more fold classes is tested.

3.2. Performance of the Proposed Ensemble Classifier

In the field of protein fold classification, many researchers used ensemble learning methods [11, 18, 22, 23, 3436, 38, 46, 51, 54, 79, 8289]. The specific process of those ensemble strategies is as follows. Integrate all features. Select several basic classifiers for training. Propose an ensemble classifier according to the classification result probability of each basic classifier. In this study, we find that the redundancies of the features will influence the performance of those methods. Therefore, we propose a novel ensemble strategy.

We took experiments on DD-dataset. Firstly, extract four feature groups which have been tested in 10 basic classifiers by cross validation. The detailed information of the test results is listed in Table 8. We can see from Table 8 that the best classifier is RandomForest using the DSSP feature group and AAsCPP feature group. The best classifiers are RotaionForest and FT when PSSM and FunD features are implemented, respectively. Secondly, train the four feature groups with corresponding basic classifiers and get four models. Finally, test the models on DD-dataset. The overall accuracy is 76.2%. Our method improves the accuracy effectively compared with other existing ensemble learning methods.

In order to compare our ensemble strategy with the traditional ensemble strategy, we took experiments on the four feature groups with traditional ensemble strategy. Integrate the four feature groups. Train the models with RandomForest, RotationForest, and FT respectively. Test the models on DD-dataset, EDD-dataset, and TG-dataset. The classification accuracy of our ensemble strategy has increased by 3% to 4%, as shown in Table 9. The result showed that our ensemble strategy has a better classification performance.

3.3. Accuracy Improvements with the DSSP Feature

In order to evaluate the influence on importing the DSSP feature, we calculated the classification accuracy of each fold class with and without the DSSP feature, respectively, using the DD-dataset. The accuracies are shown in Table 10. From the table, we can see that the accuracies of some fold classes, such as Fold number 2, number 4, number 6, number 12, number 23, and number 26, have increased obviously after importing the DSSP feature. The overall accuracy has increased from 71.3% to 76.2%. For example, the protein chain 1FAPB in DD-dataset was incorrectly classified into Fold number 5 before importing the DSSP feature, and it was reclassified into Fold number 4 correctly after importing the DSSP feature. The results showed that the DSSP feature has a significant effect on protein structure classification.

As we know that PDB files contain protein 3D structure information, we started from the PDB file of the protein in this study. The DSSP feature is extracted from the 3D structure in PDB and the 3D structure of a protein is more stable. Thus it explains why the DSSP feature has a significant effect on the protein structure classification.

4. Conclusion

In this study, we proposed a novel method called proFold. ProFold is an ensemble classifier combining the protein structural and functional information. In terms of feature extraction, we imported the DSSP feature into protein fold classification for the first time. Experiments showed that the classification accuracy will increase by about 5% using the DD-dataset by importing the DSSP feature. In terms of classification method, we proposed a novel ensemble classifier and improved the classification accuracy with this method. The classification accuracies of proFold on DD-, EDD-, and TG-dataset are 76.2%, 93.2%, and 94.3%, respectively, which are higher than the existing similar methods. The results showed that proFold is a relatively better classifier.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

This work was supported in part by grants from National Natural Science Foundation of China (Grant no. 61303099).