Abstract

Prediction of secreted protein types based solely on sequence data remains to be a challenging problem. In this study, we extract the long-range correlation information and linear correlation information from position-specific score matrix (PSSM). A total of 6800 features are extracted at 17 different gaps; then, 309 features are selected by a filter feature selection method based on the training set. To verify the performance of our method, jackknife and independent dataset tests are performed on the test set and the reported overall accuracies are 93.60% and 100%, respectively. Comparison of our results with the existing method shows that our method provides the favorable performance for secreted protein type prediction.

1. Introduction

Protein secretion is a universal and important biological process and it can occur in both eukaryotes and prokaryotes. In recent years, several secreted proteins have been identified as markers for disease typing and staging [1, 2] or the development of drugs [3]. Most bacteria are able to secrete proteins, such as toxins and hydrolytic enzymes, into the extracellular environment. In this process, Gram-negative bacterial proteins have to be transported across the two lipid bilayers, including the cytoplasmic membrane (CM) and the outer membrane (OM) [4]. Proteins, including virulence factors involved in invasion, colonization, and survival within a host organism, are produced in pathogenic Gram-negative bacteria and are secreted to the cell exterior [5]. They play different roles in invaded eukaryotic cells and cause various diseases [4], so it is important to study them for the pathogenesis of diseases and the development of drugs.

Secretion systems are capable of specifically recognizing their substrates and facilitating secretion without disturbing the barrier function of the cell envelope. However, they differ tremendously with respect to their functional mechanism and complexity. So far, eight secretion systems have been found in Gram-negative bacteria and named from the type I (T1SS) to the type VIII secretion system (T8SS) according to the OM secretion mechanisms [4]. Correspondingly, proteins released via the T1SS are called type I secreted proteins (T1SPs), and other types of proteins are named by analogy with this.

In fact, prediction of protein datasets such as protein structural classes prediction and Subcellular localization prediction is a typical and traditional pattern recognition problem. Generally, it can be performed in three main steps: feature extraction, feature selection, and model selection for classification. Among the three steps, feature extraction is the most critical and challenging step for the prediction. Amino acid composition (AAC) [69], pseudoamino acid composition (PseAAC) [1012], polypeptide composition [13], functional domain composition [14], PSI-BLAST profile [15, 16], and so on are all the widely used feature extraction methods. In order to reduce the computation complexity and pick out the more informative features, a feature selection step is necessary. Principal component analysis (PCA) [17], SVM-RFE [18], and correlation-based feature selection (CFS) [19] have performed well in the feature selection. Finally, choosing a powerful classification tool is also very important. Neural network [8], support vector machine (SVM) [9, 20], fuzzy clustering [21], and rough sets [22] are usually being used.

In 2013, Yu et al. constructed a dataset of Gram-negative bacterial secreted proteins which contains 839 secreted proteins [23]. The proteins are collected from three data sources, namely, SwissProt, TrEMBL [24], and RefSeq [25]. They used an improved PseAAC consisting of amino acid composition (AAC) and autocovariance (AC) to extract information from PSI-BLAST profile. The support vector machine (SVM) is used to distinguish different types of secreted proteins in their paper and the reported highest overall accuracy of their method is 90.12%.

Recently, some researchers try to improve the prediction accuracy of protein datasets by combining the dipeptide composition and PSI-BLAST profile together [15, 16, 2628]. These methods mainly focused on the single-column information extraction based on the hypothesis that two neighboring amino acids are independent which may make the neighboring correlation information lost.

In this study, we also extracted the evolutionary information from PSI-BLAST profile based on correlation method to perform Gram-negative bacterial secreted proteins prediction. A feature set consisting of 309 features is selected by correlation-based feature selection (CFS) method based on training set. With the selected 309 features, the jackknife test and independent test are performed on test set by SVM. The results show that our method is reliable for the secreted protein type prediction.

2. Materials and Methods

2.1. Materials

Yu et al. constructed a dataset of Gram-negative bacterial secreted proteins which contains 839 secreted proteins with 25% similarity. The dataset is divided into training set and test set. The 667 secreted proteins belong to training set and the other 172 secreted proteins belong to test set. The protein numbers of each type are listed in Table 1. In fact, 16 T6SPs and 24 T8SPs were also collected from several data sources as shown in the paper of Yu et al.; however, owing to the small numbers and high sequence similarity, they are just suitable for phylogenetic analysis to understand the evolutionary history [23]. Hence, only six types of Gram-negative bacterial secreted proteins are considered. The datasets can be downloaded from http://web.xidian.edu.cn/slzhang/paper.html.

2.2. Feature Extraction

PSI-BLAST profile is usually denoted by a position-specific score matrix (PSSM) which includes abundant evolutionary information. PSSM is calculated by applying the PSI-BLAST [29] in which three iterations are used and its cut off value is set to on SwissProt dataset. Given a protein sequence, PSSM produces the substitution probability of the amino acids along its sequence based on their position with all 20 amino acids. PSSM is a log-odds matrix of size , where is length of the query amino acid sequence and 20 is due to the 20 amino acids. The th entry of the matrix represents the score of the amino acid in the th position of the query sequence being mutated to amino acid type during the evolution process.

In this study, the PSSM elements are scaled to the range from 0 to 1 using the following sigmoid function: where is the original PSSM value.

For convenience, we denote as the PSSM of the query sequence with length , where, for example, is the transpose operator, and   () denotes the score of the amino acid in the th position of being mutated to the th amino acid during the evolution process.

In our previous study, we combine the long-range correlation information and linear correlation information of and   () together to perform the feature extraction and the linear correlation coefficient of and is used to reflect the average correlation between two residues separated by a gap of along the sequence [30]. For convenience, for a fixed , we list the formulae as follows:Then, we define For a fixed , we define

is a 400-dimensional vector, where .

Suppose that the maximum value of is ; then the feature vector can be denoted by

The dimension of feature vector is . However, there may exist some irrelevant and redundant information among the extracted features, which can lead to a poor prediction. Hence, a feature selection method is used.

2.3. Feature Selection and the Selection of

Feature selection can reduce the dimensionality of the data and may allow learning algorithms to operate faster and more effectively. Wrapper and filter are two main directions developed for feature selection. In order to determine the value of , CFS method [19] is performed to the features to filter out poorly informative ones with varying from 0 to 16. As shown in Hall’s paper, as a filter method, in many cases CFS gave comparable results to the wrapper and, in general, outperformed the wrapper on small datasets [19].

Then, the jackknife test is performed on the training set based on the selected features. The overall accuracies of training set at different values of are shown in Figure 1, from which we can find that the highest overall accuracy of training set is achieved at . Hence, in this paper, is set to be . The selected feature numbers with the varies of when are listed in Table 2. From Table 2, it is found that when , the selected features are the most which arrives at 45. While when , only 18 features are selected. When is bigger than 10, the long-range correlation of residues becomes more and more weak with increases. This is consistent with the phenomenon shown in Figure 1 that the overall accuracy becomes stable when is bigger than 10.

2.4. Classification Algorithm Construction

SVM can often achieve superior classification performance in comparison with other classification algorithms. In this study, the support vector machine (SVM) classifier is employed as the classification algorithm. The radial basis function (RBF) is selected as the kernel function, which is defined aswhere is a kernel parameter and and are the feature vector of the th and th proteins, respectively. The regularization parameter (used to control the trade-off between allowing training errors and forcing rigid margins) and kernel parameter are optimized based on tenfold cross-validation on training set. is allowed to take a value of and to take a value of . Various pairs of values are tried and the one with the best cross-validation accuracy is picked. The final classifier uses and .

3. Prediction Assessment

Independent dataset test, subsampling test, and jackknife test are usually used to examine the effectiveness of a predictor in statistical prediction. The jackknife test and independent dataset test are used to examine the power of our method. The standard performance measures including the sensitivity (Sens), specificity (Spec), overall accuracy (OA), and Matthew’s correlation coefficient (MCC) are used to evaluate the prediction accuracy. The MCC value ranges between −1 and 1, where 0 represents random correlation, and bigger positive (negative) values indicate better (lower) prediction quality for a given class [31]. Explicitly, they are defined by the following formulas:where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives, respectively.

4. Results

To evaluate the performance of our method, jackknife test was performed on training set and test set, respectively. The detailed prediction results are listed in Table 3. The overall accuracies are both higher than 85%. If comparing the six types to each other, the prediction of T1SP and T5SP types is both higher than 90% for the training set. For the training set, the prediction accuracy of T4SP is only 67.74%, which may be due to the unbalance of this dataset. For the test set, the accuracies of other four types are all higher than 90% excluding T1SP and T4SP types. Excluding T4SP type, the MCC values of the other five types are all higher than 0.9 which shows that our method is effective for the Gram-negative bacterial secreted protein types prediction.

In addition, the independent dataset test is performed on test set. The method is trained by SVM based on training set; then the obtained model is used to perform the prediction of test set. An excellent result is obtained and all the types are predicted correctly and the result is shown in Table 4. The overall accuracy of 100% is obtained by our method for the test data. Compared with the result of Yu et al. [23] obtained by “one-to-one” algorithm, the overall accuracy obtained by our method is 9.88% higher than that of Yu’s method. Compared with the “one-to-the-rest” algorithm result of Yu’s method (2013), the overall accuracy of our method is 13.95% higher.

The result shows that the extracted information, especially the information extracted from different columns of PSSM, plays an important role in the improvement of the prediction accuracy. In addition, the combined information extracted at different gaps can provide more useful information for the prediction.

5. Conclusions

In recent years, more and more secreted proteins have been discovered from a variety of Gram-negative bacteria. Hence, how to determine the type of new discovered Gram-negative bacterial secreted protein is becoming an urgent research task. A set which contains six types of Gram-negative bacterial secreted proteins was constructed by Yu et al. in 2013. In this paper, the long-range correlation information and linear correlation information are extracted from position-specific score matrix (PSSM). The best optimal residue distance is determined based on the training set. Results by jackknife test and independent dataset test on the test set show that our method is effective in predicting Gram-negative bacterial secreted protein types.

Competing Interests

The authors have declared that no conflict of interests exists.

Acknowledgments

The authors express their thanks to Dr. Yanzhi Guo for her kind help. This work is supported by Tianyuan Special Funds of the National Natural Science Foundation of China (Grant no. 11426056) and the Scientific Research Fund of Liaoning Provincial Education Department (Grant no. L2014538) and the Independent Foundation of Dalian Nationalities University (Grant no. DC201502050401).