Application of Machine Learning Method in Genomics and ProteomicsView this Special Issue
acACS: Improving the Prediction Accuracy of Protein Subcellular Locations and Protein Classification by Incorporating the Average Chemical Shifts Composition
The chemical shift is sensitive to changes in the local environments and can report the structural changes. The structure information of a protein can be represented by the average chemical shifts (ACS) composition, which has been broadly applied for enhancing the prediction accuracy in protein subcellular locations and protein classification. However, different kinds of ACS composition can solve different problems. We established an online web server named acACS, which can convert secondary structure into average chemical shift and then compose the vector for representing a protein by using the algorithm of auto covariance. Our solution is easy to use and can meet the needs of users.
Knowledge of subcellular localization information of a protein may help to unravel its normal cellular function . The proteins within the different compartments have different biological activity and functions; in turn, knowing the subcellular localization of a given protein helps in elucidating its functional role.
Recently, many computational approaches for subcellular localization predictions have been developed and plenty of methods for improving the accuracy of the prediction were applied. From two aspects the predictor can be described. One is the predicting algorithms, like support vector machine (SVM) [2–11], neural network , increment of diversity (ID) , random forest (RF) , K-nearest neighbor (K-NN) [15, 16], generating algorithm , and so on, or the combination of them [16, 18]. The other is the information source, such as widely used sequence-based information source, which are amino acid composition (AAC) and sorting signals [19–21], and textual descriptions of proteins [22, 23], which are protein physiochemical property , gene ontology (GO) , and so on. Actually, the structure information of a protein is very important, especially when it is used for representing the subcellular locations of a protein. However, the structure information of a protein cannot be easily described, and few methods using the structure information can be learned to our knowledge.
However, in NMR spectroscopy, as an important parameter, chemical shift, which is sensitive to changes in the local environments, can report the structural changes. Sibley et al. , Mielke and Krishnan , Spera and Bax , and Zhao et al.  have found that the ACS of a protein has intrinsic correlation with the protein’s secondary structure and the function of this protein is determined by its structure. According to this point of view, there must be some relationship among the averaged chemical shift, protein structure, and functions [30, 31]. Wishart has developed a web server, namely, CS23D, for rapidly generating accurate 3D protein structures using only assigned NMR chemical shifts . More than 100 proteins from BMRB  were tested and found that the resulting structures generally exhibit good geometry and chemical shift agreement . Also, there are some algorithms, which can predict the chemical shift from protein sequences and conformation [34–37]; few works have been done to determine a protein’s functions by the chemical shifts [38, 39]. Therefore, how to use the chemical shift is still important and urgent.
In this paper, a benchmark data set of chemical shift was constructed, which consists of 1,552 proteins derived from BMRB website  and then extracted chemical shift values of , , , and for 20 amino acid residues. Then four types of average chemical shift for 20 amino acid residues were calculated and the autocovariance algorithm was used to convert the average chemical shift into the vector to describe the protein sample. The algorithm acACS (autocovariance of averaged chemical shifts) has been used to enhance the prediction accuracy in protein subcellular locations. The proposed acACS descriptor can be considered as a mode of generalized pseudoamino acid composition, which was summarized in . Recently, the generalized pseudoamino acid composition methods have been systematically implemented by two powerful software, PseAAC-Builder  and PseAAC-General . For the readers’ convenience in using the current method, the acACS descriptor may be integrated into this software in future works. The details of how to deal with this calculation and how to use this method is shown as follows.
2. Material and Methods
2.1. Data Sets
When an electron moves around a proton, it will produce some magnetic field, which could affect proton’s external electron field. Thus, the absorption frequencies of proton in different chemical environments would shift relatively to the absorption frequencies under standard magnetic fields. Chemical shift is the relative resonance frequencies shift of protons between different chemical environment and standard, which can be measured by NMR spectroscopy. Due to its sensitivity to local environments, such as the backbone dihedral angles and the secondary structure types [26, 27, 29], chemical shift can be an indicator for the changes of local conformations.
In order to find out the correlation between chemical shift and the secondary structure of a protein, we construct a high-quality working data set, which started from the following steps: (1) the proteins star file with NMR spectroscopy data were downloaded from BMRB ; (2) the proteins less than 50 residues or not matched to PDB  entries were discarded; (3) the proteins with sequence identity higher than 40% were excluded by CD-HIT . Finally, the benchmark data set has 1,552 proteins. The data set was available at our website. The data set contained 1,552 proteins sequences and BMRB star file, which was the original chemical shifts data file for all kinds of backbone atoms of each protein. We analyzed the averaged chemical shifts for every kind of amino acids type and secondary structure in order to find out the rules among averaged chemical shifts with every kind of amino acids type and secondary structure types and then used the autocovariance algorithm to calculate the feature vectors of the protein sequences from the statistic results. The feature vectors representing the protein sequences can be used in problems of subcellular location prediction or other protein classifications. Researchers may also develop better algorithms for protein representation using the data set.
2.2. Averaged Chemical Shift (ACS)
In order to find the rule between the chemical shifts and structure information, the statistic about averaged chemical shift related to secondary structure and amino acids type was carried out.
Firstly, four types chemical shift values of , , , and from every amino acid residue were extracted from the BMRB star file for further calculation. In the BMRB star file, the amino acid residues, four kinds of protein backbone atoms of each amino acid residue, and matched PDB file were given. For example, the “bmr447.str” was extracted into four files: N_447.txt, Ca_447.txt, Ha_447.txt, and Hn_447.txt, which correspond to , , , and protein backbone atoms.
Secondly, the secondary structure information was extracted from PDB file which matched to BMRB star file. The secondary structure types of each amino acid residue are denoted by , , and . Then the averaged chemical shifts for all the residues were calculated.
For protein backbone atoms “” of amino acid type “” with secondary structure type “,” the averaged chemical shift (ACS) is defined as Here , , , or , is one kind of 20 amino acids and stands for the secondary structure types (H, E, or C) from DSSP  (H = helix, E = strand, and C = the rest). is the chemical shift value extracted from the BMRB star file and is the counts of items.
By calculating the residues’ ACS with (1) for 1552 proteins, we found that the ACS regularly varies with the secondary structure types and residues. The statistic results of averaged chemical shifts were listed in four tables, which can be accessed from our website. Take the as an example, the ACS of for each of 20 native amino acid residues with three types of secondary structure is shown in Figure 1. According to Figure 1, it can be concluded that we can use the ACS to represent the protein’s secondary structure. In order to illustrate the algorithm, the flowchart of ACS is given in Figure 2.
2.3. Algorithm of Autocovariance of Average Chemical Shift (acACS)
In order to obtain the correlation information between amino acids of a protein, the autocovariance of ACS was calculated. For a protein , Here, is the sequence length and is the amino acid in position .
Then, the amino acid in protein was replaced by its ACS “” according to its secondary structure type . When was redefined as , can be expressed as
Then, the autocovariance algorithm was used to calculate the correlation between amino acid and by the following equation:
After the above calculation, the protein can be expressed as follows: Here, is the correlation factor of average chemical shift with average chemical shift . In particular, when , with (5), . In order to take use of ACS, the was replaced by the average chemical shift . The factor is a nonnegative integer and reflects the rank of correlation . Based on different problems, in order to get a best result, a certain right number for factor should be given and so does .
In order to give a pictorial representation of chemical shifting technique, a flow diagram is given in Figure 3, which shows how the acACS works.
3. Results and Discussion
By using the acACS algorithm, we successfully represented the protein samples and accurately predicted submitochondria locations. We used the model to test the SML3-983 data set that was along with the SubMito-PSPCP . The data set has 983 proteins sequences which were divided into three locations. Among the data set, there are 661 sequences from inner membrane, 177 sequences from matrix, and 145 sequences from outer membrane. We selected acACS combined with AAC, DC, PSSM, and GO and reduced physicochemical properties (Hn) as feature vectors for representing the proteins and then trained the model. Then 90.74% accuracy was obtained for SML3-983 data set with Jackknife cross-validation, which was 1.63% higher than SubMito-PSPCP. In order to compare the performance of acACS, the feature vector was recombined with AAC, DC, PSSM, GO, and Hn, without acACS. Then we trained the model and obtained the predicting accuracy of 89.52%, which was dropped about 1.2%.
The acACS algorithm has also been checked in our previous works [49–52]. In subcellular location prediction, we compared the results with and without the acACS in the submitochondria locations and mycobacterial proteins subcellular locations and got the better result which was listed in Tables 1 and 2. Actually, the acACS as a feature vector for representing the protein samples can also be used for other kinds of proteins prediction problem. In acidic and alkaline enzymes prediction and bioluminescent and nonbioluminescent proteins discrimination, we also improved the predicting accuracy by about 1.3%, which was listed in Table 3.
The protein functions, including its subcellular locations, are largely determined by its structure. Developing a novel method for improving the performance of predicting protein subcellular locations is urgent. However, the feature vectors in the methods were almost sequence-based in the past. Therefore, almost state-of-the-art methods tried to incorporate some other sequence-based information as its complement. Our method provides structure-based information and can be perfect complement to the sequence-based methods and can be used for other kinds of protein related problems. Actually, these methods can work side by side to help each other in a practical study.
For the chemical shift, it incorporates the structure information in the first place, so it can represent the protein sample better. What is the better ways to use of the chemical shift is still a hot topic for biologist and chemist. In this work, we used the autocovariance algorithm to process the averaged chemical shift and got the better results, but there are certainly some improvements that could be made for acACS. Actually other algorithms can be adopted to try to find the better method for representing the protein samples in the future. At present stage, it is not convenient for the user, for both the secondary structure information and the protein sequence that are used, to calculate the chemical shift. In the future, the secondary structure information will not be necessary, and it will be integrated into the algorithm.
In this work, the raw chemical shifts data set and averaged chemical shift data set were constructed. Then, the averaged chemical shift was calculated and the algorithm of acACS was presented. In order to check the performance of the acACS we got, proteins submitochondria locations, mycobacterial proteins subcellular locations, bioluminescent proteins discrimination, and acidic and alkaline enzymes classification were predicted. Based on the results we obtain, it can be concluded that the acACS can improve the accuracy of prediction at least 1%-2%, the performance of which is correlated with the correlation factor and the backbone atoms . Some recent studies showed that the profile-based features [54–56], pseudoamino acid composition (PseAAC) , and features based on physicochemical proprieties of amino acids  were able to improve the performance of many computational predictors for protein remote homology detection, protein binding site identification, and so forth. Therefore, these features should be studied for protein subcellular location prediction in the future studies.
We have developed a web server acACS, which could automatically produce the vectors of proteins, when a custom submitted the protein sequences along with the secondary structure in batch mode. The data set can be a very useful addition to biomolecular NMR spectroscopists. The acACS will be of benefit to the proteomic research. The current work will become an important progress in the prediction of the protein subcellular locations and promote the study in the related areas.
5. Web Server and User Guide
To enhance the value of its practical applications, a web server for the acACS generator was established. Moreover, for the convenience of the user, here a step-to-step guide is provided for how to use the web server to get the desired results.
Step 1. Open the web server at http://wlxy.imu.edu.cn/college/biostation/fuwu/acACS/index.asp and you will see the top page of the acACS on your computer screen, as shown in Figure 4. Click on the Read Me button to see a brief introduction about the acACS.
Step 2. Either type or copy/paste the query protein sequences into the input box at the center of Figure 4, and then copy/paste the secondary structure of the protein sequence in the next line. The input sequence should be in “ONE LINE” format. For the examples of sequences in ONE LINE format, click the “?” button above the input box.
Step 3. Input the Lambda value in the input box right of the Lambda label.
Step 4. Check atoms with chemical shift.
Step 5. Click on the Submit button to see the result page. For example, if you use the default example sequences, Lambda and atoms in the window, after clicking the Submit button, you will see the following message shown on the screen of your computer: “The lamda you have chosen is 12”; “The Atom of chemical shift you have chosen are ,”; “The acACSs of the proteins you submitted are......”. Then the acACS of atom was given and the acACS of atom followed for the first protein, then the acACS of second protein, the third, and so forth.
Step 6. Click the ACS of atoms and data set button to download the benchmark dataset used to calculate the ACS.
Step 7. Click the Citation button to find the relevant papers that document the detailed development and algorithm of acACS.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors would like to thank the reviewers for their helpful comments on their paper. This work was supported by a Grants from National Natural Science Foundation of China (61063016 and 31160188), The Scientific Research Program at Universities of Inner Mongolia Autonomous Region of China (NJZY13014), The Natural Science Foundation of Inner Mongolia Autonomous Region of China (2013MS0504 and 2013MS0503), and the Program of Higher-level Talents of Inner Mongolia University (135147).
Y. C. Zuo, Y. Peng, L. Liu, W. Chen, L. Yang, and G. L. Fan, “Predicting peroxidase subcellular location by hybridizing different descriptors of Chou' pseudo amino acid patterns,” Analytical Biochemistry, vol. 458, pp. 14–19, 2014.View at: Google Scholar
Y. D. Cai and K. C. Chou, “Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition,” Biochemical and Biophysical Research Communications, vol. 305, no. 2, pp. 407–411, 2003.View at: Publisher Site | Google Scholar
S. Brady and H. Shatkay, “EPILOC: a (working) text-based system for predicting protein subcellular location,” in Proceedings of the 13th Pacific Symposium on Biocomputing (PSB '08), pp. 604–615, January 2008.View at: Google Scholar
P. Du, S. Gu, and Y. Jiao, “PseAAC-general: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets,” International Journal of Molecular Sciences, vol. 15, no. 3, pp. 3495–3506, 2014.View at: Google Scholar
B. Liu, J. Xu, and Q. Zou, “Using distances between Top-n-gram and residue pairs for protein remote homology detection,” Bmc Bioinformatics, vol. 15, article S3, 2014.View at: Google Scholar