Scalable Data Mining Algorithms in Computational Biology and BiomedicineView this Special Issue
Research Article | Open Access
Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition
Owing to the abuse of antibiotics, drug resistance of pathogenic bacteria becomes more and more serious. Therefore, it is interesting to develop a more reasonable way to solve this issue. Because they can destroy the bacterial cell structure and then kill the infectious bacterium, the bacterial cell wall lyases are suitable candidates of antibacteria sources. Thus, it is urgent to develop an accurate and efficient computational method to predict the lyases. Based on the consideration, in this paper, a set of objective and rigorous data was collected by searching through the Universal Protein Resource (the UniProt database), whereafter a feature selection technique based on the analysis of variance (ANOVA) was used to acquire optimal feature subset. Finally, the support vector machine (SVM) was used to perform prediction. The jackknife cross-validated results showed that the optimal average accuracy of 84.82% was achieved with the sensitivity of 76.47% and the specificity of 93.16%. For the convenience of other scholars, we built a free online server called Lypred. We believe that Lypred will become a practical tool for the research of cell wall lyases and development of antimicrobial agents.
Bacteria are widely distributed on the earth, a significant proportion of which can cause disease. The antibiotic can efficiently treat infectious diseases caused by pathogens. However, antibiotics abuse may cause bacterial drug resistance. Thus, there is an ever-increasing need to find new ways to address this important issue [1, 2]. In the search for more effective therapeutic strategies, great effort has been placed on the study and development of lyases, which benefits from high potency activity toward drug-resistant strains and a low inherent susceptibility to emergence of new resistance phenotypes [3–7].
In 1896, the British bacteriologist Hankin found that the bacteriophage has antibacterial activity . Subsequently, in 1921, Brunoghe and Maisin used bacteriophage to treat staphylococcal skin disease in France, which was the first reported application of bacteriophage to treat infectious diseases . Maxted , Krause , and Fischetti et al.  found that the lysates of Group C streptococci infected with C1 bacteriophage contain an enzyme which has the ability to lyse streptococci and their isolated cell walls. The enzyme is called endolysin which is encoded by bacteriophage gene. It can cause bacteria death by degrading cell wall. It has been reported that 10 ng endolysins can lead to 107 bacteria’s lysis within 30 seconds [4, 12].
Autolysins are another kind of lyases that are functionally similar to endolysins except they are bacteria-encoded enzymes . It has been reported that autolysins play important roles in several fundamental biological phenomena, such as cell wall enlargement, genetic transformation, flagella extrusion, cell division, and lysis induced by fl-lactam antibiotics, as well as in the “suicidal tendencies” of pneumococci [14–16].
Due to their special biological activity, lyases have been applied in antibacteria drug development. Thus, it is necessary to perform intensive research on lyases to understand the antibacterial mechanism. Although wet experiments are an objective approach for accurately recognizing the lyases, they are often time-consuming and costly. Due to the convenience and high efficiency, computational methods have attracted more and more attention. Many algorithms such as common support vector machine (SVM) [17–19], structured SVM , artificial neural network (ANN) , Random Forest (RF) , -nearest neighbor (KNN) [23–25], Bayesian classifier [26, 27], Mahalanobis discriminant [28, 29], LibD3C , genetic algorithm , imbalanced classifier , learning to rank , and ensemble learning [34, 35] have been developed for protein function prediction. Various sequence features descriptors such as amino acid composition [36, 37], pseudo amino acid composition (PseAAC) , physicochemical properties , secondary structure features , and N-peptide composition  were proposed to represent protein sequences .
To deal with the problem about lyases prediction, recently, a method was developed to identify cell wall enzymes by using PseAAC and Fisher discriminant . A maximum overall accuracy of 80.4% was obtained with the sensitivity of 66.7% and the specificity of 88.6% . However, further work is needed due to the following reasons. (i) The prediction quality can be further improved. (ii) No web server for the prediction method in  was provided, and hence its usage is quite limited, especially for the majority of experimental scientists.
The present study was devoted to development of a new predictor for identifying lyases. For this purpose, an objective and strict benchmark dataset was constructed for training and testing the proposed model in which protein sequences were formulated by using an improved PseAAC. For the convenience of other scholars, a free online server called Lypred (at http://lin.uestc.edu.cn/server/Lypred/) was established.
2. Material and Method
2.1. Benchmark Dataset
A high quality dataset is the key to building a robust and accurate predictor. The lyases in bacteria or bacteriophage were regarded as positive samples which were derived from the UniProt . Negative samples, namely, the nonlyases, were also derived from bacteriophage and downloaded from the UniProt. In order to guarantee the reliability of the benchmark dataset, we optimized the data according to the following standards: firstly, the sequences whose protein was with annotations of “Inferred from homology” or “Predicted” were excluded; secondly, we removed the sequences which are the fragments of other proteins; thirdly, the protein sequences containing unknown residues, such as “,” “,” “,” “,” “,” and “,” were eliminated; fourthly, to avoid overestimation of prediction model that resulted from the high sequence identity, the CD-HIT program  was adopted to eliminate redundant sequence by setting the cutoff of sequence identity to 40%. As a result, a total of 68 lyases and 307 nonlyases were obtained to form the final benchmark dataset.
2.2. Features Extraction
A sequence can be represented by two different forms: one is the sequential form and the other is the discrete form . The most common and straightforward way to characterize a protein is to use all the residues in its sequence written as follows:where , , and are the 1st, 2nd, and th amino acid residue of protein , respectively. Based on the information, a query protein can be predicted by the BLAST or FASTA program. The results are always good for the query sequence which has high similar sequences in benchmark dataset; however, it fails to work when the similar sequences for the query sequence are not found in the training dataset . Therefore, the similarity-based method is not suitable for the case that no homologue was found in the benchmark dataset. The discrete form can overcome the shortcoming and is easy to be treated in statistical prediction. Thus, it has been widely used in protein and DNA formulation [48, 49]. The PseAAC is a typical discrete form that has been widely used for protein function prediction [46, 50, 51].
It is well known that the polypeptide chains fold to tertiary structures based on the physicochemical properties of residues. Thus, it is not enough to analyze the residue compositions of protein molecules. Hence, we proposed to represent protein samples by using an improved PseAAC which includes not only -gap dipeptide composition, but also correlation of physicochemical property between two residues.
According to the concept of PseAAC, a protein with the length of can be formulated in a dimension space as given bywherewhere denotes the normalized occurrence frequency of the th kind of -gap dipeptide in protein formulated aswhere () denotes the number of the th -gap dipeptide in .
in (3) is the -tier sequence correlation factor calculated by the following formulas:
The correlation of physicochemical property between two residues is given bywhere denotes the th physicochemical value of amino acid residue . The value is obtained bywhere is the th physicochemical original value of amino acid .
2.3. Feature Selection
Some features are noise or redundant information which will reduce the predictive performance of classification models. Thus, it is very important to develop a method to evaluate the contribution of every feature to the classification. Here, we used ANOVA  to rank features defined aswhere represents the -score of the th feature type, is the feature value of the th feature type of the th sample in the th protein type, and is the number of samples in the th protein type. It is obvious that the larger the value, the better the discriminative capability the th feature has.
In order to eliminate the redundant features, we firstly ranked all features according to their -score from high to low. The first feature subset only contained the feature with the largest -score; then, a new feature subset was generated when the feature with the second largest -score was added. The process was repeated until all features were added. The SVM was used to evaluate the performance for each feature subset. The feature subset with the best performance is deemed the optimal feature subset which does not contain redundant features.
2.4. Support Vector Machine
The SVM is a linear-classifier-based supervised machine learning method, which has been successfully used in many bioinformatics fields [48–51, 53–57]. To attain the goal of classification, SVM utilizes the kernel function to deal with the nonlinear transformation, and thus linear inseparable can be converted to a linear problem in high-dimension Hilbert space. In this work, the software LIBSVM  was used to execute SVM.
2.5. Performance Standard
To provide a more intuitive and easier-to-understand method to evaluate the prediction performance, we used the following criteria: the sensitivity (), the specificity (), Mathew’s correlation coefficient (), the overall accuracy (), and the average accuracy (), which were defined aswhere is the number of lyases that were correctly predicted, denotes the number of lyases that were predicted as the nonlyases, is the number of nonlyases that were correctly predicted, and denotes the number of nonlyases that were predicted as the lyases.
In addition, we also chose the receiver operating characteristic curve (ROC curve) to measure the performance of the proposed model. ROC curve is a kind of comprehensive index that is drawn by using as the abscissa and as the ordinate. Thus, it reveals the continuous variable of and . Generally, we only need to calculate the area under the ROC curve (). The greater the is, the better the discriminate capability the prediction model has is.
3. Results and Discussion
3.1. Forecasting Accuracy
In this work, 9 kinds of physicochemical properties were selected in improved PseAAC . The nine physicochemical properties are hydrophobicity, hydrophilicity, rigidity, flexibility, irreplaceability, side chain mass, pI at 25°C, pK of the α-COOH group, and pK of the α-N group , respectively. The original values of the physicochemical properties for 20 amino acids were all listed in Table 1. According to (2)–(7), each protein sample can be formulated by a () dimension vector including -gap dipeptide compositions and correlation factors based on physicochemical properties between two residues. From (3)–(5), the prediction performance of our method was influenced by two parameters, namely, and , where describes the local sequence-order effect and represents the global sequence-order effect. The current study searched for the optimal values for the two parameters according to the following standard:
In cross-validation test, n-fold cross-validation, jackknife cross-validation, and independent dataset test are often used for measuring the performance of prediction model. Although the jackknife cross-validation is deemed the most objective because it can always yield a unique result for benchmark dataset given [59, 60] and it has been more and more widely used, it also has obvious drawbacks, such as the large calculation and being time-consuming. Hence, the 5-fold cross-validation was adopted in this work for searching the optimal parameters and the optimal feature subset. Once the optimal feature subset was determined, we used jackknife cross-validation for verification ulteriorly.
Based on (10), a total of 10 × 10 = 100 groups of parameters () were investigated. For each parameter group (), there are feature subsets. Then, we used feature selection technique defined in (8) to find out the best one in each parameter group. Thus, we obtained the 100 highest s for 100 groups of parameters (). To provide an overall and intuitive analysis, the best s were drawn into a heat map, where the column and row of the heat map represent the parameters and , respectively. Each element in the heat map represents one of the 100 groups of parameters () and was colorized according to its highest overall accuracy in feature selection process. From Figure 1, we noticed that several elements are red indicating the maximum overall accuracy of 91.73% when equals 0 or 4 and equals 7, 8, 9, and 10. Generally, a model with a small number of features can reduce the risk of overfitting. After checking the feature selected results, we found that when using feature selection technique to optimize parameter group ( = 4 and = 7), the optimal feature subset contains 63 features, which is less than the optimal feature subset in other groups. Thus, the final model was established based on the 63 features from parameter group ( and ).
Because there is imbalance in our benchmark dataset, the average accuracy and ROC curve were employed to evaluate the model. Thus, we set a series of different classification thresholds to seek the maximum of average accuracy. The maximum and corresponding , , , and were listed in Table 2. The ROC curve can demonstrate the predictive capability of the proposed method across the entire range of SVM decision values. Thus, we plotted the ROC curve in Figure 2. It shows that is 0.926, demonstrating that our model has capability to predict cell wall lyases.
To investigate whether other algorithms have the same or higher discriminate capability in the same feature space, the performances of Random Forest, Naïve Bayes, and LibD3C were examined by using jackknife cross-validation. Random Forest and Naïve Bayes were executed by using free package WEKA . The LibD3C, a new selective ensemble algorithm, is a hybrid model of ensemble pruning that is based on -means clustering and the framework of dynamic selection and circulating in combination with a sequential search method . We used default parameters in LibD3C to perform classification.
The jackknife cross-validated results were also recorded in Table 2 for clear comparison. Note that the result for each algorithm in Table 2 was calculated with the maximum . As can be seen from the table, although ’s of Random Forest and Naïve Bayes are no lower than SVM, other indicators (, , , , and ) of SVM are the best.
3.2. Online-Server Guide
A user-friendly online server called Lypred was established. A simple guide about the server was given below in order to further make it easier for the users.
Lypred has five pages. Users can browse the server at http://lin.uestc.edu.cn/server/Lypred/ and see the home page on the screen as shown in Figure 3. The Read Me page provides a brief introduction about Lypred and the caveat when being used. The Data page shows a brief description about the benchmark dataset and the optimal feature subset used in this work and provides links for downloading. The relevant paper about the detailed development and algorithm of Lypred can be seen by clicking the Citation button. Example sequences in FASTA format can be found by clicking the Example button right above the input box.
Users can not only type or copy/paste the query protein sequences into the input box, but also upload FASTA/txt file containing the query protein sequences at the center of the home page of Lypred. Note that Lypred also has some constraints so as to guarantee the reliability of the results: firstly, protein sequences must be in FASTA format consisting of a single initial line beginning with a greater-than symbol (“”) in the first column, followed by lines of sequence data, and the sequence is deemed to end if there is another line starting with “”; secondly, the query protein sequence should only contain 20 kinds of amino acids; thirdly, the length of a query protein sequence should be no less than eight.
With growing drug resistance of pathogenic bacteria, great effort has been placed on the study and development of lyases. Effective identification of lyases will provide convenience for development of new antimicrobials. In this work, we used an improved PseAAC including -gap dipeptide compositions and correlation factors of the physicochemical properties to extract the characteristics of protein sequences. A feature selection technique based on ANOVA was used to optimize features. The results of of 84.82% and of 0.926 make us believe that Lypred will become a powerful and useful tool for the experimental study of bacterial cell wall lyase.
The authors declare that there are no competing interests regarding the publication of this paper.
This work was supported by the Applied Basic Research Program of Sichuan Province (nos. 2015JY0100 and LZ-LY-45), the Scientific Research Foundation of the Education Department of Sichuan Province (11ZB122), the Nature Scientific Foundation of Hebei Province (no. C2013209105), the Fundamental Research Funds for the Central Universities of China (nos. ZYGX2015J144 and ZYGX2015Z006), and the Program for the Top Young Innovative Talents of Higher Learning Institutions of Hebei Province (no. BJ2014028).
- D. Trudil, “Phage lytic enzymes: a history,” Virologica Sinica, vol. 30, no. 1, pp. 26–32, 2015.
- Y. Li, C. Wang, Z. Miao et al., “ViRBase: a resource for virus–host ncRNA-associated interactions,” Nucleic Acids Research, vol. 43, no. 1, pp. D578–D582, 2015.
- E. Hankin, “L'action bactericide des eaux de la Jumna et du Gange sur le vibrion du cholera,” Annales de l'Institut Pasteur, vol. 10, pp. 511–523, 1896.
- V. A. Fischetti, “Bacteriophage lytic enzymes: novel anti-infectives,” Trends in Microbiology, vol. 13, no. 10, pp. 491–496, 2005.
- D. C. Osipovitch, S. Therrien, and K. E. Griswold, “Discovery of novel S. aureus autolysins and molecular engineering to enhance bacteriolytic activity,” Applied Microbiology and Biotechnology, vol. 99, no. 15, pp. 6315–6326, 2015.
- C. C. Kietzman, G. Gao, B. Mann, L. Myers, and E. I. Tuomanen, “Dynamic capsule restructuring by the main pneumococcal autolysin LytA in response to the epithelium,” Nature Communications, vol. 7, article 10859, 2016.
- H. Oliveir, L. D. R. Melo, S. B. Santos et al., “Molecular aspects and comparative genomics of bacteriophage endolysins,” Journal of Virology, vol. 87, no. 8, pp. 4558–4570, 2013.
- R. Brunoghe and J. Maisin, “Essais de therapeutique au moyen du bacteriophage du staphylocoque,” Journal des Comptes Rendus de la Société de Biologie, vol. 85, pp. 1029–1121, 1921.
- W. R. Maxted, “The active agent in nascent phage lysis of streptococci,” Microbiology, vol. 16, no. 3, pp. 584–595, 1957.
- R. M. Krause, “Studies on the bacteriophages of hemolytic streptococci. II. Antigens released from the streptococcal cell wall by a phage-associated lysin,” The Journal of Experimental Medicine, vol. 108, no. 6, pp. 803–821, 1958.
- V. A. Fischetti, E. C. Gotschlich, and A. W. Bernheimer, “Purification and physical properties of group C streptococcal phage-associated lysin,” The Journal of Experimental Medicine, vol. 133, no. 5, pp. 1105–1117, 1971.
- R. Schuch, D. Nelson, and V. A. Fischetti, “A bacteriolytic agent that detects and kills Bacillus anthracis,” Nature, vol. 418, no. 6900, pp. 884–889, 2002.
- O. Salazar and J. A. Asenjo, “Enzymatic lysis of microbial cells,” Biotechnology Letters, vol. 29, no. 7, pp. 985–994, 2007.
- H. J. Rogers, H. R. Perkins, and J. B. Ward, Microbial Cell Walls and Membranes, Chapman and Hall London, 1980.
- M. McCarty, The Transforming Principle: Discovering That Genes Are Made of DNA, W. W. Norton & Company, New York, NY, USA, 1986.
- J. M. Sanchez-Puelles, C. Ronda, J. L. Garcia, P. Garcia, R. Lopez, and E. Garcia, “Searching for autolysin functions. Characterization of a pneumococcal mutant deleted in the lytA gene,” European Journal of Biochemistry, vol. 158, no. 2, pp. 289–293, 1986.
- K.-C. Chou and H.-B. Shen, “Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms,” Nature Protocols, vol. 3, no. 2, pp. 153–162, 2008.
- M. K. Leong and T.-H. Chen, “Prediction of cytochrome P450 2B6-substrate interactions using pharmacophore ensemble/support vector machine (PhE/SVM) approach,” Medicinal Chemistry, vol. 4, no. 4, pp. 396–406, 2008.
- B. Liu, D. Zhang, R. Xu et al., “Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection,” Bioinformatics, vol. 30, no. 4, pp. 472–479, 2014.
- D. Li, Y. Ju, and Q. Zou, “Protein folds prediction with hierarchical structured SVM,” Current Proteomics, vol. 13, no. 2, pp. 79–85, 2016.
- A. Reinhardt and T. Hubbard, “Using neural networks for prediction of the subcellular location of proteins,” Nucleic Acids Research, vol. 26, no. 9, pp. 2230–2236, 1998.
- X. Zhao, Q. Zou, B. Liu, and X. Liu, “Exploratory predicting protein folding model with random forest and hybrid features,” Current Proteomics, vol. 11, no. 4, pp. 289–299, 2014.
- H. Shen and K.-C. Chou, “Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types,” Biochemical and Biophysical Research Communications, vol. 334, no. 1, pp. 288–292, 2005.
- C. Yan, J. Hu, and Y. Wang, “Discrimination of outer membrane proteins using a K-nearest neighbor method,” Amino Acids, vol. 35, no. 1, pp. 65–73, 2008.
- T.-L. Zhang, Y.-S. Ding, and K.-C. Chou, “Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern,” Journal of Theoretical Biology, vol. 250, no. 1, pp. 186–193, 2008.
- A. Bulashevska, M. Stein, D. Jackson, and R. Eils, “Prediction of small molecule binding property of protein domains with Bayesian classifiers based on Markov chains,” Computational Biology and Chemistry, vol. 33, no. 6, pp. 457–460, 2009.
- A. Bulashevska and R. Eils, “Using Bayesian multinomial classifier to predict whether a given protein sequence is intrinsically disordered,” Journal of Theoretical Biology, vol. 254, no. 4, pp. 799–803, 2008.
- H. Lin and Q.-Z. Li, “Using pseudo amino acid composition to predict protein structural class: approached by incorporating 400 dipeptide components,” Journal of Computational Chemistry, vol. 28, no. 9, pp. 1463–1466, 2007.
- H. Lin, “The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition,” Journal of Theoretical Biology, vol. 252, no. 2, pp. 350–356, 2008.
- C. Lin, W. Chen, C. Qiu, Y. Wu, S. Krishnan, and Q. Zou, “LibD3C: ensemble classifiers with a clustering and dynamic selection strategy,” Neurocomputing, vol. 123, pp. 424–435, 2014.
- X. Zeng, S. Yuan, X. Huang, and Q. Zou, “Identification of cytokine via an improved genetic algorithm,” Frontiers of Computer Science, vol. 9, no. 4, pp. 643–651, 2015.
- L. Song, D. Li, X. Zeng, Y. Wu, L. Guo, and Q. Zou, “nDNA-prot: identification of DNA-binding proteins based on unbalanced classification,” BMC Bioinformatics, vol. 15, article 298, 2014.
- B. Liu, J. Chen, and X. Wang, “Application of learning to rank to protein remote homology detection,” Bioinformatics, vol. 31, no. 21, pp. 3492–3498, 2015.
- T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple Classifier Systems, pp. 1–15, Springer, 2000.
- T. G. Dietterich, “Ensemble learning,” in The Handbook of Brain Theory and Neural Networks, vol. 2, pp. 110–125, MIT Press, 2002.
- M. H. Smith, “The amino acid composition of proteins,” Journal of Theoretical Biology, vol. 13, pp. 261–282, 1966.
- J. Cedano, P. Aloy, J. A. Perez-Pons, and E. Querol, “Relation between amino acid composition and cellular location of proteins,” Journal of Molecular Biology, vol. 266, no. 3, pp. 594–600, 1997.
- K.-C. Chou, “Prediction of protein cellular attributes using pseudo-amino acid composition,” Proteins: Structure, Function, and Bioinformatics, vol. 43, no. 3, pp. 246–255, 2001.
- S. Saha and G. P. S. Raghava, “BcePred: prediction of continuous B-cell epitopes in antigenic sequences using physico-chemical properties,” in Artificial Immune Systems, G. Nicosia, V. Cutello, P. J. Bentley, and J. Timmis, Eds., vol. 3239 of Lecture Notes in Computer Science, pp. 197–204, Springer, New York, NY, USA, 2004.
- L. Wei, M. Liao, X. Gao, and Q. Zou, “An improved protein structural classes prediction method by incorporating both sequence and structure information,” IEEE Transactions on NanoBioscience, vol. 14, no. 4, pp. 339–349, 2015.
- C.-S. Yu, C.-J. Lin, and J.-K. Hwang, “Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions,” Protein Science, vol. 13, no. 5, pp. 1402–1406, 2004.
- B. Liu, F. Liu, X. Wang, J. Chen, L. Fang, and K. Chou, “Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences,” Nucleic Acids Research, vol. 43, no. W1, pp. W65–W71, 2015.
- H. Ding, L. Luo, and H. Lin, “Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition,” Protein and Peptide Letters, vol. 16, no. 4, pp. 351–355, 2009.
- A. M. Bairoch, R. Apweiler, C. H. Wu et al., “The universal protein resource (UniProt),” Nucleic Acids Research, vol. 33, pp. D154–D159, 2005.
- W. Li and A. Godzik, “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences,” Bioinformatics, vol. 22, no. 13, pp. 1658–1659, 2006.
- K.-C. Chou, “Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes,” Bioinformatics, vol. 21, no. 1, pp. 10–19, 2005.
- H. Tang, W. Chen, and H. Lin, “Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique,” Molecular BioSystems, vol. 12, no. 4, pp. 1269–1275, 2016.
- P.-P. Zhu, W.-C. Li, Z.-J. Zhong et al., “Predicting the subcellular localization of mycobacterial proteins by incorporating the optimal tripeptides into the general form of pseudo amino acid composition,” Molecular BioSystems, vol. 11, no. 2, pp. 558–563, 2015.
- W.-C. Li, E.-Z. Deng, H. Ding, W. Chen, and H. Lin, “IORI-PseKNC: a predictor for identifying origin of replication with pseudo k-tuple nucleotide composition,” Chemometrics and Intelligent Laboratory Systems, vol. 141, pp. 100–106, 2015.
- M. Esmaeili, H. Mohabatkar, and S. Mohsenzadeh, “Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses,” Journal of Theoretical Biology, vol. 263, no. 2, pp. 203–209, 2010.
- C. Chen, X. Zhou, Y. Tian, X. Zou, and P. Cai, “Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network,” Analytical Biochemistry, vol. 357, no. 1, pp. 116–121, 2006.
- M. J. Anderson, “A new method for non-parametric multivariate analysis of variance,” Austral Ecology, vol. 26, no. 1, pp. 32–46, 2001.
- R. Wang, Y. Xu, and B. Liu, “Recombination spot identification Based on gapped k-mers,” Scientific Reports, vol. 6, article 23934, 2016.
- J. Chen, X. Wang, and B. Liu, “IMiRNA-SSF: improving the identification of microrna precursors by combining negative sets with different distributions,” Scientific Reports, vol. 6, article 19062, 2016.
- P. Feng, H. Lin, W. Chen, and Y. Zuo, “Predicting the types of J-proteins using clustered amino acids,” BioMed Research International, vol. 2014, Article ID 935719, 8 pages, 2014.
- W. Chen and H. Lin, “Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine,” Computers in Biology and Medicine, vol. 42, no. 4, pp. 504–507, 2012.
- W. Chen and H. Lin, “Prediction of midbody, centrosome and kinetochore proteins based on gene ontology information,” Biochemical and Biophysical Research Communications, vol. 401, no. 3, pp. 382–384, 2010.
- C.-C. Chang and C.-J. Lin, “LIBSVM: a Library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, article 27, 2011.
- K.-C. Chou and C.-T. Zhang, “Prediction of protein structural classes,” Critical Reviews in Biochemistry and Molecular Biology, vol. 30, no. 4, pp. 275–349, 1995.
- Y.-C. Wang, X.-B. Wang, Z.-X. Yang, and N.-Y. Deng, “Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature,” Protein and Peptide Letters, vol. 17, no. 11, pp. 1441–1449, 2010.
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software,” ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
Copyright © 2016 Xin-Xin Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.