- About this Journal ·
- Abstracting and Indexing ·
- Advance Access ·
- Aims and Scope ·
- Annual Issues ·
- Article Processing Charges ·
- Articles in Press ·
- Author Guidelines ·
- Bibliographic Information ·
- Citations to this Journal ·
- Contact Information ·
- Editorial Board ·
- Editorial Workflow ·
- Free eTOC Alerts ·
- Publication Ethics ·
- Reviewers Acknowledgment ·
- Submit a Manuscript ·
- Subscription Information ·
- Table of Contents
BioMed Research International
Volume 2013 (2013), Article ID 625403, 8 pages
Application of Improved Three-Dimensional Kernel Approach to Prediction of Protein Structural Class
1School of Chemistry & Chemical Engineering, Guangxi University, Guangxi Province, Nanning 530004, China
2State Key Laboratory of Medical Genomics, Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai Jiao Tong University School of Medicine, Shanghai 200240, China
3Graduate School of the Chinese Academy of Sciences, Beijing 100049, China
4College of Pharmacy, Guangxi University of Chinese Medicine, Nanning 530001, China
Received 25 March 2013; Revised 4 May 2013; Accepted 10 May 2013
Academic Editor: Bing Niu
Copyright © 2013 Xu Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Kernel methods, such as kernel PCA, kernel PLS, and support vector machines, are widely known machine learning techniques in biology, medicine, chemistry, and material science. Based on nonlinear mapping and Coulomb function, two 3D kernel approaches were improved and applied to predictions of the four protein tertiary structural classes of domains (all-α, all-β, α/β, and α + β) and five membrane protein types with satisfactory results. In a benchmark test, the performances of improved 3D kernel approach were compared with those of neural networks, support vector machines, and ensemble algorithm. Demonstration through leave-one-out cross-validation on working datasets constructed by investigators indicated that new kernel approaches outperformed other predictors. It has not escaped our notice that 3D kernel approaches may hold a high potential for improving the quality in predicting the other protein features as well. Or at the very least, it will play a complementary role to many of the existing algorithms in this regard.
Due to the rapid development of genome and protein science, the biological information has expanded dramatically. Therefore, it is very important and highly desirable for computers to manage, organize, and interpret the information. As a part of biochemistry, study of protein structure classes has become a hot topic, because of experimental and theoretical purposes. Artificial neural networks, support vector machines, kernel methods, and ensemble algorithms are widely known machine learning techniques in biology, medicine, chemistry, and material science [1–10]. In this work, two classification problems, protein’s tertiary structure classes of domains and membrane protein types, were researched with some machine learning techniques.
Several motifs pack together to form compact, local, and semiindependent units called domains. The details of proteins domains structures are extremely complicated and irregular. But their overall structural frames are simple, regular, and truly elegant [11–13]. Many protein domains often have similar or identical folding patterns even if they are quite different according to their sequences [14–16]. The overall 3D structure of the polypeptide chain is referred to as the protein’s tertiary structure. Levitt and Chothia proposed to classify protein tertiary structures into the following four structural classes based on the secondary structural content of the domains. All-α: it is formed essentially by α-helices. This class is dominated by small folds, many of which form a simple bundle with helices running up and down. All-β: this class has a core composed of antiparallel β-sheets, usually two sheets pack against each other. α/β: this class contains both α-helices and β-strands that are largely interspersed in forming mainly parallel β-sheet; α + β: this class also contains both of the two secondary structure elements that, however, are largely segregated in forming mainly antiparallel β-sheets.
This concept of structural class has ever since been widely used as an important attribute for characterizing the overall folding type of proteins domains. Lots of methods have been made to predict the structural classes based on the knowledge of protein sequences .
The research of membrane protein type is also important because of the special biological functions. The biomembrane usually contains some specific proteins and lipid components that enable it to perform its unique roles in the cell and organelle.
Furthermore, several studies show that many membrane proteins are also the key targets of drug discovery, particularly membrane channel proteins [18–20]. Membrane proteins can be further classified into the five types [21–23]: (a) type A membrane protein is single-pass transmembrane protein which has an extracellular (or luminal) N-terminus and cytoplasmic C-terminus for a cell (or organelle) membrane; (b) type B membrane protein is single-pass transmembrane protein which has an extracellular (or luminal) C-terminus and cytoplasmic N-terminus for a cell (or organelle) membrane; (c) type C is multipass transmembrane protein: the polypeptide crosses the lipid bilayer multiple times; (d) type D membrane proteins are lipid chain-anchored membrane proteins: they are bound to the membrane by one or more covalently attached fatty acid chains or other types of lipid chains called prenyl groups; (e) type E is GPI-anchored membrane protein which is bound to the membrane by a glycosylphosphatidylinositol (GPI) anchor.
Researchers have applied classification algorithm to predict the types of membrane proteins based on their amino acid composition [24, 25]. Figure 1 shows the forms and the locations of different membrane proteins.
The first goal of this paper is to illustrate the application of 3D kernel approach as a relatively new tool in proteins domains field for classification purposes. And the second goal is to show that the new approach can be applied to analysis of membrane protein types.
2. Materials and Methods
2.1. Kernel Function
Kernel function was originally a kind of functions used in integral operator research. However, Vapnik implemented this function in his newly invented SVMs method . The use of kernel function makes SVMs able to treat nonlinear data processing problems by using linear algorithms. The basic idea of kernel function is to map the data into a higher-dimensional feature space via a nonlinear mapping and then to do classification and regression in this space. There are four commonly used kernel functions: linear kernel polynomial kernel Gaussian (RBF) kernel sigmoid kernel The elegance of using kernel function lied in the fact that one can deal with feature spaces of arbitrary dimensionality without having to compute the map . Any function that satisfies Mercer’s condition can be used as kernel function.
2.2. Kernel PCA
Principal component analysis (PCA) is a versatile and easy-to-use multivariate mathematical-statistical method in multivariate data analysis and the extraction of maximal information [27, 28]. It is a linear transformation approach that compresses high-dimensional data with minimum loss of data information. PCA is performed in the original sample space, whereas kernel PCA (KPCA) applies kernel functions in the input space to achieve the same effect of the expensive nonlinear mapping.
From Figure 2, it is found that the basic idea of KPCA is to map the original dataset into some higher dimensional feature space. In this complex space, PCA can be applied to establish a linear relationship which is nonlinear in the original input space [29, 30]. For the special case in which , KPCA is equivalent to linear PCA. From this viewpoint, KPCA can be regarded as a generalized version of linear PCA.
For PCA, with data , one can first compute the covariance matrix :
A principal component is computed by solving the following eigenvalue problem:
Thus, the eigenvectors can be written as
Then, the eigen value problem can be represented by the following simple form: where is a linear kernel matrix. To derive KPCA, one firstly needs to map the data into a feature space . Hence, a nonlinear kernel matrix can be directly generated by means of specific kernel function ((1), (2), (3), and (4)). For extracting features of a new sample with KPCA, one simply projects the mapped sample onto the first projections ,
KPCA is to map the original data (in the input space) with nonlinear features into kernel feature space in which the linear PCA algorithm is then performed. Therefore, KPCA, being suitable to describe the nonlinear structure of data set, can be regarded as a generalized version of linear PCA.
Generalized discriminant analysis (GDA) is a method designed for nonlinear classification [31–33]. It is a nonlinear extension of linear discriminant analysis (LDA) based on a kernel function which transforms the original space to a new high dimensional feature space . The within-class (or total) scatter and between-class scatter () matrixes of the nonlinearly mapped data are as follows:
In (11), is the mean of class and is the number of samples belonging to . The aim of the GDA is to find such projection matrix that maximizes the following Fisher criterion:
From the theory of reproducing kernels, any solution must lie in the span of all training samples in : where are some real weights and is the th sample of the class . The solution is obtained by solving (, ; , ): is the kernel matrix composed of the dot products of nonlinearly mapped data. And , where is a matrix with entries all equal to .
2.4. New Improved 3D Kernel Approach: 3D KPCA and 3D GDA
Traditional KPCA and GDA are typical multivariate two-dimension statistical methods. In this work, KPCA and GDA are improved with three-dimensional projection and the concept of electric field intensity.
Firstly, the data of training samples are projected onto three-dimensional space by KPCA or GDA algorithm with satisfactory classification effect. The three-dimensional coordinate axes are, respectively, the first kernel principal component, second kernel principal component, and third kernel principal component or the direction vectors of generalized discriminant analysis.
Secondly, we need to estimate the class (unknown) of new projection points, such as membrane protein types of test sample data. There are two estimation methods in this work: K-Nearest Neighbor algorithm (KNN)  and class intensity model.
KNN algorithm estimation: new projection point (test sample) is classified by a majority vote of its neighbors (training samples in kernel three-dimensional space).
Class intensity model estimation: the projection point of one training data can be considered as point charge. The species of charge is related to the class of sample. And the Electric Quantity of Point Charge (EQPC) is related to the number of samples () which belongs to some class:
The value of EQPC is negative related with the sample amount of same class. Based on the Coulomb law and formula of intensity of electric field, the Intensity of Electric Field of one Point (IEFP) in 3D space is where is distance between point charge and the space point.
Therefore, in class intensity model, IEFP is a criterion of classification. For example, there are four classes in training data: class 1, class 2, class 3, and class 4 in Figure 3. After projecting with kernel methods, all projection class charge points of training data can form a space electric field. The test sample can be projected onto this space with the same kernel methods. Figure 3 illustrates the relationship between point charge of different class and corresponding IEFP. To project position of test sample, if there exist , and , test sample should belong to class 1.
3. Results and Discussion
3.1. System and Software Used for Data Analysis
The calculations were carried out using the Intel(R) Core(TM) Duo CPU T5870 GHz computer running Windows XP operating system. All the learning input data were range-scaled to [0~1] in this work. The improved 3D kernel approach software package including 3D kernel PCA and 3D GDA was programmed in our laboratory referring to the literature [29, 31] based on statistical pattern recognition toolbox for MATLAB .
3.2. Application of Improved 3D Kernel Approach to Protein’s Tertiary Structure Classes of Domains
The protein datasets studied here were taken from Niu and his coworkers . In dataset A, there are 277 protein domains, of which 70 are all-α domains, 61 all-β, 81 α/β, and 65 α + β. In dataset B, there are 498 protein domains, of which 107 are all-α domains, 126 all-β, 136 α/β, and 129 α + β. The amino acid composition was used to represent the sample of a protein domain.
To demonstrate the power of 3D kernel methods, computations were performed by the Leave-One-Out Cross-Validation (LOOCV), which are widely used by more and more investigators in testing the power of various predictors. As such, the data set of samples was divided into two disjoint subsets including a training data set ( samples) and a test data set (only 1 sample). After developing each model based on the training set, the omitted data was predicted and the difference between experimental value and predicted value was calculated [36–38].
Based on dataset A, it was found that the projection with Gaussian (see (3), ) kernel function and KNN () algorithm estimation was suitable for building 3D kernel PCA model with the better success rates.
Based on dataset B, it was found that the projection with polynomial (see (2), , ) kernel function and class intensity model estimation was suitable for building 3D GDA model with the better success rates. Figure 4 illustrates the protein domains classes distribution of dataset B (498 samples) in 3D kernel space with GDA model. It can be seen that the data points, which belong to all-α domains, all-β domains, α/β domains, and α + β domains respectively, are located in different regions with a correct classification result.
The success rates thus obtained are given in Table 1, where, for facilitating comparison, the corresponding rates obtained by component-coupled algorithm, neural networks, support vector machines (SVMs), and AdaBoost Learner  are also listed.
As it can be seen from Table 1, the performance of improved 3D kernel model outperforms those of component-coupled, neural networks, SVMs models but was a little worse than that of AdaBoost model for the dataset A (277 domains) available in LOOCV test. Based on dataset B (498 domains), improved 3D kernel learner is superior to all the other predictors in identifying the structural classification.
3.3. Application of Improved 3D Kernel Approach to Classification of Membrane Proteins
The membrane proteins dataset studied here was collected from the literature . The dataset contains 2059 prokaryotic proteins (type A membrane proteins: 435; type B membrane proteins: 152; type C Multi-pass transmembrane proteins: 1311; type D lipid chain-anchored membrane proteins: 51; type E GPI-anchored membrane proteins: 110). The amino acid composition was selected as the input of the classification algorithm, and the computations were performed by LOOCV to test the power of various predictors. Based on dataset of membrane proteins, the classification flow chart (Figure 5) was obtained as follows.
From Figure 5, there are two steps in building classification model. Firstly, the 3D KPCA model with projection through polynomial (see (2), , ) kernel function and KNN () algorithm estimation was built to classify the multipass transmembrane proteins (type C) and the other membrane proteins (type A, type B, type D, and type E). Figure 6 illustrates the data distribution of type C and other membrane proteins in 3D kernel space with KPCA model.
Secondly, the 3D GDA model with Gaussian (see (3), ) kernel function and class intensity model estimation was built to classify type A, type B, type D, and type E membrane proteins.
Figure 7 illustrates the data distribution of the type A, type B, type D, and type E membrane proteins in 3D kernel space with GDA model. 3D kernel method was compared with other machine learning classification methods: the covariant discriminant algorithm , neural networks, support vector machines, and Bagging , as is shown in Table 2.
As we can see from Table 2, correct classification rate of the LOOCV test applied 3D kernel algorithm outperformed other algorithms. It also means that 3D kernel method has learned very well through the membrane proteins training process.
The 3D kernel approach is very useful machine learning classifier. It has remarkably outperformed the powerful neural network, SVM classifiers, in predicting the protein domain structural classes for the two datasets constructed and membrane protein types for the same dataset constructed by previous investigators. It is thus anticipated that the 3D Kernel classifier can also be used to predict other protein attributes, such as sub-cellular localization [39–41], enzyme family and subfamily classes , and active sites of enzyme. The concepts of EQPC and IEFP can be easily extended to many-dimensional space and could be improved to use four or more dimensions.
It could be concluded that 3D kernel approach is a robust and highly accurate classification technique that can be successfully applied to derive statistical models with statistical qualities and predictive capabilities for the protein location and function. The 3D kernel algorithm should be a complementary tool to the existing pattern recognition in chemometrics and bioinformatics.
Xu Liu and Yuchao Zhang contributed equally to this work.
The project is financially supported by National Natural Science Foundation of China (nos. 20373040, 20973108, 20942005, and 21262005), Innovation Foundation of Guangxi University (nos. XBZ120947), and Innovation Foundation of Shanghai University (nos. A.10-0101-10-006). The work was supported by Guangxi Key Laboratory of Traditional Chinese Medicine Quality Standards (Guangxi Institute of Traditional Medical and Pharmaceutical Sciences) (guizhongzhongkai0802).
- V. Brusic, G. Rudy, M. Honeyman, J. Hammer, and L. Harrison, “Prediction of MHC class II-binding peptides using an evolutionary algorithm and artificial neural network,” Bioinformatics, vol. 14, no. 2, pp. 121–130, 1998.
- L. Xu, L. Wencong, J. Shengli, L. Yawei, and C. Nianyi, “Support vector regression applied to materials optimization of sialon ceramics,” Chemometrics and Intelligent Laboratory Systems, vol. 82, no. 1-2, pp. 8–14, 2006.
- I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, no. 1–3, pp. 389–422, 2002.
- B. M. Nicolaï, K. I. Theron, and J. Lammertyn, “Kernel PLS regression on wavelet transformed NIR spectra for prediction of sugar content of apple,” Chemometrics and Intelligent Laboratory Systems, vol. 85, no. 2, pp. 243–252, 2007.
- Y. Qu, B.-L. Adam, Y. Yasui et al., “Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients,” Clinical Chemistry, vol. 48, no. 10, pp. 1835–1843, 2002.
- B. Niu, X.-C. Yuan, P. Roeper et al., “HIV-1 protease cleavage site prediction based on two-stage feature selection method,” Protein and Peptide Letters, vol. 20, no. 3, pp. 290–298, 2013.
- B. Niu, Q. Su, X.-C. Yuan, W. Lu, and J. Ding, “QSAR study on 5-lipoxygenase inhibitors based on support vector machine,” Medicinal Chemistry, vol. 8, no. 6, pp. 1108–1116, 2012.
- C.-R. Peng, W.-C. Lu, B. Niu, M.-J. Li, X.-Y. Yang, and M.-L. Wu, “Predicting the metabolic pathways of small molecules based on their physicochemical properties,” Protein & Peptide Letters, vol. 19, pp. 1250–1256, 2012.
- Q. Su, W.-C. Lu, B. Niu, X. Liu, and T.-H. Gu, “Classification of the toxicity of some organic compounds to tadpoles (Rana Temporaria) through integrating multiple classifiers,” Molecular Informatics, vol. 30, no. 8, pp. 672–675, 2011.
- B. Niu, W.-C. Lu, J. Ding et al., “Site of O-glycosylation prediction based on two stage feature selection,” Chemometrics and Intelligent Laboratory Systems, vol. 2, pp. 142–145, 2011.
- A. V. Finkelstein and O. B. Ptitsyn, “Why do globular proteins fit the limited set of foldin patterns?” Progress in Biophysics and Molecular Biology, vol. 50, no. 3, pp. 171–190, 1987.
- K.-C. Chou and L. Carlacci, “Energetic approach to the folding of α/β barrels,” Proteins: Structure, Function and Genetics, vol. 9, no. 4, pp. 280–295, 1991.
- K.-C. Chou, “Progress in protein structural class prediction and its impact to bioinformatics and proteomics,” Current Protein & Peptide Science, vol. 6, no. 5, pp. 423–436, 2005.
- K. Oxenoid and J. J. Chou, “The structure of phospholamban pentamer reveals a channel-like architecture in membranes,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 31, pp. 10870–10875, 2005.
- J. S. Richardson, “β sheet topology and the relatedness of proteins,” Nature, vol. 268, no. 5620, pp. 495–500, 1977.
- O. B. Ptitsyn and A. V. Finkelstein, “Similarities of protein topologies: evolutionary divergence, functional convergence or principles of folding?” Quarterly Reviews of Biophysics, vol. 13, no. 3, pp. 339–386, 1980.
- B. Niu, Y.-D. Cai, W.-C. Lu, G.-Z. Li, and K.-C. Chou, “Predicting protein structural class with AdaBoost Learner,” Protein and Peptide Letters, vol. 13, no. 5, pp. 489–492, 2006.
- D. A. Doyle, J. M. Cabral, R. A. Pfuetzner et al., “The structure of the potassium channel: molecular basis of K+ conduction and selectivity,” Science, vol. 280, no. 5360, pp. 69–77, 1998.
- J. R. Schnell and J. J. Chou, “Structure and mechanism of the M2 proton channel of influenza A virus,” Nature, vol. 451, no. 7178, pp. 591–595, 2008.
- L. Stouffer Amanda, A. Rudresh, and S. David, “Structural basis for the function and inhibition of an influenza virus proton channel,” Nature, vol. 451, pp. 596–599, 2008.
- M. D. Resh, “Myristylation and palmitylation of Src family members: the fats of the matter,” Cell, vol. 76, no. 3, pp. 411–413, 1994.
- K.-C. Chou and D. W. Elrod, “Protein subcellular location prediction,” Protein Engineering, vol. 12, no. 2, pp. 107–118, 1999.
- K.-C. Chou and D. W. Elrod, “Prediction of membrane protein types and subcellular locations,” Proteins, vol. 34, pp. 137–153, 1999.
- K.-C. Chou, “A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space,” Proteins: Structure, Function and Genetics, vol. 21, no. 4, pp. 319–344, 1995.
- B. Niu, Y.-H. Jin, K.-Y. Feng et al., “Predicting membrane protein types with bagging learner,” Protein & Peptide Letters, vol. 15, no. 6, pp. 590–594, 2008.
- V. Vapnik, Statistical Learning Theory, John Wiley & Johns, New York, NY, USA, 1998.
- D. L. Massart, B. G. M. Vandeginste, S. N. Deming, Y. Michotte, and L. Kaufman, Chemometrics: A Textbook, Elsevier Science Publishers B.V., Amsterdam, The Netherlands, 1988.
- R. Bro, “PARAFAC. Tutorial and applications,” Chemometrics and Intelligent Laboratory Systems, vol. 38, no. 2, pp. 149–171, 1997.
- W. Wu, D. L. Massart, and S. de Jong, “The kernel PCA algorithms for wide data. Part I: theory and algorithms,” Chemometrics and Intelligent Laboratory Systems, vol. 36, no. 2, pp. 165–172, 1997.
- D.-S. Cao, Y.-Z. Liang, Q.-S. Xu, Q.-N. Hu, L.-X. Zhang, and G.-H. Fu, “Exploring nonlinear relationships in chemical data using kernel-based methods,” Chemometrics and Intelligent Laboratory Systems, vol. 107, no. 1, pp. 106–115, 2011.
- G. Baudat and F. Anouar, “Generalized discriminant analysis using a kernel approach,” Neural Computation, vol. 12, no. 10, pp. 2385–2404, 2000.
- H. Yamamoto, H. Yamaji, Y. Abe et al., “Dimensionality reduction for metabolome data using PCA, PLS, OPLS, and RFDA with differential penalties to latent variables,” Chemometrics and Intelligent Laboratory Systems, vol. 98, no. 2, pp. 136–142, 2009.
- H. Wang, Z. Hu, and Y. Zhao, “An efficient algorithm for generalized discriminant analysis using incomplete Cholesky decomposition,” Pattern Recognition Letters, vol. 28, no. 2, pp. 254–259, 2007.
- B. S. Kim and S. B. Park, “A fast k nearest neighbor finding algorithm based on the ordered partition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 761–766, 1986.
- S. Sonnenburg, G. Rätsch, S. Henschel et al., “The Shogun machine learning toolbox,” The Journal of Machine Learning Research, vol. 11, pp. 1799–1802, 2010.
- S. R. Amendolia, G. Cossu, M. L. Ganadu, B. Golosio, G. L. Masala, and G. M. Mura, “A comparative study of K-nearest neighbour, support vector machine and multi-layer perceptron for Thalassemia screening,” Chemometrics and Intelligent Laboratory Systems, vol. 69, no. 1-2, pp. 13–20, 2003.
- M. Kearns and D. Ron, “Algorithmic stability and sanity-check bounds for leave-one-out cross-validation,” in Proceedings of the 10th Annual Conference on Computational Learning Theory, pp. 152–162, ACM Press, July 1997.
- S. B. Holden, “PAC-like upper bounds for the sample complexity of leave-one-out cross-validation,” in Proceedings of the 9th Annual Conference on Computational Learning Theory, pp. 41–50, Desenzano del Garda, Italy, July 1996.
- G.-P. Zhou and K. Doctor, “Subcellular location prediction of apoptosis proteins,” Proteins: Structure, Function and Genetics, vol. 50, no. 1, pp. 44–48, 2003.
- Y.-X. Pan, Z.-Z. Zhang, Z.-M. Guo, G.-Y. Feng, Z.-D. Huang, and L. He, “Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach,” Journal of Protein Chemistry, vol. 22, no. 4, pp. 395–402, 2003.
- K.-C. Chou and Y.-D. Cai, “Predicting protein localization in budding yeast,” Bioinformatics, vol. 21, no. 7, pp. 944–950, 2005.
- K.-C. Chou and Y.-D. Cai, “Predicting enzyme family class in a hybridization space,” Protein Science, vol. 13, no. 11, pp. 2857–2863, 2004.