Abstract

Kernel methods, such as kernel PCA, kernel PLS, and support vector machines, are widely known machine learning techniques in biology, medicine, chemistry, and material science. Based on nonlinear mapping and Coulomb function, two 3D kernel approaches were improved and applied to predictions of the four protein tertiary structural classes of domains (all-α, all-β, α/β, and α + β) and five membrane protein types with satisfactory results. In a benchmark test, the performances of improved 3D kernel approach were compared with those of neural networks, support vector machines, and ensemble algorithm. Demonstration through leave-one-out cross-validation on working datasets constructed by investigators indicated that new kernel approaches outperformed other predictors. It has not escaped our notice that 3D kernel approaches may hold a high potential for improving the quality in predicting the other protein features as well. Or at the very least, it will play a complementary role to many of the existing algorithms in this regard.

1. Introduction

Due to the rapid development of genome and protein science, the biological information has expanded dramatically. Therefore, it is very important and highly desirable for computers to manage, organize, and interpret the information. As a part of biochemistry, study of protein structure classes has become a hot topic, because of experimental and theoretical purposes. Artificial neural networks, support vector machines, kernel methods, and ensemble algorithms are widely known machine learning techniques in biology, medicine, chemistry, and material science [110]. In this work, two classification problems, protein’s tertiary structure classes of domains and membrane protein types, were researched with some machine learning techniques.

Several motifs pack together to form compact, local, and semiindependent units called domains. The details of proteins domains structures are extremely complicated and irregular. But their overall structural frames are simple, regular, and truly elegant [1113]. Many protein domains often have similar or identical folding patterns even if they are quite different according to their sequences [1416]. The overall 3D structure of the polypeptide chain is referred to as the protein’s tertiary structure. Levitt and Chothia proposed to classify protein tertiary structures into the following four structural classes based on the secondary structural content of the domains. All-α: it is formed essentially by α-helices. This class is dominated by small folds, many of which form a simple bundle with helices running up and down. All-β: this class has a core composed of antiparallel β-sheets, usually two sheets pack against each other.   α/β: this class contains both α-helices and β-strands that are largely interspersed in forming mainly parallel β-sheet;   α + β: this class also contains both of the two secondary structure elements that, however, are largely segregated in forming mainly antiparallel β-sheets.

This concept of structural class has ever since been widely used as an important attribute for characterizing the overall folding type of proteins domains. Lots of methods have been made to predict the structural classes based on the knowledge of protein sequences [17].

The research of membrane protein type is also important because of the special biological functions. The biomembrane usually contains some specific proteins and lipid components that enable it to perform its unique roles in the cell and organelle.

Furthermore, several studies show that many membrane proteins are also the key targets of drug discovery, particularly membrane channel proteins [1820]. Membrane proteins can be further classified into the five types [2123]: (a) type A membrane protein is single-pass transmembrane protein which has an extracellular (or luminal) N-terminus and cytoplasmic C-terminus for a cell (or organelle) membrane; (b) type B membrane protein is single-pass transmembrane protein which has an extracellular (or luminal) C-terminus and cytoplasmic N-terminus for a cell (or organelle) membrane; (c) type C is multipass transmembrane protein: the polypeptide crosses the lipid bilayer multiple times; (d) type D membrane proteins are lipid chain-anchored membrane proteins: they are bound to the membrane by one or more covalently attached fatty acid chains or other types of lipid chains called prenyl groups; (e) type E is GPI-anchored membrane protein which is bound to the membrane by a glycosylphosphatidylinositol (GPI) anchor.

Researchers have applied classification algorithm to predict the types of membrane proteins based on their amino acid composition [24, 25]. Figure 1 shows the forms and the locations of different membrane proteins.

The first goal of this paper is to illustrate the application of 3D kernel approach as a relatively new tool in proteins domains field for classification purposes. And the second goal is to show that the new approach can be applied to analysis of membrane protein types.

2. Materials and Methods

2.1. Kernel Function

Kernel function was originally a kind of functions used in integral operator research. However, Vapnik implemented this function in his newly invented SVMs method [26]. The use of kernel function makes SVMs able to treat nonlinear data processing problems by using linear algorithms. The basic idea of kernel function is to map the data into a higher-dimensional feature space via a nonlinear mapping and then to do classification and regression in this space. There are four commonly used kernel functions: linear kernel polynomial kernel Gaussian (RBF) kernel sigmoid kernel The elegance of using kernel function lied in the fact that one can deal with feature spaces of arbitrary dimensionality without having to compute the map . Any function that satisfies Mercer’s condition can be used as kernel function.

2.2. Kernel PCA

Principal component analysis (PCA) is a versatile and easy-to-use multivariate mathematical-statistical method in multivariate data analysis and the extraction of maximal information [27, 28]. It is a linear transformation approach that compresses high-dimensional data with minimum loss of data information. PCA is performed in the original sample space, whereas kernel PCA (KPCA) applies kernel functions in the input space to achieve the same effect of the expensive nonlinear mapping.

From Figure 2, it is found that the basic idea of KPCA is to map the original dataset into some higher dimensional feature space. In this complex space, PCA can be applied to establish a linear relationship which is nonlinear in the original input space [29, 30]. For the special case in which , KPCA is equivalent to linear PCA. From this viewpoint, KPCA can be regarded as a generalized version of linear PCA.

For PCA, with data , one can first compute the covariance matrix :

A principal component is computed by solving the following eigenvalue problem:

Thus, the eigenvectors can be written as

Then, the eigen value problem can be represented by the following simple form: where is a linear kernel matrix. To derive KPCA, one firstly needs to map the data into a feature space . Hence, a nonlinear kernel matrix can be directly generated by means of specific kernel function ((1), (2), (3), and (4)). For extracting features of a new sample with KPCA, one simply projects the mapped sample onto the first projections ,

KPCA is to map the original data (in the input space) with nonlinear features into kernel feature space in which the linear PCA algorithm is then performed. Therefore, KPCA, being suitable to describe the nonlinear structure of data set, can be regarded as a generalized version of linear PCA.

2.3. GDA

Generalized discriminant analysis (GDA) is a method designed for nonlinear classification [3133]. It is a nonlinear extension of linear discriminant analysis (LDA) based on a kernel function which transforms the original space to a new high dimensional feature space . The within-class (or total) scatter and between-class scatter () matrixes of the nonlinearly mapped data are as follows:

In (11), is the mean of class and is the number of samples belonging to . The aim of the GDA is to find such projection matrix that maximizes the following Fisher criterion:

From the theory of reproducing kernels, any solution must lie in the span of all training samples in : where are some real weights and is the th sample of the class . The solution is obtained by solving (, ; , ): is the kernel matrix composed of the dot products of nonlinearly mapped data. And , where is a matrix with entries all equal to .

2.4. New Improved 3D Kernel Approach: 3D KPCA and 3D GDA

Traditional KPCA and GDA are typical multivariate two-dimension statistical methods. In this work, KPCA and GDA are improved with three-dimensional projection and the concept of electric field intensity.

Firstly, the data of training samples are projected onto three-dimensional space by KPCA or GDA algorithm with satisfactory classification effect. The three-dimensional coordinate axes are, respectively, the first kernel principal component, second kernel principal component, and third kernel principal component or the direction vectors of generalized discriminant analysis.

Secondly, we need to estimate the class (unknown) of new projection points, such as membrane protein types of test sample data. There are two estimation methods in this work: K-Nearest Neighbor algorithm (KNN) [34] and class intensity model.

KNN algorithm estimation: new projection point (test sample) is classified by a majority vote of its neighbors (training samples in kernel three-dimensional space).

Class intensity model estimation: the projection point of one training data can be considered as point charge. The species of charge is related to the class of sample. And the Electric Quantity of Point Charge (EQPC) is related to the number of samples () which belongs to some class:

The value of EQPC is negative related with the sample amount of same class. Based on the Coulomb law and formula of intensity of electric field, the Intensity of Electric Field of one Point (IEFP) in 3D space is where is distance between point charge and the space point.

Therefore, in class intensity model, IEFP is a criterion of classification. For example, there are four classes in training data: class 1, class 2, class 3, and class 4 in Figure 3. After projecting with kernel methods, all projection class charge points of training data can form a space electric field. The test sample can be projected onto this space with the same kernel methods. Figure 3 illustrates the relationship between point charge of different class and corresponding IEFP. To project position of test sample, if there exist ,  and , test sample should belong to class 1.

3. Results and Discussion

3.1. System and Software Used for Data Analysis

The calculations were carried out using the Intel(R) Core(TM) Duo CPU T5870 GHz computer running Windows XP operating system. All the learning input data were range-scaled to [0~1] in this work. The improved 3D kernel approach software package including 3D kernel PCA and 3D GDA was programmed in our laboratory referring to the literature [29, 31] based on statistical pattern recognition toolbox for MATLAB [35].

3.2. Application of Improved 3D Kernel Approach to Protein’s Tertiary Structure Classes of Domains

The protein datasets studied here were taken from Niu and his coworkers [17]. In dataset A, there are 277 protein domains, of which 70 are all-α domains, 61 all-β, 81 α/β, and 65 α + β. In dataset B, there are 498 protein domains, of which 107 are all-α domains, 126 all-β, 136 α/β, and 129 α + β. The amino acid composition was used to represent the sample of a protein domain.

To demonstrate the power of 3D kernel methods, computations were performed by the Leave-One-Out Cross-Validation (LOOCV), which are widely used by more and more investigators in testing the power of various predictors. As such, the data set of samples was divided into two disjoint subsets including a training data set ( samples) and a test data set (only 1 sample). After developing each model based on the training set, the omitted data was predicted and the difference between experimental value and predicted value was calculated [3638].

Based on dataset A, it was found that the projection with Gaussian (see (3), ) kernel function and KNN () algorithm estimation was suitable for building 3D kernel PCA model with the better success rates.

Based on dataset B, it was found that the projection with polynomial (see (2), , ) kernel function and class intensity model estimation was suitable for building 3D GDA model with the better success rates. Figure 4 illustrates the protein domains classes distribution of dataset B (498 samples) in 3D kernel space with GDA model. It can be seen that the data points, which belong to all-α domains, all-β domains, α/β domains, and α + β domains respectively, are located in different regions with a correct classification result.

The success rates thus obtained are given in Table 1, where, for facilitating comparison, the corresponding rates obtained by component-coupled algorithm, neural networks, support vector machines (SVMs), and AdaBoost Learner [17] are also listed.

As it can be seen from Table 1, the performance of improved 3D kernel model outperforms those of component-coupled, neural networks, SVMs models but was a little worse than that of AdaBoost model for the dataset A (277 domains) available in LOOCV test. Based on dataset B (498 domains), improved 3D kernel learner is superior to all the other predictors in identifying the structural classification.

3.3. Application of Improved 3D Kernel Approach to Classification of Membrane Proteins

The membrane proteins dataset studied here was collected from the literature [25]. The dataset contains 2059 prokaryotic proteins (type A membrane proteins: 435; type B membrane proteins: 152; type C Multi-pass transmembrane proteins: 1311; type D lipid chain-anchored membrane proteins: 51; type E GPI-anchored membrane proteins: 110). The amino acid composition was selected as the input of the classification algorithm, and the computations were performed by LOOCV to test the power of various predictors. Based on dataset of membrane proteins, the classification flow chart (Figure 5) was obtained as follows.

From Figure 5, there are two steps in building classification model. Firstly, the 3D KPCA model with projection through polynomial (see (2), , ) kernel function and KNN () algorithm estimation was built to classify the multipass transmembrane proteins (type C) and the other membrane proteins (type A, type B, type D, and type E). Figure 6 illustrates the data distribution of type C and other membrane proteins in 3D kernel space with KPCA model.

Secondly, the 3D GDA model with Gaussian (see (3), ) kernel function and class intensity model estimation was built to classify type A, type B, type D, and type E membrane proteins.

Figure 7 illustrates the data distribution of the type A, type B, type D, and type E membrane proteins in 3D kernel space with GDA model. 3D kernel method was compared with other machine learning classification methods: the covariant discriminant algorithm [23], neural networks, support vector machines, and Bagging [25], as is shown in Table 2.

As we can see from Table 2, correct classification rate of the LOOCV test applied 3D kernel algorithm outperformed other algorithms. It also means that 3D kernel method has learned very well through the membrane proteins training process.

4. Conclusions

The 3D kernel approach is very useful machine learning classifier. It has remarkably outperformed the powerful neural network, SVM classifiers, in predicting the protein domain structural classes for the two datasets constructed and membrane protein types for the same dataset constructed by previous investigators. It is thus anticipated that the 3D Kernel classifier can also be used to predict other protein attributes, such as sub-cellular localization [3941], enzyme family and subfamily classes [42], and active sites of enzyme. The concepts of EQPC and IEFP can be easily extended to many-dimensional space and could be improved to use four or more dimensions.

It could be concluded that 3D kernel approach is a robust and highly accurate classification technique that can be successfully applied to derive statistical models with statistical qualities and predictive capabilities for the protein location and function. The 3D kernel algorithm should be a complementary tool to the existing pattern recognition in chemometrics and bioinformatics.

Authors’ Contribution

Xu Liu and Yuchao Zhang contributed equally to this work.

Acknowledgments

The project is financially supported by National Natural Science Foundation of China (nos. 20373040, 20973108, 20942005, and 21262005), Innovation Foundation of Guangxi University (nos. XBZ120947), and Innovation Foundation of Shanghai University (nos. A.10-0101-10-006). The work was supported by Guangxi Key Laboratory of Traditional Chinese Medicine Quality Standards (Guangxi Institute of Traditional Medical and Pharmaceutical Sciences) (guizhongzhongkai0802).