Correlation Kernels for Support Vector Machines Classification with Applications in Cancer Data
High dimensional bioinformatics data sets provide an excellent and challenging research problem in machine learning area. In particular, DNA microarrays generated gene expression data are of high dimension with significant level of noise. Supervised kernel learning with an SVM classifier was successfully applied in biomedical diagnosis such as discriminating different kinds of tumor tissues. Correlation Kernel has been recently applied to classification problems with Support Vector Machines (SVMs). In this paper, we develop a novel and parsimonious positive semidefinite kernel. The proposed kernel is shown experimentally to have better performance when compared to the usual correlation kernel. In addition, we propose a new kernel based on the correlation matrix incorporating techniques dealing with indefinite kernel. The resulting kernel is shown to be positive semidefinite and it exhibits superior performance to the two kernels mentioned above. We then apply the proposed method to some cancer data in discriminating different tumor tissues, providing information for diagnosis of diseases. Numerical experiments indicate that our method outperforms the existing methods such as the decision tree method and KNN method.
In the current perspective, support vector machines (SVMs) demonstrate as benchmarks for various disciplines such as text categorization and time series prediction and they have gradually become popular tools for analyzing DNA microarray data . SVMs were first used in gene function prediction problems and later they were also applied to cancer diagnosis based on tissue samples . The effectiveness of SVMs depends on the choice of kernels. Recently correlation kernel with SVM has been applied successfully in classification. The correlation matrix gives the correlation coefficients among all the columns in a given matrix. To be precise, in a correlation matrix, the th entry measures the correlation between the th column and th column of a given matrix. The diagonal entries in the correlation matrix are all equal to one because they compute the correlation of all the columns with themselves. Furthermore, the correlation matrix is symmetric because the correlation between th column and th column is the same as the correlation between th column and th column in the matrix. There are several possible correlation coefficients, the most popular one is the Pearson correlation coefficient, see for instance . In the case of a perfect positive linear correlation, the Pearson correlation coefficient will be . While indicates a perfect negative anticorrelation. Usually the correlation coefficients lie in the interval , indicating that the degree of linear dependence between the variables within a given matrix. An important property of the correlation matrix is that it is always positive semidefinite.
Correlation kernel with SVMs is a recent application in biological research. It can be effectively used for the classification of noisy Raman Spectra, see for instance [4, 5]. The construction of correlation kernel involves the use of distance metric which is problem specific but this is less common in kernel methods. Correlation kernel is self-normalizing and is also suitable for classification of Raman spectra with minimal pre-processing. The similarity metric defined in the kernel describes the similarities between two data instances. The positive semidefinite property of the usual correlation kernel is ensured if the correlation matrix itself is positive semidefinite.
The kernel matrices resulting from many practical applications are indefinite and therefore are not suitable for kernel learning. This problem has been addressed by various researchers, see for instance [6–9]. A popular and straightforward way is to transform the spectrum of the indefinite kernel in order to generate a positive semidefinite one. Representatives such as the denoising method which treats negative eigenvalues as ineffective . The flipping method flips the sign of negative eigenvalues in kernel matrix . The diffusion method transforms the eigenvalues to their exponential form  and the shifting method applies positive shift to the eigenvalues .
Taking into consideration that a correlation matrix is positive semidefinite, we can therefore construct a parsimonious kernel matrix such that the positive semidefiniteness is satisfied automatically. This novel kernel is so far until now the first application in classification problems. Apart from that, we also propose a kernel sharing similar expression with the usual correlation kernel. However, the denoising method was applied accordingly to construct a novel positive semidefinite kernel matrix. The reason why we choose the denoising method is that the technique has been successfully used in protein classification problem . This suggests that it may have an important role in classification for other biological data sets.
The remainder of this paper is structured as follows. In Section 2, we introduce the construction of usual correlation kernel. We proposes the parsimonious positive semidefinite kernel as well as the novel kernel after denoising on a kernel having similar property with the usual correlation kernel. Theoretical proof on the positive semidefinite property of parsimonious kernel was provided. We also give explanations for particular property of related kernels. In Section 3, publicly available data sets are utilized to check the performance of the proposed method and compare to some state-of-the-art methods such as the KNN method and the decision tree method. A discussion on the results obtained is given in Section 4. Finally concluding remarks are given in Section 5.
2. The Proposed Parsimonious Positive Semidefinite Kernel Method
In this section, we first introduce the usual correlation kernel. Based on the positive semidefinite property of the usual correlation kernel, we then propose a parsimonious positive semidefinite kernel. Apart from that, our novel kernel, namely, DCB (denoised correlation based) kernel will be presented.
2.1. The Usual Correlation Kernel
In this section, we assume that there are data instances in the data set. The number of features used to describe a data instance is . Then the data matrix can be expressed as a matrix which we denote as follows:
It is straightforward to obtain the correlation matrix of . Here we suppose the correlation matrix is . Then we have whereand is the sample mean of data matrix .
Correlation is a mean-centered distance metric that is not common for kernel constructions. However, it is an important metric and problem specific. The usual correlation kernel is constructed based on the correlation matrix defined above. And the kernel value between and is
This kind of kernel definition appropriately describes the similarity between two data instances. It is direct to see the symmetric property of the kernel matrix as well. To have a better understanding of the kernel matrix, we can describe it as follows: where , . The following proposition presents relationship with .
Proposition 1. The usual correlation kernel is positive semidefinite if is positive semidefinite.
Proof. The correlation kernel is symmetric and we have , . If we denote , , then we have the following description of the kernel matrix: Because , we have What's more, has the same definite property with Using kernel trick in machine learning area, we can see that if is positive semidefinite, then usual correlation kernel is also positive semidefinite.
2.2. A Parsimonious Correlation Kernel
To deal with the positive semidefinite requirement of a kernel matrix, in this subsection, we propose a parsimonious kernel which is simply the correlation matrix . The proposition below shows that the proposed kernel is positive semidefinite.
Proposition 2. The matrix is a positive semidefinite matrix.
Proof. From (3), we know that the th entry of is given by
Alternatively, we may write
If we denote from the separability of the kernel matrix, we can rewrite . Then for any we have
If we further assume then
This demonstrates that itself is a parsimonious kernel matrix satisfying positive semidefinite property automatically.
Therefore, can be employed as a kernel matrix for training classifiers in machine learning framework. This further proves the positive semidefiniteness of usual correaltion matrix.
2.3. Denoised Correlation-Based Kernel
From the successful experience of the usual correlation kernel in Raman Spectra classification, we construct a novel kernel utilizing the advantage of the usual correlation kernel. The denoised correlation-based kernel construction involves two steps. First, we formulate a kernel matrix sharing similar property of the usual correlation kernel. Second, denoising techniques are applied in order to construct a positive semidefinite kernel matrix. The above ideas can be summarized in the following two steps.
Step 1 (a new kernel)
Here we propose a new kernel having equivalent property with the usual correlation kernel. It is defined as follows:
Since we can write it in another way as follows: wherehas similar expression with the usual correlation kernel.
Step 2 (the denoising strategy)
In order to avoid the problem of nonpositive semidefiniteness of the kernel matrix, we incorporate denoising strategy in the kernel construction. Because , where is the matrix composed of all the eigenvectors of the matrix and is a diagonal matrix where the diagonal entries are the eigenvalues of the matrix then we denote it by
The denoising strategy is to transform the diagonal matrix to another diagonal matrix , where
Finally, is a positive semidefinite kernel matrix.
We prepared three publicly available data sets from libsvm  related to three types of cancer.
The first data set is related to colon cancer. In the data set, there are 22 normal and 40 tumor colon tissues. Each tissue is characterized by intensities of 2,000 genes with highest minimal intensity through the samples . The preprocessing process has been done through instance-wise normalization to standard normal distribution. Then feature-wise normalization was performed to the standard normal distribution as well. In total there are 62 data instance with 2000 features. There are 40 positive data which means 40 exhibiting colon cancer, while 22 are normal.
The second data set is related to breast cancer. Similar to the first data set, the same preprocessing technique applied to the data normalization. Initially, there are 49 tumor samples. They are derived from the Duke Breast Cancer SPORE tissue resource. And they were divided into two groups: estrogen receptor positive and estrogen receptor-negative, via immunohistochemistry . However, the classification results using immunohistochemistry and protein immunoblotting assay conflicted, 5 of them are then removed. Therefore, there are 44 data instances in total, 21 are negative and 23 are positive. The number of genes used to describe the tumor sample is 7129.
The third data set is related to leukemia cancer. Preprocessing for the data set is exactly the same as the previous two data sets. The data set was composed of 38 bone marrow samples, 27 of them are acute myeloid leukemia, and the remaining 11 are acute lymphoblastic leukemia . Expression levels of 7129 genes are used to measure each data.
3. Numerical Experiments
We compare our proposed methods with the following three state-of-the-art methods.
(i) Decision Tree
Decision tree learning is a method commonly used in data mining. It employs a decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications.
(ii) K-Nearest Neighborhood (KNN)
The -nearest neighbor algorithm is the simplest method among all machine learning algorithms. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its -nearest neighbors ( is a positive integer, typically small). If , then the object is simply assigned to the class of its nearest neighbor.
(iii) Support Vector Machines (SVMs)
A support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification. A good separation is achieved by the hyperplane that has the largest functional margin that is the distance to the nearest training data points of any class.
In this study, we employed the KNN method with and the decision tree algorithm for comparison with our proposed parsimonious correlation kernel and denoised correlation-based kernel with SVM. The aim is to demonstrate superiority of our proposed kernels to the usual correlation kernel.
Tables 1, 2, and 3 present the prediction accuracy comparison in different algorithms. Here we introduce some state-of-the-art models for the purpose of comparison, they are the decision tree method and the KNN Algorithm. We employ the 5-fold cross-validation setting in the study. To get a relatively stable result, 10 times 5-fold cross-validation was performed and the accuracy was measured as the averaged accuracy over the 10 runs. The best performance is marked in bold size in the tables.
For colon cancer data set, decision tree exhibits inferior performance compared to KNN algorithm. However, both decision tree and the KNN algorithm cannot do better than the usual correlation kernel when . For different values of , the performance of the usual correlation kernel differs widely. The best performance is achieved when is adopted. But when or , only around accuracy was obtained.
For breast cancer data, decision tree performed better than KNN algorithm. The accuracy for decision tree is , but for the KNN algorithm, the best result obtained is when which is significantly less than . But still they cannot catch up with the usual correlation kernel when that is . Similar to colon cancer data set, when and , the usual correlation kernel demonstrated poorly, the accuracies are only and , respectively. As a conclusion, in general the parsimonious correlation kernel and the denoised correlation-based kernel are the best two.
Finally for leukemia data set, the accuracies of the decision tree method and the KNN algorithm are higher than usual correlation kernel. They are all over while the best performance of the usual correlation kernel is when , less than . However, both parsimonious correlation kernel and denoised correlation-based kernel can achieve over 0.9000 accuracy.
As we can conclude that for the usual correlation kernel, ensures the best performance. Hence we choose in the following studies. Figures 1, 2, and 3 show the performance of 10 runs of 10-time 5-fold cross-validation for the 3 data sets. Value in -label means the run. And -label means the averaged accuracy of each 10-time 5-fold cross-validation. We compare the decision tree method, the KNN algorithm, the usual correlation kernel and the 2 proposed positive semidefinite kernels: Parsimonious correlation kernel and denoised correlation-based kernel. The figures clearly demonstrate the superiority of the our 2 proposed kernels (as presented in starred green and diamond yellow in the figures) over all the other algorithms compared.
Table 4 presents the dominant eigenvalues in PC kernel and DCB kernel. We observe that the dominant eigenvalues for PC kernel and DCB kernel are very close to each other, with a gap of only 0.0168. This explain why the two algorithms exhibit similar performance. And for the colon cancer data set, the difference in dominant eigenvalues is 0.1421. While for the leukemia data set, the difference is the largest: 0.3048. One can see that the performance difference is also the largest, the superiority of DCB kernel over PC kernel is the clearest. The difference in the dominant eigenvalues is consistent with the difference in performance in classification. The larger the difference in eigenvalues, the larger the difference in classification performance will be.
From the tables, one can see the consistent superiority of the denoised correlation-based kernel for classification. All of them can achieve the best for the 3 tested data sets. And the positive semidefinite Parsimonious kernel is the second best among all the algorithms compared. Moreover, we observe no dominant superiority for decision tree or KNN algorithm over the other.
From the perspective of the usual correlation kernel, in the colon cancer data, it is better compared to decision tree and KNN algorithm, the average accuracy is located around while decision tree and KNN algorithm cannot exceed in general. In the breast cancer data, similar conclusions can be drawn for the usual correlation kernel. Second to our proposed PC kernel and DCB kernel, it ranks 3 in all the investigated methods. But in the leukemia data set, UC kernel is the lowest in accuracy. It cannot compete with all the other methods presented. This concludes that there is also no dominant advantage of the UC kernel over the decision tree method and the KNN algorithm.
If we focus on the comparison of the 2 proposed positive semidefinite kernels: PC kernel and DCB kernel, we can also reach some conclusions. For breast cancer data, the two show comparable performance. But for colon cancer data and leukemia data, DCB kernel demonstrates its superiority. The superiority is much clearer in leukemia data set. The reasons explaining the difference can be possibly given by the dominant eigenvalue theory. In finance, the largest eigenvalue gives a rough idea on the largest possible risk of the investment in the market . The dominant eigenvalue is the one provides the most valuable information about the dynamics from which the matrix came from .
In this study, two positive semidefinite kernels which we call parsimonious correlation kernel and denoised correlation-based kernel have been proposed in discriminating different tumor tissues, offering diagnostic suggestions. We have provided theoretical illustrations on the positive semidefinite property of the usual correlation kernel. Taking into consideration of the positive semidefiniteness of correlation matrix, we have proposed 2 positive semidefinite kernels. The robustness of the 2 proposed kernels in conjunction with support vector machines is demonstrated through 3 publicly available data sets related to cancer in tumor discrimination. Comparisons with the state-of-the-art methods like the decision tree method and the KNN algorithm are made. Investigation on the performance analysis for the 2 proposed positive semidefinite kernels is conducted with eigenvalue theory support. The proposed kernels highlight the importance of positive semidefiniteness in kernel construction. As novel kernels using distance metric for kernel construction that are not common in machine learning framework, the proposed kernels are hoping to be applied in a wider range of areas.
The authors would like to thank the anonymous referees for their helpful comments and suggestions. Research supported in part by GRF Grant and HKU CERG Grants, National Natural Science Foundation of China Grant no. 10971075.
N. Cristianini and B. Schölkopf, “Support vector machines and kernel methods: the new generation of learning machines,” AI Magazine, vol. 23, no. 3, pp. 31–41, 2002.View at: Google Scholar
A. Kyriakides, E. Kastanos, K. Hadjigeorgiou, and C. Pitris, “Support vector machines with the correlation kernel for the classification of Raman spectra,” in Advanced Biomedical and Clinical Diagnostic Systems IX, vol. 7890 of Proceedings of SPIE, pp. 78901B-1–78901B-7, San Francisco, Calif, USA, January 2011.View at: Publisher Site | Google Scholar
J. Chen and J. Ye, “Training SVM with indefinite kernels,” in Proceedings of the 25th International Conference on Machine Learning, pp. 136–143, Helsinki, Finland, July 2008.View at: Google Scholar
Y. M. Ying, C. Campbell, and M. Girolami, “Analysis of SVM with indefinite kernels,” in Proceedings of the Neural Information Processing Systems Conference (NIPS '09), pp. 1–9, Vancouver, Canada, 2009.View at: Google Scholar
R. Luss and A. d'Aspremont, “Support vector machine classification with indefinite kernels,” Mathematical Programming Computation, vol. 1, pp. 97–118, 2009.View at: Google Scholar
E. Pekalska, P. Paclik, and R. P. W. Duin, “A generalized kernel approach to dissimilarity-based classification,” Journal of Machine Learning Research, vol. 2, pp. 175–211, 2002.View at: Google Scholar
T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer, “Classification on pairwise proximity data,” NIPS, vol. 53, pp. 438–444, 1998.View at: Google Scholar
R. I. Kondor and J. D. Lafferty, “Diffusion kernels on graphs and other discrete input spaces,” in Proceedings of the International Conference on Machine Learning (ICML '02), pp. 315–322, Sydney, Australia, 2002.View at: Google Scholar
G. Wu, E. Y. Chang, and Z. H. Zhang, “An analysis of transformation on non-positive semidefinite similarity matrix for kernel machines,” in Proceedings of the 22nd International Conference on Machine Learning (ICML '05), pp. 315–322, Bonn, Germany, 2005.View at: Google Scholar
“Libsvm data set,” http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html.View at: Google Scholar
U. Alon, N. Barka, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.View at: Publisher Site | Google Scholar
R. Couillet and M. Debbah, Random Matrix for Wireless Communications, Cambridge University Press, NewYork, NY, USA, 2011.
E. A. Gonzalez, “Determination of the dominant eigenvalue using the trace method,” IEEE Multidisciplinary Engineering Education Magzine, vol. 1, pp. 1–2, 2006.View at: Google Scholar