Multiple Kernel Spectral Regression for Dimensionality Reduction
Traditional manifold learning algorithms, such as locally linear embedding, Isomap, and Laplacian eigenmap, only provide the embedding results of the training samples. To solve the out-of-sample extension problem, spectral regression (SR) solves the problem of learning an embedding function by establishing a regression framework, which can avoid eigen-decomposition of dense matrices. Motivated by the effectiveness of SR, we incorporate multiple kernel learning (MKL) into SR for dimensionality reduction. The proposed approach (termed MKL-SR) seeks an embedding function in the Reproducing Kernel Hilbert Space (RKHS) induced by the multiple base kernels. An MKL-SR algorithm is proposed to improve the performance of kernel-based SR (KSR) further. Furthermore, the proposed MKL-SR algorithm can be performed in the supervised, unsupervised, and semi-supervised situation. Experimental results on supervised classification and semi-supervised classification demonstrate the effectiveness and efficiency of our algorithm.
In real applications, the resulting data representations are generally high dimensional. Practical algorithms usually behave badly when faced with many unnecessary features. Hence, finding a way of transforming them into a unified space of lower dimension can facilitate the underlying tasks such as pattern recognition or regression problems. Dimensionality reduction (DR) techniques, which have been widely used in many fields of information processing, include unsupervised, supervised, and semisupervised methods due to different assumptions about the data distribution or the availability of the data labeling.
In order to handle the data sampled from a nonlinear low dimensional manifold, many manifold learning techniques, such as ISOMAP , Locally Linear Embedding (LLE) , and Laplacian Eigenmap , have been proposed in recent years, which reduce the dimensionality of a fixed training set in a way that can maximally preserve certain interpoint relationships. One of the major limitations of these methods is that they do not generally address the out-of-sample problem. Although some methods explicitly require an embedding function either linear or in RKHS when minimizing the objective function [4, 5], the computation of these methods involves eigendecomposition of dense matrices which is expensive in both time and memory. Spectral regression (SR), which is fundamentally based on regression and spectral graph analysis [6–10], can avoid eigen-decomposition of dense matrices and has better performance at a faster learning speed. Moreover, it can be performed either in supervised, unsupervised, or semisupervised situation. Kernel SR (KSR) is the kernelized version of SR in the reproducing kernel Hilbert space (RKHS), which can further improve the performance of SR. While KSR is based on a single kernel, in practice it is often hard to select a suitable kernel. A common way to an automatic selection of optimal kernels is to learn a linear combination of base kernels. Motivated by the effectiveness of SR, we introduce a framework called MKL-SR that incorporates multiple kernel learning (MKL) into the training process of SR. We will illustrate the formulation of MKL-SR with graph embedding , which provides a unified view for a large family of DR methods. Any DR technique expressible by graph embedding can therefore be generalized by MKL-SR to boost their power by automatically selecting optimal kernels. As the corresponding SR algorithm would do, the proposed approach not only solves the out-of-sample extension problem but also improves the performance of kernel-based SR (KSR) for the supervised, semisupervised, and unsupervised learning problems.
The paper is structured as follows. In Section 2, we briefly introduce the related work. We provide the MKL-SR framework and present the optimization process in Section 3. The experimental results are shown in Section 4. Finally, we give the related conclusions in Section 5.
2. Related Work
Since the relevant literature is quite extensive, our survey instead emphasizes the key concepts crucial to the establishment of the proposed framework.
2.1. Spectral Regression Algorithm
In the traditional spectral dimensionality reduction algorithms, seeking an embedding function which minimizes the objective function involves eigen-decomposition of dense matrices, which has the high computational cost in both time and memory. The SR algorithm uses the least squares method to get the best projection direction, instead of computing the density matrix of features, so it has much faster learning speed. An affinity graph of both labeled and unlabeled points is constructed to find the intrinsic geometry structure and to learn the responses with the given data. Then, with these responses, the ordinary regression is applied to learning the embedding function.
Given a training set with labeled samples and unlabeled samples , where the sample belongs to one of classes, let be the number of labeled samples in the th class (the sum of is equal to ). The SR algorithm is summarized as follows.
Step 1. Constructing the adjacency graph let be the training set and let denote a graph with nodes, where the th node corresponds to the sample . In order to model the local structure as well as the label information, the graph will be constructed through the following three steps.(1)If is among -nearest neighbors of or is among -nearest neighbors of , then nodes and are connected by an edge.(2)If and are in the same class (i.e., same label), then nodes and are also connected by an edge.(3)Otherwise, if and are not in the same class, then the edge will be deleted between nodes and .
Step 2. Constructing the weight matrix let be the sparse symmetric matrix, where represents the weight of the edge joining vertices and .(1)If there is no any edge between nodes and , then .(2)Otherwise, if both and belong to the th class, then , else , where is a given parameter to adjust the weight between supervised and unsupervised neighbor information. Therein, is a similarity evaluation function between and ; we have two variations, the first one is simple-minded function and the second one is heat kernel function: where .
Step 3. For eigen-decomposing let be the diagonal matrix, whose th element is the sum of the th column (or row) of . Find , which are the largest generalized eigenvectors of the eigenproblem where the first eigenvector is a vector of all ones with eigenvalue 1.
Step 4. Calculate vectors . is the solution of the regularized least square problem where is the th element of .
Step 5. Let be an transformation matrix, where . The testing samples or new sample can be embedded into dimensional subspace by
Next, we briefly discuss the kernel spectral regression. If we choose a nonlinear function in RKHS; that is, , and is the Mercer kernel of RKHS . Equation (3) can be rewritten as where is gram matrix . Find vectors . is the solution of the following linear equations system:
Let , is a transformation matrix. The samples can be embedded into dimensional subspace by where .
2.2. Multiple Kernel Learning
MKL learns a kernel machine with multiple kernel functions or kernel matrices. Recent studies have shown that MKL not only increases the recognition accuracy but also enhances the interpretability of the resulting classifiers. Given a set of base kernel functions , an ensemble kernel function is defined by Consequently, an often-used MKL decision function derived from binary-class SVM is The training process of MKL generally optimizes over both the coefficients and .
In recent years, dimensionality reduction methods based on multiple kernels have been proposed to improve the performance of those using single kernel. In , kernel learning was first incorporated into DR methods. Then, a multiple kernel DR framework was designed in . Recently, Zhu et al. proposed a dimensionality reduction method by Mixed Kernel Canonical Correlation Analysis (CCA) [14, 15]. In this method, the high dimensional data space is mapped into the reproducing kernel Hilbert space (RKHS) with a linear combination between a local kernel and a global kernel. Kernel CCA is further improved by performing Principal Component Analysis (PCA) followed by CCA for effective dimensionality reduction, which can be implemented in supervised learning, semisupervised learning, and transfer learning. Motivated by their work, we aim to incorporate the MKL optimization into SR to yield more flexible dimensionality reduction schemes.
3. The MKL-SR Framework
We first explain how to integrate MKL and SR for dimensionality reduction. Then, we propose an optimization procedure to complete the framework.
3.1. MKL-SR Model
Suppose that the ensemble kernel in MKL-SR is generated by linearly combining the base kernels as in (8). Selecting a nonlinear function in RKHS induced by the kernel function , we have . The constrained optimization problem for 1 MKL-SR is defined as follows: where
Observe from (10) that the one-dimensional projection of MKL-SR is specified by a sample coefficient vector and a kernel weight vector . The two vectors, respectively, account for the relative importance among the samples and the base kernels in the construction of the projection. To generalize the formulation to uncover a multidimensional projection, we consider a set of sample coefficient vectors, denoted by The resulting projection will map samples to a ()-dimensional euclidean space. Similar to the 1 case, a projected sample can be written as
The optimization problem (10) can now be extended to multidimensional MKL-SR as
3.2. Optimization Algorithm
Since direct optimization to (16) is difficult, we instead adopt an iterative, two-step strategy to alternately optimize and . At each iteration, one of and is optimized while the other is fixed, and then the roles of and are switched. Iterations are repeated until convergence or a maximum number of iteration is reached.
3.2.1. On Optimizing A
We can indirectly utilize MKL-SR to solve multidimensional MKL-SR. By fixing , the problem (10) can be transformed into the following optimal problem: where . The optimal ’s are the eigenvectors corresponding to the maximum eigenvalue of the eigenproblem Consequently, the columns of the optimal in (16) are the eigenvectors corresponding to the first smallest eigenvalues in (19).
Solving the problem (19) directly involves eigen-decomposition of dense matrices, which has the high computational cost in both time and memory. In order to solve the eigenproblem in (19) efficiently, we use the following theorem.
Theorem 1 shows that, instead of solving the eigenproblem (19), the embedding functions can be acquired through two steps.(1)Solve the eigenproblem in (2) to get .(2)Find which satisfies . Similar to SR, a possible way is to find which can best fit the equation in the least squares sense as where is the th element of .
Since the matrix is guaranteed to be positive definite, the eigenproblem in (2) can be stably solved. Moreover, both and are sparse matrices. The top eigenvectors of eigenproblem in (2) can be efficiently calculated with Lanczos algorithms . In addition, the technique to solve the least square problem is already matured and there exist many efficient iterative algorithms that can handle very large scale least square problems.
3.2.2. On Optimizing β
By fixing , the optimization problem (16) becomes where .
The additional constraints cause the optimization to (22) to be no longer transformed into a generalized eigenvalue problem. It is actually a nonconvex quadratically constrained quadratic programming (QCQP) problem , which is a NP-hard problem. Thus, we instead consider solving its convex relaxation by adding an auxiliary variable of size as where in (26) is a column vector whose elements are 0 except that its th element is 1. To obtain the convex relaxation of the nonconvex QCQP problem (22), we relax the equation to , which can be equivalently expressed by the constraint in (27) according to the Schur complement lemma . The optimization problem (24) is a semidefinite programming (SDP), which can be efficiently solved. It can be note that the numbers of constraints and variables in (24) are linear and quadratic to , respectively. In practice, the value of is often small. Thus, the proposed MKL-SR algorithm listed in Algorithm 1 mainly includes a sequence of SR training.
3.3. Novel Sample Embedding
After accomplishing the training procedure of MKL-SR, we can project a testing sample z into the learned subspace by where Several algorithms such as the nearest neighbor rule or -means clustering can be used to complete classification or clustering tasks. In the experiments of this paper, we specifically discuss the effectiveness of MKL-SR in different learning tasks, including unsupervised learning for clustering, supervised, and semisupervised learning for face recognition.
We used seven datasets (ionosphere, letter, digit, and satellite) from the UCI machine learning repository to perform unsupervised learning task. For the letter and satellite data sets, we only used their first two classes. Several multiclass data sets were created from the digits data. The experiments on supervised and semisupervised classification were performed on the CMU PIE face data set and the extended Yale B data set [17, 18], respectively. All the face images are manually aligned and cropped. The pixel values are scaled to . The basic information about these data sets is listed in Table 1. All the experiments have been performed in MATLAB 7.14.0 environment running in a 3.10 GHZ Intel Core i5-2400 with 3GB RAM.
4.1. Experiments on Unsupervised Learning
To validate that MKL-SR is effective for an unsupervised dimensionality reduction task, we applied the proposed algorithm as a tool to learn an appropriate kernel function for KSR. Each data set was reduced by SR, single kernel based SR, kernel principal component analysis (KPCA), and MKL-SR, respectively. The normalized cut spectral clustering (NC) algorithm was adopted to evaluate the clustering performance on the reduced data. We set the number of clusters equal to the true number of classes and compared the clusters generated by these algorithms with the true classes by computing the clustering accuracy measure as where denotes the th cluster in the final results, is the true th class, and is the number of entities which belong to class and are assigned to cluster .
To obtain stable results, for each data set, we computed the average results of each algorithm over 20 runs. For comparison, we also performed the NC algorithm in the original data space (Baseline). For SR, KSR, and MKL-SR, the dimension of the subspace is the number of categories. For KPCA, we tested its performance with all the possible dimensions and report the best result. For SR, KSR, and MKL-SR, we simply set the value of the parameter as 1. For KSR and KPCA, the Gaussian function with width 1 was selected. For MKL-SR, we use a linear kernel function, a polynomial kernel function, and a Gaussian kernel function.
Table 2 lists the mean of 20 different random repetitions as well as the standard deviation. From Table 2, we observe that the performance of kernel based algorithms is much better than SR, which indicates that the performance of linear DR algorithms can be improved by virtue of nonlinear kernel functions. MKL-SR significantly surpasses KSR and KPCA, which are single kernel based approaches. This is due to the fact that MKL-SR is able to learn a better kernel by MKL, which is considerably more effective than a single Gaussian kernel. The performance of KSR is very close to that of KPCA, but the number of reduced dimensions of KPCA has to be verified by testing many times. In addition to the fixed number of reduced dimensions, we also try to examine how the compared algorithms work when applying KPCA to obtain projected data of a varied number of dimensions. Thus, MKL-SR is easy to be implemented and has better performance than other algorithms.
4.2. Experiments on Supervised Learning
In this experiment, we mainly compared MKL-SR with the following approaches: KPCA, LDA, SR, and KSR. In order to evaluate the performance of these algorithms, we performed the SVM algorithm in the original face image space (baseline) and KPCA, LDA, SR, KSR, and MKL-SR subspace. The kernels and parameters are set in the same way as in the unsupervised learning. From each class of the CMU PIE face data sets, we randomly selected (the number of training samples per class) samples for training.
For each given , we averaged the results over 30 random splits and computed the mean as well as the standard deviation, which are listed in Table 3. As can be seen from Table 3, the performance of KPCA and LDA is even worse than that of the baseline method, which resulted from the limitation of KPCA and LDA. As is well known that KPCA is unsupervised, thus it cannot effectively exploit the supervised information, which results in the worst performance in supervised case. LDA does not utilize the regularization approach to control the model complexity. Thus, it cannot solve the over-fitting problem in small sample size case. In contrast, SR, KSR, and MKL-SR take advantage of the Tikhonov regularizer to improve the smoothness of projection functions, and they can perform better than KPCA and LDA. By substituting the nonlinear embedding functions with the linear ones, KSR and MKL-SR all outperform SR. The performance of MKL-SR is better than that of KSR based on a single kernel, which indicates that MKL-SR can select an appropriate kernel and validates the effectiveness of our method.
The key parameter in MKL-SR is the regularization parameter which controls the smoothness of the embedding function based on multiple kernels. Next, we discuss the impact of parameter on the performance of MKL-SR. Figure 1 shows the performance of MKL-SR as a function of the parameter . For convenience, the -axis is plotted as which is strictly in the interval . As can be seen from Figure 1, MKL-SR obtains the best performance near the middle of the interval. When decreases to zero or increases to one, the performance of MKL-SR decreases sharply. Fortunately, good performance can be achieved over a wide range of , which shows that the parameter selection is not a crucial problem in MKL-SR algorithm. In reality, we can use cross validation to verify the best parameter or simply select a value between 0.1 and 1.
4.3. Experiments on Semisupervised Learning
In the semisupervised case, we compared the performance of MKL-SR with KPCA and semisupervised KSR. For comparison, we performed the SVM algorithm in the original face image space (baseline), KPCA, and semisupervised KSR and MKL-SR subspace. For KSR and MKL-SR, we simply set the value of the parameter as 1. In the semisupervised MKL-SR, the parameter was selected by cross validation. The kernels and parameters are set in the same way as in the unsupervised learning. For the extended Yale B face data set, a random subset with images per individual was first taken to form the training set and the rest of the data set was used to be the testing set. In the training set, we only use one half data as labeled data and the rest as unlabeled data. KPCA only uses unlabeled data and the SVM algorithm is also performed on the reduced data based on KPCA. KSR and MKL-SR use both labeled and unlabeled data. The is set to be 7 for the -nearest neighbor graph over all the training samples in KSR and MKL-SR.
We average the classification accuracy over 30 random splits for each given . The mean as well as the standard deviation is shown in Table 4. From Table 4, we can observe that KSR and MKL-SR can efficiently exploit both labeled and unlabeled data to discover the intrinsic geometry structure in the data; that is, the reduced data can preserve the original intrinsic geometry structure very well. Thus, they outperform the baseline method and KPCA, which cannot utilize all the available data. The performance of MKL-SR is much better than that of KSR, which indicates that the final kernel matrix learned by MKL-SR is still better than the one based on a single kernel in the semisupervised case. Overall, the proposed MKL-SR algorithm can achieve better performance in the supervised, semisupervised, and unsupervised case.
In this paper, we propose a new dimensionality reduction framework called MKL-SR. By means of SR, we solve the out-of-sample extension problem by seeking an embedding function in RKHS induced by multiple kernels. Thus, this method can not only construct the nonlinear embedding function in the form of convex combination of base kernels but also improve the performance of single kernel based SR in the supervised, semisupervised, and unsupervised case. Experimental results validate the effectiveness and efficiency of the MKL-SR algorithm. In the near future, we will further explore how to integrate different MKL methods into our model.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the Fundamental Research Funds for the Central Universities under Grant no. 2013XK10.
S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: spatial pyramid matching for recognizing natural scene categories,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '06), pp. 2169–2178, usa, June 2006.View at: Publisher Site | Google Scholar
A. C. Berg, T. L. Berg, and J. Malik, “Shape matching and object recognition using low distortion correspondences,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05), pp. 26–33, June 2005.View at: Google Scholar
X. He and P. Niyogi, “Locality Preserving Projections,” Advances in Neural Information Processing Systems, MIT Press, 2003.View at: Google Scholar
K. Q. Weinberger, F. Sha, and L. K. Saul, “Learning a kernel matrix for nonlinear dimensionality reduction,” in Proceedings of the 21rst International Conference on Machine Learning (ICML '04), pp. 839–846, Banff, Canada, July 2004.View at: Google Scholar
X. Zhu, Z. Huang, Y. Yang, H. T. Shen, C. Xu, and J. Luo, “Self-taught dimensionality reduction on the high-dimensional small-sized data,” Pattern Recognition, vol. 46, no. 1, pp. 215–229, 2013.View at: Google Scholar