Abstract

Traditional multiple kernel dimensionality reduction models are generally based on graph embedding and manifold assumption. But such assumption might be invalid for some high-dimensional or sparse data due to the curse of dimensionality, which has a negative influence on the performance of multiple kernel learning. In addition, some models might be ill-posed if the rank of matrices in their objective functions was not high enough. To address these issues, we extend the traditional graph embedding framework and propose a novel regularized embedded multiple kernel dimensionality reduction method. Different from the conventional convex relaxation technique, the proposed algorithm directly takes advantage of a binary search and an alternative optimization scheme to obtain optimal solutions efficiently. The experimental results demonstrate the effectiveness of the proposed method for supervised, unsupervised, and semisupervised scenarios.

1. Introduction

Dimensionality reduction (DR) methods in supervised, unsupervised, and semisupervised learning tasks have attracted much attention in computer vision and pattern recognition [16]. These methods are often considered as feature extraction methods for high-dimensional signals from various application fields, such as transportation, communications, plants, and mines. Unsupervised dimensionality reduction, such as principle component analysis (PCA) [7], does not utilize any label information. Linear discriminant analysis (LDA) is a popular supervised dimensionality reduction method, which derives a projection from simultaneously maximizing the between-class scatter and minimizing the within-class scatter. Semisupervised dimensionality reduction, such as semisupervised discriminant analysis (SDA) [8], makes good use of labeled data while preserving the intrinsic geometric structures of unlabeled data.

In order to handle the data sampled from a low-dimensional manifold, some nonlinear dimensionality reduction methods, such as isometric feature mapping (ISOMAP) [9], locally linear embedding (LLE) [10], and Laplacian Eigenmap (LE) [11], introduce manifold assumption into dimensionality reduction and aim to maximally preserve certain interpoint relationships. But these methods cannot address the out-of-sample extension problem. Thus, locality preserving projections (LPP), as a linear approximation of LE [12], were proposed to both uncover the data manifold and provide out-of-sample extensions. These dimensionality reduction methods could be unified under a framework called graph embedding [13]. To achieve significant improvements, it is feasible to kernelize a certain type of linear methods into nonlinear ones [1418]. But, the performances of the kernelized versions heavily rely on the selections of kernel functions. With inappropriate kernels, the performances will be degraded and become even worse.

Recently, the advantage of using multiple kernels instead of only one kernel for dimensionality reduction has been demonstrated [15, 19]. Multiple kernel learning for dimensionality reduction (MKL-DR) was proposed to learn an appropriate kernel from the multiple base kernels and a transformation into a lower dimensionality space simultaneously [20]. But, MKL-DR relaxes a nonconvex quadratically constrained quadratic programming (QCQP) into a semidefinite programming (SDP), which is very time-consuming and has a negative effect on its performance. Recently, a multiple kernel learning method called MKL-TR was proposed to improve the performance of MKL-DR [21]. MKL-TR formulates multiple kernel learning for dimensionality reduction as a trace ratio maximization problem. But both MKL-DR and MKL-TR need to iteratively compute generalized eigendecomposition of dense matrices. Motivated by the efficiency of spectral regression, a fast multiple kernel dimensionality reduction method, termed as MKL-SRTR, was presented to avoid generalized eigendecomposition of dense matrices [22]. It is more efficient than MKL-DR and MKL-TR by virtue of spectral regression. Since MKL-DR, MKL-TR, and MKL-SRTR are all based on graph embedding and manifold assumption, they cannot cope with manifold assumption invalidation. In addition, MKL-DR and MKL-SRTR might be ill-posed if the rank of matrices in their objective functions was not high enough [21].

Since spectral clustering and multiple kernel dimensionality reduction have the same form of optimization based on the manifold assumption, motivated by the spectral embedded clustering framework proposed in [22], we firstly extend the traditional graph embedding framework by incorporating linear regularization terms into its model, termed as extended graph embedding (EGE). Secondly, we introduce multiple kernel learning into EGE (termed as MKL-EGE) to improve the performance of single kernel DR. Compared with traditional multiple kernel dimensionality reduction methods, such as MKL-SRTR, the proposed method not only solves the ill-posed problems but also is more robust against high-dimensional or sparse data. Furthermore, our method directly utilizes a binary search and an alternative optimization scheme to obtain optimal solutions. The experimental results demonstrate that the proposed method achieves better or similar performance compared to other algorithms for supervised, unsupervised, and semisupervised settings.

The remainder of the paper is structured as follows. In Section 2, we briefly introduce the related work. We provide the MKL-EGE framework and the optimization process in Section 3. The experimental results are shown in Section 4. Finally, we give the related conclusions in Section 5. In order to avoid confusion, we give a list of the main notations used in this paper in Notations.

2. Graph Embedding and Its Extension

2.1. Graph Embedding

Specifically, denote an undirected weighted graph by , where is a vertex set and represents an affinity matrix. Each entry of the symmetric matrix is the edge weight that characterizes the similarity between a pair of vertices of . A dimensionality reduction scheme aims at finding a low-dimensional subspace (, ) by a complete graph whose vertices are over . The purpose of graph embedding is to represent each vertex of a graph as a low-dimensional vector and preserves similarities between the vertex pairs. The optimal could be obtained by solvingwhere is the graph Laplacian matrix of and is a diagonal matrix with the diagonal elements defined as . is the graph Laplacian matrix of another weighted graph .

By specifying and (or and ), the PCA, ISOMAP, LLE, LPP, LDA, local discriminant embedding (LDE), marginal Fisher analysis (MFA) [13], and spectral regression (SR) [23, 24] can all be expressed as graph embedding. Since and the constraint are commonly used, in this paper, we mainly discuss the following form of graph embedding:which usually relaxes to the following objective function:

2.2. Extended Graph Embedding

The term in problem (4) is actually derived based on the manifold assumption [25]. However, for high-dimensional or sparse data, this assumption may not hold due to the bias caused by the curse of dimensionality. Thus, the low-dimensional manifold structure cannot be exploited by the inaccurate similarity matrix, which would result in the performance degradation of graph embedding.

To address this issue, we try to improve traditional graph embedding framework. Notice that the term can be regarded as the objective function of spectral clustering; we use the spectral embedded clustering method proposed in [26] to extend the graph embedding framework. Specifically, we minimize the following objective function:where and are two regularization parameters, denotes the vectors of all 1s, and the second term characterizes the mismatch between the low-dimensional feature matrix and the low-dimensional representation of the data.

Theorem 1. The optimization problem (4) can be transformed into the following minimization problem:where and . and represent the identity matrix of size n by n and size d by d, respectively.

Proof. By setting the derivatives of the objective function (4) with respect to and to zeros, we haveBy substituting and in (4) by (6), the optimization problem (4) becomeswhere . This completes the proof of Theorem 1.

From problem (5), we can find that the form of EGE is similar to that of GE and GE is a special case of EGE when . can be regarded as a correction of the graph Laplacian matrix for high-dimensional data.

Since , problem (5) can be transformed into the following form:

3. Multiple Kernel Learning Based on EGE and Trace Ratio Maximization

Since MKL-DR, MKL-TR, and MKL-SRTR can be viewed as multiple kernel versions of graph embedding, it is natural to establish a multiple kernel learning framework for dimensionality reduction based on EGE.

3.1. Formulation

Suppose the ensemble kernel is generated by linearly combining the base kernels ; that is, , where and . We can find a sample coefficient matrix and a kernel weight vector by the following trace ratio optimization problem based on extended graph embedding:whereIt should be noted that dimensionality reduction based trace ratio optimization tends to overfitting [27, 28]. To address this issue, a regularization term is added to the denominator of problem (9) to ensure that is of full rank. Hence, the objective function could be expressed as follows:Compared with MKL-SRTR, the proposed method is based on the extended graph embedding framework. Thus, it has more robustness against high-dimensional or sparse data. In addition, our method avoids ill-posed problems.

3.2. Method

To optimize our objective function, the following function that satisfies constraints (13)–(15) is defined:

The optimal value of the objective function in (15) is the root of the function [27, 28]. Based on (15), we update , , and alternately.

On Optimizing and . By fixing , optimization problem (11) is simplified to whereThus, a binary search (giving a lower bound and an upper bound) is used to seek such that . The value of can be easily calculated as the sum of the first largest eigenvalues of   Optimal is finally obtained by performing the eigenvalue decomposition of

On Optimizing . By fixing and , can be obtained by solving the following optimization problem:We define a function with given and as follows:and we haveThus, can be determined by updating the projections of in the direction of . Finally, we define a quadratic programming to satisfy the constraint as where denotes unit vector.

3.3. Algorithms

The proposed algorithm based on EGE and regularized trace ratio, termed as MKL-EGE, is described in Algorithm 1. As can be seen from Algorithm 1, MKL-EGE utilizes a binary search in inner iterations to speed up convergence and adopts updating and alternately in outer iterations to seek optimal solutions. Since the proposed algorithm cannot guarantee obtaining the optimal solution exactly, we terminate it within a maximum iteration and choose the best result.

Input: The matrix of data points , the number of classes , step length , maximum number of iterations ,
   parameters , and , an error constant .
Output:
() Initialize , construct the weighted matrix , calculate .
() Repeat
   () Calculate and .
   () Find as the first c largest eigenvalues of and   as the first c smallest eigenvalues of .
   () Let , and .
   () while   do
     () Compute as the sum of the first c largest eigenvalues of .
     () If    then    else  .
     () .
   () Obtain , where are the c eigenvectors corresponding to the c largest eigenvalues of .
   () Set  .
   () Update .
  until   reached
() Output , calculate the embedding result
3.4. Computational Complexity

For MKL-EGE, the computational complexity of inner iterations is , where iter2 is maximum number of inner iterations. Thus, the computational complexity of the whole algorithm is , where iter1 is maximum number of outer iterations. MKL-DR needs to solve the SDP problem in each iteration, which is as high as [20].

The computational complexity of MKL-TR decreases to [21]. Since MKL-EGE only needs a small number of iterations to converge, the computational complexity of our method is much lower than that of MKL-DR and MKL-TR.

3.5. Unseen Sample Embedding

After accomplishing the training procedure of MKL-EGE, we can project a new sample into the learned subspace by

4. Experiments

We compared the proposed MKL-EGE algorithm with MKL-DR [20], MKL-TR [21], and MKL-SRTR [22] on UCI datasets (Sonar, Ionosphere, and Isolet), face recognition datasets (Yale, PIE, and ORL), digits recognition datasets (USPS and MNIST), object recognition datasets (COIL-20), and text datasets (20 newsgroups). We randomly selected 300 samples from each digit for the USPS dataset and used digits 3, 6, and 8 for the MNIST dataset. For 20 newsgroups datasets, four largest topics (comp, rec, sci, and talk) were selected as high-dimensional datasets. For all datasets, we randomly selected samples to form training and testing sets with ratio 1 : 1. The basic information of datasets is listed in Table 1. All the experiments were performed in MATLAB R2013a running in a 3.10 GHZ Intel Core i5-2400 with 4-GB RAM.

For all datasets, we first normalized the values of the data vector to the range and used 10 RBF base kernels, whose values are set as 0.10, 0.20, 0.40, 0.80, 1.60, 3.20, 6.40, 12.80, 25.60, and 51.20, respectively. In all experiments, we set t_1 = 0.5 and ε = 0.001. The parameter k of k-nearest-neighbor graph is set to 5 empirically. For fair comparisons, we set the parameters as 0.5 for MKL-TR and MKL-EGE. For MKL-SRTR and MKL-EGE, we conducted a search of the optimal parameters , , and on by using fivefold cross-validation and reported the best experimental results.

4.1. Experiments on Supervised Learning

The maximum number of iterations for all algorithms is set as 20. For MKL-DR, MKL-SRTR, and MKL-EGE, the affinity matrix is defined asFor MKL-TR, we set and , where represents the indicator matrix with if belongs to class and 0 otherwise. For MKL-DR, the elements of another affinity matrix are all set as . The final reduced dimension is for all algorithms. We used libSVM [29] with linear kernel to classify the embedding data. All experiments were independently carried out over 20 times.

The mean classification accuracies and the standard deviations of different algorithms are displayed in Table 2. As can be seen from Table 2, MKL-EGE significantly outperforms MKL-DR, MKL-TR, and MKL-SRTR in most datasets, which achieves 11 best recognition rates among all 13 datasets. In particular, the performance of MKL-EGE is much better than that of other algorithms on high-dimensional datasets such as Yale, PIE, ORL, and COIL-20. This is due to the fact that MKL-EGE incorporates EGE and linear regularization terms into its model, which is effective for handling high-dimensional data and can avoid overfitting. Consequently, MKL-EGE is more robust than other algorithms based on traditional graph embedding. In addition, the performance of MKL-EGE is very close to that of MKL-TR and MKL-SRTR on low-dimensional dataset, such as Ionosphere, which shows that the proposed method is effective for both low-dimensional and high-dimensional data. The performance of MKL-DR is worst among all algorithms, which validates that the SDP relaxation technique applied in MKL-DR has a negative influence on the performance of dimensionality reduction. The performance of MKL-TR is similar to that of MKL-SRTR, since MKL-SRTR only utilizes spectral regression to improve the speed of MKL-TR.

We used all samples from each class of ORL as training data and used different algorithms to obtain corresponding two-dimensional embedding results. To further validate and compare the final results among different algorithms, we also tested them on PIE, which has the maximum number of samples. The final embedding results are shown in Figures 1 and 2, respectively. As can be seen from Figures 1 and 2, the embedding data obtained by MKL-DR, MKL-TR, and MKL-SRTR is overlapped more seriously than MKL-EGE. The embedding data obtained by MKL-EGE has the best separability, which demonstrates that MKL-EGE is more effective than other algorithms for high-dimensional face data. Consequently, the performance of classification using SVM based on MKL-EGE is best compared to other algorithms.

To compare the computational time of different algorithms, we used all data samples of each dataset as training data to perform different multiple kernel dimensionality reduction methods. The results are displayed in Figure 3. From Figure 3, we can see that MKL-SRTR and MKL-EGE are much faster than MKL-DR and MKL-TR. Since MKL-EGE utilizes a binary search in inner iterations to speed up convergence, its speed is only a little slower than that of MKL-SRTR for the sake of eigenvalue decomposition of dense matrices. The convergence curves of MKL-EGE and MKL-SRTR are displayed in Figure 4. As can be seen from Figure 4, the speed of convergence for MKL-EGE is faster than that of MKL-SRTR; this is due to the fact that MKL-SRTR needs to predefine step length of parameter and does not adjust adaptively step length in each iteration. For comparing the approximation performances of different algorithms, Figure 5 shows the histograms of the values obtained by all algorithms in 100 runs. As can be seen from Figure 5, compared with other algorithms, the approximate solutions of MKL-EGE are more concentrated near zero, which validates that our algorithm can more effectively find the root approximately. Overall, the proposed method is the most cost-efficient among all algorithms.

4.2. Experiments on Unsupervised Learning

To evaluate the performance of MKL-EGE in unsupervised settings, we first used all algorithms to project the original data onto a subspace, where the normalized cut spectral clustering (NC) [30] algorithm was performed to evaluate the clustering performance. For MKL-TR, we set and , where is the affinity matrix for MKL-EGE, MKL-SRTR, and MKL-DR. In the unsupervised case, we set the number of clusters as the number of classes in each dataset. In order to evaluate the clustering performance, the normalized mutual information (NMI) and Rand index (RI) [31] were adopted.

We used the same datasets and the same preprocessing procedure as in supervised learning experiments. For unsupervised MKL-DR, initializing first obtained more stable performances. Thus, this strategy was adopted in the experiments. To obtain stable results, for each dataset, we computed the average results of each algorithm over 20 runs.

The values of NMI and RI obtained by these algorithms are reported in Tables 3 and 4, respectively. From Tables 3 and 4, we can see that MKL-EGE performs better than other algorithms in most datasets, which demonstrates that it can improve the performance of dimensionality reduction by using EGE and regularization terms. Consequently, it has the ability to find a more effective combination of base kernels in unsupervised settings. MKL-TR and MKL-SRTR evidently outperform MKL-DR, which indicates that the SDP relaxation used in MKL-DR also has a negative effect on the performance of dimensionality reduction in this case.

4.3. Experiments on Semisupervised Learning

In the semisupervised case, MKL-DR, MKL-TR, MKL-SRTR, and MKL-EGE are actually the multiple kernel extensions of the semisupervised discriminant analysis (SDA) [3234]. Given labeled data and unlabeled data , SDA can be specified by two affinity matrices and , defined as follows [34]:whereand is the parameter to adjust the weight between the label information and unsupervised neighbor information. For MKL-TR, we set and . is set as 0.1 for all algorithms.

In semisupervised settings, the same datasets and parameter initialization were used. We randomly selected one-half training data as labeled data for each dataset. Each algorithm was independently performed over 20 times. The average classification accuracies as well as the standard deviations are reported in Table 5. As can be seen from Table 5, the proposed MKL-EGE algorithm performs better than MKL-SRTR, MKL-TR, and MKL-DR. Our proposed algorithm, which effectively takes advantage of EGE and regularized trace ratio optimization, can automatically learn weights of base kernels and combine them to improve the performance of dimensionality reduction. By virtue of the same prior information, the proposed algorithm achieves 10 best results among 13 datasets compared with these state-of-the-art methods.

To visualize the semisupervised dimensionality reduction results, we used all samples from the first 10 classes of PIE and projected them into a two-dimensional subspace to generate a graphical representation, shown in Figure 6. From Figure 6, we can observe that the embedding data obtained by MKL-EGE and MKL-SRTR is separated from each other more clearly than MKL-DR and MKL-TR. The embedding data obtained by MKL-EGE has the best separability, which further validates that the performance of MKL-EGE is much better than that of other algorithms in the semisupervised case.

4.4. Experiments on Real World Datasets

To evaluate the effectiveness of MKL-EGE on real world datasets, it serves as a feature extraction method for bearing vibration signals, which were provided by bearing accelerometer sensors under different operating loads and bearing conditions from mines. The vibration signals were collected by using a 16-channel digital audio tape (DAT) recorder at the sampling frequency 12 kHz. Similar to the experimental settings in [35], the experimental vibration data were divided into four datasets, named as D_IRF, D_ORF, D_BF, and D_MIX shown in Table 6, where “07,” “14,” “21,” and “28” mean that fault diameter is 0.007, 0.014, 0.021, and 0.028 inches. We used one-half vibration data as training samples and another one-half as testing samples.

Similar to the experimental settings in [35], we firstly transformed the obtained vibration signals into 10 time domain features, 3 frequency domain features, and 16 time-frequency domain features. Secondly, low-dimensional features were extracted for performing bearings fault diagnosis or prognosis. Finally, SVM was used to evaluate the performance of different DR methods. The first three extracted features corresponding to the largest eigenvalues are employed as the input features of SVM. The classification accuracy rates are reported in Table 7. It can be observed that MKL-EGE achieves much better results compared to other algorithms on all datasets, which further demonstrates the effectiveness of our method for feature extraction of vibration signals in real applications.

5. Conclusion

In this paper, we propose a new multiple kernel dimensionality reduction method called MKL-EGE. By means of EGE and regularized trace ratio maximization, the proposed method not only avoids the SDP relaxation of MKL-DR but improves the performance of multiple kernel dimensionality reduction further. Moreover, the proposed algorithm makes good use of the binary search and alternative optimization scheme to efficiently find optimal solutions. Experimental results validate the effectiveness of this method. In the future, we plan to incorporate pair constraints into our framework and exploit multiple kernel dimensionality reduction via convex optimization.

Notations

: The input -dimensional Euclidean space
:The number of total data points
:The number of classes that the samples belong to
: is the training data matrix
: is the 0-1 label vector is the lable of
: Kernel function of data vectors and
:Kernel matrix
: Base kernels
:, representing nonnegative coefficients of base kernels
:The ensemble kernel
: The trace of the matrix , that is, the sum of the diagonal elements of the matrix .

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (nos. 61403394 and 71573256) and the Fundamental Research Funds for the Central Universities (2014QNA46).