Abstract

Traditional supervised multiple kernel learning (MKL) for dimensionality reduction is generally an extension of kernel discriminant analysis (KDA), which has some restrictive assumptions. In addition, they generally are based on graph embedding framework. A more general multiple kernel-based dimensionality reduction algorithm, called multiple kernel marginal Fisher analysis (MKL-MFA), is presented for supervised nonlinear dimensionality reduction combined with ratio-race optimization problem. MKL-MFA aims at relaxing the restrictive assumption that the data of each class is of a Gaussian distribution and finding an appropriate convex combination of several base kernels. To improve the efficiency of multiple kernel dimensionality reduction, the spectral regression frameworks are incorporated into the optimization model. Furthermore, the optimal weights of predefined base kernels can be obtained by solving a different convex optimization. Experimental results on benchmark datasets demonstrate that MKL-MFA outperforms the state-of-the-art supervised multiple kernel dimensionality reduction methods.

1. Introduction

Recently, multiple kernel dimensionality reduction methods have been attracting many researchers, and a series of methods are proposed based on the graph embedding framework [17]. These methods generally transform primal data into high-dimensional feature spaces deduced by a set of base kernels, where a linear transformation is seeking to perform dimensionality reduction. Consequently, these nonlinear dimensionality reduction methods not only deal with high-dimensional data effectively, but automatically select optimal kernels by predefining a set of base kernels. It has been demonstrated that multiple kernel method performs better than single kernel-based methods in dimensionality reduction.

Although existing multiple kernel dimensionality reduction methods are significantly superior to single kernel-based dimensionality reduction methods, they are still confronted with challenging issues. Firstly, these algorithms have to iteratively solve the time-consuming generalized eigenvalue problem, which is a part of alternative optimization methods. Secondly, in addition, it generally transforms the primal model into the simple form by relaxing the SDP (Semidefinite Programming) problem or utilizes gradient descent algorithms to obtain local optima, which all could have a negative effect on its performance. To overcome the shortcomings mentioned above, some multiple kernel dimensionality reduction algorithms based on spectral regression were proposed recently [810]. They transform eigen-decomposition of dense matrices into a linear regression problem by means of spectral regression. However, they still make good use of the convex relaxation or gradient descent to optimize the kernel weights. Instead of convex relaxation, a multiple kernel learning framework was recently proposed to avoid relaxing the primal problem [11], which learns a transformation into a space of lower dimension by converting a ratio-trace maximization problem into a semi-infinite linear program. But this method still needs to iteratively compute generalized eigen-decomposition of dense matrices. In addition, these methods are regarded as multiple kernel versions of KDA and unified under the graph embedding framework. Thus, they all have the assumption that the distribution of each class is considered to be a unimodal Gaussian. This property often does not exist in real world applications and separability of the different classes cannot be well characterized by interclass scatter. Although kernel marginal Fisher analysis (KMFA) has been developed to overcome this limitation by using an intrinsic graph and another penalty graph [12], it has to choose the kernel type and determine its parameters beforehand.

Motivated by these methods, in this paper, a new multiple kernel dimensionality reduction algorithm, called multiple kernel marginal fisher analysis (MKMFA), is presented for supervised nonlinear dimensionality reduction. MKMFA not only solves the problem of the restrictive assumption of existing multiple kernel dimensionality reduction methods, but has the ability of automatically constructing appropriate kernels for nonlinear dimensionality reduction by means of the ratio-trace model. Furthermore, spectral regression is used to address the issue of dense metrics decomposition and speed up the learning of MKMFA. Finally, as other multiple kernel-based dimensionality reduction methods would do, it can also solve the out-of-sample extension problem.

2.1. Marginal Fisher Analysis

Graph framework is a general platform designing for dimensionality reduction algorithms, and ISOMAP, LLE, and Laplace feature mapping algorithm can be derived from it. With this framework, we develop a new dimensionality reduction algorithm, in order to avoid limitations of traditional linear discriminant analysis in data distribution assumption and available projection direction.

The assumption of the linear discriminant analysis algorithm is that the data of each class is Gaussian distribution, which is usually nonexistent in practical problems. Without this property, the separability of different classes cannot be characterized by interclass scatter. This limitation of LDA can be overcome by developing new standards that are characterized by intraclass compactness and interclass separability. To this end, we propose a new algorithm using the graph embedding framework which is called marginal Fisher analysis (MFA). We design an intrinsic graph with the characteristics of intraclass compactness and another penalty graph characterized by interclass separability. Specifically, the intrinsic graph illustrates the adjacency relationship of the intraclass point, and the connection of each sample to the nearest neighbor in the same class. The penalty graph describes the adjacency relationship of the interclass marginal point and the marginal point pairs of different categories.

By following the graph embedding formulation, intraclass compactness is characterized from the intrinsic graph by the term [1113]where Here, indicates the index set of the nearest neighbors of the sample in the same class. Interclass separability is characterized by a penalty graph with the term [1113]where

Here, is a set of data pairs that are the nearest pairs among the set , where denotes the index set of the samples belonging to the cth class. The algorithmic procedure of marginal Fisher analysis algorithm is formally stated as follows [1113]:

Firstly, project the data set into PCA subspace by preserving dimensions or a certain energy. The transformation matrix of PCA was represented by .

Construct the intraclass compactness and interclass separability graphs. In the intraclass compactness graph, for each sample , set the adjacency matrix if is among the -nearest neighbors of in the same class. In the interclass separability graph, for each class , set the similarity matrix if the pair is among the shortest pairs among the set.

Marginal Fisher Criterion. From the linearization of the graph embedding framework, we have the Marginal Fisher Criterionwhich is a special linearization of the graph embedding framework with

Output the final linear projection direction as

2.2. Ratio-Trace Optimization Problem

For any two symmetric positive semidefinite matrices and , the ratio-trace problem is defined as [14]

For a given kernel function , the kernelized versions of these algorithms solve the following ratio-trace problem:where is a transformation matrix, is the kernel matrix with , and (0,1) is a regularization parameter used to prevent overfitting. and are (algorithm-dependent) symmetric positive semidefinite matrices. The optimal solution to (9) is given by the generalized eigenvectors corresponding to the nonzero generalized Eigenvalues:Once is obtained, the new representation for a data sample can be computed using

3. Multiple Marginal Fisher Analysis Kernel Dimensionality Reduction via Ratio-Trace

3.1. Kernel Marginal Fisher Analysis via Ratio-Trace

The kernel trick is widely used to improve the separation ability of a linear supervised dimensionality reduction algorithm. By using the kernel trick, the marginal Fisher analysis can be further improved. By replacing and by and , respectively, problem (9) can be rewritten as follows:

Note that the graphs of kernel marginal Fisher analysis (KMFA) may be different from MFA because the nearest neighbors for each sample in KMFA is different from one in MFA. In each class the nearest in-class neighbors of each sample and the closest out-of-class sample pairs can be measured through the use of the kernel mapping function from the original feature space to the higher dimensional Hilbert space. The distance between sample and sample can be obtained by the following formula:

3.2. Multiple Kernel Marginal Fisher Analysis Dimensionality Reduction

In this section, a multiple kernel Fisher analysis framework is presented to incorporate spectral regression and ratio-trace into multiple kernel learning for dimensionality reduction. On one hand, spectral regression does not increase speed at the cost of some accuracy. On the other hand, the ratio-trace optimization algorithm can avoid conventional convex relaxation or gradient descent optimization method. The formulation of multiple kernel learning with MFA and ratio-trace will be illustrated, which not only combines multiple kernel dimensionality reduction with MFA, but selects optimal kernels more effectively than other multiple kernel dimensionality reduction methods by semi-infinite linear program (SLIP).

In the MKL framework, the kernel function is parametrized as a linear combination of predefined base kernels :where and the weights are learned from the data. Under the kernel marginal Fisher analysis framework based on ratio-trace, a multiple kernel variant of KMFA is deduced by combining MFA with MKL, termed as MKMFA, which is formulated as the following optimization problem:

Given the input data point , where and is the class label of . Denote as the training data matrix. The detailed steps of MKMFA are given as follows:

Step 1. Constructing the intraclass compactness graph and interclass separability graph .

Step 2. We extend the Marginal Fisher Criterion to the multiple kernel case in the following way:
Firstly, intraclass compactness is characterized from the intrinsic graph by the termwhere , , and is a diagonal matrix with the diagonal elements defined as .

Secondly, interclass separability is characterized by a penalty graph with the termwhere is the degree matrix of .

To obtain a multidimensional projection, we consider a set of c sample coefficient vectors, denoted by . Finally, Multiple Kernel Marginal Fisher Criterion can be denoted as follows:where is a regularization parameter used to prevent overfitting.

Step 3. Assume the ranks of and are and , respectively. Let and be the nonzero Eigenvalue-Eigenvector pairs of and , respectively. We can obtain the optimal by solving the following semi-infinite linear program [15]:where being M functions defined and and for .

Step 4. Solve the ratio-trace problem (18) using spectral regression to get optimal . Since and are all sparse matrices, we can use spectral regression to obtain in the following way:

Find the largest generalized eigenvectors of the following eigen-problem:

Find () by solving the following least squares regression:where is the i-th element of .

Algorithm 1 summarizes the algorithm for solving (18). This iterative algorithm is referred to as MKL-MFA. The alternating algorithm for solving the proposed SILP problem belongs to a family of algorithms for solving general semi-infinite programming problems called the exchange methods, in which the constraints are exchanged at each iteration. These methods have been guaranteed to converge [16].

Input:.
Output:
Step: Initialization: ,  .
Step: Compute , and , where and are the non-zero Eigenvalue-Eigenvector
   Pairs of and respectively.
Step: Solve the SILP (19) to obtain as follows:
   While
    
    For
      Compute by solving =
    end
    If break;
    else
     Add to the constraint set . Update and by solving restricted Version of (19) using only
    end
    = +1;
   end
Step: Solve the ratio-trace problem(3) using spectral regression with
   To get optimal .
Step: The new non-linearly transformed representation for a data sample can be computed by
    and

Compared with existing supervised multiple kernel dimensionality reduction based on LDA [14, 1723], the proposed method has the following advantages:(1)The projection direction that MFA can use is much greater than that of LDA, and the dimension size is determined , that is, the number of the shortest pairs between in-class and out-of-class samples.(2)Do not assume the data distribution in each class, and the intraclass compactness is characterized by the sum of the distance between each data and the nearest neighbor in the same class. Therefore, discriminant analysis is more general.(3)Without prior information of data distribution, the interclass margin can better characterize the separability in different classes than the interclass variance in LDA.(4)Avoid conventional convex relaxation or gradient descent optimization algorithm, and optimal kernels can be more effectively obtained than other multiple kernel dimensionality reduction methods.

4. Experiments

To validate the effectiveness of the proposed method, all algorithms are carried out on UCI (University of California, Irvine) datasets, digits recognition and face recognition datasets. The characteristics of the datasets are summarized in Table 1. For fair comparison, the final reduced dimension is equal to the number of classes of each dataset for all algorithms and the libSVM tool is used to classify the reduced data. For each dataset, training and testing sets are selected randomly with ratio 1:1. After that, the values of samples are normalized to the range for each dataset. Finally, we analyze and compare our method with other algorithms by repeating each algorithm 20 runs.

For fair comparison, 10 RBF base kernels are predefined to construct the ensemble kernel as the MKL-TR algorithm [12], and the values of parameter are set as 0.10, 0.22, 0.46, 1.00, 2.15, 4.46, 10.00, 21.54, 46.42, and 100.00, respectively. Based on these base kernels, MKL-MFA is mainly compared with EMFA[10], MKL-DR [11], MKL-TR [12], and MKL-SRTR [13] in supervised settings. The parameters and of MKL-MFA and EMFA are all equal to 10, while the parameters and are specified by cross-validation. For MKL-TR, we set and , where denotes pseudoinverse and is the indicator matrix with if belongs to class , and otherwise. As the settings of MKL-SRTR and MKL-DR, we also define the affinity matrix as and set another affinity matrix , where . For EMFA, the parameter and the number of hidden nodes are obtained by a 10-fold cross-validation. According to the mean classification accuracies and the standard deviations, the performance of each algorithm is evaluated and reported in Table 2.

From Table 2, the performance of MKL-MFA is significantly superior to that of KMFA, EMFA, MKL-TR, MKL-DR, and MKL-SRTR on all datasets. The results of KMFA are worse than other methods; this is due to the fact that the combination of multiple base kernels can improve the performance of single kernel-based algorithms. Our method outperforms MKL-DR, MKL-TR, and MKL-SRTR, since MKL-MFA utilizes the penalty graph to characterize the interclass marginal point adjacency relationship. Without prior information on data distributions, the interclass margin can better characterize the separability of different classes than the interclass variance in KDA. The performance of our model also goes beyond that of EMFA, which can be attributed to the fact that our model finds the global optimum by converting a ratio-trace maximization problem into a semi-infinite linear program instead of the convex relaxation or gradient descent. Experimental results demonstrate that MKL-MFA makes good use of ratio-trace optimization, MKL and MFA to achieve the outstanding discriminant analysis power and yields the best results.

To analyze the performance of our model further, we test MKL-TR and MKL-MFA in the 30 runs of experiments on Ionosphere with different splits of training and testing set. The mean values of kernel weights are reported in Table 3. For comparison, in Table 3 the best classification accuracies of KFDA corresponding to the 10 base kernels are also displayed, respectively. As can be seen from Table 3, K3, K4, K5, and K6 are more suitable for KFDA than other kernels. It can be observed that our method tends to assign larger weights on K3, K4, K5, and K6. As a result, our method is able to combine base kernels with appropriate weights and overly outperforms KFDA with the best single kernel.

To validate the effectiveness of our model on high-dimensional data, we select 40 samples from 9 classes of PIE face dataset, projecting them into two-dimensional space using MKL-MFA, as displayed in Figure 1. It is shown that the projected results of MKL-MFA and MKL-SRTR have the better separability than those of MKL-DR and MKL-TR. The separability of embedded data of MKL-MFA is much clearer than that of other models. Consequently, our model is also superior to others on high-dimensional data.

To further evaluate the effectiveness of MKL-MFA, we used bearing vibration signals of accelerometer sensors under different operating loads as a real world dataset. The vibration signals were collected by using a 16 channel digital audio tape (DAT) recorder at the sampling frequency 12 kHz. The experimental vibration data were divided into four datasets, named as D_IRF, D_ORF, D_BF, and D_MIX shown in Table 4, where “07”, “14”, “21”, and “28” mean that fault diameters are 0.007, 0.014, 0.021, and 0.028 inches, respectively [24]. Signals were selected randomly to form training and testing sets with ratio 1:1.

Firstly, vibration signals were transformed into 10 time domain features, 3 frequency domain features, and 16 time-frequency domain features [24]. Secondly, deferent dimensionality algorithms were carried out to extract low dimensional features from the transformed signals. Finally, we used SVM to train and test low dimensional features to compare our method with other DR methods. The classification accuracy rates are reported in Table 5. It can be seen that, compared with other algorithms, MKL-MFA achieves much better performance on all datasets, which further validates the effectiveness of MKL-MFA for feature extraction of vibration signals in real applications.

5. Conclusions

In this paper, we extend the Marginal Fisher Criterion to the multiple kernel case. Based on the extended criterion, a new multiple kernel-based dimensionality reduction algorithm termed as MKL-MFA is proposed for supervised nonlinear dimensionality reduction. Without prior information on data distributions, MKL-MFA is more general for multiple kernel discriminant analysis. Experimental results on benchmark and real world datasets validate the promising performance of MKL-MFA, respectively. In the near future, we intend to improve our model by introducing deep kernel networks and study nonlinear dimensionality reduction methods via deep models.

Data Availability

The benchmark data used to support the findings of this study have been deposited in the UCI and face image repository (http://archive.ics.uci.edu/ml/datasets.html, http://www.face-rec.org/databases/, http://www.ri.cmu.edu/projects/project_418.html, http://web.mit.edu/emeyers/www/face_databases.html, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html, http://yann.lecun.com/exdb/mnist/, http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php). The bearing vibration signals of accelerometer sensors under different operating loads, provided by the Bearing Data Center of the Case Western Reserve University, have been validated in many research works and become a standard dataset for bearing studies (http://csegroups.case.edu/bearingdatacenter/home).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by Natural Science Foundation of Jiangsu Province [grant numbers BK20170273 and BK20180174] and the National Natural Science Foundation of China [grant numbers 61801198, 41672324, and 41704115].