Abstract

Different kernels cause various class discriminations owing to their different geometrical structures of the data in the feature space. In this paper, a method of kernel optimization by maximizing a measure of class separability in the empirical feature space with sparse representation-based classifier (SRC) is proposed to solve the problem of automatically choosing kernel functions and their parameters in kernel learning. The proposed method first adopts a so-called data-dependent kernel to generate an efficient kernel optimization algorithm. Then, a constrained optimization function using general gradient descent method is created to find combination coefficients varied with the input data. After that, optimized kernel PCA (KOPCA) is obtained via combination coefficients to extract features. Finally, the sparse representation-based classifier is used to perform pattern classification task. Experimental results on MSTAR SAR images show the effectiveness of the proposed method.

1. Introduction

Recently, kernel learning or kernel machine has aroused broad interest in pattern recognition and kernel learning areas. For classification problem based supervised kernel learning, different kernel geometrical structures give different class discriminations. However, separability of the data in the feature space could be even worse if an inappropriate kernel is chosen since the geometrical structure of the mapped data in the feature space is totally determined by the kernel matrix, so the selection of kernel influences greatly the performance of kernel learning and thus optimizing kernel can be regarded as an effective way to improve classification performance. Considering that optimized kernel parameters of kernel function cannot change the geometrical structures of kernel in the feature space [1, 2], so it cannot improve the performance of kernel learning. In this sense, Scholkopf et al. [3] proposed an empirical kernel map which maps original input data space into a subspace of the empirical feature space. Since the training data have the same geometrical structure in both the empirical feature space and the feature space, and the former is easier to access than the latter, it is easier to study the adaptability of a kernel to the input data and to improve it in the former space. Cristianini et al. [4] and Lanckrict et al. [5] have proposed methods of choosing kernel by optimizing the measure of data separation in the feature space for the first time. Cristianini et al. and Lanckrict et al., respectively, employ the alignment and margin as the measure of data separation to evaluate the adaptability of a kernel to input data. Zhang et al. proposed several variants of KPCA [6, 7] to perform fault diagnosis and nonlinear processes. Then, they utilized the improved kernel learning techniques to deal with statistical analysis of nonlinear fault detection [8], large-scale fault diagnosis processes [9], and the monitoring of dynamic processes [10].

Simultaneously, sparse representation has gained great interest in pattern recognition and computer vision areas recently. Wright et al. [11] presented a sparse representation based classification method [12] and applied it to real-world face recognition problems [11, 12]. With varying expression and illumination, as well as occlusion and disguise, it was very effective and robust for face recognition.

The paper is organized as follows: in Section 2, we first introduce the concept of data-dependent kernel and empirical feature space. Then, we optimize the kernel in the empirical feature space by seeking the optimal combination coefficients of data-dependent kernel based on Fisher criterion. In Sections 3 and 4, the optimized kernel PCA (KOPCA) is adopted for MSATR SAR images to obtain dimensionality reduced empirical feature so as to employ sparse representation-based classifier for pattern classification. Finally, in Section 5, experiments are carried out on MSTAR SAR images to demonstrate the improvement in the performance of the data classification algorithms after using the optimized kernel and sparse representation-based classifier.

2. Kernel Optimization in the Empirical Feature Space

2.1. Data-Dependent Kernel

Since different kernels create different geometrical structures of the data in the feature space and lead to different class discriminations [13], there does not exist a kernel function that can be adaptive to all datasets in kernel learning. Therefore, data-dependent based kernel is necessary to be chosen to deal with the problem. In this paper, we employ a data-dependent kernel which is proposed by Amari and Wu [14] to be the objective kernel function to conduct kernel optimization. There is a need to explain that the data-dependent kernel is a conformal transformation to a basic kernel.

Given a set of training samples , the data-dependent kernel is defined as follows: where , is a basic kernel such as a polynomial kernel or Gaussian kernel. is a positive real valued factor function and different make the data-dependent kernel different properties; Amari and Wu [14] expand the spatial resolution in the margin of a SVM by where ,?? is the th support vector, and SV is a set of support vector. The set called the “empirical cores,” can be determined according to the distribution of the training data. is a free parameter. ??are the positive combination coefficients which are considered as contribution weights corresponding to the . Meanwhile, the data-dependent kernel is a kernel function as it satisfies the Mercer condition [3].

Let and be the Kernel Matrices of and , respectively. Thus, it is easy to see that where ,?? and is a diagonal matrix with elements .

We denote vectors and as and , respectively. Then, we have

2.2. Empirical Feature Space

The different kernels cause various class discriminations owing to their different geometrical structures of the data in the feature space. It is often not so convenient or easy to compute in feature space. Hence, the concept of empirical feature space is introduced.

Let be a -dim training dataset and . ? denotes the kernel matrix with rank . Since is a symmetric positive semidefinite matrix, it can be decomposed as where is a diagonal matrix with positive eigenvalues of in descending order, and contains eigenvectors corresponding to the positive eigenvalues.

On the basis of above, we can define the map from input data space to Euclidean space and gain the so-called empirical kernel map defined in [3], that is,

The embedding space is called empirical feature space.

We can prove that the training data has the same geometric structure in both the empirical feature space and feature space. Let , then the dot product matrix?? in the empirical feature space can be calculated as Notice that , , and the result of is exactly the dot product matrix of in the feature space; therefore, we say that the empirical feature space preserves the geometric structure in the feature space.

2.3. Fisher Criterion Based Kernel Optimization

As illustrated in Section 2.2, the training data has the same geometric structure in both the empirical feature space and feature space and it is easier to access the empirical feature space than the feature space, so it is better to measure class separability in the empirical feature space. In this paper, we choose the acquainted Fisher criteria to measure class separability: where is the between-class scatter matrix, is the within-class scatter matrix, and is the trace of given matrix. is Fisher criteria to measure the class separability. Notice that measures the class separability in the feature space and is independent of the projections in the common projection subspace, so it is a satisfying choice to be the task of kernel optimization.

Up to now, the kernel optimization problem is transformed to maximize Fisher scalar . Let the number of each class be , that is, class has training samples and denotes the number of all training samples. What is more, let ,?? denote the center of each training samples in class and the center of all training samples, respectively, that is, ,?? and be the images of the training samples in the empirical feature space, that is, . Then, we can define where means the th training sample in the th class.

For the convenience of calculation and representation, we rewrite the kernel matrix as where represent the submatrices of and the size of is .

Let the following matrices and be called “between-class” and “within-class” kernel scatter matrices, respectively,

We can also employ and to denote “between-class” and “within-class” kernel scatter matrices corresponding to the basic kernel .

Now we establish the relation between Fisher scalar and the proposed kernel scatter matrices. Let be the -dim vector whose elements are equal to 1. Then we can get The proof is given in the appendix.

To maximize , we adopt the general gradient method and use formula (4). Define

Then, Thus, Denote and , maximize is equivalent to , that is,

Considering the fact that it is almost impossible to make invertible because of the limited amount of training samples in real-world applications, the general gradient descent method is adopted to get an approximate value of the optimal . The updating equation to maximize is defined as follows:

To guarantee the convergence of formula (17), is defined as the function of iterations, that is, where is a predefined initial value, denotes total number of iterations, and represents the current iteration number.

After we calculate the combination coefficients of , then we can get , as , and thus, the optimizing kernel or data-dependent kernel, , is easy to achieve.

3. Optimizing Kernel PCA (KOPCA)

In this section, we will employ the optimized kernel function mentioned above to construct the optimizing kernel PCA and extract feature in the empirical feature space.

Given a set of training samples and an empirical feature mapping , let the input data space be mapped into the empirical feature space . The covariance operator on the empirical feature space can be constructed by where and are defined the same as above, that is, , . It is easy to proof that all nonzero eigenvalues of are positive, and every eigenvector of can be linearly expanded by To get these expansion coefficients, we denote ,?? and form an Gram matrix , whose elements are determined by optimizing kernel, that is, . Note that the kernel matrix is the same as what is defined in formula (7).

Centralize by where is defined the same as above, that is, is -dim vector whose elements are equal to 1.

Let the eigenvectors of be corresponding to the largest positive eigenvalues . Then, the eigenvectors of ,??, corresponding to the largest positive eigenvalues , are After the projection of a mapped sample onto the eigenvectors , we can obtain optimizing kernel PCA transformed feature vector by Meanwhile, the th optimizing kernel PCA component:

Up to now, the essence of optimizing kernel PCA has been revealed. That is, we first maximize a measure of class separability in the empirical feature space by virtue of Fisher criterion to form needful data-dependent kernel and then take advantage of optimizing kernel PCA to extract feature in the empirical feature space.

4. Sparse Representation-Based Classifier (SRC)

Let be the matrix formed by the training samples of the th class, that is, . And define a new matrix for the total training set with classes as the concatenation of the training samples: .

Given a test sample from the th class, then can be approximately represented by the linear span of the training samples in the corresponding class, that is, where are the corresponding coefficients, and we denote .

Then, the linear representation of can be rewritten in terms of as where is a coefficient vector whose entries are zero except those associated with the class.

Hereto, we should take the number of row and column of into consideration. If the row number is bigger than column number , the system of equations is overdetermined and the correct can usually be found as its unique solution. Nevertheless, this is not what we need since sparse representation involves an underdetermined system of linear equations , where as it is motivated by the following fact: given a test sample , the representation is naturally sparse if training sample size (column number) is large enough, and if the sparser the coefficient vector is, the easier it will be to accurately reconstruct the identity of the test sample [12].

Consequently, it means that the dimension of feature vector (row number) must be smaller than the training sample size (column number). Considering, before we use sparse representation, we have obtained dimensionality reduced empirical feature in Section 2, and just in time, it can meet requirements.

The above discussion motivates us to seek the sparest solution by solving the following optimization problem: where denotes the -norm, which counts the number of nonzero entries in a vector.

However, solving optimization problem in formula (27) is NP hard and time-consuming. Recent research of spares representation and compressed sensing [15, 16] proves that if the solution is sparse enough, the solution of the optimization problem is equivalent to finding the solution of the optimization problem: This problem can be solved in polynomial time by standard linear programming algorithms [17].

After obtaining the sparest solution , we can form a sparse representation-based classifier (SRC) in the following way. For each class , let be a function which selects the coefficients associated with the th class, then is a vector whose only nonzero entries are the entries in that are associated with class . Making use of the coefficients associated with the th class, one can reconstruct the given test sample as . is often called the prototype of class with respect to the sample . The residual between and its prototype of class is defined as follows: Then the SRC decision rule is to minimize the residual, that is, if ,?? is assigned to class . It is necessary to explain that our implementation minimizes the -norm via the basis pursuit denoising (BPDN) algorithm for linear programming based on [1719].

5. Experimental Results

In this section, experiments are designed to evaluate the performance of the proposed algorithm. The first experiment is adopted to show that the class separability is probably worse in the feature space than that in the input space in some cases and demonstrate that the proposed kernel optimization algorithm can enlarge class separability. The second experiment is carried out on MSTAR SAR images using KOPCA compared with conventional KPCA to extract features and use nearest neighbor (NN) classifier to implement pattern classification. Simultaneously, sparse representation-based classifier (SRC) is applied to verify its superiority and effectiveness to deal with pattern classification compared with other classifiers. In order to verify the sparsity via BPDN, we randomly choose a test sample and show its representation coefficients on the training set.

5.1. Kernel Optimization on Synthetic Gaussian Distributed Dataset

Before concentrating on optimizing the kernel in the empirical feature space, we use two simple datasets called Gaussian distribution data generated by computer to get intuition about the embedding of data in the feature space into the empirical feature space. More information about data embedding can be found in [20]. Figure 1(a) shows a 2-class 2-dim dataset containing 400 samples, whose coordinates are uncorrelated. Each class contains 200 samples and both are Gaussian distributions with parameters: ,??,??,?? and ,??,??,??, respectively. Seeing this figure, there is some overlap between the two classes. Figure 1(b) shows the projection of the data into the empirical feature space onto the first two significant dimensions corresponding to the first two largest eigenvalues of , when the polynomial kernel function with is used. Figure 1(c) shows the corresponding projection when the Gaussian kernel function with is employed. Both the two basic kernels are mentioned in formula (1). It is seen from Figures 1(b) and 1(c) that the class separability is worse in the feature space than that in the input space when adopting both the polynomial kernel and Gaussian kernel. Therefore, it is important to conduct kernel optimization. We will carry out an experiment later to demonstrate that when applying the kernel optimization algorithm in Section 2.3, the measure for class separability is surely enlarged.

In this experiment, we set parameter of the function in formula (2) as for the given polynomial kernel and the given Gaussian kernel . One-third of the synthetic data are randomly chosen to form the “empirical core” set . The initial learning rate and total iteration number are set 0.1 and 200, respectively, in both the polynomial kernel and Gaussian kernel. Figures 2(a) and 2(b) show the projection of the data into empirical feature space onto the first two significant dimensions corresponding to the first two largest eigenvalues of , when the polynomial kernel and Gaussian kernel are used as mentioned above. It is seen from Figure 2 that the proposed kernel optimization algorithm substantially improves the class separability of the data in the empirical feature space and, hence, in the feature space.

5.2. KOPCA Criterion on MSTAR SAR Dataset

This experiment is conducted on MSTAR SAR image provided by Defense Advanced Research Project Agency and Air Force Research Laboratory (DARPA/AFRL). The data is the MSTAR public release subset in order to initiate Moving and Stationary Target Acquisition and Recognition project which has provided a unique opportunity to promote and assess progress in SAR ATR algorithm development.

Since the characteristic in SAR image changes greatly with different aspect angles, a great many of images within one target-class were collected, where the poses lie between 0 and 360 degree.

The vehicle in MSTAR SAR Dataset contains BMP2 (sn-c21, sn-9563, sn-9566) tracked Armored Personnel Carrier, BTR70 (sn-c71) wheeled Armored Personnel Carrier, and T72 (sn-132, sn-812, sn-s7) Main Battle Tank. Different serial numbers in one-target class mean that vehicles are variant with small differences in configuration, articulation under extended operating condition (EOC) [21]. Therefore, scattering centers of SAR images change so intensively that recognition ability decreases greatly, and in this sense, recognizing variants in SAR images is difficult.

In this experiment, we select images of BMP2sn-c21, BTR70sn-c71 and T72sn-132 in 17 depression angle as the training samples (numbers of each class are 233, 233, 233). The testing samples are selected as BMP2 sn-9563, BMP2 sn-9566, and T72 sn-812, T72sn-s7 in 15 depression angle (numbers of each class are 195, 196, 195, 191). The testing targets have small configuration differences to the training targets. There is a need to explanation; in this paper, KOPCA extracts features of all MSTAR images with different aspect angles directly and the process does not need to form different aspect windows and before recognition, images are chipped into pixels.

We set parameter of the function in formula (2) as . The kernel functions are chosen as the polynomial kernel where is from 1 to 10, and the Gaussian kernel where is from to . One-third of the training data are chosen to form the “empirical core” set . The initial learning rate and total iteration number are set 0.1 and 200, respectively, in both the polynomial kernel and Gaussian kernel. Moreover, the feature dimension and empirical feature dimension are set 100 in both KPCA and KOPCA criterion. In order to reflect the performance of optimizing kernel in real-world application, the simplest nearest neighbor (NN) classifier is selected.

Suppose the distance between two samples and is defined by , where denotes l1-norm.

Then, if a test sample satisfies , and belongs to class , then belongs to class .

Tables 1 and 2 show the recognition rates of KPCA and KOPCA with polynomial kernel and Gaussian kernel using the nearest neighbor (NN) classifier, respectively. From them, we can see that using the proposed data-dependent kernel optimization algorithm with KPCA criterion, recognition rate can be increased 10%~15% in both polynomial kernel and Gaussian kernel compared with conventional KPCA in the whole process. The class separability in the empirical feature space is improved, and thus, recognition rate is improved.

Now, we will conduct another experiment to discuss sparse representation-based classifier (SRC). In the first part, we validate its effectiveness to deal with pattern classification task after extracting features via KOPCA (KOPCA: SRC) when compared with other classifiers, such as -nearest neighbor (KNN) classifier, support vector classifier (SVC), and linear regression classifier (LRC) [22]. In the second part, in order to verify the sparsity of sparse representation-based classifier (SRC) via BPDN, we randomly choose a testing sample and show its representation coefficients on the training set.

Tables 3 and 4, respectively, show the recognition rates of KOPCA with KNN, SVC, LRC, and SRC classifiers corresponding to polynomial kernel and Gaussian kernel.

From Table 3, we see that sparse representation-based classifier (SRC) outperforms other classifiers no matter what the order of polynomial kernel is. In the whole experiment process, the recognition rate of KOPCA: SRC achieves higher than 95% while others are lower than 95%. Only KOPCA: LRC is close to KOPCA: SRC, the others are lower than 10%, even 20% compared with it. Meanwhile, notice that there exists an interesting phenomenon, that is, though the order of polynomial kernel varies from 1 to 10, variations of recognition rates of all the algorithms are less than 10%. However, it is inapplicable to Gaussian kernel. From Table 4, we learn that KOPCA: SRC has the superiority and effectiveness, that is, (1) when parameter of Gaussian kernel is between and , recognition rate of KOPCA: SRC is slightly lower than KOPCA: LRC, the difference is no more than 2%, but it is better than KOPCA: KNN and KOPCA: SVM. (2) Though, KOPCA: LRC performs well, but it has limitations when is equal or greater than since the recognition rate degrades quickly and significantly, while KOPCA: SRC remains stable and superior. (3) The performance of all these algorithms decreases rapidly when is greater than while KOPCA: SRC can still hold about 70%, the others are equal or lower than 50%.

The basis pursuit (RB) method is introduced to optimize l1-norm based minimization problem in our experiment. Here, we randomly choose a testing sample in the third class. Intuitively, most nonzero representation coefficients for the testing sample lie in the range from 301 to 450 (since each class has 150 training samples, so the index for the third class is from 301 to 450). Fortunately, our result demonstrates it. From Figure 3, we see that the representation coefficients are sparse with respect to the basis, that is, the training set. Moreover, the nonzero coefficients are mostly located in the range from 301 to 450. As the end, we are to say that BPDN algorithm is fast enough to perform our SAR images’ recognition and classification. The BPDN software package that we use is from the “L1 Homotopy” homepage: http://users.ece.gatech.edu/~sasif/homotopy/. The running time is in second level, which is as the same level as LRC and SVC.

Through the complete discussion above, we can come to the conclusion that optimizing kernel PCA can indeed enhance the class separability in the empirical feature space compared to conventional KPCA and thus improve recognition rate. In the meantime, sparse representation-based classifier is robust and of high efficiency for classification compared to other nice classifiers.

6. Conclusion

In this paper, we proposed an efficient pattern classification method named kernel optimized PCA with sparse representation classifier (KOPCA: SRC). After conducting several experiments, we can come to the following conclusions with the experimental results.(1) We have proposed a new space called the empirical feature space, in which the data is embedded in a way that the geometrical structure of the data in the feature space is preserved.(2) We have presented a general form of data-dependent kernel and derived an effective algorithm for optimizing kernel by maximizing class separability of the dataset in the empirical feature space via Fisher criterion. (3) We have for the first time applied sparse representation-based classifier for pattern classification on MSTAR SAR image and experiment results reveal that it is more effective and robust than existing classifiers.

Appendix

Proof. Note the empirical feature mapping , and , we know the dot product matrix has exactly positive eigenvalues.
Let ,??,??. Then, we have
As the empirical feature space preserves the dot product, that is, therefore
Note formula (3), we easily get ,??, simultaneously, and . Hence,

Acknowledgments

The work is supported by Major Program of Natural Science Foundation of China, no. 61033012, Natural Science Foundation of China, no. 611003177, no. 61272371, Fundamental Research Funds for the Central Universities, no. DUT12JR07, and Specialized Research Fund for the Doctoral Program of Higher Education, no. 20120041120046.