Abstract

Kernel Locality Preserving Projection (KLPP) algorithm can effectively preserve the neighborhood structure of the database using the kernel trick. We have known that supervised KLPP (SKLPP) can preserve within-class geometric structures by using label information. However, the conventional SKLPP algorithm endures the kernel selection which has significant impact on the performances of SKLPP. In order to overcome this limitation, a method named supervised kernel optimized LPP (SKOLPP) is proposed in this paper, which can maximize the class separability in kernel learning. The proposed method maps the data from the original space to a higher dimensional kernel space using a data-dependent kernel. The adaptive parameters of the data-dependent kernel are automatically calculated through optimizing an objective function. Consequently, the nonlinear features extracted by SKOLPP have larger discriminative ability compared with SKLPP and are more adaptive to the input data. Experimental results on ORL, Yale, AR, and Palmprint databases showed the effectiveness of the proposed method.

1. Introduction

In recent years, the kernel methods have been widely studied for feature extraction and pattern recognition. They map the input data onto a kernel space and assume the nonlinear problem can be transferred to linear problem, which can be conveniently solved by linear algorithms. However, different kernel geometrical structures give different class discriminations, and the inappropriate selection of kernel function will induce disastrous effects, because the kernel matrix determines the geometrical structure of the mapped data in the kernel space. Thus it is necessary to use an adaptively optimized kernel function to improve the classification performance. We know that optimizing kernel parameters cannot change the geometrical structures of kernel in the feature space [1, 2], so, Schölkopf et al. [3] proposed an empirical kernel map which maps original input data onto a subspace of the empirical feature space, since the training data have the same geometrical structure in both the empirical feature space and the kernel space, and the former is easier to access than the latter. Cristianini et al. [4] and Lanckriet et al. [5], respectively, employed the alignment and margin as the measure of data separation to evaluate the adaptability of a kernel to input data. He and Niyogi [6] pointed out that locality preserving projection (LPP) is capable of discovering nonlinear method with kernel trick. To utilize the merit of LPP, Wang and Lin [7] proposed supervised kernel LPP (SKLPP) to improve kernel LPP (KLPP), by using classification information in kernel feature extraction process. Li et al. [8] extended LPP with nonparametric similarity measure and then optimized the kernel with maximum margin criterion for feature extraction and recognition. Sun and Zhao [9] proposed a normalized Laplacian based optimal LPP method. Lu et al. [10] proposed a regularized generalized discriminant LPP approach. Lu and Tan [11] proposed a parametric regularized LPP. Pang and Yuan [12] proposed to substitute L2-norm with L1-norm to improve the robustness of LPP against outliers. Although SKLPP showed good performance in [7], the selection of kernel function has a significant influence on the kernel feature extraction and the problem was widely studied in the previous works [7, 1315]. In [15], we proposed a Kernel Optimized PCA (KOPCA) with sparse representation-based classifier (SRC). Although both KOPCA and SKOLPP aim to improve the recognition rate through optimizing the kernel function, they employed different feature extraction methods where KOPCA extracted features by PCA and SKOLPP extracted features by LPP. In [16], we proposed a Supervised Gabor-wavelet-based Kernel Locality Preserving Projections (SGKLPP) method, which integrated the Gabor-wavelet representation of face images and the SKLPP method to improve the recognition rate. Gabor wavelets extracted the features brought by illumination and facial expression changes, and SKLPP solved the nonlinear feature extraction and classification problem.

In [14], Pan et al. applied the optimizing kernel [17] to kernel discriminant analysis (KDA) called adaptive quasiconformal kernel discriminant analysis (AQKDA). Different from optimizing kernel based on Fisher Criterion [18], the maximum margin criterion (MMC) was chosen to extract the feature by maximizing the average margin between different classes of data in the quasiconformal kernel mapping space.

Li et al. proposed class-wise locality preserving projection (CLPP) which utilized the class information for feature extraction [19]. In CLPP, a nonparametric similarity measure of LPP was proposed by Li et al. and then the optimized kernel with maximum margin criterion was used for feature extraction. According to the nonparametric similarity measure, the local structure of the original data was constructed which took consideration of both the local information and the class label information. Moreover, Li et al. applied the kernel trick to CLPP to increase its performance on nonlinear feature extraction.

In [8], Li et al. proposed the Kernel Self-optimized Locality Preserving Discriminant Analysis (KSLPDA). In the paper, the authors integrated CLPP [19] and data-dependent kernel based MMC [14] to form a constraint optimization equation for KSLPDA.

In [20], Li et al. proposed the Quasiconformal Kernel Common Locality Discriminant Analysis (QKCLDA). In QKCLDA, the quasiconformal kernel based on Fisher Criterion is used for breast cancer diagnoses. Li et al. divided the procedure of QKCLDA into two steps. First, the original data was mapped to a low-dimensional space via quasiconformal kernel locality projection. Secondly, the low-dimensional data was mapped to a common space.

In SKOLPP, we first construct a data-dependent kernel [18] to maximize the class separability based Fisher Criterion. Then, we use gradient descent method to optimize the object function where the combination coefficients can be obtained. Last, integrating the supervised kernel locality preserving projections [21], optimized kernel LPP can be used to extract features. In this paper, SKOLPP aims to optimize the kernel function. By retaining the local information and optimizing the kernel function by maximizing the between-class distance, SKOLPP surpassed the above methods.

The paper is organized as follows. In Section 2, we optimize the kernel in the empirical feature space by seeking the optimal combination of coefficients with data-dependent kernel based on Fisher Criterion. In Section 3, we employ the optimized kernel function mentioned above to construct the supervised kernel optimized LPP (SKOLPP). Finally, in Section 4, experiments are executed on ORL, Yale, AR, and Palmprint databases to demonstrate the effectiveness of the optimized kernel in classification.

2. Kernel Optimization in the Empirical Feature Space

2.1. Data-Dependent Kernel

The geometrical structure of the data in the feature space is determined by the kernel functions, which means that choosing different kernels may induce different class discrimination performance [4]. Because there is no general kernel function that can be suitable to all databases, it is necessary to choose a data-dependent kernel to solve this problem. In this paper, a data-dependent kernel which is similar to that used in [13] is employed as the objective kernel to be optimized.

Considering a set of training data , we apply the conformal transformation kernel [13] as our data-dependent kernel function:where and is an ordinary kernel function, called the basic kernel. is the factor function which is determinative for the properties of data-dependent kernel:where , , is a free parameter, and are the combination coefficients. The set is called the “empirical cores” which can be chosen from the training data. In [13], in order to enlarge the spatial resolution around the class boundary, the support vectors are chosen as the empirical cores. In this paper, we choose the mean value of each class as empirical core. Apparently, the data-dependent kernel satisfies the Mercer condition [3].

Supposing that , then we havewhere is the basic kernel and is the data-dependent kernel. Letting , then we havewhere .

2.2. Fisher Criterion Based Kernel Optimization

In [18], we note that the geometrical structure of data in the kernel feature space and empirical feature space is the same. That is, the optimized kernel parameters cannot change the geometrical structures of kernel in the feature space. It is better to measure class separability in the empirical feature space, because it is easier to access the empirical feature space than the kernel feature space. Specifically, we use the Fisher Criteria for measuring the class separability:where is the well-known Fisher scalar, denotes the trace of given matrix, is the between-class scatter matrix, and is the within-class scatter matrix. measures the class separability in the feature space rather than in the projection subspace. is a good choice for the task of kernel optimization as it is independent of the projections. So optimizing the data-dependent kernel means maximizing Fisher scalar .

We can call the matrices and between-class and within-class kernel scatter matrices, respectively. Then, they can be written aswhere is the data-dependent kernel and denotes the submatrix of the kernel matrix . Apparently, is the kernel matrix corresponding to the samples in class .

For the basic kernel , the matrices and are similar to formulae (6) and (7). Now, the relationship between Fisher scalar and the kernel scatter matrices can be established:

The proof is given in the Appendix.

We use the standard gradient approach to maximize and let and . Then, we have and .

Thus, the iteration algorithm is as follows:

In order to maximize , we let , and then .

However, the number of training samples is not enough in real-world applications. Thus, it is hard to get the invertible matrix of . Additionally, we use the general gradient descent method to get which is an approximate value. Then, the updating equation for maximizing the class separability is given bywhere is the learning rate and , where is the number of iterations, denotes the current iteration number, and is the initial learning rate.

When we get , can be calculated by and the data-dependent kernel is easy to achieve.

3. Locality Preserving Projections and Supervised Kernel Optimized LPP

In this section, first, LPP algorithm is reviewed briefly, and then we use the optimized kernel function mentioned above to construct the supervised optimizing kernel LPP.

3.1. Locality Preserving Projections

Locality Preserving Projections (LPP) [6] is a linear manifold learning method which seeks an embedding that retains local information and obtains a face subspace that best perceives the crucial face manifold structure [22].

Given a matrix with each point . Similar to other subspace learning algorithms, LPP uses the obtained transformation matrix with the basis vector to map the high-dimensional points to low-dimensional points : . The objective function of LPP is defined as follows to compute the optimal basis vector :where measures the similarity of and . Heat kernel is frequently used to define :where parameter is predefined. In (12) the similarity monotonously increases with the decrease of the distance between and . It is worth noting that if and do not belong to the same class, the value of will be zero.

The minimization problem of (11) can be reduced to the eigendecomposition problem [6]:where is a diagonal matrix, , and is the Laplacian matrix [23]. There is a constraint as follows:

3.2. Supervised Kernel Optimized Locality Preserving Projections

We utilize the nonlinear projection to map the input data onto a Hilbert space; that is, . We expanded LPP into a new space . The objective function iswhere . The optimal transformation matrix can be obtained through (13). The eigenvector can be expressed as follows:where , , and is the similarity matrix in the Hilbert space. Then, SLPP is generalized to the nonlinear case with kernel function. The basic kernel in (1) is . Gaussian kernel and polynomial kernel , , are the most popular kernel functions in kernel trick.

We can simplify the objective function (15) aswhere is the data-dependent kernel matrix defined in (3) and is a diagonal matrix where . The local structure information of data in the original space can be presented by matrix . That is, if is more important, the value of is bigger. Consider the constraint ; that is, . Then minimization problem can be transformed as

We can obtain the optimal by solving (18). Therefore, the essence of SKOLPP is clear. That is, we first use Fisher Criterion to maximize the class separability and a data-dependent kernel can be obtained. Then, we seek the optimal projection matrix of kernel optimized SLPP to extract feature. Last, a classifier can be adopted for classification. In this paper, we use nearest neighbor classifier for recognition.

More information about LPP can be obtained from [6, 19, 21].

4. Experimental Results

In this section, we first verify the assumption that the classification performance is probably worse in the feature space than without using kernel tricks in some cases. And we demonstrate that our proposed kernel optimization algorithm can obtain a better performance of classification. Then we test the proposed SKOLPP and other methods on the ORL, Yale, AR, and Palmprint database.

4.1. Kernel Optimization on Synthetic Gaussian Distributed Database

In this part, we generated two simple datasets with Gaussian distribution by computer. Figure 1(a) shows a two-dimensional dataset with 600 samples and the coordinates are uncorrelated. The samples are separated into two classes. Each class has 300 samples with Gaussian distributions where , , , and and , , , and , respectively.

From this figure, we can see that some samples of the two classes are overlapped. We use polynomial kernel function where to project the data into the empirical feature space and Figure 1(b) shows the projection of the data in the empirical feature space onto the first three significant dimensions corresponding to the first three largest eigenvalues of . From Figure 1(b), we observe that the class separability is worse in the feature space than that in the input space. Figure 1(c) shows the corresponding results when we use Gaussian kernel function with to project the data into the empirical feature space. Similarly, the class separability cannot be endured. As a consequence, the kernel optimization algorithm is necessary to overcome this trouble. In order to show the effectiveness of the optimization algorithm, we carried out another experiment. In this experiment, we use the polynomial kernel and Gaussian kernel as basic kernels. We select one-third of the samples randomly to form the empirical core set .

In the polynomial kernel and Gaussian kernel, the initial learning rate of the algorithm is 0.1 while the number of iterations is 200. Figure 2(a) shows the projections of the data in the empirical feature space when the third-order polynomial kernel is used as the basic kernel. The corresponding results, when the Gaussian kernel is used, are shown in Figure 2(b). From Figure 2, we can see that the class separability of the data in the feature space is improved significantly while our kernel optimization algorithm is used.

4.2. SKOLPP on ORL and Yale Databases

This experiment is conducted on the well-known face image databases (ORL and Yale).

The ORL database contains 40 individuals. Each of them includes 10 different images, which show variations in facial expressions (smiling or not smiling), facial details (glasses or no glasses), and poses. Yale database is more challenging than ORL, which contains 165 grayscale images of 15 individuals. The images demonstrate variations in lighting condition (left-light, center-light, and right-light), facial expression (normal, happy, sad, sleepy, and surprised), and facial details (glasses or no glasses). Some sample images from the same individual on ORL and Yale dataset are shown in Figures 3 and 4.

Experiment 1. In this part, we compare the proposed method with PCA, KPCA, KOPCA, KFD, SVM, KMSVM, SLPP, and SKLPP on ORL and Yale databases. Gaussian function is used with for SVM, SKLPP, and KFD, whereas for KPCA and for KOPCA, KMSVM, and SKOLPP. The dimension of eigenvectors is 60 here.
In the experiment, images () are randomly selected from the image gallery of each individual to form the training sample set , respectively, and the corresponding remaining images are taken to form the testing set . The results are averaged by 5 random replicates. Table 1 presents the top recognition accuracy of PCA, KPCA, KOPCA, KFD, SVM, KMSVM, SLPP, SKLPP, and SKOLPP along different number of training samples on ORL database.
Table 1 shows the top recognition rate of all the methods on ORL database. It is clear to see that SKOLPP performs the best. Moreover, SKOLPP still performs well when the number of training samples is small. It is worth noting that SKOLPP works better than KOPCA and KMSVM while all of them used optimizing kernel. One of the reasons is that SKOLPP retains the local information and obtains a face subspace that best perceives the crucial face manifold structure.
Table 2 shows the results of all the algorithms on Yale database. Obviously, SKOLPP performs always better than other methods along different number of training samples.

Experiment 2. We design this experiment in order to test the performance of all the algorithms under different value of in Gaussian kernel function. The value of ranges from 104 to 108. Similarly, we select five images of each class as training samples and the results are shown in Figures 5 and 6.
From Figure 5, we can see that SKOLPP performs best compared with other methods when the value of is 7 whereas the result is not perfect when = 104~105. The small value of is more suitable for KPCA and KFD on ORL database. It is worthwhile to note that SKOLPP performs always better than other methods including KMSVM on ORL database (Figure 5). Figure 6 shows the corresponding results on Yale database. Although, in the case where the values of are small in Figure 6, the SKOLPP method works slightly worse than KMSVM, when the value of becomes larger, SKOLPP surpasses KMSVM and achieves the highest recognition result. The recognition rate of SKOLPP reaches 95.5% and 92.9% on ORL and Yale databases, respectively.

Experiment 3. The polynomial function is used in this part to test the performance of the proposed method. Three kinds of polynomial functions are adopted in this part, , , and , to compare the different kernel methods, such as KPCA, KOPCA, KFD, SVM, KMSVM, and KLPP.

Five images of each class are selected as training samples and the rest of the images are used for testing. Table 3 shows the performance of different methods under different polynomial functions mentioned above.

From Table 3, we can see that the performance of KFD is unsatisfactory whereas SKLPP and SKOLPP achieve better results than other methods. Not surprisingly, KOSLPP achieves a better result as well and reaches the highest recognition rate of 94.4%. The corresponding results on Yale database are presented in Table 4.

4.3. SKOLPP on AR Database

This experiment is conducted on the AR face database. The AR database [24] contains over 4000 color images corresponding to 126 people’s face (70 men and 56 women). Images feature frontal view faces with different facial expressions, illumination conditions, and occlusions (sun glasses and scarf). The images of each person were taken in two sessions, separated by two weeks’ time. The same pictures were taken in both sessions. In this experiment, we take 100 individuals (50 men and 50 women) and use the first 13 images of each person to test the performance of all the algorithms. Thus, the total number of images used in this experiment is 1300. All images are gray with 256 levels and are of size 13 × 100 pixels. To simplify the computation of the experiments, we cropped each image manually and resized each image to 48 × 48 pixels. Figure 5 shows the samples of one person. To fully evaluate the performance of SKOLPP, we make three tests based on variations in facial expressions, lighting conditions, and occlusions. Gaussian function is performed with for SKOLPP.

4.3.1. Facial Expressions

In this test, we randomly select two images from Figure 7 (1–4) as training samples; then the remaining two images are used for testing. Therefore, the total number of training samples is 200. These images have different facial expressions. Table 5 shows the top recognition rate of different algorithms. From Table 5, we can see SKOLPP achieves better results than other methods (PCA, 2DPCA [25], LDA [26], NPE [27], SLPP, and SKLPP). Particularly, SKOLPP performs better than SKLPP by 6.5% recognition rate.

4.3.2. Lighting Conditions

To test SKOLPP together with other methods under varying lighting conditions, we selected images from Figure 7 (1, 3, and 6) as training samples. Images 2, 4, 5, and 7 in Figure 7 were considered as testing samples. Thus the number of training samples is 300, while that of the testing samples is 400. The recognition rates are summarized in Table 6. It is obvious that SKOLPP is the most effective technique dealing with illumination variation among the listed methods. SKOLPP exceeds 6.6% rates compared with SKLPP.

4.3.3. Occlusions

In this part, we test the recognition rate under varying occlusions. We took images 1–7 in Figure 7 as training samples. Then the number of training samples is 700. Meanwhile, we took the rest of the images (8–13) (Figure 7) as test samples. Table 7 shows the top recognition rates of all the involved methods. Apparently, SKOLPP delivers the best result of all the algorithms while SKLPP is worse than SKOLPP by nearly 20% recognition rates. SKLPP, SLPP, and NPE also achieve good result.

4.4. SKOLPP on Palmprint Database

The PolyU Palmprint database contains 7752 grayscale images corresponding to 386 different palms in BMP image format (http://www4.comp.polyu.edu.hk/~biometrics/). Around twenty samples from each of these palms were collected in two sessions, where ten samples were captured in the first session and the other ten in the second session. The average interval between the first and the second collection was two months. In this experiment, we took 200 different palms and used the first 5 images of the first session and second session, respectively. Thus, the total number of images used in this experiment is 2000. All images are gray with 256 levels and are of size 384 × 284 pixels. To simplify the computation of the experiments, we cropped each image manually and resized each image to 64 × 64 pixels. Figure 8 shows the samples of one person.

The maximal recognition rates of each method and the corresponding dimension are given in Table 8. A random subset with () is taken with labels to form the training set and the remaining part () to form the testing set. From Table 8, we notice that SKOLPP consistently outperforms other methods in all the cases. Particularly in the cases of and , SKOLPP boosts over 5% recognition rates compared with SKLPP and even delivers nearly 10% of improvement recognition rates compared to PCA. In addition, SKLPP gets the second best results while the performance of SLPP is slightly better than NPE. PCA performs the worst among all the methods.

Through the experimental results, we can come to the conclusion that SKOLPP indeed improves the class discrimination in the empirical feature space compared with SKLPP, and it is robust to the influence of illumination, facial expression, and occlusions.

5. Conclusion

In this paper, we proposed an efficient classification method by maximizing a measure of the class separability in the feature space named supervised kernel optimized LPP (SKOLPP). Based on the Fisher Criterion, our method achieved satisfactory classification performance by preserving the geometrical structure of the data in the kernel feature space. SKOLPP integrates the merit of kernel optimization and SKLPP to increase the performance of nonlinear feature extraction and classification, and it is robust to the influence of illumination, facial expression, and occlusions. Several experiments were conducted to demonstrate the effectiveness of SKOLPP.

Appendix

Proof. Note the empirical feature mapping , and ; we know the dot product matrix has exactly positive eigenvalues.
Let , , . Then, we haveThe empirical feature space preserves the dot product, that is,Therefore,Note formula (3); we easily get , ; simultaneously, and . Hence,

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.