Abstract

Constructing a powerful graph that can effectively depict the intrinsic connection of data points is the critical step to make the graph-based semisupervised learning algorithms achieve promising performance. Among popular graph construction algorithms, low-rank representation (LRR) is a very competitive one that can simultaneously explore the global structure of data and recover the data from noisy environments. Therefore, the learned low-rank coefficient matrix in LRR can be used to construct the data affinity matrix. Consider the existing problems such as the following: the essentially linear property of LRR makes it not appropriate to process the possible nonlinear structure of data and learning performance can be greatly enhanced by exploring the structure information of data; we propose a new manifold kernelized low-rank representation (MKLRR) model that can perform LRR in the data manifold adaptive kernel space. Specifically, the manifold structure can be incorporated into the kernel space by using graph Laplacian and thus the underlying geometry of data is reflected by the wrapped kernel space. Experimental results of semisupervised image classification tasks show the effectiveness of MKLRR. For example, MKLRR can, respectively, obtain 96.13%, 98.09%, and 96.08% accuracies on ORL, Extended Yale B, and PIE data sets when given 5, 20, and 20 labeled face images per subject.

1. Introduction

Since it is usually not easy to collect a large number of labeled samples to train learning models, the semisupervised learning (SSL) paradigm, which can harness both labeled and unlabeled samples to improve the learning performance, draws a lot of attention in recent studies [17]. Among existing SSL algorithms, graph-based algorithms are a class of the most popular approaches in which the label propagation can be performed on the graph [811]. The underlying idea for graph-based algorithms is to characterize the relationship between data pairs by an affinity matrix. Although researchers have pointed that sparsity, high discriminative power, and adaptive neighborhood are desirable properties for a good graph [12], how to learn a good graph that can accurately uncover the latent relationship in data is still a challenging problem.

Among existing graph construction methods, the -nearest neighbors and -neighborhood are the two most widely used algorithms. However, they are usually sensitive to noisy environments, especially when data points contain outliers. To construct more effective graph, many new algorithms were proposed. The sparse graph [8] is parameter-free and insensitive to outliers, which is derived by encoding each datum as a sparse representation of the remaining samples. Sparse graph can automatically select the most informative neighbors for each datum. However, since sparse representation encodes each datum individually, the resultant sparse graph only emphasizes the local structure of data, while it neglects considering the global structure of data. This property will deteriorate its performance, especially when data are grossly corrupted [13]. Different from sparse representation that enforces the representation coefficient to be sparse [14], low-rank representation aims to learn the data affinities jointly, which can reveal the global structure of data and preserve the membership of samples that belong to the same class in noisy environments [15, 16]. The learned LRR graph can capture the global mixture of subspaces structure via the low rankness property and thus it is generative and discriminative for semisupervised learning tasks [9].

Apart from the conventional LRR model, many advanced variants were proposed recently. To efficiently explore the structure information of data, Zheng et al. imposed the local constraint characteristic on the representation coefficient and thus formulated the low-rank representation with local constraint (LRRLC) model [10]. Lu et al. proposed the graph regularized LRR (GLRR) that introduces the graph regularizer to enforce the local consistency of data [17]. Zhuang et al. proposed incorporating the sparse and nonnegative constraints into low-rank representation and formulated the NNLRS model [9]. The manifold low-rank representation (MLRR) [18] first uses a sparse learning objective to identify the data manifold and then incorporates the manifold information into low-rank representation as a regularizer. Additionally, [19] proposed preserving the structure information of data from two aspects: local affinity and distant repulsion. Li and Fu proposed constructing graph based on low-rank coding and -matching constraint for obtaining a sparse and balanced graph [20]. All the above-mentioned low-rank models are linear; therefore, they inevitably have limitations in modeling complex data distribution, which does not strictly follow a linear model but a nonlinear one. To make the low-rank model effectively deal with the nonlinear structure of data, [11] proposed the kernel low-rank representation (KLRR) graph for semisupervised classification by using kernel trick. As a nonlinear extension of LRR, KLRR also showed excellent performance in face recognition [21].

Recent studies [2226] have shown that learning performance can be greatly enhanced by considering the geometrical structure and local invariant idea [27]. It is obvious that this idea should be considered in both original data space and the reproducing kernel Hilbert space (RKHS). However, there is no existing LRR variant that takes into account the intrinsic manifold structure in RKHS. In this paper, we propose a novel manifold adaptive kernelized LRR for semisupervised classification. By using a data-dependent norm on RKHS proposed by [28], we can warp the structure of the RKHS to reflect the underlying geometry of the data. Then, the conventional low-rank representation can be performed in the manifold adaptive kernel space. The main contributions of this paper can be briefly summarized as follows:(1)We construct the manifold adaptive kernel space, where the underlying geometry of data can be reflected by the graph Laplacian.(2)We give the model formulation, the optimization method, and the complexity analysis of MKLRR in detail.(3)We conduct extensive experiments on semisupervised image classification tasks to evaluate the effectiveness of MKLRR and the experimental results show that MKLRR can get pretty promising performance.

The remainder of this paper is organized as follows. In Section 2, we give a brief review on the conventional LRR model and the semisupervised learning framework to be used in our work. Section 3 describes the model formulation, optimization method, and complexity analysis of the manifold adaptive kernelized LRR model in detail. Experimental studies of MKLRR on semisupervised image classification task will be introduced in Section 4. Section 5 concludes the whole paper and presents an extension of MKLRR as our future work.

In this section, we give a brief review of the conventional low-rank representation model [15] and the semisupervised classification framework based on Gaussian Fields and Harmonic Functions (GHF) [1].

2.1. LRR

Given a set of samples , LRR aims to represent each sample as a linear combination of the bases in by , where is the matrix in which each is the representation coefficient corresponding to sample . Therefore, each entry in can be viewed as the contribution to the reconstruction of with as the dictionary. LRR seeks to find the lowest rank solution by solving the following optimization problem [15]:It is NP-hard to directly optimize the rank function. Therefore, we usually use the trace norm (also called nuclear norm) as the closest convex surrogate to the rank norm, which leads to the following objective [29]:where is the sum of its singular values of a certain matrix [30]. Considering the fact that samples are usually noisy or even grossly corrupted, a more reasonable objective for LRR can be expressed as where and . The second term in (3) is to characterize the error term by modeling the sample-specific corruptions. Also, some existing studies employed the -norm to measure the error term [31, 32]. The optimal solution can be obtained via the inexact augmented Lagrange multiplier method [31].

2.2. GHF

Assume that we have a data set from classes, where , and , are the labeled and unlabeled samples, respectively. The label indicator matrix is defined as follows: for each sample , is its label vector. If is from the th () class, then only the th entry of is one and all the other entries are zeros. If is an unlabeled data, then .

GHF is a well-known graph-based semisupervised learning framework in which the predicted label matrix is estimated on the graph with respect to the label fitness and the manifold smoothness. Let and , respectively, denote the th rows of and . GHF tries to minimize the following objective:where is a very large value such that can be approximately satisfied. is an affinity matrix to depict the pairwise similarity of samples. Obviously, (4) can be rewritten in the compact matrix form aswhere the graph Laplacian matrix can be calculated as ; (or since is usually a symmetric matrix) is a diagonal degree matrix. is also a diagonal matrix with the first and the remaining diagonal entries as and 0, respectively.

3. Manifold Adaptive Low-Rank Representation

3.1. Manifold Adaptive Kernel

In this section, we show how to incorporate the manifold structure into the reproducing kernel Hilbert space (RKHS), which leads to manifold adaptive kernel space.

Kernel trick is usually applied with the hope of discovering the nonlinear structure in data by mapping the original nonlinear observations into a higher dimensional linear space [33]. The most commonly used kernels are Gaussian and Polynomial kernels. However, the nonlinear structure captured by the data-independent kernels may not be consistent with intrinsic manifold structure, such geodesic distance, curvature, and homology [34, 35].

In this work, we adopt the manifold adaptive kernel proposed by [28]. Let be a linear space with a positive semidefinite inner product (quadratic form) and let be a bounded linear operator. We define to be the space of functions from with manifold inner product: is still a RKHS [28].

Given samples , let be the evaluation map.Denote and . Note that ; thus we havewhere is a positive semidefinite matrix. For a data vector , we defineIt can be shown that the reproducing kernel in iswhere is an identity matrix, is the kernel matrix in , and is a constant controlling the smoothness of the functions. The key issue now is the choice of , so that the deformation of the kernel induced by the data-dependent norm is motivated with respect to the intrinsic geometry of the data.

Without loss of generality, we assume that there are data points to be utilized to derive the linear space . It is easy to rewrite formulation (10) in compact matrix form aswhere the matrices , , and are all in . Here, is an identity matrix with the same size as . was referred to as the kernel matrix in the warped RKHS.

The key issue now is the choice of . As mentioned above, manifold structure can be discovered by the graph Laplacian associated with the data points.

3.2. The Objective Function

From [11], the objective of kernel low-rank representation was formulated asIn order to learn the low-rank representation that is consistent with the manifold geometry, it is natural to take advantage of the manifold adaptive kernel in KLRR.

In order to model the manifold structure, we construct a nearest-neighbor graph . For each data point , we find its nearest neighbors denoted by and put an edge between and its neighbors. There are many choices for the weight matrix on the graph and we use the “0-1” form defined as follows:The graph Laplacian [36] is defined as , where is a diagonal degree matrix given by (or since is symmetric). The graph Laplacian provides the following smoothness penalty on the graph:Therefore, it is natural to substitute with the graph Laplacian . For convenience, we make use of all the available data points to derive the linear space in the warped RKHS (i.e., ); then (11) can be rewritten aswhere indicates that this kernel matrix is in a manifold RKHS.

Using the nuclear norm to replace the rank function, we arrive at the following objective of manifold adaptive kernelized LRR as

Figure 1 shows the connection between MKLRR and LRR as well as its variants. As we can see, LRR variants such as GLRR, LRRLC, and MLRR can be reached by incorporating manifold information. By using the kernel trick, the KLRR model can find the lowest rank representation in RKHS. Further, by considering the geometric structure of data in RKHS, we can formulate the MKLRR model. Both KLRR and MKLRR are nonlinear models, since an implicit nonlinear mapping is employed.

3.3. Optimization

To make objective (16) separable, we introduce an auxiliary variable with respect to and then we have the following objective:The corresponding augmented Lagrangian function iswhere and are Lagrange multipliers and is a penalty parameter. The inexact augmented Lagrange multiplier (ALM) algorithm is employed to optimize objective (18) [31]. The detailed optimization process is summarized in Algorithm 1.

Input: data points , regularization parameters ,
, , , , , , and ;
Output: the low-rank representation coefficient matrix .
(1) while not converged do
(2) Fix the other variables and update by
(3) Fix the others and update by
(4) Fix the others and update by
(5) Update the multipliers
(6) Update the parameter by
(7) Check the convergence conditions
,
(8) end while

The updating rule for is based on singular value thresholding operator which is given by the following theorem [30].

Theorem 1. Let and be the SVD of , where and have orthonormal columns, is diagonal, and . Then is given by , where is diagonal with .

The updating rule for can be obtained by soft-shrinkage operator [15], which is given as below.

Theorem 2. Let be a given matrix and let be the Frobenius norm. If the optimal solution tois , then the th column of is

3.4. Algorithm Workflow and Complexity Analysis

As a whole, we summarize the manifold adaptive kernelized low-rank representation-based semisupervised classification algorithm as follows:(i)Construct the graph Laplacian: construct a nearest-neighbor graph with weight matrix defined in (13) and then calculate the graph Laplacian by (ii)Calculate the manifold adaptive kernel: assume that the kernel matrix in can be induced from any data-independent kernel (e.g., Gaussian kernel or linear kernel). Then calculate the manifold adaptive kernel in the warped RKHS according to (15)(iii)Manifold kernel low-rank representation: optimize the MKLRR model and obtain the low-rank representation coefficient matrix via Algorithm 1. Shrink some small values in and then make it symmetric and nonnegative as (iv)Semisupervised classification: calculate the Laplacian matrix and do semisupervised classification based on (5)

Below we give a brief analysis on the computational complexity of MKLRR. Constructing the nearest-neighbor graph in the first step of MKLRR needs . In the second step, computing the data-independent kernel matrix needs and computing the manifold adaptive kernel matrix needs . In the fourth step, the complexity of semisupervised learning based on GHF is . Below we give a detailed description on the complexity of Algorithm 1. Obviously, the main computation burden of MKLRR lies in the updating of , since it involves the singular value decomposition (SVD). Specifically, in equation in Algorithm 1, the SVD is operated on an matrix, which is time-consuming if the number of samples (i.e., ) is large. As referred to in [37], by substituting with the orthogonal basis of the dictionary, the computation can be reduced to , where is the rank of dictionary . The computational complexity of updating is trivial owing to its simple closed form solution. The complexity of updating is . Thus, the computation complexity of MKLRR-based semisupervised learning is in general, where is the number of iterations of loop in Algorithm 1.

4. Experiments

This section evaluates the effectiveness of the proposed MKLRR algorithm on semisupervised classification task. Specifically, we will compare the performance of MKLRR with some state-of-the-art graph construction methods on four representative image data sets. All experiments are conducted on platform Intel(R) Core(TM) i7-4700MQ CPU @2.40 GHz 16.0 GB RAM Windows 8.1 System and Matlab 2013a.

4.1. Experimental Settings

For the comparison methods, several baseline methods are compared including some state-of-the-art graph-based semisupervised learning methods:(i)NN: if one sample is among the nearest neighbors of the other, then these two samples are viewed as connected. In NN-1, is set to 5; and in NN-2, is set to 8. The distance information is measured by the “Heatkernel” function, where the variance is the average of squared Euclidean distances for all edged pairs on graph(ii) graph [8, 38]: the -norm regularized least squares problem is optimized by the - package [39]. The regularization parameter to enforce the sparsity is searched from (iii)LNP (linear neighborhood propagation): we follow the pipeline in [40] to construct the graph. The neighborhood size in LNP is set to 40(iv)SPG (sparse probability graph) [41]: we implement the SPG algorithm by setting as one-quarter of the size of data set and is set to 0.001 as suggested by [41](v)LRR (low-rank representation) [15]: for all data sets, we tune the parameter in the range to achieve the best performance(vi)GLRR (graph regularized low-rank representation) [17]: in [17], the accelerated gradient method [42] was employed to optimize GLRR by updating , which is the corresponding auxiliary variable with respect to , while in our implementation, the GLRR objective function was relaxed as described in [10] and was updated by using the SVT operator [31](vii)NNLRS (nonnegative low-rank and sparse graph) [9]: we construct the LRR graph with nonnegative and sparse properties. The weighted parameters are set as guided in [9](viii)LRCB (low-rank representation with -matching constraint) [20]: as suggested in [20], we set the parameters and as 2 and 0.03 for all the four data sets. The parameter is set as 5, 5, 10, and 5 in the ORL, Extended Yale B, PIE, and USPS data sets, respectively.(ix)KLRR (kernel low-rank representation) [11]

For both KLRR and MKLRR, we use the Gaussian kernel function defined as and the band width parameter is set as the mean value of all the distances of each data pair. When constructing the weight matrix, the number of nearest neighbors is set as 5. The regularization parameter in conventional LRR model is searched from the candidate values of . Similar to the usage in [43], we fix the parameter in all experiments below.

4.2. Experiment on Synthetic Data

Similar to studies [11, 15], a synthetic data set is constructed as follows. We construct 5 independent subspaces whose bases are computed by , , where is a random rotation and is a random orthogonal matrix with dimension 100 × 100. Therefore, each subspace has a dimension of 100. We sample 200 data vectors from each subspace by , , with being a 100 × 200 i.i.d. matrix. We randomly choose 30% of the total samples to corrupt. For example, if data vector is chosen, its observed vector is computed by adding Gaussian noise with zero mean and variance .

We select different numbers of labeled samples to evaluate the performance of different graph construction methods. Table 1 shows the classification accuracies of different graphs on the synthetic data set. The results are obtained from ten independent runs. From the results, we can find that all LRR variants can achieve good performance even when given only a few labeled samples. KLRR is slightly better than MKLRR by 0.26% when there is only one labeled sample per class. MKLRR obtains the best results in all the remaining cases.

Since GLRR is also related to incorporating the structure information of data into LRR, we show the learned block diagonal structure, respectively, by LRR, GLRR, KLRR, and MKLRR in Figure 2. Generally, although the visual discrepancies between MKLRR and its counterparts are minor, the block diagonal structure obtained by MKLRR is clearer than the others. Most of the values within each block of MKLRR graph are obviously larger than those of KLRR graph.

4.3. Experiment on ORL Data Set

The ORL data set contains ten different images of each of 40 distinct subjects. The images were taken at different times, varying the lighting, facial expressions, and facial details. Each image is manually cropped and normalized to size of pixels. Figure 3 shows some example images of two subjects from the ORL data set.

We repeat all the experiments ten times. In each time, we randomly select a subset of images from each subject to create a labeled sample set. In this experiment, 1, 2, 3, 4, and 5 images per subject are randomly selected as labeled samples and the remaining images are regarded as unlabeled samples. The random indices are kept the same for all compared algorithms. The classification accuracies of different algorithms with different numbers of labeled samples on the ORL data set are shown in Table 2, in which MKLRR outperforms all the compared algorithms. For example, when we select 1, 2, 3, 4, and 5 images per person as labeled samples, the accuracies of MKLRR are higher than those of the second best algorithm by 1.88% (LCRB), 3.18% (KLRR), 2.18% (SPG), 0.85% (SPG), and 1.28% (SPG), respectively.

4.4. Experiment on Extended Yale B Data Set

The Extended Yale B data set consists of 2414 human face images of 38 subjects. Each subject has about 64 images taken under different illuminations. Half of the images are corrupted by shadows or reflection. Each image is cropped to pixels. Figure 4 shows some images of two subjects from the Extended Yale B data set.

We use the first 20 subjects and get 1262 images in total in the Extended Yale B data set to evaluate different methods. In this experiment, 4, 8, 12, 16, and 20 images per subject are randomly selected as labeled samples and the remaining images are regarded as unlabeled samples. The random indices are kept the same for all compared algorithms. Table 3 shows the classification accuracies of different algorithms with different numbers of labeled samples on the Extended Yale B data set. We can easily find that, with increasing number of labeled samples, all algorithms can obtain better classification results. Although the results are close to being saturated, MKLRR still can make some improvements. For example, it gets the accuracy of 98.09% when given 20 labeled images per person, which is 1.13% higher than that of KLRR. In particular, when given small number of labeled samples, MKLRR shows great superiority to the remaining algorithms. There is about 3% improvement when comparing MKLRR with KLRR when the number of labeled samples of each person is only 4. Since there are some noises in this data set, the performance of the basic KNN algorithm greatly decreases.

4.5. Experiment on PIE Data Set

The CMU PIE data set contains 41368 images of 68 subjects with different poses, illumination, and expression. We only use their images in five near frontal poses (C05, C07, C09, C27, and C29) and under different illumination and expressions. The first 15 subjects are selected and there are 2550 face images in total. Each image is manually cropped and resized to pixels. Figure 5 shows some images of two subjects from the PIE data set.

Identical to the Extended Yale B data set, we also select 4, 8, 12, 16, and 20 images per subject as labeled samples and let the remaining images be unlabeled samples. The random indices are kept the same for all compared algorithms. Table 4 shows the classification accuracies of different algorithms with different numbers of labeled samples on the PIE data set. It is obvious that MKLRR outperforms the other algorithms in all cases. Particularly, MKLRR performs much better than the others when given 4 labeled samples per person.

4.6. Experiment on USPS Data Set

The USPS digit database [44] consists of 9298 handwritten digit images of 10 numbers (0–9). The size of each image is 16 × 16 pixels. We select 200 samples from each class and thus the resultant data set has 2000 images in total. Figure 6 shows some images of the 10 numbers from the USPS data set.

In this experiment, we randomly select 10%, 20%, 30%, 40%, and 50% samples per digit as labeled samples and let the remaining images be unlabeled samples. The random indices are kept the same for all compared algorithms. Table 5 shows the classification accuracies of different algorithms with different numbers of labeled samples on the USPS data set. All algorithms can obtain excellent performance on this data set including the simple KNN algorithm. The classification accuracies of MKLRR are higher than other algorithms in most cases.

4.7. Parameter Sensitivity Analysis

There are two important parameters in MKLRR, which are the regularization parameter to construct the manifold adaptive kernel and to control the impact of the noise term. It is obvious that MKLRR will boil down to KLRR when we set the parameter as zero. In our previous experiments, the usage we employ is to empirically fix the value of to one that follows the similar ideas in [43, 45]. In this section, we will analyze the parameter sensitivity of and by the way of investigating one while fixing the other.

Figure 7 shows how the performance of MKLRR varies with the change of on the Extended Yale B and PIE data sets, respectively, where we fix . Here, four images per person are labeled and the remaining are unlabeled. For making it easier to do comparison, we also include the results of KLRR in the figure. We can find that is a reasonable interval for the selection of values of .

Figure 8 shows how the performance of MKLRR varies with the change of on the Extended Yale B and PIE data sets, respectively, where we fix . There are also four labeled samples for each person. Generally, MKLRR is insensitive to the variation of if it is set as a slightly large value.

For the remaining data sets, the parameter sensitivities of MKLRR with respect to and have similar tendencies as shown in Figures 7 and 8.

5. Conclusion and Future Work

In this paper, we have proposed a new low-rank representation model for semisupervised image classification, which is called manifold adaptive kernel low-rank representation (MKLRR). Different from most existing LRR variants that consider the structure information in the original data space, our proposed model explicitly takes the intrinsic manifold structure depicted by nearest-neighbor graph into consideration. The graph Laplacian corresponding to the local geometry of the data is incorporated into the manifold adaptive kernel space, in which the low rank representation model is then calculated. Extensive experiments performed on both synthetic and benchmark data sets have shown excellent performance of MKLRR-based graph for semisupervised classification when given limited labeled samples.

As a limitation of general two-stage graph-based semisupervised learning methods, the information of labeled samples is neglected in graph construction stage. Therefore, it is necessary to take this point into consideration in order to construct more discriminative graph. This will be our future work and one of possible approaches is to introduce a constraint matrix that can depict the partial label information of data into LRR model.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by National Natural Science Foundation of China (61602140, 61971193, 61633010, 61473110, and 61502129), Zhejiang Science and Technology Program (2017C33049, 2018C04012, and LQ16F020004), China Postdoctoral Science Foundation (2017M620470), Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, NJUPT (BDSIP201804), and Co-Innovation Center for Information Supply & Assurance Technology, Anhui University (ADXXBZ201704).