Abstract

Recently, dictionary learning has become an active topic. However, the majority of dictionary learning methods directly employs original or predefined handcrafted features to describe the data, which ignores the intrinsic relationship between the dictionary and features. In this study, we present a method called jointly learning the discriminative dictionary and projection (JLDDP) that can simultaneously learn the discriminative dictionary and projection for both image-based and video-based face recognition. The dictionary can realize a tight correspondence between atoms and class labels. Simultaneously, the projection matrix can extract discriminative information from the original samples. Through adopting the Fisher discrimination criterion, the proposed framework enables a better fit between the learned dictionary and projection. With the representation error and coding coefficients, the classification scheme further improves the discriminative ability of our method. An iterative optimization algorithm is proposed, and the convergence is proved mathematically. Extensive experimental results on seven image-based and video-based face databases demonstrate the validity of JLDDP.

1. Introduction

Face recognition (FR) is an imperative issue in the field of image processing and computer vision. Recently, plenty of face recognition methods have been proposed [15]. However, the problems of occlusion, illumination, pose, and small sample size are still huge challenges for face recognition [68]. Currently, sparse representation-based classification (SRC) [9] has been successfully employed, in which the overcomplete dictionary can represent the query face image well. Significantly, the dictionary designed for SRC utilizes all training images. SRC has shown favorable properties in FR, particularly when images are partly occluded. Nevertheless, the unsure and noisy components may lead to the ineffectiveness of the dictionary in representing query samples. Moreover, the dictionary’s size is consistent with the number of training images. Thus, the computational cost of solving sparse representation coefficients will increase if the training samples’ number is large. At last, the dictionary does not take the structure of the training set or class label into account, which will make the dictionary lack discriminant information. To address these issues, predefined dictionaries that use bases such as Haar or Gabor wavelet instead of training samples are presented [10, 11], but none of these bases is proposed for SRC [12].

Dictionary learning (DL) is significant for SRC because it can suppress the useless information to promote the representation and discrimination [13]. To learn a discriminative and small-sized dictionary, a substantial amount of methods have been presented [1416], which can be roughly divided into two categories: unsupervised and supervised. Unsupervised DL methods have achieved satisfactory results by minimizing the representation error. The method of optimal directions (MOD) [17] was proposed for unsupervised DL. MOD updated the dictionary by minimizing the representation error and achieved the convergence by an iteration-based strategy. However, the computation of the inverse matrix in the MOD was very complicated. The K-singular value decomposition (K-SVD) [18] method was proposed based on the MOD, which performed SVD decomposition on the representation error term and selected the decomposition terms as the updated dictionary atoms and the corresponding coding coefficients. The most substantive difference between MOD and K-SVD is the dictionary updating strategy, in which K-SVD updates one atom and its corresponding coding coefficients each time until all atoms are updated. Therefore, the MOD can be considered as a simplified version of K-SVD. Although the performance of the K-SVD method has been improved, the computational complexity of updating atoms is also high. To enhance the efficiency of DL, an effective reconstructed DL method was presented in [19], which was based on alternating optimization over two subsets of variables. Skretting and Engan [20] introduced a forgetting factor into the DL algorithm to make the algorithm less dependent on the initial dictionary. In [21], metafaces were learned from the training samples, which can promote the representation ability of the dictionary. Although unsupervised DL methods have achieved impressive recognition results, there still exists a limitation in their practical applications. Due to the absence of label information, the dictionaries obtained by unsupervised DL methods were always lacking the discriminative ability. To overcome this problem, many supervised DL methods that utilize the label information have been proposed. In [22], a discriminative K-SVD algorithm was proposed to ensure the representative and discriminative abilities of the learned dictionary. To better utilize the correspondence between the dictionary and labels, the label consistent K-SVD [23] algorithm, which associated the label information with each atom to promote the discriminative ability of the dictionary, was put forward. Recently, the Fisher discrimination dictionary learning (FDDL) [24] algorithm was proposed to learn a class-specific dictionary for FR. Based on the Fisher discrimination criterion [25], the representation error associated with each class was employed for classification. Ding and Ji [26] applied a kernel-based robust disturbance dictionary to significantly enhance the recognition accuracy of occluded faces. Since the supervised DL methods explored the label information of training samples to promote the discriminative ability of the learned dictionary, they have achieved well performance for FR. Recent progresses in SRC have made video-based face recognition become a growing research topic. The video can be treated as a set of images obtained from different poses, illuminations, and expressions. The main difficulty is how to effectively use the multiframe information. In [27], a video dictionary was adopted to encode different video information, i.e., pose, temporal, and illumination. In [28], a multivariate sparse representation method was suggested for video-based face recognition, which was robust to noise and occlusion. These two methods learned the dictionary for FR, but they did not consider the impact of other constraints on algorithm performance. Xu et al. [29] proposed a method to learn a structured dictionary for video-based face recognition, which adopted the nuclear norm to make the coding coefficient matrix be low-rank. However, this method did not enhance the discriminative ability of the representation coefficients. In addition, it utilized the samples in the original space to learn the dictionary and the coding coefficient matrix, which ignores the influence of noise and other irrelevant information.

Dimensionality reduction (DR) is an essential step to decrease the cost of data computation and storage. It also eliminates the irrelevant information to enhance the discriminative ability of features [3033]. Zhang et al. [34] proposed a novel unsupervised algorithm to obtain the orthogonal projection, which can ensure that the samples were well reconstructed in the projected subspace. Clemmensen et al. [35] utilized the sparseness criterion to realize linear discriminant analysis so that the classification and feature selection can be achieved concurrently. In [36], a linear discriminative projection was learned by maximizing the ratio of the between-class representation error to the within-class representation error in the projected space. In [37], the sparsity criterion and the maximum margin criterion [38] were combined to obtain the discriminant projection. Although these SRC-based DR methods yielded notable results, they only acquired the low-dimensional features of the samples and failed to supply an explicit discriminative dictionary.

To overcome this limitation, a series of methods have been suggested to combine DR and DL into a unified framework. By combining the sparseness criterion with PCA, Nguyen et al. [39] presented a sparse embedding method for simultaneously solving the DR and DL problems. The projection matrix was learned for retaining the sparse structure of samples, and the dictionary was learned in the reduced space simultaneously. However, it ignored the distinguish ability of different class samples in the subspace. In [40], the sigmoid function and the ratio of intraclass representation error to interclass representation error were utilized to learn the discriminative dictionary and projection simultaneously, but it ignored both the intraclass and interclass scatter matrix of the coefficients and low-dimensional samples. To address this problem, Feng et al. [41] introduced an orthogonal projection matrix, which can be obtained through maximizing the total scatter and between-class scatter of the training set, in the projection and dictionary simultaneously learning framework. Liu et al. [42] utilized the discriminative graph constraints to achieve nonnegative feature projection and dictionary learning simultaneously. Lu et al. [43] also presented a framework, which can simultaneously learn low-dimensional features and dictionaries, to deal with the video-based face recognition problem. Although these jointly learning methods have achieved success, they did not exploit the discriminative relationship between low-dimensional features and dictionary. To address this issue, a novel method called jointly learning the discriminative dictionary and projection (JLDDP), which simultaneously learns the dictionary and projection in a unified framework, is proposed for FR in this paper. Compared with the existing methods, JLDDP has four characteristics. First, the discriminative ability of the dictionary can be enhanced via imposing the Fisher discrimination criterion on the coding coefficients. Second, the projection learned by our approach enables the closeness of samples from the same class, while keeping the samples from different classes far away in the low-dimensional subspace. Third, JLDDP combines the processes of projection learning and DL into a uniform framework, so the dictionary and projection can be automatically optimized. Last, we design an iterative optimization algorithm to solve our model and provide a theoretical proof for its convergence.

The remaining part is organized as follows. Some of the related work is briefly reviewed in Section 2. The details of JLDDP are provided in Section 3. Experiments and comparisons are carried out in Section 4, and conclusions are provided in Section 5.

2.1. SRC

SRC was proposed by Wright et al. [9] for face recognition. Assume there are n classes of samples, and the training set can be expressed as , where denotes the subset of the training samples that contains ni samples of class i. Let represent the m-dimensional vector stretched by the j-th sample of class i. SRC assumes that a testing sample can be well estimated by the linear combination of the training samples from the same class, so let denote a testing sample of class i; it can be expressed as , where is the corresponding coding coefficient. Suppose we utilize the training set to represent y, the corresponding coefficient vector entries except those related to the i-th class should be zero. In SRC, the l1-minimization is applied to handle the coefficient vector, i.e., , where is a tradeoff parameter. denotes the representation error of class i, where can choose the coefficients of class i. The classification criterion is .

2.2. Dictionary Learning

In this section, the DL methods, including unsupervised K-SVD [18] and supervised FDDL [24], will be reviewed.

2.2.1. K-SVD

In the K-SVD algorithm [18], an overcomplete dictionary is learned from the training set for image compression and denoising. The objective function of K-SVD is formulated aswhere X is the training set, D is the dictionary, is the sparse coding coefficient matrix of X over D, and T is the parameter to adjust the sparsity. To optimize equation (1), the sparse coding coefficient and the dictionary D are updated iteratively. However, there is no corresponding relation between the class label and the dictionary atoms. Thus, K-SVD is unsuitable for solving classification problems.

2.2.2. FDDL

Different from K-SVD, FDDL [24] combines the class label information and the Fisher discrimination criterion to learn a structured discriminative dictionary, which performs classification by the representation error for each class. The FDDL model is formulated aswhere X is the training set, and are tradeoff parameters, and each column of D is normalized to a unit vector. is the discriminative term, is the sparse regularization term, and is the discriminative coefficient term to enforce the discriminative ability of the sparse representation coefficients. The objective function of FDDL can be optimized by updating the dictionary and sparse representation coefficients iteratively. Although FDDL has achieved a good performance for FR, the process is time-consuming. Therefore, PCA is applied to extract features from all samples firstly in FDDL.

3. Methodology

In this section, we firstly describe the proposed JLDDP, which incorporates DL and projection learning into a unified framework. Secondly, the novel iterative update algorithm of JLDDP is deduced. Thirdly, the convergence analysis is given. Fourthly, we provide the classification schemes which characterize the class-specific representation error for FR. Finally, we analyze the guideline for parameter setting.

3.1. Modeling

Let denote the set of d-dimensional training samples with c classes, where is the i-th class subset of Y. Let be the projection that reduces the feature dimension of samples. The structured (class-specific) dictionary is denoted by , where is the i-th class subdictionary. The coding coefficient matrix of over D is denoted by X, which can be refined to , where is the i-th class submatrix of coding coefficient X. Actually, can also be expressed as , where is the coding coefficient of over the subdictionary . In JLDDP, the projection, dictionary, and coding coefficients are jointly learned with the following model:where denotes the representation error term, is the l1-regularization on X, is the coding coefficient term imposing discriminative label information on DL, and S(P) is the projection learning term projecting the samples into a more discriminative space. ω1, ω2, and ω3 are the tradeoff parameters. Each atom dk in the dictionary has a unit norm. Next, more detailed descriptions of the terms in equation (3) will be given.

3.1.1. Representation Error Term

When the training samples are represented by a dictionary, we expect the dictionary to have both strong reconstructive ability and strong discriminative ability. In addition, the samples can be reconstructed not only by the whole dictionary but also by the subdictionary from the same class. Therefore, the representation error term is expressed as

The representation error term is designed to obtain a small representation error that is calculated by the low-dimensional training samples and the structured dictionary D. First, each class of low-dimensional training samples should be well represented by the structured dictionary D, i.e., . Second, each class of low-dimensional training samples should be well represented by the dictionary from the same class, rather than other classes, which indicates that should be well represented by as much as possible, but not by . Hence, should have some significant coefficients, and should have nearly zero coefficients.

3.1.2. Coding Coefficient Term

We can make the dictionary discriminative by constraining the coding coefficients [24]. According to the Fisher discrimination criterion, the within-class scatter should be minimized, and the between-class scatter should be maximized, which can make the coding coefficients have discriminative ability. Hence, the coding coefficient term is formulated aswhere is the within-class scatter of X, is the mean vector of , is the between-class scatter of X, is the number of samples in class i, and is the mean vectors of X. We impose the Fisher discrimination criterion on X to improve the discriminative ability, which indicates the within-class scatter (X) should be minimized, and the between-class scatter Sb(X) should be maximized. is an elastic term, and the convexity of equation (5) is proved in [24].

3.1.3. Projection Learning Term

The projection matrix P should preserve the energy of samples as much as possible and make the samples from different classes separable in the low-dimensional space. Therefore, the projection learning term is expressed aswhere and are the within-class scatter and the between-class scatter of , respectively. denotes the k-th sample from class i in the low-dimensional space. and denote the mean vectors of and , respectively. We adopt the Fisher discrimination criterion on low-dimensional samples, i.e., , to enhance the discriminative ability of features. Moreover, we minimize the term to guarantee that the energy of can be well preserved.

By incorporating equations (4)–(6), we obtain the JLDDP model as shown in equation (3). The iterative update scheme is adopted to optimize the objective function, and the detailed optimization process of JLDDP is presented in the following section.

3.2. Optimization

The objective function of JLDDP is not convex for P, D, and X jointly, but it is convex with regard to each of them when the others are fixed. Thus, equation (3) can be divided into three subproblems and optimized by an iterative update scheme.

3.2.1. Updating X with Fixed P and D

Suppose that P and D are fixed, we can update class-by-class, i.e., we fix all to update . Therefore, the simplified form of equation (3) can be obtained as follows:where and are the mean vector matrices of class k and class i, respectively. M is the mean vector matrix of all classes. Except for , the other terms in equation (7) are differentiable. Since equation (7) is strictly convex, we can employ iterative projection methods (IPM) [44] to solve it.

3.2.2. Updating D with Fixed P and X

To obtain the optimal structured dictionary D, we need to update the subdictionary class-by-class, while P, X, and all other are fixed. Then, equation (3) can be simplified aswhere represents the coding coefficients of over the subdictionary . We can employ the algorithm in [19] to solve equation (8), i.e., update atom-by-atom.

3.2.3. Updating P with Fixed D and X

When the dictionary D and the coding coefficient matrix X are fixed, equation (3) can be simplified to

We can obtain equation (10) by the mathematical derivation of equation (9):

If we set the derivative of P as zero in equation (10), we acquire

For convenience, we define , , , , and to replace the corresponding parts of equation (11). Then, we gain the explicit solution of the projection matrix P as shown in the following:

The above iterative optimization process of JLDDP will stop when the algorithm is convergent or the maximum number of iterations is attained. Algorithm 1 is the summary of the whole optimization process.

(1)Input: the training set , iteration number , parameters , , and .
(2)Initialize: projection matrix , structured dictionary , .
(3)Repeat steps 3–6 until convergence or conditions.
(4)Update with fixed and by equation (7).
(5)Update with fixed and by equation (8).
(6)Update with fixed and by equation (12).
(7)Output: projection matrix P, structured dictionary D, coding coefficient matrix X.
3.3. Convergence

The optimization process of JLDDP can be simplified into three subproblems that can be solved iteratively, as formulated in equations (7), (8), and (12). It has been proved that the subproblem in equation (7) is convex in [24]. Obviously, equation (8) is quadratic programming, so it is convex. In each iteration, the value will decline after solving X and D via equations (7) and (8), respectively, as proved in [21, 44]. Moreover, the subproblem in equation (12) can obtain an explicit solution. Thus, to justify the convergence of JLDDP, we need to demonstrate that the value of equation (3) is nonincreasing after optimization. For convenience, let denote the objective function of JLDDP. Before proving the convergence of Algorithm 1, we should establish Theorem 1 first.

Theorem 1. If Algorithm 1 is used to solve , the objective function value is nonincremental.

Proof. Let indicate the value in the t-th iteration.
When solving the subproblem , we utilize the method in [44] to obtain the optimal value of with fixed and . This subproblem is convex, so we can obtainWhen solving the subproblem , we employ the method in [21] to obtain the optimal value of with fixed and . It is still a convex problem, so we haveWhen solving the subproblem , we can obtain the explicit solution with fixed and based on equation (12). Therefore,Combining equations (13)–(15), we haveNow, the theorem has been proved.
Since each term in equation (3) is nonnegative, the objective function value has a low bound. According to Theorem 1 and the Cauchy convergence criterion [45], the optimization algorithm presented for JLDDP is convergent.

3.4. Classification

The learned projection P can reduce the dimension of the testing sample , and the low-dimensional feature can be coded over the learned dictionary D. Therefore, we can obtain the coding coefficient bywhere is the coding coefficient and is the coding coefficient vector associated with class i. is a tradeoff parameter.

The structured dictionary D is learned to ensure the coding coefficients of the identical class are similar, and the coding coefficients of various classes are different. In addition, the coding coefficients have a stronger discriminative ability through the constraints of the Fisher discrimination criterion. Therefore, not only the representation error but also the distance information of the coding coefficients obtained by equation (17) is useful for classification. We classify the testing sample bywhere is the mean vector of related to class i and is a tradeoff parameter.

3.5. Parameter Analysis

There are three parameters in the proposed JLDDP, i.e., ω1, ω2, and ω3. Therefore, how to properly set their values is important. Fortunately, each parameter has a clear physical meaning, which can supply a guideline for setting the value. The parameter ω1 is used to control the sparsity of the coding coefficient matrix, whose value needs to be set as a moderate value. The parameter ω2 can adjust the coding coefficient term based on the Fisher discrimination criterion, whose value should not be set either too small or too large. Since an extremely small ω2 value will lead to the loss of latent discrimination information, a too large ω2 value will make other terms be neglected. The parameter ω3 is used to constrain the projection learning term based on the Fisher discrimination criterion. Analogous to the parameter ω2, a relatively small ω3 value can decrease the projection learning term effect. However, a relatively large ω3 value will make the objective function dominated by the projection learning term, and the role of other terms will be neglected.

3.6. Comparison with the Existing Work

In order to highlight the novelty of our work, we compare the proposed JLDDP method with some related studies. First, although some terms in the objective function of FDDL [24] are similar to those in our JLDDP, they are different from each other. Specifically, FDDL utilizes PCA to project original features into a low-dimensional subspace, which is separated from the process of dictionary learning. Thus, FDDL does not exploit the relationship between the low-dimensional features and the learned dictionary, which cannot effectively learn the appropriate features for the discriminative dictionary learning task. To solve this problem, our proposed JLDDP simultaneously learns the feature projection matrix and dictionary in a unified framework, which can ensure that the learned projection matrix is most beneficial for discriminative dictionary learning. That is, the learned projection matrix and dictionary in our JLDDP are relevant and mutually beneficial. Hence, jointly optimizing them can achieve better performance for face recognition. Second, the proposed JLDDP also seems like the dictionary learning methods in [4648]. However, there exist some significant differences between them. To be specific, (1) the methods in [4648], respectively, learn multiple class-specific subdictionaries and a common subdictionary shared by all classes. Then, they combine the learned class-specific subdictionaries and common subdictionary to achieve the recognition task. In our JLDDP, we only need to learn a subdictionary for each class and combine all subdictionaries as a whole dictionary. Therefore, there is no need to learn and update the common dictionary during the model optimization, which can make sure that our model has a fast convergence speed and high computational efficiency. (2) Similar to FDDL, the methods in [4648] do not consider feature projection matrix learning in the process of dictionary learning. Thus, the feature projection is separated from the process of dictionary learning in them, which cannot learn the best combination of the low-dimensional feature and dictionary for face recognition. (3) The regularization criteria in the objective functions adopted in [4648] were different from our proposed JLDDP, e.g., [46, 48] used l1-norm, and [47] used l2,1-norm to enforce the learned coefficients of the dictionary to be sparse, while our proposed JLDDP utilizes the intraclass and interclass scatter of coefficients as constraints, which can improve the discrimination of the model. Third, Lin et al. [49] proposed a RDCDL method which utilizes the low rank and sparse constraint to extract the disturbance components (e.g., noise, outliers, and occlusion) in the training samples. In RDCDL, a set of training samples and a set of alternative training samples with simulated facial variation are employed to build a dictionary learning model with a complex and comprehensive dictionary. The comprehensive dictionary includes a class-shared dictionary, a class-specific dictionary, a simulated disturbance dictionary, and a real disturbance dictionary. The main difference between our JLDDP and RDCDL lies in that we only adopt class-specific dictionary to construct the whole dictionary, which is simpler than Lin’s model and can deeply decrease the computational complexity. Besides, RDCDL utilizes PCA to reduce the feature dimension of samples, which is separated from the process of dictionary learning. However, our JLDDP combines the processes of feature projection and dictionary learning into a unified framework to obtain a more suitable low-dimensional feature, which is quite different from RDCDL. Moreover, it is worth noting that RDCDL only adopts the intraclass scatter of coefficients as the discrimination constraint but neglects the interclass scatter of coefficients, while our JLDDP utilizes both the intraclass scatter and the interclass scatter to improve the discriminative ability of the learned dictionary. Fourth, Zhang et al. [40] proposed a SS-DSPP model which can simultaneously learn the dictionary and the projection matrix, but it is still very different from our JLDDP in the following aspects. SS-DSPP takes advantage of the relationship between the reconstruction error of training samples by the same class dictionary and the reconstruction error of training samples by different classes. Nevertheless, the discrimination constraint on coefficients is not considered in it. In addition, SS-DSPP also ignores the class information of low-dimensional features obtained after projection but only imposes an orthogonal constraint on the projection matrix, which leads to reducing the discrimination capability of the model to some extent. To solve these problems, our JLDDP utilizes the Fisher discrimination criterion to constrain the intraclass and interclass scatters of coefficients and low-dimensional samples, which can ensure the discrimination ability of the JLDDP model. In summary, although the proposed method shares several similarities with the aforementioned approaches [24, 40] and [4649], our JLDDP is different from them in the dictionary learning process, projection learning process, or coefficient constraint. Specifically, JLDDP simultaneously learns the dictionary and projection matrix in a unified framework by adopting the intraclass and interclass scatter as the constraint of coefficients and the samples. Thus, JLDDP can explore the intrinsic relationship between the dictionary and the feature learning, which can improve the classification performance of both the image-based and the video-based face recognition.

4. Experimental Results

We conduct extensive experiments on image-based and video-based face databases to confirm the validity of JLDDP.

4.1. Image-Based Face Recognition Results and Analysis
4.1.1. Image Database Description

ORL [50], CMU PIE [51], FERET [52], and LFW [53] databases are used to prove the validity of JLDDP for image-based face recognition. Some examples from the ORL, CMU PIE, FERET, and LFW databases are shown in Figure 1.

The ORL face database includes 400 images of 40 subjects. The images reflect the changes of illumination, pose, expression, and whether glasses are worn. The CMU PIE face database includes 41,368 images of 68 subjects. In 43 distinct illumination conditions, images are taken across 13 various poses and with 4 diverse expressions. We adopt a subset of 24 images for each person in this experiment. The FERET database is recorded in a real environment with a lot of images. It includes 14,051 face images of more than 1,000 subjects. The face images have the characteristics of different expressions, postures, and illuminations. In addition, the time span of image acquisition in the FERET database is very large. We adopt a subset which contains 1,400 images of 200 subjects in this experiment. The LFW database is collected in unconstrained environments, which is very challenging. This database contains 13,233 face images of 5,749 subjects. However, most of the people have only one image in the database. Therefore, we select 158 subjects from LFW, which has at least 10 distinct images, to verify the effectiveness of algorithms. In [54], a new sparse representation-based alignment method is proposed for real-world images, which can eliminate the variety of orientations, expressions, and other factors as much as possible. We use this method to deal with the original LFW database for all the recognition methods. Table 1 provides the detailed database information. All images are clipped by selecting eye coordinates manually and normalized to 32 × 32 pixels.

4.1.2. Experiment Setting

In the image-based face recognition task, we compare our method with some representative methods, including SRC [9] with PCA and LDA, LCK-SVD [23], FDDL [24], DRSRC [34], LSD [29], DSRC [40], JDDRDL [41], and JNPDL [42]. The l1-ls toolbox [55] is adopted to handle the l1-minimization problem in the SRC-related algorithms. The source code of the l1-ls toolbox can be found at http://web.stanford.edu/∼boyd/l1_ls/. The source code of FDDL can be found at http://www4.comp.polyu.edu.hk/cslzhang/code/FDDL.zip. The source code of LC-KSVD can be found at http://users.umiacs.umd.edu/∼zhuolin/projectlcksvd.html. The other methods are based on our implementations, and the parameters are tuned based on the settings reported in their papers. We set the number of atoms for each class of the dictionary in JLDDP as half of the training samples. Through randomly chosen training and testing samples, experiments are conducted 10 times totally, and the average recognition accuracies and standard deviations are reported. All the methods are developed in MATLAB and implemented on a computer with an Intel Core i3-2100 CPU at 3.2 GHz and 8 GB physical memory.

We first compare the recognition performance under various feature dimensions, and next, we compare the recognition performance under various number of training samples. For convenience, the number of training and testing samples is represented by l and h, respectively. Tables 2 and 3 show the data descriptions.

We compare the recognition performance under different parameter values. We adjust the parameter values by searching the grid {0, 0.0001, 0.001, 0.01, 0.1, 1} in an alternate manner to obtain the optimal parameter combination. Finally, we provide the convergence evaluation. We set the number of atoms for each class of the dictionary in JLDDP as half of the training samples. Through randomly chosen training and testing samples, experiments are conducted 10 times totally, and the average recognition accuracies and standard deviations are reported.

4.1.3. Recognition Results and Analysis

(1) Recognition Performance under Different Feature Dimensions. In the first experiment, we employ different feature dimensions to verify the performance of various methods. Table 2 shows the number of training samples and the reduced feature dimensions. The reduced feature dimension of LDA can be one less than the number of classes at most, and we cannot vary the feature dimensions as other methods. Thus, the results of LDA + SRC are not shown in the first experiment. In LC-KSVD and FDDL, PCA is adopted to reduce the sample dimension. Tables 47 demonstrate the recognition accuracies on the four databases by various number of dimensions. In most instances, the performance of JLDDP is better than the other methods. Moreover, several points can be seen from the tables. First, DRSRC is an unsupervised DR method that is designed based on SRC, so the accuracy is higher than PCA + SRC in most cases. This illustrates that the well-designed projection is more suitable for the classification. Second, compared with PCA + SRC and DRSRC, the average recognition accuracies of LCK-SVD, FDDL, and LSD are higher. The reason is that, after reducing the dimension of the samples with PCA and LCK-SVD, FDDL and LSD can learn a representative and discriminative dictionary, which is a key role in SRC. Third, LCK-SVD, FDDL, and LSD enhance the discrimination ability of the dictionary, but they do not jointly learn the projection that can preserve much discriminative information. Therefore, their performance is not as good as JDDRDL, DSRC, JNPDL, and JLDDP under different feature dimensions. Fourth, JLDDP outperforms JDDRDL, DSRC, and JNPDL significantly under different feature dimensions on the four databases, except when the feature dimension is 250 on the CMU PIE database, in which the best average recognition result of JDDRDL is only 0.07% higher than that of JLDDP. Nevertheless, the experimental results still indicate that JLDDP can achieve relatively stable and high recognition accuracy in general under different feature dimensions. The superiority of our approach is due to that JLDDP can discover the latent discriminative ability of samples in the low-dimensional space and learn the class-specific dictionary simultaneously.

(2)·Recognition Performance under Various Number of Training Samples. The effectiveness of JLDDP under various number of training samples is compared with other methods on the ORL, CMU PIE, FERET, and LFW databases. The number of training samples and test samples used is listed in Table 3. Tables 811 show the recognition accuracies and the corresponding feature dimensions. The corresponding feature dimensions are annotated in parentheses. When there are only 2 training samples per subject, JDDRDL, DSRC, JNPDL, and JLDDP that learned the dictionary and projection jointly obtain better performance than other methods. When the number of training samples is increased, the performance of all the methods is improved in general, except for the LDA + SRC and LCK-SVD methods in the FERET database. Compared with other methods, JLDDP can achieve the best average recognition accuracies and a relatively small feature dimension, which demonstrate its capability to address practical applications.

(3)·Recognition Performance under Different Parameter Values. We test the impacts of various parameter values on four image-based face recognition databases. Since there are three parameters in the proposed JLDDP, we fix two of them and then analyze the influence of the remaining parameter. The physical meaning of the parameters is described in Section 3. For the ORL, CMU PIE, FERET, and LFW databases, the number of training samples is set as 5, 7, 4, and 5, respectively. The top average recognition results obtained by JLDDP under various parameter values are shown in Figure 2. When the parameter values of ω1, ω2, and ω3 equal to zero, the recognition accuracy of JLDDP is relatively low, which indicates that each term in the objective function of JLDDP is significant for classification. With the increasing of each parameter value, the performance of JLDDP improves gradually. When ω1 = 0.0001, ω2 = 0.0001 or 0.001, and ω3 = 0.001 or 0.01, the proposed JLDDP performs best on the four databases. However, after achieving its best performance, the recognition accuracy dramatically decreases with the increase of each parameter value. Hence, ω1, ω2, and ω3 should be set as moderate values to obtain a good performance, which is conform to our analysis in Section 3. That is, if the parameter value is too large, the corresponding term in equation (11) will play a leading role, which makes other terms be neglected. In contrast, if the parameter value is too small, the corresponding term will lose its constraint ability.

To further evaluate the role of each term in our model, we, respectively, set the parameter values of ω1, ω2, and ω3 as zero to test the performance of JLDDP. Here, the number of training samples is set as 5, 7, 4, and 5 for ORL, CMU PIE, FERET, and LFW databases, respectively. The top average recognition results obtained by JLDDP under various situations are shown in Table 12. In this table, the baselines are results obtained by the optimal parameter combination in Tables 911. From the experimental results, we can see that the proposed method cannot achieve its best recognition accuracies when one of the parameters ω1, ω2, and ω3 is equal to zero, which indicates that the sparse constraint term, the coding coefficient term, and the projection learning term are all essential to improve the recognition performance of our JLDDP method. Besides, the recognition accuracies are dramatically decreased when ω1 is set as zero, that is, the sparse constraint term is omitted, which indicates the sparse constraint in the dictionary representation is very important to improve the discriminative ability of our model. Furthermore, the recognition accuracies are very close when ω2 or ω3 is set as zero, but much lower than the baselines. This means the coding coefficient term and the projection learning term are also indispensable in our JLDDP since they can bring the intraclass and interclass information into our model to ensure the discrimination of coefficients and low-dimensional features.

(4)·Convergence Evaluation. Figure 3 demonstrates the convergence curves of JLDDP on the ORL, CMU PIE, FERET, and LFW databases. In each figure, the x-axis represents the iteration number, and the y-axis represents the value of the objective function. From this figure, we can find that the proposed iterative updating algorithm of JLDDP is convergent, which is conformable to our convergence analysis in Section 3.

4.2. Video-Based Face Recognition Results and Analysis
4.2.1. Classification Scheme

To further evaluate the performance of JLDDP, we perform face recognition experiments on video. Here, we suppose is a testing face video, where is the j-th () frame and is the total number of frames. According to Lu et al. [43], we project each frame into a low-dimensional feature space by the learned projection P and then obtain the corresponding coding coefficients by equation (17). Finally, the class label of the frame can be obtained by the following equation as [42]where is the pseudo-inverse of and is the projection of onto the span of atoms in [26]. Finally, we apply the majority voting to determine the testing video’s label after obtaining the entire frames’ label:where denotes the total votes from the i-th class.

4.2.2. Video Database Description

The Honda [56], MoBo [57], and YTC [58] databases are employed to verify the performance of JLDDP. All the videos in the Honda database are recorded indoors with normal lighting conditions and include different facial expressions and a large range of head movement. The Honda database contains 59 videos of 20 subjects. Each video clip comprises 12 to 645 frames. The MoBo database is designed for the identification of long-distance people, which is captured with fixed-position cameras. The MoBo database comprises 96 videos of 24 subjects, which include large head-pose variations. Each subject comprises 4 videos, about 300 frames per video. The YTC database is collected from YouTube, which has 1,910 videos of 47 subjects. These subjects are politicians, actors, or actresses. It is a large low-resolution video database for face recognition, which is highly compressed. Each video contains 8 to 400 frames. In the experiment, the cascaded face detector [59] is used to detect the face, and then all the faces are resized to grayscale images with 30 × 30 pixels.

4.3. Experiment Setting

We compare the proposed JLDDP with several existing classical video-based face recognition methods, including MSM [60], DCC [61], MMD [62], MDA [63], AHISD [64], CHISD [64], SANP [65], DFRV [27], LSD [29], and SFDL [43]. The source code of DCC can be found at http://mi.eng.cam.ac.uk/∼tkk22. The source code of AHISD and CHISD can be found at http://mlcv.ogu.edu.tr/softwareimageset.html. Since the source codes of other methods are not provided by their authors, we implement them by ourselves and follow the same parameter settings in their corresponding papers. In the video-based experiments, the parameters ω1, ω2, and ω3 of JLDDP are empirically set as 0.0001, 0.0005, and 0.005, respectively. The number of atoms per class for the Honda, MoBo, and YTC databases is set as 20, 25, and 40, respectively. We select the best accuracy that JLDDP achieves with projected dimensions from 50, 100, 150, 200, and 300. All results are the average value of 10 times’ independent experiments with different training set selection.

In the first experiment, the proposed JLDDP is compared with the state-of-the-art methods. The training set of the Honda and MoBo databases contains one video of each subject, and the testing set contains the remaining videos. If the subject has only one video, we separate the video into two clips and select one video for training and another video for testing randomly. The training set of the YTC database contains 3 videos of each subject, and the testing set contains 6 videos of each subject. In the second experiment, the influence of different training and testing frames on the performance of various methods is tested. We randomly choose 50, 100, and 200 frames from each video as the training set and another 50, 100, and 200 frames as the testing set.

4.4. Recognition Results and Analysis
4.4.1. Comparison with the Contrast Methods

In the first experiment, our JLDDP is compared with several existing methods. Table 13 tabulates the recognition accuracies of the methods on the Honda, MoBo, and YTC databases. The recognition accuracies of MDA, LSD, SFDL, and JLDDP are higher than those of MSM, DCC, MMD, AHISD, CHISD, SANP, and DFRV in most cases. Therefore, we can infer that the supervised methods can exploit more discriminative information than the unsupervised methods. Moreover, our JLDDP surpasses the compared methods. The main reason is JLDDP can project the frames into a discriminative low-dimensional subspace, which is beneficial to obtain the discriminative coding coefficients with the class-specific dictionary.

4.4.2. Comparison under Various Number of Frames

In the second experiment, various number of frames are selected as the training set to compare the robustness of JLDDP with other methods. Figure 4 shows the top average recognition accuracies of different methods on the Honda, MoBo, and YTC databases with various number of frames. The recognition accuracies are improved with increasing of the number of frames. JLDDP can achieve the best recognition accuracy with different numbers of frames. This is because joint learning of the projection and dictionary can enable JLDDP to obtain more discriminative information.

5. Conclusions

This paper presents a JLDDP method for sparse representation-based face recognition. By combining DL and DR into a unified framework, our JLDDP obtains the adaptive projection and dictionary. The proposed JLDDP achieves commendable performance and robustness on seven benchmark image-based and video-based databases. Moreover, an effective iterative algorithm is proposed to solve the optimization problem, and the convergence is strictly proven.

Data Availability

The data are derived from public domain resources.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant nos. 61602221, 61672150, and 61806126, in part by the Fund of the Jilin Provincial Science and Technology Department under Grant nos. 20200201199JC, 20180201089GX, 20190201305JC, 20200401081GX, and 20200401086GX, in part by the Fund of Education Department of Jilin Province under Grant nos. JJKH20190294KJ and JJKH20190291KJ, in part by the Natural Science Foundation of Jiangxi Province under Grant no. 20171BAB212009, in part by the Science and Technology Research Project of Jiangxi Provincial Department of Education under Grant no. GJJ160333, and in part by the Funds for the Central Universities under Grant nos. 2412018QD029, 2412019FZ049, and 2412020FZ031.