Abstract

We investigate a novel way of robust face image feature extraction by adopting the methods based on Unsupervised Linear Subspace Learning to extract a small number of good features. Firstly, the face image is divided into blocks with the specified size, and then we propose and extract pooled Histogram of Oriented Gradient (pHOG) over each block. Secondly, an improved Earth Mover’s Distance (EMD) metric is adopted to measure the dissimilarity between blocks of one face image and the corresponding blocks from the rest of face images. Thirdly, considering the limitations of the original Locality Preserving Projections (LPP), we proposed the Block Structure LPP (BSLPP), which effectively preserves the structural information of face images. Finally, an adjacency graph is constructed and a small number of good features of a face image are obtained by methods based on Unsupervised Linear Subspace Learning. A series of experiments have been conducted on several well-known face databases to evaluate the effectiveness of the proposed algorithm. In addition, we construct the noise, geometric distortion, slight translation, slight rotation AR, and Extended Yale B face databases, and we verify the robustness of the proposed algorithm when faced with a certain degree of these disturbances.

1. Introduction

Although many sophisticated algorithms have been proposed, face recognition is still a challenging problem affected by many external factors such as the occlusion, illumination, noise, geometric distortion, translation, and rotation of face images. Recently, face recognition algorithms based on deep learning have achieved good performance [16]. Stacked autoencoder (SAE) [7] is an unsupervised neural network approach, where the input and target values are the same. In SAE, the deepest hidden layer carries the features we are interested in. The input layer and the deepest hidden layer are connected by multiple encoding layers, and the deepest hidden layer and output layer are connected by multiple decoding layers. The activation values of the deepest hidden layer nodes are essentially the deep representation features which are used to perform classification tasks by feeding them to the corresponding classifier such as Softmax. In order to obtain more robust features, random noise can be added to the input layer of SAE. This method is called Stacked Denoising Autoencoders (SDAE) [8]. In practical applications, the values of the input layer nodes can be set to be 0 with a certain probability and it can extract more robust features. However, SAE and SDAE both adopt the fully connected way to establish a connection between the input layer (hidden layer) and another hidden layer. The disadvantage is that a large number of parameters need to be learned when training SAE and SDAE. Take SDAE as an example, we set the hidden nodes to be 100, and the number of the network weights and bias will be when extracting the deep features over images with pixels. In order to overcome the limitations of the fully connected network, Convolutional Neural Networks (CNN) [9, 10] are proposed, where the local connected way can effectively reduce the computational complexity of model training. In addition, another important advantage of using local connected network is that we can extract local information in the input space, which is consistent with the mechanism of the visual center. The LeNet-5 [10] model is one of the most classic CNN models. The convolution kernel is learned during model training by back propagation. Another method to learn convolution kernels is the unsupervised approach: the stacked autoencoders [7, 8] are used to extract the corresponding size of convolution kernels. In UFLDL_CNN [11], convolution kernels are learned by unsupervised Stacked Autoencoder (SAE) [7, 8] and used to perform the convolution operations at the convolutional layer. However, in addition to learning a lot of network weight values and bias, these methods need to choose several hyperparameters, for example, sparsity penalty coefficient (as in autoencoder algorithm [12, 13]) and weight penalty coefficient (as in regularization of deep neural network) [14]. Furthermore, these parameters need to be selected by cross validation, so the complexity and high computational cost are attached to these algorithms [14]. On the other hand, algorithms based on keypoints, such as sift [15] and surf [16], have demonstrated good performance in face recognition and they are robust to scale changes, rotation, illumination, and other disturbances [1517]. But the disadvantage of these methods is that they may require a large number of keypoints, and the number of queries between keypoints is even larger. This makes it difficult to effectively perform face recognition or retrieval tasks in large-scale face image datasets.

Linear Subspace Learning is a kind of linear projection method which assumes that the high-dimensional data are located in a low-dimensional manifold that is linearly or approximately linearly embedded in the ambient space [1822]. Linear Subspace Learning is often used in pattern recognition and computer vision tasks. Iosifidis et al. [20] proposed the optimal class representation algorithm based on the linear discriminant analysis, which increased discrimination between classes. F. Liu and X. Liu [23] proposed the Locality Enhanced Spectral Embedding (LESE) and novel Spatially Smooth Spectral Regression (SSR) methods for face recognition. It not only constructed a good locality preserving mapping but also made full use of the spatial locality information of face image matrix. Tzimiropoulos et al. [21] proposed a subspace learning algorithm based on the image gradient orientations, which has shown good performance on appearance-based object recognition. Zhang et al. [24] proposed a Linear Subspace Learning method using the sparse coding to learn a dictionary, and the aim is to fully exploit different image components. Both unsupervised and supervised criteria are proposed in order to learn the corresponding subspace. Cai et al. [22] proposed a spatially smooth subspace approach for face recognition which took full account of spatial correlation of face image and used the Laplacian penalty to learn the corresponding spatially smooth subspace. In addition, some kernel based technologies [2528] are also applied in face recognition, and they are mainly used to explore the nonlinear relationship between face images. When nonlinear information is contained in the dataset, the kernel based techniques will exhibit good properties. These Linear Subspace Learning methods mainly fall into two categories, supervised and unsupervised methods. For supervised methods, sample labels are used in model training, provided that these labels have been manually marked. Although sometimes some supervised methods can learn good subspaces, the disadvantage is that the samples need to be marked; so in reality, unsupervised methods are more commonly used. In this research, we focus on unsupervised methods and are committed to building a good subspace by unsupervised learning algorithms. For Linear Subspace Learning, the high-dimensional input data are mapped onto a low-dimensional space by linear projection to achieve dimensionality reduction; it is a common and critical processing module in pattern recognition. Preprocessing, feature selection, feature extraction, pooling operations, and so on are implicitly or explicitly attached to dimensionality reduction operations [29]. Furthermore, the discrimination process can be viewed as a dimensionality reduction operation where high-dimensional input data are mapped onto low-dimensional class data (binary vector consisting of 0 and 1) [29].

Raw data in reality are often high-dimensional. High-dimensional data on the one hand can increase the computational burden of recognition system; on the other hand, it brings in a negative impact (arising from noise or outlier) on robust recognition tasks with limited training sample sets [29]. Importantly, raw data are often unlabeled. Therefore, this research focuses on the algorithms based on Unsupervised Linear Subspace Learning. Different from deep learning algorithms, Unsupervised Linear Subspace Learning does not need the process of selecting complex hyperparameters. Different from the keypoint ones, grids of HOG [17] or grids of pHOG over each face image are extracted and all of them are collected to form the final descriptors.

However, the raw data and the “features” extracted from them such as grids of HOG or grids of pHOG are still high-dimensional, so it is necessary to further learn the linear subspace of them. More importantly, the data from subspace can be guaranteed to have the same dimension after completing Linear Subspace Learning, and it enables subsequent classifier training, such as Softmax or SVM. In this research, in order to best evaluate the performance of the subspace learning algorithm, we adopt the nearest neighbor (NN) to be the classifier.

One of the most important tasks of Linear Subspace Learning is to construct the adjacency graph which is used to describe the nearest neighbor relationship between samples. In order to calculate the dissimilarity, a sample is drawn into a row or column which ignores the structural information of the sample. Then metric (Euclidean distance) is often used to measure the dissimilarity between any two samples.

Face images taken from cameras often suffer from noise, geometric distortions [30, 31] and sometimes complex geometric distortion can occur during shooting, storage, and transmission. The most common geometric distortion is radial distortion, which includes barrel distortion and pincushion distortion. Some examples of face images suffering from noise, geometric distortions, slight translation, and rotation changes are shown in Figure 1.

In this research, we consider the algorithms that are robust to a certain degree of noise, geometric distortions, slight translation, and slight rotation changes. We construct the noise, geometric distortion, slight translation, slight rotation AR, and Extended Yale B face databases, and we also verify the robustness of our proposed algorithms to a certain degree of these disturbances. See Section 4 for detailed information about these databases.

The main contributions of our proposed algorithm are as follows.

In order to reduce the computational complexity and enhance robustness, we propose and perform pooling operations on the “granularity” of cell over each block. That is to say, we accumulate the histograms for all cells over the block and we obtain a pHOG histogram over the block. Then, an improved EMD metric instead of metric is adopted to compare any two pHOG histograms over corresponding blocks from two different face images. It can effectively deal with the quantization problem of rigid binning.

We attach great importance to the structural information of samples. In order to effectively preserve the structural information of the sample, each face image is divided into blocks with the specified size and we propose the Block Structure LPP (BSLPP) algorithm based on the improved EMD metric, which overcomes the limitation of the original LPP.

We construct the noise, geometric distortion, slight translation, slight rotation AR, and Extended Yale B face databases and verify the robustness of the algorithm against a certain degree of these disturbances.

The rest of this paper is organized as follows: we first review related work in Section 2. In Section 3, we present our improved our improved EMD-based dissimilarity metric for Unsupervised Linear Subspace Learning. Experiments and results are reported in Section 4, and this is followed by the conclusions made in Section 5.

Earth Mover’s Distance (EMD) [32] is a metric proposed for some vision problems, and it can measure the dissimilarity between two distributions. EMD has been successfully applied in image retrieval, and with EMD the quantitative measure of dissimilarity between any two samples is defined by the dissimilarity of two distributions, which correlates to human perception to some extent [32].

An intuitive explanation of EMD is as follows: given two distributions (normalized histograms), one is taken as “supply” with a mass of earth properly spreading in space, and the other is regarded as “demand” with collection of holes. So, the solution is the minimal work (cost) that must be done to fill the holes with earth [32]. And the formula of EMD defined by Rubner et al. [32] is given as follows:subject to

The variables involved in Formula (2) are consistent with the ones in Formula (3). Compared with the metric histogram matching technique, EMD (the Cross-Bin Dissimilarity Measure, as shown in Figure 2(b)) can not only effectively deal with the quantization problem of rigid binning (the Bin-by-Bin Dissimilarity Measure, as shown in Figure 2(a)), but also demonstrate robustness to shape deformation.

We explain the results of the dissimilarity measure in Figure 2: in Figure 2(a), the or metric is adopted to measure the dissimilarity. For simplicity and intuitive display, we choose the metric and let denote the distance between and , so ; in Figure 2(b), EMD is adopted to measure the dissimilarity, so EMD (according to Formula (1)). So it is not difficult to see that the “Cross-Bin Dissimilarity Measure” can effectively deal with the quantization problem of rigid binning and correlates to human perception.

However, the EMD metric can only be used for normalized histograms. More importantly, it will suffer from high computational burden, and the worst-case complexity of time for this algorithm is exponential [33]. In order to avoid the limitation of EMD, Pele and Werman proposed the EMD variant [33]: an improved EMD-based dissimilarity measure with thresholded ground distance. It is a metric for nonnormalized histograms and shows robustness to quantization, shape deformation, and occlusion. Furthermore, it is a linear time algorithm and the time complexity is [33]. Pele and Werman’s EMD variant is given as follows [33]:where and are the two (nonnormalized) histograms and is the flow, with each denoting the amount of mass flowing from the -th “supply” to the -th “demand”. represents the thresholded ground distance, which is set to be zero for corresponding bins, one for the adjacent bins and two for other bins including the extra mass in the histogram [33]. Thresholded ground distance is just the thresholded module metric, and see [33] for detailed definitions. Parameter in Formula (3) controls the value of the second item when the masses of and are not equal.

The improved EMD metric is a metric for nonnormalized histograms. So, in order to measure the dissimilarity between two images by the improved EMD algorithm, we first obtain the corresponding histograms of the image and an optional one is the Histogram of Oriented Gradient (HOG). The HOG features [17] possess a certain degree of invariance to local geometric and photometric deformations, and the local shape of objects in an image can be characterized by capturing edge or gradient structure [17]. Dalal and Triggs [17] applied the HOG descriptors to human detection, which performed much better than other feature sets. They also explored the influence of the fine-scale gradients, orientation binning, spatial binning, the local contrast normalization operation, and so on, and they finally obtained the HOG descriptors for the robust visual object recognition. Zhu et al. [34] adopted a cascade of histograms of oriented gradients for fast human detection. They used the AdaBoost algorithm to select the best blocks and then built the rejector-based cascade, which not only is a near real-time human detection method but also performs well in terms of accuracy. Freeman and Roth [35] presented the histograms of local orientation for hand gestures recognition. Newell and Griffin [36] extended the HOG and proposed multiscale histogram of oriented gradient descriptors for robust character recognition. Monzo et al. [37] compared the novel face recognition algorithm HOG-EBGM with GABOR-EBGM. The experiments showed that HOG-EBGM was more robust to illumination and rotation of images. Dniz et al. [38] employed the HOG features for face recognition. They firstly normalized the face images and then acquired the HOG descriptors using a regular grid. They also implemented a fusion strategy to combine information from different sizes of patches.

The main process of extracting the HOG features is illustrated in Figure 3.

The dimensionality of “features” is always high and contains redundant information (e.g., noise or outliers). Therefore, many features are not necessary and we aim to extract a small number of good features. Linear Subspace Learning [29, 39, 40] is one of the most powerful tools to perform dimensionality reduction. According to whether the labeled samples are used in training process, Linear Subspace Learning can be divided into three categories: the first category is Unsupervised Linear Subspace Learning [41], where no labeled samples are used; the second one is Semisupervised Linear Subspace Learning [42], where part of labeled samples are used; the last one is Supervised Linear Subspace Learning [41, 43] where all labeled samples are used.

The most typical unsupervised, semisupervised, and supervised algorithms in face recognition are Locality Preserving Projections (LPP) [18], Semisupervised Discriminant Analysis (SDA) [44], and Locality Sensitive Discriminant Analysis (LSDA) [45], respectively. However, the raw data are often unlabeled, so in this research we focus on the algorithms based on Unsupervised Linear Subspace Learning. Therefore, we adopt the typical Unsupervised Linear Subspace Learning methods LPP [18] to reduce the dimensionality of “features” of face images. More importantly, in order to make better use of the structural information of face images, we proposed a novel algorithm named Block Structure LPP (BSLPP). We also use BSLPP to reduce the dimensionality of “features” of face images.

The adjacency graph building method plays an important role in LPP and BSLPP. We adopt a dissimilarity metric based on the improved EMD metric rather than the metric (Euclidean metric) to conduct Unsupervised Linear Subspace Learning, where we expect to achieve better performance on the recognition rate and robustness to illumination, occlusion, noise, geometric distortion, and other disturbances.

3. Unsupervised Linear Subspace Learning Based on the Improved EMD

In this section, we describe our improved EMD-based dissimilarity metric for Unsupervised Linear Subspace Learning. First of all, we describe the Locality Preserving Projections (LPP) algorithm. Then, we elaborate our first algorithm (Algorithm 1): the improved EMD metric for LPP. Finally, we further introduce our second algorithm (Algorithm 2): the improved EMD metric for BSLPP.

Input: the sample set with samples, parameter , block parameter , pHOG bins, nearest neighbors parameter
Output: adjacency graph , weight matrix , transformation matrix , eigenvalues , and subspace y
While  
Extract HOG histogram over each block of per face image
Carry out the pooling operation over each block and then get the pHOG histogram
Obtain the grids of pHOG vector for one face image and grids of pHOG vectors for the rest of face images ,
Compute the dissimilarity between and by Equations (8) and (9)
Obtain the nearest neighbors of the face image :
EndWhile
Build the adjacency graph and calculate the corresponding weight matrix by Equation (10)
Begin // compute the projection
Get the diagonal matrix
Solve the generalized eigenvector problem of Equation (11) on the sample set
Get the eigenvectors with respect to eigenvalues
End // compute the projection
Obtain the transformation matrix
Obtain the subspace for the sample set by Equation (13)
(16) Perform face recognition by the classifier
Input: the sample set with samples, parameter , sub adjacency graph weight parameter , block , pHOG bins, parameter
nearest neighbors parameter
Output: adjacency graph , weight matrix , transformation matrix , eigenvalues , and subspace y
While  
While  
Extract HOG histogram over the block of per face image
Carry out the pooling operation over the block and then get the pHOG histogram
Obtain pHOG vector over the block for one face image and pHOG vectors over the blocks for
the rest of face images ,
Compute the dissimilarity between and by Equations (3) and (2)
Obtain the nearest neighbors for the block of the face image :
EndWhile
Obtain the adjacency graph and the corresponding weight matrix
EndWhile
Merge these sub adjacency graphs over blocks by Equation (15)
Build the adjacency graph and calculate the corresponding weight matrix by Equation (16)
Begin // compute the projection
Get the diagonal matrix
Solve the generalized eigenvector problem of Equation (11) on the sample set
Get the the eigenvectors with respect to eigenvalues
End // compute the projection
Obtain the transformation matrix
Obtain the subspace for samples set by Equation (13)
Perform face recognition by the classifier
3.1. Locality Preserving Projections (LPP)

Locality Preserving Projections are a linear dimensionality reduction method, which falls into the graph embedding framework [4648]. The adjacency graph building method [18, 4448] plays an important role in the performance of LPP. The detailed steps of LPP are as follows:

(a) Use the -neighborhoods to build the adjacency graph , and and will be connected if one of the two nodes is among the nearest neighbors of the other one and the value is set to be 1; otherwise 0.

(b) Choose the weights. The two commonly used methods are heat kernel and simple-minded, [18]. We apply the K-nearest neighbor (KNN) to build the adjacency graph which can well present the local geometrical structure on data manifold. Let be the set of its -nearest neighbors. We choose the simple-minded weight, so the adjacency graph and the corresponding weight matrix are defined below.

(c) Compute the projection. We solve the following generalized eigenvector problem to get the eigenvectors in accordance with the eigenvalues.where and denotes the number of samples; , and is a diagonal matrix whose entries are the row or column sum of the sparse symmetric weight matrix [18], that is, And are the eigenvectors with respect to eigenvalues .

(d) LPP embedding: is the transformation matrix, and the original samples can be embedded into the dimensional subspace through the following embedding:

3.2. The Improved EMD Metric for LPP

The original Linear Subspace Learning method adopts the metric to calculate the dissimilarity between two samples. However, the dissimilarity between two nonnormalized histograms (such as HOG histograms) by the metric may suffer from the quantization problem of rigid binning, while the improved EMD metric calculates the dissimilarity between two nonnormalized histograms which correlates with human perception and tolerates the problems of quantization, distortion, occlusion, and other disturbances. The improved EMD metric has been briefly introduced in Section 2. In order to preserve the structural information of samples, we divided each face image into blocks with the specified size and extracted Histogram of Oriented Gradient over each block. In order to reduce the computational complexity and improve robustness, we perform the pooling operations on the “granularity” of cell over each block. And the main process of extracting the pHOG features is illustrated in Figure 4. The detailed steps of our first algorithm (Algorithm 1) are given as follows:

(a) Calculate the dissimilarity [17, 33]. The improved EMD metric is a linear time histogram metric with a low computational cost. We use this metric to calculate the dissimilarity of the pHOG histograms (vectors) over blocks instead of the original EMD. The face image is divided into blocks and a pooled histogram of oriented gradients (pHOG) with 12 bins is obtained over each block. We compare any two pHOG histograms over corresponding blocks from two different face images by the improved EMD metric and the sum of the dissimilarity is taken as the dissimilarity between the two face images. Then we use the dissimilarity measure to obtain the -nearest neighbors of each face image to build the adjacency graph . So the final dissimilarity measure metric is as follows:subject to

In the above, denotes the dissimilarity of and , where , denotes the pHOG histogram over the block, similarly , and denotes the pHOG histogram over the block. denotes the flows of the pHOG histogram over the block: the amount transported from the bin (supply) to the bin (demand) is represented by . denotes the ground distance from the bin to the bin. According to Pele and Werman’s EMD variant [33], the is a metric when and the ground distance is also a metric. And is usually set to be 1 and we also adopt the same parameter in this research. The detailed process of calculating the dissimilarity between any two face images of the AR face database is shown in Figure 5.

(b) Chose the weights. We use the improved EMD metric (as described by (8) and (9)) to calculate the -nearest neighbor. Let be the set of its -nearest neighbors calculated by the improved EMD metric. The adjacency graph and the corresponding weight matrix are defined below:

(c) Compute the projection. Solve the following generalized eigenvector problem to obtain the eigenvectors in accordance with the eigenvalues.where , denotes the number of training samples, , and is a diagonal matrix whose entries are the row or column sum of , as shown below:

are the eigenvectors with respect to eigenvalues .

(d) LPP embedding: is the transformation matrix, and the original samples can be embedded into a dimensional subspace through the following embedding:

3.3. The Improved EMD Metric for Block Structure LPP

In the original LPP algorithm, in order to calculate the dissimilarity, a sample is drawn into a row or column, and this ignores the structural information of the sample, which plays an important role in Linear Subspace Learning. In order to preserve the structural information of samples, we divided each face image into several blocks with the specified size. We proposed a novel algorithm named Block Structure LPP (BSLPP) based on the improved EMD metric. The main difference between Algorithms 1 and 2 is that the adjacency graph is constructed differently. So in this section, we only elaborate the detailed process of building the adjacency graph and other steps of Algorithm 2 are consistent with Algorithm 1.

The process of building affinity graph in our proposed algorithm includes three main steps.

We firstly calculate the dissimilarity between the block from one face image and corresponding blocks from the rest of face images with the improved EMD metric.

Secondly, we get the -nearest neighbors for the corresponding block and we build the sub-adjacency graph over blocks, denoted by . Let be the set of its -nearest neighbors calculated by the improved EMD metric. Let denote all the blocks of face image . The adjacency graph and the corresponding weight matrix over the blocks of all face images are defined below.

Finally, we obtain the final adjacency graph by merging these sub-adjacency graphs over blocks. The merge function is as follows:

Among them, parameter denotes the weight of the sub-adjacency graph . In this research, we simply set this parameter to be . The final adjacency graph and the corresponding weight matrix are defined below.

When the final adjacency graph and the corresponding weight matrix are obtained, we can conduct the Block Structure LPP subspace learning. When features of face images are mapped onto a subspace, we will get the final “features” for each face image.

In this paper, we present a dissimilarity metric based on the improved EMD for Unsupervised Linear Subspace Learning. The dissimilarity between two samples is calculated by an improved EMD-based dissimilarity metric, which is a variant of the original EMD [33]. For simplicity, we refer to this dissimilarity metric as “the improved EMD metric” from now on. The whole process is described as follows.

Firstly, the metric will suffer from the quantization problem of rigid binning. So, the improved EMD metric [33] instead of the metric is adopted to compare any two pHOG histograms over the corresponding blocks from two different face images and the sum of the dissimilarity is taken as the final dissimilarity between two different face images. The aim of the pooling operation is to reduce the computational complexity for calculating the improved EMD metric and enhance robustness against occlusion, noise, and other disturbances.

Secondly, in order to preserve the structural information of samples, each face image is divided into blocks with the specified size, and then the pHOG histogram over each block is obtained. In one way (which we call Algorithm 1), an adjacency graph is constructed by comparing -nearest neighbors among face images. In another way (which we call Algorithm 2), we firstly obtain the sub-adjacency graph denoted by over blocks and then get the final adjacency graph by merging these adjacency graphs over blocks.

Finally, a small number of good “features” of face images are obtained by Unsupervised Linear Subspace Learning which includes Algorithm 1 (LPP based on the improved EMD metric, named LPP_IEMD) and Algorithm 2 (BSLPP based on the improved EMD metric, named BSLPP_IEMD). When “features” of face images are mapped onto a subspace, we will get the final “features” for each face image. Among them, the “features” include the grayscale face image, grids of pHOG, and grids of HOG. See Section 4 for more detailed information about these “features.”

4. Experiments and Results

In this section, firstly, we introduce the face databases used in this research as well as detailed experimental settings on these face databases, including training set, test set, and the choice of parameters. Secondly, we describe the experimental setups and the corresponding results for Unsupervised Linear Subspace Learning.

4.1. Face Databases
4.1.1. The AR Face Database

The AR face database has a total of 4,000 frontal images, including 126 individuals (males and females), with 26 images for each person, of which the first 13 and the last 13 were taken in two sessions (14 days). Each image has pixels. Partial occlusions by sun glasses and scarves, illumination variation, and facial expressions occur in this database. In order to verify the effectiveness of the proposed algorithms to a certain degree of noise, geometric distortion, slight translation, and slight rotation, we construct the noise, geometric distortion, slight translation, and slight rotation AR face database.

In order to reduce the difficulty of introducing the noise, geometric distortion, slight translation, and slight rotation into the AR face database, we chose the first 15 males and the first 15 females with first 13 images (we do not consider the time factor) of each person to construct the subAR database, and this gives a total of 390 images for our experiments. We add salt and pepper noise with noise density of 0.02 to the AR face database. We use Adobe Photoshop CS6 to simulate the geometric distortions of the face images, including barrel distortion, pincushion distortion, and the complex geometric distortion. We also use Adobe Photoshop CS6 to simulate the slight translation and slight rotation of face images. The 2nd, 6th, and 9th images of each person on our subAR face database are modified and the aim is that we consider the fusion of the simulated interference factors (noise, geometric distortion, slight translation, and slight rotation) and the inherent interference factors (occlusions, illumination, and facial expressions).

For the noise AR face database, we add salt and pepper noise with noise density of 0.02 to the 2nd, 6th, and 9th images of each person on our subAR face database.

For the geometric distortion AR face database, we add three variants, barrel distortion, pincushion distortion, and the complex geometric distortion, respectively, to the 2nd, 6th, and 9th images of each person on our subAR face database.

For the slight translation AR face database, we add slight translation to the 2nd, 6th, and 9th images of each person on our subAR face database.

For the slight rotation AR face database, we add slight rotation towards to the 2nd, 6th, and 9th images of each person on our subAR face database.

The specific details for our constructing subAR face database are shown in Figure 6.

The specific experimental settings for subAR and the noise, geometric distortion, slight translation, and slight rotation AR face databases, including training set, test set, and the choice of parameters, are as follows.

Five groups (G4/P9,…,G8/P5) of different training and testing sets are selected and we iterate every group data for 20 times, and finally we choose the average value of 20 trials as the recognition rate. denotes images of each person for training and images for testing, where . And the parameters of this experiment are pixels for each block, blocks with a length of 12 bins for each face image in total.

4.1.2. The Extended Yale B Face Database

The second face database used in the experiment is the Extended Yale B. The Extended Yale B face database has 2,414 face images in total, containing 38 individuals with 64 images of each person under 64 illumination conditions. Each image is in pixels. In order to reduce the difficulty of introducing the noise, geometric distortion, slight translation, and slight rotation into the Extended Yale B face database, we chose the first 30 persons with 16 images (we choose the first one in every four of 64 face images of each person) of each person to construct the sub Extended Yale B database, and this gives a total of 480 images for our experiments. We add salt and pepper noise with noise density of 0.02 to the Extended Yale B face database. We use Photoshop to simulate the geometric distortions of the face images, including barrel distortion, pincushion distortion, and the complex geometric distortion. We also use Photoshop to simulate the slight translation and slight rotation of face images.

For the noise Extended Yale B face database, we add the noise (salt and pepper noise with noise density of 0.02) to 2nd, 6th, 10th, and 14th images of each person on our sub Extended Yale B database.

For the geometric distortion Extended Yale B face database, we add three variants, barrel distortion, pincushion distortion, and the complex geometric distortion, respectively, to the 2nd, 6th, 10th, and 14th images of each person on our sub Extended Yale B database.

For the slight translation Extended Yale B face database, we add slight translation to the 2nd, 6th, 10th, and 14th images of each person on our sub Extended Yale B database.

For the slight rotation Extended Yale B face database, we add slight rotation towards the 2nd, 6th, 10th, and 14th images of each person on our sub Extended Yale B database.

The specific details for our constructing sub Extended Yale B face database are as shown in Figure 7.

The specific experimental settings for the sub Extended Yale B and the noise, geometric distortion, slight translation, and slight rotation Extended Yale B face database, including training set, test set, and the choice of parameters, are as follows.

Five groups (G6/P10,…,G10/P6) different training and testing sets are selected and we iterate every group data for 20 times, and finally we choose the average value of 20 trials as the recognition rate. denotes images of each person for training and images for testing, where . And the parameters of this experiment are pixels for each block, blocks with a length of 12 bins for each face image in total.

4.2. Comparison of Experiments with Other Approaches

Before conducting the Unsupervised Linear Subspace Learning, we conducted several experiments for comparison in order to assess the effectiveness of our algorithms. In addition to the algorithms proposed in this paper, those involved in the comparative experiments include deep learning based approaches, keypoints based approaches, and kernel based approaches. The deep learning based approaches include Stacked Denoising Autoencoders (SDAE) [7, 8, 49], LeNet-5 [10, 49], and UFLDL_CNN [11]. For SDAE and LeNet-5, we use the same settings as in [49]. For UFLDL_CNN, at the convolutional layer, 400 convolution kernels are learned by the unsupervised Stacked Autoencoder (SAE) algorithm, and then at the pooling layer we choose a pool size of 5 to conduct the pooling operation. The keypoints based approach we adopt is the sift [15] algorithm, and the specific parameters are the same as those in [15]. Kernel PCA [50] is the kernel based approach. Among them, LeNet-5 [10, 49] is the supervised algorithm, while UFLDL_CNN [11] belongs to the unsupervised one because the 400 convolution kernels are learned by SAE.

Firstly, we conducted several comparative experiments to assess the effectiveness of our algorithms on the subAR face database. We randomly selected 7 images of each face for training, and the rest for testing, and we conducted the comparative experiments on a PC with Intel(R) Core(TM) i7-4790 3.60 GHz Win 8 machine with 8 GB memory. We recorded the corresponding “cputime” (including the training and testing time) for each approach. The final results of experiments are shown in Table 1.

From Table 1, we can see that our proposed algorithm has obtained higher accuracy and consumed relatively less cputime. Although the sift approach achieves high accuracy, it consumes almost the second-longest cputime. The original face images were resized to for UFLDL_CNN1 and resized to for UFLDL_CNN2. The reason why the cputime is longer than 437 seconds is that we need to use SAE to learn about 400 convolution kernels and the same for UFLDL_CNN2. We also point out that our algorithms learn a subspace, which means we get a relatively small number of good features, and therefore our algorithms will spend less cputime when the unseen samples need to be tested. This is essentially an advantage of subspace learning methods over deep learning based and keypoints based ones.

Secondly, we conducted several comparative experiments to verify the effectiveness of our algorithms on the sub Extended Yale B face database. We randomly selected 7 images of each face for training, and the rest for testing. Other configurations are similar to the comparison experiments on subAR face database. The final results of comparative experiments are shown in Table 2.

From Table 2, we can see that our proposed algorithm has obtained higher accuracy with consuming relatively less cputime. However, the sift approach achieves a lower accuracy and consumes the longest cputime. LeNet-5 and Kernel PCA obtained the low accuracy and it may reveal that the supervised LeNet-5 and the kernel based kernel PCA approaches do not perform well when faced with heavy illumination variation.

4.3. Experiments and Results on Unsupervised Linear Subspace Learning

In this subsection, we will further demonstrate a certain degree of robustness of our proposed algorithms against partial occlusions, illumination variation, noise, geometric distortion, slight translation, and slight rotation on our constructed face databases compared with the original one.

4.3.1. Experiments and Results on the AR Face Database

First of all, we report the recognition rates on subAR face database. In this experiment, we conduct the Unsupervised Linear Subspace Learning over the features including the grayscale face image, grids of pHOG, and grids of HOG, denoted by , , and , respectively. We reveal the effectiveness of the improved EMD metric, pooling HOG operation, and the BSLPP. Then, compared with the experiments on subAR face database, we obtain the experimental results on noise, geometric distortion, slight translation, and slight rotation AR face databases.

The parameter “bins” plays an important role in the pooling operation on the “granularity” of cell over each block, so we explore the impact of the number of “bins” on Algorithms 1 and 2 on our subAR face database. The range of the number of “bins” is and the impact of the “bins” size on our sub AR face database is shown in Figure 8.

From Figure 8, we can see that the recognition rates with different numbers of “bins” are low. We hypothesize that it is the good performance of the improved EMD metric that leads to this result. And we selected the “bin” size of 12 in this experiment.

The recognition rates on subAR face database are shown in Table 3 and Figure 9. In Table 3, we compare three different algorithms, namely, Baseline, LPP, and Algorithm 1 (LPP_IEMD), where Baseline represents the nearest neighbor algorithm over the original “features” space. In particular, Algorithm 1 ()_nonpooled means that the nonpooling HOG with 192 bins (12 bins for the pHOG) is adopted to measure the dissimilarity between the two blocks for our Algorithm 1 over the original “” features.

From Table 3 and Figure 9, we can see that our Algorithm 1 achieves the highest recognition rates over “F1” and “F2” features (except for the group of over “F2” features). Among them, the dimensionality is just the corresponding one of the highest recognition rate in the 20 iterations for each group. This comparison experiment verifies the effectiveness of Algorithm 1. Succinctly, we just compare three algorithms including Baseline, LPP, and our Algorithm 1 on noise, geometric distortion, slight translation, and slight rotation AR face databases.

The experimental results on noise AR face database are shown in Table 4 and Figure 10(a). As we can see from Table 4 and Figure 10(a), for the noise AR face database, our Algorithm 1 achieves the best results over “F1” features. As for “F2” features, our Algorithm 1 obtains the best results for some of the experiments. It is worth noting that although our Algorithm 1 over “F2” features does not achieve the best results, we can speed up the recognition of the unseen samples which have a smaller dimensions (good features) with a lightly lower recognition rate.

The experimental results on barrel distortion AR face database are shown in Table 5 and Figure 10(b). As we can see from Table 5 and Figure 10(b), our Algorithm 1 achieves the best results over both “F1” and “F2” features. So, it shows that our Algorithm 1 is robust to the barrel distortion (the most common geometric distortion) to a certain degree.

The experimental results on complex geometric distortion AR face database are shown in Table 6 and Figure 10(c). As we can see from Table 6 and Figure 10(c), our Algorithm 1 achieves the best results over “F1” features. As for “F2” features, our Algorithm 1 achieves a lower recognition rate than the Baseline over “F2” features. For the complex geometric distortion, our Algorithm 1 over “F2” features may lose some discriminative information which may affect recognition rates to some extent. However, the advantage of our Algorithm 1 is that it can speed up face recognition with lower dimensionality.

The experimental results on pincushion distortion AR face database are shown in Table 7 and Figure 10(d). As we can see from Table 7 and Figure 10(d), our Algorithm 1 achieves the best results over “F1” features. As for “F2” features, our Algorithm 1 achieves a lower recognition rate than the Baseline over “F2” features. The advantage of our Algorithm 1 is that it can speed up face recognition with lower dimensionality, while the disadvantage is that our Algorithm 1 loses some discriminative information which can improve the recognition performance.

The experimental results on slight translation AR face database are shown in Table 8 and Figure 10(e). As we can see from Table 8 and Figure 10(e), our Algorithm 1 achieves the best results over both “F1” and “F2” features. So, it shows that our Algorithm 1 is robust to slight rotation to a certain degree.

The experimental results on slight rotation AR face database are shown in Table 9 and Figure 10(f). As we can see from Table 9 and Figure 10(f), our Algorithm 1 achieves the best results over both “F1” and “F2” features. So, it shows that our Algorithm 1 is robust to the slight rotation to a certain degree.

As shown in Tables 4, 6, and 7, our Algorithm 1 does not have an obvious advantage over “F2” features. “F3” features are the more robust ones, so in order to better learn the linear subspace, we adopt Algorithm 2 (BSLPP_IEMD) to conduct the Unsupervised Linear Subspace Learning over “F3” features. The experimental results on subAR are shown in Table 10 and Figure 11. In Table 10, we compare three different algorithms, namely, Baseline, LPP, and our Algorithm 2, where Baseline represents the nearest neighbor algorithm over the original “F3” features space. In particular, Algorithm 2 (F3)_ means that metric is adopted to measure the dissimilarity between the two blocks for our Algorithm 2 over the original “F3” features. Algorithm 2 (F3)_nonpooled means that the nonpooling HOG with 192 bins (12 bins for the pHOG) is adopted to measure the dissimilarity between the two blocks for our Algorithm 2 over the original “F3” features.

As one can see from Table 10 and Figure 11, Algorithm 2 achieves the highest recognition rates over “F3” features. This comparison experiment verifies the effectiveness of Algorithm 2. Succinctly, we just compare three algorithms including Baseline, LPP, Algorithm 2 (F3)_, and Algorithm 2 on noise, geometric distortion, slight translation, and slight rotation AR face databases.

As we can see from Tables 1115 and Figure 12, our Algorithm 2 achieves the best results over “F3” features. It shows that our Algorithm 2 is robust to the noise, geometric distortion, slight translation, and slight rotation to a certain degree. And it can well validate the effectiveness of our algorithm. More importantly, our Algorithm 2 with much lower dimensionality will provide an effective guarantee for face recognition in terms of speed and accuracy.

The experimental results on noise AR face database over “F3” features are shown in Table 11 and Figure 12(a).

The experimental results on barrel distortion AR face database over “F3” features are shown in Table 12 and Figure 12(b).

The experimental results on barrel distortion AR face database over “F3” features are shown in Table 13 and Figure 12(c).

The experimental results on pincushion distortion AR face database over “F3” features are shown in Table 14 and Figure 12(d).

The experimental results on slight translation AR face database over “F3” features are shown in Table 15 and Figures 12(e) and 12(f).

4.3.2. Experiments and Results on the Extended Yale B

Similar to the experiments on AR face database, we get the experimental results on sub Extended Yale B, noise, geometric distortion, slight translation, and slight rotation Extended Yale B face databases. The recognition rates are shown in Tables 1621 and Figure 13. In Tables 1621, we compare three different algorithms, namely, Baseline, LPP, and our Algorithm 1.

As we can see from Tables 1621 and Figure 13, Algorithm 1 achieves the best results over “F1” features. It is worth noting that Algorithm 1 over “F1” features is even better than that over “F2” features. As for “F2” features, Algorithm 1 achieves the partial best results on noise, complex geometric distortion, slight translation, and slight rotation Extended Yale B face databases. So our conclusion is that Algorithm 1 over “F2” features is less effective than that over “F1” features in the case of suffering from heavily varying illumination.

The recognition rates on sub Extended Yale B face database are shown in Table 16 and Figure 13(a). The recognition rates on noise Extended Yale B face database are shown in Table 17 and Figure 13(b). The recognition rates on barrel distortion Extended Yale B face database are shown in Table 18 and Figure 13(c). The recognition rates on complex geometric distortion Extended Yale B face database are shown in Table 19 and Figure 13(d). The recognition rates on pincushion distortion Extended Yale B face database are shown in Table 20 and Figure 13(e). The recognition rates on slight translation and rotation Extended Yale B face databases are shown in Table 21 and Figures 13(f) and 13(g).

In order to better deal with the problem of heavily varying illumination and solve the serious lighting problem, we adopt Algorithm 2 to conduct the Unsupervised Linear Subspace Learning over “F3” features. The experimental results on sub, noise, geometric distortion, slight translation, and slight rotation Extended Yale B face databases are shown in Tables 2227 and Figure 14 and the specific setting of the experiment is the same as that of Tables 1621.

As we can see from Tables 2227 and Figures 14(a)14(g), Algorithm 2 achieves the best results on slight translation, and slight rotation Extended Yale B over “F3” features, and the effectiveness of our algorithm against slight translation and rotation is well validated. Algorithm 2 (F3)_ achieves the partial best results on sub, noise, complex geometric distortion, and pincushion distortion Extended Yale B face databases. The Baseline method achieves the best results on sub (except for the group of over “F3” features) and barrel distortion Extended Yale B. In spite of slightly lower recognition results on sub and barrel distortion Extended Yale B, Algorithm 2 with much lower dimensionality will provide an effective guarantee for face recognition in terms of speed and accuracy.

5. Conclusions and Future Work

In this research, in order to reduce the computational complexity and improve robustness when calculating the improved EMD metric between numerous blocks and suffering from disturbances, we firstly carry out the pooling operation over each block to extract the pHOG features and then adopt the improved EMD metric instead of the metric as a dissimilarity measure to conduct Unsupervised Linear Subspace Learning, which has demonstrated a certain degree of robustness against partial occlusions, illumination variation, noise, geometric distortion, slight translation, slight rotation, and other disturbances. The experimental results on well-known databases confirm the effectiveness of our Unsupervised Linear Subspace Learning algorithms: Algorithm 1 (LPP_IEMD) and Algorithm 2 (BSLPP_IEMD).

Although our proposed algorithms achieve higher performance and demonstrate good robustness against some disturbances, there are still some limitations:

Although the improved EMD metric is a linear time algorithm with time complexity, compared with the metric, the training time of the model is longer. However, the model training process is offline, which still makes it acceptable.

Heavy illumination variation is really a challenging problem. Unfortunately, our algorithms do not show a distinct advantage as they suffer from heavy illumination variation and this is an issue for future investigation.

Our future work will focus on how to more effectively measure the dissimilarity between samples, such as further refining the improved EMD metric and how to better represent the neighborhood relationship between samples besides KNN. To determine the weight of subadjacency graph by the adaptive weight learning method is another concern. Finally, we will also be committed to extract more robust features and make further improvement for the HOG algorithm.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the Science and Technology Developing Project of Jilin Province, China (Grant no. 20150204007GX), and the Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education.