Complexity

Volume 2018, Article ID 8917393, 24 pages

https://doi.org/10.1155/2018/8917393

## An Improved EMD-Based Dissimilarity Metric for Unsupervised Linear Subspace Learning

^{1}College of Computer Science and Technology, Jilin University, Changchun 130012, China^{2}Department of Computing Science, University of Aberdeen, Aberdeen AB24 3UE, UK^{3}College of Software, Jilin University, Changchun 130012, China

Correspondence should be addressed to Zhezhou Yu; nc.ude.ulj@zzuy

Received 4 July 2017; Revised 20 November 2017; Accepted 4 December 2017; Published 18 February 2018

Academic Editor: Danilo Comminiello

Copyright © 2018 Xiangchun Yu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We investigate a novel way of robust face image feature extraction by adopting the methods based on Unsupervised Linear Subspace Learning to extract a small number of good features. Firstly, the face image is divided into blocks with the specified size, and then we propose and extract pooled Histogram of Oriented Gradient (pHOG) over each block. Secondly, an improved Earth Mover’s Distance (EMD) metric is adopted to measure the dissimilarity between blocks of one face image and the corresponding blocks from the rest of face images. Thirdly, considering the limitations of the original Locality Preserving Projections (LPP), we proposed the Block Structure LPP (BSLPP), which effectively preserves the structural information of face images. Finally, an adjacency graph is constructed and a small number of good features of a face image are obtained by methods based on Unsupervised Linear Subspace Learning. A series of experiments have been conducted on several well-known face databases to evaluate the effectiveness of the proposed algorithm. In addition, we construct the noise, geometric distortion, slight translation, slight rotation AR, and Extended Yale B face databases, and we verify the robustness of the proposed algorithm when faced with a certain degree of these disturbances.

#### 1. Introduction

Although many sophisticated algorithms have been proposed, face recognition is still a challenging problem affected by many external factors such as the occlusion, illumination, noise, geometric distortion, translation, and rotation of face images. Recently, face recognition algorithms based on deep learning have achieved good performance [1–6]. Stacked autoencoder (SAE) [7] is an unsupervised neural network approach, where the input and target values are the same. In SAE, the deepest hidden layer carries the features we are interested in. The input layer and the deepest hidden layer are connected by multiple encoding layers, and the deepest hidden layer and output layer are connected by multiple decoding layers. The activation values of the deepest hidden layer nodes are essentially the deep representation features which are used to perform classification tasks by feeding them to the corresponding classifier such as Softmax. In order to obtain more robust features, random noise can be added to the input layer of SAE. This method is called Stacked Denoising Autoencoders (SDAE) [8]. In practical applications, the values of the input layer nodes can be set to be 0 with a certain probability and it can extract more robust features. However, SAE and SDAE both adopt the fully connected way to establish a connection between the input layer (hidden layer) and another hidden layer. The disadvantage is that a large number of parameters need to be learned when training SAE and SDAE. Take SDAE as an example, we set the hidden nodes to be 100, and the number of the network weights and bias will be when extracting the deep features over images with pixels. In order to overcome the limitations of the fully connected network, Convolutional Neural Networks (CNN) [9, 10] are proposed, where the local connected way can effectively reduce the computational complexity of model training. In addition, another important advantage of using local connected network is that we can extract local information in the input space, which is consistent with the mechanism of the visual center. The LeNet-5 [10] model is one of the most classic CNN models. The convolution kernel is learned during model training by back propagation. Another method to learn convolution kernels is the unsupervised approach: the stacked autoencoders [7, 8] are used to extract the corresponding size of convolution kernels. In UFLDL_CNN [11], convolution kernels are learned by unsupervised Stacked Autoencoder (SAE) [7, 8] and used to perform the convolution operations at the convolutional layer. However, in addition to learning a lot of network weight values and bias, these methods need to choose several hyperparameters, for example, sparsity penalty coefficient (as in autoencoder algorithm [12, 13]) and weight penalty coefficient (as in regularization of deep neural network) [14]. Furthermore, these parameters need to be selected by cross validation, so the complexity and high computational cost are attached to these algorithms [14]. On the other hand, algorithms based on keypoints, such as sift [15] and surf [16], have demonstrated good performance in face recognition and they are robust to scale changes, rotation, illumination, and other disturbances [15–17]. But the disadvantage of these methods is that they may require a large number of keypoints, and the number of queries between keypoints is even larger. This makes it difficult to effectively perform face recognition or retrieval tasks in large-scale face image datasets.

Linear Subspace Learning is a kind of linear projection method which assumes that the high-dimensional data are located in a low-dimensional manifold that is linearly or approximately linearly embedded in the ambient space [18–22]. Linear Subspace Learning is often used in pattern recognition and computer vision tasks. Iosifidis et al. [20] proposed the optimal class representation algorithm based on the linear discriminant analysis, which increased discrimination between classes. F. Liu and X. Liu [23] proposed the Locality Enhanced Spectral Embedding (LESE) and novel Spatially Smooth Spectral Regression (SSR) methods for face recognition. It not only constructed a good locality preserving mapping but also made full use of the spatial locality information of face image matrix. Tzimiropoulos et al. [21] proposed a subspace learning algorithm based on the image gradient orientations, which has shown good performance on appearance-based object recognition. Zhang et al. [24] proposed a Linear Subspace Learning method using the sparse coding to learn a dictionary, and the aim is to fully exploit different image components. Both unsupervised and supervised criteria are proposed in order to learn the corresponding subspace. Cai et al. [22] proposed a spatially smooth subspace approach for face recognition which took full account of spatial correlation of face image and used the Laplacian penalty to learn the corresponding spatially smooth subspace. In addition, some kernel based technologies [25–28] are also applied in face recognition, and they are mainly used to explore the nonlinear relationship between face images. When nonlinear information is contained in the dataset, the kernel based techniques will exhibit good properties. These Linear Subspace Learning methods mainly fall into two categories, supervised and unsupervised methods. For supervised methods, sample labels are used in model training, provided that these labels have been manually marked. Although sometimes some supervised methods can learn good subspaces, the disadvantage is that the samples need to be marked; so in reality, unsupervised methods are more commonly used. In this research, we focus on unsupervised methods and are committed to building a good subspace by unsupervised learning algorithms. For Linear Subspace Learning, the high-dimensional input data are mapped onto a low-dimensional space by linear projection to achieve dimensionality reduction; it is a common and critical processing module in pattern recognition. Preprocessing, feature selection, feature extraction, pooling operations, and so on are implicitly or explicitly attached to dimensionality reduction operations [29]. Furthermore, the discrimination process can be viewed as a dimensionality reduction operation where high-dimensional input data are mapped onto low-dimensional class data (binary vector consisting of 0 and 1) [29].

Raw data in reality are often high-dimensional. High-dimensional data on the one hand can increase the computational burden of recognition system; on the other hand, it brings in a negative impact (arising from noise or outlier) on robust recognition tasks with limited training sample sets [29]. Importantly, raw data are often unlabeled. Therefore, this research focuses on the algorithms based on Unsupervised Linear Subspace Learning. Different from deep learning algorithms, Unsupervised Linear Subspace Learning does not need the process of selecting complex hyperparameters. Different from the keypoint ones, grids of HOG [17] or grids of pHOG over each face image are extracted and all of them are collected to form the final descriptors.

However, the raw data and the “features” extracted from them such as grids of HOG or grids of pHOG are still high-dimensional, so it is necessary to further learn the linear subspace of them. More importantly, the data from subspace can be guaranteed to have the same dimension after completing Linear Subspace Learning, and it enables subsequent classifier training, such as Softmax or SVM. In this research, in order to best evaluate the performance of the subspace learning algorithm, we adopt the nearest neighbor (NN) to be the classifier.

One of the most important tasks of Linear Subspace Learning is to construct the adjacency graph which is used to describe the nearest neighbor relationship between samples. In order to calculate the dissimilarity, a sample is drawn into a row or column which ignores the structural information of the sample. Then metric (Euclidean distance) is often used to measure the dissimilarity between any two samples.

Face images taken from cameras often suffer from noise, geometric distortions [30, 31] and sometimes complex geometric distortion can occur during shooting, storage, and transmission. The most common geometric distortion is radial distortion, which includes barrel distortion and pincushion distortion. Some examples of face images suffering from noise, geometric distortions, slight translation, and rotation changes are shown in Figure 1.