Abstract

A novel facial expression recognition algorithm based on discriminant neighborhood preserving nonnegative tensor factorization (DNPNTF) and extreme learning machine (ELM) is proposed. A discriminant constraint is adopted according to the manifold learning and graph embedding theory. The constraint is useful to exploit the spatial neighborhood structure and the prior defined discriminant properties. The obtained parts-based representations by our algorithm vary smoothly along the geodesics of the data manifold and have good discriminant property. To guarantee the convergence, the project gradient method is used for optimization. Then features extracted by DNPNTF are fed into ELM which is a training method for the single hidden layer feed-forward networks (SLFNs). Experimental results on JAFFE database and Cohn-Kanade database demonstrate that our proposed algorithm could extract effective features and have good performance in facial expression recognition.

1. Introduction

Facial expression recognition plays an important role in human-computer interaction, and 55% information is transferred by facial expression in face-to-face human communication [1]. Although many methods were proposed, recognizing facial expression is still challenging due to the complex, variable, and subtle facial expressions.

One of the effective methods for facial expression recognition is the subspace-based algorithm [25]. It aims to project the samples into a lower dimensional space which preserves the needed information and discards the redundant information. There are many widely used subspace-based algorithms, for example, principle component analysis (PCA) [2], linear discriminant analysis (LDA) [3], neighborhood preserving embedding (NPE) [4], locality-preserving projection (LPP) [6], and singular value decomposition (SVD) [5], and so forth.

Recently, the nonnegative matrix factorization (NMF) was introduced into facial expression recognition [7]. NMF decomposes the face samples into two nonnegative parts: the basis images and the corresponding weights. As many data in bases and weights were degenerated too close to zero, NMF derived the parts-based sparse representations. For facial expression recognition, the localized subtle features, such as the corners of mouth, upward or downward eyebrows, and change of eyes, are critical for the recognition performance. Since NMF yields the parts-based representations, it outperforms the subspace-based models. To further improve NMF, several variants have been presented by introducing different constraints to the objective function. Li et al. put forward a local NMF (LNMF) by adding a local constraint to the basis images [8], to learn a localized, parts-based representation. Hoyer gave a sparse constraint NMF (SNMF) by incorporating sparseness constraint into both the bases and the weights [9]. Cai et al. developed a graph constraint NMF (GNMF) by adding a graph preserving constraint to the weights [10]. Zafeiriou et al. used the discriminant NMF (DNMF) for frontal face verification [11]. Wang et al. extended NMF to PNMF with a PCA constraint and FNMF with a Fisher constraint [12].

For facial expression recognition, NMF and its variants vectorize the samples before factorization, which may lose the local geometric structures. However, the spatial neighborhood relationships within pixels are critical for image representation, understanding, and recognition [13]. Another drawback of NMF is that it could not generate a unique decomposition result. Welling and Weber developed a positive tensor factorization (PTF) algorithm, which handled images as 2D matrices directly [14]. Shashua and Hazan proposed the nonnegative tensor factorization (NTF) which implemented the factorization in the rank-one tensor space [15]. The factorization in the tensor space could preserve the local structures and guarantee the uniqueness of the decomposition.

On the other hand, the choice of classifier plays an important role for recognition. For facial expression recognition, nearest neighbor (NN) and support vector machine (SVM) are the commonly used methods [16]. The sparse representation classifier (SRC) was adopted in [17]. Recently, the extreme learning machine (ELM) was proposed for classification which is a training method for the single hidden layer feed-forward networks (SLFNs) [18]. The conventional methods need long time to converge or may lose the generalization property due to overfitting. However, ELM converges fast and provides good generalization performance. For ELM, the input weights and biases are randomly assigned, and the output weights can be simply calculated by the generalized inverse of the hidden layer output matrix. Therefore it converges extremely fast and obtains an excellent generalization capability. Many variants of ELM were proposed for different applications [1926], including the Kernel-based ELM [21] and the incremental ELM (I-ELM) [23], which lead to the state-of-the-art results in different applications.

In this paper, we propose a novel facial expression recognition algorithm based on discriminant neighborhood preserving nonnegative tensor factorization (DNPNTF) and ELM. It works well in the rank-one tensor space. The simple ELM is adopted to testify its effectiveness for facial expression recognition [18]. Our algorithm is composed of two stages: feature extraction and classification. Firstly, to extract the discriminant features, a neighborhood preserving constraint form of NTF is used. The constraint is derived according to the manifold learning and graph embedding theory [2729]. Since the columns of the weighting matrix have a one-to-one correspondence with the columns of the original sample, the discriminant constraint is added to the weighting matrix. With the neighborhood preserving constraint, the obtained parts-based representations vary smoothly along the geodesics of the data manifold and are more discriminant. Secondly, the discriminant features extracted by DNPNTF are fed into ELM classifier to conduct the recognition task.

The rest of this paper is organized as follows. The mathematical notations are given in Section 2. In Section 3, we give the detailed analysis about DNPNTF and its optimization procedure. ELM is introduced in Section 4, and the experiments are given in Section 5. Finally, the conclusions are drawn in Section 6.

2. Basic Algebra and Notations

In this paper, a tensor is represented as , whose order is . The element of is denoted by , where , .

Definition 1 (inner product and tensor product [30]). The inner product of two tensors is defined as
The tensor product of two tensors and is

Definition 2 (rank-one tensor [30]). A th-order tensor could be represented as a tensor product of tensors as
Here is called rank-one tensor, and   .

Definition 3 (mode product [30]). The mode product of and is where , , .

3. The DNPNTF Algorithm

In this section, we give a detailed description about the proposed DNPNTF algorithm. Instead of converting into vectors, it processes the samples in rank-one tensor space. The objective function of NTF is adopted, which could learn the parts-based representation and have the sparse property. To discover the spatial local geometric structure and the discriminant class-based information, a constraint is added in the objective function according to the manifold learning and graph embedding analysis. To guarantee the convergence, the project gradient method is used.

3.1. The Analysis of DNPNTF

Given a image database , it contains sample images . The dimension of each sample is . In NTF, the database is organized as a 3rd-order tensor , which is overlapped by sequentially. The objective function of NTF is where , , and describe the first, second, and third modules of , respectively. Each sample is approximated by

By minimizing (5), the bases and the corresponding weights are conducted. The inner product of and the sample image is calculated to derive the low-dimensional parts-based representation.

To incorporate more properties into NTF, different constraints could be added into the objective function. The constraint form of objective function is where , , . is the constraint function about and is the constraint about . and are the corresponding positive coefficients. To encode the spatial structure and discriminant class-based information into sparse representations, we propose a constraint function according to the manifold learning and graph embedding analysis. In NTF, the columns of the weighting matrix have a one-to-one correspondence with the columns of the original image matrix. Therefore, we add the discriminant constraint to , and for is defined as where denote the graphs with different properties and are the corresponding coefficients. By deriving different , the graph embedding model could have different properties, such as the neighborhood preserving property and the discriminant property.

Now, we discuss the selection of . The most commonly used graph is the Laplacian graph, and is calculated in form of the heat kernel function as

Here, measures the similarity between a pair of vertices and has neighborhood preserving property. To further incorporate the class-based discriminant information, we derive a universal penalty graph [27], where the similarity matrix is defined as where represents the nearest pairs of samples between class and the other classes. The purpose of the penalty graph was to separate marginal samples between different classes.

Now, the objective function of constrained NTF becomes

By solving the generalized eigenvalue decomposition problem, the graph embedding criterion in (11) can be calculated as And the final objective function of DNPNTF is where and to make sure (13) should be nonnegative.

3.2. Projected Gradient Method of DNPNTF

The most popular approach to minimize NMF or NTF is the multiplicative update method. However, it cannot ensure the convergence of the constraint forms of NMF or NTF. In this paper, the projected gradient method is used to solve DNPNTF.

The objective function of DNPNTF can be stated as where is a positive constant. The goal of (14) is to find and by solving the following problem:

To find the optimal solution, (15) is divided into three subproblems: first, we fix and and update to arrive at the conditional optimal value of the subminimization problem; second, we fix and and update ; last, we fix and and update . Three functions are defined as , , and . The update rules are defined as Now the task is calculating , , and .

3.2.1. The Calculation of and

Firstly, we discuss the calculation of and . The objective function could be written as

The differential of is

And the partial differential for is where the th element in is 1 and others are 0. That is, and . According to Definition 1, for any order tensors , , there is . Then (19) could be written as

According to (16), the update rule for is

To confirm the nonnegative of , is set to be Then the update function is where represents the th row of , is the matrix Hadamard product, represents the matrix which fixes the first module of , and traversals are the other two modules. It is defined as

Similarly, the update rule of the th element of is where represents the th row of , is the matrix Hadamard product, represents the matrix which fixes the second module of , and traversals are the other two modules. It is defined as

Now, and are calculated.

3.2.2. The Calculation of

Then we discuss the calculation of . The differential of along is

For , the partial differential for is where the th element in is 1 and others are 0. That is, and . Then the partial differential for is

According to (16), the update rule for is

The update step is set as

And the final update rule of is where represents the th row of ; and represent the th row of and , respectively; is Hadamard product, , and is defined as

Now , , and in the objective function are all calculated.

4. Extreme Learning Machine

ELM is proposed by Huang et al. [18] for SLFNs. Unlike the traditional feedforward neural network training methods, such as the gradient-descent method, the standard optimization method, and the least-square based method, ELM need not tune the hidden layer of SLFNs which may cause learning complicated and inefficient. It could reach the smallest training error and have better generalization performance. The learning speed of ELM is fast, and the parameters have not to be tuned manually. In our proposed algorithm, the extracted features by DNPNTF are fed into ELM for classification.

Given a training set , where is the input feature vector and is a target vector. ELM with hidden nodes and activation function is modeled as where represents the input weight, which is the th neuron in the hidden layer and the input layer; is the weight vector between the th hidden neuron and the output layer; is the target vector of the th input data. In training step, ELM aims to approximate training samples with zero error, which means . Then there exist , , and satisfying that

Equation (35) can be reformulated compactly as where is called the hidden layer output matrix of the neural network, and the th column of is the th hidden neuron output with respect to inputs . It is proved by Huang et al. [18] that weights and biases need not be adjusted and can be arbitrarily given. Therefore, the output weights could be determined by finding the least-square solution as where is the Moore-Penrose generalized inverse of matrix . Furthermore, the smallest training error can be obtained by as As analyzed by Huang, ELM could obtain a good generalization performance with a dramatically increased learning speed by solving (39).

5. Experiments

In this section, we apply DNPNTF via ELM to facial expression recognition. We compare DNPNTF with NMF [7], DNMF [11], and NTF [9] and give the experimental results by employing ELM, NN, SVM [16], and SRC [17]. Two facial expression databases are used: the JAFFE database [31] and the Cohn-Kanade database [32]. Raw facial images are cropped according to the position of eyes and normalized to 32 × 32 pixels. Figure 1 shows an example of the original face image and the corresponding cropped image. According to the rank-one tensor theory, gray level images are encoded in tensor space.

Since the results of ELM may vary during each different execution, we repeat the execution for 5 times and take the average value as the final result. It is proved by theory analysis and experiments that the classification performance of ELM is affected by the hidden activation function and the number of hidden nodes [23]. However, in this paper we just focus on the application of ELM to facial expression recognition. The activation function used in our algorithm is a simple sigmoidal function. The number of hidden nodes is set to be the same as the number of facial expression classes (e.g., 7 for the JAFFE database and 6 for the Cohn-Kanade database). For SVM, the radial basis function (RBF) is used, which is that is set to be 3 as an empirical value. For SRC, “Homotopy” algorithm is used to solve the minimization of norm constraint.

5.1. Experiments on JAFFE Database

The JAFFE database [31] is an expression database which contains 213 static facial images captured from 10 Japanese females. Each person poses 2 to 4 examples for each of the 6 prototypic expressions (anger, disgust, fear, happiness, sadness, and surprise) plus the natural face. To evaluate the algorithms, we randomly partition all images into 10 groups, with roughly 70 samples in each group. We take any 9 groups for training and calculate the recognition rates with the remaining one. We repeat it for all the 10 possible choices. Finally, the average result over 10 times’ testing was taken.

The average recognition rates of different feature extraction algorithms are shown in Figure 2, where the vertical axis represents the correct recognition rate in percentage and the horizontal axis represents the corresponding dimensions (from 1 to 120). Here, only the NN classifier is used. In the lower range of dimensions, the recognition rates of DNPNTF are similar to other algorithms. This is because DNPNTF extracts the parts-based sparse representations, and only a few features could be generated for recognition in the low range of dimensions. In the higher range of dimensions, DNPNTF outperforms the others. With the increase of the extracted parts-based features, DNPNTF could achieve good recognition performance. Since different constraints were added, the improved versions of NMF, including DNMF and NTF, outperform the conventional NMF.

The top recognition rates of different algorithms with corresponding dimensions are illustrated in Table 1. NMF achieves the highest rate at a low dimension, while DNMF achieves the highest rate at a high dimension. Although more dimensions are needed, DNPNTF achieves the highest recognition rate compared with others. This is because the constraints about manifold structure and discriminant information are considered, which are critical for classification.

Figure 3 shows the basis images obtained on the JAFFE database by NMF, NTF, and DNPNTF. Based on the principle of NMF, the face images are represented by combining multiple basis images with addition only, and the basis images are expected to represent facial parts. In this database, the basis images calculated by NMF are not sparse. NTF and DNPNTF which execute in the tensor space could generate parts-based sparse representations. Since more constraints were adopted, DNPNTF generate sparser basis images which reflect distinct features for recognition.

Then we give the experiments to prove the effectiveness of DNPNTF via ELM. The average recognition rates of DNPNTF with ELM, NN, SVM, and SRC are given in Figure 4, where the vertical axis represents the correct recognition rate in percentage and the horizontal axis represents the corresponding dimensions (from 1 to 120). ELM and SRC achieve better recognition performance compared with NN and SVM, and ELM achieves the highest recognition rate. The top recognition rates with the corresponding dimensions are given in Table 2.

5.2. Experiments on Cohn-Kanade Database

The Cohn-Kanade database [32] consists of a large amount of image sequences starting from natural face and ending with the peak of the corresponding expression. 104 subjects with different ages, genders, and races are instructed to pose a series of 23 facial displays, including the 6 prototypic expressions. In our experiments, for every image sequence, we take 2 to 8 continuous frames near the peak expression as the static samples. We use the face images of all subjects. We partition the subject to 3 exclusive groups, and in each group, for each of the prototypic expression, we select 100 samples; that is, there are 600 samples in each group and the size of the total set is 1800. During the experiment, we adopt the leave-one-group-out strategy and 3-fold cross-validation: each time two groups are taken as training set and the remaining group is left for testing. This procedure is repeated for 3 times.

The average recognition rates of different algorithms on the Cohn-Kanade database are shown in Figure 5, where the vertical axis represents the correct recognition rate in percentage and the horizontal axis represents the corresponding dimensions (from 1 to 120). Here, only the NN classifier is used. Table 3 shows the top recognition rates with the corresponding dimensions. The recognition rates obtained on the Cohn-Kanade database are lower than those obtained on the JAFFE database. It can be explained that the experiments on the Cohn-Kanade database are person-independent, which are more difficult than the person-dependent experiments on the JAFFE database. From Figure 5, we can see that the performance of DNPNTF is superior to others with nearly all dimensions. Its recognition rates improve with the increase of dimensions.

Lastly, we give the experiments about different classifiers on the Cohn-Kanade database. The average recognition rates of DNPNTF via ELM, NN, SVM, and SRC are shown in Figure 6, and the top recognition rates are given in Table 4. SVM and SRC achieve better performance compared with NN. ELM achieves the best recognition accuracy among all tested algorithms on almost all dimensions. It means ELM could use the information contained in extracted features better than other classifiers.

6. Conclusions

In this paper, a novel DNPNTF algorithm with the application to facial expression recognition was proposed, which adopts ELM as the classifier. To incorporate the spatial information and the discriminant class information, a discriminant constraint is added to the objective function according to the manifold learning and graph embedding theory. To guarantee the convergence, the project gradient method is used for optimization. Theoretical analysis and experimental results demonstrate that DNPNTF could achieve better performance compared with NTF, NMF, and its variant. Then the discriminant features generated by DNPNTF are fed into ELM to learn an optimal model for recognition. In our experiments, DNPNTF via ELM achieves higher recognition rate compared with NN, SVM, and SRC.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported partly by the National Natural Science Foundation of China (6137012761472030), the Fundamental Research Funds for the Central Universities (2013JBM020 and 2014JBZ004), and Beijing Higher Education Young Elite Teacher Project (YETP0544).