Abstract

Aiming at the problem of facial expression recognition under unconstrained conditions, a facial expression recognition method based on an improved capsule network model is proposed. Firstly, the expression image is normalized by illumination based on the improved Weber face, and the key points of the face are detected by the Gaussian process regression tree. Then, the 3dmms model is introduced. The 3D face shape, which is consistent with the face in the image, is provided by iterative estimation so as to further improve the image quality of face pose standardization. In this paper, we consider that the convolution features used in facial expression recognition need to be trained from the beginning and add as many different samples as possible in the training process. Finally, this paper attempts to combine the traditional deep learning technology with capsule configuration, adds an attention layer after the primary capsule layer in the capsule network, and proposes an improved capsule structure model suitable for expression recognition. The experimental results on JAFFE and BU-3DFE datasets show that the recognition rate can reach 96.66% and 80.64%, respectively.

1. Introduction

Human facial expression is a kind of representation language which is naturally or deliberately revealed by the complex stimulation of environment, context, and mood in the process of communication and can be perceived by the visual system [13]. It is also the expression that human facial muscles produce stress movement under certain semantic stimulation or active movement driven by consciousness. It is generally believed that human facial expressions are controlled and stress-induced. The so-called controllability of consciousness means that human beings (especially actors) can make or suppress any expression at random [4, 5]. The so-called stress convergence refers to that most people will make similar expressions under the stimulation of a specific semantic environment. For example, when people hear interesting events or face beautiful things, they will naturally show the expression of happiness; when people face expired smelly food or ugly bad scenes, they will generally show the expression of disgust; when people face sudden emergencies, they usually show the expression of surprise, etc. [6].

When the human stress expression is contrary to the normal situation, it usually reflects the inhibition of some psychological factors on the expression. Because the human facial expression is rich in psychological and emotional information, has stress convergence, and is controlled by consciousness, it is easy to get the general attention of a large number of scholars in the field of psychology and pattern recognition [7, 8]. Under the existing technical conditions, people began to use pattern recognition technology to establish the mapping relationship between the face image and the facial expression in the image. Through the computer automatic judgment of human facial expression, the research field of expression recognition is proposed. In a broad sense, facial expression recognition is a process of automatic analysis of face image data by computer. The process of automatic image analysis by computer is just the main content of computer vision [9]. As a special subject born out of the traditional pattern recognition technology, computer vision studies the difficulties faced by the traditional pattern recognition methods in image data or processes and refines the image data to facilitate the indexing, classification, and automatic analysis of image data [10].

In this paper, the practical significance of research on facial expression recognition technology of single face image is stronger than that of expression recognition technology based on image sequence. The reason is that compared with expression recognition technology based on image sequence, expression recognition technology based on single image can reflect the defects of existing image processing technology and recognition technology in a specific application. It is helpful to improve the applicability of the existing image processing and discrimination technology. Therefore, this paper focuses on the expression recognition of a single face image under unconstrained conditions. For the problem of illumination and pose standardization of face image, the existing lighting and pose standardization methods are easy to lose texture details, so it is not suitable for facial expression recognition. This paper considers that under the premise of unconstrained expression recognition, it is necessary to further improve the degree of expression detail reservation. Thus, the subsequent discriminant model can work effectively. To solve the problem of facial expression recognition, a facial expression recognition method using improved capsule network model is proposed. The main contributions of this paper are as follows.

In this paper, we consider that the convolution features used in facial expression recognition need to be trained from the beginning and add as many different samples as possible in the training process. The proposed method focuses on temporal attention. The attention module uses sigmoid as an activation function, which can not only select important features but also suppress irrelevant information. It can also help smooth the mismatch between the training set and test set and improve the final recognition rate.

For the facial expression recognition under nonconstrained condition, scholars have proposed many methods. For example, reference [11] proposed a hybrid expression recognition method using High-order Joint Derivative Local Binary Pattern (HJDLBP) and Local Binary Pattern (LBP). The model efficiency is improved by removing unwanted areas and preserving the facial area. The study in [12] proposed a facial expression recognition framework combining two-dimensional Gabor and local binary patterns. By extracting salient features of facial expression, the model efficiency was improved. The study in [13] proposed an adaptive model parameter initialization method based on the multilayer maxout network linear activation function, which improved the performance of the model by extracting highly relevant features of the image sequence. The study in [14] proposed an expression recognition method based on Wasserstein generative adversarial network, which improved the model efficiency by suppressing slight changes of the face. The study in [15] proposed a Deep Cascaded Peak-piloted Network (algorithm in reference [15]), which extracts key and subtle details in the image through peak-conducting feature transformation to improve the accuracy of the model. However, these methods do not consider the edge characteristics of the image.

The study in [16] proposed a facial expression recognition method combining multiple facial features and support vector machines. By extracting important facial features, reducing image noise points can improve the accuracy of the model. The study in [17] proposed a deep convolution BiLSTM fusion network facial expression recognition method, which extracts spatial features from each frame through a convolutional neural network and then models the temporal dynamics. Feature fusion improves the model recognition rate. The study in [18] proposed the facial expression recognition method based on a facial video sequence. By extracting features represented by temporal local binary pattern, the efficiency of the model was improved. The study in [19] proposed a conditional convolutional neural network enhanced random forest expression recognition method (algorithm in Reference [19]), which reduces the noise points of the data set and improves the accuracy of the model. However, when the training data is small, these methods are prone to underfitting.

The study in [20] proposed facial expression recognition based on incremental active learning, which improves the accuracy of the model by reducing the noise points of the image. The study in [21] proposed a multifeature fusion facial expression recognition method based on Extreme Learning Machine (ELM), which improved the accuracy of the model by fusing multiple features. The study in [22] proposed a facial expression recognition method based on feature space and principal component analysis. The method encodes the known image through feature space to improve the accuracy of the model. The study in [23] proposed a facial expression recognition method based on the Two-Stream Convolutional Neural Network (T-SCNN), which improved the accuracy of the model by fusing RGB images and temporal features. However, when the amount of data is large, these methods are prone to overfitting. The study in [24] proposed a multilayer perceptron algorithm for facial expression recognition, which increased the accuracy of the model by adding hidden neurons. However, the parameters of the model were difficult to adjust. The study in [25] proposed a facial expression recognition method based on hidden Markov, which improved the efficiency of the model by extracting the more important features of the image. However, under unconstrained conditions, the model is less robust.

Based on the above analysis, in the field of facial expression recognition, deep learning has good modeling and processing ability for facial expression images, but only when the face illumination and pose are constrained can the model be effectively recognized. Aiming at the problem of facial expression recognition under unconstrained conditions, a facial expression recognition method based on an improved capsule network model is proposed. The improved capsule model can effectively classify facial expressions under unconstrained conditions, which makes up for the deficiency of pure deep convolution network in acquiring sparse features hidden in discriminative texture, and improves the generalization ability of existing expression classification models for illumination and pose differences.

3. Overall Architecture of the Proposed Method

Through the analysis of the existing deep convolution neural network in the application of expression recognition, this paper thinks that the illumination and posture correction technology has very important application value in alleviating the dependence of deep convolution neural network on the number of samples and improving the quality of perception weight. It can be set by using light treatment technology. We can change the lighting conditions of the face to generate expression samples under different lighting conditions; we can also use the posture correction technology to realize the generalization of the face pose to generate different facial expression samples. According to the idea of local density sampling, this may make the final training model to the real model. The framework of facial expression recognition based on deep learning is shown in Figure 1. In this paper, the illumination and projection analysis technology is still used to analyze, correct, and perceive the face pattern. Under the existing technical conditions, this kind of pretreatment is still a necessary step. In the specific implementation process, a batch of dense sample graphs are generated and input into the deep convolution model for weight training to fully alleviate the problem of insufficient sample number. In the stage of model recognition, illumination and projection analysis are still used to analyze, correct, and perceive the face pattern. In this paper, we add an attention layer after the primary capsule layer in the capsule network and propose that a capsule structure is suitable for expression feature extraction.

4. The Process of the Proposed Method

4.1. Illumination Normalization

A new illumination normalization method based on Weber face (WF) [26] is proposed, which can not only extract illumination-insensitive features effectively but also suppress boundary marks at sudden changes of light. Assume that the lighting component is . In WF, all ratios are multiplied by a combination coefficient and , as follows:

At this time, while enhancing the effective information, the noise will also be enhanced, and the coding value of the area most affected by the light is . Therefore, to reduce the noise, multiply the interval by a suppression factor. The revised WF definition is as follows:where is a coefficient that suppresses the influence of light: . is used to adjust (increase or decrease) the difference between the WF encoding values of adjacent pixels: the interval is the interval that is greatly affected by light. The interval has more noise. It can be known from equation (2):

According to the WF theorem, the minimum perceivable ratio is constant. Therefore, the subinterval of is called the low perceptual interval. In the subinterval, since the ratio is smaller than the minimum perceptible ratio, its changes will not be perceivable by the human eye. That is, even if the pixels of the interval are affected by the light, the change is small and can be ignored. can be defined as follows:

The low-frequency component is regarded as a large-scale feature, which is the part mainly affected by light. The high-frequency component is regarded as a small-scale feature, which is the light-invariant feature. Those close to are changing fast and can be regarded as high-frequency components, and those close to 0 are slow changing and can be regarded as low-frequency components. The interval is regarded as low-frequency interval. and are regarded as high-frequency interval. What needs to be suppressed is suitable low-frequency interval, so can be defined as follows:

4.2. Key Point Detection

For key point detection [27], a model based on the Gaussian process regression tree is proposed. A special kernel function and random partition kernel function are designed. Given a random partition , the definition of the kernel function is as follows:where is the indicator function and refers to the cluster that the partition assigns to , and the kernel function is defined as the segment that is assigned to the same cluster.

According to Moser theorem, if the function is a semipositive definite function, it is also a kernel function. Firstly, is proved to be a reasonable kernel function. Define

To prove that is a semipositive definite function, the expectation is decomposed into the limit of summation and each single term is proved to be a semipositive definite function.

For any dataset of size , the covariance matrix of can be arranged into a diagonal matrix:

It can be seen that is a semipositive definite matrix. Therefore, for any dataset, is also a semipositive definite matrix and it can be concluded that is a reasonable kernel function. Analogously, the kernel function defined in the random partition is applied to the random forest. The kernel function of the Gaussian process regression tree can be composed of trees and the distribution of nodes in the tree. The formula is shown in the following formulas:

is the split function, and refers to the scaling parameters of the kernel function.

and are the hyperparameters of the Gaussian process regression tree. The maximum likelihood function is solved by the probability density function of the training samples. The formula is shown as follow:

Take the logarithm of the above formula:

Let and find the maximum value of the maximum likelihood function. Derivate :

The nonparametric nature of Gaussian process regression results in a large amount of calculation and the computational complexity of is . Using the reduced-rank approximation method can reduce the amount of calculation. Let and can be simplified to the following formula:where

Due to the special construction of the kernel function, , let . is the index of the leaf node of the samplethat falls into the tree. is the matrix of size and the computational complexity also changes from to .

4.3. Face Posture Standardization

Based on the 3D Morphable Model (3DMM), a new face posture normalization method is proposed [28]. The eigenvalue corresponding to the weighting coefficient of each eigenvector is covariance. 0 is the Gaussian probability distribution of the mean and the expression is as follows:where is the average three-dimensional face shape, is each principal component obtained by the PCA algorithm on the three-dimensional vertex dataset of the face. is the corresponding eigenvalue of , and the combination coefficient follows a Gaussian distribution with 0 as the mean and as the variance. Extract facial feature point in the model and the corresponding part does not change the linear combination relationship. So:

Given a face image, obtain the 2-dimensional feature point estimate of the point in the face image and record as the position of the camera coordinate system origin in the world coordinate system of , and is the matrix formed by merging the feature vector set by column. is the third row of the rotation matrix . is the internal parameter matrix of the camera, and is the common scale factor of the projection imaging. point energy equation combined with the deformation coefficient estimate is as follows:

In the above formula, the left term controls the estimated residual, and the right term controls the combination coefficient according to the probability prior. is the mean of the probability distribution of , which is not zero when the average shape is updated.

The method of alternating optimization [29] is used to reduce the energy function . Given , the minimum energy function is target estimation . Given , the combination coefficient is obtained by minimizing .

After updating the position and rotation matrix of each face feature point under the camera coordinate, the deformation estimation technology of the point mapping can be used to calculate the eigenvector combination coefficient of the modified three-dimensional model based on the mean of the probability of the feature vector combination coefficients. After updating , the mean of the probability of the feature vector combination coefficients is updated. When determining the combination of the projection parameters and the 3DMMs deformation model coefficient vector , the paper establishes the solved 3D human face:and color reference relationship of faces in :

Given the standard posture projection parameter combination of as , then a standard pose face image can be generated by the color reference relationship of the three-dimensional model of the human relative to :

The normalized face must have symmetry, and the symmetrical point of in the three-dimensional face shape is , where is the diagonalized form of . The quality of is inversely related to the number of references of in . The number of references of is at least 1, then the following correspondence can be given:

By the symmetry of the standard pose face image, the reference colors with higher quality can be given as follows:

4.4. Dense Sampling and Preprocessing

According to the idea of local dense sampling [30], the illumination and posture correction technology has very important application value to alleviate the dependence of deep convolutional neural network on the number of samples and improve the quality of perceived weights. After completing the lighting and projection analysis of the face samples, firstly, 4 random affine transformations of a small range are performed on the light-analyzed image, and then the three Euler angles, scale adjustment parameters, and origin coordinates of the three-dimensional rotation matrix are randomly changed to generate 16 batches of dense sample images as a Mini-Batch for deep convolution model for weight training. This training method expands the sample size by 16 times, which fully alleviates the problem of the insufficient sample. Besides, due to the denseness between the same batch of samples, the probability that the perceptual model overfits the data is reduced. In the process of recognition, as face images with arbitrary lighting and attitude may appear, the paper still uses the technologies of lighting and projection analysis to analyze, correct, and then perceive the face pattern.

4.5. Attention Capsule Network Model

The proposed attention capsule network model has five gated convolution modules. Each gated convolution module consists of two layers of a gated convolutional network and maximum pooling. Each layer of the gated convolutional network includes linear function and sigmoid activation function. Compared with the traditional CNN, the gated convolutional network replaces the modified linear unit with a gated linear unit. The learnable gate can control the amount of information passed from the current layer to the next. Gated linear units can reduce the disappearance of the gradient. It is achieved by using a sigmoid activation function to preserve the nonlinear capability of the neural network while using a linear function to provide a linear path for the gradient. The maximum pooling operation can reduce the spatial dimension of features.

The output features through the five gated convolution modules are sent to the primary capsule layer. The primary capsule layer consists of a convolution module, remodeling module, and squashing module. After the input features go through the convolutional layer, add the bias, and go through the ReLU nonlinear activation function, it is reshaped into a three-dimensional tensor with and compressed with squashing function. is the time dimension before remodeling, is the dimension inferred from other variables, and is the size of the capsule. The output of the primary capsule layer has time slices. Each time slice has capsules, and each capsule is a tensor with .

The capsules of each time slice are input into the advanced capsule layer. The calculation is performed between the primary and the advanced capsule layer using a dynamic routing algorithm. The dynamic routing algorithm matches low-level capsules representing image frames with high-level capsules representing expression categories. When multiple image frames predict the same event, the expression category of the image is determined. Then, feedback is used to increase the weight between image frames related to the image expression category and reduce the weight of image frames not related to the image expression category to learn the weights between all image frames and image expression categories accurately. With each training, the weight of the routing algorithm is updated, and the final weight is saved at the end of the algorithm. Use the dynamic routing algorithm to calculate the output vector and then calculate the Euclidean length of the output vector . The Euclidean length composition vectors of categories at each moment are used as the output of the advanced capsule layer, denoted as .

The capsules of each time slice are input into the attention layer. The attention layer allows the network model to focus more on finding salient frames of the input image related to the image expression category. The sigmoid activation function of this layer can predict the importance of each frame. The output of the attention layer at each moment is and the value of is between 0 and 1. The attention layer selects saliency frames while suppressing the irrelevant frame of image expression category. The time attention mechanism is realized through the output of the attention layer. Finally, the fusion layer combines the output of the advanced capsule layer with the output of the attention layer. The time attention mechanism is realized by selecting significant frames of time slices. Time slices with large attention factors correspond to class-related significant image frames and time slices with small attention factors correspond to class-irrelevant image frames. The final predicted output is obtained by calculating the weighted sum of the output and the attention factor of the advanced capsule layer. represents the predicted value of the image event and the expression is as follows:where represents the Euclidean length of the output vector of the capsule at time , and represents the attention factor at time , , . controls the salient image frames of the transmitted information. Choose a probability threshold . When , the output is the image activity event. The overall framework of the attention capsule network model is shown in Figure 2.

5. Experimental Results and Analysis

To verify the effectiveness of the proposed facial expression recognition method using CNN and improved capsule network model, the experimental evaluation was performed on the BU-3DFE and the JAFFE dataset. The proposed algorithm is compared with that proposed in reference [23], reference [15], and reference [19] through experiments.

5.1. Experimental Datasets
5.1.1. JAFFE

The JAFFE dataset contains a total of 213 images. 10 Japanese female students were selected, and each person made 7 expressions. In the preprocessing stage, all images are uniformly normalized to 150 × 110 pixels, and then feature extraction is performed on the images. Figure 3 shows example images of the JAFFE dataset.

5.1.2. BU-3DFE

The dataset covers 2D images modeled in 3D datasets of 7 typical expressions. The dataset includes 100 subjects, of which 56% are female and 44% are male. Images also vary in age, celebrity, and ethnic origin. Figure 4 shows example images of the BU-3DFE dataset.

5.2. Experimental Setup

In the proposed facial expression recognition method using CNN and improved capsule network model, the feature bin is merged by multiple subsequent residual blocks without the single-layer convolution truncated in the residual block chain (RBC). The single-layer convolution consists of 32 convolution kernels with 256 × 7 × 7 and a step size of 2. In this way, the dimensions of the 8 parallel convolutional layers in the feature bin are all 32 × 8 × 8. The reason for taking 8 parallel convolutional layers in the feature bin is to establish the middle-level convolutional feature of each expression. Therefore, the number of parallel convolutional layers in the feature bin should be greater than and close to the number of 7 expression classifications. However, using 7 convolutional layers directly is too harsh, so the restrictions are slightly relaxed, and the number of convolutional layers in the feature bin is set to 8. The length of each class vector in the class vector bin is 16. To reconstruct the activation class vector, the activation class vector is first converted into the feature block of 8 × 8 × 32 using a full link method, and then the image is restored by three distributed interleaved convolution modules. 128, 32, 32, 32 distributed interleaved convolution kernels with scales of 6 × 6 × 32, 9 × 9 × 128, 6 × 6 × 32, 10 × 10 × 32, and steps of 2, 1, 2, 2 are adopted. Width and heigth are set to the uniform scale of the model input image 128 × 128.

5.3. Analysis of Parameter Performance

To verify the value of parameters dynamic routing times, suppression illumination coefficient , and combination coefficient of the proposed facial expression recognition method using CNN and improved capsule network model, the experiments were performed on JAFFE and BU-3DFE datasets. The change range of dynamic routing times in the experiment is 1∼10. The value of illumination coefficient is 0.1∼1, and the value of the combination coefficient is 0.1∼1. After a lot of experiments, the most representative results are shown in Figures 57.

It can be seen from Figures 57 that when the number of dynamic routing times is 3, the recognition rates on both the JAFFE and BU-3DFE datasets have peaked. The model works best when the light coefficient and combination coefficient is 0.2 and 0.4, respectively. Therefore, in the following experiments, the number of dynamic routes is set to 3, the illumination coefficient is set to 0.2, and the combination coefficient is set to 0.4.

6. Results of Key Point Detection

To illustrate the key point detection of the image in this method, the proposed method is compared with existing face key point detection models. Figure 8 shows the comparison of key point detection results of reference [23] algorithm, reference [15] algorithm, reference [19] algorithm, and proposed method on JAFFE and BU-3DFE datasets.

It can be seen from Figure 8 that the proposed method is better than the three existing methods of reference [23] algorithm, reference [15] algorithm, and reference [19] algorithm significantly. It shows that when using the original face image for training in facial expression recognition, there will be a large error using rough geometric constraints as the real face key points. Therefore, mapping the features to high dimension space by random partitioning is needed.

6.1. Effect Verification of Illumination and Posture Preprocessing

To verify the effects of illumination and posture preprocessing, training was performed on the JAFFE and BU-3DEFE datasets, and validation was performed on the CK+ dataset and multi-PIE datasets. In each process of sampling, a single expression image is combined with illumination and posture processing technology to convert the camera projection perspective and illumination condition, and then, 32 training data of Mini-Batch are generated. The experimental results are shown in Table 1.

It can be seen from Table 1 that after the preprocessing of illumination and posture, the accuracy of cross-dataset recognition of several deep learning methods has been greatly improved. The results verify the effectiveness of the proposed illumination and posture normalization method.

6.2. Recognition Result

To verify the effectiveness and superiority of the algorithm in the paper, Tables 2 and 3 show the recognition rates of various expressions of different pose angles on the BU-3DFE and multi-PIE datasets. Figures 9 and 10 show the confusion matrix with the best performance of each expression of the proposed method on the JAFFE dataset and the BU-3DFE dataset.

It can be seen from Tables 2 and 3 that the recognition rate of each expression of different poses on the JAFFE dataset has reached more than 90%. In the BU-3DFE dataset, the highest recognition rate of the single expression also reached 86.13%, and under different posture angles, the recognition rate of each expression is higher than that at 0° because the proposed method improves the smoothness of the image and reduces the distortion of the face texture to increase the recognition rate.

As can be seen from Figures 9 and 10, the accuracy rates on the JAFFE and BU-3DFE datasets have reached 96.66% and 80.64%, respectively. Hate is more difficult to identify on the JAFFE dataset and fear is more difficult to identify on the BU-3DFE dataset. The correct recognition rates are 95.23% and 75.83%, respectively. The reason is that the two expressions have similar texture changes around the eyes.

To verify the effectiveness and superiority of the algorithm in the paper, a comprehensive comparison is made with the existing methods on the JAFFE dataset and the BU-3DFE dataset. During the experiment, it is best to guarantee that the training object is performed under condition independent of the test object and SVM is used as the classifier. The experimental results are shown in Tables 4 and 5.

It can be seen from Tables 4 and 5 that under the same classifier, the proposed method can obtain a higher recognition accuracy rate than several other expression recognition methods. The reason is that the proposed method extracts light-insensitive features fully, suppresses boundary mark at sudden changes of light, and reduces noise points of image. Meanwhile, mapping features to high-dimensional space through random partitioning helps to distinguish similar-looking expressions. Therefore, the proposed method can improve the recognition accuracy of CNN models effectively.

7. Conclusion

A new facial expression recognition method improved the capsule network model is proposed, which reduces the noise of the image by adaptive preprocessing of the image illumination, reduces the complexity of the model, and improves the accuracy of model by using random partitioning. The improved model adds an attention layer after the primary capsule layer in the capsule network. It can increase the attention to the salient parts by weighting. That is, it can automatically select the most relevant important frames of the audio event class and ignore the irrelevant frames (such as background noise). Our attention layer realizes the attention mechanism by selecting the saliency of time slices. Thus, the overfitting of the model is reduced. Experimental results show that the improved capsule model can effectively classify facial expressions under unconstrained conditions, which makes up for the deficiency of pure deep convolution network in acquiring sparse features hidden in discriminative texture and improves the generalization ability of existing expression classification models for illumination and pose differences.

In the future task of facial expression recognition, it is planned to integrate the attention matrix into the attention capsule network. The attention capsule network is used for weakly labeled semisupervised expression image detection. Also, the attention capsule network is applied to other large-scale data problems with low discrimination.

Data Availability

The data included in this paper are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.