Abstract

Recognizing facial expressions accurately and effectively is of great significance to medical and other fields. Aiming at problem of low accuracy of face recognition in traditional methods, an improved facial expression recognition method is proposed. The proposed method conducts continuous confrontation training between the discriminator structure and the generator structure of the generative adversarial networks (GANs) to ensure enhanced extraction of image features of detected data set. Then, the high-accuracy recognition of facial expressions is realized. To reduce the amount of calculation, GAN generator is improved based on idea of residual network. The image is first reduced in dimension and then processed to ensure the high accuracy of the recognition method and improve real-time performance. Experimental part of the thesis uses JAFEE dataset, CK + dataset, and FER2013 dataset for simulation verification. The proposed recognition method shows obvious advantages in data sets of different sizes. The average recognition accuracy rates are 96.6%, 95.6%, and 72.8%, respectively. It proves that the method proposed has a generalization ability.

1. Introduction

Recognizing facial expressions can provide a more comprehensive understanding of people’s inner world [1]. It has many applications in medicine, transportation, culture, and education [25]. Therefore, the recognition and analysis of facial expressions have important research significance and value.

At present, certain researchers have carried out research on image expression recognition, the purpose of which is to accurately classify and recognize seven basic emotional expressions in facial images [6, 7] including anger, disgust, fear, happy, sad, surprise, and neutrality.

Traditional facial expression feature extraction uses mathematical methods to calculate and process facial expression images. Mainly, it can be divided into two situations for processing static images and dynamic images. Statistical methods, Gabor wavelet, and local binary method belong to the feature extraction of static images [8, 9]. Geometric method, optical flow method, and model method belong to the feature extraction of dynamic images [1012]. However, there are diversity and complexity in image acquisition. Traditional facial expression recognition methods face the problem of nonlinear uncertainty of sample data. The features selected in facial expression feature extraction have no good representation ability [13] and need to be extracted manually according to people’s experience. These problems will have a great impact on the recognition accuracy of the model, resulting in the poor generalization ability.

It should be noted that for facial expression recognition research, its essence is to optimize and analyze massive data [14, 15]. Benefiting from the development of artificial intelligence and big data technology, the deep network model can effectively extract effective image features from massive multidimensional image data through continuous iterative learning of multilayer network. And based on its strong learning ability, compared with the traditional facial expression recognition method, it can classify the facial expression more accurately and quickly [1619]. Reference [20] analyzed the facial expression information of time series based on the part-based hierarchical bidirectional recurrent neural network and extracts facial temporal features in dataset. It can comprehensively analyze facial expressions. Reference [21] proposed a method based on fusion of deep belief network (DBN) and local features. This method extracts eyebrows, eyes, and mouth with rich expression information as local expression images. It also combines the Log-Gabor feature with texture information and the second-order histogram with the gradient direction feature of shape information to realize facial expression recognition. Reference [22] combined spatio-temporal features and used deep residual networks to extract feature. Reference [23] used three channels to extract feature of expression images, respectively. Then, extracted features are connected and sent to the next layer for processing.

Considering the previous work, the network designed in this paper mainly has the following innovations:(1)Through continuous confrontation training of generator and discriminator in GAN, the deep extraction of the features of processed expression dataset is realized thereby improving performance of corresponding facial expression classification and recognition.(2)The performance and speed of network are improved. The idea of adding residual network to the generator network improves the operation efficiency while ensuring accuracy.In addition, the second section introduces related theoretical methods. The third section introduces the improved network and explains the structure of the discriminator and generator in the network. The fourth section introduces the experimental results and analysis based on different datasets. The fifth section is the conclusion.

2.1. General Steps

Face recognition technology includes four steps. They are face detection, face alignment, face representation, and face matching, as shown in Figure 1. Face detection module is used to detect position of face in input image. The face alignment module automatically locates key points around the face according to the input such as eyebrows, eyes, corner points of mouth, nose tip, and contour points. Face characterization is to locate a face picture from the above two steps, extract from it, or convert it into a feature vector. In the face matching part, the extracted feature vector will be used to compare with these in database. Based on the similarity between the two, it is possible to determine whether they belong to the same person in the database.

The image needs to be preprocessed to improve accuracy of judgment [24]. The advantages of traditional face key point detection algorithms are clear architecture and easy to understand. However, the operation efficiency is not high, and it is not suitable for processing a large number of images.

For face characterization processing and analysis, the data feature vector often contains information such as the position of eyebrows, nose, and eyes and even additional information such as contour and shape. The more classic methods include HOG method, Haar wavelet method, and eigenface method. However, traditional methods are used to extract the front of a human face, but the effect is not good enough for the side [25].

Face matching generally compares the extracted face feature vector with these in database. If the distance of feature vector is close, identity information is output. If there is no match for all faces in database, output cannot be recognized.

2.2. Convolutional Neural Network (CNN)

CNN is good at processing images [26]. Traditional face recognition methods show poor results when facing complex scenes. CNN-based deep learning methods can automatically extract features based on a large amount of image data and perform well in complex scenes.

The CNN model is essentially a deep feedforward model, which updates parameters through backpropagation. To obtain better results, it generally needs to design the cores of convolutional layer and pooling layer. And they will be continuously combined to obtain better image characteristics.

2.3. Generative Adversarial Network

Due to its multilayer network structure, the convolutional neural network also has too many problems in its network parameter settings, which makes the CNN face recognition training process very fragile [27]. For face recognition research and analysis, subtle changes in the CNN structure or a little adjustment of parameters will lead to deviations in the recognition results.

As a deep learning model that is widely used in current image analysis, GAN can solve the problem of instability in the training process through adversarial learning methods.

A typical GAN consists of two part, namely, generator G and discriminator D. During training, these two subnetworks play a game, as shown in Figure 2.

First, the generated image and real image are input into discriminator at the same time, and the discriminator is trained. As the training process progresses, the pictures generated by the generator become more and more realistic, and the classification ability of the discriminator is gradually improved. Finally, the training process reaches a state of convergence. The discriminator cannot identify the true and false of the input image, and the image generated is also the same as the real image; that is, the Nash equilibrium state is reached. The training process of the entire game can be described in the following value function :where and are the expected functions; x is the real image; z is the image of input generator. G converts the variable z into the probability that the image generated by the converter is a real image. The variable z is basically a sample from the distribution . The ideal distribution should converge to the data distribution . Practice has proved that in the generator, maximizing the logarithm is better than minimizing the logarithm .

Since the GAN network has two models, the loss of discriminator is as follows:

When training loss function of generator, default discriminator has the best ability. The part is a constant, so the loss of generator is as follows:

3. Method

3.1. Discriminator Network

For the discriminator network in the GAN, this paper uses the VGG-16 network as backbone network structure [28], and the network structure is shown in Figure 3. Using or as input, and . When the input is , correct output of discriminator is 1 and the correct output of input discriminator is 0. Leaky-ReLU is used as a nonlinear activation function in each convolutional layer in the discriminator.

First, two convolution and pooling operations are performed on the image, and each operation includes two convolutions and one maximum pooling. Then, three convolution and pooling operations are performed, and each operation includes three convolution operations and a maximum pooling operation. Finally, there are three fully connected layers and one Softmax layer. Similar to traditional generative confrontation network, the discriminator mainly judges the authenticity of input discriminator image. Input image has same size and dimension as the generated image, and both are 3 × 48 × 48. The adversarial loss is defined as follows:where is the real image, is the feature extractor, is the parameter of feature extractor, is a feature synthesizer, is the parameter of feature synthesizer, is the discriminator, is the parameter of discriminator, and is the loss calculation function of discriminator. Then, the total loss function is as follows:

3.2. Generator Network

The generator network in GAN uses as the input image, where , , The network structure is shown in Figure 4. Some previous segmentation methods use encoder-decoder [29]. This structure first down-samples and then gradually up-samples.

This paper uses a U-shaped structure for the generator. The feature extractor is used to extract the feature of input image. Image input resolution is 3 × 48 × 48, and the backbone network uses ResNet-18 [30]. Unlike the traditional generative confrontation network, the generator input is not random noise but a facial expression image. First, the feature extractor performs a 3 × 3 convolution operation on the input image with a step size of 1. Then, there are batch normalization and ReLU. Second, convolution operation of 4 modules is performed, respectively. Then, the average pooling operation is performed after convolution, and window size is 2 × 2. Dropout is used after the average pooling operation. Finally, the extracted features are input to two fully connected layers and one Softmax layer. The 512-dimensional feature vector is classified into 7 types of facial expressions, and the facial expression recognition results are obtained. The classification loss of classifier is defined aswhere is the original input image, is the parameter of feature extractor, is the feature extractor, is the parameter of classifier, is the classifier, is the real label, and is the classification loss.

At the same time, this paper adds a residual module to generator, and it is shown in Figure 5(a). The structure of the forward propagation convolution unit is shown in Figure 5(b).

Through the confrontation training between the generator and the discriminator, the feature extractor’s ability to extract features and the discriminator’s recognition ability are improved. The feature synthesizer is a symmetrical structure to the feature extractor and is mainly composed of a convolutional layer and an upsampling layer. After continuous convolution and upsampling operations, the generated output image is restored to the original size.

4. Experimental Results and Analysis

The experiment uses TensorFlow framework to implement network model training on simulation dataset. To ensure quality of the experiment, Python is used as the programming language. And NVIDIA CUDA 9.0 is used for GPU accelerated computing. The specific system development environment of the face recognition simulation experiment is shown in Table 1.

4.1. Parameter Setting

When the face recognition network is trained, the optimization method uses SGD, momentum parameter is set to 0.9, and weight decay rate is 10−4. Learning rate is a reduction strategy of multiplying the initial lr = 3 × 10−3 by (1-current_iter/max_iter)power where power = 0.9, current_iter is the current number of iterations, and max_iter is the maximum number of iterations in training process. For the discriminant network, the Adam optimization method is used, betas∈(0.9, 0.99), and the initial lr = 1 × 10−4. The learning rate reduction strategy is the same as the method of training the segmentation network. Taking into account the GPU memory limitation, the image size in the experiment is set to 348 × 348 pixels.

4.2. Evaluation Index

To measure performance of identification our method, an objective and fair evaluation index should be used. Accuracy (AC), precision (P), and recall (R) are commonly used indicators in big data image classification research, which can be used to analyze performance of face recognition results. The calculation formulas are shown in formulas (7)–(9).

P represents how many of the samples that the model predicts to be positive are true categories. R is expressed as how many of the model’s predicted categories are positive examples in the samples where the true category is positive.

For classification problems, the combination of the model prediction result and the true category of the sample can be divided into true positive (TP), true negative (TN), false positive (FP), and false negative (FN).

The precision and recall rate can be represented by a confusion matrix, as shown in Table 2.

At the same time, the loss value function c is used to evaluate the model and to measure the quality of the training performance of the GAN model. The appropriate number of iterations is determined in the process of training discrimination. In this paper, cross-entropy loss is used to express probability of predicting which type of input sample belongs to, and its expression is as follows:where y is the true classification value, a is the predicted value, and c represents the loss value.

4.3. Training Process

We analyze recognition and classification performance of the GAN model for face collection data and explore the convergence of collection data training process. Figure 6 shows the convergence performance of the training process for the expression dataset.

In the 10th iteration, recognition accuracy of training set samples has reached 95%. At the end of 15 iterations, the accuracy is approximately close to 100%. At the same time, through the numerical analysis of the loss function of each iteration, it can be known that the training set has been quickly and effectively attenuated before the 10th iteration. At the 18th iteration, the loss function value is close to 0. In summary, the improved expression recognition method of GAN has good convergence performance.

4.4. Simulation Analysis of General Experimental Dataset

The experimental simulation analysis is carried out using the methods proposed in references [2123] and this paper. To verify generalization performance of the proposed recognition method on data sets of different sizes, the small, medium, and large experimental simulation data sets are selected as JAFEE dataset, CK + dataset, and FER2013 dataset in turn.

The JAFFE dataset was created by the Michael Lyons team. The image data collected in this dataset contain the expressions of 10 Japanese female participants, with a total of 213 facial images. The JAFEE dataset contains 7 types of basic expressions: anger, happy, sad, surprise, fear, disgust, and neutral.

The CK + dataset comes from the Patrick Lucey team’s expansion in the Cohn–Kanade dataset. The CK + dataset collected 123 facial expression images of different people, a total of 593 expression sequences and 951 image samples. The image pixel size is 3 × 48 × 48.

The FER 2013 dataset comes from the Kaggle competition and consists of 35886 facial expression pictures. There are 28708 test sets, 3589 public verification sets, and 3589 private verification sets. Each image is composed of 48 × 48 grayscale images.

4.4.1. JAFFE Dataset Experiment

This paper chooses JAFFE dataset as a small dataset to simulate and verify performance of the proposed GAN facial expression recognition method. Table 3 shows the stability results based on JAFFE dataset under different methods.

It can be seen from Table 3 that in terms of facial expression recognition for JAFFE dataset, accuracy of our method is 96.6%. It is 0.9%, 2.1%, and 2.3% higher than references [2123]. The proposed method has no obvious advantage over the comparative method in terms of simulation runtime. Therefore, we believe that when performing expression recognition on small data sets, the method proposed in this article can be selected for efficient discrimination.

4.4.2. CK + Dataset Experiment

The CK + dataset is used as a medium-sized data set for facial expression recognition in this article, and different methods are also used for comparative analysis with our method. The face recognition performance of the CK + dataset under different methods is shown in Table 4.

It can be seen from Table 4 that our method has the accuracy of 95.6% for CK + dataset, which is 5.3% higher than that in reference [23]. The PCNN used in reference [23] has more network layers. There is the problem of the disappearance of the network gradient during training, which causes a large gap in the recognition accuracy compared with our method. The simulation time of the identification method in this paper is 84.23 s, which is 5.3 s shorter than that in reference [21]. Compared with reference [23], the simulation time is relatively close, but reference [23] does not have an advantage in recognition accuracy. Therefore, it is proved that GAN has good accuracy and real-time performance for facial expression recognition of medium-sized volume data sets.

4.4.3. FER2013 Dataset Experiment

Table 5 shows the simulation analysis results of large datasets using different facial expression recognition and classification methods.

From Table 5, the accuracy of expression classification and recognition of all methods for the FER2013 dataset is both below 75%. This is because there are a certain number of error labels in the FER2013 dataset. All of this results in a lower accuracy of the recognition method. However, our method has the highest accuracy of 72.8%. The running time of our method is 134.23 s, which is shortened by more than 10 s compared with references [2123]. Therefore, the improved expression recognition method of GAN proposed in this paper can also be used for large-scale data set analysis.

To further illustrate recognition performance, a confusion matrix is used to display and illustrate the recognition results obtained by our method, as shown in Figure 7. The accuracy of the method for the recognition of anger, disgust, fear, happiness, sadness, surprise, and neutral expressions is 65%, 62%, 57%, 88%, 58%, 85%, and 67%, respectively.

Figure 7 shows that the method performs well in identifying “happy” and “surprised,” with accuracy rates reaching 88% and 85%, respectively. In addition, it can be noticed that the ability to recognize “fear” facial expressions of generating confrontation network is low, with an accuracy rate of 57%. This is because the labeling in the FER 2013 dataset is not good.

In summary, compared with other methods, our method has higher accuracy and operating efficiency for different volume data sets. It shows that our method has excellent generalization ability.

5. Conclusion

This paper proposes a facial expression recognition method based on GAN. This method is based on continuous confrontation training between generator structure and discriminator structure in GAN, which realizes the accurate extraction of data set features and ensures the accurate recognition of facial expressions. By improving the generator structure in GAN network, the residual network is combined with image processing technology. Thus, the amount of calculation for identifying the network model is reduced. Finally, based on general datasets of different sizes, our method is validated for the efficient performance of facial expression recognition. It is proved that our method has obvious advantages in recognition accuracy and processing speed.

In the future, we also plan to add an attention mechanism to the network to further improve accuracy and prune the network to improve efficiency and strive to achieve industrialization.

Data Availability

The data included in this paper are available without any restriction.

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the “Software Engineering” (Key Subjects’ Construction Project) in Guangdong University of Education.