Abstract

With the development of integration and innovation of Internet and industry, facial expression recognition (FER) technology is widely applied in wireless communication and mobile edge computing. The sparse representation-based classification is a hot topic in computer vision and pattern recognition. It is one type of commonly used image classification algorithms for FER in recent years. To improve the accuracy of FER system, this study proposed a sparse representation classifier embedding subspace mapping and support vector (SRC-SM-SV). Based on the traditional sparse representation model, SRC-SM-SV maps the training samples into a subspace and extracts rich and discriminative features by using the structural information and label information of the training samples. SRC-SM-SV integrates the support vector machine to enhance the classification performance of sparse representation coding. The solution of SRC-SM-SV uses an alternate iteration method, which makes the optimization process of the algorithm simple and efficient. Experiments on JAFFE and CK+ datasets prove the effectiveness of SRC-SM-SV in FER.

1. Introduction

At present, many countries are actively developing intelligent technologies focusing on mobile edge computing [1]. The data architecture and forms in mobile edge computing are complex and diverse, resulting in great limitations of application. The main success of artificial intelligence application comes from image processing, natural language processing, social network, robots, and so on. Companies in many fields develop their applications combined with artificial intelligence technology. In the field of mobile edge computing, vision-based human-computer interaction is a very active research field. Among them, facial expression recognition (FER) technology is widely used [2]. The recognition of facial expression through artificial intelligence technology can promote the rapid development of human-computer interaction technology. The main function of FER is to recognize the expressions in the natural environment to judge people’s emotions and inner activities and produce a series of FER systems [3, 4].

The applications of FER include the following: (1) security monitoring. At present, face recognition system has been widely used. It can realize the recognition of specific faces in complex crowd, predict the behavior and activities of the recognized people through the facial state of the recognized people, and analyze their action intention. In order to prevent dangerous situations in public places, these FER systems can be placed in specific public places, and facial expressions can be used to determine whether someone is engaging in illegal activities or entering illegal places. If an abnormality is found, the system will sound an alarm to avoid emergencies in public places. (2) Medical care assistance. Many hospitals have introduced robots that can detect the facial expressions of patients. These robots can determine whether there is a problem with the patient’s body based on the various facial expressions of the patient. For example, when the patient’s facial expression is very stable, then, the patient’s physical condition is good; when the patient’s face shows painful and uncomfortable expression characteristics, then, the patient’s body may have problems. When a problem occurs, the ward monitor will immediately send an alarm to inform the medical staff that the patients in the ward need emergency treatment. Not only that, this type of robot can also be used in the home of the elderly who live alone. (3) Safe driving. In order to prevent traffic accidents, the driver’s FER technology application safety assistance system has begun to spread all over the world. Driving for a long time is prone to fatigue, which may lead to improper driving. Therefore, it is necessary to stop driving and have a period of rest. A camera for recognizing facial expression is installed in the vehicle, which can monitor the driver’s facial expression in real time. When the expression characteristics of fatigue appear, it will remind the driver that he should rest and assist the driver to stop. The auxiliary system improves the driver’s safety factor to a great extent and reduces the occurrence of unnecessary traffic accidents. (4) Entertainment. The development of interactive games has enriched people’s lives. Such games mainly make corresponding judgments according to the changes of people’s facial expressions. For example, when the expression shows the characteristics of panic, the expression monitoring camera of the game will immediately add relaxed elements after receiving expression feedback to relax ones nervous mood. When the expression has the characteristics of carelessness, it will give one some stimulating elements to increase the attraction of the game. (5) In the field of criminal investigation, the analysis of the subtle expression changes of the suspect can be used to determine whether the other party is lying and assist the police in solving the case.

Generally speaking, FER is to extract and analyze the features related to emotional expression of facial images and judge the emotion contained in facial images by using the prior knowledge of human emotional information. In the process of human-computer intelligent interaction, the development of good and reliable FER technology will enable the computer to well understand people’s emotional state and obtain the ability to perceive and understand human social behavior, which will make the human-computer interaction system intelligent. In generally, the FER system mainly includes three stages: face detection and preprocessing, feature extraction, and FER [5]. In a complete FER process, firstly, the image is obtained through the external image acquisition equipment and detected to segment the corresponding facial expression image. Then, the facial expression-related features are extracted to obtain a good emotional feature representation of the facial expression image. Finally, the classification model is trained based on emotional feature recognition.

There are four commonly used kinds of facial expression feature extraction algorithms: geometry, texture, representation, and deep learning algorithms [6]. The geometry feature extraction algorithm is aimed at representing the structural changes of the face as a whole. It mainly uses the geometric relationship of facial points to extract the facial expression features. Lee [7] established shape model and texture model for training samples at the same time in the model establishment stage and then combined them to form active appearance models to obtain reliable expression features. It is a geometric feature extraction method. The texture feature extraction method uses the characteristics of face image pixels to represent the local subtle changes of the face expression image. Representative methods include local binary pattern (LBP) [8] and scale invariant feature transform (SIFT) [9].

The last type is to use the deep learning method to automatically learn and extract facial expression image features. For example, Kuo et al. [10] used a convolutional neural network (CNN) for FER and achieved good recognition performance. Li and Deng [11] used a bimanifold CNN (DBM-CNN) to learn the discriminative features of facial expression images. In the whole process of FER, the last stage is facial expression classification. At this stage, the obtained efficient feature representation and the labels of training data are used to train a good classifier for FER system. Sparse representation-based classification (SRC) was proposed in 2009 [12]. This algorithm has been successfully applied to face recognition, especially when the samples are damaged or occluded. However, because the SRC algorithm involves the optimization of the norm, when the data scale of the linear combination is large, the computation cost will be greatly increased, and it is not suitable for practical applications. Researches have proposed a large number of improvement algorithms to solve this problem. These algorithms can be simply divided into two types: one is to select representative training samples, and the other is to refine training sample information through dictionary learning. By selecting representative training samples, the data can be compressed to reduce the computation scale, thereby speeding up the efficiency of sparse decomposition. Li et al. [13] selected the nearest neighbor samples as the representation data. Hui et al. [14] combined SRC and local linear embedding strategy together into the speed up sparse decomposition. Ortiz and Becker [15] developed a sparse representation classification algorithm. This algorithm used linear regression to filter training samples before sparse optimization, thereby reducing computation time.

The sparse classification method based on dictionary learning can effectively accelerate the efficiency of sparse decomposition. Dictionary learning can obtain a dictionary with a small scale but a large amount of information [16, 17]. The most classical dictionary learning algorithm is the K-SVD algorithm proposed by Aharon et al. [18]. Zhang and Li [19] introduced the classification error term based on the K-SVD algorithm and proposed the discriminative D-KSVD algorithm. The dictionary learned by the algorithm has the discriminative ability. Similarly, Jiang et al. [20] made full use of the label information of training samples and proposed LC-KSVD algorithm with label consistency constraints. Due to the existence of label constraints, the coding coefficients of similar training samples were similar, so as to improve the discrimination ability. Xu et al. [21] developed a within-class-similar discriminative dictionary learning algorithm. The algorithm improved the discriminative ability by using intraclass divergence restrictions in the coding coefficients.

In order to reduce the computation scale, we consider projecting the original image into a low-dimensional subspace and embedding a multiclass support vector machine into the sparse representation classification algorithm. Based on this idea, this paper proposes a sparse representation classifying embedding subspace mapping and support vector machine (SRC-SM-SV). In detail, when learning sparse coding, the proposed algorithm uses the Laplacian regularization term and principal component analysis to mine the geometric structure information of sample features in a low-dimensional subspace. At the same time, using the label information, this algorithm further introduces a multiclass support vector machine, so that the learned sparse representation has better discriminative ability. A series of experiments on Japanese female facial expression database (JAFFE) [22] and extended Cohn Kanade (CK+) [23] datasets are carried out to compare with the proposed algorithm, which further proves the effectiveness of the proposed algorithm.

The sparse representation classification algorithm based on sparse constraint comes from the classical sparse theory. With the rapid development of mathematics-related fields, sparse representation methods based on sparse constraints have been widely used in image processing and other related fields and have achieved success in practical applications such as image restoration, classification, restoration, and segmentation. Each sample is expressed as a sparse linear combination of dictionary atoms. There are image samples with classes. Each sample is sparse represented by the feature vector , and the feature dimension is . All training samples represent in a matrix , where . is the class label matrix of , and is the class label of . is the learned dictionary matrix with atoms. For the matrix , is the sparse coding matrix on the dictionary . The objective function of sparse representation can be expressed as where is Frobenius norm and is norm. The second term of Equation (1) computes the number of nonzero items of . is an adjustable parameter that balances coding coefficients and sample reconstruction item. Given the dictionary , the minimization problem on sparse coding coefficient matrix is a NP hard problem. Equation (1) can be solved by norm by replacing norm. To deal with the classification problem, some algorithms add the classification error term into the framework of sparse representation. One of the representative algorithms is D-KSVD. Its objective function is where is the parameter for a linear classifier. and are two regularization parameters.

As can be seen from Equation (2), the D-KSVD algorithm integrates dictionary learning and linear classifier into a framework. To promote the discriminative ability of sparse representation classification algorithms, some researches embed the idea of support vector machine into the sparse representation framework. For example, the objective function of the support vector-guided dictionary learning algorithm [24] is where and are the label matrix and classifier parameter of the th class sample, respectively. is the SVM term, . The function is the loss function in SVM. is the penalty parameter.

3. Sparse Representation Classify Embedding Subspace Mapping and Support Vector

3.1. The Objective Function

The features of facial expression images are mostly high-dimensional. Therefore, the SRC-SM-SV algorithm tries to find a suitable subspace to reduce the feature dimensions and redundant information, so as to obtain more effective feature representation. In the subspace, the SRC-SM-SV algorithm fully utilizes label information of training data and adopts the Laplacian regularization term and principal component analysis term. Thus, the sparse representation is obtained by maximizing the interclass separability and minimizing the intraclass discreteness.

Denoting as a projection matrix, the new feature representation of image dataset can be written as . is the dictionary learned in the new feature space. The data in the real world is easily polluted by noise. In order to better mine the data structure, we establish the Laplacian regularization term based on dictionary atoms. The element of the similarity matrix is expressed as where means the -nearest neighbor function and is the -nearest neighbor parameter.

The Laplacian regularization term can be written as where is the Laplacian matrix, , and .

Based on the above idea, the objective function of SRC-SM-SV is defined as

The first term in the objective function is the reconstruction error term. The second term is the Laplacian regularization term. The third term is the mapping reconstruction term. A similarity matrix with label information is constructed; the element in can be defined by

The forth term in the objective function is the principal component analysis (PCA) term, which is used to mine the structure information of the data. The last term is the SVM term.

To simplify term, we define a matrix , where and . We obtain the following equation:

Then, the objective function of SRC-SM-SV is re-written as

3.2. Solution of Optimization Variables

The alternating optimization method is used to tune the variables of {}. (1)First, we tune the variables . Following the Proposition in [25], letwhere ,

Equation (9) can be repressed as where .

When the variables , , are fixed, Equation (9) can be rewritten as

Equation (13) has a closed-form solution, which has the form as where and is the optimal solution of the following problem, where .

Then, according to Proposition in [25], can be tuned as . (2)Tune : with the other parameters fixed, Equation (9) can be rewritten as

Using the Lagrange dual method, can be computed as where is a diagonal matrix constructed from all the optimal dual variables. Then, matrix can be computed by , where denotes the pseudoinverse matrix. (3)Tune : with the other parameters fixed, Equation (9) can be rewritten as

Equation (18) can be expanded as where .

We can obtain a closed-form solution of as where is the class label of in the th SVM. (1)Tune and : with the other variables fixed, Equation (9) is transformed to a multiclass SVM classification problem

Here, we utilize the multiclass SVM [26] to solve Equation (21).

When the optimization procedure is completed, we obtain the optimal variables {}. For the testing sample , its sparse coding vector can be computed by Equation (20). Then, its classification result can be classifier by a multiclass SVM.

The optimization of the SRC-SM-SV algorithm is shown in Algorithm 1.

Input: labeled training data , regularization parameters ,,,, and .
Output: the optimal variables {}
1. Initialize the dictionary using K-SVD algorithm
  
While not convergence or
2. Compute the similarity matrix via Equation (4);
3. Tune the mapping matrix via Equations (10)–(15);
4. Tune the dictionary via Equation (17);
5. Tune sparse coefficient matrix via Equation (20);
6. Tune the via Equation (21);
  
end while

4. Experiment

4.1. Datasets and Experiment Setting

The experiments of FER are conducted on JAFFE [22] and CK+ [23] datasets in this study. The JAFFE has 213 facial expression images, including a variety of facial expressions of 10 Japanese women. The samples of CK+ dataset are facial expressions from different countries, nationalities, and genders. It is a relatively perfect public dataset at present. Figures 1 and 2 show some examples of facial expression images in JAFFE and CK+ databases, respectively. In the experiment, we selected 183 images in JAFFE and 210 images in CK+. Table 1 shows the number distribution of six types of expressions in the two databases. We crop and converted these facial images to pixels and extract LBP feature. Specifically, each face image is divided into 9 () regions and extracted 2304-dimensional LBP feature. In addition, we use a fine-tuning Res Net-50 model [27] to obtain 2048-dimensional deep feature. Six basic expressions shared by the two databases, namely, anger, disgust, fear, happiness, sadness, and surprise, were selected as the objectives of the classification task. We use the enhancement method to expand the data to 3 times of the original images for JAFFE and CK+ datasets. In the experiment, the SRC-SM-SV algorithm is compared with several algorithms, including SRC [12], K-SVD [18], LC-KSVD [19], FDDL [28], SVGDL [24], SDDL [21], and LCDL-SV [29]. In the SRC-SM-SV algorithm, the dimension of mapping subspace is set to be 500, and the learned dictionary has 420 atoms. SRC-SM-SV needs to adjust the regularization parameters , , , , and . The value range of regularization parameters is set . We empirically find that there is no rule to follow for the influence of parameter changes on recognition accuracy. Therefore, we tune the regularization parameters in the strategy of grid optimization. The parameters in other comparison algorithms are set according to their default settings. All experiments are conducted with MATLAB R2019b.

4.2. Results and Analysis

Tables 2 and 3 show the confusion matrix of the proposed SRC-SM-SV algorithm in JAFFE dataset using LBP and deep features, respectively. It can be seen in Table 2 that the recognition accuracies of happy and surprise expressions in the SRC-SM-SV algorithm are the highest, reaching 95% and 95.5%, respectively. Because the facial features of these two expressions are exaggerated and have large motion range, so the model is easier to extract features. The recognition effects of the SRC-SM-SV algorithm on disgust, fear, and sad expressions are slightly poor, but the recognition rates are also about 87.5%. Because both fear and sad have the characteristics of opening their lips and tense forehead, while disgust and sad have similar eyebrow characteristics and wrinkled corners of the mouth, these three expressions have certain similarities and are prone to misclassification. In addition, the recognitions of disgust and sad are easy to interfere with each other.

Tables 4 and 5 show the confusion matrix of the proposed algorithm on CK+ dataset using LBP and deep features, respectively. The confusion matrixes of the model on CK+ dataset are similar to those on the JAFFE dataset. As can be seen from Tables 4 and 5, the SRC-SM-SV algorithm is the easiest to recognize happy expression, and the recognition rate reaches 98% when using deep feature. The second best recognition rate is surprise, with a recognition rate of 97% when using deep feature. Because surprise is exaggerated and its features are easier to learn, so the recognition rate of surprise expression is also higher. Disgust and sad are confused and lead to errors in recognition, because their expressions are similar, especially the part of the mouth. Fear expression recognition is relatively weak, mainly because the number of fear samples is small and the features that can be learned are relatively small.

The comparison results on the JAFFE dataset using LBP and deep features are shown in Figures 3 and 4, respectively. Firstly, the results in Figure 3 show that SRC-SM-SV obtains the best performance under the LBP feature. Secondly, the results in Figure 4 show that SRC-SM-SV also obtains the best performance under the deep feature. The average recognition rate of the SRC-SM-SV algorithm on all expressions is 94.18%, which improves the performance by 6.29% compared with SRC and 1.28% compared with the second best LCDL-SV algorithm. On the one hand, it shows that the deep feature of automatic learning can be well used for recognition and classification. On the other hand, it shows the advantages of the proposed SRC-SM-SV. Compared with the LCDL-SV algorithm, the objective function of the proposed algorithm includes subspace mapping and PCA terms, which shows that these two terms have significant benefits to FER tasks. The comparison results on the CK+ dataset using LBP feature and deep feature are shown in Figures 5 and 6, respectively. The results are similar to those in Figures 3 and 4. The results of Figures 5 and 6 also illustrate that our proposed algorithm is effective.

To further compare the recognition accuracy of each expression, Tables 6 and 7 show the recognition accuracy of each comparison algorithm on the JAFFE dataset in six expressions using LBP and deep features, respectively. Tables 8 and 9 show the recognition accuracy of each comparison algorithm on the CK+ dataset in six expressions using LBP and deep features, respectively. From these results, we can see that SRC-SM-SV algorithm shows good recognition performance in each expression and obtains better performance than other algorithms. Especially on the JAFFE dataset, the recognition rate of SRC-SM-SV is 3 to 6 percentage higher than other algorithms, which also shows that this algorithm is suitable for FER.

5. Conclusion

As one of the most important tasks in the field of emotional computing, FER has wide applied in many practical applications, such as computer vision, multimedia entertainment, and machine intelligence. To improve the recognition performance of FER technology in practical applications, this paper explores and studies the FER system based on sparse representation classification. Then, a sparse representation classify embedding subspace mapping and support vector machine is developed in this paper. In this algorithm, the subspace learning and support vector machine are combined into the framework of sparse representation classification to obtain the discriminative sparse representation in low dimensional subspace. At the same time, this algorithm combines Laplacian regularization term and PCA term into the model to better minimize the intraclass discreteness and maximize the separability between classes. Although this study has made some achievements in FER, there are still many problems worthy of further research and exploration in practical applications. For example, the SRC-SM-SV algorithm can be extended to more complex FER tasks, such as multisource cross-database scenes, multimodal/cross-modal scenes, multiview scenes, and facial expression action unit recognition. In addition, the computational complexity of our algorithm is relatively high. At present, our algorithm is difficult to be applied in a large-scale dataset. Therefore, exploring the theory of fast sparse optimization is a direction of our research.

Data Availability

Two public datasets JAFFE and CK+ are used in this study. The JAFFE dataset can be downloaded in the hyperlink: https://zenodo.org/record/3451524#.YZT9EFVByM8. The CK+ dataset can be downloaded in the hyperlink: https://www.kaggle.com/shawon10/ckplus.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the Project of Jiangsu Education Science in the 13th five year plan in 2018 under Grant No. B-a/2018/01/41, Future Network Scientific Research Fund Project (No. FNSRFP-2021-YB-36), and Science and Technology Project of Changzhou City (No. CE20215032).