Abstract

Emotion recognition is one of the trending research fields. It is involved in several applications. Its most interesting applications include robotic vision and interactive robotic communication. Human emotions can be detected using both speech and visual modalities. Facial expressions can be considered as ideal means for detecting the persons' emotions. This paper presents a real-time approach for implementing emotion detection and deploying it in the robotic vision applications. The proposed approach consists of four phases: preprocessing, key point generation, key point selection and angular encoding, and classification. The main idea is to generate key points using MediaPipe face mesh algorithm, which is based on real-time deep learning. In addition, the generated key points are encoded using a sequence of carefully designed mesh generator and angular encoding modules. Furthermore, feature decomposition is performed using Principal Component Analysis (PCA). This phase is deployed to enhance the accuracy of emotion detection. Finally, the decomposed features are enrolled into a Machine Learning (ML) technique that depends on a Support Vector Machine (SVM), k-Nearest Neighbor (KNN), Naïve Bayes (NB), Logistic Regression (LR), or Random Forest (RF) classifier. Moreover, we deploy a Multilayer Perceptron (MLP) as an efficient deep neural network technique. The presented techniques are evaluated on different datasets with different evaluation metrics. The simulation results reveal that they achieve a superior performance with a human emotion detection accuracy of 97%, which ensures superiority among the efforts in this field.

1. Introduction

Recognition of human emotions is a vital phase, which is involved in several applications such as augmented and virtual reality [1, 2], advanced driver assistance systems [3], human computer interaction [4], and security systems [57]. Humans have several ways of interpreting the emotions of others, such as speech and linguistic aspects [8] and facial expressions [911]. Furthermore, emotions can be detected based on gaze direction [12] and biosignals including electroencephalogram (EEG) and electrocardiogram (ECG). Emotional expressions are used for intelligent Human-Robot Interaction (HRI). Emotion analysis can also be used to track the students’ emotions to enhance the learning environment. Therefore, the students can learn better using this approach. Such information obtained through emotion analysis is useful in monitoring of the overall mood of a group of persons to identify any destructive events [13]. In human interaction, 7% of the affective information is conveyed by words, 38% is conveyed by speech tone, and 55% is conveyed by facial expressions [14]. Therefore, the facial emotion analysis can be a dependable approach to recognize human emotions for HRI applications.

The robot vision issue can be handled using thermal images [1517] and RGB images [18]. This paper presents a real-time study for emotion detection and deployment in robotic vision applications. The proposed approach consists of four phases: preprocessing, feature extraction and selection, feature decomposition, and classification. Feature extraction and selection is carried out by MediaPipe face mesh algorithm. This algorithm is based on real-time deep learning. In addition, the feature decomposition phase is performed by PCA. This phase is deployed to enhance the accuracy of emotion detection. It is required to decompose the extracted features using the Singular Value Decomposition (SVD). Finally, the obtained features are enrolled into a selected classifier. In addition, an MLP deep neural network is utilized. The introduced techniques are assessed on different datasets with the help of different evaluation metrics. Moreover, this paper introduces a hardware implementation of the proposed models. The main contributions of this work can be summarized as follows:(1)A novel fast and robust emotion detection framework for robotic vision applications is proposed.(2)Emotion face mesh is introduced depending on automatic key point determination from face images.(3)Key point angular encoding is presented to generate sensitive and distinguishable angular features.(4)Emotion classification is performed depending on various machine learning techniques.(5)A brief comparison is made between the deployed techniques in terms of accuracy, scalability, and processing time.

The remaining parts of this paper are organized as follows. Section 2 covers the works introduced in the literature. Section 3 shows the datasets utilized in this work. Furthermore, the proposed methodology is discussed in Section 4, and its simulation results are given in Section 5. Moreover, the result discussion highlights the performance of the proposed approach among the works in the literature in Section 6. Finally, the paper concluding remarks are given in Section 6.

Several researchers presented their frameworks to handle the issue of HRI. The work in [19] offers a conditional-generative-adversarial-network-based (CGAN-based) framework to reduce intraclass variances by managing facial expressions individually, while simultaneously learning generative and discriminative representations. A generator G and three discriminators make up this architecture (Di, Da, and Dexp). Any query face image is transformed into a prototypic facial expression form with certain factors kept by the generator G. An accuracy of 81.83% was achieved. A model based on CNN was proposed in the work of [20]. It was designed for smile detection, emotion recognition, and gender classification. Therefore, it is considered as a multi-task model. It achieved an accuracy of 71.03%.

Some efforts have been presented for emotion detection using deep learning. The work in [21] introduced a deep CNN to deploy a facial expression recognition system. This system can automatically extract the features of facial expressions to allow automatic recognition. In addition, it consists of input, preprocessing, recognition, and output modules. Furthermore, it was used to simulate and assess the recognition performance under the effect of several aspects such as network structure, learning rate, and preprocessing on both the Japanese Female Facial Expression (JAFFE) dataset and the Extended Cohn–Kanade (CK+) dataset. To make the results more convincing, the authors used the k-Nearest Neighbor (KNN) technique. For JAFFE and CK+ datasets, the performance accuracies are 76.7442% and 80.303%, respectively. Another model was proposed in [22]. It was tested on a facial expression dataset of HDR images, considering a collection of faces under different lighting conditions. It is based on SVM, Local Binary Patterns (LBPs), and appearance. It works depending on the Speeded-Up Robust Feature (SURF) transform to conduct the emotion recognition task. This model revealed accuracy levels up to 80%. In [23], the authors presented a model for submission to the fifth Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition subchallenge. They deployed a CNN to extract features from the detected face images. Another role for the CNN is to be trained for the face identification task, rather than traditional pretraining on emotion recognition problems. In the final pipeline, an ensemble of Random Forest (RF) classifiers was learned to predict an emotion score using an available training set. This model achieved an accuracy of 75.4% on the validation data.

Another trend in this field is to detect emotions from videos. The authors of [24] presented a hybrid deep learning model for emotion detection from videos. A spatial CNN is used for processing of static facial images and a temporal CNN for optical flow images. These two processing branches are used to learn high-level spatial and temporal features on video segments, separately. These two CNNs are fine-tuned using pretrained CNN models and target video facial expression datasets. A deep fusion network, which is deployed using a Deep Belief Network (DBN) model, fuses the collected features from the segment-level spatial and temporal branches. The obtained fused features are enrolled into a linear SVM for facial expression classification tasks. The authors achieved an accuracy of 75.39%. Moreover, another video-based emotion detection algorithm was presented in [25]. The authors investigated different ways for pooling spatial and temporal data. For video-based face expression identification, they discovered that pooling spatial and temporal information together is more efficient. Unlike the framework given in [24], this work is end-to-end trainable for whole-video recognition. The goal of this framework is to create a trainable deep neural network framework for pattern identification that integrates spatial and temporal information from video using CNNs and LSTMs. This framework achieved an accuracy of 65.72%.

3. Dataset Description

The proposed models are evaluated on three datasets: Cohn–Kanade (CK+) [26], Japanese Female Facial Expression (JAFFE) [27], and Real-world Affective Faces Database (RAF-DB) [28]. A description of each of them is given below.

3.1. Cohn–Kanade [CK+]

The CK+ dataset [26] consists of 593 video sequences from 123 participants. Each sequence contains images beginning from onset (neutral frame) and progressing to the peak expression (last frame). The label associated with each sequence is depicted from the peak expression. The dataset contains images for seven different expressions: anger, contempt, fear, disgust, happiness, surprise, and sadness. The images have a resolution of 640 × 480 pixels. In this work, the images are cropped into 48 × 48 pixels to focus on the subject face. Figure 1 shows sample images for each expression.

3.2. Japanese Female Facial Expression (JAFFE)

The JAFFE dataset [27] has 213 photos of ten different female actors posing for seven different facial expressions. There are six primary expressions: happiness, sadness, surprise, anger, disgust, and fear, plus one neutral expression. The images have a resolution of 256 × 256 pixels. Figure 2 shows sample images for each expression.

3.3. Real-World Affective Face Database (RAF-DB)

RAF-DB [28] contains 15,339 facial images with uncontrolled poses and illumination from thousands of individuals of different ages and races. The images within the RAF-DB are labeled by approximately 40 annotators. The database includes six basic expressions plus a neutral expression. Sample images from RAF-DB are shown in Figure 3.

4. Proposed Methodology

This paper presents an emotion detection approach based on deep and machine learning techniques. The main idea of this approach is to deploy deep learning as an automatic key point generator using MediaPipe technique. Hence, a sensitive mathematical process is performed to encode the generated key points into a set of distinguishable features. In addition, different machine learning techniques are implemented on the extracted features to perform the classification task. The proposed approach consists of four main phases. The first phase is image preprocessing in which a super-resolution task is carried out using SRGAN. In the second phase, we deploy MediaPipe to generate key landmarks on the face images. Furthermore, we present a key landmark analysis and an angular encoding module. This module contains three subphases (key landmark selection, emotional mesh generation, and mesh angular encoding). The main idea of this module is to generate an emotional mesh that connects the selected key landmarks. Hence, the obtained mesh is encoded into angular values to generate a feature map. Moreover, the generated feature map is enrolled into a classifier to be discriminated into six categories. Figure 4 represents the proposed framework.

4.1. Preprocessing

Generally, the images that are captured by robotic vision devices have a limited resolution due to the hardware limitations of cameras involved in such systems. Furthermore, most of the available datasets for human emotion recognition are down-sized because of the storage limitations. Therefore, the first module in the proposed approach is the super-resolution. In addition, the proposed approach involves angular feature extraction from the geometry of the face images, which requires a clarified representation of the landmarks and boundaries of the face images to allow proper facial emotion recognition. SRGAN [29], a Generative Adversarial Network (GAN) for image Super-Resolution (SR), is employed in the current research to increase the perceptual quality of images prior to further processes. With SRGAN, the images are super-resolved with a 4x upscaling factor, while minimizing the Mean SquareError (MSE) between the super-resolved and original images and maximizing the Peak Signal-to-Noise Ratio (PSNR).

Figure 5 illustrates the preprocessing step by employing the SRGAN. The figure displays an original image selected from the CK+ dataset and the corresponding super-resolved image after SRGAN. The original image size is 48 × 48 pixels, and the super-resolved image size is 192 × 192 pixels.

4.2. Key Landmark Generation

The process of key landmark generation is performed using deep MediaPipe technique. MediaPipe [30] is an open-source ML framework developed by Google and devoted to building real-life computer vision applications. MediaPipe capabilities allow developers to focus on algorithm or model development, while using MediaPipe to iteratively improve their application with results that are consistent across different devices and platforms [31]. Solutions that are currently implemented with MediaPipe include face detection, face mesh annotation, iris localization, hand detection, pose estimation, hair segmentation, object detection and tracking, and 3D object detection (Objectron). These solutions are released in different platforms: mobile (Android and iOS), C++, Python, and JS. Real-life examples of ML solutions in MediaPipe are shown in Figure 6.

In the current work, the face mesh solution from the MediaPipe framework is employed to annotate the landmarks and boundaries of the face. Face mesh calculates 468 3D face landmarks in real time. It uses ML to infer 3D surface geometry using just a single camera input without a specialized depth sensor [32]. The solution provides a real-time performance, even on mobile devices. Figure 7 displays an image selected from the JAFFE dataset with the 468 facial landmarks annotated on the image.

4.3. Proposed Key Landmark Analysis and Angular Encoding

This paper presents a key landmark analysis and an angular encoding module. This module contains three subphases (key landmark selection, emotional face mesh generation, and mesh angular encoding). The main idea of this module is to generate an emotional mesh, which connects the selected key landmarks. Hence, the obtained mesh is encoded into angular values to generate a feature map. In the following subsections, a discussion for each step in this module is presented.

4.3.1. Key Landmark Selection

As discussed earlier, the MediaPipe face mesh solution provides face detection capability and 468 facial landmarks spread over the face, along with their locations (x and y coordinates for each detected landmark). In the proposed model, only 27 key landmarks are selected from the 468 detected landmarks. These key landmarks are used later to define the vertices of the emotion face mesh. Table 1 describes the selected key landmarks and the corresponding MediaPipe landmark IDs. The 27 key landmarks and their locations on a test face image are shown in Figure 8.

The selection of the key landmarks and their locations is based on the Facial Action Coding System (FACS) [33, 34], which encodes movements of individual facial muscles. It can be used to describe facial actions that make up an expression based on changes in facial muscles regardless of emotion. The movement of particular facial muscles, known as Action Units (AUs), is encoded by FACS. This requires unique instantaneous changes in facial appearance [35]. Table 2 describes the facial emotion-related AUs and the corresponding FACS names. A graphic-based demonstration for FACS with isolated AUs is illustrated in [36]. Hence, facial emotions can be represented using reliable combinations of different AUs, as demonstrated in Table 3. Each key landmark location is chosen such that it is more probably affected by a specific emotion-related AU, which seeks better recognition of facial expressions.

4.3.2. Emotional Mesh Generation

After selection of the key landmarks, emotion face mesh is created, consisting of 27 vertices inferred from the selected key landmarks. Edges of the emotion face mesh, which define the connections between vertices, are drawn to establish a closed mesh structure. Table 4 defines the edges that constitute the emotion face mesh, as well as the start and end vertices for each edge. The vertices IDs are defined in Table 1. The mesh yields 27 vertices and 38 edges. Deformation of emotion face mesh measured by the deviation of angles between edges reflects facial muscle contraction and relaxation, which will be used to identify facial emotions. Figure 9 displays the emotion face mesh for sample images selected from the JAFFE dataset with different emotions.

4.3.3. Mesh Angular Encoding

After acquiring the key landmarks and establishing the emotion face mesh, we use the mesh to extract the relevant features for emotion classification. The relevant features employed are geometric features, since most emotions can be detected from geometric changes. Ten features are extracted, defining angles between specific edges of the emotion face mesh. The angles are represented in degrees in the range of (0°, 360°). These features are then fed to the ML classifiers to learn from them to identify each emotion. The low dimensionality of features (10 features) makes them more resistant to local facial changes. In addition, the classifiers can be trained in a much shorter time. Moreover, the overall complexity of the proposed framework is significantly reduced. The list of angles taken as discriminant features for emotion classification, and the three vertices IDs forming each angle are given in Table 5. An example depicting the angular features and their locations on a test face image is shown in Figure 10.

The angle between the three vertices can be computed as follows (consider Figure 11).

The angle between the line (edge) connecting P2 and P3 and the line (edge) connecting the points P2 and P1 is unknown.

The angle β between the line P2-P3 and the X-axis can be computed as

Similarly, the angle α between the line P2-P1 and the X-axis can be computed as

Hence, the angle will be

Using the above procedure, ten angles between prescribed edges in the emotion face mesh are computed, and then used for classification. Angle values are all positive, where negative values can be avoided by adding 360° to the values. Furthermore, the generated feature maps are redistributed using PCA to enhance their distribution.

4.4. Classification

In this work, we develop an automated facial expression identifier to recognize human emotions for robotic vision applications. Discriminant features extracted from a face (Section 4.3) are fed to classifiers to recognize the emotion in the given face. DT, KNN, a multiclass SVM [37], Gaussian NB, MLP with backpropagation, QDA, RF, and LR classifiers are used for classification. The trial-and-error method and grid-search [38] are conducted to identify the optimal structure and hyperparameters of classifiers. In addition, 10-fold cross-validation is employed to estimate the optimal hyperparameter combinations to avoid overfitting. The optimal hyperparameters of classifiers adopted in the current work are investigated in Table 6.

The images in the dataset are divided into two parts: training part and testing part. The training part is used to train/validate the classifier, and the testing part is used to test the performance of the classifier. The splitting scheme is 80/20, as shown in Figure 12. The 10-fold cross-validation adopted in the current model employs further splitting of the training part into ten folds (subsets). After that, nine folds are used to train the classifier, while the remaining fold is used to validate the training. This process continues until each of the ten folds is used exactly once for validation. The optimal configurations identified in the training stage are then applied in the testing stage.

5. Experimental Results

Experiments are performed on an Intel Core i3 machine with 8 GB RAM. Python 3.9 is used as the development environment. The OpenCV 4.5 and SRGAN libraries are used for image preprocessing. MediaPipe 0.8.6 library is used as the building block for the key feature extraction. Scikit-learn 0.24.2 [39] is used for implementing the machine learning classifiers and computing the evaluation metrics for the proposed model. NumPy, Pandas, Math, OS, and Matplotlib are used as supplementary libraries. The accuracy, precision, recall, F1-score, and training time are the five metrics used to evaluate the proposed framework. The training time is recorded based on the average of five runs. The proposed model is evaluated using two different datasets: CK+ (6 classes) and JAFFE (6 classes), which are benchmark datasets for facial expression classification. For CK+, 784 images are used for training, while 197 images are used for classification. For JAFFE, the training set contains 164 images, and the testing set contains 42 images.

To evaluate the performance of the proposed model, eight classifiers are employed to classify facial expressions across two benchmark datasets. The hyperparameters employed for each classifier are presented in Table 6. The classification is based on ten features extracted from images in each dataset using the procedure described in Section 4.

Learning curves, which determine cross-validation scores and behaviors for different training sizes for the adopted classifiers in case of CK+, are shown in Figure 13.

The confusion matrix for each classifier on the CK+ dataset using the proposed model is shown in Figure 14. It shows that the per-class accuracies of Anger, Happy, and Surprise classes have higher values with all classifiers than those of other emotions, while the Contempt and Sadness classes have lower per-class accuracies. Moreover, the confusion matrices for classifiers on JAFFE dataset are shown in Figure 15.

The performances of the proposed framework with eight classifiers on CK+, JAFFE, and RAF-DB [28] datasets are presented in Tables 79. The illustrated results show the classification report including accuracy, precision, recall, and F1-score as well as the training time taken for each classifier. A visual comparison between the classifier accuracies across the used datasets is shown in Figure 16.

Results reveal that the KNN classifier outperforms other classifiers in terms of accuracy, precision, recall, and F1-score. It achieved the best accuracies of 97% and 95% on CK+ and JAFFE datasets, respectively. The accuracies for Gaussian NB, QDA, DT, LR, RF, MLP, and SVM classifers on the CK+ are 84%, 86%, 86%, 87%, 89%, 94%, and 94%, respectively, and those on the JAFFE are 90%, 79%, 90%, 86%, 93%, 90%, and 88%, respectively. In addition, the time required to train the KNN and Gaussian NB is 0.005 sec on CK+. It is the lowest time compared to those of other classifiers. The MLP and RF classifiers have the highest training times, which are 1.82 sec and 0.74 sec, respectively. Moreover, the proposed models are evaluated on theRAF-DB. The results of this evaluation reveal that the proposed MLP and SVM models can be considered as good emotion detection models for this database, especially with an accuracy of 67% for both models. Therefore, the proposed approach provides a variety of models, which are optimal for robust emotion detection environments.

6. Discussion

The simulation results reveal that the proposed approach shows a high performance in human emotion detection. Furthermore, they clarify that the proposed encoding module has a superior performance with the deployed classifiers including KNN, SVM, and MLP. In this section, a brief comparison is presented between the proposed approach and the works in the literature as illustrated in Table 10. It can be observed that the proposed approach has a superior performance among the efforts in this field.

7. Conclusion

The issue of Human-Robot Interaction (HRI) has been discussed in this paper. As a solution, the paper presented a novel approach for facial expression recognition. This proposed approach consists of four phases, which are carried out to extract key points from facial images using a real-time algorithm (MediaPipe). Furthermore, these key points are enrolled into a sequence of selection, mesh generator, and angular encoding modules. Moreover, the generated feature maps are classified using several classification algorithms, including SVM, KNN, RF, QDA, NB, LR, DT, and MLP. The novelty of the proposed approach is highlighted in the proposed key point analysis and angular encoding algorithm. This algorithm is efficient, because it generates only ten features (angular values), which are discriminative for different emotional classification categories. The proposed approach has been evaluated on CK+, JAFEE, and RAF-DB datasets. It reveals a superior performance in terms of accuracy of detection and processing time evaluation metrics. Furthermore, the low dimensionality of extracted features enables the ML-based approaches to reach an optimum performance in a short time with much lower computational cost than those of the DL-based approaches, which require more time for convergence and need much computational cost.

In addition, the future work that can be deduced from this paper is introducing a method for emotion detection from other modalities such as videos, spoken words, and written text. Furthermore, hardware implementation of the proposed approach is a research trend, which we are working on. Moreover, further machine learning techniques such as dictionary learning and semi-supervised learning can be performed to solve this issue.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University through the Fast-Track Research Funding Program to support publication in the top Journal (Grant no. 42-FTTJ-38).