Abstract

To improve the human-computer interaction (HCI) to be as good as human-human interaction, building an efficient approach for human emotion recognition is required. These emotions could be fused from several modalities such as facial expression, hand gesture, acoustic data, and biophysiological data. In this paper, we address the frame-based perception of the universal human facial expressions (happiness, surprise, anger, disgust, fear, and sadness), with the help of several geometrical features. Unlike many other geometry-based approaches, the frame-based method does not rely on prior knowledge of a person-specific neutral expression; this knowledge is gained through human intervention and not available in real scenarios. Additionally, we provide a method to investigate the performance of the geometry-based approaches under various facial point localization errors. From an evaluation on two public benchmark datasets, we have found that using eight facial points, we can achieve the state-of-the-art recognition rate. However, this state-of-the-art geometry-based approach exploits features derived from 68 facial points and requires prior knowledge of the person-specific neutral expression. The expression recognition rate using geometrical features is adversely affected by the errors in the facial point localization, especially for the expressions with subtle facial deformations.

1. Introduction

The human mental state could be inferred using various modalities such as facial expressions, hand gestures, acoustic data, and biophysiological data [15]. The importance of knowing this mental state appears in different disciplines. For example, HCI is required to be improved to be as good as human-human interaction. Hence, recognizing human emotions by machines is considered an important step forward. Pantic et al. [6] argued that facial expressions are more important than body gestures and vocal expressions to the judgment of human behavior. For example, in our companion-based assistant system, facial expression is considered as a complementary aspect to hand gestures and other modalities [7]. In addition, a human emotion recognizer can provide feedback for different services. As a case in point, the one-to-one tutoring outperforms conventional group methods of instruction. Consequently, adapting one-to-one tutoring to student performance through a cognitive process (nonverbal behavior recognition) is crucial [8]. Many other applications are built based on facial expression recognition [911]. In this paper, we propose an approach to perceive human facial expression (happiness, surprise, anger, disgust, fear, and sadness) from captured face images. Additionally, we synthesize facial points with several uncertainties matching facial points detected with errors. These produced facial points are used to investigate the performance of our approach under inaccurate facial point localization. The errors in the facial point location are drawn from identical independent normal distributions with zero mean and five different standard deviation values.

To recognize the facial expressions, Ekman and Friesen [12] broke the facial expression down into smaller action units (AUs), where each AU codes small visible changes in facial muscles. Then, each facial expression is defined to be composed of several AUs simultaneously occurring with different intensities. Instead of explicitly building an approach to recognize the facial expressions from their corresponding AUs, one can use directly geometry and appearance features for expression recognition, where those features implicitly encode the aforementioned AUs.

By exploring the state-of-the-art approaches for human facial expression recognition, we can sort them into two categories. The first category regards prior knowledge of person-specific neutral expression as essential for the approach. In other words, each facial expression is concluded by comparing features from face image with those of the same face at the neutral expression [1315]. These approaches do have limitations such as the human intervention to define the neutral expression of the considered person. Several methods were proposed to automatically estimate the neutral expression; for example, the average over many frames for each person is assumed to be person-specific neutral expression, or a model that best fits all neutral samples is considered as a general neutral model. However, these methods are error prone and cannot provide hand-annotation accuracy.

As a sample of the first category, Lucey et al. [13] manually labeled 68 facial points in keyframes within each image sequence then used a gradient descent Active Appearance Model (AAM) to fit these points in the remaining frames. Then, several features extracted from the displacement of those points are fed into a multiclass support vector machine classifier (SVM) to infer the human facial expression. In another example in 3D facial data, Niese et al. [14] extracted dynamic and geometrical features from facial points and specific regions associated with the 3D face model of each subject. These points are initially annotated or detected on the neutral state image and tracked over the remaining sequence. Moreover, many approaches employed spatiotemporal information of image sequence such as Valstar et al. [16] who utilized the motion history inside the face image. Zhu et al. [17] used hidden Markov model (HMM) along with moment invariants to do facial expression recognition. Zhang and Ji [18] used dynamic based network to model the temporal behavior of the facial expressions. They used IR illuminations and Kalman filtering to assist the facial point detection and tracking. Baltrušaitis et al. [5] proposed a dynamic system with three levels of inference on progressively longer time scales to understand the human mental states from facial expressions and upper-body gestures, where they employed both DBN and HMM. Lörincz et al. [19] used time-series kernels to analyze the spatiotemporal process of the facial points, where the points’ movements in 3D space are classified with kernels derived from time-warping similarity measures. Some approaches utilized the texture dynamics for the facial expression recognition [2022]. Many other approaches exploit also the facial point dynamics to recognize the corresponding expression [2327]. Obviously, these corresponding approaches work with image sequences starting usually with neutral expression.

By contrast, the second category does not require prior knowledge of the considered person’s neutral expression. For example, Littlewort et al. [28] convolved registered detected face image with a filter bank of 72 Gabor filters with eight orientations and nine spatial frequencies, where each filter output value is considered as a feature. All those features are the input into an individual SVM classifier for smaller facial action units (AUs). Finally, they built a multivariate logistic regression classifier (MLR) on top of the output of AU classifiers to recognize the human facial expressions. Shan et al. [29] used the local binary patterns (LBP) for facial expression recognition. Several modified versions of LBP were also proposed for facial expression recognition, for example, local normal binary patterns (LNBP) [30], local phase quantisers (LPQ) [31], and local sign directional pattern (LSDP) [32]. This category generates a feature vector of larger size, which leads to an increase in the classifier test and train time.

Our proposed approach follows the second category idea, by not benefiting from prior knowledge about person-specific neutral state or any temporal information. We infer the facial expressions from features utilizing the location of just 8 facial points inside a bounding rectangle around the detected face. Those features represent the shape and location of three facial components (eye, eyebrow, and mouth). The off-the-shelf facial point detectors do not provide hand-annotated accuracy for localizing facial parts in images. And to neutralize the recognition rate analysis from using a specific facial point detector, we provide a method to synthesize facial expression data with different uncertainties matching the errors from the detection of the used facial points. Finally, we evaluate our approach on these synthesized data. This work is an extension to our previous paper [33]. Here we further enhance the approach performance by employing a point distribution model (PDM) to avoid shape distortions, which could be caused by noisy detection. In addition, more analysis and experiments are carried out. Finally, detailed results and comparisons are reported.

In this proposed approach, we provide a frame-level facial expression recognition. Besides its usefulness to analyze single images, it could be exploited in the facial expression recognition using spatiotemporal methods. For example, Valstar and Pantic [34] used a combined SVM and HMM to model the facial AU temporal dynamics. They showed that the accuracy of AU classifiers is improved by using a hybrid of SVM and HMM, where SVM is providing frame-level information employed as emission probabilities for HMMs.

The remainder of this paper is structured as follows. In Section 2, we describe our proposed approach for the facial expression recognition. Three experiments are discussed in Section 3, where a comprehensive evaluation of the performance of our method, including a comparison with a state-of-the-art method, is provided. Finally, the conclusion and future perspectives are given in Section 4.

2. Proposed Facial Expression Recognition Approach

In this work, we investigate the ability of perceiving human facial expression using geometrical features without any prior knowledge of person-specific neutral expression, since the neutral expression is usually manually annotated.

Disregarding the offered annotated neutral expression (offered by most databases) is a step forward in the direction of fully automatic facial expression recognition. To this end, our extracted geometrical features do not entail prior knowledge of person-specific neutral expression. It is argued that robust computer vision algorithms for face analysis and recognition are based on configural and shape features [35]. These features are defined as distances between facial components (mouth, eye, eyebrow, nose, and jaw line). In this paper, the facial expressions are inferred from the relative location of eight facial points within the detected face besides other geometrical features.

The structure of the proposed facial expression recognition approach is pictured in Figure 1. First, a human face is detected inside the input image. Then, we locate the eight facial points inside the face by manually annotating/automatically detecting the points in the first frame and then tracking them over the rest sequence (Sections 3.1 and 3.2) or altering the tracked facial points to simulate the errors in the facial point detection stage (Section 3.3). To cope with deficiencies in facial point localization, we project the facial points onto facial point subspaces with the help of a trained PDM. Hence, we assure that these points fall within the variance of the training set. Following this, we extract two geometrical feature types from the projected points. Finally, we classify each normalized feature vector to one of the basic facial expressions, where we employed two machine learning algorithms for the facial expression recognition.

2.1. Face Detection

The human face is detected using a well-trained Haar cascade classifier [36, 37]. This classifier employs the Haar-like features, which are defined as the ratio of intensities taken from adjacent rectangles. Interestingly, this face detector does not rely on skin color and is trained with several illuminations. On the other hand, it only detects frontal upright human faces with approximately up to 20 degree rotation around any axis. Samples of the face detector output are shown in Figures 2(b)2(e) and 2(h)2(l).

2.2. Feature Extraction

The used 2D facial points () sample three main facial components (mouth, eye, and eyebrow): two points for eyebrows (, ); two points for eye’s corner (, ); four points for mouth (), Two feature sets are extracted from the selected facial points. Figure 2(b) shows the eight facial points within a box returned by the employed face detector.

2.2.1. Facial Points Location Features

Unlike Lucey et al.’s approach [13] which used 68 facial points, we use just eight facial points. These points have shown to perform well in the facial expression recognition [15, 33]. Moreover, these points represent corner and edge points that can be efficiently detected and tracked [38]. The location and size of a detected face in Section 2.1 are invariant to mouth deformations and eyebrow movements, as obviously shown in Figures 2(b)2(e) and 2(h)2(l). Hence, the location of the eight points relative to the face position and size results in a useful 16-dimensional feature vector, generated from both - and -coordinates of each point. The mean and standard deviation () of the position of for the six basic facial expressions along with the neutral expression based on CK+ database are summarized in Table 1.

2.2.2. Geometrical Features

Geometrical features describe relative position of the facial points to each other. Six distances are extracted from the eight points, as shown in Figure 2(f). To ensure the scale invariant features, the distances are normalized to the detected face width. The distances and represent the average of two mirrored values on the left and right sides of the face.

Then, the two feature sets are concatenated to produce a vector of length 22. To remove the dominant effect of the large valued features before passing the feature vector into a machine learning algorithm, the feature vector is normalized to as follows: where and are mean and standard deviation of the th feature across the training data, respectively. If we assume is normally distributed, (2) guarantees 95% of to be in the range. Then, we truncate the out-of-range components to either 0 or 1.

2.3. Point Distribution Model (PDM)

The extracted features from the previous section rely on the facial point location, where state-of-the-art facial point detectors still do not provide manually annotated accuracy, especially when the face is not in the neutral expression. In this work, we apply PDM to the detected facial points. Hence, we guarantee that the facial points will fall within the variance of the training set. The first step in building PDM is to align the facial points of all training samples. We consider only frontal faces; therefore, normalizing facial point positions to the detected face dimensions is supposed to satisfy the PDM requirements. The normalized eight facial points are concatenated to produce a vector of length 16. Each sample is given as Then, we group all facial expression samples to one matrix as follows: where is the total number of facial expression samples. Next, we calculate the covariance matrix over all samples: Following this, we apply the singular value decomposition (SVD) to the covariance matrix (5) to be written as where are matrices of size . and are unitary matrices. denotes the matrix conjugate transpose. is a diagonal matrix with diagonal entries equal to the square root of eigenvalues from . And the eigenvectors of make up the columns of . Each eigenvector describes a principal direction of variation within the training set with corresponding standard deviation . Finally, each detected facial point should satisfy the following linear combination of the eigenvectors: where represents the mean of across all training samples and is a vector of scaling values for each principal component. Simply, to guarantee that the facial points fall within the variance of the training set, we truncate each element of as follows:

2.4. Machine Learning Algorithms

To solve the facial expression recognition from its representing feature vector, we employed two machine learning algorithms. In the experimental results (Section 3), we reported the recognition rates that stem from both algorithms.

2.4.1. Support Vector Machine (SVM)

We formulate the facial expression recognition task as a multiclass learning process, where one class is assigned to each expression. SVM is a well-known classifier for its generalization capability. In the case of a binary classification task with training data (), having corresponding classes , the decision function could be formulated as where denotes a separating hyperplane, is the bias or offset of the hyperplane from the origin in input space, and is a weight vector normal to the separating hyperplane. Two hyperplanes, called canonical hyperplanes, pass through support vectors () and satisfy and , respectively, as shown in Figure 3. The region between the canonical hyperplanes is called margin band: where denotes 2-norm of . Finally, choosing the optimal values () is formulated as a constrained optimization problem, where (10) is maximized subject to the following constraints:

Several one-versus-all SVM classifiers are incorporated to handle the multiclass expression recognition. For this purpose, we employed LIBSVM [39].

2.4.2. K-Nearest-Neighbor (kNN)

kNN classification is one of the simplest classification methods, where a test sample is classified based on the closest previously known samples in the feature space [40]. This classification method does not depend on underlying joint distribution of the training samples and their classifications. For example, the k-nearest neighbor rule assigns the class represented the most in the closest k neighbors to the test sample.

3. Experimental Results

To assess the reliability of our approach, we compared our results with those of Lucey et al. [13] on the extended Cohn-Kanade dataset (CK+). Then, we evaluated our approach on Binghamton University 3D dynamic Facial Expression Database (BU-4DFE) [41]. Next, we provided a method to synthesize facial points with several uncertainties that are supposed to simulate the facial point detector errors. Finally, we investigated the influence of the point detection error on our approach results.

3.1. The Extended Cohn-Kanade Dataset (CK+) [13]

We compared our results with those of Lucey et al. approach, which relied on features extracted from 68 fiducial points, taken into consideration their prior knowledge of person-specific neutral expression. The comparison was carried out on CK+ database. This database contains 593 sequences across 123 subjects. Each image sequence starts with onset (neutral expression) and ends with a peak expression (last frame). The offered peak expression is fully coded by Facial Action Coding System using FACS investigator guide. After applying perceptual judgment to the facial expression labels, only 327 of the sequences were labeled for the human facial expressions: 45 for anger (An); 18 for contempt (Co); 59 for disgust (Di); 25 for fear (Fe); 69 for happiness (Ha); 28 for sadness (Sa); 83 for surprise (Su). Keyframes within each image sequence were manually labeled with 68 points, and after that a gradient descent active appearance model (AAM) is used to fit these points in the remaining frames.

In our work, we use eight points out of the offered 68 points. Each facial expression sequence is represented by only one frame, which carries the apex of the expression, and due to the lack of training samples, we and Lucey et al. employed a leave-one-out subject cross-validation (LOOCV) strategy. As the name suggests, one subject is left out for testing, and the rest of the samples are used for training.

The confusion matrix depicting the results obtained by the proposed approach compared to Lucey et al. [13] published results is shown in Table 2. We achieved an average recognition rate of 83.01% compared to 83.15% achieved by them, taken into consideration that removing contempt expression (Co) from their classification algorithm can lead to an improvement in their results. Our proposed geometrical features (from eight fiducial points) provide good results as well as that taken from 68 points; however, we do not utilize prior knowledge of the considered subject neutral expression. Similarly to Lucey et al.’s approach, we achieved high recognition rates for the expressions that cause distinctive facial deformations (happiness, surprise, and disgust). The recognition of other expressions (anger, sadness, and fear) experiences confusions due to their subtle facial deformations.

Our proposed neutral-independent approach allows us to have a frame-based decision of the facial expressions. Employing temporal information such as prior knowledge of the considered person neutral expression will enhance the recognition rate. Figure 4 shows the recognition rate of facial expression in neutral-independent compared to neutral-dependent case; in the latter we normalize each feature to its corresponding value at the neutral expression. The average recognition rate is improved by approximately 6%. Expressions with subtle facial deformations, such as sadness and fear, are most improved compared to other ones. Happiness and surprise are recognized with higher rate in both cases, and this is most likely due to the bigger facial deformations they cause, which could be easily measured by our method.

It is not applicable to just classify images into the six basic expressions without automatically recognizing the neutral expression, which is a pitfall of the approaches that use annotated prior knowledge of person-specific neutral expression. To this end, we dedicated a separate class to neutral expression. Moreover, we use two machine learning algorithms (SVM and kNN) to classify the neutral-independent feature vector generated as shown in Section 2.1. Table 3 shows the confusion matrix of our approach for the six basic expressions, plus neutral expression (Ne) for both machine learning algorithms.

A number of points can be drawn from this matrix; the happiness and surprise expressions are still recognized with high rate of 98.55% and 98.75% in the case of SVM and 92.75% and 98.75% in kNN case, respectively. On the other hand, the perception of the other expressions particularly sadness is confused with neutral. However, neutral expression is recognized with high rate of 90.35% and 79.18% using SVM and kNN, respectively. We achieved an average recognition rate of 73.63% and 67.12% using SVM and kNN, respectively. These results indicate that SVM classifier outperforms kNN for facial expression recognition. Considering the neutral expression as a separate class implies a lot of confusions with other expressions, especially the subtle ones: fear, anger, and sadness. Similarly, another appearance based approach suffers from alike confusions with the neutral expression [28].

3.2. Binghamton University 3D Facial Expression Database (BU-4DFE) [41]

To assess the effectiveness of our approach, we evaluated it on a second database (BU-4DFE database) to recognize the six basic facial expressions, plus neutral. First, we extracted 2D frontal face image sequences from this 3D database. After that, the eight fiducial points were detected in the first frame (neutral expression) with the help of Valstar et al. approach [42] and then tracked using a dense optical flow tracking algorithm [43] in the rest sequence. Next, we extracted frame-based feature vectors from the apex frames. As in Section 3.1, we represented each expression sequence by one apex frame, used LOOCV strategy, and employed both machine learning algorithms: SVM and kNN.

The recognition results are summarized in a confusion matrix, as shown in Table 4. Due to their distinctive facial deformations which are easier to be detected, happiness and surprise expressions are recognized with high rate: 88.4%, 93.7% and 85.36%, 86.2% using SVM and kNN classifiers, respectively. Similarly, confusions of subtle expressions with neutral are present. In contrast with our evaluation on CK+ database, we achieved a lower recognition rate for neutral expression and a higher one for sadness. We achieved an average recognition rate of 68.04% and 57.92% using SVM and kNN classifiers, respectively. These rates are lower than that on CK+, which is reasonable due to the higher error in the facial point localization which comes from the methods employed on this database. Once again, SVM classifier outperforms kNN for facial expression recognition.

3.3. Approach Evaluation with the Uncertainty in Facial Point Detection

Locating the eight facial points in the aforementioned experiments involves human intervention either by annotating keyframes within each image sequence in Section 3.1, or by selecting frames with neutral expression to detect the facial points and track them afterwards in Section 3.2. Therefore, these approaches cannot run fully automatically.

This experiment is carried out to set up a method for investigating the performance of geometry-based facial expression recognition approaches in fully automatic frame-based scenarios, where the facial point detector is applied on each frame. This method helps to neutralize the analysis of geometry-based approaches from the performance facial point detectors. To this end, we synthesize facial points with different uncertainties, which supposed to match the error of any selected facial point detector. One thing’s for certain, the prior knowledge about the distribution of point localization error would help in selecting more robust geometrical features. Studying this issue is behind the scope of this work, since all state-of-the-art point detectors are not providing these error distributions [38, 4446].

For example, the off-the-shelf facial points’ detector [38] reported 2–5% mean error for the eight facial points that are used in our approach. is defined as the mean of the Euclidean distance between the detected point and the ground truth divided by face width. The behavior of the detection error is not reported; hence it is not clear if there are correlations among the point detection errors. Additionally, we have no information about the error distribution in - coordinates. Let , denote error standard deviations in - and -coordinates for a specific facial point. Then, the provided (from state-of-the-art point detectors) could be calculated as Figure 5 shows three possible normal distributions for the same (12). In this experiment, we assume the worst case, where the errors are independent, the errors in - and -coordinates as well as the errors among facial points. Furthermore, we assume ; hence, no useful information could be used to bias the geometrical features. Additionally in this simulation, all facial points are exposed to the same detection error , which may not reflect real scenarios. We used the normal distribution to model our error.

The normal distribution is popular due to the central limit theory, and it is easy to analytically derive more results such as least squares fitting errors. Moreover, normal distribution has the maximum entropy for a given mean and variance [47]. Consequently, the error in the position of the detected points is modeled by a bivariate Gaussian distribution, where each facial point location is altered as follows:

is the synthesized point that will be passed to PDM, and is the ground truth facial point location.

This experiment is conducted on ck+ database for the six basic facial expressions. The expression is classified using SVM, which provided better results than kNN in the last two experiments. Similarly, to the last experiment, we used LOOCV strategy, where each test sample is altered by (13) to generate 1000 new test samples with specific uncertainty. Then, those samples are processed by PDM module before extracting the features and finally classifying them to facial expressions. Figure 6 shows the distribution of synthesized facial points generated from a sample of neutral expression. We used the facial points at the apex of each facial expression, which stem from gradient descent AAM fitting, as a ground truth for (13). Figure 7 shows correct recognition rates of six basic facial expressions versus MEr. As expected, an increase in MEr results in lowering the recognition rate.

By far surprise expression is recognized with high rate even for noisy detection, which is plausible due to its distinctive facial deformation as could be noticed from Table 1. The recognition rate of anger, sadness, and fear expressions dropped to be lower than 60% in the case of 5% error, which can be attributed to the small deformations they cause, and our geometry-based approach cannot accurately detect.

To fully understand the behavior of our proposed approach under facial point localization errors, we depicted most confusions that occurred between the facial expressions in Figure 8. Sadness expression is mostly confused with anger, and this confusion is dramatically increasing for higher uncertainties of facial point location. Confusing disgust with anger is also growing with increasing the uncertainty of the facial point position. An additional remark on this experiment, the state-of-the-art facial point detector approach reported 2–5% error for the eight selected facial points. This error was measured on a challenge database, which includes non-frontal face and occluded facial parts. Furthermore, the error was measured compared to manually annotated ground truth, while in this experiment, the ground truth data are the tracked facial points through a gradient descent AAM fitting which already suffer from errors.

Finally, the following steps summarize how to generalize the usage of this experiment.(i)The proposed facial point detectors should report not only each point but also the error distribution in - and -coordinates and also the error covariances among different points; hence, we can activate the off-diagonal elements in (13).(ii)The researchers in geometrical facial features synthesize facial points with the help of aforementioned information. As a result of this step, we overcome the issue of reimplementing not available state-of-the-art facial point detectors. Additionally, we save the time of applying facial point detectors to huge databases when we just optimize the geometrical features.(iii)The process of geometrical feature extraction should be optimized with respect to the known distributions of point localization errors.

4. Conclusions and Future Work

Several approaches were proposed to do facial expression recognition. These approaches can be grouped into two main categories: geometry and appearance based. In this paper, we considered the geometry based case. The state-of-the-art geometry-based approaches entail prior-knowledge of person-specific neutral expression; however, such information is not available in real-world scenarios. In contrast to that, we extract geometrical features from just eight facial points. These features do not rely on person-specific neutral expression or any temporal information. Two databases were used to evaluate our approach. We achieved an average 83% (frame-level) recognition rate on CK+ database which as well as Lucey et al. approach; however, they used 68 facial points along with prior knowledge of person-specific neutral expression. Our average recognition rate can be improved by approximately 6% in the case of utilizing information about person-specific neutral expression. When we add the neutral expression as a new class to the expression classifier in the frame-based case, the recognition rate drops to 73.6% due to additional higher confusions between the subtle expressions with the neutral one.

On the other hand, we achieved 68% average recognition rate on BU-4DFE database. The geometry-based approach strongly depends on the facial point detector. Interestingly, we provide a method to neutralize the analysis of geometry-based approach from specific feature detectors. We synthesized facial points with various uncertainties that supposed to match existing errors in the facial point detectors. Then, we investigated our approach under errors from 1 to 5%, which is the range of the state-of-the-art approaches. As expected, the recognition rate is adversely affected by increasing the facial point localization error.

The next step in our research is to generalize the facial expression recognition to non-frontal face poses. Combining geometry and appearance-based approaches deserves also more attention.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is part of the project done within the Transregional Collaborative Research Centre SFB/TRR 62 Companion-Technology for Cognitive Technical Systems funded by the German Research Foundation (DFG).