Abstract

Facial expression recognition plays an important role in communicating the emotions and intentions of human beings. Facial expression recognition in uncontrolled environment is more difficult as compared to that in controlled environment due to change in occlusion, illumination, and noise. In this paper, we present a new framework for effective facial expression recognition from real-time facial images. Unlike other methods which spend much time by dividing the image into blocks or whole face image, our method extracts the discriminative feature from salient face regions and then combine with texture and orientation features for better representation. Furthermore, we reduce the data dimension by selecting the highly discriminative features. The proposed framework is capable of providing high recognition accuracy rate even in the presence of occlusions, illumination, and noise. To show the robustness of the proposed framework, we used three publicly available challenging datasets. The experimental results show that the performance of the proposed framework is better than existing techniques, which indicate the considerable potential of combining geometric features with appearance-based features.

1. Introduction

Facial expression recognition (FER) has emerged as an important research area over the last two decades. Facial expression is one of the immediate, natural, and powerful means for humans to communicate their intentions and emotions. The FER system can be used in many important applications such as driver safety, health care, video conferencing, virtual reality, and cognitive science etc.

Generally, facial expression can be classified into neutral, anger, disgust, fear, surprise, sad, and happy. Recent research shows that the ability of young people to read the feeling and emotion of other people is getting reduced due to the extensive use of digital devices [1]. Therefore, it is important to develop a FER system which accurately recognizes facial expression in real time.

An automatic FER system commonly consists of four steps: Preprocessing, feature extraction, feature selection, and classification of facial expressions. In the preprocessing step, face region is first detected and then extracted from the input image because it is the area that contains expression-related information. The most well-known and common algorithm used for face detection is the Viola–Jones object detection algorithm [2]. Subsequently, in the feature extraction step, distinguishable features are extracted from the face image. The two popular approaches for feature extraction are geometric-based feature extraction and appearance-based feature extraction. In the geometric-based techniques, the facial landmark points are first detected and then combined into a feature vector, which encodes geometric information of face from the position, distance, and angle [3]. The appearance-based techniques characterize the appearance information brought by different facial movements. Next, a subset of relevant features is selected in the feature selection step which contains more discriminatory power to classify different classes. In the last classification step, classifiers like K-nearest neighbor (KNN) [4] and support vector machine (SVM) [5] are first trained and then used to classify the input data.

Although a lot of work has been done to develop a robust FER system, we find that several common problems still exist in the real-time environment which hinder the development of the FER system: (i) The extracted features are sensitive to the change in illumination, occlusion, and noise. That means a slight change in illumination, occlusion, and noise may influence the recognition accuracy rate. (ii) The large data dimension is another problem which deteriorates the performance of such systems.

The contributions of the proposed work are as follows:(i)A dual-feature fusion technique is proposed in this work for effective and efficient classification of facial expressions in the unconstrained environment.(ii)The proposed framework is based on local and global features, which make the proposed framework robust to change in occlusions, illumination, and noise.(iii)Feature selection process is used to obtain the discriminative features, where the redundant features are discarded. The reduction in feature vector length also reduces the time complexity which makes the proposed framework suitable for real-time applications.

The rest of the paper is organized as follows: Section 2 presents the related work. Section 3 provides the description of the materials and methods. Experimental results are presented in Section 4. Finally, conclusion is provided in Section 5.

Numerous methods for facial expression recognition have been developed due to its increased importance. These methods are mainly categorized into geometric-based and appearance-based methods based on feature extractions.

In geometric-based methods, information such as shape of the face and its components are used for feature extraction. The first important and challenging step in the geometric-based method is to initialize a set of facial points as the facial expression evolves over time. The study presented in [6] employed the elastic bunch graph matching (EBGM) algorithm for initialization of facial points. The discriminative features are also selected from triangle and line features with the multiclass AdaBoost algorithm. Sun et al. [7] proposed an effective method for the selection of optimized active face regions. They used convolution neural network (CNN) to extract features from optimized active face regions. The method used by Hsieh et al. [8] was based on the active shape model (ASM). They employed ASM to extract different facial expression regions. Similarly, Zangeneh and Moradi [9] first used the active appearance model (AAM) to reveal the important facial points, and then differential geometric features are extracted from those facial points. In the geometric-based features extraction techniques, it is difficult to track and initialize facial feature points in real time. If the error occurs during facial point initialization process, then this error deteriorates the overall feature extraction process.

On the contrary, appearance-based features extraction methods encode the face appearance variations without taking muscle motion into account. Chen et al. [10] introduced the multithreading cascade of Speeded Up Robust Features (McSURF), which improve the recognition accuracy rate. Cruz et al. [11] explore the temporal derivative and adjacent frames by using new framework known as temporal patterns of oriented edge magnitudes. The cases of out-of-plane head rotations are handled using rotation-reversal invariant HOD, presented by Chen et al. [12]. They also developed the cascade learning model to boost the classification process. Alphonse and Dharma [13] employed the maximum response-based directional texture pattern and number pattern for feature extraction. The performance is tested in the constrained and unconstrained environments. Recently, the work proposed in [14] employed spatiotemporal convolution to jointly extract the temporal dynamic and multilevel appearance feature of facial expressions. Another promising method to enhance the performance of random forest is proposed in [15]. They reduce the influence of various distortions like occlusion and illumination by extracting the robust features from salient facial patches. Sajjad et al. [16] presented a model integrating the histogram-oriented gradient with the uniform-local ternary operator for the extraction of facial features. The performance of the proposed method was tested on facial expression images which contains noise and partial occlusions. In another interesting approach, the authors proposed a new framework named local binary image cosine transform for computationally efficient feature extraction/selection [17]. Munir et al. [18] proposed a merged binary pattern code (MBPC) to represent the face texture information. They performed experiments on real-time images. In order to normalize the illumination effects, they preprocessed the images using the fast Fourier transform and contrast limited adaptive histogram equalization. Liu et al. [19] made use of deep network to learn the midlevel representation of face. They tested the effectiveness of their proposed method both on wild environment images and lab-controlled data.

Apart from the appearance-based or geometric-based feature extraction, fusion of this two-feature extraction method is also a promising trend. Zhang et al. [20] combined both texture and geometric-based features to maintain reasonable amount of tolerance against noise and occlusion. They used an active shape model and SIFT for geometric and appearance-based feature, respectively. To inherit the advantages of geometric and appearance information, Yang et al. [21] fused deep geometric features and LBP-based appearance features. They also proposed an improved random forest classifier for effective and efficient recognition of facial expressions. In the method of Tsai and Chang [22], features are extracted via Gabor filter, discrete cosine transform, and angular radial transform. In the work of Ghimire et al. [23], first, the face local specific regions were selected, and then central moments were normalized. A local binary pattern descriptor is used for the extraction of geometric and appearance-based features, respectively.

In this paper, different from other methods, we select the facial informative local regions instead of dividing the face image into nonoverlapping blocks. Such representations can improve the classification performance compared with the block-based image representation. The appearance-based feature is computed from local face regions and also from the whole face area. These features are then fused which provide more robust features.

3. Materials and Methods

The working of the proposed framework based on dual-feature fusion is illustrated in Figure 1. Initially, the face portion is detected and extracted from input images using the Viola–Jones algorithm [2]. For dual-feature fusion, we first detect the facial landmark point on the face image and then the important local regions are located. The Weber local descriptor (WLD) excitation and orientation image is also generated from the input images. In next step, DCT is used to select the high variance features from local regions along with excitation and orientation image of WLD. In order to improve the performance, both types of features are then fussed using the score-level fusion.

3.1. Face Detection and Landmark Position Estimation

In order to extract the region of interest (i.e. face portion), we utilized the Viola–Jones algorithm [2] in our study which is mostly cited in literature and also considered as a fast and accurate object detection algorithm [24].

The spatial misalignment usually occurs due to the expression and pose variations in the face image. Division of the face image into nonoverlapped blocks or exploiting holistic features cannot resolve this issue [25]. Admittedly, the intraclass difference is increased due to variation in face appearance because of expressions and facial poses. In that case, the local features are more robust to these changes as compared to holistic features. There are some reliable and stable regions which preserve more useful information to deal with these changes. That is why, in this study, we extract the features from inner facial landmarks rather than extracting the features from whole face image.

For this purpose, we used the method presented by Kazemi and Sullivan [26] in which the face landmark position is estimated from subset of pixel intensities using ensemble of regression trees. This method is highly effective to locate the landmark position not only in the face with neutral expression but also in the face with variation in different expressions.

After landmark position estimation, we use the facial point location to divide the face image into 29 local regions. The local feature is extracted from all these local regions. In order to reduce the data dimensions, we do not require exhaustive search technique as performed in [23] to search for a subset of local regions among 29 local regions because our feature selection method is more efficient and effective.

3.2. Construction of WLD Excitation and Orientation Image

The Weber local descriptor is proposed by Chen et al. [27] which is inspired from Weber’s law. WLD consist of two main components, namely, differential excitation and gradient orientation. The differential excitation component represents the intensity differences of the neighbor pixel and the center pixel where the gradient orientation of the center pixel is described by the gradient orientation component. Both the components provide the local texture description of an image.

Formally, the differential excitation component can be defined aswhere the arctangent is used to suppress the noise side effect and also to avoid the output of being too large. The neighbor pixels are denoted as , while represents the center pixel. Similarly, the differential orientation component of an image can be defined as follows:where the intensity difference is indicated by and in the x and y directions.

Figures 2 and 3 illustrate the WLD excitation and orientation component images.

3.3. DCT-Based Feature Selection and Fusion

We can compute the DCT of an input scanned image of size , by using the expression as defined in equation (3) [28]. For all values of and , the expression of equation (1) must be evaluated. Also, given , for and , can be obtained by using the inverse DCT transform which is mentioned in equation (4). Note that both equations (3) and (4) consist of a two-dimensional pair of DCT, where x and y are spatial coordinates and and refers to frequency variables: is the power spectrum of image and can be defined as

After selection of appearance-based and geometric-based features, we employed a score-level fusion strategy to combine these features. Feature-level fusion and score-level fusion are the two fusion strategies which are used widely in the literature. In the feature-level fusion, different feature vectors are simply concatenated after normalization process. In contrast to the feature-level fusion, a distance-based classifier is used in the score-level fusion to compute the distance between the feature vector of training and testing samples. The feature-level fusion mainly produces large data dimension [29] that is why we prefer score-level fusion in this study. In the score-level fusion, the extracted appearance- and geometric-based DCT features are stored in and , respectively. These features are computed for all training and testing samples. Afterward, score vectors, namely, and , are produced by computing the distance between training samples and all the testing samples of appearance and geometric feature sets. In order to perform normalization, the min-max method of normalization [30] is used which is described aswhere the original score entry is represented by . The minimum and the maximum values of the score is denoted by and . Finally, the product rule or the sum rule method is used to normalize the score vectors [30].

The procedure of feature extraction and fusion is presented in Algorithm 1.

Input: Training sample images with size
 Testing sample images
Output:
Procedure
(1)For each do
(2)
(3)For each do
(4)   using equations (3) and (4)
(5)  
(6)  
(7)End For
(8)End For
(9)For each do
(10)
(11)For each do
(12)   using equations (3) and (4)
(13)  
(14)  
(15)End For
(16)End For
(17)For each do
(18)
(19)
(20)End For
(21)For each do
(22) using equation (7)
(23)End For
(24)
3.4. Support Vector Machine (SVM) for Expression Classification

For multi and binary classification problem, the SVM [31] acts as a more powerful tool. The SVM draws a hyperplane between the two classes by maximizing the margin between the closest points of the class and hyperplane. The decision function for class labels and training data can be formulated as [23]where the hyperplane separation is denoted by . In order to handle the multiclass problem, we have used SVM with radial basic function kernel, implemented as libsvm [32] and is publicly available for use.

4. Experimental Results and Discussions

To evaluate the performance of the proposed framework, we used 3 publicly available benchmarking databases, namely, MMI database, extended Cohn-Kanade (CK+) and static face in the wild (SFEW).(i)MMI database: this image database [33] contains both video sequences and static images which include head movements and posed expressions. It consists of images of high resolutions of 88 subjects and over 2900 videos of male and female. For our experiment, we have selected different video sequences and extracted a total of 273 images from these sequences.(ii)Extended Cohn-Kanade (CK+): this database contains 593 video sequence of 123 subjects [34]. The subjects are origins from Latino, Asian, and African-American and aged from 18 to 30 years. We have selected different video sequences and obtained 540 static images of six basic expressions.(iii)Static face in the wild (SFEW): the SFEW [35] contains real-time movie images which are captured in unconstrained settings. The images are having different variations like noise, pose variation, and high illumination changes. We have taken 291 images from the available 1394 images in the database.

Sample images of each database is shown in Figure 4, and Table 1 illustrates the number of images taken from MMI, CK+, and SFEW database.

To make maximum use of the available data, we employed 5-fold and 10-fold cross validation for all the experiments. To get the better picture of the facial expression recognition accuracy, average accuracy rate and confusion matrices are given across all the three datasets.

4.1. Experiment on MMI, CK+, and SFEW Database

This section shows the results obtained using MMI, CK+, and SFEW datasets. MMI dataset contained most of the spontaneous expressions. The proposed framework achieved an average recognition accuracy of 96% and 98.62%, respectively, for MMI and CK+ database. The confusion matrix of classifying 7 facial expressions for MMI dataset and 6 basic expressions for CK+ is shown in Tables 2 and 3, respectively.

In Table 2, among the seven facial expressions, neutral and sad expressions are the easiest with an average recognition accuracy rate of 100%, which is followed by happy and surprised. In contrast, angry and fear are the most difficult expressions for classification. As shown in the table, the fear expression is mostly confused with neutral and surprised, which is expected because of the structural similarities [36]. Furthermore, the anger facial expression is mostly misclassified with disgust and neutral expressions. This is probably because of the wrinkles of the forehead in anger expression, which is also the characteristics of disgust expression.

The confusion matrix in Table 3 depicts that disgust, sad, and happy expressions are classified with 100% recognition accuracy rate which is followed by surprised and anger expressions. The recognition accuracy for fear expression is slightly deviated at 95%. The results indicate that the fear expression misclassified either as anger or disgust emotion. The reason is that the fear, disgust, and anger expressions demonstrated similar muscle activities [37]. Moreover, it is also observed that the average recognition accuracy rate of the CK+ dataset is slightly higher than the MMI dataset. This is because the CK+ dataset contains more expressive emotions.

The confusion matrix for SFEW results is shown in Table 4. The performance on the SFEW database is low as compared to MMI and CK+ databases. This is because the images of the SFEW database are captured in the uncontrolled environment (real-world images) and are more challenging to classify as compared to other datasets. The average recognition accuracy rate of 50.2% is obtained using the SFEW database. By inspecting the recognition accuracy rate of each expression, we observed that sad, fear, and happy expressions are more accurately recognized. However, the disgust expression obtained the smallest recognition accuracy of 31.7%.

Table 5 illustrates the comparative assessment of the proposed method with the existing state-of-the-art [6, 1014] methods. In literature, the FER system presented in [11] has achieved the highest recognition accuracy rate of 93.66% which works on the nonoverlapping patches. But in their method, the length of their code is controlled by a new coding scheme which makes their process more complex for real-time FER systems. The results show that the performance of our proposed method is superior as compared to existing techniques in terms of average recognition accuracy. Furthermore, it is also notable that recognition accuracy rate per expression of our proposed method is also high as compared to other methods.

In Table 6, the results for CK+ database are compared with the state-of-the-art methods. The average recognition accuracy rate of our method is highly competitive with other methods. Although the performance of the method presented in [14] is 1.11% higher than our method, the use of 3D convolution neural network makes their method computationally more expensive.

Figure 5 illustrates the comparative assessment of the proposed method with other methods on the SFEW database. It is evident from the results that our proposed method achieved better results as compared to existing methods in the literature. The average recognition accuracy rate of our proposed method is 50.2%. For the same dataset present in the studies [13, 19, 20, 3840], the average accuracy rates were 26.1%, 30.14%, 33.8%, 44.0%, 49.31%, and 48.3%, respectively. The results depict that our strategy of the dual-feature fusion is more appropriate for FER in the uncontrolled environment. The recognition accuracy rate is significantly degraded on SFEW as compared to the results on MMI and CK+ due to its challenging condition, e.g., change in illumination and large pose variations.

4.2. Robustness against Noise and Occlusions

In the uncontrolled environment, noise, and occlusions are the main factors to degrade the image quality and reduce the facial expression recognition accuracy rate. It is required for any FER system to perform well in the presence of noise and partial occlusions. In this section, we examine the robustness of our proposed method in the presence of noise and partial occlusions.

To check the robustness against noise, we randomly added salt and pepper noise of different levels to the images of MMI and CK+ database. This type of noise is composed of two components.

The first component is the salt noise which occurs as a bright spot in the image, and the second component is the pepper noise which appears as a dark spot. As shown in Figure 6, the noise density was increased up to 0.05 level because in the real-time system, the average noise of this level is normally observed [16].

The results illustrated in Figure 7 shows that the recognition accuracy rate of our proposed method does not significantly reduce with increase in variance of salt and pepper noise. We have also observed that the recognition accuracy rate of the CK+ database is more stable in the presence of noise as compare to the MMI database. This is because the expression of CK+ is more representative.

In order to assess the proposed method performance in the presence of occlusions, we have added a block of random size to the test images. The range of block size starting from [15 × 15] to [55 × 55] randomly placed to the face images are shown in Figure 8.

The average recognition accuracy rates for both MMI and CK+ are illustrated in Table 7. The results of MMI show that the accuracy rate decreased up to 3.6% when the block size increased from [15 × 15] to [45 × 45]. However, the recognition drops down by 17% when the block size [55 × 55] is used. This is because most of the important facial points are hidden due to the large block size. In contrast, the recognition accuracy on the CK+ database only decreases by 7.5% when [55 × 55] block size was used in the experiments. It is foreseeable that the recognition accuracy reaches to zero in the presence of total occlusion.

To prove the robustness of our proposed method against noise and occlusions, we also compared the performance with the existing method [16] as shown in Figures 9 and 10. The methods presented in [16] are selected due to their state-of-the-art performance on MMI and CK+ database, and they also used a similar ratio of noise density and block size. From the results, we can easily conclude that our dual-feature fusion method is more robust to noise and occlusions as compared to the methods presented in [16] due to the less decline in recognition accuracy.

5. Conclusion and Future Work

Facial expression recognition in the real-world case is a long-standing problem. The low image quality, partial occlusions, and illumination variation in the real-word environment make the feature extraction process more challenging. In this paper, we exploit both texture and geometric features for effective facial expression recognition. The effective geometric features are introduced in this paper from facial landmark detection, which can capture the facial configure changes. Considering that the geometric feature extraction may fail under various conditions, the addition of texture feature with geometric features is useful for capturing the minor changes in expressions. WLD is utilized for the extraction of texture feature which is more effective to capture the facial subtle changes. Furthermore, we have employed score-level fusion for fusion of geometric and texture features which results in decreasing the number of features. The performance of the proposed approach is evaluated on standard databases like MMI, CK+, and SFEW, and the results are compared with the state-of-the-art approaches. The effectiveness of our proposed dual-feature fusion strategy is verified by different experimental results.

Although WLD works well on the face images for the extraction of salient features, the variation of local intensity cannot effectively be represented by using the standard WLD because it neglects different orientations of the neighborhood pixel. In future work, we are planning to address this issue along with the experimentation with ethnographic datasets.

Data Availability

The authors confirm that the data generated or analyzed and the information supporting the findings of this study are available within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

All the co-authors have made significant contribution in conceptualization, data analysis, experimentations, scientific discussions, preparation of original draft, and revision and organization of the paper.

Acknowledgments

This study was supported by the Deanship of Scientific Research, King Saud University, Riyadh, Saudi Arabia, through the Research Group under Project RG-1439-039.