Abstract

Humans use facial expressions to convey personal feelings. Facial expressions need to be automatically recognized to design control and interactive applications. Feature extraction in an accurate manner is one of the key steps in automatic facial expression recognition system. Current frequency domain facial expression recognition systems have not fully utilized the facial elements and muscle movements for recognition. In this paper, stationary wavelet transform is used to extract features for facial expression recognition due to its good localization characteristics, in both spectral and spatial domains. More specifically a combination of horizontal and vertical subbands of stationary wavelet transform is used as these subbands contain muscle movement information for majority of the facial expressions. Feature dimensionality is further reduced by applying discrete cosine transform on these subbands. The selected features are then passed into feed forward neural network that is trained through back propagation algorithm. An average recognition rate of 98.83% and 96.61% is achieved for JAFFE and CK+ dataset, respectively. An accuracy of 94.28% is achieved for MS-Kinect dataset that is locally recorded. It has been observed that the proposed technique is very promising for facial expression recognition when compared to other state-of-the-art techniques.

1. Introduction

Emotions are a natural and powerful way of communication among living beings. Humans express their emotions by voice, face, body gestures, and behavioral changes. A reliable emotion perception scheme is required in order to translate human expression and behavioral changes into useful commands to control systems. Emotion recognition is a challenging task because humans do not always express themselves by words and gestures. Automatic human emotion recognition is a multidisciplinary area including psychology, speech analysis, computer vision, and machine learning. Facial expression is considered as a powerful mean of one to one communication after speech signals and plays a pivotal role in human computer interaction (HCI). Human emotional state is provoked by external stimuli resulting in changes in facial dimensions. The most commonly used system for face expression is developed by Ekman and Frieses and is famously known as Facial Action Coding System (FACS) [1, 2]. In [3] Ekman has defined six basic classes of facial expression, that is, anger, disgust, fear, happiness, sadness, and surprise, which are commonly used by researchers working in this area. Automatic facial expression recognition and analysis play a vital role in different application areas, such as human machine interaction, surveillance, information security, robotics, and video summarization.

Development of an affective facial recognition system still remains a challenging task. Facial images and videos are affected by illumination conditions, human age, and variations in how the emotion is expressed. A first attempt was made in 1978 by Suwa et al. [4] to develop automatic facial expression system by analyzing twenty known areas of image structure. Since then many attempts have been made to develop automatic emotion recognition system that can narrate affective human feelings. Automatic facial expression recognition system is divided into three main stages, that is, face detection and tracking, feature extraction, and emotion classification [1, 2]. In the first stage, human face is cropped by face detection, head tracking, and head pose estimation. In the second stage relevant image features are extracted that describes the changes in facial expressions. These features can be extracted in holistic or anatomical fashion. In holistic approach features are extracted from the whole face and are based on image texture or its transformation [5, 6]. On the other hand in anatomical approach feature are extracted from subportions of the face and are based on distance measures and geometric transformations [7, 8]. For facial expression recognition systems, extracted facial features have been generally based on geometry [9, 10] or appearance [11, 12]. Appearance based features used skin texture variants like furrows and wrinkles for studying facial expression and these features have been applied to the complete face as well as selected face regions. Donato et al. [6] have worked in plane image transform by applying Gabor Wavelets as texture descriptors, on posed data of 24 subjects and nearest neighbor as classifier. Wu et al. [13] used Gabor motion energy as texture descriptor in plane image transform with support vector machine (SVM) as emotion classifier and Cohn-Kanade (CK) [14] as database. However, texture deformation dynamics can also be incorporated for feature extraction. On the other hand geometric features analyze the variation of human face components such as nose, eyebrows, lips, eyes in terms of location, distance, and shape. Yacoob and Davis [15] have applied rule-based classifier and region based optical geometric methods on posed data of 32 subjects for emotion recognition. Wang et al. [16] recognized emotions by using geometrical B spline curve, with Euclidean classifier and applied algorithm on posed data of 8 different subjects. Valstar and Pantic [17] applied affine transformation registration technique on predefined MMI [18] and CK [14] databases by defining dynamics of 20 facial points with probabilistic actively learned support vector machine as emotion classifier. Hybrid approaches make use of both the geometrical and appearance based features for emotion recognition. A comprehensive survey of existing methods that have been adopted in facial expression recognition systems is presented in [1921].

Most important aspect of facial emotion recognition system is to extract relevant features from either spatial or frequency domain. A number of algorithms based on frequency domain features have been proposed in literature that have used discrete cosine transform (DCT), discrete wavelet transform (DWT), Discrete Fourier Transform (DFT), Gabor Wavelet Transform, and curvelet transform [2228]. In [22], three different types of features based on DCT, FFT, and signal value decomposition are extracted and then fed to SVM for facial expression recognition. A comparative study of DCT and two-dimensional principal component analysis (2D-PCA) for feature dimension reduction in facial expression recognition is presented in [23] and it is found that DCT gives better recognition rate with the same feature reduction as compared to 2D-PCA. In [24], wavelet based features are provided to bank of seven parallel SVMs. A particular facial expression is recognized by each SVM, which are then combined by maximum function. DWT, DFT, and DCT are used as a unique combination in [25] to extract features for face recognition system that are invariant in terms of pose, translation, and illumination. In [26], 18 filtered images are obtained by convolving input image with 18 Gabor filters. The amplitude value of each filtered image from selected points is used as features, which are then classified by using Bayes classifier, SVM, and AdaBoost. In [27], facial element and muscle movement are used as features for facial expression recognition, which are obtained from patch based 3D Gabor filters. In [28], entropy, standard deviation, and mean of curvelet transform coefficients of each region are used as features for facial expression recognition. The extracted features are passed to online sequential extreme learning machine with radial basis function.

Current frequency domain approaches have not fully utilized the facial elements and muscle movements for recognition. Therefore, hybrid frequency domain features are required to fully cover these aspects, which are necessary for facial expression recognition. In this paper, we aim to improve the performance of facial expression recognition system by extracting feature in frequency domain using stationary wavelet transform (SWT). The main reason for utilizing this transformational technique is due to its good localization characteristics, in both spectral and spatial domains [29]. Once SWT is applied on the detected face, feature vector length is further reduced by applying DCT and selecting a few number of DCT coefficients. For classification of facial expression, an artificial neural network is trained on extracted features from the images. The major contributions of this study are as follows:(1)Stationary wavelet transform for facial expression recognition with a special emphasis on horizontal and vertical subbands is used.(2)A new dataset of videos is generated using MS-Kinect with seven general emotions.(3)A significant increase in accuracy of classification is achieved when compared to the other frequency based emotion classifiers.

The remainder of the paper is organized as follows: In Section 2, an overview of stationary wavelet transform is presented. Section 3 describes the proposed methodology in detail. Section 4 presents the facial expression recognition results of the proposed method and demonstrates a comparison with state-of-the-art methods followed by conclusion in Section 5.

2. Stationary Wavelet Transform

Conventionally, Fourier Transform (FT) is used as a signal analysis tool that converts the signal into constituent sinusoids of different frequencies. The major drawback with Fourier Transform is the loss of time information. Short Time Fourier Transform (STFT) is considered as a compromise between the time and frequency information. In STFT, a window is applied to the signal and then Fourier Transform is computed. The preciseness of STFT depends on window shape and size. Wavelet transform preserves both the time and frequency information by decomposing the signal in a hierarchy of increasing resolution. Wavelet transform of signal is represented aswhere is the dilated and translated version of the mother wavelet and is calculated aswhere and are real and positive number representing dilation and translation. Similarly, discrete wavelet transform of signal is represented aswhere is the dilated and translated version of the mother wavelet and is calculated asDiscrete wavelet transform (DWT) can be implemented using filter bank approach and lifting scheme. In filter bank approach, input signal is passed through low and high pass filters and then decimated by a factor of two to get approximation and detailed coefficients. In case of image, DWT is applied in each dimension separately. Figure 1 shows the single level wavelet decomposition of image, which results in four subbands, that is, LL, LH, HL, and HH representing low resolution approximation, horizontal, vertical, and diagonal information of the input image. DWT is a spatial variant transform that means the DWT of shifted version of signal is not equivalent to the shift in DWT of signal. The spatial variant nature of DWT occurs because of the decimation operation that can be carried out by either choosing even indexed or odd indexed elements.

Stationary wavelet transform (SWT) solves this problem of shift invariance. SWT differs from conventional DWT in terms of decimation and shift invariance, which makes it feasible for change detection, pattern recognition, and feature extraction. In conventional DWT, at each level of transform input signal is firstly convolved with low and high pass filter and then decimated by a factor of two to obtain wavelet transform coefficients. The resolution after DWT remains the same as the input signal. In SWT, the input signal is convolved with low and high pass filter in a similar manner as in DWT but no decimation is performed to obtain wavelet coefficients of different subbands. As there is no decimation involved in SWT, therefore the number of coefficients is twice that of the samples in the input signal. Figure 2 shows single level decomposition of the input image using SWT. It is clear from Figure 2 that image is decomposed into four subbands, that is, LL, LH, HL, and HH representing low resolution approximation, horizontal, vertical, and diagonal information of the input image all having the same resolution.

3. Proposed Methodology

Figure 3 illustrates the proposed methodology for facial expression recognition, which is composed of three stages, namely, preprocessing, feature extraction, and classification. A detailed discussion is presented in following subsections.

3.1. Preprocessing

Preprocessing stage is divided into two steps, that is, face detection and normalization. In the first step, human face is detected from input image with the help of Viola and Jones algorithm [30]. The main advantage of this algorithm is its capability to quickly detect faces with a high detection rate. The algorithm is composed of three phases.

First, the input image is represented in the form of an integral image , which is calculated as The computation is performed on the entire image in a single pass using the following pair of equations: where represents the row sum, , and equals zero. The use of integral image allows rapid feature computation in image subregions and is independent of the size of neighborhood selected. Secondly, AdaBoost learning algorithm is used to select few critical features from the complete set of features. Thirdly, the classifiers are combined in a cascade manner resulting in rapid discarding of background regions and spending more time on the computation of face like regions. The features used in these classifiers are based on the area of rectangular neighborhood of pixels. For a rectangular area in the integral image with four corner pixel values of , , , and , the area is calculated asIn the second step, image normalization and histogram equalization are performed on detected face to remove unrelated and unwanted parts, which are present in the background of dataset. The normalized image is obtained aswhere is the subimage detected as face region and and are functions used to find the minimum and maximum pixel values, respectively. Image normalization changes the intensity of images to the new intensity range . Equalization is used to enhance the global contrast of the image giving better dynamic range.

3.2. Feature Extraction

The process of feature extraction is shown in Figure 4. The detected and preprocessed face from input image is firstly decomposed into different subbands using stationary wavelet transform. SWT differs from conventional DWT in terms of decimation and shift invariance at the cost of redundant information. The mathematical proof of the shift invariance of SWT is discussed in detail in [31] when subbands are not decimated.

In SWT, input image is convolved with low pass and high pass filter to obtain approximated and detailed coefficients without decimation. For the detected face image of size M × N, the SWT at th level is given aswhere , and  and represent the low pass and high pass filters. , and represent the approximate, horizontal, vertical, and diagonal subbands, respectively. In each SWT subband different information of the image is retained. The LL subband is the overall image approximation, LH, HL, and HH have the horizontal, vertical, and diagonal information, respectively, as shown in Figure 4. This piece of information helps in recognizing facial expressions that are dependent on the changes that occur in these orientations. For example, for smile the major changes in face occur in horizontal direction due to movement of lips. The resulting SWT decomposition has the same size as the original image, which results in four times the number of coefficients as compared to the input image, since we have two-dimensional data. Therefore, some mechanism is required to reduce the features from the nondecimated wavelet coefficients.

To reduce feature vector length, block DCT is applied to the LH, HL, and HH subbands of SWT. The DCT applied to each block is calculated as whereDC coefficient from each block is selected as features for each subband because it represents majority of the energy of that subband. We have combined features from different subbands, that is, LH, HL, LH + HL, and LH + HL + HH, to obtain more descriptive features in terms of horizontal, vertical, and diagonal directions, which are then fed to artificial neural network for facial expression recognition. This combination of SWT and DCT resulted in improved classification results by utilizing redundant information from different SWT subbands, but also having a reduced feature vector length by utilizing only the DC coefficient of DCT.

3.3. Artificial Neural Network

The selected features are fed into a neural network that is trained to classify the seven common facial expressions. The neural network design consists of three fully connected layers as shown in Figure 5. It has inputs , which is equivalent to the length of the feature vector and seven outputs that correspond to the emotions being recognized. The training data is organized in pairs , where is the input feature vector and is the corresponding target output. The network uses feed forward connections for training and back propagation for optimization. The actual output of the network during training is given as that differs slightly from the target output . The output layer is defined as a softmax function and is given as where, represent the output layer and . This gives a probability distribution of the classified emotions where the highest value is picked as the emotion classified. All other neurons use sigmoid function for activation and are given aswhere and represent each neuron’s input and output, respectively. The network is parametrized using the connection weights and biases . These network parameters are initialized using Gaussian distribution and are trained using backpropagation to minimize negative log probability that is used as the cost function.

The optimization is performed using stochastic gradient descent (SGD), which has the following cost function:where is the total number of inputs and is output of the final layer and is the regularization parameter. The error is back-propagated, and using SGD the algorithm converges to its optimal state. For backpropagation, gradient values are used to update the network parameters. The weights are updated usingwhere is the learning rate and represents the error gradient given asL2 regularization is also used to update the cost function to avoid overfitting. The system is designed to solve the following minimization problem to find the best weights asThis design was particularly authenticated for the classification of seven ideal facial expressions of JAFFE and CK+ database. The same experiment was also conducted on our own collected database, where seven common facial expressions are classified. L2 penalty is used with the weight terms, so that the neural network generalizes better and does not overfit the sampling errors.

4. Simulation Results

In this section we present the recognition performance of the proposed approach and its comparison with the state-of-the-art frequency domain facial expression recognition methods. The recognition performance is computed in terms of correct recognition rate (CRR). Three different datasets were used to evaluate the facial expression recognition results. The datasets used are Japanese Female Facial Expression (JAFFE), Cohn-Kanade (CK+), and our own dataset acquired from MS-Kinect device. The JAFFE dataset consists of 213 gray scale images of seven different expressions by 10 different females. The seven different emotions present are anger (AN), happiness (HA), neutral (NE), sadness (SA), disgust (DI), fear (FE), and surprise (SU). The spatial resolution of each image is 256 × 256 and is rated for different facial expression by 60 subjects. The CK+ dataset consists of both posed and nonposed facial expression of 210 adults. In this work supervised learning is used; hence the posed data from CK+ dataset is used. The dataset consists of seven emotions. The spatial resolution of the frames used is 640 × 480. The MS-Kinect dataset is created for the same seven emotions and the purpose is to use facial expression recognition in MS-Kinect based applications. The MS-Kinect dataset consists of 210 images of seven expressions by 5 males and 5 females. The spatial resolution of each image is 640 × 480 and is rated for different emotions by 10 subjects.

In this work features were extracted using stationary wavelet transform. Applying DCT on each subband of SWT reduces the overall feature vector length. Four different features are created by combining DCT coefficients from different subbands, that is, LH, HL, LH + HL + HH, and LH + HL considering horizontal, vertical, horizontal + vertical + diagonal, and horizontal + vertical information. Twenty images for each expression are used for training purpose and the rest of the images are used for testing. Table 1 shows the CRRs of seven emotions on JAFFE and MS-Kinect dataset by considering different features vectors based on the selection of SWT subbands. It is evident from Table 1 that horizontal and vertical SWT subband contribute more in accurately recognizing the facial expression. The combination of features from LH and HL subband representing horizontal and vertical features of the face gives the best CRR of 98.83%, 96.61%, and 94.28% for JAFEE, CK+, and MS-Kinect dataset, respectively.

The confusion matrix of seven emotions on JAFFE dataset is presented in Table 2. Anger and disgust emotions are difficult to recognize with respect to other expressions having CRR of 96.6 and 96.4 percent, respectively. On the other hand the rest of the facial expressions are perfectly recognized. In terms of misrecognition rate, anger contributes the most in the reduction of overall performance. Table 3 represents the confusion matrix of CK+ dataset. Anger and surprise emotions are recognized with higher recognition rate while disgust, fear, and happy emotions are recognized with lesser recognition rate. The confusion matrix of seven emotions on MS-Kinect dataset is presented in Table 4. It is evident that sad, fear, and happy facial expressions are difficult to be recognized with the CRR of 86.6, 90, and 93 percent, respectively. On the other hand the rest of the facial expressions are perfectly recognized. In terms of misrecognition rate, sadness contributes the most to the reduction of overall performance.

Table 5 represents the comparison of proposed facial expression recognition scheme with the state-of-the-art methods. The state-of-the-art schemes are selected because they used frequency domain features, similar testing strategy, and the same dataset. It is evident from the table that the proposed scheme outperforms state-of-the-art approaches when JAFFE dataset is used. The CRR of the proposed scheme is 19.53, 7.83, 5.9, and 3.66 percent higher than those in [23], [26], [27], and [28], respectively.

5. Conclusion

This study investigates the facial expression recognition system from images using stationary wavelet transform features. These features have also been compared with other state-of-the-art frequency domain features in terms of correct recognition rate and classification accuracy. The simulation results also reveal meaningful performance improvements due to the use of SWT features. Different emotions generate varying muscle movements. However, majority of the emotions bring horizontal and vertical muscle movement on the face. Therefore, features are combined from LH and HL subband of SWT representing horizontal and vertical direction, respectively. Decimation operation is not involved in SWT, which results in a large number of coefficients. Hence, DCT is performed to reduce the features dimension. The results indicate that SWT based features show 19.53, 7.83, 5.9, and 3.66 percent better recognition results than DCT based, Gabor based, patched based, and curvelet based features when applied on JAFFE dataset. Moreover, the highest accuracy of proposed scheme is 98.8%, 96.61%, and 94.28% in case of JAFFE, CK+, and MS-Kinect, respectively, when LH + HL information is utilized. The proposed facial expression recognition scheme can play a vital role in HCI and Kinect based applications. In the future, we intend to use the proposed facial expression recognition scheme for the generation of personalized video summaries.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.