Abstract

Human emotion recognition from videos involves accurately interpreting facial features, including face alignment, occlusion, and shape illumination problems. Dynamic emotion recognition is more important. The situation becomes more challenging with multiple persons and the speedy movement of faces. In this work, the ensemble max rule method is proposed. For obtaining the results of the ensemble method, three primary forms, such as CNNHOG-KLT, CNNHaar-SVM, and CNNPATCH are developed parallel to each other to detect the human emotions from the extracted vital frames from videos. The first method uses HoG and KLT algorithms for face detection and tracking. The second method uses Haar cascade and SVM to detect the face. Template matching is used for face detection in the third method. Convolution neural network (CNN) is used for emotion classification in CNNHOG-KLT and CNNHaar-SVM. To handle occluded images, a patch-based CNN is introduced for emotion recognition in CNNPATCH. Finally, all three methods are ensembles based on the Max rule. The CNNENSEMBLE for emotion classification results in 92.07% recognition accuracy by considering both occluded and nonoccluded facial videos.

1. Introduction

Human emotions are inevitable in day-to-day interactions. They catalyse improvising communication. Generally, humans use face, hand, voice, and body gestures to express their feelings. Among all these, human faces have been the most prominent and expressive medium in carrying out emotions during interactions. Facial emotion recognition is the technology used to reveal information from one’s emotional state or sentiments by analysing facial expressions from both static and videos. This is a part of affective computing.

Emotions are integral to human communication: during smiles, showing greetings and respect to others, frowning in confusion, raising voice during arguments, and so on. They are the best means of nonverbal communication, irrespective of culture, religion, or race. It assists in determining how someone feels by obtaining information about their emotional state. So, it can be used for verification and recognition purposes.

In conventional market research, companies use surveys and customer reviews (verbal methods) to understand the demands and needs of customers. The other method is behavioural, in which companies record video feeds of users interacting with a product. They manually analyse the video to observe a user’s reactions and emotions. This method is useful, but it is time-consuming and tedious. Furthermore, it raises the overall cost. Market research firms can now easily automate video analysis and detect their users’ facial expressions using artificial intelligence (AI)-enabled facial emotion recognition systems. This saves time and labour while also lowering costs. Market research firms can use facial recognition systems to scale their data collection efforts.

1.1. Benefits of Emotion Detection
1.1.1. Assess Personality Traits in Interviews

Personal interviews are an excellent way to interact with potential candidates and determine whether they are a good fit for the position. However, analysing a candidate’s personality in such a short period of time is not always possible. Furthermore, many categories of discussion and judgement add to the complexity. Through facial expressions, emotion detection can assess and measure a candidate’s emotions. It assists interviewers in comprehending a candidate’s mood and personality traits. Human resources can use this technology to develop recruiting strategies and policies to get the most out of their employees.

1.1.2. Product Testing and Client Feedback

When customers try a product, emotion detection technology can help the product industry understand their genuine emotions. Companies can set up a product testing session, record it, and then analyse it to detect and assess the facial emotions that emerge during the session. Because AI powers emotion detection, it can evaluate user reactions to new product launches.

1.1.3. Enhances Customer Service

Emotion detection improves the user experience in nearly every industry. This technology enables retailers to create more personalised customer offers by analysing their browsing and purchasing habits. Furthermore, healthcare providers can use facial recognition to create better care plans and deliver services much more quickly. In Phycology and Crime Prediction. Human emotion recognition has applications in psychology. Emotions, which are active most of the time, control our behaviours unconsciously. Emotions can significantly influence criminal behaviour. According to criminal psychologists [1], there are nine levels of emotional motivation for criminal conduct: bothered, annoyed, indignant, frustrated, infuriated, hostile, wrath, fury, and rage. Using emotion analysis, the phycologist can understand a person’s emotions and emotional fluctuations. As a result, emotion recognition can be used to predict crime.

1.2. Classes of Emotion

Emotions are generally classified as positive or negative. The six basic emotions are anger, happiness, fear, disgust, sadness, and surprise. Other emotions are an embarrassment, interest, pain, shame, shyness, anticipation, smile, laugh, sorrow, hunger, and curiosity. Emotions can be discrete or dimensional [2]. According to Natya Sastra [3], nine basic emotions are identified. They are love, laughter, sorrow, anger, courage, fear, disgust, surprise, and peace.

If a person is angry, the eyebrows are drawn together and lowered inside and the extremes are pulled outward, either in both eyebrows or in one. Vertical lines, generally two, appear between the eyebrows, lower lid raised, eyes focusing at the centre, stare or bulging and lips tightly pressed together with corners down or square shape, moustache or the upper part of the mouth curved, nostrils may be dilated, the lower jaw projecting out. If a person is happy, the corners of the lips are drawn wide and upwards, the lips parted and teeth exposed, or lips widened with wrinkles running from the outer nose to outer lip, a portion of cheeks below the eyes are raised, lower lid may show wrinkles or be tense, exhibiting crow’s feet near the corner of both the eyes. Therefore, identifying facial expressions is the key to emotion recognition.

For facial expression recognition [4], extraction of facial features for capturing the changes in appearance is achieved via harvesting deep feature semantics. To achieve this, traditional CNNs are optimised using a soft-max loss, which penalises the misclassified samples, forcing the features of different classes to stay apart.

Deep convolutional neural networks (CNN) otherwise require massive training to produce better accuracy. Due to the limited public database available for facial expressions, data augmentation mechanisms must be employed. Cropping sample images at varied angles results in images at various positions at varied scales, which further reduces the sensitivity of the overall system. Therefore, if augmented for experimental purposes, the utmost care should be taken in data to preserve the model’s robustness.

Deep CNNs can hierarchically learn the features from samples to represent all possible complex variations of input images [5]. The max pooling layers only consider input features’ first-order statistics, which limits learning deep semantic features. This becomes further complicated when pose variations and/or partial facial occlusions are included. Occlusions and variant poses are two major factors causing the significant change in facial appearance. Removing occluded regions is not practical when real-time video emotion recognition is considered. Real-world occlusion is yet another difficult task for emotion detection research. Using CNN to ignore occlusion and pose variations might lead to inaccurate facial features.

In contrast to the above claim, human intelligence can exploit local facial regions and holistic faces for better perception of emotions in partial or complete facial occlusions. Due to the dynamism in the variation of local parts such as the eyes, nose, and mouth, the vital issue rests in the robust detection of such energy from every keyframe. Directly feeding the keyframes leads to underutilising prior knowledge hidden in consecutive frames. Hand-crafted facial descriptors are unsuitable for interpreting the powerful temporal features in facial images.

Several mathematical models capable of processing under adverse conditions have been proposed to address these challenges. Constrained frontal faces shall be analysed by facial expression classifiers [6]. Subspace analysis techniques [7] require extensive training and are not suitable. Recognition systems based on local feature representations [8] respond better under face illumination variations. However, occlusion and posing reduce the accuracy. Stacked supervised autoencoder [9] is better at solving the above problems. However, they need accurate training data which is occlusion-free. The proposed work concentrates on emotion recognition from videos performed via ensemble-based approaches, as there is less robustness in processing video data streams as in [6]. Though the work tends to achieve dynamic emotion recognition via crucial frame extraction, face recognition, and further processing of image-based emotion detection, the idea of SSPP is not addressed as it might be computationally expensive and lead to delayed recognition.

Therefore, the proposed work discusses using KLT tracking to track the recognised faces in video accurately. The extraction of visual facial features for facial recognition is crucial since the colour and shape of faces in video are similar. In this work, HOG features are used to precisely capture facial features such as directions and edges of the face and facial intensities, which are later fed to the SVM classifier for robust facial recognition. Avoiding a large number of layers and the need for colossal training, the proposed work encompasses nine layers of CNNs with extensive data augmentation. The primary goal is to analyse the emotion of all the persons in a given video.

1.3. Challenges

A facial expression is representative of a specific emotion, so it is not easy even for humans to recognise emotions accurately. Studies show that different people recognise different emotions in the same facial expression. And it is even more challenging for AI to differentiate between these emotions.

1.3.1. Technical Challenges

Emotion recognition shares many challenges. Identifying an object, continuous detection, and incomplete or unpredictable actions are the most widespread technical challenges of implementing emotion recognition.

Face occlusion is one of the main challenges in captured videos and pictures. Another commonly seen challenge is lighting issues. Identifying facial features and recognising unfinished emotions are crucial challenges in emotion recognition.

1.3.2. Psychological Challenges

Psychologists have studied the connection between facial expressions and emotions since the middle of the 19th century. Cultural differences in emotional expression are one of the main challenges. Infants and children indicate feelings differently than adults, so identifying children’s emotions is another challenge.

An ensemble method of image classification is proposed to improvise emotion recognition. Ensemble learning aims to assemble diverse models or multiple predictions and boost prediction performance. The ensemble combines numerous learning algorithms to obtain their collective performance and improve the performance of existing models by combining several models, resulting in one reliable model. For image classification, both occluded and nonoccluded facial videos are used.

The online social networks produce billions of visual information, which are useful to recognise sentiment. A proverb in many languages goes, “A picture is worth a thousand words,” which means that a single image can convey many ideas more effectively than spoken description. Visual sentiment analysis on social network content helps us to understand the user behaviour and provides useful information for related data analysis. The majority of users of social networking platforms prefer to use images and emoticons other than typing long sentences. The Twitter platform encourages the communication between users using short texts or images. The main objective of the work is to classify the sentiment behind the messages represented in the form of image in social networks, using state of art machine learning and deep learning methods.(i)To propose new techniques for face detection(ii)To develop a system with high accuracy of face emotion detection(iii)To propose an ensemble convolution neural network for face emotion classification.

The primary goal of this paper is to examine social media post in the form of image data to identify user attitudes regarding a particular topic of discussion. Utilizing attitudes toward a topic of discussion on social media can help to identify and predict the sentiments. Additionally, it aids in assess personality traits in interviews, product testing and client feedback, enhance customer service to quickly adapt to constantly changing needs. This paper first proposes an ensemble deep learning classification algorithm, namely, CNNENSEMBLE, which is proposed in this work by combining the outcomes of CNNHOG-KLT, CNNHaar-SVM, and CNNPATCH for the analysis of sentiments. This proposed ensemble deep learning classification algorithm is employed in this work for performing classification over occluded and nonoccluded social media postimages to improve the accuracy of emotion classification.

Therefore, the main contributions of this paper are:(i)The first contribution of this work is that it proposes three different face detection methods such as HOG-KLT, Haar-SVM, and PATCH are used parallel in this work for emotion analysis, which effectively identifies the occluded and nonoccluded faces(ii)Second, an efficient CNN based emotion classification model that highlights the impact of handling occlusion and nonocclusion for improvising the classification of emotions(iii)Finally, an ensemble deep learning algorithm named, CNNENSEMBLE is proposed in this work for performing effective emotion recognition.

These sentiment classification algorithms have been evaluated with extended CK+ for emotion recognition. The results obtained from this work show that CNNENSEMBLE emerges as the highly accurate model for emotion recognition for occluded and nonoccluded faces.

The rest of this article is organized as follows: Section 2 provides a survey of related work that are existing in the literature on emotion classification and they are compared with the proposed work. Section 3 explains the methods used and the algorithms proposed in this paper. Section 4 discusses about the results obtained from this work and performs a comparison with existing work. Section 5 gives the conclusions arrived from this works and lists some future enhancements.

In this section, we surveyed the identification of facial expression recognition, occlusion-aware facial expression recognition, and the techniques used for facial emotion recognition.

2.1. Facial Expression Recognition

Nowadays, distance learning and e-classrooms are the part of our life, in e-learning scenario, by analysing the facial expressions, the teachers can understand the engagement of students in learning [10]. Viola–Jones and HAAR cascade algorithms are used for object detection and feature extraction. CNN is used for expression classification. Facial expressions have been analysed for various applications. However, not all the facial regions contribute to expression detection, and a few areas do not change with varied terms [11, 12]. Determining human behaviour from facial expressions is an excellent application in healthcare, tourism and hospitality, and the retail industry. Sajjad et al. [13] proposed a framework for analysing human behaviour via facial expressions from video. The face is initially detected using the Viola–Jones algorithm and then tracked via KLT. Viola–Jones has various stages, including Haar features selection, which selects the most important facial features [14, 15], AdaBoost training, and cascading classifiers.

The localisation of the eyes and nose is detected by Haar cascades [16, 17]. The algorithm first identifies the rectangular area of the eyes and nose position. Later, the eye centres are computed. If there is a mismatch in detection, anthropometric statistics are used. Face alignment is achieved using the position of eyes since eyes do not move with expressions. After the nose position is extracted, the mouth region is identified using the nose region as a reference. The curves in the upper lips shall be detected using horizontal edge detection techniques. The position of the eyes is also used for identifying the eyebrow region of interest.

Second, the detected face (if not priorly registered, is registered into the database and) is recognised using the SVM classifier, followed by facial expression recognition using CNN. The proposed work is constructed along these lines with additional semantic feature extraction using CNN. Bounding box approaches have been combined with confidence score and class prediction parameters within layers of CNN to achieve improved detection accuracy in video surveillance. The proposed work also uses bounding box approaches for improved face detection accuracy.

2.2. Occlusion-Aware Facial Expression Recognition

Occlusion is the primary issue in handling real-time videos. Partial facial occlusion has been widely addressed in the literature. However, real-life occlusion detection is essential for applications in the healthcare and hospitality industries.

In handling occlusions, patch-based approaches [18, 19] have emerged as the state-of-the-art in real-time. VGGNet is used to represent the input image as feature maps. ACNN decomposes the input image feature maps into subfeature maps. This disintegration into multiple subfeature maps results in the identification of local patches. The feature maps are sent to the Gg unit to identify the facial occlusion’s location.

Patch-based ACNN (pACNN) performs region decomposition and occlusion perception. Region decomposition uses an exclusive approach [20] to select 24 out of 69 facial landmarks. The local feature maps are cropped and fed to the respective convolution layers using selected landmarks without compromising spatial resolution. After sufficient learning, the feature maps are converted to vector-shaped local features, provided to the attention layer. The attention net determines the scalar weight as a means of quantifying the importance of the identified local patch. Global-local-based ACNN (gACNN) takes care of the later stages of processing. It takes the full-face region and extracts the local details of the patches and their respective global context cues.

2.3. Datasets for Facial Emotion Recognition

Table 1 summarises publicly available video datasets and the addressed emotion categories.

The proposed work assumes CK+ and ISED databases and proceeds to facial expression detection using deep learning models. Additionally, the proposed work concentrates on extracting emotions about basic categories.

2.4. Machine and Deep Learning Approaches for Facial Emotion Recognition

Emotions related to e-learning, like boredom, confusion, contempt, curiosity, disgust, eureka, delight, and frustration were mainly identified in recent literature [3239]. Deep learning models, mainly convolutional neural networks, are used for emotion classification. Different deep learning models such as VGGNet [34, 39] and ResNet [35] are used for the implementation. A variant of CNN, DCFA-CNN [36], is tested with different image datasets and got excellent classification result. Yolcu et al. [40] presents a deep learning-based system for customer behavior monitoring applications. The system uses 3-cascade CNN for head pose estimation, facial components segmentation, and expression classification. GoogLeNet and AlexNet, which consists of 2 consecutive CNN layers, are widely used in facial expression recognition [41]. Table 2 presents various classifiers used for facial emotion recognition.

3. Ensemble Framework for Facial Emotion Recognition

The framework in Figure 1 discusses the facial emotion recognition from videos using ensemble CNN classifiers. It also highlights the ensemble CNN for robust video emotion detection with late multiple feature fusion using deep semantic facial features. The video is the input used for the recognition of emotion. The video may contain a single person or multiple people, and emotion can be identified for both occluded and nonoccluded faces. The input video contains sequence frames. Initially, the edges will be extracted from videos. The structures may have faces or nonfaces. By using a keyframe extraction method, the keyframes are identified. The idea is to create a model that ensembles the inherent emotional information within the video frames. In this work, three different methods are used for emotion recognition and for improving accuracy. All the methods are fused based on an ensemble strategy. Before emotion recognition, the first step is identifying the faces in the video frame.

In the first method, the face is detected through the Haar cascade algorithm and tracked using KLT tracking. The detected face image will be fed into CNN for further emotion classification. Similarly, face detection is achieved using HOG features and SVM in the second method. The images classified as face images will be the input of the CNN emotion classifier. The third template-matching method is used to detect the faces in the frame. Then, using a patch-based CNN, the emotion of the image will be recognised. This proposed end-to-end trainable Patch-Gated Convolution Neutral Network (PG-CNN) can automatically detect and focus on the most discriminative non-occluded areas of the face. After identifying emotions by three distinct methods, ensemble max rule-based emotion recognition accurately classifies the feelings.

3.1. Keyframe Extraction

Keyframe extraction is used to reduce redundant frames that lead to the dimensionality reduction of the feature vector for classification. The input video is processed for keyframe extraction, where multiple keyframes are extracted in this module. This work uses the histogram difference method for keyframe extraction. The difference is calculated between each frame, and the threshold value is obtained. Consider two frames and . If there are any changes or differences found in from , then is taken into account. If there are no changes next subsequent frame is taken for examination, and the process is continued till frame.

The process has two main phases. In the first phase, the threshold (TD) value will be computed using the mean and standard deviation of the histogram of the absolute difference of successive image frames. In the second phase, compare the threshold (TD) extracted from keyframes against the fundamental difference of consecutive image frames.

The video frames will be extracted one by one at first. The histogram difference between two successive frames will be calculated for each video frame. To determine a threshold point, the mean (M) and standard deviation (SD) of the absolute difference of the histogram are calculated. The following equation can be used to calculate the threshold (TH), where Md is the mean of absolute difference and SDd is the standard deviation of fundamental difference. After obtaining the threshold, the next phase determines the keyframes by comparing the absolute difference of the histogram to the point. The process of histogram difference-based essential frame selection was described in Algorithm 1.

(1)Extract the video frames ()
(2)Find the histogram difference between two adjacent frames (fj, fj+1)
(3)Calculate M and SD of absolute difference
(4)Compute threshold, TH
(5)Compare the difference (d) with TH
if the d > TH
 select it as a keyframe
Else
 go to step no. 2.
(6)Continue the process till the end of the video
3.2. Method 1: CNNHOG-KLT

Method 1, developed inside the ensemble framework, consists of two main steps such as (1) face detection and tracking and (2) emotion recognition.

3.2.1. Face Detection and Tracking

It used the Haar cascade and Kanade Lucas Tomasi algorithm to detect and track faces in the frames extracted from videos. The steps used to compute face detection and tracking are mentioned below.Step 1: Input the keyframesStep 2: Identify the relevant features using the Haar cascade algorithm, which locates the face. It requires the identification of located feature points needs to be reliably trackedStep 3: Use the KLT method, which computes the displacement of the tracked points from one keyframe to another. It finds the traceable feature points in the first frame and then follows the detected features in the succeeding frames using the calculated displacement.

(1) Haar Cascade Algorithm. The Haar cascade is used to recognise faces in keyframes. It essentially identifies adjacent rectangular regions in a detection window at a specific location. The calculation entails adding the pixel intensities in each area and subtracting the sums. These features can be challenging to determine for a large image. To overcome the difficulty, integral photos are used, reducing the number of operations compared to larger original images. Necessary images return the pixel value at any (a, b) location is the sum of all pixel values present before the current pixel. Instead of computing at each pixel, it creates subrectangles and array references for each subrectangle. The Haar features are then computed using these. Haar is primarily used to extract three distinct parts: line, edge, and rectangle features. The representation of line, edge, and rectangle features is represented in Figure 2.

The Haar features are applied to determine the facial features in which the line feature, edge feature, and rectangle features are denoted by , , and . The value of the feature it is calculated by identifying the sum of pixel values in the black area minus the sum of pixel values in the white space. A threshold it is set for each feature. Initially, the average sum of each feature is calculated. After that, compute the difference and check with . If the value is above or matches with Then, it is detected as a relevant feature . During the creation of integral image , it identifies the sum of pixel values in an image or rectangular part of a painting by (2) in which I (a′, b′) is the intensity of the original image.

The integral image can be calculated in a single pass using the following equations, in which is the cumulative row sum

In the integral image, csum (a, 1) = 0 and  = 0. After generating the integral image, each feature can be calculated at a constant time.

(2) Kanade Lucas Tomasi Algorithm. Face detection requires tracking of faces on keyframes. Kanade Lucas Tomasi (KLT) is an effective feature-based face-tracking algorithm. It continuously tracks human faces in a strong frame extracted from videos. This method finds the parameters that reduce dissimilarity measurements between feature points related to the original translational model. For tracking the face, it finds the traceable feature points in the first keyframe and then follows the detected features in the succeeding keyframes based on computed displacement value.

Let us assume that initially, one of the corner points is (a, b). If (a, b) is displaced by some variable vector in next frame, the displaced corner point (dc) can be calculated by equation

The coordinates of the new point will be a′ = a +  and b′ = b + . It uses warp function W(a; d) =  (a + ; a + ) to calculate the coordinates. The alignment is calculated by the following equation:where d is the displacement parameter. Assume an initial estimate of d as a known parameter and find ∆d using the following equation:

The displacement ∆d is calculated by finding the Taylor series and then differentiating it concerning ∆d using equation (8), in which H is called the Hessian matrix.

3.2.2. Emotion Recognition by CNN

This work uses CNN to achieve high precision in emotion recognition. There are two primary functions of CNN; Feature Extraction and classification. CNN has multiple layers in which each layer performs a specific transformation function. The goal of CNN is to reduce the images so that it would be easier to process without losing valuable features for accurate prediction.

The first layer to extract features from the input image is convolutional. Convolution can perform edge detection, blur, and sharpening operations by applying filters to an embodiment. When the image is too large, the pooling layer function is used to reduce the number of parameters. Spatial pooling, like average pooling, reduces the size of each map while retaining important information. The fully connected layer has flattened the matrix into a vector and feeds it into a neural network-like fully connected layer.

During emotion recognition, the convolutional layer recognises features in pixels. Then, pooling layers are responsible for making these features more abstract. Finally, the fully-connected layer is accountable for the classifications of emotions. The first layer is convolutional with a kernel size of 5 × 5 pixels and 16 output channels. The second layer is a max pooling layer with a 2 × 2 kernel size. In the same manner, nine convolution layers are used in this work. The following three layers are fully connected neural layers with 100, 50, and 5 neurons in each layer.

Image pixels are directly used as input to standard feed-forward neural networks for emotion recognition in the convolution layer. In emotion classification, one or more 2D matrices are fed into the convolutional layer, and multiple 2D matrices are generated as output using equation

Each input matrix is convoluted with a corresponding kernel matrix . Then the sum of all convoluted matrices is computed, and a bias value is added to each element of the resulting matrix. Finally, a nonlinear activation function is applied to each aspect of the previous matrix to produce one output matrix . Each set of kernel matrices represents a local feature extractor that extracts regional features from the input matrices. The learning procedure aims to find groups of kernel matrices That extracts good discriminative features to be used for emotion recognition. Backpropagation, a neural network connection weight optimisation algorithm, can train kernel matrices and biases as shared neuron connection weights.

The pooling layer is used to reduce feature dimension. It reduces the number of output neurons in the convolutional layer, and pooling algorithms should be used to combine the convolution output matrices’ neighbouring elements. Max pooling is used to reduce dimensionality. The Max pooling layer with a 2 × 2 kernel size chooses the highest value from four adjacent input matrix elements to generate one component of the output matrix. During the error backpropagation process, the gradient signal must be only routed back to the neurons that contribute to the pooling output. In our CNN model, the ReLU activation function f(x) = max (0, x) is used in the convolutional layer, which significantly improves both learning speed and emotion recognition performance in CNN.

Batch learning is used to accelerate and improve learning speed and accuracy. Instead of updating the connection weights after each backpropagation, we process 128 input samples in a batch and update the entire set with a single update. To further speed up the learning, momentum weight combined with weight decay is applied. The weight is updated by utilising equation

The part is the backpropagation, where is the current weight vector. is the error gradient concerning the weight vector, and η is the learning rate. The is the momentum part, where α is the momentum rate. The momentum weight update will speed up learning. The is the weight decay part, where λ is the weight delay rate. It slightly reduces the weight vector towards zero in each learning iteration, which helps stabilise the learning process. The working process of CNN during emotion recognition is shown in Figure 3. In this, it detects the type of emotion after analysing the features by CNN from the face image.

3.3. Method 2: CNNHaar-SVM

Method 2, developed inside the ensemble framework, consists of two main steps such as (1) face detection and (2) emotion recognition.

3.3.1. Face Detection

Face recognition is a two-step process. Initially, HOG-based normalised face features are extracted. After generating normalised feature vectors, all the features are given as the input to the SVM classifier for face recognition.

(1) Histogram of Oriented Gradients (HOG) Face Feature Extraction. The HOG feature descriptor is used for emphasising face structures or shapes. The magnitude and gradient angle are used to compute the features in this feature descriptor. It outperforms other edge descriptors. It generates histograms for the areas of the face image based on the magnitude and direction of the gradient. Using HOG, each face image is first divided into small square cells. It then computes a histogram of oriented gradients for each cell. The result is then normalised using a block-wise pattern, and a description for each cell is output. Figure 4 depicts the flow of HOG feature computation from the input face image.

Seven significant steps are used to compute HOG features from the input face image. It is explained below.Step 1: Input Face image and Perform Preprocessing. Consider the input face image. Initially, the images are preprocessed to reduce the width-to-height ratio to 1 : 2. Most preferably, the input face image size should be resized to a size of 64 × 128. Then the resized images are considered for further processing for the optimal extraction of features.Step 2: Compute Gradients. Combining the image’s magnitude and angle yields the gradient such as and . It is calculated for each pixel value in the input image. The is computed by equation (11) and is computed by equation (12) in which R and C represent the row and column of each image matrix A. After the gradient value computation of each pixel in the image, the magnitude and angle values are computed by equations (13) and (14).Step 3: Dividing image into 8 × 8 cellsInitially, the image is divided into 8 × 8 cells. After that, the feature descriptors of the histogram of oriented gradients are computed for each 8 × 8 cell in the imageStep 4: Identify the Histogram Bins for each 8 × 8 cellIn this step, a 9-point histogram is computed for each 8 × 8 cell block with angle range. For that, assume that is ranging from angles to .Step 5: Construct overlapping blocks by grouping 2 × 2 cells.Four blocks from the nine-point histogram matrix are combined to create a new block after the histogram computation for all of the blocks is complete (2 × 2). This clubbing is carried out in an overlapping fashion with an 8-pixel strideStep 6: Generate a feature vector for each cell blockGenerate 36 feature vectors by concatenating constructed 9-point histograms for each cell blockStep 7: Perform Feature vector normalisation

The gradients of the image are sensitive to the overall lighting. But the generated HOG features for the image’s 8 × 8 cells offered noticeably brighter than the other portions. This maintains visibility in the image. Normalising the gradients by taking blocks may lead to fluctuation in lighting. Therefore, a block will be formed by joining four 8 × 8 cells. In step 4, a histogram comprises a 9 × 1 matrix for each 8 × 8 cell. Therefore, a single 36 × 1 matrix or four 9 × 1 matrices are there while creating a block. Mathematically, for given vectors V are 36 rows there, which is represented in equation

For normalising, the matrix is constructed based on equation (16). It divides each value by the square root of the sum of squares of the values (k) as per equation (17).

The process of creating HOG features for the image is completed. For the 16 × 16 blocks of the image, features for the complete image are built by integrating the features. Now the generated features are given as the input to the SVM classifier for detecting faces.

(2) Support Vector Machine (SVM) for Face Recognition. A classic two-class recognition problem is solved using a support vector machine (SVM). It transforms the data using a kernel trick and then finds an optimal boundary between the possible outputs based on the transformations. This work uses SVM for face recognition by modifying the interpretation of an SVM classifier’s output and devising a representation of facial images concordant with a two-class problem. It selects the decision boundary that minimises the distance between the classes’ closest data points. The maximum margin classifier or maximum margin hyperplane is the decision boundary generated by SVMs.

Let be the HOG features and be the class labels of training data. The images may be labelled as face images as +1 and nonface images as −1. The SVM algorithm considers the input during training. After that, it finds the optimal decision surface with the number of support vectors. Then, the linear surface can be calculated by equation (8), in which is the coefficient weight, is the class label of the support vector and the weighted summation (). The computation of in equation (18) is calculated using equation (19).

is a facial image representation vector, where is referred to as face space. Face space can be another feature space or the vectorised original pixel values. The equation’s function calculates the SVM classifier function (18).

Feed a training set with two classes—one of the nonfacial photos and the other of facial images to build a classifier for the image “A.” An SVM algorithm creates a linear decision surface to determine if the face image “A” is a face or not. The following equation states that image “A” is a face image if the input picture A meets the requirement.

If the input image A satisfies the condition given in the following equation, then image “A” is a nonface image.

After detecting the frames with faces, all the face images are passed to the CNN for emotion recognition.

3.3.2. Emotion Recognition by CNN

The emotion recognition is carried out by convolution with different filter sizes and pooling layers of CNN. The flow of work detecting emotion from the face in CNN is mentioned in Figure 5. The working process of CNN is already discussed in Section 3.2.2.

3.4. Method 3: CNNPATCH

In this method, the recognition of the face is done by a template-based method, and the emotion is detected by analysing the patches from the face by patch-based CNN.

3.4.1. Template-Based Face Detection

Using the correlation between the templates and the input photos, template matching locates faces using predefined face templates. For instance: A human face can initially be broken down into its eyes, face contour, nose, and mouth. The edge detection technique can then create an edge-rich face model. It is a method of looking for and finding a template within a bigger picture. It determines whether the input face and template images are similar (training images). The presence of full-face features can then be ascertained by analysing the correlation between the input face photos and the standard patterns stored in the full-face parts. It looked at the input photos at various scales to achieve the shape and scale invariance. Algorithm 2 explains the process of template-based face detection in keyframes.

Input: the keyframe() and template image (t)
Output: The keyframe with face
Read the keyframes
Read the template image t
Apply template matching to detect face image
 Slide the t over
 Compare t and and find the correlation value (cv) using equation (1)
 Normalise the correlation value using (2)
 Compute the correlation threshold (T)
  Add mean with an arbitrary number of standard deviation
 Compare cv and T
 If the cv > T the
  The segment is marked as a face
 Else
  The segment is marked as nonface
 End If

Let (x, y), t(x, y) denotes the keyframe and template image, respectively. During the matching of t and , the correlation value (cv) is calculated using equation (22). Then, normalise the correlation value using equation (23). The correlation threshold (T) is computed by adding the mean with an arbitrary number of standard deviations. After that, compare cv and T. If the value of cv exceeds T, then that segment is marked as the face.

3.4.2. Patch-Based CNN for Emotion Recognition

For the effective handling of occlusion in face images, A Patch-Gated CNN (PG-CNN) [59] is used in this work. The primary reason to use patches instead of the entire face is to increase the number of training samples for effective and optimal feature-based CNN learning. The second reason is that traditional CNN needs to resize faces when using full-face images as input. It significantly reduces the discriminative information. Using local patches maintains the native resolution of original face images, which increases discriminative ability. The framework of patch-based CNN during emotion recognition is shown in Figure 5.

This approach uses facial landmarks for region decomposition to generate the input image patches. In this, an end-to-end trainable Patch-Gated Convolution Neutral Network (PG-CNN) [59, 60] automatically percepts the occluded region of the face and focuses on the most discriminative unoccluded areas. According to the locations of facial landmarks, PG-CNN divides an intermediate feature map into 24 patches to identify potential regions of interest on the face. After that, a suggested Patch-Gated Unit in PG-CNN is computed from the patch itself and reweights each patch according to relevance. The working of the Patch-Gated Unit, followed by CNN during partial facial occlusion is represented in Figure 6. The algorithm for occlusion detection from the patched image is described in Algorithm 3.

Input: Keyframes
Output: Representation of occluded face
Input the extracted keyframe as a face image
Generate a feature map (FM) from each keyframe
 Return 24 local patches (, )
 For each local patch
  Decomposes the feature map into 24 subfeature-maps ()
  Encode a weighted vector (wv) of local feature (lf) by a PG-Unit
  PG-Unit computes the weight by an attention net based on its obstructed-ness
  Concatenate the weighted local features
  Return the representation of the occluded face.
 End For

The keyframes from the keyframe extraction phase will be considered as the input image of the network. The network receives the information and displays it as feature maps. The feature maps of the entire face are then divided into 24 subfeature maps for 24 local patches via PG-CNN. A Patch-Gated Unit encodes each local patch as a weighted vector of local features (PG-Unit). By taking into account each patch’s obstructed-ness, or how much of the patch is blocked, PG-Unit determines the weight of each patch by an Attention Net. The occluded face is finally represented by concatenating the weighted local features. The face is assigned to one of the emotional categories through three completely connected layers. The soft-max loss function value is minimised to optimise the PG-CNN. For handling the occlusion issue, PG-CNN used two key schemes Region decomposition and Occlusion perception.

(1) Region Decomposition. Extract the patches based on the locations of the facial landmarks for each individual to identify the usual facial regions associated with expression. Then, depending on the n points discovered, it finds n facial landmark points. The informative facial area, consisting of two eyes, a nose, a mouth, a cheek, and eyebrows, is then covered by a new computation of m points. The selected patches are then defined as the regions by treating each of the m points as the centre. Following are the procedures used in this study to compute region decomposition for creating feature maps.Step 1: Detects 68 facial landmark pointsStep 2: Select or re-compute 24 points. It must hold informative regions of the face such as eyes, nose, mouth, cheek, and eyebrow.Step 3: Consider each of the 24 points as the centre and define 24 patches (, )Step 4: Based on the feature maps of size 512 × 28 × 28 and 24 local region centres, 24 local regions or Patches (512 × 6 × 6) are obtained.

(2) Occlusion Perception with PG-Unit. The PG-Unit embedded in the PG-CNN automatically percept the blocked facial patch and pay attention mainly to the unblocked and informative patches. In each patch-specific PG-Unit, the cropped local feature maps are fed to two convolution layers without decreasing the spatial resolution. This is more effective in preserving more information when learning region-specific patterns. The last 512 × six × six feature maps are processed in two branches. The first branch encodes the input feature maps as the vector-shaped local feature. The second branch consists of an attention net that estimates a scaler weight to denote the importance of the local patch. The computed weight then weights the local feature.

Each local patch is encoded as a weighted vector of local features by a Patch-Gated Unit (PG-Unit). PG-Unit computes the weight of each patch by an attention net, considering its obstructed-ness (to what extent the patch is occluded). Finally, the weighted local features are concatenated and serve as a representation of the occluded face. Three fully connected layers are followed to assign the face to one of the emotional categories. PG-CNN is optimised by minimising the soft-max loss. The steps are used to identify occlusion with PG-unit and further emotion recognition.Step 1: Input the feature map of patch toStep 2: calculated the weighted feature as per equation (24), Importance or unobstructed-ness based on equation (25) and feature vector by equation (26), in which is the last feature map ahead of the two branches, denotes production, and (.) denotes the attention net operations: pooling, convolution, inner productions, and a sigmoid activation.Step 3: The sigmoid activation forces the output αi ranges in [0, 1], where 1 indicates the most salient unobstructed patch and 0 indicates the completely blocked patch.

3.5. Ensemble Max Rule Method for Emotion Recognition: CNNENSEMBLE

The idea is to create a model that ensembles the inherent emotional information within the video frames. Two basic approaches are proposed to achieve this purpose: (1) maximum emotion ensembles (MSE) (2) late multiple feature fusion (LMFF).

In maximum emotion ensembles, three models are explored: (1) Max. Emotions (2) Max. Emotion Intensity (3) Max. Emotion Sustenance [19, 61]. All three models work on extracted keyframes of the given video. Maximum Emotions count on the maximum probability related to each emotion across the separated keyframes and depicts it as the final emotion. The maximum Emotion Intensity model measures the intensity of emotions for every keyframe and recommends the most intensified emotion. The maximum emotion sustenance model is more accurate than the above two models [19, 61]. This model measures the emotion in every keyframe and looks globally at the emotion that had occurred repeatedly for the more extended sequence of keyframes.

Late multiple-feature fusion operates independently in 3 ways. The first method () performs face detection and tracking from input video and then uses CNN for emotion detection over the face boundary box. The other two methods perform video emotion recognition via image-based approaches. The input video sequence is split into multiple keyframes. The keyframes are then fed to the face recognition module, which identifies the sample. The corresponding face set from the database is identified, features extracted from the input face and matched with the trained sample features using SVM, and then fed to CNN for emotion detection by . The last approach uses patch-based ACNN for occlusion-aware emotion recognition. All three emotion recommendations are later fused in the ensembling setting to recommend the emotion at the output. The pseudocode for ensemble classification is stated in Algorithm 4.

Data: Training Set
BC: Number of base classifiers
SR: Ratio of samples that need replacement
 : Parameter used to reduce the distance between training and synthetic data
 : ith attribute
 : standard deviation
r: normal distribution’s sampling value N(0, 1)
Training Phase:
For a = 1: BC
 Copy the original dataset i.e. Data
 Identify the number of training samples that need replacement, i.e.,
 For b = 1: TS
  Randomly pick “z” samples from
   If x is a majority class sample, then
    Generate a neighborhood of z based on and replace z exists in
   Else if
   Check z is a minority sample, then compute m = Round
   Replace m neighbourhoods of z in
  End For
  Build base classifier from
End For
Classification Phase:
For a given z
 evaluate ensemble to classify the sample z based on the majority voting strategy

4. Experimental Results and Discussion

This section discussed the dataset used in this work and the evaluation results of emotion recognition with the ensemble method.

4.1. Video Emotion Dataset

The primary data set used in this work is the Extended Cohn-Kanade (CK+) dataset, which contains 593 video sequences from a total of 123 subjects. In this, 327 videos are labelled with anger, contempt, disgust, fear, happiness, sadness, and surprise. A detailed description of the dataset is tabulated in Table 3. The CK+ database is one of the most widely used facial expression classification databases. We have included some camera-recorded participants’ facial expressions without disturbing their natural emotion outlay. The videos are recorded for 1–10 seconds. Each video contains an average of 10–15 keyframes. A total of 1830 videos were taken for the experiment and among which 80% were taken for training and 20% for testing.

4.2. Keyframe Extraction

A keyframe extraction approach [61] uses the histogram with deep learning to extract the pertinent keyframe from the video sequence. The keyframe extraction gets the highest recall and precision values for all the video sequences. In most cases, a metric’s highest value is insufficient. The precision metric assesses a method’s capacity to obtain the most accurate outcomes. A high accuracy number indicates more substantial keyframe relevance. However, a high-precision number can be obtained by choosing just a few keyframes from a video sequence. The keyframe extraction algorithm depends heavily on the accuracy and speed of both parameters. If the algorithm is slow, then the throughput of the system gets affected. It is also necessary that extracted keyframes are the relevant and accurate. Further, it will affect other processes, such as object detection, classification, and object description. The Precision in equation (27) and Recall in equation (28) are evaluated during keyframe extraction and tabulated in Table 4.

The Precision and Recall value achieved using the keyframe extraction method is high, so the model gives unique frames without replica. The result also calculates the CPU time (0.50) to extract the keyframes; it shows that the extraction speed is good.

4.3. Face Detection

Facial detection plays a significant role in facial identification and emotion recognition. The method of face detection in photographs is complicated due to the variability across human faces, including pose, expression, position and orientation, skin colour, glasses or facial hair, differences in camera gain, lighting conditions, and image resolution. This method’s strength is to concentrate computational resources on the area of an image holding a face.

One of the computer technologies involved in image processing and computer vision is called object detection, and it deals with finding instances of objects like people, cars, buildings, and trees. Finding out if there is a face in the image is the primary goal of face detection algorithms. In this paper, we employ two face-detection techniques. Face detection allows us to gather the data required for emotion analysis.

4.3.1. Face Detection Using Haar Cascade and KLT Algorithm

The video may contain a single person or multiple people, and emotion can be identified for both occluded and nonoccluded faces. Initially, the face is detected through the Haar cascade algorithm and tracked using KLT tracking, which accurately tracks the detected face. The sample results of face detection using Haar and KLT are displayed in Figure 7 and tabulated in Table 5.

In Method 1, the keyframes with faces are identified and tracked by Haar cascading and the KLT algorithm. The model was accurately detecting the faces in the image. But partially occluded or side-angled faces are missing in the model. In our experiment, we got an accuracy of 92.6% for the model.

4.3.2. Face Detection Using HOG and SVM

Feature extraction for facial emotion recognition is performed through HOG features, where the video is fed to keyframe extraction and the particular image has all the information and details about the face (i.e.) face directions, edges, intensities, and colour are extracted and saved in a separate file. This information is fed to the SVM classifier for accurate face detection. The sample results of face detection using HOG and SVM are displayed in Figure 8 and tabulated in Table 6.

The effectiveness of the face detection model is often evaluated based on Precision in equation (29), Recall in equation (30), and Accuracy by equation (31). The Precision and Recall are in Table 7, and Accuracy in Table 8 proves that more accurate face detection is possible when using the HOG-SVM model.

HOG achieves face detection, and SVM offers more accurate results than Haar KLT because it detects faces with angle change or partially covered to some extent.

4.4. Emotion Recognition

After identifying the face, the emotions are detected using CNN and patch-based CNN. The performance of emotion recognition is discussed below.

4.4.1. Patch-Based CNN for Emotion Recognition

In this study, CNN is used as the base classifier by PG-CNN. The straightforward structure and unique item categorisation performance are the cause. Attach 24 PG-Units after selecting the first nine convolution layers as the feature map for region decomposition. The model was initialised using the pretrained model based on the ImageNet dataset. For each dataset, both the train and test corpus are mixed with occluded images with a ratio of 1 : 1. We adopt a batch-based stochastic gradient descent method to optimise the model. The base learning rate was set as 0.001 and was reduced by the polynomial policy with a gamma of 0.1. The momentum was set as 0.9, and the weight decay was set as 0.0005. The training of models was completed on a Titan-X GPU with 12 GB memory. During the training stage, we set the actual batch size as 128 and the maximum iterations as 50 K. It took about 1.5 days to finish optimising the model.

CNN disintegrates the feature maps as multiple subfeature maps. The region decomposition the feature maps are divided by CNN into many subfeature maps. The facial picture is aligned by fixing the 68 facial landmarks around the face, and the region is decomposed by splitting the facial landmark into 24 patches that span the entire informative area. Then patches are extracted based on the locations of the landmarks on each subject’s face. The following procedure is used to choose the facial patches:(i)Sixteen points are picked from the original 68 facial landmarks to cover eyebrows, eyes, nose, and mouth. The selected points are indexed as 19, 22, 23, 26, 39, 37, 44, 46, 28, 30, 49, 51, 53, 55, 59, and 57(ii)In addition, 4 points are picked around the eyes and eyebrows, and then the midpoint of each point pair is computed as the delegation.

Based on the 512 × 28 × 28 feature maps and the 24 local region centres, a total of 24 provincial regions are obtained, each with a size of 512 × 6 × 6.

The inbuilt PG-CNN detects blocked face patches automatically and focuses mainly on unblocked and informative patches. The cropped local feature maps are given to two convolution layers, the attention layer and the encoding layer, in each patch-specific PG unit. Figure 9 depicts the regional features.

Table9 shows the performance of both nonoccluded and occluded images during emotion recognition. For both occluded and nonoccluded scenarios, the overall accuracy on seven facial expression categories is evaluated by performing a 10-fold evaluation.

A 10-fold test accuracy test has been performed on CK+, ISED Dataset with synthetic occlusions. The size of occlusion are 8 × 8, 16 × 16, and 24 × 24, represented by R8, R16, and R24, respectively. The full image size is 48 × 48. The input images (size 48 × 48) without occlusion have high accuracy of 97.02%. In the same image set, synthetic occlusion was applied on a different scale (S8, S16, S24). The accuracy of occluded images varies with the amount of occlusion. But Table 10 shows that the accuracy of the occluded images is also high.

4.4.2. Performance of CNN vs. Patch-Based CNN vs. Ensemble

Table 9 shows the different sets of experiments conducted for emotion detection from facial expressions. In the first method CNNHOG-KLT, face detection is performed by HOG and KLT methods, and detected CNN classifies frames. Similarly, in CNNHaar-SVM, Haar cascade and SVM techniques are used for face detection in extracted frames. Again CNN is used for emotion recognition. In the last method, template matching is used for face recognition, and patch-based CNN is used for emotion classification.

Tables 1113 present the confusion matrix generated for emotion classification by Method1 (CNNHOG-KLT), Method2 (CNNHaar-SVM), and Method3 (CNNPATCH), respectively. All three methods correctly identified all seven emotions.

From the confusion matrix, the performance such as Precision in equation (32), Recall in equation (33), Accuracy in equation (34), and F-measure in equation (35) are evaluated and tabulated in Tables 1417. From this, the CNNENSEMBLE method is more accurate during emotion recognition.

4.5. Comparison with Existing Emotion Classification Methods

The comparison of existing emotion classification methods with the proposed model tabulated in Table 18. Based on the comparison, the proposed ensemble CNN (CNNENSEMBLE) is more suitable for identifying emotion class.

5. Conclusion

An ensemble of CNN methods performs the robust emotion recognition of faces using multiple facial features. This proposed CNNENSEMBLE approach is suitable for a single person and multiple persons in the video. Despite partial occlusions, the proposed work responds much better than the previous approaches using CNN. All the faces with emotions within keyframes are initially detected using CNNHOG-KLT, CNNHaar-SVM, and CNNPATCH methods. After that, the CNNENSEMBLE method ensemble the detected emotions by the Max rule and achieved the maximum accuracy of 92.07%. In addition, other performance measures such as Precision, Recall, and F-Measure also proved that ensemble increases the emotion recognition rates. This system can detect emotions during occlusion.

The emotion recognition system has to be further improved to handle more partial and complete occlusions. In addition to this, there is a plan to consider contextual information along with facial images to recognise human emotions. Therefore, the extracted features from facial and context regions surrounding that person can be fused to make more labels and classify the emotions in the future.

Data Availability

The emotion datasets used to support the findings of the study are available at https://www.kaggle.com/code/shawon10/ck-facial-expression-detection/data and https://sites.Google.com/site/iseddatabase/download.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

All authors contributed equally to this work.