Abstract

Facial emotion expressions are among the most potent, natural, and powerful means of human communication. Due to the COVID-19 pandemic, educational institutions worldwide are forced to switch rapidly to remote and online learning. Students are currently in an emergency state and must adapt to various and readily accessible learning methods, such as mobile learning applications or an e-learning system. A systematic literature review (SLR) is conducted to extract and synthesize information such as the emotion classifier used in the facial expression recognition (FER) system, the dataset used, the preprocessing technique applied, the feature extraction approach used, and the strength and limitation of the previous studies. Based on the search criteria, 701 publications were initially retrieved from five different digital databases, of which 48 studies have been chosen as primary studies for further analysis. Based on the findings of this study, the deep learning approach is the most frequently adopted approach in classifying student emotions during online learning. FER-2013 is the most commonly used FER dataset in FER studies, while DAiSEE is the most used academic emotion dataset. Moreover, support vector machine (SVM) is the conventional learning emotion classifier that is widely used in the FER systems, while convolutional neural network (CNN) is the most frequently used deep learning classifier. Next, it was found that the number of real-time FER systems is less than that of non-real-time FER systems. Finally, the top-1 accuracy of 94.6% was achieved by the long-term recurrent convolutional network on the academic emotion dataset, and the limitation is that it has low illumination and a lack of frontal pose.

1. Introduction

Facial emotion expressions are generally considered to be a powerful means of communication among humans. However, research has shown that cultural and individual differences can exist in how people interpret and respond to them [1]. Hence, it is important to address automatic emotion recognition based solely on facial emotion expressions with caution since it may not account for the diverse and subtle ways in which emotions are expressed and interpreted by individuals and across cultures.

Online learning can be defined as learning that takes place over the internet at the learner’s own pace or in real time, depending on the platform used [2]. According to a global survey, 85% of universities and other educational institutions adopted online learning as their primary teaching mode during the COVID-19 pandemic [3]. Students are currently in an emergency state and must adapt to various and readily accessible learning methods, such as mobile learning applications or an e-learning system. Distance learning and e-learning are not novel concepts for learners. Nevertheless, the pandemic has highlighted the importance of exploring online teaching and learning opportunities [4]. This situation has demonstrated the pros and cons of educational systems when confronted with the challenges of digitalization.

According to the study by Krithika and Priya, emotion plays a crucial role in online learning as it can impact the student’s interest in lectures [5]. An enthusiastic student has a higher chance of successful learning compared to a bored student, making it crucial to adapt learning to students’ emotions [6]. Furthermore, students may experience a variety of emotional or mental states in online learning, which can impact their learning process. During the pandemic, many students suffered from depression, academic stress, or anxiety in e-learning because they struggled to adapt to the new norm. Therefore, an emotion recognition system could assist educators in perceiving their students’ emotional state in online learning.

Facial emotions are among the most potent, natural, and universal messages for individuals to convey their emotions and thoughts irrespective of gender, ethnicity, and nationality [7]. Emotions are described as physical and mental states that help people deal with various circumstances. An individual is said to be attentive when his mind is focused on a certain topic. Educators interpret attention as a mental state in which learners focus on something since it is a prerequisite for learning and motivation in a classroom.

According to a meta-analysis by Camacho-Morles et al., the enjoyment of learning is positively correlated with academic performance, while negative emotions, such as anger and boredom, are negatively correlated with academic performance [8]. Students’ emotions and motivation affect their learning ability, whereas positive emotions and moods can sometimes lead to more precise decision-making, creativity, and adaptability [911]. Hence, individuals must be prepared on all levels, including emotionally, physically, socially, and mentally, to effectively absorb the information during learning. Students who fail to understand and process information while learning may experience emotions like confusion, fatigue, boredom, or exhaustion [9]. This can lead to a change in student behaviour, leading them to skip the online lectures or drop out of the course. Consequently, students struggling with unstable or negative emotions can find it difficult to learn effectively through online learning.

Identifying the emotion of students can bring many benefits, such as determining which type of teaching style or teaching materials have the potential to boost students’ positive feelings. Hence, the lecture materials or teaching style can be tailored to make students interested in every online lecture. Rothkrantz stated that positive emotions in students could favour the intention to engaged in online learning [12]. In contrast, negative emotions can lead to negative learning outcomes, influence students’ educational path, and cause them to lose interest in learning. An enthusiastic student had a higher chance of successful learning compared to a bored student. Hence, learning must adapt to the emotion of the students [6].

A systematic literature review (SLR) is conducted to obtain an overview of what has been done on the student FER system in the education field. SLR indicates the potential research gaps in a specific problem area and provides guidance to researchers and practitioners that are interested in carrying out new studies on the particular problem area. All associated research papers are retrieved from various digital sources, integrated, and discussed to answer the research questions stated. The SLR study yields new insights and assists new researchers in learning better about the state-of-the-art.

In this paper, the sections are structured as follows. Following the introduction is the background section, which details the facts regarding the problems of FER in online learning and the taxonomy of FER, which is split into preprocessing, feature extraction, and emotion recognition. After that, the methodology for identifying the relevant research studies will be discussed in the methodology section. The result section will summarize the results of completing the SLR phases, while the approaches used in FER in online learning will be reviewed and analyzed in the discussion section. Then, the research challenges and limitations will be discussed. Last but not least, this review paper will be concluded in the conclusion section.

2. Background

In this section, the background and significance of facial emotion recognition (FER) in online learning will be included. Firstly, the online learning problem will be discussed, followed by the significance of FER, the main stages of FER, and finally, the taxonomy of FER.

2.1. Online Learning Problem

Distance and self-paced learning methods appear to have shattered the bonds of friendship and social interaction between students in the classroom. Thus, video conferencing platforms like Google Meet, Webex, and Microsoft Teams have facilitated real-time interaction between students and educators in online learning. Nevertheless, educators are unable to determine the students’ interest level in online learning via video conferencing platforms [13]. This restriction can be attributed to a number of different reasons. Firstly, there is a possibility that interactive learning may not be supported by online applications. Students’ privacy was safeguarded by features like microphone muting, shuttering, and limiting the camera or webcam recording capabilities during the online class via these platforms. As a result, most students prefer to use these provided features in online classes. Another drawback is that these tools have a narrow field of view, which causes educators can only observe the students’ faces but not their postures and surroundings. Consequently, educators are not able to fully grasp the true emotions of their students as they can in traditional classroom settings. Besides that, students use a variety of gadgets, including tablets, smartphones, and computers, all of which have the potential to degrade visual quality.

There is a critical lack of academic engagement in large-scale online education like massive open online courses (MOOCs), which are provided by many organizations, including some of the world’s best-known institutions with a wide range of offerings and have millions of students enrolled [14]. They are able to reach a large scale, but this comes with a significant drawback since the proportionately smaller number of educators cannot accommodate the large number of students they work with. As a result, educators are unable to monitor and participate in their students’ progress and engagement, which frequently results in a waterfall-style, one-way transfer of knowledge from educators to students.

2.2. Facial Emotion Recognition (FER)

In human communication, facial emotion is crucial to assist humans in understanding people’s intentions. The fluency, precision, and truthfulness of the interaction or communication can be enhanced by facial emotion recognition. This method of recognition is functional when it comes to deciphering the interactions between humans and computers.

Previous studies have shown that two-thirds of interpersonal communication is conveyed by nonverbal elements. Facial emotions are part of crucial sources of information in human communication among the nonverbal elements [15]. The ability to recognize facial emotions is fundamental in order to effective interpersonal communication. In fact, emotion recognition is crucial to the experience of empathy, prosocial behaviour prediction, and the ability model of emotional intelligence. It is a difficult task because of some problems, such as action similarities, and large head positions. The study by Sariyanidi et al. revealed that various components contribute to the effectiveness of FER approaches, including factors such as precise face registration, effective representation methods, and accurate emotion recognition algorithms [16].

There are several approaches to recognizing an individual’s emotions, and the acquisition of facial-based features is the most commonly used approach [17]. The facial channel is universally recognized as the dominant channel of emotional expression in humans among automatic emotion recognition techniques, resulting in facial emotion recognition being the most researched among the various channels of emotional expression.

A number of approaches can be applied to assess the student’s emotions in online learning, such as through students’ facial expressions, eye movements, gestures, and posture, as well as feedback checklists from students [18]. Nonetheless, the educator can only view the students’ faces using the currently available technologies for online classes [19]. Besides that, sensors can be utilized for FER inputs, such as a camera, electroencephalograph (EEG), electrocardiogram (ECG), and electromyography (EMG). However, the camera is a promising sensor since it does not have to be worn and gives FER the most detailed indication [15].

2.3. Main Stages of FER

In this section, the main stages of FER will be discussed. The process typically involves three stages in FER: preprocessing, feature extraction, and emotion recognition. These stages are all essential for building an accurate and effective FER system.

2.3.1. Preprocessing

Preprocessing is a process that may be taken before the feature extraction in order to enhance the FER system’s overall performance [20]. The feature extraction step can be better tailored using preprocessing to reduce noise and redundant data [21].

In FER, face detection, dimension reduction, and normalization are the crucial preprocessing steps before heading to the feature extraction step [22]. Face detection, the first prerequisite phase in FER, involves detecting a face inside a frame or an image and removing any pixels that do not contribute significantly. Face detection is a challenging task since human faces come in a variety of shapes and sizes. Therefore, the face detection algorithm significantly impacts the issue mentioned above. Examples of face detection algorithms include Viola-Jones, genetic algorithms, linear discriminant analysis (LDA), and principal component analysis (PCA).

Next, an approach called dimension reduction is applied to narrow down the number of variables to a set of principal variables [22]. When there are more features to consider, it is more difficult to visualize the training set and perform the necessary steps to improve it. In this context, PCA and LDA are useful algorithms for handling the mentioned issue. Besides that, normalization is a term that is interchangeably used with feature scaling. Following the dimension reduction process, the reduced features are normalized in a manner that does not misrepresent the variations in the range of features’ values. In order to speed up the training process and enhance the numerical stability of a model, many different normalization approaches, such as unit vector normalization, min-max normalization, and Z normalization, can be applied.

2.3.2. Feature Extraction

The process of extracting the important features, including geometric features, appearance-based features, or physiological features for FER, can result in smaller and more detailed attribute sets. These attribute sets consist of features like the distance between a pair of eyes, the distance between eyes and lips, and the edges, diagonal, and corners of a face, which aid in more rapid learning of previously trained data.

Appearance-based extraction and geometric-based extraction are the two approaches for feature extraction. Features such as corner and edge features may be extracted using the geometric-based extraction method. In order to extract geometric features, the position of the face components must first be recognized and then depicted using a set of feature points, also known as landmarks or contours [23]. Subsequently, the (x, y) coordinates are used to generate a feature vector. The feature vector, which contains geometric information of face components, is computed using the landmark points’ distance, the points’ arrangement, and the slope of connected lines.

Meanwhile, for the appearance-based extraction technique, salient point features are used in order to maintain the location of the eyes and the form of the lips and eyebrows, as well as other key facial features. This technique does not require face points; instead, it determines the texture information of facial images based on the grey level values of the pixels, along with the connection between each pixel and its set of neighbouring pixels. Furthermore, this technique is often accomplished through a variety of texture descriptors or image filters.

The specific features that are extracted can vary depending on the approach used for FER, and the choice of features can greatly impact the performance of the emotion recognition system. Therefore, feature extraction is an important step in the FER process and requires careful consideration and evaluation.

2.3.3. Emotion Classification

The stage after feature extraction in FER is emotion classification, in which the classifier sorts different expressions into the appropriate categories. Various classification algorithms, including conventional learning algorithms and deep learning algorithms, are widely used in emotion classification. CNN is the most widely applied classification algorithm nowadays. The fact that it can be implemented directly on the input image without using any facial detection or feature extraction algorithms makes it the most efficient algorithm [24]. Despite this, it still achieves a higher level of accuracy than the input data.

Human emotions are inconstant since they go through cycles of highs and lows. Thus, classifying emotions based on context is very challenging.

2.4. Taxonomy of FER

This section introduces the taxonomy of FER based on the technology used, specifically conventional and deep learning-based approaches. An illustration of the FER comprehensive taxonomy is presented in Figure 1.

2.4.1. Conventional/Traditional FER

This approach used handcrafted features extracted from facial emotion expressions, which were then classified using machine learning algorithms [25]. Conventional FER can be further classified as machine learning-based FER. This approach uses machine learning algorithms such as support vector machines (SVM), decision trees, random forests, Naïve Bayes, and K-nearest neighbors to recognize and classify facial expressions.

2.4.2. Deep Learning-Based FER

This approach applied deep learning algorithms that allow the automatic extraction of features and classification [26]. Deep learning-based FER can be further classified into hybrid FER and deep neural network-based FER. Hybrid FER combines conventional FER methods like feature extraction and selection with deep learning techniques to improve the model’s accuracy [27]. On the other hand, deep neural network-based FER deep neural networks, such as CNNs and RNNs, learn and extract features directly from raw facial images, which are subsequently classified into various categories of emotions.

3. Methodology

A review protocol is established before the SLR is performed. The SLR was carried out in accordance with the prominent SLR guidelines that were published in 2007 [28]. A review protocol specifies the approaches used to perform SLR. In order to minimize the potential for publication bias, a predefined protocol is required. First of all, the research questions are identified. The digital databases will be used to find relevant research papers when the research questions are ready. Databases such as Science Direct, IEEE Xplore, Springer Link, Google Scholar, and Scopus were used in this SLR. There are three phases in the systematic literature review, including the planning phase, conducting phase, and reporting phase, as presented in Figure 2.

In the first phase, the need for SLR is identified, the research questions are specified, a review protocol is developed, and the review protocol is evaluated. The review protocol evaluation was recursive. The search string and domain list were frequently altered until the search results showed the results of each identified domain found. Next, the publications were identified and chosen by searching the available databases during the conducting phase. The data extraction was done where the authors’ details, publication types, publication year, and other details about the research questions were collected. After the proper extraction of all the relevant data, a data synthesis will be made to present an overview of the related studies published up to now. The review was concluded during the final stage by reporting the findings and answering the research questions. The review must be thoroughly reported in adequate detail in order for readers to evaluate the comprehensiveness of the search. The unfiltered search results should be stored in case they need to be analyzed again.

Five research questions are specified to guide this SLR:

RQ1: What is the most frequently adopted approach in classifying student emotion during online learning for the student FER systems in recent years? (Conventional machine learning/deep learning/hybrid)

RQ2: Which dataset is used for the student FER systems in online learning

RQ3: What is the most frequently used emotion classifier in the student FER systems

RQ4: Do the existing student FER systems work in real time

RQ5: What is the accuracy and limitation of previous studies that used the academic emotion dataset

A systematic search was carried out in five digital databases, which are IEEE Xplore, ScienceDirect, Springer Link, Scopus, and Google Scholar, to answer the research questions presented in the previous subsection. The initial search input was “Facial emotion recognition” AND “Online learning”. The final search string was as follows: ((“facial emotion recognition” OR “facial emotion detection” OR “facial emotion classification) AND (“e-learning” OR “online learning“) AND (“deep learning” OR “machine learning”)). There were 701 papers initially retrieved following the execution of the stated search string.

Exclusion criteria (EC) were used for study evaluation and assessment to determine the boundaries of the SLR to exclude irrelevant studies. Six ECs are listed as follows:

EC1: publication is a survey or review paper

EC2: publication has been published before 2018

EC3: publication without full text available

EC4: duplicate publication from multiple sources

EC5: publication not written in English

EC6: papers are not computer science-related

Following the application of the listed ECs, only 48 studies have been left for further review. In order to answer the research questions accordingly, the data from the selected publications were extracted and synthesized. Figure 3 illustrates the diagram of the study selection process.

An accurate and effective FER model can assist educators in evaluating the emotion of students in online learning. Various approaches were applied to classify student emotions in both classroom and online learning environments. This review article aims to investigate how machine learning and deep learning are used in student emotion recognition systems based on facial expressions in previous studies.

Review or survey papers are one of the exclusion criteria during the analysis of retrieved publications. The publications that have been omitted are related work that will be addressed in this section. Dewan et al. performed a review study on engagement detection in online learning in 2019 [18]. The paper concluded that the computer vision-based approaches have some constraints, although they are found to be effective in engagement detection. For example, the existing algorithms face difficulties in analyzing facial occlusions and head movements, so the features cannot be extracted from certain video segments, thereby resulting in data loss. In addition, very few available online datasets can be used to detect student emotions in online learning.

Li and Deng published a survey paper about deep facial expression recognition [29]. A detailed review of deep facial expression recognition was presented in their survey, such as algorithms and datasets that clarify fundamental issues such as overfitting due to insufficient expression-unrelated variations and training data. Besides that, the established deep neural networks and associated FER training methods implemented are addressed based on static and dynamic image sequences, along with their pros and cons. Furthermore, the difficulties and opportunities in the FER field and the prospect of developing robust, deep FER systems are also reviewed in their survey paper.

4. Results

In this section, the publications and information related to the FER approach used by the researchers to classify student emotions in online learning will be reviewed and discussed. In addition, the performance of the machine learning algorithms used in the FER will be examined and investigated.

A total of 48 publications were selected to be included in this review paper. Table 1 indicates the number of publications initially retrieved and the number of publications following the application of exclusion criteria. Table 2 summarizes the important information from each publication, such as the emotion classifier and dataset used, the accuracy of the classifier, preprocessing approach, the feature extraction method, strengths, and limitations.

Figure 4 illustrates the year-wise distribution of the primary studies in the previous five years. The statistic presented in Figure 4 indicates the patterns that can be seen in the research publications that were evaluated over the years. It was discovered that most papers relevant to the topic of this review paper were published in the year 2020. In recent years, most studies (70.83%) applied deep learning algorithms to classify emotions, while 18.75% used machine learning algorithms and 10.42% applied hybrid algorithms. Figure 5 illustrates the classification approach used in the previous studies from 2018 to 2022.

The type of approach used in the retrieved publications was investigated and extracted into Table 2 to address the first research question (RQ1). Besides that, the dataset used in the reviewed publications was also summarized in Table 2 to answer RQ2. FER-2013 is the most used dataset in the FER system. Besides that, popular datasets such as CK+ and JAFFE were also used in previous studies. The dataset focusing on academic emotions, such as DAiSEE, was also used in seven publications out of 48 primary studies.

Furthermore, it was found that CNN is the most frequently used deep learning algorithm as an emotion classifier in the FER systems. On the other hand, SVM is commonly used in conventional machine learning-based FER systems. The percentage of real time and non-real-time FER systems are summarized in Figure 6 to answer RQ4. The percentage of non-real-time FER systems was found to be higher than the real-time FER systems. Other than that, DAiSEE and OL-SFED are examples of datasets that consist of academic emotions. The overall analysis of the FER in online learning was performed by comparing the crucial components in each relevant research study, such as dataset, emotion classifier, preprocessing method, feature extraction approach, results of the research experiment, and strengths and limitations.

According to the analysis of research papers collected for this SLR, 34 papers used deep learning algorithms as the emotion classifier in their research experiment. Out of 34 studies, a study conducted by Rao et al. in 2020 achieved the highest accuracy of 99.95% using CNN as the emotion classifier and CK+ as the dataset [61].

Furthermore, nine papers applied conventional machine learning algorithms for emotion classification. The study conducted by Sabri et al. in 2020 achieved the highest accuracy of 99.16% using SVR as the emotion classifier and JAFFE as their dataset [57]. Out of five papers that used hybrid algorithms in facial emotion classification, a study by Shi et al. in 2019 achieved the highest accuracy of 93.80% [42]. They used a combination of CNN and SVM as their emotion classifier, and the dataset, which consists of 82 students that learn different online courses.

Table 2 includes the strengths and limitations of the papers, which are not confined solely to the strengths and limitations of the emotion classifier but also encompass the datasets used in these studies. Hence, a summary of the advantages and disadvantages of each emotion classifier algorithm will be presented in Table 3.

Ultimately, a quantitative comparison may be preferable when the objective is to select the model with the best performance on classification tasks or to identify the model with the highest accuracy. However, it is essential to note that quantitative measures cannot necessarily capture all the relevant aspects of a model. Hence, qualitative and quantitative factors could be significant in picking the ideal FER model. A summary of the FER classification methods is presented in Table 4.

5. Discussion

In this section, the commonly used datasets and academic datasets that were used in the selected reviewed publications were discussed. Besides that, the conventional learning emotion algorithms and deep learning algorithms applied were also presented in detail. Finally, this section is concluded with a critical review where the publications that used academic emotion datasets and deep learning algorithms were reviewed.

5.1. Commonly Used FER Datasets

There are multiple online datasets available for the FER field, including FER-2013, JAFFE, CK+, KDEF, DISFA, and DISFA+. Most datasets available are constructed based on 2D video sequences or static images. Nonetheless, the 3D image can be found in certain datasets. The six basic emotions, including neutral, happiness, disgust, anger, fear, surprise, and sadness, are labeled in most datasets. Some datasets are built in controlled environments, while others are created in wild environments. This section presents several well-known and commonly used datasets in the reviewed works. The commonly used FER datasets are summarized in Table 5.

5.1.1. FER-2013

FER-2013 is a dataset generated using the Google Search API, which consists of six basic emotions and neutral emotions by matching 184 keywords related to emotion [91]. Detailed information on the race or ethnicity of the individuals is not provided in the dataset. It consists of roughly 30,000 grayscale and 48x48 scaled facial images with various facial expressions, and the main labels of it can be divided into seven types. FER-2013 is the largest publicly accessible dataset for facial emotion in the wild. Nevertheless, it is challenging for facial landmark detectors to extract landmarks due to the image resolution and quality.

5.1.2. Japanese Female Facial Expression (JAFFE)

JAFFE is a dataset of 213 images of facial expressions from ten Japanese female individuals [92]. Every female had to make seven facial expressions, and 60 annotators annotated the images with average semantic scores for each facial expression. Each facial image has an image resolution of pixels.

5.1.3. Extended Cohn-Kanade (CK+)

The CK+ dataset comprises 593 video sequences, covering 123 different participants with different genders and heritage aged between 18 to 50 years old [93]. There were 81% Euro-Americans, 13% Afro-Americans, and 6% from other groups. 69% of them were female, while 31% were male. Each video depicts a transformation in facial emotion from neutral to specific peak emotion, captured at 30 FPS with a resolution of either or pixels. There are 327 videos labeled with seven classes of facial expressions, including happiness, surprise, disgust, sadness, anger, fear, and contempt.

5.1.4. Karolinska Directed Emotional Faces (KDEF)

This dataset contains 4900 photos of facial emotions from 70 individuals, captured from five different angles and labeled with six basic facial expressions plus neutral [94]. The subjects are 35 males and 35 females aged between 20 and 30 years old. This dataset was created in Sweden and is designed to represent Sweden’s population. The photographs of individuals from various backgrounds and ethnicities were included, but the exact demographics of the individuals are not explicitly stated on the official dataset website. During the photography session, subjects without beards, eyeglasses, earrings or moustaches, as well as no noticeable make-up, are preferred. Each image has a resolution of pixels.

5.1.5. Denver Intensity of Spontaneous Facial Action Database (DISFA)

This dataset contains spontaneous facial emotions that can be used to automatically detect action units and intensities described by FACS [95]. This dataset includes videos of 12 females and 15 males of various ethnic groups. Twenty-one were Euro-American, three were Asian, two were Hispanic, and one was African-American. Sixty-six facial landmark points are included in each picture. The pictures in the DISFA dataset were captured at a high resolution () using the PtGrey stereo imaging system.

5.1.6. Extended Denver Intensity of Spontaneous Facial Action Database (DISFA+)

The DISFA+ database is an extension of the DISFA database and comprises a massive amount of data on posed and spontaneous facial emotions [96]. The participants in the DISFA+ dataset are from the same population as those in the original DISFA dataset. Besides that, it also includes metadata as well as manually labeled frame-based annotations of 5-level intensity for 12 FACS facial actions.

5.2. Academic Emotion Datasets

Besides FER datasets focusing on seven facial expressions (6 basic expressions and one neutral expression), only a few publicly accessible academic emotion datasets were recorded in an online learning environment. This section will discuss some academic emotion datasets.

5.2.1. Dataset for Affective States in E-Environments (DAISEE)

This dataset comprises 9068 video clips from 112 Indian students to detect the individual affective states of frustration, engagement, confusion, and boredom [97]. The videos in the dataset are captured in dormitories, laboratories, and crowded classrooms where the students focus on their educational tasks on the computer screen. The annotations will be further measured according to the intensity, which is from 0 to 3.

5.2.2. Online Learning Spontaneous Facial Expression Database (OL-SFED)

OL-SFED is a dataset containing 30184 images and 1274 video clips by 82 Chinese students in an online learning environment [46]. The dataset comprises spontaneous facial expressions in response to 5 typical academic emotions, including enjoyment, fatigue, neutral, distraction, and confusion, with samples thoroughly annotated by participants and external coders.

5.3. Preprocessing Methods

According to the selected 48 retrieved publications, most of the studies applied the Viola-Jones algorithm, also known as the Haar cascades classifier, during the preprocessing phase.

The viola-Jones algorithm is the most commonly implemented face detection algorithm that searches through an image using a window to seek features of a human face [98]. A face is assumed to be included inside a certain window of an image if and only if these features are identified and assigned a value that is unique to faces. This method consists of four steps, as presented in Figure 7.

The first step that has to be done in this method is to read the image of the person’s face that is facing the camera [99]. Next, the Haar-like feature will interpret the camera-captured facial image by processing the image into boxes to signify dark and bright areas of the facial image. Some features of the human face are universal such as the fact that the area around the eyes is always darker than its surrounding pixels, and the area around the nose is always lighter.

The following step is to compute all of the pixels that are contained within that specific feature. The calculation of an integral image can be carried out in a time-efficient and effective manner for each and every point in the images. After obtaining the integral image of all points, the pixel intensity of every subwindow in the image may be calculated with a maximum of four memory references. Adaptive Boosting, or AdaBoost, is one of the most popular boosting approaches that merges a number of underperforming classifiers to create a strong classifier. It chooses several weak classifiers to combine into a single model, giving each classifier weight to produce a robust classifier.

The final step in the Viola-Jones method is the cascade classifier. The Haar cascades are a kind of classifier that can detect an object in a previously trained image or video [100]. Each subwindow will have a specific feature assigned to it in order to determine its classification in the initial stage of the classification. The output of the feature is considered to be rejected if it does not satisfy the desired requirements. After that, the algorithm will go to the subsequent subwindow, where it will do another calculation to determine the value of the feature. The algorithm will proceed with the subsequent phase once the acquired result meets the prerequisite criteria. Finally, it is regarded to contain a face if a subwindow is able to pass through all the phases of the classification process.

Based on the retrieved publications, Ayvaz et al. [101], Candra Kirana et al. [34], Dewan et al. [35], El Hammoumi et al. [31, 32], Shi et al. [42], Dash et al. [45], Sabri [57], Murugappan et al. [59], and Murugappan et al. [60] have applied the Viola-Jones algorithm in the preprocessing step. Moreover, Ma et al. [32], Yang et al. [33], Hingu [62], and Zakka and Vadapalli [63] have used the Haar cascades method, while Hung et al. [39], Lasri et al. [40], and Alrayassi and Shilbayeh [50] have implemented the AdaBoost algorithm in the preprocessing phase.

Apart from that, histogram equalization is also applied in preprocessing stage by G. Li and Wang [30], Mao et al. [38], and Rao et al. [61]. Histogram equalization is a computer image processing method applied to enhance image contrast. This method is applied because of the specialized nature of learning activities, where there is more light in the scene and less variation in the learner’s head angle. Besides the commonly used preprocessing method, Kumar et al. [58] applied the Kanade-Lucas-Tomasi algorithm for face detection in the preprocessing phase. This algorithm has achieved a high true-positive rate under various exposure settings, and it can effectively detect facial parts like the eyes, nose, and mouth. Next, Wang et al. [7] introduced InfraFace, a preprocessing tool that integrates algorithms for facial attribute detection, head pose estimation, facial feature tracking and more. As a result, the face’s essential features, such as mouth, nose tip, eyes and eyebrows, are easily detected, and the emotions can be recognized by rectangular outlines accordingly.

In the study by Krithika and Priya [5], colour conversion is used in the preprocessing phase. Firstly, the RGB image is transformed into the colour space. Then, the a and b colour spaces are chosen and transformed into binary images. Finally, on the image, the AND operation is carried out in order to produce a matrix with the values 0 and 1, where 0 represents the absence of a face while 1 represents the existence of a face. Meanwhile, Pise et al. [56] performed face detection and preprocessing, such as scaling, aligning, and normalizing the samples. According to their research findings, alignment and normalization of the input samples can assist the deep neural network in learning relevant facial features. Besides that, there is another preprocessing technique, like the conversion of a grayscale image into a string, performed by Tang et al. [41]. The grayscale image of the face, which is 48 by 48 pixels in size, is converted into a string for each individual image. This technique can help reduce the dimension and model complexity.

In conclusion, preprocessing enhances the performance of facial emotion recognition since it helps in reducing the noise that is present in the images. Therefore, it is an essential stage in the image processing process in the computer vision field.

5.4. Feature Extraction Methods

Based on the selected 48 retrieved publications, most of the publications applied a convolutional neural network (CNN) in the feature extraction phase.

Feature extraction part and classification part are the two fundamental components that make up a CNN. Multiple layers of convolutional and pooling layers make up the feature extraction network. In image processing, convolution is an effective feature extraction method adept at lowering the dimensionality of data and yielding a less redundant data set, also known as a feature map [102]. In addition, each kernel acts as an identifier for a feature and may thus filter out locations in the original picture where the feature is present. Eventually, it will generate a map, and the altitude on that map displays how these features are distributed.

A convolutional layer condenses the input data by extracting interest features within it and then constructing feature maps in response to various feature detectors [102]. By condensing the input and extracting features of interest, a convolutional layer can then provide feature maps that various feature detectors can use. The neurons in the first convolutional layer are responsible for filtering out basic features such as edges. High-order feature detection is achieved as a result of the neurons learning to collect information in order to obtain a more comprehensive view of the image in the subsequent convolutional layers. The most common pooling techniques utilized in CNNs are known as max-pooling and average-pooling. The purpose of pooling is to reduce the dimensionality of data in order to minimize overfitting by concentrating local data using a pooling window. Additionally, appropriate pooling results in invariance with respect to translation, scale, and rotation, as minor dislocations or scalings no longer have an effect.

Based on the selected reviewed publications, it was found that El Hammoumi et al. [31], Ma et al. [32], Hung et al. [39], Lasri et al. [40], Tang et al. [41], Shi et al. [42], Dash et al. [45], Bian et al. [46], Tang et al. [51], Wang et al. [7], Pise et al. [56], Hingu [62], and Zakka and Vadapalli [63] have applied CNN for the feature extraction.

Besides CNN, local binary pattern (LBP) was also used in the studies conducted by Mao et al. [38] and Zatarain Cabada et al. [53]. It is feasible to define the shape of a digital image using LBP as well as the texture of a digital image [103]. This is accomplished by first segmenting the image into multiple small parts, from which the features are extracted. Furthermore, Dewan et al. [35] made use of a feature extraction technique known as Local Directional Pattern (LDP) to extract person-independent edge features for the various facial emotions. It is robust enough to generate consistent representation even when nonmonotonic illumination change and random noise are present.

Other than that, Ayvaz et al. [101] proposed facial landmark localization as the feature extraction method in their study. Facial landmarking refers to the process of detecting and locating specific points and features on a person’s face. The algorithm generates 68 facial landmarks on the face that have been detected, each of which indicates the boundaries between different facial features.

Wibawanto and Kirana [6] applied median fisher’s face in their research, a technique that combines the linear feature extraction and reduction approaches of princpal component analysis (PCA) and linear discriminant analysis (LDA). This proposed approach employs LDA in PCA space after detecting the face to obtain the fisher’s face. Following that, the fisher face is transformed into the fisher’s median. Moreover, Haar cascade extraction was used in the previous studies conducted by El Hammoumi et al. [31, 32]. Based on the research findings, the Haar cascade extraction approach would be the ideal method for high-performance images and high-resolution faces. This is because the Gabor filter and wavelet transform only extract particular facial features, leaving out those when the face is not correctly aligned facing the camera.

Besides that, Liang [44] proposed an enhanced active shape model (ASM) approach, one of the most used face feature point localization algorithms for extracting face feature points. This approach is not overly complicated and can be comprehended with little effort. However, there is room for improvement in this technique’s speed, accuracy, and overall effectiveness with regard to applications. According to the studies conducted by Alrayassi and Shilbayeh [50], Principal Component Analysis (PCA) was used for the facial feature extraction. Besides feature extraction, dimensionality reduction is also included by applying the PCA algorithm.

Next, Kumar [58] applied the Gabor filter to identify the key features of the faces. Utilizing a Gabor filter bank makes it possible to extract various features. Zhu and Chen [54] used Face++ Detect API to accurately extract 106 facial landmark points, including 20 landmark points of the mouth, 20 for two eyes, 15 for the nose, 18 for eyebrows, and 33 for facial contour. Furthermore, the gray-level cooccurrence matrix (GLCM) was used for the feature extraction in the research made by Sabri [57]. GLCM is used to train the grayscale images of the eyes and mouth regions for texture analysis.

To sum up, feature extraction is one of the crucial phases in FER because it determines the overall efficiency of the FER process. Furthermore, since this phase can minimize the dimensionality of data by eliminating redundant data, it can facilitate faster training and inference, which in turn increases the accuracy of learned models.

5.5. Conventional Learning Emotion Classifier

In 2016, a student emotion recognition system for online learning was presented in the previous study [5]. It can capture the students’ emotions that are dynamically shifting when listening to the lecture in the e-learning environment. The local binary patterns (LBP) and Viola-Jones methods were applied for the detection of students’ faces and the classification of students’ emotions. The authors claimed that the quality could be accomplished better by using eye and head movements based on the concentration level recognized.

A facial emotion recognition system was developed in 2017, which detects the students’ emotional states and motivation in the video conference form of e-learning based on facial expressions [101]. Machine learning approaches such as SVM, KNN, CART, and random forest were applied in the classification, and SVM and KNN algorithms achieved the best accuracy rates.

Besides that, a facial recognition model that detects emotion in a virtual learning environment was proposed in 2018 [33]. The authors used the Haar cascades approach to detect the input image to extract mouth and eyes on the JAFFE dataset to detect emotions. The average accuracy of each emotion is 82.58% for fear, 84.32% for disgust, 91.22% for anger, 95.25% for happiness, 93.26% for surprise, and 78.54% for sadness. However, the authors stated there was uncertainty on how the image’s illumination and pose would influence the final emotion recognition because they do not involve those two factors. This section will further discuss commonly used conventional learning emotion classifiers, including SVM, decision tree, and random forest.

5.5.1. Support Vector Machine (SVM)

This algorithm is a common regression and classification prediction algorithm, which applies machine learning theory to boost prediction accuracy while avoiding data overfitting. Despite that, since it is crucial to choose the proper kernel function, they are costly, difficult to tune, and do not efficiently perform with large databases. It is also known as a system that applies linear function hypothesis space in a high-dimensional feature space and is trained using an optimization theory-based learning approach that includes a statistical learning bias [104].

5.5.2. Decision Tree

Similar to SVM, the decision tree is also typically applied in regression and classification tasks, mainly in classification. Data features are represented as nodes in a tree-like structure, and each branch symbolizes the decision rule while each leaf node indicates the result. The threshold value of neutral emotion is used in generating the decision tree that provides rules to classify the data. The tree, in the form of a top-down method, is constructed without backtracking that determines how to make decisions in order to detect various facial expressions [105].

5.5.3. Random Forest

This algorithm is a classification algorithm containing a large number of individual decision trees. Every decision tree produces a prediction, and a majority vote decides the final prediction, which ensures that the last prediction is the most predicted class. Overfitting is one of the major machine learning problems but can be prevented using a random forest classifier. The overfitting problem will not occur as long as sufficient trees are available in the forest [106]. In contrast, many trees in the random forest caused the algorithm to be slow and inefficient for predictions in real time.

5.6. Deep Learning Emotion Classifier

A deep learning approach reduces the dependency on image preprocessing and feature extraction in terms of FER [107]. It is robust in environments with different components, such as occlusion or illumination, which enables them to outperform the conventional approaches. Furthermore, the deep learning approach is capable of handling large datasets. In this section, several commonly applied deep-learning emotion classifiers will be further discussed.

5.6.1. Convolutional Neural Network (CNN)

CNN has accomplished state-of-the-art in multiple fields such as FER, face recognition, and object recognition because the physics-based model’s dependency or other preprocessing approaches can be highly reduced or completely removed by enabling “end-to-end” learning from input pictures [15]. The main advantage of CNN is that there is only a slight effect on the recognition effect on the geometric transformation, deformation, and illumination.

A CNN is made of convolution, pooling, and fully connected layers [108]. The convolution layer can extract feature that combines linear and nonlinear operations, such as activation function and convolution operation. In order to let the neural network learn efficiently with high accuracy, the linear combination of features is transformed into nonlinear by the activation function. Next, the dimensionality of each function map dimensionality is reduced by the pooling layer while keeping the important information. The images are then partitioned into overlapping or nonoverlapping regions, and each region is downsampled by a nonlinear function like max-pooling and average pooling. After that, the output feature maps of the final convolution or pooling layer are often flattened and connected to at least one fully connected layer, in which every input is associated with every output by a learnable weight.

CNN has the benefit of automatically detecting the essential features without human intervention, as opposed to its predecessors. In addition, CNN is computationally effective since special convolution and pooling operations are used, and the parameters are shared. This allows the functioning of CNN models on any device, which makes them universally attractive. Moreover, CNN is great at classifying two similar emotions because it processes more granular elements within an image.

Many recent studies have used deep learning algorithms such as CNN to infer emotions. One of them is a facial emotion recognition model that combines CNN with specific image preprocessing steps [109]. CNN was used to detect seven basic emotions. In addition, 96.76% accuracy on the CK+ dataset was obtained using preprocessing techniques such as the generation of synthetic samples and normalization of intensity and spatial. Nevertheless, the authors stated that their study is limited to the subject’s frontal face of the controlled environment’s input images.

Furthermore, Ma et al. presented an emotion recognition model using CNN [32]. The proposed model captures students’ images through a web camera, analyzes students’ real-time learning emotions, and gives the lecturer feedback. The authors found that lecturers who used the proposed model instinctively picked up on their students’ emotional states, in contrast to lecturers who did not use it. Nevertheless, some lecturers refuse to use the emotional analysis model as the sentiment scores updating in real-time obstruct their teaching concept.

In 2019, a deep learning-based facial emotion recognition model was proposed to evaluate the classroom teaching effect [41]. The authors trained the CNN model and used it to predict the students’ emotional states using the FER-2013 dataset. The proposed model demonstrates high accuracy and robustness in the detection and can evaluate the students’ performance in the classroom in real-time so that teachers can get immediate feedback. However, the authors stated that some misclassifications still occur, such as those who wore spectacles or had bushy whiskers were predicted to be angry.

Moreover, a model to recognize learning based on CNNs and transfer learning was proposed [39]. The basic emotional model’s transfer learning using the FER-2013 showed the significance of data complexity in the deep learning model’s design process, achieving an accuracy rate of 84.59%. Nonetheless, the study focused only on demonstrating emotional data in a laboratory setting without appraising uncommon circumstances that could happen in the classroom environment.

In a previous study by Lasri et al., a model was proposed to integrate emotion recognition in education based on CNN [40]. Haar cascade was used in face detection, while CNN was used in emotion recognition and normalization using the FER-2013 dataset. The proposed model achieved an accuracy of 70%. Although the model effectively detects happy and surprising emotions, the emotion of fear is poorly detected, as it confuses the fear emotion with sad emotion.

Recently, the combination of CNN and the geometric feature-based method has been used to enhance the performance of the models [55]. The proposed model can monitor the students’ faces through web cameras, and their facial expressions will be translated into learner engagement levels. Two tests have been carried out, and 90% of the CNN models have achieved an average accuracy of 95% for most subjects. The authors found that under certain circumstances, such as people having obstructions on their faces or moving their heads significantly, it was difficult to predict the levels of learning engagement.

Apart from that, genetic algorithms were also used to optimize CNN’s hyperparameters to determine an individual’s affective state in the previous study [53]. The results from the study demonstrate that the genetic algorithm enhances accuracy compared with CNN, which used the other machine learning algorithms and trial and error. The authors stated that the study results could be enhanced if the database is filtered and undergo a class distribution balancing.

5.6.2. Long Short-Term Memory (LSTM)

LSTM networks are a special kind of RNN with a distinct and complex neural cell structure. They are specifically crafted to overcome the RNN’s long-term dependence by using short-term memory. Despite the different structures of the repeating modules, the LSTM is still structured like a chain.

FER based on LSTM was previously proposed for video sequences because long-range context modelling can enhance the accuracy in analyzing emotion. There are several advantages of the LSTM model compared to standalone approaches for modelling sequential images. When integrating with other models, LSTM is straightforward in its end-to-end fine-tuning. The LSTM can also handle both variable-length and fixed-length inputs or outputs [110]. Furthermore, retaining the former cell unit information and updating the node information to the existing cell state value allows LSTM to learn the long-distance information in the dataset and pick the information generated by forgetting irrelevant information [111].

Recent studies have shown that LSTM is successfully applied in the FER field. A model that combined temporal convolution, bidirectional LSTM, and attention mechanism was proposed to determine the degree of engagement in online learning [47]. 60% of top-1 accuracy was achieved for four classifications using DAiSEE.

Next, a study that used the LSTM model to detect academic emotions such as boredom and frustration was conducted [52]. The findings indicated that, despite the higher accuracy of the FaceNet embeddings model, the facial landmark points model is more efficient at distinguishing between incidences of occurrence and nonoccurrence of boredom and frustration.

5.6.3. Convolutional LSTM (ConvLSTM)

ConvLSTM is created as an extended and modified version of LSTM because of the limitations of LSTM [112]. It has an LSTM-like structure and can be applied for modelling long-term dependencies in either the spectral domain or time domain. ConvLSTM can be used to combine a CNN’s capability in local data extraction with the ability of a RNN to use temporal context. The convolutional layers are used for feature extraction, while the transitions in image sequences are captured by LSTM layers.

Convolutional structures are available at both the input-to-state and state-to-state transitions in ConvLSTM. The spatial correlation is not taken into consideration by the conventional, fully connected LSTM. Since ConvLSTM is able to model spatiotemporal relationships, the convolution operator replaced the matrix multiplication in the LSTM formula. Moreover, the inputs and the past states of its local neighbours determine the future state of a certain state. This allows the model to capture long-term temporal relationships. Hence, ConvLSTM takes advantage of LSTM’s ability to handle temporal information and CNN’s ability to handle spatial information.

A deep learning approach based on ConvLSTM was proposed for affective analysis [113]. The study results indicated that the proposed approach outperformed the conventional baseline and achieved the state-of-the-art system. Furthermore, a study conducted by Miyoshi et al. used an enhanced ConvLSTM to automatically recognize facial emotions from videos. Their findings indicated that the proposed technique achieved an accuracy of 49.26% for the eNTERFACE05 dataset and 95.72% for the CK+ dataset.

5.7. Evaluation Method of FER

FER is typically evaluated using metrics such as accuracy, precision, recall, and F1 score, depending on the specific task and dataset [15]. The evaluation metric is crucial in the training phase, and the selection is vital in distinguishing and obtaining the optimal classifier. Evaluation metrics, including accuracy, precision, recall, and F1 score, will be described in this section.

5.7.1. Accuracy

Accuracy is the number of data instances that are accurately classified over the sum of data instances. The issue with using accuracy as the primary performance metric is that it may not be reliable if there is a severe class imbalance. The following is the formula for accuracy:

5.7.2. Precision

Precision is the classifier’s ability to avoid labeling an instance positive that is negative and is also interpreted as the ratio of TP to the total of TP and FP for every class. Precision is a very useful metric that conveys more information compared to accuracy. In FER, precision is the proportion of automatic annotations for a specific action unit (AU) i that are accurately identified by the model [114]. The following is the formula for precision:

5.7.3. Recall

Recall is the classifier’s ability to search all positive instances correctly. It is also interpreted as the fraction of TP to the total of TP and FN. This metric ranges from 0 to 1, with 1 being the best value. In FER, recall refers to the proportion of images containing a specific action unit (AU) i that the model correctly identifies, calculated as the number of correct AU i recognitions over the total number of images with AU i [114]. The formula for recall is as follows:

5.7.4. F1-Score

The F1-score is a weighted average between precision and recall, with 0.0 representing the worst and 1.0 representing the best score. This illustrates the importance of missing predicted classified emotions, which is a key factor based on weighted recall. The formula for the F1-score is as follows:

5.8. Critical Review

In the previous study, Dewan et al. proposed a deep learning approach by using learners’ facial expressions to detect their engagement [35]. Nonlinear correlation in the extracted features was captured using the KPCA, while the person-independent edge feature extraction was performed using the LDP for the multiple facial expressions. The experiments conducted on the DAiSEE demonstrated that the detection of two-level engagement achieved a better accuracy (90.89%) than the three-level engagement (87.25%). Nevertheless, the direct correlation between engagement and actual task performance is unknown.

In 2019, Bian et al. established a spontaneous facial expression dataset in the online learning environment named OL-SFED [46]. The dataset consists of five emotions, including confusion, distraction, enjoyment, fatigue, and neutral. In their studies, a CNN-based algorithm was applied and achieved an accuracy of 91.6%. Furthermore, it has been demonstrated through a comparison of pre- and postadoption assessment indicators that the method may significantly boost inference performance. However, the sample number of the dataset used in this study is limited.

In 2020, Leong conducted a study using the DAiSEE that only focused on detecting negative emotions such as boredom and frustration [52]. The findings showed that the facial landmark points model is more efficient at distinguishing between incidences of occurrence and nonoccurrence of boredom and frustration, even though the FaceNet embedding model’s detection accuracy is higher. In my opinion, the authors should include positive emotions such as engagement so that educators can maintain students’ interest and motivation.

In 2020, Rao and Rao proposed a hybrid CNN model to detect the cognitive state of the learner [61]. Datasets such as DAiSEE, JAFFE, and CK+ were used. Besides that, the classifier used in their study is SVM. The model has achieved 53.4%, 71.4%, and 99.95% for DAiSEE, JAFFE, and CK+, respectively. However, the “frustration” emotion’s recognition rate is very low using DAiSEE since most of the images with a “frustration” cognitive state are predicted as “confusion”.

Most of the existing similar studies that used academic emotion datasets are not real-time systems. Even though Huang et al. had built a real-time emotion recognition system using the DAiSEE, the accuracy rate is quite low, which is only 60% [47]. Furthermore, two studies that used the DAiSEE did not specifically classify student emotions. One of them only focuses on negative emotions, while another study grouped the emotions from the DAiSEE into two or three engagement levels. The previous studies that used the academic emotion datasets are summarized in Table 6 to highlight the gaps in previous studies.

6. Challenges and Limitation

In recent years, FER in online learning has become an increasingly active research field. A variety of works have been done that have demonstrated remarkable outcomes and precisely classified emotions. The precision of classified emotions is likely being evaluated in relation to some preestablished benchmark or baseline, such as the performance of other emotion classifiers or the accuracy of human annotators. For example, in a study by Huang et al. in 2019, the performance of their model was compared to the state-of-the-art approaches and benchmarks on the dataset. Similarly, in a study by Mohan et al. in 2021, they tested their method on five benchmarking datasets and did a comparative evaluation of their model with 21 state-of-the-art methods. Despite the fact that deep learning-based FER techniques have been successful in experimental assessments, there are still a number of challenges and issues that need to be investigated further. In this section, the challenges and limitations will be discussed.

Firstly, there are currently relatively few datasets that can be accessed online that are applicable for the purpose of FER in the context of online learning. Nevertheless, the significance of the academic emotion dataset has recently come to light. Researchers are devoting an increasing amount of attention to the process of producing these types of datasets and making them publicly available. According to findings from previous research, it was difficult to establish a connection between certain facial emotions and learning tasks such as attending an online lecture, watching online video tutorials, writing, and reading [18].

Next, there is a requirement for a large amount of memory for deep learning models, and it is also time-consuming in model training [115]. Hence, the deep learning model is not well-suited for deployment on platforms with limited resources because of its high memory and sophisticated computing requirements. In order to obtain models that can be executed quickly without loss of accuracy, it is necessary to investigate methods for reducing the complexity.

Furthermore, the algorithms are developed for data exploitation and extrapolate stereotypical traits, which precludes them from considering exceptional cases and uncommon configurations [116]. According to cutting-edge theoretical perspectives, emotions are a nuanced and dynamic phenomenon that vary along many parameters that have not yet been completely formalized in theory.

Moreover, recent development in machine learning technologies, particularly deep learning like CNN, requires larger data sets than currently available [18]. The process of gathering and evaluating behavioural data in naturalistic settings is in and of itself a challenging task. Therefore, additional efforts are required to solve the open challenges related to the restrictions of the real-world learning environment.

7. Conclusion

This paper presents a systematic literature review of multiclass student emotion classification in online learning. In this paper, the scientific literature of the past five years was systematically searched to identify the type of FER approach used, the algorithm that was used as an emotion classifier, and the datasets used in the previous studies. It can be concluded that deep learning algorithm, such as CNN, is applied more frequently than other algorithms. On the other hand, SVM, random forest, decision tree, and KNN are examples of conventional machine learning algorithms used to classify facial emotions.

Based on the findings of this study, the deep learning approach is the most frequently adopted approach in classifying student emotion during online learning for the student FER systems in recent years. Furthermore, FER-2013 is the most commonly used FER dataset in FER studies, while DAiSEE is the most used academic emotion dataset. Moreover, support vector machine (SVM) is the conventional learning classifier that is widely used in the FER systems, while convolutional neural network (CNN) is the most frequently used deep learning classifier, followed by LSTM. Next, it was found that the number of real-time FER systems is less than the number of non-real-time FER systems. Finally, the top-1 accuracy of 94.6% was achieved by the long-term recurrent convolutional network on the academic emotion dataset in previous studies [97]. The limitation is that it has low illumination and a lack of frontal pose.

In conclusion, emotion recognition has come a long way over the years, with a significant number of approaches having been established. This has led to the emergence of new research issues, opportunities, and challenges, and as a result, the field has advanced one step in both the recognition and knowledge of emotions.

8. Future Directions

Although numerous studies on FER have been carried out previously, FER’s performance has dramatically improved in recent years by combining deep-learning algorithms. Future studies must focus on developing annotation standards for labeling the benchmarking datasets. Academic-based affective states, such as boredom, frustration, and engagement, are more challenging to measure than the commonly investigated domains of emotion recognition [117]. By personalizing the datasets, specifically trained models can be created to recognize the students’ unique academic emotions. This could lead to better insights into students’ emotions and more personalized feedback and support, hence enhancing their learning outcomes.

Moreover, emotions can vary throughout time and are influenced by various internal and external circumstances. Long-term monitoring of student emotions can provide insightful information about their emotional states and assist in detecting patterns and trends. Future systems could explore approaches for long-term monitoring of facial emotions, such as wearable sensors or continuous video recording.

Besides that, several different ethical concerns are raised by emotion recognition systems, including privacy, consent, and potential biases [118]. These concerns could be addressed in future systems through transparent data collection and processing, informed consent, and unbiased algorithms.

In conclusion, the potential future directions for student FER systems are numerous and exciting. Advancements in dataset personalization, integration with other technologies, and ethical considerations hold great promise for improving students’ learning experiences and outcomes. It is crucial to thoroughly consider the potential consequences of these advancements and implement them responsibly and ethically. This will allow us to utilize the potential of FER systems to create more engaging, personalized, and effective learning environments for students.

Data Availability

The data supporting this systematic review are from previously reported studies and datasets, which have been cited.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The corresponding author is supported by a research grant from the Ministry of Higher Education, Malaysia (Fundamental Research Grant Scheme (FRGS), Dana Penyelidikan, Kementerian Pengajian Tinggi, FRGS/1/2019/ICT02/UMS/01/1). The APC is funded by Universiti Malaysia Sabah.