Abstract

Autism spectrum disorder refers to a neurodevelopmental disorders characterized by repetitive behavior patterns, impaired social interaction, and impaired verbal and nonverbal communication. The ability to recognize mental states from facial expressions plays an important role in both social interaction and interpersonal communication. Thus, in recent years, several proposals have been presented, aiming to contribute to the improvement of emotional skills in order to improve social interaction. In this paper, a game is presented to support the development of emotional skills in people with autism spectrum disorder. The software used helps to develop the ability to recognize and express six basic emotions: joy, sadness, anger, disgust, surprise, and fear. Based on the theory of facial action coding systems and digital image processing techniques, it is possible to detect facial expressions and classify them into one of the six basic emotions. Experiments were performed using four public domain image databases (CK+, FER2013, RAF-DB, and MMI) and a group of children with autism spectrum disorder for evaluating the existing emotional skills. The results showed that the proposed software contributed to improvement of the skills of detection and recognition of the basic emotions in individuals with autism spectrum disorder.

1. Introduction

The representations obtained through the facial expressions of a person are important for social interactions in the society [1]. Emotional expression and eye gaze direction play crucial roles in nonverbal communication [2]. According to the authors in [3], facial emotion recognition is a fundamental concept within the field of social cognition. Individuals with autism spectrum disorder (ASD) may have characteristics related to repetitive behavior patterns, impaired social interaction, impaired verbal and nonverbal communication, and limited social-emotional reciprocity present from childhood [4]. People with ASD can present difficulty improving and developing important social-emotional skills throughout life [5]. Facial emotion recognition is a skill important for adult life but traditional methods and group interactions may not be promising for these tasks, as described in [6].

Several types of research have been developed to present tools based on the use of information and communication technology (ICT) for the improvement of skills related to emotions in an individualized way [69]. These solutions enable the user to have more comfort, entertainment, and insertion in the real context safely and playfully, as described in [10]. Moreover, these tools can be used on several devices and platforms, such as tablets, desktops, and smartphones. The development of techniques with ICT in digital game environments is being adopted in our society according to data published in the study presented in [4]. One of the approaches employed in this context is the use of serious games (SGs). According to the authors in [11], SGs provide entertainment and the means to improve skills related to facial emotion recognition. Moreover, computational techniques employed for facial emotion recognition can contribute to tools such as SGs.

There are challenges in the area for new tools that can encompass methodologies for improving emotion-related skills with SGs and facial emotion recognition techniques in a way that quantifies individual’s data during the treatment sessions.

2. Contributions of This Work

In this paper, we present an SG to improve emotional skills through facial expression recognition methods for individuals with ASD. Firstly, a camera captures user’s face. Then, the Haar-like feature algorithm is applied to detect regions of interest from user’s face. The dlib library is employed to detect the face keypoints in the image. With the detected keypoints, the optical flow algorithm is employed to keep the locations independent of user’s face movement. The descriptors obtained with the histogram of oriented gradients are extracted from the regions of interest. A convolutional neural network model is also employed over the regions of interest. The resulting data in the flattening layer are associated with the handcraft information and keypoints for classification in the softmax layer. This game includes characters to assist in the practice activities. Our SG explores multimodal aspects, such as facial expressions, vocal prosody, and body language. The user receives positive feedback after proper representation of the emotion, as well as the status and faults. A group of public domain datasets was employed for evaluation of the facial emotion recognition methods. In a second step, a dataset of volunteers with ASD was selected to evaluate the game and investigate the contribution of the tool to the improvement of emotional skills.

The contributions of this paper can be summarized as follows: (i)We present an approach for feature extraction and emotion classification based in a hybrid model (handcraft and learning features)(ii)We developed an SG involving characters that express emotions and can assist in improve facial expression skills to individuals with ASD(iii)We design a tool that allows for the capture of information from intervention sessions to help specialists in the improvement of individuals with ASD

The article is organized in sections with information and details: Section 3 describes about the theoretical background. Section 4 presented the related work about strategies to improve facial expression skills. Section 5 shows the methodologies used in the various stages of the proposed approach. Section 6 presents the results and discussions regarding the performance of the various steps of the work. Finally, the conclusions are described in Section 7.

3. Theoretical Background

3.1. Facial Expressions of People with Autism

The term autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects social communication and behavior in children and adults. Lorna Wing and Judith Gould were the first researchers to observe and define the term ASD in 1979 [12]. According to data published by the World Health Organization, one in 160 children have characteristics that may be related to ASD [13]. In 2020 in the United States, the Centers for Disease Control and Prevention reported an increase of 178% in the number of children diagnosed with ASD compared to data published in 2000 [14]. In Brazil, with a population of over 200 million, there are estimated 2 million individuals with ASD [15].

The perception of the face is one of the most important skills developed in humans [1]. Factors, such as behavioral difficulties that occur due to lack of visual attention, compromise the socialization of children with ASD [16]. Depending on the level, individuals with ASD can also have difficulty interacting in collaborative environments and performing actions such as teamwork or public speaking due to the difficulty of muscle analysis and face expression recognition [17, 18]. The development of software with the aim of contributing to this task for ASD individuals has been increasingly employed in our society according to data from an investigation into the global panorama of solutions published in the work presented by [4]. Software for this task that employs SG for the process of improving these skills are being explored in this area.

3.2. Serious Games to Improve Skill of Facial Expression

Serious games are primarily aimed at learning or developing skill [19]. This term was proposed and defined by Clark [19] in 1970 the name “Serious Games” for games developed with the educational objective and not only for fun and entertainment, but that does not mean that serious games (SGs) should not be fun for their users, but rather provide a association of these essential elements in this type of application. According to the authors in [11], SGs provide the user with entertainment and exploring improving skills related to emotional expressions. Software for this task is investigated in several studies in the literature as highlighted in these studies [2022].

3.3. Facial Emotion Recognition

Computational techniques for facial emotion recognition (FER) based in computer vision and image processing can contribute to tools such as SGs. These techniques involve face information acquisition, feature extraction, and classification steps. The information acquisition step involves the detection of areas of the face from an image or video. After this step, regions must be selected to determine the area of interest and eliminate the background of the image. From these regions, features related to morphological or nonmorphological information are extracted. According to [23], the performance of a system for human FER depends on the feature extraction methods used. The information extracted must be highly discriminative among the variations of the classes and weakly discriminative within the same class variation. Moreover, the procedure for calculating the features should be described in low-dimensional space and be robust against variation in illumination and noise [24].

Tools for FER can be categorized into methods that explore geometric information, brightness, or texture [25]. Recently, the strategies based on deep learning have been employed in problems for FER and facial emotion expression in computer vision [2628]. Methods based on deep learning have achieved remarkable success rates in FER studies [27, 29]. The high accuracy achieved in these recognition tasks can be attributed to large-scale labeled datasets such as AffectNet [26] and EmotioNet [30], which enable convolutional neural networks (CNNs) to explore generalizable representations [31]. These methods with approaches based on CNNs have contributed to the challenges with regard to new approaches for FER.

The use of computer technologies to improve skills has increased in recent years. The work of [32] shows through a systematic review the use of technology in improving the conceptual skills, practical skills, social skills, and general skills to people with ASD. This study shows that people with ASD tend to be motivated by new technologies, and these solutions can be fun in the process of improving skills. The use of technological advancements such as artificial intelligence and augmented reality undoubtedly provides a comfortable environment that promotes constant learning for people with ASD. Moreover, the study in [33] shows a systematic mapping study of the use of technologies in emotion recognition in children with ASD. The study analyzed the main techniques employed and recommendations for user interface and equipment. These strategies can provide flexibility, accessibility, and easy adaptation to real-world scenarios. This section presents the main contributions that have investigated SG strategies for individuals with ASD.

The authors in [34] proposed a system to improve facial expression skills for individuals with ASD using 3D animations overlaid on the face. In this tool, images obtained from regions (frontal and lateral) of user’s face were used to elaborate a 3D face mask. With the use of the facial action coding system (FACS) theory, representations of facial micromovements in the 3D animations were captured from the participant. The overlay of the 3D face representation allowed to evaluate the representation of emotions by the participants during the representation. This tool was evaluated with three participants during seven sessions. These sessions are aimed at collecting skills related to emotion recognition based on facial expressions. The results showed an accuracy of 89.94% for recognition of basic emotions. According to the authors, the use of the 3D representation contributed to the evaluation by allowing the participants to maintain their attention and focus. The authors highlight that the system still has restrictions due to the small number of participants as well as the small number of quantitative metrics, which makes further analysis necessary to evaluate the effectiveness of the system.

In 2016, a digital storybook with characters, video element modeling, and augmented reality (AR) was proposed for emotion recognition in the study of [35]. Computational technologies were adopted in order to attract children’s attention and focus during sessions. The features available in AR allowed to extend the social features of the story and to direct participants’ attention to the important information present in the digital book. The tracking of individual’s eye movement was analyzed by the expert. In the sessions, initially, a printed book with visual images was presented. Then, characters modeled in AR of the scenes in the book were presented with facial expressions of emotions to assist the individual in representing the emotion. The evaluation was conducted in three stages: baseline, intervention, and maintenance. The authors observed that the system helped the children to recognize and understand emotions from facial expressions. The experts reported that the tool raised the attention of users with AR use. The authors reported that quantitative measures were not obtained for the performance improvement of emotion.

An SG game for smartphone devices capable of assisting in the development of emotional skills for individuals with ASD was presented by [36]. The tool was composed of a communication interface for the smartphone (app) and a server with services for requests, emotion processing, and emotion classification. The captured images were sent for analysis and classification of emotions on the server. The server had a convolutional neural network trained with 15,000 images of the basic emotions. The tool was evaluated with nine users, ages 18 to 25, during four intervention periods and two-week intervals. According to specialists, the results showed that the participants improved their ability to recognize and express emotions. However, the authors did not employ metrics for quantitative analysis of the improvement in the skill. They also did not analyze the performance of the tool in relation to image processing and classification algorithms.

In [20], the authors developed a game called Emotiplay to assist in improving the emotional skills of children with ASD. The game was designed to run in a desktop environment with access to a browser using HTML5, CSS, and JavaScript technologies. This game was built with animated characters that represented everyday scenes from social life and questionnaires to be answered about the emotions observed in the scenes. Among the proposed activities, the user could recognize emotions through facial expressions, gestures, speech, and other sociobehavioral characteristics. The evaluation with volunteers from different countries analyzes the performance of the tool in people of different cultures and sociobehavioral contexts. This investigation provided quantitative results, and the data showed an evolution of the emotional vocabulary of participants with ASD.

In the work of [21], the authors developed an application using machine learning techniques to help individuals with ASD develop the only four basic emotions: neutral, joy, sadness, and anger, The representations were captured on video and analyzed by Viola and Jones technique to detect the FKs of users’ faces. These were used for classification with the random forest algorithm. To make the game attractive, the authors developed scenarios involving social situations that demanded the representation of emotions. A qualitative evaluation was not performed; only questionnaires were used to analyze the level of satisfaction regarding the use of the tool. The authors reinforce that new evaluations should be carried out in future work.

In [37], the authors developed an application that simulates a mirror with a webcam in which convolutional neural networks are employed to analyze the images that are captured by a camera and compare it with the one that the patient should perform to detect five basic emotions. The tool has been evaluated with people with autism and monitored by experts. In the experiments, the professionals were able to evaluate the acceptance rate and various usability information of the tool. The tool made it possible to carry out the treatment sessions during the period of social isolation. As restrictions, the proposed method presents the need for a specific hardware and proper configurations, which for certain people can be an obstacle for use during daily routine. Thus, it is reported by the authors that approach should be improved in relation to the quality of the interfaces.

In the literature, different works have demonstrated the importance of adapting and evaluating tools in environments more suitable for individuals with ASD. Developing tools that can capture information from user’s face in a process of emotion detection and recognition can generate important data for the specialist. These specialists can track the progress of this skill at each session and further explore customization to each user’s limitation during the skill development process.

5. Methodology

The proposed tool is a free software developed to improve emotional skills from the recognition of facial expressions in people with ASD. Figure 1 presents the modules of this tool: SG module, detection and classification module, and data module. Firstly, the initial instructions of the SG are presented to the user in the SG module. Then, the knowledge evaluation step is presented to the user to evaluate the emotional skills. After this step, the user receives new information to express emotions following the instruction of the virtual character.

The facial expressions of the user are captured by the webcam, which can be obtained from a computer or smartphone. In this work, a 720p Microsoft webcam H5D00013 was used to capture the expressions which were sent to the detection and classification module. In this step, the information from ROIs on user’s face is extracted and classified. The data obtained are stored for evaluation and analysis by the specialist. Information on the attention level, the number of hits, the number of errors, and the time required to express the emotion is stored (see extra parameters, Figure 1). An interface allows the specialist to analyze the data in each session in order to evaluate the improvement of the skills in individuals with ASD.

The proposed game was developed for children aged 6 to 13 years old, but it can be applied to adult people. According to the authors in [32, 33], when a game is developed for children with ASD, some characteristics need to be adopted: the user interface and elements such as color tones and sounds. The narrative during matches and characters with customization, such as choices of clothing and accessories chosen by the user, are important in the game. These features allow the SG to be more attractive and consequently improve the process of skill improvement.

Our scheme was developed using HTML version 5.2 [38], JavaScript, and Python programming language version 3. In addition, the web application was developed in a responsive manner allowing it to run on Desktop or Smartphone devices. In this SG, no predefined game engine from the literature was used, in order to allow for customization based on the particular characteristics of the user. The algorithms were implemented using a cloud computing architecture, with 4 GB of RAM, a 1 TB HD, and a 5 GB GPU. The videos were captured with 30 frames per second in the RGB color model.

5.1. SG Module

Before the SG is launched, the user chooses one of the available characters, as shown in Figure 2. For this software, the characters Ana and Juninho were provided. Figure 2 presents the character Ana that helps the user during the stages of the game.

In this SG, animated characters are used to express the six basic emotions. These animations were developed based on an analysis of the videos in the dataset [39], which contains 2,900 videos and a set of images of 75 people expressing each of the six emotions, according to FACS theory. The characters take between 40 and 45 seconds, on average, to express these emotions. The animations and the quality of the representations of the emotions were evaluated by specialists who work with ASD individuals. Figure 3 presents one of the characters used to express emotions.

After this character has been defined, an interface allows the accessories to be customized (see Figure 4). This customization step allows interaction so that the user can have motivation and more attention according to [40].

The user is then directed to the first stage of the game. At this stage, the user must identify the emotions shown in a sequence of three images of people. The individual watches six videos with a person representing each of the basic emotions during one minute. The videos with the basic emotions are presented randomly. At the end of each video, the images are shown for the user to select the emotion as shown in Figure 5. During five minutes, the user can choose the image that represents the emotion. In this phase, no feedback is presented to the individual with ASD. The data captured during this stage is shown only at the end of this phase. This strategy was adopted in order not to demotivate the user during this assessment. Participant’s score is displayed in the data module so that specialists can analyze and track participant’s progress.

5.2. Evaluating Serious Games with People with ASD

After the evaluation step, the user begins the SG, following the path in the world of emotions as shown in Figure 6. In this step, the basic emotions will be presented, and the character will assist the user.

A camera will capture the video of the user expressing the emotion, and the detection and recognition algorithms analyze this representation. User’s performance during the emotion expression process can be accompanied by a status bar. Each facial muscle expressed adequately allows for the filling of the status bar. After the status bar has been filled in completely, the system will allow the user to proceed to the next phase, and a new emotion will be presented. In Figure 7, the SG interface is illustrated with the target emotion (upper right, yellow borders), and in the upper left (purple borders), the character expresses the emotion based on information from the user (center of the screen). The points marked on the face allow the user to monitor the expression of emotion.

When a user is playing a match of this game, this participant expresses the target emotion as represented by the avatar defined in Figure 7 (see right side of the interface). In Figure 7, the tutor (see left side of the interface) shows the movements needed to represent the facial expression of the required emotion. These movements are shown as represented in Figure 3. The movements of user’s face and the keypoints (marked in white) are presented in the central region of the interface. Information such as time, eye-tracking of the regions of the screen viewed by the user, the muscles activated in the face, and the hit and miss numbers is stored. This information can be analyzed by the healthcare professional during the follow-up of a match.

When the user presents a sequence of errors in the expression of an emotion, hints are presented with the objective of helping the user correctly represent that emotion during a match of this game. However, when the participant does not manage to represent the emotion, the tool does not allow the user to advance to the next phase. The participant is guided to the training phase with images and animations. In the training phase, the tutor expresses the emotion, and a voice speaks the name of the emotion presented. There is also an image of a person showing that emotion (see Figure 8). The six emotions are presented in this phase, and this process is repeated twice for each emotion. After this step, a message is displayed asking to continue training or to return to the game phase. In this process, the training information is stored for the specialists for the purpose of assisting them in making decisions and helping the participants.

5.3. Detection and Classification Module

At this module, the methods responsible for processing and classifying emotions are presented in Figure 9.

Firstly, the video is captured in the RGB color model and converted to grayscale. This process allows decreasing the color scale present in the frames as reported in the studies proposed by [41, 42]. The frames are analyzed by the algorithm proposed by Viola-Jones for feature detection of regions of user’s face [43]. This method analyzes the appearance of the objects and searches for descriptors [43, 44]. Then, initially, it is necessary to perform a calculation on all of the pixels of the face. For this, an integral image was computed in this stage where represents the value in the integral image and represents the input image which is given by:

After obtaining this information, the method is employed to obtain edge features, line features, and center-surround features (see Figure 10). In this method, we also classify the features using the Adaboost algorithm (Figure 11) and a tree structure defined by a cascade of classifiers [45]. If all of the classifiers accept the image of user’s face, then it is displayed with the detected regions (see Figure 12). In this stage, the ROIs detected are the eyebrows, eyes, lip, nose, and jaw.

Facial keypoint (FK) detection model based on the Dlib library was employed in these regions of face [4850]. In this stage, the dlib function called shape_predictor is employed. This function can take a ROI as input and then output a set of locations with 68 FKs, which are shown in Figure 13: (i)Right eyebrow—points 18, 19, 20, 21, and 22(ii)Left eyebrow—points 23, 24, 25, 26, and 27(iii)Right eye—points 37, 38, 39, 40, 41, and 42(iv)Left eye—points 43, 44, 45,46, 47, and 48(v)Nose—points 28, 29, 30, 31, 32, 33, 34, 35, and 36

Mouth: (i)Upper outer lip—points 49, 50, 51, 52, 53, 54, and 55(ii)Upper inner lip—points 61, 62, 63, 64, and 65(iii)Lower inner lip—points 61, 65, 66, 67, and 68(iv)Lower outer lip—points 49, 55, 56, 57, 58, 59, and 60

The Euclidean distance is computed among the detected FKs for detected face regions. Due to the movement of muscles in the face, we apply the algorithm proposed in [51] to compute the optical flow and to estimate the movement of the face in the images making up the video sequence. If the user moves during the emotion investigation process, the algorithm can describe these movements. Finally, the information from the FKs is stored in the flatting vector before the fully connected layers of CNN.

The histogram of oriented gradient (HOG) method was computed to the detected ROIs. The HOG method is a feature descriptor that allows determining the occurrences of gradient orientation in localized portions of an image for extracting object features [52]. Gradients are computed for each image per block where the block is obtained of a grid of pixels composed from the magnitude and direction of the change in the pixel intensities:

The terms and are the horizontal and vertical components of the pixel intensity change, respectively, in the equation. The feature was calculated in blocks of size pixels for the ROI of pixels. The values for each block are quantized into 9 bins based on the gradient and the magnitudes of the pixels. These values are stored into the feature vector before the fully connected layer.

The ROIs are also given as input to a CNN approach based on the MobileNet architecture [53, 54]. This model employed depth-separable convolutional (DSC) layers and hyperparameters called width and resolution multipliers that address the computational resource limitations (latency and size) of the applications [53, 54]. These features aim to split the convolution step into two operations—depth-wise convolution and point-wise convolution ( size)—thus reducing the set of parameters in the convolutional layers. Figure 14 shows the difference between a traditional convolution filter and a DSC filter, where is the kernel size, is the number of input channels, and is the number of output channels.

The batch normalization and activation operations are employed after the convolutional layers. Batch normalization is a technique used to standardize the inputs and contributes to the training stage of the model. For the activation stage, the rectified linear unit (ReLU) operation [53] was used. The pooling layer was employed with the function of simplifying the information in the output of the convolutional layers. In this step, the max-pooling operation was used, where only the largest value is extracted for the output. This data summarization serves to reduce the number of weights to be learned and avoid overfitting. In the final stage, the flattening and softmax layers were employed [53]. In the proposed approach, an adaptation was performed where three fully connected (FC) layers were inserted (as shown in Figure 9). The features obtained by the convolutional layers and the handcraft features (FK and HOG) were employed in the classification. According to the authors at [55], the strategy of using handcraft features can help the CNN architecture with the ability to learn problem-specific features and consequently improve results. The model was trained using the 70% ROIs from each dataset during 1000 epochs using an Adam optimizer and a batch size of one image patch. The configuration for the Adam optimizer used the value provided by the Keras framework (default learning rate of 0.001).

5.4. Data Module

This tool also has a module for storing game match data. This module has an interface that allows analyzing the data that was captured in the emotion improvement process (see Figure 15). The specialist can use this data to evaluate the improvement of skill of the treatment sessions of each individual. Moreover, the specialist can evaluate the number of sessions, the emotions in which the individual has difficulty, the time spent expressing the emotions, and the muscles activated for each emotion.

5.5. Evaluation Strategies

The hold-out method was employed to evaluate the proposed methods at the training and test stages. The image dataset was split into subsets: the training with 70% images and the test with 30% images. The evaluation of the classification performance was performed using the metrics accuracy, recall, precision, and -score. The accuracy metric indicates the performance of the model in reaction to the instances that the model classified correctly. Recall evaluates the number of positive samples that have been correctly classified. Accuracy of a model can be defined as the ability to predict positive values. The metric -score is defined by:

-score is used when it is desired to find the balance between the metrics precision and recall, which adapts to the solution as can be observed in the studies of [5658].

To evaluate the proposed SG, we employed a multiple baseline design across participants to demonstrate emotional skill. This strategy is divided into baseline, intervention, and maintenance phases. This assessment model is widely used in medical and psychological research and for individuals with ASD, as reported in [34, 35, 5961].

The information about the skill of each individual is collected in the baseline stage. In each section, six videos are shown to each participant with a person representing the basic emotions during one minute. The videos are presented in order randomly to avoid that the participant memorizes the sequence. At the end of each video, three images with different emotions were presented to choose the emotion that was presented in the video. During five minutes, participants chose the image that corresponded to the emotion shown in the video. The baseline sessions occurred once a weekday up to five times a week and for approximately 30 minutes.

In the first session of the intervention phase, participants were instructed by the researchers that the character will express the target emotion. Participants were also oriented about the scoring and functionality of the interface. After the instructions, participants were to express the emotion presented in the video. The facial mimics of the participant were captured with a camera and represented by the character. The participant has five minutes to represent each emotion. The individual observed the representation of the emotion through a progress bar. When the emotion was correctly expressed, a reward is provided to the player. This phase was also employed once a weekday up to five times a week in sessions of approximately 30 minutes. The videos with the basic emotions were presented randomly. The characters used in the intervention session were different from those used in the baseline and maintenance phases to avoid memorization by the participants.

Finally, maintenance sessions were conducted four weeks after the intervention phase to demonstrate the emotional skills of the individuals. The time interval between intervention and maintenance is aimed at checking whether the improved or developed skills remain in the individuals. During this step, all individuals watched the videos with the emotions presented randomly. The participants chose the image that corresponds to the emotion shown in the video. These sessions occurred once per weekday up to five times a week and for approximately 30 minutes.

6. Experiments and Results

This section presents the public datasets and evaluation of the computational algorithms for emotion recognition (see Section 6.1) section as well as the experiments performed to evaluate the improvement of emotional skills with the use of the game with individuals with ASD (see Section 6.2).

6.1. Evaluation of the Emotion Recognition and Detection Methods
6.1.1. Public Datasets

For evaluation of the detection and recognition, algorithms were employed the public domain datasets: CK+ [62], FER2013 [63], Real-world Affective Faces Database (RAF-DB) [64], and MMI Facial Expression Database [65].

The CK+ database [62] has images of people expressing basic emotions, which are divided into joy (324), sadness (253), anger (183), surprise (328), disgust (182), and fear (182). This basis has been employed in work in the literature for evaluating the emotion detection and recognition algorithms [5658]. This database is composed of videos/images from over 200 adults between the ages of 18 and 50 of Euro-American and African-American individuals. The images were taken in periods, where the capture begins with the face in neutral and captures the transition to the required emotion. These images were captured in grayscale or RGB with a resolution of 640 × 490 or 640 × 480 pixels. The second database investigated was FER2013 [63], which has a total of 35,887 grayscale pixel images of faces, divided into the following categories: anger (4593), (547) disgust, (5121) fear, (8989) joy, (6077) sadness, (4002) surprise, and (6198) neutral. The faces were captured so that they were centered and occupied the same proportion of space in each image. The images are not distributed in temporal states, being considered the maximum state of each one. The third database investigated was RAF-DB [64]. This is a large-scale facial expression image base with approximately 30,000 facial images obtained over the Internet in RGB color standard and a size of pixels. From this dataset of images, a total of 12,271 images are divided into surprise (1290), fear (281), disgust (717), joy (4772), sadness (1982), anger (705), and neutral (2524). An important feature of this dataset is that the images have great variability in the age of the participants and variations in ethnicity, head poses, lighting conditions, occlusions (e.g., glasses, facial hair, or self-occlusion), postprocessing operations (e.g., various filters and special effects), etc. The last dataset evaluated was the MMI Facial Expression Database [65]. This database consists of over 2,900 high resolution RGB samples. In this database only 174 samples contained the necessary annotations for emotion classification. The images used are distributed as follows: joy (35), disgust (25), fear (24), anger (27), surprise (35), and sadness (28).

Figure 16 shows images of individuals expressing emotions from each of the databases used: (a) CK+, (b) FER2013, (c) RAF, and in (d) MMI. In the images presented in Figure 16, it is possible to observe a good distribution of characteristics related to gender, age, and ethnicity of the individuals. These features contribute to the evaluation of the robustness and generalization of the algorithms over the various contexts.

6.1.2. Analysis of the Face Emotional Recognition Algorithms

In this experiment, an investigation was performed with the association of the extracted features employed for emotion classification on the images. Table 1 shows the results of the average accuracy for the image datasets. The values shown in Table 1 indicated that the association between the CNN learned features and the handcraft descriptors contributed to increasing the values of the accuracy. The best performance with the accuracy metric was obtained on the CK+ dataset (). Table 2 shows the results for the proposed approach and other CNN architectures. As better results for the proposed approach were obtained with the CK+ dataset, this dataset was chosen for comparison with the other models: the proposed model was 2.03% and 5.03% better than the inception architecture and MobileNet model, respectively.

The quantitative metrics for evaluating the detection and recognition steps for each of the emotions are presented in Table 3 with the CK+ dataset. These results show that the worst measurements occurred with the sad, fear, and disgust emotions.

In general, the results presented by the proposed algorithm were relevant, with an average accuracy rate of 0.98. The values obtained through the recall metric were satisfactory, with an average value of 0.99, showing that the model was able to detect a relevant number of cases from the CK+ database. The score metric considers false positives and false negatives and is generally more important than other measures such as precision, especially if you have an uneven class sample distribution, as in our experiments.

Figure 17 presents the confusion matrix of the results obtained in the experiments for the dataset. In the matrix, it is possible to observe that the emotions with the highest number of errors are sadness, disgust, and fear. These are emotions in which some of the systems present in the literature show difficulties in classification similar to the proposed method. This was observed in the studies proposed by Jain et al. [66] and Wu and Lin [67]. In general, the proposed method shows promising results compared with studies in the literature. For this, these methods were employed as part of the SG for detection and recognition of emotions during the process of improving emotional skills.

Several methods have been developed to study for DRE of image from the CK+ dataset. But none of them has used our proposed association for DRE. An illustrative overview is now propitious to show the good quality of our method. Table 4 lists several values of accuracy in the literature, including those presented in this work.

6.2. Evaluating Serious Games with People with ASD
6.2.1. Participants

A group of eight individuals diagnosed with ASD was selected to evaluate the proposed SG. The individuals diagnosed with ASD were selected at the university hospital of the Federal University of Uberlândia (UFU), located in Uberlândia, Minas Gerais, Brazil. Data collection was carried out after approval by the Research Ethics Committee of UFU (Opinion No. 82555417 0 0000 5152), and during data collection, the subjects were asked to sign the informed consent form, and this work was following Resolution 196/96.

To select the group was considered the following information: (1) being diagnosed with ASD, (2) having no motor limitations, and (3) having undergone clinical evaluation. This clinical evaluation adopted the parameters of the Diagnostic and Statistical Manual of Mental Disorders [74]. The participants were literate and without cognitive delays, which did not compromise the application of the game during the experiments. Although there was no severe impairment in cognitive skills, all participants had social deficits and difficulty recognizing emotions as reported by specialists.

A visit was made to the clinic, and the tool was presented to the specialists. Then, the specialists were trained to use the tool and doubts about its usability were answered. A meeting with the parents presented the tool, and the doubts were clarified. To assess participants’ cognitive, social, and communication skills, interviews were conducted with the parents and the specialists. The selected participants with ASD were between 6 and 12 years old in which four boys and four girls. The SG was evaluated with children with different intensity levels (from level 1–level 3) of ASD. During the application of multiple baseline designs (baseline, intervention, and maintenance phases), a psychologist accompanied the participants in order to support the use of the tool. For this group, the application of the tool occurred on an individualized basis with psychologists in therapy sessions. For the application of the SG, a desktop computer with a webcam was used.

6.2.2. Analysis of the Effectiveness of the SG

The second stage of investigation is aimed at evaluating the performance of individuals with ASD using the SG. The group of participants was analyzed in the three phases of investigation. This stage can contribute to show whether the intervention was effective and how each individual with ASD had improved. Figures 18 and 19 show the results for the eight participants with ASD. Table 5 shows the number of times that the SG alerted the user due to loss of focus in each of the evaluation steps (eye-tracking).

In the baseline phase, the specialists observed that participants with ASD had their focus not directed to the main part of the tool (see Table 5). Usually, their focus was directed to other parts of the SG interface. The specialists reported that some participants were able to correctly speak the word that represented the emotion but were not able to express the facial muscles to represent it. These sessions contributed to improving the representations and also to usability training of the tool. In the intervention phase (see Figures 18 and 19), the specialists guided participants to focus on the animations while keeping attention on the facial expressions for representing emotions. Participants showed an improvement in emotion identification scores when compared to the baseline stage. It is possible to observe from the results that some participants had an improvement in the hit rates at each session (P1–P8). This information is mainly observed in the intervention phase. In the maintenance phase, all participants managed to increase their performance scores over previous phases. Table 5 shows that all participants had higher indices of loss of focus in the baseline phase and that these values decreased during the sessions. It can also be noted that P6 presented higher rates of inattention, which may be related to the low performance rate in the initial phase (baseline).

Table 6 presents the correct rates for each stage of the multiple baseline design. The participants presented an advance in the results (score) for the identification of emotions compared to the baseline stage and had an advance in relation to the scores obtained at the baseline stage. The most significant increase was seen for individual P6.

7. Conclusions

The computational techniques employed in the SG allowed us to build a tool aimed at individuals with ASD to improve emotion detection and recognition skills without the use of the popular game engines and hardware. The concepts employed in the SG allowed the development of an interactive and dynamic application capable of awakening the interest of the users involved in the interventions in an interactive and relaxed manner during game’s evaluation phases. For the person with ASD and with problems related to emotional skills (recognition and expression of emotions), this tool is an aid system that can contribute to the specialists in the area of treatment in a way that can be employed in combination with other techniques.

This SG has a module that employs emotion recognition and detection algorithms for individuals with ASD. This characteristic helps participants to develop or improve their emotional skills: joy, sadness, anger, fear, grief, and surprise. The Viola and Jones technique, the dlib library, and the features learned by the convolutional layers of CNN contributed to the detection of the features captured on the face. The softmax layer enabled the classification of the features for detection of emotion. The dashboard interface enabled the specialist to analyze the data to assist in making decisions to improve the strategies of the emotional skills of the individuals with ASD during the sessions.

The experiments showed that in the baseline stage, the participants with ASD were not yet familiar with the tool and did not make significant progress during the sessions. Participants were not focused on the game scenes as occurs in everyday situations as reported by parents and specialists. In the intervention and maintenance phases, the participants were motivated by the technological resources of the tool. This allowed for greater concentration on the SG and improved the emotional skills.

This study provided important results for the detection and recognition of emotions in public domain databases. The results showed that the system can contribute to the treatment of individuals with ASD to improve emotional skills. One of the limitations presented in this work is the need for a hardware device with a webcam to capture images and an internet connection for evaluation and storage of the data obtained during sessions with individuals with ASD. It is also noted that evaluations of microexpressions and investigation of balanced datasets may contribute to the model. In further work, we intend to investigate the data collected in the baseline and intervention phases and employ CNN architectures to provide recommendations to participants in real-time, enabling feedback capable of helping in the improvement of emotional skills.

Data Availability

Application results and face data used to support the conclusions of this study are available upon request to the author for correspondence. Participant data may not be released due to the research ethics committee.

Conflicts of Interest

The authors declare that they have no conflict of interest.

Acknowledgments

The authors thank everyone who took part in this survey, autism or nonautism people, their parents who allowed their participation, the multidisciplinary team from the schools where these children study, and all the specialists who gave their opinions about this work. This study was financed in part by the Coordination of Improvement of Higher-Level Personnel, CAPES—finance code 88882.429122/2019-01. The authors gratefully acknowledge the financial support of National Council for Scientific and Technological Development (CNPq) (grant # 304848/2018-2).