The task of child engagement estimation when interacting with a social robot during a special educational procedure is studied. A multimodal machine learning-based methodology for estimating the engagement of the children with learning difficulties, participating in appropriate designed educational scenarios, is proposed. For this purpose, visual and audio data are gathered during the child-robot interaction and processed towards deciding an engaged state of the child or not. Six single and three ensemble machine learning models are examined for their accuracy in providing confident decisions on in-house developed data. The conducted experiments revealed that, using multimodal data and the AdaBoost Decision Tree ensemble model, the children’s engagement can be estimated with 93.33% accuracy. Moreover, an important outcome of this study is the need for explicitly defining the different engagement meanings for each scenario. The results are very promising and put ahead of the research for closed-loop human centric special education activities using social robots.

1. Introduction

Nowadays, we are witnessing the fourth industrial revolution commonly known in Europe as Industry 4.0 [1]. One of the most important parts of this revolution is the extension of the robots’ usage beyond the industrial environments to social activities interacting directly with humans. This new kind of robot named social robots shows increased interaction capabilities, characterized by a certain degree of intelligence, and is very much safe to interact with children in any type of education.

Our interest here is the case of special education, which draws increased attention from modern societies aiming at providing equal opportunities to children with special needs to develop their skills. Recent studies have demonstrated the positive role of social robots in delivering special education in person [2, 3] as well as in distance [4].

The ultimate goal of an advanced child-robot interaction is the establishment of a high level of an intelligence communication channel, in a closed-loop configuration with the child being at the center of the educational scenario. This goal can be achieved by developing efficient sensing mechanisms to the robot side, such as automatic engagement measuring, which will permit the robot to adapt its behavior or even the execution of the educational scenario [5], towards increasing the success—increased knowledge transfer and achievement of the learning objectives—of the education delivery. Therefore, the development of a robust methodology for measuring the engagement state of the children in special education constitutes a challenging problem to tackle.

Children with learning disabilities (LD) are identified as having typical intelligence but manifest specific difficulties that interfere with their task performance and academic achievement [6]. This repeated failure and frustration experienced by the children with LD reduce their self-efficacy leading to a sense of helplessness, which is associated with lack of motivation and academic disengagement [79]. As academic engagement refers to active participation and attention and focuses on the task during the learning process, disengagement refers to apathy and lack of interest. The degree to which students are engaged is a critical precursor to learning, as without academic engagement, students are unlikely to benefit from instructions [10]. In other words, the more students are engaged, the more they learn [11]. Therefore, the development of a robust methodology for measuring the engagement state of children with LD constitutes a challenge.

Although several methods [12] for measuring the engagement level during child-robot interaction have been presented in the literature, all these attempts were focused on children with Autism Spectrum Disorder (ASD), and their experimental study was limited with a small number of children.

Taking into consideration the fact that children with learning disabilities are associated with maladaptive engagement compared with their typically developed peers [13, 14], it would be of great importance to have knowledge of each child’s engagement level through social robots in order to use them in intervention programs that aim to promote child’s learning by increasing their involvement in all kinds of learning tasks. Thus, as confirmed through researches, interventions using social robots as a tool to support the learning process have been demonstrated to enhance students’ motivational skills, maintenance of engagement, and compliance during instructional interactions [15, 16]. The research of Pistoia et al. [17], which is one of the first attempts to investigate the use of a social robot in students with dyslexia, confirms that the presence of the robot to support the learning process showed high levels of response and engagement during child-robot interaction.

In this context, this work contributes along with the following directions:(1)A definition of the “Intelligent Interaction” based on psychology is provided(2)A machine learning-based methodology that allows a social robot to interact with intelligence with the child is proposed(3)The proposed methodology is evaluated with a large amount of in-house developed real data(4)For the first time the case of children with learning difficulties is considered for measuring their engagement state during interaction with the social robot(5)The need for customized engagement measuring methods based on the characteristics of the deployed scenario is touched for the first time

The rest of the paper is organized as follows: Section 2 provides a snapshot of the related work and Section 3 presents the definition of the “Intelligent Interaction,” the information of the designed educational scenarios, and the details of the proposed methodology. Section 4 provides the experimental study with the corresponding results. Section 5 discusses the results, concludes this study, and lays out the future work.

Sidner et al. [18] proposed and Ahmad et al. [19] rephrased a general definition of the concept of engagement during human-robot interaction: “Engagement is the process by which interactors start, maintain, and end their perceived connection to each other during the interaction.” Measuring engagement of humans, executing a specific activity, constitutes a highly informative indication for analyzing the effectiveness of the activity design. This measurement can help the improvement of the design towards achieving the desired outcomes relative to the executed activity.

For this purpose, several methodologies have been proposed to measure the engagement of a user playing a video game [20], of a person when working [21], of students in a classroom [22], of TV viewers [22], of a consumer when purchasing products [23], and so on. Measuring engagement of a child with special needs during an educational process and/or intervention is very challenging due to the specially designed scenarios and interaction schemes, which must attract their attention and maintain engagement.

Early outstanding work for measuring the engagement of children with a game companion was proposed by Castellano et al. [24, 25] by using a multimodal processing scheme based on visual and contextual information. Moreover, Hernandez et al. [26] proposed a method to measure the engagement of children, which were difficult to engage during social interactions. In [26], wearable sensors were used to measure the electrodermal activity of the children and a Support Vector Machine (SVM) classifier was applied to classify the children being engaged or not. In [27], acoustic and linguistic data were utilized to detect the social engagement in conversational interactions of children with ASD and their parents, using an SVM classifier. The first in-depth study of measuring the engagement of children when interacting with social robots was proposed by Anzalone et al. [28]. In this work, the researchers analyzed visual information in a static and dynamic perspective, in several case studies of ASD child-robot interaction. Rudovic et al. [29] presented a very interesting study regarding the engagement measuring across cultures, which revealed that the engagement level of 30 ASD children can be increased by taking into account the cultural differences.

Recently, with the advent of deep learning technology, several attempts have been pointed out for measuring engagement during a child-robot interaction using advanced intelligent models. Rudovic et al. [30] proposed the CultureNet model based on the typical ResNet-50 architecture for estimating the engaged or not engaged children of different cultures interacting with NAO robot in robot-assisted therapy for children with Autism Spectrum Condition (ASC). In [31], Hadfield et al. proposed a deep learning model consisting of three fully connected layers and a single LSTM layer, while the used features are computed using visual data relative to the position of the child’s body parts. The reported results were of almost 80% accuracy, but the limited number (3) of Typical Developed (TD) children can justify the quite low accuracy. In a very recent work, Del Duchetto et al. [32] tried to measure the engagement level in human-robot interaction utilizing Convolutional Neural Networks (CNNs) and LSTM model. The novelty of the work in [33] is the tackling of the engagement estimation as a regression problem, aiming at providing a scalar value for the engagement level during human-robot interaction. The reported results were very promising with Mean Squared Error (MSE) 0.126.

Although the previous approaches have contributed significantly to the engagement estimation in human-robot interaction, they possess some limitations: (1) they were applied mostly on adults or children with TD or ASD, without examining other categories of children with special needs, such as children with learning difficulties; (2) they were experimented with a limited number of children; and (3) they did not study the engagement estimation in the framework of appropriately designed intervention scenarios or the designed scenarios were few and very simple.

It is important to realize the need to analyze and measure the engagement of children with learning difficulties. Considering dyslexia as the most frequent learning difficulty, Uta Frith [33] proposed a three-level theoretical framework for the interpretation of dyslexia, namely, behavioral, cognitive, and biological. In this context, Frith also distinguished the role of the environmental level that interacts with the abovementioned three levels. Therefore, dyslexia students interacting with a social robot can learn easier due to its interactive and fun performance, which also allows students to take their time during a learning task. In addition, a social robot engages pupils in mental information processing and captures their attention [34].

After reviewing the applications of social robots in special education from the international literature, we found that the usage of social robots in supporting the educational procedure of children with learning difficulties is limited. This observation contradicts the educational needs of a large percentage of the world’s population, which accounts for 10–15% [35]. We believe that the high percentage of the population showing learning difficulties imposes the targeting of this part of the population as a potential application field for using social robots.

The current study aims to complement the previous works by investigating the engagement measuring when children with learning difficulties are interacting with the social robot NAO. The number of children that participated in the experiments was 10, while child psychologists carefully designed 10 scenarios, executed by each child.

3. Materials and Methods

3.1. Intelligent Interaction: A Definition

In order to understand the real needs for an engagement measuring methodology, it is crucial to provide a definition of what is the meaning of an “Intelligent Interaction.”

Considering the work of the psychologist Howard Gardner [36] regarding the type of intelligence, nine different types of intelligence can be considered. From these nine types of intelligence, the following five deal with the interaction of a human with the surrounding environment:(1)Linguistic intelligence: ability to find the right words to express what do you mean(2)Visual-spatial intelligence: having awareness of the surrounding environment(3)Bodily-kinesthetic intelligence: coordinating the mind with the body(4)Interpersonal intelligence: sensing children’s feelings and motives(5)Logical-mathematical intelligence: quantifying things, making hypotheses, and proving them

From the engineering point of view though, the previous interaction-oriented intelligence can be summarized to the following two levels of intelligence:(1)1st level of intelligence: ability to analyze the sensory data in order to understand the surrounding environment(2)2nd level of intelligence: establishing a human-like closed-loop communication with the child

The above two levels of intelligence enclose the aforementioned five types of intelligence defined in terms of psychology and can be the ultimate goals of any research dealing with human-robot interaction.

An important part of the above two levels of intelligence is the measuring of the child’s engagement state by processing the sensory data (1st level) for adapting the robot’s behavior and/or the educational scenario towards establishing a closed-loop communication channel (2nd level).

3.2. Educational Scenarios

For the sake of this study, five child psychologists (three from the “Family Center KPG, Thessaloniki, Greece” and two from the “Department of Clinical Psychology, Papageorgiou General Hospital, Thessaloniki, Greece”) of our research team designed ten different educational scenarios for children with learning difficulties, as part of the national project titled “Social Robots as Tools in Special Education (SRTSE)” [37]. It is worth noting that each child executed each scenario on different days. More precisely, each child executed two scenarios per week and the average duration of each scenario was 35 minutes.

Table 1 shows what types of activities are included in each scenario.

The scenarios include the following types of activities:(i)Meet/greet(ii)Text decoding, comprehension, and reading(iii)Phonology composition, decomposition, discrimination, and addition(iv)Memory(v)Robot-child relaxation game(vi)Story listening and telling(vii)Sentence structuring(viii)Strategic visual representation

3.3. Proposed Methodology

Two are the main features of the proposed methodology: (1) the usage of multimodal data consisting of visual and audio modalities and (2) the usage of a machine learning model that provides the decision about the engagement state of the child. In the following subsections, the modules of the designed methodology depicted in Figure 1 are described in detail.

3.3.1. Multimodal Sensing

The sensing capabilities of the used social robot mainly control the type of sensory data to process in order to decide the engagement state of the child during the interaction. Our study considers the well-known NAO robot as the robot that is able to interact with the child, but other social robots [38] could also be used. This robot is equipped with two identical RGB video cameras located in the forehead and a microphone; thus, it can provide visual and audio sensing capabilities.

(1) Visual Sensing. The visual sensing capabilities of the NAO robot permit the acquisition of video frames that include the child’s body and face. From each video frame, the body pose is extracted using the library [39], consisting of 25 key points (2 on the torso, 6 on the hands, 12 on the legs, and 5 on the head), as depicted in Figure 2(a). In addition, 68 key points called facial landmarks are extracted (see Figure 2(b)), from the child’s face using the OpenFace library [40]. It is worth noting that the computed facial landmarks are used to define the child’s emotional state in compliance with the Facial Action Coding System (FACS) [41]. Finally, the eye contact between the child and the robot is detected using the OpenGaze library [42] and following the methodology proposed by Xucong Zhang et al. [43].

(2) Audio Sensing. During the interaction with the child, the robot needs to keep facing the child at all times, in order for the robot to record and analyze the child’s speech, by providing additional information related to the engagement state of the child.

3.3.2. Feature Extraction

The abovementioned multimodal sensing mechanism aims at collecting sensory raw data. This data, which has the form of 2D Cartesian points belonging to the child, is further processed to construct more informative descriptions named features. The feature extraction procedure is applied on the video frames (640 × 480 pixels resolution) captured every 0.7 secs (1.4 fps) by using nonoverlapping sliding windows of 60 secs. Although the camera of the NAO robot has 2.5 fps for 640 × 480 video resolution, in a WiFi connection mode, in our case, the real-time performance of our system is 1.4 fps due to the execution of the algorithms. Moreover, it is decided to set the processing time window to 60 secs, in order to include enough event transitions and to help the manual annotation of the data. The features that are finally computed are the following:(1)Feature 1: number of blinks: the blinks count of the child on average(2)Feature 2: mean movement of the body in pixels(3)Feature 3: if the child’s body was turned away from the robot (0 or 1)(4)Feature 4: percentage of the time window within which there was eye contact by the child(5)Feature 5: emotion (happy, sad, surprised, fear, anger, disgust, or contempt)(6)Feature 6: emotion intensity (0–5)(7)Feature 7: if the child’s head was turned away from the robot (0 or 1)(8)Feature 8: mean response time (set to −1 if the scenario did not require a response from the child)(9)Feature 9: mean voice level (in RMS)(10)Feature 10: percentage of the time window within which the child was silent(11)Feature 11: percentage of the time window within which the child was speaking

It should be noted that almost all the above visual features are computed by tracking and processing the extracted key points. For example, for a specific frame, the emotion is determined by combining the FACS corresponding to each feeling (Table 2), averaging their intensities, and choosing the emotion that has the highest intensity. In addition, features 3 and 7 are determined by counting the number of states (0 and 1) in the time window and choosing the one with the highest number of occurrences. Lastly, to determine if the child is speaking or not, we check if the voice volume is higher than 350 RMS and the mean voice level considers levels where the child is speaking.

To summarize, for each 60 secs video frame, a feature vector is assigned, which is also manually annotated by three experienced child psychologists to an engaged time slot or not. The extracted features from the educational scenarios are used to train the machine learning model, so it will able to detect the engagement state of the child.

3.3.3. Machine Learning Models

Herein, the detection of the child’s engagement state (engaged or not engaged) is accomplished by solving a typical two-class classification problem by using a machine learning classifier. Machine learning has been proved to be an efficient technology in many disciplines such as signal processing [45] and computer vision [46]. More precisely, six traditional machine learning models, the Support Vector Machine (SVM) model with two different kernels (RBF and poly), the Decision Tree, the K-NN, the Naïve Bayes (NB), the Multilayer Perceptron (MLP), and the Extreme Learning Machine (ELM) classifiers, are examined.

Additionally, three ensemble models are also considered, the Random Forest (RF), the AdaBoost Decision Tree, and AdaBoost Naïve Bayes ones. The advantage of the ensemble classifiers is that they combine “weak learners” with strong ones, by reducing the bias and variance of the learner. The first ensemble uses the Bagging [47] and the last two use the AdaBoost [48] training techniques.

Most of the machine learning models owing to a set of configuration parameters that enables them to adjust their performance are subject to the considered problem and must be carefully selected.

4. Experimental Study

In order to study the performance of the proposed methodology, a set of experiments was arranged. The experiments were carried out using the scikit-learn [49] Machine Learning Library for Python, on Python version 2.7. Moreover, the experiments were conducted on a laptop computer equipped with Intel i7-6700HQ CPU, 8 GB DDR4 RAM, and GTX 960M GPU.

4.1. Dataset Design

For the sake of the experiments, 10 children participated in the ten scenarios (see Table 1), 2 girls and 8 boys, aged from 9 to 10 years. Each scenario is executed in a classroom with the participation of a child, the NAO robot, and a child psychologist sitting behind the NAO robot. The robot also needs to keep facing the child at all times, in order for the speech recognition module of the robot to work more accurately, since in this position the microphones are oriented to the source of the sound [50]. From the recorded video files, a dataset of 819 samples, with 11 features for each sample, was designed. From these samples, 99 samples corresponded to children being engaged, while 720 samples corresponded to children being not engaged. Three experienced child psychologists derived the ground truth data after manual annotation. Since this dataset is imbalanced, an oversampling technique was employed, called Synthetic Minority Oversampling Technique (SMOTE) [51], in order to balance the dataset, by containing the same number of samples for each class. The final balanced dataset includes 1440 samples (720 per class).

4.2. Settings of the Experiments

A 10-fold cross-validation grid search technique [52] was applied in order to select the best parameters set for each model. The resulting parameters that optimize the accuracy of each model are presented in Table 3.

The performance of each model was evaluated using the Precision, Recall, Accuracy, and F-measure indices [53]. These measures are widely used in machine learning to evaluate the performance of a model. They are taking into account the True Positive (TP) and True Negative (TN) cases, which correspond to those cases correctly identified as positive or negative, respectively, and False Positive (FP) and False Negative (FN) cases, which are falsely identified as positive or negative, respectively.

Accuracy is the proportion of the total number of correct predictions and is calculated from the equation

Precision is the proportion of the correct predicted positive results and is calculated from the equation

Recall is the proportion of correct positive results and is calculated from the equation

F-measure combines both Precision and Recall and is the harmonic mean of those indices, calculated as follows:

4.3. Results

A k-fold (with k = 10) cross-validation technique is followed for the evaluation of each machine learning model in estimating the children’s engagement state. According to this training and testing protocol, the initial dataset of 1440 samples is divided into 10 equal and nonoverlapped subsets of 144 samples. Each one of these subsets is used to test the model trained with the remaining nine subsets. The process is repeated k times by using different subsets for testing only once. The results of the k experiments are averaged in order to conclude the generalization ability of each model. Table 4 summarizes the prediction performance of each model.

The results of Table 4 reveal two important conclusions. The first one is that the initial hypothesis that the children’s engagement can be measured by using multimodal data consisting of combined behavioral, pose, and emotional information is justified experimentally since the accuracy of the models is very high (up to 93.33%).

The second conclusion is that the AdaBoost Decision Tree model outperforms the other models, by a significant factor in some cases, followed by the SVM (RBF). Despite the high accuracy of the AdaBoost Decision Tree model, the additional high Precision, Recall, and F-measure constitute the evidence that the model is able to estimate the children’s engagement of unseen data with minimum False Positive (FP) and False Negative (FN) decisions. The outperformance of the ensemble methods was expected since these models are more complex and they provide the final decision considering the outcomes of multiple single classifiers working in a complementary way. Among the ensemble models, the AdaBoost shows the best performance, a result that reveals the ability of the sequential topology of the bootstrapping to improve the classification performance. On the other hand, the Bagging topology of the Random Forest model implies that the weak classifiers of the model operate with similar data, meaning that the dataset is quite homogenous, without including significant variations.

Moreover, in order to examine the bias of the machine learning models in the data of a specific scenario, a modified leave-one-out training strategy was applied with the samples that were left out in each case where all the samples corresponding to a specific scenario and the training were done with all other samples of the other scenarios. For each test case, the training data were again augmented to tackle the existing issue of the imbalanced number of samples per class. For example, in the first fold, the models were trained with the samples corresponding to all the scenarios except the first one and then tested with the samples corresponding to the first scenario and so on for each fold. Figure 3 depicts the performance of the machine learning models for each scenario when its data samples are used to test the models.

The results presented in Figure 3 reveal that the SVM (poly) model shows the lowest bias in the training data since it has the highest detection accuracy in 7 out of 10 training folds. Moreover, an interesting observation of this experiment is the different “definitions” of children’s engagement state in each scenario, since the performance of each model varies with the scenario type.

5. Discussion and Conclusion

The task of engagement detection of a child with learning difficulties interacting with a social robot for establishing a two-way intelligent interaction was studied in this work. The detection procedure was tackled as a two-class classification problem solved with high success by applying a machine learning model. The proposed methodology uses multimodal data (visual and audio) that describe the behavior of the child during the interaction. The initial hypothesis that an engaged child with learning difficulties can be identified by processing the body and head poses, the facial expressions, the eye contact, and the speech was accepted following the proposed method. However, this study brought to light the possible different “definitions” of engagement that apply in each educational scenario. This outcome is very important since it paves the way for more customized engagement measuring techniques oriented to the specific scenarios under deployment, towards providing an optimal interaction strategy.

In addition to the investigation of developing scenario-based engagement measuring methods, future work will consider the time parameter for each extracted feature and the handling of them as time series by deploying regression ML models such as Long Short Term Memory (LSTM) for predicting the engagement level at discrete time steps.

Data Availability

The data used in this research will be provided upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This research had been cofinanced by the European Union and Greek National Funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH–CREATE–INNOVATE (Project code T1EDK-00929).