Abstract

We seek to model the users’ experience within an interactive learning environment. More precisely, we are interested in assessing the relationship between learners’ emotional reactions and three trends in the interaction experience, namely, flow: the optimal interaction (a perfect immersion within the task), stuck: the nonoptimal interaction (a difficulty to maintain focused attention), and off-task: the noninteraction (a dropout from the task). We propose a hierarchical probabilistic framework using a dynamic Bayesian network to model this relationship and to simultaneously recognize the probability of experiencing each trend as well as the emotional responses occurring subsequently. The framework combines three modality diagnostic variables that sense the learner’s experience including physiology, behavior, and performance, predictive variables that represent the current context and the learner’s profile, and a dynamic structure that tracks the evolution of the learner’s experience. An experimental study, with a specifically designed protocol for eliciting the targeted experiences, was conducted to validate our approach. Results revealed that multiple concurrent emotions can be associated with the experiences of flow, stuck, and off-task and that the same trend can be expressed differently from one individual to another. The evaluation of the framework showed promising results in predicting learners’ experience trends and emotional responses.

1. Introduction

Modeling and understanding the users’ interaction experience is an important challenge in the design and development of adaptive intelligent systems [1]. Ongoing advances in human-computer interaction (HCI), cognitive science, psychology, and neuroscience have greatly enhanced the ability of such systems to effectively diagnose users’ behaviors and to provide appropriate assistance and adjustment [27]. In this context, a particular attention is paid to modeling of users’ affect and emotional reactions, as they play a critical role in users’ cognitive performance and decisively influence their perception, concentration, decision-making, memorization, and problem solving abilities [811]. In the field of computer-based learning and intelligent tutoring systems (ITS), a growing interest has been devoted to obtaining and monitoring information about learners’ emotions. The combination of multimodal affect sensing technologies with artificial intelligence (AI) techniques proved its effectiveness in inferring learners’ emotional states [1218]. Physiological monitoring using wearable noninvasive biofeedback devices holds a prominent place as it provides valuable quantitative and objective information as compared to traditional evaluation methods such as questionnaires or self-report [1921].

Nevertheless, the integration of the affective dimension within ITS has raised much debate about which emotions should be assessed. No clear consensus was reached on which emotions should be fostered or avoided within tutoring interactions [2224]. Indeed, the relationship between emotions and learning is far more complex than a linear association that would state that positive emotions enhance learning, while negative emotions obstruct it [25]. Some emotions, considered a priori as negative, are not only inevitable within technology-mediating learning [26, 27] but can also contribute positively to the learning experience. For example, stress can have two opposite effects: the “positive” stress (or eustress) is known to stimulate cognitive abilities, while the “negative” stress (or distress) penalizes concentration and decreases cognitive performance [28, 29]. Similarly confusion can represent a positive challenging aspect in the learning experience or might, conversely, signal a cognitive lock or impasse [23, 30, 31]. Therefore, the assessment of learners’ emotions may not provide, in itself, an explicit evaluation of their interaction experience. For instance, beyond which level stress becomes harmful to the learning experience? This is obviously a challenging aspect, given the highly contextualized, person-dependent, and dynamic nature of emotions.

Hence, the goal of this research is not only to assess learners’ emotional responses but also to determine how emotions impact their learning experience, whilst taking account of both contextual and individual differences and tracking the dynamics of the learners’ states over time. More precisely, we propose modeling the relationship between learners’ emotions and the tendency that characterizes the quality of their interaction experience (e.g., positive/favorable or negative/unfavorable). We identify three extreme trends in the interaction, namely, flow or the optimal experience: a state in which the learner is completely focused and involved within the task, stuck or the unfavorable interaction: a state in which the learner has trouble maintaining focused attention, and off-task or the noninteraction: a state in which the learner is not involved anymore within the task. The hypothesis we establish is that these trends can be associated with multiple overlapping emotional responses and that this relationship can be specific to each learner. We propose a hierarchical probabilistic framework using a dynamic Bayesian network to model this relationship and to simultaneously recognize the trend that characterizes the learner’s interaction experience and the emotional responses occurring subsequently. The framework involves three different modalities to diagnose the interaction including physiology, behavior and performance, the learner’s profile, context-dependent variables to account for individual differences and environmental factors, and a dynamic structure to track the evolution of the interaction experience over time.

An experimental study was conducted to test our hypothesis and validate our approach. A protocol was established to manipulate the learners’ interaction experience and elicit the three targeted trends as they used three computer-based learning environments involving different cognitive tasks, namely, problem solving, memorization, and reasoning. 44 participants were recruited for this experiment while monitoring their physiological activities using three biofeedback devices (electroencephalogram, skin conductance, and blood volume pulse), behavioral variables tracking patterns of their interactions, and performance during the tasks. The evaluation of the proposed framework shows its capability to efficiently recognize the learners’ experience. We demonstrate that our approach outperforms conventional nondynamic modeling methods using static Bayesian networks as well as three nonhierarchical formalisms including naive Bayes classifiers, decision trees, and support vector machines.

The remainder of the paper is organized as follows. A brief literature review is outlined in Section 2. Section 3 describes the proposed hierarchical framework for assessing learners’ interaction trends and emotional responses. Section 4 details our experimental setup and methodology. Finally, Section 5 discusses the experimental results, and Section 6 concludes and presents directions for future work.

Improving the interaction between users and computers requires both a means of measuring qualitatively the users’ experience and a set of adaptive mechanisms to automatically adjust the interaction. A large body of work has extensively been devoted to evaluating the users’ experience by analyzing their emotions as they play a key role in mirroring the users’ internal state. Approaches for measuring emotions—especially in the fields of HCI and ITS—are typically concerned with the recognition of a single emotional state. Two distinct strategies are mainly adopted: either a specific emotion is considered in isolation or several emotions are considered but treated as mutually exclusive. For the first case, the system is designed to identify a specific class of emotion such as frustration [17, 32, 33], stress [3436], confusion [15, 37, 38], or fatigue [3941]. For the second case, the system is capable of representing and recognizing several classes of emotions that vary over time, but at a given time the user is characterized by a unique emotional state (e.g., [12, 14, 18, 22, 42]). These approaches clearly restrict the evaluation of the user’s experience as they provide only a limited insight into the user’s actual state. Indeed, several emotions can be experienced at the same time; these emotions can have either the same or opposed valence [7, 43]. For instance, at a given time, a user can be both interested and engaged within the current task but also stressed and confused. Hence, representing and recognizing a combination of overlapping states provide a more holistic and comprehensive view of the user’s experience [13].

Current approaches to affective modeling can be also categorized according to the machine learning techniques used to recognize the users’ emotional states. The first category uses conventional classification algorithms including rule-based reasoning [44], support vector machines [42, 45], neural networks [46, 47], and decision trees [48, 49]. These approaches rely mostly on a low-level mapping between manifesting features of affect and the targeted emotional states. This mapping is often inadequate to represent complex dependencies comprising contextual features or person-related characteristics, which could interfere in the experience of affect. Besides, the classification of the user’s state is commonly achieved on an ad hoc and static basis independently of the history, that is, without taking account of the past knowledge regarding the user state. Another limitation of these approaches is that they are often unable to represent and manage the uncertainty associated with both the sensory measurements and the expression of affect. To overcome these limitations the second category of approaches uses hierarchical probabilistic methods such as dynamic Bayesian networks (DBN) and hidden Markov models (HMM). DBN are particularly used for affect recognition (e.g., [13, 41, 50, 51]) as they provide a powerful tool to model complex causal relationships at different levels of abstraction and capture the dynamics and the temporal evolution of the user’s state, while efficiently handling the uncertainty through probabilistic representation and reasoning formalisms. For instance Conati and Maclaren [13] use a DBN to monitor learners’ emotions within an educational game, using bodily expression-related features, personality traits, and patterns of the interaction. Liao et al. [51] infer users’ stress levels using a DBN that combines physiological measures, physical observable changes, and performance and interaction features. Ji et al. [41] use observable clues including facial expressions, gaze direction, and head and eye movement, in conjunction with context-related information to assess human fatigue.

In this paper, we propose a hierarchical probabilistic framework to dynamically track the users’ experience while interacting with a learning environment. Our approach differs fundamentally from previous work in that we are not only recognizing concurrent emotional states but also measuring explicitly the tendency that characterizes the quality of their interaction experience. More precisely, our objective is to assess the relationship between emotions and the type of the interaction. That is, how emotions impact the learning experience or, in other words, how a favorable (or an unfavorable) interaction is manifested emotionally. We propose evaluating the learners’ experience with regard to three extreme key trends, namely, the states of flow, stuck, and off-task, which characterize the learners’ interaction along the dimensions of involvement and control (or mastery) regarding the task at hand and which would determine whether a tutoring intervention is required. Flow is the optimal trend: a positive experience where the learner is perfectly focused and involved within the task. A feeling of being in control prevails, as an equilibrium is found between the challenge at hand and the learner’s skills [52]. It is hence the moment where a tutoring intervention should be avoided to not interrupt the learner and risk to disturb his cognitive flow. Stuck is a nonoptimal trend: a negative experience where the learner has trouble maintaining focused attention. The learner feels out of control, as a pronounced disequilibrium is perceived between the challenge at hand and his skills [53]. In this case, a supportive intervention should be performed to help the learner overcome the encountered difficulty and peruse the task. The off-task trend (or the “noninteraction”) can be seen as an extremely negative experience where the learner totally loses his focus and drops out of the task. The notion of control is no longer applicable, as the learner gave up and “disconnected” from the interaction. The off-task trend should therefore be carefully monitored, and if detected, a more radical intervention would be performed to motivate the learner and get him involved again in the interaction, such as changing the current task or presenting a different topic.

Although there have been significant attempts to model these trends, especially within ITS [16, 5356] and video game environments [5759], there is still a lack of a unifying framework to systematically assess, in a dynamic way, both the three types of interaction (i.e., flow, stuck, and off-task) and the emotional responses that occur subsequently. Indeed these states have mainly been approached in an isolated manner and mostly associated with a single emotion within constrained interactions, such as predicting whether a learner is about to quit from a Towers of Hanoi activity as he presses a button labeled “I am frustrated” while resolving the task [16] or detecting whether the user is avoiding learning the materials by guessing or abusing hint features [54].

To summarize, the research presented in this paper extends prior work in the following ways. First, we combine the recognition of the three interaction experience trends with the emotional responses. We assume that a learner’s experience can be possibly associated with several overlapping emotions and that the same trend can be expressed differently from a learner to another. Second, we propose a hierarchical probabilistic framework based on DBN to model and train the relationship between learners’ emotions and the targeted trends. The framework combines multimodal channels of affect with the learner’s profile and context-dependent variables, to automatically recognize the probability of experiencing flow, stuck, and off-task and to assess the emotional responses occurring during the interaction. Finally, we validate our approach through an experimental study where we provoke the three interaction trends as learners are performing different cognitive tasks.

3. The Proposed Approach

In this section we describe our framework for modeling a user’s experience while interacting with a computer-based learning environment. The framework uses a dynamic Bayesian network [60] to automatically track the learner’s emotional changes, where concurrent emotions are represented, and assess the probability of experiencing flow, stuck, and off-task. A macromodel of the framework is given in Figure 1; it includes two main portions to represent the factors (causes) and the manifesting features (effects) of a learner’s state, namely, a predictive component and a diagnostic component.

Predictive Component.  The predictive (upper) part of the network describes the factors that could cause or alter the experience of the interaction. These factors represent the current context, which includes environmental variables that can directly influence the learner’s experience such as the level of difficulty of the task at hand, the relevance of the hints or help provided, and the imposed time constraints. The predictive portion includes also the learner’s own characteristics (profile) that can directly or indirectly influence the learning experience. These include the learner’s goal, preference, personality, skills, and computer usage frequency.

Diagnostic Component.  The diagnostic (lower) part denotes the evidence, that is, the sensory observations used to infer the learner’s state. Three modality channels can be included, namely, physiology and behavior and performance. (1) Physiological features can be used to track bodily changes associated with emotions. For instance, galvanic skin response (GSR) is widely known to linearly vary with the emotional arousal [61, 62]. Heart rate (HR) is extensively applied to understand the autonomic nervous system function and has shown a close correlation with the emotional valence [61, 63, 64]. Electroencephalogram (EEG) can provide neural indexes related to cognitive changes such as alertness, attention, workload, executive function, or verbal and spatial memory [2, 6567]. Particularly, Pope and colleagues developed at NASA [68] a mental engagement index. This index showed a great reliability in switching between manual and automated piloting modes and was used as an alertness criterion for adaptive and automated task allocation [69]. It was also used within an educational context, providing an efficient assessment of learners’ mental vigilance and cognitive attention [70].

(2) Behavioral features comprise key aspects of the interaction between the learner and the environment, which may give clues about the learners’ levels of involvement (or activity/inactivity) within the task. These variables include the rate of requesting help, the hints used, mouse or keyboard pressing, click frequency, and character input speed. Additional devices can be used to assess learners’ behaviors during the interaction such as a video camera, an eye tracker, and a posture sensitive chair. (3) Performance features involve objective measures that can be influenced by changes in the learner’s experience and could provide an indication about the level of mastery of the task. These features include correctness, errors made in the task, and time spent before answering. More complex features can be used to track the learner’s skill acquisition process such as the content that the learner knows and the practiced skills.

The middle part of the model represents the learner’s actual state. The first layer represents the concurrent emotional responses. Each emotion is represented by a separate random variable with different possible outcomes. In this work, we are modeling four classes of emotions pertaining to learning and frequently observed during computer tutoring, namely, stress, confusion, boredom, and frustration [22, 23, 2527, 71]. For instance, the node associated with stress can have the following outcomes: calm (no stress), low, moderate, and high stress. Similarly confusion can range from confidence (no confusion) to high confusion, boredom can range from interest (no boredom) to high boredom, and frustration can range from satisfaction (no frustration) to high frustration. The second layer represents the learner’s interaction experience trend with the following possible outcomes: flow, stuck, and off-task. The recognition is achieved through a probabilistic inference from the available diagnostic measures (bottom-up) to update the learner’s emotional responses (i.e., the probability of each emotion node’s outcome). This inference will be, in turn, combined with a predictive (top-down) inference from the current context and personal variables and propagate to update the learner’s interaction trend (i.e., the probability of experiencing flow, stuck, and off-task).

This two-layered abstraction is aimed at quantifying the learner’s experience trend on one hand and to identify the emotional responses that occur subsequently on the other hand. More precisely the goal is to determine the emotions that occur when the probability of a positive interaction (flow) tends to decrease and the probability of a negative interaction (stuck and off-task) tends to increase so that an effective intervention can be initiated and targeted according to the predominant emotional states. In addition, the model includes a dynamic structure representing the temporal evolution of the learner’s interaction trends and emotional responses. This structure is described by the dashed arcs shown in Figure 1. Each random node at time is influenced by the observable variables at time as well as by its corresponding random node’s outcomes at time . The resulting network is made up of interconnected time slices of static Bayesian networks describing each a particular state of the learner. The relationship between two neighboring time slices is represented by a hidden Markov model (HMM). That is the inference made at time is used in conjunction with the sensory data observed at time , to update the learner’s current emotions and the probability of each trend.

4. Methodology and Experimental Design

An experimental protocol was established to deliberately manipulate the learners’ interaction experience, while recording their physiological activities, behavioral patterns, and performance during the tasks. Data were collected from 44 participants of different ages, gender, and qualifications to validate our approach. Three devices were used to record participant’s physiological activities, namely, electroencephalogram (EEG), skin conductance (SC), and blood volume pulse (BVP) sensors. EEG was recorded using a 6-channel headset. SC and BVP sensors were placed in the resting left-hand fingers. Data were synchronized using necessary time markers, to automatically integrate the recorded signals with the rest of the instrumental setup. In addition, two video cameras were used to record the users’ face and the onscreen activity so that to not miss any feature of their interactions.

Three environments were used for our experimentations, namely, trigonometry, backward digit span (BDS), and logic. The goal was to study the learners’ experience within different contexts and cognitive tasks. BDS and logic involve strict cognitive tasks with controlled laboratory conditions, namely, memorizing digits and logical exercises. The trigonometry session is a more complex learning environment, with less controlled conditions. It comprises a learning session with an introductory course covering some basic trigonometric properties and relationships, followed by a problem-solving activity. Figure 2 depicts a screenshot from each environment.

One of the key points of this study was to acquire accurate data related to the learners’ interaction experience trends and emotional responses. Thus the three environments were thoroughly designed in a way that would intentionally elicit the three types of interaction (i.e., flow, stuck, and off-task). Each session begins with relatively simple tasks; everything was made to get the learners involved within the activity (e.g., easy problems, figures clarifying the problem statements, help/hints provided if needed, no time limit imposed, etc.). As the learner progresses within the session, the tasks become more challenging and the level of difficulty increases gradually. Different parameters were manipulated to deliberately vary the difficulty level and foster the states of stuck and off-task. These included the complexity of the task to be performed, the time limits, and the provided help. Some additional parameters (e.g., unreasonable time limits, deliberate bugs, etc.) were adjusted to systematically get the learners puzzled or even discouraged from pursuing the activity.

Trigonometry.  For this session, we used the trigonometry tutoring system developed by Chaouachi et al. [72]. The tutoring content, which formally covered six basic problem solving tasks, was enhanced with additional tasks (16 in total) structured in three series of incrementally increasing difficulty as will be described below. The session started with a trigonometry lesson explaining several fundamental trigonometric properties and relationships. Basic definitions as well as their mathematical demonstrations were given. The environment provided schemas and examples for each presented concept and a calculator to perform the needed computations. Learners were then asked to complete a problem-solving activity, which involved applying, generalizing, and reasoning about the trigonometric properties. No further prerequisites were required to resolve the problems, except for the concepts previously seen. However a good level of concentration was needed to successfully achieve the tasks. Three series of gradual difficulty were designed for this activity; several parameters were considered, namely, the time constraints, the presence/absence of help, and the complexity of the task. Particularly, each trigonometric problem required some intermediate steps to reach the solution and the complexity was enhanced by increasing the number of the required steps.

Series 1 involved six rudimentary multiple-choice questions, without any time limit. The problems consisted mainly in applying simple trigonometric properties and required few intermediate steps (e.g., calculating the measure of an acute angle within a right triangle given the length of the hypotenuse and the opposite side). The environment provided a limited number of hints for each problem. The hints (if used) provided relevant and detailed information leading to the solution (e.g., “remember to use the sine = hypotenuse/opposite”). Schemas illustrating the problems and the necessary recalls were presented as well, to make the task easier. Series 2 consisted of five multiple-choice questions. The problems of this series were more complex and required an increased number of intermediate steps to reach the solution. For example, to compute the sine of an angle, learners had to first compute the cosine. Then, they had to square the result and to subtract it from 1. Finally they had to compute the square root. A geometrical figure was given to illustrate the statements, and reasonable time limits, varying according to the difficulty, were fixed for each problem. Some hints were given for the most difficult problems. However the information provided was very vague as compared to series 1 (e.g., “the sum of the angles of a triangle is equal to 180 degrees”). Series 3 involved five open response questions (i.e., without offering potential options to choose from). The problems involved more elaborated statements, and a further concentration was needed to translate the statements into a trigonometric formulation (e.g., “a 50-foot pole (height = 50 feet), perpendicular to the ground, casts a shadow of 20 feet (length = 20 feet) at a given time. Find the elevation angle (in degrees) of the sun at that moment”). Very strict gradually decreasing time limits were imposed and no hints or illustrations were given for this series.

Backward Digit Span (BDS).  This activity involves mainly working memory and attention abilities. A series of single digits are presented successively on the screen during a short time. Learners are asked to memorize the whole sequence and then instructed to enter the digits in the inverted order of presentation. Two levels were considered for this session, namely, BDS 1 and BDS 2. Each level involved three tasks (i.e., task one to three and task four to six) of gradual difficulty by increasing the number of digits of the displayed sequence. The difficulty was further enhanced in BDS 2 by gradually decreasing the digits’ display periods (from 700 ms to 600, 500, and 300 ms). Task one consisted of a series of 12 sets of 3 digits, task two consisted of 8 sets of 4 digits, task three consisted of 7 sets of 5 digits, task four consisted of 5 sets of 6 digits, task five consisted of 4 sets of 8 digits, and task six consisted of 4 sets of 9 digits. Participants were instructed to use the mouse to enter the digits using a virtual keyboard displayed on the screen. No additional time constraints were imposed for this activity.

Logic.  This activity involves inferential skills on information series and is typically used in brain training exercises or tests of reasoning. No further prerequisites are needed but a high level of concentration is required. The goal is to teach learners how to infer a logical rule from a series of data in order to find a missing element. The tutoring environment is composed of three modules. Each module is concerned with specific forms of data: the first module deals with geometrical shapes (Geo.), the second module with numbers (Num.), and the third module with letters (Lett.). The session started with a tutorial giving instructions and warm-up examples to get the learners accustomed to the user interface and types of questions, then a series of multiple-choice question tasks related to each module is given. For instance, in the Geo. module, three shapes were successively presented in the interface. The first shape represented a black triangle, the second a white rectangle, and the third a black pentagon. Learners were asked to deduce the fourth missing element, which would be, in this case, a white hexagon. That is, the logical rule that one should guess is to alternate between the black and white colors and to add a side in each shape. Two levels with an increasing difficulty were considered for each module, namely, Geo. 1 and 2, Num. 1 and 2, and Lett. 1 and 2. The difficulty was manipulated by enhancing the complexity of the logical rule between the data. In addition, for the first level, the environment provided a limited number of hints to help the learners find the logical rule that they had to infer, and no time constraint was fixed to answer. In the second level, the hints were increasingly scarcer or even omitted, and a gradually decreasing time delay was imposed on answers. Besides, some tasks were designed to systematically mislead the learner. For instance, in the Num. 2 module, two perpendicular data series were presented. In the vertical series all the numbers were multiples of seven and in the horizontal series all the numbers were multiples of five. In this task, one should deduce the missing crossing element, which should be a multiple of both five and seven. But no such element was given among the possible answers. Some disturbing bugs were also intentionally provoked to get learners distracted and lose their focus (e.g., freezes, hidden statements or materials, very unreasonable time limits, etc.). A total of 20 tasks were given in this session: each subactivity consisted of 3 tasks, except for Num. 2 and Lett. 2, which involved 4 tasks each.

4.1. Sensory Measurements

Three modality measures were monitored, namely, behavioral variables, performance, and physiological features. Behavioral variables included the mouse movement rate (Mouse_mvt) and the frequency of requesting help/hints (Help_req). Performance measures included response time (Resp_time), answer to the current task (correct, incorrect, or no answer), and the overall accuracy rate. Physiological features involved galvanic skin response (GSR), heart rate (HR), and mental engagement (EEG_Engag). We discuss below the methodology used to extract and preprocess the physiological data.

Physiological Features.  Three devices were used to record learners’ physiological activities, namely, skin conductance (SC), blood volume pulse (BVP), and electroencephalogram (EEG) sensors. The acquired signals were digitized using the ProComp Infinity multichannel data acquisition system [73]. The SC device computed galvanic skin response (GSR). It measures changes in the resistance of the skin produced by the perspiration gland activity. A tiny voltage is applied through two electrodes strapped to the first and middle fingers on the palm side. This establishes an electric circuit that quantifies the skin’s ability to conduct electricity, which increases as the skin is sweaty (e.g., when one is experiencing stress). The SC data were recorded at a sampling rate of 1024 Hz. The BVP device is a photoplethysmograph sensor, which computes the amount of light reflected by the surface of the skin. This amount varies with the quantity of blood present in the skin and thus with each heartbeat. The BVP signals were recorded at a sampling rate of 1024 Hz. Heart rate (HR) was calculated by measuring the inverse of the interbeat intervals (i.e., distance between successive pulse peaks).

EEG was recorded using an Electro-Cap that measures the electrical brain activity produced by the synaptic excitations of neurons. Signals were received from sites P3, C3, Pz, and Fz as defined by the international 10–20 electrode placement system [74]. Each site was referenced to Cz and grounded at Fpz. Two more active sites were used, namely, A1 and A2 (i.e., the left and right earlobes, resp.). This setup is known as the “referential linked ear montage,” and is illustrated in Figure 3. In this montage, roughly speaking, the EEG signal is equally amplified throughout both hemispheres. Moreover, the “linked-ear” setup calibrates each scalp signal to the average of the left and right earlobe sites, which yields a cleaner and a more precise signal. For example, the calibrated C3 signal is given by (C3 − (A1 + A2)/2). Each scalp site was filled with a nonsticky proprietary gel from Electro-Cap and impedance was maintained below 5 Kilo Ohms. Any impedance problems were corrected by rotating a blunted needle gently inside the electrode until an adequate signal was obtained. The recorded sampling rate was at 256 Hz.

Due to its weakness (at the order of a few microvolts), the EEG signal needs to be amplified and filtered. Besides, the brain electrical signal is usually contaminated by external noise such as environmental interferences caused by surrounding devices. Such artifacts alter clearly the quality of the signal. Thus a 60 Hz notch filter was applied during data acquisition to remove these artifacts. In addition, the acquired EEG signal easily suffers from noise caused by user body movements or frequent eye blinks. Thus a 48 Hz high pass and 1 Hz low pass denoising filters were applied. The engagement index was derived using three EEG frequency bands, namely, Theta (4–8 Hz), Alpha (8–13 Hz), and Beta (13–22 Hz). A fast Fourier transform (FFT) was applied to transform the EEG signal from each active site into a power spectrum. The transformed signal was divided to extract the estimated power with respect to each band. A combined power was then summed from the measured scalp sites in order to compute the EEG band ratio given by Beta/(Alpha + Theta) [68]. The EEG engagement index was then smoothed using a sliding moving average window: at each instant , the engagement index is computed by averaging each ratio within a 40 s sliding window preceding . This procedure is repeated every 2 s and a new 40 s sliding window is used to update the index.

4.2. Participants and Protocol

44 participants (31 males) aged between 19 and 52 (M = 28.61 ± 8.40) were recruited for this research. Participation was compensated with 20 dollars. Upon arrival at the laboratory, participants were briefed about the experimental objectives and procedure and asked to sign a consent form. They were then outfitted with the biofeedback devices and familiarized with the materials and environments. Next, participants filled in demographic information (age, gender, qualification, frequency of computer usage per day, etc.). They were also asked about their preferences regarding the three activities (i.e., whether they like trigonometry, digit recall, and logical reasoning, resp., or not), and their perceived skill levels (low, moderate, or high) in each of the three activities. Then, the Big Five Inventory (BFI) was administrated to assess learners’ personality traits, namely, openness, conscientiousness, extraversion, agreeableness, and neuroticism [75]. After that, participants completed a 5-minute eyes open baseline followed by another 5-minute eyes closed baseline to establish a neutral reference for the physiological variables.

Participants were then instructed to complete the trigonometry session, followed by BDS and logic. To make the tasks more stimulating, participants were informed that a correct answer is rewarded 4 points, −1 point is given for a bad answer, 0 point is given for a no-answer, and they could, if they choose to, get their score and ranking as compared to other participants at the end of the three sessions. All participants completed the levels of the three activities in the same order, namely, series 1 to 3 for trigonometry, next BDS 1-2, and then Geo. 1-2, Num. 1-2, and Lett. 1-2 for logic. They were allowed to self-pace with respect to the time required to complete each task and were given breaks and rest periods between the three sessions and levels. Before starting each level, participants were asked what were their goals regarding the next tasks by choosing between the following: “realizing the highest score/fewest incorrect answers possible,” “learning or discovering new concepts,” or just “finishing the task.” The experiment ended with a debriefing interview.

Subjective Measurement Collection.  After completing each task, participants reported how they have been experiencing the last trial. Participants were instructed to select the trend that would characterize their overall state during the last task (i.e., flow, stuck, or off-task) and rate their experienced levels of stress, confusion, frustration, and boredom. A definition of each trend was given to the participants to help them in choosing the descriptions that best match their experiences. Flow was defined as follows: “I felt like I was immersed in the activity. I was totally involved, and I was focused and attentive. I was totally controlling the task, and I felt that I had the necessary skills to fulfill it.” Stuck was defined as follows: “I felt that I was blocked. I had trouble maintaining focused attention. I was not totally controlling the task and felt like I could not make it.” Off-task was defined as follows: “I was likely to drop out. I could not (or did not want to) concentrate and I was no more involved in the task. I felt like I gave up or that I did not want to pursue”. If participants reported an “off-task” trend, they were given a little break before resuming the session.

A definition of each of the four emotions was provided as well. Stress was defined as follows: a reaction from a state of calm (relaxed) to an excited state and a feeling of tension or worry due to environmental pressure or constraint. Confusion was defined as follows: having doubts or uncertainty which may be due to a lack of knowledge or understanding. Frustration was defined as annoyance, irritation, or dissatisfaction. Boredom was defined as being wearied or listless due to a lack of interest. Four graduated scroll bars ranging from 0 to 100 were used to rate the intensity of each emotion. The bars included the following subdivisions: 0 = no negative emotion (i.e., calm, confident, satisfied, or interested, resp., for stress, confusion, frustration, or boredom), ]0; 35] = a low level, ]35; 65] = moderate, and ]65; 100] = high. For instance, if a participant rated 17 for stress, 52 for confusion, 0 for frustration, and 0 for boredom, we get the following overlapping states: low stress, moderate confusion, satisfied, and interested.

5. Results and Discussion

A total of 1848 samples (42 * 44 participants) were collected from the experiment. Results are organized as follows. First we describe the statistical analysis conducted to validate our experimental design. Then, we study the relationship between the reported emotions and the experience trends. Finally, we evaluate our framework for recognizing learners’ interaction experience trends and emotional responses.

5.1. Analysis of the Reported Experiences

A preliminary statistical analysis was performed to analyze the experience trends with regard to the task design. More precisely, the goal was to investigate how participants perceived their interactions throughout the sessions: What was the distribution of the targeted trends (i.e., flow, stuck, and off-task) across the different activities? Did the reported experiences vary in line with the established experimental process?

A two-way repeated measure ANOVA was conducted to evaluate the incidence (occurrence) and the variation (increase or decrease) of flow, stuck, and off-task across the levels of difficulty of the three sessions (i.e., trigonometry, BDS, and logic). The within-subject dependent variable was the proportions of the interaction trends, and the independent variables were (i) the type of the trend (flow, stuck, or off-task) and (ii) the testing time (i.e., series 1–3 for trigonometry, levels 1-2 for BDS, and Geo. 1-2, Num. 1-2, and Lett. 1-2 for logic). Results revealed a statistically significant main effect of the trend: (1.80, 77.71) = 61.85, ; degrees of freedom were corrected using Huynh-Feldt estimates of sphericity (epsilon = 0.89), as the assumption of sphericity has been violated (chi-square = 6.79, ). Post hoc tests with a Bonferroni adjustment indicated that the state of flow was overall (i.e., across the three environments) the most prominent trend (M = 0.59 (0.028)), the state of stuck was less frequent (M = 0.27 (0.022)), and off-task was the least prevalent state (M = 0.14 (0.021)).

The interaction effect (trend * testing time) showed that the rates of occurrence of flow, stuck, and off-task differed significantly across the 11 subactivities: (10.24, 440.40 = 20.07), ; degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity (epsilon = 0.59), as the assumption of sphericity has been violated (chi-square = 441.62, ). Bonferroni corrected post hoc tests yielded the following patterns at the 0.05 significance level: flow > stuck > off-task for the beginning of the trigonometry session (series 1) and flow > (stuck = off-task) for the beginning of BDS (level 1) and logic (Geo. 1). Hence in the first tasks of each of the three environments, experiences of flow were the most common and experiences of stuck were either on a par with or higher than off-task. For instance, an average occurrence of 79% for flow, 17% for stuck, and 4% for off-task was found in series 1 of trigonometry. These patterns were reversed towards the end of each activity: (flow = off-task) < stuck for trigonometry (series 3) and flow = off-task = stuck for BDS (level 2) and logic (Lett. 2). That is, experiences of stuck were either as likely as or more likely than experiences of off-task and flow. For instance, for the last subactivity in logic (Lett. 2), flow occupied 38% of the interaction time, stuck 30% of the time, and off-task 32% of the time.

Figure 4 shows the estimated marginal means of each trend over the 11 subactivities of the three learning environments. The proportions of flow were significantly lower at the end of the trigonometry session (series 3) compared to the beginning of the session (series 1) (, ), and the proportions of stuck and off-task were significantly higher ( and 0.258, resp., ). The same pattern was observed from level 1 to level 2 in BDS ( for flow, 0.258 for stuck, and 0.265 for off-task, ) as well as from Geo. 1 to Lett. 2 in logic ( for flow, 0.218 for stuck, and 0.316 for off-task, ). Hence for each of the three environments, the further the learners got within the tasks, the more the occurrences of flow decreased and the occurrences of suck and off-task increased. This pattern was inverted between the end of the trigonometry session (series 3) and the beginning of BDS (level 1): the incidence of flow increased (, ), and the incidence of stuck and off-task decreased ( and −0.273 respectively, ). The same variations were observed from BDS (level 2) to logic (Geo. 1): for flow, −0.250 for stuck, and −0.280 for off-task (). Besides, within the logic session, the occurrences of flow increased from Geo. 2 to Num. 1 (, ) and from Num. 2 to Lett. 1 (, ), that is, as a different type of materials (numbers or letters) was presented, with a lower difficulty level. The difference was not significant for stuck and off-task.

To sum up, the experience trends accurately tracked the intended experimental design. At the beginning of their interactions, learners were more likely to experience flow. The occurrences of negative interactions (stuck and off-task) were also probable but with very low proportions. As the level of difficulty of the task increased (i.e., more complex tasks, imposed time constraints, scarcer hints, provoked bugs, etc.), the incidence rate of flow decreased significantly and both stuck and off-task behaviors were more likely experienced. Particularly towards the end of the sessions stuck and off-task became more common; a negative interaction trend was as likely as or more likely than a positive interaction experience. Switching the learning environment (i.e., starting a new activity with a lower level of difficulty) reversed this pattern. That is, the incidence of stuck and off-task decreased and the state of flow became dominant again.

5.2. Emotional Expressions of the Experience Trends

Our next investigation was to analyze the learners’ emotional responses with regard to the states of flow, stuck, and off-task. More precisely, we were interested in answering the following questions. (1) Are there any significant differences in terms of stress, confusion, boredom, and frustration, as the learners’ interaction was optimal, problematic, or completely inhibited? (2) If so, is there a particular emotional pattern associated with each trend? That is, which emotion(s) could potentially characterize or contribute to each state and how? (3) Did all the learners share the same pattern?

Three MANOVAs were conducted to test the relationship between the interaction experience trends and the emotional responses reported in each of the three environments. The dependent variable was the combined intensities of the four emotions (i.e., stress, confusion, boredom, and frustration), and the independent variable was the interaction trends (i.e., flow, stuck, and off-task). We found that each of three MANOVAs was statistically significant, showing that there is a significant interplay between the combined expressed emotions and the interaction feedback. (8, 1398) = 73.81, ; Pillai’s Trace = 0.59, partial ε2 = 0.29 for trigonometry, (8, 518) = 27.32, ; Pillai’s Trace = 0.59, partial ε2 = 0.29 for BDS, and (8, 1750) = 101.66, ; Pillai’s Trace = 0.63, partial ε2 = 0.31 for logic. Hence the emotional responses do seem to significantly characterize the type of the interaction. An analysis of each emotion aside was performed using distinct ANOVAs (4 * 3). The results were statistically significant for all the ANOVAs (); a summary is given in Table 1. Bonferroni post hoc tests showed that the three trends were significantly different in terms of the four emotions (). The state of flow was characterized by a low level (]0; 35]) of stress (around 17 and 25) and confusion (around 16 and 19) and a very low level of boredom (around 6 and 9) and frustration (around 10 and 15). The state of stuck was marked by a moderate level (]35; 65]) of stress (44 to 46), confusion (about 57), and frustration (36 to 50) and a low level of boredom (20 to 25). The off-task trend concurred with the highest level of stress (still moderate: 50 to 57 but more intense as compared to stuck), a high level (]65; 100]) of confusion (70 to 77), a moderate level of boredom (49 to 64), and a moderate to high level of frustration (about 58 to 68).

From these analyses, it can be said that there was not a unique emotion behind the nature of the interaction, but the four concurrent emotions (stress, confusion, boredom, and frustration) seemed to contribute significantly in the expression of flow, stuck, and-off-task. Overall (i.e., across all the participants), low stress and confusion seemed to be more likely associated with a positive trend of interaction. Frustration was also experienced but with a very smaller degree, and boredom was practically absent with flow. The state of stuck was characterized with significantly higher levels of stress, confusion, frustration, and boredom. The off-task behavior was likely associated with the worst emotional responses (i.e., the highest levels of stress, confusion, boredom, and frustration). However, the case-by-case analysis showed that this pattern was not shared by all the study subjects.

Separate correlational analyses were run for each participant. MANOVAs results revealed a statistically significant effect of the experience trends for all the participants () but with different emotional reactions. Figure 5 depicts an example of three distinct patterns. For the first participant (a), a significant effect was found for the four emotions ((2, 39) = 22.43 for stress, 23.18 for confusion, 20.56 for boredom, and 19.73 for frustration, for the four ANOVAs), showing that all four emotions do significantly contribute to the expression of flow, stuck, and-off-task. Bonferroni post hoc tests showed a statistically significant increase of the intensity of the four emotions from the state of flow to stuck and from the state of stuck to off-task, that is, as the typical case discussed above. For the second subject (b), there were no significant differences of stress between the three types of interaction ((2, 39) = 0.71, = n.s.), and, as a matter of fact, this subject did not seem to experience much stress during the experiment (Max = 33.75). A significant contribution of boredom was found ((2, 39) = 3.32, ), with the highest values (a low level), for the off-task trend (M = 32.5 (10.35)), but there were no reliable differences between flow and stuck. Significant contributions of confusion ((2, 39) = 7.30, ) and frustration ((2, 39) = 6.85, ) were also found. But unlike the overall pattern, the highest values were associated with the state of stuck rather than off-task (M = 75 (9.36) and M = 71.67 (9.99), resp., for confusion and frustration). Besides, there were no significant differences between the states of flow and off-task in terms of confusion and frustration. For the third subject (c), a totally different pattern was found: there was no significant contribution of stress ((2, 39) = 2.00 = n.s.) or confusion ((2, 39) = 1.11, = n.s.) and a tendency towards significance for frustration ((2, 39) = 3.12, ). The unique significant effect was found for boredom ((2, 39) = 17.64, ), with a low level in the off-task trend (M = 17.5 (2.80), values were close to zero for flow and stuck).

In summary, emotions do seem to be key indicators of a user’s learning experience. This relationship showed to include several emotions that may differ from a person to another, which confirmed our expectations about the person-specific nature in expressing emotions. Indeed the case-by-case analysis showed that the emotional responses associated with the states of flow, stuck, and off-task can be specific to each learner. Some learners can experience the same stress, confusion, or frustration when they are immersed within a task or get stuck and fewer reactions when they drop out. Besides, some subjects seemed to have calmer temper and showed little emotional activations, that is, no considerable emotional changes between a positive and a negative interaction. Different factors such as the learner’s goals, personality, or skills could intervene and not make all learners react the same way as they are fully involved within a task, get stuck, or are about to give up, hence the importance of accounting for these individual differences in the assessment of learners’ experience.

5.3. Learners’ Interaction Experience Modeling

Our last objective was to implement and validate our framework for recognizing a learner’s interaction experience trend and emotions, based on the observable diagnostic features, the current context, and the learner’s characteristics. More precisely, given the macromodel described in Figure 1, the diagnostic component involved the following modalities: (1) physiological features including EEG_Engag, GSR, and HR, (2) behavioral variables including Help_req and Mouse_mvt, and (3) performance measures including Resp_time, answer, and accuracy. The context involved three variables, namely, the difficulty of the task being executed (task_diff), the presence/absence of hints or help (help_given), and time constraints (time_const). The profile involved the learner’s goal regarding the importance of performing the task (having the best score, learning new concepts, or just finishing the task), preference (e.g., whether the learner likes trigonometry or not), skill level, frequency of computer usage (computer_use), conscientiousness personality trait (perso_consc), and age. Although beyond the scope of this paper, it should be mentioned that these particular variables were selected as they showed statistically significant associations with the experience trends. For instance, no significant correlation was found with regard to the gender variable, which was not included in the model.

Once the structure of the DBN has been defined, the next step was to train the model parameters that quantify the relationships between the connected nodes. These parameters are given by the a priori probabilities of each predictive node (e.g., (skill) over the values “low,” “moderate,” and “high”), the conditional probability distribution of each node given the outcomes of its parents (e.g., (experience trend ∣ goal) over the values “flow,” “stuck,” and “off-task” given each of the values “having the best score,” “learning new concepts,” and “finishing the task”), and the transition probabilities between the time slices (e.g., (boredomt ∣ bordedomt − 1, experience trendt) over the values “interest,” “low,” “moderate,” and “high” boredom given the corresponding values at time and the current parent’s outcomes “flow,” “stuck,” and “off-task”). We used an iterative approach to automatically train the model parameters from the collected data, namely, the EM algorithm [76]. Starting with a random parameter initialization, EM alternates between two steps. The E-step (Expectation) computes the likelihood of the completed data given the current parameter estimate and the observed data; unobserved data are filled in with their expected probability distributions. Such unobserved data included missing information such as corrupted readings due to sensor failure. The M-step (Maximization) updates the current parameters by maximizing the data likelihood, that is, the model parameters that best fit the data. The two steps are iterated until parameter convergence where a local optimal solution is reached. A 10-fold cross-validation technique was used to train the parameters and evaluate the model inference for categorizing both the interaction trends and the four concurrent emotions (i.e., stress, confusion, boredom, and frustration). The data set was divided into 10 subsets, where 9 subsets were used for the training and the remaining subset was used for the evaluation. The process was repeated 10 times and the accuracy estimates were averaged to yield the overall model inference accuracy reported in Table 2 (DBN). The accuracy results were compared to a static approach (i.e., without the temporal dependencies) using static Bayesian networks (SBN) as well as to three nonhierarchical static formalisms, namely, naive Bayes (NB) classifiers [77], decision trees (DT) [78], and support vector machines (SVM) [79].

As shown in Table 2, two test cases were considered. The first case (top) categorizes three outcomes for the interaction experience trend, namely, flow, stuck, and off-task, and four outcomes for each emotion (e.g., the target variable stress has the following possible outcomes: calm (no stress), low, moderate, or high stress). The second case (down) shows the accuracy of a binary categorization, where two outcomes are considered for both the experience trend and the emotion labels. Although there is a loss of information, this last setting is intended to focus on two reverse behaviors in a learner’s experience and emotional responses. Thereby the experience trend is either positive/favorable (i.e., flow) or negative/unfavorable (i.e., stuck or off-task). In the same way, each emotion can be either positive to low or moderate to highly negative (e.g., calm to low stress or moderate to high stress). In both cases, DBN yielded the highest accuracy rates for the experience trend and emotion recognition as compared to SBN, NB, DT, and SVM. For the first test case, an accuracy rate of 75.63% was achieved for assessing the experience trend, and an accuracy ranging from 60.02% (confusion) to 79.95% (boredom) to discriminate between four levels of emotions (e.g., confidence, low confusion, moderate confusion, and high confusion). For the second (binary) case, an accuracy of 82.25% was reached for categorizing between a positive and a negative interaction, and an accuracy ranging from 81.88% (confusion) to 90.97% (boredom) to discriminate between two outcomes for each emotion (e.g., confidence to low confusion vs. moderate to high confusion).

These results suggest that the inference of a learner’s interaction experience can be accurately achieved through probabilistic inference using three modality measures (physiology, behavior and performance) in conjunction with context and person-dependent (profile) variables. The dynamic approach using a DBN outperformed the static approaches (SBN, NB, DT, and SVM) that do not track the temporal evolution of the learners’ states over time. Besides, with nonhierarchical formalisms (i.e., NB, DT, and SVM), no distinction can be made between the input variables on the basis of their causal relationships and the learners’ states (i.e., the predictive variables of the interaction experience on one side and the diagnostic variables on the other side): all the features are equally entered as input variables for the three classifiers. Moreover, with the three latter techniques, the recognition is done only for the interaction experience trends. Indeed unlike Bayesian networks (SBN and DBN), where a simultaneous inference of several target nodes is made possible through the two-layered hierarchical structure, these approaches do not allow a straight representation of several unknown classes simultaneously.

The underlying inference of a learner’s level of stress, confusion, boredom, and frustration through the DBN can be used as a dashboard for real-time adaptation by continuously monitoring the learner’s state and assessing the potential cause of a favorable versus unfavorable interaction so that an effective intervention can be undertaken. For instance, in case of a favorable interaction (i.e., a high probability of flow), the tutoring system would let the learner free to go through the materials without interruption. Implicit interventions such as affective or cognitive priming can be made to enhance the interaction experience without interrupting the learner’s immersion (see [80] for more details). If the learner is about to get stuck (i.e., a high probability for stuck), an explicit intervention would be initiated, while taking into account the learner’s emotional changes. For instance, in case of high boredom, a more challenging task could be proposed. In case of frustration, hints could be made available for the learner. In case of high stress, the time constraints can be alleviated, and, in case of confusion, a piece of advice or help can be proposed to guide the learner. Similarly, if the learner is about to give up (i.e., a high probability for the off-task trend), a different activity can be proposed with a varying level of challenge, constraints, or help, depending on the predominant emotional states.

Figure 6 depicts such an example where a learner’s emotional responses and interaction experience trend are inferred using the trained DBN, as new evidence is introduced into the model (predictive and diagnostic nodes). The task at hand is the last trigonometric problem in series 3. The predictive variables are given by the current context (a high level of difficulty, no help provided and a time constraint imposed), the learner’s current goal (finishing the activity) and other characteristics (less than 30 years, conscientious, low computer usage, low skills, and does not like trigonometry). The diagnostic evidence are given by the learner’s cerebral activity: low EEG engagement, dermal response: moderate GSR, and cardiac activity: low HR; behavioral variables: no help request and low mouse movement rate; and performance: high response time, no answer to the given problem, and low accuracy rate. The inference yields the following outcomes: a low level of stress (with a probability = 58%), a moderate confusion ( = 41%), a high level of boredom ( = 58%), and a low frustration ( = 60%). The predominant inferred experience trend is an off-task behavior ( = 77%). In this case, the system would for instance interrupt the learner to propose a break and change the type of the activity with a more challenging task, as a state of high boredom is detected with a high probably of giving up.

6. Conclusion

In this paper we described a hierarchical probabilistic framework to model the user’s experience while interacting with a computer-based learning environment. The framework uses a dynamic Bayesian network to recognize three trends of the interaction experience, namely, flow or the optimal interaction (a total involvement within the task), stuck or the nonoptimal interaction (a difficulty to maintain focused attention), and off-task or the noninteraction (a dropout from the task) as well as the emotional responses occurring subsequently. The network integrates three modality measurements to diagnose the learner’s experience, namely, physiology, behavior and performance, predictive variables including contextual features and the learner’s personal characteristics (profile), and a dynamic structure to track the temporal changes of the learner’s state. An experimental protocol was conducted, while 44 participants performed different cognitive tasks (trigonometry, backward digit span, and logic) with a gradual difficulty level to provoke the three-targeted trends and analyze their relationship with the reported emotional responses. Three biofeedback devices were used to record participants’ physiological activities including skin conductance, heart rate, and EEG engagement. Behavioral variables included the help use and mouse movement rate, and performance measures included response time, answer, and accuracy.

The statistical analysis supported our hypothesis about the complexity of the relationship between emotions and learners’ experiences. Results showed that concurrent emotional responses can be associated with the experiences of flow, stuck, and off-task and that the same trend could be expressed with different emotional patterns for different participants which confirmed the importance of accounting for overlapping emotional changes and individual differences in the assessment of the learners’ interaction experience. The evaluation of the proposed framework showed its capability in efficiently assessing the probability of experiencing flow, stuck, and off-task as well as the emotional responses associated with each trend. The experimental results showed that our framework outperformed conventional nondynamic modeling approaches using static Bayesian networks as well as three nonhierarchical formalisms including naive Bayes classifiers, decision trees, and support vector machines. An accuracy rate of 82% was reached to characterize a positive versus a negative experience, and an accuracy ranging from 81% to 90% was achieved to assess four emotions related to the interaction, namely, stress, confusion, frustration, and boredom.

Our findings have implications for intelligent tutoring systems in particular and for human-computer applications more generally, seeking to acquire a precise monitoring of the user state by simultaneously identifying concurrent emotional responses occurring during the interaction and the tendency that characterizes their experiences within the task. As our next steps, we plan to enhance the proposed framework with a decision theoretic formalism and incorporate it within a real-time interaction based tutoring system, so that timely interventions can be formulated on the basis of the user’s inferred state. Further diagnostic variables will be included within the model to track additional features of the user’s experience including keyboard interaction patterns, and facial expressions, as well as a cognitive component to monitor the learner’s skill acquisition process including the history of the presented concepts and the practiced skills to optimally adapt the pedagogical content and strategies according to the learner’s state.

Conflict of Interests

There is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was funded by the National Science and Engineering Research Council (NSERC). The models described in this paper were implemented using SMILE, a reasoning engine for graphical probabilistic models, and the GeNIe modeling environment, both developed at the Decision Systems Laboratory, University of Pittsburgh (http://genie.sis.pitt.edu/).