Abstract

The COVID-19 pandemic heavily influenced human life by constricting human social activity. Following the spread of the pandemic, humans did not have a choice but to change their lifestyles. There has been much change in the field of education, which has led to schools hosting online classes as an alternative to face-to-face classes. However, the concentration level is lowered in the online learning class, and the student’s learning rate decreases. We devise a framework for recognizing and estimating students’ concentration levels to help lecturers. Previous studies have a limitation in that they classified attention levels using only discrete states. Due to the partial information from discrete states, the concentration levels could not be recognized well. This research aims to estimate more subtle levels as specified states by using a minimum amount of body movement data. The deep neural network is used to continuously recognize the human concentration model, and the concentration levels can be predicted and estimated by the Kalman filter. Using our framework, we successfully extracted the concentration levels, which can aid lecturers and can be expanded to other areas. To implement the framework, we recruited participants to take online classes. Data were collected and preprocessed using pose points, and an accuracy of 90.62 was calculated by predicting the concentration level using the framework. Furthermore, the concentration level was approximated based on the Kalman filter. We found that webcams can be used to quantitatively measure student concentration when conducting online classes. Our framework is a great help for instructors to measure concentration levels, which can increase the learning efficiency. As a future work of this study, if emotion data and skin thermal data are comprehensively considered, a student’s concentration level can be measured more precisely.

1. Introduction

After the outbreak of the coronavirus in December 2019, it has spread worldwide and has caused much confusion in society [1]. The coronavirus is causing chaos in many parts of society and has a great impact on the daily life of mankind. Education is one of the most affected sectors as the coronavirus has persisted without any signs of improvement [2]. Most classes in elementary school, middle school, high school, and university have come to be conducted in the online learning method. In many schools that do not have sufficient preparation for online learning, the educational effectiveness is declining due to insufficient technical preparation and lack of operational experience [3].

Online learning is classified into synchronous distance education and unsynchronous distance education [4]. In synchronous distance education, lectures are conducted in real-time using useful tools such as Zoom or Google Meet [5]. In unsynchronous distance education, instructors upload the recorded video to the system, and students take the course at the desired time. Because synchronous distance education is a real-time lecture, if students turn on their cameras and show their faces, it is possible to determine the minimum level of participation in the class. For unsynchronous distance education, many universities develop and use various learning management systems such as Moodle and Blackboard [6]. By using these systems, it is possible to determine student participation by calculating the learning rate roughly.

Although the students participated in the class, they might not be focused on the content of the class. Xu and Yang found that when students learn through online learning, the dropout rate can go up to 95 because they desire to use their time for purposes other than educational purposes [7]. Even if students participate in synchronous distance education, the instructor cannot correctly determine the students’ concentration due to actions such as taking other actions or turning the camera while attending class. Even if the learning rate is calculated using multiple learning systems, unsynchronous distance education has many weaknesses. Students turn on the class and engage in other activities, or they attack the system’s vulnerabilities to adjust the speed of lectures and take them faster [8].

Therefore, appropriate measures should be taken by determining the students’ concentration in class. Typically, lecturers have determined students’ concentration levels based on their own experiences in online learning. For example, they would make inferences about whether students were concentrating on a lecture or not through various visual cues, such as the focus of students’ eyes or their body movements during interactions. However, according to a study by Erol and Tekdal, when it comes to distance education, teachers currently do not have sufficient resources to supervise and evaluate students [9]. Therefore, an automated method for determining students’ concentration levels is needed.

There have been many attempts to measure students’ concentration levels using various methods, such as taking skin temperature [10], recognizing visual attention and students’ emotions [11], and detecting electroencephalogram (EEG) signals [1214]. However, these methods often do not work well in online classes because teachers cannot promptly interact with each student. In addition, these attempts lack detail because their concentration levels are classified as discrete states [15]. As students’ concentration levels are simultaneously changing states, this information may aid lecturers.

Here, we develop a new framework that consists of a concentration level recognition network (CLRN) and Kalman filter (KF) [16] to overcome the limitations of existing methods. The CLRN is based on supervised learning, which is trained with the standard deviations of designated points in positioning a human being and classified labels. The CLRN provides the concentration levels as the probability of “high concentration.” The concentration levels can be obtained by the CLRN simultaneously, and the KF identifies the patterns from the fluctuating levels. In addition, a future concentration level can be estimated by applying the KF. Ultimately, the concentration levels can be quantified by the CLRN with KF, which can aid the lecturers in better understanding the concentration levels of his/her students.

We implemented this framework in practice using videos of participants taking online lectures. First, the standard deviations of the pose points were extracted as a preprocessing step. Then, CLRN was constructed, and a loss function was grafted. Based on this, the concentration level of the participants was predicted, and performance of 90.62 was derived. Moreover, the concentration level was completed by smoothing and approximating by applying KF to the result. In this paper, there are various abbreviations, and the list of abbreviations is summarized in Table 1.

The motive of our study is that a person’s body movements can be a factor in recognizing his/her condition. Extracting the status of a human being, such as their emotional state, from body movements is an interesting research topic, which has recently become more important [17]. Several studies have extracted meaningful factors from the movements of individuals. They have examined whether students are concentrating on a lecture or not by checking various visual cues, such as the focus of the eyes or body movements of the students [18]. Generally, eye movement is a strong indicator for estimating the degree of concentration [19]. Research has demonstrated that concentration is amplified when the eye movements of participants maintain a central fixation [20]. In addition, body movement has been previously researched; for example, a model using the joints of the human body can estimate the pose of individuals [21]. Kinetic movement of an individual’s body has been identified for the assessment of a patient’s recovery process [22]. There have been previous studies that extract high-value features based on dynamic movements such as dance movement and aerobic [23, 24]. The pose of individuals has also been researched using video data, which can then be used to present a visual flow of poses [25]. Furthermore, emotion has been recognized from body movement via machine learning and is available in public data sets [26].

The studies mentioned above suggest that the relationship between the movement of individuals and effect is very close, and the relationship should be examined via a bidirectional rather than a unidirectional cause-effect approach [27]. Research to classify the various states of individuals by human body posture has been conducted; however, only binary states have been suggested as results [15]. Similar to previous research, we propose that the standard deviations of designated physical points comprise a core factor in measuring concentration levels. We use OpenPose as a backbone package; it is a well-known tool for analyzing body movements by detecting designated points of a human body. Several types of research have been used OpenPose; for example, sign languages were recognized by a transfer learning algorithm that utilized OpenPose [28]. When humans are focused on some subjects, the standard deviations of their movements will become lower because they engage in less wasted effort. Therefore, we designed the CLRN based on deep learning to find the subtle changes in the standard deviations. There has been similar research to recognize human states, such as emotion [29], via deep learning. However, the approach is limited in the sense that it does not provide simultaneous results. To address this problem, we apply a KF to deal with continuous and simultaneous data.

3. Proposed Framework

Figure 1 shows the overview of our framework. In the first step of the framework, the student’s video data recorded by a webcam is preprocessed. The preprocessed data are labeled by the two states based on the participants’ self-reported intent. The labeled data are used for training the CLRN in the recognition step. The CLRN is devised with supervised learning for binary classification, and the data are prepared with a binary class (the data are labeled as zero or one).

The trained CLRN recognizes the continuous concentration levels, which are defined as recognition levels . The KF is used for smoothing and filtering the highly fluctuating . In the estimation step, the KF provides an approximation of the concentration levels, which are called the estimation levels .

Our method to recognize and estimate human concentration levels consists of three steps: preprocessing, recognition, and estimation.

3.1. Step 1: Preprocessing

The first step of our framework is to extract the standard deviations of the pose points from the video data. The standard deviations of the X and Y coordinates are calculated for the top and middle parts, respectively. Note that we assume the standard deviations of the points are the core factor in measuring the concentration levels. Table 2 shows the notations of the results in the preprocessing step.

Algorithm 1 shows the process of the preprocessing step. The standard deviations are obtained through this algorithm and become the input data, the CLRN, which is discussed in the following section.

Input: top.X, top.Y, mid.X, mid.Y
for eachdo
  fordo
   
  end for
end for
Output:
3.2. Step 2: Recognition
Input:
Output: S

Algorithm shows the overall structure of the recognition step. Through the CLRN, the recognition levels are obtained. The CLRN consists of four layers: two hidden, one input, and one output layer. The role of the hidden layers is to find the hidden features in the data. A network deeper than two layers does not improve the performance of the framework. The rectified linear unit (ReLU) is used as the activation function in both hidden layers. A sigmoid is used to make sure the probability is distributed relatively evenly from zero to one. ADAptive Moment (ADAM) estimation optimizer [30] is applied, and the initial learning rate is set as 0.1 , which is the optimal value for the CLRN. The binary cross-entropy loss is chosen as the loss function of the CLRN and is defined aswhere the number of data items is N, the labels are , and the prediction values from our deep neural network (DNN) are . Note that the values of are obtained from the participants’ self-reported intent. As the output value is a probability for binary classification, the binary cross-entropy is an appropriate value to determine continuous concentration levels.

3.3. Step 3: Estimation
Input:
for alldo
 / Predicting /
 / Updating /
 / Estimating /
end for
 / Analyzing /
Output:

The estimation step of the CLRN includes a KF to establish . Algorithm 3 shows the overall process. There are three states in the algorithm: the prediction state, ; estimation state, ; and measurement state, . The error covariance matrix and the transition weight matrix are also defined. In the predicting step, and an external noise matrix are used, and lectures can modify those matrices. is set to , and is set to as an ideal case.

As part of the updating step, the Kalman gain is obtained at each update. is a scale matrix, which is set to by simplifying the problems. and are updated with . Finally, in the estimating step, the next estimated state is recurrently updated.

We assume that each of the distributions of can be decomposed into two dominant levels with a certain function. The function is a bimodal distribution , which is written aswhere and are the standard deviations, and are the mean values, and is the input data.

4. Implementation

Three participants were recruited for this experiment, and each participant was recorded when they viewed an online lecture, and they were required to mark the times when they were concentrating on the lecture. This work involved human subjects or animals in its research. Approval of all ethical and experimental procedures and protocols was granted by the Institutional Review Board of the Korea University Center for Gifted Education.

A webcam was used to record the 25-fps video data. For recording video data for the distraction (nonconcentration) case, the participants also marked the times when they were distracted.

The data from three participants are merged as a dataset because estimating the levels for each participant, respectively, could be biased per the characteristics of participants. Moreover, we expect to find general properties of concentration levels by using the merged data with our models. The merged dataset is labeled as two cases based on the markers of the participants. In total, 12 hours of video data (consisting of 1M images) were recorded. A total of eight hours of data was marked for the concentration case; the other four hours of data were taken for the distraction case.

For step 1(preprocessing), certain pose points in the images are detected to measure the distribution of participant poses. Ten points of the human body are measured every 50 frames, which are classified as the top part (0–4) and the middle part (5–9). Figure 2 shows the points, and the coordinate data of the points range from zero to one. To detect the points, OpenPose [3134] is used. OpenPose [31] is a very recent open-source package for detecting the keypoint of human poses. OpenPose is a real-time system for the body, foot, hand, and facial keypoint detection and is an appropriate package for continuously detecting these points. In our case, we only used the upper body of individuals as captured in the video data.

We then check the distributions of the pose points when the individuals were concentrating or not, as shown in Figure 3. The distribution in Figure 3(a) shows that the entries are gathered more closely around the body points, while those in Figure 3(b) are spread more widely. The difference is visually noticeable in this example, but it cannot be easily quantified to identify the concentration levels. In the preprocessing step, the input data are already divided into 50 frames, so the input data for the CLRN are not separated into minibatches.

For step 2(recognition), CLRN performs the task of predicting what participants marked while viewing the online lecture. K-fold is applied to cover insufficient data. The accuracy of 5-fold training ranged from 85 to 95 with a median of 90.62.

Figure 4 shows the difference of the s among each group. Nevertheless, there remain unexplained aspects, such as ambiguous patterns, whose correlation with the concentration levels is unclear. To this end, neural networks are applied to solve the problems as they are an appropriate method for obtaining nonlinear combinations from features. This allows us to identify hidden features that we cannot otherwise describe.

For step 3 (estimation), trained CLRN recognizes continuous concentration level, smoothing and filtering it using Kalman Filter, and finally approximates it. The students’ state starts from because the students’ concentration level is assumed to be at the beginning. is the system error, which comes from the DNN, which was described in Section 3.

Figure 5 shows the estimation and measurement results for 2.5 second intervals. It indicates that the students maintained their concentration levels, and there were no external disturbances when they observed the lectures. Even though the measurements fluctuate widely every 2.5 seconds, the KF enables users to track the levels smoothly, which are shown as the green and the red dots, indicating the low and high concentration levels, respectively.

Figure 6 shows the distribution of . and are obtained as 0.09 and 0.16, respectively, which indicate that the students are entangled in two concentration levels.

5. Conclusion and Future Work

Many schools have been semicompulsory for distance education due to the coronavirus. However, distance education is economical in terms of price effect and can educate many students simultaneously. Furthermore, if distance education is carried out, cooperative learning can be performed in an interactive learning environment, and since home-based classes are possible, the time and effort of commuting to school are reduced. If the major disadvantage of distance education, the concentration level is low, can be overcome through this study, and more effective classes will be possible. We solve this problem by developing a novel framework consisting of a concentration level recognition network and a Kalman filter. We devised the model for aiding lecturers in estimating students’ concentration levels using webcams as part of online classes. Our system presents the level every 2.5 seconds with 90.62 accuracy and estimates the next level of concentration by using the KF. In contrast to the previous research, such as VGG16 [35], our model takes a different approach to quantify the levels by capturing the variance of the detected pose points on individuals in the current state. Additionally, we estimate and track the level for the next time window. Our model offers a practical tool to monitor the level more precisely and aid lecturers in estimating the level. Academically, our model applies a novel approach to analyzing complex human states, in this specific case, concentration level. As future work, we plan to use not just body movement data but also emotion data [36] and skin thermal data [10, 37] to enhance the prediction of measuring human concentration levels. This paper will combine and process the measuring method used and the conventional techniques using deep learning. This work expects to provide helpful information on students’ concentration levels and thus assist lecturers.

Data Availability

No data are available because of privacy issue.

Disclosure

An earlier version of this manuscript was preprinted in the arXiv [38], and several students participated to the earlier version.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to show their gratitude to Jakyung Koo, Nokyung Park, and Pilgu Kang at Korea University for their assistance in this research, which greatly improved the manuscript. Even though they are not included in authorship in this paper, their broader assistance lays the foundation of our research.