#### Abstract

With the rapid development of computer big data technology, online education in the form of online courses is increasingly becoming an important means of education. In order to objectively evaluate the teaching quality of online classroom, a teaching quality evaluation system based on facial feature recognition is proposed. The improved (MTCNN) multitask convolutional neural network is used to determine the face region, and then the eye and mouth regions are located according to the facial proportion relationship of the face. The light AlexNet classification based on Ghost module was used to detect the open and close state of eyes and mouth and combined with PERCLOS (percentage of eye closure) index values to achieve fatigue detection. Large range pose estimation from pitch, yaw, and roll angles can be achieved by easily locating facial feature angles. Finally, the fuzzy comprehensive evaluation method is used to evaluate students’ learning concentration. The simulation experiments are conducted, and the results show that the proposed system can objectively evaluate the teaching quality of online courses according to students' facial feature recognition.

#### 1. Introduction

Online education, as a brand new education model, has been gradually welcomed by students and parents due to its characteristics of openness and diversity [1]. Online education market has also ushered in a vigorous development, while there are still numerous shortcomings in the actual use process. The biggest deficiency lies in the poor supervision effect of teachers on students in the classroom [2]. In the traditional teaching process, teachers and students are in a limited space. The position of the podium is very good for quick observation of the whole classroom. In the course of class, teachers can control the whole room and effectively supervise students with just a glance. But in online education, the limits of space are broken down. Most of the teacher’s attention is paid to the lecture; thus, the rest of the attention cannot effectively control the state of all students, let alone play a supervisory role. For students, online education offers many choices and flexible hours, while the degree of self-discipline is very high. The analysis results of the statistical data of some online education platforms show that there is an obvious phenomenon of learners’ poor autonomy in online classroom. This results in a high enrolment rate but a low course pass rate for all kinds of courses [3, 4]. During the COVID-19 pandemic in 2020, some parents reported that the learning effect of their children’s online classes at home was very low. This not only causes many students to waste valuable learning time, but also undermines teachers’ control over teaching progress and quality.

The COVID-19 pandemic is the first large-scale use of online education to replace traditional education in a short period of time, though it cannot completely replace traditional education at present [5]. However, with the emergence and application of 5G technology, online education has gradually become an important development trend that complements traditional education. Therefore, considering that online education teachers cannot effectively supervise students, the exploration of online classroom teaching quality evaluation system has become an important task of online education development.

Online education has the characteristics of loose structure and open distance teaching environment, in which online learning effect evaluation is an important part. It is very difficult to objectively evaluate the teaching quality of the course via these characteristics [6]. Literature [7] discusses how to improve the learning effect and efficiency of learners through positive emotion and positive emotion in online education. Literature [8] analyses how teachers channel and dissolve students’ negative emotions through positive emotions, so as to improve students’ learning enthusiasm. Literature [9] builds an emotional interaction model of social learning network, which can identify learners’ emotional states in online education. Emotional states include six categories: pleasure, pain, tension, calm, surprise, and disgust. Literature [10] designs a teaching feedback strategy construction method based on emotion learning ontology. In the current research environment, quantitative analysis of student behaviour is an important link of online education. Based on the hybrid teaching mode combining online and offline, paper [11] established online evaluation and offline evaluation systems. The evaluation system focuses on the learning process and teachers can take intervention measures according to the early warning situation.

This paper presents an online classroom teaching quality evaluation system based on facial feature recognition. The innovations and contributions of this paper are listed below.(1)Face regions are determined by an improved multitask convolutional neural network, and then eye and mouth regions are located according to the facial proportion relationship of the face. The light AlexNet classification based on Ghost module was used to detect the open and close state of eyes and mouth and combined with PERCLOS index values to achieve fatigue detection.(2)Complete the face pose estimation from pitch, yaw, and roll angles in a wide range through easily located facial feature angles.(3)Finally, the fuzzy comprehensive evaluation method is used to quantitatively evaluate students’ learning concentration.

The simulation results show that the system can effectively evaluate students’ concentration in class according to the results of face detection, so as to evaluate the teaching quality of online education.

The chapter structure of this paper is as follows. Section 2 focuses on the proposed algorithm model in this paper. Section 3 is experiment and analysis. Section 4 is the conclusion.

#### 2. The Proposed Model in This Paper

##### 2.1. Fatigue Detection Model

Fatigue detection based on human facial features usually needs to determine the location of the face and eye-mouth and detect the opening and closing state of the eye-mouth. In order to achieve the above functions, this paper uses the improved MTCNN (Multitask Cascaded Convolutional Networks) to complete face detection and key point positioning. Then, the improved AlexNet was used to identify the open and closed state of the eyes and mouth. Finally, the fatigue discrimination was performed according to PERCLOS and PMOT parameters. The fatigue detection process is shown in Figure 1.

###### 2.1.1. Key Point Positioning Based on Improved MTCNN

MTCNN is a face detection and alignment algorithm based on deep learning, which uses image pyramid to detect faces at various scales [12]. Different from multitarget face detection, fatigue detection only needs to precisely locate the human face region. Therefore, the image pyramid part of MTCNN network is improved to achieve face detection and key point location quickly and accurately. Then, after locating the key points, the eye and mouth regions are obtained by using the proportion relation of three courtyards and five eyes of human face. Three courtyards here refer to the length of the face proportion, and the length of the face is divided into three equal points. Five eyes here refer to the proportion of the width of the face, and the width of the face is divided into five equal parts with the length of the eye as the unit.

MTCNN is made up of Proposal Network (P-Net), Refine Network (R-Net), and Output Network (O-Net). The idea of candidate frame plus classifier is used to detect face quickly and efficiently. In order to detect faces of different scales, MTCNN reduces the original image to different sizes to generate image pyramids. The original image is compressed to a certain scale, and the image is traversed with a candidate box of fixed size 12 pixels by 12 pixels. Loop until the reduced image is smaller or wider than the side length of the candidate box. At this point, a picture with a size of 12 pixels × 12 pixels and channel number of 3 is obtained as the input of P-Net. P-Net uses a convolution kernel with a size of 3 × 3 to extract image features through convolution operation. Maximum pooling operation was used to remove redundancy. Face classification is used to determine whether the region is a face. Face regions were initially located using bounding box regression and a facial landmark localization. These fields are inserted into R-Net, which will repeat the operations of P-Net. Most of the interference is filtered out; then the reliable face area is retained and input into O-Net. P-Net takes advantage of the feature that the full-convolution output size is 1 pixel × 1 pixel and the number of channels is 32. R-Net uses a 128-dimensional full connection layer after the last convolutional layer, which preserves more image features. O-Net performs more refined face discrimination, border regression, and key point positioning and finally outputs the coordinates of the face region and 5 feature points. The five characteristic points are left eye midpoint, right eye midpoint, nose tip, left corner of mouth, and right corner of mouth.

The image pyramid constructed by MTCNN contains a large number of images. Sending it into the network to detect all the face areas in the image would take a lot of time. In the fatigue detection scene, it is necessary to eliminate the interference of redundant faces and accurately locate human faces. Considering that the human face region occupies a large proportion of pixels in the video frame, this paper amplifies the minimum face size in MTCNN network according to the proportion of human face in the image. The fixed reduction scale is reduced and the first reduction and traversal are skipped. This allows the improved MTCNN network to accurately locate human facial areas. At the same time, it can filter out irrelevant faces in the background and greatly reduce the face detection time.

It is assumed that the coordinate of the left eye obtained by MTCNN is and the coordinate of the right eye is . With extraction of human eye region according to face proportion relationship, its corresponding relationship is shown aswhere is the distance between the eyes. is the width of the eye area. is the height of the eye area.

It is assumed that the coordinates of left and right corners of human mouth obtained by MTCNN are and , respectively. The mouth region of human is extracted according to the proportion relation of face, and its corresponding relation iswhere is the distance between the left and right corners of the mouth. is the width of the region of interest of the mouth. is the height of the mouth area.

Key points were located for the examples in NTHU-DDD dataset [13], and the regions of interest of eyes and mouth were obtained according to the proportion of three courts and five eyes, as shown in Figure 2.

**(a)**

**(b)**

**(c)**

**(d)**

###### 2.1.2. Eye and Mouth State Recognition

The method of eye and mouth state recognition based on manual feature extraction is affected by the shooting angle, shooting distance, and individual differences and has poor robustness. In this paper, the improved AlexNet is used to recognize the open and closed state of the eyes and mouth, which avoids the complex preprocessing operation of the image and has strong robustness.

Ghost module [14] is a lightweight neural network unit. In order to make the fatigue detection model obtain better real-time performance at the edge end, Ghost module is used in this paper to replace all convolution operations in AlexNet. Due to the redundancy of feature images output by deep convolutional neural network, these similar feature images make the feature extraction ability of convolutional neural network stronger. Therefore, Ghost module uses a simple linear operation to obtain more similar feature graphs to improve CNN performance. Ghost module uses a small amount of conventional convolution to get eigen features and then the obtained eigen features through depthwise convolution such a simple linear operation to generate Ghost features. Finally, the intrinsic feature graph and Ghost feature graph are spliced to get the final output. Compared with the direct use of conventional convolution, Ghost module, in the guarantee of accuracy at the same time, greatly reduces the amount of calculation.

AlexNet [15] showed excellent results in image classification based on 8-layer network structure. AlexNet is improved in this paper to classify the open and closed state of eyes and mouth. Since the image of eyes and mouth occupies fewer pixels, the input size of AlexNet is compressed from 224 pixels × 224 pixels and channel number 3 to 24 pixels × 24 pixels and channel number 3. Modify 11 × 11 and 5 × 5 convolution kernels to 3 × 3 convolution kernels. The stride of maximum pooling operation was optimized to avoid too small size of feature graph. AlexNet is only used for eye-mouth state 4 classification, and only the first full connection layer in AlexNet is retained in the model. The output dimension of the full connection layer is changed from 2 048 to 128. Finally, Softmax regression function is used to output the probability that the sample belongs to the open mouth state. The model retains the first layer of traditional convolution to extract image features comprehensively. The other convolution operations are replaced by Ghost module, making the network lightweight. The improved AlexNet architecture is shown in Figure 3.

###### 2.1.3. Fatigue Detection

When a person is in a state of fatigue, there will be a series of physiological reactions, such as long closed eyes, yawning and so on. By calculating PERCLOS and PMOT parameters of continuous video frames, the improved AlexNet model can judge the fatigue state of human with threshold value.

PERCLOS is a physical quantity proposed by Carnegie Mellon Institute to measure fatigue. The correlation analysis of 9 fatigue parameters, including PERCLOS parameters, blinking frequency, and yawning parameters, shows that the correlation between PERCLOS parameters and fatigue state is the highest [16]. PERCLOS parameter represents the percentage of eye closure time in unit time, and the calculation equation is shown as follows:where is the total number of video frames per unit time. is the number of closed frames. is the total number of frames closed per unit time. In the normal state, the value of the PERCLOS parameter is small. When people are in fatigue state, the number of closed eye frames increases, and the value of PERCLOS parameter increases.

The PMOT parameter is similar to the PERCLOS parameter and represents the percentage of open mouth time per unit time. Its calculation equation is as follows:where is the total number of video frames per unit time. is the number of open mouth frames. is the total number of open mouth frames per unit time. In normal process, the value of human PMOT parameter is small. When a person yawns, the number of open mouth frames increases and the value of PMOT increases.

##### 2.2. Face Pose Estimation Algorithm

Face deflection axis is shown in Figure 4. Among them, the deflection around the *X* axis is called pitch, the deflection around the *Y* axis is called yaw, and the deflection around the *Z* axis is called roll.

In this paper, Adrian Bulat face feature point locator is used to complete face feature point positioning [17], as shown in Figure 5. The positioner is suitable for feature location of rotating face in plane. In addition to the visible feature points, those blocked or invisible face feature points can also be located.

Based on facial features, the face pose estimation process adopted in this paper is shown in Figure 6. In order to reduce the estimated parameters in the loss function, the algorithm transforms the process of estimating face rotation around three axes into the process of searching the best rotation angle around *X* and *Y* axes within a certain rotation range around *Z* axis of sparse model. Thus, the roll angle parameter in the loss function is eliminated.

By aligning the subnasal point on the 3D model with the subnasal point on the image, the constraint model can only rotate around the subnasal point to eliminate the translation parameter in the loss function. This makes the loss function retain only the scaling factor, pitch angle, and yaw angle.

Let *s* be the global dimension parameter of the 3D model; and are the translation parameters in the *X* and Y directions of the 3D model after the parallel projection to the *XY* plane. If face roll angle *γ* is known, the following method is used to estimate face depth direction deflection angles and . The subnasal points of 3D model were recombined and fixed with the subnasal points of human face on the image. Then, *s*, , and were adjusted to align other feature points on the image with corresponding points on 3D model after 2D projection (meeting the minimum sum of squares of distance). The equation for the sum of the squares of the minimum distance is as follows:where *t* is the number of alignment points, and is the alignment point on the face image. is the corresponding alignment point on model .

and are the subnasal points on the face image and , respectively. is the translation vector after the projection of . is orthographic matrix. is rotation matrix. is global scaling matrix.

Substitute equation (6) into equation (5), and construct the augmented objective function (loss function) using the interior point penalty function method; then it can be obtained as follows:

is the barrier factor. The modified Newton method [18] is used to calculate the deflection parameters and of pitch and yaw of the image face satisfying equation (9) at the specified angle, as well as the scaling coefficient *s* of 3D face model .

In this paper, the estimations of face in-plane rotation angle and out-plane deflection angle were combined, and the final estimation of face deflection angle around each coordinate axis was obtained by searching for the best deflection angles and are within the range of based on the angle of the two-eye center line.

##### 2.3. Fuzzy Comprehensive Evaluation

In practical problems, objective decision is often a comprehensive decision of many factors. Some attributes are ambiguous, cannot be quantified, and cannot be judged simply by “good” or “bad.” Fuzzy comprehensive evaluation decision is a comprehensive evaluation method based on fuzzy mathematics [19]. The basic idea is to use fuzzy linear transformation principle and maximum membership principle. Fuzzy attributes are quantified by membership function and evaluated by traditional quantitative evaluation method. In this way, the quantitative and qualitative factors in the problem can be uniformly dealt with, and the differences between fuzzy attributes can be taken into account.

The evaluation objective of this paper is learning concentration. It cannot be quantified in the academic world and belongs to the fuzzy goal. Facial orientation and fatigue degree are also fuzzy in evaluation, so fuzzy comprehensive evaluation is adopted in this paper. The mean value of horizontal facial deflection angle, vertical facial deflection angle, times of eye closure, and times of yawn per unit time are the first level factors. Facial orientation and fatigue were the second factors. Therefore, this paper adopts fuzzy comprehensive evaluation oriented to multilayer factor evaluation, that is, multilevel fuzzy comprehensive evaluation.

###### 2.3.1. Factor Evaluation of the First Layer

The head posture score was divided into turning head score and raising head score, and the minimum value was taken for the comprehensive score. Firstly, the mean of left (right) head turn angle and the mean of raised (low) head angle were calculated. According to the left and right rotation range of human cervical vertebra joint is . is taken as the maximum left (right) head rotation angle in this paper. The smaller the angle is, the higher the score is.

The angle of raising (lowering) head of human cervical vertebra joint is − ∼+. When the angle is greater than , it is judged as the head, and when the angle is less than , it is judged as the head. The closer the angle is to 0, the closer the head is to the ground, indicating that the learner is looking straight ahead. So the closer the angle gets to zero, the higher the score is.

The comprehensive score takes the minimum value as

In this paper, a comprehensive score of PERCLOS value [20], average length of eye closure, and yawning frequency was used to evaluate fatigue. The comprehensive evaluation of concentration included times measurement of eye closure, yawn, and head rotation. The total time of times detection is . PERCLOS value was calculated according to the number of eye closure T1 and detection .

The average closed eye duration () was calculated from the total time () of times detection.

Yawning frequency was calculated according to the number of yawns .

The weights of the three parameters are determined according to their importance. Take the weights of the three as 1, 0.8, and 0.5, respectively, and the comprehensive fatigue score can be calculated as

###### 2.3.2. Second Factor Evaluation

The multilevel comprehensive evaluation model of fuzzy comprehensive evaluation model was used to calculate the comprehensive score of concentration. The scores for the first tier of factors have been calculated. Facial orientation score and fatigue score were grouped into evaluation factors . Analytic hierarchy process was used to calculate the weight of each factor.

The eigenvector is . The comprehensive score of concentration is the sum of the scores of the two factors and the weight product, and the equation is as follows:

#### 3. Experiment

##### 3.1. System Development

The computer development system of this system is Win10. The programming language for development is Python. The development tool uses Pycharm. The programming environment is Python 3.8. The system is set to perform a test every one second. Every five times is a detection period.

###### 3.1.1. Image Preprocessing

This system uses OpenCV’s own function to open the computer camera and obtain images. If the fetch fails, it is fetched again. To judge the resolution of the successfully obtained image, if its length (width) is greater than 700, the image is reduced to 1/3 of the original image. Grayscale processing of image can reduce the complexity of subsequent calculation. The face feature points of the image are detected using the preloaded DLIB library. If the detection is successful, 68 feature points and ash images are returned. If the detection fails, obtain the image again and perform the above steps again.

###### 3.1.2. Head Posture Assessment

This module firstly obtains face feature points outputted by image acquisition and preprocessing module. The feature points of canthus of both eyes, corners of the mouth, tip of the nose, and subnasal points were selected. Binocular inclination, scaling factor, and all candidate roll angles were calculated. The pitch, yaw, and loss function values at each roll angle were calculated using the modified Newton iteration method. Select the pitch, yaw, and roll angles corresponding to the minimum loss function. Finally, the evaluation value of face pose is output.

###### 3.1.3. Eye Closure Test and Yawn Test

This module firstly obtains image acquisition and preprocessing module outputs face feature points. Select the feature points of left (right) eye and mouth. The mean value of left (right) eye closure was calculated. Check whether the closure is less than or equal to the threshold. If the condition is met, the number of eye closures is increased by 1. The mouth closure is calculated according to the characteristic points of the mouth to determine whether the closure is greater than or equal to the threshold. If the condition is met, the number of yawns is increased by 1.

###### 3.1.4. Fuzzy Comprehensive Evaluation

The fuzzy comprehensive evaluation module firstly determines whether the detection cycle is over. If not, continue monitoring. At the end, the head posture score and fatigue score were, respectively, calculated according to the detected head raising angle, head left-right turning angle, PERCLOS value, average eye closing duration, and yawn frequency. Then the learning concentration score was calculated based on fuzzy comprehensive score.

##### 3.2. System Experiment

In order to objectively evaluate the teaching quality of online classroom, a teaching quality evaluation system based on facial feature recognition is proposed. The effectiveness of the scheme needs to be tested in practical application. Therefore, a detection scheme is designed in this paper. The subjects were watching a 40-minute teaching video on the computer while the detection system started working. During this process, subjects randomly simulated normal state, nodding state, yawning state, mild head deflection state, and severe head deflection state. It can be seen as Figure 7.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

The detection system calculates the mean value of the learning concentration score given by the system to learners in different states. The experimenters were divided into four groups with 20 people in each group. The experimental results are shown in Table 1. For the same learning state simulated by different experimenters, there was no significant difference between the scores given by the system to learners. Compared with the normal state, the scores of learning concentration in other states all decreased to different degrees. In the severe head deflection state and the sleepy state, the decrease of learning concentration score was greater. Combined with the actual situation, severe head deflection and drowsiness will have a very negative impact on learners’ concentration. When the head is slightly tilted, it indicates that the learner’s attention has shifted to a certain extent, but to a small degree. Therefore, at this time, the system score has decreased relative to the normal state, but the decrease is small. Yawning indicates that the learner’s mental state shows fatigue, which will have a great negative impact on their learning concentration. Therefore, the score of the system at this time has a large decrease compared to the normal state.

It shows that the system can realize the detection of learners’ learning concentration to a certain extent through the simulation experiment of the system designed in this paper. The quality of online course teaching can also be evaluated through this system. At the same time, the teacher can obtain students’ online learning concentration in real time at the remote end and master the classroom status. In order to improve the quality of teaching, teachers can adjust the teaching plan to help students learn better.

#### 4. Conclusion

As a new education model, online education is gradually welcomed by students and parents because of its openness and diversity. Aiming to objectively evaluate the teaching quality of online classroom, a teaching quality evaluation system based on facial feature recognition is proposed. Firstly, the improved MTCNN network is used to obtain key points of face. The eye and mouth regions were obtained according to the proportion of three courtyards and five eyes. The improved AlexNet based on Ghost module is used to classify the state of eyes and mouth and make fatigue judgment. Then, the face pose estimation is completed by associating Adrian Bulat face feature point locator. Finally, the effectiveness of the system is verified by simulation experiments. The experimental results show that this system has a good effect on the evaluation of learning concentration and can objectively evaluate the teaching quality of online courses. The analysis and optimization of algorithm efficiency is the content of future work.

#### Data Availability

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

#### Ethical Approval

The authors declare that they obtained image data from public datasets or image data collected by their team, and they obtained authorization to use portrait images in publicly published articles.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This paper was supported by the Fundamental Research Funds for the Central Universities (Project name: The Study of Need Supportive Teaching under Ubiquitous Learning, Project no. 2016TS009) and research project of Youth Foundation of Humanities and Social Sciences, MOE (Ministry of Education in China) (Project name: The Study of the Historical Evolution of College Teaching Paradigms and Its Transformational Paths, Project no. 19YJC880124). Yuan Fang is the member of “One Belt, One Road” countries’ education developmental strategy research team, The Youth Innovation Team of Shaanxi Universities.