Abstract

With the continuous development of online technology, online education has become a trend. To improve the quality of online education, a comprehensive and effective analysis of educational behaviour is necessary. In this paper, we proposed a network model based on the ResNet50 network fused with a bilinear hybrid attention mechanism and proposed an adaptive pooling weight algorithm based on the average pooling algorithm for the problems of image feature extraction caused by traditional pooling algorithm such as mutilation and blurring. At the same time, the hyperparameters of the convolutional neural network model are adaptively adjusted based on the particle swarm algorithm to improve the model recognition accuracy further. Through the experimental validation on NTU-RGB + D and NTU-RGB + D120 data set, the recognition accuracy of this paper is 88.8% for cross-subject (CS), 94.7% for cross-view (CV), 82.8% for cross-subject (CSub) 83.2%, and 84.3% for cross-setup (CSet), respectively. The experimental results show that the algorithm in this paper is an effective method for educational behaviur recognition.

1. Introduction

As the government, education departments, and academic certification institutions begin to encourage schools to shape evidence-based decision-making and innovation systems, learning analysis technology shows great advantages in decision-making assistance and teaching evaluation [1]. Learning analysis has achieved higher analysis accuracy. Students’ learning engagement can help schools better understand students’ learning quality. The core element of evaluating a university’s education quality is students’ learning engagement [2]. As an important part of learning engagement, students’ classroom behaviour has always been a concern of the researchers. The traditional evaluation of students’ classroom behaviour is realized by manual observation and recording; this leads to low efficiency [3]. At present, understanding students’ learning behaviour and learning state in the process of classroom learning has become an important topic for educational development, which will promote the intelligent, efficient, and all-round development of the educational analysis system [4].

Educational behaviour analysis is an important part of detecting the teaching quality detection [5]. The evaluation method used in traditional teaching evaluation methods is mainly to record the frequency information of teachers’ and students’ classroom behaviour through the scale when observing the course and conduct relevant manual analysis and evaluation [6]. This evaluation method can observe more complex teaching behaviours, such as gestures and eyes, but the behaviour record lacks time dimension information. At the same time, the rater’s recognition and scoring of behaviour are relatively subjective, and the value of unified comparison and analysis is low. The beginning of modern quantitative analysis classroom is the proposal of Flanders’ interaction analysis system (FIAS) [7], and then a variety of optimization analysis systems are designed [8]. However, the basic method of data source is to rely on other teachers’ classroom records, questionnaires, and other manual methods, so the traditional teaching evaluation methods lack the automatic identification method of teachers’ and students’ educational behaviour.

In recent years, with the continuous maturity of CV in behaviour recognition technology, more and more researchers choose to migrate the behaviour recognition algorithm to the education industry. Literature [9] proposes a system for identifying the extent to which children’s naturally occurring postures are related to their interest in learning on a computer: postures are collected by sensors mounted on the seat and the back of the chair. The pose features were then extracted using a hybrid Gaussian model and fed into a feedforward neural network, which classified nine poses and achieved an overall accuracy of 87.6% when testing poses from new subjects. Literature [10] constructs the student classroom behaviour database SCBID data set, puts forward the student classroom behaviour recognition method based on skeleton and deep learning, and uses the openpose skeleton detection model to extract the skeleton of student classroom behaviour, with an average recognition rate of 97.92%. Literature [11] uses the openpose model to detect the key points of human bones, extracts the behaviour characteristics of students in class, and uses polynomial kernel SVM to train the classification model, with an accuracy of 95.34%. Literature [12] studies the real-time detection technology required by the intelligent classroom system from the two aspects of behaviour detection and expression recognition. In the aspect of behaviour detection, two key point detection technologies are combined. In the aspect of expression recognition, the position of human face is detected in real time, and the traditional CNN is improved. A real-time image classification network architecture is proposed, and good results are obtained. Literature [13] proposes a new method to automatically evaluate students’ attention in classroom teaching. The results show that the best individual independent three-level attention classifier achieves a medium accuracy of 0.753, which is equivalent to the research results of other students in the field of participation. Literature [14] uses Kinect sensor to obtain point images, uses the machine learning method to detect face position, extracts the feature of histogram of oriented gradient (HOG), and uses the support vector machine (SVM) model for training. The trained model is used for convolution to find the position of all faces in the picture, and then the perspective N-point location algorithm is used to solve the position and orientation information of faces. The objective indicators of students’ class status are obtained, so as to provide data support for classroom teaching evaluation. In order to quantify students' participation in class, previous scholars have used the method of measuring students' rise rate in class. Literature [15] proposes a new method to identify students’ head raising rate (HRR) in the classroom. A method for extracting students’ salient facial features is developed. A multitask CNN is constructed to detect students’ HRR. Experiments show the effectiveness of this method. Literatu0re [16] proposes an improved deep separable convolutional neural network based on mobile net network, and the recognition accuracy in CK + data set and KDEF data set is 99.25% and 99.70%, respectively. An attention layered bilinear pooled residual network structure proposed in literature [17] has recognition accuracy of 73.24% and 98.79% in fer2013 data set and CK + data set, respectively. Literature [18] proposes a dynamic weighted decision algorithm based on Genie index. Experiments were carried out on fer2013 data set and CK + data set. The classification accuracy of the proposed method reached 73.36% and 97.59%, respectively. Literature [19] uses the improved yolo-op4 algorithm, designs VGG-M algorithm through network optimization and multiscale feature fusion, and uses the self-made student behaviour data set Stu_obj to obtain a good accuracy of 97.02%.

In this paper, an improved convolutional neural network model based on a bilinear mixed attention module is proposed to improve the ability of educational behaviour recognition. The improved network model selects ResNet50 [20] as the backbone network, integrates the bilinear hybrid attention module in the last bottleneck, and uses the improved pooling algorithm in the first convolution layer. The network integrated into the bilinear mixed attention model improves the convergence speed of the model and increases the generalization ability of the model. Compared with the traditional algorithm, the recognition accuracy of the network model added with the adaptive pooling weight algorithm is improved. Finally, the particle swarm optimization algorithm is used to optimize the hyperparameters of the improved network model, which automatically selects the best hyperparameters of the model, solves the uncertainty of parameter adjustment, and improves the accuracy of model recognition. Experiments show that the recognition accuracy of this model has been improved to a certain extent and has a certain research value.

2. Methodology

2.1. Overview of Model Design
2.1.1. ResNet Network

Due to the deepening of the number of layers of the convolutional neural network model, the accuracy of the training set will decline, that is, the phenomenon of “annealing” will appear. Considering the bottleneck problem of the convolutional neural network model, literature [21] constructs the depth residual network ResNet. The main idea is to use the residual network structure and design a jump connection that can skip one or more layers. A deep residual network structure is formed by stacking multiple residual units. Due to the difference between the residual structure and the convolution layer, it mainly includes ResNet18, ResNet50, ResNet101, ResNet152, and other network models. This paper selects ResNet50 as the backbone network, loads its pretraining model, and improves it.

2.1.2. Bilinear Convolutional Neural Network

B–CNN (bilinear convolutional neural network) model mainly uses a dual-stream CNN to complete the feature extraction process. The model has a simple and efficient structure and can carry out end-to-end training, and the bilinear form simplifies the gradient calculation. Since the classification of the convolutional neural network model is mainly used to distinguish greatly different targets, there is no more consideration of different subclasses in the same category, while fine-grained image classification pays more attention to discriminative local positions, which is more suitable for dealing with educational behaviour recognition and classification. The B–CNN model is shown in Figure 1.

2.1.3. Mixed Attention Mechanism Module

Mixed attention mechanism includes channel attention mechanism and spatial attention mechanism. Assuming that the input is , first pass through a channel attention graph and then pass through the characteristic graph .

The whole module can be summarized as the following equation:

2.2. Bilinear Hybrid Attention Mechanism
2.2.1. M-SE Module

Based on the idea of attention mechanism, a bilinear mixed attention mechanism module (M-SE) adapted to educational behaviour recognition is designed. Extracting image features with convolution kernel and using convolution kernel to extract features can obtain image features more comprehensively and accurately, which is convenient for later learning and classification. After the output feature map of the feature extracted from the convolution layer, use 3 × 3 and 5 × 5 convolutions to check the same output feature for feature extraction and the extracted features are then combined with the mixed attention mechanism. Finally, bilinear transforms the two channels of the fused attention mechanism to obtain the second-order feature information of the feature.

Assume that the input and output of the network module are as follows:where and are length and width and is the number of channels. The 512th channel characteristic image output from the convolution layer in the last Bottleneck in ResNet50 is convoluted with convolution kernel to obtain .where represents the filter whose convolution kernel is , and are the length and width of the characteristic graph output following convolution, and is the number of convolution kernels.

Send into CBAM (Convolutional Block Attention Module), respectively, to model the correlation between channels and space. Completing the introduction of attention mechanism into channel and space dimensions, then is obtained as follows:

Bilinear changes the characteristic diagram output by the two channels, and the formula is as follows:where is the feature of the feature map at position .

The characteristic graphs in each channel are added element by element to obtain as follows:

Then, convert into the form of vector to obtain bilinear vector.

Perform regularization operation on to obtain .where is a symbolic function.

2.2.2. IPM-SE Module

Compared with the M-SE module, which first introduces the attention mechanism and then performs the bilinear transformation, the IPM-SE (Indicate Premise Mixture-Squeeze and Excitation) module first performs the bilinear transformation on the pixel values corresponding to the feature map and then introduces the mixed attention mechanism module to weight each channel and spatial position of the feature map corresponding to the bilinear feature, highlighting the role of the area where the important information is located in the image. The bilinear convolutional neural network model of attention enhancement is shown in Figure 2.

Firstly, the feature map is output from the convolution layer. Each channel feature map is composed of pixel values. There are channels in total. Then, the pixel values at the corresponding positions in each channel are taken out to obtain .

Take out the pixel at the corresponding position and multiply it by its transpose to obtain the pixel value of . There are channels in total.

is sent into the mixed attention mechanism module to obtain the characteristic graph with attention.

Add the pixel values of the corresponding positions in the channels with attention to calculate the mean value to obtain .

Finally, is brought into formula (11) to find its vector and regularization to complete the construction of the bilinear model of enhanced attention.

2.2.3. Adaptive Pooling Weight Algorithm

(1) The Deficiency of Traditional Pooling Algorithm. The main function of pooling is to extract feature information and reduce the size of output features [22]. As shown in Figure 3, assuming that the pixel value with the greatest correlation with the correct classification of educational behaviour is 0, if the maximum value pooling is adopted, the important pixel features will be lost. If the mean value pooling is adopted, the feature information will be easily weakened. For random pooling, the reserved information content is determined according to the size of characteristic pixel value, but the characteristics information cannot be accurately retained. The above pooling algorithms cannot dynamically select the pooling weight parameters according to the features, resulting in the loss of image features after pooling. The three pooling processes are shown in Figure 3.

(2) The Adaptive Pooling Weight Algorithm. Learnable parameters are introduced into the average pooling layer, so that the pooling weight parameters can be dynamically updated according to the features in the target domain, to better preserve the image feature information related to expression classification. Assuming the input characteristics , the formula is as follows:

Softmax is selected as the classifier, and the calculation formula of the cross-entropy cost function is as follows:where is the predicted output value and is the actual tag value. Calculate the partial derivative of the parameter according to the cost function , and then update the weight parameters and .Step 1. Initialize the pooling weight parameters and shown as follows:Step 2. The state value of each layer after pooling is calculated by using the improved pooling formula.Add nonlinear factors.where is the ReLU activation function.Step 3. Calculate and update the formula using the formula back propagation, shown as follows:Get , and calculate of the pool layer in sequence from layer T-1 to layer 2. The formula is as follows:Step 4. Calculate the partial derivative of the cost function of the current training data according to the following formula:Step 5. Update parameters and .

Finally, the updated pooling parameters are used for the next iteration until the pooling weight parameters can minimize the cost function .

2.3. Convolutional Neural Network Model
2.3.1. Particle Swarm Optimization Algorithm

In the particle swarm optimization algorithm, if the population size of particles is , the position of the -th particle can be recorded as and the speed as . The algorithm steps are as follows:Step 1. Initialize particle swarm. Initialize the position and velocity of each particle. The position is the initial coordinate of each particle. The velocity represents the change value of coordinate in the next round of the particle, which can be positive or negative, i.e., and . The initial values of the historical best position of the initialization particle and the historical best position of the global particle are assigned with random numbers.Step 2. Cycle through the following three steps until the end conditions are met.(1)Calculate the fitness of each particle.(2)Update the historical best fitness and its corresponding position of each particle, and update the current global best fitness and its corresponding position.(3)Update the speed and position of each particle.where is the inertia factor and and are the learning factors.

2.3.2. Network Model Hyperparameters

The convolution neural network model has many hyperparameters, manual adjustment is slow, and it is difficult to achieve the best effect. A stable and efficient convolutional neural network structure is affected by many factors. The network’s performance can be optimized only after each hyperparameter is in a certain balance. It is a huge project to select the hyperparameter combination with the best performance among many convolutional neural network structures [23]. The particle swarm optimization algorithm is simple and efficient and needs fewer parameters to adjust. The hyperparameters of the above network model are used as the components of PSO particles to increase the diversity of convolutional neural network structures and expand the range of automatic selection. Each particle represents a convolutional neural network structure, and the error result of the network model training is used as the value of the fitness function, and the final result obtained by the PSO algorithm is the optimal result of the automatically selected convolutional neural network. The network model hyperparameter settings are shown in Table 1.

2.3.3. Algorithm Flow of Optimizing Network Model Parameters
Step 1. Initialize parameters. Set the size of the population, the speed range and search space of particles, and the number of iterations.Step 2. Define individual extremum and global optimal solution.Step 3. According to formulas (28) and (29), the speed and position of particles are updated, and the fitness function values corresponding to the individual historical best position and the global best position are updated.Step 4. Iudge the termination conditions. Judge whether the maximum number of iterations and the minimum limit of the global optimal position meet the requirements. The algorithm flow chart is shown in Figure 4.

The optimal hyperparameters of the convolutional neural network model obtained by the particle swarm optimization algorithm are shown in Table 2.

3. Result Analysis and Discussion

3.1. Data Set
3.1.1. Student Online Classroom Behaviour Identification Data Set

For the online classroom teaching environment, data collection is mainly carried out for the six behaviours that students often appear in a class: diet, talking with others, play phone, read, sleep, and write. Considering that the students’ online classroom environment is complex and changeable, which is affected by the placement of computers and students’ sitting posture, these influencing factors are added when collecting data, which makes the research on students’ online classroom behaviour recognition more practical significance.

This experiment selects 50 college students as the research object. By simulating the online classroom environment, 50 students’ data were collected. Each student needs to do two groups for one behaviour, which can ensure that at least one group is the data collected under the influence of factors such as sitting posture. The data of each behaviour are video files, and a total of 600 groups of video files are collected.

3.1.2. NTU-RGB + D and NTU-RGB + D120 Data Set

The NTU-RGB + D data set contains 60 actions from 40 action objects, and a total of 56880 action samples are obtained. NTU-RGB + D120 data set is an extension of the NTU-RGB + D data set. It contains 120 actions from 106 action objects, with a total of 114480 action samples. The author of NTU-RGB + D120 data set recommends two-division methods, cross-subject (CSub) and cross-setup (CSub). Csub divides the data set across action objects into test and training samples, and CSet sets the number of all samples; even-numbered samples and odd-numbered samples are test sets.

3.2. Experimental Environment and Configuration

The configuration of the experimental hardware environment is as follows: two Tesla T4 (GPU), 32 g video memory, Intel (R) Xeon (R) Gold 5218 (CPU), and 128G memory. The programming language used in the experiment is Python, and the experiment uses the Pytoch framework and OpenPose (Pytorch version) open-source library.

3.3. Experimental Results and Analysis
3.3.1. Experimental Results and Analysis of NTU-RGB + D Data Set

The training results of the data set are shown in Figures 5 and 6. Figure 5 is the training accuracy change curve of the data set, and Figure 6 is the loss rate change curve. In Figure 6, the loss rate values of literature [2428] and the algorithm in this paper differ greatly in the first 30 rounds because the calculation methods of loss functions selected in literature [2428] and the algorithm in this paper are different. Literature [2428] adopts cross-entropy loss functions, and the loss function formula used in this algorithm is as follows:

The performance comparison of data set is shown in Table 3. The analysis shows that the test accuracy of this model on NTU-RGB + D data set is 88.8% (CS) and 94.7% (CV), respectively.

3.3.2. Experimental Analysis Results of NTU-RGB + D120 Data Set

The training results of the data set are shown in Figures 7 and 8. Figure 7 is the training accuracy change curve of the data set, and Figure 8 is the loss rate change curve. The loss rate values are different in the loss rate chart due to the different calculation methods of the selected loss function. By observing the training process, it is found that the model in this paper has good adaptability and faster convergence speed to NTU-RGB + D120 (CSub) data set. The reason is that after integrating the global attention mechanism, the model can learn more important feature information and has a strong ability to distinguish different actions with little change in joint coordinates.

The performance comparison of the data set is shown in Table 4. The analysis shows that the test accuracy of the model in this paper is 83.2% (CSub) and 84.3% (CSet), respectively.

3.3.3. Experimental Analysis Results of Student’s Online Classroom Behaviour Recognition Data Set

To verify the effect of the model, the deep-sort target tracking algorithm is applied to the video data as a way to track the students in the identification video, and then the trained student online classroom behaviour recognition model is called to realize the experiment on the data set, and the analysis results are shown in Figure 9, which shows that the analysis accuracy of this paper’s model is 96.5%. The comparison analysis shows that the recognition effect of this paper’s model is better, and it is more capable of recognizing some actions that are difficult to distinguish.

4. Conclusion

In this paper, an improved convolutional neural network model based on a bilinear mixed attention module is proposed to improve the ability of educational behaviour recognition. Aiming at the shortcomings of the traditional pooling algorithm, an adaptive pooling weight algorithm is proposed. At the same time, the hyperparameters of the model are adjusted adaptively to improve the accuracy of model recognition. The effectiveness of the fusion of the global attention mechanism and spatiotemporal graph convolution network method is verified on two public data set NTU-RGB + D and NTU-RGB + D120, and a good recognition effect is obtained on the student online classroom behaviour recognition data sets. Although this model provides good recognition ability for students’ online classroom behaviour recognition data sets, the network parameters are significant, and it takes some time to recognize the behaviour. The next step will be to find a student online classroom behaviour recognition method that can ensure high accuracy and a lighter network model to have more application value.

Data Availability

The labeled data set used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This work was supported by the Hebi Polytechnic.