Abstract

By reversing the arrangement of knowledge transfer and knowledge internalisation, FC (flipped classroom) changed the roles of teachers and students in traditional teaching and replanned the use of classroom time, realizing the innovation of traditional teaching mode. This research uses a network teaching platform to create a network platform supporting environment suitable for FC teaching in universities, allowing for the reform of university classroom teaching modes. An LSTM (Long Short-Term Memory) eye movement analysis algorithm is proposed to solve this problem. By analyzing the sequence of eye movements, this algorithm can predict students’ learning behavior and complete the current learning state detection. The research shows that the design and application of this research have achieved a good teaching effect in practice, which provides an idea for flipped classroom teaching design and curriculum resource design and development.

1. Introduction

The original traditional teaching mode can no longer meet the needs of society in the information age of education, with the rapid development of information technology and the reform and innovation of teaching, and FC (flipped classroom) came into being, changing the traditional teaching situation and bringing new vitality to education [1]. With the rise of FC in China, an increasing number of educators are attempting to use FC in their teaching activities, which greatly enhances the effect of knowledge transmission and improves teaching quality.

Traditional teaching emphasizes teaching and ignores students’ subjectivity. It is obviously not easy to teach in large classes, and teachers should impart knowledge in a fixed time and give attention to answering questions for all students [2, 3]. In China, with the improvement of educational information infrastructure, the basic conditions for implementing FC and innovative classroom have been fully met. Under the influence of the new educational thought, the focus of teaching puts students’ autonomous learning ability, the ability of self-knowledge construction, and the ability of cooperation in a prominent position. For this reason, FC teaching guided by task-driven mode has gained opportunities for rapid development and has been widely carried out and implemented nationwide in a short time [4]. FC teaching mode can liberate students from the inherent rigid teaching. Flexible knowledge transfer, interesting collaborative inquiry, and relatively timely Q&A feedback all inject new vitality into the traditional classroom so that students can play their autonomy well in a happy and relaxed atmosphere and, at the same time, improve the utilization rate of teaching resources such as classrooms and teachers [5].

At present, the research content of university FC mainly focuses on FC teaching mode itself and the design and production of the resources it needs, while there is little research on constructing FC teaching mode from the perspective of information technology [6]. Therefore, this study is based on the teaching practice and based on LSTM (Long Short-Term Memory) network teaching platform, constructs the university FC teaching mode, and organically integrates it with FC teaching to reform the present situation of university classroom teaching and realize the reform and innovation of university classroom teaching mode.

The characteristics and innovations of this study are that it discusses the design, development, and application of university courses based on the concept of flipped classroom teaching, provides teaching design strategies for universities to carry out flipped classroom teaching, constructs a general mode for universities to carry out flipped classroom teaching, and provides reference ideas for universities to carry out teaching reform.

In recent years, the research of human behavior recognition based on video has a wide application prospect, attracting the attention of more and more Chinese researchers and foreign research experts, and human behavior recognition has also made world-famous achievements in the field of computer vision [79]. Using the 3D coordinates of human joints collected from depth data based on the difference of skeleton joints, a new motion feature descriptor is designed, which includes static posture [10]. Literature [11] uses HOOF (Histogram of Oriented Optical Flow) to represent video frames and identifies behavior categories by classifying HOOF time series. Literature [12, 13] multiscale feature extraction algorithms based on spatiotemporal interest points have achieved satisfactory results, but most of the traditional methods for behavior recognition rely on hand-designed features, which is complicated and troublesome.

Multiple visual cues in video are becoming increasingly important in recognising complex human behavior, prompting many researchers to adopt multistream architecture. The improved trajectory is used instead of optical flow to extract time information in literature [14], and the spatial network remains unchanged. Convolution features are learned using a CNN (Convolutional Neural Network), and convolution features are aggregated using trajectory-limited sampling and pooling strategies. Form a TDD (Trajectory-Pooled Deep-Involuntary Descriptor), and then, aggregate the local TDD in the entire video into a global super vector using a Fisher vector, and then, classify and recognise using SVM (Support Vector Machine). According to literature [15], pose information, motion information, and the original image should all be considered important visual cues, and the Markov chain model, which adds cues in order to achieve behavior classification and detection, should be used. C3D network based on 2D convolution was proposed in reference [16] and applied to video behavior recognition. In a joint framework, literature tracks many people in a unified way and identifies individual behaviors, interactions among people, and group activities. Literature [17] proposes a network architecture based on the combination of recurrent neural network (RNN) and attention mechanism, in which attention mechanism determines which two people are most relevant when an action occurs, that is, each person is given different weights in each frame by attention mechanism to identify key figures, and then, group behaviors are classified and identified by RNN. Literature [18, 19] proposes a cyclic interactive context modeling scheme based on LSTM network, which makes use of the aggregation ability of LSTM network to uniformly model the dynamic, intragroup, and intergroup interactions of a single person.

In order to solve the above problems, this paper analyzes learners’ eye movements to judge learners’ learning state. Compared with the traditional learning state detection method, it can deeply analyze the details of the learning process and get more accurate judgment.

3. Research Method

3.1. Working Principle of LSTM Model

In this paper, LSTM will be used to obtain the long-time information of group behavior in video sequences, so as to generate the comprehensive feature representation of context for the final discrimination of group behavior. Firstly, the working principle of LSTM network will be briefly introduced, and its working principle diagram is shown in Figure 1.

The LSTM network memory unit is mainly controlled by three gates of information flow in and out of the memory unit, namely, the input gate, the forgetting gate, and the output gate. The value of these gates lies in that they can help the network avoid erroneous gradient update.

In the LSTM network, the forgetting gate is mainly responsible for what information the network will discard, that is, the forgetting gate will determine how much of the cell state of the last moment can be retained in the current cell state , and the gate will read and output a value between 0 and 1 to each , where 1 means “completely retained” and 0 means “completely discarded.” The formula of the forgetting gate is

in which represents the output of the LSTM network at the last moment, represents the input of the LSTM network at the current moment, represents the weight matrix of forgetting gate, represents the bias of forgetting gate, represents the sigmoid function, and represents the connection operation.

The function of input gate is mainly to decide what new information to store in cell state . There are two steps to complete.

First, the sigmoid function determines which information needs to be updated. The formula of the input gate is

where is the input gate weight and is the input gate bias.

Then, the tanh function generates a new candidate vector , which is used to update the cell state , that is, is updated to .

Output gate needs to determine what value to output. First, it is determined which part of cell state will be output by sigmoid function.

Then, the tanh function is used to process the cell state , and a value between -1 and 1 is obtained, which is multiplied by the output of sigmoid function; finally, only the part of that needs its output will be output, and its formula is

where represents the weight of the output gate, represents the offset of the output gate, and represents the final output of the LSTM network.

When expressing personal behavior, the hidden state can be used to express the actions performed by a single person at time . The output of each cell changes with time based on the past storage content. Because various control gates are deployed on the information flow, the hidden state can remember the past behavior of individuals in a short time. Therefore, the output of each LSTM unit can be simply passed to the softmax classification layer and predict everyone’s behavior at the personal level to form the tracking of everyone.

3.2. Eye Movement Analysis Algorithm

According to the research, when human beings acquire and process information, their eyes do not quickly scan what they are watching, but the focus of their eyes stays at a specific position of the information to fully process the information. After processing, they skip to the next position.

According to the way of eye movement, two terms “Fixation” and “Saccade” can be used to describe a series of eye movements [20]. It is defined as follows. “Fixation” state refers to the state when the line of sight is relatively stationary. “Saccade” state refers to the Saccade when it goes from one Fixation state to another, which is called “Saccade.”

In this paper, the classification of eye movements is based on eye movements, and the feature extraction of eye movements is carried out by deep learning method, so as to realize eye movement classification, and thus, the current learning state of learners can be obtained. The eye movement behavior is divided by two states of “Fixation” and “Saccade” to form the eye movement quantity [21].

The definition of eye movement is based on the current learner’s eye corner position, and the offset of iris center position relative to eye corner point in each collected frame image is analyzed to judge the current eye movement behavior.

Definition: eye movement , where represents the horizontal displacement of the current Fixation state compared with the previous Fixation state, represents the vertical displacement of the current Fixation state compared with the previous Fixation state, and represents the duration of this eye movement behavior.

denotes the captured th eye movement quantity, , and each component value of is calculated by the following formula:

in which represents the line-of-sight state calculated and analyzed from the collected images, and the line-of-sight state of each frame of images is represented by triplet , where represents the distance change of iris center in horizontal direction and vertical direction with respect to the corner of eye and represents the time required for sampling each frame of images.

Each Fixation state is represented by several consecutive frames with similar gaze states, and represents the set of all gaze records in a Fixation state. The number of records in is represented by , and represents the duration of this Fixation state.

Eye movement analysis inputs an eye movement quantity of a sequence, and the network finally outputs which eye movement pattern this eye movement sequence belongs to, adopting a multi-input single-output structure. In terms of input, the input is the eye movement quantity obtained through extraction, and the specific model structure is shown in Figure 2.

As can be seen from Figure 2, a series of eye movements are obtained by feature extraction of successive frames in the video stream, which reduces the search for basic features by the network and focuses on eye movement feature analysis. The continuous eye movements are used as the input of RNN to train the network.

is the hidden state of step , is the initial state, is the parameter in LSTM, which is shared in the training process, and is the final output result. The result can take three values, namely, 0, 1, and 2, which, respectively, represent the three states of reading, searching, and distraction. Among them, the two states of reading and searching are the state of concentration, and the state of distraction is the state of inattention.

3.3. FC Design of University Courses

This study starts with college classroom teaching and draws corresponding conclusions through the whole process design, implementation, test, and feedback of related courses. In the process of research, based on the existing research results, the following research methods are mainly adopted: literature research, case study, and quasiexperimental study.

3.3.1. Task-Driven FC Teaching Model Analysis

Different from the traditional teaching mode, FC teaching mode based on task-driven takes teaching content analysis and partition as the starting point of teaching activities, divides the teaching content into several different task stages according to the logical relationship, and sets up several levels of increasing difficulty for each stage according to students’ learning level and learning needs [22, 23]. The task-driven FC teaching model is shown in Figure 3.

The teaching process is divided into two relatively independent but closely related links in the task-driven FC teaching mode. The teaching task is divided into several specific learning problems in the first part, and students collect data and study according to the problems, forming a problem-solving learning model. Students are required to share their teaching achievements, discuss teaching problems encountered during the achievement output process, and summarize excellent learning experience and problem-solving methods in the second part, and teachers assist students in discussing and solving problems, as well as summarizing and sublimating their knowledge.

Task-driven FC teaching model mainly includes the following modules: problem and task setting module, teaching resource provision and students’ learning module, classroom activity design module, and teachers’ summary knowledge sublimation module. The FC structure is shown in Figure 4.

The setting of problems and teaching tasks should take full account of students’ learning level, cognitive structure and existing knowledge background, the needs of teaching content, the needs of teaching development and the availability of teaching resources, and the teachers’ control of the teaching process and the provision of teaching resources.

3.3.2. Design of Microvideo

In the traditional classroom, the transfer of knowledge is accomplished by the teacher’s face-to-face teaching in the classroom, while the design of teaching microvideo in FC replaces the teacher’s teaching and becomes a very important factor for the success of FC. The design and development of teaching microvideo mainly include two forms. One is to transform the existing video and form it with the corresponding thinking questions and exercises. The other is the original microvideo recorded by screen recording software or video camera. The microvideo used in FC usually includes the following features: (1)The course time is relatively short, generally about 10 minutes. According to the cognitive characteristics of primary and secondary school students, some authoritative educational theorists have determined that the time for students to concentrate is about ten minutes. In order to ensure the learning efficiency, the design of microvideo should also be controlled within the time for students to concentrate(2)The teaching content is more targeted, and the time of microvideo is short and limited, so microvideo must be concise and to the point. Only when the goal of microvideo is clear and the theme is clear can students have a more focused learning experience [24](3)There should be flexibility in the playing form. Microvideo should be a streaming media format supporting network play online, which can be viewed online or downloaded to meet the needs of students’ self-paced learning

In the research, the author mainly adopts the form of screen recording software and finally decides to use Camtasia Studio software through comprehensive comparison of various screen recording software. According to practical experience, the production process of microvideo can be summarized as Figure 5.

The first step of microvideo production is to determine the theme. A microvideo generally contains a knowledge point. The knowledge point should be moderately difficult, can be explained clearly in a short time, and is convenient for students to learn and master in a short time.

The second step of microvideo production is preliminary preparation, which includes four parts: material collection, instructional design, courseware making, and script writing. What needs attention here is that courseware making and script writing must be rigorous and meticulous so that later editing can save a lot of unnecessary troubles.

The third step is video production, including video recording and editing. Video recording here refers to recording the screen of the courseware and, at the same time, cooperating with the teacher’s explanation according to the script, forming a teaching microvideo, then processing the preliminarily formed video, editing the unnecessary parts, modifying the improper recording, and beautifying the video, thus forming the final complete teaching microvideo.

3.3.3. Evaluation Design

Process and summative evaluation are two of FC’s evaluation methods. First, the author discriminates and screens to form index content according to the three main stages of FC (knowledge transfer, knowledge internalisation, and knowledge summarization) and then determines the weight coefficient of index content using an expert determination method, resulting in the final evaluation index system. Figure 6 depicts the evaluation architecture in detail.

Students must watch teaching videos, learn related courseware materials before class, practice before class, and interact with teachers and classmates on the network platform, in contrast to traditional teaching methods. Students usually actively participate in classroom teaching activities through collaborative learning, solve problems encountered in preclass learning through discussion, exchange learning results, actively share what they have learned, raise questions and new opinions, and assist other group members in completing their learning.

Therefore, process evaluation should include two parts: online learning and classroom learning. Among them, online learning is divided into three indicators: video learning, preclass test, and topic discussion, and classroom learning is divided into three indicators: attendance, classroom performance, and process homework.

In FC teaching, the final exam should not only be used as the assessment method but also innovative and creative works, such as innovative products, patents, plans, works, reports, videos, procedures, and papers. Therefore, the content of summative evaluation index should be the final exam or creative works, which should be one of the two.

The evaluation methods of this course mainly include teacher evaluation, group mutual evaluation, and objective evaluation (i.e., online teaching platform automatic grading). In the process evaluation stage, the classroom learning part is evaluated by teachers and groups, and the online learning part is automatically evaluated by the online teaching platform test. The summative evaluation stage is evaluated by the teacher.

In the whole FC learning evaluation process, there are two basic principles: first, attach importance to the process and neglect the result, for example, the process evaluation accounts for 60% of the final result, while the summative evaluation accounts for 40% of the final result; second, emphasis is placed on achievements and light on examination papers. For example, in the final exam, innovative and creative works are mainly submitted, instead of answering examination papers.

4. Analysis and Discussion

4.1. LSTM Behavior Recognition Analysis

In this paper, the collected video image data is transformed into behavior key sentences using steps such as target detection and key frame extraction. The extended behavior database in this paper uses partial data from the KTH database, data from the CASIA database, behavior data from previous behavior recognition research, data from self-shooting, and data that has been enhanced and generated by data.

The extended database in this paper has six behavioral key sentences: walking, running, jumping, falling, squatting, stooping, and walking. In this paper, there are 10651 behavior key statements; each behavior has more than 1500 pieces of data.

In this paper, RNN, LSTM, CNN, and SVM are used to classify behavior key sentences. There are two test methods used in this paper to verify LSTM data.

Sample set test: that is, for each behavior in the data set, part of data is randomly selected as test data and the other part is used as training data. There are about 1600 pieces of data for each behavior in this data set, and about 1200 pieces of data are randomly selected as training data and about 360 pieces of data as test data. The experimental results are shown in Figure 7.

Cross-validation test: that is, all the data in the data set are divided into five equal parts, of which four parts are used as training data and the rest as test data. After cross-validation, the training model parameters are saved in this paper. In the final test, this paper adds a small amount of extra generated data for verification test. The experimental results of this paper are shown in Figure 8.

From the above two test methods, it can be known that the recognition rate of cross-validation test is lower than that of sample set test. However, the results of cross-validation tests are more reliable and closer to the results of real data. Because of the sample set testing method, all types of data used in model training and testing are balanced.

Figure 9 is the ROC graph of LSTM in the extended database in the best state after many trainings. From the ROC (receiver operating characteristic) graph, it can be seen that LSTM can accurately classify six behaviors in the extended database, with an average recognition rate of 99%.

The recognition rate of each classifier under the extended database abnormal behavior confusion matrix is shown in Figure 10.

From the above results, all the algorithms have a high recognition rate for squatting and fainting. It is also obvious to distinguish the two behaviors of “squatting down” and “fainting” from the perspective of human vision. Other behaviors are easily classified by the algorithm. It can be seen that the accuracy of deep learning algorithm under the same test set is generally higher. However, the accuracy of SVM is relatively low.

SVM classification mainly depends on the judgment of boundary data, while deep learning algorithm classifies data according to the spatial distribution of all kinds of data in the whole data set. Therefore, if there is a large amount of data, the generalization ability of the model trained by the deep learning algorithm is stronger than that of the model trained by SVM.

CNN’s recognition rate is low. The results show that the behaviors misjudged by CNN mainly focus on “running” and “walking.” CNN misjudged the behaviors of “running” and “walking” for a certain reason at the data level. CNN is a network that is good at extracting visual image features and classifying them. It is reasonable that CNN does not have a good recognition effect for the data whose visual image features are not obvious.

LSTM and RNN can process time series data well because they both have certain data “memory” ability. RNN can get good results for short-distance serial data. LSTM can not only process short-distance sequence data but also get good results for long-distance sequence data. It is not difficult to analyze why RNN and LSTM achieved higher recognition rate in this experiment.

4.2. Analysis of Application Effect of FC Teaching Mode

In order to get a more objective evaluation of FC teaching and better obtain the feedback effect of students on FC teaching methods, after the second round of action research, the author used a questionnaire to investigate students’ views on FC teaching methods. The author designed the following questions in the questionnaire, and the specific analysis results are as follows.

I asked, “Do you like FC study?” The survey results are as follows in Figure 11.

According to the survey results, students generally like FC learning. This data also proves that FC teaching adopted in this action study has achieved good teaching results.

In “What do you think of the development prospect of FC teaching method in the field of education?”, the survey results in question 1 are as follows in Figure 12.

Students believe that this teaching method will have a good development prospect in the field of education, with only 4.49 percent believing that it will have a limited development prospect. Students have a positive attitude toward this teaching method, as evidenced by the fact that they agree that this course was designed using the FC teaching method.

“Do you believe FC can improve the teaching effect compared to traditional classroom teaching?” In response to this question, 71.28 percent of students believe “it can be greatly improved.” This data clearly demonstrates that students believe that this teaching method is extremely beneficial in improving the teaching effect. Figure 13 shows the results of the survey.

The evaluation of the implementation effect of the scheme can be analyzed by combining quantitative and qualitative methods. Through the stage test and final exam of students, the class test scores of FC teaching mode and those of traditional teaching mode are compared and statistically analyzed, the influence of FC teaching mode on teaching effect is obtained, and the teaching effect and existing problems are analyzed and summarized.

Through the above suggestions and opinions, the author thinks that FC teaching has a good effect on promoting students’ learning and improving their learning skills, and FC teaching method is very popular among students. In the practice of this study, the deficiency lies in that some adjustments are needed in the allocation of teaching content and time, and some hands-on practice time can be added continuously so that students can have more opportunities to communicate with teachers or peers when they encounter difficulties and teachers can also find out which problems students are likely to encounter when learning, so as to improve the next teaching.

5. Conclusion

The author has a better understanding of cloud platform, FC, and other related theories after consulting a large number of documents and materials. This research focuses on using a cloud platform as FC’s technical support to develop an LSTM-based turnover classroom teaching model. The FC network course is developed, the experimental process is devised, and the FC mode based on the network teaching platform is implemented in the FC course in accordance with the teaching design. Using eye movement behavior analysis as a springboard, this paper proposes an LSTM-based eye movement analysis algorithm for detecting learners’ reading, searching, and distracting learning states when their eyes are fixed on the screen. The overall quality and satisfaction of students were investigated through questionnaires and interviews after a semester of teaching practice, the investigation was analyzed, and the research conclusion was drawn.

The purpose of this study is to develop a teaching model, with an emphasis on analysis, construction, and concrete implementation, and the validity verification of the teaching model is missing from the research process. The enthusiasm of students’ participation in the learning process, the completion of students’ learning tasks, and the solution of learning problems are all used to verify the students’ learning effect during the research process.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The author declares that there are no conflicts of interest.