Abstract

With the wide application of information technologies such as big data, the Internet of Things, and cloud computing, college students have accumulated a large amount of personal information and daily behavior data in their daily studies and life. How to dynamically integrate multidimensional information of students to build accurate student portraits, using multi-indicator data of student behavior and comment texts, and finding out students with abnormal behavior from among many students has become an important problem to be solved. This paper proposes an abnormal behavior prediction method integrating multiple indicators of student behavior and text information (ABPM-IMISBTI) for the problem of abnormal behavior prediction of college students in the big data environment. First, given the problems of multidimensionality, timeliness, and dynamics of student behavior information fusion in the construction of student behavior portraits, by integrating students’ objective tags and subjective tags, an optimized K-means algorithm based on a cloud platform environment is proposed. Second, aiming at the problem of insufficient text information analysis in the analysis of students’ abnormal behavior, the ABPM-IMISBTI method is proposed to solve the prediction of students’ abnormal behavior through long and short-term memory networks (LSTM) combined with student behavior multi-index data and text information. Finally, this paper takes student achievement prediction as an example for verification. The experimental results show that, compared with other prediction methods, the ABPM-IMISBTI method proposed in this paper can improve the accuracy of student behavior prediction, and then quickly determine the abnormal behavior of students, to improve the level of education management in universities and promote the development of safe campuses, smart campuses, and smart education.

1. Introduction

In recent years, emerging information technologies represented by big data, cloud computing, and the Internet of Things have further deepened education reform. The information revolution brought by big data is changing people’s daily lives, ways of thinking, and work [1, 2]. Since the “Outline of Action for Promoting the Development of Big Data” issued by the State Council of my country in 2015, more than 10 documents have been issued by the State Council and the Ministry of Education, making important discussions on the use of big data to promote educational reform, involving precise governance in colleges and universities and reform of teaching models. The development and utilization of educational big data have become an important part of college informatization. It is an important research direction of big data technology in education informatization to study the behavioral laws of students’ academic, psychological, and consumption in the big data environment.

At present, many information systems such as student return, behavior analysis, book borrowing, and class rate statistics have been built on university campuses, collecting a large number of students’ daily studies, life, and other behavioral data, and a relatively complete campus big data environment has been formed [3]. However, with the wide application of new media, a large amount of network behavior data has been generated by college students surfing the Internet, and the formed network behavior data has multidimensional characteristics. Students are faced with a more complex and diverse social network environment, which makes student behavior more diverse. It increases the difficulty of student behavior analysis, prediction, early warning, and guidance. In colleges and universities, students with academic difficulties, psychological barriers, and life difficulties, especially those who may experience extreme events, are the groups that need the attention of college administrators. How to integrate multidimensional student behavior information, tap potential student behavior patterns, realize abnormal student behavior analysis and early warning, provide college administrators with a comprehensive understanding of students’ learning and living conditions, and provide accurate services and management.

Domestic and foreign research has been carried out on student multidimensional information fusion and student abnormal behavior prediction. Luo et al. [4] studied the cross-curricular teaching method of network security and the course of big data analysis and used big data analysis technology for network log analysis. Jing et al. [5] proposed a mental health intelligence assessment method based on multisource information fusion, combined with the data description of UPI and SCL-90 and the corresponding task demand analysis, designed and realized the multisource information fusion visualization based on the Circos graph. Cao et al. [6] quantitatively analyzed the relationship pattern between student behavior and academic performance. By collecting the behavior records of undergraduates, they proposed two high-level behavioral characteristics, namely, orderliness and diligence. The experimental results showed that orderliness is a predictor of academic performance, then an important feature of performance. Yue et al. [7] aimed at the lack of multidimensional computing in the current research, analyzed learner engagement in a digital learning environment, proposed an integrated framework, and identified learning engagement from three aspects, including affective, behavioral, and cognitive states, to accurately predict the effect of student learning. Hongwei et al. [8] proposed a multisegment semantic spatiotemporal graph convolutional network (MFSTGCN) model based on the spatiotemporal graph network structure, considering the campus activity time series, hierarchical correlation, and spatial semantic feature correlation. The richness, data granularity, and data preprocessing are not considered very well. Nam and Samson [9] integrated student behavior and incoming profiles to achieve student learning prediction in higher education STEM courses. However, the model does not fully consider the profile factors passed on by extra informative features, such as student study notes. Yao et al. [10] constructed a multitask prediction method based on a learning ranking algorithm for academic performance prediction, which considered factors such as school-term correlation and interprofessional correlation, and integrated students’ similarity to predict students’ performance. However, this method is not perfect for integrating the behavior information of students into the network, analyzing and mining the characteristics of each dimension of students’ behavior, and then realizing real-time prediction of abnormal behavior of students.

At present, multidimensional information fusion methods focus on extracting single-dimensional features of text, ignoring the multidimensional features of semantic representation of students’ behavior and do not fully consider factors such as timeliness and weight of student behavior information fusion [1113]. Therefore, how to dynamically integrate the multidimensional information of students, consider the timeliness and dynamics of the characteristics of students and the similarity of students’ behaviors, construct accurate student portraits and ideal student classification groups, and then improve the accuracy of students’ abnormal behavior prediction is an urgent need in order to solve the problem. At present, the main problems faced by students’ abnormal behavior prediction are as follows: (1)Student behavior portraits with deep integration of multidimensional information. Student behavioral portraits are based on students. By mining various campus behavioral data such as students’ learning and life behaviors, students are given behavioral portrait labels [14]. With the continuous development of informatization in colleges and universities, many information systems related to student behavior have been formed. However, student behavior information is often scattered in multiple data sources and different dimensions, and a single data source and dimension only contain part of the student’s behavior fragments. It is difficult to accurately mine the complete behavior patterns of students, resulting in low accuracy of students’ abnormal behavior prediction and early warning [15, 16]. Meanwhile, the deep fusion of multidimensional information also has problems such as data dynamics, diversity, and complexity of feature extraction. The purpose of deep fusion of multidimensional information is to organically combine different types of information after feature extraction through several fusion methods, make full use of student feature attributes, fuse multimodal information such as images and text, and comprehensively mine student behavior rules. Therefore, according to the multidimensional information on student behavior in the campus big data environment, using the similarity data of students to conduct in-depth fusion is a challenging problem for student behavior portrait and abnormal student behavior prediction(2)Problems such as accuracy and backwardness in predicting abnormal behavior of students. In the era of big data, the amount of data has increased dramatically, the types of data have varied, and the time-sensitive features contained in the data have received increasing attention [17]. The traditional student data analysis and early warning focus more on the analysis and mining of historical data in the data processing process, lack of timeliness, and the behavioral early warning mechanism still relies on the setting of thresholds, which is not in line with the development direction of educational big data [18]. Text information is an important dimension affecting student behavior analysis, and it is difficult to directly analyze it with traditional natural language processing techniques. In addition, multiple dimensions such as student class rate, book borrowing, one-card consumption, and students returning to bed are all important factors that constitute abnormal student behavior scenarios. Traditional student abnormal behavior prediction has problems such as low timeliness and accuracy. Therefore, based on the analysis of student behavior portraits and the deep integration of multidimensional information, how to realize the accurate prediction of abnormal student behavior by considering the similarity of students’ behavior and text information is another problem that needs to be considered

To solve the two problems, we propose an ABPM-IMISBTI method by integrating the multidimensional information of students and constructing accurate student behavior portraits. The advantages of the proposed method are as follows: it can make full use of multidimensional and multimodal information to construct an accurate student behavior profile, which provides a basis for the prediction of abnormal student behavior; the student text behavior information is integrated into the ABPM-IMISBTI method, thereby improving the abnormal behavior of students. The accuracy of behavioral predictions. (1)We propose a method for constructing student behavior portraits based on the deep fusion of multidimensional information. Based on data preprocessing, all kinds of behavior data of students are clustered according to different levels and dimensions to form a portrait feature library. Cluster analysis is performed under the Spark platform, and an accurate student behavior portrait is finally constructed(2)We propose an ABPM-IMISBTI method, which uses a neural factorization machine and long-short-term memory neural network to solve the problem of abnormal student behavior prediction by combining multi-index data and text information of student behavior, and further improves the prediction performance, to detect abnormal behavior in time Behavioral students, to intervene and supervise students on time(3)We designed the analysis of behavior prediction results under the Spark platform to further verify the accuracy of the ABPM-IMISBTI method and analyzed the relationship between students’ behavior and the relationship between students’ physical exercise and students’ GPA scores

After this Introduction, Section 2 introduces some related works. Section 3 describes the abnormal behavior prediction method. We propose the ABPM-IMISBTI method, which combines multiple indicators of student behavior and text information. Section 4 describes experimental results and analysis; we use real data to conduct a predictive analysis of student-behavior relationships and abnormal behaviors. Conclusions and possible future works are given in Section 5.

The work related to this paper mainly includes two aspects: (1) Student behavior portrait research; (2) student abnormal behavior research.

2.1. Research on Student Behavior Portrait Based on Deep Fusion of Multidimensional Information

Multidimensional data fusion is the premise of applying big data technology in specific fields, and early behavioral data fusion methods mainly focus on semantics [19] and similarity [20]. Inspired by the multisource data fusion theory, many researchers have integrated various data sources on campus, including data on motion trajectories, student learning, student consumption, and classroom attendance, and used support vector machines (SVM) and machine learning (ML) classifiers, etc. to predict student course grade [2123]. Student multidimensional information data fusion can fuse the data generated by different data sources to mine the student behavior rules more comprehensively [24]. Due to the dynamic, complex, and multidimensional nature of student behavior data, the identification of abnormal student behavior groups cannot simply use a single-dimensional perspective. There are multiple dimensions such as the situation of returning to bed. Among them, students returning to bed can have multiple subfeatures, such as the frequency of students returning to bed early, the frequency of returning late, and the dormitory not returning at night. Only by integrating this multidimensional information can students be more accurate to determine students with abnormal behavior.

Domestic and foreign researchers have also carried out student behavior portraits based on multidimensional information fusion [2527]. Li et al. [28] propose an adaptive Web API recommendation method that integrates multidimensional information, which can create a Web API for Mashup. Zhu et al. [29] proposed a hybrid recommendation model that combines network structural features with neural networks, as well as user interaction activities and tensor factorization. Chen et al. [30] designed a visual analysis method of college student’s mental health based on multisource questionnaire data to effectively mine the connection between multisource questionnaires and reduce the uncertainty of mental health analysis, but this method involves some questionnaires. It is limited and does not consider combining students’ online behaviors and integrating more qualitative and quantitative data to analyze students’ psychological problems. Liu et al. [31] obtained a low-dimensional vector representation of students through the LINE-based network embedding method to calculate the similarity between students, but this method would lose the semantic information contained in the original data and could not expand the fusion of multiple data sources. Bo et al. [32] proposed an eigenvalue extraction model based on the ID3 machine learning algorithm by establishing the characteristics and feature attribute groups of student behavior portraits, and constructed student behavior portraits based on big data. Zhang et al. [33] established a portrait of college students based on Hadoop big data processing technology, and used HDFS for data storage; we used canopy and K-means-based clustering algorithms to perform multidimensional analysis of student data. Li et al. [34] proposed a user model with five dimensions, including students’ basic information, learning ability, consumption level, daily habits and interests, and preferences, and extracted the characteristic attributes of students through methods based on data collection, processing, and mining. Ding et al. [35] proposed a hybrid neural network model to mine the data of college students and build student portraits, thereby helping students’ personal development and improving the quality of school teaching.

At present, multidimensional information fusion methods focus on extracting single-dimensional features of texts, ignoring the multidimensional features of semantic representation of student behavior, and do not fully consider factors such as timeliness and weight of student behavior information fusion [36]. Zhang et al. [37] proposed a student portrait construction algorithm based on K-means optimization. The canopy algorithm was used for preliminary clustering, which eliminated the uncertainty of K value selection. The sample is used as the initial center of K-means, but the algorithm only conducts preliminary analysis on the two dimensions of student consumption and learning. The scale of student data in the experiment is relatively small, and the analysis dimension of student data is not enough. Mainstream student behavior profiling methods are generally based on technologies such as machine learning, support vector machines, and supervised learning [38, 39]. However, these methods do not fully consider the security and timeliness of students’ personal information, and the integration of multimodal information such as images and texts. Therefore, how to dynamically integrate the multidimensional information of students, consider the timeliness and dynamics of the characteristics of students and the similarity of students’ behaviors, construct accurate student portraits and ideal student classification groups, and then improve the accuracy of students’ abnormal behavior prediction is an urgent need.

2.2. Research on Abnormal Behaviors of Students

Deng et al. [40] used the Hadoop-based K-means clustering algorithm to cluster user behaviors, and the association rules obtained by mining were used as the preferences for users to access the campus network, thereby improving the efficiency and accuracy of student behavior analysis and achieving accurate prediction. You et al. [41] proposed a hybrid neural network method based on a high-order attention mechanism, using generative adversarial networks to simulate students’ learning behavior, mine missing data, and quickly classify students’ studies, but the model considering online learning detection is not very well. Zeng et al. [42] proposed an attentive prediction model for academic abnormalities. Nie et al. [43] aimed at the problem of precise poverty alleviation in colleges and universities, based on student behavior data, combined with the time series characteristics of college data, extracted the multidimensional characteristics of students’ basic information and behavior data and proposed a CW-LSTM algorithm based on deep learning theory for prediction. Yu et al. [44] analyzed the decision tree of the single classification algorithm and the random forest (RF) of the ensemble learning algorithm and constructed an online student achievement prediction model by using the RF algorithm. Xu et al. [45] proposed a novel hybrid IDA-SVR-based model to predict student performance, it is an improved decision algorithm (IDA) to optimize support vector regression (SVR). Maksimova et al. [46] combined the decision tree and rule model to establish a classification rule set and constructed a learning behavior diagnosis model combining a decision tree and deep neural network, but this model is limited by factors such as lack of data and imbalance. When predicting academic performance, the learning behavior data in the study fails to cover a variety of other factors that affect academic performance in the university study scenario. Meanwhile, with the diversification of students’ online behavior, text information and images have become important factors in student behavior analysis. Traditional prediction and early warning models of students’ abnormal behavior do not make full use of this information, resulting in ambiguity, uncertainty, and large errors. Therefore, it is a challenging scientific problem to consider students’ real-time behavior data and student text information, improve the prediction accuracy of students’ abnormal behavior, and realize the transition from post-emergency to prewarning of abnormal behavior of students.

However, in the process of data processing, the traditional analysis and early warning of abnormal student behaviors focus more on the analysis and mining of historical data, which lacks dynamics and timeliness [47]. Cao et al. [48] established a scientific student portrait evaluation index system, collected various data on college students’ academic performance, normalized the collected data, determined the weight of each evaluation index through the analytic hierarchy process (AHP), and then A fuzzy evaluation model based on big data is used to evaluate various dimensions of college student’s academic performance. Zhao et al. [49] integrated various data of college students’ movement trajectory, consumption, social behavior, etc., used machine learning (ML) classifier support vector machine (SVM) to predict English, and analyzed the correlation between students’ performance and social relations, then it was used to predict the English grades of college students. Li et al. [50] established a combined data mining model by integrating a decision tree, neural network, and naive Bayesian algorithm, and established a Spark-based college student behavior analysis and prediction platform. Filvà et al. [51] collected and analyzed the data generated by students from scratch through behavior in scratch-based programming activities, and conducted a predictive analysis of student interest to help teachers conduct assessments. Yang et al. [52] analyzing the characteristics of campus big data, using the traditional K-means clustering algorithm, and under the mainstream Hadoop open source platform, proposed a college student behavior early warning system based on the Internet of Things and big data environment. Xie et al. [53] proposed a deep learning algorithm to evaluate college students’ classroom posture, using K-means clustering (KMC) to cluster different student groups and identify the characteristics of each group. Liu et al. [54] proposed a deep learning-based method for identifying the abnormal behavior of students in classroom videos, aiming at the complex and slow process of traditional human behavior recognition. Shi et al. [55] proposed a prediction algorithm based on the BP neural network, which predicts the situation of borrowing books through course performance, and builds an early warning model of student performance. However, the data sample of this paper is not large enough, and the borrowing information is insufficiently utilized, which is not conducive to predicting students’ performance. Yang et al. [56] used students’ homework data to predict students’ course grades in Moodle through students’ procrastination behavior, but the data used for student grade prediction was not considered comprehensively, and more student activity data could be used, such as student text data, student learning resources, and network access records, etc. Meanwhile, the existing methods still have some deficiencies in introducing deep learning, integrating student behavior text information, and intelligently identifying some unknown abnormal behaviors in the network [5759]. Many scholars have analyzed the abnormal behavior prediction and early warning of students. The comparison table of abnormal behavior student methods is shown in Table 1.

3. Abnormal Behavior Prediction by Integrating Multiple Indexes of Students

3.1. Data Preprocessing

At present, many business systems for student image collection, such as face brushing in student apartments, intersection monitoring, and classroom attendance have been built on college campuses. When searching for student trajectories, student classroom behaviors, abnormal behavior trajectories in specific areas, blacklists, and special students, these image data need to be processed in real-time and effectively. In scenarios such as public opinion analysis, it is necessary to analyze the relevant student text information. Therefore, for the data of student images and text information, this project plans to use data preprocessing based on deep learning.

Data preprocessing is an important process of data mining and knowledge discovery, and it is also the basis for extracting student behavior characteristics. In this study, data preprocessing refers to a series of data processing processes, including data collection, data cleaning, behavioral feature extraction, and data transformation. Among them, data collection provides basic data for students’ behavioral portraits, and data cleaning is to process students’ basic data. Missing values and outliers in the data and student behavior feature extraction is used to obtain the data required for behavior analysis from the basic data of students. Finally, data transformation mainly eliminates the magnitude difference between different behavior data of students to establish the student behavior. The portrait model provides a unified standard.

This paper adopts a parallel processing framework to improve the efficiency of data preprocessing, adopts a nondestructive cleaning algorithm based on distributed processing framework (Hadoop), and uses Map-Reduce to process unstructured and semistructured data. For example, for student library borrowing data, structured data can be abstracted into fixed information data, such as student borrowing information, browsing information, and library book information, which can be expressed in two-dimensional form using relational databases, and unstructured data can be abstracted browsing, commenting, and page retention time when borrowing from the student’s library, these data are random and nonfixed. And semistructured data is in between, it is a data model suitable for database integration; that is, suitable for describing data contained in two or more databases.

This paper adopts VGG16 for image data preprocessing and identifying abnormal images of students. Data preprocessing for text information is the same as image processing, requiring text vectorization. This paper uses the Word2Vec method to process text vectorization. Using the cleaning algorithm of the previous data preprocessing, the filtered data is stored separately according to different reasons, and the data will not be lost directly. You can clean how many pieces of data are filtered out by each cleaning rule set by the storage path cleaning query. Meanwhile, there are a large number in the case of discarded data, abnormal data can be checked, the proportion of various abnormal data can be counted, and the distribution of abnormal data in the cycle can be analyzed. Meanwhile, the daily data volume is continuously detected. After analyzing the abnormal data situation, a reasonable abnormal data threshold is set. Once it is higher than the threshold, an email can be sent for processing by calling the early warning mechanism. The preprocessing of student behavior data is shown in Figure 1.

3.2. Construction of Student Behavior Portrait Based on Deep Integration of Multidimensional Information

At present, the further mining and analysis of educational big data have a profound impact on the current university management and training model and even the entire education system. Establishing student behavioral portraits has become one of the important ways to apply big data to guide students’ behavioral norms. The student behavior portrait in the big data environment is used to classify various behavior data of students. Therefore, the student behavior portrait is a collection of data and semantic behavior characteristics. This paper starts from the original data and obtains the objective label through statistical analysis of the original data. The objective label can be obtained by data preprocessing technology, which is the real data. To better describe a student’s comprehensive ability, in this paper, the researchers further model and analyze the objective labels to obtain subjective labels, such as analyzing the characteristics of students’ online behavior, the characteristics of students’ hybrid amount, the characteristics of students returning to bed, and the characteristics of students’ course attendance from different dimensions. Data such as characteristics and student borrowing characteristics; that is, to construct student behavior characteristics portraits by describing students’ study habits and living habits.

Student behavior portrait analysis refers to the process of dividing various types of behavior data of students after data preprocessing into different behavior categories according to different levels and dimensions. Taking the portrait characteristics of outstanding students and the portrait characteristics of ordinary students as an example; first, by analyzing the clustering categories of various types of behavior data of students, a data-based behavior label set is constructed to describe different types of students, and the behavior characteristics and characteristics of outstanding students are identified. The behavioral characteristics of ordinary students; secondly, we use the data-based label set analysis to obtain the situational behavioral characteristics and verify and correct various behavioral labels; finally, put the behavioral characteristics of outstanding students and ordinary students in the real student group, carry out accuracy verification, modify and improve the student behavior label according to the verification results of the real student group behavior feedback and integrate the verified student behavior characteristics into the behavior label set, which also provides a basis for the analysis of abnormal student behavior scenarios.

The traditional K-means algorithm adopts a random method for the selection of initialized clustering centers, and the quality of its clustering results is largely dependent on the selection of initialized center points, which leads to its instability. The time complexity of the cluster center algorithm is also relatively large, so it needs to be optimized and designed to further improve the work efficiency of the K-means algorithm. This paper adopts the clustering of various behaviors of students based on the optimized K-means algorithm in the cloud platform environment to construct a set of student behavior labels. The construction process of the student behavior portrait is shown in Figure 2.

The basic idea of optimizing the K-means algorithm in the cloud platform environment is to randomly select K data samples from the data set as the initial clustering center, and each iteration can correspond to a task in Map-Reduce, and each time the data samples are sent to the distance of the cluster center and the classification of the data samples are written into the Map function of Map-Reduce, and then the information of the cluster center is updated through the Reduce function. After cyclic iteration, the algorithm converges to obtain the most stable cluster center, the flow of the optimized K-means algorithm in the cloud platform environment is shown in Figure 3.

3.3. Abnormal Behavior Prediction Integrating Multiple Indicators of Student Behavior and Text Information

Traditional student behavior prediction methods still have some shortcomings in using multi-index data and comment texts on student behavior, mining data, etc., especially when semantic analysis is required, such as student network behavior analysis, public opinion analysis, and psychological analysis. When students search and speak information such as loans, suicides, and cults, students may have some comment information, and it is necessary to analyze the text information of such students. Due to the limited length of textual information or annotations on student behavior, and not all words and topics in the textual information are relevant to the analysis of student behavior, the topic factors that are usually used to obtain comments based on the distribution of latent Dirichlet topics cannot fully reflect the relevance of student behavior. The order of words is often ignored by the bag-of-words languages used. In response to these problems, this paper proposes an ABPM-IMISBTI method, which uses a neural factorization machine and LSTM to solve the problem of abnormal student behavior prediction by combining student behavior multi-index data and text information, and further improves the prediction performance.

Student behavior multi-index data and text information are different manifestations of student behavior. Therefore, this model can better reflect student behavior and improve early warning accuracy by using multiple index data and text information. denotes a dataset of students, certain behavioral scenarios of students, predicted values, multi-index data, and student comments, denotes students, denotes students’ predicted values for , such as student academic analysis, represents a set of indicator data of student to a student behavior scene , denotes the behavior performance value on the student to student behavior scene indicator , denotes the student behavior similarity between student and , denotes student ‘s comment on student behavior scenario , and , denotes the word in the comment. Abnormal behavior prediction that integrates multiple indicators of student behavior and text information is to predict a certain student behavior (such as student performance, etc.) that has not occurred based on the above-mentioned multiple indicator data, predicted value, and similarity of student behavior.

As shown in Figure 4, in the prediction stage, since the text information of students cannot be used directly; a neural decomposition machine and long short-term memory network are used to generate behavior prediction value and comment generation. Meanwhile, the model contains three components: neural multi-index regression prediction component, related behavior comment text generation component, and student behavior similarity decomposition component. The neural multi-index regression prediction component represents the predicted value processed by the neural factorization machine. To improve the accuracy of student behavior prediction, first perform a one-hot key encoding for <student, behavioral scene, index data> to construct the input the feature vector, denotes the Bi_Interaction layer, and represents the element-wise product of vectors and , and a neural decomposition regression model is then used to map the input feature vector to the student’s predicted total score in a nonlinear transformation. In the text information generation component, this paper adopts a threshold neural network, that is, a decoder based on long and short-term memory networks, and converts the combined representation of students, behavioral scenes, and multiple indicator data into a series of words representing the comment text through LSTM. Where , where represents the number of hidden layers, represents the high-dimensional interaction of the hidden layers, and connects the prediction component and the review generation component through the hidden layer of the shared neural factorization machine.

4. Experiment Study

4.1. Data Set and Experimental Environment

Various behavioral data of 14,800 sophomore and junior undergraduates in a university, including students returning to bed, on-campus behavioral trajectory, academic performance, campus network behavior, one-card consumption, library borrowing, and other data for behavioral analysis, were extracted about 75 dimensions to form different student behavior profiles by analyzing different scenarios, such as academic abnormalities, students returning to bed, canteen consumption abnormalities, campus network behavior abnormalities, and abnormal behaviors in specific areas. Focus on the behavior data of 612 students in the 2020 and 2021 classes of the School of Big Data and Software as the main research object.

The source of the data set of the experiment: 14,800 college students from March 2020 to March 2022, with a total of 7,202,811 records of student campus card consumption records. From March 2020 to March 2022, the library book loan record data totaled 1,468,128. From January 2020 to March 2022, there are 7,595,783 records of attendance records in educational affairs classrooms. The library access control record data from September 2019 to March 2022, a total of 1,970,764 records. From January 2019 to December 2021, through the analysis of more than 44.19 million face-swiping data of students on campus, 56,000 late return records, 147,000 nonreturn records, and 235,000 records of students returning to bed, the behavior analysis system was formed. No activity was recorded for 24 hours. From December 2019 to February 2022, the online data of students was about 43 T, the campus network behavior of 15,800 students was analyzed, and a total of 142.332 million Internet access data was obtained, including related college students’ GPA, physical exercise record data, and other data.

4.2. Evaluation Indicators

The traditional K-means algorithm generally evaluates the effect of clustering by calculating the sum of the squares of the distances from the sample points within the class to the cluster center points. To verify the superiority of the clustering method based on density optimization in this paper, this paper adopts a more effective evaluation standard to evaluate the clustering results and comprehensively considers the intraclass similarity and interclass dissimilarity during clustering, as shown in Equation (1). denotes the intraclass similarity, as shown in Equation (2), which is represented by the average distance between the samples in each type and the center point of the cluster to which it belongs. denotes the interclass dissimilarity, as shown in Equation (3), which is used to measure the degree of separation between different clusters, which is represented by the average distance between cluster centers.

The clustering evaluation standard function is denoted as . It can be seen that its value range is between [-1, 1]. The closer is to 1, the higher the similarity of each sample within the class and the dissimilarity between classes. The stronger the clustering effect, the better the clustering effect. On the contrary, when is closer to -1, the clustering effect is worse, which represents the distance from the center point of the cluster to which it belongs.

To verify the accuracy of the model prediction in this paper, this paper measures the quality of the prediction results by calculating different evaluation indicators. Accuracy is an important indicator to measure, forecast, and early warning. Specific indicators include Relative Error (RE), Mean Relative Error (MRE), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE).

As shown in Equation (4), denotes Relative Error, where is the predicted value on the -th dimension, and is the student’s actual value on the -th dimension index attribute. , where denotes the predicted number of students and represents the relative error of the behavior prediction result of the student .

The calculation method is shown in Equation (5), and the calculation method is shown in Equation (6), where is the dimension of student behavior characteristics, denotes the number of predicted students, and the student means that the student is in . The predicted value under the feature represents the true value of the student under the feature, which further verifies the accuracy of predicting student behavior.

4.3. Experimental Results and Analysis
4.3.1. Efficiency Analysis

As shown in Figure 5, when the student density threshold is set high and the amount of student data is large, the students are divided into fewer categories, the difference between classes is small, and the aggregation effect within the class is very poor, so takes a negative value, with the adjustment of the student density threshold, the clustering effect is better when the aggregated evaluation criteria parameters increase. When the student density is set to 300, the value of is the largest, but when the student density drops to a certain level, although the interclass dissimilarity decreases, the intraclass aggregation effect decreases significantly, resulting in a decrease in parameters, which is different from the traditional K-means compared with the algorithm, the clustering effect of the clustering method based on density optimization is significantly improved. This paper takes the analysis of the clustering of students’ consumption patterns as an example.

As shown in Figure 5, the clustering effect of students’ consumption patterns is analyzed. According to the clustering evaluation standard function , the optimized K-means algorithm is used for clustering on the Spark platform. The number of clusters is denoted as . In this paper, when the student density is set to 300, the clustering evaluation criterion is optimal. To evaluate the clustering effect, this paper evaluates the clustering results through the evaluation standard that reflects the similarity within the class and the dissimilarity between the classes and compares it with the traditional K-means clustering algorithm. It can be seen that this paper is based on density, because the (Optimizing the K-means) value of the optimized K-means clustering algorithm is higher than the (K-Means) value of the traditional K-means algorithm, so the optimized K-means clustering algorithm has achieved better results.

In this paper, aiming at the problem of student behavior segmentation, the K-means clustering algorithm based on density optimization is used for clustering analysis on the Spark cluster, and good experimental results are obtained. Meanwhile, the optimization algorithm in this paper analyzes the time efficiency of parallelization. This paper compares and analyzes the execution time of a serial on a single machine and parallel on a Spark cluster according to the number of different students. As shown in Figure 6, with the increase in the number of clusters, the running time of the K-means clustering algorithm based on density optimization on a single computer increases significantly. On a Spark cluster, the runtime increases slightly with the number of students. When the number of students is small, the running time of the parallel cluster on the Spark cluster is higher than that of the single machine because too much time is spent in the process of submitting tasks and resources, and the scheduling processing speed on the Spark cluster is not as fast as that of the single machine. However, when the amount of student data increases, the execution time required for the model does not increase very much, so the Spark cluster processing is more suitable for processing large amounts of data.

As shown in Figure 7, for the time efficiency analysis of parallelization on Spark, the parallelism of the early warning model in this paper is evaluated by analyzing the speedup ratio of different data volumes with different numbers of cluster worker nodes. This paper uses 10%, 50%, and 100% of the student data as the data set and gradually increases the number of worker nodes in the Spark cluster to train the early warning model in this paper. With the increase of worker nodes, the speedup ratio of the algorithm is improved. However, due to factors such as internode communication and data dependence, the running time of the algorithm increases with the increase of student data. When the amount of data is small, that is, 10% of the number of students, the task submission and resource allocation of the Spark cluster and the communication between nodes take too long to reflect the advantages of parallelization. However, when the amount of data increases, especially when student data set reaches 100%, the acceleration ratio increases significantly. It can be seen that Spark has more advantages for computing large data volumes, which also shows that Spark is more suitable for big data and distributed computing.

As shown in Figure 8, the ABPM-IMISBTI method performs an efficiency analysis. During the experiment, the values of MAE and RMSE need to be predicted at different student scales. MAE and RMSE vary with the predicted number of students, but the change is small, which can verify that the ABPM-IMISBTI method in this paper has good scalability, and the prediction accuracy will not decrease as the number of students increases. At the same time, the MAE and RMSE of the predictors decrease with the increase of the student data set, indicating that the student scale has a certain influence on the accuracy of the prediction, and the data set needs a relatively large scale for better prediction, which is suitable for multidimensional prediction of student behavior.

4.3.2. Student Behavior Analysis

To improve the physical quality of college students, cultivate the quality of perseverance, and show the youthful vitality of college students, the university established an information system for physical exercises such as Sunshine 60 Points and morning sun plan. For example, the university organized the 2021 “Sunshine 60 Points” health walk activity, which started on April 7, 2021, and ended on May 23, 2021. A total of more than 2,000 students participated, and these records were statistically analyzed. This paper analyzes the relationship between the number of students’ physical exercise and their GPA scores.

The analysis of the relationship between students’ GPA and physical exercise is shown in Figure 8. In the Sunshine 60 Points activity, students completed three physical exercise check-in points within one hour and counted the number of check-ins. Among them, the horizontal coordinate represents the student’s GPA, , the vertical coordinate represents the number of times a student has punched in, and the blue dot represents a certain student. As shown in Figure 9, there is no necessary relationship between students’ GPA scores and the number of physical exercises. Meanwhile, in the experiment, it was found that the number of physical exercises of junior and senior students was lower, and the number of physical exercises by freshman students was the highest. Therefore, it is necessary to educate senior students to encourage participation in activities.

4.3.3. Student Abnormal Behavior Prediction

This paper proposes an ABPM-IMISBTI method, which combines multiple indicators of student behavior and text information. To verify the accuracy of the method for predicting abnormal student behavior, taking student performance prediction as an example, when students have academic abnormalities, they can quickly find there are no academically abnormal students. For student performance prediction, a prediction model is constructed by integrating study diligence indicators, including book borrowing, class rate, GPA, regular diet, study time, physical exercise, and students returning to bed. The historical scores of students are the GPA scores of students before the spring semester of 2020-2021 as the initial data, which can better reflect the students’ learning foundation, predict the students’ GPA in the fall of 2020-2021, and compare the predicted values and actual values through different methods.

The ABPM-IMISBTI method proposed in this paper is compared and analyzed with the other six methods. The first method is the MTLTR-APP method [10], which is a multitask prediction method based on a learning ranking algorithm, used for learning performance prediction; the second method is CW-LSTM [43], which is a CW-LSTM algorithm based on deep learning theory for prediction; the third method is Bayesian [57], which is a method suitable for small-scale data and can handle multiclassification tasks. The fourth method is Deep Neural Network (DNN) [58], its feature vector reaches the output layer through hidden layer transformation, and the classification result is obtained from the output layer; the fifth method is random forest (RF) [44], random forest is used for learning performance prediction; the sixth method is IDA-SVR [45], an improved decision algorithm (IDA) to optimize support vector regression (SVR), which is a classification method that finds a regression plane so that all data of a set are closest to the plane.

In this paper, the performance of the grade prediction model is judged by comparing and analyzing the and indicators between the student grade prediction results and the true value. If the values of and are larger, the model is less accurate, and vice versa, the smaller the value, the higher the accuracy. To verify the correctness of the comparative analysis model, the test data are 612 students from the School of Big Data and Software, college sophomores, and a predicted GPA of 14,800 juniors.

The performance comparisons of different prediction methods on the Big Data and Software College student datasets are shown in Table 2. The performance comparisons of different prediction methods on sophomore and junior datasets are shown in Table 3.

The ABPM-IMISBTI method in this paper integrates multidimensional information. Constructing accurate student portraits, considering students’ objective labels and subjective labels, combining student behavior multi-index data, and text information to accurately predict students’ abnormal behavior, and the two evaluation indicators have low errors and the best performance. MTLTR-APP is a multitask model based on matrix decomposition, which takes into account the various types of student behavior and the similarity of students and predicts students’ multitask learning ranking and academic performance, but the model does not incorporate factors such as students’ classroom performance and physical exercise. This affects the accuracy of academic performance prediction. The CW-LSTM combined with the time series characteristics of college data, extracted the multidimensional characteristics of students; however, the data sample of the algorithm is small, and it does not integrate other students’ classroom attendance, Internet behavior, and other data, and the accuracy rate needs to be further improved. DNN cannot use historical information to assist in the classification of students’ grades, the number of parameters required for classification is too large, and the classification effect is general. Compared with the Bayesian method and the DNN method, the Bayesian classifier has a better mathematical foundation and a good classification effect, but in actual operation, the algorithm requires few parameters to run, it is less sensitive to missing data, and the classification effect is not ideal. RF is a combined classification algorithm of ensemble learning. It is a sampling method with replacement from the original data set, and the classification effect is general. When IDA-SVR performs multiclassification predictions on grades based on study diligence indicators and living habits, it can be seen that IDA-SVR has the worst classification effect, while the overall accuracy of ABPM-IMISBTI and MTLTR-APP is relatively stable, which has a strong mathematical explanation for the data. Prediction results are compared with other prediction methods, the model has high prediction accuracy and good scalability, and the and error values are the smallest. Therefore, it is feasible for this paper to use relevant data such as multiple indicators of student behavior and text information to predict student achievement. Due to the strict management of students by the relevant managers of the School of Big Data and Software, the students’ class rate, study time, physical exercise, and students returning to bed are more in line with the actual situation of the students, they have higher accuracy. In summary, the model performance is ranked: ABPM-IMISBTI > MTLTR-APP > CW-LSTM > DNN > Bayesian > RF > IDA-SVR.

5. Conclusion

This paper proposes an ABPM-IMISBTI method. Firstly, aiming at the problems of multidimensionality, timeliness, and dynamics of student behavior information fusion in the construction of student behavior portraits, a method of constructing student behavior portraits based on the deep fusion of multidimensional information is proposed. Secondly, an abnormal behavior prediction method that integrates multiple indicators of student behavior and text information is proposed, which integrates multiple indicators of student behavior data and text information to solve the problems of accuracy and backwardness in predicting abnormal behavior of students. Finally, the real student behavior data is used to analyze the relationship between the clustering effect and student behavior, an abnormal behavior prediction model based on the Spark platform is constructed, and the student achievement prediction is used for example verification. The experimental results show that the ABPM-IMISBTI method proposed in this paper can provide early warning for students with abnormal behaviors, which is convenient for student administrators to intervene in time, thereby promoting the development of safe and smart campuses in colleges and universities.

This paper proposes the ABPM-IMISBTI method, which can solve the problems of accuracy and lag in the prediction of abnormal behavior of students. However, there are still some deficiencies in this paper. First, the number of samples needs to be further expanded. This paper studies the behavioral analysis of undergraduates. The next step is to conduct research and classification on the relevant behavioral data of master and doctoral students. Second, this paper analyzes the relationship between the number of students’ physical exercise and their GPA scores, but students’ scores are related to many other factors, and further quantitative analysis is needed; third, the offline information, content access, URL and other data of students’ classes can be integrated. It is used for student performance prediction and personalized learning recommendation to further improve the efficiency of the algorithm. Further research is needed on these three problems.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

The work was supported by the Chongqing Natural Science Foundation of China (No. cstc2021jcyj-msxmX0515), the Chongqing Federation of Social Sciences Planning Project (No.2021NDYB110), the National Science Foundation of China (No.62072060 and No.72074036), and the Chongqing Educational Science Planning Project (No.K22YA201002).