Abstract

The identification of students with financial difficulties is one of the main problems in campus data research. Effective and timely identification not only provides convenience to campus administrators but also helps students who are really in financial hardship. The popular using of smart cards makes it possible to identify students with financial difficulties through big data. In this paper, we collect behavioural records from undergraduate students’ smart cards and propose five features by which to associate with students’ poverty level. Based on these features, we proposed the Apriori Balanced Algorithm (ABA) to mine the relationship of poverty level with students’ daily behaviour. Association rules show that students’ poverty level is most closely related to their academic performance, followed by consumption level, diligence level, and life regularity. Finally, we adopted the semisupervised K-means algorithm to more accurately find out students with financial difficulties. Tested by classical classification algorithms, our method has a higher identification rate, which is helpful for university administrators discover students in real financial hardship effectively.

1. Introduction

Nowadays, college students gradually become the main labour force in the society and have an important impact on the country’s economic and social development [1]. In recent years, thanks to the rapid development of digital campus, college students’ daily behaviours can be recorded in the campus smart-card system, so researchers are increasingly paying attention to the study of campus big data [25]. As a branch of campus behaviour research, finding students with financial difficulties can not only effectively help those who really need it but also provide school administrators with a solution to find them and to give some financial support. Therefore, it is imperative for educators to discover students with financial difficulties.

In China, the selection of students with financial difficulties mostly adopts the method of “students proclaiming + advisers assessing” [6], where the process of evaluating students’ qualification is usually manually conducted concerning students’ family background, daily expenditure, and academic performance. When there are a large number of students, this scheme can be time-consuming and tend to involve some subjective judgments as well. Fortunately, thanks to the rapid construction of smart campus, the student one-card system, also known as the smart-card system, has been designed to record students’ behaviours of daily life. These behaviours include the consumption in the canteen, the Internet login records [7], the book-borrowing records [8], checkin records, and so on. The increasing amount of these data has provided opportunities for us to analyse students’ behaviour through novel information technologies.

Several previous research studies have focused on the topic of students’ behaviour analysis. Some studies pay attention to records from the smart-card system, using them to explore students’ academic performance with daily behaviour [7]. These records also give a multiaspect display of their campus daily life, revealing the changing trend in their learning career and showing different living habits of different genders [9]. Moreover, there has been growing interest underlining the importance of online education systems and online learning platforms [10, 11]. Learning records generated by these tools have revealed the dependencies among learning time, subject, activity type, activity complexity, and performance, which gives suggestions for behavioural changes to optimize learning experience. Besides, judging from students’ learning modality, the trends and deficiencies in the use of LMS by students can easily be detected, which is beneficial to quickly grasp their learning status.

The above works prove the feasibility of data mining on students’ behaviour to identify their behaviour patterns through the daily records generated from the smart cards. However, in terms of students’ economic status, few studies have been conducted using campus behavioural data. Present studies can mainly be divided into two branches, namely, the prediction of students’ financial hardship and the discovery of students’ financial difficulties. The former one has been considered as a multilabel classification problem using features such as smart-card usage, Internet usage, and trajectories in campus [12]. However, only the pairwise correlation of students is studied, but not the correlation of poverty level and behaviour characteristics. The BP neural network was also utilized to construct a nonlinear mapping between the economic conditions of college students and the needy students identifying [13], but lacked fine-grained analysis of how different behaviour influenced students’ economic status. The latter one has been studied through active learning [6], but such method requires the intervention of human knowledge. Although accuracy has been improved, human intervention can involve too much personal will. In addition, the correlation of different behaviours with financial hardship has not been analysed.

Hence, in this paper, we proposed the Apriori Balanced Algorithm (ABA) to explore the relationship between students with financial difficulties and their behaviours. In addition, a semisupervised K-means algorithm is established to identify students with financial difficulties to decrease human intervention. To be specific, we extract “consumption level,” “GPA,” “GPA_percentage,” “life regularity,” and “diligence” from the smart card to describe students’ behaviours. Then, we applied the ABA to obtain the relationship of students’ poverty level and their behaviour features. Finally, we adopted the semisupervised K-means algorithm to identify the financially difficult students.

Overall, the contributions of this paper are summarized as follows:(1)Faced with complicated data exported from the smart-card system, we proposed five behavioural features, GPA, GPA_percentage, consumption level, life regularity, and diligence, which can better reflect the behavioural characteristics of students in financial hardship.(2)Secondly, we proposed the Apriori Balanced Algorithm (ABA) based on the original Apriori algorithm by modifying the Support to Balanced_support with a balance factor C. After such modification, for items with small proportion in the dataset, the association rules containing such items will be more accurately found out. Therefore, it is useful for our task of mining the association between students’ economic status and behavioural features, since financially difficult students only account for a small scale of the whole students. Test results on the Groceries dataset prove the adaptability of the ABA, and the relationship between the proportion of poor students and different behavioural features shows that the association rules we obtained are consistent with the ground truth.(3)Thirdly, we proposed a method based on semisupervised K-means to identify students in financial hardship. Previous works have used methods such as active learning to discover financially difficult students, but this may involve too much personal will. Our method will effectively decrease human intervention without losing identification performance. Experiments on the dataset processed by our method through four classical classification models indicate a higher prediction performance.

2. Materials and Methods

2.1. Motivation

In this section, we will describe the motivation of our research in detail. The first is the motivation of proposing the Apriori Balanced Algorithm. In the traditional campus big data research, personal will may be involved in the experiment. In this paper, we expect to find the relationship between students with financial difficulties and other behavioral characteristics through data mining, so as to reduce such disadvantages. However, traditional algorithms for mining association rules, such as Apriori and FP-growth, are based on support and confidence. When used for mining rules containing small-scale items, the result may not reflect the truth hidden in the dataset. This phenomenon is found in our previous data mining that other papers may not pay attention to, so for this reason, we proposed the Apriori Balanced Algorithm (ABA).

The second is the motivation of identifying students in financial hardship. It has been found in previous data mining for the original poor student list provided by the university that for students labelled with financial difficulties, some of them have different behavioural features from others, while some of the students without financial difficulties have the same behavioral features as most students with economic difficulties. Based on such phenomenon, we think that in terms of behavioural characteristics, some students in the original poor student list provided by the university are not real financially hard students. Meanwhile, a small proportion of students are not labelled as “Poor” by the university but have the same behavioural characteristics as “Poor” students. These students are not accurately identified in the poor student list. Therefore, to solve the above problems, we proposed a method based on semisupervised K-means to relabel students in the poor student list according to their behavioural characteristics. In this way, university administrators can more accurately identify students in financial hardship and provide targeted funding.

Figure 1 shows the basic work flow of our framework, which includes four major parts. The entire framework mainly focuses on identifying financially difficult students and finding out the hidden poor students. Firstly, behavioural characteristics of students including consumption level, academic performance, diligence, and life regularity are extracted from records of the smart-card system. Secondly, the Apriori Balanced Algorithm (ABA) is proposed and used to correlate the poverty level with other behavioural features, by which 2-item set and 3-item set consisting of students’ behavioural feature labels are obtained. Thirdly, labelled data and unlabelled data are selected based on the predefined rule and are then input to the semisupervised learning algorithm to label the unlabelled data and build the new datasets. Finally, new datasets are used to train different models for prediction to verify the effectiveness of the framework.

2.2. Dataset

There are two datasets used in this paper. The first dataset is exported from the database management system provided by the Information Center of our university, which consists of three parts, with the time range from Sep.1st, 2013, to Jun.30th, 2014, including students’ consumption records in the canteen, the GPA for the spring and autumn semesters, and the records of poor students’ list. The students selected were enrolled in 2012 and 2013. Not all students have all of these three kinds of records, so after combining different data tables and removing error data, there are records of 6224 students remained for experiment. The data statistics are illustrated in Table 1.

The second dataset used in this paper is the Groceries dataset. This dataset is often used for association analysis by Apriori, FP-growth, and Eclat algorithms. The dataset is the real transaction records of a grocery store within a month. There are 9835 consumption records and 169 products. The data format of the Groceries dataset is shown in Table 2.

2.3. Experiment Tool

All the experiments were conducted by Python 3.6 on a 64-bit Windows 8 with 16 GB memory and 2.3 GHz CPU.

2.4. Feature Extraction

Traditional research usually obtains information about students’ family situation by means of elaborate rules and regulations of the funding system [12]. However, there are also shortcomings. For example, to obtain financial support from the school, students may deliberately describe their family as financially difficult ones. In addition, dealing with case-study assessments manually put a lot of pressure on the staff [12]. Thanks to big data technology, researchers have been provided a fast, efficient, and accurate way to students’ behaviour. In this paper, combining with the campus data, we identify students with financial difficulties by their behaviours. Initially, we propose four assumptions.

Assumption 1. Financially difficult students tend to consume less.
The most direct intuition of students in financial difficulties is that they are lack of money. In their normal campus life, they may consume less than others in lunch and in dinner, reflected by generally smaller consumption amount in the smart-card records. Therefore, we proposed the consumption level to describe students’ consumption behaviour.

Assumption 2. Financially difficult students tend to perform better in academic activities.
Students with financial difficulties may have a deep understanding of their own situation, so they cherish learning opportunities more than others, and perform better in academic performance. In the smart-card records, the grades are generally high in all subjects, so we propose the academic performance level to describe students’ learning behaviour.

Assumption 3. Financially difficult students are more irregular in life.
Students with financial difficulties may less self-disciplined in life. Lacking of money, they may not eat breakfast on time. Also, they may not attend classes on time every day due to part-time jobs. So, we put forward the life regularity to describe life behaviour of students.

Assumption 4. Financially difficult students may skip breakfast to save money.
A major challenge facing students in financial hardship is their limited available money. Consequently, they may skip breakfast to save as much money as possible. In the smart-card system of our university, a record is generated once a student swipes his/her student card on the card reader device, so each consumption is recorded with a timestamp every time he/she comes to the canteen for meals. Therefore, we regard the time of students’ first meal as a rough reflection of their diligence level.

2.5. Consumption Level

The consumption data selected for this research are 2108250 records in total. The data format of consumption is shown in Table 3, in which Stu_ID shows the ID of each student (similarly hereinafter) and Location shows the place he/she buys food. In our university, food is served from different buttery hatches of four canteens, and Canteen1, BH1 means the first buttery hatch in Canteen1. Time is the time he/she buys food, Consum_amount is the amount of money spent during this consumption, and Card_balance is the balance of his/her student card after this consumption. After the exploratory data analysis, we found that the transaction amount of each breakfast is extremely lower compared to lunch and dinner because porridge and pancakes are served at a low price during breakfast time. If breakfast is included in the consumption level statistics, some students cannot guarantee to eat it every day, so there will be mistakes in the classification. Therefore, only lunch and dinner consumption is considered for mining consumption behaviour.

Along this line, we need to ensure that every student only eats lunch and dinner one time every day. First of all, we define the time intervals for different types of meals. According to the dining rules of our university, we set 11:00–13:00 as lunch time and 16:00–18:00 as dinner time. It is important to note that a student may swipe the card for food more than one time during each meal, for example, buying some snacks during lunch. So, we proposed lunch-time consumption (LTC) and dinner-time consumption (DTC), which, respectively, represent how much money a student spends during lunch time or dinner time. LTC is defined as formula 1, and DTC is defined similarly:where represents the i-th consumption record of the total n records between 11:00 and 13:00 of a day, so that LTC represents the total consumption amount during 11:00–13:00.

Next, it is a key problem to convert consumption records into indicators of consumption. Previous work [9] simply calculated the number of consumption records in the canteen in different time durations during one hour. However, in this way, students with more consumption records are more likely to be considered to spend more, while the ones with fewer records tend to be considered to spend less. To avoid this situation, we propose the average consumption and the consumption speed. Average consumption is defined as the average consumption amount during one LTC or DTC, denoted by Avg_consum in formula (2). Consumption speed is defined as the number of consumption times to spend up per 100 yuan, denoted by Spd_consum in formula (3):where represents the total consumption of student t within a semester, while represents the total number of consumptions during one LTC or DTC:where represents the total record number of consumptions of student t in a semester and has the same definition as formula (2).

2.6. Poverty Level

We exported data of financially difficult students of the 2013–2014 academic year from the database, totalling 3400 items. The data format is given in Table 4, in which the Semester column indicates the corresponding semester of the record (“2013-20141” means the first semester, and “2013-20142” means the second semester). Besides, for each of the 3400 records, the Financial_status column indicates whether the corresponding student in this entry is in financial hardship. If so, the value will be “Poor,” otherwise it will be “Not_poor.” Such labels can be convenient for the subsequent processing.

2.7. Academic Performance
2.7.1. GPA

The academic performance data we selected in this research are composed of 251,055 records, which contain the scores of each course of each undergraduate student in two semesters. The data format of grades is shown in Table 5. In this table, all the students are grouped by Stu_ID, and for each student, each course he/she attended during the two semesters is recorded as one entry. The Score column is the score he/she obtained for that course, Credit is the credit assigned to that course, and Course_type indicates whether the course is compulsory or elective. Generally, in the student management system, a student’s academic performance is measured by the GPA (grade points on average). In this research, we proposed a metric similar to the original GPA to measure students' academic performance. This metric is defined in the following formula:

For each student , gradei denotes the score of a single course, crediti is the credit for that coursei, and m is the number of courses in a specific semester. Through this formula, the score of per credit for each student is obtained, which can later be used to divide all students into two groups. Concretely, after obtaining the gradesum and sorting them descendingly, the top 50 percent of students are labelled as GPA_high, and remaining 50 percent are labelled as GPA_low. These two labels can be used as the features of students’ academic performance.

2.7.2. GPA_Percentage

Although GPA is generally considered as a metric for evaluating students’ academic performance, it is rather a coarse-grained measure. This is because in China, the difficulty of different subjects varies with majors. Courses of liberal arts majors tend to be given higher marks due to flexible answers of certain exam questions, while those of science and engineering majors are much harder to get A due to complex calculation, analysis, deduction, and reasoning. Despite the difference existing in different courses, students of the same major will face the exams of same subjects. Therefore, it is required to figure out the ranking of students within their respective majors. To this end, we propose GPA_percentage here, which is defined in the following formula:where represents the GPA ranking of student within the range of his/her major and represents the total number of students in his/her major.

According to the criteria for evaluating personal scholarship in our university, students who rank top 20% in his/her major will win the “first-level scholarship” and “second-level scholarship,” while those who rank between top 30% and top 50% will be awarded the “third-level scholarship.” Therefore, students whose GPA_percentage is between 0 and 0.2 are labelled as Gper_A; those with GPA_percentage of 0.2 to 0.5 are labelled as Gper_B, and the rest are labelled as Gper_C.

2.8. Life Regularity

The regularity of students’ behaviour can be expressed with the regularity of eating breakfast [7]. In order to describe the regularity of different students as much as possible, we regard a student’s first record in the smart card as his/her first activity every day. So, we select 5:00–11:00 as the time interval for the regularity of students’ behaviour. Therefore, our processing steps are as follows.

Firstly, we divide the time intervals from 5:00 to 11:00 in the morning into 12 bins, each of which spans 30 minutes and is encoded from 1 to 12, respectively. Then, inspired by the concept of information entropy [14], we define a life entropy here to express the life regularity of students, which is calculated by the following formula:where represents the probability of each time interval. We know from the definition of entropy that LE is de facto, the distribution of an arbitrary student X’s eating time in a semester. Therefore, the larger the LE is, the more scattered and irregular the breakfast eating period is, while the smaller the LE is, the more concentrated and regular the period is.

Next, a threshold value needs to be determined to label the regularity for different students according to LE. This can be considered as a problem of one-dimensional data clustering. Therefore, we sort LE first and then use the K-means clustering to obtain a threshold H. According to the threshold H, students can be divided into Regular or Irregular.

2.9. Diligence

It has been said that the first smart-card record in each day can be regarded as surrogates of students’ bedtime [15]. Inspired by this, we use students’ first smart-card record in each day as their first daily activity. Since the meal consumption in canteen accounts for a large majority of all consumption records, we calculated the time of first meal for each student. This is then used as a measure for students’ diligence level. Specifically, we transformed the raw date-time format into Unix timestamps. After obtaining the time of first meal consumption for each student in each day, the diligence level can be calculated as follows:where is the total number of days that student i has in consumption records and is the time of his/her first meal. In this way, we obtained the average time of each student’s first meal within one semester. Subsequently, we clustered all the students into two groups according to the diligence value, labelling those with smaller diligence as “Early” and larger diligence as “Late”.

2.10. Apriori Balanced Algorithm

The Apriori algorithm is one of the most popular and widely used algorithms in both data mining and educational data [16]. Previous research has studied the utilization of Apriori on user behaviour prediction, for example, using Apriori for mining rules related with the study to provide a basis for optimizing educational decision [17] and mine the association rules of enrollment information to explore the factors affecting college enrollment [18]. However, the traditional Apriori algorithm may not be able to mine out the rules of items with small proportion. This problem is mainly caused by the different proportion of various labels in the datasets. People tend to accept rules with high support and high confidence. However, low proportion labels generate rules with low support, which may easily be ignored. In our datasets, various labels have different proportions. Therefore, Apriori is not suitable for mining the association rules hidden in students’ poverty level and daily behaviour.

Based on the above problems, the Apriori Balanced Algorithm (ABA) is proposed.

Given a dataset D, N is the number of data in D, L = {l1, l2,…,ln} is the set of different items in D, P = {p1, p2,…, pn} is the proportion of different data items.

Let U = {li, lj,…,lt} be a rule of expectation. The support of U is defined as follows:where B is a subset of D, represents the data items in D that contain U, and represents the number of data items in B. The confidence of U is defined as follows:

To make sure that the items in U are closely related to each other, compute in the following formula:where x is the index of the terms in U and m is the number of items of U.

The reasonable range of is [0, 1], so the range of is [1, +∞]. If normalized, its range will be [0, 1], which has the same distribution as . Therefore, is replaced with . Since we focus on how to deal with the imbalanced proportions of different behavioural labels, it has nothing to do with the calculation of Support, so here we just set m = 1. To adapt the support to different numbers of item sets, a balance factor C is defined as follows:where T is the number of the type of labels that belong to the same behaviour category as the item x. (For example, x is the item “Consumption Low,” since the labels of consumption behaviour are “High,” “Medium,” and “Low,” in such situation, T = 3.) represents the maximum number of samples of the labels in each behaviour category. Similarly, is the minimum number of samples of the labels in each behaviour category.

With the balance factor C, Balanced_support(U) is defined as follows:

Table 6 shows the different values of C for each label in each behaviour category. The labels are listed in the first column, and the “Proportion” column shows the number of samples for each label. Different labels are delimited by a colon. Some behaviour categories have 2 labels, while some have 3, so the last column for behaviours with 2 labels is left blank.

Algorithm 1 consists of two parameters: the Dataset D and the Balanced_support threshold value S, which is set by the experimental operator. Different S corresponds to different numbers of frequent item sets. The algorithm firstly scans the whole dataset and regards the generated set as the frequent 1-item set. Next, calculate the Balanced_support of the frequent 1-item set. Then, remove the items whose Balanced_support is lower than S to obtain frequent 2-item set. Next, calculate the Balanced_support of frequent 2-item set. The above procedures are repeated until there is no item in the frequent k item set or only one item left, and the program ends at this point.

Input: The Dataset D, Balanced_support threshold value S
Output: Maximum frequent k item set
(1)Scan all the datasets and get all the data that have appeared, as a candidate frequent 1-item set.
(2)k = 1, the frequent 0-item set is considered an empty set.
(3)While 1 do:
(4)  Scan data to calculate the Balanced_support of candidate frequent k item set
(5)  Remove the datasets whose Balanced_support of candidate frequent k item set is lower than the threshold value S. Get frequent k items.
(6)  If The frequent k item set is Empty Then:
(7)    return frequent k − 1 item sets as result, and ABA over.
   End if
(8)If the number of items in frequent k dataset is equal 1 Then:
    return frequent k item set as result, and ABA over.
  End if
(9)k=k+1
(10)End while

Table 7 is an example of the input data for Algorithm 1, which has 7 columns. The first column shows the Stu_ID, the second column gives the financial status of that student, and the rest columns record the labels of different behaviours, as defined previously. Since D is the parameter, different behavioural labels can influence the algorithm. The influence will be analysed in the later sections.

2.11. Semisupervised K-Means

Semisupervised learning is an important method in the field of pattern recognition and machine learning, which carries out pattern recognition using a large amount of unlabelled data and a small number of labelled ones. Therefore, this method receives increasing attentions from various areas of research, including predicting dropout rate based on behavioural features [19] and predicting students’ academic performance by constructing students’ social relationship based on their campus behaviour [20].

The basic idea of semisupervised learning is to label the unlabelled samples by creating a learner using the model hypothesis of data distribution. Its basic setting is as follows.

Given a labelled sample set L = {(x1, y1), (x2, y2),…,(x1, y1)} with unknown distribution and an unlabelled sample set U = {x1, x2,…, xn}, it is expected to learn a function f: X ⟶ Y, which can set the label of unlabelled set U. Here, xi and xj are d-dimensional vectors, and yi ∈ Y is the label of sample x. |L| and |U| are the sizes of set L and set U, respectively.

The semisupervised learning includes two kinds of hypotheses, and clustering is one of them. It derives from the intuition that two samples are more likely to have the same label when they are in the same cluster. Based on this idea, the semisupervised K-means algorithm calculates the centroid of each cluster using the labelled data. Then, for each cluster, the Euclidean distance between each unlabelled sample and the centroid is calculated according to formula 13. These unlabelled samples are gradually incorporated into the labelled ones. The above process is iterated until each cluster of unlabelled data is stable.

The following formula shows how the Euclidean distance is calculated:where X and C are both vectors and X = {x1, x2, …, xm} and C = {c1, c2, …, cm}, in which k is the number of samples in each vector.

Based on its core idea, here is the label propagation process of semisupervised K-means, as shown in Algorithm 2.

Input: Labelled data array L and unlabelled data array U
Output: Label array LS
(1)Combine L and U into a new array LU
(2)Calculate the centroid of each cluster, appending them into a set C.
(3)Set the loop Flag←Changed
(4)While Flag ≡ Changed do:
(5)  FlagUnchanged.
(6)  For luLU:
(7)  Calculate the distance of lu and Ci as Di.
(8)  Put Di in the array D.
(9)  Get the minimum of D, record the label as Lc.
(10)  If lu ≡ Lc Then:
(11)   Change the lu label.
(12)   FlagChanged.
(13)  End if
(14)End for
(15)End while

The input of Algorithm 2 consists of two parts of data, including the array of labelled data L and that of unlabelled data U. Lines 1 to 3 show the initialization part of the algorithm. Firstly, L and U are combined into a new array LU, and at the same time, the centroids of each cluster are calculated. These centroids are then put into a set C, C = {C1, C2, …, Cs}, where s is the number of centroids. Then, a flag is set denoting whether the data label is stable. After that, an iterative loop is entered to judge whether each element in LU has greater distance to the centroids of other clusters than that to the centroid of its belonging cluster. The cluster of data is updated based on the above process, until all the clusters are in a stable state.

The specific format of array L and array U is shown in Table 8, where the top half is the labelled data and the bottom half is the unlabelled data. For both L and U, the first column (Stu_ID) shows the ID of students, and the next six columns successively show the value of each behaviour according to our previous definition. Finally, the Label column indicates the financial status of students, which is converted from the Financial_status column of Table 6, with Poor denoted as 1 and Not_poor denoted as 0. The only difference for L and U is that the Label column is initially set to −1 for U, meaning there is no label assigned at the beginning.

3. Results and Discussion

After data preprocessing, behavioural features including students’ GPA, GPA_percentage, life regularity, diligence, and consumption level are obtained. Combining these features with the financial status, the processed dataset is produced. As is shown in Table 9, except for Student_ID (the first column), each column shows different behaviour labels of students.

Figure 2 illustrates the proportion of different labels for different behaviours. It is obvious that the data to be processed are significantly imbalanced. Therefore, the problems involved in this paper are suitable for the ABA.

3.1. Application of ABA and Results

The results of the ABA are shown in Tables 10 and 11, which are a 2-item set and a 3-item set showing the correlation between behavioural characteristics and poverty level. From the 2-item set table, we first observe from the last column that the Balanced_supports of rules “Poor, Regular” and “Poor, Irregular” are almost the same, which indicates that students’ financial hardship has no obvious correlation with their life regularity. Difference emerges when exploring the diligence level. It is clear that the Balanced_support of “Poor, Late” (0.2000) is higher than that of “Poor, Early” (0.1489). This is because that the diligence level is measured by the time of students’ first meal every day, and students may skip or seldom eat their breakfast for the sake of saving money, leading to a later time of first meal. When it comes to consumption level, we find that the Balanced_supports of “Poor, Medium” and “Poor, Low” are both higher than “Poor, High,” indicating that students in financial hardship spend lower money on average, which is in accordance with the reality. As for the Academic Level, we have found that the Balanced_support of “Poor, GPA_high” is higher than “Poor, GPA_low.” This suggests that financially difficult students generally score higher. They may cherish every opportunity to study hard, resulting in better grades. Besides, for the comparison of GPA_percentage, the rule “Poor, Gper_A” has higher Balanced_support than “Poor, Gper_B” and “Poor, Gper_C.” This further proves that financially hard students generally study better.

Similar conclusion can be drawn from the 3-item set table. For instance, it can be seen that the Balanced_support of “Poor, Low, Good” (0.05756) is undoubtedly higher than that of “Poor, Medium, GPA_high” (0.032088), “Poor, Medium, GPA_low” (0.022446), and “Poor, High, GPA_high” (0.032985). This further suggests that students in financial hardship tend to spend fewer money and get higher grades. Moreover, financially hard students have relatively lower diligence level, because the Balanced_support of those items containing “Late” is generally higher than the ones containing “Early.” For similar reasons as the 2-item set, there is no obvious difference in terms of life regularity.

Compared with the original Apriori, it is worth noticing that if using Support (the metric of the original Apriori algorithm) for association rule mining, the support of “Poor, Medium” is larger than that of “Poor, Low.” However, results from last step show that financially hard students have low consumption level rather than medium. Therefore, the traditional support cannot reflect the patterns hidden in the original data distribution, but the Balanced_support will solve such a problem.

To prove the validity of the Balanced_support, we did the following steps.

Firstly, for each behavioural feature, we figured out two proportions. One is the proportion of students labelled with each specific behavioural feature in all of the students and the other is the similar proportion in those students in financial hardship.

Secondly, the changing trend of the obtained two proportions is compared. As is shown in Figure 3, the percentage of poor students on some behavioural labels have increased, for example, GPA_high, Gper_A, Irregular, and Late, with the increasing rate being 8%, 6.6%, 2.9%, and 1.6%, respectively. Such results indicate that the group of financially hard students has different distributions of behavioural labels compared with that of all the students. That is to say, students in financial hardship show different behaviours. Specifically, they tend to be more hard-working, with better academic level and lower consumption. Besides, based on our definitions for life regularity and diligence, students in financial hardship live a little bit more irregular life and tend to be less diligent. Generally, such difference in the two proportions suggests that our proposed Balanced_support is reasonable, because an increasing proportion on a certain label of certain behaviour indicates an increasing tendency that poor students are more likely to have this kind of behaviour.

We have introduced that the input parameter of the Apriori Balanced Algorithm (ABA) contains a dataset D, as shown in Table 7, where six behaviour features are listed as columns. Next, we dive deeper into the fine-grained relationship between the proportion of poor students and different types of behaviour features to see whether the previous conclusions drawn through Balanced_support are in accordance with the behaviour characteristic of poor students in our dataset.

Firstly, we explored the distribution of score among financially hard students. In our university, the maximum score for each subject is 100 points. Dividing all the students into 5 categories according to their scores, we figured out the proportion of poor students in each category. Seen from Figure 4, for students score higher than 90 points, 28.57% of them are poor ones. For lower-score categories, such proportion generally follows a downward trend. Such phenomenon indicates that students in financial hardship basically have a higher academic performance.

Besides, we also carried out similar experiments on GPA_percentage. In Figure 5, students are divided into 10 categories to calculate the proportion of poor students in different GPA Ranking. For example, 0–0.1 means the ranking of a student within his/her own major is top 10% and 0.1–0.2 corresponds to the ranking of top 10% to top 20%. It can be seen that for students ranking top 10%, 32.3% of them are in financial hardship, but with their GPA ranking increasing (meaning a decreasing academic performance), the proportion of poor students in respective ranking declines. When it comes to the last 10%, poor students only account for 17.3%. Such results act as another proof that financially hard students generally perform better in study.

For consumption level, using the similar method as above, we first divided the average consumption amount into seven categories. In Figure 6, for all students whose average consumption amount is in 0–5 yuan and 5–6 yuan, those in financial hardship take up for 63.6% and 40.6%, respectively. Thus, it is obvious that most students have a rather low consumption level, which is also evidenced by the 14.2% of poor students in the group of “10 or more.”

Since consumption speed is also a component of our definition for consumption level, we also find out the relationship of distribution of poor students and their average times to spend up 100 yuan. Students are grouped into seven categories, and we lay particular emphasis on those whose times are over 10. According to Figure 7, for those students who use 20 or more times to spend up 100 yuan (meaning a low consumption amount each time), 63.6% are financially hard students. Also, the last three groups all show a high percentage of poor students. This further proves the low consumption level of students in financial hardship.

Previously, we defined life entropy (LE) to represent students’ life regularity. Here, we spotlight poor students and explore their distribution with different values of LE. Judging from Figure 8, grouping students into seven categories by LE, the proportion of poor students increases when LE becomes larger. According to our definition, a larger LE corresponds to a lower regularity. This can be considered as an extra evidence for our previous conclusion about the life regularity for poor students.

Finally, we studied the relationship between students’ average time of first meal and the proportion of poor students, which shows the pattern of diligence level. As shown in Figure 9, after dividing first meal time into 6 categories, we can find that poor students take up the most in the 8:00–9:00 group, followed by 7:00–8:00 and 9:00–10:00. This means that poor students tend to eat their first meal later, showing a relatively lower diligence level according to our definition.

The above analysis further explained the detailed distribution of poor students relating to different behavioural features, and the obtained results basically conform to the association rules we have found using the proposed Balanced_support. Therefore, the Apriori Balanced Algorithm (ABA) can be used for mining the relationship between students’ poverty level and their daily behaviour.

3.2. Validation on the Groceries

To further verify the effectiveness and adaptability of the ABA algorithm, using Apriori as the comparison algorithm, we tested our algorithm on the public dataset Groceries. The data format of Groceries is shown in Table 2. The comparison results of the Apriori and ABA algorithm are shown in Table 12. Besides, the number of different products in the dataset is shown in Table 13.

Seen from Table 12, for most item sets, a larger Support can represent the association strength among the items. In the comparison of Group 1 and Group 2, item sets with higher Support also have a higher Balanced_support. For example, the Support of “soda, sausage” is higher than “soda, pastry,” so as the Balanced_support. We also know from Table 13 that the number of pastry is 875, which is close to that of sausage (924). That is to say, when comparing item sets with items of similar quantities, the proportion of items in the whole dataset will not influence the association rules obtained. In this situation, we can get the same association rules using either Support or Balanced_support, i.e., the association between soda and sausage is stronger than soda and pastry.

On the contrary, when the quantities of the items of the same item set are really different, the Support of the item with larger quantity is obviously higher than that with lower quantity. For instance, in Group 3, we can find that the Support of “whole milk, soda” (0.040061) is obviously higher than “whole milk, shopping bags” (0.024504), since the quantity of soda is 1715, while that of shopping bags is 969. The quantity of these two items are obviously different, so the proportions of them in the whole dataset are also quite different. However, the Balanced_support of “whole milk, shopping bags” is indeed higher than “whole milk, soda.” That is to say, regardless of the difference in proportions, the association of “whole milk, shopping bags” is actually higher. Such conclusion also has significance in practical use. For example, when stores intend to increase the sales volume of whole milk by increasing the sales volume of “shopping bags” and “soda,” conclusion from the ABA tells that “shopping bag” will be a better choice.

3.3. Semisupervised K-Means Application and Results
3.3.1. Data Preparation

Based on the association rules obtained in the previous sections, i.e., poor students tend to study better and spend less money, we constructed the labelled dataset and the unlabelled dataset, which are the input parameters of Algorithm 2. Labelled data refer to the data with a label that indicates a student’s poverty level, so the key problem is how to select poor students from the whole dataset. Results from Figures 49 have shown that financially hard students have higher academic performance, lower consumption, and irregular life. Based on such principle, we defined 4 rules for choosing financially difficult students, as shown in Table 14.

Different rules correspond to different amounts of labelled data and unlabelled data, since the criteria for different behavioural labels vary. As is shown in Table 15, if we set R1 as rule, for example, the labelled data will contain 9 Poor students and 30 Not_poor students, and the amount of unlabelled data will be 3151. Similarly, rules R2, R3, and R4 correspond to different amounts of labelled data and unlabelled data, respectively.

According to the experiment of semisupervised K-means, we finally realized the identification of poor students, including identifying Not_real financially hard students and finding out the hidden poor students from the poor student list provided by the university.

3.3.2. Evaluation Metric

Predicting students with financial difficulties is extracted as a binary classification problem in this paper. To validate the effectiveness, four commonly used metrics are selected:where TP means the number of students with financial difficulties that are classified correctly, TN is the number of students without financial difficulties that are classified correctly, and FN and FP mean the number of students with financial difficulties and normal students that are incorrectly classified.

3.3.3. The Rules' Influence on Model

In the previous sections, we have defined different rules for choosing financially hard students, and the amount of labelled data and unlabelled data varies with rules, and such amount will influence the prediction performance accordingly. In order to explore the impact of different rules on the prediction performance, we use different rules to generate different datasets and conduct comparison experiments in the logistic regression model. The results are compared in Table 16.

From Table 16, we find that R2 has better performance than other rules, so in the next experiment, we use R2 as the main rule for choosing the amount of labelled data and unlabelled data in Algorithm 2.

3.3.4. The Process of Label Propagation

The data processed by the semisupervised K-means algorithm contain six dimensions. We selected GPA as X-axis and Avgconsum as Y-axis. The process of label propagation is displayed in Figure 10.

Semisupervised learning requires both unlabelled data and labelled data. Here, the initialized data contain 20% labelled data and 80% unlabelled data. As shown in Figure 10(a), in labelled data, blue points represent students with financial difficulties, orange points represent students without financial difficulties, and gray points represent data without labels. In Figure 10(b), the proportion of labelled data increased from 20% to 40%, and that of the unlabelled data decreased from 80% to 60%. This is because after the process of the semisupervised K-means algorithm, the 20% unlabelled data were divided into different categories according to the Euclidean distance of the centre point of the two categories. Figures 10(c)10(e) successively show that the labelled data propagate labels to 60%, 80%, and 100% of total data, respectively. Figure 10(f) shows the classification of all data by means of SVC, and it is found that SVC can fit and work out a classification curve well, indicating good propagation effect.

During the propagation, a number of points changed from blue to orange, representing the process of identifying Not_real financially hard students. Some of the dots change from orange to blue, representing the process of finding students with financial difficulties. As shown in Figure 10(a), the blue point X1 represents a student with financial difficulty. After identification by the model, it was found that X1 did not conform to the behavioural characteristics of students with financial difficulties, so it was remarked as orange in the process of propagation. The orange point X2 in Figure 10(a) represents a student without economic difficulties. After model identification, it was found that the behavioural characteristics of X2 accord with the characteristics of students with economic difficulties, so it was remarked as blue in the process of propagation. This process represents the identification of students with hidden financial difficulties. Therefore, this model can be used to identify the students without financial difficulties in the poor student list and discover the hidden students with financial difficulties from all of the students.

3.3.5. Label Propagation’s Influence on Prediction

In this section, four classical classification algorithms are used on our new dataset processed through the proposed method. The input format of all algorithms is shown in Table 17. Their performances are compared in Table 18. Compared with the model trained by the original data with an old label, the performance of the model trained by the new dataset with a new label has been significantly improved. This means that label propagation has greatly improved the prediction effect of the model. When tested on logistic regression, it achieves an accuracy of 0.96, much higher than other algorithms. Besides, it achieves a relatively higher F1 score of 0.94 and a highest recall of 0.96 despite the lowest precision. This suggests that our method is more suitable for logistic regression when used for the identification of financially hard students.

3.3.6. The Influence of Different Behavioural Features on the Model

Although we have found that the logistic regression achieves the best result on our new dataset, how its performance changes with different features is still under exploration. In this section, we test different behavioural features on logistic regression, and the performance is shown in Table 19. Among all the behavioural features, GPA_percentage is the most outstanding, with an accuracy of 0.88, a precision of 0.61, a recall of 0.95, and a F1 score of 0.74. In addition, GPA also achieves a high accuracy and high precision. This suggests that GPA_percentage and GPA are more distinguishing features for the identification of financially hard students. On the contrary, Avg_consum, Regular, and Diligence achieve relatively lower accuracy and very low precision, especially for Regular and Diligence, whose precision, recall, and F1 score are all 0. Such phenomenon indicates that these behavioural features, if used independently, cannot determine if a student is in financial hardship. As for Spd_consum, though it achieves fairly low precision and F1, it still contributes to the identification of financially hard students. Therefore, it can be safely concluded that GPA_percentage, GPA, and Spd_consum contribute a lot in the identification of students in financial hardship, while the rest features have smaller contributions.

Finally, the model is trained using all of these six behavioural features, and the result is shown in the Total row in Table 16 with the highest accuracy, recall, and F1 score. That is to say, identifying financially hard students is a comprehensive process determined by multiple behaviour features, and our proposed features are effective for such a process.

4. Conclusions

In this work, we proposed the Apriori Balanced Algorithm (ABA) and carried out association rule mining for students in financial hardship through a new measure, Balanced_support, which is used to represent correlation strength and better find out how students’ poverty level is correlated with their daily behaviour. In addition, through association rule mining, we found that students in financial hardship have better academic performance and lower consumption level with relatively lower life regularity and diligence level. Next, based on the obtained association rules, we noticed that some students selected in the poor student list do not conform to the above rules. Therefore, we used semisupervised K-means to identify students in real financial hardship, as well as finding out the students who are not really poor. Tested by classical classification algorithms, the proposed method displays better identification performance compared with the original assessment approach.

In the future, we need to further optimize our framework. For example, data with other dimensions are required to describe student behaviour more comprehensively, such as water consumption records, book-borrowing records, and Internet login records. Also, we will incorporate knowledge from other research areas to explore the behavioural characteristics of financially hard students more deeply, such as combining with psychology to study psychological problems of students in poverty.

Data Availability

The original data of precise behavioural records cannot be released in order to preserve the privacy of individuals.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Higher Education Research Project of Jilin Province (grant no. ZD18027).