Abstract

Massive Open Online Courses (MOOCs) have boomed in recent years because learners can arrange learning at their own pace. High dropout rate is a universal but unsolved problem in MOOCs. Dropout prediction has received much attention recently. A previous study reported the problem of learning behavior discrepancy leading to a wide range of fluctuation of prediction results. Besides, previous methods require iterative training which is time intensive. To address these problems, we propose DT-ELM, a novel hybrid algorithm combining decision tree and extreme learning machine (ELM), which requires no iterative training. The decision tree selects features with good classification ability. Further, it determines enhanced weights of the selected features to strengthen their classification ability. To achieve accurate prediction results, we optimize ELM structure by mapping the decision tree to ELM based on the entropy theory. Experimental results on the benchmark KDD 2015 dataset demonstrate the effectiveness of DT-ELM, which is 12.78%, 22.19%, and 6.87% higher than baseline algorithms in terms of accuracy, AUC, and F1-score, respectively.

1. Introduction

MOOCs emerged as the natural solution to offer distance education with online learning enormously changing over the past years. MOOCs are widely used because of a potentially unlimited enrollment, nongeographical limitation, free accessibility for majority of courses, and structure resemblance with traditional lectures [1, 2]. Simply, they allow learners to learn anytime and anywhere at their own pace. With MOOCs booming popular [3, 4], the enrollment number of participants has increased from 8 million in 2013 to 101 million in 2018 rapidly [5, 6].

However, one critical problem that should not be neglected is that only an extremely low percentage of participants can complete courses [710]. Meanwhile, due to the high ratio of learner-to-instructor in online learning environment [8], it is unrealistic for instructors to track learners’ learning behavior, which results in dropout or retention. Many educational institutions will benefit from accurate dropout prediction. That will help to improve the course design, content, and teaching quality [1113]. On the other hand, it will also help instructors supply learners with effective interventions, such as proposing personalized recommendations of educational resources and guiding suggestions.

Dropout prediction has recently received much attention. Previous studies applied traditional machine learning algorithms to it. These algorithms include logistic regression [1418], support vector machine [19], decision tree [20], boosted decision trees [2, 21, 22], and hidden Markov model [23]. However, there exists the problem of low accuracy leading to misidentification of at-risk learners, those who may quit courses.

Most recently, deep learning has become the state-of-the-art machine learning technique with a great potential for dropout prediction [24]. Jacob et al. applied a deep and fully connected feed forward neural network which is capitalized on nonlinear feature representations automatically. Fei et al. utilized a recurrent neural network model with long short-term memory (LSTM) cells which encoded features into continuous states [25]. Although deep learning achieves more accuracy than traditional machine learning methods, it should be noted that deep neural networks need iterative training and a large amount of training data.

Moreover, due to the design discrepancy of MOOC platforms, current research utilizes different learning behaviors in dropout prediction [26]. Lack of uniform definition and understanding of learning behaviors in online learning environment [27, 28] will lead to un-unified conclusion on behavior features with better classification ability. The range of result fluctuation is widely resulting from the learning behavior discrepancy. Feature selection is essential in dropout prediction. Nevertheless, little attention has been devoted to it and most related studies utilize as many features as possible. Genetic algorithm is one of the common used feature selection methods with good scalability combining with other algorithms easily [29, 30]. However, it needs iterative training.

The goal of our approach is incorporating feature selection and fast training to realize accurate dropout prediction. To address feature selection, we adopt the decision tree algorithm due to its tree structure and theoretical basis. Further, the selected features are enhanced with different weights depending on the decision tree structure. The aim is to strengthen the features with good classification ability.

To realize fast training, we choose the ELM algorithm for dropout prediction. ELM is a single hidden layer feed forward neural network which improves the gradient algorithm and requires no updating parameters by repeated iterations [31, 32]. However, a theoretical guiding rule to determine the structure of ELM is lacking. Different structures lead to different prediction results.

To achieve accurate results of drop prediction, we map the decision tree structure to the ELM structure based on the entropy theory. The mapping rule takes full account of the impact of internal nodes on leaf nodes in the decision tree. It determines not only the neuron numbers of each layer in ELM, but also the connections between input layer and hidden layer. By this way, reasonable information assignment is realized at the initial stage of ELM.

In line with common practice in dropout prediction, we extract behavior features from raw learning records. Unlike past approaches, feature selection and enhancement are realized by decision tree. Then decision tree is incorporated with ELM to realize fast training and accurate prediction. Meanwhile, it is noteworthy that we utilize the same tree structure to solve the different problems. The core of our proposed algorithm is how to design the mapping rule to determine the structure of ELM.

The main contribution of this paper can be summarized as follows. Firstly, we define and extract several interpretive behavior features from raw learning behavior records. Secondly, we propose a novel hybrid algorithm combining decision tree and ELM for dropout prediction. It solves the problems of behavior discrepancy, iterative training, and structure initialization of ELM. It successfully makes full use of the same decision tree structure as a warm-start to the whole algorithm. Finally, we verify the effectiveness of our proposed algorithm by conducting experiments on the benchmark KDD 2015 dataset and it performs much better than baseline algorithms in multiple criteria.

2. Method

2.1. Problem Statement

There are three definitions of MOOC dropout prediction in the current studies. The first is whether a learner will still participate in the last week of the course [3335]. The second is whether the current week is the last week a learner has activities [17, 19, 36]. Those two definitions are similar because they are related to the final state of a learner, and the dropout label cannot be determined until the end of the course. The third definition is whether a learner will still participate in the coming week of the course, which is related to the ongoing state of a learner [25, 37]. The dropout label can be determined based on the behavior of current week, which can help the instructors to make the interventions timely. Thus, the third definition is used in our paper.

The expectation confirmation model explains why users continue to use the information system [38], and then it is extended to explain why learners continue to use MOOCs [39]. The studies find that there exist several significant factors which can influence the continuing usage, such as confirmation of prior use, perceived usefulness, and learners’ satisfaction. According to that, the current learning may have more impact on the intention of continuing usage. For most learners, if they confirm the usefulness and feel satisfied of current week learning, they may have strong intention to continue the learning in the next week.

Therefore, the goal of this paper is to predict who may stop learning in the coming week based on the learning behaviors of current week, which helps instructors better track the learning state of the leaner to take corresponding interventions. Assume there are behavior features extracted from learning behavior records for the current week, which is represented as a dimensional vector . is the corresponding dropout label. If there are activities associated with learner in the coming week, the dropout label of this learner is which indicates that the learner will continue to learn. Otherwise, the dropout label is which means the learner will quit the course next week.

2.2. Framework of MOOC Dropout Prediction

To address the problem of MOOC dropout prediction, we propose a framework which is shown in Figure 1. To be specific, the first module designs and extracts several features from learners’ learning behavior records. The feature quantification is realized by calculating the number of each feature, which reflects the engagement of learners. The outputs of this module are feature matrix and label matrix.

The second module implements dropout prediction using DT-ELM algorithm based on the extracted behavior features. The decision layer is designed to select features and determine the ELM structure based on decision tree. It outputs the tree structure to the mapping layer and the selected features to the enhancement layer. The enhancement layer targets the strengthening of the classification ability of the selected features. It outputs the enhanced features to the improved ELM. The function of mapping layer is to determine the ELM structure according to the tree structure. It outputs the neuron numbers of each layer and connections between layers. By the improved ELM, the dropout or retention can be obtained.

2.3. Feature Extraction

We extract features from learning behavior records. Courses generally launch weekly. It is better to utilize the numbers of learning behavior records by week as features [40]. The set for each type of learning behavior records is represented as . Here denotes the type of learning behavior and . The record number of type of learning behavior during the duration for each course is expressed as a vector , where represents the leaner. is the number of learning behavior records in the week, and is the number of weeks a course lasts. The feature extraction process is shown in Algorithm 1. It outputs the feature matrix and label matrix.

Inputs:
: Learning behavior records of a course
: Enrollment number of learners
: Number of behavior features
: Duration of the course
Outputs:
: Feature matrix with size of
: Label matrix with size of
1: is the set of learning behavior records for each course. It is grouped by
the behavior types. Let be the record set of
learning behaviors, where .
2: Divide the duration of this course into weeks.
3: For each learning behavior record in
4: If this record occurred in week generated by learner
5:
6: For each learner, the learning behavior feature
is obtained.
7: Form the feature matrix , where is
types of behavior features of the learner.
8: Form the label matrix , where .

After feature extraction, the feature matrix is obtained, where is the number of enrollment learners. represents the behavior features of the learner. is the label matrix, where is the dropout label of the learner.

Effective learning time is another kind of behavior feature and represents the actual time that a learner spends on learning. In practice, a learner may click a video and then leaves for something else. Therefore, we set a threshold between two activity clicks. The time exceeding the threshold will not be counted.

2.4. Dropout Prediction Based on DT-ELM Algorithm

Decision Layer. The decision layer implements the feature selection using decision tree based on the maximum information gain ratio [41]. is the input of decision layer. Each instance in is represented as . is the class label of . is the output of decision layer only containing the selected features. Each instance in is represented as which means there are selected features in .

Decision tree is constructed by recursive partitioning into smaller subsets until reaching the specified stopping criterion, for example, that all the subsets belong to a single class. A single feature split is recursively defined for all nodes of the tree using some criterion. Information gain ratio is one of the most widely used criteria for decision tree. The entropy which comes from information theory is described as

where represents the number of classes. is the probability that an instance belongs to the class . The split rule is defined by information gain which represents the expected reduction in entropy after the split according to a given feature. The information gain is described as follows.

is the conditional entropy which represents the entropy of based on the partitioning by feature . It is computed by the weighted average over all sets resulting from the split shown in (3), where acts as the weight of the partition.

The information gain ratio extends the information gain which applies a kind of normalization to information gain using a “split information” value.

The feature with the maximum information gain ratio is selected as the splitting feature, which is defined as follows.

The decision tree is constructed by this way. Each internal node of the decision tree corresponds to one of the selected features. Each terminal node represents the specific class of a categorical variable.

Enhancement Layer. Because the classification ability of each selected feature is different, the impact of this feature on each leaf node is different. The root node has the best classification ability and it connects to all leaf nodes. That means the root node has the greatest impact on all leaf nodes. Each internal node except the root node connects to fewer leaf nodes and has less impact on the connected leaf nodes. Each value of the selected feature is multiplied by a number , which is equal to the number of leaf nodes it connected to in the decision tree. It is represented as follows.

By this step, we enhance the impact of the selected features on leaf nodes based on the tree structure.

Mapping Layer. In the mapping layer, inspired by the entropy net [42], we map the decision tree to the ELM. Table 1 shows the corresponding mapping rules between nodes in decision tree and neurons in ELM.

The number of internal nodes in decision tree equals the number of neurons in the input layer of ELM. Each leaf node in decision tree is mapped to a corresponding neuron in the hidden layer of ELM. The number of distinct classes in decision tree equals the number of neurons in the output layer of ELM. The paths between nodes in decision tree decide the connections between input layer and hidden layer of ELM. The result of this mapping principle determines the numbers of neurons in each layer. Meanwhile, it improves ELM with fewer connections.

Figure 2 shows an illustration of mapping the decision tree to ELM. The first neuron of the input layer is mapped from the root node of the decision tree. It connects to all hidden neurons mapped from all leaf nodes. That means the first neuron has impact on every hidden neuron. The second neuron of the input layer connects to the four hidden neurons according to the decision tree structure. That means the second neuron has impact on the four hidden neurons. The dashed lines show that there exist no corresponding paths between the internal nodes and leaf nodes in the decision tree. Therefore, there exist no connections between the corresponding neurons in the input layer and the corresponding neurons in the hidden layer of ELM.

Improved ELM. Once the structure of ELM is determined, the enhanced features are input into the ELM. The connectionless weights between input layer and hidden layer are initialized with zero or extremely small values very close to zero. Other connection weights as well as biases of the hidden layer are initialized randomly. Unique optimal solution can be obtained once the numbers of hidden neurons and initialized parameters are determined.

There are random instances , where , . A SLFN with hidden neurons can be represented as follows:

where is the activation function of hidden neuron. is the weight vector of input neurons connecting to hidden neuron. The inner product of and is . The bias of the hidden neuron is . The weight vector of the hidden neuron connecting to the output neurons is .

The target of a standard SLFN is to approximate these instances with zero error which is represented as (8), where the desired output is and the actual output is .

In other words, there exist proper , , and such that

Equation (9) can be represented completely as follows.

where

Once and are determined, the output matrix of the hidden layer is uniquely calculated. The training process can be transformed into finding a least-squares solution of the linear system in (9). The algorithm can be described as in Algorithm 2.

1: Give the training set , activation
function , number of hidden neurons .
2: Randomly assign input weight vector and the bias
except the connectionless weights and biases between
input layer and hidden layer with zero.
3: Calculate the hidden layer output matrix .
4: Calculate the output weight vector where is
the Moore-Penrose generalized inverse of matrix .
5: Obtain the predicted values based on the input variables.

3. Experimental Results

3.1. Dataset and Preprocess

The effectiveness of our proposed algorithm is tested on the benchmark dataset KDD 2015 which contains five kinds of information. The detailed description of the dataset is shown in Table 2. From the raw data, we define and extract several behavior features. The description is shown in Table 3.

The next step is to label each record according to the behavior features. If a learner has activities in the coming week, the dropout label is 0. Otherwise the dropout label is 1. A learner may begin learning in the later week but not the first week, and in the first several weeks, the learner will not be labeled as dropout; the week when the learner begins learning is seen as the first actually learning week for this learner. The first several weeks data will be deleted and then other weeks data will be labeled.

3.2. Experimental Setting and Evaluation

Experiments are carried out in MATLAB R2016b and Python 2.7 under a desktop computer with Intel 2.5GHz CPU and 8G RAM. The LIBSVM library [43] and Keras library [44] are used to implement the support vector and LSTM, respectively.

In order to evaluate the effectiveness of the proposed algorithm, accuracy, area under curve (AUC), F1-score, and training time are used as evaluation criteria. Accuracy is the proportion of correct prediction including dropout and retention. Precision is the proportion of dropout learners predicted correctly by the classifier in all predicted dropout learners. Recall is the proportion of dropout learners predicted correctly by the classifier in all real dropout learners. F1-score is the harmonic mean of precision and recall.

AUC depicts the degree to which a classifier makes a distinction between positive and negative samples. It is invariant to imbalanced data [45]. The receiver operating characteristics (ROC) plot the trained classifier’s true positive rate against the false positive rate. The AUC is the integral over the interval of the ROC curve. The closer the number to 1, the better the classification performance.

3.3. Overall Performance

We choose ten courses for experiments. The enrollment number ranges from several hundreds to about ten thousands. We divide the baselines into three categories, separately: traditional machine learning, deep learning, and optimization algorithm. Traditional machine learning algorithms include logistic regression, support vector machine, decision tree, back propagation neural network, and entropy net. LSTM is adopted as the deep learning algorithm. Genetic algorithm and ELM (GA-ELM) are combined as the optimization algorithm aiming to improve the ELM.

The results of overall performance in terms of accuracy, AUC, and F1-score are shown in Figure 3. The results of overall average training time are shown in Table 4.

Although there exists a wide range of course enrollments, the proposed DT-ELM algorithm performs much better than the three categories of baseline algorithms. DT-ELM is 89.28%, 85.86%, and 91.48% and about 12.78%, 22.19%, and 6.87% higher than baseline algorithms in terms of overall accuracy, AUC, and F1-score, respectively.

To be specific, the traditional machine learning algorithm performs the worst in terms of the three criteria. The results of the deep learning algorithm are much better. However, the deep learning algorithm has the longest training time. Although the optimization algorithm performs better than the deep learning algorithm, it does not perform as good as DT-ELM. DT-ELM performs the best in terms of accuracy, AUC, and F1-score. Meanwhile, it requires the least training time due to noniterative training process. The results have proved that DT-ELM reaches the goal of dropout prediction accurately and timely.

Another conclusion is that the last two weeks get better performance than the other weeks. To identify the reason, we make a statistical analysis of dropout rate weekly. We find that, compared to the first three weeks, the average dropout rate of the last two weeks of courses is higher. It means the behavior of learners is more likely to follow a pattern. On the other hand, it also illustrates the importance of dropout prediction. The dropout rates in the later stage of courses are higher than the initial stage generally. So it is better to find at-risk learners early in order to make effective interventions.

3.4. Impact of Feature Selection

To verify the effectiveness of feature selection, we make a comparison between DT-ELM and ELM. The results are shown in Figure 4. DT-ELM is about 2.78%, 2.87%, and 2.41% higher than ELM in terms of accuracy, AUC, and F1-score, respectively. It proves that feature selection has promoted the prediction results. Choosing as many features as possible may not be appropriate for dropout prediction. According to the entropy theory mentioned previously, features with different gain ratios have different classification ability. Features with low gain ratios may weaken the classification ability.

Although each course has different behavior features, two conclusions can be obtained. The average number of selected features is 12, which is less than the number of extracted features. It proves that using fewer features for dropout prediction can achieve better results than using all extracted features. Moreover, we find that discussion, active days, and time consumption are the three most important factors affecting prediction results.

3.5. Impact of Feature Enhancement and Connection Initialization

To verify the impact of feature enhancement, we make a comparison between DT-ELM and itself without feature enhancement (Without-FE). Similarly, to verity the impact of connection initialization, we make a comparison between DT-ELM with itself without connection initialization (Without-IC). The results of the three algorithms are shown in Figure 5.

The results of Without-FE and Without-IC are not as good as DT-ELM. DT-ELM is about 0.94%, 1.21%, and 0.9% higher than Without-IC in terms of accuracy, AUC, and F1-score, respectively. It is also about 2.13%, 2.25%, and 1.98% higher than Without-FE. The values of Without-IC are higher than Without-FE in terms of the three criteria, which indicates that feature enhancement plays a more important role than connection initialization. That is because features with better classification ability are enhanced depending on how much impact each feature has on other neurons.

3.6. Comparison of Different Numbers of Enrollments

To verify the effectiveness of DT-ELM on different numbers of enrollments, two groups of experiments are conducted and the results are shown in Figure 6. The first group (DT-ELM-SE) contains courses with smaller numbers of enrollments ranging from several hundreds to about one thousand. The second group (DT-ELM-LE) contains courses with larger numbers of enrollments ranging from several thousands to about then thousands.

Generally, the more the data used for training, the better the classification results. DT-ELM-LE is about 2.59%, 3.41%, and 2.54% higher than DT-ELM-SE in terms of accuracy, AUC, and F1-score, respectively. The average training time of DT-ELM-SE and DT-ELM-LE is 1.6014s and 1.6752s, respectively. Although courses with less enrollments achieve lower values than courses with more enrollments for the three criteria, the proposed algorithm still performs better than the other algorithms. That means dropout can be predicted accurately and timely in courses with different numbers of enrollments.

3.7. Performance with Different Algorithms

The detailed results of different algorithms are shown in Tables 57. Observing the results, logistic regression and support vector machine achieve lower values in accuracy, AUC, and F1-score than the other algorithms. Back propagation neural network and decision tree perform better than logistic regression and support vector machine. Entropy net utilizes decision tree to optimize performance, and it performs better than back propagation neural network. Different from entropy net, we utilize decision tree to optimize ELM. Although the performance of LSTM is much better than the traditional algorithms, it is time intensive based on previous results.

It is obvious that ELM-based algorithms perform better than other algorithms. That is because ELM can get the smallest training error by calculating the least-squares solutions of the network output weights. GA-ELM achieves good results due to its function of feature selection. However, it also needs iterative training and lacks structure initialization of the ELM. DT-ELM optimizes the structure of ELM and performs much better than ELM and GA-ELM.

4. Discussion

Sufficient experiments are designed and implemented from various perspectives. For the overall performance, it achieves the best performance compared to different algorithms. Besides, we also explain why the last two weeks get better performance than the other weeks. The experimental results of feature selection demonstrate the importance of the feature selection. The reason is that features with different gain ratios have different classification ability, which helps get a higher performance. Feature enhancement and connection initialization both contribute to results, and feature enhancement plays a more important role than connection initialization due to its higher promotion on three criteria. The results of different numbers of enrollments prove the universality of our algorithm.

The experimental results have proved the effectiveness and universality of our algorithm. However, there is still an important question that needs to be considered. Do the instructors need to make interventions for all dropout learners? Our goal is to find the at-risk learners who may stop learning in the coming week and help instructors to take corresponding interventions. From the perspective of behavior, break means dropout. However, interventions would not be taken for all dropout learners, because while making interventions, besides the behavior factor, some other factors [46], such as age, occupation, motivation, and learner type [47], should be taken into consideration. Our future work is to make interventions for at-risk learners based on learners’ behaviors and background information.

5. Conclusion

Dropout prediction is an essential prerequisite to make interventions for at-risk learners. To address this issue, we propose a hybrid algorithm which combines the decision tree and ELM, which successfully settles the unsolved problems, including behavior discrepancy, iterative training, and structure initialization.

Compared to the evolutionary methods, DT-ELM selects features based on the entropy theory. Different from the neuron network based methods, it can analytically determine the output weights without updating parameters by repeated iterations. The benchmark dataset in multiple criteria demonstrates the effectiveness of DT-ELM and the results show that our algorithm can make dropout prediction accurately.

Data Availability

The dataset used to support this study is the open dataset KDD 2015 which is available from http://data-mining.philippe-fournier-viger.com/the-kddcup-2015-dataset-download-link/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The research for this paper is supported by National Natural Science Foundation of China (Grant no. 61877050) and open project fund of Shaanxi Province Key Lab of Satellite and Terrestrial Network Tech, Shaanxi province financed projects for scientific and technological activities of overseas students (Grant no. 202160002).