#### Abstract

Educational Data Mining (EDM) is a rich research field in computer science. Tools and techniques in EDM are useful to predict student performance which gives practitioners useful insights to develop appropriate intervention strategies to improve pass rates and increase retention. The performance of the state-of-the-art machine learning classifiers is very much dependent on the task at hand. Investigating support vector machines has been used extensively in classification problems; however, the extant of literature shows a gap in the application of linear support vector machines as a predictor of student performance. The aim of this study was to compare the performance of linear support vector machines with the performance of the state-of-the-art classical machine learning algorithms in order to determine the algorithm that would improve prediction of student performance. In this quantitative study, an experimental research design was used. Experiments were set up using feature selection on a publicly available dataset of 1000 alpha-numeric student records. Linear support vector machines benchmarked with ten categorical machine learning algorithms showed superior performance in predicting student performance. The results of this research showed that features like race, gender, and lunch influence performance in mathematics whilst access to lunch was the primary factor which influences reading and writing performance.

#### 1. Introduction

Predicting the factors impacting student performance early in the academic programme can assist in combating the high dropout rate experienced by higher education institutions. Predicting factors can guide lecturers adjusting their lessons in order to assist students who are at risk of failing. Several studies have been conducted on using machine learning algorithms for early prediction of student performance. These studies fall into the field of Educational Data Mining (EDM) which provides great value to educational institutions [1]. EDM is an emerging field in data mining and lies in the intersection between education, computer science, and statistics [1]. Not only is Educational Data Mining limited to education but also it covers fields like transportation, sports, marketing, and sales [2].

There are several data mining tools used in EDM to analyse and predict student performance to the benefit of educational institutions. These interventions improved pass rates, curbed dropout rates, and increased retention rates [3]. There are several tools that are used in EDM. Data manipulation and feature engineering tools include Microsoft Excel, EDM Workbench, Python and Jupyter notebook, and Structured Query Language (SQL). No one tool can be used for EDM as different tools suit different tasks [4].

A wide range of classification algorithms can be used to predict processes and performance, namely, random forest, support vector machines, AdaBoost, decision tree, Naïve Bayes, and K-nearest neighbour [3]. Kumar et al. [2] used EDM effectively in research to improve retention rates by predicting slow learners’ in a high school class and providing them with intervention to improve. They found Naïve Bayes, Multilayer Perceptron, SMO, J48, and REPtree to be the most widely used techniques in EDM research. This paper will explore other techniques in the field of EDM.

This research paper will attempt to answer the following research question in the field of EDM: What are the strong impacting factors for school based learners’ performance in reading, writing, and mathematics? The paper is organized as follows: In the next section, a review of related literature is presented, Section 3 introduces the materials and methods used in this research, followed by Section 4 that provides the applied data mining algorithms on the selected dataset and presentation of experimental results and, finally, the paper ends with the section that concludes the work.

#### 2. Related Works

Acharya and Sinha in [5] showed, in a study of students majoring in computer science, the best results for early prediction was obtained with the decision tree class of algorithms. In this study, the training consisted of 309 instances whilst the testing comprised 104 instances. The study found that caste and religion played a vital role in student performance in India. Other strong factors related to academic performance were the family size and family income.

In another study by Koutina and Kermanidis in [6], early prediction of postgraduate master’s students was useful to assist tutors. Data on a total of 117 instances were collected from three courses, namely, Advanced Language Technology, Computer Networks, and Information Systems Management. Naïve Bayes and 1-NN achieved the best prediction results compared to well-known classification algorithms. The results show that, in small datasets, Naïve Bayes and 1-NN can perform better than decision trees [6]. Support for better performance on Naïve Bayes on smaller datasets was also shown in a study by Belachew and Gobena in [7]. The study found that the size of the dataset and the imbalance and distribution of class values are the main challenges in work obtaining better accuracies. In this study, predictors of student performance were the presence in class and in-term performance. It was also found that student’s occupation, the type of degree, and their possession of another master’s degree were not good indicators for early prediction of student performance.

In a study by Kotsiantis et al. [8] on the performance of distance learning, students showed that Naïve Bayes was able to produce more than satisfactory accuracy when compared to other state-of-the-art algorithms. Two experiments were conducted using datasets of 354 and 28 instances. The results showed that 28 instances were too few and the number of instances needed for better accuracy was 70 instances.

El Aissaoui et al. [9] implemented an adaptive e-learning system to take into account learning styles so convenient content can be provided to enhance learning. This study automatically detected learning styles using machine learning. *K*-means clustering was used to extract learning styles from login sequences. Thereafter, using a supervised learning algorithm, namely, Naïve Bayes, the learning style for a new learner could be predicted. A real dataset extracted from the e-learning system’s log file was used in the experiments.

In a study, Xu et al. [10] used machine learning for tracking student performance in a degree programme. Students’ future performance in degree programmes based on ongoing information could be predicted. The following predictors were cited in the study, namely, backgrounds of the students, the courses that were selected by students, the information provided by the courses to predict future performance, and the progress made by the students. This study asserts that predicting students’ performance in degree programmes is not a one-off task but it is a progressive process of tracking and updating. They developed a novel algorithm to predict students’ performance based on a bilayered structure comprising a base predictor layer and an ensemble predictor layer. Base predictors are trained using offline data (past performance) and ensemble predictors are trained using online student data (current performance). The dataset had 1169 anonymous undergraduate students. The data contained student’s precollege information (high school GPA and SAT), courses that students take each quarter, the course credits, and the obtained grades. The study concluded that the proposed method achieved superior performance over benchmarked approaches. Mao et al. in [11] used Bayesian Knowledge Tracing and Long Short-Term Memory (LSTM) to predict posttest scores and learning gains prediction. Two training datasets namely, Pyrenees and Cordillera were employed. The findings showed that the learning environment can predict students’ performance and learning gains.

Cui et al. in [12] investigated student performance prediction by assessing the possibility of a common set of student activity variables that predict performance, the machine learning classifiers that perform well across different courses model that could be used to predict performance based on LMS data. There were 18 features extracted from data processing on the LMS activity. The recursive feature elimination (RFE) algorithm was employed to select the most important features. The features were ranked and the top 5 features were selected. In the experiment, 8 classical machine learning algorithms were used, namely, logistic regression (LR), Naïve Bayes (NB), neural network (NN), support vector machine (SVM), decision tree (DT), *k*-nearest neighbours (kNN), random forest (RF), Gradient boosting machine (GBM), and 1 ensemble classifier, namely, ensemble model (EM). The results showed that the usability of the LMS could predict student performance and can provide feedback to support students and higher education institutions.

Gray and Perkins in [13] showed that students “At Risk” can be identified as early as week three of the semester with approximately 95% accuracy. The dataset consisted of 4970 students, 32 features, and 5 classes, namely, pass, supplementary, repeat year, fail, and repeat semester. Zohair in [14] explored identifying key features using a small dataset using clustering algorithms. Records of 50 students were broken up into 2 small datasets. The classifiers *K*-nearest neighbour (KNN), Naïve Bayes (NB), LDA, support vector machines (SVM), and MLP-ANN were used for the training. A comparative analysis showed which algorithm was superior when dealing with prediction using small datasets. The study concluded that support vector machines produced the most acceptable results on small datasets.

Linear support vector machine (LSVM) is a variant of support vector machine (SVM) which is one of the most popular supervised machine learning methods. It has been proven in the literature that LSVM can guarantee global optimization for regression or classification problems in small-to-large datasets [15, 16] and deals with predictive binary classification, that is, the assignment of class labels to unlabeled data [17]. Based on the benefits LSVM proffers, this paper adopted LSVM for the statistical analysis of student performance.

The aim of this study is to compare the performance of linear support vector machines with state-of-the-art classical machine learning algorithms in order to determine the algorithm that would improve prediction of student performance.

#### 3. Materials and Methods

The experiments were conducted on a computer running Windows 10 operating system with the configuration of Intel ® Core ™ i5-8250U CPU @ 1.60 GHz (8 CPUs), 1.8 GHz, 8 GB RAM memory, and 500 gigabytes hard disk drive. The dataset used in this study was selected from a set of categorical data obtained from a public domain referred to as the Kaggle data and available online (https://www.kaggle.com/spscientist/students-performance-in-exams/activity). The database consists of 1000 high school learner records based on student performance in mathematics, reading, and writing. No preprocessing of the data was required. The features of the dataset include the following: (1) gender, (2) race/ethnicity, (3) parental level of education, (4) access to lunch, (5) test preparation, (6) mathematics score, (7) reading score, and (8) writing score.

A machine learning approach was adopted to establish the contributing factors that impact student performance. The training dataset comprised of all 1000 student data and thereafter the testing dataset comprised of 1000 student data for the Kaggle data. In the training stage, the classification rules were adopted. The testing stage is used to test the accuracy of the classification rules [5]. All results are provided for the testing of the dataset. The machine learning library in MATLAB R2020a was used to implement algorithms and obtain statistical results.

Figure 1 shows the architecture of the student performance testing used in this study.

##### 3.1. Classification

A popular machine learning algorithm, the linear support vector machine (LSVM) classifier, was used as a supervised learning model to analyse data for classification. SVM is one of the most actively developed classifications and regression methodologies in data mining and machine learning. It provides salient properties such as the margin maximization and nonlinear classification via kernel tricks and has proven to be effective in many real-world applications [18].

According to Foody and Mathur in [19] a linear support vector machine is often able to classify a dataset to a higher accuracy than conventional statistical classifiers. Based on the advantage that the linear support vector machine classifier algorithm provides, we used the algorithm to classify and predict factors that impact student performance [20, 21]. The linear SVM classifier was benchmarked with ten other algorithms such as coarse decision tree, medium decision tree, fine decision tree, logistic regression, Gaussian Naive Bayes, Kernel Naive Bayes, quadratic SVM, cubic SVM, fine Gaussian SVM, and medium Gaussian SVM.

##### 3.2. Performance Measures

The standard performance metrics, namely, accuracy, total misclassification cost (TMC), prediction speed (PS), training time (TT), and area under Receiver Operating Characteristics (ROC) were calculated in the experimental comparison of classifiers [22]. A parallel coordinate plot was used to visualise the multivariate data. A parallel coordinate plot maps each row in the data table as a line, or profile, and allows for the visualisation of data points across many dimensions.

#### 4. Results and Discussion

The LSVM was used to analyse the performance by extracting useful knowledge from the student dataset. The usefulness of the algorithm was to interpret relationships among variables and to determine factors that affect student performances in their mathematics, reading, and writing score. Table 1 presents the frequency and percentage of all variables for a student population of *n* = 1000.

The summary data from Table 1 was trained, using linear SVM 5-fold cross-validation, a preprocessing operation for classification. The trained data was analysed using linear SVM (LSVM) and benchmarked with fine decision tree (FDT), coarse decision tree (CDT), medium decision tree (MDT), logistic regression (LR), Gaussian Naive Bayes (GNB), Kernel Naive Bayes (KNB), quadratic SVM (QSVM), cubic SVM (CSVM), and fine Gaussian SVM (FGSVM) and medium Gaussian SVM (MGSVM). Table 2 shows the results for all the benchmarked algorithms for the area under ROC curve (AUC), accuracy, total misclassification cost (TMC), prediction speed (PS), and training time (TT).

Table 2 shows that LSVM has the highest accuracy of 90.1% and appositely the lowest total misclassification cost of 99. According to time metrics, LSVM was outperformed by other classifiers. GNB boasts the best performance for prediction speed while CDT has the fastest training time.

##### 4.1. Strong Predictors for Mathematics

Figure 2 illustrates the parallel coordinate plot with five dimensions represented by *N* = 5 (race/ethnicity, parental level of education, lunch, test preparation, and math score) such that every vertical line, that is, factors impacting the mathematics score, appears exactly once. The standard deviation is 65% shown as point 0 on the parallel coordinate plot. The standard deviation from 0 to 2 gives scores greater than or equal to 65% while the standard deviation between −1 and 0 gives scores greater than or equal to 52% but less than 65%; furthermore, the standard deviation between −4 and −1 gives scores between 0% and 49%.

In Figure 2, the topmost plot for mathematics score shows that a standard deviation greater than 0 represents scores greater than or equal to 65% and enables us to predict factors that impacted student performance for the mathematics score. The five-dimension vertical lines predict that the high level of education of parents does not necessarily impact student performance as a student with a parent with some college or high school degree performed better than a student whose parent has an associate’s or master’s degree.

The parallel coordinate plot in Figure 2 shows that another important factor that affected student performance in math was the type of lunch. It was observed that students with a standard lunch performed better than students with a free or reduced form of lunch. It was also observed that the test preparation did not affect student performance for a student who scored more than 65% in math. The bottom row in Figure 2 with a standard deviation of less than −1 shows a math score lower than 50%. It was identified that a student with free/reduced lunch had a bad performance in math and it was further observed that students that fall under race group c and are females also had a poor performance in math while the level of parents’ education was not an important factor.

##### 4.2. Strong Predictors for Reading

In Figure 3, the topmost plot for reading score shows a standard deviation greater than 0 represents reading scores greater than or equal to 65%, which enables us to predict factors that impact student performance for reading score.

In Figure 3, the five-dimension vertical lines predict that the high level of education of parents still does not impact student performance as a student with a parent with some college or high school degree performed better than a student whose parent has an associate’s or master’s degree. However, contrary to the observation of the math score, the reading observation gave a different result. Females in group C race/ethnicity performed better in reading than math; furthermore, the parent level of education is not an important factor in student performance. The bottom row in Figure 3 with a standard deviation of less than −1 shows a reading score lower than 50%. It was identified that the free/reduced lunch factor has a significant impact in affecting the performance of students. Furthermore, females in group E race/ethnicity recorded the lowest reading scores.

##### 4.3. Strong Predictors for Writing

Figure 4 shows the parallel coordinates with five dimensions represented by *N* = 5 vertical lines impacting the writing score.

In Figure 4, the topmost plot for writing score further reiterates that parental level of education does not impact student performance as a student with a parent with some college or high school degree performed better than a student whose parent has an associate or masters degree. Females in group C race/ethnicity performed better in writing than math; furthermore, the parent level of education is not an important factor in student performance. The bottom row in Figure 4 with a standard deviation of less than −1 shows a reading score lower than 50%. It was identified that the free/reduced lunch factor has a significant impact on affecting the writing score of students. Furthermore, females in group E race/ethnicity recorded the lowest reading scores.

##### 4.4. Receiver Operator Characteristic (ROC)

Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the diagnostic ability of binary classifiers. The AUC provides a trade-off between true positives and false positives [3]. Table 3 shows the summary of results of AUC values for the linear SVM versus the ten benchmarked algorithms.

Figure 5 shows the ROC curve for the linear SVM compared to the ten benchmarked algorithms used in this study.

The results depicted in Figure 5 show that linear SVM outperformed other algorithms iterating the accuracy of the linear SVM in classifying and predicting factors that impact student performance.

#### 5. Conclusions

The results show that, in predicting students’ performance, linear support vector machines showed superior performance when applied to student data. The algorithm predicted that the parental level of education does not influence students’ performance whilst impactors such as race, gender, and lunch have a bearing on student performance. In future work, we will consider using an ensemble of classical machine learning algorithms to boost accuracy in the prediction of student performance.

#### Data Availability

The data is available publicly at https://www.kaggle.com/spscientist/students-performance-in-exams/activity.

#### Conflicts of Interest

The authors declare no conflicts of interest.

#### Acknowledgments

The authors acknowledge the Durban University of Technology for making funding opportunities and materials for experiments available for this research project.