Abstract

The main purpose of educational institutions is to provide quality education to their students. However, it is difficult to analyze large data manually. Educational data mining is more effective as compared to statistical methods used to explore data in educational settings to analyze students’ performance. The objective of the study is to use different data mining techniques and find their performance and impact of different features on students’ academic performance. The dataset was collected from the Kaggle repository. To analyze the dataset, different classification algorithms were applied like decision tree, random forest, SVM classifier, SGD classifier, AdaBoost classifier, and LR classifier. This research revealed that random forest achieved a higher score (98%). The score of decision tree, AdaBoost, logistic regression, SVM, and SGD is 90%, 89%, 88%, 86%, and 84%, respectively. Results show that technology greatly influences student performance. The students who use social media throughout the week showed low performance as compared to the students who use it only at weekends. Furthermore, the impact of other features on the performance of students is also measured.

1. Introduction

Student’s performance modeling is one of the challenging and popular research topics in educational data mining (EDM) [1]. Multiple factors influence the academic performance in nonlinear ways. The widespread availability of educational datasets further made educational data mining more attractive to the researchers. EDM is a field in which data mining algorithms are applied on educational data to improve and predict the performance of education in institute students [2].

Information Technology (IT) is an important part of learning process [3]. It greatly influences the online student performance and GPA. The study [4] believes that the use of technology such as internet is one of the most important factors that can influence the educational performance of students positively or negatively. Students are spending too much time on social media sites like Facebook and do have not enough time to study. This behavior leads students toward poor performance during high school studies, and they consider themselves difficult to survive in higher studies. EDM can detect this poor behavior pattern at right time to maximize the student grades and minimize the failure rate of weak students.

Social media greatly influence the school-age students. Social media influences students’ academic and personal lives. Students use social media for academic purposes to improve their performance. Teachers and students both can use social media as a teaching and learning tool for ease and improve learning and teaching process [5].

The objective of the study is to predict student performance with the use of technology, weekday-social-media-use, and weekend-social-media-use and, furthermore, to find out the student who desires to get higher education in advance and to find how parent’s education influences the student performance.

In this paper, data is collected from Callboard 360 LMS (learning management system). We used six machine learning techniques (DT, SVM classifier, SGD classifier, RF classifier, AdaBoost, and LR classifier) to determine the patterns inside the student performance data.

The results show that technology greatly influences the student performance. Six classifiers are used to identify the performance on the basis of different features like romantic status, use of technology, weekday-social-media-use, weekend-social-media-use, parent education, and living area. It helps the teachers to identify the fair, good, and poor students.

The proposed work analyzes performance and finds the desirable and undesirable student behaviors of students, which will help to combine students into classes based on different performance capabilities, furthermore predicting the student’s social activities.

2. Relate Work

Educational institutes face different problems to identify reasons of drop-out, graduate not on time, pass to fail ratio, effect of parent’s involvement on student performance, effect of attendance on student performance, predicting student’s performance on the basis of previous marks, and many more. For solving such problems, several studies present machine learning and statistical solutions.

The study [4] believes that the use of technology such as internet is one of the factors that can influence educational performance of students positively or negatively. Students are spending too much time on social media websites like Facebook and do have not enough time to study which leads students towards poor performance during high school studies and ultimately consider themselves difficult to survive in higher studies. EDM can detect this poor behavior pattern at right time to maximize the student grades and minimize the failure rate of students. Another study [5] shows that social media influences the school-age students positively. Students use social media for academic purposes to improve their performance. Teachers and students both can use social media as a teaching and learning tool for ease and improve learning and teaching process.

Study [6] used decision tree, a nonlinear classifier to generate tree and rules. For analyzing results, the J48 algorithm is used as an analyzing tool. Dataset is collected through the surveys of students of master and Ph.D. studies.

Study [7] finds the student’s academic results based on cluster groups and uses standard (statistical) algorithms to collect and manage their score data corresponding to the level of their performance. The K-mean clustering algorithm is used to analyze academic performance. Another study [8] used knowledge discovery and data mining tools for extracting useful information from data repository, which is used to enhance the quality of education. DM methods are used for decision making in educational systems. A decision tree (DT) algorithm is used for data searching based on divide and conquer rule. Student’s academic performance is measured by applying decision tree algorithms on students’ past academic data to predict and analyze the data. These performance measures help to find the dropout students and to identify who need special coaching and allocation of instructor for suitable advice and counseling.

Study [9] declared that decision tree is the most broadly used supervised classification algorithm. Its step creation is fast and easy. The DT classifier is applied on any field. Student qualitative dataset is collected from educational data mining. Different decision tree algorithms are applied on dataset and compared the performance. These algorithms are CART, C4.5, and ID3. The contrast outcome shows that the Gini index of CART influences the information of gain ratio of ID3 and C4.5. The CART algorithm performance and correctness are greater than ID3 and C4.5, because the DT algorithm results prove that student’s performance is influenced by qualitative features.

Reference [10] declared that data mining techniques increasingly merged day by day with the educational field. Data mining and education field are combined called educational data mining that help to identify the features and information of students. This study uses to predict and analyze the performance of bachelor and master students at university level students. The performance is analyzed with two algorithms: decision tree and fuzzy genetic algorithms. The dataset contains features internal-marks, sessional-marks, and admission-marks which are used to identify the results. Internal-marks contain attendance-marks, AVG-marks, sessional-marks, and assignment-marks. Weighted marks obtain from matric and interclass. In master degree, examination marks are also included. A systematic model is used to enhance the performance of students in the early stage and in time. To find the result and solution in early stage, conduct good result in final examination. Students also view their result and new updates. Many companies connect to the educational organizations to find out students according to their needs.

Reference [11] declared that large amount of data is stored in different technological spaces and makes new data quickly and easily. Data mining is also combined with these technologies. With the help of data mining techniques, important information from ordinary data can be taken out. Because of these techniques, data can be produced quickly and easily daily or each second. Using data mining methods provide meaningful knowledge. An educational database contains huge amount of data related to student data mining methods applying on this data. This study defines how to use DM algorithms such as KNN, naïve Bayes, and DT algorithm. Apply these algorithms on student raw data and find the best result.

Reference [12] narrated that college students have great facility of internet. Internet educates the students in living and learning process. This study discloses the connection between internet use behavior and educational performance of students. It also analyzes students that are undergraduates by using machine learning algorithms. The dataset of 4000 students has attributes of online-duration, internet-traffic-volume, and connection-frequency, which were extracted, calculated, and normalized from the real internet usage. DT, NN, and SVM were used to find student educational performance by using these attributes. Internet-con and frequency attribute are positively linked, and internet-traffic-volume attribute is negatively linked with academic performance of students. The online-time and internet-time suffering results in surprising performance among different datasets. The number of features increases and improves accuracy. The results define that internet usage is able to distinguish and analyze student’s academic performance.

Reference [13] narrated that in higher education, data mining approaches are used and create an attractive part in educational research. These approaches are used for identifying and finding meaningful data from large meaningless data. By using a supervised data mining method, find the results of student progress. To find the student progress is helpful for current educational organizations. The basic purpose of the study is to make a model with the help of classification methods. This model analyzes the student performance in Malaysia. This model is used to find the most important features from the large dataset. Many approaches which are KNN, naïve Bayes, DT, and logistic regression approaches are used to analyze the student academic result performance. These approaches are based on accuracy measure, precision, recall, and ROC curve. The output showing the naïve Bayes algorithm is better. NB is disclosing important attributes that are used to find excellent students whose grades are A+ and A.

Reference [14] reported that large number of students dropped out a major worry of higher education organizations. It greatly influences the fee of students and discarded public resources. It is necessary to find those students who are in danger of dropping out and find those attributes that are the cause of higher dropout rate. Educational data mining methods are used to recover this problem. In this study, the University Teknologi MARA students of computer science undergraduate after three years. DT, logistic regression, random forest, KNN, and NN algorithms are matched to analyze student performance. Several machine learning methods are combined and make an efficient model. The logistic regression method is the best algorithm to analyzing and predicting the dropout students.

3. Educational Data Mining Model

This study evaluates the impact of technology on student’s educational performance. This study proposed the educational data mining (EDM) model that is divided into five major sections such as collection of dataset, preprocessing of dataset, feature extraction, selection of classifier, and model evaluation, see Figure 1. Each section may contain more than one subsection. In step one, dataset was cleaned and checked if there is no missing value. After cleaning the dataset, required features were extracted from the dataset like “use of technology,” “weekly-social-media-use,” or “weekday-social-media-use.” In the next step, different learning models were used to predict student’s final grade performance. After that, model’s performance was compared based on accuracy score to select the best learner for the problem. The algorithms used in the study are DT, SVM classifier, SGD classifier, random forest classifier, AdaBoost, and logistic regression classifier.

3.1. Dataset Collection

The data used for the analysis is collected from an electronic-learning system called Kalboard 360 that is publicly available on https://www.kaggle.com/d50stuck/kalboard-360-use-case. The features of student dataset and their categories are listed in Table 1.

3.2. Preprocessing

In the preprocessing phase, first, we make sure that there is no irrelevant and unacceptable value existed inside the dataset. This process is called cleaning. After cleaning process, we analyzed the data and removed unnecessary fields that are not relevant to our research objective. This process makes data more refined and relevant to research objective. In the preprocessing, we also handled the null values in the dataset.

3.3. Selection of Classifier

After obtaining the required features, different classifiers were trained on the dataset. The algorithms used in the study are DT, SVM classifier, SGD classifier, random forest classifier, AdaBoost, and LR classifier.

3.4. Decision Tree

The DT classifier is simple and understandable by analysts and end users. It is a tree shape model built based on the features, see Figure 2. These are WSM, DSM, living area, romantic status, parent education, technology, and desire-higher-education; these all features are called nodes and influence the student final scores. Every node is divided into subnodes. Every node makes decision on the basis of numeric value.

3.5. SVM Classifier

This classifier is a linear algorithm that is suitable for small datasets. Support vector machines are not suitable for large datasets because it takes small memory and needs more training time. I used this classifier because my dataset is small; it correctly classifies features that are influencing the student performance. It divides features into two classes, for example, living area divides into a rural and urban area and the use of technology divides the yes or no class. The study divides the use of internet into two classes such as “low use” and “high use”. If a person uses 1-2 days, then it is considered in “low use” and -1 weight is given to the user. If a person uses 4-5 days, then it is considered in “high use” and 1 weight is given to the user. In case of 3 days, 0 weight is given to it. All categories are shown in Figure 3.

, , and are vectors/features; is a class to which features belong .

3.6. Random Forest

The whole tree is divided into small parts/samples. The random forest classifier builds small trees for every feature of the features like weekly-social-media-use, weekday-social-media-use, living area, romantic status, technology, and parent education. The random forest classifier analyzing the student performance by splitting the nodes of decision tree random forest builds multiple decision trees. At the end, voting process is performed for every sample and finds the performance of the students. The random forest algorithm provides the best result then the decision tree algorithm and other algorithms that are used here. The working of the random forest classifier is as shown in Figure 4.

3.7. Logistic Regression Classifier

The logistic regression classifier is another type of supervised learning algorithm. It is also called the logit model. It is a statistical model used to find the probability of a class pass or fail. It uses a basic logistic function to construct a binary dependent variable. It is easy to implement. It provides a baseline to a binary classification. It also defines the link between dependent variable and independent variables. LR outcome is constant. The equation of logistic regression is represented in where is the predictive output, is the intercept term, and is the single value coefficient of input ().

3.8. AdaBoost

The AdaBoost classifier is a meta-algorithm of machine learning. Meta-algorithms mean different low accuracy classifiers merged into a single highly predictive model to increase performance. This classifier is sensitive to error porn data and outliers. This algorithm is less risky in overfitting problems as compared to other algorithms. The AdaBoost classifier is used to build a high-performance classifier whose accuracy is high. This classifier combines weak and poor classifiers and makes a strong and highly performing classifier.

As shown in Figure 5, the AdaBoost classifier works in the following steps: (1)Firstly, AdaBoost selects training samples randomly(2)It trains the AdaBoost machine learning algorithm by selecting the samples based on the correct analysis of the last training(3)It allocates the higher weight to wrong classified samples so the next repetition of classification gets the high probability for classification(4)It allocates the weight to the trained classifier in each repetition according to the accuracy of the classifier. It generates a high-performance classifier(5)This process repeats until the complete training samples fits without any error(6)To perform voting process on all the learning algorithms you generate

3.9. SGD Classifier

Stochastic Gradient Descent (SGD) is an optimization technique. In the Stochastic Gradient Descent approach, complete dataset is not selected; some samples are selected randomly. Total samples are called a batch. The batch is created from the complete dataset. The complexity is high when the dataset is big. It uses a single sample of data. The next time the sample is exchanged with the next sample randomly and then performs repetition. It increases the efficiency of the classifier and is easy to implement.

4. Results

The results of the educational data mining model are presented as follows.

4.1. Model Evaluation

The evaluation phase of our model analyzes the outcomes of every classifier on the basis of the following factors. Confusion or error matrix is used to evaluate the performance of a classifier, see Figure 6.

Accuracy is the basic evaluation metric to analyze the rate of correctness of the prediction. The accuracy is measured with a formula, see the following:

4.2. Correlation Heat Map

A heat map is a simple and useful tool to find out useful attributes in a dataset. Diagram represents correlation between different features. Value 1 shows two feathers are positively correlated, and when the correlation is closer to or similar to -1 increase or decrease, one variable value will decrease or increase the other variable. The main advantage to use a heat map is how a feature is useful according to my problem and cleans my dataset before its use and execution. The correlation heat map is shown in Figure 7.

4.3. Final Grade Distribution

The author classifies students into three categories, “good,” “fair,” and “poor” according to their final exam performance, and then analyzed a few features that have a significant influence on students’ final performance, including romantic status, parent education level, frequency of going out, desire of higher education, and living area. The three categories of students are shown in Figure 8.

4.4. Final Grade by Frequency of Technology Usage

Figure 9 shows 5 levels of technology understanding and use. The use of technology depends on the understanding of devices that are used by students. Good students understand the technology and use it for their studies. The performance is high because they understand the technology and cannot waste their time. The poor students are not capable of using technology. When they use devices, they cannot understand what they are working, so they waste their time and energy. So, that type of student has low performance. The fair students are that some students understand, and some cannot understand the technology; some students use technology but cannot improve their performance because they have no proper guidelines and training to use technology.

4.5. Final Grade by Social Media Consumption

Figure 10 shows good or intelligent, fair or normal, and poor or weak students use weekly social media. The use of social media divides into five levels (1, 2, 3, 4, and 5). Low social media usage on the weekend is represented by 1 and 2. The highest use of social media is represented by 4 and 5 levels. The medium use of social media on the weekend is represented by 3. Poor students that are weak in their studies use social media which results in their performance becoming slower. Fair and good students use social media highly; then, their performance is also decreasing. So, the high use of social media also influences the student performance.

4.6. Final Grade by Parents’ Education Level

Parents’ education level influences student performance.

4.6.1. Father Education Level

A parent’s education level has a positive correlation with a student’s final score. Father education level influences the student grade. Educated fathers also affect children’s education. Father education affects the student’s performance, but mother education greatly affected the student performance. Figure 11 shows that father education influences the student grade.

4.6.2. Mother Education Level

Comparatively, the mother’s education level has a bigger influence than the father’s education level. Because the mother guides their children and supports them in their studies more than fathers so mother education highly influences student final score. Most mothers are uneducated; the student performance in early stage is based on mother education. Some mothers are nonserious about the studies of their child because they are not educated, so their child cannot perform well in education at school level. Figure 12 shows how much mother education influences the student performance.

4.7. Feature Effect

Figure 13 shows that the number of features increases the prediction accuracy of a classifier. Multiple features help the classifier to train and get accurate results. But condition is that features are correlated to the problem. Relevant features greatly influence the accuracy, but irrelevant features decrease the accuracy. On the other hand, multiple features can complex the classifiers and some classifier like SVM cannot work on large number of features because it has a limited memory. Deep learning correctly classifies the large number of features. So, we can say that relevant large number of features improves the accuracy, but the multiple features also increase the complexity of a classifier. Sometimes, using multiple features cannot increase the prediction accuracy because the features are irrelevant to the problem, and sometimes, a small number of features greatly influence the prediction accuracy.

4.8. Final Results of Classifiers

Supervised learning algorithms used to predict the student academic performance with the use of technology. These algorithms are DT, random forest, SVM, L-regression, AdaBoost, and SGD. The score of decision tree is 0.90%, random forest score is 0.98%, support vector classifier score is 0.86%, logistic regression is 0.88%, AdaBoost is 0.89%, and SGD classifier score is 0.84%. The scores prove that the random forest classifier has the best results as compared to other classifiers. The DT classifier is the second one. The decision tree classifier score is lower than random forest because decision tree has problem of overfitting. The Stochastic Gradient Descent classifier gains the lowest scores. The comparison of classifiers is shown in Table 2.

5. Discussion

Last few years’ educational data mining received great attention. Many data mining approaches extract knowledge from educational databases. The extracted information from educational data helps the educational institutes to improve teaching and learning process. This enhancement of data improves the students and educational institute output performance. Student behavioral attributes also influence the student performance. Using behavioral feature accuracy of classifiers is greater than without using behavioral features. The decision tree classifier shows higher accuracy of 75% without behavioral feature of 55% percent accuracy. The newly emerging field of research is EDM. Educational DM is the combination of data mining and educational data [1518]. It helps the student to improve their performance and their learning activities. The educational data is used from any education repository, for example, learning management system, web-based education, and online data. Assembling methods are also used for getting higher performance of a classifier. These techniques divide the data into equal sizes, and the voting process is used. Highly voted data is extracted and concluded the results. The bagging algorithms are bagging, boosting, random forest, etc. In traditional models, single model is used for training data, however, in ensemble models, more than one model is used for training of data. Multiple models train with attribute with voting process. The advantage of the assembling method is that the accuracy is higher than the single model [1922].

Determining assessment and activity data can affect students’ educational performance. The four selection algorithms are decision tree, random forest, multilayer perceptron, and logistic regression which were used to identify the important features that affect students’ academic performance. Results show that the most important feature that can affect student educational data is assessment data like final exam and assignment marks are most important. Decision tree performs useful as using random forest achieving highest accuracy [2326].

Technology like social media usage (weekday and weekly), living area, parent education, desire to receive higher education, and romantic status are features that greatly influence the student performance. Social media use natively and consuming more time on social media can decrease the student performance. By using the random forest assembling method, the accuracy of a classifier is higher 98% than other classifiers like decision tree, AdaBoost, logistic regression, and Stochastic Gradient Descent. The technology feature greatly influences the student performance [27]. Behavioral, assessment marks, parent involvement, living area, and many more features can influence the student education, but in this modern world, technology can play a vital role in every student’s life. This paper focuses on technology like social media uses. The main benefit of assembling an algorithm is gaining higher accuracy than the single model classifier like the SVM classifier.

6. Conclusion

Academic achievement is the biggest concern of every educational institute. This paper describes the importance and impact of technology on student education. This study used machine learning techniques such as DT, SVM classifier, SGD classifier, random forest classifier, AdaBoost, and logistic regression classifier to determine the patterns inside the student performance data. We have used six different classifiers to analyze the student performance records. Our objective was to evaluate and analyze the impact of technology on student education, so we use our attributes including technology features, weekday-social-media-use, and weekend-social-media-use to analyze student data. All six of our classifiers achieved performance by adding technology features along with other features. Notably, random forest achieved higher accuracy of a classifier which is 98% as compared to the other classifiers. The score of decision tree, AdaBoost, logistic regression, SVM, and SGD is 90%, 89%, 88%, 86%, and 84%, respectively. This research shows that nowadays technology is a very important factor to achieving better performance in educational institutes. Social media greatly influence the student education. This feature increases the accuracy of classifiers. Currently, in the changing world, online and home-based educations become very important. Furthermore, the analysis of the different factors of the technology, the negative impact of technology, and the impact of home-based learning could be the key direction of future research.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Acknowledgments

We appreciate the partial collaboration with Universiti Malaysia Sabah, Malaysia, in this work.