Abstract

Chronic kidney disease (CKD) has become a widespread disease among people. It is related to various serious risks like cardiovascular disease, heightened risk, and end-stage renal disease, which can be feasibly avoidable by early detection and treatment of people in danger of this disease. The machine learning algorithm is a source of significant assistance for medical scientists to diagnose the disease accurately in its outset stage. Recently, Big Data platforms are integrated with machine learning algorithms to add value to healthcare. Therefore, this paper proposes hybrid machine learning techniques that include feature selection methods and machine learning classification algorithms based on big data platforms (Apache Spark) that were used to detect chronic kidney disease (CKD). The feature selection techniques, namely, Relief-F and chi-squared feature selection method, were applied to select the important features. Six machine learning classification algorithms were used in this research: decision tree (DT), logistic regression (LR), Naive Bayes (NB), Random Forest (RF), support vector machine (SVM), and Gradient-Boosted Trees (GBT Classifier) as ensemble learning algorithms. Four methods of evaluation, namely, accuracy, precision, recall, and F1-measure, were applied to validate the results. For each algorithm, the results of cross-validation and the testing results have been computed based on full features, the features selected by Relief-F, and the features selected by chi-squared feature selection method. The results showed that SVM, DT, and GBT Classifiers with the selected features had achieved the best performance at 100% accuracy. Overall, Relief-F’s selected features are better than full features and the features selected by chi-square.

1. Introduction

The present era, especially the last two decades, can be named the era of big data where digital data is turning out to be very crucial more and more in various fields such as science, healthcare, technology, and society. Huge data volumes have been produced and generated from multiple sensor networks and mobile applications in almost all fields, including healthcare in specific, and this multitude of data volumes is what we call big data [1]. Wide variety of data sources such as streaming machines, high-end output instruments, visualizing, and knowledge extraction across these vast and diverse types of data pose a significant challenge when sufficient cutting-edge technologies and tools are not used. One of the most eminent technological challenges facing big data analytics lays in exploring ways that are adequate to obtain useful and relevant information for different user categories in an effective manner.

Nowadays, the different forms and types of data sources in healthcare are being gathered in both clinical and nonclinical environments, where the most crucial data in healthcare analytics is the digital copy of a patient’s medical history. On that account, the process of designing and making up a distributed data system to handle big data is challenged by three main issues. The first challenge is that it is difficult to collect data from distributed locations because of the diverse and large data volume. The second challenge is that storage is the chief issue for heterogeneous and enormous datasets as big data system requires to store while allowing performance guarantee. The third challenge is more connected to big data analytics, specifically to enormous mining datasets in real time, and this includes visualization, prediction, and optimization [2].

Considering the difficulty imposed by these challenges, they require an up-to-date and advanced processing paradigm provided that the present data management systems do not provide adequate efficiency in handling the heterogeneous nature of data or the real-time aspect. Traditional database management systems cannot support the continuous increase in huge data size. To address these issues related to enormous and heterogeneous data storage, the research community has proposed a number of research works, such as Apache Spark, Apache Hadoop [3], Apache Kafka [4], and Apache Storm [5], to solve healthcare problems [68].

Chronic kidney disease (CKD) has received a lot of interest due to its high death rate. Chronic diseases have become a major hazard to emerging countries, according to the World Health Organization (WHO) [9]. CKD is a kidney illness that can be treated in its early stages, but it eventually leads to renal failure if not treated early. In 2016, chronic kidney disease claimed the lives of 753 million individuals globally, accounting for 336 million male deaths and 417 million female deaths [10]. Chronic renal disease can be prevented from progressing to kidney failure if diagnosed and treated early. Diagnosing chronic kidney disease early is the best method to treat it, while delaying treatment until it is too late may lead to renal failure, which necessitates dialysis or kidney transplantation to live normally. Therefore, global strategies for early detection and treatment of people with CKD are required. To mine hidden patterns from data for effective decision-making and to help doctors in making more accurate diagnoses, a computer-aided diagnosis system based on artificial intelligence strategies is needed for clinical information. Artificial intelligence techniques (machine learning and deep learning) have been used in the health field, namely, in disease prediction and diagnosis.

Chronic kidney disease (CKD) is a condition that affects the kidney’s ability to function. In general, CKD is separated into phases, with renal failures occurring when the kidneys are no longer able to complete their roles of blood purification and mineral balance in the body [11]. According to the current estimates, CKD is more common in adults over 65 years old (38%) than in people aged 45–64 years (12%) and people aged 18–44 years (6%). Women have a rather higher rate of CKD (14%) than males [12].

Machine learning is an exciting field that focuses on studying huge amounts of data with multiple variables. Machine learning has basically developed from studying the theory of pattern recognition and computational learning in artificial intelligence; it presupposes computational methods, algorithms, and analysis techniques. From the perspective of Medical Sciences, machine learning undertakes to aid health specialists and doctors in carrying out scintillate and flawless diagnoses, choosing the best-fit medicines for patients, determining patients at high risk, and, most importantly, improving patients’ physical condition with minimal cost.

Machine learning (ML) has demonstrated remarkable performance across a range of applications, such as speech recognition [13], computer vision [14], medical diagnostics [15], and engineering [16].

Being a constituent of the ML process, feature selection (FS) is a crucial preprocessing step that determines the most relevant attributes within a dataset. Removing unimportant and unnecessary attributes can result in less complicated and more accurate models. In this paper, two feature selection methods based on Apache Spark are used, namely, Relief-F [17] and chi-squared [18] feature selection method. Some of the research works have used ML techniques to predict CKD. For example, Charleonnan [19] et al. used four ML algorithms, K-nearest neighbors (KNN), support vector machine (SVM), logistic regression (LR), and decision tree (DT), to predict CKD. Other research works used hybrid ML algorithms that are integrated between feature selection methods and ML to predict CKD. Feature selection methods have been used to reduce the number of features and select the optimal subsets of features from the dataset. For example [20], authors used chi-square, correlation-based feature selection (CFS), and Lasso feature selection to select the essential features from the database. They applied artificial neural network (ANN), C5.0, LR, SVM, KNN, and RF to both full features and the selected features.

Recently, researchers have been using big data platforms such as Apache Spark [21] which is a large-scale data processing engine with a unified analytics engine. Spark is 100 times quicker than Hadoop in running workloads on large-scale clusters. It includes Java, Scala, Python, and R high-level APIs, as well as an efficient engine that supports broad execution graphs. It also includes a number of higher-level tools such as Spark SQL for SQL and structured data processing, MLlib, GraphX, and Structured Streaming.

Spark’s machine learning (ML) [21] library is called MLlib. Its purpose is to make scalable and simple machine learning a reality. It provides, at a high level, tools such as classification, regression, clustering, and collaborative filtering as examples of machine learning algorithms. It also provides feature extraction, transformation, dimensionality reduction, and selection as examples of featurization.

The previous studies of CKD prediction have not used big data platforms to solve this problem. The goal of this work is to predict CKD using hybrid ML techniques based on Apache Spark to predict CKD. Our contribution can be summarized as follows:Developing hybrid ML techniques based on Apache Spark to predict CKDApplying feature selection algorithms to select the important features from the datasetApplying optimization techniques, including grid search with cross-validation to optimize ML algorithms to enhance performanceApplying different ML classification algorithms to both full features and the selected featuresApplying ensemble learning such as Gradient-Boosted Trees based on Apache Spark to predict CKD.

The rest of this paper is structured as follows: Section 2 presents the previous studies to predict CKD. Section 3 presents the main stages of a developing system to predict CKD based on Apache Spark. Section 4 presents the experimental results. Finally, conclusions are presented in Section 5.

Many authors have used different ML techniques for the diagnosis and prediction of chronic kidney disease as shown in Table 1.

For example, in [27], the authors proposed a hybrid model that combines LR and RF to predict CKD disease. They compared their proposed model with six ML algorithms, LR, RF, SVM, KNN, Naive Bayes (NB), and feedforward neural network (FNN). Their proposed model has registered the highest accuracy at 99.83%. In [29], NB, K-Star, SVM, and J48 classifiers were used to predict CKD. Performance comparison was made using WEKA software. J48 algorithm had better performance with 99% accuracy than the other algorithms.

Some authors used ML algorithms with feature selection methods to predict CKD. In [22], the recursive feature elimination (RFE) feature selection method has been used to select the essential features from the chronic kidney disease (CKD) dataset. Four classification algorithms have been applied (SVM, KNN, DT, and RF) to both full features and selected features. The results showed that RF outperformed all other algorithms. In [20], the authors used chi-square, CFS, and Lasso feature selection to select the essential features from the database. They applied ANN, C5.0, LR, LSVM, KNN, and RF to both full features and the selected features. The results showed that LSVM with full features has registered the highest accuracy at 98.86%. In [23], five feature selection methods, Random Forest feature selection (RF-FS), forward selection (FS), forward exhaustive selection (FES), backward selection (BS), and backward exhaustive (BE), have been used to select the most important features from the database. Four ML algorithms, RF, SVM, NB, and LR, have been used to predict CKD. The results showed that RF with Random Forest feature selection had achieved the best performance with 98.8% accuracy. In [26], the genetic search algorithm has been used to select the most important features from the CKD dataset. Decision Table, J48, Multilayer Perceptron (MLP), and NB have been applied to both full features and the selected features. Using genetic search algorithm enhanced the performance. The MLP classifier has achieved the best performance and outperformed the other classifiers. In [30], the number of important features has been selected using a correlation-based feature selection (CFS). AdaBoost, KNN, NB, and SVM have been used to detect CKD. The proposed CFS with AdaBoost achieved the best performance at 98.1% accuracy. In [25], the authors used two ensembles techniques which are Bagging and Random Subspace methods and three base-learners, KNN, NB, and DT, to predict CKD. The random subspace has achieved the best performance than Bagging on KNN classifier.

Previous studies just applied ML techniques to study and analyze data about CKD; they did not use big data platforms. Therefore, this motivates us to use big data platform (Spark) to study and analyze data about CKD including hybrid approaches (feature selection methods with ML classification algorithms and feature selection methods with ensemble algorithms).

3. Methodology

The proposed system of predicting chronic kidney disease consists of two main approaches, as shown in Figure 1. The first approach uses feature selection methods to select the essential features from the chronic kidney disease datasets. The second approach applies ML techniques: DT, LR, RF, SVM, NB, and ensemble learning on the selected features and full features to predict CKD. The proposed system is composed of 6 steps: in the first step (data collection), the CKD dataset from the UCI machine learning repository will be used. In the second step (data preprocessing step), null values will be handled. In the third step, the feature methods will be used to select the essential features. In the fourth step, a grid search with stratified cross-validation is used to optimize the parameters of ML and ensemble learning techniques. Each step is described in detail in the following subsections.

3.1. Data Collection

The chronic kidney disease (CKD) dataset used in this study was obtained from the UCI machine learning repository [31]. The CKD dataset includes 400 samples, 24 features, and 1 class label. The dataset contains 400 samples. The class label has two values: ckd (sample with CKD) and notckd (sample without CKD). The details of each feature are described in Table 2.

3.2. Data Preprocessing

The dataset included outliers and noise. Therefore, it needs to be cleaned up and unblemished in a preprocessing stage. The preprocessing stage incorporated the estimation of the missing values and noise elimination, like outliers, normalization, and unbalanced data checking, because certain measures may be lost when patients are being tested, resulting in missing values. There were 158 completed cases in the dataset, with the remainder occurrences having missing values. Ignoring the record is the simplest way of dealing with the missing values, although this strategy is ineffective in small datasets. Instead of removing records, we can apply algorithms to estimate the missing data as an alternative approach. The missing values of nominal features have been filled by mode. The missing values of numerical features have been filled by mean.

3.3. Feature Selection Methods

The main benefits of using feature selection algorithms are determining the important features in the dataset. The classifier approach with feature selection produces better results and reduces the model’s execution time. Relief-F and chi-squared feature selection method were used to select the subset of important features from the database. This study has applied two feature selection strategies based on Apache Spark.RelieF [32] is a frequently used feature weighting technique that assigns weights to each feature in a dataset to determine the quality of the features [33]A chi-squared test is used a statistical hypothesis test to get ranks for each feature [18]

3.4. Splitting the Dataset

The CKD datasets are split into 80% training set and 20% testing set. We used stratified cross-validation to train and optimize the models using the training set and the result of cross-validation is registered. We evaluated the models using the testing set, and the results of the testing set are registered.

3.5. Models’ Optimization and Training
3.5.1. Optimization Methods

Grid search with stratified K-Fold cross-validation is used to optimize the models and tune the hyperparameters. The most common method for hyperparameter optimization is grid search. For each hyperparameter, the users must first define a set of values. The model then evaluates all possible values for each hyperparameter and chooses the one that provides the best performance.

K-Fold cross-validation: the dataset is divided into k folds of equal size. The training is done in k-1 groups, with the remaining time being used to test the classifiers. This procedure is repeated until each of the ten folds has been provided as a testing set. The performance of the classifiers is also measured for each k. Finally, depending on the average performance, the evaluation classifier is created.

3.5.2. Machine Learning Models

The classification models used in the research are as follows:Decision tree (DT): it could be a supervised rule for learning in classification issues that contains a predefined target variable which is generally used. Decision tree works for each specific and continuous input and output variables. During this methodology, decision tree will be applied to each classification and regression issue that divides the population or sample into two or additional same sets known as subpopulation supporting the foremost necessary splitter within the input variable [34].Random forest (RF): it is a type of supervised ML technique. Basically, it accumulates a lot of trees and integrates them for more accurate prediction [23].Logistic regression (LR): it solved binary classification problems. A logistic or sigmoid function is used in LR to predict the probabilities of various labels for an unlabeled observation [35].Support vector machine (SVM): it is a type of supervised ML technique. It segregates dataset into classes using the hyperplane [22].Naïve Bayes (NB): the Bayes theorem is used to train a classifier in the Nave Bayes algorithm. In other words, it is a probabilistic classifier that has been trained using the Nave Bayes algorithm. It calculates a probability distribution over a set of classes for a given observation [29].Gradient-Boosted Trees (GBTs): it is also possible to train an ensemble of decision trees using the Gradient-Boosted Trees (GBTs) algorithm. However, each decision tree is trained sequentially. This makes use of the previously trained tree information to optimize each new tree. As a result, the model improves with every new tree. Since GBT trains one tree at a time, it can take longer time to train a model using GBT. In addition, if many trees are used in an ensemble, it is prone to overfitting. In a GBT ensemble, each tree can, however, be shallow, making it easier to train. Gradient boosting is a technique for iteratively training a series of decision trees. On each iteration, the method predicts the label of each training sample using the current ensemble and then compares the prediction to the true label [36].

3.6. Evaluating the Models

As shown in Equations 1-4, the models are evaluated using four standard metrics: accuracy, precision, recall, and F1-score, where TP stands for true positive, TN stands for true negative, FP stands for false positive, and FN stands for false negative.

4. Experiments and Results

This section discusses the results of applying chi-square and Relief-F to the dataset to select the most important features. Also, it discusses the performance of cross-validation and the testing results of applying ML algorithms, SVM, LR, NB, RF, DT, and GBT Classifier, to the full features and the selected features. In addition, it demonstrates the best values of parameters for each ML algorithm that was optimized by grid search. Two feature selection methods were used; the CKD dataset was split into 80% training set and 20% testing set. The cross-validation results were registered for the training set, and the testing results were registered for the testing set. ML algorithms and features selection methods were implemented using PySpark.

4.1. Results of Chi-Square Feature Selection Method and ML Algorithms

In this subsection, the essential features were selected by chi-square algorithm to pass into ML models for predicting CKD. The 12 most important features which have the highest scores and were thus used to predict CKD chi-square are wc, bgr, bu, sc, pcv, al, haem, age, su, htn, dm, and bp, as shown in Figure 2. It can be noticed that wc has the highest score at 12733.72, while bp has the lowest score at 80.02. The second highest score is registered by bgr at 2428.327. Sc and pcv have the same score at 354.410 and 324.706, respectively. Also, htn and dm have approximately the same score at 86.29 and 80.44, respectively. Table 3 displays the scores of all features that chi-square has selected. The highest score is registered by wc at 12733.72, while the lowest is registered by sg at 0.0050.

The performance of cross-validation and the testing results of applying ML to the selected features by chi-square are described in Table 4. For cross-validation result, RF registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB has registered the lowest performance (AC = 81%, PR = 85%, RE = 82%, FS = 82%). LR and SVM have the same performance (AC = 97%, PR = 97%, RE = 97%, FS = 97%). For the testing results, SVM registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB registered the lowest performance (AC = 82%, PR = 88%, RE = 82%, FS = 82%). The second highest performance is registered by LR (AC = 97%, PR = 98%, RE = 97%, FS = 97%).

For optimization ML models, some of values of parameters are adapted and the best setting of ML’s parameters is shown in Table 5.

4.2. Results of Relief-F Feature Selection Method and ML Algorithms

In this subsection, the essential features were selected by Relief-F algorithm to pass into Ml models for predicting CKD. The 12 most important features which have the highest weights selected by Relief-F and were used to predict CKD are shown in Figure 3. It can be noticed that rbc has the highest weight at 0.4551, while appe has the lowest weight at 0.062875. The second highest weight is registered by haem at 0.365745. Al and dm have approximately the same weights at 0.257775 and 0.24085, respectively.

Table 6 displays weights of all features that are selected by Relief-F. The highest weight is registered by rbc at 0.4551, while the lowest weight is registered by bp at -0.01584. The performance of cross-validation and the testing results of applying ML to the features selected by Relief-F are described in Table 7. For cross-validation results, DT, RF, and GBT Classifier registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB registered the lowest performance (AC = 88%, PR = 89%, RE = 89%, FS = 89%). LR and SVM have the same performance (AC = 99%, PR = 99%, RE = 99%, FS = 99%).

For the testing results, DT and GBT Classifier registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB registered the lowest performance (AC = 95%, PR = 95%, RE = 95%, FS = 95%). LR and SVM have the same performance (AC = 98%, PR = 99%, RE = 99%, FS = 99%).

For optimization ML models, some of values of parameters are adapted and the best setting of ML’s parameters is shown in Table 8.

4.3. The Performance of ML with Full Features

Table 9 presents the result of cross-validation and the testing of applying ML to full features. Overall, RF achieved the best performance for cross-validation and the testing results. For cross-validation results, RF registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB has the lowest performance (AC = 84%, PR = 88%, RE = 84%, FS = 84%). LR, SVM, and GBT Classifier have the same performance (AC = 99%, PR = 99%, RE = 99%, FS = 99%). For the testing results, RF and SVM registered the highest performance (AC = 100%, PR = 100%, RE = 100%, FS = 100%), while NB has the lowest performance (AC = 87%, PR = 91%, RE = 88%, FS = 88%). For optimization ML models, some of values of parameters are adapted and the best setting of ML’s parameters is shown in Table 10.

4.4. Discussion

Table 11 presents models that have achieved the highest cross-validation results. The performance of cross-validation of applying ML to the features selected by Relief-F has achieved the best value by three models: DT, RF, and GBT Classifiers. In comparisons, the cross-validation performance of applying ML to full features and features selected by chi-square has achieved the best value by 1 model: RF.

Table 12 presents the best models for the testing results. The performance of testing applying ML to the features selected by Relief-F has achieved the best value by two models: DT and GBT Classifiers. The testing performance of applying ML to full features has achieved the best value by two models: RF and SVM Classifiers. However, the testing performance of applying ML to the features selected by chi-square has achieved the best value by 1 model: SVM Classifier.

The results showed that SVM, DT, and GBT Classifier with the selected features have achieved the best performance. Overall, the performance with Relief-F feature selection is better than chi-square feature selection and full features.

Table 13 presents the comparison of performance between the previous studies and our work on the same dataset. In our work, the Relief-F feature selection methods have achieved the best performance for the testing results and cross-validation results using DT and GBT Classifier compared to the other existing works [23, 24, 26, 27, 30]. Also, our work is different from the other existing works [22, 25] because it registered the results for both the training set and the testing set, and it has achieved the best performance.

5. Conclusion

In this paper, the hybrid ML techniques integrating feature selection methods and classification ML algorithms based on big data platforms (Apache Spark) were used to predict CKD. Relief-F and chi-squared feature selection techniques were used to select the important features from the dataset. ML algorithms, DT, LR, NB, RF, SVM, and GBT Classifier as ensemble learning algorithm, were applied to benchmark chronic kidney disease dataset. Also, they were applied to the full features and to the selected features. Grid search with cross-validation was used to optimize the parameters of ML. In addition. Four methods of evaluation, accuracy, precision, recall, and F1-measure, were applied to validate the results and the results of cross-validation and the testing data were registered. The results showed that SVM, DT, and GBT Classifier with the selected features have achieved the best performance. Overall, the performance of Relief-F feature selection is better than that achieved by chi-square feature selection and the full features.

Data Availability

Chronic kidney disease dataset is downloaded from https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease.

Conflicts of Interest

All authors declare that they have no conflicts of interest.

Acknowledgments

The authors would want to express their gratitude to the UCI machine learning repository for providing CKD dataset.