Abstract

Almost 17.9 million people are losing their lives due to cardiovascular disease, which is 32% of total death throughout the world. It is a global concern nowadays. However, it is a matter of joy that the mortality rate due to heart disease can be reduced by early treatment, for which early-stage detection is a crucial issue. This study is aimed at building a potential machine learning model to predict heart disease in early stage employing several feature selection techniques to identify significant features. Three different approaches were applied for feature selection such as chi-square, ANOVA, and mutual information, and the selected feature subsets were denoted as SF1, SF2, and SF3, respectively. Then, six different machine learning models such as logistic regression (C1), support vector machine (C2), K-nearest neighbor (C3), random forest (C4), Naive Bayes (C5), and decision tree (C6) were applied to find the most optimistic model along with the best-fit feature subset. Finally, we found that random forest provided the most optimistic performance for SF3 feature subsets with 94.51% accuracy, 94.87% sensitivity, 94.23% specificity, 94.95 area under ROC curve (AURC), and 0.31 log loss. The performance of the applied model along with selected features indicates that the proposed model is highly potential for clinical use to predict heart disease in the early stages with low cost and less time.

1. Introduction

Nowadays, machine learning algorithms are vastly used all over the world. In the healthcare industry, machine learning is widely used for predicting disease at an early stage. It saves a lot of people’s lives worldwide by predicting their disease at an early stage. Even then, every year, thousands of people are affected and died from heart disease. If machines can predict the early stage of the disease, then, this prediction should reduce the death risk of heart disease. The heart is a significant limb of the human body, and heart disease is the major reason for death in the present world. When it is unable to perform properly, various limbs are obstructed, and then, the brain and several limbs do not work, and a person will die within a few seconds. It is one of the foremost diseases that most commonly affects middle or old-aged people and creates severe complications in the human body [1]. It is difficult to diagnose heart disease because of the number of risk factors. The main symptoms of heart disease are body physical weakness, chest pain, shortness of breath, and rapid or irregular heartbeat [2]. The incidence of heart disease is much higher in the United States (US), and every 34 seconds, one person died due to heart disease [3]. Approximately, almost 26 million people all over the world are affected by heart disease [4]. Every year, 17.9 million people are affected by heart disease, and the worldwide death rate of heart disease is 32% [5]. From 2005 to 2015, India lost up to $237 billion, due to heart-related diseases, estimates made by the World Health Organization (WHO) [5]. Both males and females suffer from heart disease (HD) [6]. Heart diseases are also revealed in older age and middle life, because of exposure to unhealthy lifestyles for many years. After finishing this research, we can predict heart disease at an early stage. This prediction will help millions of heart disease patients worldwide, and millions of lives will be saved. We see heart disease causes a huge loss in the global economy, and predicting it in the early stage will save billions of dollars. For prediction, six machine learning algorithms are used to find the best accuracy. Then, come to the latest conclusion as to which algorithm is better among them.

In this section, previous heart disease-related study using machine learning methods is discussed, which motivated this work. In this paper, according to Ramalingam et al. [7], a machine learning approach has been employed on some medical datasets and experiments of numerous data. This paper contributes to various model-based algorithms and techniques. Using some supervised algorithms such as Naive Bayes, random forest (RF), decision trees (DT), support vector machine (SVM), and K-nearest neighbor (KNN) are found in these researchers. Based on the accuracy, the implementation of various techniques used in the research was compared. The results accuracy of NB was 84.1584% with SVM-RFE (recursive feature elimination) selected in the 10 most significant features. According to Pouriyeh et al. [8] using 13 attributes, in this research, the NB algorithm has performed an accuracy of 83.49%. In 1951, Fix and Hodges [9] proposed a nonparametric method for pattern classification which is popularly known as the KNN rule. Accuracy of DT and KNN was 82.17% and 83.16%, respectively. Palaniappan and Awang [10] predict the intelligent heart disease prediction in ML algorithms. The algorithms are collectively proposed to achieve accuracy. Using DT, NB, and NN technique to perdition HD, the accuracy of the DT, NB, and NN was 80.4%, 86.12%, and 85.68%. Rabbi et al. [11] used Cleveland standard heart disease dataset and classified the three-technique to prove the accuracy. Predicting the accuracy of the computer-based prediction algorithm, SVM, KNN, and artificial neural network (ANN) are used. In the accuracy, KNN (82.963%) and ANN (73.3333%) are used. They proposed SVM as the best classification algorithm with the highest accuracy to predict heart disease. In the paper, Haq et al. [12] used the UCI dataset to develop using popular algorithms, the cross-validation method, three feature selection (FS) algorithms, and seven classifier performance evaluation metrics such as classification accuracy, specificity, Matthews’ correlation, sensitivity, and execution time. Impact on classifier’s performance terms to accuracy and execution time is featured. Three feature selection algorithms, mRMR, relief, and LASSO, were used to select the important features, to develop performance, specificity, sensitivity, and accuracy.

Above all those previous studies [7], Ramalingam et al. did a survey which is heart disease prediction using machine learning techniques. The best data will give the best performance of each algorithm [8]. This author worked on the UCI data set with a comprehensive investigation on the comparison of machine learning techniques on heart disease domain. However, the performance of those techniques depends on feature selection algorithms [9]. Palaniappan and Awang use data mining techniques to predict heart disease; this work was done on 909 patients’ data. However, data mining is much more effective with big amounts of data [10]. According to Rabbi et al., this paper is done by the same techniques using several algorithms which are given less than 90% accuracy, and those algorithms are applied on MATLAB, and using Python for feature selection techniques, it could be performed better [11]. Haq et al. use much better techniques. But it is not given more than 90% accuracy [12]. If it can handle data more carefully, it may give the best accuracy. Finally, it can be said that they tried to find the best accuracy for predicting heart disease from the UCI dataset’s clinical information of patients and correctly predicted below the average of 80% of heart disease patients. They tried to find the best accuracy using all of the features or use some specific feature selection algorithm for a specific machine learning algorithm, and they do not visualize any correlation between features. Also, every other study only shows the prediction score of any algorithm, and they do not describe other performance evaluation matrices like sensitivity, specificity, log loss, and others.

In this study, heart disease (HD) datasets from UCI Machine Learning repository [13] are used. This work is related to the supervised problem of machine learning. Although there has been a lot of research on heart disease, they have tried to solve it using different algorithms. However, it is a complex problem that cannot be solved with a simple machine learning algorithm. This project will be solved by some algorithms such as linear regression (LR) and decision tree (DT). For these analyses, some feature selection methods were applied to the datasets. Several classifiers show the best accuracy in heart disease. In addition, machine learning algorithms play vital roles to predict various health-related diseases in the early stages. The visual representation of the sequential steps for predicting heart disease analysis workflow used in this study is shown in Figure 1.

3. Methodology

In this study, Python 3.8 was used to perform the experiment because it is more accessible to everyone, and it makes it easier to perform rapid testing of algorithms. The workflow of the study is mentioned in Figure 1. The following subsections briefly describe the research methods used in this study.

3.1. Dataset

In this study, the UCI Cleveland dataset [13] is used. This dataset was used in so much research and analysis. We use it for predicting heart disease. The UCI heart disease dataset contains 303 patient records, and each record has 13 features. Two classes represent heart patients or normal cases in our target label. The dataset matrix information is given in Table 1.

3.2. Data Preprocessing

In this study, data were preprocessed after collection. There are 4 records on NMV and 2 records on TS that are incorrect in the Cleveland dataset. All those records with incorrect values are replaced with optimal values. Next, StandardScaler is used for ensuring that every feature has mean 0 and variance 1 and bringing all the features to the corresponding coefficient.

3.3. Feature Selection

Feature selection plays an important role in the machine learning process because sometimes, the dataset contains many irrelevant features that are affecting the accuracy of the algorithms. Feature selection helps to reduce those unconnected features and improve the performance of the algorithms [14]. It used different feature ranking techniques [15] to rank the most important feature based on their relevance. In this study, three well-known feature selection algorithms are used to identify important features based on their score.

3.3.1. ANOVA Value

ANOVA test is a prediction technique to measure similarity or pertinent feature and to reduce the high dimensional data and identify the important feature by feature space and improving the classification accuracy. Here, the formula [16] is used:

3.3.2. Chi-Square

This test is a statistical hypothesis testing system, and also, it is written as test. It is calculated between the observed value and the expected value. This formula [17] is given below.

3.3.3. Mutual Information (MI)

A couple of decennial mutual information has acquired considerable attention for its application in both machine learning. MI is calculated between two variables and features [18], and this is the mathematical equation for calculating mutual information between the features.

As previously mentioned in this experiment, ML algorithms were used such as LR, SVM, KNN, RF, NB, and DT.

3.4. Classification and Modeling

The models used for predicting heart disease are described sequentially. Each algorithm is applied following that sequence. Various types of classification algorithms are available for data analysis. In this study, six types of classification algorithms are used. A brief discussion of each algorithm is given below.

3.4.1. Logistic Regression

Logistic regression model, the probabilities for classification problems with two possible outcomes, can be regarded as when is a negative class and 1 is a positive class [12], and a hypothesis is designed based on it . Consider that the hypothesis value is , then predict value . Consider that the hypothesis value is , then predict value . Here, the logistic regression sigmoid function is written:

, where

3.4.2. Support Vector Machine

SVM creates an effective decision boundary (hyperplane) between the two classes [19]. The main focus when drawing a decision boundary is centered on the maximum distance of the nearest data point of both classes. Although the radial base function is used as a kernel, SVM automatically determines centers, mass, and doorstep and reduces the upper limit of the expected test error. In the case of the study, we consider the support vector function as a radial base function. Here, is the length of the vector. It clarifies as

Here, is identified as the squared Euclidean distance between vector and .

3.4.3. K-Nearest Neighbor

KNN uses a training set directly for classifying the test data. Which refers to the number of KNN. To test each data, it calculates all the training data and the distance between them. Then, test data will be assigned to be used by multiplicity voting and class label. The Euclidean distance measure equation is given below:

3.4.4. Random Forest

Random forest is the most powerful algorithm of supervisory machine learning algorithms. It is principally used for classification problems. As we see, a forest is made up of many trees, which means almighty forest. This algorithm similarly builds a decision tree based on data samples. Here, we use it for efficient heart disease results.

3.4.5. Naive Bayes

In potential, the Bayes theorem is used for calculating probability and conditional probabilities. A patient may have certain symptoms (side effects). The possibility of the proposed conclusion being true may be due to the use of the Bayes hypothesis. Here, variable and . The formula is given below:

3.4.6. Decision Tree

Decision trees are the most powerful way to classify problems. In this method, the entropy for each property is calculated in two or more similar sets based on more predictive values, and then, the data set is divided on the basis of minimum entropy or maximum data gain.

The entropy and information gain formula are given as follows:

Multiplex evaluation metrics such as accuracy, sensitivity, specificity, AUROC, and log loss were evaluated to present the results of different algorithms and comparison performance based on these metrics. These matrices were represented by calculating the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values. The below section describes more about these metrics. After completing the analysis, the best algorithm is represented which achieves the highest outcomes.

3.4.7. Performance Evaluation Matrices

(1) Accuracy. The accuracy is determined by the matrices called confusion matrices. The confusion matrices are matrices, which are used for assessing the performance of the classification model. The formula used to calculate the accuracy is

(2) Sensitivity. It is the measurement of the proportion of true positive cases and predicts that all values are positive. For calculating sensitivity, the used formula is

(3) Specificity. It calculates the proportion of true negative cases and predicts that all values are negative. The formula used to calculate the specificity is.

(4) AUROC. This evaluation matrix is used for checking classification model performance. For calculating AUROC, the used formula is

(5) Log loss. This is a classification loss function used to evaluate the performance of machine learning algorithms. The closer to zero will be the value of the log loss model and will become more accurate. For calculating log loss, the used formula is

4. Experimental Setting

In this analysis, Jupyter notebook is used to perform heart disease prediction of the dataset. It helps to create documents with live codes and easy to visualize various data relation diagrams of the dataset. In this analysis, firstly, the UCI HD dataset is cleaned using Pandas 1.1 and NumPy 1.19.0 libraries of Python and then preprocessed it using the StandardScaler algorithm from Scikit-learn [20] library of Python. Secondly, some feature selection algorithm is applied to find the feature importance, then made three different selected feature (SF) sets. Thirdly, the dataset was split into train and test sets, 70% of the data is used as a train set, and the rest is used as a test set. In the last, this 70% test data was used to train six different machine learning algorithms. The algorithm with the highest performance was used for predicting heart disease. The used PC for performing all the computations is Intel(R) Core(™) i5-7200U @ 2.50GHz.

4.1. Experimental Results

In this study, the Scikit-learn package of Python [20] is used for feature selection and classification tasks. First, different algorithms, logistic regression, decision tree, random forests, support vector machine, Gaussian NB, and K-nearest neighbor (denoted as C1, C2, C3, C4, C5, and C6, respectively), were applied to the processed dataset using all the feature and have checked the performance. In the second, Matplotlib and seaborn library of Python are used to visualize correlation matrix heat map and other correlations between different features. Third, different feature selection methods of univariate selection algorithm ANOVA value, chi-square, and mutual information (MI) that are given in Table 2 (denoted as FST1, FST2, and FST3, respectively) were applied. Fourth, different algorithm performances were evaluated for the selected features. Accuracy, sensitivity, specificity, AUROC, and log loss were used to prove the results of those analyses. All features were standardized using StandardScaler before applying them to the algorithms.

4.2. Result of Different Feature Selection Techniques

ANOVA value method calculates the value between features based on the weights of the features. The score of ANOVA value is given in Table 3. In this score, the three most important features are EIA, CPT, and OP, and the less important features are RES, CM, and FBS, respectively. Another method is chi-square, which calculates the chi-square score between every feature and the target. The scores of chi-square are given in Table 4. In this method, the three most important features are MHR, OP, and NMV, and the less important features are TS, REC, and FBS, respectively. The rank of features in the FST1 and FST2 methods are shown in Figure 2. The third method used in FST3 is mutual information (MI), which calculates the mutual information between each feature, which measures dependency between the features. If the score is zero, then, two features are independent, and the more score will increase, the more the features will be dependent. The scores of mutual information are given in Table 5. Here, the three most dependent features are CPT, TS, and NMV, and the independent features are fbs and restecg. The rank of the feature in FST3 method is shown in Figure 3. Those three tables present significant features for the prediction of heart disease. Besides, FBS, REC, RBP, and CM have an overall lower score for all three FSTs, and in this study, those features are not used in the different algorithms. From all those features, three different sets of features are selected based on their score. Each of the three sets of features was denoted by SF1, SF2, and SF3, respectively. Those selected feature sets are shown in Table 6.

4.3. Visualizing Correlation between Features

Firstly, a clustered heat map is visualized that is shown in Figure 4. This heat map shows the correlation amongst the different features of the dataset. The correlation values show that almost all of this dataset’s features are significantly less correlated with each other. This implies that only a few features can be eliminated. In this heat map, CPT, MHR, and PES show the highest positive correlation between the target, and EIA, OP, and NMV show the highest negative correlation between the target attribute. However, FBS, CM, RBP, and REC show the lowest correlation score between the target. This is similar to the other feature selection technique feature score, and these features are eliminated in different SF.

Secondly, a relation is shown between age and the target attribute that is shown in Figure 5. It shows that around nine patients aged 41, 51, and 52 and 11 patients, aged 54 suffered from heart disease. It suggests that between the ages of 41 to 54 and mostly the mid-aged people suffered from heart disease.

Thirdly, a relation between MHR and target is shown in Figure 6. It shows that older people have a lower heart rate than young aged. And higher heart rate slightly increases the possibility of heart disease.

4.4. Experimental Analysis of Accuracy

The processed dataset was analyzed using different algorithms, and Table 7 shows the accuracy of each algorithm. Relevant to the accuracy of each algorithm, the highest accuracy (94.51%) was calculated by C4 for SF3; C4 also gave (90.11% and 89.01%) accuracy for SF1 and SF2. The second highest accuracy (93.41%) was calculated by C1 for all three SFs. On the other hand, the poor accuracy (75.82%) was calculated by C2 for SF3. C4 also gave low accuracy (78.02% and 76.92%) for SF1 and SF2. The other algorithm’s accuracy was between 84.61 and 92.31%. In addition, the result shows that the best algorithm for the dataset is C4 for SF3. All the accuracies of different algorithms for different SFs are shown in Figure 7.

4.5. Experimental Analysis of Sensitivity

In this analysis, the sensitivity was analyzed for all those algorithms. The score of the sensitivity for all those algorithms was shown in Table 8. The poorest sensitivity (69.38) was given by C2 for SF2. C2 also gave (70.83 and 71.42) scores for SF1 and SF2. And the highest sensitivity was 94.87 given by C4 for SF3 also; the second-highest sensitivity was 94.74 given by C1 for all the SFs. The other algorithm’s sensitivity was between 80.49 and 94.6. In addition, the result shows that C4 gave the best score for SF3. All the sensitivity scores of different algorithms for different SFs are shown in Figure 8.

4.6. Experimental Analysis of Specificity

The specificity was explored for all of those algorithms, and the scores of specificity for different algorithms are shown in Table 9. During analysis, C2 gave the most inferior score (79.69) for SF3, and C4 gave the highest score (94.23) for SF3. C4 also gave sensitivity scores (87.50 and 87.27) for SF1 and SF2. C1 gave the second highest score (92.45) for all those SFs. The other algorithms gave scores between 87.27 and 92.0. In addition, the result shows that C4 gave the best score for SF3. All the specificity scores of different algorithms for different SFs are shown in Figure 9.

4.7. Experimental Analysis of AUROC

AUROC were analyzed to evaluate the predictions made for the heart disease dataset. The scores of AUROC for different algorithms were shown in Table 10. In this analysis, the poorest AUROC score (76.27) was given by C2 for SF2. C2 also gave scores (76.54) and (79.48) for SF1 and SF3. C1 gave the highest score (96.08) for SF3. C1 also gave AUROC scores (94.56 and 96.03 for SF1 and SF2. C5 gave the second highest score (95.54) for SF2. The other algorithms gave AUROC scores between 91.81 and 95.49. In addition, the result shows that C1 gave the best score for SF3. All the AUROC scores of different algorithms for different SFs are shown in Figures 1012.

4.8. Experimental Analysis of Log Loss

In this analysis, log loss was explored. The results given by different algorithms are shown in Table 11. In this experiment, C2 gave the highest score (8.35) for SF3. C2 also gave scores (7.59 and 7.97) for SF1 and SF2. Therefore, the lowest log loss value (0.27) was given by C1 for SF2 and SF3 both. The other algorithms gave log loss scores between 0.29 and 1.02. All the log loss scores of different algorithms for different SFs are shown in Figure 13.

5. Discussion

In this research, various machine learning algorithms were used for the early detection of heart disease, and the UCI Cleveland dataset was used for training and testing purposes. Specifically, six well-known algorithms such as LR, DT, RF, SVM, Gaussian NB, and KNN were used with different selected features. And univariate selection algorithms, ANOVA value, chi-square, and mutual information (MI) are used to classify significant features which are more important for predicting heart disease. To check the performance of the different algorithms, different evaluation metrics which are accuracy, sensitivity, specificity, AUROC, and log loss were used. The experimental result shows that the algorithm C4 achieves the highest accuracy (94.51%) for SF3, and C1 achieved the second highest accuracy (93.41%) for all three SFs shown in Table 7. In terms of sensitivity and specificity, C4 also achieved the highest sensitivity (94.87) and specificity score (94.23) for SF3 shown in Tables 8 and 9. Then, for AUROC, C1 gave the highest AUROC score (96.08) for SF3 as shown in Table 10. Then, for log loss, C1 gives the lowest log loss value (0.27) for SF2 and SF3 both, as shown in Table 11. Because of the highest performance of C4 with SF3, it is the best predictive model in terms of accuracy, sensitivity, and specificity. And for AUROC and log loss, C1 is the better predictive model for SF2 and SF3, which is the second-best predictive model overall. In this analysis, we find that SVM has given the best performance for accuracy, sensitivity, and specificity, and LR is given the best performance for AUROC and log loss. Consequently, it is authorized to judge that the support vector machine is an efficient algorithm for heart disease prediction. If compressing between several machine learning algorithms, it was performing above 90 percent accuracy most of the time.

5.1. Comparisons with Other Work

Comparing our analysis with previous studies we found, Mohan et al. [21] developed a heart disease prediction model by using the HRFLM method. Their model predicted (88.47%) accuracy, (92.8%) sensitivity, and (82.6%) specificity for the UCI heart disease dataset, and they used all thirteen features. Amin et al. [22] predicted heart disease 87.41% accurately using Naive Bayes and logistic regression algorithm. A previous study [23] has 56.76% accuracy using J48 with reduced error pruning algorithm. There are more previous studies shown in Table 12, where their overall accuracy is between 87.41 and 83.70%. Besides, no study has evaluated the heart disease prediction in detail; while in our study, a range of metrics (accuracy, sensitivity, specificity, AUROC, and log loss) is evaluated, and different feature selection algorithms are used for selected important features that also improve the performance of algorithms.

6. Conclusion

In summary, we implemented different feature selection techniques and found the most significant features which are highly valuable for heart disease prediction, then applied six different machine learning algorithms for those selected features. Every algorithm performed a separate score using different selected features. SVM and LR performance were more significant among all other algorithms. However, the amount of heart disease data available was not large enough for a better predictive model. This experiment will be more accurate if the same analysis is performed in a large real-world patient’s data. In future, more experiments will be performed to find more efficient algorithms like deep learning algorithms, for this prediction to achieve better performance of the algorithms using more effective feature selection techniques.

Data Availability

The data are available by contacting the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Authors’ Contributions

K.A. and M.A.M. provided the idea and designed the experiments; N.B., M.M.A., M.A.R., M.R.M., M.I., and K.A. analyzed the data and wrote the manuscript. N.B., M.M.A., M.A.R. M.I., F.M.B., S.A., F.A.A., and M.R.M. helped perform the experimental analysis with constructive discussions. F.M.B. and F.A.A. supported the funding. All authors discussed the results and contributed to the manuscript.

Acknowledgments

The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia for funding this research work through project number: IFP22UQU4170008DSR06 and also in part by funding from the Natural Sciences and Engineering Research Council of Canada (NSERC).