Abstract

Cardiovascular disease (CVD) is a life-threatening disease rising considerably in the world. Early detection and prediction of CVD as well as other heart diseases might protect many lives. This requires tact clinical data analysis. The potential of predictive machine learning algorithms to develop the doctor’s perception is essential to all stakeholders in the health sector since it can augment the efforts of doctors to have a healthier climate for patient diagnosis and treatment. We used the machine learning (ML) algorithm to carry out a significant explanation for accurate prediction and decision making for CVD patients. Simple random sampling was used to select heart disease patients from the Khyber Teaching Hospital and Lady Reading Hospital, Pakistan. ML methods such as decision tree (DT), random forest (RF), logistic regression (LR), Naïve Bayes (NB), and support vector machine (SVM) were implemented for classification and prediction purposes for CVD patients in Pakistan. We performed exploratory analysis and experimental output analysis for all algorithms. We also estimated the confusion matrix and recursive operating characteristic curve for all algorithms. The performance of the proposed ML algorithm was estimated using numerous conditions to recognize the best suitable machine learning algorithm in the class of models. The RF algorithm had the highest accuracy of prediction, sensitivity, and recursive operative characteristic curve of 85.01%, 92.11%, and 87.73%, respectively, for CVD. It also had the least specificity and misclassification errors of 43.48% and 8.70%, respectively, for CVD. These results indicated that the RF algorithm is the most appropriate algorithm for CVD classification and prediction. Our proposed model can be implemented in all settings worldwide in the health sector for disease classification and prediction.

1. Introduction

The heart is a major part of the human or animal body that plays an essential role in the life of mammals. The heart pumps blood throughout the body parts, thereby supplying oxygen to all parts of the body and controlling the pressure of the blood. The heart performs its function together with the nervous system and the endocrine system. The nervous system helps to control the heart rate while the endocrine system sends hormones as well as blood pressure by causing the human blood vessels to either spasm or relax. However, when the human brain is at rest or under stress, it transmits signals telling your heart to beat more quickly. In stressful situations, our heart beats faster than usual leading to serious heart problems. Aside from stress, heart problems escalate with excessive drinking of liquor, smoking, and heavy fat intake [1, 2]. The rate of health hazards in humans rises as a function of unhealthy dietary habits, excessive stress, lack of good sleep, and lifestyle changes [2].

Cardiovascular disease (CVD) is one of the most noticeable heart diseases which has affected people of all ages. CVD is caused by excessive intake of alcohol, smoking, high blood pressure, high cholesterol level, poor diet, and family history [3]. Del Paoli et al. [4] showed that high blood pressure, unhealthy arguments, and alcohol are highly correlated with CVD. It has been proven that men are at a higher risk of CVD compared to women [5]. Age is one of the most significant factors for heart disease [6].

In addition to CVD, coronary disease, myocarditis, congenital heart disease, arrhythmias, cardiomyopathy, congestive heart failure, angina pectoris, and myocardial infarction have been classified as acute heart diseases. Each type of heart disease has its symptoms. However, it is very abstruse to identify these heart diseases sharing common high-risk factors like cholesterol level and blood pressure, diabetes, abnormal pulse rate (PR), and many more [7].

The lack of physical fitness due to lifestyle changes may also lead to heart disease for all age groups. A survey reported that seventeen million people in recent years lost their lives due to heart failure [8]. The early detection of heart disease may save a lot of lives provided the patients take their treatments together with their medication seriously and on time [8]. The predicted global number of casualties from CVD in 2015 was 17.7 million, of which 7.4 million were as a result of coronary heart disease and 6.7 million by stroke. According to the World Health Organization (WHO), approximately 54% of deaths from non-communicable diseases in Pakistan are due to cardiovascular problems [9]. Although 17.3 million deaths were caused due to heart disease in 2008, studies by the WHO in 2018 estimated deaths due to heart disease to be around 56.9 million globally [10].

Deep learning models like the backpropagation neural network (BNN) are highly effective for predicting diseases [11]. Likewise, feature selection approaches like decision tree (DT), logistic regression (LR), random forest (RF), Naïve Bayes (NB), and support vector machine (SVM) have been observed to be equally effective in disease prediction [12, 13]. Soni et al. [14] used predictive data mining techniques for the prediction of cardiovascular disease by evaluating the highest accuracy in the DT among a class of predictive machine learning models such as K-nearest neighbour algorithms, neural network classification, and Bayesian classification algorithms [15, 16].

Data mining techniques are very essential in effective healthcare delivery as they can assist in determining whether a patient has a disease or not in healthcare centres (hospitals or clinics). Additionally, it can be employed to rapidly and automatically diagnose people with diseases with great satisfaction [17]. The prediction approach of these techniques may enable all participants in making rational decisions, especially professionals who must make decisions about how to treat patients [18].

Hybrid machine learning models have been applied to predict heart diseases as well as perform optimum classification methods for prediction. Hybrid models give a better optimum output depending on the machine learning method implemented for the execution [8]. Similarly, random forest, decision trees, and hybrid algorithms have been used to predict diseases with high accuracy. The hybrid algorithms were found to have a high accuracy in the neighbourhood of 88.7% for the prediction of disease compared to other models [8].

Nyaga et al. [19], by summarizing available information on aetiology, rates, treatment, covariates, and mortality prevalence arising from heart failure in sub-Saharan Africa, created CVD models. Prasad et al. [20] implemented procedures geared towards predicting heart problems by recapitulating recent studies that utilized artificial intelligence procedures. Wu et al. [21] initiated new CVD forecasting structures by incorporating several procedures in a single hybridized phoned protocol. Their result validated accuracy in diagnosing by implementing a mixture of styles emanating from all methods.

In recent medical fields, a lot of information on diseases is generated through numerous sources. These available data need to be purified as fast as possible with different preprocessing techniques for the required information to fast-track the diagnosis of diseases. This study seeks to develop and propose new methodologies by the utilization of machine learning algorithms to increase the accuracy of the detection of CVD. We investigated and predicted CVD based on hybrid machine learning methods. We used hybrid machine learning models to predict CVD and perform optimum classification methods for the predictions. Our models and approach can be applied in all hospital settings across the world for effective prediction and diagnosis of CVD and other heart diseases. We are hopeful that our suggested technique will be utilized for the detection and prediction of other diseases in general.

We have discussed the materials and methods applied in the proceeding section followed by the results and discussion. The paper ends with the conclusions of the study.

2. Materials and Methods

2.1. Data

The data were collected from the two largest teaching hospitals, the Lady Reading Hospital (LRM) and the Khyber Teaching Hospital (KTH), in Khyber Pakhtunkhwa (KPK), one of the four provinces of Pakistan. Ethical approval for the inclusion of heart disease patients was sought from the Human Ethical Committees of the two teaching hospitals. The ethics approval certificate number for the Lady Reading Hospital is B371/12/07/2022, while that of the Khyber Teaching Hospital is A418/12/07/2022. A simple random sampling technique was employed in the collection of sample units included in the survey. The sample data consisted of a total of 518 randomly selected heart disease patients.

2.1.1. Variables in the Study

The CVD data included the individual output with corresponding factors. The all-inclusive dataset contained the following attributes: age, gender, height, weight, systolic, diastolic, cholesterol, glucose, smoke, alcohol intake, physical activity, cardiovascular disease, and body mass index (BMI). The response variable, CVD, was classified into two categories “presence” and “absence.” Furthermore, the data were cleaned of noise, inconsistencies, or any missing observations. We found a few missing observations in the data because some of the patients were discharged from the ward without any proper residential address or mobile/telephone numbers to trace them. As a result, it was very difficult to contact them. Since our analysis is based on complete data, we replaced the missing data by implementing the usual statistical method such as using median/mode for the categorical data to replace the missing values with the corresponding value. Thus, the data cleaning was completed using the corresponding statistical tools for the preprocessing stage.

Different data mining techniques were utilized in association, classification, clustering, pattern evaluation, and prediction. In the methods section below, we have discussed the techniques extensively.

2.2. Methods
2.2.1. Classification

Classification is the process of categorizing a given set of data into classes. Classification can be performed for both structured and unstructured data. Predicting the class of the provided data points is the first step in the procedure [22]. Common names for the classes include target, label, and categories. Different statistical and mathematical procedures such as linear programming, decision trees, and neural networks involve classification [23]. That notwithstanding, CVD detection can be recognized through classification procedures because it has two categories, that is, one has CVD or not [24].

2.2.2. Decision Tree (DT) Algorithm

The decision tree (DT) is one of the most important predictive modelling and classification methods in learning algorithms that are widely used in practical approaches in supervised learning techniques [25, 26]. It utilizes algorithms that can detect different ways of splitting datasets based on numerous situations. In the classification tree, the response variable is considered a discrete set of values for tree models [26]. DT is a useful contemporary approach to solving decision-making challenges by building models that can be used for prediction through systematic analysis. Internal nodes of a DT indicate a test of the features, branches represent the result, and leaves reflect the decisions that are produced after further computation [27, 28]. We performed our DT as follows:(I)Divide the dataset into two subdata, that is, training and testing datasets.(II)In the initial stage, the entire training data are considered the root.(III)Continuous values are discretized before the model building, whereas categorical values are preferable for feature values.(IV)Establish subsets such that each subset includes data with the aforementioned feature attributes.(V)Finally, steps I–IV are repeated for each subset until we get the tree leaves.

In the DT, the prediction for a record class label begins at the root. The values are compared with the root features in the succeeding record characteristics. In this contrast, the equivalent value of the next node to go is displayed [2931].

2.2.3. Random Forest (RF) Algorithm

A random forest (RF) is a classifier consisting of a collection of tree-structured classifiers where are independent and identically distributed random vectors where each tree casts a unit vote for the most popular class at the input of the predictor, x [3235].

The RF is an ensemble learning approach for regression or classification used to develop a large number of decision trees at training time. The average prediction of the separated tree is returned for regression purposes, while in the classification, the RF output is the class predicted by the maximum trees. The RF algorithm developed by Ho [36] used a stochastic subspace approach and was reintroduced as a technique for the implementation of a collection of tree predictors by Breiman [37]. RF implements bootstrapping to randomly select training and testing datasets from the original data. After selecting the training dataset, the remaining dataset called out of bag (OOB) is used to estimate the goodness of fit [37].

In the growing phase of the RF, classification and regression tree techniques are developed for tree growth by splitting the local training set at each node with value 1 to a randomly selected subset of the response variable. The growth of the tree continues to the largest extent possible since it does not consider pruning. The phases of bootstrapping and growing of the tree require independent random input quantities. We assumed that these inputs are independent and identically distributed among trees. In that manner, each tree can be viewed as independently sampled for a given training data [37, 38].

For prediction purposes, each tree as well as their terminal nodes are assigned to a class in the forest. Predictions by the trees are performed through voting processes in such a way that the forest returns a class with the maximum number of votes by random selection [39].

2.2.4. Logistic Regression (LR) Algorithm

The logistic regression (LR) model is the most accurate in the case of the dichotomous categorical response variable [40]. In the machine learning (ML) algorithm, the LR model can be used for classification purposes [40, 41]. We used the LR model for the classification problem satisfying the cardiovascular-affected respondents. It is implemented on the idea of likelihood by assigning observations to a discrete class being performed using logistic regression [42]. The exponential logit function is utilized for output transformation. The cost function is often restricted by the LR hypothesis to a range between 0 and 1. Consequently, according to the regression hypothesis, linear functions cannot be implemented here because they can have values of either >1 or ≤0. We classified and predicted the CVD patients in the machine learning LR [43] using the function

2.2.5. Naïve Bayes (NB) Algorithm

The Naïve Bayes (NB) method is a supervised learning approach that is based on the Bayes theorem. The NB machine learning method applies probabilistic techniques in solving classification problems [44]. The main assumption of the NB is the independence (free from multicollinearity) of the predictors fitted in the probabilistic models [45]. A class of classification algorithms predicated on the Bayes theorem is referred to as Naïve Bayes classifiers. It is characterized as a collection of algorithms whereby each algorithm follows the same guiding principle that every combination of features classified is independent of each other pair [46]. In our case, we used the NB classifier to partition the response variable CVD patients into those who have CVD or not for all patients with heart disease [44, 47].

2.2.6. Support Vector Machine (SVM) Algorithm

Among the different classification techniques, the support vector machine (SVM) is well known for its discriminative power for classification. The SVM is widely considered in recent times due to its efficiency in most different pattern classification techniques [48]. It has numerous applications ranging from bioinformatics to involuntary language recognition as well as handwritten typescript recognition with sufficient accomplishment. Kim et al. [49] proved that the SVM displays exceptional performance in the classification for prognostic prediction of class III malocclusion. Based on [50], we discuss a brief mathematical theory of the SVM below.

By assuming the binary classification of our response variable, CVD with the convention of linear divisibility for training samples, we havewhere , such that the design matrix X belongs to the d-dimensional response space, and the response variable, CVD, is represented by , which has a binary class in the vector Y with in our study. The appropriate discriminating equation is given by

Similarly, Z represents the vector that determines the coordination of the hyperplane (discriminating plane), and so Z, X, and β are offsets [48, 51, 52]. We have infinite possible hyperplanes that are efficiently classified by the training data which can be applied to the validation dataset. The optimal classifier identifies the similar optimal generalized hyperplanes that are nearer or even away from each cluster of objects [53]. The input set of coordinates is considered optimally separated by the hyperplane if there is accuracy in the separation with a maximum distance existing between the nearest components and the support vectors leading to the identification of a specific hyperplane [53, 54].

We used R version 4.1.2 for all our analyses.

3. Results and Discussion

The descriptive analysis of the attributes at the aggregate and age levels of the responses of all randomly selected patients with heart disease in the study is represented in Table 1. The table illustrates the numerical output of the cardiovascular disease-associated risk factors. Table 1 indicates the variability in the age proportion of the CVD-affected patients. The exploratory analysis revealed that almost 52.1% of the respondents had CVD at an aggregate level. Furthermore, there was a noticeable variation in the proportion of heart disease concerning different factors such as gender, physical activity, smoking, and so on that correlated with CVD. For instance, a maximum of 4.25% of 60-year-old patients were estimated to have CVD, whereas a maximum of 0.19% of 45-year-old patients had it.

Figure 1 shows the gender, cholesterol level, and glucose levels for all randomly selected CVD patients in the study. The figure shows that a greater proportion of the patients had CVD. Figure 2 presents a line graph for the proportion of gender with respect to the age of patients. The figure shows that CVD is predominant in males compared to females since a greater proportion of the males had the disease. Moreover, the proportion of CVD patients increases from forty years to sixty-one years, which confirms the result of Gulfam Ahmad and Jasim Shah [6].

To achieve our goal, we employed the binary classifier based on a supervised machine learning algorithm for classification to predict the association for the appropriate class of patients [5557] as proposed by Ramesh et al. [58] and Boukhatem [42]. Table 2 indicates the output of the predictive models that were used for the prediction of CVD.

All five ML algorithms (i.e., DT, SVM, NB, LR, and RF) were used to build the CVD prediction model in two different stages. In the initial stage, the data were split into two separate 70% and 30% groups for training and validation, respectively. In the second stage, however, the data were split into 75% and 25% for training and validation, respectively. The RF model had the highest accuracy of 85.01% with a 95% confidence interval of (0.6608, 0.8043), followed by DT with 83.72% accuracy with a 95% confidence interval of (0.654, 0.7986). The SVM and LR algorithms had the same accuracy of 83.08%, respectively, with a 95% confidence interval of (0.654 and 0.7986), respectively. The NB had the least accuracy of 74.74% with a 95% confidence interval of (0.567, 0.7221). This shows that the RF algorithm is the best predictor of CVD patients. Our outcome confirms the results obtained by the authors in [6, 5558].

Sensitivity, mathematically defined as the ratio of the total number of true-positive patients to the sum of the number of true-positive and false-negative patients, was used to find the proportion of true patients suffering from CVD [59, 60]. Similarly, the specificity is described according to respondents that are not affected by cardiovascular disease. Specificity, mathematically defined as the ratio of the total number of true negatives to the sum of the number of true negatives and false-positive patients [61], was also used to determine the true proportion of true patients who are not suffering from CVD [62]. The RF algorithm estimated sensitivity and specificity as 86.11% and 65.48%, respectively. That is, our algorithm correctly classified 86.11% of the patients to have CVD but failed to identify 13.89% as having CVD. Similarly, the test correctly classified 65.48% of patients as not having CVD while 34.52% of them were misclassified. Although the DT was not the best in terms of accuracy of prediction, it had the highest sensitivity (90.28%). Our results confirm those of Boukhatem et al. [63]. Figure 3 shows the visualization of all ML algorithm outputs, thereby confirming the superiority of the RF.

Table 3 represents the confusion matrix of the predictive model for 25% of our validation data. The confusion matrix is used to evaluate the performance of the classification algorithm by associating the actual target values for the response variable, CVD patients, with a predicted output of the response by the machine learning model. Just as expected, the RF had the best performance for all evaluation metrics for the confusion matrix. The confusion matrix essentially provides the misclassification error rates for all our ML algorithms. The misclassification error rates for the respondents who are affected were 0.087, 0.1228, 0.1719, 01778, and 0.1818, for the RF, DT, SVM, NB, and LR, respectively, in decreasing order of performance. Thus, the RF performed the best among all competing algorithms, while the LR had the poorest performance among them. Our results are similar to those obtained by O’Kelly et al. [64].

Furthermore, the recursive operating characteristic curve (ROC) was used for the visualization of the accuracy. The ROC uses a matrix to execute the performance of classification algorithms by visualizing the true-positive rate with a corresponding false-positive rate, thereby measuring and highlighting the specificity and sensitivity of the classifiers. Figure 4 shows the ROC for the different classifiers.

The ROC also indicates that the RF algorithm’s performance is the best among all classes of ML algorithms. The ROC ranges from 0 to 1, where the nearest to 0 value means it is inept for a given classifier, whereas a value nearest to 1 signifies a more capable algorithm for the classifier. The ROC value is 0.8737 for the RF algorithm which precisely signifies good prediction and classification. The highest ROC for the RF algorithm implies a better ability to discriminate the classes, while the highest accuracy signifies the well-performing ability of the algorithm and the sense of prediction just as in [15, 42, 56].

4. Conclusion

Heart diseases are considered a significant apprehension in medical data analysis. The potential of predictive machine learning algorithms to develop the doctor’s perception is essential to all stakeholders in the health sector since it can augment the efforts of doctors to have a healthier climate for patient diagnosis and treatment. This study investigated the performance of predictive ML algorithms for CVD patients. CVD is one of the leading causes of mortality worldwide. We used data from the Lady Reading Hospital and the Khyber Teaching Hospital in Khyber Pakhtunkhwa Province, Pakistan. Ethical approval for the inclusion of heart disease patients was sought from the Human Ethical Committees of the two teaching hospitals. Five machine learning algorithms (i.e., DT, RF, LR, NB, and SVM) were implemented for the classification and prediction of CVD. We performed exploratory analysis and experimental output analysis for all algorithms. We also estimated the confusion matrix and recursive operating characteristic curve for all algorithms. The performance of the proposed ML algorithm was estimated using numerous conditions to recognize the best suitable machine learning algorithm in the class of models. The RF algorithm had the highest accuracy of prediction, sensitivity, and recursive operative characteristic curve of 85.01%, 92.11%, and 87.73%, respectively, for CVD. It also had the least specificity and misclassification errors of 43.48% and 8.70%, respectively, for CVD. These results indicated that the RF algorithm is the most appropriate for CVD classification and prediction. Our proposed model can be implemented in all settings worldwide in the health sector for disease classification and prediction. It can also be implemented in other sectors with a similar function. The main limitation of the study is that detailed patient data and clinical datasets across the globe may be required if we need to have more powerful and considerable prediction models. For improving the accuracy of the ML models and algorithm, high-dimensional data would be more suitable. The ML algorithms used are limited to heart disease prediction studies. Future studies should look into exploring other ML techniques in selecting significant characteristics.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Authors’ Contributions

Arsalan Khan, Moiz Qureshi, Muhammad Daniyal, and Kassim Tawiah were responsible for conceptualization, methodology, validation, and visualization. Arsalan Khan, Moiz Qureshi, and Muhammad Daniyal were responsible for data curation, formal analysis, and original draft preparation. Kassim Tawiah and Muhammad Daniyal were responsible for review and editing.

Acknowledgments

We are grateful to the authorities of the Lady Reading Hospital and the Khyber Teaching Hospital in Khyber Pakhtunkhwa (KPK) Province, Pakistan, for the opportunity to conduct the study and providing us with the ethical approval certificate and waiving the consent. We appreciate all participants for taking time to contribute to this study.