Abstract
Heart diseases are a leading cause of death worldwide, and they have sparked a lot of interest in the scientific community. Because of the high number of impulsive deaths associated with it, early detection is critical. This study proposes a boosting Support Vector Machine (SVM) technique as the backbone of computeraided diagnostic tools for more accurately forecasting heart disease risk levels. The datasets which contain 13 attributes such as gender, age, blood pressure, and chest pain are taken from the Cleveland clinic. In total, there were 303 records with 6 tuples having missing values. To clean the data, we deleted the 6 missing records through the listwise technique. The size of data, and the fact that it is a purely random subset, made this approach have no significant effect for the experiment because there were no biases. Salient features are selected using the boosting technique to speed up and improve accuracies. Using the train/test split approach, the data is then partitioned into training and testing. SVM is then used to train and test the data. The C parameter is set at 0.05 and the linear kernel function is used. Logistic regression, Nave Bayes, decision trees, Multilayer Perceptron, and random forest were used to compare the results. The proposed boosting SVM performed exceptionally well, making it a better tool than the existing techniques.
1. Introduction
Heart disease refers to a variety of conditions that affect the heart from contamination to genetic deficiencies and bloodvessel diseases. These defects are among the topmost causes of deaths globally for all races. In 2016, about 28.2 million adults in the United State were diagnosed with this condition [1] and in 2015 nearly 634000 people died [2] making it the foremost cause of deaths. According to the American Heart Association, a nonprofit organization that funds cardiovascular medical research, one American has a heart attack every 40 seconds [3]. Per the data, there are 720,000 new cases of heart attacks and 335,000 chronic attacks in the United States each year. The form of heart or cardiovascular disease (CVD) related morbidity and mortality has been rather fascinating in SubSaharan Africa, an area thought to have the world’s youngest population. SubSaharan Africa remained the only region in the globe where heart diseaserelated fatalities increased between 1990 and 2013 [4]. The World Health Organization (WHO), for example, has listed heart disease as one of the top two causes of death in Ghana, after diarrheal infections [5]. In 2008, heart disease was the leading cause of death in Ghana among all noncommunicable diseases (NCDs) and the major cause of institutional deaths, accounting for 14.5 percent of all deaths reported [6].
Traditionally, a patient’s need to know the status of his heart condition was based on the doctor’s view. Before doing any test, the doctor will likely perform a few physical checks and interrogate the patient to examine his medical history, regardless of the severity of the cardiac problem. With the exception of blood tests and chest Xrays, any heart disease diagnosis may include the involvement of an electrocardiogram (ECG), which records electrical signals that aid in the discovery of anomalies in the heart’s rhythm and structure. Holter monitoring echocardiogram, stress test, Cardiac Catheterizations, Cardiac Computerized Tomography (CT) Scan, and Cardiac Magnetic Resonance Imaging (MRI) are some of the other therapies. A Holter monitor is a small, wearable device that captures an ECG during a 24 to 72hour period. Holter monitoring detects heart rhythm abnormalities that are not at all noticeable on a standard ECG. The echocardiogram consists of an ultrasound image of the chest and detailed images of the heart’s construction and function. A stress test, often known as a treadmill test or an exercise test, is used by doctors to determine how well the patient’s heart can endure workload. The patient will engage in some physical activity or take drugs to raise their heart rate for this test. After that, the actual examination and various photographs of the heart are taken to analyze the underlying reality. In case you ask your doctor if you have heart disease, the standard procedure is for him to assess the likelihood based on risk factors. Age, diabetes, smoking, high blood pressure, being male, and cholesterol are all significant risk factors. According to previous studies, nearly half of those who had coronary attacks had two risk factors: being male and being over 60[7]. As a result, it is incredibly exciting that technology has enabled early diagnosis and risk assessment straightforward before people develop the disease.
Owing to the increased risk of heart disease and the fact that current research forecasts computerassisted treatments, this study aims to suggest two novel approaches to the problem. To begin, we offer a better algorithm that enhances diagnosis, and then we explain how the proposed method is unquestionably superior to earlier proposed techniques by demonstrating the technique’s real implementation. Tables 1, 2, 3, and 4 and Figure 1 demonstrate unequivocally that the suggested method is superior to earlier proposed methods. The remaining part of the study is structured as follows: previous related studies and their challenges are presented in Section 2. The proposed technique and how data is preprocessed as well as previous algorithms employed to solve the problem are discussed in Section 3. The result of the study is then discussed in Section 4. The conclusions are finally drawn in Section 5.
2. Related Studies
Several methods have been used to predict the risks of getting heart disease. Genetic algorithms, for example, have been used in a variety of applications. According to [8], the neurofuzzy system combines the capabilities of neuroadaptive capability and fuzzy logic reasoning for the prediction of the heart disease risk level. The algorithms are generally used for weight optimization when training the model, but there is a serious drawback. Genetic algorithms do not guarantee an optimal solution; hence, the weight optimization may not be completely accurate. In comparison to SVM, Naive Bayes, decision tree, and random forest and genetic algorithms are more complicated to implement and require a large number of parameters to be set in order to achieve a result that is close to optimal. As a result, for small datasets like the Cleveland utilized in this investigation, the genetic algorithm is not appropriate.
The Iterative Dichotomiser 3 (ID3) algorithm, a type of decision tree building algorithm [9], is a relatively simple algorithm that has proven to be effective in other areas but has the drawback of only handling categorical data, so it cannot be used in Cleveland, which is plagued by missing values. If the sample data tested is tiny, this approach is prone to overfitting. As a result, it cannot be used for this research.
Deep neural networks [10], which have shown greater performance in prediction, were also excluded from this study because what is learned with deep neural nets is difficult to comprehend. Furthermore, because learning is progressive, deep neural nets require a large amount of data to train the learning algorithms [11]. When compared to random forest, logistic regression, Nave Bayes, neural networks, and decision trees, the proposed boosting SVM algorithm utilized in this study performed well. On small datasets, these solution approaches are among the bestperforming algorithms, and they are also a lot easier to grasp.
Miranda et al. [12] used the Naive Bayes algorithm to forecast this health concern and looked at the related risk levels for adults in their study. In this study, blood and urine test results from the clinical laboratory were used as training datasets. The difficulty with this study is that the authors failed to explore ECG and echocardiography analysis, both of which are crucial in detecting cardiovascular diseases, and the accuracy of 80% obtained is comparably poor. Again, since all the properties in Naive Bayes are expected to be mutually independent, using this predictor to predict heart disease is challenging because finding a collection of predictors that are totally independent of one another is extremely difficult in real life.
In addition, neural networks are widely employed [13, 16]. To predict cardiovascular heart disease, Nandy et al. [14] employed a swarmartificial neural network. The goal of the research was to increase accuracy. While the study’s findings were promising, the accuracy of 95.78% needed to be improved, especially when compared to the study we recommended. Sayad and Halkarnikar [17] proposed a data mining and artificial neural networkbased detection approach for cardiac disease. A multilayer perceptron neural network (MLPNN) and a backpropagation algorithm were used in this investigation. The residual dataset was separated into two parts after preprocessing. The MLPNN with backpropagation approach had a 92% accuracy, which is below average. Kim and Kang [18] developed a neural networkbased technique for predicting the risk of heart disease using the Korea National Health and Nutritional Examination Survey (KNHANESVI) dataset [19]. This method consists of two steps. A feature sensitivitybased feature selection is the first phase, followed by a neural networkbased prediction model. 3031 people were judged to be at low risk out of 4146, whereas 1115 were found to be at high risk. Dutta et al. [20] suggested a convolutional neural network for predicting heart disease by classifying clinical data that was highly classimbalanced. The study’s findings, on the other hand, were not encouraging.
While neural networks are gaining popularity and appear to be realistic, they suffer from data overfitting and temporal complexity. When dimensionality is low, neural networks also fail to converge.
For the same reason, the random forest has been employed in various investigations [21]. Javeed et al. [22] used the Cleveland datasets to construct a random search algorithm (RSA) for feature selection and a random forest model for heart failure prediction. To improve the suggested diagnostic system, the grid search method was applied. Two types of testing were conducted to determine the accuracy of the proposed approach. The first trial only builds a random forest model, whereas the second trial builds the specified RSAbased random forest model. The proposed method has a classification accuracy of 93.33%, and that is not really impressive. Jabbar et al. [23] also proposed a random forestbased classification and feature selection by chisquare and genetic algorithm to predict the risk of heart disease on the Cleveland dataset. The proposed technique outperformed other methods such as Naïve Bayes, decision tree, and neural networks. However, the study’s accuracy was only 84%, making it worthless for actual deployment. Decision tree prediction for heart disease has also been proposed [24, 25]. Decision trees, on the other hand, do not work well with missing attributes in the Cleveland datasets if they are not treated with considerable attention, making the outcome inaccurate. The use of logistic regression techniques in the prediction of cardiac disorders is very common. For example, Soleimani and Neshati [26] utilized three logistic regression models with 28 features to predict heart disease risk using 711 data from patients with factors such as severe chest pain, back pain, cold chills, shortness of breath, nausea, and vomiting. However, the study’s accuracy of 94.9% was not particularly noteworthy.
A Support Vector Machine (SVM) has also become highly popular. The SVM with sequential minimal optimization strategies was investigated in 2015, with prediction accuracies ranging from 82% to 90%, which was not promising. However, new research into SVM algorithms is yielding better results. Harimoorthy and Thangavelu [27], for example, recently used R studio’s SVMradial bias kernel approach to predict heart disease with 98.7% accuracy.
Based on the favorable results with SVM, we were encouraged to do further examination to improve the technique in the proposed study.
3. Materials and Methods
3.1. Datasets Description
The Cleveland dataset was used in this study. It is a Cleveland Clinic Foundation dataset containing 14 variables related to patients’ vital signs in relation to heart disease. The remaining property is used as the target or projected class, and thirteen of the fourteen qualities are used as predictor variables. Sex, age, type of chest pain, serum cholesterol, resting blood pressure, fasting blood sugar, resting maximum heart rate, electrocardiography, and ST segment elevation are among the study’s 13 predictor variables. The expected characteristics include exerciseinduced angina, depression, slope, thallium test result, number of vessels damaged by fluoroscopy, and diagnosis. There were 303 data sets in total, with 6 missing values. The 303 records were reduced to 297 by deleting the 6 tuples that have missing records through the listwise method. Looking at the large size of the data, and the fact that it is a purely random subset, this method had no significant effect on the rest of the data used for the experiment because there were no biases. Table 5 contains descriptions of the datasets.
3.2. The Proposed Framework
The proposed framework for the study is shown in Figure 2.
The framework demonstrates the whole methodology of the proposed technique. The explanations are as follows.
3.3. Feature Importance Estimation
The feature importance score assigns a numerical value to each data feature; the higher the score, the more significant the feature to the output variable. We extracted the top features for the dataset using the Extra Tree Classifier. The amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for, is used to evaluate the relevance of a single decision tree. The purity (Gini index) was used to choose the separation points. The relevance of each attribute is then summed across all decision trees in the model. The Gini index in Algorithm 1 is presented as follows:

The entire method is developed with the goal of maximizing purity in each split. Purity is defined in (1) as the degree to which the groupings are homogeneous:where is the probability of an object being classified to a particular class with label j number of times. Figure 3 shows the degree of importance of each feature.
3.4. Feature Correlation Matrix
A correlation is a term that describes how features are related to one another. The heatmap makes it simple to see which features are most closely associated with the target variable. Using the seaborn library, we created a heatmap of connected features. Pearson’s correlation coefficient was used in this study. This correlation evaluates how closely two numerical sequences are positively connected. We plotted Pearson’s heatmap to see the correlation of independent variables. By using AdaBoost as feature selection algorithm, only selected features which have correlation above 0.5, taking into consideration absolute values, were selected. The Seaborn functions automatically perform the statistical estimation required to complete operation. The factors in deep blue in Figure 4 show the highest correlation, namely, max. heart rate and age and ST depression and max. heart rate, indicating that both “age” and “max. heart rate” will play a significant role in predicting heart disease.
3.5. Boosting SVM Classification
Boosting is an ensemble metaalgorithm that, in essence, removes dataset biases for machine learning algorithms and upgrades weak learners to strong learners. The goal of the boosting strategy is to enhance prediction accuracy. The following is a description of the adaptive boosting algorithm that was used:
Let p be denoted by positive and negative samples and let each sample be where represents the corresponding class label. The feature selection algorithm is formulated as follows: Step 1: initialize the sample distribution by weighting every training sample equally such that the initial weights become and for y = 1 and 1, respectively. For the iteration , where T is the final iteration, execute the following. Step 2: normalize , where is a probability distribution and N is total number of features. Step 3: train a weak classifier h_{t} for feature j, which uses a single feature. The training error is estimated with respect to as stated in the following equation: Step 4: select the hypothesis with the most discriminating information, that is to say, the hypothesis with the least classification error , on the weighted samples. Step 5: compute the weight that weights by its classification performance as in the following equation: Step 6: the weight distribution is then updated and normalized with the following equation: Step 7: the final feature selection hypothesis H(S) which is a function of the selected features is denoted by the following equation:
Input the Cleveland training datasets sets, represented by . where a datasets have and b datasets have . The b datasets represent the 0 attributes of the datasets. The scale parameters x and y are the feature vectors selected by the AdaBoost algorithm. The maximal margin separating the hyperplane becomes an optimization problem shown in the following equations:
subject to the constraints in the following equation:
Since and define the same plane, , c is the regularization parameter. and , where and are the respective positive and negative support vectors. The margin is then denoted by the following equation:
The optimal plane is solved by using the convex quadratic programming problem in the following equation:for . The decision boundary of the classifier is expressed as the sum over the support vectors in the following equation:where is the support vector data, is the Lagrange multiplier, and is the label of membership class with . The product represents a linear kernel function, given by the following equation:
The linear kernel function transforms the original data space into a new space with a higher dimension; this includes the transformation function with dot product, . The reason is to make transformed data easily separable.
3.6. Model Evaluation Metrics
An important component of the study is to assess the performance of the proposed method. This is accomplished by comparing the performance of the proposed technique to that of some standard techniques using some acceptable measures. The confusion matrix, classification report, Receiver Operating Characteristic (ROC) curve, and Area under the Curve (AUC) data were used to evaluate the model’s performance. The model’s test and training accuracies must also be assessed.
3.6.1. Receiver Operating Characteristic Curve
A Receiver Operating Characteristic curve is a graph that depicts a classification model’s performance over all categorization levels. The curve represents a comparison of the True Positive Rate (TPR) and the False Positive Rate (FPR) in the following equations:where TP, FP, FN, and TN represent true positives, false positives, false negatives, and true negatives, respectively.
3.6.2. Area under the Curve
The Area under the Curve (AUC) is the most wellknown quantitative index to describe accuracy.
The AUC is computed as follows:
Generally, an area of 1 means a perfect test and area of 0.5 represents a worthless test. The general acceptable interpretation of AUC values is displayed in Table 6.
3.7. Comparative Algorithms
3.7.1. Comparing SVM with Boosted SVM
Preliminary experiment was conducted using Support Vector Machine (SVM) and the boosted SVM with the same linear kernel function to determine whether the proposed boosted SVM has significant advantages over the traditional SVM. The results show that the accuracies for SVM and the boosting SVM in terms of training and testing accuracies are 86.83% and 83.41% against 99.92% and 99.75%, respectively. This result is statistically significant . Thus, we follow up to compare the results of the proposed method against Logistic regression, Naïve Bayes, decision tree, Multilayer Perceptron, and random forest which are extensively used in this domain.
3.7.2. Logistic Regression
Logistic regression is the best regression analysis to use when the dependent variable or response variable is binary [28]. It works by combining the input variable (X) in a linear form and using coefficients to predict an output variable (Y) which is a binary value of 0 or 1. The logistic regression technique models the chance of an outcome based on the individual characteristics or input variables (X). It is represented mathematically as follows:where indicates the probability of an event, represents estimated parameter values or regression coefficients associated with the variables via maximum likelihood estimation, and x indicates the parameter variables.
3.7.3. Naïve Bayes
A Naive Bayes classifier is a simple probabilistic classifier modelled on the application of Bayes’ theorem, with strong (Naive) independence assumptions [29]. Naïve Bayes classifier can be trained very efficiently in the context of supervised learning. The Bayesian rule is given in the following equation:
From above, is a conditional probability, that is, the likelihood of event H occurring given X is true. P(X) and P(H) are the probabilities of observing X and H independently of each other.
3.7.4. Decision Tree
The Gini index, impurity (information gain) approach, which evaluates the degree or chance of a given variable being incorrectly classified when it is randomly chosen, was utilized to compare with the proposed method. The term “information gain” refers to the process of determining which characteristic or attribute provides the most information about a class. The Gini impurity is calculated by summing the probabilities , of a class with label i, times the probability of a mistake in categorizing that item. The computation is given in the following equation:where is the probability of an object being classified to a particular class.
3.7.5. Multilayer Perceptron
The Multilayer Perceptron (MLP) network is trained using the backpropagation [30], which uses data to adjust the network’s weights and thresholds to minimize the error in its predictions on the training set. First, it computes the total weighted input , using the following equation:where is the activity level of the jth unit in the previous layer and is the weight of the connection between the ith and the jth unit. Next, the unit calculates the activity using the sigmoid function.
3.7.6. Random Forest
The training algorithm used is the bagging or the bootstrapping aggregating trees. This creates an ensemble of trees where multiple training sets are generated with replacement, meaning data instance can be repeated. The algorithm is represented as follows.
Given a training set with a response, , bagging repeatedly (B times) selects a random sample of the training set and fits trees to these samples:
For (i)Sample, with replacement, n training examples from X, Y; call X_{b,}Y_{b}.(ii)Train a classification tree f_{b} on X_{b,}Y_{b}.
When training is done, predictions for unseen samples are done by determining the average of the predictions from all the individual regression trees on as stated in the following equation:
The process above depicts the original tree bagging algorithm. Random forest, on the other hand, differs in only one way: its algorithm chooses a random subset of features at each candidate split in the learning process (ensemble learning method that tries to reduce the correlation between estimators in an ensemble by training them on random samples of features rather than the entire feature set), also known as feature bagging. The Gini impurity was employed as the criterion because the random forest is based on decision tree and the study is based on classification.
4. Results and Discussion
The results of the study are presented as follows: Table 1 shows the different models’ training and testing accuracies and its processing time when run on 4 CPUs), ∼2.2 GHz processor of 8192 MB RAM. Table 2 shows the confusion matrices and Table 3 shows the classification report.
For each method, the value at the upper left corner is the true positive and the one at the upper right corner is the false positive. The lower right corner is the true negative and the lower left corner is the false negative.
Precision refers to the accuracy with which a judgment is made. The upper row values represent the likelihood of heart illness, whereas the lower row values indicate the likelihood of a decision. The harmonic mean of precision and recall is represented by the F1 score. This is a performancebased statistical measure. The capacity to determine the number of samples that test positive for a specific attribute is known as recall. Figure 1 compares the performance of all of the solution models and Table 4 shows the performances of different methods on the Cleveland dataset. We conducted a oneway ANOVA for the results to find if there is a statistically significant difference between the outcome of the proposed technique result and the others in terms of boosting SVM versus random forest, boosting SVM versus Multilayer Perceptron, boosting SVM versus decision tree, boosting SVM versus Naïve Bayes, and finally boosting SVM versus logistic regression. The analysis of the variances, followed by Tukey simultaneous plot at 95% CI, shows that the corresponding means are significantly different which demonstrates that boosting SVM is the best. Also, tests for the training speed were conducted and the results again show that there was statistically significant difference between groups . A further Tukey post hoc analysis shows that the processing time for the boosting SVM was significantly smaller than all the other techniques after pairing boosting SVM and random forest , boosting SVM and Multilayer Perceptron , boosting SVM and decision tree , boosting SVM and Naïve Bayes , and boosting SVM and logistic regression . All comparatives show that the boosting SVM methodology is extremely promising.
Figures 5 and 6 demonstrate the test application as a proof of concept using the boosting SVM algorithm.
5. Conclusion
The study emphasizes the seriousness of cardiac disease and the need of detecting early warning signs. Many machine learning algorithms based on random forest, logistic regression, Multilayer Perceptron, Naive Bayes, and decision trees are being investigated in light of recent studies that call for the automatic detection of dangers. This study proposed a boosting SVM technique to further investigate how to improve prediction accuracy. The technique is based on the Cleveland datasets, which have been utilized successfully and extensively in earlier studies. To reduce misclassification, we preprocessed the data by normalizing it and removing the redundant ones. The feature importance is also computed, which assigns a score to each characteristic in the data; the greater the score, the more relevant the feature to the output variable. Also a heatmap of linked features is produced. The heatmap demonstrates that the most important factors in predicting heart disease are age and maximum heart rates. Finally, classification is performed using the proposed boosting SVM. For the analysis, confusion matrices, classification reports, ROC, and AUC are all used, and the findings reveal that the provided methodologies performed the best. The proposed method has a recognition accuracy of 99.75%, which is much higher than previous studies. The algorithm has now been enacted and has shown to be pretty useful. In the future, we plan to develop a new ensemble model that combines SVM and AdaBoost to improve accuracy and speed, as well as releasing the app on both Android and iOS.
Data Availability
The data for this study are publicly available at https://archive.ics.uci.edu/ml/datasets/heart+disease.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.