Abstract
Coronary heart disease has an intense impact on human life. Medical historybased diagnosis of heart disease has been practiced but deemed unreliable. Machine learning algorithms are more reliable and efficient in classifying, e.g., with or without cardiac disease. Heart disease detection must be precise and accurate to prevent human loss. However, previous research studies have several shortcomings, for example,take enough time to compute while other techniques are quick but not accurate. This research study is conducted to address the existing problem and to construct an accurate machine learning model for predicting heart disease. Our model is evaluated based on five feature selection algorithms and performance assessment matrix such as accuracy, precision, recall, F1score, MCC, and time complexity parameters. The proposed work has been tested on all of the dataset'sfeatures as well as a subset of them. The reduction of features has an impact on theperformance of classifiers in terms of the evaluation matrix and execution time. Experimental results of the support vector machine, Knearest neighbor, and logistic regression are 97.5%,95 %, and 93% (accuracy) with reduced computation timesof 4.4, 7.3, and 8seconds respectively.
1. Introduction
Chronic heart diseases are one of the most dangerous and lifethreatening worldwide. The fundamental cause of heart failure is narrowing and blockage of coronary arteries, where the heart fails to supply enough blood to other organs [1, 2]. The coronary arteries must be accessible to supply blood to the heart. According to a recent study, heart disease is the most common disease in the United States and worldwide with a high percentage of heart disease patients [3]. Common symptoms are shortness of breath, swelling feet, and tiredness [4]. Junk food with a maximum number of cholesterols, smoking, poor nutrition, high blood pressure, and physical inactivity increase the risks of heart disease [5]. Heartburn, stroke, and heart attack are all symptoms of coronary artery disease (CAD). Other heart disorders include heart rhythm problems, congenital heart disease, congestive heart failure, and cardiovascular disease. Traditional methods for detecting cardiac disease were used [6]. Lack of medical understanding and diagnostic instruments, on time detecting, and treating heart disease in poor countries is very difficult [7, 8]. The main motivation behind the research study is to propose a comprehensive and precise diagnosis technique for heart disease to avoid loss of lives. Cardiovascular disease is the leading cause of death in both developed and developing countries. According to the WHO, 17.90 million people died from cardiovascular disease (CVD) in 2016, accounting for 30% of all deaths globally. Moreover, 0.2 million Pakistanis per year face death and death counts are still uplifting per year. According to the European Society of Cardiology (ESC), there are 26.5 million people in Europe who suffer from heart disease, with 3.8 million new cases being discovered each year. Heart disease kills 50–55% of patients in the first year, and treatment costs 4% of the yearly healthcare expenditure [9]. Invasive diagnostic procedures relied on a patient's medical history, physical examination results, and an examination of symptoms to make a diagnosis of heart disease [10]. Traditional methods like angiography are regarded as the most precise practice when it comes to detecting heart abnormalities but still facing certain limitations, such as high costs, various other side effects, and a high level of technical expertise is required, and most importantly it is much expensive, computationally difficult, and take time to assess [11, 12], to overcome the limitations of conventional invasivebased approaches for detecting cardiac disease. Predictive machine learning and deep learning algorithms were used to construct noninvasive Internet of Medical Thing (IoMT) [13–16], smart healthcare systems such as KNN, SVM, NB, DT, LR, RF, and ANN [17–22]. As a result, the death rate among individuals with heart disease has exponentially dropped per year.
The main objectives of this research study are as follows:(i)To develop an intelligent medical decision system for the identification of cardiac illness on time.(ii)Machine learning classification methods such as decision tree (DT), stochastic gradient descent (SGD), Knearest neighbor (KNN), naive Bayes (NB), random forest (RF), logistics regression (LR), and support vector machine (SVM) are used to select the best model for early heart disease diagnosis.(iii)Feature selection such as LASSO, ANOVA, MultiSURF, variance threshold, and mutual information to identify the most important and linked features that properly reflect the pattern of the desired target.(iv)Cleveland hospital datasets related to heart disease are utilized.
The rest of the paper is organized as follows: Section 2 provides an overall literature review, materials and methods are explained in Section 3, results and discussion are discussed in Section 4, and Section 5 provides a conclusion.
2. Literature Review
Over time experts and practitioners have shown keen interest in diagnosing heart disease by employing classical machine learning techniques. Experts usually utilize a classification approach to create a heart disease diagnosis model in their research study [5, 23–38]. The machine learning model can diagnose heart failure with 99% accuracy, according to preliminary computational results as shown in Table 1.
Current research has imbalanced distribution, e.g., some approaches are accurate but required a long time for computation, and some techniques responded on time but are not very accurate to diagnose such serious disease. As a result, there is a great deal of work to improve the performance evaluation rate in this area.
3. Materials and Methods
The suggested approach aims to distinguish patients with or without cardiac disease. Both complete and selective features are enforced to investigate predictive models. Important features are identified using methods, e.g., LASSO, ANOVA, MultiSURF, variance threshold, and mutual information. Knearest neighbor (KNN), support vector machine (SVM), decision tree (DT), random forest (RF), logistic regression (LR), stochastic gradient descent (SGD), and naive Bayes (NB) machine learning algorithms are deployed in the system for classification. Structure based on four steps, including exploratory data analysis, feature selection, ML classifiers, and performance evaluation matrix approach, is adopted. Algorithm 1 and Figure 1depict the proposed system's framework.

3.1. Preprocessing
Cleaning data is very important to achieve maximum accuracy and actual efficiency of machine learning algorithms. Different data preparation techniques are used to ensure each and every features must have the same coefficient. Moreover, standard scalar assures that each feature has the same mean, while minmax scalar shifts of data are set between 0 and 1, and lastly the row with missing values is erased.
3.2. Feature Selection
Precise and accurate feature selection is a very important parameter because it improves classification accuracy with minimum time complexity. LASSO, ANOVA, MultiSURF, variance threshold, and mutual information feature selection algorithms are used to select features from the dataset.
In the LASSO algorithm, some coefficients (feature) become zero, and are removed from the feature subset, derived from equations (1)–(6), while ANOVA compares the mean of two or more groups that are statistically distinct, derived from equations (7)–(11). MultiSURF is the most reliable feature selection algorithm explained in equations (12) and (13) and can be used for explicitly detecting pure 2way interactions across a wide range of problems. Variance threshold is efficient in eliminating all features with variance below a certain threshold evaluated from equation (20). Lastly, we used mutual information in the feature selection phase to find dimensionless quantities with units of bits that measure “how much one random variable provides information about another.” Mathematical modulation behind mutual information is explained in equation (15)–(20).
We have N number of samples {(xᵢ, yᵢ)} ᴺᵢ₌₁ in the linear regression, where each xᵢ = (xᵢ₁,…,xᵢp) is a pdimensional vector of features, and each yᵢ ∈ ℝ is the corresponding response variable. Our goal is to use a linear mixture of features to approximate the response variable yᵢ. Then the cost function (or loss function) must be optimized by using MSE as a cost function to determine the best fit line.
The following equation shows the closed form solution that determines the coefficients of the aforesaid cost function. LASSO reduces the coefficients of redundant variables to zero, allowing the direct feature method. The LASSO cost function is as follows:
In equation (6) argmin finds values where the expression E(β) + R(β) is minimum. The sparsity (β) of a model is defined by the number of parameters in β that are exactly equal to zero. In realworld problems, we need the model to take up only the most useful traits. LASSO regularization yields sparse solutions, which automatically choose features.
ANOVA makes use of the more traditional, standardized nomenclature. When we look at equations, we can see that the divisor has a degree of freedom (DF), the total is sum of squares (SS), we get mean square (MS), and the squared terms represent deviations from the sample mean. As a starting point, SS is partitioned into components that correspond to the model's effects.
Similarly, the number of degrees of freedom (DF) can be partitioned: one of these components specifies chisquared distribution for error that represents the related sum of squares, and the same “treatments” have no effect if there is no value.
In lieu of the more traditional oneway analysis of ANOVA, the following form can be used to express each piece of information.
In the case of the MultiSURF algorithm, each feature in the dataset is assigned to one of two groups. Inside the data collection, each feature should be scaled 0–1 and repeat the process m times with a plong weight vector (W) of zeros. Then the feature vector (X) of a random instance and the feature vectors of the instances closest to X by Euclidean distance. It refers to the closest sameclass instance, whereas it refers to the nearest differentclass instance. In equation (13) we compute a twotailed pvalue using the cumulative distribution function to determine the number of cases that are close or distant.
The informationtheoretic formula is used by the variance threshold algorithm to reduce dataset features. For a given feature subset Q, there are a variety of truth value assignments. A feature set Q divides training data into groups of instances with the same truth value into a set of training data instances. The entropy of positive and negative class values are calculated by using the below equation.
Mutual information, as opposed to correlation coefficients, includes information on all linear and nonlinear dependencies. However, if the joint distribution of X and Y is bivariate normal and both marginal distributions are normally distributed, the relationship between I and p is precise.
3.3. Classification
Heart patients and healthy patients are separated into groups using machine learning classification methods. In this phase, we will take a look at a few prominent classification approaches as well as the theoretical basis of those methods.
3.3.1. Support Vector Machine (SVM)
SVM is an ML classification technique; this has mainly been used to solve classification issues. It uses a maximum margin strategy to solve a complex quadratic problem, and is employed in a variety of applications due to its high classification performance. Moreover, SVM is best suited for identifying the best hyperplane to separate the data, as shown in equations (21)–(23).
3.3.2. Naïve Bayes (NB)
The NB method uses the conditional probability theorem as can be seen in equation (24), to classify new feature vectors and also find their conditional probability values. The conditionality likelihood of each vector is used to calculate the new vector class and is usually utilized for textrelated problem classification.
3.3.3. Decision Tree (DT)
DT is also an ML approach where each node is a leaf node with internal and external nodes connected. The internal nodes make decisions and send child nodes to the next node, whereas the leaf node has no child nodes, and is labeled derived from the following equations:
3.3.4. KNearest Neighbor (KNN)
KNN uses the similarity of new input to the incoming input samples in the training set and to predict a new input’s class label, as shown in the following equation:
3.3.5. Logistic Regression (LR)
Binary classification problems are solved using a logistic regression technique, which predicts values for variables 0 and 1, and classifies them into two groups: negative (0) or positive (1). A threshold value of 0.5 is used in the multiclassification approach to predict decimal numbers, which is then used to classify the two classes, e.g., 0 and 1. Hypothesis if threshold ≥0.5 predicts 1, indicating that the patient has heart disease (cardiomyopathy). The mathematical representation of logistic regression is explained in the following equations:
3.3.6. Random Forest
A random forest is a meta estimator explained in equations (37) and (38), that uses averaging to improve prediction accuracy while minimizing overfitting. The subsample size is determined by the maxsamples option, and each tree uses the entire dataset.
3.3.7. Stochastic Gradient Descent (SGD)
SGD has received significant attention, despite its long history in machine learning applications. Convex loss faced in SVM and LR is addressed by SGD. This technique (SGD) provides a quick and easy technique to fit linear classifiers and regressions in the context of largescale learning. Equations (39)–(41) explain the SGD technique to provide a quick and bestfit machine learning classifier.
3.3.8. Performance Matrix
Several performance matrices are explained in equations (42)–(46), including accuracy, recall, precision, F1score, and Matthews correlation coefficient (MCC). These evaluation parameters are used to check the performance of our proposed approach with other algorithms.
4. Result and Discussion
This section of the study provides various classification models and their statistical analysis. In the first phase, we compare the performance of LR, KNN, SGD, RF SVM, NB, and DT on the Cleveland heart disease dataset. In the second phase, we have employed LASSO, ANOVA, MultiSURF, variance threshold, and mutual information to pick relevant features. To evaluate the performance classifiers, all features were normalized and standardized before being supplied to classifiers.
The features of the entire dataset were tested on selected machine learning classifiers in this experiment, where 7 : 3 ratio data is allocated for training (70%) and testing (30%).
In Table 2 and Figure 2, the SVM shows a good performance with 75% accuracy, 75.5% precision, 75.5% recall, 75% F1score, 53% MMC, and 10.4 seconds time complexity. Different K values are tested for the KNN classifier, and the best performance among all round is; 67% accuracy, 67.6% precision, 67.5% recall, F1score 67%, MCC 41%, and time complexity of 16.7 second. The LR classifier achieved 71% accuracy, 69.5% precision, 71% recall, 70.5% F1score, MCC 37.5%, and time complexity is 12.2 second. The DT classifier achieved 61% accuracy, 61% precision, 61% recall, 60% F1score, MCC 29.5%, and time complexity is 19.9 second. The NB classifier achieved 70% accuracy, 70.5% precision, 70% recall, 70% F1score, MCC 40%, and time complexity is 24.7 second. The RF classifier achieved 65% accuracy, 65% precision, 64.5% recall, 64.5% F1score, MCC 28.5%, and time complexity is 17.1 second. The SGD classifier achieved 69% accuracy, 69% precision, 69% recall, 68.5% F1score, MCC 41.5%, and time complexity is 14.4 second.
Based on their weight, LASSO and ANOVA select different features from the complete dataset. LASSO is used to select the five most important features namely SEX, RES, MHR, VCA, and THA. ANOVA select features, e.g., SEX, RBP, SCH, RES, and THA, as can be seen in Table 3 and Figure 3. We analyzed classifiers on a variety of chosen features and performances are very efficient.
The five most relevant features are selected and to be utilized in the second group of feature selection, namely MultiSURF, variance threshold, and mutual information, as shown in Table 3 and Figure 4. MultiSURF selects RBP, MHR, EIA, OPK, and THA features from the dataset. RES, MHR, EIA, OPK, and PES features are the most prominent features for variance threshold. Moreover, RES, MHR, PES, VCA, and THA are chosen by mutual information select features which is the final and most essential feature selection algorithm.
As demonstrated in Figures 3 and 4, after features selection, the five most important features are tested on different machine learning classifiers, with a 7 : 3 ratio set for the training (70%) and testing (30%). In Table 4 and Figure 5, SVM shows a good performance by using a confusion matrix with 97.5% accuracy, 97% precision, 97% recall, 97% F1score, 95% MMC, and 4.4 seconds time complexity. Different K values are applied for the KNN classifier and best among them are 95% accuracy, 95% precision, 95% recall, F1score 95%, 88.5% MCC, and 7.3 seconds time complexity. The LR classifier achieved 93% accuracy, 93.5% precision, 93.5% recall, 93% F1score, 87.5% MCC, and 8 seconds time complexity. The DT classifier has achieved 90% accuracy, 90.5% precision, 90.5% recall, 90.5% F1score, 82.5% MCC, and 11 seconds time complexity. The NB classifier achieved 88% accuracy, 88% precision, 87.5% recall, 88% F1score, 75.5% MCC, and 13.9 seconds time complexity. The RF classifier achieved 89% accuracy, 89% precision, 89.5% recall, 88.5% F1score, 79.5% MCC, and 10 seconds time complexity. The SGD classifier achieved 90% accuracy, 91.5% precision, 91% recall, 90.5% F1score, 83% MCC, and 12 seconds time complexity.
Figure 6 depicts the classifier parameters for overall features and five main characteristics to demonstrate time complexity of each classifier. The SVM algorithm has 4.4 seconds for selected features and 10.4 seconds for all other features in the dataset. KNN has 7.3 and 16.7 seconds, respectively. The LR algorithm has 8 and 12.2 seconds with and without features, the DT algorithm has 11 and 19.9 seconds, and the NB algorithm has 13.9 and 24.7 seconds. RF processing time for classifying the dataset is 10 and 17.1 seconds, and lastly, SGD has 12 and 14.4 seconds, respectively.
Table 5 illustrates an increase in SVM classification accuracy from 75% to 97.5% on minimized features. Similarly, the accuracy of KNN improved from 67% to 95% with reduced features, LR increased from 71% to 91%, DT increased from 61% to 90%, NB increased from 70% to 88%, RF increased from 65% to 89%, and SGD increased from 69% to 90%. As a result, the feature selection algorithms select significant features that boost the performance of the classifier and reduce execution time to effectively diagnose heart disease prediction.
4.1. Comparative Analysis
We employed several feature selection and machine learning approaches in the classification phase. The results demonstrated that our suggested methods produce efficient outcomes in terms of all performance matrices with minimum computational time. In the end, based on statistical data, we conclude that our proposed approach has improved the overall performance of algorithms as can be seen in Table 6.
5. Conclusion
This research study proposed a machinelearningbased cardiac disease classification system. Decision tree (DT), stochastic gradient descent (SGD), Knearest neighbor (KNN), naive Bayes (NB), random forest (RF), logistics regression (LR), and support vector machine (SVM) were used to classify the Cleveland heart disease dataset collected from Cleveland hospitals. The novelty of this proposed work is the development of a diagnosis system for heart disease patients. Feature selection algorithms such as LASSO, ANOVA, MultiSURF, variance threshold, and mutual information are utilized before supplying data for the training and test phase, main motivation behind this approach is to improve the response time of each algorithm. Performance evaluation matrices, e.g., accuracy, precision, recall, F1score, and MMC, were used to compare the different classifier performances. In addition, the proposed approach is evaluated on a 5feature algorithm with 7 classifiers and 5 performance evaluation metrics and have shown efficient performance (refer to section 4). A machine learning classification model is used in this study. SVM, KNN, and LR models all perform well with specific features and can improve classification accuracy while also reducing the overall processing time. The findings are consistent with earlier research. In the future, we will apply federated learning and blockchain algorithms to generate an effective and efficient diagnosing system.
Data Availability
The data used to support the ﬁndings of the study are included in the article https://www.kaggle.com/datasets/aavigan/clevelandclinicheartdiseasedataset.
Conflicts of Interest
The authors declare that they have no conﬂicts of interest.
Acknowledgments
This research work was supported by the Information Systems Department, Faculty of Management Comenius University in Bratislava Odbojárov 10, 82005 Bratislava 25, Slovakia.