Abstract

Heart disease is recognized as one of the leading factors of death rate worldwide. Biomedical instruments and various systems in hospitals have massive quantities of clinical data. Therefore, understanding the data related to heart disease is very important to improve prediction accuracy. This article has conducted an experimental evaluation of the performance of models created using classification algorithms and relevant features selected using various feature selection approaches. For results of the exploratory analysis, ten feature selection techniques, i.e., ANOVA, Chi-square, mutual information, ReliefF, forward feature selection, backward feature selection, exhaustive feature selection, recursive feature elimination, Lasso regression, and Ridge regression, and six classification approaches, i.e., decision tree, random forest, support vector machine, K-nearest neighbor, logistic regression, and Gaussian naive Bayes, have been applied to Cleveland heart disease dataset. The feature subset selected by the backward feature selection technique has achieved the highest classification accuracy of 88.52%, precision of 91.30%, sensitivity of 80.76%, and f-measure of 85.71% with the decision tree classifier.

1. Introduction

With the enhancement of the information era, computer-aided systems generate massive amounts of raw data, enhancing the new center of power. Acquiring important knowledge from this form of data is a challenging task for practitioners. Data mining, Artificial Intelligence, machine learning, and deep learning are relatively modern and promising technologies for obtaining relationships or identifying significant databases using advanced statistical approaches. Medical data mining and knowledge exploration constitute a relatively new and developing domain that is of interest to many researchers [1].

With the improvement of medical data gathering, physicians have the potential to diagnose ailments better. In addition, computational biomedical systems can speed up the decision-making process and achieve greater prediction accuracy in different diseases such as heart diseases, cancers, diabetes, skin diseases, and kidney diseases. Among these ailments, cardiovascular diseases have been identified as having the highest mortality rate in most countries worldwide [14].

According to the World Health Organization (WHO), cardiovascular disease mortality will rise to almost 30 million by 2040 [2]. Electrocardiogram (ECG), echocardiogram (heart ultrasound), cardiac magnetic resonance imaging (MRI), stress tests (exercise stress test, stress ECG, nuclear cardiac stress test), and angiography are commonly used tests by physicians to help identify cardiovascular problems. However, diagnostic and treatment costs are relatively high for cardiovascular disease and not affordable for the whole community. Data mining methods enable the efficient determination of whether patients are at an increased risk of heart disease at an early stage and, thus, enable the decrease of the costs of diagnosis and treatment.

In this respect, researchers investigated feature selection approaches and various classifiers on various heart disease datasets, including Statlog, Cleveland, Hungary, VA Long Beach, and Switzerland datasets from the UCI Machine Learning Repository, as well as the Z-Alizadeh Sani datasets.

When diagnosing various diseases using a computer-based system, high-dimensional and heterogeneous data must be processed and analyzed. High-dimensional data frequently results in model overfitting and increases the training time of the model. Feature selection is a dimensionality reduction technique used to reduce a set of considerable number of data features to a manageable set by eliminating irrelevant and redundant features that have no significant impact on classifier performance. Numerous effective feature selection methods have been developed recently to mitigate the curse of dimensionality. Algorithms for feature selection can be classified as supervised, unsupervised, or semisupervised [3]. Supervised feature selection is the initial and most prevalent technique for selecting features that use labeled data. Three approaches are utilized in supervised feature selection methods: filter, wrapper, and embedded approaches.

Filter-based techniques are used at the preprocessing stage, independently of the learning algorithm being used. Using statistical measures, this approach calculates the score based on their dependency on the class label for each feature [57]. Following that, all features are sorted further using these scores to choose a predetermined number of relevant features. The filter technique can employ various feature selection criteria, including correlation coefficient (Pearson, Spearman, and Kendall Tau), Relief, Chi-square, information gain, and Fisher score.

Wrapper techniques use a particular learning algorithm in order to find out which features are most fit for a certain dataset. Then, the selected features are determined by the prediction accuracy of the learning algorithm. Following that, features are added to or eliminated from the subset based on the accuracy of the model. This loop continues until the optimal subset is discovered. The complexity of these approaches might be relatively high if the dataset has several dimensions. Forward feature selection, backward feature elimination, recursive feature elimination, and exhaustive feature selection are examples of wrapper-based feature selection techniques dependence [57].

The embedded techniques combine the advantages of filter and wrapper techniques. Here, algorithms incorporate their feature selection criteria. By adding attribute selection to the training process, these techniques minimize computational complexity. This approach is typically considerably more accurate at prediction and more efficient than filter and wrapper methods. Lasso regression, Ridge regression, and memetic algorithm are examples of embedded methods [57].

The purpose of this study is to determine the effect of several feature selection algorithms classified as filter, wrapper, and embedded techniques on improving the prediction of heart disease. An experimental approach that performs comprehensive experiments on real datasets relevant to heart disease that have been made available through the University of California, Irvine (UCI) data repository was employed to accomplish the goal.

We develop many predictive models, including feature selection, and evaluate them against various performance measures, including accuracy, precision, and recall, to identify the most successful ones that might be utilized for heart disease prediction and benefit the medical community.

2. Literature Survey

Ritu et al. [4] presented a sequential feature selection method for identifying mortality events in patients with heart disease during treatment to find the most critical features. Numerous machine learning methods are utilized, including LDA, KNN, RF, SVM, DT, and GBC. In addition, the confusion matrix, receiver operating characteristic curve, precision, recall rate, and F1-score are also generated to validate the SFS algorithm's results. The experimental findings indicated that the sequential feature selection technique achieves an accuracy of 86.67% for the random forest classifier.

Gao et al. [5] suggested a model using ensemble approaches (boosting and bagging) with feature extraction algorithms (LDA and PCA) for predicting heart disease. The authors compared ensemble techniques (bagging and boosting) with five classifiers (SVM, KNN, RF, NB, and DT) on selected features from the Cleveland heart disease dataset. The results of the experiments indicated that the bagging ensemble learning method with DT and PCA feature extraction obtained the most outstanding performance.

Takci [6] used twelve classification algorithms from various categories and four feature selection methods for heart attack prediction. The models were evaluated based on the accuracy of the model, the processing time, and the findings of the ROC analysis. The result shows that, without feature selection, the maximum accuracy value was 82.59%; it increased to 84.81% with feature selection. Model accuracy of 84.81% was obtained using naive Bayes and linear SVM. Additionally, the processing time was decreased from 359 to 187 milliseconds. The ReliefF algorithm provides the best model accuracy among the four alternative feature selection techniques according to the mean accuracy value. Thus, the author mentioned that when the appropriate combinations are considered, feature selection has a beneficial effect in heart attack prediction research.

Latha et al. [7] proposed a model to predict heart disease risk using ensemble classification techniques and feature selection techniques. The study findings showed that ensemble approaches like bagging and boosting are useful in increasing the prediction accuracy of weak classifiers and perform well in predicting heart disease risk. Ensemble classification improved the accuracy of weak classifiers by up to 7%. The performance was further increased by adding a feature selection implementation, which resulted in a substantial increase in prediction accuracy.

KarenGárate-Escamila et al. [8] proposed a hybrid dimensionality reduction technique combining Chi-square and principal component analysis (CHI-PCA) to predict heart disease. Their study was conducted on three different datasets: Hungarian, Cleveland, and Hungarian-Cleveland datasets gathered from the UCI Machine Learning Repository. The performance of the suggested technique was evaluated with five different classifiers: random forests, gradient-boosted tree, decision tree, multilayer perceptron, and logistic regression. Chi-square and principal component analysis (CHI-PCA) using random forests (RF) showed the most remarkable accuracy, at 98.7% for the Cleveland dataset, 99.0% for the Hungarian dataset, and 99.4% for the Cleveland–Hungarian (CH) dataset, respectively.

Spencer et al. [9] conducted experiments on four frequently used heart disease datasets using four different feature selection techniques: principal component analysis, Chi-squared testing, ReliefF, and symmetrical uncertainty. As noted by the authors, the benefits of feature selection differ depending on the machine learning approach employed for the cardiac datasets. For example, one of the most accurate models discovered had an accuracy of 85.0%, a precision of 84.73%, and a recall of 85.56% when Chi-squared feature selection was combined with the BayesNet classifier.

Semen et al. [10] constructed a diagnostic model for chronic kidney disease detection utilizing a dataset of 400 patients with 24 features. Recursive feature elimination (RFE) was used to select the most significant features. The k-nearest neighbors (KNN), support vector machine (SVM), decision tree, and random forest classification methods were used in this work. All classification methods achieved excellent performance. The random forest method beat all other algorithms, achieving 100% accuracy, recall, precision, and F1-score on all metrics.

Almansour et al. [11] used machine learning methods to assist in the early detection of chronic kidney disease by comparing two classifiers, SVM and ANN, improving their parameters using a random exhaustive search method. The 400-instance dataset from the UCI repository is preprocessed, and features are chosen using correlation coefficients. The classifiers' performance is compared against various best features (2, 3, 6, 12, and all) and the computing time required to train the model. Finally, the twelve best features are used to predict renal illness using SVM and ANN since ANN surpasses SVM with high accuracy.

From the literature, it is evident that classifier training using relevant features selected by different feature selection algorithms enhances the accuracy of the classifier.

3. System Framework

The proposed study focuses on increasing classification accuracy by decreasing the number of features in a dataset of cardiac disease. The framework for classifying cardiac diseases is depicted in Figure 1. The key components of the framework include data collection, data preprocessing, feature selection, data splitting, model training with classifiers, and model evaluation. The following sections describe the building blocks of the suggested framework.

3.1. Data Collection

This study gathered Cleveland heart disease datasets from a repository called UCI (University of California, Irvine) which can be accessed online for testing purposes. However, there were 303 instances and 75 features when this dataset was developed. All published studies [2, 4, 6, 7, 12] make use of a selection of 14 features, including biographic features, clinical features, and habitual features in the dataset, along with one output feature. A detailed description of the dataset is provided in Table 1. The label of output feature (num) is divided into two classes to denote the presence of heart disease and absence of heart disease.

3.2. Data Preprocessing

Preprocessing the dataset is necessary for the effective representation of the quality of data. The dataset has been preprocessed using techniques such as eliminating missing values from features, StandardScaler (SS), and MinMaxScaler. Missing value handling is a data preprocessing technique used to create a smooth dataset. Thus, it was started by determining if the dataset contains any missing values [2, 5]. The missing values can be handled in various ways, including totally ignoring the missing value, replacing it with any numeric value, replacing it with the most often occurring (mode) value for that feature, or replacing it with the mean value of the attribute. The heart disease dataset used in this article consists of six missing values. The missing values are replaced with the mean or mode values of the features.

3.3. Feature Selection

In this stage, experiments were conducted with and without feature selection to assess the effect of feature selection. The feature selection aims to identify the most significant features of cardiac diseases. Furthermore, feature selection helps to construct a more accurate model by eliminating or underrepresenting the less relevant features, minimizing training time and enhancing learning performance [6]. The behavior of several feature selection approaches under the three major categories (filter, wrapper, and embedded) is assessed in this experiment. As illustrated in Figure 2, feature selection techniques belonging to three categories were separately applied to initial datasets. Feature selection techniques begin by creating a subset, but that subset generation depends on the type of approach. Figure 2 illustrates the process of feature selection followed by each of the three categories to identify optimal feature subsets.

As illustrated in Figure 2, the filter technique chooses the best subset immediately before passing it to the learning algorithm. The remaining two approaches, wrapper and embedded, create the optimal subset in combination with the learning algorithm. In contrast to the other techniques, the embedded method incorporates the benefits of both the filter and wrapper methods.

3.3.1. Feature Selection Using Filter Methods

Filter-based selection techniques utilize statistical methods to determine the dependence or association among independent features (input attributes) and the dependent feature (target attribute).

In filter method, without depending on any learning algorithms, features are assessed based on the general properties of the data as represented in Figure 3. On the other hand, characteristics are chosen based on the output results/scores generated by the various statistical methods used to validate them.

It primarily quantifies the properties of features using distinct types of measurement criteria, including information, distance, dependence, consistency, similarity, and statistical measures. These attributes are compared for their interdependence with the output attribute to choose the appropriate subset [1316].

This proposed work mainly uses four filter-based feature selection algorithms, namely, Chi-square, mutual information, ReliefF, and ANOVA F method.

(1) Chi-Square. The Chi-square test follows the univariate statistical approach in feature selection to identify the dependence of each feature (or correlation) with the target variable in a dataset [8]. Here, remove the features independent of the target variable and other dependent features considered highly important to predict the target variable. The steps involved in this process are given as follows.Step 1: select all the features from the original dataset.Step 2: calculate the Chi-square scores to determine the independence of two features using chi2() function that is in the scikit-learn machine library. Here, use (1) to determine the Chi-square score for each feature given as follows.where is the observer frequency and is the expected frequency.The observed frequency is close to the expected frequency, resulting in a lower Chi-square score when two features are independent.Step 3: the feature with a larger Chi-square value relies more on the target feature and is thus chosen for model construction. SelectKBest() was used to select the ten features with the highest Chi-square score shown in Table 2; K denotes the number of features to be included in the final dataset.Step 4: based on the top ten ranking features in Table 2, the next step is needed to select a threshold to create subsets for the number of features represented by n, i.e., the optimum number of features with the highest Chi test score to be utilized. During the search process, start with a subset of features with n = 5, i.e., only five features with the highest Chi test score, to create the initial feature subset.Step 5: in the second iteration, another subset of features with n = 6 is generated by picking the first six features with the greatest Chi score. The same procedure is performed with n = 7, 8, 9, 10 for ranking features to create distinct feature subsets.

(2) Mutual Information. The mutual information (MI) between two random variables is a metric measuring their interdependence or measures the reduction in uncertainty. The actions necessary to complete this process are listed below.Step 1: select all the features from the original dataset.Step 2: calculate mutual information scores to determine the relationship between the features and the target feature using the mutual_info_classif() function. When calculating mutual information scores, the following steps and formulas are used.In predicting the unknown class of samples, entropy and conditional entropy principles can be used to assess a given feature's efficiency [15, 17, 18]. The entropy of H (X) for the values {x1, x2,..., xn} can be given aswhere means probability mass function of .The conditional entropy for the two discrete X and Y random variables is given aswhere p (xi, yj) is X = xi and Y = yj joint probability.The mutual information [MI (X; Y)] of the respective variables X and Y can be computed in terms of both the entropy of X, H (X) and the conditional entropy H (C|X) asStep 3: a higher score indicates the presence of additional dependence on target variable. SelectKBest() was used to choose the ten features with the highest mutual information scores for model construction, as indicated in Table 2.Step 4: considering the best ten ranking features in Table 2, the next step is to determine the optimal number of features (n) with the highest mutual information scores to be used for feature subsets. During the search procedure, we began with a subset of features with n = 5, i.e., the five features with the greatest mutual information scores.Step 5: another subset of features with n = 6 is created in the second iteration by selecting the first six features with the highest mutual information scores. For n = 7, 8, 9, 10, the same technique is used to rank features to produce separate feature subsets.

(3) ReliefF Algorithm. The ReliefF is a supervised feature learning approach to weigh each feature, demonstrating its importance to the target variable used for binary and multiclass problem domains. The steps required to accomplish this procedure are summarized below.Step 1: select all the features from the original dataset.Step 2: calculate the weight of each feature using the ReliefF() function. The weight is defined by the closest neighbor of the same class and different classes [6, 9, 12]. The sample instance compared the values of the nearest neighbor features and modified the weight of every feature. The weight vector of a feature in binary classification is given in the following equation:where nearHit is the nearest same-class instance and nearMiss is the nearest different-class instance.The Manhattan (L1) distance is used to find the nearHit and nearMiss instances rather than the Euclidean (L2) distance used in a ReliefF algorithm. Furthermore, ReliefF algorithm updates weight vectors when considering the absolute differences between xi and nearHit rather than the square between xi and nearHit.Step 3: compute the feature importance to the target variable based on the weights using feature_importances_.Step 4: Considering the feature importance values, features having higher scores are selected for model development, indicating additional dependence on target variable. SelectKBest() is used to choose the ten features with the highest feature importance values; K denotes the number of features for model construction, as indicated in Table 2.Step 5: after selecting the highest ten ranking features in Table 2, the next step is to identify the ideal number of features (n) to be employed for each feature subset. We began the search method with a subset of n = 5 features, i.e., the five features having the highest feature importance values.Step 6: in the second iteration, another subset of features with n = 6 is produced by choosing the first six features with the greatest feature importance values. The same approach is used to rank features for n = 7, 8, 9, 10 to generate distinct feature subsets.

(4) ANOVA Method. ANOVA (Analysis of Variance) is a statistical test that formulates the ratio between variances, such as the variance from two different samples. Furthermore, ANOVA is beneficial in classification; the data come up with numeric input and categorical target variables [11, 19]. The steps necessary to complete this process are listed below.Step 1: select all the features from the original dataset.Step 2: calculate the ANOVA F-scores between each feature and the target feature using f_classif() function that is in the scikit-learn machine library. Here, use the following formula to calculate the one-way ANOVA F-test.where is the ith group sample mean, is the number of observations in the ith group, is the overall mean of the data, is the number of groups, is the jth observation in the ith out of K groups, and N is the overall sample size.Step 3: the score of this test is used to perform feature selection, which allows for the elimination of features that are not related to the target variable. SelectKBest() was used in this experiment to choose the most relevant top ten features with the lowest variance; K denotes the number of features we want in the final dataset. Table 2 lists the initial top ten features for model building.Step 4: after selecting the highest-ranking features in Table 2, the ideal number of features (n) with the highest ANOVA F-scores to be used is determined to create various feature subsets.We began the search method with a subset of n = 5 features, i.e., the five features with the highest ANOVA F-scores.Step 5: in the second iteration, a subset of features with n = 6 is formed by choosing the top six features with the greatest ANOVA F-scores. The same approach is used to rank features for n = 7, 8, 9, 10 to generate distinct feature subsets.

3.3.2. Feature Selection Using Wrapper Methods

This method mainly utilizes a search strategy to evaluate the variable subsets of independent features S'⊆S by feeding S′ to the chosen learning algorithm and then assessing the performance of a learning algorithm. When the cardinality of features in a dataset comes up with N, the 2N subsets are possible, and the steps are done iteratively before the required suboptimal subsets are found. Thus, the value of the cardinality is very high, and the number of subsets is exponential. The searching method can identify various types of techniques and in this experiment mainly used four types of techniques: forward feature selection, backward feature elimination, recursive feature elimination, and exhaustive feature selection described below [13, 14, 20, 21]. The parameter settings for the wrapper-based algorithms are included in Table 3.

(1) Forward Feature Selection (FFS). Forward feature selection is an iterative procedure that begins with the feature having the highest performance versus the target feature. When selecting the feature subset, with support vector machine used as a learning algorithm, stratification is required to guarantee that each class is well represented [4]. Each subset is evaluated independently, and the creation of subsets is determined by the search method. The following are the steps required to complete the feature selection procedure.Step 1: all the features were selected from the original dataset.Step 2: start the feature selection process by initializing the value as 5 for the n_features_to_select parameter.Step 3: the process starts with adding one feature that obtained a higher cross-validation score. Then, after iteration, choose another feature that optimally performs when combined with the initially selected feature. This procedure is repeated until reaching the desired number of features to be chosen.Step 4: select the optimal feature subset according to the highest average score of the classifier in Step 3.Step 5: steps 2 and 3 are repeated by changing the value of n_features_to_select to 6, 7, 8, 9, 10, respectively.Step 6: using Step 4, the optimal feature subsets for the number of features assigned in Step 5 are identified.

(2) Backward Feature Elimination (BFE). The approach of backward elimination is the complete opposite of the method of forward feature selection [4]. The following stages are followed to finish the feature selection procedure.Step 1: all the features are selected from the original dataset.Step 2: start the feature selection process by initializing the value as 5 for the n_features_to_select parameter.Step 3: the process begins by assembling all accessible features into a model. Following that, each iteration eliminates the least significant feature, thereby improving the performance of the model. This method is continued until the desired number of characteristics available for selection is reached.Step 4: select the optimal feature subset according to the highest average score of the classifier in Step 3.Step 5: steps 2 and 3 are repeated by changing the value of n_features_to_select to 6, 7, 8, 9, 10, respectively.Step 6: using Step 4, the optimal feature subsets for the number of features assigned in Step 5 are identified.

(3) Recursive Feature Elimination (RFE). Recursive feature elimination uses an approach to iteratively train the model, removing the least significant feature using the algorithm's weights as a criterion for each iteration. Recursive feature elimination aims to choose features by examining smaller and smaller sets of features in a recursive fashion. The following steps are used to complete the feature selection method [10, 21].Step 1: all the features are selected from the original dataset.Step 2: start the feature selection process by initializing the value as 5 for n_features_to_select parameter.Step 3: at the beginning, the estimator is trained on the original set of features, and the significance of each feature is determined using either the coef_ or feature_importances_ attributes.Step 4: the least significant features in the current set of features are trimmed. This method is done repeatedly until the trimmed group has the required number of features to be selected.Step 5: steps 3 and 4 are repeated with changing the value of n_features_to_select to 6, 7, 8, 9, 10, respectively.Step 6: using Step 4, the optimal feature subsets for the number of features assigned in Step 5 are identified.

(4) Exhaustive Feature Selection (EFS). When determining the optimum feature subset, the exhaustive feature selection approach used brute-force strategies. The performance of the machine learning algorithm is assessed against all potential feature combinations in the dataset. The optimum feature subset is determined by the one that produces the best performance. The exhaustive search algorithm is the greediest algorithm of the wrapper techniques since it attempts all possible combinations of features and chooses the best [11]. The disadvantage of exhaustive feature selection is that it is slower than the step forward and step backward approach since it examines all possible feature combinations [11]. The steps to select the optimal feature subset using EFS are outlined below.Step 1: all the features are selected from the original dataset.Step 2: start the feature selection process by initializing the min_features as one and max_features as 5.Step 3: the performance of the machine learning algorithm is assessed against all potential feature combinations in the dataset. The optimum feature subset is determined by the one that produces the best performance. The exhaustive search algorithm is the greediest algorithm of the wrapper techniques since it attempts all possible combinations of features and chooses the best.Step 4: step 3 is repeated by changing the value of max_features to 6, 7, 8, 9, 10, respectively.Step 5: using Step 4, the optimal feature subsets for the number of features assigned in Step 4 are identified.

3.3.3. Feature Selection Using Embedded Methods

Embedded techniques include integrating the feature selection process into the machine learning algorithm's development. This is a hybrid technique that combines the filter and wrapper methods. Here, the algorithms include their feature selection strategy. These contribute to the generation of the optimal subset and offer it to the training model. In other words, they pick features during model training, which is why they are referred to as embedded techniques. These techniques need less computation than wrapper methods. The design of embedded feature selection approaches is algorithm dependent. However, developing a mathematical model for a feature selection embedded classifier is quite difficult [13,14,20].

(1) Lasso Regression. The abbreviation “Lasso” refers to the Least Absolute Selection and Shrinkage Operator. In Lasso, the penalty is defined as the total of the absolute values of the coefficients, which is L1. Thus, the primary goal of Lasso is to reduce an absolute value (L1 penalty) to zero rather than to use the sum of squares (L2 penalty) [18, 22]. The following steps explain the procedures of selecting the optimum feature subset using Lasso.Step 1: all the features are selected from the original dataset.Step 2: make a pipeline object using StandardScaler and Lasso in scikit-learn.Step 3: optimize the α hyperparameter that tunes the intensity of Lasso regression using several values from 0.1 to 10 with 0.1 steps.Step 4: calculate the mean squared error on a fivefold cross-validation basis and choose the value that minimizes such average performance metrics.Step 5: get the absolute values of the magnitude of the coefficients of the Lasso regression. Select the feature that takes nonzero value for the best feature subset.

(2) Ridge Regression. Ridge regression employs L2, with the penalty equal to the sum of the squares of the coefficients of the variables. This form of regression aids in the reduction of variation caused by variable multicollinearity. It contributes to the removal of variation caused by nonlinear correlations between two independent variables [18]. The following steps describe how to use Ridge to choose the optimal feature subset.Step 1: all the features are selected from the original dataset.Step 2: create a pipeline object in scikit-learn using StandardScaler and Ridge.Step 3: optimize the alpha hyperparameter that tunes the Ridge regression's intensity using multiple values ranging from 0.1 to 10 with a 0.1 step.Step 4: using fivefold cross-validation, calculate the mean squared error and select the value that minimizes such average performance metrics.Step 5: calculate the Ridge regression's squared coefficients. For the best feature subset, select the features that have a nonzero value.

3.4. Data Splitting

The heart disease dataset is split into an 80% training set and a 20% testing set in this phase. The training set was used to develop the models, while the testing set was used to assess them. Additionally, a 10-fold cross-validation technique was used for validating the training phase of the classifier.

3.5. Training Models

Classification is a method for labeling a particular set of data with different classes, which is carried out on structured and unstructured data. The aim of classification predictive modeling is to approximate mapping between input variables and discrete output variables. There are several classification algorithms available, but one that is better than another cannot be concluded. This relies on the domain of the problem and the nature of the dataset available [10].

In this analysis, we used six different supervised learning algorithms, which are decision tree, random forest, support vector machine, K- nearest neighbor, logistic regression, and Gaussian naive Bayes.

3.5.1. Support Vector Machines (SVM)

This classifier showed great empirical successes in classification tasks under supervised machine learning techniques. It separates data into different classes by a hyperplane or hyperplanes since it can handle multidimensional data. SVM separates data into hyperplanes with the help of the following mathematical formulas:where x is the vector point and is the weight vector.

SVMs minimize the error in empirical classification and optimize the classification margin. SVM models can be categorized into four distinct categories, according to the error function method: Nu-SVM regression, C-SVM classification, Nu-SVM classification, and Epsilon-SVM regression. Support vector machine algorithm is deployed with kernel functions such as radial basis, polynomial, linear, and sigmoid kernels, which convert nonlinear data map to the linear form [2, 4, 10]. In this experiment, linear kernel function is used for classification.

3.5.2. Decision Tree (DT)

A decision tree algorithm can be categorized under supervised learning, which is identical to a tree of nodes and edges ideal for classification problems. Every node in the tree represents the class of the problem, and every edge indicates the choice made based on the evaluated results. This classifier can be considered a predictive machine learning model that displays the correlation between dataset values and features. Each division in the decision tree indicates the potential value for a certain category. Dependent on entropy measurements of the dataset attributes, the nodes are identified. The maximum entropy value attribute is known to be the root node. C4.5, Iterative Dichotomiser 3(ID3), and Classification and Regression Trees (CART) are widely used decision tree models [2, 4, 10].

3.5.3. Random Forest (RF)

One of the most widely used, technologically advanced, supervised ensemble classification methods is called random forest. It generates a large number of trees during the training stage and builds a forest of decision trees on multiple data subsets. At the testing stage, each tree in the forest assigns each piece of data a class label. When each tree predicts a class label, the ultimate determination for each set of test data is made by a majority vote. Whichever class label receives the greatest number of votes is assumed to be the right label for the test data to enhance its prediction accuracy. This procedure is performed for each piece of data in the dataset [2, 10].

3.5.4. K-Nearest Neighbor (KNN)

K-nearest neighbor is a technique of supervised learning used to recognize the patterns in the dataset and determine the belonging class based on the points of existence with the nearest neighbors. It determines the class of instance by considering the distance from the nearest neighbor and consistency. KNN uses different distance functions such as Manhattan, Euclidean, Hamming, and Minkowski when identifying the neighbor's points. This method is called a memory dependent approach because all the instance points are stored inside the memory [2, 4, 10]. The weight is allocated for each point depending on the distance from each instance to improve the algorithm's performance. The following equation is used to calculate the Euclidean distance:

3.5.5. Logistic Regression (LR)

Logistic regression is a widely used machine learning method that falls into the category of supervised learning. It is used to forecast a categorical dependent variable based on a set of independent variables. For example, it is used to extract significant statistical items from the model or forecast the tendency of data [2, 18]. The dependent variable in logistic regression is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). The logistic regression algorithm is based on the logistic function shown below:where f(x) is a function consisting of features (xk) and their corresponding weights/coefficients (βk) in a linear form shown as below:where x, β, f(x)Rk and ε represents the random error process noise inevitably happening in the data generating process. The sigmoid function is a mathematical function that converts linear regression to logistic regression. It converts any real number to another value between 0 and 1.

3.5.6. Gaussian Naive Bayes (GNB)

This algorithm, based on the theorem of Bayes, expects independence between two variables. The construction of a Bayesian model is simple and particularly realistic, based on the massive datasets. Naive Bayes' simplicity and dominance over complex approaches are well established [2, 8].

Gaussian naive Byes is used when all of the data values of a dataset are numeric. The probability density function is defined using the mean and standard deviation. Next, it computes the mean and standard deviation for each of the dataset's features. After computing this, the probabilities for each test data pattern are calculated using the mean and standard deviation when any test data pattern occurs. Finally, it assigns an appropriate class label to the test data whose probability is near to 1. The following equation is used to determine the likelihood.where μy is the mean and σy is the standard deviation.

The parameter setting of the learning algorithms is described in Table 4.

3.6. Evaluating Models

Various assessment metrics are applied to determine the efficiency of the feature selection techniques in this study, namely, precision, F1-score, accuracy, and recall, with the help of elements in the confusion matrix. The confusion matrix allows researchers to identify the performance rate of the result in classification problems depending on four main factors: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).

The accuracy will be shown by the model if it can raise proper identification of high risk level of getting heart disease. Accuracy can be calculated by

Precision, calculated as a summation of correctly identified instances, means that the proportion of patients diagnosed with heart disease actually had risk. Precision is given by

Recall is the measure which indicates that the algorithm diagnosed the proportion of patients that actually had heart disease as having disease and is defined by

F-score computes the accuracy of the test, and the formula used to evaluate it is

4. Results and Discussion

Python was used as the programming language in this comparative analysis process to build the analytical model on Jupyter (Anaconda) Notebook. This provides benefits to the dataset exploration and allows effective pattern identification. In addition, scikit-feature has been designed by Arizona State University and is one of the most comprehensive open source feature selection libraries in Python, which includes around 40 feature selection algorithms used in development.

The Cleveland heart disease dataset gathered from the UCI ML Repository contains 303 samples: 164 samples indicate the absence of heart disease, and 139 samples indicate the presence of heart disease. The optimal feature subsets for classification were selected by using ten feature selection algorithms, namely, Chi-square, ANOVA, mutual information, Relief, forward feature selection (FFS), backward feature selection (BFS), exhaustive feature selection (EFS), recursive feature elimination (RFE), Lasso (L1) regression, and Ridge (L2) regression categorized under the filter, wrapper, and embedded methods. The performance is evaluated using six classifiers: random forest, support vector machine, decision tree, K- nearest neighbor, logistic regression, and Gaussian naive Bayes.

4.1. Feature Selection Result

Table 5 shows the reduced feature subsets using different feature selection algorithms. Each feature in the table is denoted by values of 1 or 0. If the feature appears in the feature subset, it is denoted by 1; otherwise, it is denoted by 0 in accordance with the following attribute sequence.

According to the statistics in Figure 3, the most significant feature for predicting heart disease is F12, followed by F13, F3, F9, and F10; nevertheless, these features are ordered across feature selection methods.

4.2. Classification Result

The accuracies obtained by the analysis, on the Cleveland heart datasets, of the different feature subsets selected by filter, wrapper, and embedded methods are shown in Figures 46.

Figure 5 summarizes the heart disease prediction accuracies using the optimal feature set identified by ANOVA, Chi-square, mutual information, and Relief techniques on various learning algorithms.

The classification accuracies of the feature set identified by FFS, BFS, EFS, and RFE techniques categorized under the wrapper-based methods are presented in Figure 5. The accuracies of embedded feature selectors with learning algorithms are illustrated in Figure 6.

According to the result obtained by the analysis, all models constructed from the selected feature subsets utilizing a variety of different feature selection approaches outperform the original dataset feature subsets. The original feature set has achieved maximum accuracy of 63.92%, precision of 58.33%, sensitivity of 53.84%, and f-measure of 55.99% with the KNN classifier. After applying all feature selection approaches, the accuracy of the classifiers significantly improved.

Table 6 summarizes the best accuracy performance of each feature selection technique. The feature subset selected by the backward feature selection technique has achieved the highest classification accuracy of 88.52%, precision of 91.30%, sensitivity of 80.76%, and f-measure of 85.71% with the DT classifier. The selected features are F2, F3, F7, F9, F12, and F13.

As shown in Table 7, all the models created with combining feature selection techniques have achieved more than 80% accuracy. Furthermore, among all the feature selection categories, wrapper-based techniques, namely, FFS, BFS, EFS, and RFS, have obtained higher classification performances than the other two categories, above 83%.

5. Conclusion and Future Work

The primary objective of this article is to examine the effect of feature selection approaches on the accuracy of heart disease prediction. This analysis was conducted against a collection of distinctive features extracted from frequently used Cleveland heart disease datasets available at the University of California, Irvine using various feature selection algorithms.

Experiments were performed with and without feature selection to determine the effect of feature selection. ANOVA, Chi-square, mutual information, Relief, forward feature selection, backward feature selection, exhaustive feature selection, recursive feature elimination, Lasso regression, and Ridge regression were employed as feature selection algorithms. The analysis was conducted on six classification algorithms: decision tree, random forest, support vector machine, K-nearest neighbor, logistic regression, and Gaussian naive Bayes.

Without feature selection, the highest result provides 63.92% model accuracy using the KNN classifier. The experiment was then conducted using feature selection. The prediction accuracy has improved the models that are developed with all the feature selection algorithms. Without feature selection, the maximum accuracy value was 63.92%; this value was increased to 88.52% using backward feature selection and a decision tree classifier. The experimental findings suggest that using feature selection algorithms is capable of classifying the disease well with a small number of features.

The enhancements overutilizing the original dataset vary significantly depending on the feature selection approach and learning algorithm employed; hence, it is important to evaluate various feature selection strategies and learning algorithms combinations to obtain the best feasible model. However, without many experiments and analyses, it is impossible to predict which will be beneficial.

In the future, multiple feature selection techniques can be used as assembling (hybrid) techniques to extract optimal feature subsets to develop models. Also, real-time medical datasets gathered from different countries can be used to model development. This could enhance the performance with improved accuracy for heart disease prediction.

Data Availability

The data for this analysis were derived from previously published studies and datasets that have been cited in this paper. The datasets used in this study are accessible at University of California, Irvine's Machine Learning website.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.