Abstract

There is a necessity for analysis of a large amount of data in many fields such as healthcare, business, industries, and agriculture. Therefore, the need of the feature selection (FS) technique for the researchers is quite evident in many fields of science, especially in computer science. Furthermore, an effective FS technique that is best suited to a particular learning algorithm is of great help for the researchers. Hence, this paper proposes a hybrid feature selection (HFS) based efficient disease diagnostic model for Breast Cancer, Hepatitis, and Diabetes. A HFS is an efficient method that combines the positive aspects of both Filter and Wrapper FS approaches. The proposed model adopts weighted least squares twin support vector machine (WLSTSVM) as a classification approach, sequential forward selection (SFS) as a search strategy, and correlation feature selection (CFS) to evaluate the importance of each feature. This model not only selects relevant feature subset but also efficiently deals with the data imbalance problem. The effectiveness of the HFS based WLSTSVM approach is examined on three well-known disease datasets taken from UCI repository with the help of predictive accuracy, sensitivity, specificity, and geometric mean. The experiment confirms that our proposed HFS based WLSTSVM disease diagnostic model can result in positive outcomes.

1. Introduction

The advancement in the computer hardware technology to store data and the computerization of all fields has resulted in the generation and collection of huge amount of data. Along with it, the use of internet provides us with endless information. The need to tune this enormous data into useful patterns and knowledge has increased the demand of Data Mining among information system researchers. The problem with the large amount of data is its poor quality and redundant information. Moreover, the process of knowledge discovery is also affected. Hence, feature selection (FS) method has been used to tune the large data into useful and reliable data [1, 2]. Feature selection approach is used to select a minimal and relevant feature subset for a given dataset. A feature is said to be relevant, if the class is conditionally dependent on it, that is, if the feature is helpful in predicting class attribute. Another important criterion to check the usefulness of feature is tested on the basis of its redundancy where a feature is highly associated with other features. A good feature subset includes those features which are highly correlated with the class attribute or decision function but are uncorrelated with each other [13]. So the task of FS is to search for optimal feature subset depending on the problem to be solved. FS performs better with appropriate learning algorithm.

Nowadays, in medical diagnosis, the use of classification techniques is increasing gradually. The purpose of this research is to develop an effective disease diagnostic model for Breast Cancer, Hepatitis, and Diabetes patients. The maximum deaths of the women between 40 and 55 years of age are caused due to Breast Cancer [4]. As per World Health Organization (WHO) report, about 1.2 million women suffer from Breast Cancer every year [5]. Diabetes is a disease which finds its place among all age groups. The cause of Diabetes is high glucose level in the blood and insufficient secretion of insulin in the body of a person. Hence the body cells do not react properly to insulin. Insulin is a hormone which helps in breaking down sugar, starch, and other food items into energy. Therefore, in the shortage of the insulin, the glucose is not properly converted to energy and is drained out by urination [6, 7]. Hepatitis is caused by high alcohol consumption, chemicals, and different viruses or drugs. Hepatitis causes liver problems [8]. Better diagnostic techniques and more effective treatments have resulted in decreased mortality rate caused due to these diseases [9].

The class imbalance is also a major problem with real healthcare data. In class imbalance, the data points of one class outnumber the data points of the other class. In medical domain, different classes contain different number of data points. For example, if there are two classes, healthy and sick, then there may be a possibility that both classes contain different number of records. Due to this problem a classifier may be biased towards the majority class (a class with large number of data points) and may produce wrong interpretation for the data points of minority class (a class with less number of data points). So a disease diagnosis system must be able to handle the class imbalance problem. Over the years, various approaches are proposed to deal with this issue such as undersampling, oversampling, and algorithm adjustment approaches [10, 11]. In undersampling, the data points of majority class are removed till both classes have the same number of data points. Sometimes, the useful information may be lost due to undersampling method because it deletes the data points of majority class [1216], while in oversampling, the data points in minority class are added till both classes have equal number of data points. The problem with oversampling method is that it generates unnatural bias towards minority class. It is found from experimental study that undersampling performs better as compared to the oversampling method [15]. Assigning different cost to training data points is also another approach of handling class imbalance problem in which different costs are assigned to different data points [1721]. Ensemble methods such as Boosting Support Vector Machine and Random Forest are also given by the researchers [10, 11]. Weighted SVM and weighted least squares SVM (LSSVM) are proposed by Yang et al. and Suykens et al., respectively [22, 23]. All these approaches have concluded that weighted classification approach improves the predictive performance of a classification system.

WLSTSVM is the weighted version of recently proposed least squares twin support vector machine (LSTSVM) in which distinct weights are assigned to the data points of each class to handle class imbalance problem. Here, we select LSTSVM because it has shown better generalization ability and faster computational speed. Since the healthcare data is imbalance in nature and contains many irrelevant features, this research work utilizes the advantages of correlation feature selection and sequential forward selection search strategy and proposes a HFS based WLSTSVM diagnostic model for Breast Cancer, Hepatitis, and Diabetes disease. Thus, in this way, this paper not only deals with redundant and irrelevant features but also solves the class imbalance problem.

The paper is organized into six sections as follows. Section 2 summarizes the methods and results of previous works on diagnosis of Breast Cancer, Diabetes, and Hepatitis diseases. Brief introduction of WLSTSVM is highlighted in Section 3. Proposed disease diagnostic model and experimental results are discussed in Sections 4 and 5, respectively. Finally conclusion is drawn in Section 6.

Several research techniques have been proposed on disease diagnosis and most of them obtained high predictive accuracies [2442]. Quinlan used C4.5 decision tree classification model for Breast Cancer disease diagnosis and achieved 94.74% accuracy with 10-fold cross validation [24]. Hamilton et al. achieved 96% accuracy using RIAC approach while Ster and Dobnikar reached 96.8% using Linear Discreet Analysis [26, 27]. In another research work, Peña-Reyes and Sipper obtained 97.36% accuracy with Fuzzy Genetic Algorithm approach [28]. Several researches have been done in the Breast Cancer diagnosis using feature selection based classifier model. Akay and Huang et al. used -score feature selection based SVM model while Chen et al. used SVM with rough-set based FS for the diagnosis of Breast Cancer disease [2931]. Rathore and Agarwal have predicted the survivability of Breast Cancer patients using ensemble approach [32]. Polat and Güneş have developed a Breast Cancer disease diagnosis system by using least squares SVM [33]. Karabatak and Ince combined association rules and neural network approach to develop an expert system for the diagnosis of Breast Cancer disease. They applied 3-fold cross validation method to measure the performance of their proposed expert system [34]. Übeyli also used neural network for the same purpose [35]. Temurtas et al. performed a comparative study on the diagnosis of diabetes by using neural network [36]. They did a comparative analysis among multilayer neural network (MLNN) which was trained by Levenberg-Marquardt (LM) approach, probabilistic neural network (PNN), and other existing methods and found that MLNN with LM achieved highest accuracy for the diagnosis of diabetes disease among all methods. Liu and Fu proposed a PSO based SVM with Cuckoo search technique for the diagnoses of heart disease and Breast Cancer disease. Cuckoo search was used for selecting better initial parameters of kernel function and PSO approach searched for the best parameters of SVM [37]. Ganji and Abadeh developed a fuzzy classification system based on ant colony optimization approach, referred to as FCS-ANTMINER, for the diagnosis of Diabetes disease. FCS-ANTMINER system generated fuzzy rules for the diagnosis of Diabetes disease and achieved 84.24% accuracy [38]. Ashraf et al. developed a diseases diagnosis system using Correlation based feature selection and Naïve Bayes approach to diagnose Thyroid, Hepatitis, and Breast Cancer diseases [39]. They analyzed that the feature selection based disease diagnosis model produced more promising results. Polat and Güneş proposed a Hepatitis disease diagnosis system based on feature selection and artificial immune recognition system with fuzzy resource allocation mechanism [40]. Yang et al. proposed an ensemble based wrapper methods for the feature selection from imbalanced data. They generated multiple balanced datasets from the original imbalance dataset by using sampling approach and evaluated each feature subset using an ensemble of base classifiers [41]. In another research work, Al-Shahib et al. focused on two issues, to identify discriminatory features and to overcome the challenge of data imbalance problem [42]. They applied feature subset selection approach followed by undersampling of majority class and generated a SVM classifier based system for the prediction of protein function from amino acid sequence. They analyzed that this combined approach outperformed other competitive learning algorithms. It is analyzed from all of the above research that feature selection plays an important role for the development of disease diagnosis system.

3. Weighted Least Squares Twin Support Vector Machine

It is evident from the literature results that support vector machine performs classification tasks more efficiently as compared to the decision tree, artificial neural network, and Naïve Bayes [9]. SVM solves a complex quadratic programming problem (QPP) for constructing a maximum margin hyperplane in order to handle the classification tasks. Recently, Jayadeva et al. proposed a twin support vector machine (TWSVM) that handles the classification tasks by generating two nonparallel planes, one plane for each class [43]. Different from SVM, TWSVM solves a pair of simple QPP and obtains two nonparallel planes in such a way that each hyperplane is nearer to the data points of one class while it is as far as possible from the data points of the other class. TWSVM is four times faster than that of traditional SVM and produces comparable results with existing methods. But again the problem with TWSVM classifier is to solve two QPPs. Further, Kumar and Gopal proposed least squares twin support vector machine (LSTSVM) which solves a pair of linear equations instead of two QPPs as in TWSVM [44]. LSTSVM takes lesser computational time and has better generalization ability as compared to the traditional TWSVM. But the above classification techniques are unable to handle class imbalance problem. So, in this research work we propose a weighted LSTSVM approach in which distinct weight is applied to the training data points. Number of data points in each class is used for the selection of appropriate weight parameters. The lower weight is assigned to the class with large number of data points while higher weight is assigned to the class with less number of data points [45]. Let and represent the size of positive and negative class, respectively. The weight is assigned to each class on the basis of the following formula:

3.1. Linear WLSTSVM

Let the data points of positive and negative classes be indicated by two matrices and in real space of -dimension. For linearly separable data points, the WLSTSVM is formulated as where “” is a diagonal matrix containing weight for each of data points as per (1). and are penalty parameters and and are two vectors of 1’s. , are normal vectors to the hyperplanes and represent bias term. The Lagrangian of the above equations are obtained as where and are nonnegative Lagrangian multipliers. Optimization of the above equations returns the value of normal vectors and biases as follows: where and . The Lagrangian multipliers are obtained as where and are two diagonal matrices having equivalent weight corresponding to and . Normal vector and bias are used to construct two nonparallel planes as The class is assigned to new data point as

3.2. Nonlinear WLSTSVM

WLSTSVM also works well for nonlinearly separable data points with the help of kernel function. Kernel function transforms the data points into higher dimensional space to make easier separation. Then, WLSTSVM constructs nonparallel planes in that space [44]. If “” is any kernel function then equations of kernel generated surfaces are obtained as where . WLSTSVM for nonlinearly separable data points is formulated as The normal vector and bias in the higher dimensional space are measured as follows: where and . A class is assigned to a new data point according to For a new data point, its perpendicular distance is calculated from each surface and a class is assigned to the data point depending upon the fact of which kernel surface lies nearest to the data point. If and represent vectors in input space, then Gaussian kernel function is formulated as We use Gaussian kernel function to develop a nonlinear WLSTSVM classifier model. Hybrid feature selection and Grid Search parameter selection approach are also used with this classifier model which further enhances its performance for disease diagnosis. The algorithm of WLSTSVM classifier for both linearly and nonlinearly separable data points is given in Table 1.

4. Methodology and Experiments

4.1. Datasets

This research work proposes the disease diagnosis model for Breast Cancer, Hepatitis, and Diabetes disease. Datasets are taken from UCI Machine Learning Repository [46]. The details of datasets are shown in Table 2. It is clear from Table 2 that all the three datasets contain a different number of instances in each class and imbalance in nature. Imbalance ratio is calculated by taking the ratio of number of data points of majority class with minority class. For Breast Cancer, one class contains 241 and the other contains 458 instances. For Diabetes dataset, one class contains 500 and the other contains 268 instances and in Hepatitis one class contains 123 and the other class contains 32 instances. The numbers of features in Breast Cancer, Diabetes, and Hepatitis are 9, 8, and 19, respectively. Diabetes dataset is a very popular dataset and used for predicting diabetes in women during their pregnancy while Breast Cancer dataset is used to predict whether the tumor is benign or malignant. Hepatitis dataset is also useful for evaluating the performance of a classifier and is used to predict the patient survival rate.

4.2. Hybrid Feature Selection Method

The main purpose of the feature selection approach is to select a minimal and relevant feature subset for a given dataset and maintaining its original representation. FS not only reduces the dimensionality of data but also enhances the performance of classifiers. There are two categories of feature selection—Filter and Wrapper methods. Filter method is useful to rank features according to their importance but the main problem is to decide the stopping criterion. Wrapper method is not suitable for large dataset due to exhaustive search. Since Filter is a less computationally expensive feature selection approach and Wrapper is more accurate, hybrid approach combines both, the advantages of Filter and Wrapper methods. In this paper, we used correlation feature selection as a Filter approach which considers the correlation of feature with target feature and selects only those features which show a strong correlation with the target feature and weak correlation with each other. Correlation coefficient between two features and is calculated as where and represent standard deviation and expected values, respectively. Correlation coefficient indicates how strongly two features are related or associated with each other; that is, when the value of one feature is able to predict the value of another feature then they are said to be strongly correlated with each other. Correlation can be estimated from training data points as where and indicate standard deviations of training data points, and are the mean value of sample, and and represent the value set of features and correspondingly. The above measurement value is used to rank the features according to their individual association with the target feature. A feature set is optimal only when it shows strong correlation with target feature and weak correlation with each other. So, high rank is given to the feature that satisfies strong correlation criteria. Using this condition, the merit of a subset of features is given as where denotes number of features, is sample correlation coefficient, is class, and is predictive feature. and represent average value of feature-class correlation and feature-feature correlation, respectively.

The search strategy in Wrapper method either removes or adds features into candidate feature subset and finds an optimal feature subset that maximize the performance of learning algorithm. For example, in case of a classifier, the optimal feature subset maximizes its accuracy. Two common search strategies used in Wrapper methods are forward selection or backward elimination. In this research work we used sequential forward selection (SFS) approach because backward elimination does not work well when initial numbers of features are very large. In this situation, it takes more time and becomes infeasible whereas forward selection is faster when there is a need to select small number of features. At the beginning, SFS assumes an empty feature set and then it iteratively chooses and adds features one by one. In each iteration, the performance of the learning algorithm is evaluated by using generated feature subset. This search continues until adding new feature improves the performance of learning algorithm and stops when there is no improvement in the performance of learning algorithm with respect to the current feature subset.

This research paper proposed a 2-stage feature selection algorithm that combines CFS and SFS feature selection techniques as shown in Figure 1. In the first stage, CFS technique is used to calculate the importance of each feature and then rank the features in descending order. In the second stage, SFS is used for feature selection. SFS constructs a candidate feature subset and selects one feature at a time. It then iteratively adds feature into this candidate feature subset. A temporary WLSTSVM classifier is constructed using current selected feature subset and then its performance is evaluated using 10-fold cross validation method. The feature subset, for which the WLSTSVM performs well, is selected as an optimal feature subset. Then this feature subset is used for final classifier construction.

4.3. Proposed Model

Figure 2 indicates the block diagram of the proposed hybrid feature selection based WLSTSVM model. This model uses hybrid feature selection (CFS + SFS) approach for selecting relevant feature subset. Initially, the preprocessing is performed on the original dataset. Here, we applied normalization to the original dataset. The next step is to divide the dataset into -part using -fold cross validation method. One part of this partitioned dataset is used for testing and the remaining part is used for training the classifier. CFS is used to rank the feature according to their relevance to the class and then features are arranged according to their rank. Next, SFS selects one best feature at a time and this feature is used to train and test the WLSTSVM classifier. After that, SFS iteratively adds one feature to the previous feature subset and again this feature subset is used to train and test the model. This process is repeated until all features appear in the subset. Finally, we obtain WLSTSVM disease diagnosis model which has the highest accuracy.

4.4. Performance Evaluation Parameters

In this paper, we used predictive accuracy, sensitivity, specificity, and geometric mean to evaluate the performance of the proposed disease diagnosis model. The formulations of the above mentioned parameters are given below: where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

5. Result and Discussion

WLSTSVM classifier model contains several parameters such as sigma and penalty parameters and which require some initial value. In this study, we used grid search approach for appropriate parameters selection. The penalty parameters are selected within range and sigma within range. The accuracy comparison of different methods for Diabetes, Hepatitis, and Breast cancer is shown in Table 3. We have evaluated the performance of the proposed model and compared it with the traditional LSTSVM classifier without feature selection, WLSTSVM classifier without feature selection, LSTSVM with undersampling and without feature selection, and HFS based LSTSVM and HFS based LSTSVM with undersampling approach. 10-fold cross validation is used for the performance comparison. Here, we used undersampling for balancing the imbalanced class as compared to the oversampling because undersampling has shown better performance as compared to the oversampling and it does not introduce unnatural bias towards minority class. First of all, LSTSVM is implemented using matlab R2012a. We have checked its accuracy for all given disease datasets without FS. The accuracies of LSTSVM without FS for Diabetes, Hepatitis, and Breast Cancer dataset are 75.33%, 84.28%, and 93.28%, respectively. Then, we applied undersampling approach to the dataset in order to balance the data of both classes and checked the predictive accuracy of LSTSVM with the balanced data. LSTSVM with undersampling has obtained 74.31%, 83.33%, and 92.83% accuracies for Diabetes, Hepatitis, and Breast Cancer disease correspondingly. Again, we checked the performance of our proposed approach WLSTSVM to deal with class imbalance problem without any FS approach. As shown in Table 3, it is clear that WLSTSVM performs well as compared to the LSTSVM + undersampling approach. Next, we applied HFS approach to LSTSVM which enhances its accuracy for all given disease datasets. In the same way, HFS based LSTSVM with undersampling also performs well compared to the LSTSVM with undersampling. Then we applied HFS to the proposed WLSTSVM classifier model and achieved far better accuracy for Diabetes, Hepatitis, and Breast Cancer disease dataset. The results indicate that HFS based WLSTSVM is a better choice for the construction of the disease diagnosis model. The proposed disease diagnosis model not only selects the relevant and useful features but also handles the class imbalance problem very well.

The performance of our proposed WLSTSVM classifier model is checked with and without HFS approach. It is clear from Table 4 that the predictive accuracy and geometric mean of WLSTSVM with reduced features are much better compared to the WLSTSVM with original features. The proposed model selects 5 features for Diabetes, 12 features for Hepatitis, and 5 features for Breast Cancer datasets as shown in Table 4. For Breast Cancer patients, the proposed model selects Bare Nuclei, uniformity of cell shape and cell size, clump thickness, and marginal adhesion features. The proposed model predicts the Diabetes in pregnant women using their Plasma Glucose Level, Body Mass Index, number of times a woman is pregnant, Pedigree, and their age. For Hepatitis, it selects Bilirubin, age, Histology, Antivirals, Alk Phosphatase, SGOT, Liver Big, Liver Firm, Steroid, Spleen Palpable, Varices, and Ascites to predict a patient suffering from hepatitis will either die or survive. The value of sensitivity and specificity is also improved with the feature selection approach. The sensitivity of the proposed approach is increased from 0.8709 to 0.9694 for Diabetes, from 0.9167 to 0.9514 for Hepatitis, and from 0.9800 to 1 for Breast Cancer dataset. In the same manner, specificity is also varied from 0.6957 to 0.8311 for Diabetes, from 0.7554 to 0.8296 for Hepatitis, and from 0.9355 to 0.9737 for Breast Cancer dataset.

Tables 5, 6, and 7 indicate the comparison of predictive accuracy of our proposed HFS based WLSTSVM model with other existing classifiers for Diabetes, Hepatitis, and Breast Cancer diseases [47]. Our proposed approach based on WLSTSVM with HFS obtains 89.71% accuracy for Diabetes which is better compared to the other existing approaches. It again obtains 87.50% accuracy for Hepatitis and 98.55% accuracy for Breast Cancer diagnosis which is comparable to the other existing models.

6. Conclusion

This paper is concerned with two issues-class imbalance problem and the need of feature selection. We propose a WLTSVM-based disease diagnosis model with HFS for diagnosis of Breast Cancer, Diabetes, and Hepatitis diseases. HFS combines the positive aspects of Filter and Wrapper approaches. This model not only is useful for selecting significant features but it also handles the class imbalance problem very effectively. The predictive accuracy for Pima Diabetes dataset is 89.71% which is far better than other existing approaches. For Hepatitis and Breast Cancer diseases, the predictive accuracy is 87.50% and 98.55%, respectively, which is also comparable with other existing models. The accuracy improvement is showing more variation from 74.31% to 89.71% for Diabetes dataset, 92.83% to 98.55% for Breast Cancer dataset, and 83.33% to 87.50% for Hepatitis dataset. Experimental results show the effectiveness of the proposed disease diagnosis model. Thus, the above results indicate that feature selection with proper handling of class imbalance problem may enhance the accuracy of classifiers up to a significant level which is necessary while handling real time data.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.