Abstract

We introduced a novel hybrid feature selection method based on rough conditional mutual information and Naive Bayesian classifier. Conditional mutual information is an important metric in feature selection, but it is hard to compute. We introduce a new measure called rough conditional mutual information which is based on rough sets; it is shown that the new measure can substitute Shannon’s conditional mutual information. Thus rough conditional mutual information can also be used to filter the irrelevant and redundant features. Subsequently, to reduce the feature and improve classification accuracy, a wrapper approach based on naive Bayesian classifier is used to search the optimal feature subset in the space of a candidate feature subset which is selected by filter model. Finally, the proposed algorithms are tested on several UCI datasets compared with other classical feature selection methods. The results show that our approach obtains not only high classification accuracy, but also the least number of selected features.

1. Introduction

With increase of data dimensionality in many domains such as bioinformatics, text categorization, and image recognition, feature selection has become one of the most important data mining preprocessing methods. The aim of feature selection is to find a minimal feature subset of the original datasets that is the most characterizing. Since feature selection can bring lots of advantages, such as avoiding overfitting, facilitating data visualization, reducing storage requirements, and reducing training times, it has attracted considerable attention in various areas [1].

In the past two decades, different techniques are proposed to address these challenging tasks. Dash and Liu [2] point out that there are four basic steps in a typical feature selection method, that is, subset generation, subset evaluation, stopping criterion, and validation. Most studies focus on the two major steps of feature selection: subset generation and subset evaluation. According to subset evaluation function, feature selection methods can be divided into two categories: filter method and wrapper method [3]. Filter methods are independent of predictor, whereas wrapper methods utilize their predictive power as the evaluation function. The merits of filter methods are high computation efficiency and its generality. However, the result of filter method is not always satisfactory. This is because the filter model separates feature selection from the classifier learning and selects the feature subsets that are independent from the learning algorithm. On the other hand, wrapper methods guarantee good results, but they are very slow when applied to large datasets.

In this paper, we propose a new algorithm which combined rough conditional entropy and naive Bayesian classifier to select features. First, in order to decrease the computational cost of wrapper search, a candidate feature set is selected by using rough conditional mutual information. Second, the candidate feature subset is then further refined by a wrapper procedure. We take advantages of both the filter and the wrapper. The main goal of our research is expected to obtain a few features while the classification accuracy is still very high. This approach provides the possibility of efficiently applying filter-wrapper model on some datasets from UCI [4], obtaining better results than other classical feature selection approaches.

In the remainder of the paper, related work is first discussed in the next section. Section 3 presents the preliminaries on Shannon’s entropy and rough sets. Section 4 introduces the definitions of rough uncertainty measure and discusses their properties and interpretation. The proposed hybrid feature selection method is delineated in Section 5. The experimental results are presented in Section 6. Finally, a brief conclusion is given in Section 7.

In filter based feature selection techniques, a number of relevance measures were applied to measure the performance of features for predicting decisions. These relevance measures can be divided into four categories: distance, dependency, consistency, and information. The most prominent distance-based method is relief [5]. This method uses Euclidean distance to select the relevance features. Since relief works only for binary classes, Kononenko generalized it to multiple classes called relief-F [6, 7]. However, relief and relief-F are unable to detect redundant features. Dependence measures or correlation measures quantify the ability to predict the value of one variable from the value of another variable. Hall’s correlation-based feature selection (CFS) algorithm [8] is typical representative of this category. Consistency measures try to preserve the discriminative power of data in the original feature space. Rough set theory is a popular technique of this sort [9]. Among these measures, mutual information (MI) is the most widely used one in computing relevance. MI is a well-known concept from information theory and has been used to capture the relevance and redundancy among features. In this paper, we focus on the information-based measure in the filter model.

The main advantages of MI are its robustness to noise and transformation. In contrast to other measures, MI is not limited to linear dependencies but includes any nonlinear ones. Since Battiti proposed mutual information feature selector (MIFS) [10], more and more researchers began to study information-based feature selection. MIFS selects the feature that maximizes the information of the class, corrected by subtracting a quantity proportion to the average MI with the previously selected features. Battiti demonstrated that MI can be very useful in feature selection problems and the MIFS can be used in any classifying systems for its simplicity whatever the learning algorithm may be. Kwak and Choi [11] analyzed the limitations of MIFS and proposed method called MIFS-U, which, in general, makes a better estimation of the MI between input attributes and output classes than MIFS. They showed that MIFS does not work in nonlinear problems and proposed MIFS-U to improve MIFS for solving nonlinear problems. Another variant of MIFS is min-redundancy max-relevance (mRMR) criterion [12]. The method presented the theoretical analysis of the relationships of max-dependency, max-relevance, and min-redundancy. They proved that mRMR is equivalent to max-dependency for the first-order incremental search.

The limitations of MIFS, MIFS-U, and mRMR algorithms are as follows. Firstly, they are all incremental search schemes that select one feature at a time. At each pass, these methods select one feature with maximum criterion, without considering the interaction between groups of features. In many classification problems, groups of several features occurring simultaneously are relevant but not for the case of individual feature alone, for example, the XOR problem. Secondly, the coefficient is a configurable parameter, which must be set experimentally. Thirdly, they are not accurate enough to quantify the dependency among features with respect to a given decision.

Assume as an input feature set and as a target; our task is to select features from a pool such that their joint mutual information is maximized. However, the estimation of mutual information from the available data is a great challenge, especially multivariate mutual information. Martínez Sotoca and Pla [13] and Guo et al. [14] proposed different methods to approximate multivariate conditional mutual information, respectively. Nevertheless, their proofs are all based on the same inequality; that is, . The inequality does not hold under any conditions. Only if random variables , , and satisfy Markovity, then the inequality holds. Many researchers try various methods to estimate mutual information. The most common methods are histogram [15], kernel density estimation (KDE) [16], and k-nearest neighbor estimation (K-NN) [17]. The standard histogram partitions the axes into distinct bins of width and then counts the number of observations; therefore, this estimation method is highly dependent on the choice of the width of the bins. Although the KDE is better than histogram, the bandwidth and kernel function are difficult to decide. The K-NN approach uses a fixed number of nearest neighbors to estimate the MI, but it seems more suitable for continuous random variables.

This paper will compute multivariate mutual information and multivariate conditional mutual information in a new perspective. Our method is based on rough entropy uncertainty measure. Several authors [1821] have used Shannon’s entropy and its variants to measure uncertainty in rough set theory. In this work, we will propose several rough entropy-based metrics. Some important properties and relationships of these uncertainty measures will be concluded. Then we will find a candidate feature subset by using rough conditional mutual information to filter the irrelevant and redundant features in the first stage. To overcome the limitations of the filter model, in the second stage, we will use the wrapper model with the sequential backward elimination scheme to search for an optimal feature subset from the candidate feature subset.

3. Preliminaries

In this section we briefly introduce some basic concepts and notations of the information theory and rough set theory.

3.1. Entropy, Mutual Information, and Conditional Mutual Information

Shannon’s information theory, first introduced in 1948 [22], provides a way to measure the information of random variables. The entropy is a measure of uncertainty of random variables [23]. Let be a discrete random variable and let be the probability of ; the entropy of is defined by the following: Here the base of log is 2 and the unit of entropy is the bit. If and are two discrete random variables, the joint probability is , where and . The joint entropy of and is as follows:

When certain variables are known and others are not known, the remaining uncertainty is measured by the conditional entropy as follows: The information found commonly in two random variables is of importance and this is defined as the mutual information between two variables as follows: If the mutual information between two random variables is large (small), it means two variables are closely (not closely) related. If the mutual information becomes zero, the two random variables are totally unrelated or the two variables are independent. The mutual information and the entropy have the following relation: For continuous random variables, the entropy and mutual information are defined as follows: Conditional mutual information is the reduction in the uncertainty of due to knowledge of when is given. The conditional mutual information of random variables and given is defined by the following: Mutual information satisfies a chain rule; that is,

3.2. Rough Sets

Rough sets theory, introduced by Pawlak [24], is a mathematical tool to handle imprecision, uncertainty, and vagueness. It has been applied in many fields [25] such as machine learning, data mining, and pattern recognition.

The notion of an information system provides a convenient basis for the representation of objects in terms of their attributes. An information system is a pair of , where is a nonempty finite set of objects called the universe and is a nonempty finite set of attributes; that is, for , where is called the domain of . A decision table is a special case of information system , where attributes in are called condition attributes and is a designated attribute called the decision attribute.

For every set of attributes , an indiscernibility relation is defined in the following way. Two objects, and , are indiscernible by the set of attribute in , if for every . The equivalence class of is called elementary set in because it represents the smallest discernible groups of objects. For any element of , the equivalence class of in relation is represented as . For , the indiscernibility relation constitutes a partition of , which is denoted by .

Given an information system , for any subset and equivalence relation , the -lower and -upper approximations of are defined, respectively, as follows:

4. Rough Entropy-Based Metrics

In this section, the concept of rough entropy is introduced to measure the uncertainty of knowledge in an information system and then some rough entropy-based uncertainty measures are presented. Some important properties of these uncertainty measures are deduced, respectively, and the relationships among them are discussed as well.

Definition 1. Given a set of samples described by features , is a subset of attributes. Then the rough entropy of the sample is defined by and the average entropy of the set of samples is computed as where is the cardinality of .

Since for all , , , so we have . if and only if for for all , ; that is, . if and only if for all , ; that is, . Obviously, when knowledge can distinguish any two objects, the rough entropy is the largest; when knowledge can not distinguish any two objects, the rough entropy is zero.

Theorem 2. Consider , where is Shannon’s entropy.

Proof. Suppose and ,where ; then . Because for and for any , we have

Theorem 2 shows that the rough entropy equals Shannon’s entropy.

Definition 3. Suppose are two subsets of attributes; the joint rough entropy is defined as Due to , therefore, . According to Definition 3, we can observe that .

Theorem 4. Consider and .

Proof. Consider for all ;we have and , and then and . Therefore, and .

Definition 5. Suppose are two subsets of attributes; the conditional rough entropy of to is defined as

Theorem 6 (chain rule). Consider .

Proof. Consider

Definition 7. Suppose are two subsets of attributes; the rough mutual information of and is defined as

Theorem 8 (the relation between rough mutual information and rough entropy). Consider  .  .  .

Proof. The conclusions of (1) and (3) are straightforward; here we give the proof of property (2).
(2) Consider

Definition 9. The rough conditional mutual information of and given is defined by

Theorem 10. The following equations hold:  ;  .

Proof. (1) Consider
(2) Consider

5. A Hybrid Feature Selection Method

In this section, we propose a novel hybrid feature selection method based on rough conditional mutual information and naive Bayesian classifier.

5.1. Feature Selection by Rough Conditional Mutual Information

Given a set of sample described by the attribute set , in terms of mutual information, the purpose of feature selection is to find a feature set with features, which jointly have the largest dependency on the target class . This criterion, called max-dependency, has the following form: According to the chain rule for information, that is to say, we can select a feature which produces the maximum conditional mutual information, formally written as where represents the selected feature set.

Figure 1 illustrates the validity of this criterion. Here, represents a feature highly correlated with , and is much less correlated with . The mutual information between vectors and is indicated by a shadowed area consisting of three different patterns of patches; that is, , where , , and are defined by different cases of overlap. In detail,(1) is the mutual information between and , that is, ;(2) is the mutual information between and , that is, ;(3) is the mutual information between and , that is, ;(4) is the conditional mutual information between and given , that is, ;(5) is the mutual information between and , that is, .

This illustration clearly shows that the features maximizing the mutual information not only depend on their predictive information individually, for example, , but also need to take account of redundancy between them. In this example, feature should be selected first since the mutual information between and is the largest, and feature should have priority for selection over in spite of the latter having larger individual mutual information with . This is because provides more complementary information to feature to predict than does (as in Figure 1); that is to say, for each round, we should select a feature which maximizes conditional mutual information. From Theorem 2, we know that rough entropy equals Shannon’s entropy; therefore, we can select a feature which produces the maximum rough conditional mutual information.

We adopt the forward feature algorithm to select features. Each single input feature is added to selected features set based on maximizing rough conditional mutual information, that is, given selected feature set , maximizing the rough mutual information of and target class , where belongs to the remain feature set. In order to apply the rough conditional mutual information measure to the filter model well, a numerical threshold value is set to . This can help the algorithm to be resistant to noise data and to overcome the overfitting problem to a certain extent [26]. The procedure can be performed until is satisfied. The filter algorithm can be described by the following procedure.(1)Initialization: set “initial set of all features,” “empty set,” and “class outputs.”(2)Computation of the rough mutual information of the features with the class outputs: for each feature , compute .(3)Selection of the first feature: find the feature that maximizes ; set and .(4)Greedy selection: repeat until the termination condition is satisfied:(a)computation of the rough mutual information for each feature ,(b)selection of the next feature: choose the feature as the one that maximizes ; set and .(5)Output the set containing the selected features: .

5.2. Selecting the Best Feature Subset on Wrapper Approach

The wrapper model uses the classification accuracy of a predetermined learning algorithm to determine the goodness of the selected subset. It searches for features that are better suited to the learning algorithm, aiming at improving the performance of the learning algorithm; therefore, the wrapper approach generally outperforms the filter approach in the aspect of the final predictive accuracy of a learning machine. However, it is more computationally expensive than the filter models. Although many wrapper methods are not exhaustive search, most of them still incur time complexity [27, 28] where is the number of features of the dataset. Hence, it is worth reducing the search space before using wrapper feature selection approach. Through the filter model, it can reduce high computational cost and avoid encountering the local maximal problem. Therefore, the final subset of the features obtained contains a few features while the predictive accuracy is still high.

In this work, we propose the reducing of the search space of the original feature set to the best candidate which can reduce the computational cost of the wrapper search effectively. Our method uses the sequential backward elimination technique to search for every possible subset of features through the candidate space.

The features are ranked according to the average accuracy of the classifier, and then features will be removed one by one from the candidate feature subset only if such exclusion improves or does not change the classifier accuracy. Different kinds of learning models can be applied to wrappers. However, different kinds of learning machines have different discrimination abilities. Naive Bayesian classifier is widely used in machine learning because it is fast and easy to be implemented. Rennie et al. [29] show that its performance is competitive with the state-of-the-art models like SVM while the latter has too many parameters to decide. Therefore, we choose the naive Bayesian classifier as the core of fine tuning. The decrement selection procedure for selecting an optimal feature subset based on the wrapper approach can be seen as shown in Algorithm 1.

Input: data set , candidate feature set
Output: an optimal feature set
(1) Classperf ( )
(2) set
(3) for all   do
(4) Score = Classperf ( )
(5) append to
(6) end for
(7) sort in an ascending order according to Score value
(8) while   do
(9) for all   according to order do
(10)    = Classperf ( )
(11)   if     then
(12)     ,
(13)    go to Step  8
(14)   end if
(15)  Select with the maximum
(16) end for
(17)   if     then
(18)   ,
(19)  go to Step  8
(20) end if
(21)  if     then
(22)   ,
(23)  go to Step  8
(24) end if
(25) go to Step  27
(26) end while
(27) Return an optimal feature subset

There are two phases in the wrapper algorithm, as shown in wrapper algorithm. In the first stage, we compute the classification accuracy of the candidate feature set which is the results of filter model (step 1), where Classperf () represents the average classification accuracy of dataset D with candidate features C. The results are obtained by 10-fold cross-validation. For each , we compute the average accuracy . Then features are ranked according to value (steps 3–6). In the second stage, we deal with the list of the ordered features once; each feature in the list determines the first till the last ranked feature (steps 8–26). In this stage, each feature in the list considers the average accuracy of the naive Bayesian classifier only if the feature is excluded. If any feature is found to lead to the most improved average accuracy and the relative accuracy [30] is more than (steps 11–14), the feature then will be removed. Otherwise, every possible feature is considered and the feature that leads to the largest average accuracy will be chosen and removed (step 15). The one that leads to the improvement or the unchanging of the average accuracy (steps 17–20) or the degrading of the relative accuracy not worse than (steps 21–24) will be removed. In general, should take value in [0, 0.1] and should take value in [0, 0.02]. In the following, if not specified, and .

This decrement selection procedure is repeated until the termination condition is satisfied. Usually, the sequential backward elimination is more computationally expensive than the incremental sequential forward search. However, it could yield a better result when considering the local maximal. In addition, the sequential forward search adding one feature at each pass does not take the interaction between the groups of the features into account [31]. In many classification problems, the class variable may be affected by grouping several features but not the individual feature alone. Therefore, the sequential forward search is unable to find the dependencies between the groups of the features while the performance can be degraded sometimes.

6. Experimental Results

This section illustrates the evaluation of our method in terms of the classification accuracy and the number of selected features in order to see how good the filter wrapper is in the situation of large and middle-sized features. In addition, the performance of the rough conditional mutual information algorithm is compared with three typical feature selection methods which are based on three different evaluation criterions, respectively. These methods include correlation based feature selection (CFS), consistency based algorithm, and min-redundancy max-relevance (mRMR). The results illustrate the efficiency and effectiveness of our method.

In order to compare our hybrid method with some classical techniques, 10 databases are downloaded from UCI repository of machine learning databases. All these datasets are widely used by the data mining community for evaluating learning algorithms. The details of the 10 UCI experimental datasets are listed in Table 1. The sizes of databases vary from 101 to 2310, the numbers of original features vary from 12 to 279, and the numbers of classes vary from 2 to 19.

6.1. Unselect versus CFS, Consistency Based Algorithm, mRMR, and RCMI

In Section 5, rough conditional mutual information is used to filter the redundant and irrelevant features. In order to compute the rough mutual information, we employ Fayyad and Irani’s MDL discretization algorithm [32] to transform continuous features into discrete ones.

We use naive Bayesian and CART classifier to test the classification accuracy of selected features with different feature selection methods. The results in Tables 2 and 3 show the classification accuracies and the number of selected features obtained by the original feature (unselect), RCMI, and other feature selectors. According to Tables 2 and 3, we can find that the selected feature by RCMI has the highest average accuracy in terms of naive Bayes and CART. It can also be observed that RCMI can achieve the least average number of selected features which is the same as mRMR. This shows that RCMI is better than CFS and consistency based algorithm and is comparable to mRMR.

In addition, to illustrate the efficiency of RCMI, we experiment on Ionosphere, Sonar, and Wine datasets, respectively. A different number of the selected features obtained by RCMI and mRMR are tested on naive Bayesian classifier, as presented in Figures 2, 3, and 4. In Figures 24, the classification accuracies are the results of 10-fold cross-validation tested by naive Bayes. The number in -axis refers to the first features with the selected order by different methods. The results in Figures 24 show that the average accuracy of classifier with RCMI is comparable to mRMR in the majority of cases. We can see that the maximum value of the plots for each dataset with RCMI method is higher than mRMR. For example, the highest accuracy of Ionosphere achieved by RCMI is 94.87% while the highest accuracy achieved by mRMR is 90.60%. At the same time, we can also notice that the RCMI method has the number of maximum values higher than mRMR. It shows that RCMI is an effective measure for feature selection.

However, the number of the features selected by the RCMI method is still more in some datasets. Therefore, to improve performance and reduce the number of the selected features, these problems were conducted by using the wrapper method. With removal of the redundant and irreverent features, the core of wrappers for fine tuning can perform much faster.

6.2. Filter Wrapper versus RCMI and Unselect

Similarly, we also use naive Bayesian and CART to test the classification accuracy of selected features with filter wrapper, RCMI, and unselect. The results in Tables 4 and 5 show the classification accuracies and the number of selected features.

Now we analyze the performance of these selected features. First, we can conclude that although most of features are removed from the raw data, the classification accuracies do not decrease; on the contrary, the classification accuracies increase in the majority of datasets. The average accuracies derived from RCMI and filter-wrapper method are all higher than the unselect datasets with respect to naive Bayes and CART. With respect to naive Bayesian learning algorithm, the average accuracy is 91.47% for filter wrapper, while 89.86% for unselect. The average classification accuracy increased 1.8%. With respect to CART learning algorithm, the average accuracy is 89.94% for filter wrapper, while 88.08% for unselect. The average classification accuracy increased 2.1%. The average number of selected features is 5.3 for filter wrapper, while 9.3 for RCMI and 50.6 for unselect. The average number of selected features reduced 43% and 89.5%, respectively. Therefore, the average value of classification accuracy and number of features obtained from the filter-wrapper method are better than those obtained from the RCMI and unselect. In other words, using the RCMI and wrapper methods as a hybrid improves the classification efficiency and accuracy compared with using the RCMI method individually.

7. Conclusion

The main goal of feature selection is to find a feature subset as small as possible, while the feature subset has highly prediction accuracy. A hybrid feature selection approach which takes advantages of filter model and wrapper model has been presented in this paper. In the filter model, measuring the relevance between features plays an important role. A number of measures were proposed. Mutual information is widely used for its robustness. However, it is difficult to compute mutual information, especially multivariate mutual information. We proposed a set of rough based metrics to measure the relevance between features and analyzed some important properties of these uncertainty measures. We have proved that the RCMI can substitute Shannon’s conditional mutual information; thus, RCMI can be used as an effective measure to filter the irrelevant and redundant features. Based on the candidate feature subset by RCMI, naive Bayesian classifier is applied to the wrapper model. The accuracy of naive Bayesian and CART classifier was used to evaluate the goodness of feature subsets. The performance of the proposed method is evaluated based on ten UCI datasets. Experimental results on ten UCI datasets show that the filter-wrapper method outperformed CFS, consistency based algorithm, and mRMR at most cases. Our technique not only chooses a small subset of features from a candidate subset but also provides good performance in predictive accuracy.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by the National Natural Science Foundation of China (70971137).