Abstract

Feature selection is the key step in the analysis of high-dimensional small sample data. The core of feature selection is to analyse and quantify the correlation between features and class labels and the redundancy between features. However, most of the existing feature selection algorithms only consider the classification contribution of individual features and ignore the influence of interfeature redundancy and correlation. Therefore, this paper proposes a feature selection algorithm for nonlinear dynamic conditional relevance (NDCRFS) through the study and analysis of the existing feature selection algorithm ideas and method. Firstly, redundancy and relevance between features and between features and class labels are discriminated by mutual information, conditional mutual information, and interactive mutual information. Secondly, the selected features and candidate features are dynamically weighted utilizing information gain factors. Finally, to evaluate the performance of this feature selection algorithm, NDCRFS was validated against 6 other feature selection algorithms on three classifiers, using 12 different data sets, for variability and classification metrics between the different algorithms. The experimental results show that the NDCRFS method can improve the quality of the feature subsets and obtain better classification results.

1. Introduction

In the era of big data, the number of dimensions of small sample data has increased dramatically, leading to dimensional disasters. In the preprocessing stage, irrelevant and redundant features need to be processed using data dimension reduction techniques. Because there are a lot of irrelevant and redundant features in high-dimensional data, these features not only lead to higher computational complexity but also reduce the accuracy and efficiency of classification methods. Feature selection [15] differs from other data dimensionality reduction techniques (e.g., feature extraction) [6] in that feature selection focuses on analysing the relevance and redundancy in high-dimensional data, removing as many irrelevant and redundant features as possible and retaining the relevant original physical features. This approach not only improves the data quality and classification performance but also reduces the training time of the model and makes it more interpretable [79].

Feature selection methods can be classified into three types: filter methods [10, 11], wrapper methods [12], and embedded methods [13]. Due to their high computational efficiency and generality, filter methods are also easily applied to ultra-high-dimensional data sets. In this paper, the filter feature selection method is used. The filter feature selection methods can be classified into rough set [14], statistics-based [15], and information-based [16] according to different metrics. Among these criteria, information-theoretic-based feature selection algorithms are currently the most popular research direction for filter feature selection algorithms. Usually, feature selection algorithms in information theory are further divided into mutual information metrics [17, 18], conditional mutual information metrics [1, 19], interactive mutual information metrics [2022], and so on. These methods then only determine whether the features are redundant and relevant under a single condition, so the optimal feature subset cannot be obtained. At the same time, the main differences between feature extraction in deep learning and feature selection algorithms based on information-theoretic filtering are described in two ways: (1) from a business perspective, feature selection algorithms can analyse features, whereas feature extraction can only perform pattern mapping and not correlation analysis and research; (2) from an efficiency perspective, feature extraction requires higher computational resources and longer training time, whereas feature selection only needs to be performed in a low-performance server.

In a high-dimensional small sample environment, the dynamic search for redundant and correlated features between features becomes a current problem to be solved in response to the diversity and high dimensionality of the data. This paper proposes a feature selection algorithm for nonlinear dynamic conditional relevance (NDCRFS). The innovations and contributions of this paper are as follows:(1)Firstly, the correlation between independent features and class labels is calculated by mutual information. Secondly, the correlation between the candidate features and the selected features under the class label is calculated using the conditional information. Finally, the correlation and redundancy between features are judged by the interaction information. This method solves the problem of how to measure the relevance and redundancy between selected features and candidate features.(2)The interaction information is normalized by an information gain factor to solve the dynamic balance of interaction information values.(3)Experimental comparison of 12 benchmark data sets in k-nearest neighbour (KNN), decision tree (C4.5), and support vector machine (SVM) classifiers showed that the NDCRFS algorithm outperformed other feature selection algorithms (Mutual Information Maximization (MIM) [23], Interaction Gain-Recursive Feature Elimination (IG-RFE) [24], Interaction Weight Feature Selection (IWFS) [21], Conditional Mutual Information Maximization (CMIM) [25], Dynamic Weighting-based Feature Selection (DWFS) [26], and Conditional Infomax Feature Extraction (CIFE) [23]). The experimental results demonstrate that the proposed NDCRFS algorithm is an effective criterion for classifying feature subsets and can select the feature subsets with good classification performance.

The rest of the paper is organised as follows. In Section 2, related work is presented. Section 3 discusses mutual information and conditional mutual information. In Section 4, the development of filtered feature selection algorithms is introduced and summarised and also a discussion is given on how to define independent feature relevance and redundancy, new categorical information relevance, and interaction feature dependency relevance and redundancy. In Section 5, the process and details of the implementation of the NDCRFS algorithm are described in detail. In Section 6, the effectiveness of the NDCRFS algorithm is validated by conducting a comprehensive evaluation of 12 data sets in ASU and UCI, while giving a related discussion. In Section 7, the paper is summarised and the shortcomings and future developments of the NDCRFS algorithm are pointed out.

2. Mutual Information and Conditional Mutual Information

Let , , and Z be three discrete variables [27], where . Therefore, the mutual information between and is defined as follows:

In the above equation, refers to the joint distribution, and and refer to the marginal distribution.

Also, the conditional mutual information of , , and is defined as follows:

A large number of feature selection algorithms have been proposed for filters, which mainly use forward search to find the optimal subset of features by evaluating the relevance between features and class labels and the redundancy between features using their respective evaluation criteria. Let be the original set of features and let be the best feature subset represents the assessment criteria, indicates candidate features, and indicates a selected feature, .

Lewis et al. proposed the MIM algorithm, which focuses on selecting the most relevant features from using the relevance of the features to the class labels. In the MIM algorithm, it is evaluated by the following criteria:

Therefore, Lin et al. studied the limitations of the MIM algorithm and proposed CIFE algorithm, in which it is evaluated with the following criteria:

In , in addition to measuring redundancy between features, it is proposed to measure redundancy within class labels .

Yang et al. [28] proposed the Joint Mutual Information (JMI) algorithm, which is evaluated with the following criteria:where has only one additional weighting factor over and represents the optimal number of feature subsets.

Fleuret et al. proposed CMIM algorithm according to the maximum-minimum criterion, which is evaluated as follows:

The difference between and is that uses a nonlinear cumulative summation standard, while uses a linear cumulative summation standard.

Sun et al. considered nonlinear criteria with low computational cost and therefore proposed DWFS, in which the DWFS algorithm is evaluated as follows:where, in the standard, means relevant and means redundant.

Hu et al. [29] proposed the Dynamic Relevance and Joint Mutual Information Maximization (DRJMIM) algorithm based on the DWFS algorithm and the JMIM algorithm, which mainly addresses the definition of feature relevance, that is, how to distinguish between the relevance of candidate features and the relevance of selected features. The evaluation criteria of this algorithm are as follows:In the above equation, .

Xiao et al. [30] believed that the use of redundancy between features can further improve the accuracy of the classification algorithm. Based on this, the Dynamic Weights Using Redundancy (DWUR) algorithm was proposed. Evaluation criteria of the algorithm are as follows:

In the above equation, has one more item than .

In summary, the analysis of equations (3) to (9) shows that the existing feature selection algorithms all have some of the following problems: (1) Redundant features and irrelevant features are not completely eliminated. (2) Interdependent features are often removed as redundant features because they are highly correlated with each other. These algorithms ignore judgements about the relevance and redundancy of interdependent features. (3) The dependency relevance and redundancy of interaction features can be judged by conditional mutual information and mutual information differences. Therefore, the study of better feature selection criteria is an urgent problem to be solved.

4. Evaluation Basis for Feature Selection

Bennasar et al. [31] argued that a feature is considered useful if it is related to the class label ; otherwise, feature is considered useless. This assumption only considers features to be completely independent of each other. In reality, feature and label correlations vary with the addition of different features, and it can be concluded that there are interdependencies between features and that feature and class label correlations and redundancies change dynamically with each other. In this section, the relevance and redundancy of independent and dependent features will be analysed and discussed. Let .

4.1. Independent Feature Relevance and Redundancy Analysis

Mutual information is often used to assess the correlation between feature and the class label . The stronger the correlation between feature and the class label is, the closer the value is to 1; conversely, the weaker the correlation is, the closer the value is to 0. If , then the correlation between feature and the class label is stronger than the correlation between feature and the class label . If , then the correlation between feature and the class label is weaker than the correlation between feature and the class label .

The mutual information is often used to assess the correlation between feature and feature . If the correlation between and is high, then the redundancy between features is strong; conversely, the redundancy is weak. When , the features and are independent of each other. When , it means that feature and feature are redundant, and then it means that feature or is deleted.

4.2. Relevance Analysis of New Classification Information

If , it means that the candidate feature can provide more classification information. If , it means that the candidate feature cannot provide any useful classification information and the features and are independent of each other.

If , it means that feature provides more classification information than feature .

4.3. Relevance and Redundancy of Interaction Feature Dependencies

According to the literature [6, 18, 29], if relevance of the selected feature to the class label is becoming stronger after the candidate feature is added, it indicates that the candidate feature can provide more classification information.

If , the correlation between the selected feature and the class label is weakening after the candidate feature is added, indicating that the candidate feature and the selected feature are redundant with each other.

5. NDCRFS Algorithm Description and Pseudocode Implementation

The feature selection algorithm seeks to search for sets of features that are closely related to class labels. To more accurately measure the relevance of features to class labels, the NDCRFS algorithm measures the relevance and redundancy of features in three ways:(1) to measure the relevance of feature to class label (2) to measure the relevance of feature to the selected feature under class label (3) measuring the interaction correlation and redundancy between and under the class label

Therefore, for the evaluation criteria for the NDCRFS algorithm, the specific formula is as follows:

In the above formula, is used as an information gain factor to normalize indicates candidate features and indicates a selected feature, .

From equation (10), in the NDCRFS algorithm, it firstly selects the minimum redundant features from based on the correlation analysis between the selected features and the candidate features ; secondly, it selects the most relevant features to the optimal feature subset by iteration, and its pseudocode is as follows.

(1)Input: Original feature set ; Class label set ; Threshold
(2)Output: Optimal feature subset
(3)initialization: ;
(4)fordo
(5) Calculate the mutual information value of each feature and label ;
(6)ifthen
(7)  remove from and continue;
(8)end
(9)end
(10);
(11);
(12);
(13)whiledo
(14) calculate the value of ;
(15)ifthen
(16)  calculate the value of ;
(17)  calculate the value of ;
(18)  Update using equation (10);
(19)  find the candidate feature with the largest ;
(20)end
(21);
(22)
(23);
(24)end

From Algorithm 1, line 1 initializes set and counters . In lines 2 to 7, the mutual information of each feature in the set is calculated. In lines 8 to 10, at the same time, the selected optimal feature is removed from set , and feature is added to set . At this time, the candidate feature becomes the selected feature . In lines 11 to 18, the values of , and are calculated.

The NDCRFS algorithm consists of 2 “for” loops and 1 “while” loop. Therefore, the time complexity of the NDCRFS algorithm is ( represents the number of selected features, represents the number of all features, and represents the number of all samples, where ). The complexity of the NDCRFS algorithm is higher than that of the MIM algorithm, IWFS algorithm, CMIM algorithm, DWFS algorithm, and CIFE algorithm, but the NDCRFS algorithm is lower than the IG-RFE algorithm, mainly because the NDCRFS algorithm also needs to calculate .

6. Experiments and Results

6.1. Introduction to the Data Set

In order to verify the effectiveness of the NDCRFS algorithm, a total of 12 data sets were used in the experiments. The experimental data sets were selected from the internationally renowned UCI [3] and ASU [14] general data sets, which are described in detail in Table 1. From Table 1, we know that the sample range is from 60 to 7494, the feature range is from 16 to 19 993, and the classification label range is from 2 to 20. The experimental data sets involve biomedical (Lymphography, Dermatology, Lung Cardiotocography, Lymphoma, Nci9, SMK-CAN-187, and Carcinom), face image data (COIL20 and Pixraw10P), and text data (PCMAC and Pendigits).

6.2. Experimental Environment Setup

NDCRFS was compared with six feature selection algorithms, MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE, to verify its effectiveness. The experiments were conducted using KNN, SVM, and C4.5, respectively, on the same feature subsets. The number of feature subsets was set as ; for example, K = 10 for Lymphography and Pendigits and for the rest of the settings. The experimental environment for this paper was an Intel-i7 processor with 8 GB RAM, and the simulation software was Python 2.7. A 5-fold cross-validation method was used in the experiments to obtain the average classification accuracy of the current classifier for that feature selection algorithm’s average classification accuracy. In the experiment, the incomplete samples are deleted, and, at the same time, according to Kuarga [32], the class attribute dependence maximization method is used to discretize continuous data.

6.3. Discussion and Analysis of Experimental Results
6.3.1. Comparison of Algorithm Variability

This paper proposes a method to measure the difference between two selected feature subsets using the Jaccard method. Among them, represents the feature subset selected by the NDCRFS algorithm, and represents the feature subset selected by other feature selection algorithms. The specific formula (11) is as follows:

As can be seen in Table 2, the mean values of the difference between NDCRFS and MIM, NDCRFS and IG-RFE, NDCRFS and IWFS, NDCRFS and CMIM, NDCRFS and DWFS, and NDCRFS and CIFE are 0.355, 0.389, 0.261, 0.222, 0.286, and 0.166, respectively, indicating that the difference between features is not considered. When sorting the relationship, the NDCRFS algorithm is significantly different from the other feature selection algorithms.

6.4. Comparison of Classification Accuracy

Tables 3 to 5 show the average classification accuracy on the 12 data sets using KNN, C4.5, and SVM. Bold represents the highest accuracy value in the feature selection algorithm for that data set. Tables 35 show that the NDCRFS algorithm had the highest average classification accuracy of 88.734%, 81.574%, and 79.213%, respectively. “Wins/Ties/Losses” describes the number of wins/ties/losses between NDCRFS and MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE.

From Table 3, it is clear that the NDCRFS algorithm outperforms the MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE algorithms in most data sets by 12, 12, 12, 12, 12, and 12, respectively. In Figure 1(a), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (97.769%, the required number of features is 23), which is 5.605%, 5.605%, 9.257%, 6.979%, 1.089%, and 10.63% higher, respectively. In Figure 1(b), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (98.589%, the number of required features is 5), which is 0.188%, 0.188%, 0.188%, 0.188%, 0.0%, and 0.188% higher, respectively. In Figure 1(c), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (76.69%, the required number of features is 28), which is 1.25%, 2.678%, 7.666%, 0.571%, 28.261%, and 19.44% higher, respectively. In Figure 1(d), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (70.014%, the number of required features is 15), which is 1.621%, 1.01%, 0.014%, 4.267%, 1.593%, and 11.138% higher, respectively.

From Table 4, the NDCRFS algorithm is superior to the MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE algorithms in the majority of data sets, with 11, 11, 11, 10, 10, and 11, respectively. In Figure 2(a), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (43.935%, the required number of features is 7), which is 2.042%, 2.462%, 2.588%, 1.613%, 0.933%, and 1.613% higher, respectively. In Figure 2(b), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (94.569%, the number of required features is 10), which is 0.226%, 0.373%, 0.787%, 0.801%, 0.347%, and 0.894% higher, respectively. In Figure 2(c), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (87.774%, the required number of features is 30), which is 7.856%, 2.661%, 11.81%, 3.932%, 3.617%, and 10.538% higher, respectively. In Figure 2(d), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (87.75%, the required number of features is 4), which is 8.0%, 7.75%, 18.222%, 4.944%, 18.333%, and 0.833% higher, respectively.

From Table 5, the NDCRFS algorithm is superior to the MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE algorithms in the majority of data sets, with 10, 12, 12, 11, 10, and 11, respectively. In Figure 3(a), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (87.964%, the number of required features is 28), which is 36.966%, 62.936%, 37.517%, 36.419%, 32.191%, and 67.049% higher, respectively. In Figure 3(b), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (85.589% with 20 required features), which is 0.001%, 0.102%, 3.394%, 0.255%, 0.206%, and 5.194% higher, respectively. In Figure 3(c), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (92%, the number of required features is 5), which is 1%, 1%, 1%, 1%, 1%, and 1% higher, respectively. In Figure 3(d), the classification accuracy of the NDCRFS algorithm is the highest compared to the six classification algorithms (68.352%, the number of features required is 24), which is 4.466%, 6.285%, 15.528%, 12.419%, 19.714%, and 27.447% higher, respectively.

6.5. Runtime Analysis of the Algorithm

Calculating the running time of feature selection algorithms is also one of the criteria to measure the importance of feature selection algorithms. Now, the running times of the NDCRFS algorithm, the MIM algorithm, the IG-RFE algorithm, the IWFS algorithm, the CMIM algorithm, the DWFS algorithm, and the CIFE algorithm are compared. In Table 6, these feature selection algorithms are the final runtimes derived from the feature ranking of all features of the 12 data sets. The NDCRFS algorithm’s runtimes are well within acceptable limits.

The results of the 5-fold cross-validation experiments on the ASU and UCI data sets show that the proposed NDCRFS algorithm is able to select a subset of features with better classification performance, which can further improve the discrimination ability of the data set under data dimensionality compression.

7. Conclusion

Feature selection is an important tool for the data preprocessing phase in high-level small sample data. The main objective of feature selection is to select the optimal subset of features and should have a high classification accuracy. Therefore, in this paper, a nonlinear dynamic conditional correlation feature selection algorithm is proposed. The algorithm first uses mutual information, conditional mutual information, and interactive mutual information to determine and identify the relevance and redundancy of independent features and dependent features. Secondly, the “max-min” principle is used to eliminate redundant and irrelevant features from the original feature set iteratively. Finally, the effectiveness of this algorithm is verified through experiments, which demonstrate that the NDCRFS algorithm significantly outperforms feature selection algorithms MIM, IG-RFE, IWFS, CMIM, DWFS, and CIFE in most of the data sets.

However, the NDCRFS algorithm also has an unsatisfactory selection of feature subsets on some data sets. In the future, it will be necessary to optimize the NDCRFS, while verifying the proposed method in research fields.

Data Availability

The experimental data set selects the world-famous UCI universal data set (https://archive.ics.uci.edu/ml/datasets.html) and the world-famous ASU universal data set http://featureselection.asu.edu/datasets.php).

Conflicts of Interest

The author declares that he has no conflicts of interest.

Authors’ Contributions

The author wrote, read, and approved the final manuscript.

Acknowledgments

This work was supported by Jiangsu University of Technology Doctoral Research Start-Up Fund: KYY19042.