Abstract

For high-dimensional data with a large number of redundant features, existing feature selection algorithms still have the problem of “curse of dimensionality.” In view of this, the paper studies a new two-phase evolutionary feature selection algorithm, called clustering-guided integer brain storm optimization algorithm (IBSO-C). In the first phase, an importance-guided feature clustering method is proposed to group similar features, so that the search space in the second phase can be reduced obviously. The second phase applies oneself to finding optimal feature subset by using an improved integer brain storm optimization. Moreover, a new encoding strategy and a time-varying integer update method for individuals are proposed to improve the search performance of brain storm optimization in the second phase. Since the number of feature clusters is far smaller than the size of original features, IBSO-C can find an optimal feature subset fast. Compared with several existing algorithms on some real-world datasets, experimental results show that IBSO-C can find feature subset with high classification accuracy at less computation cost.

1. Introduction

Feature selection (FS), as an important dimension reduction method, has been applied in various real problems, such as image processing and text classification [1, 2]. In general, a lot of irrelevant/redundant features will slow down the speed of learning algorithms, even reducing their learning accuracy. The purpose of FS is to eliminate those irrelevant and/or redundant features, thus shorting the learning time while improving the learning accuracy [36].

Swarm intelligence (SI) is a branch of evolutionary computation [5]. Since it can find optimal or suboptimal solutions by global search strategies, SI has become an effective approach to solve FS problems. Now many SI-based algorithms have been applied in FS problems, such as particle swarm optimization [712], differential evolution [1315], artificial bee colony algorithm [1618], firefly algorithm [19, 20], salp swarm algorithm [21], ant colony optimization [6], and whale optimization algorithm [22].

Brain storm optimization (BSO) is a new swarm intelligence algorithm simulating collective behavior of human being [23, 24]. And it has been applied to a lot of real problems including system identification and electromagnetic antenna design [2533]. Recently BSO-based FS algorithms have received much attention. Zhang et al. applied BSO in FS problems for the first time and proposed a continuous BSO-based FS algorithm (CBSO) [34]. Furthermore, they have also developed an improved discrete BSO [35], where two new idea clustering and new idea updating mechanisms were proposed to improve the performance of BSO. Liang et al. proposed a hybrid FS algorithm by combining ant colony optimization and BSO [36]. Combining the Fuzzy Min-Max neural network and BSO to undertake feature selection and classification problems, Pourpanan et al. developed also a hybrid BSO-based FS method [37]. Furthermore, they also presented a novel improved hybrid BSO-based FS method [38] by combining a fuzzy ARTMAP model with BSO. Papa et al. introduced an improved binary BSO-based FS algorithm, where a real-valued solution is mapped onto a Boolean hyper cube by using different transfer functions [39]. These methods above all enhance the capability of BSO on solving FS problems. However, as the feature space increases exponentially, the search capability of existing BSO-based methods is inevitably reduced because of the lack of effective space reduction strategies.

For high-dimensional data, this paper develops a new evolutionary feature selection algorithm, called the clustering-guided integer BSO algorithm (IBSO-C). IBSO-C works in the two phases. In the first phase, an importance-guided feature clustering method is developed to group all features into multiple clusters according to their redundancy. Following that, the second phase selects the most representative feature from each feature cluster by employing an improved integer BSO, and all representative features form the final feature subset. Appling IBSO-C in several high-dimensional FS problems, experimental results show its superiority and effectiveness over some state-of-the-art methods, including one filter method and three evolutionary wrapper feature selection methods.

The main contributions of this paper are as follows:(1)The paper proposes a new two-phase hybrid evolutionary FS framework, which effectively combines the capability of fast dimensionality reduction of clustering-based method with the global search ability of evolutionary algorithm. Since the number of feature clusters is far smaller than the size of original features, the second phase can find an optimal feature subset fast.(2)roposes a new feature clustering method, called the importance-guided feature clustering method. By effectively fusing feature importance and feature correlation, the proposed method can group all features into multiple clusters according to their redundancy at relatively small computational cost.(3)The paper proposes an improved integer BSO (IBSO) for feature selection problems. Several new strategies, including the integer encoding strategy and the time-varying integer update strategy, improve the search performance of IBSO.

The remaining contents are organized as follows: Section 2 shows basic conceptions. Next, the proposed BSO-based FS algorithm is introduced in Section 3. Following that, Section 4 provides experimental analyses. Section 5 concludes this paper.

2.1. Feature Selection Problem

Considering a dataset with D features and H instances, the objective of feature selection is to select d features (d < D) from the original feature set, Fset, so that the classification accuracy, AC, become better. Using a binary string, X, to represent a feature subset, we havewhere indicates that the j-th feature is selected into the feature subset X; otherwise, it is not selected. A feature selection problem is as follows:

To deal with FS problems, existing methods can be divided into three categories [8]: filter, wrapper, and embedded. The filter firstly calculates the importance degrees of all features by a specified measure, such as information gain, distance measure, and dependency measure. Next, all features are ordered with their importance degrees. This kind of approach has less computational cost, but its classification accuracy often is worse than the other two kinds of methods. The wrapper utilizes a learning algorithm to evaluate feature subsets and uses some search methods to find good feature subsets. Because of the need of repeatedly evaluating new feature subsets (or solutions) by a classifier, this kind of methods has high computational cost. But, its classification accuracy often is better than the filter. The embedded method automatically carries out feature selection when training classifier. Since those selected features are closely related to such the used learning algorithm, the robustness of embedded method is poor to the change of algorithms. Since the proposed FS algorithm also uses a classifier to evaluate new solutions (i.e., feature subset) and utilizes BSO to search feature subsets, it belongs to the wrapper. The purpose of this paper is to study a new BSO-based feature selection algorithm for high-dimensional data.

2.2. Brain Storm Optimization Algorithm

In BSO, an idea (i.e., an individual) represents a potential solution of optimized problem. It generates continually new ideas by repeatedly implementing the following three phases, i.e., individual clustering, individual update, and elite selection. Note that an idea is an individual in other evolutionary optimization algorithms.

In the phase of individual clustering, individuals are first grouped into multiple clusters. Here, the most commonly used clustering method is the K-means clustering. Next, the phase of individual update uses those cluster centers or ideas from one/two clusters to generate new solutions. Here, two probability values, i.e., and , are used to determine which cluster center or ideas to be used. After selecting the two solutions, a new individual is generated by a crossover-like strategy as follows:where and are two ideas/individuals or cluster centers from the two clusters; Rand is a random number within [0, 1]. The disturbance factor, , is utilized to enhance the diversity of new ideas; t is the current iteration time and T is the maximal iteration time.

Finally, the elite selection compares those new ideas with corresponding old ones. And the better ones are saved and recorded as new ideas.

2.3. Symmetric Uncertainty

Compared with mutual information, symmetric uncertainty (SU) can correct the FS bias by normalizing mutual information values, so that it can fairly analyze the correlation between features. Now, it has been successfully used in FS problems [7, 40].

Taking two random variables, X and Y, as example, their SU value is calculated bywhere is the conditional entropy, which is used to evaluate the uncertainty of X when Y is given. is the entropy of X. is the information gain, which is used to evaluate the decrease degree of the uncertainty of X when Y is known.

3. The Proposed IBSO-C Algorithm

3.1. Framework

To deal with the problem of “curse of dimensionality,” the paper proposes a two-phase evolutionary FS algorithm, called clustering-guided integer BSO algorithm (IBSO-C). Figure 1 shows its framework. The framework includes the following phases: clustering features and selecting representative features. Firstly, the first phase groups all features into multiple clusters according to their similarity, by using the proposed importance-guided feature clustering method. After that, the second phase selects representative features from these feature clusters by employing an improved integer BSO, so generating the final feature subset. The main contributions of this paper have been marked with red dotted lines in Figure 1.

3.2. Importance-Guided Feature Clustering Strategy

A good clustering method should be able to group similar features into the same cluster at a less computational cost. The frequently used K-means can divide data more accurately, but it usually has a high computational cost. For the above reason, we propose a new feature clustering method, i.e., the importance-guided feature clustering (IFC).

Algorithm 1 shows the implementing steps of IFC. Firstly, the SU measure is used to evaluate the importance of each feature in Fset (step 1). The greater the SU value between a feature and class labels is, the more important the feature is. Secondly, all the features in Fset are sorted in the decreasing order of SU (step 2), denoted the sorting result by Fsorted. After that, the following steps (steps 4–12) are executed circularly until all features are classified into corresponding clusters: (1) the first feature in Fsorted is set to be a new cluster center, Centeri, and the i-th feature cluster is initialized as. (2) All the features in Fsorted are checked against the new cluster center. If the correlation between a feature and the cluster center is more than a threshold η, then the feature is put into the feature cluster, . Repeating the above method, we can get the new feature cluster, . (3) After that, we reset . If Fsorted is not empty, return to step 5; otherwise, stop the clustering method and output all the feature clusters. Note that, we use also US to calculate the correlation between feature and cluster center.

Input: The original feature set, Fset;
Output: The feature clustering result,
(1) Use the SU measure to evaluate the importance of each feature in Fset;
(2) Sort all features in Fset in the decreasing order of SU values, denoted the sorted result by Fsorted;
(3) Set i = 1;
(4)While |Fsorted| > 0%|Fsorted| is the size of Fsorted
(5)  Set the first feature in Fsorted to be the i-th cluster center, Centeri
(6)  Initialize ;
(7)  For j = 1: |Fsorted|
(8)    Calculate the correlation between Centeri and the j-th feature in Fsorted;
(9)   If the correlation > η, then save the j-th feature into;
(10)  Endfor
(11)  Reset , and i = i + 1.
(12) Endwhile

Compared with existing clustering algorithms, IFC has the following advantages: (1) IFC does not need to set the clustering number in advance. Compared with the clustering number, it is easier to set the correlation threshold η. (2) IFC has a lower computation complexity. Considering the worst case, running the 11-th line of Algorithm 1 every time can delete two features from Fsorted, and the times of running the SU measure to calculate the correlation between features and cluster centers are D (D − 1)/4, where D is the size of feature. Moreover, evaluating the importance degrees of all features needs running the SU measure D times. Therefore, the times of running the SU measure by IFC are (D2 + 3D)/4. In most of the existing clustering methods, determining the redundancy between all features usually needs running the SU measure (D − 1)! times.

3.3. Selecting Representative Features by an Improved Integer BSO

In this section, an improved integer BSO is proposed to select representative features from those feature clusters, thus generating the final feature subset. Firstly, we give the encoding method of idea/individual and the fitness evaluation strategy of idea.

3.3.1. Encoding and Fitness Evaluation Strategies

By implementing the method in Section 3.2, all features can be divided into Z feature clusters. . The goal of this section is to produce a good feature subset, X, by selecting a representative feature from each feature cluster, so that the classification performance gets the largest. Taking an integer vector to construct a solution, the optimization model is as follows:where indicates that the s-th feature in the i-th feature cluster is selected into the feature subset, X. Following the above model, we directly use the integer vector, X, to represent an individual in the population.

This paper adopts the leave-one-out cross-validation (LOOCV) of k-NN to calculate the fitness of an idea in BSO. Due to easy implementation, the one nearest neighbor (1-NN) method is used as the classifier in the following experiment. As we know, the k-NN has used by many FS methods [7, 8, 10]. In the LOOCV with 1-NN, a single instance from the original dataset is selected as a testing sample, and the remaining ones are used as training samples. Then the 1-NN predicts which class this instance belongs to. The above process is repeated so that each instance in the original dataset is used once as the testing sample. Based on this, the classification accuracy of an idea Xi is as follows:where is the size of all instances in the original dataset and is the number of correctly predicted samples by K-NN. In the proposed method, the AC value of an idea is its fitness.

3.3.2. Time-Varying Integer Update of Idea

It can be seen by analyzing (3) that since two weights are randomly assigned to the two normal ideas or cluster centers selected, and , the new generated idea may swing back and forth between the two. If and are always far away in the iteration process of BSO, such ideas corresponding the two clusters will be difficult to converge, reducing the convergence performance of BSO.

Moreover, since traditional BSO is proposed for continuous optimization problems, we must develop a new integer update rule for solving the optimization model described in (5).

To overcome the above problems, this section proposes a time-varying integer update strategy of idea (TVIU). In this strategy, the best one among the two normal ideas or cluster centers will get a large learning weight. And this learning weight will increase as the number of iterations increases. The new update rule is as follows:where rand2 and rand3 are two random numbers within [0, 1], is the roundup function, and are two weights, which are used to control the learning proportions of the new solution from two ideas or cluster centers, T is the maximum iteration times of the algorithm, t is the current iteration times, and and are the AC values of the two normal ideas or cluster centers, respectively.

From (7) we can see that (1) the bigger the AC value of a normal idea or cluster center, the higher the weight () of this idea or center, so that the new solution can learn more knowledge from this idea or center. This can improve the quality of new solution to a certain extent. (2) As the iteration times increase, the influence of the AC value of a normal idea or cluster center on the weight gets higher. This indicates that the learning degree from the best idea is higher and higher with the increase of iteration times. Furthermore, this can speed up the convergence of the population in the later stage of the algorithm.

3.3.3. Disturbance Operator

In the proposed IBSO, a new disturbance operator is unitized to improve the new ideas’ diversity. Checking each element in the idea in turn, if the random number is smaller than the probability pm, the element will be reinitialized within its search space. In this paper, we set pm = 1/D, where D is the number of features.

3.4. Implement Steps of IBSO-C

Like traditional BSO algorithms [41], the proposed IBSO-C includes still three main steps: clustering ideas, updating ideas, and selecting elite ideas. In the IBSO-C, an idea represents a solution of the optimized problem. The feature clustering strategy proposed in Section 3.2 is used to cluster features, and the improved integer BSO proposed in Section 3.3 is used to update the positions of ideas.

In the first step (i.e., clustering ideas), all ideas are grouped into several clusters. In traditional BSO, the K-means clustering is the commonly used method. Due to the need to repeatedly cluster all instances, the K-means has still the disadvantage of high-computational cost. Focusing on it, Cao et al. [42] introduced random grouping to minimize the clustering cost. In this method, after the population is grouped into randomly M clusters, the fittest idea in each cluster is selected as its center. Compared with the K-means, this method significantly reduces the computational cost of population clustering.

In the second step (i.e., updating ideas), new ideas are generated based on two cluster centers or normal ideas in two clusters. Here the time-varying integer update strategy in Section 3.3 is used to generate new solutions. And the proposed disturbance operator is unitized to improve the new ideas’ diversity.

In the third step, the elite selection will be implemented. For the i-th idea in the population, if the classification accuracy of the new idea is better than that of the old one, , then replace by , i.e., . If and have the same classification accuracy, but the feature number in is fewer than that of , then .

Moreover, Algorithm 2 shows the pseudocode of the proposed IBSO-C. Note that, there are two clustering results in Algorithm 1, i.e., lines 2 and 6. In line 2, the method proposed in Section 3.2 is used to cluster features. Its input is the data to be processed, and the output is feature clusters. Line 6 is to cluster the N ideas or individuals, and its input is the population of BCO.

Input: The data set to be solved;
Output: The optimal feature subset;
(1) Set related parameters, including the population size, N, the maximal iteration times, T, and , and so on; t = 0;
(2) Cluster all the features into K clusters by using the method in Section 3.2.
(3) Randomly generate N integer ideas or individuals.
(4) Evaluate the fitness of each idea by equation (6);
(5)While t<T
(6) Grouping all the N ideas into M clusters by the method in [42];
(7) Select the best idea from each cluster as the cluster center; % the phase of updating ideas %
(8)For i = 1: N % from the first idea to the last one
(9)  If a random number rand() < , then
(10)   Randomly select a cluster and determine its cluster center;
(11)   If a random number rand() < ,
(12)    Select the cluster center;
(13)    Generate a new idea by the equation (7);
(14)     Implement the proposed disturbance operator;
(15)   Else
(16)    Randomly select a normal idea from this cluster
(17)    Generate a new idea by the equation (7);
(18)     Implement the proposed disturbance operator;
(19)   End if
(20)  Else
(21)   Randomly select two clusters;
(22)   If a random number rand() < , then
(23)    Select two cluster centers;
(24)    Generate a new idea by the equation (7);
(25)     Implement the proposed disturbance operator;
(26)   Else
(27)    Randomly select two normal ideas from the two clusters respectively;
(28)    Generate a new idea by the equation (7);
(29)     Implement the proposed disturbance operator;
(30)   End if
(31)  End if % the phase of selecting elite ideas %
(32)  Evaluate the new idea and update correspond old idea by Section 3.3.
(33)  End for
(34)End while

4. Experiments and Analyses

This section verifies the effectiveness of IBSO-C. Firstly, we discuss the effectiveness of the proposed two key operators on the performance of IBSO-C, i.e., the proposed clustering strategy and the time-varying integer update strategy. Secondly, IBSO-C is compared with four existing FS algorithms.

4.1. Experimental Preparation

Eight real datasets are used to verify the performance of IBSO-C. Table 1 shows their basic information. These datasets have been used in many studies [5, 7, 10], which can be downloaded from http://www.ics.uci.edu/mlearn/MLRepository.html and http://gems-system.org/.

Four representative feature selection algorithms were used, including the ReliefF algorithm (ReliefF) in [43], the binary PSO algorithm (BPSO) in [44], the binary BSO-based algorithm (BBSO) in [35], and the self-adaptive PSO algorithm (SaPSO) in [12]. For fair comparison, all population-based algorithms use the same swarm/population size (50) and the same maximal iterations (1000). For other parameters, their values are set followed by their original literatures, as shown in Table 2.

Three performance indexes are used to evaluate the quality of an algorithm, i.e., the classification accuracy (AC), the number of selected features (FN), and the running time (Time). This paper employs the 10-fold cross-validation method to demonstrate the effectiveness of an algorithm. The 9 parts are taken as training data in turn, the remaining part is taken as test data, and the average value of 10 times is used as the final result. All experiments in this paper are carried out on Intel (R) core (TM) i7-8700 CPU, 3.2 GHz, and 16.00 GB RAM.

4.2. Analysis on the Proposed Clustering Strategy

The proposed clustering strategy plays a key role in improving the performance of IBSO-C. We analyze the effectiveness of the strategy in this section. Here, the conventional BSO without feature clustering [34] (CBSO) is selected as a comparison method. IBSO-C and CBSO use the same parameters shown in Table 2.

Table 3 show the values of AC, FN, and the running time obtained by IBSO-C and CBSO. We can see that, (1) with the help of the feature clustering strategy proposed in Section 3.2, IBSO-C obtained the best average AC values for all the eight datasets, which are significantly higher than those of CBSO. (2) More importantly, compared with CBSO, IBSO-C takes only very few features to get such good AC values, as shown in their average FN values. Taking the dataset CNS as example, IBSO-C only used less than 90 features to get the classification accuracy, 86.04%. (3) Due to only using very few features, the running time of IBSO-C is also significantly less than that of CBSO. Taking the dataset CNS as example, IBSO-C only costs 1.5936 minutes to get a good solution, while the running time of CBSO is more than two hours.

4.3. Analysis on the Time-Varying Integer Update Strategy

The time-varying integer update strategy (TVIU) proposed in Section 3.3 also plays a key role in improving the performance of IBSO-C. This subsection will analyze its effectiveness. Here, the integer update version of (3) is selected as the comparison method, as follows:

For convenience, the IBSO-C with (8) is called IBSO-S. Both IBSO-C and IBSO-S use the same parameters shown in Table 2.

Figures 2 and 3 show the AC and FN values obtained by IBSO-C and IBSO-S, respectively. We can see that, (1) for all the datasets except leukemia_small, IBSO-C all obtained the biggest classification accuracy values compared with IBSO-S, with the help of the proposed TVIU strategy. Taking the dataset CNS as example, the AC value of IBSO-C is 5 percentage points higher than that of IBSO-S. (2) In terms of the number of selected features (NF), both IBSO-C and IBSO-S obtained very similar values.

4.4. Comparison Analyses

The proposed IBSO-C algorithm is compared with the four existing algorithms in terms of AC, FN, and the running time. Table 4 lists the average AC values obtained by the five FS algorithms with KNN, and Table 5 shows the average FN values obtained by the five FS algorithms. In addition, we employ the Mann–Whitney U test to investigate whether there is a significant difference between IBSO-C and another algorithm. Here, “+” indicates that IBSO-C is obviously superior to the comparison algorithm, “=” indicates that there is no significant difference between them, and “−” indicates that IBSO-C is obviously inferior to the comparison algorithm.

The following can be seen from the two tables: (1) for 6 out of the 8 datasets, IBSO-C obtained the biggest average AC values; for 4 out of the 8 datasets, i.e., Colon, WrapAR10P, DBWorld, and CNS, the AC values of IBSO-C are significantly superior to that of all the four comparison algorithms. (2) For the dataset GFE01, IBSO-C obtained the third best AV value, while SaPSO had the best AV value. However, the number of features selected by IBSO-C is significantly less than that of SaPSO, where the FN values of IBSO-C and SaPSO are 14.0 and 73.2, respectively. (3) Like the dataset GFE01, SaPSO obtained the best AV value on the dataset SRBCT, but its FN value is significantly bigger than that of IBSO-C. (4) For all the 8 datasets, IBSO-C obtained the smallest FN values compared with the four comparison algorithms.

4.5. Running Time

Table 6 shows the average running time of IBSO-C and the three evolutionary FS algorithms. It reports that the running time of IBSO-C is significantly less than that of BPSO, SaPSO, and BBSO. For all the 8 datasets, the average running time of IBSO-C is 5.33 minutes, while the average running time of BPSO, SaPSO, and BBSO is 145.48, 127.39, and 60.742 minutes, respectively. Overall, IBSO-C is a highly competitive FS algorithm, which can obtain relatively good classification accuracy at less computational cost.

5. Conclusions

This paper studied a new two-phase evolutionary feature selection algorithm, called the clustering-guided integer BSO algorithm (IBSO-C), for high-dimensional data. In the IBSO-C, the proposed feature clustering strategy in the first phase obviously reduced the search space of the integer BSO in the second phase. Since the number of feature clusters is far smaller than the number of original features, IBSO-C can find an optimal feature subset fast. Moreover, the proposed importance-guided feature clustering method can effectively group features at a relatively small computational cost. The proposed new encoding strategy and the proposed time-varying integer update strategy have improved the search performance of IBSO-C. The proposed IBSO-C was compared with four existing FS algorithms, i.e., relief, BPSO SaPSO, and BBSO on several datasets. The experimental results showed that IBSO-C is a highly competitive FS algorithm, which can obtain relatively good classification accuracy at less computational cost.

A more sophisticated feature clustering method that does not need to set any threshold or parameters manually will be one of our future research directions. In addition, applying multi- or many-objective evolutionary algorithms to cost-sensitive feature selection problems will be another research direction in the future.

Data Availability

Some or all data, models, or codes generated or used during the study are available from the corresponding upon by request.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the Scientific Innovation 2030 Major Project for New Generation of AI, Ministry of Science and Technology of the People’s Republic of China (No. 2020AAA0107300).