Abstract

The feature selection problem is a fundamental issue in many research fields. In this paper, the feature selection problem is regarded as an optimization problem and addressed by utilizing a large-scale many-objective evolutionary algorithm. Considering the number of selected features, accuracy, relevance, redundancy, interclass distance, and intraclass distance, a large-scale many-objective feature selection model is constructed. It is difficult to optimize the large-scale many-objective feature selection optimization problem by using the traditional evolutionary algorithms. Therefore, this paper proposes a modified vector angle-based large-scale many-objective evolutionary algorithm (MALSMEA). The proposed algorithm uses polynomial mutation based on variable grouping instead of naive polynomial mutation to improve the efficiency of solving large-scale problems. And a novel worst-case solution replacement strategy using shift-based density estimation is used to replace the poor solution of two individuals with similar search directions to enhance convergence. The experimental results show that MALSMEA is competitive and can effectively optimize the proposed model.

1. Introduction

Feature selection involves the selection of a specific number of features from existing features to optimize specific objectives [1]. Feature selection can be regarded as a multiobjective optimization problem that can be solved using evolutionary algorithms. Feature selection has attracted the attention of scholars and has been widely used in gene expression analysis [2], face recognition [3], and drug discovery [4]. For example, a two-stage heuristic algorithm minimal redundancy maximal relevance (mRMR) [5] is used to optimize relevance and redundancy simultaneously. A filter-based algorithm [6] is used to consider the entropy-based correlation measure and the combination measure of the redundancy and cardinality of a selected subset. A decomposition algorithm based on a weighted method is utilized to optimize interclass and intraclass distances [7]. Gulsah et al. [8] proposed two algorithms, W-QEISS and F-QEISS, that use nondominated sorting based on classification accuracy, feature number, relevance, and redundancy. Li et al. [9] established a model with feature number, classification performance, interclass distance, and intraclass distance as objectives and proposed a decomposition-based large-scale algorithm (DMEA-FS).

However, some unsolved problems still exist in feature selection using traditional evolutionary algorithms. The first problem is that the selection of a large number of features can be regarded as the optimization of the large-scale optimization problem [1] or the large-scale multiobjective optimization problem (LSMOP) [10], but the traditional evolutionary algorithms cannot effectively solve such problems. The second problem is that feature number and accuracy are two basic objectives, and other objectives are needed to explore the potential information to guide the evolution in feature selection [1]. Correspondingly, more objectives result in many-objective optimization problems (MaOPs) [11, 12].

There are three main types of current algorithms, which are mainly used to solve LSMOPs or MaOPs, but they perform poorly on large-scale many-objective problems (LSMaOPs) [13], which include more than 3 objectives and over 100 decision variables [14, 15].

The first kind of algorithms is based on the Pareto dominance, which improves the convergence pressure by modifying the Pareto dominance relation. The new dominance relations are -dominance [16], -dominance [17], -optimality [18], simplex dominance [19], grid dominance [20, 21], etc. The algorithm using shift-based density estimation (SDE) was proposed in the work of [22], which allows individuals with poor convergence to obtain higher density.

The second is based on performance indicators, such as the hypervolume (HV) adaptive grid algorithm (HAGA) [23], the evolutionary algorithm (MaOEA/IGD) using inverted generational distance (IGD) [24], indicator-based algorithm with boundary protection (MaOEA-IBP) [25], and R2 indicator and weight vector-based method (R2-WVEA) [26]. Most of these algorithms are many-objective evolutionary algorithms (MaOEAs), but their computational costs are large.

The third category is composed of decomposition-based methods. The most classic ones are the multiobjective evolutionary algorithm based on decomposition (MOEA/D) [27] and its variants [2830]. The algorithm based on nondominated sorting approach (NSGA-III) [31] uses evenly distributed reference points to assist the environmental selection. Based on NSGA-III, Gu and Wang [10] introduced an information feedback model to solve LSMaOPs. The reference vector-guided evolutionary algorithm (RVEA) [32] uses reference vectors to guide the optimization.

To more comprehensively describe and better solve the large-scale feature selection problem, this paper studies the existing multiobjective models based on the evolutionary algorithm, combines the existing objectives, constructs the feature selection problem as an LSMaOP, and uses an improved large-scale many-objective evolutionary algorithm (LSMaOEA) for optimization.

The main contributions of this paper are summarized as follows:(1)A novel worst-case solution replacement strategy based on SDE is proposed. This strategy allows conditional replacement of poor solutions in terms of convergence and diversity compared to other solutions, thereby maintaining a balance between convergence and diversity.(2)A modified vector angle-based large-scale many-objective evolutionary algorithm (MALSMEA) is proposed, which uses variable grouping-based polynomial mutation instead of naive polynomial mutation to improve the efficiency of solving large-scale problems. In the environmental selection process, the proposed worst solution replacement strategy is used to improve diversity.(3)A large-scale many-objective feature selection optimization model is constructed, and MALSMEA is used to optimize it. The optimization objectives of this model are the number of selected features, accuracy, relevance, redundancy, interclass distance, and intraclass distance.

The remainder of this paper is arranged as follows. Section 2 introduces the related works. Section 3 describes the proposed model and MALSMEA in detail. In Section 4, we compare and analyze the experimental results of MALSMEA and four advanced algorithms in solving benchmark LS-MaOPs, as well as the performance of MALSMEA and three feature selection algorithms in optimizing the proposed feature selection model. Section 5 provides a summary of the full paper and prospects of future research.

2.1. Large-Scale Many-Objective Optimization Problem

An LSMaOP can be described aswhere is the decision space, is the number of decision variables (), and and are the lower and upper bounds of decision variables in the th dimension, respectively. is the -dimensional decision vector in , is the objective number (), and is the objective vector of . If no other solution dominates , then is a Pareto optimal solution [33]. The objective vectors corresponding to all Pareto optimal solutions constitute the Pareto optimal front (PF) [34, 35].

2.2. Shift-Based Density Estimation

We use the SDE [22] with the th nearest neighbor [36] to estimate the density of all individuals. For an individual , the following method is used to calculate the density value .(i)First, the standardized objective vectors of other individuals in population are shifted.(ii)Then, the Euclidean distances between other shifted normalized objective vectors and the considered individual are calculated, expressed as .(iii)Next, the th minimum value in the set is found, where and is the size of the population.(iv)Finally, is calculated as follows:

Through the above process of estimating the individual density, we can observe that the smaller the individual density is, the better the performance of the individual. Therefore, this paper uses this strategy, considering both diversity and convergence, to judge a pair of individuals with similar search direction, so as to delete the individual with poor performance.

2.3. Information Theory Criterion Based on Entropy

The feature selection model uses an entropy-based information theory criterion [8] to measure correlation and redundancy. For a given discrete random variable , its entropy is determined as follows:where , is the set of all possible values of , . Then, the joint entropy of and is determined as follows:where is a discrete random variable, , , and . Then, the mutual information between and is determined as follows:

Symmetric uncertainty is used to scale the value range of mutual information to [37], which is defined as follows:

3. Proposed Model and Algorithm

3.1. Model Design

The optimization objectives of the feature selection model include the number of selected features, accuracy, relevance, redundancy, interclass distance, and intraclass distance, which are described as follows:(1)The Number of Selected Features. It is minimized to ensure the simplification of feature selection:where represents the cardinality of feature set .(2)Accuracy. The accuracy of the learning algorithm is measured by the classification performance. The higher the classification performance is, the greater the accuracy. In this paper, the extreme learning machine (ELM) classifier [8] is used to calculate the accuracy:where , , , and represent the true negative, true positive, false negative, and false positive, respectively.(3)Relevance. The relevance between features and categorical variables reflects the recognition ability of the selected features. The greater the correlation is, the stronger the recognition ability is: where represents the th feature and represents the target categorical variable. This objective is normalized according to .(4)Redundancy. The redundancy is used to quantify the level of similarity between selected features. The smaller the redundancy is, the smaller the similarity: where represents the th feature. This objective is normalized according to .(5)Interclass Distance. The interclass distance represents the distance between the mean sample of each class and the average of mean samples of all classes, which reflects the recognition ability of samples of different classes. In the evolutionary process, a better sample distribution is obtained by maximizing the distance between classes:where is the total number of classes and is the average value of all samples with feature in class . This objective is normalized according to .(6)Intraclass Distance. By calculating the distances between the samples with the selected feature and the mean of all samples of the same kind, this value reflects the cohesion of the same kind of samples and can improve the accuracy to a certain extent: where is the th sample in class . This objective is normalized according to .

Therefore, the definition of the feature selection optimization model in this paper is as follows:

3.2. The Proposed Algorithm: MALSMEA

In this paper, a modified vector angle-based large-scale many-objective evolutionary algorithm is proposed, termed as MALSMEA. MALSMEA mainly uses a mutation operator based on variable grouping and the environment selection method of VaEA [38]. Figure 1 shows the program flowchart of MALSMEA. The main process of MALSMEA is as follows:(i)Step 1. Initialize a population with individuals randomly in the whole decision space , and set parameters.(ii)Step 2. The mutation operator based on variable grouping is used to mutate the population , in which the grouping method is ordered grouping, to generate the offspring population .(iii)Step 3. Combine the offspring population with the parent population and obtain the joint population . Then, the environmental selection in steps 4–9 is adopted to select promising individuals from .(iv)Step 4. Normalize the individuals in the population , and calculate the fitness and density values of each individual as well as the vector angle between every two individuals.(v)Step 5. Use the nondominated sorting method to rank, and determine the last layer .(vi)Step 6. According to the vector angle between any two individuals in layer and the fitness value of each individual, individuals with the largest vector angle and individuals with the smallest fitness value are selected to join to ensure the diversity.(vii)Step 7. If , select the individual with the largest vector angle in to join the new population by calculating the vector angles between the individuals in and the individuals in ; otherwise, go to step 9.(viii)Step 8. To maintain the balance between convergence and diversity, the worst individual replacement strategy is used to replace the poor individual with other individuals. Repeat from step 7 if .(ix)Step 9. Obtain the new population .(x)Step 10. Repeat from step 2, and stop when the maximum number of generations is reached.

3.3. The Worst-Case Solution Replacement Strategy Based on SDE

As the extreme individuals have been selected according to the vector angle and fitness value, for the worst individual replacement strategy in the process of environmental selection, we use the SDE strategy to calculate the density of individuals. The SDE strategy can consider the convergence and diversity of individuals simultaneously. Using this method, we can replace the poor individuals with similar search directions. The specific process is as follows: if the angle between an individual in and an individual in is less than the angle between two solutions of ideal solutions, that is, , where is the population size, then they have similar search directions. In this case, if , then individual is replaced by . After replacement, the angle between each individual and the new population is updated.

3.4. The Wrapper Structure of MALSMEA

MALSMEA is applied to the feature selection model, and the pseudocode of the wrapper structure of MALSMEA is shown in Algorithm 1. The main steps are as follows:(i)First, the input dataset is divided into training and test datasets.(ii)Then, in the initialization process, MALSMEA allocates the random feature vector selected from the data feature matrix . The selected feature vector is encoded as solutions by using the coding technology of [9] to reduce the amount of computation in the evolutionary process, and the mask of is regarded as the decision variables, and the population is formed.(iii)Then, in the wrapper structure, the population is evaluated via six objective functions to obtain objective vectors and obtain the evaluated population . The feature number is calculated according to the decision variables of the solutions. The accuracy can be obtained from the decoded feature subset and the corresponding ELM classifier [8], and other objectives can be calculated according to the corresponding equations.(iv)Then, the population is optimized by MALSMEA.(v)Finally, the optimal set is obtained.

Input: Datasets with labels, ; the maximal number of generations, ; the population size, ;
Output: The Pareto subset, ;
(1)divide into training and test datasets;
(2);
(3); ;
(4); ;
(5);
(6);
3.5. Time Complexity Analysis

The time complexity of MALSMEA is composed mainly of the following parts: the time complexity of the mutation operation in MALSMEA is , where is the number of groups, the time complexity of nondominated sorting is [31], the worst-case solution replacement strategy based on SDE has the time complexity of , and the time complexity of other operations is . Therefore, the time complexity of MALSMEA is . Compared with the four algorithms, the time complexity of the grouped and linked polynomial mutation operator (GLMO) is [39], linear combination-based search algorithm (LCSA) is [40], vector angle-based evolutionary algorithm (VaEA) is [38], and RVEA is [32]. Thus, the time complexity of MALSMEA is similar to that of GLMO but greater than that of the other three algorithms.

4. Experimental Studies

In this section, DTLZ1-DTLZ6 in the Deb, Thiele, Laumanns, and Zitzler (DTLZ) test suite [41] and LSMOP1-LSMOP9 in the Large-Scale Multi- and Many-Objective Problems (LSMOP) test suite [42] are selected to evaluate the performance of MALSMEA, and four datasets in the University of California at Irvine (UCI) machine learning library [43] are selected to evaluate the ability of MALSMEA to optimize the proposed feature selection model, among which Heart is a two-class dataset, Zoo and Iris are two multiclass datasets, and Musk1 is a high-dimensional dataset. For LSMaOPs, MALSMEA is compared with GLMO [39], LCSA [40], VaEA [38], and RVEA [32]. GLMO and LCSA are large-scale multiobjective evolutionary algorithms. GLMO uses mutation operators based on variable grouping, and LCSA uses a linear combination to reduce dimensionality. VaEA and RVEA are many-objective evolutionary algorithms that use vector angles and reference vectors, respectively. For the proposed six-objective feature selection model, MALSMEA is compared with W-MOSS [44], W-QEISS, and F-QEISS [8].

In the next sections, we introduce the performance indicators and set the parameters in the experiments. Then, for all algorithms, when the objective numbers are 5 and 10, the population sizes are 126 and 275, and the numbers of decision variables are 500 and 1000, respectively. Each algorithm runs 20 times independently and stops when the number of function evaluations (FEs) reaches 90,000. The performance of MALSMEA is verified by comparing the average IGD values obtained by five algorithms. In each test instance, the best average IGD value is highlighted in bold. Finally, in four datasets, MALSMEA and three feature selection algorithms are utilized to deal with the proposed six-objective feature selection optimization model, for which , the maximum number of FEs is 100, and each algorithm runs independently for 10 times. The optimization ability of MALSMEA is verified by comparing the HV indicator and optimization results.

4.1. Experimental Settings

(1)Performance Indicator. In the experiment, IGD [45] and HV [46] are used as evaluation indicators. The smaller (larger) the IGD (HV) indicator value is, the better the performance of the algorithm. The IGD indicator evaluates the algorithm by calculating the average of minimum distances between all sampled individuals on the actual PF and the obtained solution set. The HV indicator quantifies the algorithm performance by calculating the volume between the obtained nondominated solution set and the reference point.(2)Parameter Settings for the Crossover and Mutation Operators. In the performance verification experiment of MALSMEA, MALSMEA and GLMO use the mutation operator based on variable grouping to generate offspring. Other algorithms use simulated binary crossover (SBX) [32] and polynomial mutation [47]. The crossover probability is , the mutation probability is , and the distribution indicator is , where is the number of decision variables. In the experiment to verify the superiority of MALSMEA with respect to the proposed model, according to [9], , .(3)Other Parameter Settings for Algorithms. In MALSMEA and GLMO [39], the number of groups is set to 4, and the ordered grouping method is adopted. For RVEA [32], the index and the frequency are set to 2 and 0.1, respectively. The parameters in W-QEISS and F-QEISS are set according to [8], and the searching method is based on r-NSGA-II [48]. The parameters in W-MOSS are set according to [44].(4)Datasets. The details of 4 UCI datasets utilized are shown in Table 1.(5)ELM Classifier. For the proposed model, the ELM classifier [8] is utilized to evaluate the accuracy of the current solution, which follows the criterion given in [46]: the activation function is in the hidden layer, and the number of neurons is set to . The target classification variable and the (input) features are normalized into ranges [0, 1] and [−1, 1] in each dataset, respectively. To minimize the accuracy deviation, the -fold cross validation approach is utilized with , and the average accuracy is used for comparison [9].

4.2. Performance Comparison of Algorithms on DTLZ

Table 2 describes the IGD indicator values obtained by the five algorithms on the 5- and 10-objective DTLZ1-DTLZ6 with 500 and 1000 decision variables. As shown in Table 2, MALSMEA is competitive with the other four algorithms. Specifically, MALSMEA produces 18 best results out of 24 test instances, and its performance on the 10-objective DTLZ is significantly better than that of the other algorithms. The experimental results are analyzed in detail as below.

DTLZ1 reflects the convergence of the algorithm. MALSMEA outperforms the other algorithms on the 5- and 10-objective DTLZ1. These results demonstrate that MALSMEA has better convergence on the large-scale high-dimensional DTLZ1. DTLZ2 is generally used to test the scalability of algorithms with respect to the number of objectives. The performance of MALSMEA on the 5-objective DTLZ2 is better than that of LCSA but slightly inferior to that of GLMO, VaEA, and RVEA. The performance of MALSMEA on the 10-objective DTLZ2 is better than that of the other four algorithms. Thus, MALSMEA has better scalability to the objective number.

DTLZ3 is a highly multimodal problem similar to DTLZ1. MALSMEA obtains the smallest IGD indicator value on DTLZ3 with 500 and 1000 decision variables. DTLZ4 is used to test the ability of the algorithm to ensure the diversity of the population. MALSMEA obtains the smallest IGD indicator value on the 10-objective DTLZ4 with 500 and 1000 decision variables. For the 5-objective DTLZ4, VaEA outperforms other algorithms on DTLZ4 with 500 and 1000 decision variables. MALSMEA exhibits greater diversity on the large-scale 10-objective DTLZ4.

For the 5-objective DTLZ5, MALSMEA outperforms LCSA on DTLZ5 with 500 and 1000 decision variables, but inferior to GLMO, VaEA, and RVEA. For the 10-objective DTLZ5, MALSMEA outperforms its counterparts. For DTLZ6, the overall performance of MALSMEA is optimal on instances with up to 1000 decision variables.

To further test the performance of MALSMEA, the nonparametric Friedman test [49] is employed. According to the average IGD indicator values of the five algorithms on DTLZ, Table 3 indicates the average ranking of the five algorithms. The average ranking of MALSMEA is the smallest, which indicates that MALSMEA performs the best. The average ranking of LCSA is the largest, so its performance is the worst.

To verify the efficiency of MALSMEA, Table 4 presents the running time of MALSMEA and the four other algorithms on the 10-objective DTLZ1 with 1000 decision variables. The running times of MALSMEA and GLMO are quite similar but greater than those of other algorithms.

4.3. Performance Comparison of Algorithms on LSMOP

LSMOP is proposed to test the performance of the algorithm in LSMaOPs. Table 5 lists the IGD indicator values obtained by five algorithms on 5- and 10-objective LSMOP1-LSMOP9 with 500 and 1000 decision variables. MALSMEA produces 26 best results out of 36 test instances. Therefore, compared with the other four algorithms, MALSMEA has better performance in solving LSMaOPs.

Specifically, for the LSMOP test suite with 500 decision variables, MALSMEA outperforms the other algorithms on the 5- and 10-objective LSMOP2, LSMOP4, LSMOP5, LSMOP8, and LSMOP9. MALSMEA is inferior to LCSA on LSMOP3. MALSMEA outperforms the other algorithms on the 10-objective LSMOP1 and LSMOP7, but LCSA obtains the smallest IGD indicator value on the 5-objective LSMOP1 and LSMOP7. MALSMEA obtains the smallest IGD indicator value on the 5-objective LSMOP6, while RVEA performs better on the 10-objective LSMOP6.

For the LSMOP test suite with 1000 decision variables, MALSMEA outperforms the other algorithms on the 5- and 10-objective LSMOP2, LSMOP4, LSMOP5, LSMOP8, and LSMOP9. MALSMEA is inferior to LCSA on LSMOP3. LCSA obtains the best performance on the 5-objective LSMOP1 and LSMOP7, and MALSMEA outperforms the other algorithms on the 10-objective LSMOP1 and LSMOP7. The performance of MALSMEA on the 5-objective LSMOP6 is better than that of the other algorithms, but it is slightly inferior to that of LCSA and RVEA on the 10-objective LSMOP6.

4.4. Comparison of the Optimization Results on the Proposed Model

Table 6 shows the HV indicator values and objective values of the four algorithms after optimization on four datasets. The results demonstrate that MALSMEA obtains the maximum HV indicator values, showing that MALSMEA has certain advantages in feature selection. As noted in Table 6, for the four datasets, the optimization performance of MALSMEA is better on Iris and Musk1. MALSMEA is slightly inferior to the other three algorithms in relevance and redundancy but exhibits better performance in the other four objectives. In addition, W-QEISS and F-QEISS are relatively better than the other algorithms in terms of relevance and redundancy, but they are worse in other objectives.

5. Conclusion

In this paper, a modified vector angle-based large-scale many-objective evolutionary algorithm called MALSMEA is proposed. In MALSMEA, the polynomial mutation based on variable grouping is used to replace the polynomial mutation to improve the efficiency of solving large-scale optimization problems. A novel worst-case solution replacement strategy based on SDE is proposed to replace the worse one of two individuals with similar search directions to increase diversity. In addition, MALSMEA is compared with four typical algorithms to solve the optimization problem with up to 10 objectives and 1000 decision variables. Experimental results indicate that MALSMEA outperforms the four algorithms on the DTLZ and LSMOP test suites. By studying the existing feature selection models, taking the number of selected features, accuracy, relevance, redundancy, interclass distance, and intraclass distance as the optimization objectives, a six-objective optimization model is constructed and solved by using MALSMEA. Compared with the other three feature selection algorithms, MALSMEA has some advantages in solving this model.

Future studies will proceed in two directions. The first direction is to add a parallel strategy to MALSMEA to improve efficiency or to further modify its environmental selection method. Another research direction is to solve LSMaOPs in other fields using MALSMEA.

Data Availability

The details of the four UCI datasets utilized are shown in Table 1.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant no. 61976242, in part by the Fundamental Scientific Research Funds for Interdisciplinary Team of Hebei University of Technology under Grant no. JBKYTD2002, and in part by the Guangdong Provincial Key Laboratory under Grant no. 2020B121201001.