Abstract
Microarray gene expression data provide a prospective way to diagnose disease and classify cancer. However, in bioinformatics, the gene selection problem, i.e., how to select the most informative genes from thousands of genes, remains challenging. This problem is a specific feature selection problem with highdimensional features and small sample sizes. In this paper, a twostage method combining a filter feature selection method and a wrapper feature selection method is proposed to solve the gene selection problem. In contrast to common methods, the proposed method models the gene selection problem as a multiobjective optimization problem. Both stages employ the same multiobjective differential evolution (MODE) as the search strategy but incorporate different objective functions. The three objective functions of the filter method are mainly based on mutual information. The two objective functions of the wrapper method are the number of selected features and the classification error of a naive Bayes (NB) classifier. Finally, the performance of the proposed method is tested and analyzed on six benchmark gene expression datasets. The experimental results verified that this paper provides a novel and effective way to solve the gene selection problem by applying a multiobjective optimization algorithm.
1. Introduction
Gene selection is an important issue in bioinformatics [1]. A gene is the basic functional unit of heredity. Gene expression is the process in which the instructions encoded in genes are used to synthesize gene products [2] such as proteins. Then, the gene products dictate cellular function. Therefore, abnormal gene expression is usually correlated with different types of disease, such as cancer [3]. Usually, many diseases correspond to unique gene expression profiles that can be revealed by DNA microarray technology [4]. Typically, microarray data corresponding to a certain disease consist of a set of biological samples. From each sample, the expression of thousands of genes at each position can be measured. As a result, microarray data are usually in the form of a matrix. However, it is not an easy task for researchers to check which genes are responsible for a given disease because of the high dimensionality of microarray data. Thus, determining how to select the most significant genes effectively for further analysis becomes urgent and vital.
The gene selection problem is intrinsically a feature selection problem with highdimensional features and small sample sizes. Since the gene expression data can be labeled (whether the sample is malignant or not), partially labeled, or unlabeled, three categories of methods are applied to solve the gene selection problem in the literature [5]: supervised, semisupervised, and unsupervised feature selection methods. Because labeled data are the most common types of data in reality, supervised feature selection methods are the most widely used and most practical methods for the gene selection problem. We refer to feature (gene) selection methods as supervised feature (gene) selection methods in the following context.
In the field of machine learning, feature selection, also known attribute selection, is defined as the process by which the best subset of relevant features is selected from a large set of features [6], and the performance of classifiers is assuredly improved by the optimal feature subset when compared with the utilization of all features. However, it is difficult to execute feature selection by retaining relevant features and removing irrelevant and redundant features. There are two main obstacles in feature selection. First, the size of the search space is quite large. Given a dataset with features, there are subsets (solutions) [7]. Specifically, as big data continues to grow [8], becomes increasingly large. Thus, in most cases, an exhaustive search for feature selection is impossible. Second, the feature interaction problem makes feature selection complex. For example, a feature as a single entity is irrelevant to the target, but when combined with another feature, it may become significantly relevant. In fact, there are many interaction patterns among features. As a result, “the best features are not the best features” [9]. Therefore, the performance of a feature selection method depends on two key factors: (1) effective evaluation criteria to measure the quality of a feature subset and (2) an efficient search strategy to explore the large search space [10].
Regarding evaluation criteria, feature selection methods can be roughly classified into two categories: filter methods and wrapper methods [10]. The main difference between them is that wrapper methods use a classifier to evaluate a feature subset, while filter methods do not. Filter methods are independent of any classifier and focus on the intrinsic characteristics of the dataset. The common metrics used in filter methods are correlation [11] and mutual information [12]. Specifically, the filter methods examining each feature separately are considered univariate. They ignore the feature interaction problem and lead to the redundancy of feature subsets. Thus, multivariate filter methods such as minimum redundancymaximum relevance (mRMR) [13] are considered better choices. Wrapper methods select discriminative feature subsets to improve the classification performance. Most popular classifiers can be incorporated into wrapper methods, e.g., the naive Bayes (NB), Knearest neighbors, support vector machine, and neural network [14]. It has been generally regarded that filter methods are usually considered faster, but their accuracy is relatively lower. Wrapper methods are the opposite of filter methods because they need to consider the computational costs of the involved classifiers. Thus, combining them as a hybrid method is an alternative and promising method for feature selection problems, especially for the gene selection problem [15].
There are two main categories of search strategies applied in feature selection. The first category is sequential search. Sequential forward selection and sequential backward selection [16] are considered conventional methods but suffer from the “nesting effect” [17] because only one feature is added or removed at a time. The second category is a randomized search strategy that starts by randomly selecting some features and then executing a heuristic search. It has been verified that these methods based on randomized search are better than the methods based on sequential search because they can escape local optima more easily [10]. Specifically, applying evolutionary computation (EC) techniques such as genetic algorithms (GAs) [18], particle swarm optimization (PSO) [19, 20], and differential evolution (DE) [21, 22] to feature selection has raised the attention of researchers in recent years.
Regarding the gene selection problem, numerous methods based on EC techniques have been proposed in the literature [5]. These pertinent experiments have shown that EC techniques can achieve very competitive performance compared with traditional methods. For example, Mohamad et al. proposed an improved binary PSO as a wrapper method and obtained positive results [23]. Shreem et al. proposed a Markov blanketembedded harmony search algorithm as a wrapper method to solve the gene selection problem [24], and Elyasigomari et al. proposed a filter method based on the cuckoo optimization algorithm and shuffling [25], where a clustering technique was involved. In addition, a modified artificial bee colony algorithm was applied to solve the gene selection problem in the work of Alshamlan et al. [26], where the search method was enhanced by combining two EC algorithms. Note that most current methods based on EC techniques treat the gene selection problem as a singleobjective optimization problem. On the other hand, recent work [22, 27] suggests that multiobjective optimization techniques are alternatives for solving the gene selection problem. This is because the single objective to multiobjective transformation can lead to improvements in the search strategy and evaluation criteria; thus, more competitive results can be obtained. However, to the best of the authors’ knowledge, employing an effective multiobjective differential evolution (MODE) approach to address the gene selection problem has not yet been well explored.
Thus, in this study, a twostage method based on multiobjective optimization is proposed. The first stage included a multivariate filter method where three objective functions referring to mutual information are incorporated. The second stage included a conventional wrapper method involving the NB classifier. The number of selected features and the classification error are incorporated as the two objective functions in this stage. In addition, both stages employ the same search strategy: a welldesigned MODE. Finally, six benchmark datasets are used to test and analyze the performance of the proposed method. The experimental results are statistically compared with those of five widely used feature selection methods.
The remainder of the paper is organized as follows. Section 2 introduces three important concepts: multiobjective optimization, differential evolution, and mutual information. Section 3 describes the proposed method. Section 4 provides the experimental results and analysis. Finally, Section 5 draws the conclusion of this paper.
2. Materials
2.1. Multiobjective Optimization Problem
Many realworld problems involve multiple conflicting objectives that should be optimized simultaneously [28]. A MOOP is a multiobjective minimization problem that involves more than one objective function to be optimized, and it can be mathematically stated as follows:where is the dimensional decision vector and is the decision space. consists of realvalued objective functions . In normal cases, there is no solution that can optimize all the objective functions because of the conflicts among these objectives. Four important definitions referring to MOOPs are given as follows.
Definition 1. (Pareto dominance). Let and be two vectors. is said to dominate , represented as , if
Definition 2. (Pareto optimal solution). For a given MOOP, a vector is called the Pareto optimal solution if
Definition 3. (Pareto optimal set). All Pareto optimal solutions compose the Pareto optimal set , which can be described as follows:
Definition 4. (Pareto front). The image of the Pareto optimal set is called the Pareto front , which is composed of objective vectors and is defined as follows:For a realworld MOOP, the Pareto optimal is usually unreachable and infinite. Therefore, the goal of an optimization method [29–31] is to obtain an approximation of , which is convergent and diverse in the objective space as much as possible. In addition, an excellent approximation of is crucial for a decision maker to select the final solutions.
2.2. Standard Differential Evolution
DE is a simple but powerful stochastic optimization algorithm that was first proposed by Storn and Price in the 1990s [32]. Recent research has increased the efficiency for solving many realworld problems [33–35]. The characteristic of DE is using the difference between two candidate solutions to generate a new candidate solution. This algorithm is population based and works through a cycle of computational steps, which are similar to the steps employed in common evolutionary algorithms. The flowchart of standard DE is shown in Figure 1, and it can be separated into the following four stages: initialization, mutation, crossover, and selection.
DE optimizes a problem by maintaining a population of candidate solutions and evolving them with specific formulas within the search space. An individual, also called a genome, is represented as a vector forming a candidate solution for a specific problem as follows:where is the dimension of the search space and represents the th individual in the sized population at generation .
Initially, all individuals , also called target vectors, are randomly initialized by restricting them in a problemspecific range. Then, standard DE starts its main loop. Every individual evolves in the following steps. First, for each individual , the differential mutation operator works and generates a donor vector as follows:where is the mutation scale factor that controls the scaled difference and , , and are three different integers, which are randomly chosen from the range . Note that the three integers must be different from the current index .
Next, the trial vector is generated by crossing over the target vector and the donor vector . A typical crossover mutation operation employed in standard DE is implemented by exchanging components between and as follows:where is the th element of and is a uniformly distributed random real number in . is the crossover rate that controls the probability of how many elements of are inherited from . , which ensures that obtains at least one element from , is a random integer in .
Then, the selection process is executed to update all individuals as follows:where is the singleobjective function of DE.
Finally, the DE terminates when the stopping criterion is met.
2.3. Mutual Information
In information theory [36], the mutual information of two variables quantifies the mutual dependence between them. This metric measures the correlation between two variables powerfully and is not sensitive to the noise in sampling [37]. Given two continuous variables and , their mutual information can be defined as follows:where and are the probability density functions of and , respectively, and is the joint probability density function. Therefore, if two variables are strictly independent, their mutual information is equal to 0. Similarly, for two discrete variables and , mutual information has the following form:
Given two variables and , the range of the mutual information between them is , where is the function to calculate the entropy of a variable.
Although mutual information has been considered an excellent indicator to quantify the independence between two variables, its calculation is not easy because estimating probability density functions is a complex task. If two variables are discrete, the calculation of mutual information is straightforward by counting the samples in difficult categories to make the joint and marginal probability tables. However, if at least one of the two variables is continuous, the calculation becomes difficult. In this work, we use entropy estimation based on the Knearest neighbor distance [38] to calculate mutual information.
3. Methodology
In this section, the proposed twostage method based on MODE is described. Figure 2 illustrates the flowchart of the proposed method, which consists of two stages: a filter stage and a wrapper stage. In the latter stage, a novel wrapper method based on MODE is proposed. In addition, two singleobjective wrapper methods based on DE are proposed in this stage. These two singleobjective methods serve as the baseline to test the performance of the MODEbased wrapper method and help us investigate the following: (1) whether it is necessary to consider the number of selected features in the wrapper method and (2) whether the method based on multiobjective optimization outperforms the methods based on singleobjective optimization.
3.1. Multiobjective Differential Evolution
Due to the effectiveness of DE for solving singleobjective optimization problems, extending DE to solve MOOPs has attracted the interest of researchers in the literature [34]. Two important issues in extending DE into MODE need to be overcome. The first issue is how to order two candidate solutions. The solutions are straightforward to order when one solution dominates the other solution. However, if two candidate solutions do not dominate each other, an additional strategy to assign the complete order must be provided. Second, an effective scheme of maintaining a set of nondominated solutions during the optimization process is necessary. In contrast to singleobjective optimization problems where only one global optimal solution is generated, the goal of MOOPs is to obtain a set of nondominated solutions. Therefore, the convergence and diversity of the set of nondominated solutions should be ensured. A widely used method is to adopt an external archive to couple with the current population [30].
The proposed MODE follows the framework of the standard DE, which is shown in Figure 1. The external archive stores the nondominated solutions that interact with the current population. In addition, the mutation operator and the selection operator, which are different from those of the standard DE algorithm, are modified. The key components of the proposed MODE are described below.
3.1.1. External Archive

Adopting an external archive to store nondominated solutions is a common and effective method in numerous multiobjective evolutionary algorithms [39, 40]. Similarly, an archive with limited size is maintained in the optimization process of the proposed MODE. A solution will be added into if any one of the following criteria is met. (1) is empty. (2) is not full, and is nondominated by any solution in . (3) dominates at least one solution in . Note that in this case, these solutions dominated by will be removed from . (4) is full, and is nondominated with any one solution in . In this extreme condition, is first added into , and a density estimation operation is executed to assign each solution a crowding distance value (see Section 3.1.2). Then, the solution in the most crowded region will be removed from .
The archive interacts with the current population in two aspects. First, the equation for generating a donor vector (see equation (7)) is modified bywhere is a solution that is randomly selected from the external archive rather than the current population. This handling method is inspired by the standard mutation strategy in DE/best/1 [41]. can be regarded as one of the best solutions that are stored in archive . Second, the selection operator (see equation (9)) of MODE is modified, which is illustrated in Algorithm 1, and the interaction between the current population and archive will be enhanced. Since the updating scheme of archive is based on crowded information, archive will be updated in a timely manner at each iteration, and the convergence and diversity of these nondominated solutions in can be ensured.
3.1.2. Density Estimation

Many density estimation methods have been proposed in the literature [29, 30]. In our proposed method, a parameterindependent method called the crowding distance is used to assist Pareto dominance in assigning the complete order. The basic idea is that the degree of crowding of a solution in objective space is quantified by the distance between its neighbors. For a given solution and an archive , the crowding distance of can be calculated by Algorithm 2. This method is similar to the method used in the nondominated sorting genetic algorithmII (NSGAII) [29], and the crowding distance of a solution is considered the perimeter of the cuboid formed by its neighbors.
3.1.3. Parameter Control
The mutation scale factor (see equation (7)) and the crossover rate (see equation (8)) are the two main control parameters in DE. A welltuned setting of and is crucial to the performance of DE [41]. However, determining how to set the suitable values of and is problemspecific. To select suitable parameters for and , we follow the idea of selfadaptive differential evolution (SaDE) [42] and use a selfadaptive strategy to control the two parameters in MODE.
The employed parameter control strategy is described as follows. At each iteration, a set of values is regenerated from a normal distribution with a mean of and a standard deviation of . Then, these values are orderly applied in equation (12) to generate the donor vectors. In this way, both exploitation (small values) and exploration (large values) are ensured during the evolution process. Furthermore, the crossover rate is gradually adjusted according to previous experience during the evolutionary process. Specifically, is assumed to obey a normal distribution with a mean of and a standard deviation of but is restricted to . Initially, an empty pool is created, and is set to 0.5. At each iteration, a set of values is regenerated and applied to generate the trial vectors, as shown in equation (8). If a trial vector successfully replaces its target vector in the selection process, the corresponding value will enter the pool. At the end of each iteration, the new is reset as the median of the pool, and then the pool is emptied.
3.2. Implementation of MODE in Feature Selection

The proposed MODE is an optimization method over continuous spaces. However, the landscape of feature selection problems is discrete. To implement MODE in feature selection, a binary strategy is incorporated in the proposed method. For a given dataset with features , a candidate solution in MODE is represented aswhere is the number of dimensions of , and it is equal to the dimensionality of the data points. Consequently, a feature subset is determined by and a preset threshold parameter , which is shown in Algorithm 3. This strategy is also employed in the two singleobjective methods based on DE.
3.3. Three Objectives of the Filter Stage
The first stage of the proposed method is considered a multivariate filter method where the intrinsic characteristics of the raw data are considered. Three objective functions to be minimized are defined in the filter stage to evaluate a feature subset. The first objective function is the number of selected features, and it is considered a prime motivation of feature selection. Previous works [27] have proven that incorporating the number of selected features as an objective is necessary in feature selection. For a given feature subset , the first objective function is defined as
The second objective function strives to select the features with the highest relevance to the target class variable (labeled as malignant or not). This objective aims to maximize the relevance between the features and the target class. Independent of the number of selected features of , it can be defined as follows:where is the target class variable and is the mutual information between feature and target class .
In addition, the redundancy among each pair of the selected features should be narrowed down because redundant information does little to improve the accuracy of a classifier [43]. The third objective function aims at minimizing the redundancy of the feature subset, and it is defined as follows:
3.4. Two Objectives of the Wrapper Stage
The second stage of the proposed method included a wrapper method where the employed classifier should be considered. As shown in Figure 2, a set of nondominated solutions is generated after the filter stage. Although every one of these solutions can be accepted as the starting point of the second stage, it seems more reasonable to select some typical solutions among them according to computational costs. Since the aim of the filter stage is to select a small number of informative features, we select the solution with the smallest number of features as the input of the wrapper stage. Minimizing the classification error rate of a classifier is the main goal of the wrapper stage. In this study, the famous and effective Gaussian NB [44] classifier is applied. The NB classifier is a supervised learning method for classification, which is based on Bayes’ theorem and assumes that every pair of features is independent. The Gaussian NB classifier is the stateoftheart type of NB classifier to handle continuous data in which the continuous values of a special feature are assumed to fit a Gaussian distribution. After selecting a suitable classifier, the two objective functions of the wrapper stage can be defined as follows:
According to the guidance of Xue et al. [27], the first objective function to be minimized is defined as the number of selected features. In the following content, we can investigate whether it is necessary to take the number of a selected features as an objective in the wrapper stage. Moreover, the average classification error rate of a selected feature subset is defined as the second objective function, which is evaluated by 5fold crossvalidation on the training data. A more detailed description of how the 5fold crossvalidation is performed on training data is given in [45].
3.5. Two SingleObjective Feature Selection Methods
Two singleobjective feature selection methods based on DE are also proposed in the wrapper stage for comparison. The main difference between the two methods is the choice of fitness functions. The fitness function of one method (DE1) is the same as the second objective function of MODE in the wrapper stage, which is defined as follows:
The aim of DE1 is to minimize the classification error rate during the training process. However, the other method (DE2) considers the number of selected features. The fitness function of DE2 is defined as follows:where is a scaling parameter determining the relative importance of the two terms and is the average classification error rate of 5fold crossvalidation on the training data.
To meaningfully devise a fair comparison with the proposed MODE, the procedure of the two DEbased methods is chosen to be similar to that of the proposed MODE mentioned above. The differences between the MODEbased method and the DEbased methods are the selection process and the updating strategy of the external archive. The selection process of the two DEbased methods is the same as the standard DE, as shown in equation (9). The updating strategy of the external archive of the two DEbased methods is based on tournaments with limited size .
4. Experimental Studies
All the algorithms in this study are implemented in C and Python languages. The programs are executed on a Linux 64bit system with a 3.4 GHz Core i5 CPU and 8 GB RAM. In addition, the parameters of MODE used in the two stages are listed in Table 1, which have been discussed above. To assess the performance of the proposed twostage feature selection method, six widely used benchmark microarray datasets are selected in our experiments. The details of these datasets are summarized in Table 2. Note that all of the datasets are binary. The reason for excluding the multiclass datasets is that binary microarray datasets are more common in the field of gene selection [46].
Because the numbers of samples in microarray datasets are relatively small, 5fold crossvalidation is applied to each dataset to evaluate the effectiveness of feature selection [46]. Specifically, the samples of each dataset are randomly partitioned into five equal subsamples. Four subsamples are used as the training data, and the remaining subsample is used as the test data. Then, the crossvalidation process is successively repeated five times. The flowchart of the 5fold crossvalidation experiment is presented in Figure 3. The training data are used by feature selection methods to select a feature subset. Then, the selected feature subset is used to reduce the dimensions of the training data and the test data. Finally, the goodness of the selected feature subset is evaluated by using the test data.
4.1. Results of the Filter Stage
The proposed MODE on these benchmark datasets is first implemented in the filter stage. The threshold is problemspecific and is set properly for each dataset. Since MODE obtains a set of nondominated solutions in each independent run, five independent sets of nondominated solutions with three objectives are generated. We collect five sets of nondominated solutions into a union set and report its statistics in Table 3. It is clear that fruitful solutions are obtained because the values of the three objectives fluctuate significantly. In addition, the small values of indicate that few features are selected, and the effectiveness of the filter stage is demonstrated.
Figure 4 shows the nondominated solutions of the Colon dataset in one experiment. These solutions are mapped onto space. Similar results can also be obtained for the remaining datasets. Figure 4 shows that and strongly conflict along a curve. This strengthens the rationality of decomposing them as two objectives for optimization. Moreover, a common dominance pattern can be found in Figure 4. For example, the solutions and are nondominated, and , . It is easy to conclude that . This finding supports our premise that simply reducing the number of features of a subset may diminish its compactness. Therefore, it is necessary to use different criteria to measure the quality of a feature subset.
The first objective is used to direct the search procedure and reduce the number of selected features. To observe the changes of the first objective during the evolutionary procedure, Figure 5 shows the convergence curves of the average value of of the solutions stored in archive for each dataset. We find that the average number of selected features converges quickly and finally stabilizes near a certain value. This means that the filter results are not sensitive to the iteration if the maximum number of iterations has been set sufficiently large. In addition, the convergence speed and the stable number of selected features rely on the intrinsic characteristics of each dataset. For example, Leukemia and Prostate have similar scales but converge to different values. Prostate has the largest number of features, but its convergence speed is the fastest.
4.2. Results of the Wrapper Stage
Next, the proposed MODE is adopted in the wrapper stage. Inspired by recent works [22, 47], the threshold is set to 0.5 for all datasets. Finally, five independent sets of nondominated solutions with two objectives are generated. In addition, to analyze the performance of the proposed multiobjective approach, two singleobjective approaches mentioned above are executed in the same setting. They also generate five independent sets of solutions for each dataset. Note that for DE2, the classification performance is more important; thus, is set to 0.2 in equation (19).
Since each method generated five sets of nondominated solutions, it will be difficult to compare the performances of these methods. We use the comparison method adopted in previous works [22, 27]. It is worth noting that the classification performance is evaluated and compared on the test data rather than the training data. Specifically, five sets of nondominated solutions that are achieved by the proposed MODE in 5fold crossvalidation are first collected into a union set. Then, the test classification error of each solution is calculated, and the test classification error of the solutions that have the same number of features is averaged. Moreover, the set of “average” solutions is defined as the “average” front. The set of nondominated solutions with the objectives and the test classification error in the union set is defined as the “best” front. Furthermore, for the two singleobjective methods, we also collect these solutions into a union set, and the same processing method is applied to the union sets. Finally, the performance of the three methods on these three union sets can be compared.
The experimental results of the three methods on the benchmark datasets in the wrapper stage are shown in Figures 6 and 7. The horizontal axis represents the number of selected features of a solution, and the vertical axis represents the test classification error rate. The dashed line crossing each chart represents the average classification error rate of 5fold crossvalidation using all features. Moreover, in each chart, the label “Avg” in the legends refers to the average front obtained by each method, and the label “Best” refers to the best front.
(a)
(b)
(c)
(a)
(b)
(c)
According to Figures 6 and 7, the average fronts of the three methods are under the dashed line in most cases. This suggests that all the methods work effectively because their solutions achieve a lower test classification error rate and select fewer features. Moreover, the fluctuation in the curves of the average fronts means that the solutions with a similar number of features can have different test classification error rates. This implies that the feature subset search space is relatively complex.
When we compare DE1 with the other two methods, it is obvious that the classification performance of DE1 is similar to the other two methods on most datasets, but the number of selected features in DE1 is quite larger than that of the other two methods. This is because there is no term in the fitness function of DE1 (equation (18)) that considers the number of selected features. The experimental results strongly suggest the necessity of considering both the classification accuracy of a classifier and the number of features in feature selection.
Both MODE and DE2 consider the number of features in the fitness functions. However, the former method uses a multiobjective technique, while the latter method uses a singleobjective technique. As shown in the left charts of Figures 6 and 7, both methods successfully achieve low classification error rates and select fewer features. When we compare these two methods, it can be observed that MODE outperforms DE2. MODE achieves significantly lower test classification error rates on most datasets except Leukemia in terms of the “average” fronts. Furthermore, MODE obtains fewer features. In terms of the “best” fronts, the performance of MODE is also better than that of DE2 because fewer features and a lower test classification error rate are obtained by MODE. Although a finetuning parameter in equation (19) can improve the performance of DE2, it requires prior knowledge and should be predefined properly. The results demonstrate the advantage of the proposed MODE in the wrapper stage.
4.3. Comparison with Other Methods
To further evaluate the performance of the proposed twostage method based on MODE, we compare it with seven widely used feature selection methods. GainRatio [48] and ReliefF [49] are two univariate feature selection methods. These methods provide each feature an order ranking according to the relevance between the feature and the target class. We retain the top 10, top 20, and top 40 features to evaluate the performance of these two methods. mRMR [13] is a classical feature method based on mutual information that returns a subset of features with a predefined size. We set the returned number of features to 10, 20, and 40. Correlationbased feature selection (CFS) [50] is also a classical multivariate feature selection method and returns a subset of features. WrapperNB [45] is a wrapper method coupled with the NB classifier. The search strategy of this method is greedy hill climbing augmented with a backtracking facility. In addition, two wrapper methods based on GA and PSO are compared. Based on the parameter settings in the literature [51, 52], the population size and the maximum iteration of the two methods are set to 50 and 100, respectively. The key parameters of GA are set as follows: the crossover rate , the mutation rate , and the number of elites . The key parameters of PSO are set as follows: the inertia weight and the acceleration constants .
We use 5fold crossvalidation and follow the workflow in Figure 3 to perform the experiments. The final classifier is the NB classier. To provide a fair comparison, for the proposed method, we select the solutions in the training Pareto front of the union set because the test data cannot be seen until the final performance evaluation. Specifically, the training Pareto front of the union set is constructed according to the training classification performance and the number of features. The comparison is performed on test data, and the results are listed in Table 4. represents the average test classification accuracy, and represents the number of selected genes (features). As illustrated in Table 4, the proposed method obtains the best classification performance on three (out of six) problems. Moreover, it can select a small number of features and meet the target of gene selection.
We further conduct the Wilcoxon signedrank test to determine the significant differences between the proposed method and the other methods. The significance level is set to 0.05, and the values are listed in Table 4. It is clear that the proposed method significantly outperforms eight (out of fourteen) methods because the values are smaller than 0.05. In addition, for the remaining six methods, the values are larger than 0.05. This indicates that the proposed method is not significantly better but still obtains competitive results. Therefore, we can conclude that the proposed method can be considered a very competitive method relative to classical methods. The comparison results suggest that the proposed twostage method based on MODE is a promising method to solve the gene selection problem.
5. Conclusion
The gene selection problem is a specific feature selection problem and remains challenging in bioinformatics. In this paper, a twostage feature selection method was proposed to solve the gene selection problem. The first stage included a multivariate filter method, and the second stage included a wrapper method. Both stages were based on the same MODE but with different objective functions. The objective functions of the filter stage were mainly based on mutual information. The classification error of the NB classifier and the number of selected features were incorporated as the two objective functions in the wrapper stage. In our experiments, six common benchmark datasets were used to test and analyze the performance of the proposed method. In addition, the effectiveness of the proposed method for solving the gene selection problem was verified by comparing it with five classical methods. Since the main differences between the two stages (filter and wrapper) were the objective functions, the proposed method is considered to be an easily understood implementation.
This study provided a new perspective for solving the gene selection problem by using multiobjective optimization because the solution ideas are quite different from the methods based on singleobjective optimization. In the future, we plan to apply the proposed method to more gene expression datasets to verify its effectiveness. To improve the performance of the method, the search strategy and the evaluation criteria will also receive sustained attention.
Data Availability
The data used to support the findings of this study are included within the article and can be obtained from the corresponding authors upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This study was supported by the Research Foundation for Talented Scholars of Changzhou University (grant no. ZMF20020459) and JSPS KAKENHI (grant no. 19K12136).