Abstract

Multilabel feature selection involves the selection of relevant features from multilabeled datasets, resulting in improved multilabel learning accuracy. Evolutionary search-based multilabel feature selection methods have proved useful for identifying a compact feature subset by successfully improving the accuracy of multilabel classification. However, conventional methods frequently violate budget constraints or result in inefficient searches due to ineffective exploration of important features. In this paper, we present an effective evolutionary search-based feature selection method for multilabel classification with a budget constraint. The proposed method employs a novel exploration operation to enhance the search capabilities of a traditional genetic search, resulting in improved multilabel classification. Empirical studies using 20 real-world datasets demonstrate that the proposed method outperforms conventional multilabel feature selection methods.

1. Introduction

Multilabel classification has emerged as a promising technique for various applications, including lifelong structure monitoring [1], functional proteomics [2], and sentiment analysis [3]. These applications produce a series of labels for describing complicated concepts, which are compounded when high-level concepts are composed of multiple subconcepts, such as the environmental and operational conditions of structures [1, 4, 5]. Let denote a set of patterns constructed from a set of features . Then, each pattern , where , is assigned to a certain label subset , where and is a finite set of labels. Therefore, the task of multilabel classification is to identify a function that maps given instances into one of label subsets based on input feature values.

In practice, there can be a maximum number of features allowed because of the limits on data acquisition rates or energy consumption [68]. In reality, for example, this problem can emerge from the music applications on lightweight mobile devices. Applications for mobile devices typically have a limitation in computational capacity and there is a maximum number of allowed features to be extracted [9, 10]. This is because an overly excessive number of extracted features on mobile devices causes consumers to suffer low quality user experience due to unacceptable waiting or battery consumption.

Given input data with an original feature set and label set , the goal of our multilabel feature selection problem is to identify a feature subset with the maximum number of features that yields the best multilabel classification accuracy [11, 12]. This problem is known as budgeted feature selection [13] or feature selection with test cost constraints [8, 14, 15]. However, most studies have been conducted from the perspective of traditional single-label learning. It should be noted, especially when a given constraint is small, that our multilabel feature selection problem becomes more challenging in terms of classification accuracy due to the fact that a small number of features must support multiple labels simultaneously [1619].

Multilabel feature selection methods can be categorized according to how they assess the importance of candidate feature subsets [16, 2022]. Filter-based multilabel feature selection methods identify a final feature subset by focusing on the intrinsic discriminative power of features [21, 2325]. Some multilabel learning algorithms have a feature selection process embedded in their learning process [26, 27]. In contrast, wrapper-based multilabel feature selection methods assess the importance of feature subsets through a search process by using a multilabel classifier directly. This typically results in better classification accuracy [11, 12]. For this reason, we focus on a multilabel feature wrapper based on an evolutionary search process [28].

During the search process, each chromosome represents a feature subset and selects a number of features less than or equal to . As a result, most features remain unselected by any chromosome in the population. This can lead to an ineffective search because important features can be continuously neglected. Without negatively affecting the strength of the evolutionary search, this problem can be solved by adding additional chromosomes that convey promising unselected features to the population. In this study, we propose an effective multilabel feature wrapper while considering the constraint of feature subset size. Experimental results demonstrate that the proposed method is able to identify an effective feature subset for multilabel classification with the aid of an enhanced evolutionary search process.

In traditional single-label feature selection, the budgeted feature selection problem is treated as a special case of the feature selection problem where the algorithm should consider the effectiveness of the feature subset and the acquisition cost for gathering each feature simultaneously. To solve this problem, Zhang et al. [29] proposed a feature selection algorithm based on the bare bones particle swarm optimization, which considers the complexity of an algorithm due to additional parameters. Because the acquisition cost for each feature can be unequal, multiobjective particle swarm optimization approach for cost-based feature selection and return-cost-based binary firefly algorithm for feature selection are also studied [30, 31] which have another objective function of minimizing the cost sum of features.

In multilabel feature selection studies, one of the major trends is the application of a feature selection method for single-label problems by transforming multilabel datasets into single-label datasets [32, 33]. Although this strategy facilitates the use of conventional methods, which has advantages in terms of ease of use [34], algorithm adaptation strategies that directly manage multilabel problems have also been considered [35]. In these approaches, which are largely filter-based, a feature subset is obtained by optimizing a specific criterion, such as a joint learning criterion that involves simultaneous feature selection and multilabel learning [27, 36], -norm function optimization [37], label ranking error [26], Hilbert-Schmidt independence criterion [23], -statistics [21], or mutual information [16, 24, 38]. However, these methods commonly suffer from low multilabel classification accuracy because of a lack of interaction with multilabel classifiers.

As a notable multilabel feature wrapper study, Zhang et al. [12] proposed a multilabel feature selection method based on a genetic algorithm (GA), which is the most common choice in evolutionary feature wrapper studies [28]. Specifically, their method combined instance- and label-based evaluation metrics [39] as a fitness function to determine label dependency. However, in the original proposal, a maximum number of features to be selected were not considered during the genetic search process. The multilabel classification performance when considering the number of features to be selected was later demonstrated for comparison purposes [11]. During initialization, this method creates chromosomes by selecting a number of features less than . During the genetic search process, this constraint is continuously satisfied by employing restrictive crossover and mutation operators [40] that immediately discard features randomly if the number of selected features exceeds . Although this method satisfies the constraint, important features may be discarded, resulting in an ineffective feature subset.

Recent multilabel feature wrapper methods have treated the number of features to be selected as a secondary objective to be achieved by the evolutionary search process (i.e., multiobjective optimization [28]). This is achieved through a specifically designed ranking method for multiobjective optimization problems, known as nondominated sort [41], where the rank of each chromosome is based on the number of times it dominates other chromosomes in terms of two fitness values: multilabel classification accuracy and the number of selected features. Because the ranking of the chromosomes can be determined, it can be directly used in the natural selection process of a GA. Although the most common approach using a nondominated sorting method is NSGA-II [42], nondominated sorting has also been employed in other evolutionary search methods, including particle swarm optimization (PSO) [43]. A common drawback in these methods is that no solution may satisfy the feature number constraint if such a solution is not included in the final Pareto front. Additionally, they may suffer from unnecessary searches of infeasible solutions conveying unacceptable number of features.

Our review indicates that conventional multilabel feature wrappers can fail to identify a final solution that satisfies a given constraint. To remedy this limitation, in addition to the evolutionary process, it is necessary to devise a new process, namely, exploration operation, to find important features in a large set of novel features with the aid of an effective filter and supply them to the population to enhance the evolutionary search process. We summarize subsequent issues and corresponding reasons to our approach as follows.(i)The exploration operation must be able to identify promising features in a large unselected feature set size of . To achieve this, we employ a criterion that measures the relevance score of features.(ii)The exploration operation must be computationally efficient to circumvent performance degradation of the entire search process. To achieve this, we employ a multilabel feature filter that is confirmed to be efficient because it only requires the dependency between two variables [16].(iii)Our exploration operation is designed to incur no additional parameter that may cause complicated parameter control issues and increase the overall complexity of the algorithm [11, 44]. Based on the number of features given by the evolutionary search, it automatically identifies an effective feature subset that is composed only of novel features.

3. Proposed Method

3.1. Motivation and Approach

In this study, we enhance the performance of a population-based search, such as a GA, for multilabel feature selection with a budget constraint by introducing novel chromosomes that inject promising unselected features into the population. Figure 1 reveals several key issues that should be considered when introducing novel features into the evolutionary search-based multilabel feature selection process with a budget constraint. In the original feature set , there may be a subset of important features that are strongly dependent on multiple labels, leading to excellent discriminative power in the multilabel classifier if they are included in the final feature subset. After a random initialization process is completed, important features, such as , may be unselected by any chromosome (feature subset) because each chromosome only covers a small number of features under the budget constraint . It should be noted that chromosomes should be evaluated to consider all the features at least once, even though all chromosomes are forced to select disjoint feature subsets, which incurs an expensive computational cost. Instead, the proposed method identifies promising features with the help of the employed filter without explicit evaluation of candidate feature subsets.

Next, genetic operators, such as crossovers and mutations, are applied to the population to create new chromosomes. However, unselected important features may not be considered because new chromosomes are created by exchanging the alleles of their ancestors. This means that if ancestors commonly unselect a feature, then their offspring will also unselect that feature. The only chance to add neglected features into the offspring creation process is through the use of a mutation operation. However, this is computationally inefficient because the mutation operation is done by selecting features randomly and, additionally, the mutation rate is set to a small value in order to achieve the convergence. Thus, a large number of iterations or generations should be spent to introduce important features into the population randomly.

In the proposed method, the exploration operator is applied to each of the new offspring to create novel chromosomes that contain promising features that were not considered by the original offspring. During each exploration operation, we calculate the dependency of unselected features on multiple labels (). After the ranking of each feature is computed (e.g., ), a new chromosome that selects the most promising features is created. Finally, exploration and genetic operation-based chromosomes are then merged into a single population.

This paper presents an effective evolutionary search method that remedies the aforementioned issues. In Section 3.2, we discuss the procedural steps of the proposed method and how to handle the issues associated with the exploration operation and the creation of new chromosomes. Section 3.3 presents a mutual-information-based search method for efficiently capturing the relationships between features and labels.

3.2. Algorithm

Algorithm 1 outlines the pseudocode for the procedures used in the proposed method. The terms used for describing the algorithm are summarized in “Terms Used in This Study and Meanings” section. The feature selection vector in a chromosome is a binary string where each bit represents an individual feature, with values of one and zero representing selected and unselected features, respectively. In the initialization step (line ), the algorithm generates chromosomes via random assignment of maximum binary bits. The selected feature subset encoded in is then evaluated using a fitness function. We use multilabel classification error as the fitness function for the selected feature subset. Because chromosomes must be evaluated in order to obtain their fitness values, fitness function calls (FFCs) are used in line .

procedure PROPOSED ALGORITHM allowed FFC
, -th generation
initializing population of -th generation
evaluating
While    do if spent FFC is less than
create using genetic operators
create using exploration operator based on
offspring set
evaluate using a multi-label classifier
add to
select from natural selection
update based on spent FFC
end while
end procedure

After performing the initialization process, the proposed method performs a reproduction process that can be divided into two parts: reproduction via genetic operators and reproduction via the exploration operator. First, the proposed method creates an offspring set (line ) using restrictive crossover and mutation operators to control the number of selected features [40]. Next, the exploration operator identifies unselected promising features from the perspective of each chromosome in and encodes them into a new chromosome in (line ). For balance between the genetic and exploration operations, we set the size of to the same value as that of because must be evaluated in order to determine its fitness. These two sets of chromosomes are then combined to form the offspring set of the th population (line ). To evaluate the fitness of the offspring set, the proposed method uses a certain number of FFCs (line ). Specifically, the proposed method uses FFCs in one generation. Next, is added to and chromosomes with higher fitness values are selected (line ). This procedure is repeated until the algorithm uses all of its allowed FFCs. This limit is denoted and is chosen by the user. The output of Algorithm 1 is the best feature subset obtained during evolution.

3.3. Exploration Operator

Because a feature subset selects a small number of features within and most features will remain unselected, the exploration operator is needed in order to explore a large set of unselected features. Algorithm 2 outlines the pseudocode for the proposed exploration operator. For each offspring generated by the genetic operators, we iteratively select relevant features that maximize the objective function and that were not selected by the offspring until the subset size becomes , where is the subset size of . Thus, proposed exploration operation does not incur additional parameter for determining the number of features to be selected.

  procedure EXPLORE
    do
initialize novel feature subset
for    to    do feature subset selected by ,
find the best feature
add to
end for
add to as a chromosome
end for
  end procedure

To implement our exploration operation, we employ an effective filter method called the scalable criterion for large label sets (SCLS) [16] as an objective function , where is the label set. The selection of the th feature from the set , where is a feature subset with features when selecting th feature, is performed by identifying that maximizes the value of the following relevance evaluation [17]:where and denote the dependency of on and the dependency of on the selected features of , respectively. From [17], (1) can be reformulated as follows:where is the mutual information between variables and and is the joint entropy of the probability functions , , and . Following from (2), can be calculated as follows:As (2), can be calculated as

In order to calculate while considering adaptability against the scaling of and avoiding repetitive calculations by and , let be represented as follows:where , which must be estimated, determines the reduction with relevance to based on , while circumventing the repetitive calculations for reduction against each label. According to [16], can be approximated as follows:As a result, the relevance evaluation for is performed as follows:

Equation (7) represents how the relevance evaluation can be performed when . By considering the previously selected features in , the final relevance evaluation can be represented as follows:

Equation (8) is the objective function for selecting relevant features from the unselected feature subset used by our exploration operation.

3.4. Experimental Settings

We experimented on 20 different datasets from various domains. The Birds dataset is audio data containing examples of multiple bird calls. The Emotions dataset is music data classified into six emotional clusters. The Enron, Language Log (LLog), and Slashdot datasets were generated from text mining applications, where each feature corresponds to the occurrence of a word and each label represents the relevancy of each text pattern to a specific subject. The Genbase and Yeast datasets come from the biological domain and include information about the functions of genes and proteins. The Mediamill dataset is video data from an automatic detection system. The Medical dataset was sampled from a large corpus of suicide letters obtained from the natural language processing of clinical free text. The Scene dataset is related to the semantic indexing of still scenes, where each scene may contain multiple objects. The TMC2007 dataset contains safety reports of complex space system. The remaining nine datasets come from the Yahoo dataset collection. We performed unsupervised dimensionality reduction on text datasets, including the TMC2007 and Yahoo collections, which were composed of more than 10,000 features. Specifically, the top 2% and 5% of features with the highest document frequency were retained for TMC2007 and the Yahoo datasets, respectively [45]. In the text mining domain, existing studies report that classification performance will not suffer significantly from the retention of 1% of features based on document frequency [46].

Table 1 contains the standard statistics for the multilabel datasets employed in our experiments, including the number of patterns in the dataset , number of features , type of features, and number of labels . When the feature type is numeric, we discretize the features by using the supervised discretization method [47] for multilabel naïve Bayes classifier (MLNB) [12]. Specifically, each observed numeric value is assigned to one of several bins that are automatically determined by using the discretization method. The label cardinality Card represents the average number of labels for each instance. The label density Den is the label cardinality over the total number of labels. The number of distinct label sets Distinct indicates the number of unique label subsets in . Domain represents the application that each dataset was extracted from.

We measured the mean size of the selected feature subsets for both the proposed method and the conventional multilabel feature selection methods (GA with restrictive genetic operators [40] (RGA), NSGA-II [43], and MPSOFS [43]) to determine which methods achieved to select less than 10 features. Specifically, we provide detailed parameter setting to support good reproducibility as follows:(i)RGA creates initial solutions by selecting less than features randomly in accordance with each chromosome. Each solution in the initial population , where , is evaluated using an employed multilabel classifier. Next, the RGA creates an offspring set by using genetic operators. To apply the crossover operator, two solutions in are randomly selected and mated; thereafter, one solution in is randomly selected and mutated. In this study, we employed restrictive crossover and restrictive mutation operators with both crossover rate and mutation rate set to 1.0. Therefore, for each iteration, the GA creates three new solutions to compose . Each newly created solution is evaluated using the multilabel classifier. To create , is added to , and 20 solutions with higher fitness values are selected. This procedure is repeated until the RGA spends 100 FFCs.(ii)NSGA-II creates initial solutions randomly, the same number RGA creates. The maximum number of allowed feature is set to because the NSGA-II naturally minimizes the number of selected features. Each solution in is evaluated using an employed multilabel classifier and the number of features. The NSGA-II then creates where which is the same setting of RGA. To create , is added to , and the superiority of each solution is determined by the nondominated sort method. After the superiority among solutions in is determined, the top 20 solutions are selected to form . This procedure is repeated until the NSGA-II spends 100 FFCs.(iii)MPSOFS creates 20 initial solutions randomly, the same number RGA creates. Each solution in is evaluated using an employed multilabel classifier and the number of features and ranked using the nondominated sort method. The MPSOFS then preserves the best solution of called the global best solution. In addition, the best solution which each chromosome experienced is also preserved; this is called the individual best solution, and therefore there are 20 individual best solutions. Thereafter, the MPSOFS updates the representation of each chromosome based on the global best solution and its own individual best solution using a velocity with inertia weight of 0.7298 and two acceleration coefficients of 1.4962 suggested from the study of [48]. After all chromosomes in are modified, they are evaluated and regarded as . This procedure is repeated until the MPSOFS spends 100 FFCs.

Although different parameter setting may result in better performance, we fixed the size of the population to 20 and the number of spent FFCs to 100 for all the methods to ensure a fair comparison. To evaluate the quality of the feature subsets obtained by each method, we used MLNB classifier because it outputs a predicted label subset based on the intrinsic characteristics of a given dataset without requiring any complicated parameter-tuning process that might influence the final multilabel classification performance [39]. For the sake of fairness, we used the hold-out cross-validation method for each experiment [11, 49]. 80% of the samples in a given dataset were randomly chosen as the training set for multilabel feature selection and classifier training, while the remaining 20% of the samples were used as the test set to obtain the multilabel classification performance. For both the RGA and the proposed method, we set the population size to 20 and the maximum number of allowed FFCs to 100. Each experiment was repeated 10 times and the average value was used to represent the classification performance of each feature selection method.

We employed four evaluation metrics: Hamming loss, multilabel accuracy, ranking loss, and normalized coverage. Let be a given test set where is a correct label subset. For a given test sample , a classifier, such as MLNB, should output a set of confidence values for each label . If a confidence value is larger than a predefined threshold value, such as 0.5, the corresponding label will be included in the predicted label subset . Based on the ground truth , confidence values , and predicted label subset , multilabel classification performance can be measured with each evaluation metric [33, 45, 50].

Multilabel accuracy is defined as follows:Hamming loss is defined as follows:where denotes the symmetric difference between two sets. Ranking loss is defined as follows:where is a complementary set of . Therefore, ranking loss measures the average fraction of pairs with over all possible relevant and irrelevant label pairs. Finally, normalized coverage is defined as follows:where returns the rank of the corresponding relevant label according to in nonincreasing order. Therefore, normalized coverage measures how many labels must be marked as positive for all relevant labels to be positive. Higher values of multilabel accuracy and lower values of Hamming loss, ranking loss, and normalized coverage indicate good classification performance.

Additionally, because we are interested in the superiority of the proposed method over conventional multilabel feature selection methods, we perform the Wilcoxon signed-rank test [51] to validate the performance of the proposed method. Let be the difference between the performance of the two methods for the th dataset. The differences are ranked based on their absolute values and the smallest is assigned to the first rank. If ties occur, average ranks are assigned. Let be the sum of the ranks for the datasets on which the compared method outperforms the proposed method, defined as follows:

Let be the sum of the ranks for the datasets on which the proposed method outperforms the compared method. Then, based on the critical values from the Wilcoxon test, for a confidence level of and , the difference between the compared methods is significant if is less than or equal to 8. In this case, the null hypothesis of equal performance is rejected.

4. Experimental Results

4.1. Comparison Results

Table 2 contains the results for the mean size and standard deviation of the selected feature subsets of the proposed method and conventional multilabel feature selection methods when the evaluation metric is multilabel accuracy. The symbol indicates methods that failed to satisfy given constraint for the corresponding dataset. The proposed method and RGA both selected less than 10 features for all datasets. The NSGA-II and MPSOFS methods failed to select less than 10 features for all datasets other than the Mediamill dataset for NSGA-II, despite having objective functions to minimize feature subset sizes. Because the NSGA-II and MPSOFS failed to select less than 10 features for most datasets, we compared the performance of the proposed method with the performance of the RGA from subsequent experiments. It should be noted that can be set to a larger value than 10, such as 30 or 50. The experimental results in Table 2 show that the NSGA-II or MPSOFS will fail to satisfy the given constraints because they output the final feature subset, which is composed of tens or hundreds of features for most experiments.

Tables 3 and 4 contain the experimental results for the proposed method and RGA on 20 multilabel datasets, presented as the average performances for hold-out cross-validation with corresponding standard deviations. Table 3 contains the performance results for multilabel accuracy and Hamming loss, and Table 4 contains the performance results for ranking loss and normalized coverage. The best performance between the two methods is indicated by bold font and a ✓ symbol. Finally, Table 5 contains the results of the Wilcoxon signed-rank test for the proposed method against RGA for Genbase dataset with a significance threshold of . For each evaluation metric, the winner of each comparison is indicated with bold font and the corresponding sum of the outperformed rank over the total rank and values are presented in the parenthesis. We observed a similar tendency from the same experiments on the other multilabel datasets.

As shown in Tables 3 and 4, the proposed method outperformed RGA for most multilabel datasets. Specifically, the proposed method achieved the best performance for 90% of the datasets in terms of multilabel accuracy, 95% of the datasets in terms of Hamming loss, 95% of the datasets in terms of ranking loss, and 100% of the datasets in terms of normalized coverage. Thus, the proposed method significantly outperforms RGA for all evaluation metrics. This is evident from the experimental results shown in Table 5, which clearly demonstrate that the proposed method is statistically superior to RGA.

4.2. Analysis

Figure 2 shows the convergence behaviors of the GA and proposed method according to the number of spent FFCs () in terms of the multilabel accuracy; the horizontal axis represents , and the vertical axis indicates the multilabel accuracy performance. Because the convergence behaviors may differ according to each experiment owing to the stochastic nature of the population-based search methods, we set the same initialized population in both algorithms and averaged the multilabel accuracy performance of the top elitist in the population after conducting the experiment 10 times. Figure 2 shows that the multilabel accuracy performance monotonically improves with . Because the initialization steps consume 20 FFCs and the two methods have the same initialized population that is randomly created, both methods gradually improve the multilabel accuracy initially. However, the experimental results indicate that the multilabel accuracy value of the proposed method is dramatically improved when because the exploration operator is applied to the population after the initialization. Thus, Figure 2 indicates that the proposed method can efficiently locate a good feature subset from unselected features.

The goal of our exploration operation introduces novel promising features that would effectively improve the multilabel classification performance. To validate the effectiveness of our exploration operation, we conduct an additional experiment by comparing the fitness values of the offspring set created by the proposed exploration operation and the random operation, respectively. Specifically, 50 chromosomes, namely, , that select 10 or lesser number of features as the same initialization procedure of RGA were used and 50 new chromosomes are then created by applying the proposed exploration operation to each chromosome in to form the first offspring set. Thereafter, for the sake of comparison, novel features with regard to each chromosome in are selected randomly and introduced to create the second offspring set. Finally, the fitness values of the first and second offspring sets in terms of the four performance measures are measured. Figure 3 shows the box plots of fitness values given by the two offspring sets of the Genbase dataset. The experimental results indicates that the fitness values of the first offspring set (Proposed) is much better than that of the second offspring set (Random) from the viewpoint of all measures, indicating that the proposed exploration operation has a much better search capability than the random search.

5. Conclusion

We proposed an effective evolutionary search-based feature selection method with a budget constraint for multilabel classification. As a feature subset selects a small number of features within the maximum allowed number of features and most features are unselected in the budget constraint problem, we employ a novel exploration operation to find relevant features in the large unselected feature subset. Our experiments on 20 real-world datasets demonstrated that proposed exploration operator successfully enhances the search capability of genetic search, resulting in an improvement in multilabel classification. The results also showed that the proposed method can search a feature subset successfully, which does not violate the budget constraint. Statistical tests showed that our method outperformed conventional methods in four performance measures. Although the proposed exploration operation improves the effectiveness of evolutionary search without incurring additional parameters, it cannot be applied directly to certain types of evolutionary search algorithms, such as particle swarm optimization, which do not depend on offspring sets. Thus, an additional consideration should be made to design a new exploration operation for such cases.

A future research direction will be a study on an evolutionary algorithm. The proposed method is a genetic algorithm based feature selection; however, it can be applied to other evolutionary algorithms such as the Estimation of Distribution Algorithm. We would like to study this issue further.

Terms Used in This Study and Meanings

Constants
:Number of generations
:The size of the population,
:Maximum number of allowed features selected by
:A chromosome in
:A selected feature subset represented by
:Maximum number of allowed fitness function calls (FFCs)
:Number of spent FFCs,
Sets
:The population at the th generation
:A set of newly created solutions from genetic operator
:A set of newly created solutions from exploration operator
:A set of newly created solutions from , .

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the Chung-Ang University Research Grants in 2017 and by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (NRF-2016R1C1B1014774).