Computational Intelligence in Modeling Complex Systems and Solving Complex ProblemsView this Special Issue
Research Article | Open Access
Effective Evolutionary Multilabel Feature Selection under a Budget Constraint
Multilabel feature selection involves the selection of relevant features from multilabeled datasets, resulting in improved multilabel learning accuracy. Evolutionary search-based multilabel feature selection methods have proved useful for identifying a compact feature subset by successfully improving the accuracy of multilabel classification. However, conventional methods frequently violate budget constraints or result in inefficient searches due to ineffective exploration of important features. In this paper, we present an effective evolutionary search-based feature selection method for multilabel classification with a budget constraint. The proposed method employs a novel exploration operation to enhance the search capabilities of a traditional genetic search, resulting in improved multilabel classification. Empirical studies using 20 real-world datasets demonstrate that the proposed method outperforms conventional multilabel feature selection methods.
Multilabel classification has emerged as a promising technique for various applications, including lifelong structure monitoring , functional proteomics , and sentiment analysis . These applications produce a series of labels for describing complicated concepts, which are compounded when high-level concepts are composed of multiple subconcepts, such as the environmental and operational conditions of structures [1, 4, 5]. Let denote a set of patterns constructed from a set of features . Then, each pattern , where , is assigned to a certain label subset , where and is a finite set of labels. Therefore, the task of multilabel classification is to identify a function that maps given instances into one of label subsets based on input feature values.
In practice, there can be a maximum number of features allowed because of the limits on data acquisition rates or energy consumption [6–8]. In reality, for example, this problem can emerge from the music applications on lightweight mobile devices. Applications for mobile devices typically have a limitation in computational capacity and there is a maximum number of allowed features to be extracted [9, 10]. This is because an overly excessive number of extracted features on mobile devices causes consumers to suffer low quality user experience due to unacceptable waiting or battery consumption.
Given input data with an original feature set and label set , the goal of our multilabel feature selection problem is to identify a feature subset with the maximum number of features that yields the best multilabel classification accuracy [11, 12]. This problem is known as budgeted feature selection  or feature selection with test cost constraints [8, 14, 15]. However, most studies have been conducted from the perspective of traditional single-label learning. It should be noted, especially when a given constraint is small, that our multilabel feature selection problem becomes more challenging in terms of classification accuracy due to the fact that a small number of features must support multiple labels simultaneously [16–19].
Multilabel feature selection methods can be categorized according to how they assess the importance of candidate feature subsets [16, 20–22]. Filter-based multilabel feature selection methods identify a final feature subset by focusing on the intrinsic discriminative power of features [21, 23–25]. Some multilabel learning algorithms have a feature selection process embedded in their learning process [26, 27]. In contrast, wrapper-based multilabel feature selection methods assess the importance of feature subsets through a search process by using a multilabel classifier directly. This typically results in better classification accuracy [11, 12]. For this reason, we focus on a multilabel feature wrapper based on an evolutionary search process .
During the search process, each chromosome represents a feature subset and selects a number of features less than or equal to . As a result, most features remain unselected by any chromosome in the population. This can lead to an ineffective search because important features can be continuously neglected. Without negatively affecting the strength of the evolutionary search, this problem can be solved by adding additional chromosomes that convey promising unselected features to the population. In this study, we propose an effective multilabel feature wrapper while considering the constraint of feature subset size. Experimental results demonstrate that the proposed method is able to identify an effective feature subset for multilabel classification with the aid of an enhanced evolutionary search process.
2. Related Work
In traditional single-label feature selection, the budgeted feature selection problem is treated as a special case of the feature selection problem where the algorithm should consider the effectiveness of the feature subset and the acquisition cost for gathering each feature simultaneously. To solve this problem, Zhang et al.  proposed a feature selection algorithm based on the bare bones particle swarm optimization, which considers the complexity of an algorithm due to additional parameters. Because the acquisition cost for each feature can be unequal, multiobjective particle swarm optimization approach for cost-based feature selection and return-cost-based binary firefly algorithm for feature selection are also studied [30, 31] which have another objective function of minimizing the cost sum of features.
In multilabel feature selection studies, one of the major trends is the application of a feature selection method for single-label problems by transforming multilabel datasets into single-label datasets [32, 33]. Although this strategy facilitates the use of conventional methods, which has advantages in terms of ease of use , algorithm adaptation strategies that directly manage multilabel problems have also been considered . In these approaches, which are largely filter-based, a feature subset is obtained by optimizing a specific criterion, such as a joint learning criterion that involves simultaneous feature selection and multilabel learning [27, 36], -norm function optimization , label ranking error , Hilbert-Schmidt independence criterion , -statistics , or mutual information [16, 24, 38]. However, these methods commonly suffer from low multilabel classification accuracy because of a lack of interaction with multilabel classifiers.
As a notable multilabel feature wrapper study, Zhang et al.  proposed a multilabel feature selection method based on a genetic algorithm (GA), which is the most common choice in evolutionary feature wrapper studies . Specifically, their method combined instance- and label-based evaluation metrics  as a fitness function to determine label dependency. However, in the original proposal, a maximum number of features to be selected were not considered during the genetic search process. The multilabel classification performance when considering the number of features to be selected was later demonstrated for comparison purposes . During initialization, this method creates chromosomes by selecting a number of features less than . During the genetic search process, this constraint is continuously satisfied by employing restrictive crossover and mutation operators  that immediately discard features randomly if the number of selected features exceeds . Although this method satisfies the constraint, important features may be discarded, resulting in an ineffective feature subset.
Recent multilabel feature wrapper methods have treated the number of features to be selected as a secondary objective to be achieved by the evolutionary search process (i.e., multiobjective optimization ). This is achieved through a specifically designed ranking method for multiobjective optimization problems, known as nondominated sort , where the rank of each chromosome is based on the number of times it dominates other chromosomes in terms of two fitness values: multilabel classification accuracy and the number of selected features. Because the ranking of the chromosomes can be determined, it can be directly used in the natural selection process of a GA. Although the most common approach using a nondominated sorting method is NSGA-II , nondominated sorting has also been employed in other evolutionary search methods, including particle swarm optimization (PSO) . A common drawback in these methods is that no solution may satisfy the feature number constraint if such a solution is not included in the final Pareto front. Additionally, they may suffer from unnecessary searches of infeasible solutions conveying unacceptable number of features.
Our review indicates that conventional multilabel feature wrappers can fail to identify a final solution that satisfies a given constraint. To remedy this limitation, in addition to the evolutionary process, it is necessary to devise a new process, namely, exploration operation, to find important features in a large set of novel features with the aid of an effective filter and supply them to the population to enhance the evolutionary search process. We summarize subsequent issues and corresponding reasons to our approach as follows.(i)The exploration operation must be able to identify promising features in a large unselected feature set size of . To achieve this, we employ a criterion that measures the relevance score of features.(ii)The exploration operation must be computationally efficient to circumvent performance degradation of the entire search process. To achieve this, we employ a multilabel feature filter that is confirmed to be efficient because it only requires the dependency between two variables .(iii)Our exploration operation is designed to incur no additional parameter that may cause complicated parameter control issues and increase the overall complexity of the algorithm [11, 44]. Based on the number of features given by the evolutionary search, it automatically identifies an effective feature subset that is composed only of novel features.
3. Proposed Method
3.1. Motivation and Approach
In this study, we enhance the performance of a population-based search, such as a GA, for multilabel feature selection with a budget constraint by introducing novel chromosomes that inject promising unselected features into the population. Figure 1 reveals several key issues that should be considered when introducing novel features into the evolutionary search-based multilabel feature selection process with a budget constraint. In the original feature set , there may be a subset of important features that are strongly dependent on multiple labels, leading to excellent discriminative power in the multilabel classifier if they are included in the final feature subset. After a random initialization process is completed, important features, such as , may be unselected by any chromosome (feature subset) because each chromosome only covers a small number of features under the budget constraint . It should be noted that chromosomes should be evaluated to consider all the features at least once, even though all chromosomes are forced to select disjoint feature subsets, which incurs an expensive computational cost. Instead, the proposed method identifies promising features with the help of the employed filter without explicit evaluation of candidate feature subsets.
Next, genetic operators, such as crossovers and mutations, are applied to the population to create new chromosomes. However, unselected important features may not be considered because new chromosomes are created by exchanging the alleles of their ancestors. This means that if ancestors commonly unselect a feature, then their offspring will also unselect that feature. The only chance to add neglected features into the offspring creation process is through the use of a mutation operation. However, this is computationally inefficient because the mutation operation is done by selecting features randomly and, additionally, the mutation rate is set to a small value in order to achieve the convergence. Thus, a large number of iterations or generations should be spent to introduce important features into the population randomly.
In the proposed method, the exploration operator is applied to each of the new offspring to create novel chromosomes that contain promising features that were not considered by the original offspring. During each exploration operation, we calculate the dependency of unselected features on multiple labels (). After the ranking of each feature is computed (e.g., ), a new chromosome that selects the most promising features is created. Finally, exploration and genetic operation-based chromosomes are then merged into a single population.
This paper presents an effective evolutionary search method that remedies the aforementioned issues. In Section 3.2, we discuss the procedural steps of the proposed method and how to handle the issues associated with the exploration operation and the creation of new chromosomes. Section 3.3 presents a mutual-information-based search method for efficiently capturing the relationships between features and labels.
Algorithm 1 outlines the pseudocode for the procedures used in the proposed method. The terms used for describing the algorithm are summarized in “Terms Used in This Study and Meanings” section. The feature selection vector in a chromosome is a binary string where each bit represents an individual feature, with values of one and zero representing selected and unselected features, respectively. In the initialization step (line ), the algorithm generates chromosomes via random assignment of maximum binary bits. The selected feature subset encoded in is then evaluated using a fitness function. We use multilabel classification error as the fitness function for the selected feature subset. Because chromosomes must be evaluated in order to obtain their fitness values, fitness function calls (FFCs) are used in line .
After performing the initialization process, the proposed method performs a reproduction process that can be divided into two parts: reproduction via genetic operators and reproduction via the exploration operator. First, the proposed method creates an offspring set (line ) using restrictive crossover and mutation operators to control the number of selected features . Next, the exploration operator identifies unselected promising features from the perspective of each chromosome in and encodes them into a new chromosome in (line ). For balance between the genetic and exploration operations, we set the size of to the same value as that of because must be evaluated in order to determine its fitness. These two sets of chromosomes are then combined to form the offspring set of the th population (line ). To evaluate the fitness of the offspring set, the proposed method uses a certain number of FFCs (line ). Specifically, the proposed method uses FFCs in one generation. Next, is added to and chromosomes with higher fitness values are selected (line ). This procedure is repeated until the algorithm uses all of its allowed FFCs. This limit is denoted and is chosen by the user. The output of Algorithm 1 is the best feature subset obtained during evolution.
3.3. Exploration Operator
Because a feature subset selects a small number of features within and most features will remain unselected, the exploration operator is needed in order to explore a large set of unselected features. Algorithm 2 outlines the pseudocode for the proposed exploration operator. For each offspring generated by the genetic operators, we iteratively select relevant features that maximize the objective function and that were not selected by the offspring until the subset size becomes , where is the subset size of . Thus, proposed exploration operation does not incur additional parameter for determining the number of features to be selected.
To implement our exploration operation, we employ an effective filter method called the scalable criterion for large label sets (SCLS)  as an objective function , where is the label set. The selection of the th feature from the set , where is a feature subset with features when selecting th feature, is performed by identifying that maximizes the value of the following relevance evaluation :where and denote the dependency of on and the dependency of on the selected features of , respectively. From , (1) can be reformulated as follows:where is the mutual information between variables and and is the joint entropy of the probability functions , , and . Following from (2), can be calculated as follows:As (2), can be calculated as
In order to calculate while considering adaptability against the scaling of and avoiding repetitive calculations by and , let be represented as follows:where , which must be estimated, determines the reduction with relevance to based on , while circumventing the repetitive calculations for reduction against each label. According to , can be approximated as follows:As a result, the relevance evaluation for is performed as follows:
Equation (7) represents how the relevance evaluation can be performed when . By considering the previously selected features in , the final relevance evaluation can be represented as follows:
Equation (8) is the objective function for selecting relevant features from the unselected feature subset used by our exploration operation.
3.4. Experimental Settings
We experimented on 20 different datasets from various domains. The Birds dataset is audio data containing examples of multiple bird calls. The Emotions dataset is music data classified into six emotional clusters. The Enron, Language Log (LLog), and Slashdot datasets were generated from text mining applications, where each feature corresponds to the occurrence of a word and each label represents the relevancy of each text pattern to a specific subject. The Genbase and Yeast datasets come from the biological domain and include information about the functions of genes and proteins. The Mediamill dataset is video data from an automatic detection system. The Medical dataset was sampled from a large corpus of suicide letters obtained from the natural language processing of clinical free text. The Scene dataset is related to the semantic indexing of still scenes, where each scene may contain multiple objects. The TMC2007 dataset contains safety reports of complex space system. The remaining nine datasets come from the Yahoo dataset collection. We performed unsupervised dimensionality reduction on text datasets, including the TMC2007 and Yahoo collections, which were composed of more than 10,000 features. Specifically, the top 2% and 5% of features with the highest document frequency were retained for TMC2007 and the Yahoo datasets, respectively . In the text mining domain, existing studies report that classification performance will not suffer significantly from the retention of 1% of features based on document frequency .
Table 1 contains the standard statistics for the multilabel datasets employed in our experiments, including the number of patterns in the dataset , number of features , type of features, and number of labels . When the feature type is numeric, we discretize the features by using the supervised discretization method  for multilabel naïve Bayes classifier (MLNB) . Specifically, each observed numeric value is assigned to one of several bins that are automatically determined by using the discretization method. The label cardinality Card represents the average number of labels for each instance. The label density Den is the label cardinality over the total number of labels. The number of distinct label sets Distinct indicates the number of unique label subsets in . Domain represents the application that each dataset was extracted from.
We measured the mean size of the selected feature subsets for both the proposed method and the conventional multilabel feature selection methods (GA with restrictive genetic operators  (RGA), NSGA-II , and MPSOFS ) to determine which methods achieved to select less than 10 features. Specifically, we provide detailed parameter setting to support good reproducibility as follows:(i)RGA creates initial solutions by selecting less than features randomly in accordance with each chromosome. Each solution in the initial population , where , is evaluated using an employed multilabel classifier. Next, the RGA creates an offspring set by using genetic operators. To apply the crossover operator, two solutions in are randomly selected and mated; thereafter, one solution in is randomly selected and mutated. In this study, we employed restrictive crossover and restrictive mutation operators with both crossover rate and mutation rate set to 1.0. Therefore, for each iteration, the GA creates three new solutions to compose . Each newly created solution is evaluated using the multilabel classifier. To create , is added to , and 20 solutions with higher fitness values are selected. This procedure is repeated until the RGA spends 100 FFCs.(ii)NSGA-II creates initial solutions randomly, the same number RGA creates. The maximum number of allowed feature is set to because the NSGA-II naturally minimizes the number of selected features. Each solution in is evaluated using an employed multilabel classifier and the number of features. The NSGA-II then creates where which is the same setting of RGA. To create , is added to , and the superiority of each solution is determined by the nondominated sort method. After the superiority among solutions in is determined, the top 20 solutions are selected to form . This procedure is repeated until the NSGA-II spends 100 FFCs.(iii)MPSOFS creates 20 initial solutions randomly, the same number RGA creates. Each solution in is evaluated using an employed multilabel classifier and the number of features and ranked using the nondominated sort method. The MPSOFS then preserves the best solution of called the global best solution. In addition, the best solution which each chromosome experienced is also preserved; this is called the individual best solution, and therefore there are 20 individual best solutions. Thereafter, the MPSOFS updates the representation of each chromosome based on the global best solution and its own individual best solution using a velocity with inertia weight of 0.7298 and two acceleration coefficients of 1.4962 suggested from the study of . After all chromosomes in are modified, they are evaluated and regarded as . This procedure is repeated until the MPSOFS spends 100 FFCs.
Although different parameter setting may result in better performance, we fixed the size of the population to 20 and the number of spent FFCs to 100 for all the methods to ensure a fair comparison. To evaluate the quality of the feature subsets obtained by each method, we used MLNB classifier because it outputs a predicted label subset based on the intrinsic characteristics of a given dataset without requiring any complicated parameter-tuning process that might influence the final multilabel classification performance . For the sake of fairness, we used the hold-out cross-validation method for each experiment [11, 49]. 80% of the samples in a given dataset were randomly chosen as the training set for multilabel feature selection and classifier training, while the remaining 20% of the samples were used as the test set to obtain the multilabel classification performance. For both the RGA and the proposed method, we set the population size to 20 and the maximum number of allowed FFCs to 100. Each experiment was repeated 10 times and the average value was used to represent the classification performance of each feature selection method.
We employed four evaluation metrics: Hamming loss, multilabel accuracy, ranking loss, and normalized coverage. Let be a given test set where is a correct label subset. For a given test sample , a classifier, such as MLNB, should output a set of confidence values for each label . If a confidence value is larger than a predefined threshold value, such as 0.5, the corresponding label will be included in the predicted label subset . Based on the ground truth , confidence values , and predicted label subset , multilabel classification performance can be measured with each evaluation metric [33, 45, 50].
Multilabel accuracy is defined as follows:Hamming loss is defined as follows:where denotes the symmetric difference between two sets. Ranking loss is defined as follows:where is a complementary set of . Therefore, ranking loss measures the average fraction of pairs with over all possible relevant and irrelevant label pairs. Finally, normalized coverage is defined as follows:where returns the rank of the corresponding relevant label according to in nonincreasing order. Therefore, normalized coverage measures how many labels must be marked as positive for all relevant labels to be positive. Higher values of multilabel accuracy and lower values of Hamming loss, ranking loss, and normalized coverage indicate good classification performance.
Additionally, because we are interested in the superiority of the proposed method over conventional multilabel feature selection methods, we perform the Wilcoxon signed-rank test  to validate the performance of the proposed method. Let be the difference between the performance of the two methods for the th dataset. The differences are ranked based on their absolute values and the smallest is assigned to the first rank. If ties occur, average ranks are assigned. Let be the sum of the ranks for the datasets on which the compared method outperforms the proposed method, defined as follows:
Let be the sum of the ranks for the datasets on which the proposed method outperforms the compared method. Then, based on the critical values from the Wilcoxon test, for a confidence level of and , the difference between the compared methods is significant if is less than or equal to 8. In this case, the null hypothesis of equal performance is rejected.
4. Experimental Results
4.1. Comparison Results
Table 2 contains the results for the mean size and standard deviation of the selected feature subsets of the proposed method and conventional multilabel feature selection methods when the evaluation metric is multilabel accuracy. The ✘ symbol indicates methods that failed to satisfy given constraint for the corresponding dataset. The proposed method and RGA both selected less than 10 features for all datasets. The NSGA-II and MPSOFS methods failed to select less than 10 features for all datasets other than the Mediamill dataset for NSGA-II, despite having objective functions to minimize feature subset sizes. Because the NSGA-II and MPSOFS failed to select less than 10 features for most datasets, we compared the performance of the proposed method with the performance of the RGA from subsequent experiments. It should be noted that can be set to a larger value than 10, such as 30 or 50. The experimental results in Table 2 show that the NSGA-II or MPSOFS will fail to satisfy the given constraints because they output the final feature subset, which is composed of tens or hundreds of features for most experiments.
Tables 3 and 4 contain the experimental results for the proposed method and RGA on 20 multilabel datasets, presented as the average performances for hold-out cross-validation with corresponding standard deviations. Table 3 contains the performance results for multilabel accuracy and Hamming loss, and Table 4 contains the performance results for ranking loss and normalized coverage. The best performance between the two methods is indicated by bold font and a ✓ symbol. Finally, Table 5 contains the results of the Wilcoxon signed-rank test for the proposed method against RGA for Genbase dataset with a significance threshold of . For each evaluation metric, the winner of each comparison is indicated with bold font and the corresponding sum of the outperformed rank over the total rank and values are presented in the parenthesis. We observed a similar tendency from the same experiments on the other multilabel datasets.
As shown in Tables 3 and 4, the proposed method outperformed RGA for most multilabel datasets. Specifically, the proposed method achieved the best performance for 90% of the datasets in terms of multilabel accuracy, 95% of the datasets in terms of Hamming loss, 95% of the datasets in terms of ranking loss, and 100% of the datasets in terms of normalized coverage. Thus, the proposed method significantly outperforms RGA for all evaluation metrics. This is evident from the experimental results shown in Table 5, which clearly demonstrate that the proposed method is statistically superior to RGA.
Figure 2 shows the convergence behaviors of the GA and proposed method according to the number of spent FFCs () in terms of the multilabel accuracy; the horizontal axis represents , and the vertical axis indicates the multilabel accuracy performance. Because the convergence behaviors may differ according to each experiment owing to the stochastic nature of the population-based search methods, we set the same initialized population in both algorithms and averaged the multilabel accuracy performance of the top elitist in the population after conducting the experiment 10 times. Figure 2 shows that the multilabel accuracy performance monotonically improves with . Because the initialization steps consume 20 FFCs and the two methods have the same initialized population that is randomly created, both methods gradually improve the multilabel accuracy initially. However, the experimental results indicate that the multilabel accuracy value of the proposed method is dramatically improved when because the exploration operator is applied to the population after the initialization. Thus, Figure 2 indicates that the proposed method can efficiently locate a good feature subset from unselected features.
(a) Genbase dataset
(b) Slashdot dataset
(c) Arts dataset
(d) Education dataset
(e) Entertainment dataset
(f) Social dataset
The goal of our exploration operation introduces novel promising features that would effectively improve the multilabel classification performance. To validate the effectiveness of our exploration operation, we conduct an additional experiment by comparing the fitness values of the offspring set created by the proposed exploration operation and the random operation, respectively. Specifically, 50 chromosomes, namely, , that select 10 or lesser number of features as the same initialization procedure of RGA were used and 50 new chromosomes are then created by applying the proposed exploration operation to each chromosome in to form the first offspring set. Thereafter, for the sake of comparison, novel features with regard to each chromosome in are selected randomly and introduced to create the second offspring set. Finally, the fitness values of the first and second offspring sets in terms of the four performance measures are measured. Figure 3 shows the box plots of fitness values given by the two offspring sets of the Genbase dataset. The experimental results indicates that the fitness values of the first offspring set (Proposed) is much better than that of the second offspring set (Random) from the viewpoint of all measures, indicating that the proposed exploration operation has a much better search capability than the random search.
(a) Multilabel accuracy
(b) Hamming loss
(c) Ranking loss
(d) Normalized coverage
We proposed an effective evolutionary search-based feature selection method with a budget constraint for multilabel classification. As a feature subset selects a small number of features within the maximum allowed number of features and most features are unselected in the budget constraint problem, we employ a novel exploration operation to find relevant features in the large unselected feature subset. Our experiments on 20 real-world datasets demonstrated that proposed exploration operator successfully enhances the search capability of genetic search, resulting in an improvement in multilabel classification. The results also showed that the proposed method can search a feature subset successfully, which does not violate the budget constraint. Statistical tests showed that our method outperformed conventional methods in four performance measures. Although the proposed exploration operation improves the effectiveness of evolutionary search without incurring additional parameters, it cannot be applied directly to certain types of evolutionary search algorithms, such as particle swarm optimization, which do not depend on offspring sets. Thus, an additional consideration should be made to design a new exploration operation for such cases.
A future research direction will be a study on an evolutionary algorithm. The proposed method is a genetic algorithm based feature selection; however, it can be applied to other evolutionary algorithms such as the Estimation of Distribution Algorithm. We would like to study this issue further.
Terms Used in This Study and MeaningsConstants
|:||Number of generations|
|:||The size of the population,|
|:||Maximum number of allowed features selected by|
|:||A chromosome in|
|:||A selected feature subset represented by|
|:||Maximum number of allowed fitness function calls (FFCs)|
|:||Number of spent FFCs,|
|:||The population at the th generation|
|:||A set of newly created solutions from genetic operator|
|:||A set of newly created solutions from exploration operator|
|:||A set of newly created solutions from , .|
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This research was supported by the Chung-Ang University Research Grants in 2017 and by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (NRF-2016R1C1B1014774).
- A. Iliopoulos, R. Shirzadeh, W. Weijtjens, P. Guillaume, D. V. Hemelrijck, and C. Devriendt, “A modal decomposition and expansion approach for prediction of dynamic responses on a monopile offshore wind turbine using a limited number of vibration sensors,” Mechanical Systems and Signal Processing, vol. 68-69, pp. 84–104, 2016.
- S. Diplaris, G. Tsoumakas, P. A. Mitkas, and I. Vlahavas, “Protein classification with multiple algorithms,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Preface, vol. 3746, pp. 448–456, 2005.
- Y. Rao, “Contextual Sentiment Topic Model for Adaptive Social Emotion Classification,” IEEE Intelligent Systems, vol. 31, no. 1, pp. 41–47, 2016.
- R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma, “Multi-label learning with millions of labels,” in Proceedings of the the 22nd international conference, pp. 13–24, Rio de Janeiro, Brazil, May 2013.
- P. Duygulu, K. Barnard, J. F. de Freitas, and D. A. Forsyth, “Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary,” in Computer Vision — ECCV 2002, vol. 2353 of Lecture Notes in Computer Science, pp. 97–112, Springer Berlin Heidelberg, Berlin, Heidelberg, 2002.
- H. Ghasemzadeh, N. Amini, R. Saeedi, and M. Sarrafzadeh, “Power-aware computing in wearable sensor networks: An optimal feature selection,” IEEE Transactions on Mobile Computing, vol. 14, no. 4, pp. 800–812, 2015.
- B. Nushi, A. Singla, A. Krause, and D. Kossmann, “Learning and feature selection under budget constraints in crowdsourcing,” in Proceedings of the in 4th AAAI Conf. Human Computation and Crowdsourcing, pp. 159–168, Austin, USA, October 2016.
- H. Yang, R. Fujimaki, Y. Kusumura, and J. Liu, “Online feature selection: A limited-memory substitution algorithm and its asynchronous parallel variation,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 1945–1954, USA, August 2016.
- H. Blume, B. Bischl, M. Botteck et al., “Huge music archives on mobile devices,” IEEE Signal Processing Magazine, vol. 28, no. 4, pp. 24–39, 2011.
- P. Naula, A. Airola, T. Salakoski, and T. Pahikkala, “Multi-label learning under feature extraction budgets,” Pattern Recognition Letters, vol. 40, no. 1, pp. 56–65, 2014.
- J. Lee and D. W. Kim, “Memetic feature selection algorithm for multi-label classification,” Information Sciences, vol. 293, pp. 80–96, 2015.
- M.-L. Zhang, J. M. Peña, and V. Robles, “Feature selection for multi-label naive Bayes classification,” Information Sciences, vol. 179, no. 19, pp. 3218–3229, 2009.
- H. Yang, Z. Xu, M. R. Lyu, and I. King, “Budget constrained non-monotonic feature selection,” Neural Networks, vol. 71, pp. 214–224, 2015.
- F. Min and J. Xu, “Semi-greedy heuristics for feature selection with test cost constraints,” Granular Computing, vol. 1, no. 3, pp. 199–211, 2016.
- J. Wang, P. Zhao, S. C. H. Hoi, and R. Jin, “Online feature selection and its applications,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 3, pp. 698–710, 2014.
- J. Lee and D.-W. Kim, “SCLS: Multi-label feature selection based on scalable criterion for large label set,” Pattern Recognition, vol. 66, pp. 342–352, 2017.
- J. Lee, H. Lim, and D.-W. Kim, “Approximating mutual information for multi-label feature selection,” IEEE Electronics Letters, vol. 48, no. 15, pp. 929-930, 2012.
- H. Lim, J. Lee, and D.-W. Kim, “Multi-label learning using mathematical programming,” IEICE Transaction on Information and Systems, vol. E98D, no. 1, pp. 197–200, 2015.
- Y. Lin, Q. Hu, J. Liu, and J. Duan, “Multi-label feature selection based on max-dependency and min-redundancy,” Neurocomputing, vol. 168, pp. 92–103, 2015.
- G. Doquire and M. Verleysen, “Mutual information-based feature selection for multilabel classification,” Neurocomputing, vol. 122, pp. 148–155, 2013.
- D. Kong, C. Ding, H. Huang, and H. Zhao, “Multi-label ReliefF and F-statistic feature selections for image annotation,” in Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2012, pp. 2352–2359, usa, June 2012.
- J. Lee and D.-W. Kim, “Fast multi-label feature selection based on information-theoretic feature ranking,” Pattern Recognition, vol. 48, no. 9, pp. 2761–2771, 2015.
- X. Kong and P. S. Yu, “GMLC: A multi-label feature selection framework for graph classification,” Knowledge and Information Systems, vol. 31, no. 2, pp. 281–305, 2012.
- J. Lee and D.-W. Kim, “Feature selection for multi-label classification using multivariate mutual information,” Pattern Recognition Letters, vol. 34, no. 3, pp. 349–357, 2013.
- N. Spolaôr, E. A. Cherman, M. C. Monard, and H. D. Lee, “A comparison of multi-label feature selection methods using the problem transformation approach,” Electronic Notes in Theoretical Computer Science, vol. 292, pp. 135–151, 2013.
- Q. Gu, Z. Li, and J. Han, “Correlated multi-label feature selection,” in Proceedings of the the 20th ACM international conference, p. 1087, Glasgow, Scotland, UK, October 2011.
- B. Qian and I. Davidson, “Semi-supervised dimension reduction for multi-label classification,” in Proceedings of the Proc. 24th AAAI Conf. Artificial Intelligence, pp. 569–574, Atlanta, USA, Jul 2010.
- B. Xue, M. Zhang, W. N. Browne, and X. Yao, “A Survey on Evolutionary Computation Approaches to Feature Selection,” IEEE Transactions on Evolutionary Computation, vol. 20, no. 4, pp. 606–626, 2016.
- Y. Zhang, D. Gong, Y. Hu, and W. Zhang, “Feature selection algorithm based on bare bones particle swarm optimization,” Neurocomputing, vol. 148, pp. 150–157, 2015.
- Y. Zhang, D.-W. Gong, and J. Cheng, “Multi-objective particle swarm optimization approach for cost-based feature selection in classification,” IEEE Transactions on Computational Biology and Bioinformatics, vol. 14, no. 1, pp. 64–75, 2017.
- Y. Zhang, X.-F. Song, and D.-W. Gong, “A return-cost-based binary firefly algorithm for feature selection,” Information Sciences, vol. 418-419, pp. 561–574, 2017.
- J. Read, B. Pfahringer, and G. Holmes, “Multi-label Classification Using Ensembles of Pruned Sets,” in Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM), pp. 995–1000, Pisa, Italy, December 2008.
- N. Spolaôr, M. C. Monard, G. Tsoumakas, and H. D. Lee, “A systematic review of multi-label feature selection and a new method based on label construction,” Neurocomputing, vol. 180, pp. 3–15, 2016.
- Y. Sun, A. K. C. Wong, and M. S. Kamel, “Classification of imbalanced data: a review,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, no. 4, pp. 687–719, 2009.
- G. Tsoumakas, I. Katakis, and I. Vlahavas, “Random k-labelsets for multilabel classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7, pp. 1079–1089, 2011.
- S. Ji and J. Ye, “Linear dimensionality reduction for multi-label classification,” in Proceedings of the Proc. 21th Int. Joint Conf. Artificial Intelligence, pp. 1077–1082, Pasadena, USA, Jul 2009.
- F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust feature selection via joint l2,1-norms minimization,” in Advances in Neural Information Processing System, pp. 1813–1821, MIT Press, 2010.
- J. Lee and D.-W. Kim, “Mutual Information-based multi-label feature selection using interaction information,” Expert Systems with Applications, vol. 42, no. 4, pp. 2013–2025, 2015.
- M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning algorithms,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 8, pp. 1819–1837, 2014.
- Z. Zhu, S. Jia, and Z. Ji, “Towards a memetic feature selection paradigm,” IEEE Computational Intelligence Magazine, vol. 5, no. 2, pp. 41–53, 2010.
- K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
- J. Yin, T. Tao, and J. Xu, “A Multi-label feature selection algorithm based on multi-objective optimization,” in Proceedings of the International Joint Conference on Neural Networks, IJCNN 2015, Ireland, July 2015.
- Y. Zhang, D.-W. Gong, X.-Y. Sun, and Y.-N. Guo, “A PSO-based multi-objective multi-label feature selection method in classification,” Scientific Reports, vol. 7, no. 1, article no. 376, 2017.
- G. Karafotias, M. Hoogendoorn, and A. E. Eiben, “Parameter Control in Evolutionary Algorithms: Trends and Challenges,” IEEE Transactions on Evolutionary Computation, vol. 19, no. 2, pp. 167–187, 2015.
- M.-L. Zhang and L. Wu, “LIFT: multi-label learning with label-specific features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 1, pp. 107–120, 2015.
- Y. Yang and S. Gopal, “Multilabel classification with meta-level features in a learning-to-rank framework,” Machine Learning, vol. 88, no. 1-2, pp. 47–68, 2012.
- A. Cano, J. M. Luna, E. L. Gibaja, and S. Ventura, “LAIM discretization for multi-label data,” Information Sciences, vol. 330, pp. 370–384, 2016.
- F. van den Bergh and A. P. Engelbrecht, “A study of particle swarm optimization particle trajectories,” Information Sciences, vol. 176, no. 8, pp. 937–971, 2006.
- S. Arlot and A. Celisse, “A survey of cross-validation procedures for model selection,” Statistics Surveys, vol. 4, pp. 40–79, 2010.
- J. Lee, H. Kim, N.-R. Kim, and J.-H. Lee, “An approach for multi-label classification by directed acyclic graph with label correlation maximization,” Information Sciences, vol. 351, pp. 101–114, 2016.
- F. Wilcoxon, “Probability tables for individual comparisons by ranking methods,” Biometrics - A Journal of the International Biometric Society, vol. 3, pp. 119–122, 1947.
Copyright © 2018 Jaesung Lee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.