Research Article  Open Access
Liwen Peng, Yongguo Liu, "Feature Selection and Overlapping ClusteringBased Multilabel Classification Model", Mathematical Problems in Engineering, vol. 2018, Article ID 2814897, 12 pages, 2018. https://doi.org/10.1155/2018/2814897
Feature Selection and Overlapping ClusteringBased Multilabel Classification Model
Abstract
Multilabel classification (MLC) learning, which is widely applied in realworld applications, is a very important problem in machine learning. Some studies show that a clusteringbased MLC framework performs effectively compared to a nonclustering framework. In this paper, we explore the clusteringbased MLC problem. Multilabel feature selection also plays an important role in classification learning because many redundant and irrelevant features can degrade performance and a good feature selection algorithm can reduce computational complexity and improve classification accuracy. In this study, we consider feature dependence and feature interaction simultaneously, and we propose a multilabel feature selection algorithm as a preprocessing stage before MLC. Typically, existing clusterbased MLC frameworks employ a hard cluster method. In practice, the instances of multilabel datasets are distinguished in a single cluster by such frameworks; however, the overlapping nature of multilabel instances is such that, in reallife applications, instances may not belong to only a single class. Therefore, we propose a MLC model that combines feature selection with an overlapping clustering algorithm. Experimental results demonstrate that various clustering algorithms show different performance for MLC, and the proposed overlapping clusteringbased MLC model may be more suitable.
1. Introduction
The multilabel classification (MLC) problem, which is applicable to a wide variety of domains, such as music classification and bioinformatics [1], has received increasing attention. However, situations where single instances are associated with multiple labels remain challenging. Most algorithms treat such MLC tasks as multiple binary classification tasks. However, this approach may not consider potential correlations among features and labels.
A good MLC solution must be effective and efficient; however, a large number of redundant and irrelevant attributes may increase computational costs and the time required to learn and test a multilabel classifier, which reduces classification performance. Feature selection, which is an important technique in data mining and machine learning, has been widely used in classification models to enhance performance. Selecting features before applying classification methods to original datasets has many advantages, such as refining the data, reducing computational costs, and improving classification accuracy [2, 3]. Therefore, we utilise a feature selection algorithm to improve the quality of MLC.
Various feature selection methods have been proposed, for example, statistics, rough set methods, information gain, and mutual information (MI). A wide variety of research has shown that no single feature selection method can handle all situations. Many studies have demonstrated that MIbased feature selection methods are effective and efficient because the MI can handle different types of attributes, does not make any assumptions, and can measure nonlinear relations between variables [4]. Recently, many algorithms to select significant features for MLC have been proposed. However, most of these methods do not consider that a single attribute may affect various labels differently. The concept of interaction information has become more relevant because it can reflect the relevance, redundancy, and complementarity among attributes and labels; thus, it is an effective feature selection method. In this study, we propose an algorithm to improve MLC performance by selecting significant attributes based on interaction information between attributes and labels.
Some studies have shown that clusteringbased MLC methods can improve predictive performance and reduce time costs; however, those studies used nonoverlapping clustering methods to handle multilabel datasets. We know that, in MLC, one object may belong to multiple classes; however, algorithms based on nonoverlapping clustering, that is, hard division methods, do not consider such situations. In contrast, overlapping clusteringbased methods consider this situation when they handle datasets. Therefore, we propose an overlapping clusteringbased MLC (OCBMLC) model.
The remainder of this paper is organised as follows: Section 2 describes related work, Section 3 provides background information, Section 4 describes the proposed multilabel feature selection algorithm and MLC model, Section 5 introduces experimental data, evaluation criteria, and experimental results, and conclusions and suggestions for future work are presented in Section 6.
2. Related Work
Currently, a variety of algorithms have been developed to handle MLC problems [5–15]. In traditional classification methods, each instance has a single label; however, in MLC, an instance can have more than one label. MLC algorithms can be divided into problem transformation methods (PT) and algorithm adaptation methods (AA) [9].
PT methods convert multilabel data to singlelabel data; thus, a singlelabel classification method can be used. Label powerset, binary relevance [10], and random ensemble learning with klabel sets [11] are classic PT methods. The AA approaches extend singlelabel algorithms to process multilabel data directly. BPMLL [12] and MLKNN are two popular AA methods. BPMLL is a widely used MLC backpropagation algorithm. An important characteristic of this algorithm is the introduction of an error function that considers multiple labels. The MLKNN AA method [13] determines the labels of a new object using the maximum a posteriori principle. The MLKNN algorithm obtains a label set based on the statistical information of the label sets of the nearest neighbours of a test instance.
Many studies have proven that redundant and irrelevant features can increase computational costs, reduce performance, and result in overfitting. These problems also exist in MLC. Many feature selection methods have been proposed to handle these problems and improve MLC. Battiti [14] proposed the Mutual Information Feature Selection algorithm, which selects the maximum relevance term, to address these problems. Peng et al. [15] introduced an improvement algorithm called MinimalRedundancy and MaxRelevance, and Lin et al. [16] proposed a multilabel feature selection algorithm that combines MI with maxdependency and minredundancy. In addition, over the past few years, unsupervised, clustering, and other technologies have been used to reduce dimensionality. For example, Li et al. [17] proposed a clusteringguided sparse structural learning algorithm that integrates clustering and a sparse structure in a united framework to select the most useful features. They also proposed an algorithm [18] that employs nonnegative spectral clustering and controls the redundancy between features to select significant features. Cai et al. [19] presented the Unified Sparse Subspace Learning (USSL) framework, which employs a dimension reduction technique that incorporates a subspace learning method. The USSL framework has demonstrated good performance. Li et al. [20] proposed the Robust Structured Subspace Learning (RSSL) framework that combines subspace learning theory and features learning. Their experimental results demonstrated that the RSSL framework performed well for image understanding tasks.
Recently, Kommu et al. [21] proposed two methods based on probabilistic theory to solve multilabel learning problems. In the first method, their algorithm uses logistic regression and a nearest neighbour classifier for MLC. Note that Partial Information is used in this approach. In the second method, their algorithm deals with the concept of grouping related labels. Association Rules are also introduced in the second approach. Guo and Li [22] proposed the Improved Conditional Dependency Networks framework for MLC. This method uses label correlations in the training stage and CDNs in the testing stage. Yu et al. [23] used a rough sets approach for MLC that considers the associations between labels. They evaluated the performance of their approach using seven multilabel datasets.
Nasierding et al. [24, 25] presented an effective CBMLC framework that combines a clustering algorithm with an MLC algorithm. Various clustering methods, such as means, EM, and Sequential Information Bottleneck, are used for training. Note that, with this framework, labels are ignored during training phase. Nasierding et al. [26] compared clustering and nonclustering MLC methods for image and video annotations. Tahir et al. [27] proposed a method that combines a multilabel learning approach with fusion techniques. They used various multilabel learners to select a label set and demonstrated that ensemble techniques can avoid the disadvantages of different learners.
3. Background Theory
3.1. Entropy and Mutual Information
In this section, we introduce the theories of Entropy and MI. Here, we assume that all variables are discrete or data attributes can be discretised using different discrete methods. Shannon’s entropy [28] is the uncertainty measure of a random variable, and it has been widely used in various domains. Here, let be a discrete variable and be the probability density function. Formally, the entropy of is defined as follows:
Assume that and are two random discrete variables. is the joint probability of and . The joint entropy is defined as follows:
If the value of random variable is known and variable is not, the remaining uncertainty of variable can be measured by the conditional entropy defined as follows:
The minimum value of is zero when random variable is statistically dependent on random variable . The maximum conditional entropy value occurs when the two variables are statistically independent.
The relationship between conditional and joint entropy can be defined as follows:
MI is the amount of information shared by two variables and is defined as follows:
Note that the two random variables are statistically independent when is zero. The relation between MI and entropy can be defined as follows:
Let be a random variable . The conditional MI and joint MI represent the information of two variables in the context of a third variable and are defined as follows:
Multiinformation, which was introduced by McGill [29], is an extension of two random variables that can handle the interaction among more than two random variables. Mathematically, multiinformation is defined as follows:
Multiinformation can be positive, negative, or zero [30]. If the multiinformation value is zero, the random variables are independent in the context of the third variable. If the value is negative, the variables have redundant information and a positive value indicates that together the random variables can provide more information than each variable taken individually.
3.2. Overlapping Clustering Algorithm
Fuzzy Means (FCM) algorithms are widely used in fuzzy clustering learning. Fuzzy clustering, which is a type of overlapping clustering, differs from hard clustering. The FCM clustering algorithm assigns data points (examples) to a cluster, and the fuzzy membership of data points indicates the extent to which data points pertain to their clusters [31].
Suppose is a set of vectors for clustering. Vectors represent the attributes of the object . Here, a fuzzy partition matrix is defined aswhere .
Note that examples can belong to more than one cluster with different degrees of membership. The object function of the FCM algorithm obtains the minimum value as follows [32]:where is the membership degree matrix, parameter is the weight exponent that defines the fuzziness of the resulting clusters, and is the Euclidian distance between object and the cluster centre .
The objective function is minimized by updating the partition matrix and cluster centre as follows:
The FCM membership function is defined as follows:
Here, is the membership value of the th object and th cluster, is the number of clusters, and is the cluster centre of the th cluster.
4. Proposed Multilabel Classification Model
4.1. Proposed Multilabel Feature Selection Method
In information theory, is the original feature set, and the subset is the compact feature subset where . The selected subset should maximise the joint information between the subset with compact dimension and the class label .
Such a method is impractical because it is difficult to calculate the highdimensional probability density function. Therefore, some efficient methods have been proposed to approximate the ideal solution [14–16]. Generally, most multilabel feature selection methods based on MI consider the relevance and redundancy terms. In practice, such methods and their variants calculate the MI between a candidate feature and the selected features subset; however, they do not sufficiently consider interaction information among attributes and class labels, ignore feature cooperation, and allow all features to be competitive.
We know that a candidate feature for multilabel feature selection should have one of the highest MI values for all class labels. This is referred to as the relevance term. Multilabel feature relevance terms have been defined previously, and we use the following definition.
Definition 1. Let denote a candidate feature and be a class label. The relevance term is expressed as follows:We can obtain two properties according to this definition.
Property 2. If candidate feature and each class label are mutually independent, then the MI of and is minimum.
Property 3. If each class label is determined completely by , then the MI of and is maximum.
According to the above properties, we can use Definition 1 to select relevant candidate features. However, classes combined with previously selected features may produce interaction. Therefore, we should consider the interaction information among the candidate feature, the selected features, and the classes during feature selection. Differing from existing feature selection methods, we consider the interaction information between a feature and a single class and the pairwise interaction between features and all class labels. Our interaction metric is defined as follows:Here, is the selected features subset, denotes the label set, and denotes the candidate feature.
It is well known that multilabel feature selection attempts to select a set of features with the highest discrimination power for all labels. According to the above discussion, we combine (14) and (15) using the feature interaction maximum of the minimum criteria to propose a new goal function (referred to as maxdependence and interaction (MDI)) for multilabel feature selection. Here, the candidate features are considered to have the highest relevance and beneficial interaction with all class labels. The proposed MDI goal function is expressed as follows:
With this function, the first term is the relevance between the candidate features and all class labels, and the second term focuses on the interaction information among , , and . The proposed goal function can select features with the greatest discrimination power. The pseudocode of the proposed algorithm is as Pseudocode 1.

4.2. Proposed Multilabel Classification Model
There are some experimental results that show that CBMLC methods can improve the predictive classification performance and reduce algorithm training time compared to existing popular multilabel methods [24–26]. The results of those models show that the classification performance of clusteringbased methods is effective. However, those algorithms were used for nonoverlapping clustering methods, such as EM and means, prior to MLC. Therefore, the original data will be set into several disjoint data clusters in nonoverlapping methods.
Clustering methods are usually classified into hard clustering and fuzzy clustering. In hard clustering, instances are distinguished in a single cluster. However, due to the overlapping nature of instances, generally, they do not belong to only a single class in realworld applications. This property limits the practical application of hard clustering, especially for MLC.
FCM is an effective classic fuzzy clustering method based on an objective function concept and is widely used in clustering. The FCM approach uses alternating optimisation strategies to solve nonlinear and nonmonitor clustering problems. We know that one instance may own multiple classes in multilabel data, and the FCM algorithm can handle one instance that belongs to more than one cluster simultaneously. This allows the use of a fuzzy clustering method that assigns a single object to several clusters. Therefore, we propose the OCBMLC model in combination with the FCM algorithm to improve performance. Figure 1 shows the basic procedure of the proposed OCBMLC model.
5. Experiments and Results
5.1. Datasets
In our experiments, we used three public multilabel datasets, that is, the emotions, yeast, and scene datasets. These datasets were taken from the Mulan Library. The emotions dataset contains examples of songs according to people’s emotions [33]. The yeast dataset includes information about genes functions [34], and the scene [35] dataset includes a series of landscape patterns. Table 1 shows the statistics of the three multilabel benchmark datasets.

In Table 1, “Domain” denotes the dataset domain, “Instances” is the number of instances in the dataset, “Features” is the number of attributes, “Labels” is the number of labels in the datasets, and “Cardinality” is the average number of labels associated with each instance.
5.2. Experimental Setting
At the multilabel feature selection stage, in order to calculate MI convenience, we discretise continuous features into 10 bins using an equalwidth strategy. The evaluation approaches for MLC differ from traditional singlelabel classifications. Note that the Hamming loss and micro measure evaluation criteria are widely used for MLC; thus, we used these criteria in our experiments.
Note that nonOCBMLC models use means and EM algorithms to cluster original datasets, and OCBMLC model uses the FCM algorithm on the data after dimension reduction. The overlapping and nonoverlapping frameworks both employ MLKNN as the classifier. The number of clusters in means, EM, and FCM is all set between 2 to 7. In this study, a crossvalidation strategy was used for each combination of algorithm framework and dataset. All experiments used MATLAB 2012 on an Intel Corei5 2.3 GHz processor with 8 GB memory.
5.3. Evaluation Metrics
The evaluations of an MLC system differ from that of a singlelabel classification system. Note that some criteria that evaluate the performance of an MLC system have been employed previously [36]. Among such evaluation metrics, we employed Hamming loss and the micro measure criteria.
Here, let be a set of n test examples and be the predict label set for the test instance . is the ground truth label set for . The Hamming loss indicates the number of erroneous labels to the total number of labels, where a smaller Hamming loss value indicates better classification performance. The Hamming loss value is calculated as follows:
The micro measure represents the harmonic means between precision and recall, and it is calculated from false positives, false negatives, true positives, and true negatives. The measure and microaveraging are evaluated as follows:
Here, denotes true positives and denotes false positives, and and are true and false negatives, respectively, for labels after a separate binary evaluation is performed. Note that a greater micro measure value indicates better classification performance of a multilabel algorithm.
5.4. Results
In this study, we used Hamming loss and the micro measure as experimental evaluation metrics and employed MLKNN as the multilabel classifier. Note that, in all cases, we indicate the best results in bold values in Tables 2, 3, 4, 5, 6, 7, 8, and 9.
5.4.1. Comparisons of Feature Selection Methods
To demonstrate the efficacy of the proposed feature selection algorithm, we compared the proposed feature selection method to other MLC models based on clustering using the emotion dataset. We also compared a feature selection method that only considers the dependence between features and classes using the proposed algorithm in which interaction information among features and classes is considered. Here, we refer to the criterion that considers only dependence as the maxdependence criterion, where . This criterion was used to select candidate features.
In this experiment, “DEP_max” represents the features selected by the maxdependence criterion, which ignores interaction information when selecting candidate features, and “MDI” represents features selected by the proposed algorithm, which considers dependence and interaction information among the candidate features and each class simultaneously. Here, we selected the top features for MLC according to the MDI and DEP_max criteria, and we used the average Hamming loss and micro measure values based on the selected top features subset by comparing the values from the original feature sets.
Table 2 shows the Hamming loss values obtained when we used the features selected according to MDI, DEP_max, and the original features from the emotion dataset, and Table 3 shows the micro measure values when we employed features selected according to MDI, DEP_max, and the original features. In terms of the feature selection methods, we found that the performance of DEP_max is no better than that of the other models even though we used the original feature subset. However, the MDI performance is better and more stable when the clustering number is 2 to 7. It is likely that features selected by only considering MaxRelevance could generate abundant redundancy, which means that the dependence among those features could be large. Therefore, the proposed feature selection function may be better suited for MLC, and we observed the same from the experimental results.


5.4.2. Comparisons of Multilabel Classification Models
The results obtained by the MLC models with the emotions dataset relative to Hamming loss and the micro measure are shown in Tables 4 and 5. We selected the top in the selected feature subset as the final feature subset for use with the proposed model. Table 4 demonstrates that the proposed OCBMLC framework achieved the lowest Hamming loss value () with the emotions dataset. Table 5 shows that the proposed framework achieved the highest micro measure () with the emotions dataset. As shown in Figures 2 and 3, the predictive performance of the proposed model achieved the best results with the emotions dataset when . As shown in Figures 2 and 3, respectively, the Hamming loss demonstrates the minimum value and the micro measure demonstrates the maximum value when we used the MDI feature selection method to select the top features as the classification attributes subset.


To demonstrate the classification performance of the proposed model, we also selected the top in the selected feature subset as an experimental feature subset. The Hamming loss and micro measure results of the MLC model with the yeast dataset are shown in Tables 6 and 7. As shown in Figures 4 and 5, the Hamming loss and micro measure demonstrate the best results when with 40% of the features selected from the original data attributes. In addition, it was found that the evaluation criterion value of MLC was reduced with an increasing number of clusters.


Tables 8 and 9 show that the OCBMLC model achieved the top predictive performance (Hamming loss = ; micro ) with the scene dataset. Figures 6 and 7 show that the Hamming loss and micro measure values outperformed the “EM and MLKNN” and “means and MLKNN” models when and 30% of the features of the original data attributes were selected for the scene dataset. Thus, we conclude that the proposed OCBMLC model outperforms the other classification models.
