Abstract

Label imbalance is one of the characteristics of multilabel data, and imbalanced data seriously affects the performance of the classifiers. In multilabel classification, resampling methods are mostly used to deal with imbalanced problems. Existing resampling methods balance the data by either undersampling or oversampling, which causes overfitting and information loss. Resampling has a significant impact on the minority labels. Furthermore, the high concurrency of majority labels and minority labels in many instances also affects the performance of classification. In this study, we proposed a bidirectional resampling method to decouple multilabel datasets. On one hand, the concurrency of labels can be reduced by setting termination conditions for decoupling, and on the other hand, the loss of instance information and overfitting can be alleviated by combining oversampling and undersampling. By measuring the minority labels of the instances, the instances that have less impact on minority labels are selected to resample. The number of resampling is limited to keep the original distribution of the data during the resampling phase. The experiments on seven benchmark multilabel datasets have proved the effectiveness of the algorithm, especially on datasets with high concurrency of majority labels and minority labels.

1. Introduction

With the advent of the era of big data, data classification has received much attention in recent years. The imbalanced data often occurs in the field of data classification, including medical data. Data imbalance means that some categories of instances are much higher or lower than others. Generally, compared with the balanced datasets, most algorithms perform poorly when dealing with imbalanced data. The performance of the classifiers is biased to the majority class, and a higher error rate will occur on the minority class. In practical application, we tend to pay more attention to the correct classification results of the minority classes; as a result, it is more important to correctly identify minority classes than to correctly identify majority classes. For example, in the field of tumor classification, nontumor patients are the majority class, while tumor patients are the minority class [1], but we are more concerned about the minority of tumor patients. These problems also exist in the fields of medical imaging classification, credit card fraud [2] detection, and network intrusion identification, etc.

The multilabel imbalance problem is different from the traditional imbalance problem. In multilabel imbalance problem, each instance is associated with a set of labels, instead of only one type of label as in binary classification. The class with a larger number of instances is called the majority class, which corresponds to the majority label, and the class with a smaller number of instances is called the minority class, which corresponds to the minority label [35]. For example, in the prediction of drug targets, since each drug molecule can correspond to multiple targets, targets can also correspond to multiple drug molecules, but some targets contain far less instances than the rest of the targets, which greatly increases the difficulty of classification.

In traditional binary classification, because each sample only corresponds to one class, it is not necessary to consider the influence of different classes. But the multilabel classification has to face a new challenge, in which some instances contain both majority labels and minority labels. These two kinds of labels are highly concurrent, which makes it more difficult to correctly classify multilabel data. The multilabel data imbalance and the concurrency of majority labels and minority labels often coexist. When dealing with these two issues at the same time, they often need to be considered together.

The imbalanced data is divided into the single-label imbalanced data and the multilabel imbalanced data according to the number of instance labels. This section will introduce resampling methods of traditional single-label imbalanced data and multilabel imbalanced data and describe their advantages and disadvantages in detail.

In traditional single-label imbalanced data processing methods, relevant studies can be divided into three aspects: algorithm-level methods, cost-sensitive learning methods [68], and data-level methods. Among algorithm-level methods, the classification algorithm is improved to adapt to imbalanced data sets. The improved algorithm usually moves the decision boundary to enhance the existence of minority label instances. Hong et al. [9] optimized the distribution of imbalanced datasets by improving the kernel classifier. Liu et al. [10] applied weighted Gini index (WGI) to choose a subset of features, which is conducive to the accurate determination of the minority class. Cost-sensitive learning achieves the goal of correct classification by penalizing misclassifications. The data-level methods primarily focus on resampling methods, including undersampling [11], oversampling [12], and SMOTE method (Synthetic Minority Oversampling Technique). The data was balanced by deleting the instances of majority class in the dataset or increasing instances of minority class. Galar et al. [13] compared common imbalanced learning algorithms and proved that data preprocessing combined with other classification methods is an effective imbalanced classification method. Kang et al. [14] proposed a Noise-filtered Undersampling Scheme (NUS) by incorporating a noise filter before executing resampling process.

Although the processing of multilabel imbalanced data is also based on the data-level methods and the algorithm-level methods, the traditional imbalanced data processing methods are not completely applicable to multilabel imbalanced datasets. Among the previous methods, the algorithm-level methods mainly focus on adjusting the existing classification methods to adapt to the imbalance data [15, 16]. The traditional multilabel [17] classification is to convert multilabel problems into two-class [18, 19] or multiclass problems [20, 21], such as Label Powerset (LP) [22] and Binary Relevance (BR) [4]. Zhang et al. [23] improved the traditional classification algorithm and proposed the COCOA algorithm, which converts the original multilabel dataset to one binary dataset and several multiclass datasets for each label and resamples each dataset to achieve the purpose of constructing the imbalanced classifiers.

The data-level methods change the distribution of the instances to achieve the balance of the dataset. The methods mainly focus on resampling, including oversampling to generate new instances from minority class and undersampling methods to remove some instances from majority labels. Among the multilabel imbalanced data processing methods, the method based on the data-level should be paid more attention, because this method has the following advantages: (1) it is independent of the classification process and can be applied without disturbing the classification algorithm; (2) the separation of tasks allows different algorithms to exert their advantages. Hence, some researchers have conducted relevant research on this aspect. In 2015, Dendamrongvit and Kubat proposed the data-level LP-RUS (LP-based Random Undersampling) and LP-ROS (LP-based Random Oversampling) [24] algorithms and their improved algorithms ML-RUS (Multilabel Random Undersampling) and ML-ROS (Multilabel Random Oversampling) [25].

Both LP-RUS and LP-ROS methods decide how to resample by considering the label set of the dataset. LP-RUS deletes the instances that appear in the most frequent label set, while LP-ROS clones the instances of the least frequent label set. In the resampling process, LP-RUS and LP-ROS may lead to new imbalance of some labels. In order to balance the dataset, the ML-ROS algorithm randomly copies instances related to minority labels to increase the frequency of occurrence of minority labels in the dataset, while ML-RUS randomly deletes the number of instances with majority labels to reduce the frequency of majority labels in the instance set.

ML-ROS and ML-RUS resample the dataset, which improves the classification performance. However, there are some disadvantages: (1) using oversampling or undersampling alone results in the redundancy of a minority labels information and the loss of majority labels information; (2) these methods destroy the original distribution of datasets and cause adverse effects on the classification [26]; (3) they cannot balance the highly concurrent instances of the majority labels and the minority labels.

In order to alleviate the problems where minority labels and majority labels being highly concurrent, Charte et al. [27] proposed the REMEDIAL, REMEDIAL-HwR-ROS (REMEDIAL Hybridizing with Random Oversampling), and REMEDIAL-HwR-HUS (REMEDIAL Hybridizing with Heuristic Undersampling). The REMEDIAL algorithm is independent of the resampling algorithms and can be combined with various resampling algorithms to decouple majority and minority labels, reducing the degree of concurrency among labels [28]. The REMEDIAL-HwR-ROS decouples the highly concurrent labels, then looks for instances linked to minority labels, and generates clones from them. REMEDIAL-HwR-HUS decouples the highly concurrent labels and applies the undersampling processing. But there are several problems in these algorithms: (1) the algorithms do not fundamentally change their original disadvantages and may still cause serious overfitting or loss of information; (2) the algorithms divide the datasets into two parts. Even though the high concurrency problem has been solved during the decoupling process, decoupling will continue, and overfitting may occur during the classification process.

3. Our Approach

In this section, we proposed a Multilabel Decoupling Bidirectional Resampling algorithm (ML-DBR).

3.1. Related Definitions

In the study of the imbalance problem of multilabel data, to measure the degree of data imbalance, there are two measurement indicators to distinguish different labels in multilabel imbalance data: Imbalance Ratio per Label (IR) and Mean Imbalance Ratio (MeanIR).

Let a multilabel dataset D = {(Xi, Li) | 0 ≤ i ≤ n, Li ∈ Y}, where Xi represents the i-th instance of the dataset, Li is the label set of Xi, and Y is the label set of the dataset.

3.2. MeanIR

MeanIR represents the average level of imbalance in the dataset, as shown in equation (2). MeanIR is the mean of all labels IR:

According to MeanIR and IR, we can define majority and minority labels. If the IR value of a label is higher than MeanIR, it is a minority label; otherwise, it is a majority label. For label y, if IR(y) > MeanIR, it belongs to minBag; otherwise, it belongs to majBag.

3.3. SCUMBLE [29]

Besides, we use SCUMBLE metric to assess the degree of concurrency between majority label and minority label, and their values are in the (0, 1) range. The higher the value, the more the instances containing minority and majority labels existing in the dataset:

In equation (3), n is the number of instances in the dataset, and in equation (4), k is the number of labels of Xi and IRi is the IR set of Li.

3.4. Min-SCUMBLE

Relevant studies [27] have shown that the resampling of minority label instances has the most impact on other minority labels contained in the instance. When resampling a certain label, the resampling of this label also resamples other minority labels included in the instance, which will interfere with the resampling of other minority labels. Based on the SCUMBLE metric, we propose a Min-SCUMBLE metric that particularly measures for the minority labels in the instance when resampling:where k is the number of minority labels.

3.5. MeanSamples

In addition, MeanSamples is used in the ML-DBR. MeanSamples represents the number of instances required for all labels to reach the balance state of MeanIR. It is calculated by dividing the number of label instances with the highest occurrence frequency by the MeanIR value:

3.6. Proposed Algorithm

The pseudocode of ML-DBR is shown in Algorithm 1. The algorithm is divided into two stages, i.e., decoupling and resampling. In the first stage, the decoupling strategy decouples high concurrency labels and prevents decoupling of instances with low concurrency labels (Steps 4–10 in Algorithm 1). In the second stage, oversampling and undersampling are combined to select instances that have less impact on the minority label for resampling (Steps 11–24 in Algorithm 1).

Algorithm ML-DBR:
Input: A multilabel dataset D, resampling rate
Output: Preprocessed dataset D
Decoupling strategy
   (1)  Calculate samplesToResampling =  P, IR, Mean IR & Mean Samples
   (2)  Calculate SCUMBLEIns in D and set the SCUMBLE(D) as SCUMBLE(D)1
   (3)  For in D,
   (4)    If , then
   (5)      clone ( is the label set of )
   (6)       ( is the decoupled dataset)
   (7)        For samples
   (8)          Recalculate the SCUMBLE(D) as SCUMBLE(D)j
   (9)          If SCUMBLE(D)j−1 –SCUMBLE(D)j < t, then stop decoupling
   (10)  D = D + 
Resampling strategy
   (11)  While samplesToResampling > 0
   (12)    random select
   (13)    if then
   (14)      
   (15)      While
   (16)        random get m samples from
   (17)        let the max Min-SCUMBLEIns sample of samples as Z, clone Z
   (18)        D = D + Z, , samplesToResampling
   (19)    if then
   (20)      
   (21)      While
   (22)        random get m samples from
   (23)        Let the max Min-SCUMBLEIns sample of m samples as Z, Set of Z to 0
   (24)        , samplesToResampling
   (25)    Recalculate MeanIR, if then stop algorithm
   (26)  return D

For each label, IR and MeanIR are calculated to determine which category the label belongs to. Resampling rate represents the proportion of increase or decrease in the dataset. In ML-ROS and ML-RUS, it causes the dataset to swell or shrink the proportion of . In ML-DBR, is not the proportion of the dataset to increase or decrease, but to calculate the number of instances that need to be adjusted. Next, we introduce the strategies used in the ML-DBR.

3.6.1. Decoupling Strategy

ML-DBR calculates the SCUMBLEIns value for each instance in the dataset, sets the initial SCUMBLE(D) of the dataset as SCUMBLE(D)1, and decouples the instances that meet the requirements according to the SCUMBLE(D)1, so as to reduce the instances with highly concurrent labels. if SCUMBLEIns(i) > SCUMBLE(D)1, clone the instance Di as , Li is the label set of Di, is the label set of ,  = Li[IR(y)≥MeanIR], Li = Li[IR(y)≤MeanIR]. Then, when every 1% of the instances in the dataset are decoupled, the SCUMBLE(D) of the uncoupled dataset is recalculated. When SCUMBLE(D)j−1 - SCUMBLE(D)j ≥ t, it is considered that the high concurrency of the dataset has been solved, where j means being decoupled to j% instances, and j − 1 means being decoupled to (j − 1) % instances. If SCUMBLE(D)j−1 − SCUMBLE(D)j < t, continue decoupling the instances of SCUMBLEIns > SCUMBLE(D)1.

Instances with highly concurrent labels can be separated from both minority labels and majority labels by decoupling, and the highly concurrent labels can be found according to Step 4. From Step 4 to Step 6, the instances were decoupled into two instances. Although the characteristics of decoupled instances are the same, the label sets are different.

3.6.2. Resampling Strategy

Firstly, this strategy randomly selects m instances of a certain label y. MeanSamples is used to limit the number of samples, which can balance the distribution between samples and do not exceed or fall below the number of samples required to achieve balance when resampling. Next, generate a random x and randomly picked m instances from y. The Min-SCUMBLEIns metric was used to resample randomly selected instances and compared Min-SCUMBLE of m instances to select an instance with less impact on the minority label for resampling. If y belongs to minBags, x = Random (0, MeanSamples − |y|), and clone the instance with the lower Min-SCUMBLE. If y belongs to majBags, x = Random (0, |y| − MeanSamples), and set the label y of the instance with the lower Min-SCUMBLE to 0.

At the end of each resampling, the MeanIR and IR are recalculated, MeanSamples only records the initial value and does not recalculate during the resampling process, so that the original distribution of the dataset will not be affected too much. According to the study, when MeanIR ≤ 1.5, the performance improvement of the classifier by resampling the dataset is limited [25], so when MeanIR ≤ 1.5, the ML-DBR stops.

4. Results and Discussion

4.1. Evaluation Metrics

The performance of general multilabel classifiers can be measured in a variety of ways, which can be divided into multiple types: example-based, label-based, and ranking-based. In order to better evaluate the performance of different methods, we use label-based evaluation method. This method can better reflect the correct classification of majority labels and minority labels. There are two types of label-based evaluation methods: macromeasurement and micromeasurement. Accuracy, macro-F, and micro-F were selected as evaluation indicators [30] for the purpose of obtaining a comprehensive evaluation. For a label, TP represents true positives, TN represents negatives, FP represents false positives, and FN represents false negatives.

Accuracy is the ratio of the number of correctly predicted instances to the total number of predicted instances, regardless of whether the instances are positive or negative. The accuracy is calculated as follows:

Macro-F and micro-F inherit the advantages of F-Measure and can better reflect the classification effect of minority label.

Macro-F refers to the arithmetic mean of each statistical indicator value of all categories. The calculation method of macro-F is shown in equation (10), where equations (8) and (9) are the macro-Precision (macro-P) and the macro-Recall (macro-R). In equations (8) and (9), P and R represent precision and recall:

Micro-F is to calculate a global confusion matrix for each instance in the dataset regardless of the category. Micro-F is calculated as in equation (11), and (12) and (13) are micro-precision(micro-P) and micro-recall (micro-R):

4.2. Datasets

As shown in Table 1, the seven benchmark multilabel datasets of yeast, enron, tmc-2007, cal500, Corel-16k, Corel-5k and mediamill were selected as experimental datasets [31]. The classification performance of multilabels is related not only to the number of labels but also to the different characteristics of the dataset. To measure different characteristics of the datasets, we introduce Dens, Card, and TCS [32] as the measurement of datasets. Dens indicates the density of labels, as shown in equation (14). The higher the value, the denser the labels. Card represents the average number of labels for each instance, as shown in equation (15). The higher the number is, the more the average number of labels per instance is. TCS is used to evaluate the complexity of a dataset, as shown in equation (16). A higher value indicates that the dataset is more complex, and it is more difficult for the classifier to predict the correct classification result:where n is the number of instances, f is the number of input features, k is the number of labels, and ls is the number of different label sets.

4.3. Optimal Values of t and m

The parameters t and m in our method (algorithm ML-DBR) directly affect the performance of the algorithm, so it is also important to explore the appropriate values of t and m. t is the threshold value of decoupling. When t is high, some instances are not decoupled. If t is lower, the instances continue decoupling when label concurrency is balanced and t should lower than the SCUMBLE value of different datasets. The lowest SCUMBLE value is 0.1 on the different dataset, so t ≤ 0.1. m is the number of instances extracted during each resampling. When m is high, it is possible to increase the frequency of some instances. When m is low, instances that have a greater impact on minority labels may be selected. In addition, m needs to be less than the minimum number of instances for minority labels. In the ML-DBR, t is set to 0.1, 0.01, and 0.001, m is set to 3, 5, and 7 for comparison. In the traditional multilabel classification, it is the most common way to convert multilabel into binary classification problem, such as LP and BR. In this paper, LP, BR, and ML-kNN [33] were selected for classification, and C4.5 was used as the underlying classifier in BR and LP. All the parameters in the algorithm were chosen as default parameters, the resampling rate in the experiment was set to 0.1, which is the best resampling rate for ML-ROS, and the number of neighbors for ML-kNN was set to 10. Ten-fold cross-validation was used in this experiment. Yeast, enron, tmc-2007, cal500, and Corel16k were selected as experimental datasets.

The experimental results of m and t values are shown in Figures 15. When m = 3, it performs better on different datasets than m = 5 and m = 7. In the measurement of micro-F, the performance of m = 3 far exceeds that of the other two values. The main reason is that when m is 5 and 7, it increases the frequency of a part of instances with high Min-SCUMBLEIns, and the overfitting is more serious than m = 3 in the classification. Therefore, m = 3 is an appropriate value in the ML-DBR. It is also found in the experiment that the performance at t = 0.01 is better than 0.001 and 0.1. The reason is that the threshold is lower when t = 0.001, and all instances of SCUMBLEIns > SCUMBLE(D)1 are almost decoupled, and the decoupling cannot be terminated after the dataset is balanced; when t = 0.1, the threshold is higher, and the decoupling is terminated when the dataset is not balanced. These figures show that t = 0.01 and m = 3 obtained the best results for most of the datasets and the combination of ML-DBR and ML-kNN classification algorithms has the best effect, and it is better than LP and BR in different measurements, indicating that ML-kNN is more suitable for ML-DBR.

4.4. Experiment and Analysis

The proposed ML-DBR algorithm was compared with three algorithms: REMEDIAL-HwR-HUS, REMEDIAL-HwR-ROS [28], and the combination of REMEDIAL [27] and LP-ROS. The REMEDIAL-HwR-HUS and REMEDIAL-HwR-ROS algorithms have achieved good results in previous experiments, especially in the imbalanced dataset. The LP, BR, and ML-kNN classifiers were used to classify the dataset, and tenfold cross-validation was used. In ML-DBR, the m value was set to 3, and t was set to 0.01. The resampling rate of all algorithms was 0.1, and all the other parameters were default. 10 experiments were performed on each dataset, and the results were averaged.

Tables 24 show the experimental results assessed with the accuracy, macro-F, and micro-F, respectively. The best results are highlighted with bold typeface. As shown in Table 2, compared with other algorithms, ML-DBR achieves the best results. In Tables 3 and 4, ML-DBR also has the best performance in both macro-F and micro-F values. The performance of ML-DBR is far ahead of the other algorithms on the Corel16k, enron, Corel-5k, and mediamill datasets, which indicates that our proposed ML-DBR algorithm obtains the best results when SCUMBLE and TCS are higher. In addition, ML-DBR also has certain advantages on datasets with lower SCUMBLE and TCS. ML-DBR achieved the best results on the tmc-2007 dataset. On the yeast dataset, compared with REMEDIAL-HwR-ROS, the ML-DBR did not obtain the best results in some metrics. This happens because the yeast dataset has lower SCUMBLE and MeanIR values, and there is no obvious difference between the two algorithms when preprocessing the yeast dataset. On the cal-500 dataset, the accuracy of ML-DBR is not significantly improved, but the macro-F and micro-F values of ML-DBR are superior to that of other algorithms, which indicates that the classification accuracy of minority labels has been improved on the cal-500 dataset. In general, the combination of ML-DBR and ML-kNN classifier achieves the best performance.

Table 5 shows the new SCUMBLE and MeanIR values for each dataset after the ML-DBR was applied. The SCUMLE and MeanIR values were decreased compared with Table 1, which verifies the effectiveness of the decoupling strategy and resampling strategy adopted by the proposed ML-DBR algorithm.

In summary, compared with other resampling algorithms, our experiments prove that ML-DBR has the best performance among several multilabel resampling algorithms. It can effectively balance multilabel imbalanced data at the data level. ML-DBR can effectively deal with multilabel imbalanced data with high concurrency of minority label and majority label, and it has a significant effect on improving the classification performance of minority labels.

5. Conclusions

The multilabel data has the problems of imbalance and high concurrency of majority labels and minority labels. This paper proposed the ML-DBR algorithm at the data-level. By decoupling the highly concurrent data of the majority labels and minority labels and measuring the influence of the labels during resampling, the imbalance of the labels is reduced and the independence of the instances is guaranteed. Therefore, ML-DBR has the following advantages: (1) decoupling strategy is more effective and reasonable; (2) it combines undersampling and oversampling processes, which can reduce the redundancy of minority labels information caused by oversampling and the loss of majority labels information caused by undersampling, thus making the instance distribution more balanced and reducing the impact on minority label during the sampling process; (3) the original distribution state of the dataset will not be changed too much, and the original distribution of the dataset is maintained. Experiments show that the ML-DBR can effectively improve the classification performance of the classifier. ML-DBR achieved outstanding results on datasets with high TCS value, large number of labels, and high concurrency of labels with high scumble value. How to find more suitable m and t values for different datasets is the focus of our future work.

Data Availability

The datasets used to support the findings of this study have been deposited in the Mulan repository (http://mulan.sourceforge.net/datasets-mlc.html).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China under Grant no. 61373057, Natural Science Foundation of Zhejiang Province under Grant no. LY20F020009, and Science and Technology Planning Project of Lishui City under Grant no. 2019RC05.