Multilabel classification is a key research topic in the machine learning field. In this study, the author put forward a two/two-layer chain classification algorithm with optimal sequence based on the attention mechanism. This algorithm is a classification model with a two-layer structure. By introducing an attention mechanism, this study analyzes the key attributes to achieve the goal of classification. To solve the problem of algorithm accuracy degradation caused by the order of classifiers, we adopt the OSS (optimal sequence selection) algorithm to find the optimal sequence of tags. The test results based on the actual dataset show that the ATDCC-OS algorithm has good performance on all performance evaluation metrics. The average accuracy of this algorithm is over 80%. The microaverage AUC performance reaches 0.96. In terms of coverage performance, its coverage performance is below 10%. The comprehensive result of single error performance is the best. The loss performance is about 0.03. The purpose of the ATDCC-OS algorithm proposed in the study is to help improve the accuracy of multilabel classification so as to obtain more effective data information.

1. Introduction

Multilabel classification, a commonly used method in big data analysis, aims to associate multiple labels to a sample at the same time. The ubiquity of multilabel data in real-life scenarios makes multilabel classification methods a popular research topic. However, in real-life applications, the integrity of the labels is usually not guaranteed. Due to poor data collection and the high cost of labeling and other reasons, only part of the labels in those samples is marked. There are many ambiguous examples in the real world. Sample instances are of a certain probability to be calibrated to different attributes. Many multilabel classification algorithms come into being. Usually, it is very challenging to extend the theory of single-label classification to multilabel classification. With the development of machine learning, multilabel classification algorithms can be applied to imaging, recommendation systems, medical diagnosis, information retrieval, and many other fields [18]. In recent years, an ocean of research works accepted by top conferences (e.g., ACL, AAAI, COLING, KDD, NIPS, ICDM, CIKM, INTERSPEECH, ICML, and IJCAI) proposed technologies and solutions for multilabel classification. The multilabel classification theory is a heated topic in data mining, which has attracted wide attention in the machine learning community.

There are two commonly used methods to construct a multilabel classification model: algorithm adaptation and problem transformation. The algorithm adaptation method is to adjust the existing algorithms (e.g., AdaBoost and decision trees) to solve multilabel classification issues. The performance of the algorithm adaptation method often remains poor. The problem transformation method splits a multilabel classification task into several single-label classification tasks. Then, the classical single-label classification theory is utilized to solve the problem, which brings the trained single-label classifiers together as a super-classifier through linear combination. In this study, we investigate the multilabel classification theory algorithm based on problem transformation.

There are many existing problem transformation methods, such as the BR method [9], the CC theory [10], the MBR model [11], and the DLMC-OS algorithm [12]. However, these methods usually ignore the correlation among labels, the randomness of label sequences, and the redundant interactive label information, which reduces the accuracy of classification. The problem transformation method uses extended attributes to dig out the correlation between labels, but for different classification tasks, the importance of feature attributes is usually ignored during the process, which decreases the sensitivity of the classifiers. Therefore, we try to introduce an attention mechanism into the methods. Such attention mechanism method [13] is a bionic process based on how the human brain works. It is widely used in machine learning in areas such as speech recognition, image recognition, natural language processing, and so on. The attention mechanism usually calculates the probability mapping from an input to different outputs. The result with the largest probability will be chosen as the output, which has a great impact on considering the correlation between multiple attributes and labels. Then, we propose an attention mechanism-based multilabel classification algorithm, based on the double-layer chain structure.

In the proposed algorithm (algorithm of two/double-layer chain classification with optimal sequence based on attention mechanism, ATDCC-OS), we integrate three multilabel classification frameworks (including BR, MBR, and CC) and an attention mechanism into a chain structure with two layers. This structure exploits a binary association classification framework. In layer one, it carries out the initial classification. In layer two, the chain-based classifier utilizes an updating process to complete the final classification, which interacts with the label information coming from the output of layer one. In particular, we put an attention mechanism in layer two and use the output of layer one to calculate the probability of final classification results. Thus, this can find important information between different attributes and can improve the final classifier accuracy for different tasks. However, there is a random chain order problem in ATDCC-OS. We leverage the optimal sequence selection (OSS) algorithm to solve this issue. OSS integrates several variables and methods (including the hierarchical traversal algorithm, PageRank, Kruskal’s algorithm, and mutual information) to decide labels’ priority. Then, the priority rank is used to help ATDCC-OS to assign classifiers and construct the chain classification model.

In this study, the main contributions are as follows: (1) A double-layer structure multilabel classification model is constructed to fully integrate the advantages of three classical classification models. At the same time, an attention mechanism is introduced to further analyze the influence of key attributes on classification results to optimize traditional classification. (2) The OSS algorithm is proposed to solve the problem of low classification accuracy due to the existence of random chain order in the chain classification model. It is applied to improve the second layer of the chained classification model. This classification model does not depend on any classification algorithm separately. Experiments on benchmark datasets validate the effectiveness of the proposed approach by comparing it with the state-of-the-art methods in terms of predictive performance.

The rest of this study includes the following: Section 2 deals with related work. Section 3 displays the proposed ATDCC-OS method. Then, we introduce the datasets used in the experiments and perform some simulations to verify the proposed method and discuss the experimental results in Section 4. We conclude our work in Section 5.

2.1. Multilabel Classification Method

The multilabel classification approach has received much attention and is widely used in various fields, including text classification, scene and video classification, and bioinformatics. The multilabel classification includes two common methods: problem transformation process and algorithm adaptation process. The former changes a multilabel problem into one or several single-label issues [11] and uses basic classification algorithms, such as Naive Bayesian, supporting vector machine [14], k-nearest neighbor algorithm, and so on to solve them. The latter transforms the existing algorithms so that they can solve the multiclassification problem, e.g., ML-RBF method [15, 16], ML-kNN approach [17, 18], rank-SVM classification [9], and associated classification algorithm [19, 20].

BR (binary relevance) [9] is a common method of problem transformation, which transforms the multilabel classification issue into several binary relevance problems where it trains a binary classification model one by one for all labels. However, BR is often overlooked because it cannot effectively use the correlation between labels. The MBR based on BR was proposed [11], which was constructed as a two-layer model. Layer one in MBR is taken as the input of layer two as a sample attribute to consider label correlation. However, the problem of the label value redundancy is ignored in the training process of layer two.

The CC method was proposed by the authors in [10], where the chain is exploited to build the correlation among all labels. It converts all classifiers into the linear stochastic data chain and adds previous classifiers’ output to the data sample attribute set and takes it as the input to the next classifier. However, there are many disadvantages to the random chain. First, in the CC training process, the classifier output is input as a new attribute together with the original attribute into the next classifier. So, the former classifiers in the chain have a greater impact on classification than the latter classifiers. The order of classifiers in the chain affects the classification result. Second, CC considers the correlation of attributes, but two linked classifiers can use the correlation between adjacent attributes, and the other correlation between attributes cannot be used. Finally, the order of classifiers in the chain is randomly assigned, so the CC model is not unique, which makes the model have strong randomness and ruins the stability of the algorithm [21, 22].

The two-layer classification model, DLMC-OS, is proposed to solve the classification problem [12]. In this model, the output of the first-level classifier is forwarded to the second-level classifier as an extended feature. Each classifier in the second layer of the model passes the latest classification results backward through a chain to consider the correlation between labels. This approach suppresses the classifier chain randomness, but it cannot obtain the unique classifier sequence in the chain.

2.2. Attention Mechanism

The attention mechanism method [2325] is derived from the study of human vision, which simulates the perspective interest of human vision when observing. When the human eye scans the global image, the part of the information that assists the judgment is tracked dynamically in the image, and irrelevant information is ignored. This process can effectively decrease or reduce the amount of information processing when the eyes recognize images by paying more attention to part information. The modern attention mechanism is adopted for machine translation, and it greatly improves the performance of the model [26]. In 2014, Google Brain published an article on the attention mechanism [27]. The article pointed out that when viewing an image, people do not first look at the image pixels but pay more attention to the image’s specific parts based on their requirements. In addition, as humans, we will focus on the required attention locations in the future based on previous observations of images. The authors designed a new architecture named transformer. In a transformer, the self-attention mechanisms are extensively utilized to perform text representations [28], which break away from the traditional RNN/CNN. In recent years, transformer-style models achieved many good results in various tasks. Subsequently, attention mechanisms have become more common and are widely used in classification tasks, such as sentiment classification [29], musical instrument recognition [30], visual recommender systems [31], multilabel text classification [32], and multiple protein subcellular location prediction [33].


3.1. Preliminaries

We set and as the input domain and output domain, respectively. There are d-dimensional attributes in the input domain and L-dimensional labels in the output domain. Instance belongs to a subset of attributes. We use the set Lvector to represent that is the input and is the output. If the label is related to , then , or . The set represents the trained multilabel classification model, where is an attribute vector with d dimensions and indicates a label set corresponding to . To construct a multilabel classifier, we let . and as the first and second layers of the multilabel classifier, respectively. and are the outputs of the first and second layers.

3.2. The ATDCC Framework

By referring to algorithm DLMC-OS, we construct the double-layer chain classification based on the attention mechanism (ATDCC). ATDCC converts the multilabel classification issue into a series of binary classification issues, each one of which is independent of others. In layer one, ATDCC performs binary transformation on labels and constructs some classifiers between attributes and labels. After training, the classifiers of each binary classification model can be obtained [12]. ATDCC completes the binary classification of instances in layer one and then makes the classification results as the extended attributes transfer to layer two. In layer two, ATDCC constructs a classification method with a chain structure by realizing the updating process of dynamic feedback. It exploits the classifier chain to transfer and change the labels. It realizes the interaction among labels and optimizes the classification result. ATDCC utilizes correlation among all labels for multilabel classification through label information interaction within layers and labels information transfer between layers.

3.2.1. ATDCCFirst-Layer

ATDCCFirst-layer follows the idea of a binary correlation classification model. It constructs a classifier with a binary structure for all labels. These binary classifiers are combined as classification one, as shown in Figure 1.

In step one, assume there is an annotated dataset with a size being L. ATDCCFirst-layer constructs an attribute set for all labels by using the following equation:

In step two, some binary algorithms B (such as SMO) are utilized to create the binary classifier of the training instance: .

In step three, we use the obtained binary classifier to classify and predict the unseen instance .

Finally, the prediction result of each classifier (i.e., , as shown in Figure 1(b)) is the output of the unseen instance in the first layer of ATDCC, integrating these output with the attribute set of samples to build a new attribute set . Let be the input of layer two in ATDCC.

3.2.2. ATDCCAT-Layer

The attention mechanism is usually exploited in sequence-to-sequence learning paradigms. For different multilabel classification tasks, the attribute mapping weights between the two layers of ATDCC are different. The attention mechanism method can capture the weight value of all attributes in samples according to requirements. It can improve the final accuracy of classification results.

ATDCCAT-layer uses the attention mechanism mentioned above to dynamically compute the extended attributes’ weights. The layer two model can adapt to the requirement of the current classification task by adjusting the weight value of the transfer attributes between the two layers in ATDCC.

In step one, according to the original sample attributes’ dimension of the first layer to define weight matrix W, the tanh function is exploited to train ATDCCAT-layer to capture correlations between input attributes and label i. The trained model can be expressed aswhere W and b, respectively, denote the weight matrix and the model’s bias.

In step two, ATDCCAT-layer uses a softmax function to transform the output of equation (3) to a probability value and then obtains the weight value of the attention scores.

Finally, the extended attribute set is weighted based on the attention weights obtained from equation (4):

The parameters in our model are optimized by carrying out the minimization of the feedback result of the loss function. The cross-entropy loss in equation (6) is used as the loss function. The following equation calculates the accumulated loss derived from actual and predicted labels for each instance:

3.2.3. ATDCCSecond-Layer

ATDCCSecond-layer is the second layer of the ATDCC model (Figure 2), which uses the classification structure with a chain and exploits an updating process to classify instances in the second time. The attribute set of each binary model expands the correlation of the classification labels before the instance to create the chain structure of classifiers. The attribute set of all binary models is augmented via the 0/1 label estimation value obtained in layer one as well as the whole prior binary correlation estimations from layer two. In the second layer, the correlations between each label are fully applied. Given the attribute set, each classifier in the chain will learn and predict the binary association of labels.

In step one, ATDCCSecond-layer creates the extended attribute vector for each class label as shown in the following equation.where W represents the set of attributes’ weight value.

In step two, we use binary approach B (such as SM) to learn the constructed extended attribute vector (O) to create the binary classifier, .

In the third step, use the constructed binary classifier to classify and predict the unseen instance X.

In the model training process, we use the latest predicted label value to change each sample attribute set’s label value. For example, for the third classifier in a chain, the next sample’s attribute variable is instead of .

Finally, ATDCC evaluates the classification prediction result of each classifier as the final classification for the unseen instance.

3.3. OSS Method

In the MBR model, the sequence of the classifiers in the chain is randomly arranged. If the classification accuracy of the classifier at the core of this chain is very low, an error will be propagated via a backward way along this classifier chain, decreasing the classifier’s accuracy. This further leads to lower classification correctness and accuracy for the whole chain. As the number of labels increases, the randomness of the OSS classifier chain also increases rapidly. The algorithm DLMC-OS can reduce the classifier chain’s randomness, but the optimal label recognition sequence cannot be determined due to the nonuniqueness of the root node. The most effective method is to sort the sequence of the chain. The sequence of the classifier needs to be ranked according to attributes and the characteristics of the chain classification model. For this reason, the following constraints are proposed to search for the optimal chain sequence:(A)The label list is ordered according to a sequence which contains all label information(B)The label sequence satisfies the greatest correlation of labels(C)The label list sequence is optimal under current conditions

Under these design rules, we propose OSS in the model, which integrates mutual information and PageRank with the Kruskal algorithm and the hierarchical traversal method to find an optimal label sequence. The chain classification model uses sequences as the rules to assign the order of each classifier, and the second layer will optimize the ATDCC with the OSS algorithm.

3.3.1. Subalgorithm Related with OSS

(1)Mutual Information (MI) Theory. In the information theory and probability theory, mutual information (MI) is used to evaluate the interdependence between two random variables, so we can obtain the “information amount” of a stochastic variable by observing the other random variables. Equation (9) shows the MI of the two variables. In current information technologies, the probability theory and information theory have been widely used. The MI theory is widely exploited in research works. In the machine learning field, MI can be utilized to select the features [34, 35]. The search engine often uses MI among phrases and contexts to find discover semantic clusters [36]. In statistical mechanics, MI is usually used to solve mechanical problems together with Loschmidt’s paradox [37, 38].Based on the MI application, we evaluate the correlation between labels by capturing MI among labels. Then, we exploit it as edges’ weight in the fully connected graph.(2)PageRank. PageRank (PR) is used to overcome the page ranking issue in the detailed link analysis process, which was proposed in reference [39]. The key idea of this algorithm includes that the page’s importance is related to the number as well as the detailed quality of another page that points to this page. This algorithm is applied in Google’s search engine [40]. The importance of a Webpage can be quantified by the number of links in the link structure, rather than relying on specific search requests. Twitter uses a personalized PageRank to show users’ another account [41]. In this study, we use PageRank and priority search to build the customized PageRank algorithm to decide a very important label to act as the chain’s first node. This can overcome the issue of nonuniqueness of the chain head.(3)Edge Weight-Based Graph Algorithm. Usually, the connection between different entities can be formulated as a graph with edge weights [42]. The weight of an edge may represent cost, length, or capacity, depending on the current problem to be solved [4346]. In the model, we exploit this weighted graph method to create the graph with a fully connected relationship related to the labels in Algorithm 1.(4)Detailed Kruskal’s Algorithm Idea. In this study, the referred Kruskal’s algorithm is utilized to seek a tree with minimum spanning [47]. We use Kruskal’s method to seek a tree with the largest label spanning. This can provide a basis to create a sequence in which the association with labels is the largest. The designed algorithm is shown in Algorithm 2.(5)Breadth-First Based Search Method. In this study, the breadth-first based search (BFS) is an algorithm used for seeking the available paths of the graph, which traverses or searches the tree or graph data structures. Then, we use PageRank to find the starting point and use BFS to traverse the spanning tree with the maximum label to construct the resulting label order, as shown in Algorithm 3.

Input label values to construct a label map with weights:
Output: G
(1)G {}
(3)For each (u, ) in G.V
(4)Calculate the mutual information of MI (u, ) according to Definition 1.
(5)G.E MI (u, )
(6)GG(V, E)
(7)Return G
Constructing the minimum spanning tree of labels based on Algorithm 1:
Input: G (V, E)
Output: MWT
(1)MWT  {}
(2)For G, then V is:
(3) Make the set ()
(4)For (u, ) in G. E is ordered according to weight (u, ) via an increasing way:
(5) If set(u)  set():
(6)  MST = MST  {(u, )}
(7)  Let it Union (u, )
(8)Return MWT  MST
Hierarchical traversal to get the label sequence:
Input: MWT (V, E)
Output: OS
(1)Queue Q {}
(2)For each  MWT.V:
(5)  ,
(8)  end while
(9) end while
(10)end for
(12)Return OS
3.3.2. The OSS’s Detailed Design Framework

The detailed design steps for the OSS algorithm in this study are shown in Figure 3.

Step 1. Calculate the MI of the correlation between labels. Assuming that there are N labels , we use formula (9) to calculate the MI on any two labels yi and yj, and the MI must be nonnegative.

Definition 1. The formula of MI calculation is

Step 2. Construct a fully connected graph G via labels, where the labels are the graph’s vertices, and MI volume among labels acts as edges’ weights. Utilize the Kruskal algorithm to build the label tree with the maximum weight. Then, invert the mutual information value to obtain the maximum weight spanning tree.

Step 3. Use the PageRank algorithm to sort each label in the dataset by “voting” and decide on the label node whose PR value is the highest. This node acts as the root node that belongs to a tree with the maximum weight. It is also selected to act as the first node of the traversal algorithm that is hierarchical. This can overcome the issue of not unique head label in the chain.

Step 4. Use Kruskal’s algorithm to generate a minimum weight label tree (MWT) used for the fully connected graph G. The label tree includes the whole labels and the entire edges. These edges connect the label nodes. The weighted sum is the largest.

Step 5. Traverse the MWT with the label nodes obtained by BFS and PageRank to obtain the label sequence. Use this sequence as a guide for constructing the sequence of each classifier in a chain to overcome the uncertainty issue for the classification, as shown in Algorithm 4.

Find labels’ optimized chain order:
Input: Variable
(1)Calculate mutual information according to Definition 1
(6) End for
(7)End for
(8)Make a fully connected graph
(10)Determine the root node by PageRank
(12)Get the maximum weight label Tree
(14)Get the optimal sequence
(16)Return OS

3.4. The ATDCC-OS Framework

The ATDCC-OS design framework is plotted in Figure 4. Figure 4 shows that the ATDCCFirst-layer and ATDCCSecond-layer are the first and second layers. We utilize the OSS approach to optimize the chain structure in the ATDCC-OS framework. Then, we can seek an optimal sequence of labels. According to the best and optimal serials, we train each classifier in our model. We utilize this attention mechanism layer between layer one and layer two to find important attributes and features from the current task. In such a case, we can build a better classifier in layer two, as shown in Algorithms 5 and 6.

D is the training set, L is the labels’ number
(1)Train the first-layer classifier
(7)end for
(8)Use the OSS algorithm to obtain the label priority order to guide the training of the second-layer classifier
(12)Compute attribute value weights using the attention mechanism
(19)End for
Classify(x): classify new instance X
(2)Classify x for the first time using the first-layer classifier
(6)End for
(7)Classify x for the first time using the second-layer classifier
(12)End for
(13)Get the final classification result

4. Experiments

To validate the method, we perform some simulations and use the experimental results to analyze the performance of the proposed algorithm. In the simulation, we analyze the algorithm (ATDCC-OS) presented in this study with other algorithms of multilabel classification (including DLMC-OS and BR and CC and MBR) via five metrics. We then take seven datasets as the multilabel benchmark.

4.1. Test Datasets

We utilize the standard datasets provided on the Mulan [48] platform as the multilabel benchmark. Table 1 describes each dataset and related statistical data in the simulation. N, F, and L represent instances’ numbers, attributes’ numbers contained in each instance, and labels’ numbers in the dataset, respectively. The notation label cardinality (LCard) represents the normal measure as shown in [49]. LCard denotes the average label number associated with an instance.

4.2. Evaluation Methods

The evaluation indicator is a measure that directs the indication of the algorithm’s performance. To better evaluate the method, we used mean accuracy, coverage rate, single error, ranking loss rate, and microaverage AUC to analyze the performance of ATDCC-OS.(1)Average precision: average precision [12] is an accurate metric, which associates recall with precision to sort search results. It reviews a mean score of labels with a higher rank than a specific tag. The larger the value of the average precision is, the better the classifier will be. The average precision can be expressed asWhere is a sort function.(2)Coverage [12]: coverage indicates that the algorithm can cover all possible labels. This metric describes how far or how deep we are to go in the tag list on average to include possible labels related to the document. At the perfect recall level, coverage is loosely related to accuracy. The smaller the value of coverage is, the better the algorithm will be. The coverage can be calculated asWhere notation denotes a sort and ranking function related to the classifier .(3)One-error metric [50]: one-error metric is used to indicate the proportion of examples where the top label does not fall into the selected label set. The bigger this metric is, the worse the algorithm will be. The one-error metric can be expressed asWhere stands for a function associated with a classifier with multiple labels.(4)Ranking loss metric [12]: the ranking loss metric is related to those situations in which the classed labels of samples are not sorted in order, that is, in the label serials, the classified labels (that are not related to the researched instance) fall into the previous related labels. The bigger this indicator is, the better the algorithm performance will be. The ranking loss metric can be expressed as(5)Microaveraged AUC [50]: this metric shows the area covered by a ROC curve graph. Its value is from 0.1 to 1. This metric is directly exploited to review the classifier’s performance. The smaller this metric is, the worse this algorithm will be. The microaveraged AUC metric is

, where is a real-valued function [51] and the following equations can be obtained:where they denote label pairs’ sets which are related or unrelated.

4.3. Experimental Setting

We use the dataset provided by the Mulan platform to evaluate all algorithms. The Mulan [48] is an open-source dataset used for classification with multiple labels, which is based on Weka. In this study, we use SMO as a basis for classification algorithms. Four different classifiers are utilized to carry out comparisons, including the DLMC-OS algorithm, the MBR algorithm, the CC algorithm, and the BR algorithm. We select 80% of instances from every dataset to act as training datasets, while we choose the rest to act as testing datasets. We adopt Adam [52] as the optimizer during the training process. We list the default parameters of Adam’s hyperparameters as follows: let alpha be 0.001, set beta1 to be 0.9, let beta2 be 0.999, and set epsilon to be 10−8. Our simulation platform includes the Intel(R) Xeon(R) E5-2630 CPU, 128 GB RAM, as well as the operating system Centos 7.6. We design and implement the algorithms in the Java (JDK 1.8) running environment.

4.4. Results and Discussion

Figures 510 show the performance comparison among ATDCC-OS, DLMC, MBR, CC, and BR algorithms, using mean accuracy, coverage metric, single error metric, ranking loss metric, and microaverage AUC metric. We use the metric of the mean ranking (Ave. rank) parameter to review different classification results of the algorithms [53]. In these figures, each color represents an algorithm and the name of the algorithm has been listed in the upper left corner of the graph. The number on the top of each bar is the performance rank of the algorithms in the dataset. In Figures 59, the ordinate y-axis denotes the results of the evaluation, while the abscissa x-axis stands for the names of the dataset. In Figure 10, x-axis denotes the name of the algorithm, while y-axis shows the average rank of algorithms in all datasets.

Figure 5 shows the accuracy of each algorithm in each dataset. The ATDCC-OS method proposed in the study has the best performance in the dataset. Compared with other methods, except for the lowest accuracy in the yeast dataset, the accuracy in other datasets is the highest. Among them, the accuracy in the datasets of flags, emotions, and the medical dataset is over 80%.

In Figure 6, we can see the comparison of the microaverage AUC performance of the algorithms. The ATDCC-OS algorithm is also the most excellent and stable in terms of microaverage AUC performance. The performance of this algorithm is the best except for that in the birds dataset, and the performance in the medical dataset is 0.96.

Figure 7 shows the comparison of the coverage performance of each algorithm. The lower the coverage, the better the performance of the algorithm. The coverage performance of the ATDCC-OS algorithm proposed in the study is optimal in all datasets, and its coverage performance is less than 10% in flags, emotions, birds, medical datasets, and yeast datasets.

The single error performance of each algorithm is shown in Figure 8. The performance of the proposed ATDCC-OS algorithm in this graph is relatively unstable compared with other algorithms. However, from a comprehensive perspective, the performance of this algorithm is still good, and the performance in the flags, birds, Enron, and BibTeX datasets is the best. In the emotion dataset, the performance of this algorithm is second only to that of the MBR algorithm.

From Figures 59, we can see that ATDCC-OS shows the optimal classification performance, while algorithm DLMC-OS presents better performance. However, other methods indicate worse performance. For all reviewing metrics, the mean precision metric and microaveraged AUC metric directly indicate the performance of the classifiers. The larger the values, the better the performance of the algorithms. According to Figures 5 and 6, we can see that the algorithm ATDCC-OS proposed in this study and the method DLMC-OS demonstrate much better performance compared with other algorithms. This is because they utilize a two-layer classification structure and the label information interaction to create detailed classifiers. This design structure takes into consideration the interrelationship between labels. At the same time, the algorithm ATDCC-OS also exploits the classical attention mechanism theory to improve the sensibility of classifiers and adapt them to a variety of tasks. Three indicators, namely, coverage, ranking loss, and the one-error metric are often exploited to decide and find irrelevant labels in classification results. As shown in Figures 79, we find that the algorithm ATDCC-OS and the previous algorithm DLMC-OS also demonstrate better performance compared with the rest of the algorithms, while the BR approach presents a medium performance. The MBR method and the CC approach are the worst in this metric. This is because the algorithm ATDCC-OS and the previous approach DLMC-OS utilize optimization algorithms to train all classifiers in order. The randomness of serials in the CC method and the MBR approach directly leads to poorer performance. On the contrary, the BR method does not take into account the sequence of the labels, while it shows better performance.

The loss performance of each algorithm is shown in Figure 9. Among them, the ATDCC-OS algorithm is the most excellent in terms of loss performance. In all datasets, the performance of this algorithm is one level better than other algorithms. In the medical dataset, the loss performance is about 0.03.

From Figure 10, we can see the comprehensive performance ranking of the comparison algorithms in various indicators. Among all the indexes, the ATDCC-OS algorithm has the best performance. The comprehensive performance of the DLMC-OS algorithm is second only to that of the ATDCC-OS algorithm, and the subsequent performance is different in different algorithms.

Figure 10 shows the mean ranking performance metrics of the five classifiers for mean accuracy, coverage metric, single error metric, ranking loss metric, and microaverage AUC metric.

From our simulations, we can find that our algorithm ATDCC-OS outperforms the rest of the algorithms for most of the datasets, while it performs poorly in yeast and birds. As we all know, this algorithm cannot obtain the best performance for all types of different test datasets [10]. The algorithm performance is related not only to the detailed structure of the algorithm but also to the dataset’s detailed type and size, as well as labels’ balance in our test dataset.

Figures 11 and 12 show the plots of the percentage of training data versus average precision and ranking loss. These two figures illustrate how the percentage change of training data affects the enhancement of performance. In this experiment, we take the emotions dataset as an example for both comparisons.

Figure 11 shows the change curve for average precision under the two pairs of classifiers scale with respect to the percentage of training data. From Figure 11, we observe that the average precision is elevated for the four classifiers when the percentage of training data increases. When the percentage of training data is between 10% and 30%, the accuracy of all algorithms floats up and down. When the percentage of training data is over 30%, the average precision of the ATDCC-OS and DLMC-OS rises steadily, while MBR needs to reach 40%, and CC and BR need to reach 60%. Overall, as the training data increase, ATDCC-OS shows better performance than DLMC-OS, followed by MBR and BR, while CC is the worst.

From Figure 12, we can see the results of the comparison in terms of ranking loss. In this figure, as the percentage of training data increases, the ranking loss of ATDCC-OS and DLMC-OS tend to decrease steadily, compared to MBR, CC, and BR. When the training data is between 10% and 40%, the ranking loss of each algorithm is unstable, among which the MBR fluctuates the most, followed by CC and BR, while ATDCC-OS and DLMC-OS perform better. When the dataset is larger than 40%, the ranking loss curves of all algorithms show a downward trend. ATDCC-OS still presents the lowest loss in such a scenario.

5. Conclusion

In this study, we propose a simple and effective multilabel classification model (ATDCC-OS) that integrates the multilabel classification framework of three classic problem-conversion types. It fully explores all kinds of advantages of every method to resolve these issues without considering the correlation among labels when performing classifications. In order to further improve the performance of classification, the algorithm solves the problem of nonreal-time label information interaction in the second-layer chained classification model by introducing the idea of “update replacement.” At the same time, the algorithm dynamically calculates the weight values of all feature attributes through an attention mechanism in order to add more important attribute features to the current classification target for each classifier. It is helpful to add the classification sensibility of classifiers, which greatly improves the preciseness of classification. Five different metrics are utilized to describe different algorithms on seven different datasets. The results of the experiments show that the proposed method obtains high predictive performance compared with the state-of-the-art multilabel classification methods in most cases. In terms of average accuracy, the average accuracy of the ATDCC-OS algorithm is basically the highest in all datasets, and the accuracy in flags, emotions, and the medical dataset is more than 80%. In the microaverage AUC performance, the performance of the ATDCC-OS algorithm in all datasets is the best except for that in the bird’s dataset, and the performance in the medical dataset is 0.96. In terms of coverage performance, the ATDCC-OS algorithm has the best coverage performance in all datasets, and its coverage performance is less than 10% in some datasets. In single error performance, this algorithm has the best comprehensive performance. In the loss performance, the algorithm has a loss performance of about 0.03 in the medical dataset. Based on the above results, it is concluded that the performance of the proposed ATDCC-OS algorithm is the best. This is only the preliminary result of this study. In the future, we will further optimize the algorithm to solve the problem of time complexity caused by the model structure, and we will also try to apply the algorithm to solve classification problems in everyday work and life. Finally, we hope that the research work in this study can provide some reference and assistance to researchers or scholars in the field of multilabel classification of problem transformation types.

Data Availability

The data used and/or analyzed during the current study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This research was supported by the Key Research Project of Education Department of Sichuan Province of China (18ZA319).