Abstract
Multilabel classification is a key research topic in the machine learning field. In this study, the author put forward a two/twolayer chain classification algorithm with optimal sequence based on the attention mechanism. This algorithm is a classification model with a twolayer structure. By introducing an attention mechanism, this study analyzes the key attributes to achieve the goal of classification. To solve the problem of algorithm accuracy degradation caused by the order of classifiers, we adopt the OSS (optimal sequence selection) algorithm to find the optimal sequence of tags. The test results based on the actual dataset show that the ATDCCOS algorithm has good performance on all performance evaluation metrics. The average accuracy of this algorithm is over 80%. The microaverage AUC performance reaches 0.96. In terms of coverage performance, its coverage performance is below 10%. The comprehensive result of single error performance is the best. The loss performance is about 0.03. The purpose of the ATDCCOS algorithm proposed in the study is to help improve the accuracy of multilabel classification so as to obtain more effective data information.
1. Introduction
Multilabel classification, a commonly used method in big data analysis, aims to associate multiple labels to a sample at the same time. The ubiquity of multilabel data in reallife scenarios makes multilabel classification methods a popular research topic. However, in reallife applications, the integrity of the labels is usually not guaranteed. Due to poor data collection and the high cost of labeling and other reasons, only part of the labels in those samples is marked. There are many ambiguous examples in the real world. Sample instances are of a certain probability to be calibrated to different attributes. Many multilabel classification algorithms come into being. Usually, it is very challenging to extend the theory of singlelabel classification to multilabel classification. With the development of machine learning, multilabel classification algorithms can be applied to imaging, recommendation systems, medical diagnosis, information retrieval, and many other fields [1–8]. In recent years, an ocean of research works accepted by top conferences (e.g., ACL, AAAI, COLING, KDD, NIPS, ICDM, CIKM, INTERSPEECH, ICML, and IJCAI) proposed technologies and solutions for multilabel classification. The multilabel classification theory is a heated topic in data mining, which has attracted wide attention in the machine learning community.
There are two commonly used methods to construct a multilabel classification model: algorithm adaptation and problem transformation. The algorithm adaptation method is to adjust the existing algorithms (e.g., AdaBoost and decision trees) to solve multilabel classification issues. The performance of the algorithm adaptation method often remains poor. The problem transformation method splits a multilabel classification task into several singlelabel classification tasks. Then, the classical singlelabel classification theory is utilized to solve the problem, which brings the trained singlelabel classifiers together as a superclassifier through linear combination. In this study, we investigate the multilabel classification theory algorithm based on problem transformation.
There are many existing problem transformation methods, such as the BR method [9], the CC theory [10], the MBR model [11], and the DLMCOS algorithm [12]. However, these methods usually ignore the correlation among labels, the randomness of label sequences, and the redundant interactive label information, which reduces the accuracy of classification. The problem transformation method uses extended attributes to dig out the correlation between labels, but for different classification tasks, the importance of feature attributes is usually ignored during the process, which decreases the sensitivity of the classifiers. Therefore, we try to introduce an attention mechanism into the methods. Such attention mechanism method [13] is a bionic process based on how the human brain works. It is widely used in machine learning in areas such as speech recognition, image recognition, natural language processing, and so on. The attention mechanism usually calculates the probability mapping from an input to different outputs. The result with the largest probability will be chosen as the output, which has a great impact on considering the correlation between multiple attributes and labels. Then, we propose an attention mechanismbased multilabel classification algorithm, based on the doublelayer chain structure.
In the proposed algorithm (algorithm of two/doublelayer chain classification with optimal sequence based on attention mechanism, ATDCCOS), we integrate three multilabel classification frameworks (including BR, MBR, and CC) and an attention mechanism into a chain structure with two layers. This structure exploits a binary association classification framework. In layer one, it carries out the initial classification. In layer two, the chainbased classifier utilizes an updating process to complete the final classification, which interacts with the label information coming from the output of layer one. In particular, we put an attention mechanism in layer two and use the output of layer one to calculate the probability of final classification results. Thus, this can find important information between different attributes and can improve the final classifier accuracy for different tasks. However, there is a random chain order problem in ATDCCOS. We leverage the optimal sequence selection (OSS) algorithm to solve this issue. OSS integrates several variables and methods (including the hierarchical traversal algorithm, PageRank, Kruskal’s algorithm, and mutual information) to decide labels’ priority. Then, the priority rank is used to help ATDCCOS to assign classifiers and construct the chain classification model.
In this study, the main contributions are as follows: (1) A doublelayer structure multilabel classification model is constructed to fully integrate the advantages of three classical classification models. At the same time, an attention mechanism is introduced to further analyze the influence of key attributes on classification results to optimize traditional classification. (2) The OSS algorithm is proposed to solve the problem of low classification accuracy due to the existence of random chain order in the chain classification model. It is applied to improve the second layer of the chained classification model. This classification model does not depend on any classification algorithm separately. Experiments on benchmark datasets validate the effectiveness of the proposed approach by comparing it with the stateoftheart methods in terms of predictive performance.
The rest of this study includes the following: Section 2 deals with related work. Section 3 displays the proposed ATDCCOS method. Then, we introduce the datasets used in the experiments and perform some simulations to verify the proposed method and discuss the experimental results in Section 4. We conclude our work in Section 5.
2. Related Work
2.1. Multilabel Classification Method
The multilabel classification approach has received much attention and is widely used in various fields, including text classification, scene and video classification, and bioinformatics. The multilabel classification includes two common methods: problem transformation process and algorithm adaptation process. The former changes a multilabel problem into one or several singlelabel issues [11] and uses basic classification algorithms, such as Naive Bayesian, supporting vector machine [14], knearest neighbor algorithm, and so on to solve them. The latter transforms the existing algorithms so that they can solve the multiclassification problem, e.g., MLRBF method [15, 16], MLkNN approach [17, 18], rankSVM classification [9], and associated classification algorithm [19, 20].
BR (binary relevance) [9] is a common method of problem transformation, which transforms the multilabel classification issue into several binary relevance problems where it trains a binary classification model one by one for all labels. However, BR is often overlooked because it cannot effectively use the correlation between labels. The MBR based on BR was proposed [11], which was constructed as a twolayer model. Layer one in MBR is taken as the input of layer two as a sample attribute to consider label correlation. However, the problem of the label value redundancy is ignored in the training process of layer two.
The CC method was proposed by the authors in [10], where the chain is exploited to build the correlation among all labels. It converts all classifiers into the linear stochastic data chain and adds previous classifiers’ output to the data sample attribute set and takes it as the input to the next classifier. However, there are many disadvantages to the random chain. First, in the CC training process, the classifier output is input as a new attribute together with the original attribute into the next classifier. So, the former classifiers in the chain have a greater impact on classification than the latter classifiers. The order of classifiers in the chain affects the classification result. Second, CC considers the correlation of attributes, but two linked classifiers can use the correlation between adjacent attributes, and the other correlation between attributes cannot be used. Finally, the order of classifiers in the chain is randomly assigned, so the CC model is not unique, which makes the model have strong randomness and ruins the stability of the algorithm [21, 22].
The twolayer classification model, DLMCOS, is proposed to solve the classification problem [12]. In this model, the output of the firstlevel classifier is forwarded to the secondlevel classifier as an extended feature. Each classifier in the second layer of the model passes the latest classification results backward through a chain to consider the correlation between labels. This approach suppresses the classifier chain randomness, but it cannot obtain the unique classifier sequence in the chain.
2.2. Attention Mechanism
The attention mechanism method [23–25] is derived from the study of human vision, which simulates the perspective interest of human vision when observing. When the human eye scans the global image, the part of the information that assists the judgment is tracked dynamically in the image, and irrelevant information is ignored. This process can effectively decrease or reduce the amount of information processing when the eyes recognize images by paying more attention to part information. The modern attention mechanism is adopted for machine translation, and it greatly improves the performance of the model [26]. In 2014, Google Brain published an article on the attention mechanism [27]. The article pointed out that when viewing an image, people do not first look at the image pixels but pay more attention to the image’s specific parts based on their requirements. In addition, as humans, we will focus on the required attention locations in the future based on previous observations of images. The authors designed a new architecture named transformer. In a transformer, the selfattention mechanisms are extensively utilized to perform text representations [28], which break away from the traditional RNN/CNN. In recent years, transformerstyle models achieved many good results in various tasks. Subsequently, attention mechanisms have become more common and are widely used in classification tasks, such as sentiment classification [29], musical instrument recognition [30], visual recommender systems [31], multilabel text classification [32], and multiple protein subcellular location prediction [33].
3. ATDCCOS
3.1. Preliminaries
We set and as the input domain and output domain, respectively. There are ddimensional attributes in the input domain and Ldimensional labels in the output domain. Instance belongs to a subset of attributes. We use the set Lvector to represent that is the input and is the output. If the label is related to , then , or . The set represents the trained multilabel classification model, where is an attribute vector with d dimensions and indicates a label set corresponding to . To construct a multilabel classifier, we let . and as the first and second layers of the multilabel classifier, respectively. and are the outputs of the first and second layers.
3.2. The ATDCC Framework
By referring to algorithm DLMCOS, we construct the doublelayer chain classification based on the attention mechanism (ATDCC). ATDCC converts the multilabel classification issue into a series of binary classification issues, each one of which is independent of others. In layer one, ATDCC performs binary transformation on labels and constructs some classifiers between attributes and labels. After training, the classifiers of each binary classification model can be obtained [12]. ATDCC completes the binary classification of instances in layer one and then makes the classification results as the extended attributes transfer to layer two. In layer two, ATDCC constructs a classification method with a chain structure by realizing the updating process of dynamic feedback. It exploits the classifier chain to transfer and change the labels. It realizes the interaction among labels and optimizes the classification result. ATDCC utilizes correlation among all labels for multilabel classification through label information interaction within layers and labels information transfer between layers.
3.2.1. ATDCC^{FirstLayer}
ATDCC^{Firstlayer} follows the idea of a binary correlation classification model. It constructs a classifier with a binary structure for all labels. These binary classifiers are combined as classification one, as shown in Figure 1.
(a)
(b)
In step one, assume there is an annotated dataset with a size being L. ATDCC^{Firstlayer} constructs an attribute set for all labels by using the following equation:
In step two, some binary algorithms B (such as SMO) are utilized to create the binary classifier of the training instance: .
In step three, we use the obtained binary classifier to classify and predict the unseen instance .
Finally, the prediction result of each classifier (i.e., , as shown in Figure 1(b)) is the output of the unseen instance in the first layer of ATDCC, integrating these output with the attribute set of samples to build a new attribute set . Let be the input of layer two in ATDCC.
3.2.2. ATDCC^{ATLayer}
The attention mechanism is usually exploited in sequencetosequence learning paradigms. For different multilabel classification tasks, the attribute mapping weights between the two layers of ATDCC are different. The attention mechanism method can capture the weight value of all attributes in samples according to requirements. It can improve the final accuracy of classification results.
ATDCC^{ATlayer} uses the attention mechanism mentioned above to dynamically compute the extended attributes’ weights. The layer two model can adapt to the requirement of the current classification task by adjusting the weight value of the transfer attributes between the two layers in ATDCC.
In step one, according to the original sample attributes’ dimension of the first layer to define weight matrix W, the tanh function is exploited to train ATDCC^{ATlayer} to capture correlations between input attributes and label i. The trained model can be expressed aswhere W and b, respectively, denote the weight matrix and the model’s bias.
In step two, ATDCC^{ATlayer} uses a softmax function to transform the output of equation (3) to a probability value and then obtains the weight value of the attention scores.
Finally, the extended attribute set is weighted based on the attention weights obtained from equation (4):
The parameters in our model are optimized by carrying out the minimization of the feedback result of the loss function. The crossentropy loss in equation (6) is used as the loss function. The following equation calculates the accumulated loss derived from actual and predicted labels for each instance:
3.2.3. ATDCC^{SecondLayer}
ATDCC^{Secondlayer} is the second layer of the ATDCC model (Figure 2), which uses the classification structure with a chain and exploits an updating process to classify instances in the second time. The attribute set of each binary model expands the correlation of the classification labels before the instance to create the chain structure of classifiers. The attribute set of all binary models is augmented via the 0/1 label estimation value obtained in layer one as well as the whole prior binary correlation estimations from layer two. In the second layer, the correlations between each label are fully applied. Given the attribute set, each classifier in the chain will learn and predict the binary association of labels.
(a)
(b)
In step one, ATDCC^{Secondlayer} creates the extended attribute vector for each class label as shown in the following equation.where W represents the set of attributes’ weight value.
In step two, we use binary approach B (such as SM) to learn the constructed extended attribute vector (O) to create the binary classifier, .
In the third step, use the constructed binary classifier to classify and predict the unseen instance X.
In the model training process, we use the latest predicted label value to change each sample attribute set’s label value. For example, for the third classifier in a chain, the next sample’s attribute variable is instead of .
Finally, ATDCC evaluates the classification prediction result of each classifier as the final classification for the unseen instance.
3.3. OSS Method
In the MBR model, the sequence of the classifiers in the chain is randomly arranged. If the classification accuracy of the classifier at the core of this chain is very low, an error will be propagated via a backward way along this classifier chain, decreasing the classifier’s accuracy. This further leads to lower classification correctness and accuracy for the whole chain. As the number of labels increases, the randomness of the OSS classifier chain also increases rapidly. The algorithm DLMCOS can reduce the classifier chain’s randomness, but the optimal label recognition sequence cannot be determined due to the nonuniqueness of the root node. The most effective method is to sort the sequence of the chain. The sequence of the classifier needs to be ranked according to attributes and the characteristics of the chain classification model. For this reason, the following constraints are proposed to search for the optimal chain sequence:(A)The label list is ordered according to a sequence which contains all label information(B)The label sequence satisfies the greatest correlation of labels(C)The label list sequence is optimal under current conditions
Under these design rules, we propose OSS in the model, which integrates mutual information and PageRank with the Kruskal algorithm and the hierarchical traversal method to find an optimal label sequence. The chain classification model uses sequences as the rules to assign the order of each classifier, and the second layer will optimize the ATDCC with the OSS algorithm.
3.3.1. Subalgorithm Related with OSS
(1)Mutual Information (MI) Theory. In the information theory and probability theory, mutual information (MI) is used to evaluate the interdependence between two random variables, so we can obtain the “information amount” of a stochastic variable by observing the other random variables. Equation (9) shows the MI of the two variables. In current information technologies, the probability theory and information theory have been widely used. The MI theory is widely exploited in research works. In the machine learning field, MI can be utilized to select the features [34, 35]. The search engine often uses MI among phrases and contexts to find discover semantic clusters [36]. In statistical mechanics, MI is usually used to solve mechanical problems together with Loschmidt’s paradox [37, 38]. Based on the MI application, we evaluate the correlation between labels by capturing MI among labels. Then, we exploit it as edges’ weight in the fully connected graph.(2)PageRank. PageRank (PR) is used to overcome the page ranking issue in the detailed link analysis process, which was proposed in reference [39]. The key idea of this algorithm includes that the page’s importance is related to the number as well as the detailed quality of another page that points to this page. This algorithm is applied in Google’s search engine [40]. The importance of a Webpage can be quantified by the number of links in the link structure, rather than relying on specific search requests. Twitter uses a personalized PageRank to show users’ another account [41]. In this study, we use PageRank and priority search to build the customized PageRank algorithm to decide a very important label to act as the chain’s first node. This can overcome the issue of nonuniqueness of the chain head.(3)Edge WeightBased Graph Algorithm. Usually, the connection between different entities can be formulated as a graph with edge weights [42]. The weight of an edge may represent cost, length, or capacity, depending on the current problem to be solved [43–46]. In the model, we exploit this weighted graph method to create the graph with a fully connected relationship related to the labels in Algorithm 1.(4)Detailed Kruskal’s Algorithm Idea. In this study, the referred Kruskal’s algorithm is utilized to seek a tree with minimum spanning [47]. We use Kruskal’s method to seek a tree with the largest label spanning. This can provide a basis to create a sequence in which the association with labels is the largest. The designed algorithm is shown in Algorithm 2.(5)BreadthFirst Based Search Method. In this study, the breadthfirst based search (BFS) is an algorithm used for seeking the available paths of the graph, which traverses or searches the tree or graph data structures. Then, we use PageRank to find the starting point and use BFS to traverse the spanning tree with the maximum label to construct the resulting label order, as shown in Algorithm 3.



3.3.2. The OSS’s Detailed Design Framework
The detailed design steps for the OSS algorithm in this study are shown in Figure 3.
Step 1. Calculate the MI of the correlation between labels. Assuming that there are N labels , we use formula (9) to calculate the MI on any two labels y_{i} and y_{j}, and the MI must be nonnegative.
Definition 1. The formula of MI calculation is
Step 2. Construct a fully connected graph G via labels, where the labels are the graph’s vertices, and MI volume among labels acts as edges’ weights. Utilize the Kruskal algorithm to build the label tree with the maximum weight. Then, invert the mutual information value to obtain the maximum weight spanning tree.
Step 3. Use the PageRank algorithm to sort each label in the dataset by “voting” and decide on the label node whose PR value is the highest. This node acts as the root node that belongs to a tree with the maximum weight. It is also selected to act as the first node of the traversal algorithm that is hierarchical. This can overcome the issue of not unique head label in the chain.
Step 4. Use Kruskal’s algorithm to generate a minimum weight label tree (MWT) used for the fully connected graph G. The label tree includes the whole labels and the entire edges. These edges connect the label nodes. The weighted sum is the largest.
Step 5. Traverse the MWT with the label nodes obtained by BFS and PageRank to obtain the label sequence. Use this sequence as a guide for constructing the sequence of each classifier in a chain to overcome the uncertainty issue for the classification, as shown in Algorithm 4.

3.4. The ATDCCOS Framework
The ATDCCOS design framework is plotted in Figure 4. Figure 4 shows that the ATDCC^{Firstlayer} and ATDCC^{Secondlayer} are the first and second layers. We utilize the OSS approach to optimize the chain structure in the ATDCCOS framework. Then, we can seek an optimal sequence of labels. According to the best and optimal serials, we train each classifier in our model. We utilize this attention mechanism layer between layer one and layer two to find important attributes and features from the current task. In such a case, we can build a better classifier in layer two, as shown in Algorithms 5 and 6.


4. Experiments
To validate the method, we perform some simulations and use the experimental results to analyze the performance of the proposed algorithm. In the simulation, we analyze the algorithm (ATDCCOS) presented in this study with other algorithms of multilabel classification (including DLMCOS and BR and CC and MBR) via five metrics. We then take seven datasets as the multilabel benchmark.
4.1. Test Datasets
We utilize the standard datasets provided on the Mulan [48] platform as the multilabel benchmark. Table 1 describes each dataset and related statistical data in the simulation. N, F, and L represent instances’ numbers, attributes’ numbers contained in each instance, and labels’ numbers in the dataset, respectively. The notation label cardinality (LCard) represents the normal measure as shown in [49]. LCard denotes the average label number associated with an instance.
4.2. Evaluation Methods
The evaluation indicator is a measure that directs the indication of the algorithm’s performance. To better evaluate the method, we used mean accuracy, coverage rate, single error, ranking loss rate, and microaverage AUC to analyze the performance of ATDCCOS.(1)Average precision: average precision [12] is an accurate metric, which associates recall with precision to sort search results. It reviews a mean score of labels with a higher rank than a specific tag. The larger the value of the average precision is, the better the classifier will be. The average precision can be expressed as Where is a sort function.(2)Coverage [12]: coverage indicates that the algorithm can cover all possible labels. This metric describes how far or how deep we are to go in the tag list on average to include possible labels related to the document. At the perfect recall level, coverage is loosely related to accuracy. The smaller the value of coverage is, the better the algorithm will be. The coverage can be calculated as Where notation denotes a sort and ranking function related to the classifier .(3)Oneerror metric [50]: oneerror metric is used to indicate the proportion of examples where the top label does not fall into the selected label set. The bigger this metric is, the worse the algorithm will be. The oneerror metric can be expressed as Where stands for a function associated with a classifier with multiple labels.(4)Ranking loss metric [12]: the ranking loss metric is related to those situations in which the classed labels of samples are not sorted in order, that is, in the label serials, the classified labels (that are not related to the researched instance) fall into the previous related labels. The bigger this indicator is, the better the algorithm performance will be. The ranking loss metric can be expressed as(5)Microaveraged AUC [50]: this metric shows the area covered by a ROC curve graph. Its value is from 0.1 to 1. This metric is directly exploited to review the classifier’s performance. The smaller this metric is, the worse this algorithm will be. The microaveraged AUC metric is
, where is a realvalued function [51] and the following equations can be obtained:where they denote label pairs’ sets which are related or unrelated.
4.3. Experimental Setting
We use the dataset provided by the Mulan platform to evaluate all algorithms. The Mulan [48] is an opensource dataset used for classification with multiple labels, which is based on Weka. In this study, we use SMO as a basis for classification algorithms. Four different classifiers are utilized to carry out comparisons, including the DLMCOS algorithm, the MBR algorithm, the CC algorithm, and the BR algorithm. We select 80% of instances from every dataset to act as training datasets, while we choose the rest to act as testing datasets. We adopt Adam [52] as the optimizer during the training process. We list the default parameters of Adam’s hyperparameters as follows: let alpha be 0.001, set beta1 to be 0.9, let beta2 be 0.999, and set epsilon to be 10^{−8}. Our simulation platform includes the Intel(R) Xeon(R) E52630 CPU, 128 GB RAM, as well as the operating system Centos 7.6. We design and implement the algorithms in the Java (JDK 1.8) running environment.
4.4. Results and Discussion
Figures 5–10 show the performance comparison among ATDCCOS, DLMC, MBR, CC, and BR algorithms, using mean accuracy, coverage metric, single error metric, ranking loss metric, and microaverage AUC metric. We use the metric of the mean ranking (Ave. rank) parameter to review different classification results of the algorithms [53]. In these figures, each color represents an algorithm and the name of the algorithm has been listed in the upper left corner of the graph. The number on the top of each bar is the performance rank of the algorithms in the dataset. In Figures 5–9, the ordinate yaxis denotes the results of the evaluation, while the abscissa xaxis stands for the names of the dataset. In Figure 10, xaxis denotes the name of the algorithm, while yaxis shows the average rank of algorithms in all datasets.
Figure 5 shows the accuracy of each algorithm in each dataset. The ATDCCOS method proposed in the study has the best performance in the dataset. Compared with other methods, except for the lowest accuracy in the yeast dataset, the accuracy in other datasets is the highest. Among them, the accuracy in the datasets of flags, emotions, and the medical dataset is over 80%.
In Figure 6, we can see the comparison of the microaverage AUC performance of the algorithms. The ATDCCOS algorithm is also the most excellent and stable in terms of microaverage AUC performance. The performance of this algorithm is the best except for that in the birds dataset, and the performance in the medical dataset is 0.96.
Figure 7 shows the comparison of the coverage performance of each algorithm. The lower the coverage, the better the performance of the algorithm. The coverage performance of the ATDCCOS algorithm proposed in the study is optimal in all datasets, and its coverage performance is less than 10% in flags, emotions, birds, medical datasets, and yeast datasets.
The single error performance of each algorithm is shown in Figure 8. The performance of the proposed ATDCCOS algorithm in this graph is relatively unstable compared with other algorithms. However, from a comprehensive perspective, the performance of this algorithm is still good, and the performance in the flags, birds, Enron, and BibTeX datasets is the best. In the emotion dataset, the performance of this algorithm is second only to that of the MBR algorithm.
From Figures 5–9, we can see that ATDCCOS shows the optimal classification performance, while algorithm DLMCOS presents better performance. However, other methods indicate worse performance. For all reviewing metrics, the mean precision metric and microaveraged AUC metric directly indicate the performance of the classifiers. The larger the values, the better the performance of the algorithms. According to Figures 5 and 6, we can see that the algorithm ATDCCOS proposed in this study and the method DLMCOS demonstrate much better performance compared with other algorithms. This is because they utilize a twolayer classification structure and the label information interaction to create detailed classifiers. This design structure takes into consideration the interrelationship between labels. At the same time, the algorithm ATDCCOS also exploits the classical attention mechanism theory to improve the sensibility of classifiers and adapt them to a variety of tasks. Three indicators, namely, coverage, ranking loss, and the oneerror metric are often exploited to decide and find irrelevant labels in classification results. As shown in Figures 7–9, we find that the algorithm ATDCCOS and the previous algorithm DLMCOS also demonstrate better performance compared with the rest of the algorithms, while the BR approach presents a medium performance. The MBR method and the CC approach are the worst in this metric. This is because the algorithm ATDCCOS and the previous approach DLMCOS utilize optimization algorithms to train all classifiers in order. The randomness of serials in the CC method and the MBR approach directly leads to poorer performance. On the contrary, the BR method does not take into account the sequence of the labels, while it shows better performance.
The loss performance of each algorithm is shown in Figure 9. Among them, the ATDCCOS algorithm is the most excellent in terms of loss performance. In all datasets, the performance of this algorithm is one level better than other algorithms. In the medical dataset, the loss performance is about 0.03.
From Figure 10, we can see the comprehensive performance ranking of the comparison algorithms in various indicators. Among all the indexes, the ATDCCOS algorithm has the best performance. The comprehensive performance of the DLMCOS algorithm is second only to that of the ATDCCOS algorithm, and the subsequent performance is different in different algorithms.
Figure 10 shows the mean ranking performance metrics of the five classifiers for mean accuracy, coverage metric, single error metric, ranking loss metric, and microaverage AUC metric.
From our simulations, we can find that our algorithm ATDCCOS outperforms the rest of the algorithms for most of the datasets, while it performs poorly in yeast and birds. As we all know, this algorithm cannot obtain the best performance for all types of different test datasets [10]. The algorithm performance is related not only to the detailed structure of the algorithm but also to the dataset’s detailed type and size, as well as labels’ balance in our test dataset.
Figures 11 and 12 show the plots of the percentage of training data versus average precision and ranking loss. These two figures illustrate how the percentage change of training data affects the enhancement of performance. In this experiment, we take the emotions dataset as an example for both comparisons.
Figure 11 shows the change curve for average precision under the two pairs of classifiers scale with respect to the percentage of training data. From Figure 11, we observe that the average precision is elevated for the four classifiers when the percentage of training data increases. When the percentage of training data is between 10% and 30%, the accuracy of all algorithms floats up and down. When the percentage of training data is over 30%, the average precision of the ATDCCOS and DLMCOS rises steadily, while MBR needs to reach 40%, and CC and BR need to reach 60%. Overall, as the training data increase, ATDCCOS shows better performance than DLMCOS, followed by MBR and BR, while CC is the worst.
From Figure 12, we can see the results of the comparison in terms of ranking loss. In this figure, as the percentage of training data increases, the ranking loss of ATDCCOS and DLMCOS tend to decrease steadily, compared to MBR, CC, and BR. When the training data is between 10% and 40%, the ranking loss of each algorithm is unstable, among which the MBR fluctuates the most, followed by CC and BR, while ATDCCOS and DLMCOS perform better. When the dataset is larger than 40%, the ranking loss curves of all algorithms show a downward trend. ATDCCOS still presents the lowest loss in such a scenario.
5. Conclusion
In this study, we propose a simple and effective multilabel classification model (ATDCCOS) that integrates the multilabel classification framework of three classic problemconversion types. It fully explores all kinds of advantages of every method to resolve these issues without considering the correlation among labels when performing classifications. In order to further improve the performance of classification, the algorithm solves the problem of nonrealtime label information interaction in the secondlayer chained classification model by introducing the idea of “update replacement.” At the same time, the algorithm dynamically calculates the weight values of all feature attributes through an attention mechanism in order to add more important attribute features to the current classification target for each classifier. It is helpful to add the classification sensibility of classifiers, which greatly improves the preciseness of classification. Five different metrics are utilized to describe different algorithms on seven different datasets. The results of the experiments show that the proposed method obtains high predictive performance compared with the stateoftheart multilabel classification methods in most cases. In terms of average accuracy, the average accuracy of the ATDCCOS algorithm is basically the highest in all datasets, and the accuracy in flags, emotions, and the medical dataset is more than 80%. In the microaverage AUC performance, the performance of the ATDCCOS algorithm in all datasets is the best except for that in the bird’s dataset, and the performance in the medical dataset is 0.96. In terms of coverage performance, the ATDCCOS algorithm has the best coverage performance in all datasets, and its coverage performance is less than 10% in some datasets. In single error performance, this algorithm has the best comprehensive performance. In the loss performance, the algorithm has a loss performance of about 0.03 in the medical dataset. Based on the above results, it is concluded that the performance of the proposed ATDCCOS algorithm is the best. This is only the preliminary result of this study. In the future, we will further optimize the algorithm to solve the problem of time complexity caused by the model structure, and we will also try to apply the algorithm to solve classification problems in everyday work and life. Finally, we hope that the research work in this study can provide some reference and assistance to researchers or scholars in the field of multilabel classification of problem transformation types.
Data Availability
The data used and/or analyzed during the current study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the Key Research Project of Education Department of Sichuan Province of China (18ZA319).