Multilabel classification is one of the most challenging tasks in natural language processing, posing greater technical difficulties than single-label classification. At the same time, multilabel classification has more natural applications. For individual labels, the whole piece of text has different focuses or component distributions, which require full use of local information of the sentence. As a widely adopted mechanism in natural language processing, attention becomes a natural choice for the issue. This paper proposes a multilayer self-attention model to deal with aspect category and word attention at different granularities. Combined with the BERT pretraining model, it achieves competitive performance in aspect category detection and electronic medical records’ classification.

1. Introduction

The multilabel classification (MLC) problem is designed to determine whether an input instance belongs to one or more predefined categories and is applied to various scenarios [1], such as electronic medical records’ classification (EMRC) in the medical domain and aspect category detection (ACD) in aspect-level sentiment analysis.

Sentiment analysis is the heart of many business and social applications [2, 3]. ACD is a subtask of aspect category sentiment analysis [4] and can be treated as an MLC task. Its main purpose is to locate aspect information in comments. Comment “While it was large and a bit noisy, the drinks were fantastic, and the food was superb” evaluates the restaurant’s “environment” and “food,” which belong to the predefined categories in the dataset. Mining such information from comments is of great help to improve user experience and product quality. Therefore, ACD is a very meaningful task in comment analysis.

With the continuous development of computers, modern biomedical research often faces the problem of underutilized text data, such as electronic medical records. Electronic medical records are common text data in clinical medicine, which have great value when correctly utilized. Automatic classification of electronic medical records not only improves the work efficiency of medical workers but also has a great effect on the research of various diseases. An electronic medical record may include diagnosis, treatment, surgery, and many other aspects, such as “Before admission, the patient was admitted to our hospital due to increased bowel movements. Colonoscopy showed that rectal polyps and recommended rectal polypectomy. The patient did not undergo surgery... Today, the patient was admitted to our hospital for rectal polypectomy and was admitted our hospital for treatment. Since the onset of the disease, the patient has normal appetite, conscious mind, good spirits, good sleep, normal bowel movements, normal urination, and no significant changes in weight.” This record includes “diagnosis” and “surgical treatment” of the patient, two of the six predefined categories in the dataset. Classifying electronic medical records is also a multilabel text classification task.

In previous work, the earliest approach to solve MLC is to convert it into multiple single-label binary classification problems [5], but it ignores the correlation between tags. To retain correlation information, the classifier chain [6] is applied to the MLC problem. When the data volume is large, the calculation cost of the classifier chain will also be quite high. Besides, some machine learning algorithms are revised to adapt to MLC, such as multilabel K-nearest neighbor (ML-KNN) [7] and RANK support vector (RANK SVM) [8]. With the development of deep neural networks, some representative deep learning models are also applied in MLC, especially after the introduction of attention mechanism [9]. Its excellent feature extraction capabilities are widely used in various fields of natural language processing. Most recently, pretrained language models including ELMo [10], OpenAI-GPT [11], and BERT [12] have shown their effectiveness to simplify the effort of feature engineering. However, direct use of the pretrained BERT model in the MLC task does not show significant improvement. We believe that the vanilla BERT model is unable to capture key information in each category, especially when the correlation between each label is strong.

In this paper, we propose a BERT-based multi-self-attention model (BERT-MSA) for MLC. The self-attention mechanism is used to capture the information of each category. Although a single attention head can obtain part of the important information in the text, a sentence often belongs to multiple categories in MLC. So, it is necessary to use multiple attentions to obtain relevant information of multiple categories, and neural network is an efficient tuning framework.

Two tasks, ACD and EMRC, are applied to verify the effectiveness of our model. For the ACD task, with extensive experiments on subtask 3 of SemEval-2014 task 4 (http://alt.qcri.org/semeval2014/task4/) [4], the results indicate that our BERT-MSA model is superior to other baseline methods in aspect category sentiment analysis. For the EMRC task, we use subtask 1 of CCKS-2019 task 1 (http://www.ccks2019.cn/?page_id=62) [13]. The results show that our model can still obtain results that exceed the benchmark scores on the medical datasets. Good performance in two completely different fields proves the generalization ability of our model.

Early methods of MLC are mostly based on traditional machine learning [7, 8]. Recently, some neural network models have also been applied in the MLC task and have made important progress. For example, Zhi-Hua Zhou and Zhou [14] apply a fully connected neural network with a paired ranking loss function, Kurata et al. [15] recommend convolutional neural networks (CNN) for classification, Kurata et al. and Chen et al. [15, 16] use both CNN and long short-term memory networks (LSTM) [17] to capture the semantic information of text.

Different models are used to improve the quality of text feature extraction. With the emergence of pretrained language models and attention mechanism [9], they are soon adopted in MLC due to their excellent representation and feature extraction capabilities. These general MLC models have also been applied to specific fields, such as medical record processing and aspect-level sentiment analysis.

In aspect-level sentiment analysis, ACD aims at identifying aspects about which users express their sentiments. A popular aspect detection method is based on the single noun and compound noun frequency method [18]. In addition to focusing on frequency, syntax-based methods are also used to detect aspects through syntactic relations [19, 20]. In general, this kind of model operates with an unsupervised learning manner.

To improve the performance of aspect detection, some deep learning methods based on word embedding [21] are applied to aspect detection, using the grammatical and semantic information embedded in the distributed representation [22]. CNN is also used in aspect detection due to its excellent feature extraction capabilities [23, 24]. LSTM with attention (LSTM-Attention) [25] applies the attention mechanism in sequential text input for accurate aspect detection. Ensemble CNN-RNN networks are applied to process MLC task [16]; the networks capture both the global and the local textual semantics and model high-order label correlations. Recently, CNN-stacked bidirectional LSTM networks with a multiplicative attention mechanism are proposed to process MLC task [26].

More recently, a hybrid Siamese-convolutional neural network [27] with additional technical attributes is applied to the MLC task. It is based on single and Siamese multitask architecture networks and calculates the category-specific similarity in the Siamese structure. Besides, it is based on RNN and a tree structure [28] to represent the relationships among labels, consequently developing an efficient max-product algorithm for exact inference of label prediction for the MLC task.

3. Model

3.1. Input and Embedding Layers

Both ACD and EMRC can be formulated as a multilabel text classification problem. A sentence usually consists of a series of words: . In ACD, we need to predict the sentence category Y = {food, price, service, ambience, anecdote}. In EMRC, the sentence category includes Y = {diagnosis and disease, image inspection, laboratory inspection, surgery, medical treatment, anatomy}.

BERT is a new language representation model, which uses bidirectional transformers [9] to pretrain a large corpus, and fine tunes the pretrained model on other tasks. To obtain a fixed-dimensional pooled representation of the input sequence, we use the BERT fine tuning final hidden state as the input. The vector is denoted as .

3.2. Model Description

As shown in Figure 1, our model contains BERT-MSA as the key component. First, we use the hidden layer vector output by BERT to capture the important information about each tag in the sentence and then reuse the attention score and attention output to obtain the most important information in the sentence.

The first step in calculating self-attention [29] is to create 3 vectors from the input of BERT fine tuning. For each word, we create a query vector, a key vector, and a value vector. These vectors are generated by multiplying the word embeddings by the three matrices created during our training process. are our predefined model parameter matrices, randomly initialized before training, and continuously updated through gradient descent in the training process.

The input sentence contains n elements, where , and a new sequence is calculated with the same length, where .

Compatibility function is calculated with the scaled dot product, which compares the relationship or similarity of two input elements. Through linear transformation of the input, more expressive power is added to the input:

We get the attention score of each word in the sentence; with using the soft-max function,

Each output element is calculated by linearly transforming the weighted sum of the attention score calculated by soft-max and the sentence vector output by BERT fine tuning:

After calculating the attention output vector of the BERT's output and the attention scores, we calculate the product of the first-layer attention output vector and attention score to make up for the lack of the first attention to capture text features. The result is fed into the second self-attention layer, and the corresponding attention output vector is calculated again. The results of the two attention’s output vectors are concatenated after that. The two attention layers are calculated in the same way, using different parameters.

After that, the spliced vector is converted into the same dimension of the label size with a fully connected layer. Finally, the activation goes through a sigmoid function [20, 30] to generate the probability if a sample belongs to the corresponding class. The instance is assigned to the category if the probability is over threshold 0.5:

Loss function is calculated with binary cross entropy between the predicted probability and the true label:

4. Experiments

4.1. Datasets

In the ACD experiment, we use the SemEval-2014 ABSA challenge dataset [4] for performance comparison. Table 1 shows the details of the dataset, including the number of samples in each category. The dataset is from subtask 3 of SemEval-2014 Task 4.

We use the Chinese Electronic Medical Record for the medical text classification experiment. This dataset is from subtask 1 of CCKS2019 Task 1 [13]. Table 2 shows the details of the dataset, with statistics separated by categories.

4.2. Baseline Methods

Using the datasets described above, BERT-MSA is compared to some baseline systems using the standard evaluation metrics from SemEval-2014 task4 and CCKS2019 Task 1.

RANK SVM [8] uses TF-IDF to extract text features. It is a basic machine learning algorithm based on pointwise sorting.

ML-KNN [7] is also based on TF-IDF features. It finds K-nearest neighbor samples and uses Bayesian conditional probability formula to calculate probability of the current label.

TEXTCNN [23] uses Glove vector [31] for text representation. It applies multiple convolution kernels to extract text features and then input them to the linear transformation layer. The sigmoid function is used to output probability distribution on the label space.

Bi-LSTM [11, 12] represents the basic bidirectional LSTM model.

Bi-LSTM-Attention is based on the basic Bi-LSTM. The hidden layer output of LSTM is fed to self-attention for classification.

XRCE [32] achieved the highest score in the SemEval-2014 competition.

Attention-XML [33] uses the label tree-based deep learning model for multilabel text classification.

BERT-base feeds the hidden layer of BERT's pretraining output to a fully connected layer.

BERT-Attention applies a one-layer attention network [9] over BERT to obtain information in each text category.

4.3. Hyperparameters

In the ACD task, we use the English pretrained BERT-based (https://huggingface.co/BERT-base-uncased) model for fine tuning to solve ACD the task and use the Chinese pretrained BERT-based (https://github.com/ymcui/Chinese-BERT-wwm) model for fine tuning to solve the EMRC task. The number of the Transformer blocks is 12, the size of the hidden layer is 768, the number of self-attention heads is 12, and the total number of parameters of the pretraining model is about 110M. The learning rate is set to 5e − 5, and the batch size is 32, with maximum sentence length 80. Adam optimizer is used to tune the model. For the EMRC task, most parameters are the same, but the batch size is set to 16. Sentences in the CCKS2019 collection are relatively long, and BERT supports a maximum sentence length of 512, so we truncate the sentences to size 512.

4.4. Evaluation Metrics

We use precision, recall, and Micro-F1 [34] to evaluate the performance in both tasks.

4.5. Results and Analysis

Table 3 shows that BERT-MSA achieves clear improvements in the ACD dataset over BERT-base, BERT-Attention, and other based convolutional and LSTM networks. Introduction of self-attention improves performance, while multiple self-attention shows a much larger gain. We believe that the multiple self-attention mechanism makes up for the defects caused by the feature extraction layer and helps the model to find words that best reflect the category.

At the same time, Table 4 shows that our multi-self-attention mechanism still obtains promising results on medical datasets. Its recall is slightly lower than BERT-Attention, but precision is much higher, resulting in a small increase in the F1 value.

Compared with the EMRC task, improvement in the ACD task is small, even when the baseline is much lower. We argue that the restaurant dataset has shorter sentences, leaving less context for the attention mechanism. Improvement in the feature extraction method is required to achieve better results in such small text pieces.

5. Conclusion

In order to provide a general solution of the MLC task in multiple applications, we propose the BERT-MSA model. It introduces a BERT-based multiple self-attention mechanism, which can obtain more comprehensive features of each word in a sentence. As self-attention has the ability to update representation of the current word from any context word in the sentence, it can potentially learn full semantic information in a sentence. For its applications, we select the ACD task in the online review domain and the EMRC task in the medical field, two areas with huge differences in text representation. Experiments on these diverse areas show that our model has achieve gratifying results. In some datasets, our model has achieved state of the art in comparison to baseline models. Moreover, our model has shown good results in MLC tasks in Chinese and English. This shows that our model is suitable for processing MLC tasks in a specific field.

As a recent concept, the prompt framework [35] for text classification is proposed, which converts the text classification task into a cloze task, making full use of the masked language model’s representation power. It is an innovative paradigm in the pretrained language model, and we will try modeling the MLC task in that framework for richer in-depth semantic representation.

Data Availability

All data supporting this systematic review are from previously reported studies and datasets, which have been cited within the article. The processed data are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This work was supported in part by the Research Innovation Team Fund (Award no. 18TD0026) from the Department of Education, Sichuan Province, China, and Sichuan Science and Technology Program (Project no. 2020YFG0168).