Fusion of Big Data Analytics, Machine Learning and Optimization Algorithms for Internet of ThingsView this Special Issue
Named Entity Recognition of Hazardous Chemical Risk Information Based on Multihead Self-Attention Mechanism and BERT
An approach based on self-attention mechanism and pretrained model BERT is proposed to solve the problems of entity recognition and relationship recognition of hazardous chemical risk information. The text of hazardous chemical risk information is coded at the character level by adding the pretrained language model BERT, which, when paired with a multihead self-attention mechanism, improves the ability to mine global and local aspects of texts. The experimental results show that the model’s F1 value is 94.57 percent, which is significantly higher than that of other standard models.
Human knowledge learning is one of the directions of artificial intelligence study. Knowledge representation and reasoning inspired by human problem-solving behavior enable the intelligent system to accurately describe and classify the acquired knowledge information and to obtain the ability to solve complex problems. The knowledge graph, a structured form of anthropological knowledge introduced by Google in 2012, has piqued academic and industry interest. Named entity recognition plays a key role in the process of constructing knowledge graphs.
Deep learning has advanced tremendously as hardware has improved. Compared with traditional methods, deep learning shows significant advantages in many tasks . Mainstream named entity recognition approaches are built on deep learning, which converts the process of named entity recognition into a sequence annotation problem to fulfill the entity label prediction. With the proposal of BERT , the recognition performance of named entity recognition has been greatly improved. At the same time, the semantic information between sentences is mined by using multihead self-attention, which makes the model obtain more features and improves the recognition performance to a certain extent. However, the self-attention mechanism in front of the BiLSTM layer is then used in many studies to obtain the characteristics of the original sentences. Since BERT also uses the self-attention mechanism, the effect of this structure is not very good. Furthermore, there is no relevant research on the named entity recognition of hazardous chemical risk information. Due to the lack of label data of hazardous chemical risk information, the effect of applying existing methods directly to the hazardous chemical risk information is not ideal, and the entity recognition accuracy is low.
In order to solve these problems, we first crawl and filter the web information to form a dataset, then design an annotation tool that conforms to the annotation characteristics of the dataset to label the dataset, and finally, this work suggests the BERT-BiLSTM-Self-Attention-CRF model, which will aid in the creation of a hazardous chemical knowledge graph. The text character level coding in the field of hazardous chemical risk information is obtained using the BiLSTM-CRF model, and the word embedding based on context information is obtained using the pretrained language model BERT. The self-attention layer is added after the BiLSTM layer to mine the features of the output vector of the BiLSTM layer, which enhances the ability of the model to mine the global features between texts.
2. Related Works
The goal of the named entity recognition task is to extract entities from the target text that adhere to preset semantic kinds. It is a basic job in natural language processing that offers support for later activities. Currently, there are four main ways for recognizing named entities: (1) the rule-based method does not need to label data, but relies on manual rules. Based on rules, corresponding dictionaries can be made to improve the recognition effect. However, in the case of domain-specific rules or incomplete domain dictionaries, these systems usually have high accuracy and low recall rate, and these systems cannot be transferred to other fields. (2) The unsupervised learning method relies on unsupervised learning and does not need manual marking data. The typical method is clustering, which extracts named entities from the cluster group through the similarity of context. (3) Feature based on the statistical machine learning method relies on feature selection. Using supervised learning, the named entity recognition task is transformed into a multiclassification or sequence marking task. To acquire the target model, it is important to choose features from the text for annotation and train the features using a machine learning method. (4) The method based on deep learning is also the most used.
Compared with feature-based methods, deep learning algorithms can find hidden features more automatically, and the effect is better. In 1997, a long short-term memory network model (LSTM) was introduced, which may be selected to ignore the unnecessary information in the preceding text and increase named entity recognition accuracy. To address the issue that the LSTM model only uses prior timing information to forecast output at a future time without taking into account the next state, Strubell et al.  integrated the iterated dilated convolutional neural network (IDCNN) and conditional random field (CRF) architectures; Young et al.  designed combining bidirectional long short-term memory (BiLSTM) with conditional random field (CRF). To get richer general semantic expressions, BERT was proposed by Devlin et al. The pretrained word embedding with good representation capacity is obtained by training a large-scale corpus. To strengthen the ability of the model to obtain semantic information in the field, Xie et al.  increased entity recognition accuracy using the BiLSTM-CRF architecture and BERT model. However, the LSTM unit is limited to local information, and its ability to obtain long-distance information is insufficient, resulting in the deviation of the overall effect of the model.
In many industries, such as cybersecurity , biomedicine [7, 8], and social media [9, 10], entity naming recognition based on deep learning algorithms is frequently employed. Nevertheless, there is no relevant research on named entity recognition of hazardous chemical risk information by scholars, and there is no large-scale labeled dataset in the field of hazardous chemical risk information. There are still some difficult problems in the entity identification method for hazardous chemical risk information, such as much data in the field of hazardous chemical risk information, different storage formats, and great individual differences. Information entities are not in a sentence in various data, and entities overlap with entity types in other fields and are inconsistent with common named entities. Conceptual entities have to be redefined. Hazardous chemical risk information is lacking due to a lack of labeling data, the high expense of manual labeling, the time and energy required, and the necessity for experienced employees to aid in labeling, making it difficult to label entities. As a result, this paper constructed a hazardous chemical risk information data set by crawling and filtering web information. And then the BERT-BiLSTM-Self-Attention-CRF model is proposed. By connecting the self-attention mechanism layer after the BiLSTM layer, the model can strengthen the mining ability of the global information in the statement, so that the model can be more effective in processing the risk information dataset of hazardous chemicals.
3. The Structure of BERT-BiLSTM-Self-Attention-CRF
The model proposed in this paper is based on BiLSTM-CRF, which replaces the input of BiLSTM from word2vec pretrained word embedding with BERT pretrained word embedding to produce a more informative word vector; at the same time, the self-attention mechanism layer is connected behind the BiLSTM layer to mine the semantic information between characters at a deeper level.
The characters are first fed into the BERT, which generates a word vector by combining word embedding, segment embedding, and position embedding. Then, the word vector fused with semantics is applied to the input of the BiLSTM network. Through BiLSTM, the model can learn how to predict the output of the next time through the timing information before and after. The self-attention mechanism layer is then used to mine the local features mined by the BiLSTM network, retrieve the output feature vector’s interaction relationship, enhance the feature vector’s global features, and supplement the features of the BiLSTM layer’s output vector. Finally, the CRF layer learns the rules following the interaction between tags for the tag prediction variables produced from the BiLSTM layer, such as not linking B-SUBJECT after I-SUBJECT, to increase the logic of the prediction tag and allow the model to achieve the optimum output tag sequence. Figure 1 depicts the model structure.
3.1. Word Embedding Based on BERT
In natural language processing tasks, the frequently used word embedding models include Word2Vec proposed by Google and Glove proposed by Stanford University. Word2vec retrieves the associated word vector of the word based on its context; Glove enriches the semantic information of the word vector by using a co-occurrence matrix and considering local and overall information at the same time. However, these word embedding models do not do anything well in word length-dependent scenarios.
The network architecture of BERT uses the transformer structure proposed by Vaswani et al. . Its most notable feature is that it foregoes regular RNN and CNN in favor of using the attention mechanism to turn the word distance between any two places into 1, essentially solving the problem of widespread reliance. As shown in Figure 2, BERT’s input word embedding is made up of token embedding, segment embedding, and position embedding.
When calculating self-attention, the pretrained model BERT must specify three vectors: the query vector, the key vector, and the value vector. The word embedding is multiplied by the three training weight matrices to produce these three vectors. The calculation formula of the self-attention layer is as follows: where , , and are the matrices of the query, key, and value vectors and is the dimension of the input embedding. The problem of gradient disappearance may be efficiently controlled by dividing by , allowing the model’s gradient to progressively drop over the training process.
Finally, using the softmax function, the score is normalized so that the output word vector may completely learn the relationship between the word and other words, enriching the semantic representation of the word.
3.2. BiLSTM Layer
RNNs contain three layers: input, hidden, and output. However, when the length of the input sequence grows longer, typical RNNs will experience gradient explosion and disappearance. LSTM is enhanced based on RNN. LSTM tackles the problems of gradient disappearance and gradient explosion by adding three gating units: forget gate, input gate, and output gate. This increases the model’s convergence speed. Figure 3 depicts the construction of the LSTM.
The forget gate, input gate, and output gate formulae for the LSTM unit structure are as follows: where stand for forget gate, input gate, and output gate, respectively; represent the corresponding weight matrix, respectively; and represent the corresponding offset vector, respectively; represents sigmoid activation function; represents the encoded input embedding at time ; and represents the state of the hidden layer at time .
The cell state formula is as follows: where the Hadamard product is represented by ⊙, stands for hyperbolic tangent activation function, represents the weight matrix, and represents the offset vector of the update state.
The update formula of hidden layer status is as follows:
Both the preceding and the following information will have an impact on the label of the current entity in the named entity recognition job. The forward LSTM can only take into account previous information in the text due to the structural properties of LSTM. As a result, this paper uses the BiLSTM model, employs bidirectional LSTM, simultaneously learns the above and below information of the text, and splices the output vector, overcoming the disadvantage of a single LSTM’s incomplete semantic information and improving the accuracy of the experimental results.
3.3. Multihead Self-Attention Mechanism Layer
The labels of characters in the job of named entity identification are heavily influenced by some words in the context; therefore, the labels of the same characters might be highly different in various circumstances. The word embedding obtained by BERT cannot well express the semantics in the field of risk information about hazardous chemicals because BERT is a model trained based on ordinary text sentences. When obtaining context information, the BiLSTM network is more likely to obtain local general information, which cannot fully express the global information of the input text sequence and cannot better obtain input and local information related to the current time output. That is, it cannot fully express the importance of each character in the sentence to the current time output. Therefore, this study takes the multihead self-attention mechanism as an additional module of the BiLSTM module, which enhances the ability of the model to mine global information and sentence relevance, so that the model can be better applied in the field of risk information of hazardous chemicals.
The query vector , key vector , and value vector will employ various vector matrices for independent linear mapping in the multihead self-attention mechanism layer and then input them into parallel line headers to execute the self-attention operation. In this manner, each parallel line header can get the unique feature semantic information of each character in the input text sequence in a distinct presentation space. The results of the calculations on each parallel head of heads are then merged to create a linear mapping, which yields the final output. The formula for the particular function is as follows: where represent the weight matrix used in linear transformation, represents the th head in the multihead self-attention module, and Concat represents the splicing vector operation.
3.4. CRF Layer
BiLSTM layer and multihead self-attention mechanism layer can learn the local and global feature information between contexts and output the label of the maximum probable value of the word. However, the relationship between labels cannot be learned, resulting in the output of continuous labels that are not in conformity with logic. There are problems of disordered order of labels of the same type or wrong matching of different labels, such as connecting I-SUBJECT to B-SUBJECT. To fully learn the dependency between adjacent tags, at the end of the model, CRF is used to decode the feature information output by the multihead self-attention mechanism layer to obtain the tag sequence of the text.
In the conditional random field of a linear chain, the characteristic functions are mainly classified into two types. The first type is the state characteristic function defined on node , which is only related to the current node; the other is the transfer characteristic function created in the node context, which is only related to the current node and the previous node. For a given input sequence , the output tag sequence can be obtained.
The scoring function of the tag sequence can be expressed as where represents the local characteristic function; represents the node characteristic function; is the weight coefficient of , respectively; stands for the number of transition characteristic functions; and l stands for the state characteristic functions.
The scores of all feasible tag sequences are calculated for a given input sequence . The following is the normalizing formula: where represents the score of the scoring function of the prediction sequence , represents the real annotation sequence, and represents all conceivable dimension sequences.
Finally, the Viterbi algorithm is used to get the optimum prediction tag sequence:
By adding a CRF layer at the end of the model, some constraints are added to the last predicted label to ensure that the predicted label is legal, so as to improve the accuracy of the predicted label.
4.1. Dataset and Annotation System
Resulting from various text styles and different formats of hazardous chemical risk information, this paper takes 2828 hazardous chemicals recorded in the catalog of hazardous chemicals (2015 Edition) as the object. By crawling the material safety data sheets (MSDS) corresponding to this hazardous chemical risk information and then filtering out useless information such as pictures and repeated information through data cleaning and preprocessing, a corpus in the field of hazardous chemical risk information is constructed.
Because there are many relationships in the crawled data, the relationship is seen as a distinct entity, and the step of forming triples is combined into one step by first identifying the entity and then extracting the relationship . This allows us to take full advantage of the characteristics of both entities and relationships in corpus sentences while also speeding up the construction of triples. Table 1 shows the definition of entity types in the field of hazardous chemical risk information risk information.
The BIEO labeling approach is used to label data in this study, with B-Label representing the start of the labeled entity, I-Label representing the middle portion of the labeled entity, E-Label representing the end section of the labeled entity, and O representing unrelated information. Table 2 shows the definition of entity labels.
YEDDA  provides a systematic solution for text range annotation, from collaborative user annotation through administrator assessment and analysis, as a lightweight, efficient, and complete open-source application for text span annotation. This paper removes the unnecessary parts of YEDDA and establishes an auxiliary entity annotation platform based on this experiment. Part of the data in the dataset is manually marked and divided into training set and test set according to the ratio of 9 : 1. Then, according to the replacement proportion of 20%, replace the similar types of entities in the training set to generate new statements and add them to the data set. Finally, supervised learning is used to mark the remaining sentences, and manual evaluation ensures that they are correctly marked. The composition of the data set is shown in Table 3.
4.2. Experimental Details
The code used in our experiment is based on the code published by Xu et al. , in which various models are added for comparative experiments. Table 4 shows the experimental hyperparameters of each model.
4.3. Experimental Results and Analysis
Experimental comparisons using IDCNN-CRF , BiLSTM-CRF , BERT-CRF , and BERT-BiLSTM-CRF  were undertaken to compare and validate the recognition impact of the BERT-BiLSTM-Self-Attention-CRF model on each kind of entity. Table 5 shows the outcomes of the experiment. The BERT-BiLSTM-Self-Attention-CRF model outperforms other models in identifying condition entities, risk entities and average performance. It is worth mentioning that the IDCNN-CRF model outperforms other models in identifying SUBJECT entity categories with a limited number of entities, owing to IDCNN’s stronger corpus information extraction capacity than the BiLSTM model in small samples. Because the majority of the consequence entities in the corpus are long words, the BERT-BiLSTM-CRF model’s result entity recognition effect is better than that of other models and even better than that of the model with a multihead self-attention mechanism layer. The attention information included in each letter in the lengthy words superimposes and impacts each other through the multi-head self-attention mechanism layer, which has an impact on the category of recognized entities. The condition entities and risk entities are mostly short words, and all BERT models with a multihead self-attention mechanism layer perform better.
Based on the hazardous chemical risk information dataset, a named entity recognition model of BERT-BiLSTM-Self-Attention-CRF is proposed. The pretrained language model BERT is introduced into the BiLSTM-CRF model to enrich the semantic features of the initial word vector. BERT is used to obtain the initialized word vector, which overcomes the problem of a lack of corpus in the field of hazardous chemicals. The multihead self-attention mechanism is utilized to capture the internal correlation between word vectors and pay more attention to the important information with high correlation. Experimental results on the hazardous chemical dataset show that BERT-BiLSTM-Self-Attention-CRF can well identify the entities of hazardous chemicals, with an accuracy rate of 94.03%, a recall rate of 95.11%, and an F1 value of 94.57%.
At present, there are too few datasets related to hazardous chemicals, and there is no relatively complete knowledge graph of hazardous chemicals. In future research, it is necessary to further expand and improve the dataset of hazardous chemicals, extract events on this basis, and build a knowledge graph for hazardous chemicals.
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no competing interest.
This work is supported by the Zhejiang Science and Technology Plan Project of China (No. 2020C03091) and the Zhejiang Provincial Central Government Guided Local Science and Technology Development Project (No. 2020ZY1010).
T. Xie, J. Yang, and L. Hui, “Chinese entity recognition based on BERT-BiLSTM-CRF model,” Computer Systems & Applications, vol. 29, no. 7, pp. 48–55, 2020.View at: Google Scholar
A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.View at: Google Scholar
L. Ben and J. Donghong, “Joint extraction of drug entity and drug-drug interaction,” Computer Engineering and Design, vol. 365, no. 5, pp. 1377–1381, 2017.View at: Google Scholar