Abstract

Because of difficulty processing the electronic medical record data of patients with cerebrovascular disease, there is little mature recognition technology capable of identifying the named entity of cerebrovascular disease. Excellent research results have been achieved in the field of named entity recognition (NER), but there are several problems in the pre processing of Chinese named entities that have multiple meanings, of which neglecting the combination of contextual information is one. Therefore, to extract five categories of key entity information for diseases, symptoms, body parts, medical examinations, and treatment in electronic medical records, this paper proposes the use of a BERT-BiGRU-CRF named entity recognition method, which is applied to the field of cerebrovascular diseases. The BERT layer first converts the electronic medical record text into a low-dimensional vector, then uses this vector as the input to the BiGRU layer to capture contextual features, and finally uses conditional random fields (CRFs) to capture the dependency between adjacent tags. The experimental results show that the F1 score of the model reaches 90.38%.

1. Introduction

Named entity recognition is to extract entities with actual meaning from massive unstructured text data [1, 2]. In the medical field, medical entities mainly include symptoms, examinations, diseases, drugs, treatments, operations, body parts etc., and are an important part of the establishment of a medical knowledge base. Chinese electronic medical record (CMR) [3] is a combination of structured and unstructured texts, which generally include not only patient information, but also a large amount of medical knowledge, but it is difficult to process. With the development of deep learning technology, entity recognition algorithms have applied research in many fields, but they lack applications in the field of cerebrovascular diseases (CVD) [4].

Cerebrovascular diseases have become one of the most threatening diseases to human health in the world due to the four characteristics [5, 6]. The treatment of cerebrovascular diseases highly depends on the doctor’s experience. With the increase in the number of patients with CVD, there is a greater demand for cerebrovascular disease physicians. Since the training cycle of professional doctors is relatively long [7, 8], it will cause an imbalance in supply and demand of “more patients and fewer doctors”. With the introduction of the concept of “AI + Medical,” the use of machine learning technology to assist diagnosis and treatment, that is, through the construction of a complex model, the feedback mechanism is used to continuously optimize the parameters of the model then, use the existing clinical data and neuroimaging data in the hospital to diagnose and treat cerebrovascular diseases or predict recurrence. On the one hand, auxiliary diagnosis and treatment decision-making is helpful to improve the professional level of doctors and improve the quality of CVD medical services. On the other hand, it can optimize the uneven distribution of medical resources [9]. At present, the scientific research of machine learning in the field of CVD mainly focuses on two aspects: diagnosis and prognosis prediction of cerebrovascular disease: (1) From the perspective of CVD diagnosis, most scholars use structured data to nest machine learning models to complete disease diagnosis. The literature [1012] established a joint diagnosis model based on logitic regression method and XGBoost machine learning method by collecting clinical data of demographic characteristics. (2) From the perspective of prognosis prediction, the use of machine learning methods for risk prediction has gradually become the trend of disease prediction, while machine learning methods such as random forest, decision tree, SVM, and other machine learning methods have achieved certain research results in the prediction of cerebrovascular diseases. Literature [1315] constructed logistic regression, k-NN, random forest, decision tree, and SVM machine learning models based on follow-up data, and verified the advantages of machine learning models in cerebrovascular disease risk prediction. It shows that the effect of neural network learning score is better. In short, from the analysis of clinical data sources, CVD medical data includes cerebrovascular disease imaging data, follow-up data, electronic medical records, and other data. The focus of most scholars is still on structured data such as follow-up data and neuroimaging data, while the focus on electronic medical record data in the field of CVD is slightly lower. At present, the increase in the number of CVD patients is accompanied by the ever-increasing number of electronic medical records for CVD patients. Electronic medical records can provide scholars with more data resources. For the processing of unstructured text information in electronic medical records, named entity recognition (NER) is a key step, and there are relatively few researches dedicated to named entity recognition in the field of cerebrovascular.

The current research on named entity recognition focuses on three aspects: (1) From the perspective of traditional entity recognition methods, traditional methods include methods based on dictionaries and rules [1623]. This method relies heavily on domain dictionaries and domain experts. The selection of features is done manually, and subjectivity and labor costs are relatively large. With the development of machine learning technology [19, 20], more and more scholars are paying attention to models such as conditional random field (CRF), Hidden Markov Model (HMM), Support Vector Machine (SVM), However, NER based on traditional machine learning technology has higher requirements for feature selection [2123], and the quality of feature selection directly affects the effect of entity recognition. (2) From the perspective of deep learning methods: with the development of deep learning technology, literature [2427] confirmed the advantages of deep neural network technology by comparing traditional CRF models, that is, deep neural network technology has less artificial feature intervention than traditional methods and can obtain higher accuracy and recall rate. Deep learning can automatically extract word features, reduce the subjectivity of feature selection, and help further improve the accuracy of recognition results. Therefore, it is better than traditional statistical algorithms such as CRF and HMM. The common single-entity recognition neural network generally only considers the sample input and lacks in-depth thinking about the output relationship. Based on the idea of model fusion, most scholars usually use LSTM-CRF as the main framework to solve the deficiencies of neural network models. The available literature [2832] uses traditional word2vec, Glove, and other word vector methods, uses a BiLSTM-CRF model as the core, and adds a CNN model, attention mechanism, RNN model, etc., to the core framework. Furthermore, the word vector undergoes continuous fine-tuning of the parameters, resulting in the final recognition achieving more accurate recognition. The process of parameter tuning is employed to set the hyperparameters of the model. However, for the BiLSTM-CRF model, there are more parameter settings, and the model training time is longer. Literature [2631, 33, 34] proposed the BGRU neural model, which has simple results and high computational efficiency, can make full use of context information to eliminate entity ambiguity, and has some good effects in the field of entity recognition. (3) From the perspective of pre‐training models, the above-described preprocessing models all use traditional word vector methods such as word2vec and Glove. This method focuses on the feature extraction between words and often ignores word context information. In order to improve this problem, as the Google BERT pre‐training model is proposed, the literature studies [3538] combine the BERT word embedding model on the basis of the traditional BiLSTM-CRF model and consider the polysemy of a word in combination with the context. The value, R value, and F1 score have all been improved. It can be seen that BERT has strong semantic analysis capabilities.

In order to solve the problems of ignoring context information, low model efficiency, and susceptibility to word segmentation in electronic CVD medical entity recognition processing. We propose a BERT-BiGRU-CRF neural network model to identify named entities in electronic medical records of cerebrovascular diseases. Specifically, the BERT layer first converts the electronic medical record text into a low-dimensional vector, then inputs the vector into the BiGRU layer to capture contextual features, and finally uses a CRF to capture dependency between adjacent tags. The entity extraction model proposed in this paper has achieved good recognition results.

2. BERT-BiGRU-CRF Model Construction

In the NER field, the use of deep neural network models for entity recognition has become the mainstream. This article uses BiGRU-CRF as a benchmark to extract named entities in the field of cerebrovascular diseases. The reason why the BERT pre-training language model is chosen is that the text vector is used as the input of the model, and the granularity of Chinese division is divided into character-level and word-level. Existing research shows that character-level pretraining schemes show better results [37, 39], while the BERT pre-training language model is a character-level pretraining program. That is, each word in the text is converted into a vector by querying the word vector table, as the model input; the model output is the vector representation combined with the context.

The overall structure of the BERT-BiGRU-CRF model is shown in Figure 1. The model is mainly divided into 3 layers. The first layer is the BERT layer. Through the BERT pretraining language model, each word in the sentence is converted into a low-dimensional vector form. The second layer is the BiGRU layer, which aims to automatically extract semantic and temporal features from the context. The third layer is the CRF layer, which aims to solve the dependency between the output tags to obtain the global optimal annotation sequence of the text.

In this study, the named entity recognition model was used to identify the medical named entity in the electronic medical record of cerebrovascular disease. The specific steps are as follows:(1)EMR data preprocessing, that is, processing the original electronic medical record text data set and express the electronic medical record text set as , where the i-th electronic medical record text is expressed as . Predefined entity category is divided and annotated according to character-level, and the characters and predefined categories are separated by spaces when annotated.(2)Construct the electronic medical record text training data set.(3)Model training, that is, training the BERT-BiGRU-CRF named entity recognition model. Take the electronic medical record test text collection as input, and take the entity and its corresponding category pair as output: , where the entity represents the entity that appears in the document; , respectively, represent the start and end positions of in , and no overlap between entities is required, ie . represents the predefined category of the entity , then calculates the F1 score according to the precision rate and the recall rate, and uses the F1 score as the model comprehensive evaluation index.

2.1. BERT Pre-training Language Model

Bidirectional Encoder Representation from Transformers (BERT) [40] is an unsupervised and deep bidirectional language representation model for pre-training. In order to accurately represent the context-related semantic information in the EMR, it is necessary to call the interface of the model to obtain the embedded representation of each word in the electronic medical record. BERT uses the deep two-way transformer encoder as the main structure of the model. Transformer introduces the self-attention mechanism and also draws on the residual mechanism of the convolutional neural network, so the training speed of the model is fast and the expression ability is strong. And also abandoning the RNN loop structure, the overall structure of the BERT model is shown in Figure 2.

En is the coded representation of the word, Trm is the transformer structure, and Tn is the word vector of the target word after training. The operating principle of the model is to use the transformer structure to construct a multi-layer bidirectional Encoder network, which can read the entire text sequence at one time, so that each layer can integrate the contextual information. The input of the BERT model adopts the embedding addition method. By adding three vectors, Token Embeddings, Segment Embeddings, and Position Embeddings, the purpose of pre-training and predicting the next sentence is achieved. In Chinese electronic medical record text processing, the semantics of characters or words in different positions have different semantics. Transformer indicates that the information embedded in the sequence of the tag sequence is its relative position or absolute position information, as shown in the following formulae:where is the position of the word in the text, i represents the dimension, and is the dimension of the encoded vector. The odd position is encoded using the cosine function. Even positions are coded using a sine function.

In order to better capture word-level and sentence-level information, the BERT pre-training language model is jointly trained by two tasks: Masked Language Model and Next Sentence Prediction. The Masked LM model [36] is similar to cloze filling. 15% of the words in the random mask corpus are marked with the “MASK” form, and then the BERT model is used to correctly predict the masked words. The strategy adopted in the training is that for 15% of the words, only 80% of the words are actually replaced with [mask], 10% of the words will be randomly replaced with other words, and the remaining 10% are unchanged. The Next SP model is to train the model to understand the relationship between sentences, that is, to judge whether the next sentence is the next sentence of the previous sentence. The specific method is to randomly select 50% correct sentence pairs from the text corpus, and 50% randomly select sentence pairs to judge the correctness of the sentence pairs. The Masked LM word processing and Next SP sentence processing are jointly trained to ensure that the information is represented by the vector of each word, so the model is comprehensive and semantically accurate. It fully depicts the characteristics of the character‐level, word-level, sentence‐level and even the relationship between sentences and increases the generalization ability of the BERT model.

2.2. BiGRU Layer

Gated Recurrent Unit (GRU) [34] gated recurrent unit structure is a variant of long and short-term memory neural network (LSTM). The LSTM structure includes forget gates, input gates and output gates. In traditional recurrent neural network (RNN) training, gradient disappearance or explosion problems often occur. LSTM only solves the problem of gradient disappearance to a certain extent, and the calculation is time-consuming. The GRU structure includes an update gate and a reset gate, and the GRU combines the forget gate and the input gate in the LSTM into an update gate. Therefore, GRU not only has the advantages of LSTM, but also simplifies its network structure. In the task of entity recognition of cerebrovascular disease electronic medical record, GRU can extract features effectively. Its network structure is shown as in Figure 3.

In the GRU structure, the update gate is z and the reset gate is r. The update gate is to calculate how much electronic medical record information of the previous hidden layer state needs to be transmitted to the current hidden state . If takes the value [0, 1], it needs to be transmitted when it is close to 1, and the information needs to be ignored when the value is close to 0. The reset gate calculation formula is similar to the update gate principle, but the weight matrix is different. The calculation of and is shown in formulae (3) and (4). First, the electronic medical record data input at time t, the state of the hidden layer at the previous time, and the corresponding weights are, respectively, multiplied and added to the function. After the calculation of and is completed, the content that needs to be memorized at time t can be calculated. Secondly, use the reset gate to determine the hidden state of the electronic medical record at . The information that needs to be ignored at time t. Then, input , , , and use tanh function to calculate the candidate hidden strong state. Finally, transfers the cerebrovascular disease electronic medical record information retained in the current unit to the next unit; that is, at time t, the product of and represents the cerebrovascular disease information that the hidden unit needs to retain. The product of and indicates how much information needs to be forgotten. The calculation is shown in formulae (5) and (6) for details:where is the input of the electronic medical record of cerebrovascular disease at time t and is the state of the hidden layer at the previous time; is the hidden state at time t; w is the weight matrix; is the update gate weight matrix and is the reset gate weight matrix; is the sigmoid nonlinear transformation function and tanh is the activation function; is the hidden state of candidate.

From the operating principle of the GRU unit, it can discard some useless information, and the structure of the model is simple, which reduces the computational complexity. However, the simple GRU cannot fully utilize the context information of the electronic medical record. Therefore, this paper designs the backward GRU to learn the backward semantics, and the GRU neural network forwards and backwards to extract the key features of the named entity in the electronic medical record of cerebrovascular disease, namely, the BiGRU model. The specific structure is shown in Figure 4. Based on the GRU principle, forward GRU is to obtain the above semantic feature (ht), and backward GRU is to obtain the following semantic features ht, and finally, the above and the following semantic features are combined to get ht. Refer to formulae (7) and (8) for details:where is the hidden layer state, the purpose is to obtain the above information from the GRU; is the hidden layer state, the purpose is to obtain the following information from the GRU; means that it is represented by features from front to back; means that it is represented by the back-to-front feature; and the final hidden layer state of is the feature of the electronic medical record report.

2.3. CRF Layer

The NER problem can be regarded as a sequence labeling problem. The BiGRU layer outputs the hidden state context feature vector h, denoted as . This vector only considers the context information in the electronic medical record and does not consider the inter-label dependencies. Therefore, this paper adds a CRF layer to label the global optimal sequence and converts the hidden state sequence into the optimal label sequence . CRF calculation principle [34]: firstly, for the specified electronic medical record input sequence , it calculates the score of each location, shown in formula (10). Secondly, calculate the probability of normalized sequence y through the Softmax function, shown in formula (11). Finally, the label sequence with the highest score is calculated using the Viterbi algorithm, shown in formula (12):where A is the transfer score matrix between tags; score (h, y) is the position score; is the parameter vector; normalized probability function; represents all possible tag sequences; and formula (10) is to calculate the score (h, y) of each position in the input sequence from the output feature matrix of the BiGRU layer and the CRF transition matrix.

2.4. Training Process

The process of deep network model training is a process of repeatedly adjusting parameters so that loss reaches a minimum. However, due to the strong learning ability of deep network models, the problem of model generalization is prone to occur. For example, the problem of model under-fitting and over-fitting leads to poor adaptability of the model to new sample data. Therefore, regularization methods can generate many models with small parameter values. In other words, such models have strong anti-interference ability and can adapt to different datasets and different “extreme conditions”. It can increase the generalization capabilities of the model during the network training process. The method to solve the problem in this paper is the L2 regularization method, which can avoid the over-fitting problem, that is, adding regularization calculation to the cost function, shown in the following formula:where is the training sample error that does not include the regularization term; is the adjustable parameter of regularization; and represents the weight parameter.

3. Experimental Design

3.1. Data Preparation

The experimental data in this article was obtained from the Ai’ai medical electronic medical records website. The electronic medical record data is a total of 1,300 electronic medical records related to cerebrovascular diseases, which are composed of general patient information, chief complaint, medical history, physical examination, and diagnosis. In addition, this article sorts out the types of entities in the published papers related to named entities in electronic medical records, as shown in Table 1. According to the frequency of occurrence of entities in published literature, electronic medical record entities for cerebrovascular diseases are divided into five entities: disease, symptom, body part, examination, and treatment, which are also proposed by CCKS.

3.2. Data Preprocessing

The electronic medical record information is preprocessed, that is, line breaks and invalid characters, etc., are removed, and 36400 sentences are finally obtained. The testing dataset and the training dataset are divided into 2 : 8. The labeling system used in this article is BIO labeling. The five types of entities are disease, symptoms, body parts, examination, and treatment. Therefore, there are 11 labels, namely, O, Disease-B, Disease-I, Body-B, Body-I, Symptom-B, Symptom-I, Examination-B, Examination-I, Treatment-B, and Treatment-I. We conduct named entity labeling with doctors. Among the 300 medical records, we designated two annotators to annotate them at the same time, used Cohen’s kappa to calculate the consistency of the annotations, and obtained a kappa value of 0.8. The labels to be predicted are shown in Table 2.

3.3. Experimental Settings

The experimental model in this article is built using tensorflow deep learning framework and Python programming language. The parameter update method is to update the parameters of the BiGRU-CRF model, and BERT is a fixed parameter. Table 3 lists the hyperparameter values of the experimental model in this article. These values have been modified according to relevant literature [3438, 41] and have not been adjusted for the cerebrovascular electronic medical record data set in this article. The model parameter optimization in this paper adopts the stochastic gradient descent method (SGD), the initial learning rate is 0.015, the update of the learning rate adopts the step decay method, and the decay rate is 0.05. The model has achieved good experimental results in the training set and test set.

3.4. Evaluation

This article uses the most commonly used evaluation index in the field of named entity recognition: precision rate (P), recall rate (R), and F1 score (F1). That is, P is the recognition rate of correctly recognized named entities, R is the rate of correctly recognized named entities in the test set, and F1 is the harmonic average of P and R, which is the comprehensive evaluation index of the model. Among them, the higher the P and R values, the higher the accuracy and the recall rate, but in fact, the two are contradictory in some cases. Therefore, the F1 score is often used to evaluate the overall performance of the model. The calculation formula is:where is the number of correct entities identified; is the total number of all entities identified; and is the number of entities in the test set.

4. Experimental Results

4.1. Comparative Experiment Analysis

In order to prove the entity recognition effect based on the BERT-BiGRU-CRF model, this article uses BiLSTM-CRF as the baseline model, and the selected comparison models are BiLSTM-CRF, BERT-BiLSTM-CRF, and BERT-BiGRU-CRF. Model introduction:(1)BiGRU-CRF model: this model inputs word vectors into the model for training.(2)BiLSTM-CRF model: this model is a classic model in the NER field. It uses trained word vectors and then uses the BiLSTM-CRF model to extract entities.(3)BERT-BiLSTM-CRF model: this model is based on the Google BERT model. Many scholars have embedded BERT in the BiLSTM-CRF model and achieved better recognition results in NER research.

4.2. Model Performance Comparison

The comparison model proposed in this paper first uses electronic medical record data for training and then uses a test set for testing. The specific comparison results are shown in Table 4.

From the comparison results of Table 4 and Figure 5, we can see that in terms of comprehensive evaluation indicators. In terms of precision rate, recall rate, and F1 score, the BERT-BiGRU-CRF model proposed in this article has increased by 2.9%, 5.0%, and 3.95%, respectively, compared with the BiGRU-CRF model. The difference between the two models is the embedding of BERT. It shows that BERT embedding can improve the recognition effect of entities. Compared with the BiLSTM-CRF model, the increase was 3.14%, 4.40%, and 4.34%, respectively. Compared with the BERT-BiLSTM-CRF model, the increase was 1.25%, 0.77%, and 1.01%, respectively, Therefore, all P, R, and F1 score are improved compared to the baseline model, indicating that the BERT-BiGRU-CRF model is more applicable to electronic medical record recognition in the CVD field. This is mainly due to the stronger ability of embedding BERT to extract features, which enables word vectors to fuse context information. On the other hand, the BiGRU-CRF model can input bidirectional information before and after the sequence, which can effectively avoid entity ambiguity.

Figure 6, in terms of entity types, horizontally compares the recognition effects of various entities under different models and compare BiLSTM-CRF, BERT-BiLSTM-CRF, and BiGRU-CRF. In terms of disease entities, they were increased by 9.87%, 2.73%, and 9.63%, respectively; on the symptom entity, they were increased by 1.62%, −0.33%, and 3.29%; on the body part entity, they were increased by 2.85%, 0.45%, and 3.10%. On the examination entity, they were increased by 0.76%, −0.25%, and 0.49%. In terms of the treatment entities, they were increased by 3.8%, 2.46%, and 3.29%, respectively. The overall recognition effect of different entities is compared longitudinally, the effect of checking entity recognition is higher than the comparison model, and the F1 score reaches 90%. However, the recognition effect of entities in the treatment category is relatively poor because the entities are relatively long and cannot clearly identify the boundaries of each entity. In short, the recognition effect of the BERT-BiGRU-CRF model proposed in this paper is higher than that of the control group.

4.3. Model Training Time

Model training is the process of parameter update. This article analyzes the relationship between the four models in the first 10 rounds of Epoch and F1. It can be seen from Figure 7 that the F1 score of the neural network model without BERT is continuously rising from a lower level, while the F1 score of the neural network model with BERT can be maintained at a higher level, and it takes iterations to reach the optimal F1 score, fewer times. In addition, as a whole, the F1 score of the BERT-BiGRU-CRF model is the highest. From the comparison of training time, Table 5 lists the time required for each model iteration. The training time of the BERT-BiGRU-CRF model for one round is 37 seconds shorter than that of the BERT-BiLSTM-CRF model. This is due to the simple structure of the BiGRU-CRF model and the higher efficiency of the model in calculation. In addition, comparing BiGRU-CRF and BERT-BiGRU-CRF models, it is worth noting that the BERT with full word mask is added to the neural network model, which improves the overall training efficiency of the model. The overall training efficiency is improved.

In summary, the BERT-BiGRU-CRF entity recognition model proposed in this paper has a better recognition effect than the control group. This model can make full use of context information, further avoid ambiguity, and effectively avoid repetition between entities, and the granularity of word segmentation in this article is small, which can improve the accuracy of entity recognition.

4.4. Entity Recognition Result

This paper uses the BERT-BiGRU-CRF named entity recognition model to identify 9393 entities (without deduplication). Among them, electronic medical records have the most descriptions of body parts, followed by symptoms and examination entities, while treatment and disease types are less. The specific results are shown in Figure 8.

5. Conclusions

Aiming at the text data of electronic medical records of cerebrovascular diseases, this paper proposes a BERT-BiGRU-CRF entity recognition model to identify five key entities in the field of cerebrovascular diseases, which are “disease, symptoms, body parts, examination, and treatment.” The model obtains the word vector combined with context information through the BERT layer and then obtains the optimal annotation sequence through the BiGRU-CRF neural network model. It not only guarantees a simple network structure and fast training speed but also can solve the problem of ambiguity in combination with context information. Next, on the one hand, we will study the construction of high-quality dictionaries. On the other hand, we will extract the relationship between different entities based on NER to construct a knowledge map in the field of cerebrovascular diseases, which is conducive to the further potential information of electronic medical records in the field of CVD.

Data Availability

The Chinese electronic medical record data used to support the findings of this study have been deposited in the Ai’ai medical repository.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the Beijing Municipal Commission of Science and Technology Project (Z131100005613017), the model and demonstration application of the collaborative prevention and treatment of major diseases in Beijing’s medical reform.