As the biomedical literature increases exponentially, biomedical named entity recognition (BNER) has become an important task in biomedical information extraction. In the previous studies based on deep learning, pretrained word embedding becomes an indispensable part of the neural network models, effectively improving their performance. However, the biomedical literature typically contains numerous polysemous and ambiguous words. Using fixed pretrained word representations is not appropriate. Therefore, this paper adopts the pretrained embeddings from language models (ELMo) to generate dynamic word embeddings according to context. In addition, in order to avoid the problem of insufficient training data in specific fields and introduce richer input representations, we propose a multitask learning multichannel bidirectional gated recurrent unit (BiGRU) model. Multiple feature representations (e.g., word-level, contextualized word-level, character-level) are, respectively, or collectively fed into the different channels. Manual participation and feature engineering can be avoided through automatic capturing features in BiGRU. In merge layer, multiple methods are designed to integrate the outputs of multichannel BiGRU. We combine BiGRU with the conditional random field (CRF) to address labels’ dependence in sequence labeling. Moreover, we introduce the auxiliary corpora with same entity types for the main corpora to be evaluated in multitask learning framework, then train our model on these separate corpora and share parameters with each other. Our model obtains promising results on the JNLPBA and NCBI-disease corpora, with F1-scores of 76.0% and 88.7%, respectively. The latter achieves the best performance among reported existing feature-based models.

1. Introduction

Named entity recognition (NER) aims to identify and extract specific entities (persons, places, organizations, and so on) from massive unstructured text data, which becomes a primary task for information extraction, text analysis, text mining, etc. Similarly, how to effectively extract and obtain valuable information has become a serious challenge for researchers in the biomedical field. Biomedical named entity recognition (BNER) is an indispensable step for this above challenge. The biomedical entities consist of genes, proteins, diseases, drugs, chemicals, and so on.

In the past, conventional machine learning methods were widely used for NER, such as support vector machine (SVM), conditional random field (CRF), and maximum entropy model (MEM). Finkel et al. [1] combined distant resources and additional features to identify the biomedical entities. Tsuruoka et al. [2] employed MEM to develop a BNER system named GENIA Tagger. ABNER [3] was a biomedical entities extraction system based on CRF. Chang et al. [4] adopted the biomedical word embeddings as external features to improve the performance of CRF significantly. Liao et al. [5] adopted the Skip-Chain CRF model to recognize entities, which effectively captured the features of the distant context. Tang et al. [6] used a CRF model with three different types of word representations to identify biological entities. According to the above studies, CRF had became the mainstream model in BNER [7]. Nevertheless, feature engineering is an essential element of the conventional machine learning methods. They must manually design complex templates that require not only domain knowledge but also time-consuming.

Driven by artificial intelligence and pattern recognition, some labor-saving and advanced technologies have been developed in natural language processing, computer vision, and other emerging fields [817]. For example, deep learning can obviously address the expensive cost of feature engineering. The widely employed neural networks include convolutional neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and gated recurrent unit networks (GRUs). Yao et al. [18] first built a multilayer neural network to obtain the biomedical word embeddings on large-scale corpora. To extract disease and chemical entities, Zhao et al. [19] constructed a CNN model. In this work, BNER was seen as text classification, and a multilabel mechanism was designed to obtain contiguous labels. Zhu et al. [20] adopted a CNN structure in BNER with -gram local character and word embeddings. The GRAM-CNN obtained the best performance (F1-score: 87.3) among the single-task models on the NCBI-disease corpus. Li et al. [21] made connections between the twin word embeddings and sentence vectors. Furthermore, they adopted the bidirectional LSTM (BiLSTM) to identify biomedical entities and significantly improved the performance. Limsopatham et al. [22] proposed an end-to-end model based on BiLSTM and orthographic features. It was designed to improve the extraction of complex biomedical terms. SBLC was developed by Xu et al. [23] based on word embeddings and BiLSTM-CRF structure. Dang et al. [24] also proposed the BNER model based on the BiLSTM-CRF structure and adopted various fine-tuned linguistic embeddings. The model showed high performance on multiple corpora. Lyu et al. [25] adopted the BiLSTM-RNN model and combined the biomedical word embeddings with character embeddings to recognize entities. In addition, some studies based on multitask learning and transfer learning were widely used in BNER and had achieved competitive performance. Wang et al. [26] jointly trained different types of entities in multiple data sets and shared both word and character representations among relevant entities. The multitask model achieved promising performance on 15 biomedical corpora. Yoon et al. [27] proposed a multitask framework termed CollaboNet. It connected multiple submodels trained on different corpora. The large performance gains come from taking turns training the target and collaborator submodels. Sachan et al. [28] designed a pretrained BiLSTM model. They first trained a language model of the same structure on the unlabeled corpora and then updated the initialization parameters of the BNER model based on transfer learning. It does not only substantially improved the performance but also alleviated the lack of high-quality labeled training data.

From the above studies, word embeddings can be seen to have become indispensable representations. They can effectively represent the semantic features of the original text sequences. But biomedical entities’ naming rules are vague. There are many polysemous and ambiguous words in the biomedical literature. For example, in “This cohort underwent follow-up for cancer incidence through the Finnish cancer registry to the end of 1995.”, the first “cancer” means disease and the second is an institution. In addition, it is difficult to address the lack of sufficient training samples in specific fields. These issues also result that the biomedical entities are more complex to recognize than the general field. Because the traditional fixed word embeddings cannot accurately represent polysemous and ambiguous words in the biomedical literature, the language models pretrained on a large number of unlabeled open corpora have drawn more and more attention. The contextualized word embeddings generated by them can optimize the feature representations of the polysemous and ambiguous words. In the general field, Peters et al. [29] designed a feature-based language model named ELMo, which consists of a bidirectional LSTM. This pretrained language model achieves state-of-the-art performance in multiple downstream tasks.

We aim to optimize the representations of polysemic words and ambiguous words in biomedical sequences and make the model fully capture richer features. This paper proposes a multitask learning multichannel BiGRU-CRF model with feature-based contextualized word representations. The main contributions of this paper are as follows.

1) We propose a multichannel BiGRU-CRF model. Three kinds of feature representations based on the biomedical pretrained dictionary, ELMo, and CNN are generated, including word-level, contextualized word-level, and character-level representations. These representations are separated or combined as inputs simultaneously, and each set of inputs is fed into a BiGRU-CRF model as a single channel. In merge layer, multiple methods are designed to integrate the outputs of multichannel BiGRU.

2) In order to address the lack of sufficient training data in specific fields, we adopt multitask learning strategy, employing auxiliary corpora to provide richer training samples and relevant information for the main corpora to be evaluated.

3) The multitask learning multichannel BiGRU-CRF model clearly strengthens the capability of recognizing entities without any artificial participation. It obtains the competitive results on the JNLPBA and NCBI-disease corpora.

The rest of this paper is divided into the following four sections. Section 2 describes the methods. Section 3 shows the experimental settings. Section 4 reports the evaluative results in a detailed manner. Section 5 provides the conclusion.

2. Methods

Figure 1 shows the multitask learning multichannel BiGRU-CRF framework. The framework is divided into five parts: input layer, embedding layer, BiGRU, merge layer, and CRF layer, where the input layer represents the original sentence in corpora. First, the three feature representations are obtained through biomedical pretrained dictionary, CNN, and ELMo language model, respectively. Then, the multichannel BiGRU is used to capture features. denotes the forward single-channel GRU, and denotes the backward single-channel GRU, respectively. Next, we integrate the output of each channel in the merge layer. Finally, the labels are parsed by CRF. This section describes the remaining four parts in detail.

2.1. Embedding Layer

To ensure the maximum coverage of the input information, the pretrained word embeddings, contextualized word embeddings, and character embeddings are used for the input layer for feature representations.

2.1.1. Pretrained Word Embedding

We represent the text sequence with word embeddings. They map words to dense vectors according to semantic relevance. The word embedding method addresses the lack of curse of dimensionality compared with the conventional one-hot method. With the development of natural language processing, word embeddings have become the most important input feature representations. The widely adopted word embedding computing tools include Word2Vec [30] and GloVe [31].

Previous biomedical studies have provided related open source word embeddings pretrained on large-scale unlabeled corpora. We initialize the word embeddings by a “look up” operation. Inspired by Quan et al. [32], this paper adopts the word embeddings pretrained on PMC and PubMed biomedical corpora.

2.1.2. Contextualized Word Embedding

This paper directly transfers the pretrained ELMo language model proposed by Peters et al. [29] to obtain the contextualized word embeddings. The main motivation is that the contextualized word representations should be able to contain rich syntactic and semantic information. The conventional word embeddings (e.g., word2vec) are context-independent, and ELMo can generate dynamic word embeddings based on context. We adopt the 2-layer ELMo to obtain the contextualized word representations as part of the multichannel BiGRU-CRF model’s input, which is shown in Figure 2. ELMo consists of a bidirectional LSTM language model. The objective function is to compute the maximum likelihood of the two sub-models. For word, a set of contextualized word representations can be computed by ELMo as follows:

where denotes the original embeddings layer. and denote the forward and backward LSTM layer, respectively. denotes the softmax-normalized weights, and denotes the number of layers. ELMo generates word representations based on the above formula, which is summing each hidden state of the bidirectional language model. They can be directly concatenated with other feature inputs. The contextualized word embeddings not only reflect the complex semantics and grammar features but also accurately adapt to different contexts.

2.1.3. Character Embedding

Character representations refer to morphological information by capturing it from all characters that make up a word. Combining them with other feature representations can better describe the morphological features of a word [33, 34]. Previous studies have shown the effectiveness of character representations in NER. This paper adopts CNN to compute the character vectors of words in biomedical sequences. The structure of CNN is shown in Figure 3, including the original character embeddings by random initialization, convolutional layer, and pooling layer. First, the words’ embeddings matrix consists of each character embeddings. A padding operation for words of different lengths is performed. Then, the local features of the initialized character embeddings matrix are captured by a convolution operation. Finally, the character representations are obtained by performing a max-pooling operation.

2.2. Multichannel BiGRU

Recently, to solve the gradient explosion or gradient disappearance, a variety of improved models based on RNN have been proposed, such as LSTM [35] and GRU [3638]. They capture distant information and address the gradient disappearance or gradient explosion by designing the memory units and gate mechanisms. Therefore, the above improved models have become the major option for sequence labeling such as BNER. The difference between LSTM and GRU is the structure of gate mechanisms. GRU maintains the performance of LSTM while making the gate structures simpler [39, 40]. Because we need to train multiple identical networks at the same time, this paper adopts GRU with lower computational complexity. Figure 4 shows the GRU units. The relevant formulas are as follows.

where denotes the function. and denote the update and reset gate. denotes the feature vectors. denotes the weights of the gate mechanism. denotes the current state. denotes the hyperbolic tangent function. denotes the final output.

However, GRU only considers the forward information of texts and ignores the backward information, which also contains important features. The bidirectional GRU is employed in our model because of this issue. The BiGRU model captures different bidirectional feature representations in each sequence. Then, it obtains the complete representations by connecting them. BiGRU can capture the bidirectional representations and hidden features. In our model, we propose a multichannel BiGRU to obtain the richer representations. The multichannel mechanism aims to feed different kinds of input representations into corresponding multiple independent and same network structures. Each channel uses a separate BiGRU to capture features, which does not cause interference between the channels and can extract information more adequately. A total of 7 channels are designed to capture features of different representations, as follows. (1)1st channel: pretrained word embeddings contextualized word embeddings character embeddings(2)2nd channel: pretrained word embeddings(3)3rd channel: pretrained word embeddings character embeddings(4)4th channel: character embeddings(5)5th channel: contextualized word embeddings character embeddings(6)6th channel: contextualized word embeddings(7)7th channel: pretrained word embeddings character embeddingswhere denotes the concatenate operation.

2.3. Merge Layer

The purpose of using the merge layer is to integrate the outputs of multiple channels from BiGRU. A good merge scheme can effectively integrate the potential valuable information in multichannel BiGRU. As shown in Figure 1, the multichannel BiGRU is adopted to capture features from different representations. Let denotes the multichannel BiGRU’s output. For a given text sequence , denotes the length of the sequence, and denotes the number of BiGRU units. We design four merge methods: addition, connection, unit-level attention, and channel-level attention.

1) Addition. This method additively integrates the output of each channel, and each single BiGRU does not interfere with others when capturing features. It can be obtained as follows: where denotes element-wise addition, . denotes the single-channel BiGRU’s output, , and , respectively, denote the pretrained word embeddings, the contextualized word embeddings from ELMo, and the character embeddings from CNN.

2) Connection. This method directly performs the concatenate operation on the single-channel BiGRU’s output. It can be obtained as follows: where denotes the concatenate operation, . , and , respectively, denote the 3 different embeddings.

3) Unit-level attention. This method adopts the multihead self-attention mechanism to redistribute the weights of units in BiGRU. It can be obtained as follows: where denotes the concatenate operation. denotes the single-channel BiGRU’s output, , , .

4) Channel-level attention. This method first connects the feature representations of all channels, then computes the weights of each channel and finally integrates them. It can be obtained as follows: where denotes matrix multiplication, , .

2.4. CRF Layer

After the representations information is output by BiGRU, the conventional decision function computes the prediction labels . However, the output sequence labels have strong dependence in BNER. For example, in the labeling scheme, the previous label of “B-disease” cannot be “I-disease”. The conventional decision function is insufficient to address the above issue effectively.

In our model, CRF [41] is employed after the merge layer; hence, the dependence between the output labels can be effectively considered. For sentence , it is input into BiGRU. denotes the probability which is output from merge lager, . denotes the sequences, and denotes the labels. denotes the label probability of the token. denotes the prediction labels, where . Its probability can be obtained as: where denotes the transfer matrix. denotes the transition probability from to . The probability of all prediction labels by decision function can be computed as follows: denotes the truth labels.

The likelihood function is: denotes all legal label sequences. The final prediction label sequence with the maximum probability can be gained as follows:

2.5. Multitask Learning

In order to provide more training data and value information for our model, we adopt the multitask learning strategy. The basic idea of multitask learning is to learn multiple tasks at the same time and use related information between tasks to improve model performance. The neural network-based multitask learning method mainly adopts a parameter sharing learning mode to learn a shared representation for multiple tasks. In this paper, we introduce two auxiliary corpora with the same entity types for the main corpora to be evaluated, then train the multichannel BiGRU-CRF model on these separate corpora and share parameters with each other.

Given a set of training corpus , . and represent the samples and corresponding prediction labels in each corpus, respectively. The loss function of the model based on multitask learning is as follows: where is a hyperparameter that reflects the weight of each corpus. It represents the contribution and importance of all participating corpora in the whole. When we can obtain that is 1 through a large number of experiments, that is, when weights are not distinguished, the model reaches the highest performance, which is also consistent with the conclusion of Wang et al. [26].

This paper adopts the fully-shared mode, which means that all parameters of the model are completely shared except that a corresponding output layer is set for each corpus. We provide an auxiliary corpus for the main corpus. The fully shared multichannel BiGRU can capture shared feature representations for multiple corpora, which are fed into their respective output layers to generate prediction sequences.

3. Experimental Settings

In this section, the experimental settings are reported clearly, including optimizer and regularization, hyperparameters, corpora, and evaluation measures.

3.1. Optimizer and Regularization

Adam [42] (Adaptive Moment Estimation) is adopted as the optimizer of our model during training. It is an adaptive optimization method that dynamically updates the learning rate by computing the gradient’s 1st moment estimate and 2nd moment estimate. Each adjusted learning rate is limited to a clear range, which ensures that the parameters are steadily updated.

We use dropout during model training to prevent overfitting. Dropout [43] is designed to randomly filter some hidden layer nodes according to the preset dropout rate so that they do not participate in the back propagation to update parameters. The above operations can effectively prevent overfitting. They make the model more generalized.

3.2. Hyperparameters

Table 1 reports the experimental hyperparameter settings. The dimension based on the pretrained word embeddings, character embeddings, and contextualized word embeddings is set to 200, 30, and 1024, respectively. We adopt the Adam to optimize our model during training. The dimension of GRU units is 100, and the dropout rate is 0.5. We set learning rate as 0.001, and the batch size is 32. In this paper, labeling schema is employed to preprocess the original samples. denotes the first token of entities in samples. denotes the token located in entities. denotes a token not belonging to entities.

3.3. Corpora

JNLPBA [44] and NCBI-disease [45] are our experimental main corpora. They are representative biomedical corpora of both multi and single classification. JNLPBA contains 5 types of entity: DNA, RNA, cell type, cell line, and protein. Training sets contain 2000 Medline abstracts, and test sets contain 404 Medline abstracts. The NCBI-disease corpus consists of 793 Medline abstracts, of which 593, 100, and 100, are used as training set, development set, and test set, respectively. It labels the disease name and the corresponding disease concept ID (the concept ID can be mapped to the ID in the MeSH or OMIM database). In addition, in the multitask learning framework, we use two other corpora as auxiliary data sets, namely BC2GM [46] and BC5CDR-disease [47]; the entity types contained in these two corpora are consistent with the main corpora. Table 2 provides the details of the above corpora.

3.4. Evaluation Measures

To evaluate the performance of our method, we adopt three conventional evaluation measures: precision (), recall (), and F1-score (). The calculation formulas are as follows: where denotes the number of true positive samples. denotes the number of true negative samples. denotes the number of false-positive samples. denotes the number of false-negative samples.

4. Results and Discussions

The described multitask learning multichannel BiGRU-CRF model is evaluated on NCBI-disease and JNLPBA. They are representative biomedical corpora of both single and multiclassification. We first compare the performance of each merge method and feature representations, as shown in Tables 3 and 4. Then, we evaluate the setting of hyperparameter values including the GRU dimension, optimizers, and dropout, as shown in Tables 5, 6, and 7. From Table 8, the effect of the CRF layer in our architecture is shown by an experiment. From Table 9, the effect of the multitask learning strategy is shown by an experiment. Lastly, the experiment compares the performance of multichannel BiGRU with some existing feature-based methods in BNER.

4.1. Performance Comparison of Merge Methods

The merge methods affect the performance of capturing features. In the merge layer, inappropriate feature representations integration methods can result in information repetition and redundancy. It will have a negative impact on integrating information. Therefore, we evaluate the performance of different designing merge methods: addition, connection, unit-level attention, and channel-level attention. From Table 3, when the unit-level attention method is adopted, the model obtains the highest -Score. The probable reason is that the unit-level attention method can fully integrate the important features captured by each channel and do not interfere with each other; thus, we use the unit-level attention method in the merge layer.

4.2. Performance Comparison of each Representations

This paper proposes a multichannel BiGRU-CRF model to capture richer feature information by sending multiple representations individually or collectively into BiGRU. We evaluate the performance of each channel based on different representations while verifying the effectiveness of our multichannel method. The experimental results are shown in Table 4. It can be seen that the multichannel representations can provide richer potential information, and the concatenate representations are superior to the single representations. In summary, we compare the performance between each representation on the same corpus. Our merge-based multiple representations method achieves optimal performance, with the F1-scores of 76.0 and 88.7 on the JNLPBA and NCBI-disease corpora, respectively.

4.3. Performance Comparison of GRU Units Dimensions

GRU units’ dimensions affect the ability of learning features and the performance of the classifier. Too few hidden units can result in insufficient capture features. Conversely, it may lead to information redundancy and increase the computational burden. Both of them will have a negative impact on model performance. Therefore, we evaluate the performance of different neuron dimensions to obtain the best hyperparameters. We set the size of GRU units to be 50, 100, 150, 200 and evaluate them. As the results show in Table 5, when the dimensions are 100, it achieves the best performance. Therefore, the GRU units’ dimensions are set to 100.

4.4. Performance Comparison of Combining CRF Layer

The CRF layer can capture the dependence between adjacent labels by transition probability. This paper evaluates the effectiveness of the CRF layer. The experimental results are shown in Table 8. After combining BiGRU with the CRF layer, the model performance has been significantly improved on the JNLPBA and NCBI-disease corpora. It proves the validity of the CRF layer.

4.5. Performance Comparison of Adopting Multitask Learning

From the Table 9, the multitask learning strategy we adopted is effective. The auxiliary corpora provide more training samples and valuable information for the main corpora. According to the analysis of main corpora evaluation results, the multitask learning framework makes the performance improvement of JNLPBA less obvious than NCBI-disease. The possible reason is that the entity type of NCBI-disease is completely consistent with the auxiliary corpus BC5CDR-disease. The auxiliary corpus BC2GM contains only “protein”, the training samples and relevant information of the other four entity types in the main corpus JNLPBA have not been supplemented.

4.6. Performance Comparison of Optimization Methods

The optimization method determines the convergence speed and performance of the model training process. This paper evaluates three different optimization methods: Adam, SGD, and AdaGrad. SGD is one of the commonly used optimizers during training. It randomly extracts fixed-size training samples to calculate gradients and update parameters. But it may lead to convergence to a local minimum. Compared to SGD, AdaGrad does not rely on a preset learning rate, but adaptively adjusts it during training. It is well suited to handle sparse data but may cause a vanishing gradient. The experimental results are shown in Table 6. Compared with the other two optimization methods, Adam achieves the fastest convergence speed and highest performance under the same conditions. Therefore, this paper uses Adam as the optimizer.

4.7. Performance Comparison of Using Dropout

This paper evaluates the effectiveness of dropout. The experimental results are shown in Table 7. After setting the dropout rate, the model performance has been significantly improved on the JNLPBA and NCBI-disease corpora. It demonstrates the validity of dropout.

4.8. Performance Comparison with Existing Feature-Based Methods

Lastly, we draw a comparison between our model and existing models. In order to ensure the fairness and rationality of the experiment, we have divided the existing models into two kinds according to the different training patterns. One kind is feature-based, which applies specific input representations to task-specific different architectures, such as the approaches listed in Table 10; while another kind is fine-tuning, which trains various downstream tasks with fine-tuning parameters in fixed model architectures, such as BERT [55]. This paper reports the performance comparison with existing models of feature-based representations.

The performance comparison results on the JNLPBA corpus are shown on the left side of Table 10. In these studies, the early methods (dictionary based and rule based) and the conventional machine learning models also obtained reasonable results in BNER, including Finkel et al. [1], Settles [3], Tsuruoka et al. [2], Tang et al. [6], Chang et al. [4], and Liao et al. [5]. NERBio [53] was the best rule-based system on a JNLPBA corpus, and the F1-score is 73.0. The Skip-Chain CRF adopted by Liao et al. [5] was the state-of-the-art conventional machine learning model. It obtained a reasonable F1-score of 73.2. Compared with the above best early method and conventional machine learning method, our model has increased F1-score values by 3.0 and 2.8, respectively. We can produce these results without any feature engineering but simple architecture. Compared with existing deep learning studies, the performance of our model is better than Li et al. [33]. They proposed a CNN-BLSTM-CRF model with word embeddings and character embeddings. Our model has increased the recall and F1-score by 9.7 and 1.6, respectively. Gridach et al. [54] proposed a BiLSTM-CRF model with pretrained word embeddings and character embeddings. They computed the character vectors by a bidirectional LSTM. This model significantly enhanced the best performance of single-task BNER models. The performance of our model is close to theirs. In summary, our method obtains promising results compared with existing feature-based models under the premise of using merge-based multiple features and simple architecture.

The performance comparison on the NCBI-disease corpus is shown in Table 10 (right side). In these studies, Leaman et al. [45, 49, 50] first adopted conventional machine learning methods to obtain competitive performance on the NCBI-disease dataset. They developed multiple BNER systems (e.g., DNorm and TaggerOne) in subsequent studies. The recent deep learning methods achieved satisfactory results in BNER. In addition to some of the related works described in the first section, including Limsopatham et al. [22], Dang et al. [24], Zhao et al. [19], Wang et al. [26], Xu et al. [23], Yoon et al. [27], Zhu et al. [20], and Sachan et al. [28], Xu et al. [48] proposed a three-layer neural network to identify disease entities. The BiLSTM with the same structure was used to generate character-level embeddings and capturing features. The entity labels were predicted through the CRF layer. Wei et al. [51] designed a hybrid model combining the conventional machine learning methods with neural networks, and bidirectional RNN and CRF were employed as submodels to extract features. Then, the output was merged and fed into SVM for classification. Habibi et al. [52] achieved reasonable performance on multiple biomedical datasets based on word embedding and a LSTM-CRF model. GRAM-CNN [20] was the best single-task system which was developed by CNN on the NCBI-disease corpus. It obtained an F1-score of 87.3. BiLM-NER [28] was the best feature-based model and was developed by the transfer learning method; the F1-score was 87.3. However, our model’s performance is better than the above state-of-the-art work. Our model obtains the best performance among reported existing feature-based models.

4.9. Error Analysis

We analyze the error cases of the model on our corpora and summarized the main causes of these errors into the following two points.

The boundary is blurred. There are 3 main reasons for this error. First, biomedical entities are generally long and complex. For example, “Kappa B-specific DNA binding proteins” contains five words as the entity, and the length of entities in the general field is usually within three words. In addition, it contains the word “DNA”, and the entity itself is “protein”. Second, the virtual words and conjunctions within biomedical entities influence the judgment of the boundary. For example, there may be fixed-use conjunctions in biomedical entities, but they are often misjudged as “O”. Finally, an entity in biomedical corpora is part of another entity, but they belong to two types. For example, “MZF-1” is part of “Recombinant MZF-1”, but they belong to “DNA” and “protein”. To a certain extent, these above issues are plaguing our model.

Corpora annotation inconsistency. For example, “wild-type” is labeled as “O” in “gave nearly wild-type levels of gene expression in phorbol ester-treated Jurkat cells but not in phorbol ester-treated HeLa or U937 cells.”, but in “as a wild-type but not a mutant TSAP-binding site of the sea urchin functions only in transfected B cells as an upstream promoter element.”, it is labeled as “DNA”. In addition, there are abbreviations of entities in some biomedical sequences, and our model is difficult to identify. For example, “IL-2” in “Under the same conditions, Lck did not stimulate IL-2 promoter unless it was activated by mutation” and “Interleukin-2” in “The proteasome regulates receptor-mediated endocytosis of interleukin-2” refer to the same entity, but our model has difficulty to distinguish them.

These analyses demonstrate that the complexity and annotation inconsistency of biomedical corpora are major factors that result in errors. To address these issues, we can disambiguate through entity linking during corpora preprocessing or adopt more external representations.

5. Conclusion

In this paper, we propose a multitask learning multichannel BiGRU-CRF model based on contextualized word representations. First, we obtain word, character, and contextualized word representations through a biomedical pretrained dictionary, convolutional neural networks, and ELMo pretrained language model, respectively. The character representations can describe the morphological features of words, and the contextualized word representations can better represent both polysemous and ambiguous words according to the context information. Then, we train multiple BiGRU submodels at the same time, each of which is viewed as a channel. The three representations are used as input for different channels, respectively, or in combination. Next, we design multiple methods to integrate the output of each channel in the merge layer. Finally, considering the dependence between labels, the CRF layer is adopted to parse sequence labels. It avoids outputting non-compliant label sequences. In addition, multitask learning strategy is adopted to solve the problem of insufficient training samples in specific fields. The auxiliary corpora with the same entity types are applied to supplement more training samples and relevant information for the main corpora to be evaluated. Our model has a simple architecture and avoids feature engineering. The multitask learning multichannel BiGRU-CRF achieves promising results on JNLPBA and NCBI-disease corpora, with F1-scores of 76.0 and 88.7, respectively. In the future, we plan to introduce more abundant additional features (e.g., domain knowledge base, structured ontology) to enhance the performance.

Data Availability

The data sets used in this paper are all publicly available. The related references of data sets adopted to support the findings of this study are included within this paper.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported by the National Natural Science Foundation of China (No.61976124).