Abstract

In order to improve the disambiguation accuracy of biomedical words, this paper proposes a disambiguation method based on the attention neural network. The biomedical word is viewed as the center. Morphology, part of speech, and semantic information from 4 adjacent lexical units are extracted as disambiguation features. The attention layer is used to generate a feature matrix. Average asymmetric convolutional neural networks (Av-ACNN) and bidirectional long short-term memory (Bi-LSTM) networks are utilized to extract features. The softmax function is applied to determine the semantic category of the biomedical word. At the same time, CNN, LSTM, and Bi-LSTM are applied to biomedical WSD. MSH corpus is adopted to optimize CNN, LSTM, Bi-LSTM, and the proposed method and testify their disambiguation performance. Experimental results show that the average disambiguation accuracy of the proposed method is improved compared with CNN, LSTM, and Bi-LSTM. The average disambiguation accuracy of the proposed method achieves 91.38%.

1. Introduction

Biomedical texts are now so large that automation tools are needed to process them effectively. But, it is difficult to process biomedical texts automatically. The reason is that there are more ambiguous biomedical words in the biomedical field. It is helpful for the automatic processing of biomedical articles to determine semantic categories of biomedical words. Now, biomedical word sense disambiguation (WSD) is widely applied to biomedical natural language processing tasks, such as text indexing, text categorization, and named entity extraction.

In the field of biomedicine, there is polysemy in professional vocabulary. For example, the biomedical word “ADA” has two semantics, including “American Dental Association” and “Adenosine Deaminase.” We need to determine the correct meanings of biomedical words according to relevant information of contexts.

Biomedical WSD methods can be divided into 3 categories: supervised methods, unsupervised ones, and knowledge-based ones. In supervised methods, a labeled dataset along with lexical and syntactic information in context is used to train a classifier that predicts correct senses of biomedical words in the test dataset. In unsupervised ones, unlabeled biomedical texts are used to provide sense choices for biomedical words. In knowledge-based ones, thesauri and sense inventories are applied to determine semantic categories of biomedical words. WordNet and Unified Medical Language System (UMLS) [1] are two important thesauri, which provide brief definitions of different senses and corresponding synonyms. Knowledge-based approaches do not use any corpus but solely rely on thesauri or sense inventories such as WordNet and UMLS that contain brief definitions of different senses and corresponding synonyms.

This paper intends to combine neural networks and linguistic knowledge for improving the performance of biomedical WSD. There is a lot of linguistic information in contexts around the biomedical word, which can be used to determine its semantics. But, some information is helpful, and others are noisy. Neural networks are often used to extract discriminative information. Each neural network has its advantages and disadvantages. It is a challenge for the biomedical WSD system to combine multiple neural networks for extracting effective discriminative information from contexts. This paper combines Av-ACNN and Bi-LSTM to extract discriminative features from contexts around the biomedical word and determine its semantics, which improves the disambiguation accuracy of the biomedical WSD system.

In this paper, we take morphology, part of speech, and semantic information of four adjacent lexical units around the biomedical acronym as disambiguation features. Word embeddings [24] are used as representations for biomedical WSD problems. Discriminative information embedded in word units is extracted by an attention mechanism to obtain features at a high level. Based on these features, average asymmetric convolutional neural networks (Av-ACNN), bidirectional long short-term memory (Bi-LSTM) networks, and softmax function are used to determine the semantic category of the biomedical acronym. The main contributions of this paper are summarized as follows:(1)Morphology, part of speech, and semantic information of 4 adjacent lexical units around the biomedical acronym are used as disambiguation features. We use word embedding to generate a feature vector.(2)Attention mechanism is adopted to generate weights dynamically by capturing relationships between left and right contiguous words of the biomedical acronym.(3)Multiscale asymmetric convolution neural network can reduce computation quantity and obtain more feature information. Useful information can be obtained forward and backward using bidirectional long short-term memory networks.

The remainder of this paper is organized as follows. Related work is reported in Section 2. Section 3 describes the extraction of disambiguation features and how to generate disambiguation feature vectors. Attention neural network is given in Section 4. The process of training attention neural network is given in Section 5. Experimental analysis is given in Section 6. Section 7 gives conclusions.

There are 3 kinds of biomedical WSD methods. They are, respectively, supervised methods, unsupervised ones, and knowledge-based ones.

2.1. Supervised Methods

Supervised methods use labeled data to train the biomedical WSD classifier. Liu applies 3 machine learning algorithms to biomedical WSD, including Naive Bayes and decision lists, adaptation of decision lists, and mixed supervised learning method. Experiments show that the hybrid supervised method with Naive Bayes performs best in biomedical WSD [5]. Stevenson extracts domain-independent linguistic features around the ambiguous word from the text. These features have been adapted for biomedical text disambiguation by adding CUIs [6] and Medical Subject Heading terms [7]. Son proposes a support vector machine with examplewise weights to solve the WSD task. Weights of training instances are adjusted according to their similarities to test data [8]. Moon builds a clinical sense inventory with 440 common abbreviations. By comparing this inventory with UMLS, Adam, and Stedman, he analyzes these clinical abbreviations and acronyms among diverse resources [9]. Yepes uses word embedding to improve traditional features. At the same time, a recurrent neural network based on LSTM nodes is used for biomedical WSD [10]. Festag applies word embedding and recurrent convolutional neural networks to medical term disambiguation, which map medical terms to multiple concepts in UMLS [11]. Wang proposes an interactive learning algorithm in which expert’s domain knowledge is used to build a medical WSD model. An expert can provide domain knowledge in 3 ways, including labeling instances, specifying indicative words of a sense, and highlighting support evidence in a labeled instance [12]. Bis proposes a novel deep neural network for biomedical WSD, in which a layered bidirectional LSTM network and a max-pooling layer along multiple time steps are built in order to create a dense representation of context [13]. Wei applies CNNs and LSTMs to capture semantic and syntactic features for bioconcept disambiguation [14]. But, a lot of human-annotated corpora are needed in the supervised biomedical WSD method.

2.2. Unsupervised Methods

Unsupervised methods use an unlabeled corpus to provide sense choice for a word in context. Agirre uses relations in UMLS to create a graph. At the same time, a personalized PageRank algorithm [15] is applied to rank semantic categories of ambiguous words based on their structural importance in graphs and relation to words in context [16]. Duan proposes a graph-based algorithm to cluster words into groups, in which the principle of finding the maximum margin between clusters is adopted [17]. Wanton gives a kernel-based method for biomedical WSD. Information in a knowledge base is used to construct an affinity matrix, and kernels are defined based on the matrix [18]. Rab applies six relation types in UMLS to build a graph for ambiguous words. At the same time, he gives a graph-based algorithm to disambiguate terms in a biomedical text [19]. Fernandez proposes a graph-based unsupervised algorithm to solve the WSD problem in a biomedical domain. When a knowledge base is built, contexts of ambiguous terms are considered [20]. Li proposes a language model based on Bi-LSTM, in which word order is considered and the entire sentential context is described. It generates high-quality context representations in an unsupervised manner [21]. Pesaranghader computes sense embeddings based on their text definitions in the Unified Medical Language System. At the time, he proposes a net to determine the semantic category of the ambiguous term [22]. But, the performance of the unsupervised biomedical WSD method is low.

2.3. Knowledge-Based Methods

Knowledge-based approaches apply external lexical resources to biomedical WSD, such as machine-readable dictionaries, thesauri, and ontologies. Rais considers that the terms in context have the same weight. Then, a modified SenseRelate algorithm [23] is given. He applies semantic similarity and relatedness measures to biomedical WSD. Then, the influence of context window size on WSD is evaluated [24]. Yepes compares 4 knowledge-based WSD methods in the biomedical domain. The method which uses semantic categories assigned to concepts in Metathesaurus performs the best [25]. Plaza studies the influence of 3 WSD algorithms on biomedical summarization, in which documents are mapped onto concepts in UMLS. Three WSD algorithms are, respectively, journal descriptor indexing, machine-readable dictionary, and automatic extracted corpus [26]. McInnes uses semantic similarity and relatedness measures to determine semantic categories of biomedical terms, which does not require human-annotated corpus and yields high disambiguation accuracy [27]. Garla adopts a directed concept graph to compute semantic similarity based on UMLS. Vertices represent concepts, and edges denote taxonomical relationships [28]. Based on neural word and concept embeddings, Sabbir combines cosine similarity, projection magnitude proportion, and a prior knowledge-based approach to determine the semantic category of the biomedical term [29]. Antunes uses unlabeled MEDLINE abstracts to generate word embeddings. Word embeddings are applied to compute embedding vectors in surrounding contexts around the ambiguous term. According to the similarity between context vector and concept vector, meanings of ambiguous terms are determined [30]. But, it is difficult to extract correct knowledge from lexical resources and apply them to biomedical WSD.

These 3 methods have their own shortcomings. The supervised WSD method achieves better performance. But, it needs a lot of annotated biomedical corpus. Unsupervised WSD method need not label biomedical corpus manually. But, disambiguation accuracy is not high. The knowledge-based method applies external lexical resources to biomedical WSD. But, it is difficult to extract correct knowledge from lexical resources and apply them to biomedical WSD correctly.

3. Feature Extraction

3.1. Preprocessing Text

Punctuations in contexts of the biomedical word have less semantic information and have little influence on determining its semantic category. At the same time, they will bring noises in the process of estimating the model’s parameters. Regular expression of python is used to delete punctuations from sentences containing the biomedical word. Part of speech refers to grammatical features of a kind of word, which is their grammatical functions. It provides help for determining the relationship between two words. Nltk packet of python is adopted to label each word with part of speech in the sentence. Semantics is the sense of a word. Words with the same or similar sense are classified into a category. The purpose is to decrease data sparsity in the process of parameter estimation. Nltk packet of python is used to label each word with semantics in the sentence.

In the sentence “A message from ADA president Feldman.”, “ADA” is a biomedical word. Firstly, punctuation “.” is deleted. Secondly, every word is labeled with part of speech. For “A,” its part of speech is DT. For “message” and “president,” their part of speech is NN. For “from,” its part of speech is IN. For “ADA” and “Feldman,” their part of speech is NNP. Thirdly, every word is labeled with semantics. “A” is annotated with angstrom.n.01, “message” is labeled with message.n.01, “ADA” is annotated with adenosine_deaminase.n.01, and “president” is labeled with president.n.01. “Feldman” and “from” are annotated with “−1.”

3.2. Disambiguation Feature Extraction

Word nearby ambiguous word has more impact on the sense of the ambiguous word, but words far away from it have less one. In this paper, the biomedical word is viewed as the center. Morphology, part of speech, and semantic information from left and right lexical units are extracted as disambiguation features to determine its meanings. When the number of left or right contiguous vocabulary units is less than 2, the corresponding disambiguation feature is set to −1. The purpose is to ensure that each biomedical word has 12 disambiguation features.

For English sentences containing the biomedical word “ADA,” the process of extracting disambiguation features is shown as follows:English sentence: A message from “ADA” president Feldman.Part of speech tagging: A/DT message/NN from/IN ADA/NNP president/NN Feldman/NNPSemantic annotation: A/DT/angstrom.n.01 message/NN/message.n.01 from/IN/-1 ADA/NNP/adenosine_deaminase.n.01 president/NN/president.n.01 Feldman/NNP/-1

The process of extracting disambiguation features is shown in Figure 1

3.3. CBOW Model

Word2vec’s CBOW [2] is used to generate a feature vector. The input of the CBOW model is a word vector corresponding to context-related words of a word, and its output is a probability distribution. The dimension of output is the same as that of the input. The gradient descent method is used to update input weights and output weights. After the training process, each word in the input layer is multiplied by input weights to get a word vector. The size of the word window is 4. The window contains its 2 left word units and 2 right ones. Twelve features from these 4 word units are input to the CBOW model. Word vector of “ADA” is computed as shown in Figure 2.

3.4. Generation of Disambiguation Feature Matrix

The feature vector is a real one that maps high-dimensional space into low one, which can represent a large amount of potential information in a word. Twelve disambiguation features are extracted from a sentence containing the biomedical word “ADA.” CBOW model is used to convert each feature into a 100-dimensional feature vector.

In this paper, we design 3 optional methods to construct a feature matrix:(1)The first one uses the morphology of 4 adjacent vocabulary units around the biomedical word “ADA” to construct a feature vector. This feature vector is used to construct a 4 × 100 feature matrix F.(2)The second one considers positions of left and right adjacent words. Generally, if a context-related word is closer to the biomedical word “ADA,” it is more important for determining the category of “ADA,” and its weight is bigger. The weighted sum of 12 feature vectors is used as a feature vector of the biomedical word “ADA,” whose dimension is 100. This feature vector is used to construct a 10 × 10 feature matrix M.(3)The third one does not consider positions of left and right adjacent words. The disambiguation feature vector of the biomedical word “ADA” is denoted as N = { O(ML2), O(PL2), O(SL2), O(ML1), O(PL1), O(SL1), O(MR1), O(PR1), O(SR1), O(MR2), O(PR2), O(SR2)}. Here, O is the output of the Word2vec tool. Feature matrix V = (, , …, ) is constructed, where V1 = O(ML2), V2 = O(PL2), V3 = O(SL2), V4 = O(ML1), V5 = O(PL1), V6 = O(SL1), V7 = O(MR1), V8 = O(PR1), V9 = O(SR1), V10 = O(MR2), V11 = O(PR2), and V12 = O(SR2).

4. Disambiguation Model Based on Attention Neural Network

4.1. Attention Layer

The attention mechanism is used to generate weights dynamically to capture the relationship between left and right adjacent words of the biomedical word. Feature matrix S is generated according to weight parameters. Feature matrix V is input to attention layer, and feature vector sm is calculated as follows:where WQ, WK, and WL are weight matrices, is the weight coefficient, and sm is the output feature vector. Here, d is the dimension of the input, which plays a role in adjusting the inner product. Values of m and n are, respectively, set to 1, 2, …, 12.

Equation (1) is used to compute the correlation strength of two elements. Weight coefficient is calculated based on correlation strength amn as shown in (2). sm is computed as shown in (3). Feature matrix S is constructed as as follows:

For the biomedical word “ADA,” attention operation is used to process feature matrix S as shown in Figure 3.

4.2. Convolutional Layer

The convolutional neural network is a deep learning model, which includes the input layer, convolutional layer, pooling layer, dropout layer, and fully connected layer. CNN can extract local features from data through convolution operations. CNN has representation learning ability, which can classify the input information shift-invariantly based on its hierarchical structure. CNN shares convolution kernels weight and multifeature graph, which can be used to process high-dimension data. After a series of convolution and pooling operations are implemented, the parameters of the model and the risk of overfitting are reduced. The convolutional neural network can capture local correlation of space-time structure, which has achieved excellent performance in natural language processing, computer vision, and image processing.

Here, asymmetric convolution proposed by Szegedy [31] is introduced, where ki × h convolution is split into ki × 1 convolution and 1 × h convolution. The biggest advantage of this method is that it reduces computation quantity, and its effect is similar to that of two-dimensional convolution. The size of the convolutional kernel is ki × 1, i = 1, 2, 3. Multiple convolution kernels of different sizes are set to get different features. The first convolution operation corresponding to 1 × h convolution kernel is applied to sm and generates the corresponding feature as follows:where m = 1, 2, …, 12, is convolution kernel in which 1 means the first convolution, i is used to distinguish 3 parallel asymmetric convolution operations, denotes bias, R(∙) is activation function, and c represents net activation of the convolutional layer. ReLU activation function R(∙) is as follows:

After 1 × h convolution kernel is used as shown in equations (5) and (12), eigenvalues are obtained. Feature mapping constructed by these 12 eigenvalues is as follows:

The second convolution operation corresponding to ki × 1 convolution kernel is applied to , and feature value is computed as follows:where m = 1, 2, …, 12, is convolution kernel in which 2 means the second convolution, i is used to distinguish 3 parallel asymmetric convolution operations, denotes bias, R(∙) is activation function, and e represents net activation of the convolutional layer. ReLU activation function R(∙) is shown as follows:

In the second convolution, 3 asymmetric operations use the same number of convolution kernels. But, the size of the convolution kernel is different. So, 3 asymmetric convolution operations in the second convolution can output the same number of feature mapping Ci, as shown in the following equation:

Features with the same index in the convolution window are averaged to generate feature mapping D as follows:where j = 1, 2, …, 12.

According to the index in the convolution window, feature mappings Ci of 3 asymmetric convolution operations are concatenated with feature mapping D to generate E as follows:

The above process is called Av-ACNN. If feature mappings C1, C2, and C3 are concatenated directly and input into Bi-LSTM, the process is called ACNN.

For feature matrix S in the above example, the process of semantic classification is shown in Figure 4.

The cell structure of LSTM is shown in Figure 5.

4.3. Bi-LSTM Layer

LSTM is a special recurrent neural network (RNN), which can capture longer distance information. It uses a set of gate controllers to solve effectively gradient disappearance and explosion of RNN. Bi-LSTM is composed of forward LSTM and backward LSTM, which makes up for the shortcomings of LSTM.

LSTM infers semantics of biomedical word based on the previous input information but cannot include the subsequent information into the reference, which will affect WSD accuracy. In fact, left and right contexts around the biomedical word can all influence the process of WSD. If you access the right context as you access the left one, it is very beneficial for biomedical WSD. Bi-LSTM is composed of two LSTMs. One is forward LSTM and the other is backward LSTM. They represent, respectively, the left context and the right one of the biomedical word. Bi-LSTM is very suitable for sequence annotation tasks with a top-down relationship. It is often used to model context information in natural language processing and provides help for biomedical WSD.

In this paper, LSTM inputs data at the multitime step and outputs results at the last time step. Results at the last time step are added and input into a fully connected layer. LSTM unit contains memory unit Ct and 3 gate controllers. They are, respectively, input gate it, forget gate ft, and output gate ot. These 3 gates control update of memory unit Ct and output of hidden layer state ht. The output of ht is computed as follows:where Wi, Wf, Wc, and Wo are, respectively, weight matrices of input gate, forget gate, candidate state, and output gate. Here, bi, bf, bc, and bo are, respectively, their bias terms. t = 1, 2, …, 12. (∙) represents the sigmoid function. tanh(∙) is the hyperbolic tangent activation function. denotes the product of corresponding elements.

At the last time step, the output of the hidden layer in forward LSTM is , and the output of the hidden layer in backward LSTM is .

4.4. Fully Connected Layer and Semantic Classification

Outputs and of Bi-LSTM are input to a fully connected layer. The softmax function is applied to map the output of the neuron to the interval (0, 1) as shown in equation (21). The purpose is to determine semantic categories of biomedical words.

In (21), Q=(P(x1 | c), P(x2 | c), …, P(xr | c)) and Y=(Y1, Y2, …, Yl) represent net activation. and denote weight parameters. B represents the bias term. is the output in the hidden layer of forward LSTM and is output in the hidden layer of backward LSTM.

P(xi | c) is the probability of biomedical word c under the semantic category xi as shown in the following equation:

In equation (22), c has r semantic categories and represents the ith semantic class, i = 1, 2, ..., r.

In probability distribution P(x1|c), P(x2|c), …, P(xr|c) of biomedical word c, semantic category s with the highest probability is selected as the predicted one as shown in the following equation:

5. Biomedical WSD Based on Attention Neural Network

Attention mechanism, Av-ACNN, and Bi-LSTM are combined to disambiguate biomedical words. The training process of attention mechanism, Av-ACNN, and Bi-LSTM includes forward propagation and backpropagation. Semantic classification is forward propagation. Gradient calculation and parameter optimization are backpropagation.

The training process of attention mechanism, Av-ACNN, and Bi-LSTM is shown as follows.

Train_Attention_Av-ACNN_Bi-LSTM( ).Input: feature vector N of biomedical word and artificially labeled semantic category hs.Output: optimized parameter set θ = {, , B, Wi, bi, Wf, bf, Wc, bc, Wo, bo, , , , , WQ, WK, WL}.Step 1: initialize iteration number n and parameter set θ.Step 2: matrix V is constructed based on N.Step 3: for(m = 1; m ≤ n; m++){Forward_Propagation( );Compute_Gradient ( );Update_Parameter( );}

Forward propagation is shown as follows.

Forward_Propagation( ).Step 1: initialize iteration number n and parameter set θ.Step 2: matrix V is constructed based on N and input into the attention layer. Feature matrix S is built according to equation (4).Step 3: according to equation (7), feature mappings Z1, Z2, and Z3 are generated by convolutional operation in which the size of convolution kernel is 1 × h.Step 4: according to equation (10), feature mappings C1, C2, and C3 are generated by convolutional operation in which the size of convolution kernel is ki × 1.Step 5: according to equation (14), feature mapping E is constructed and input into the Bi-LSTM layer. Hidden layer output ht is calculated according to equation (20).Step 6: according to equation (23), the category of biomedical word c is determined.

Backward propagation includes gradient calculation and parameter optimization. The process of gradient computation is shown as follows.

Compute_Gradient ( ).Step 1: loss value J is computed aswhere y1, y2, ..., yr are the one-hot codes of hs.Step 2: gradient of the output layer is calculated aswhere is the product of corresponding elements.Step 3: gradients and at the last time step of the forward LSTM and backward LSTM are, respectively, calculated as shown as follows:where l represents the last time step, H = {, }, and Cl = {, }.Step 4: gradients ht and Ct at any time in LSTM are computed as shown as follows:where Ct represents gradient at t + 1 time, ht = {, }, and Ct = {, }.Step 5: gradient of the second convolution layer is calculated as where i is the ith asymmetric convolution and 2 is the second convolution layer.Step 6: gradient of the first convolutional layer is computed aswhere 1 represents the first convolution layer and rot180(·) is the operation of rotating .

The process of parameter update is shown as follows.

Update_Parameter( ).Step 1: update weights and and bias B of the fully connected layer as follows:where is learning rate and H = {, }.Step 2: update weight matrices Wy and Wc and bias terms by and bc in LSTM cell as follows:where l is the last time step, is learning rate, and y = {i, f, o}.Step 3: update weight and bias in the second convolutional layer as follows:where Zi is the output of the first convolution layer.Step 4: Update weight and bias in the first convolutional layer as follows:

When the semantic category of biomedical word c is determined, disambiguation features are extracted from its 4 adjacent lexical units. Feature matrix V is constructed and input into the attention layer. According to equation (4), the attention layer outputs feature matrix S, which is input into the convolutional layer. Based on equation (10), feature mappings C1, C2, and C3 of the asymmetric convolution are computed. According to indices of the convolution window, C1, C2, and C3 are fused twice to obtain feature mapping E, which is input into the Bi-LSTM layer. The output of the Bi-LSTM layer is computed based on equation (20). According to equation (22), probability distribution P(xi|c) of biomedical word c under semantic category xi is calculated. According to equation (23), its semantic category is determined.

6. Experiments

6.1. Training and Test Corpus

MSH dataset is a WSD dataset in the biomedical domain. It consists of 106 ambiguous abbreviations, 88 ambiguous terms, and 9 words, which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term and abbreviation, the dataset contains a maximum of 100 sentences per sense obtained from MEDLINE.

51 biomedical words are selected from the MSH dataset. There are 38 biomedical words with two semantic categories, including AA, ADA, ADP, ALS, BAT, BLM, CDR, Cilia, DI, eCG, EM, EMS, FAS, Fish, FTC, GAG, Ganglion, HCl, HR, IA, INDO, IP. ITP, JP, MCC, Medullary, MRS, NBS, NM, OCD, ORI, PAF, Phosphorylase, RB, SARS, SCD, THYMUS, and WT1. There are 11 biomedical words with 3 semantic categories. They are, respectively, Ala, Cold, Cortical, CP, DDS, lce, lens, Lupus, PCP, RA, and TAT. There are only one biomedical word with 4 semantic categories and one biomedical word with 5 semantic categories in the MSH dataset. They are, respectively, Ca and PCA, which are all selected. Sentences containing these 51 biomedical words are used as training corpus and test corpus to measure the proposed method’s performance.

6.1.1. Experiment Analysis

Ten groups of experiments are carried out, and average disambiguation accuracy is used to evaluate the performance of the WSD classifier, which is defined aswhere N is the number of all biomedical words, mi is the number of test sentences correctly classified for the ith biomedical word, ni is the number of all test sentences containing the ith biomedical word, pi is disambiguation accuracy of the ith biomedical word, and pavg is average disambiguation accuracy.

In Experiment 1 and Experiment 3, method (1) is used to construct feature matrix F. In Experiment 1, CNN is applied to determine the semantic category of the biomedical word. In Experiment 2, LSTM is used to determine its semantic class. In Experiment 3, Bi-LSTM is adopted to determine its semantic category. In Experiment 4, morphology, part of speech, and semantic information in 4 adjacent vocabulary units of the biomedical word are used as disambiguation features. Feature matrix V is constructed by method (3). The proposed framework is adopted to determine its semantic class. Disambiguation accuracies from Experiment 1 to Experiment 4 are shown in Table 1.

The average disambiguation accuracy of Experiment 2 is 1.19% higher than that of Experiment 1. Experiments show that, compared to CNN, LSTM is more suitable for the disambiguation of the MSH corpus. The average disambiguation accuracy of Experiment 3 is higher 1.35% than that of Experiment 2. The reason is that Bi-LSTM takes account of contextual information from two directions. In Experiment 4, the proposed network is used to disambiguate biomedical words. The disambiguation accuracy of Experiment 4 is 5.6% higher than that of Experiment 3. The reason is that morphology, part of speech, and semantic information are extracted as disambiguation features in Experiment 4, but Experiment 3 only considers morphology. The proposed network is better than Bi-LSTM.

In Experiment 5, Experiment 6, and Experiment 7, morphology, part of speech, and semantic information in 4 left and right lexical units of the biomedical word are selected as disambiguation features. At the same time, feature matrix M is constructed by method (2) in Experiment 5, Experiment 6, and Experiment 7. In Experiment 5, ACNN and LSTM are combined to determine the semantic class of the biomedical word. In Experiment 6, ACNN and Bi-LSTM are fused to determine its semantic category. In Experiment 7, Av-ACNN and Bi-LSTM are combined to determine the semantic class of the biomedical word. The training corpus of MSH is used to optimize the WSD classifier, and the optimized WSD model is applied to classify the test corpus of MSH. Disambiguation accuracies from Experiment 5 to Experiment 7 are shown in Table 2.

The average disambiguation accuracy of Experiment 6 is 1.25% higher than that of Experiment 5. The reason is that Bi-LSTM is composed of forward LSTM and backward LSTM, in which feature information in context is obtained, respectively, from two directions. But, LSTM only obtains feature information from one direction. Therefore, the disambiguation accuracy of Bi-LSTM is higher than that of LSTM. The average disambiguation accuracy of Experiment 7 is 0.72% higher than that of Experiment 6. The reason is that the weighted average D of C1, C2, and C3 is computed according to indexes of the convolutional window in Experiment 7. Then, C1, C2, C3, and D are concatenated according to indexes of the convolutional window to get feature mapping E. Because more disambiguation information is included in E, the disambiguation effect is improved.

In Experiment 8, morphology, part of speech, and semantic information are extracted as disambiguation features from 4 adjacent lexical units of the biomedical word. Feature matrix V is generated by method (3). Av-ACNN and Bi-LSTM are combined to determine the semantic class of the biomedical word. Disambiguation accuracies of Experiment 4, Experiment 7, and Experiment 8 are shown in Table 3.

The disambiguation effect of Experiment 7 is better than that of Experiment 8. This shows that the disambiguation ability of the feature matrix constructed by method (2) is better than that of the feature matrix constructed by method (3). The reason is that the position of the adjacent word is considered in method (2). Generally, if a context-related word is closer to the biomedical word, it is more important for determining the category of the biomedical word and its weight is bigger. The disambiguation accuracy of Experiment 4 is 2.26% higher than that of Experiment 7. This is because that attention layer generates dynamically weight coefficients, which are applied to construct the feature matrix in Experiment 4. In Experiment 7, weight coefficients are manually set to construct a feature matrix. Therefore, the disambiguation effect of Experiment 4 is better than that of Experiment 7. The disambiguation accuracy of Experiment 4 is 3.14% higher than that of Experiment 8. In Experiment 7, the attention layer is not used. This shows that the attention layer is helpful for improving the disambiguation effect.

The size of the convolution kernel has a great influence on feature extraction. If the convolution kernel is too small, it is not able to extract effective features. If the convolution kernel is too large, the computation quantity will be increased. Three groups of experiments have been conducted in which 3 convolution kernels with different sizes are used to obtain discriminative features. In Experiment 4, Experiment 9, and Experiment 10, morphology, part of speech, and semantic information in 4 left and right lexical units of the biomedical word are selected as disambiguation features. At the same time, method (3) is used to construct feature matrix V and the attention network proposed in this paper is applied to determine the semantic class of the biomedical word. In Experiment 4, the second convolution kernels for 3 asymmetric convolution operations are, respectively, 2, 3, and 5 in size. In Experiment 9, the second convolution kernels for 3 asymmetric convolution operations are, respectively, 3, 4, and 5 in size. In Experiment 10, the second convolution kernels for 3 asymmetric convolution operations are, respectively, 1, 2, and 3 in size. The disambiguation accuracies of Experiment 4, Experiment 9, and Experiment 10 are shown in Table 4.

The average disambiguation accuracy of Experiment 4 is higher than that of Experiment 9 and Experiment 10. It shows that the size of the convolution kernel is suitable in Experiment 4.

From Experiment 1 to Experiment 8, the average disambiguation accuracies of biomedical words with 2, 3, 4, and 5 semantic categories are calculated. Experimental results are shown in Figure 6.

From Figure 6, it can be seen that the average disambiguation accuracy of the proposed method is better than that of other methods under 2, 3, 4, and 5 semantic classes.

The average disambiguation accuracy under 2 semantic categories is higher than that under 3 and 4 ones. The reason is that when the number of semantic classes is less, the difficulty of biomedical WSD is smaller. But, the disambiguation accuracy of the five categories is relatively high. The reason is that there is only one ambiguous word with five categories, and the distribution of its training corpus may be in accordance with that of its test one.

Finally, the time cost of the proposed model is analyzed. Here, n is the sequence length, d is the representation dimension, k is the kernel size of convolutions, c is the number of categories, and s is the number of support vectors. The run-time complexity of CNN is O(k·n·d). The run-time complexity of LSTM and Bi-LSTM is O(n·d2). The run-time complexity of attention is O(n2·d). So, the run-time complexity of the proposed method is O(n2·d + k·n·d + n·d2). The run-time complexity of KNN is O(n·d). The run-time complexity of logistic regression is O(d). The run-time complexity of Naive Bayes is O(n·d). The run-time complexity of SVM is O(s). Although the run-time complexity of the proposed method is the biggest, its average disambiguation accuracy is the highest.

7. Conclusions

Morphology, part of speech, and semantic information are extracted as disambiguation features from 4 adjacent lexical units of the biomedical word in this paper. The attention mechanism is used to generate a feature matrix, from which Av-ACNN and Bi-LSTM are used to extract discriminative features. Based on discriminative features, the softmax function is applied to determine the category of biomedical word. Experimental results show that the proposed method is more suitable for biomedical WSD than other methods.

Data Availability

MSH data can be downloaded from https://wsd.nlm.nih.gov/collaboration.shtml.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grants nos. 61502124 and 60903082), the Fundamental Research Foundation for Universities of Heilongjiang Province (Grant no. LGYC2018JC014), and the Natural Science Foundation of Heilongjiang Province of China (Grant no. F2017014).