Abstract

Social media in medicine, where patients can express their personal treatment experiences by personal computers and mobile devices, usually contains plenty of useful medical information, such as adverse drug reactions (ADRs); mining this useful medical information from social media has attracted more and more attention from researchers. In this study, we propose a deep neural network (called LSTM-CRF) combining long short-term memory (LSTM) neural networks (a type of recurrent neural networks) and conditional random fields (CRFs) to recognize ADR mentions from social media in medicine and investigate the effects of three factors on ADR mention recognition. The three factors are as follows: representation for continuous and discontinuous ADR mentions: two novel representations, that is, “BIOHD” and “Multilabel,” are compared; subject of posts: each post has a subject (i.e., drug here); and external knowledge bases. Experiments conducted on a benchmark corpus, that is, CADEC, show that LSTM-CRF achieves better -score than CRF; “Multilabel” is better in representing continuous and discontinuous ADR mentions than “BIOHD”; both subjects of comments and external knowledge bases are individually beneficial to ADR mention recognition. To the best of our knowledge, this is the first time to investigate deep neural networks to mine continuous and discontinuous ADRs from social media.

1. Introduction

With rapid growth of online health social networks, such as DailyStrength.com [1], Askapatient.com [2], MedHelp.org [3], and PatientsLikeMe.com [4], more and more patients share their personal health-related information through social media posts. This information can be utilized for public health monitoring, particularly for pharmacovigilance via mining adverse drug reactions (ADRs) using natural language processing techniques.

ADRs, noxious and unintended responses to medicinal products occurring at doses normally used in man for the prophylaxis, diagnosis or therapy of disease or for the restoration, and correction or modification of physiological function ‎[5] are an essential part of drug safety. There usually are two kinds of mechanisms to discover ADRs: some of ADRs are discovered during phase III clinical trials for drug development and some are revealed during postmarketing surveillance. In the case of the postmarketing surveillance, traditionally, ADRs are identified by individual patients and their physicians using official adverse event reporting systems (AERS). In recent years, there have been some attempts at mining ADRs and drug-drug interactions [6, 7] from unstructured text in clinical records, literature, and health-related social media, and their experimental results have shown the effectiveness of mining ADRs from unstructured text. In this study, we focus on recognizing ADR mentions, including continuous and discontinuous ADR mentions, from social media according to all comments to specific drugs. For example, given a user’s post “I still have pain in arms and legs with much stiffness.” our goal is to extract three ADR mentions, namely, “pain in arms,” “pain in…legs,” and “stiffness,” where “pain in arms” and “stiffness” are continuous ADR mentions composed of continuous words and “pain in…legs” is a discontinuous ADR mention composed of discontinuous words.

Although there have been a number of methods proposed for recognizing ADR mentions from social media, most of them used traditional machine learning methods such as conditional random fields (CRFs) to recognize continuous ADR mentions. In this work, we propose a deep neural network (called LSTM-CRF), which combines long short-term memory (LSTM) neural networks (a type of recurrent neural networks, RNNs) and conditional random fields (CRFs), to recognize both continuous and discontinuous ADR mentions from social media. Compared to CRFs, the advantages of LSTM-CRF lie in that LSTM neural networks have strong expressive ability to capture long context without time-intensive feature engineering. It has shown better performance than CRFs for some sequence labeling tasks such as part-of-speech (POS) tagging and named entity recognition (NER) ‎[8].

In order to comprehensively investigate LSTM-CRF on ADR mention recognition, we compare two novel unified representations (i.e., “BIOHD” and “Multilabel”) for continuous and discontinuous ADR mentions and studied effects of post subjects and external knowledge bases. All models are evaluated on a benchmark corpus, that is, CADEC, composed of 1250 forum posts for 12 drugs taken from AskaPatient.com, where each post has been manually annotated with ADR mentions. Our results show that LSTM-CRF performs better than CRF, “Multilabel” is more suitable than “BIOHD” to represent continuous and discontinuous ADR mentions, and both subject of posts and external knowledge bases are individually beneficial to ADR mention recognition.

On the whole, the contributions of this work can be summarized as follows: we introduce a deep neural network that combines LSTM neural networks and CRFs to recognize continuous and discontinuous ADR mentions at first time; we compare two unified representations for continuous and discontinuous ADR mentions; we investigate the effects of post subjects and knowledge bases; we conduct empirical evaluation of all models on a benchmark corpus.

This paper is organized as follows: in Section 2, we survey related work; Section 3 introduces the LSTM-CRF model; and Section 4 depicts our experiments in detail; we provide discussion in Section 5 and Section 6 concludes the paper. An earlier version of the paper has been presented in The 3rd China Health Information Processing Conference (CHIP-2017).

In recent years, social media has been increasingly used for medical research, especially for pharmacovigilance via mining ADRs from health-related posts. Certain quantities of studies have been proposed for ADR mention recognition from corpus construction to methods. Posts from DailyStrength.com [1], Yahoo Wellness Groups [9], Askapatient.com [2], Medications.com [10], WebMD.com [11], MedHelp.org [3], SteadyHelath.com [12], PatientsLikeMe.com [4], parenting websites [13], various disease-specific forums such as Diabetes and Cancer [14], Twitter ‎[15], Facebook ‎[16], and other websites or forums ‎[17] have been collected to mine ADRs. On these data, varieties of methods have been employed to recognize ADR mentions. They may fall into three categories: lexicon-based [1829], pattern-based [30, 31], and machine learning-based [32, 33]. The earliest work, the pioneering work of Leaman et al. in 2010 ‎[18], utilized a lexicon-based method to recognize ADR mentions from user posts regarding six drugs from DailyStrength.com. In this work, 450 out of 3600 posts were used for system development and the remanding post for system evaluation. Although lexicon-based methods can successfully recognize ADR mentions using several extensive and available ADR resources, they cannot address challenges such as idiomatic expressions and misspelled mentions. To conquer some of them, pattern-based methods over lexicon-based methods were proposed to detect inexact-match ADR mentions. For example, Yates et al. ‎[34] designed seven patterns to recognize ADR mentions from posts regarding five drugs for breast cancer from Askpatient.com, Drugs.com, and DrugRatingZ.com. The limitation of pattern-based methods is the need for large amounts of data to generate patterns. Recently, with some annotation data available, machine learning-based methods, such as CRFs, are becoming more and more popular with promising performances, where ADR mention recognition is considered as a sequence-labeling problem. Sarker and Gonzalez ‎[35] made a comprehensive review of text mining techniques for ADR mining before 2015. As mentioned in this review, most state-of-the-art machine learning methods are based on CRFs with rich hand-crafted features, and only continuous ADR mentions are taken into account. In 2016, Pacific Symposium on Biocomputing (PSB) launched a shared task on mining social media to exploit natural language processing techniques for ADR extraction in tweets, where subtask 2 is to automatically extract ADR mentions in user posts. In this subtask, only continuous ADR mentions were considered, and machine learning methods based on CRFs achieved best results again ‎[33].

Actually, discontinuous entity mentions are very common in the medical domain. As reported in ‎[36], discontinuous disorder mentions in clinical text accounted for about 10%. In social media, discontinuous ADR mentions also usually appear. Karimi et al. ‎[37] annotated a corpus of adverse drug events including both continuous and discontinuous ADR mentions, that is, CADEC. To recognize continuous and discontinuous ADR mentions simultaneously, Metke-Jimenez and Karimi ‎[38] followed Tang et al.’s ‎[36] way to represent them in a unified schema and used CRFs with baseline features, including bag-of-words, character n-grams, and word shapes. To the best of our knowledge, this is the only study that considered both continuous and discontinuous ADR mentions in the task of ADR mention recognition. There are also some other studies conducted on this corpus; however, all of them only consider continuous ADR mentions or convert every discontinuous ADR mention into one or more continuous ADR mentions. For example, Tutubalina and Nikolenko ‎[39] proposed a method, uniting RNNs and CRFs, to recognize ADR mentions on CADEC. They excluded overlaps between spans of discontinuous ADRs by selecting the longest continuous span and combining these ADRs into a single continuous ADR.

Deep learning methods have been increasingly applied to solve NLP tasks in the medical domain and achieve better performance than CRFs. In the case of ADR mention recognition, Stanovsky et al. ‎[40] employed RNNs with word embeddings trained on a Blekko medical corpus in conjunction with entity embeddings trained on DBpedia. If an entity mention was a lexical match with one of DBpedia entities, then the entity embeddings trained on DBpedia replaced word embeddings of all words in the entity mention. Tutubalina and Nikolenko ‎[39] utilized multilayer RNNs (LSTM and GRU) with CRFs and achieved better performance than RNNs and CRFs individually. However, no study focuses on applying deep learning methods to recognize both continuous and discontinuous ADR mentions simultaneously.

3. Methods

Before recognizing continuous and discontinuous ADR mentions, we should know how to represent them. Therefore, in this section, we introduce representation schemas for both continuous and discontinuous ADR mentions at first and then machine learning methods.

3.1. Representations

Two novel representations are adopted in our study: “BIOHD” and “Multilabel.” “BIOHD” is an extension of traditionally named entity representation schema “BIO” (B-beginning of a ADR mention, I-inside of a ADR mention, O-outside of a ADR mention) by introducing two additional tags: “H,” a part shared by multiple medical mentions (e.g., ADR mentions), and “D,” a part of a discontinuous medical mention not shared by other mentions. “Multilabel” allows a token to be labeled with more than one tag, and each tag corresponds to the position in one mention. The number of tags a token has is determined by how many mentions it appears in. Table 1 gives us examples of continuous and discontinuous ADR mentions represented by “BIOHD” and “Multilabel,” respectively. In sentence 1, there is one discontinuous ADR mention “pain in…left shoulder,” while there are two continuous ADR mentions, “stiffness” and “pain in arms,” and one discontinuous ADR mention, “pain in…legs,” in sentence 2. The ADR mentions “pain in arms” and “pain in…legs” in sentence 2 share the part “pain in,” For convenience, a token’s multiple tags can be combined into one tag. For example, sentence 2 can be tagged with “I/O still/O have/O pain/B-B in/I-I arms/I-O and/O legs/O-I with/O much/O stiffness/B-O  ./O,” where each token and its tag(s) are separated by “/,” and multiple tags are joined by “-.” In this study, the maximum number of tags a token has is set to 4 according to the statistic results from the training corpus.

3.2. LSTM-CRF

When continuous and discontinuous ADR mentions are represented by “BIOHD” or “Multilabel,” recognizing them still can be formulated as a sequence labeling problem. In this study, we use LSTM-CRF to model this problem. Figure 1 illustrates the architecture of LSTM-CRF, which is composed of three layers as follows: input layer, LSTM-layer, and CRF layer.

The input layer takes in different types of embeddings of each token. The embeddings used in this study include word embeddings, char-level embeddings, subject-related embeddings, and knowledge-based embeddings.

The LSTM layer uses bidirectional LSTM neural networks to generate hidden context representation at each position. Given a sentence , where each word is represented by (i.e., concatenation of word embeddings, char-level embeddings, subject-related embeddings, and knowledge-based embeddings of word ), the bidirectional LSTM neural networks take a sequence of embeddings as input and output a sequence of hidden context representations , where is a concatenation of the outputs of both forward and backward LSTM neural networks.

The CRF layer takes a sequence of hidden context representations as input, estimates the probabilities of label sequences (from a predefined set), and returns the one of highest probability. As shown in ‎[41], the conditional probability of a label sequence for is computed as follows (only taking the first-order linear chain CRF as an example here):where is the transform score for , is the emission score generating given , and represents all possible label sequences for .

4. Experiments

4.1. Corpus

We use a publicly available annotated corpus called CSIRO Adverse Drug Event Corpus (CADEC) from AskaPatient.com to evaluate the performance of LSTM-CRF. On AskaPatient.com, all comments are grouped by drugs. CADEC contains 1250 posts about 12 drugs , Cataflam, Voltaren-XR, Arthrotec, Pennsaid, Solaraze, Flector, Cambia, Zipsor, Diclofenac Sodium, Diclofenac Potassium, and , and the posts are manually annotated with five types of ADR-related events , Drug, Disease, Symptom, and . In CADEC, there are 6318 ADR mentions, 1000 out of which are discontinuous ADR mentions, accounting for 15.83%. Among the 1000 discontinuous ADR mentions, 918 share some parts with others, that is, overlapping.

4.2. Evaluation Metrics

We use precision (), recall (), -score (), and accuracy (Acc) to evaluate ADR mention recognition system. They are defined as follows:where is the number of ADR mentions correctly predicted by a system, is the number of ADR mentions predicted by a system but not in the gold standard corpus, is the number of ADR mentions in the gold standard corpus but not predicted by a system, and is the number of nonentities (tagged as “O”) correctly predicted.

Two kinds of criteria, that is, strict and relaxed, are adopted to calculate , , , and Acc. The strict criterion refers to the fact that an ADR mention is correctly predicted only when it is exactly the same as the gold-standard one. The relaxed criterion refers to the fact that an ADR mention is correctly predicted as long as it overlaps with the gold-standard one.

4.3. Experimental Results

We reimplement Metke-Jimenez and Karimi’s CRF-based system ‎[38], select LSTM-CRF only using word embeddings and char-level embeddings as a baseline, compare LSTM-CRF with CRF, and investigate effects of ADR mention representations, post subjects, and knowledge bases on LSTM-CRF. The word embeddings are initialized by GloVe ‎[42] and 100-dimensional pretrained embeddings on a large-scale unlabeled dataset from Wikipedia ‎[43]. We use bidirectional LSTM neural networks to extract char-level embeddings. The LSTM neural networks take a character sequence of each word (each char is represented by character embeddings) as input and output two hidden sequence representations. The last two outputs of the bidirectional LSTM neural networks are simply concatenated into char-level embeddings. The character embeddings are randomly initialized from uniform distribution ranging in , and its dimension is set to 25. The dimension of the char-level embeddings is also set to 25. The subject-related embeddings are randomly initialized from uniform distribution ranging in , and their dimension is set to 10. We label each token in a sentence with BIOES (B-beginning of a ADR mention, I-inside of a ADR mention, O-outside of a ADR mention, E-end of a ADR mention, and S-a single-token ADR mention) through knowledge-based looking up, and utilize 10-dimensional embeddings, randomly initialized from uniform distribution ranging in , to represent each token’s label. The SIDER database (http://sideeffects.embl.de/) and ADR lexicon (http://diego.asu.edu/downloads/publications/ADRMine/ADR_lexicon.tsv) are two knowledge bases used in this study. All embeddings are fine-tuned during training.

As there is no fixed way to divide CADEC into two parts, one for system development and the other one for system evaluation, we adopt 10-fold cross-validation. Following the previous study ‎[8], we set other parameters of LSTM-CRF as follows: dimension of LSTM hidden states: 100, optimizer: SGD, learning rate: 0.005, dropout rate: 0.5, and maximum number of epochs: 200. The results of different methods are shown in Table 2, where the highest values are highlighted in bold.

Table 2 shows that the methods using “Multilabel” outperform that using “BIOHD.” For example, the strict -score of CRF using “Multilabel” is higher than CRF using “BIOHD” by 0.76% (0.6060 versus 0.5984). Compared with CRF, LSTM-CRF only using word embeddings and char-level embeddings (denoted as LSTM-CRF (baseline)) achieves better performance. When continuous and discontinuous ADR mentions are represented by “BIOHD,” LSTM-CRF achieves higher strict -score than CRF by 4.92% (0.6476 versus 0.5984), while when continuous and discontinuous ADR mentions are represented by “Multilabel,” LSTM-CRF achieves higher strict -score than CRF by 4.99% (0.6559 versus 0.6060). Both subject-based embeddings (denoted by “subject” in Table 2) and knowledge-based embeddings (denoted by “knowledge” in Table 2) are individually beneficial to LSTM-CRF. When subject-based embeddings are added, the strict -score of LSTM-CRF using “Multilabel” is improved from 0.6559 to 0.6636, which is the highest -score. When knowledge-based embeddings are added, the strict -score of LSTM-CRF using “Multilabel” is improved from 0.6559 to 0.6593. When both of them are added, LSTM-CRF achieves a strict -score of 0.6614, which is a slightly lower than the highest one (i.e., 0.6636). The differences between strict -scores and relaxed -scores of the same methods exceed 20%, indicating that exactly detecting ADR mentions’ boundaries is not easy.

In addition, we also analyze the performances of different methods on continuous and discontinuous ADR mentions, respectively, as shown in Table 3, where the highest indices are highlighted in bold. LSTM-CRF using “Multilabel” representation and subject-based embeddings achieves the highest strict -score of 69.94% for continuous ADR mentions, while LSTM-CRF using “Multilabel” representation and knowledge-based embeddings achieves the highest strict -score of 41.87% for discontinuous ADR mentions. The difference between the two highest -scores is near 20%. The baseline LSTM-CRF achieves much higher -scores than CRF on continuous ADR mentions by about 5% and on discontinuous ADR mentions by about 8%. The methods using “Multilabel” almost always outperform that using “BIOHD” on continuous ADR mention recognition. The strict -score difference between the methods using the two different representations ranges from 0.2% to 0.59%. For discontinuous ADR mentions, the methods using “Multilabel” always outperform that using “BIOHD” by 4.93% in average strict -score, which is much larger than the strict -score difference between the methods using the two different representations for continuous ADR mentions.

5. Discussion

In this study, we propose a deep neural network (i.e., LSTM-CRF) to recognize continuous and discontinuous ADR mentions from medical social media, compare it with CRF, and investigate the effects of different factors on the proposed method. Similar to other related tasks such as NER and POS tagging, LSTM-CRF outperforms CRF on continuous and discontinuous ADR mention recognition. The methods using “Multilabel” outperform that using “BIOHD.” Both subject-based embeddings and knowledge-based embeddings are individually beneficial to ADR mention recognition, but when both of them are simultaneously added, the performance is not further improved.

The reason why the methods using “Multilabel” outperform that using “BIOHD” may lie in that “Multilabel” has better representation ability than “BIOHD,” especially for discontinuous ADR mentions. In theory, “Multilabel” is perfect (with a coverage of 100%), while “BIOHD” is imperfect ‎[36]. The coverage of “BIOHD” on CADEC is 89.36%. Because of this, the strict -score difference between systems using “Multilabel” and “BIOHD” for discontinuous ADR mentions is much larger than that for continuous ADR mention, although the distributions of continuous and discontinuous ADR mentions also affect the performance. For example, there are four ADR mentions, “Extremely bad pains in hands,” “Extremely bad pains in…arms,” “Extremely bad pains in…muscles,” and “Extremely bad pains in…quivering,” recognized by “BIOHD” in “Extremely/HB bad/HI pains/HI in/HI hands/DB  ,/O arms/DB  ,/O and/O muscles/DB are/O constantly/O quivering/DB  ./O”; however, in fact, there are only three ADR mentions, “Extremely bad pains in hands,” “Extremely bad pains in…arms,” and “muscles…quivering.” Since different drugs have different ADRs, adding subject-based embeddings amounts to adding the relations between drugs and their ADRs, similar to relations in knowledge bases. It may be the reason for the improvement from subject-based embeddings and why simultaneously adding both the subject-based embeddings and knowledge-based embeddings does not bring further improvements.

As the distributions of continuous and discontinuous ADR mentions are imbalanced, it is easy to understand that the strict -score difference of the same methods for continuous and discontinuous ADR mentions is not small. How to tackle data imbalance is a possible direction for further improvement, which will be considered in the future.

Although LSTM-CRF shows much better performance than CRF, the performance of LSTM-CRF is not very good, indicating that recognizing continuous and discontinuous ADR mentions from medical social media is still challenging. The main challenge is exactly determining all words or tokens of mentions, not some of them. The errors of LSTM may fall into the following three categories. Some modifiers are missing. For example, there are three continuous ADR mentions, “long time flatulence,” “Achilles tendon tightness,” and “dizziness,” in post “long time flatulence, Achilles tendon tightness, and dizziness.” The first one is wrongly recognized as “flatulence.” Some discontinuous ADR mentions are wrongly recognized as continuous mentions by combining words or tokens between all parts. For example, the discontinuous ADR mention “hair…thinning” in sentence “I took this drug a few years ago and went off it because my hair started thinning.” is wrongly recognized as a continuous ADR mention “hair started thinning.” There are some combination errors between continuous ADR mentions and discontinuous ADR mentions. For example, there are three ADR mentions, “Severe pain in buttocks,” “Severe pain in…left leg,” and “sciatica like symptoms,” in “Severe pain in buttocks and left leg sciatica like symptoms.” The last two mentions are wrongly recognized as “left leg sciatica like symptoms.” Some of these errors may be corrected by using structures of sentences. It is another case of our future work. The proposed representations actually provide two ways to connect different parts of discontinues entities; therefore, the proposed methods may have potential use for relation extraction, such as drug-drug interaction extraction.

6. Conclusions

In this paper, we investigate deep neural network-based ADR mention recognition. A deep neural network (called LSTM-CRF) combining long short-term memory (LSTM) neural networks (a type of recurrent neural networks) and conditional random fields (CRFs) is proposed to recognize continuous and discontinuous ADR mentions from social media in medicine and analyze effects of ADR mention representations, subject-based embeddings, and knowledge-based embeddings. Experiments conducted on a benchmark corpus show that LSTM-CRF outperforms CRF; “Multilabel” representation is more suitable for continuous and discontinuous ADR mention recognition than “BIOHD”; both subject-based embeddings and knowledge-based embeddings are individually beneficial for continuous and discontinuous ADR mention recognition. Moreover, some possible directions for further improvement are also presented.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This paper is supported in part by the following grants: National 863 Program of China (2015AA015405), National Natural Science Foundation of China (NSFC) (61573118, 61402128, 61473101, and 61472428), Special Foundation for Technology Research Program of Guangdong Province (2015B010131010), Strategic Emerging Industry Development Special Funds of Shenzhen (JCYJ20160531192358466 and JCYJ20170307150528934), Innovation Fund of Harbin Institute of Technology (HIT.NSRIF.2017052), and CCF-Tencent Open Research Fund (RAGR20160102).