Journal of Healthcare Engineering

Journal of Healthcare Engineering / 2017 / Article
!A Corrigendum for this article has been published. To view the article details, please click the ‘Corrigendum’ tab above.

Research Article | Open Access

Volume 2017 |Article ID 7575280 |

Siriwon Taewijit, Thanaruk Theeramunkong, Mitsuru Ikeda, "Distant Supervision with Transductive Learning for Adverse Drug Reaction Identification from Electronic Medical Records", Journal of Healthcare Engineering, vol. 2017, Article ID 7575280, 21 pages, 2017.

Distant Supervision with Transductive Learning for Adverse Drug Reaction Identification from Electronic Medical Records

Academic Editor: Peggy Peissig
Received05 May 2017
Accepted19 Jul 2017
Published26 Sep 2017


Information extraction and knowledge discovery regarding adverse drug reaction (ADR) from large-scale clinical texts are very useful and needy processes. Two major difficulties of this task are the lack of domain experts for labeling examples and intractable processing of unstructured clinical texts. Even though most previous works have been conducted on these issues by applying semisupervised learning for the former and a word-based approach for the latter, they face with complexity in an acquisition of initial labeled data and ignorance of structured sequence of natural language. In this study, we propose automatic data labeling by distant supervision where knowledge bases are exploited to assign an entity-level relation label for each drug-event pair in texts, and then, we use patterns for characterizing ADR relation. The multiple-instance learning with expectation-maximization method is employed to estimate model parameters. The method applies transductive learning to iteratively reassign a probability of unknown drug-event pair at the training time. By investigating experiments with 50,998 discharge summaries, we evaluate our method by varying large number of parameters, that is, pattern types, pattern-weighting models, and initial and iterative weightings of relations for unlabeled data. Based on evaluations, our proposed method outperforms the word-based feature for NB-EM (iEM), MILR, and TSVM with F1 score of 11.3%, 9.3%, and 6.5% improvement, respectively.

1. Introduction

Data-driven approach for knowledge extraction from electronic medical records (EMRs) has gained much attention in recent years. An EMR repository contains a collection of tacit knowledge [1] (e.g., professionals’ experiences, know-how) and explicit knowledge (e.g., diagnosis procedure, patient information) in a digital form of structured and unstructured data. This EMR repository offers insight into significant healthcare problems: patient mortality prediction [2], patient risk identification [3, 4], drug-disease relation extraction [5], and drug-drug interaction prediction [6, 7]. One of the potential applications is automatic adverse drug reaction (ADR) identification from EMRs. The ADR terminology is an unpleasant event (e.g., symptom, disease, and finding) associated with a medication given at recommended dosages [8]. Even though ADRs can be identified by premarketing clinical trials, only partial ADRs are reported. Postmarketing surveillance with a large amount of population is necessary for remaining ADR monitoring. To this end, there are two multidisciplinary tasks of ADR surveillance: ADR identification and ADR prediction. The former task targets on retrieval of unrecognized ADR that may exist in data but not explicitly described as knowledge, while the latter one aims to construct a model for predicting unknown ADR that have not been reported in anywhere.

In earlier research, the statistical co-occurrence method is broadly employed to quantify the relationship strength between a drug-event pair. While the method is simple, its result might present no explicit clinical relevance of a derived drug-event pair [9] due to disregard relational context that might express an exact impression in a clinical event such as a drug treats a symptom or a drug causes a symptom. To fill in this research gap, many researchers consider surrounding contexts around drug and event entities within clinical texts and represent such data by either using pattern-based method [1015] or feature-based method [1618]. Consequently, a potential ADR is identified by either training supervised learning or semisupervised learning [19] model. However, there are two main difficulties when dealing with unstructured texts using such learning models. A rare availability of labeled instances derived by human annotation to form a gold-standard example is the former problem, and intractable processing of unstructured clinical texts is the latter one. Toward the insufficiency of labeled instances, several studies alleviate this problem by using a sort of heuristics or rules (distant supervision [20, 21]), that is, mapping a sentence that contains entity pair from knowledge base and tagging relation label to such mentioned sentence to form a training set. For the second problem, a word-based approach [2224], the most commonly used method for text representation, is introduced; however, the method ignores either grammatical or semantic dependency among words. Therefore, pattern-based methods [10, 11, 14] are promoted to either extensive or substitute for word-based text representation. Recently, distant supervision paradigm is introduced to overcome hand-labeled data process to obtain a label of an instance from knowledge base [20, 21]. For example, knowledge bases consist of the following drug-event relations (“ramipril-allergy,” “ADR”) and (“aspirin-fever,” “IND”), so-called entity-level relation. By distant supervision, we can derive automatic labeled data of an associated sentence with such drug event, for example, “His ramipril were discontinued due to allergy and added to list in our medical records,” “ADR,” and known as instance-level relation. Therefore, multiple-instant learning (MIL) paradigm [25] is introduced into the classifier builder process to handle such two-level relations.

This paper introduces ADR identification framework by aiming to classify an entity-level relation of a drug-event pair. Our work differs from prior related works in the following aspects: (i) we propose key phrasal pattern-based bootstrapping method for characterizing ADR and IND, (ii) we introduce alternative parameter learning of a generative model, and (iii) we perform enhancement of the proposed method by incorporating transductive learning method.

The rest of this paper is organized as follows. A brief literature review and fundamental knowledge are given in Section 2. Then, Section 3 introduces problem formulation and our proposed framework. Section 4 presents the experimental results. Finally, the conclusion is discussed in Section 5.

2. Background

2.1. Adverse Drug Reaction Identification from Unstructured Texts

Recently, narrative notes in EMRs have been demonstrated as a promising data source and widely utilized for improving detection of patients experiencing adverse reactions, across drugs and indication areas [1013, 26]. There are at least three common subprocesses for dealing with unstructured texts in EMRs: (i) named entity recognition (NER) (particularly, named entities of drug and event) and normalization, (ii) relation generation (drug-event candidates), and (iii) relation classification (ADR identification).

As the first subprocess, the medical NER aims to recognize a clinical term mentioned in EMRs. Another extended task, the normalization intends to unify a discovered clinical term into a conventional lexicon based on an identical semantic meaning or a concept, which can be referred through UMLS concept unique identifier (CUI) ( Many researchers endeavor to deal with medical NER and normalization by developing computational tools such as cTAKES (, FreeLing-Med, MetaMap (, MedLEE (, tmChem (, DNorm (, GATE (, or Stanford CoreNLP tool ( From Figure 1, by employing medical NER and normalization, we can identify two drugs (i.e., ramipril and bacterium) and five events (i.e., allergy, facial swelling, HTN (hypertension), respiratory infection, and viral infection) from the given clinical texts. Then, the normalization task replaces a drug term or an event term with CUI. For example, a drug term ramipril is replaced with C0072973, or an event term HTN is replaced with C0020538, which refers to a concept of hypertension disease (NCI—

As the next subprocess, the generating of drug-event candidates is performed using the windowing technique [2729]. A drug-event pair tends to form a relation if they are located in the same sentence, the same section, or more practically in the same window size n. In general, this boundary detection (BD) task aims to detect the beginning and the ending points within given texts that a drug and an event tend to be semantically related. The challenges of BD task [3032] have arisen based on a boundary of interest and a domain of given texts. Many previous works define a potential boundary of a drug-event candidate within the same sentence, and the sentence boundary detection (SBD) in clinical texts is recognized as challenge with noise prone. One of the major issues is usage ambiguity of a period or a full stop (“.”). Typically, the period has several possible functions, such as a sentence boundary marker, a floating–point marker (e.g., “0.08,” “40.5 mg”), a marker for a numeric bullet of an enumerated list, or a separator within an abbreviation (e.g., “y.o.,” “h.s.”). Other punctuation marks such as a colon (“:”) increase the complexity of SBD as well. Additionally, the grammatical dependency is a potential method for improving a window-based relation generation because it considers more specific semantic dependency of the surrounding contexts.

Lastly, the generated lists of drug-event candidates are identified as ADR or IND using supervised, semisupervised, or unsupervised learning methods. The potential works on ADR identification from unstructured texts are summarized in Table 1. A statistical association is one of the pioneer works to identify ADR by considering the co-occurrence of a drug and an event in a specified window size to form association hypotheses, and then, the 2 × 2 contingency table is computed for hypothesis testing. Despite the method is simple, it disregards semantic dependency among surrounding contexts that might express real clinical evident. On the other hand, a pattern-based method [14, 15] is manifested that achieves more accurate clinical relation extraction because it relies on cues or trigger words that usually implies a semantic relation. Although, a pattern-based method is more efficient than the window-based method, a set of predefined patterns or redundant pattern filtering by a human is required. In our previous work [13], a pattern-based method has been proposed to utilize labels weakly suggested by a set of simple rules, (distant supervision) and pattern distribution has been investigated for characterizing ADR relations. Different from [1012, 18, 37], a pattern-based method is acquired as feature representation and machine learning methods such as support vector machine (SVM), decision tree C4.5 (DT), random forest (RF), or naïve Bayes (NB) are well-established as a classifier. Kang et al. [36] deploy a graph base and applies the shortest-path preference to ADR identification. With regard to the efficacy of word embedding [40] in NLP, Henriksson et al. [26] examine the distributional semantic model derived by word-embedding method for NER, concept attribute labeling, and relation classification. In their work, a high dimension on semantic space of each word is used as a feature for model learning. The distributional semantic model is shown to improve the classifier performance for all tasks. In another work, Nikfarjam et al. [17] apply the word embedding in a similar manner. However, to generalize semantic space, the authors employ a clustering method on such semantic vectors.

Data sourceLiteratureYearSizeLabel numberLabeling methodNERMethod

Supervised learning
EMRAramaki et al. [10]20103012 notesA, O (2)HCRFPattern-based
Sohn et al. [11]2011237 notesA, O (2)HcTAKESPattern-based, DT C4.5
Henriksson et al. [26]2015400 notesA, I, O (3)HCRFWord embedding, RF
Casillas et al. [12]2016n/aA, O (2)HFreeLing-MedPattern-based, SVM, RF
LiteraturePeng et al. [16]201618,410 abstractsA, O (2)H, DSDictionary, tmChem, DNormFeature-based, SVM
Social mediaSegura-Bedmar et al. [33]201584,000 messagesA, I (2)DSGATEShallow linguistic kernel, distant supervision
Nikfarjam et al. [17]20158800 blog sentences, 3200 tweetsA, I, O (3)HCRFWord embedding, CRF
Jenhani et al. [18]201680,000 tweetsA, O (2)R, ODINDictionary, Stanford CoreNLPRule-base, feature-based, DT, SVM, LR, NB
Liu et al. [34]20161800 blog sentences, 500 tweetsA, O (2)HMetaMapFeature-based, tree kernel-based, ensemble method

Semisupervised learning
EMRTaewijit and Theeramunkong [13]20161.5 M sentencesA, I (2)DSMetaMapDistant supervision, OpenIE [35], pattern-based
LiteratureKang et al. [36]20141644 abstractsA, O (2)HPeregrineHierarchical graph-based, shortest path
Social mediaLiu and Chen [37]2015400 sentencesA, I, O (3)HMetaMapDependency tree, TSVM [38]

Unsupervised learning
EMRWang et al. [39]200925,074 notesNoneNoneMedLEECo-occurrence
LiteratureXu and Wang [14]2014119 M sentencesNoneNoneParse treePattern-based, ranking
Social mediaFeldman et al. [15]20150.1~1 M messagesNoneNoneDictionary, patternHPSG-based parser, postprocessing of relation merging

Labels: A = ADR; I = IND; O = other (ADR cause, ADR outcome, non-ADR, negated ADR, others); labeling method: DS = distant supervision, H = human; R = rule-based.
2.2. Distant Supervision and Multiple Instance Learning

The main objective of distant supervision is to alleviate the problem of hand-labeled training which is time-consuming, rare, and expensive/costly by relying on knowledge base. Such knowledge base is reliable, cheap, and ubiquitously available. Distant supervision is first introduced by Craven and Kumlien [20]. In their work, the term weakly labeled data is presented for biomedical relation extraction from MEDLINE. Lately, Mintz et al. [21] propose an interchangeable paradigm, distant supervision, to extract relation from Freebase. Their assumption relies on “if the two entities participate in a relation, any sentence that contains those two entities might express that relation.” The distant supervision has been applied recently for relation extraction problem [4145] by mapping relations of any couple entities from knowledge bases (e.g., Freebase, YAGO) to a sentence in a large-scale text corpus (e.g., New York Times). Similarly, in previous works on application for emotion classification from social media (i.e., tweets, microblog text) [4648], the authors make use of distant supervision to map lexicon emoticons or smilies from knowledge bases (i.e., Wikipedia, Weibo) to large-scale noisy texts. In medical domain, distant supervision for ADR identification [33, 49] is leveraged to automatically assign adverse reaction relation by mapping drug-event pair from knowledge bases to each health-related texts. The work of Yates et al. [49] utilizes SIDER as knowledge based on English tweets and posted messages from breast cancer forum, and Segura-Bedmar et al. [33] deploy SpanishDrugEffectDB database on Spanish health-related texts.

As mentioned in the previous section, applying distant supervision on text corpus mostly encounters the two-level relation concept and the entity-level and the instance-level relations. This mapping procedure may trigger noisy labeled data [50, 51], and MIL paradigm [25] is widely used as a solution [41, 42, 52, 53] for such wrongly labeled data problem. Fundamentally, MIL is aimed at handling the situation that training labels are associated with sets of instance examples rather than individual examples [54]. The concept of MIL considers two levels of data, namely bag- and instance-level relations. Let be an instance space, be a set of labels, where , and be a training set, where is an instance and is a known label of ; usually, the supervised learning is to train a classifier function . On the one hand, a given training set in MIL consists of bags and bag labels as , where is a set of multiple instances, , and is a label of bag and can be different across a particular bag, the goal of MIL is to learn . For ADR identification problem, bag- and instance-level relations in MIL are equivalent to the entity- and the instance-level relations of drug-event relation by distant supervision, respectively.

2.3. Transductive Learning

In semisupervised learning, as varieties of the prediction method, the three parameters are (i) predictive model, (ii) single model or collaborative model, and (iii) test instances handling model. As the first parameter, recent works [5557] have proposed various predictive models, such as generative models [22, 58], low-density separation models [59], and graph-based models [60]. For the second parameter, at least two alternatives, namely self-training [61, 62] or cotraining [63], can be applied to assign a label to an unlabeled instance by either one single predictive model or multiple ensemble predictive models. The last parameter concerns with how to handle test instances, where two choices are (i) to manipulate the test instances separately from the unlabeled instances (inductive learning) or (ii) to treat them as unlabeled instances in the training step (transductive learning). Regardless of any choice for the above three parameters, semisupervised learning requires a few labeled instances for constructing an initial model, triggering complexity in the acquisition of such initial labeled data. The main idea of transductive learning is to take advantage of the information from unlabeled data during training time, while inductive learning ignores such information even though they are available [19].

3. Methods

This section presents the proposed ADR identification framework to overcome the shortcomings of the existing research: (i) the lack of domain experts for instances labeling and (ii) intractable processing of large-scale unstructured clinical texts. Our proposed framework contains the three main tasks (Figure 2). First, a set of drug-event candidates is generated from EMR texts. A silver-standard data and unseen data preparation are the next process. Finally, we explore alternative parameter learning schemes of generative models to identify potential drug-event relations.

To solve the first issue, we assign a label to an unlabeled instance by exploiting facts in knowledge bases (i.e., SIDER and DrugBank) and consider two labels, ADR and IND, as classification outputs. While distant supervision can supply a label to an unlabeled instance by simply looking up from knowledge bases, the labeled data set by this method is formed as MIL problem which training labels are associated with sets of instance examples rather than individual examples. As for the latter issue, applying phrase-based method and dependency representation may improve the model performance. In our work, the main idea is that a sentence regarding harmful (ADR) or beneficial (IND) clinical events can be simplified into the three key elements, a drug, a key phrasal pattern, and an event, and dependency among such three elements has significance. Such key phrasal pattern implies a semantic relation between any pair of drug and event entities. We have employed key phrasal pattern-based method for ADR identification in our previous work [13]. The method exhibits the high precision; notwithstanding its drawback is low recall rate due to a limit to the number of key phrasal patterns and the utilization of simple models. In this work, we extend such key phrasal pattern-based method with more sophisticated models, which is expected to be able to retain the high precision and improve retrieval performance. The EM, an iterative method, is incorporated with Markov property assumption to draw conditional probability distribution of pattern-based feature (dEM). Finally, we leverage unlabeled data through the transductive learning as semisupervised learning to enhance the performance of the proposed framework. For performance evaluation, we construct EM with independent assumption through NB (iEM) as the baseline and also compare our proposed methods to multiple advanced methods; multiple-instance support vector machine (MISVM), multiple-instance naïve Bayes (MINB), multiple-instance logistic regression (MILR), and transductive support vector machine (TSVM). The multiple numbers of parameters such as pattern types, pattern-weighting models, and initial and iterative weighting relation labels for unlabeled data are investigated throughout three alternative MIL models: iEM with transductive learning setting (baseline), dEM-supervised learning, and dEM with transductive learning.

3.1. Problem Formulation

We firstly present the formal definition of distant supervision and then formulate the problem using MIL concept. Let denote knowledge bases regarding ADR and IND that are obtained from SIDER ( and DrugBank (, be a set of seeds, where , and is a set of labels, where ; the data set of seeds in knowledge bases or an entity-level set can be defined as , where is a seed, is 2-dimensional entities space which consists of a drug entity and an event entity that are defined in , is a label corresponding , and is a total number of seeds. Therefore, the data set of seeds can be derived as . For instance, the drug ramipril associates with the adverse event allergy and the drug ibuprofen is used to treat the event arthritis as a symptom which is supposed to exist in . We can derive a data set of seeds to be a source of distant supervision as . These seeds are entity-level data that are used as knowledge for later processes.

Let be a clinical-record corpus from MIMIC (, which contains a set of discharge summary sentences . We transform each sentence into the three key elements, that is, a drug entity , a key phrasal pattern entity , and an event entity , while semantic of such simplified texts is retrained. Given is a tuple obtained from an input sentence and is 3-dimensional entity space, in order to automatically generate labeled examples using distant supervision, the goal is to obtain a mapping function that relates a drug-event pair of to a relation label , where exists in , , and . Finally, we can derive a set of labeled data , namely, an instance-level data set, whereas is a total number of mapped sentences.

For example, the sentence “His ramipril were discontinued due to allergy and added to list in our medical records.” is supposed to exist in the corpus . Then, the transformed sentence using a dependency tree can be simplified into the three key elements of a drug , a key phrasal pattern , and an event , where a key phrasal pattern is applied in either the syntactically lemmatized lexicon or surface lexicon (e.g., was-discontinued-due-to), and can be employed as either word or phrase form (discuss later in Section 3.3.1). From the mapping function , we can project such sentence to a seed in and transfer corresponding labels ADR to the sentence . Therefore, we can derive a labeled data by distant supervision as . As another example, a sentence “The allergy improved despite ongoing treatment with ramipril.” is also supposed to exist in the corpus . The transformed sentence is . In the similar manner, we can use the mapping function to assign the corresponding label of the entity pair and . Therefore, the derived labeled data is . However, the sentence might not express the correctly clinical event of adverse reaction. This is known as the noisy label and need to to be handled by a particular technique such as MIL.

In MIL concept, bag- and instance-level relations are equivalent to the entity- and the instance-level relations of drug-event relation derived by distant supervision, respectively. Regarding the definition in Section 2.2, is an instance space, is a set of labels, where , the labeled data set can be rewritten in the form of MIL as , where is a set of multiple sentences which all sentences in a bag correspond to the same drug and event , is the number of bags, and is the number of sentences in a bag and can be varied across a different bag. On the one hand, unlabeled instances () are formed as a group of bags in the similar way but without a label as . Our goal is both to train an instance classifier function in the instance–space paradigm from only (supervised learning) and attempt to infer the accurate label for each instance in the set during the training process (transductive learning). The bag label, eventually, can be derived from an aggregation function of the instance level, and the model assessment is investigated through the model performance of the entity level. Regarding noisy data labeling from distant supervision, the collective assumption and standard assumption with logical-OR aggregation for the bag label judgment are rather improper. The relaxed version of the MIL standard assumption is used in our proposed framework by assuming that the positive and negative bags are able to contain a mixture of either positive or negative instances, but the probability of at least one positive instance should be the maximum for the positive bag and vice versa. Consequently, to learn bag classifier , the estimated bag label from an instance classifier can be computed using (1), where is a label of a bag (the entity-level label), is a label of the instance-level and possibly different for each sentence instance within the same bag , and is the total number of sentences in the bag.

Generally, the training data are not sufficient for parameter training. In order to learn such classifier function , we make use of the iterative EM technique with transductive learning setting to estimate the posterior probability through the two parameters, that is, prior probability and class-conditional density , of the generative model.

3.2. Medical Named Entity Recognition and Relation Candidate Generation

Figure 3 displays information extraction from sentences in the MIMIC corpus, with the output of drug-key phrasal pattern-event tuples as candidates of ADR or IND relation. This process involves NER, SBD, and parsing. Here, the MetaMap [64] is used for NER, our in-house program for SBD (, and Stanford CoreNLP’s OpenIE for parsing. After extracting relation candidate tuples (entity1, predicate, entity2), we select only the tuples that include drug name and event name as entity1 and entity2 or vice versa. The output is in the form of (a drug, a key phrasal pattern, and an event).

The automatic labeling process using distant supervision is illustrated in Figure 4. Firstly, each pair of drug and event from the set of seeds in knowledge bases is used to extract drug-event pairs from the set of sentences; then, we assign the label corresponding to the seed label to all sentences that mention such pair. However, to reduce the ambiguity of the ground truth from knowledge base supervision, a pair of that is found to exhibit both of ADR and IND semantic relations is excluded. Given a set of sentences , the training set is in the form , where . In the Block 1 of Figure 4, the first bag (Bag1) consists of two sentences that correspond to the same entity-level of drug and event . The second bag (Bag2) contains only one sentence relevant to drug and event .

Finally, all sentences that are able to be assigned a label by distant supervision are referred as the set of labeled data and the remaining data that are not matched by distant supervision is used as unlabeled data .

3.3. Document Representation
3.3.1. Feature Extraction for Clinical Textual Data

To recognize a relation between a drug and an event, our approach generates a set of relation candidates (drug-event pairs) from medical records in the form of (drug, pattern, event). Table 2 depicts examples of multiple types of feature extraction and drug-event candidates. Our work considers two parameters related to representing such relation candidates. The first parameter, called relation boundary constraint, defines potential of using surrounding context for determining drug-event relations while the second and third parameters, called syntactic lemmatization and pattern granularity constraints, are related to patterns used to detect drug-event relations, as follows. (i)Syntactic lemmatization: for syntactic word forms, two possibilities are syntactically lemmatized lexicons () and surface lexicons ().(ii)Pattern granularity: in terms of pattern units, two options are in word form () and phrase form ().

SentencesTypesExample of feature representationExample of drug-event candidates (d, p, e, y)

On arrival here, propofol was held due to hypotension.L–PC0033487 be-hold-due-to C0020649(C0033487, be-hold-due-to, C0020649, ADR)
L–WC0033487 be hold due to C0020649NA
S–PC0033487 was-held-due-to C0020649(C0033487, was-held-due-to, C0020649, ADR)
S–WC0033487 was held due to C0020649NA
BOWOn arrival here, propofol was held due to hypotension.NA

Phenylephrine drip was started for hypotension.L–PC0031469 be-start-for C0020649(C0031469, be-start-for, C0020649, IND)
L–WC0031469 be start for C0020649NA
S–PC0031469 was-started-for C0020649(C0031469, was-started-for, C0020649, IND)
S–WC0031469 was started for C0020649NA
BOWPhenylephrine drip was started for hypotension.NA

3.3.2. Pattern-Weighting Models

(i)Bernoulli (binary) document model: a document (hereinafter referred to as a sentence denoted by ) can be represented in the form of a vector each element of which corresponds to a term (i.e., word, phrase) denoted by with a value of either 0 or 1 for presence or absence of such term, respectively. where presents a sentence in the form of a binary vector, when is the th term in the sentence (otherwise 0), and is a term in the universe .(ii)Multinomial (frequency) document model: a sentence is expressed by a vector of term frequency as where is a sentence in the form of a vector, expresses the normalized frequency of the th term by the sentence size , and is the frequency that the term occurs in the sentence . As another option, a document can also be expressed by a vector of term frequency-inverse document frequency TFIDF as where is a sentence (the document universe), in the form of a vector, and expresses the inverse document frequency, corresponding to the logarithm of the ratio of the total number of sentences in the universe to the number of sentences that contain the th term .

3.4. Probabilistic Classification Modeling

This section describes two EM-based probabilistic classification models, one with independent assumption () and the other with dependent representation assumption ().

3.4.1. EM Model with Naïve Bayes Independent Assumption (iEM)

Let be a set of sentences, be a sentence that includes terms, and be the set of possible classes. The probability that the sentence has as its class can be formulated as

While in most situations, it is possible to obtain the class simply from the training set, and the generative probability of given a class usually suffers with insufficient training data. As done by several works, the assumption of independence, usually called naïve Bayes (NB), can be applied to alleviate this sparseness problem as expressed in

Therefore, the NB text classifier can be rewritten in the form

Here, it is necessary to estimate two sets of parameters, denoted by , of expectation-maximization (EM) algorithm. The first parameter set is the class-conditional probability of any term given the class while the other one is the probability of the class . The parameter set is defined by

In the expectation step (E-step), for each iteration, the parameter of the previous step is applied to re-estimate the model probability. In our experiment, the convergence threshold is 10−7 and the maximum number of iterations is set to 50.

For the maximization step (M-step), with a Laplace smoothing factor , the th-iteration probability of and can be estimated from the tth-iteration probability. The maximum likelihood estimation for NB is simply computed from an empirical corpus using where is a total number of terms and any term .

The following demonstrates an example of applying the above formulations with the key phrasal pattern-based feature. Given the L–P feature representation of = (C0033487, be-hold-due-to, C0020649) corresponds to relation tuple obtained from an input sentence where the pattern be the phrase form, we can estimate as expressed in

Another example, given the L–W feature representation of the same sentence = {C0033487, be, hold, due, to, C0020649}, corresponds to relation tuple where the pattern is in the word form. We can compute the class probability of the given texts as

3.4.2. EM Model with Dependency Representation (dEM)

We introduce a dependency representation as an alternative model representation that is based on the same intuitions as the NB model but less restriction regarding the implicitly strong independence assumptions. This dependency representation is an efficient factorization of the join probability distributions over a set of three random variables and , where each variable is a domain of possible values, that is, drug, key phrasal pattern, and event. We extend the dependency representation with iterative learning by EM approach in order to align the model assumption to the natural language and also figure out an unseen random variable using probability estimation based on an existing prior knowledge. This dependency representation is also known as Bayesian networks (BN) and the conditional probability of independent variable given a class probability can be derived by the chain rule

Therefore, the BN text classifier can be rewritten in the form

According to the core of BN representation, a random variable is represented by a node in a directed acyclic graph (DAG), and an edge between any two nodes is presented by an arrow line which implies a direct influence of one node on another node. Given a sentence with three elements () in the form of a relation tuple (), there are three factorized ways (3!) as alternative model skeletons of the dependency representation through the chain rule. We, hence, propose the linear interpolation in order to weigh and combine the probability estimation from all of possible dependency representations as defined by such that the total is

Generally, the linear interpolation method of three random variables can be estimated from the combination of two random variables and individual random variable. Similarly, two random variables are able to approximate from individual random variable as well. For instance, given two history terms and wir in a sentence , the interpolation is comparatively estimated from individual random variable and two random variables as shown in such that the total is

Another instance, three history terms () in a sentence are given; the likelihood estimation as shown in (18) can be derived similarly as the previous estimator by interpolation of individual random variable, two random variables, and three random variable estimators. such that the total is

Finally, we compute , , , , and with the similar manner as (17) and calculate and using the same way as shown in (18).

In the same manner as the NB model, it is necessary to estimate the four sets of parameters whereas any terms .

The iterative learning using EM approach is applied to estimate the parameter . For the E-step, for each iteration, the parameter is applied to re-estimate the model probability as shown in (20) and (21). This process will repeat until convergence. The same setting as the iEM model, the value of 10−7 for the convergence threshold and the value of 50 for the maximum number of iterations, is applied for dEM model as well.

For the M-step, the Laplace smoothing factor is implemented as well as in NB model to avoid zero count issue. However, with the BN dependency representation, there are four parameter estimation of the th iteration probability of , , , and , which can be estimated from tth-iteration probability as expressed in whereas is a total number of terms and any term .

Then, we can derive and using the similar calculation as (22). For the dependency representations of two random variables w, that is, , , , , and can be computed by following the similar approach as (23). Similarly, the estimation of and can be obtained by the same way as shown in (24). Finally, the coefficients γ, β, and α of interpolation approach are employed in order to weigh the knowledge from multiple dependency representations. Algorithm 1 explains pseudocode for iEM model, and Algorithm 2 expresses our proposed dEM method.

= the number of labels
= the maximum number of iteration
Output: parameter
2 ;
3 repeat
4 for to n do
5   E–step:
     Estimate model probability:   (9)
     Update class-conditional probability:   (10)
     Update class probability:   (11)
7 until convergence or
3.5. The Incorporation of Unlabeled Data

In the environment of insufficient labeled data, SSL is one solution that utilizes an inexpensive and ubiquitous source of data. The transductive learning [65], one type of SSL, begins its process with making use of a limited number of labeled data () to build a rough model and then aggregated a large number of unlabeled data () (test set) to revise and improve the model iteratively. In the experiment, we investigated the three alternative approaches of initialization and iterative weighting of relation labels for unlabeled data incorporation. (i): This method is equivalent to the general transductive learning, in which the label of the test set can be derived by a classifier that is trained on the . Then, the augmented with the labeled , so called , is used for the further iteration.(ii): The class probability of the is equally assigned to and used as an initial probability. In this approach, the can be derived earlier and integrated in training process for the first iteration. Therefore, in the next iteration, the is not strictly guided by the labeled data. The revision process is probably the same manner to the previous method by combining both data set for the further iteration.(iii): Similarly, the initial probability of is assigned randomly rather than the fixed value of 0.5. The degree of likelihood for each label can be varied from 0 to 1 whereas the total probability of ADR and IND labels equals 1.

= the number of labels
= the maximum number of iteration
Output: parameter
2 ;
3 repeat
4 for to n do
5   E–Step:
     Estimate model probability:   (21)
     Update class-conditional probability:   (22)
     Update class probability:   (25)
7 until convergence or

In order to evaluate our proposed method, three types of text representation across three parameters of unlabeled data incorporation are investigated. Finally, our proposed methods and its enhancement, MIL-dEM-S-S (supervised learning) and MIL-dEM-T-S methods (transductive learning), are compared to TSVM and three MIL models, MISVM, MINB, and MILR, which are implemented in WEKA [66].

4. Evaluation

We assess our proposed method using various parameter settings as shown in Table 3 and evaluate by the hold-out evaluation through the k-fold cross validation whereas . The three main measures as defined by (26), (27), and (28), that is, precision, recall, and F1, are used for model evaluation, while the positive class in our experiments is ADR label. In our experiment, we use MetaMap Java API for NER and Stanford CoreNLP Java API for OpenIE and implement Python program for EM-based methods. For model comparison, we execute WEKA Java-based software and SVMlight (, which is implemented in C programming language, on Mac OS with Intel Core i5 processor running at 2.5 GHz and 8 GB of physical memory.

Parameter groupParameter typeParameter subtypeParameter nameVariable name

Document representationSyntactic lemmatizationSyntactically lemmatized lexiconL
Surface lexiconS
Pattern granularityPhrase formP
Word formW
Pattern-weighting modelsBernoulliBinaryB
MultinomialTF (term frequency)TF
TFIDF (TF-inverse document frequency)TFIDF

Model assumptionIndependent assumptionEM with naïve BayesiEM
Dependency representation assumptionEM with Bayesian networkdEM

Model decision methodSoft decision makingS
Hard decision makingH

Learning methodSupervised learningSL
Transductive learningInitial weight method for unlabeled dataSupervised model
Equal probability
Random probability

4.1. Data

Our proposed framework is examined on the unstructured texts from EMRs of intensive care unit which is derived from MIMIC-III [67]. The data is freely available at PhysioNet ( and is accessed on Apr 25, 2016 with the version 1.3. The over 58,000 hospital admissions for 38,645 adults and 7875 neonates are presented in the data source spanning up to 12 years from June 2001. In our work, the discharge summary from two main hospital sections, that is, brief hospital course (BHC) and the history of present illness (HPI) are preliminary explored. For data preparation, we employ SBD, stop word removal, tokenization, NER, and normalization. We consider two semantic types of UMLS CUI regarding CHEM and DISO for drug and event entities, respectively. As the results, nearly 1.6 million sentences are extracted and used as our corpus.

4.2. Results and Discussion

We conduct four main experiments in order to evaluate the effectiveness of our proposed method: (i) the key phrasal pattern analysis, (ii) the evaluation on the effectiveness of the key phrasal patterns, (iii) the effectiveness of the pattern-based feature with MIL-iEM and MIL-dEM, and (iv) the evaluation on overall performance with advanced machine learning methods.

4.2.1. Key Phrasal Pattern Analysis

We initially analyze the discovered key phrasal patterns to investigate the degree of characterization of relation labels. Given a key phrasal pattern pattern, we compute the pattern score () by performing the conditional entropy () inversion and polarity adjustment to visualize the performance of the extracted key phrasal patterns.

From Figure 5, a pattern that is located far from the middle line (score 0) and closed to the top left or the top right corners expresses the high effectiveness of semantic discrimination ability relevant relation labels. For example, the key phrasal patterns “be-hold-in,” “contribute-to,” “be-think,” and “improve-with” are strongly relevant to ADR label and “be-add-for,” “be-initial-for,” and “be-on” are rather associated to IND. Opposite to the key phrasal patterns, “be” and “be-with” are presented near the middle line in the graph that indicates the fuzziest patterns.

Additionally, the figure clearly illustrates that the patterns relevant to ADR are more efficient than the pattern relevant to IND, the small number of ADR patterns are located nearby the original point, and most of the ADR patterns are placed with spread distance. On the one hand, patterns relevant to IND are presented to dense at the location which is nearly zero score and zero frequency. Table 4 presents the example of the sentences that are relevant to key phrasal patterns and pattern direction. Finally, the key phrasal patterns with a pattern score over than the threshold are selected for the further process.

Drugs (d)Key phrasal patterns (p)Events (e)Pattern directionSentences

C0020261 (hydrochlorothiazide)be-hold-inC0020625 (hyponatremia)d → eHowever the patient’s sodium was 131 on discharge thus the patient’s HCTZ was-held-in the setting of hyponatremia.
C0000970 (acetaminophen)be-thinkC0002871 (anemia)e → dHer anemia is-thought to be due to direct effects of acetaminophen on marrow or indirect via kidneys.

C0020223 (hydrallazine)be-give-forC0020538 (hypertension)d → eHydrallazine 20 mg IV was-given-for isolated episode of hypertension and emesis ensued.
C0043031 (warfarin)be-initiate-forC0004238 (atrial fibrillation)e → dWarfarin was-initiated-for his atrial fibrillation with an initial heparin bridge.

Pattern direction: d → e is drug-event; e → d is event-drug.
4.2.2. Evaluation on the Effectiveness of the Pattern-Based Feature

The comparison of the multiple feature types across varying of initial weighting of relation labels for unlabeled data incorporation throughout the MIL-iEM are assessed in order to examine the effectiveness of the pattern-based feature. We divide the experiments into two parts based on the decision methods in EM algorithm. The former refers to soft decision making (MIL-iEM-S) in which the predicted result is directly yielded by the estimated class probability. The latter is so-called hard decision making (MIL-iEM-H) in which the predicted outcome is considered the cutoff value of the probability and assigned class label instead of likelihood ratio. We initially perform the experimental setting on the traditional-independent assumption through MIL-iEM model.

Table 5 expresses an assessment of five text transformation across three alternative document representations and three initial weighting of unlabeled data based on soft decision making and hard decision making. In the table, the pattern-based feature is expressed in the top 4 of each experimental setting, that is, SP, SW, LP, and LW. From the experimental results, we found that the pattern-based feature outperformed traditional bag-of-words (BOW). The highest F1 score value, 0.841, is resulted by MIL-iEM-SP-TF-S-T model which outperformed the baseline MIL-iEM-BOW-TF-S-T up to 4.4%. In addition, B and TF document representations have slightly better performance than TFIDF for all types of initial weighting method. The similar results are found on hard decision making approach as well. The pattern-based feature performed better performance than BOW feature. The MIL-iEM-LW-TFIDF-H-T model obtains the highest performance of F1 score 0.807 and 3.3% improvement from the MIL-iEM-BOW-TFIDF-H-T baseline model. However, it is noticed that the hard decision making results in poor performance when compared to the soft version.

ModelsSoft decision makingHard decision making


B: binary frequency; TF: term frequency; TFIDF: term frequency-inverse document frequency.

The performance comparison across the number of features is exhibited in Figure 6. The number of features relevant to pattern-based features is ranged from 737 to 1322 dimensions, and the number of BOW feature is 1853 dimensions. From the graph, even though our proposed pattern-based features with MIL-iEM-T and MIL-iEM-T provide slightly different F1 score from the BOW feature, their number of dimension are less than half of BOW, especially S–W and L–W features. Therefore, our proposed pattern-based feature is more efficient than BOW feature due to the small number of features but yield similar model performance.

Accordingly, the experimental results confidently support that the simplified sentence using relation tuple of a drug, a key phrasal pattern, and an event is a potential feature transformation for relation classification task. Moreover, ignoring the insignificant contexts can reduce redundancy of feature and avoid computational time issue that is frequently caused by the curse of dimensions.

4.2.3. Evaluation on the Effectiveness of MIL-dEM-SL and MIL-dEM-T

In this experiment, the comparison between our proposed method based on SL (MIL-dEM-SL) and transductive learning (MIL-dEM-T) across varying parameters such as feature types, pattern-weighting models, and the initial weight methods for unlabeled data incorporation are examined. Our proposed method is based on dependency representation of texts, and the posterior estimation is based on the interpolation of Markov property. The experiment is set up with supervised learning-based model and three transductive learning-based models with different initial weight methods of incorporation. The two types of pattern-based features such as surface lexicon-based (S–P) and syntactically lemmatized lexicon-based (L–P) are used for examination. The parameter tuning is also performed for all approaches.

As the results in Table 6, among transductive learning models, the performance of S–P feature is slightly different from L–P feature for all models. The simple binary (B) weighting model presents the higher F1 score over TF and TFIDF. Moreover, MIL-dEM-S-T model exhibits the higher performance than the fuzzy guideline by MIL-dEM-S-T and MIL-dEM-S-T models for all evaluation matrices.


Supervised learningMIL-dEM-S-SL1

Transductive learningMIL-dEM-S-T

1,2γ = , β = , α = ; 3,4γ = , β = , α = . B: binary frequency; TF: term frequency; TFIDF: term frequency-inverse document frequency.

On the other hand, the F1 score of MIL-dEM-SP-S-SL surface lexicon-based feature is better than MIL-dEM-LP-S-SL syntactically lemmatized lexicon-based feature with 1% and 0.8% for TF- and TFIDF-weighting model, respectively.

Similarly, the F1 score of the pattern-based feature S–P across the three types of pattern-weighting model, that is, B, TF, and TFIDF models is also slightly different; 0.928 for MIL-dEM-SP-B-S-SL, 0.946 for MIL-dEM-SP-TF-S-SL, and 0.938 for MIL-dEM-SP-TFIDF-S-SL. Among models within MIL-dEM-S-SL setting, the highest F1 score is presented by TF-weighting model with 0.946.

One of the interesting results shows that the unlabeled data incorporation is exhibited to increase the model performance. The highest effectiveness, 0.954 of F1 score, is presented by MIL-dEM-SP-B-S-T model which is the simple binary weighting model, and the model shows 2.6%, 1.6%, and 0.8% improvement over MIL-dEM-SP-B-S-SL, MIL-dEM-SP-TFIDF-S-SL, and MIL-dEM-SP-TF-S-SL, the best performance of our proposed supervised learning, respectively.

According to the result from the parameter optimization of our proposed method, the model performance is strongly relevant to the dependency representation of random variables as follows: (i) an event and the clinical outcome and (ii) a pattern, a drug, and the clinical outcome. In the contrast, the model is shown to have less relevance between a drug and an event or a pattern and an event.

4.2.4. Evaluation on Overall Performance with Advanced Machine Learning Methods

The comparison of our proposed method and advanced machine learning methods is presented in Table 7. The best models of each set of models are used for assessment. The well-known MIL methods, that is, MISVM, MINB, MILR are executed using WEKA. On the one hand, we customize the original TSVM using the source code from the author and incorporate the MIL assumption as discussed in the previous section (see Section 2.2). We divide the discussion into three parts: the effectiveness of supervised learning model, the effectiveness of transductive learning model, and the overall performance.


Supervised learning

Transductive learning

1,4γ = , β = , α = . 2Polynomial kernel, C = 10. 3Collective MI assumption, geometric mean for posteriors.

Firstly, the experimental results among baseline-supervised learning methods, that is, MISVM-TFIDF, MINB-B, and MILR-B, show that BOW feature works well for all MIL methods; conversely, the pattern-based feature S–P contributes a dramatic improvement when combined with our proposed method MIL-dEM-TF-S-SL. The TFIDF-weighting model yields the high performance for MISVM with F1 score 0.901, while binary weighting model (B) is exhibited to improve the performance for MINB and MILR with F1 scores 0.880 and 0.861, respectively. However, our proposed MIL-dEM-TF-S-SL with S–P feature outperforms all MIL methods, and 4.5% F1 score is better than the highest performance of advanced machine learning method which is resulted by MISVM-TFIDF with BOW feature. The precision of MIL-dEM-TF-S-SL with S–P feature is slightly lower than MISVM-TFIDF with BOW but the recall is significantly improved. Accordingly, our proposed method contributes to reducing the type II error which is always considered in the medical domain.

Secondly, the comparison among transductive learning methods, the BOW feature with TSVM-B is shown to achieve an F1 score of 0.889, while applying the pattern-based feature S–P, its performance is presented to degrade around 2%. Conversely, the pattern-based feature S–P with MIL generative method exhibits to enhance the effectiveness of the models. The accuracy of MIL-iEM-TF-S-T model increases up to 6.3% when the pattern-based feature is deployed instead of the BOW feature.

Lastly, in the overall evaluation, the generative models with dependency representation, that is, MIL-dEM-TF-S-SL and MIL-dEM-B-S-T, outperform for all models. The highest performance is exhibited by our transductive learning MIL-dEM-B-S-T method with 0.934 precision, 0.975 recall, 0.954 F1 score, and 0.949 accuracy, respectively. Moreover, improving the generative model by substitute assumption of word-dependency MIL-dEM-B-S-T model to word-independency MIL-iEM-TF-S-T