Abstract

Medical entity recognition, a basic task in the language processing of clinical data, has been extensively studied in analyzing admission notes in alphabetic languages such as English. However, much less work has been done on nonstructural texts that are written in Chinese, or in the setting of differentiation of Chinese drug names between traditional Chinese medicine and Western medicine. Here, we propose a novel cascade-type Chinese medication entity recognition approach that aims at integrating the sentence category classifier from a support vector machine and the conditional random field-based medication entity recognition. We hypothesized that this approach could avoid the side effects of abundant negative samples and improve the performance of the named entity recognition from admission notes written in Chinese. Therefore, we applied this approach to a test set of 324 Chinese-written admission notes with manual annotation by medical experts. Our data demonstrated that this approach had a score of 94.2% in precision, 92.8% in recall, and 93.5% in F-measure for the recognition of traditional Chinese medicine drug names and 91.2% in precision, 92.6% in recall, and 91.7% F-measure for the recognition of Western medicine drug names. The differences in F-measure were significant compared with those in the baseline systems.

1. Introduction

The existing natural languages are mainly divided into alphabetic and logographic languages. Logographic languages contain written characters representing words or phrases, and the best known logographic language is Chinese, while alphabetic languages are made of written characters representing sounds or sound combinations rather than concepts [1]. Electronic health records (EHRs) have been widely adopted worldwide. In the health care reform in China, National Health and Family Planning Commission of PRC has considered the nationwide application of EHRs one of its priorities and launched a set of modestly interoperable Chinese EHR Template Guides (CEHRTG) in 2010 [24]. In 2013, Chinese Hospital Information Management Association conducted an investigation on EHRs using data from the healthcare information systems in China. The study that involved 1004 Chinese medical institutions has shown that electronic medical records (EMRs) are prevalently applied in these hospitals (77.8%) and a portion of these hospitals have incorporated the EMR systems into regional or state-level EHRs [5]. A large number of studies on data mining and knowledge discovery have been devoted to the clinical data accumulation in EHRs. However, structured data mining methods cannot be directly used to mine EHRs stored in nonstructural texts; thus, these texts need to be preprocessed with structuration [6]. Named entity recognition (NER) in free narrative forms of clinical texts aims to recognize relevant named entities (NEs) from nonstructural clinical data and is also a basic task of medical language processing (MLP) [7]. Researches on medical entity recognition (MER) in admission notes (ANs) written in logographic languages, such as Chinese, lag behind researches on MER in ANs written in alphabetic languages, such as English. Such a gap between the two areas of research is largely due to the huge differences between alphabetic and logographic languages [8] and between modern Western medicine (WM) and traditional Chinese medicine (TCM) [9], which hinder the utilization of alphabetic research results in logographic investigation in fields of both entity recognition in general and MER in specific. In particular, with supervised algorithms based on discriminative models [10], we have to solve many urgent problems to study nonstructural Chinese admission notes involving TCM-WM combined treatment. (1)Is there a generalized approach for efficient MER for nonstructural EHRs written in a logographic language?(2)Is it feasible to selectively extract concrete medical features from free narrative clinical texts, so that supervised algorithms can fulfill MER tasks in clinical ANs written in a logographic language?(3)Among some supervised algorithms [e.g., support vector machine (SVM), conditional random field (CRF) with specific characteristic types] that are widely applied in NER, which of them are applicable for MER in ANs written in a logographic language?

Here, we proposed a new cascade-type MER approach called cascade-type Chinese medication entity recognition (CCMER) for processing admission notes in Chinese. We aimed to test CCMER in clinical texts in English (an alphabetic language) and investigate the roles of 4 feature types based on supervised algorithms [6] used for MER in nonstructural Chinese admission notes with TCM-WM combined treatment. This is the first time an MER is used under such context. Moreover, we designed and conducted a series of fine experiments using CCMER in ANs with TCM-WM combined therapy written in both Chinese and English and compared the performance of CCMER with that of three baseline systems (one based on rules, one based on simple local context features, and the other based on simplified CCMER with the filtration module for classifications of potential hot-sentence deletion). According to the soft scores of correctly recognized drug name entities, we statistically analyzed the results of CCMER and the baseline 2 system as follows:

Results showed that the performance was significantly improved (). This study provides guiding and reference values for MLP of nonstructural logographic EHRs.

Across the world, key information search in abundant nonstructural ANs through computer-based resolution was used to assist better clinical decisions as early as 1967. Depending on the purposes and contexts, key information search can be realized either by information retrieval (IR) at document level (recognition and identification of relevant AN documents from massive free text-formatted ANs) [1113] or through finer recognition, identifying various descriptions with identical meaning in clinical texts and mapping them to the same concept. MER is a typical concept-level information extraction (IE) task and also a major and basic area in MLP research [6].

The MER of early MLP systems usually adopted rule-based methods [14]. In recent years, machine learning- (ML-) based MER methods have been gradually applied in processing clinical narrative texts. One important reason for this trend is the availability of abundant anonymous clinical ANs and introduction of libraries of annotated words as gold standards, both issued by Informatics for Integrating Biology and the Bedside (I2B2) [15]. For instance, the theme of the 3rd I2B2 contest was extraction of information about drug names, doses, usage, and duration from discharge summaries [7]; one subtheme of the 4th I2B2 contest was extraction of concepts about illness state, examination, and treatment [16]. Later, these datasets from clinical ANs inspired a number of studies on NER in English clinical notes. These NER tasks were usually transformed to sequence annotations and recognized by supervised ML algorithms, such as conditional random field and structural support vector machine (SSVM). These preliminary works revalidated the three major factors that affected the systematic performance of supervised ML algorithms based on discriminative models for NER tasks: input features of descriptive datasets, model selection for classification function, and learning process for determination of model parameters [17]. We think that “appropriate feature selection” is especially important for MLP tasks, because current techniques for parameter determination are ineffective in the context of high-redundancy and high-noise featured clinical AN texts. Studies on processing clinical texts in other alphabetic languages other than English, such as French, German, and Dutch, have been carried out as well, mapping clinical descriptions to Systematized Nomenclature of Medicine codes [18].

However, most of the studies described above are limited to clinical ANs written in alphabetic languages. Studies for NER in logographic languages are relatively few. With the wide application of EHRs in China, it is urgent to test the reliability of the secondary use of abundant Chinese clinical ANs. So far, little effort has been devoted to NER in ANs written in logographic languages, especially in Chinese. As reported, several ML-based algorithms including CRF, SVM, and maximum entropy have been used to identify symptom descriptions and pathogenic mechanisms from TCM EMRs [19]. Also, CRF has been used to identify biomedical name entities in Chinese research abstracts [20]. A parallel implementation of word segmentation and NER tasks has been proposed to replace the serial application of the two tasks, in order to improve the performances of NER in discharge summaries written in Chinese [21]. Some studies on the performances of several ML algorithms for NER in clinical WM ANs written in Chinese are highly valuable [22]. However, there is no research on an MER task in Chinese AN texts with TCM-WM combined treatment, especially on WM and TCM drug NERs, yet such texts are the most commonly used in Chinese hospitals.

Here, based on document-level IR of ANs [23], we proposed a new cascade-type MER algorithm (called CCMER) and validated it in ANs written in Chinese. We also aimed to test MER in English clinical texts (English is a representative of alphabetic languages) and investigate the roles of 4 feature types based on supervised algorithms [6] for MER in nonstructural texts of TCM-WM combined treatment in Chinese ANs. The MER we presented here is the first MER of its kind. This study might provide guiding and reference values for MLP for logographic languages in general.

3. Dataset and Annotation

3.1. Description and Definition of Drug Named Entities

Based on the standard writing formats of WM and TCM prescriptions in medical orders [24] and a combination of the specific format of medical orders in an AN dataset and the characteristics of drug named entities, we compiled a list of WM/TCM drug-related treatment events and their definitions (see Table 1) in ANs [25].

In the actual clinical environment, medical orders regarding TCM or WM may contain only WM and TCM drug named entities without any other descriptive terms of the related drug use events.

3.2. Annotation Rules on Word Library of Drug Named Entities

To improve the reliability of manually annotated drug names, we established detailed annotation rules based on the above basic definitions. We subdivided medications into Western medication and traditional Chinese medication and further improved TCM-related annotation rules. The name annotation rules of core drugs are listed in Table 2.

3.3. Dataset Description

Totally 1000 out of 120,000 CEHRTG-based ANs between January 2011 and December 2012 recorded in SAHZU were extracted randomly. After deletion of incomplete ANs, 972 ANs were kept and then anonymized before manual annotation: private information including ID, name, sex, age, and reception department was deleted. Then, two native Chinese-speaking nurses annotated the names of clinical WM and TCM drugs according to the predefined guidance. To measure the interannotator agreement (IAA), the nurses independently annotated the same 100 ANs, and a senior doctor evaluated the annotations by using arbitration on disagreements. Potential issues were identified and the predefined annotation guideline was modified when necessary. The most difficult step for concept annotation of Chinese drug names is the determination of the boundary of an expression. Using the modified annotation guideline, the nurses manually annotated drug names in the remaining 872 admission notes. Therefore, 100 notes were annotated by both nurses, and the remaining 872 ANs were evenly divided and annotated by each nurse.

Then, the total 972 ANs were divided into 2 sets: 2/3 (648 ANs) as a training set and 1/3 (324 ANs) as a test set. The feature set for a classifier’s optimal performance was determined by 10-fold cross-validation using the training set. The performance was then assessed using the test set. The statistics of used data are listed in Table 3.

The 972 ANs contain 61,046 sentences, with 2739 mentioning drug names. There are totally 2599 drug named entities, including 1903 WM ones (73.2%) and 696 TCM ones (26.8%). Based on the 100 ANs annotated by both experts, the IAA was measured by kappa statistics to be 0.968, which indicates that the annotation was reliable.

4. Methods

4.1. CCMER Framework

CCMER is a new-pipelined cascade-type framework scheme. It locates hot-text chunks based on sentence classification and recognizes drug named entities from candidate target sentences via a supervised algorithm for sequence annotation. The system structure (Figure 1) includes a preprocessing module, a filtration module for potential hot-sentence classification, a recognition module for identifying drug NER, and a postprocessing module.

4.2. Preprocessing Module

The preprocessing module directly runs on the original document set. Because of specific writing habits, ANs with TCM-WM combined treatment contain abundant figures, Chinese characters, and English words. So such AN documents should be normalized first so as to reconstruct the inputted sentences, which can then be processed by the standardized natural language processing (NLP) tool. For instance, numbers and Greek letters are normalized to 9 and @, respectively. Then, with the character-to-pinyin function on Microsoft Office Word 2011, we transformed all Chinese characters into pinyin. Finally, the sentence splitter from the adjusted ICTCLAS system [26] was used to segment the text into sentences. Noticeably, in this process, we did not use the stopword list [27], which is commonly used in the general NLP field. This is because owing to the transliterative characteristics of WM drug names, the Chinese names of WM drug named entities contain abundant function words, so direct deletion of function words can destroy the completeness of drug named entities. Different from English texts, Chinese texts do not have spaces between words, which further complicates the identification of word boundaries. Because of the characteristics of Chinese clinical notes (e.g., telegraphic styles, pervasive abbreviations), automatic segmentation of Chinese clinical texts into words or phrases is dramatically difficult, not to mention the recognization of details from these fragments. These early research results suggest that the general purpose segmenter trained on general Chinese texts cannot handle clinical notes well, especially on clinical terms, and may undermine its effectiveness [28]. Therefore, we did not use the approach of filtering user-defined function words, which are determined by statistical indicators (e.g., term frequency-inverse document frequency). Instead, we delivered the preprocessed texts directly into the next module.

4.3. Filtration Module for Classifications of Potential Hot Sentence

Though CRF-based recognizers have achieved outstanding performances in different sequence annotation tasks [29], CRF severely depends on the surrounding named entities (NEs) in local contexts and assumes that similar local contexts lead to the same judgments [28]. However, this assumption is basically untenable in the contexts of clinical ANs. For instance, a phrase in the “test results” section which refers to a specific test name might represent a drug name in the “medications” section. The same phrase under different local contexts can be identified differently. A training set with abundant misleading information like this can largely reduce the performance of the CRF-based annotator.

To delete such confusing data from the dataset, we used an SVM- based [30] user-defined sentence category classifier to filter out the irrelevant sentences that do not contain information of drug names. The potential hot-sentence classifier divides the sentences contained in ANs into two parts. One part includes sentences that may contain drug names; the other part contains the remaining sentences. Only the first part is transmitted to the subsequent CRF-based recognition module for identifying drug names, thereby reducing noise and operation time, alleviating the interference from the unbalance of negative-positive sample distribution in the corpus with the subsequent recognition module, and improving system performance. The underlying scientific hypothesis is that “every sentence in an AN belongs either to the sentence type that contains drug names or to the type that does not.” The basic assumption in the study is that a sentence is an instance in a set of hypothetical classes: “the sentences with drug names” and “the sentences without drug names.”

When the NLP system is trained by a text set composed of characters, it usually describes texts by selecting phrase characteristics first. The occurrence of a phrase is considered the most important phrase characteristics in the bag-of-words model. However, accurate segmentation of Chinese words or phrases is not easy, especially in clinical texts [31]. The accuracy in the phrase-based verbal form analysis is also reduced. Thus, we used a feature extraction function for dimension reduction, which can be interpreted as a transformation from high-dimension to low-dimension vector space. Feature extraction utilizes as feature elements the components of the sentence instance as well as pattern templates hidden in a combination of components. A pattern template includes repeatability and concurrent events. This approach leads to a more general model for character-composed cases and further reduces the probability of model overfitting.

In this feature extraction function, each sentence is described as a vector of features. Definition 1. For a feature vector , let be the weight measuring the occurring frequency of feature in sentence . The vector of sentence is represented as . Note that is positively related to the occurring frequency in sentence . If feature occurs in sentence , then,  > 0; otherwise, . Nevertheless, the value of depends on the corresponding situation.

Here, we use a 6-dimension feature vector, mainly including the features based on clinical knowledge, statistics, and linguistics separately. It is specifically defined as follows:

4.3.1. In Symbol List (SF1)

The number of formal symbols (e.g., “<,” “>,” “(,” and “),” commonly used in clinical medication rules) was contained in the current sentence. If SF1 = 0, this sentence does not contain such descriptive symbols.

4.3.2. In Chinese Drug Name Dictionaries (SF2)

Do the Chinese drug name terms contained in Chinese drug name dictionaries appear in the current sentence?

4.3.3. In Pinyin Dictionaries (SF3)

Does the pinyin corresponding to the drug name terms contained in drug name dictionaries appear in the current sentence? The idea behind is that “An inputted Chinese character string is mapped into a voice coding sequence, which is the pinyin pronunciation or a rough approximation of the inputted string,” since the Chinese translation of English WM drug names is mainly based on transliteration. In actual clinical EHR writing, due to the wide use of the Chinese pinyin input method, when the description of drug names in ANs in Chinese characters shows writing or printing errors, the corresponding pinyin spellings might be actually correct. So different Chinese character strings with the same pronunciations are mapped to the same voice-encoded strings since they have the same pinyin spellings. This approach is essentially a feature clustering and can be used to correct many writing and printing errors.

4.3.4. Positive Set of Statistical Word Features (SF4) and Negative Set of Statistical Word Features (SF5)

The features for SF4 and SF5 are represented by the sum of frequency-weighted values of statistical word features (WFs) in the current sentence. An n-gram-based feature extraction algorithm [32] is used to extract the 2- to 4-character keywords in the set of positive sentences (i.e., the sentences with medicine information) and in the set of negative sentences (i.e., the sentences without medicine information), respectively. To segment sentences into bigrams, this algorithm summarizes the occurring frequencies of not only the current bigram Chinese character combinations but also the adjacent ones prior to the current ones and determines which combinations occur frequently and continuously. We empirically reserve the top 30% of statistical feature phrases ranked by weight. SF4 is mined for the positive set of the sentences with medicine information and SF5 for the negative set. Finally, a total of 119 positive feature words and 313 negative feature words were extracted and reserved. Then, the feature words were weighted as per frequencies of all selected feature words appearing in the positive or negative word library. For instance, “yuyi,” a positive feature word meaning “give help or treatment to,” appears 114 times in the positive word library, while all the selected positive feature words appear 2524 times in total, so the weight of “yuyi” is 0.045.

4.3.5. In Event Patterns (SF6)

Does the predefined drug use event phrase collocation template appear in the current sentence? Through observation of drug use events, we defined some phrase modes commonly used in drug use events, such as

[NUMBER + DOSEPATTERN];

[MODEPATTERN];

[FREQUENCYPATTERN];

[MODEPATTERN + CANDIDATE MEDICATION NAME];

[CANDIDATE MEDICATION NAME + FREQUENCYPATTERN].

The collocation of these phrases is a favorable indicator of drug use events.

The AN of one patient contains “Anti-inflammation treatment with provision of (Ofloxacin) Levofloxacin tablets Q.D 0.5 g for 7 continuous days in a local clinic,” which was transformed by a feature transform function into a 6-dimension feature vector:

< 2, 2, 2, 0.045, 0, 4 > ➔ MEDICATION.

The first feature indicates that the number of special characters is 2.

The second feature indicates that the candidate drug name terms appear 2 times in the current sentence.

The third feature indicates that the pinyin description corresponding to the candidate drug name terms appears 2 times in the current sentence.

The fourth and fifth features indicate that positive feature words, but not negative feature words, appear in the current sentence, with a frequency weight of positive feature words equal to 0.045.

The sixth feature indicates that the phrase templates commonly used in <MEDICATION> appear 4 times in the current sentence.

Then, we used the classifier, a stochastic gradient descent module [33], to implement the stochastic gradient descent learning model for use in a support vector machine (SVM). With the above feature sets, a linear SVM sentence classification model was trained on the training dataset and was evaluated on the test set. Finally, the sentence classifier filtered irrelevant sentences and transmitted medical sentences (the clauses containing the drug names) to the next module.

4.4. Recognition Module for Identifying Drug NER

Chinese ANs are special in some ways. For instance, a majority of Chinese names of WM drugs are actually the transliteration from foreign words, so an extra disambiguation of word segmentation is needed, which is much more complicated than the automatic word segmentation of general Chinese named entities. Existing common tools and methods for Chinese word segmentation thus cannot be directly applied to word segmentation of clinical ANs, which necessitates the customization for the medical field. Therefore, we did not conduct word segmentation and part-of-speech analysis commonly used in text processing. We used selected Chinese characters as the basic annotation units instead, because Chinese characters are the most basic sentence-composing units and also contain semantics.

The annotation of drug named entities can be transformed into a sequence annotation task: where is a Chinese character.

The objective is to construct an annotator p that can accurately annotate a reliable label sequence for a Chinese character sequence , where

The actual annotation task is finished by the CRF-based supervised ML sequence annotator, which was trained by an annotated corpus, while the training set is composed of a sequence pair (). CRF is widely applied in MLP of English ANs [34] with outstanding performance. Here, we used CRF++, a tool which provides CRF-targeted efficient implementation [35] and uses L-BFGS for optimization.

In this module, we use a feature set containing 5 types of features (see Table 4).

4.4.1. The Local Context Feature (F1)

F1 is composed of 2 dimensions: texts and pronunciation. The ANs with TCM-WM combined treatment contain relatively simple short narrations (e.g., short subsentences) and abbreviations of technical terms. Usually, the average length of a Chinese phrase is 2 Chinese characters [36]. Thus, we tried a Chinese character-based n-gram pattern as the basic unit, n ∈ (1, 2, 3). Relevant literatures show that the context window size (CWS) in NER tasks should not be preset to be too large [37]. Analysis of ANs shows that the fragment extent for acquisition of text features or the context window should not be too large; otherwise, more noise will be introduced, which reduces the annotation accuracy. Thus, we set the CWS < 4. For instance, is treated as 2-gram, including the current and the previous Chinese characters.

4.4.2. Feature of Drug Name Dictionary (F2)

F2 has two dimensions including texts and pronunciation, namely, the Chinese characters in the current context window and the corresponding pinyin appearing in the drug name dictionary. This simple dictionary lookup approach uses the forward maximum match algorithm to search the drug name dictionary (defined in section “drug name dictionaries and lists of relevant terms”).

4.4.3. Feature of Drug Named Entity-Related Terms (F3)

Does the Chinese characters or letters in the current context window match with the terms in the lists related to drug named entities? Also, it uses the forward maximum match algorithm to search term lists (defined in section “drug name dictionaries and lists of relevant terms”).

4.4.4. Feature of Mode (F4)

Other relevant mode features were contained in the n-gram texts; for example, does the texts of the current context window contain English letters, numbers, special symbols, or time description?

4.4.5. Feature of Global AN Structure (F5)

F5 indicates whether the name of the section that contains the current n-gram texts appears in the predefined section list. One major characteristic of clinical ANs is that the natural language description is context-based. For instance, the medical term “digoxin” in the “Laboratory Results” section is the name of a clinical examination, while the same term in the “Past Medical History” section is the name of a therapeutic drug. We manually reviewed some notes and some articles [38] and defined 15 different section headers (e.g., “History of Illness”).

4.4.6. Feature of Category Annotation (F6)

F6 is the annotation category of the first 3 characters before the current character.

4.5. Postprocessing Module

In the annotated narrative text, the annotation “BIO” is resolved as follows: “B” indicates that the character is at the beginning of the drug named entity, “I” shows the character to be in the middle or at the end of the drug named entity, and “O” indicates that the character does not belong to the drug named entity. To guarantee the consistency of character labels and the integrity of name recognition, we also used some simple heuristic rules (see Table 5).

5. Experiments

5.1. Baseline Systems

First, we constructed baseline system 1, which used the maximum matching algorithm [39] and professional drug dictionaries. It is the most mature and commonly used MER method in China. We then added the maximum matching between terms of drug names and the Chinese pinyin. Based on WM dictionaries and TCM dictionaries, we extracted the drug names from the segmented texts. Like the rule-based term-matching system, which extracts names of genes and proteins from scientific literatures, this baseline system also recognizes named entities by using terminological dictionaries, facilitating the subsequent IE or IR. Since we focus on the performance of ML-based combined systems, this baseline system adopts a preprocessing step like in the main system. Then, we used a simplified CCMER adopting simple local feature set combination F1-1 as baseline system 2 and compared the system performances under different feature sets. Finally, we used a simplified CCMER (excluding the filtration module for classifications of potential hot sentence) adopting full feature set combination (F1 + F3 + F4 + F5 + F6) as baseline system 3 and compared it with the performances of the full-version CCMER system under the same feature set.

5.2. Experiment Evaluation Methods

Regarding the characteristics and difficulty in MER recognition in TCM-WM combined Chinese ANs, we used 1 type of soft evaluation indicators [40] that are widely applied in research on Chinese named entities and based the evaluation on standard precision, recall, and F-measure [41]. The main idea is that we score-recognized MERs from three aspects: detection, classification, and boundary. The detailed rules are listed in Table 6.

It should be noted that in clinical practice, “静/B-WM 滴/I-WM 恩/I-WM 度/I-WM 复/O 查/O 有/O 进/O 展/O” is more reliable than “静/O 滴/O 恩/B-WM 度/I-WM 复/I-WM 查/I-WM 有/O 进/O 展/O,” so we assign more scores at the start position of ME.

6. Results

6.1. Results from Baseline System 1

We first tested the performance of baseline system 1. As shown in Table 7, the overall performance is obviously low, and the F-measures of TCM drug name recognition and WM drug name recognition are both less than 55%. After preliminary analysis, we hypothesized that the performance can be further enhanced if the scale of professional drug name dictionaries was improved.

6.2. Results from CCMER

The confusion matrix in Table 8 shows that the test set from the filtration module of potential hot-sentence classification based on SVM uses the sentence classification results, determined from the predefined feature set (defined in “Filtration Module for Classifications of Potential Hot Sentence” section). Clearly, most sentences containing drug name(s) were correctly classified and transferred to the subsequent recognition module for identifying drug named entities, fitting into the accurate, comprehensive, and automatic annotations (i.e., categories) of drug names.

Figure 2 shows the performance of the CRF-based MER system in terms of n-gram local context feature set of Chinese characters and pinyin. The feature set was tested with the candidate hot-sentence subset from the training set. Clearly, the local context feature set helped to improve the performance of MER. In most cases, the use of a larger number of features yielded a higher recognition performance. We compared the system performances between CWS = 3 and CWS = 1. The F-measure of drug named entity recognition using only the feature set of Chinese characters was improved only by 3.1%, while that using the local feature sets of both Chinese characters and pinyin was improved by 4%. Thus, in the subsequent tests, F1-3 + F1-6 (CWS = 3) was used as a feature set (F1) under local text context.

With various feature sets, the performances of the CRF-based MER system with the candidate hot-sentence subset are shown in Figure 3. Clearly, the optimal performance is found with F1 + F3 + F4 + F5 + F6. Compared with the only use of F1, the WM drug named entity recognition of ANs has F-measure = 91.2%, while TCM drug named entity recognition of ANs has F-measure = 93.5%, with an increase of 2.5% and 2.7%, respectively. The overall drug named entity recognition has a 2.6% higher F-measure.

The new approach obviously outperforms baseline system 1 based on professional drug dictionaries. As shown in Figure 4, the F-measures of TCM and WM drug named entity recognitions are increased by 45.8% and 38.2%, respectively. Since the filtration module of potential hot-sentence classification also deletes abundant irrelevant sentences, the operation efficiency of the system is largely improved and the operation time is shortened.

Moreover, we also conducted experiments, built baseline system 3 (see Table 9), and determined the contribution of the filtration module of potential hot-sentence classification to CCMER performance. The detection value of hot sentences for CCMER can be seen by comparing the results of baseline system 3 with those of full-version CCMER; the only difference between these two systems is the use of hot-sentence detection. The difference in F-measure of the two systems is as large as 20.6%, with the full-version CCMER achieving an F-measure of 92.2% while baseline system 3 without the filtration module for classifications of potential hot sentence having an F-measure of only 71.6% (Figure 4).

7. Discussion

Here, we manually built a dataset involving 972 annotated ANs containing TCM-WM combined treatment. Based on this, we tested a new approach, CCMER, and investigated its performance under different feature allocations. The performance of CCMER is significantly improved versus that of the baseline system 1, as the F-measures of TCM and WM drug named entity recognitions are increased by 45.8% and 38.2%, respectively. The deletion of abundant irrelevant sentences from the dataset results in largely improved operation efficiency.

The optimal performance occurs with the use of a feature set (F1 + F3 + F4 + F5 + F6), as the F-measure of overall drug name recognition is 41.9% higher than that using the baseline system. This indicates that the feature sets with different dimensions are modestly complementary and also proves that the results from Meystre et al. [6] can also be applied to the processing of ANs written in logographic languages.

We then preliminarily studied the contributions of single features to the drug named entity annotation. First, the use of small-scale medical drug name dictionaries (F2) does not improve the system performance. This is not surprising because the same type of information was already captured by F1 and certain drug name entries in F2 lacked comprehensive and detailed information about the drugs. Unfortunately, our self-compiled drug name dictionaries are of small scales. The system performance can be further improved if foreign resources such as Chinese version RxNorm [42] can be combined. In the future, we will refer to Xu et al.’ approach [43] to enrich and improve Chinese drug dictionaries. Moreover, F5 helps to reduce the false-positive (FP) rate. For instance, though results of blood drug concentrations, such as “digoxin 0.7 ng/ml” in the “LAB Results” section in ANs, are similar to those of the form of conventional medical orders on medication, the system can automatically delete such results if originated from the section of “LAB Results,” because the system has learned from the training set that there are no drug names for treatment under the contexts of “LAB Results” section. However, false positive still exists in the system. The section of medical orders contains not only orders on drugs but also other contents such as test items. Some of the test items may be presented as drug names such as “folic acid + vitamin B12” and can be mistakenly recognized as drug named entities. Such false-positive results can make drug NER more complicated. Error analysis of the result set shows that F6 helps to determine the end position of drug names, rather than finding out new drug names. However, the final result set contains most characters corresponding to label “O,” so this is also our intuitive evaluation.

Moreover, we find hot-sentence detection in AN texts to be a key factor affecting the systemic performance. The hot-sentence detection technique was a way to determine the focus areas of the texts and thus filter out a large amount of noise. Removing the filtration module for classifications of potential hot sentence alone would largely reduce the systemic performance.

Meanwhile, we found during TCM drug name recognition that TCM is subdivided into Chinese medicinal herbs and Chinese patent drugs. The names of Chinese medicinal herbs are usually composed of 2 to 3 (mean 2.57) Chinese characters, while Chinese patent drugs are preparations made from TCM materials through modern pharmaceutical approach/process complying with quality standards. Their named entities combine the characteristics of both Chinese medicinal herbs and WM; thus, the recognition rates of these drug names are very low. For instance, for Heartleaf Houttuynia injection liquid, a Chinese patent drug, “Heartleaf Houttuynia” was recognized as “traditional Chinese medicine,” while “injection liquid” was recognized as “Western medicine”; other examples include Radix Salvia miltiorrhizae tablets. Generally, however, the boundaries of TCM drug names are more unidentifiable than those of WM drug names.

7.1. Error Sources

Among the annotated results with scores of a soft evaluation indicator equal to zero for medical entities, the major error source is the recognition of general terms of drugs, such as anticoagulants, antibiotics, compound vitamins, and antihypertensive drugs. These general terms were included into the gold standard here, because while drug named entities are important drug use events that are captured by the annotator, the general terms of these drugs might also indicate such important event, yet owing to a lack of support from fine-grained information sources and medical knowledge, the current system cannot recognize them. This is also one research direction in the future.

Another common error occurs only in the sample recognition with the test set, but not with the training set. The supervised ML system has one advantage that it can accurately capture drug names in the test dataset that are not in the training set. This robustness is attributed to the systematic ability of capturing context information. As discussed above, though we annotated 648 ANs as a training set, the annotated dataset at this scale still cannot fully cover the test set. For instance, this system detects “Amoxy” from “Amoxy (Amoxicillin) 0.5 g PO TID” as a drug name, though this drug name is not in the training set. We think that the system learns from the training set through a context mode “<drug name><dosage><drug use approach><frequency>.” On the other hand, the system cannot detect “Amoxy tests negative,” because this context mode does not appear in the training set.

Owing to the timeliness and urgency of clinical work, doctors usually abbreviate and shorthand some drug names, such that “vitamin A and vitamin C” is often abbreviated as “vitamin A, C.” These two drug names share the common beginning characters “维生素 (vitamin),” and the combination of the two drug names is abbreviated to a new simple name combination. Such abbreviated drug descriptions omitting the same beginning or ending characters do not contain “and” or “or”; thus, unlike processing compound descriptions in general texts, in these abbreviated descriptions, recognizing the common beginning or ending characters of the compound-drug names can only result in the correct recognition of the first or last drug name in the combination, while all the other drug names are ignored.

Moreover, standard clinical guidance of treatment based on diagnosis has not been extensively followed in the medical institution where our samples were collected. Doctors in this institution prescribe medication according to previous experiences for most diseases. Thus, for the same disease with the same symptoms, doctors may prescribe different drugs, leading to low appearing frequency of single-drug named entities. Take transfusion medicines as an example, except for solvents like glucose infusion and normal saline solutions; about 50.3% of the medical solvents that are used for transfusion only appear once, which is one cause for the low recognition rates of WM drug names.

7.2. Limits of Application

Our approach also has some limitations. First, we only tested the ANs from one data source with one pattern from one medical center. Though CEHRTG is an HL7 CDA R2-based [44] generalized EHR interoperable AN document frame, MERs are also correlated with the ways that ANs are written, habits, and quality of clinicians. Thus, the versatility of this approach in other medical institutions or other types of ANs should be validated [45]. Second, most features used here are only at the level of Chinese characters; using features at the level of semantics is beyond our study. Thus, common problems in clinical ANs, such as coreference resolution and subject main-body examination, cannot be well solved at present. Third, the drug name dictionaries built here are incomplete and of small scale and do not contribute much to the system performance. In the future, we will try other machinery learning techniques feasible for MER, such as supervised ML algorithms (e.g., the hot SSVM), and gradually expand the experimental scope and the types of ANs. To improve the systemic effectiveness and accuracy, we should consider the inclusion of semantic features.

8. Conclusions

Here, we targeted at analyzing texts written in Chinese, a typical logographic language; tried MER in nonstructural texts regarding TCM-WM combined treatment; and proposed a new cascade-type approach—CCMER. This approach avoids the side effects due to abundant negative samples and improves the recognition performance of drug named entities in the logographic (Chinese) descriptions. We think that this approach may provide some reference values for MLP of other logographic languages. We also conducted many fine experiments. We found that the n-gram information and section information based on Chinese characters and pinyin help to improve the performance of MER. However, the contribution of small-scale professional dictionaries is small. The MER system with the optimal performance was found in the test set involving 324 manually annotated ANs with TCM-WM combined treatment. In this system, the F-measures of TCM and WM drug named entity recognitions are 93.5% and 91.7%, respectively, significantly higher than those in the baseline system.

Abbreviations

MER:Medication entity recognition
AN:Admission note
TCM:Traditional Chinese medicine
WM:Western medicine
CCMER:A cascade-type Chinese medication entity recognizer
SVM:Support vector machine
CRF:Conditional random field
I2B2:Informatics for Integrating Biology and the Bedside
NTCIR:NII Testbeds and Community for Information access Research
NER:Named entity recognition
EHR:Electronic health record
CEHRTG:Chinese Electronic Medical Record Template Guide
MLP:Medical language processing
ML:Machine learning
IR:Information retrieval
SAHZU:Second Affiliated Hospital Zhejiang University School of Medicine
NLP:Natural language processing
IAA:Interannotator agreement
CWS:Context window size
IE:Information extraction
WF:Word feature
FP:False positive
SF1:In symbol list
SF2:In Chinese drug name dictionaries
SF3:In pinyin dictionaries
SF4:Positive set of statistical word feature
SF5:Negative set of statistical word features
SF6:In event patterns
F1:The local context feature
F2:Feature of drug name dictionary
F3:Feature of drug name entity-related terms
F4:Feature of mode
F5:Feature of global AN structure
F6:Feature of category annotation.

Disclosure

The authors alone are responsible for the content and writing of the paper.

Conflicts of Interest

The authors report no conflicts of interest.

Authors’ Contributions

Jianbo Lei developed the conceptual framework and research protocol for the study. Jun Liang drafted the manuscript and Xiaojun He and Jianbo Lei made major revisions. Xuemei Xian and Meifang Xu annotated the corpus, Sheng Dai evaluated the annotations, and Jun Liang, Jianbo Lei, and Xiaojun He conducted the experiment and data analysis. Jun’yi Xin, Jie Xu, and Jian Yu provided comments and made revisions to the manuscript. All authors read and approved the final manuscript.

Acknowledgments

The authors are grateful to Professor Wei Zhu, Dr. Ting Chen, and Ms. TingXue Sun for their helpful discussion and suggestions. This work was supported by the Medical and Health Planning Project of Zhejiang Province (no. 2017KY386) and Medical and Health Planning Project of China (no. 2015109528) from the National Health and Family Planning Commission, the Key Project of the National Research Program of China Grant (no. 2015BAH07F01), and the National Natural Science Foundation of China (NSFC) (no. 81171426 and no. 81471756).