Abstract

Background. The modernization of traditional Chinese medicine (TCM) demands systematic data mining using medical records. However, this process is hindered by the fact that many TCM symptoms have the same meaning but different literal expressions (i.e., TCM synonymous symptoms). This problem can be solved by using natural language processing algorithms to construct a high-quality TCM symptom normalization model for normalizing TCM synonymous symptoms to unified literal expressions. Methods. Four types of TCM symptom normalization models, based on natural language processing, were constructed to find a high-quality one: (1) a text sequence generation model based on a bidirectional long short-term memory (Bi-LSTM) neural network with an encoder-decoder structure; (2) a text classification model based on a Bi-LSTM neural network and sigmoid function; (3) a text sequence generation model based on bidirectional encoder representation from transformers (BERT) with sequence-to-sequence training method of unified language model (BERT-UniLM); (4) a text classification model based on BERT and sigmoid function (BERT-Classification). The performance of the models was compared using four metrics: accuracy, recall, precision, and F1-score. Results. The BERT-Classification model outperformed the models based on Bi-LSTM and BERT-UniLM with respect to the four metrics. Conclusions. The BERT-Classification model has superior performance in normalizing expressions of TCM synonymous symptoms.

1. Introduction

Traditional Chinese medicine (TCM) symptoms are recorded by TCM practitioners who sometimes use different words when recording the same symptoms, as a consequence of their diverse experience and educational background. These variations in words lead to the phenomenon known as “one symptom with different literal expressions,” which is prevalent in TCM medical records. Wang et al. [1] reported that approximately 80% of TCM symptoms were recorded with multiple expressions. Although the literal expressions of these symptoms are different, they have the same meaning, and their use does not affect understanding. Thus, the use of these alternative symptoms does not affect the pathogenesis diagnosis. In summary, TCM symptoms that have the same meaning but different literal descriptions are known as TCM synonymous symptoms. For example, the symptom “lack of appetite” (纳减) can also be expressed as “loss of appetite” (纳差) or “decreased appetite” (食欲减低). They all mean a reduced desire to eat and are used in the description of spleen Qi deficiency (脾气虚).

It is essential to explore and analyze TCM medical records for the purpose of TCM modernization [2, 3]. However, the abundance of synonymous symptoms in TCM medical records hinders systematic scientific knowledge discovery. Referring to the TCM terminology [4] published by relevant authorities, it is possible to establish a TCM thesaurus and then normalize each symptom in TCM medical records to a symptom that has the same meaning in the thesaurus, so that TCM synonymous symptoms would have uniform literal expressions. That is, TCM symptom normalization is a feasible method for handling TCM synonymous symptoms. However, manual TCM symptom normalization is time-consuming and labor-intensive because of the large and growing quantity of TCM electronic medical records.

Natural language processing (NLP), which has experienced extraordinary development in recent years, provides valuable support for the automatic processing of text data, such as language translation [5], question answering [6], and information processing of medical texts [710]. This success suggests that the NLP technology will be effective for normalizing the expression of TCM synonymous symptoms.

In previous work, researchers have proposed some NLP-based normalization models for biomedical fields, such as Word2Vec [11], Jaccard similarity [12], DNorm [13], and BERT-based ranking [14] from the perspective of similarity matching. In addition, from the perspective of named entity recognition (NER), there are transition-based [15] models and Bi-LSTM-CNNs-CRF [16]. Although the performance of these models is satisfactory according to the published reports, there are two problems that are worthy of further exploration, from the perspectives of their applicability to normalizing TCM symptoms and the modeling concepts of the NLP model:(1)With regard to applicability, the above models are used for normalizing multiple synonymous terms to one term. However, they are not suitable for cases in which synonymous symptoms correspond to multiple normalized symptoms. For example, “less white sputum and difficult to expectorate” (痰少色白难咳) and “less white phlegm and not easy to expectorate” (少量白痰且不易咳出) are synonymous symptoms, should be normalized to “less phlegm” (痰少), “white phlegm” (痰白), and “expectoration difficulties” (痰难咳出).(2)With regard to the modeling concept, approaches from the perspectives of similarity matching and NER have been reported. However, many models constructed from the perspectives of sequence generation and text classification have also shown excellent performance and applicability in NLP tasks [17, 18]. Therefore, it is necessary to explore the applicability of sequence generation and text classification to this normalization task and investigate whether better performance can be achieved.

According to the above statement, the objective of this study is to develop normalization models for normalizing the expressions of TCM synonymous symptoms from the perspective of sequence generation and text classification and to compare and analyze the applicability and performance of the models, so as to select the best model.

2. Methods

The workflow of this study is shown in Figure 1. It can be divided into three parts: (1) collecting TCM symptoms from medical records (referred to as sample collection), (2) preparing training, development, and test data sets (referred to as division of data sets), and (3) constructing models for normalizing expressions of TCM synonymous symptoms (referred to as model construction).

2.1. Data Sources and Labeling

In total, 3,252 medical records, recorded by 22 TCM doctors on the platform of the “Heritage Program of Chinese Well-Known Experts” [19], were collected. The symptoms in the medical records were regarded as the original symptoms, each of which was then labeled by the corresponding normalized symptom, according to the TCM Thesaurus (from the Beijing University of Chinese Medicine TCM Information Science Research Center). Two researchers, who had obtained the qualification of TCM practicing physician and been trained by the provider of the TCM Thesaurus, performed the labeling work. Two additional experts in the TCM Thesaurus checked the labeling results independently, and inconsistent labeling results were submitted to a third expert for review and discussion to ensure consistency.

There are two forms of original symptoms in medical records: single symptoms and complex symptoms. A single symptom is an original symptom that corresponds to only one clinical manifestation; such a symptom was labeled as one normalized symptom by referring to the TCM Thesaurus. For example, “thinning and shapeless stool” was labeled as “loose stool.” A complex symptom is an original symptom that corresponds to multiple clinical manifestations; such a symptom was labeled as multiple normalized symptoms. For example, “dry and itchy throat” was labeled as “dry throat” and “itchy throat” by referring to the TCM Thesaurus.

In total, 16,808 nonrepetitive original symptoms were collected from the 3,252 medical records, corresponding to 1,501 normalized symptoms, of which 339 appeared only once. The collected original symptoms and labeled normalized symptoms served as the input and output data, respectively, of TCM symptom normalization models.

2.2. Partition of Data Sets

Two strategies were used to divide the collected data into training, development, and test data sets. The first strategy was to divide the medical records by source doctors randomly. The nonrepetitive original symptoms recorded by one randomly selected doctor, and the corresponding normalized symptoms were used as a development set to set the parameters of the model. The nonrepetitive original symptoms recorded by another randomly selected doctor, and the corresponding normalized symptoms were used as a test set to observe the ability of the model to normalize the expression of TCM symptoms. The nonrepetitive original symptoms recorded by the 20 other doctors, and the corresponding normalized symptoms were used as the training set. These data sets were called total data sets (TDS). This data set division is suitable for evaluating the performance of the TCM symptom normalization models in practical applications.

The second strategy for dividing the collected data into training, development, and test data sets was based on high-frequency normalized symptoms. These data sets were called high-frequency data sets (HFDS). According to Zipf’s law [20], (N is the threshold between high-frequency and low-frequency, and I1 is the number of normalized symptoms that only appeared once). Normalized symptoms with a frequency greater than 26 were defined as high-frequency normalized symptoms. The high-frequency normalized symptoms and the corresponding original symptoms were included in the HFDS. The ten most frequent normalized symptoms and their corresponding numbers of original symptoms are shown in Figure 2. In the HFDS, 70% of the data (6,768 original symptoms and the corresponding normalized symptoms) were randomly selected as a training set, 15% (1,471 original symptoms and the corresponding normalized symptoms) were as a development set, and 15% (1,425 original symptoms and the corresponding normalized symptoms) were as a test set. The numbers of samples in HFDS and TDS are shown in Table 1.

2.3. Model Construction

From the perspective of text sequence generation, the bidirectional long short-term memory (Bi-LSTM) recurrent neural network (RNN) with the encoder-decoder structure [21], combined with the Luong attention mechanism [22], was used to establish four models for TCM symptom normalization. (1) Encoder (Char)-Decoder (Char) model: the input of the original symptom and the output of the normalized symptom were in character form (multiple output normalized symptoms were separated by “,”). (2) Encoder (Word)-Decoder (Word) model: the input of the original symptom and the output of the normalized symptom were in word form. (3) Encoder (Char)-Decoder (Label) model: the input of the original symptom was in character form, and the output of the normalized symptom was in label form. (4) Encoder (Word)-Decoder (Label) model: the input of the original symptom was in word form, and the output of the normalized symptom was in label form. The structure of the four models was consistent; only the input and output forms were different, as shown in Figure 3(a).

This study also applied the Bi-LSTM and a full connection layer with sigmoid function to explore the feasibility of TCM symptom normalization from the perspective of text classification. In this case, the model output was in label form, and the input was in character or word form (see Figure 3(b)). In the Encoder (Char)-Classification model, the input was in character form; in the Encoder (Word)-Classification model, the input was in word form. The words that were input to the model were obtained from the original symptoms by a segmentation tool [23].

Chinese language pretraining weights, trained on a large number of Chinese corpora, can help achieve better results. Therefore, this study further used the unified language model (UniLM) based on the Chinese pretraining weights of bidirectional encoder representation from transformers (BERT) [18, 24] to construct the TCM symptom normalization model. The training process included first loading the Chinese pretraining weights of BERT (https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip) and then training with the sequence-to-sequence method of UniLM [18]. This training method was based on text sequence generation. Two output forms were used in training: a character-based output form, namely the BERT-UniLM (Char) model, and a label-based output form, namely the BERT-UniLM (Label) model, as shown in Figure 4(a). BERT and a full connection layer with sigmoid function were also used to construct the TCM symptom normalization model, namely the BERT-Classification model, as shown in Figure 4(b). Because the input of the pretraining weights of BERT was in character form, the input of the BERT-based models was also in character form.

2.4. Model Parameters

The encoder-decoder models had initialization weights sampled from a random uniform distribution in the range of −0.05–0.05, the dimension of embedding was 300, and the training batch size was 256. Adam was the optimizer [25]. According to the F1-score of the encoder-decoder models on the development set, the best parameter combinations were selected for learning rate (selected from 0.0001, 0.0003, and 0.0005), dropout rate (selected from 0.3 and 0.5), and the number of memory cells (selected from 128, 256, and 512).

For the encoder-classification models, the training batch size was 256. According to the F1-score of the models on the development set, the best parameter combinations were selected for learning rate (selected from 0.005, 0.01, and 0.03), dropout rate (selected from 0.3 and 0.5), and the number of memory cells (selected from 128, 256, and 512).

For the BERT-UniLM and BERT-Classification models, the training batch size was 16, the optimizer was Adam [25], and the learning rate was 0.0003. The other parameters were the default settings of the BERT neural network [24].

The TensorFlow neural network framework (http://www.tensorflow.org/), developed by Google, was used to implement the above models and was combined with NVIDIA GeForce RTX 2080 (11 GB memory) to train the models. When the F1-score of the models in the development set had not improved for 20 epochs, the training was terminated. Even if a fixed random seed number was used, the results from different computers were still biased. Therefore, after setting the model parameters, the modeling process was repeated 10 times; the model performance was evaluated by four metrics and expressed as mean ± standard deviation (SD). The four metrics used were accuracy, precision, recall, and F1-score. ; ; ; and . Here, P (the correct normalized symptoms of model prediction) is the number of all correct results output by the model, and T (total correct normalized symptoms corresponding to the test set) is the number of all tests. TP (true positive) is the number of results produced by the model that were consistent with the actual results, FN (false negative) is the number of correct results that the model failed to output, and FP (false positive) is the number of results produced by the model that were incorrect. The key model parameters and development set results are shown in Tables 2 and 3.

2.5. Statistical Analysis

IBM SPSS 20.0 was used to analyze the results. When analyzing the indexes of each group, if the variance between groups was homogeneous and normal distribution was satisfied, one-way ANOVA was used. If the variance was not homogeneous or there was non-normal distribution among groups, the Kruskal–Wallis test was used.

3. Results

3.1. Performance of Models on Test Data Sets

Generally, the performance of models was better on the HFDS test data set than on TDS. With regard to the model structure, the BERT-UniLM models had more advantages than the Encoder-Decoder models, as shown in Tables 4 and 5. In addition, comparing the BERT-UniLM models with the BERT-Classification model, the BERT-Classification model had more advantages. That is, the BERT-Classification model was the best model for normalizing expressions of TCM synonymous symptoms in this study, on both the HFDS and TDS test data sets.

The performance of three classification models with different threshold values on HFDS and TDS was explored. With regard to HFDS, when the threshold value was 0.2, the performance of both BERT-Classification and Encoder-Classification was generally the best, as shown in Figure 5. With regard to TDS, the best threshold value was 0.1, as shown in Figure 6. When comparing the BERT-Classification model with the Encoder-Classification models, the BERT-Classification model achieved better results. The accuracy and F1-score were 0.9051 and 0.9073 on the HFDS and 0.8568 and 0.8574 on the TDS, respectively.

The classification-based models have the ability to adjust the output threshold to change the recall. We believe this capability can be used for the retrieval of normalized symptoms. Because retrieval focuses on higher recall, namely, focuses on the outputs contain the correct normalized symptoms. By lowering the output threshold, the models can output the top 5 and 10 normalized symptoms above the threshold. Therefore, the retrieval ability was evaluated by the top 5 and 10 recall, and the results are shown in Table 6.

3.2. Performance of Models in Normalizing Single and Complex Symptoms

In evaluating the various models for normalizing single symptoms (the original symptoms corresponding to one normalized symptom) and complex symptoms (the original symptoms corresponding to multiple normalized symptoms), we found that the performance of the BERT-Classification model was comprehensively superior, not only on HFDS but also on TDS, as shown in Figures 7 and 8.

3.3. Comparison with Other Normalization Models

We also compared the BERT-Classification model with several other models that perform well for normalization, including the state-of-the-art models reported by other researchers. These methods are the Jaccard similarity algorithm [12], Word2Vec with cosine [11], DNorm [13], the transition-based model [15], RNN-CNNs-CRF [16], and BERT-based ranking [14]. The above models were not designed for the normalization of complex symptoms. Therefore, we only compared the performance of models to handle single symptoms (4,555 single symptoms) taken from the HFDS. The 4,555 single symptoms, and their corresponding normalized symptoms, were divided into a training set (70%), a development set (15%), and a test set (15%). The development set was used to select the parameters of each model, except the Jaccard method, for which there is no need to select parameters. The test results showed that the BERT-Classification model performed better than the other methods, as shown in Table 7.

We note that Jaccard similarity, Word2Vec with cosine, DNorm, and BERT-based ranking can output the score of each normalized symptom. Therefore, the models can output the top 5 and 10 normalized symptoms by score ranking to achieve retrieval. We used recall to observe the ability of retrieval, as shown in Table 8. The results show that the BERT-Classification model has advantages in retrieval.

To further demonstrate the advantages of our model, we summarized the test results on HFDS. According to the results, we comprehensively compared the performance and applicability of our model with that of existing models, as shown in Table 9.

4. Discussion

The normalization of expressions of TCM synonymous symptoms plays an important role in the collation of medical records, statistical mining, construction of TCM knowledge databases, and construction of TCM medical assistant decision-making systems [9]. The application of NLP technology improves the efficiency of normalization processing. NLP algorithms based on neural networks have been applied in normalizing biomedical texts [13, 14] but not in normalizing the expressions of TCM synonymous symptoms. In this study, multiple models were constructed with NLP algorithms based on Bi-LSTM and the BERT neural network to explore the normalization of expressions of TCM synonymous symptoms.

In TCM synonymous symptom normalization, the performance of normalization and the ability to handle one symptom corresponding to multiple normalized symptoms are crucial to the normalization model. The test results show that our BERT-Classification model outperforms previous models and has the ability as mentioned above, while previous models do not have. In addition, the model also supports retrieve normalized candidate symptoms. Our model can retrieve other candidate normalization symptoms according to original symptoms when the model does not provide suitably normalized symptoms.

These advantages of the model provide technical support for the efficient normalization of TCM synonymous symptoms and make the model highly adaptable in medical situations.

In this study, the accuracy, recall, precision, and F1-score metrics were used to evaluate the performance of each model. The results show that the BERT-Classification model outperformed other existing models with respect to various metrics; these models include the proposed Encoder-Decoder, Encoder-Classification, and BERT-UniLM designed in this study. This is because the performance of NLP models based on neural networks is strongly related to the extracted semantic features, and BERT excels in extracting semantic features [24]. Therefore, the BERT-Classification model, which extracts semantic features using BERT, is advantageous for normalization tasks. BERT-Classification, BERT-UniLM, and BERT-based ranking are all based on the BERT neural network; they differ only in their output layers due to their different modeling concepts. The results suggest that BERT-Classification performs best; therefore, the classification-based modeling concept may be the most conducive to normalizing TCM symptoms.

With regard to applicability, our proposed BERT-Classification model supports both the processing of the original symptoms that correspond to multiple normalized symptoms and the retrieval of normalized symptoms. We use sigmoid as an output function to handle the situation in which each original symptom corresponds to multiple normalized symptoms; this method is effective and outperforms sequence generation methods. Moreover, for the model to support the retrieval of normalized symptoms, it requires a higher recall. Our BERT-Classification model can increase the recall by reducing the output threshold of the sigmoid function and thereby support retrieval.

In contrast to BERT-Classification, the other reported models cannot support both of the above applications simultaneously. Jaccard similarity, DNorm, Word2vec with cosine, and BERT-based ranking pair an original symptom with each normalized symptom and rank the normalized symptoms by their pairing score. Although these models can output multiple normalized symptoms by ranking them for retrieval, when multiple normalized symptoms corresponding to the original symptoms need to be output precisely, it is difficult to decide whether the results (except for the normalized symptom with the highest score) should be output. The Bi-LSTM-CNNs-CRF model is only designed for outputting a single normalized symptom. In addition, because the model is based on the NER modeling concept, it cannot produce multiple candidate normalized symptoms, as the above models can, and therefore cannot be applied to the retrieval task. Although the Encoder-Decoder and BERT-UniLM models support the output of multiple normalized symptoms, they suffer from the same limitations as Bi-LSTM-CNNs-CRF and are not suitable for the retrieval of normalized symptoms.

The HFDS contained only high-frequency samples for modeling and testing, reflecting the performance of the BERT-Classification model under ideal conditions. Conversely, the TDS included both high-frequency and low-frequency samples, reflecting the performance of the model in practical applications. Comparing the results of the model on the two data sets, the performance on TDS was lower than that on HFDS. This suggests that the performance of the model can be improved by increasing the number of low-frequency samples.

5. Conclusions

This study constructed models to normalize TCM synonymous symptoms from the perspectives of text classification and sequence generation of NLP. The optimal model is the BERT-Classification model, which outperforms existing reported models in dealing with original symptoms that correspond to a single normalized symptom. Moreover, it also supports original symptoms that correspond to multiple normalized symptoms, and it has the ability to retrieve normalized symptoms. The limitation of this study is that the normalization models only explore symptoms. Whether the models can be used for normalizing other synonymous terms, such as TCM treatment terms and TCM disease terms, remains to be further studied. In addition, the pretrained BERT model based on large-scale corpora plays an important role in improving the model performance; the BERT model trained on corpora from professional medical fields is likely to achieve better results for normalization of medical terms. Therefore, the use of a large number of TCM literature corpora to construct the pretrained model, to improve the normalization performance, also needs further research.

Abbreviations

BERT:Bidirectional encoder representation from transformers
Bi-LSTM:Bidirectional long short-term memory
DR:Dropout rate
FN:False negative
FP:False positive
HFDS:High-frequency data sets
LR:Learning rate
MC:Memory cell
N/A:Not applicable
NLP:Natural language processing
RNN:Recurrent neural network
SD:Standard deviation
TCM:Traditional Chinese medicine
TDS:Total data sets
TP:True positive
UniLM:Unified language model.

Data Availability

All the data and materials used in the current study are available from the corresponding author on reasonable request.

Ethical Approval

Not applicable.

Not applicable.

Disclosure

The funder has no role in study design, data collection, analysis, decision to publish, or manuscript preparation.

Conflicts of Interest

The authors declare that they have no competing interests.

Authors’ Contributions

YH Li and FQ Xu guided the whole work; L Zhou and SQ Liu developed all models; CY Li, YD Li, FQ Xu, and YM Sun collected medical data; SQ Liu and YZ Zhang performed data labeling; Y Sun and YH Li checked all labels; Y Sun and HM Yuan calculated all metrics. All the authors read and approved the final manuscript. L Zhou and SQ Liu contributed equally to this work.

Acknowledgments

The National Key R&D Program of China supported this study (2017YFC1700303, 2017YFC17003032017YFC1700303).