Early Prediction of Organ Failures in Patients with Acute Pancreatitis Using Text Mining
It is of great significance to establish an assessment model for organ failures in the early stage of admission in acute pancreatitis (AP). And the clinical notes are underutilized. To predict organ failures for AP patients using early clinical notes in hospital, early text features obtained from the pretrained Chinese Bidirectional Encoder Representations from Transformers model and attention-based LSTM were combined with early structured features (laboratory tests, vital signs, and demographic characteristics) to predict organ failures (respiratory, cardiovascular, and renal) in 12,748 AP inpatients in West China Hospital, Sichuan University, from 2008 to 2018. The text plus structured features fusion model was used to predict organ failures, compared to the baseline model with only structured features. The performance of the model with text features added is superior to the model that only includes structured features.
Organ failure is a serious complication of patients with acute pancreatitis (AP). Acute renal failure is one of the most common causes of death in patients with severe AP . Acute renal failure in the setting of AP has been shown to have a 10-fold increase in mortality in a study of 563 patients . Approximately, one-third of severe pancreatitis patients develop acute lung injury and acute respiratory distress syndrome that account for 60% of all deaths within the first week . Patients with AP associated with congestive heart failure (CHF) have significantly higher mortality in comparison with those without CHF . Therefore, it is of great significance to establish an assessment model for organ failures in the early stage of admission in AP.
Three organ systems should be assessed to define organ failure: respiratory, cardiovascular, and renal based on the classification of AP—2012 . In this study, organ failures included respiratory failure, circulatory failure, and renal failure. Previous studies [6, 7] used a single indicator or a linear model of several indicators for predicting the risk of organ failure within 12 hours of admission. Their sample size was small, and their generalization ability was relatively weak. Additionally, their methods did not make full use of the patient’s clinical manifestations and medical history.
The electronic medical record (EMR) of the hospital information system contains structured and unstructured data, which does not include bio-omics data [8–10]. Integration with clinical narrative would be highly useful because rich information is buried in unstructured text. It is of great value for the mining and utilization of unstructured data. For clinical notes (an example in West China Hospital, Sichuan University, shown in Figure S1), the clinician and patient’s history of present illness (HPI) may contain information that is not extracted from structured data, such as the onset of AP, recurrence time of AP symptoms, the description of early symptoms before admission, and clinician’ speculations about certain complications or conditions.
Therefore, we will make full use of clinical notes, not just extracting some of them manually as features. We will use AP patient’s HPI, early laboratory tests, early vital signs, and demographic data to assess the risk of organ failures between 72 hours after admission and discharge. Pretrained Chinese Bidirectional Encoder Representations from Transformers (BERT) model  will be used to transform the sentence into a feature matrix, and the characteristics of the sentence will be captured by an attention-based long short-term memory (LSTM)  so that a unified representation of the text information can be obtained for all patients.
2. Materials and Methods
2.1. Comparison to Recent Works on Medical Records
Nguyen et al.  developed a deep neural net based on convolutional neural networks named Deepr for representing EMR and predicting unplanned readmission. Deepr sequences the EMR into a “sentence,” or equivalently, a sequence of “words.” Each word represents a discrete object or event such as diagnosis, procedure, or any derived object such as time-interval or hospital transfer. Their “word” is different from our “word.” Our word is extracted from the text sentence of clinical notes, rather than combining discrete objects together.
Soysal et al.  developed a toolkit CLAMP for building customized clinical natural language processing pipelines. Although the CLAMP default pipeline achieved good performance on named entity recognition and concept encoding, it has limitations in handling other languages such as Chinese.
Krishnan and Kamath  used unstructured physician notes modeled using hybrid word embeddings to generate quality features which were used to train and build a deep neural network model to predict disease groups based on ICD code. Because there are too many types of diseases involved, it is not possible to obtain important characteristics specific to a certain disease.
Yuwono et al.  proposed a novel neural network named convolutional residual recurrent neural network that learns to diagnose acute appendicitis based on doctors’ free-text emergency department notes and optional real-valued features (from the structured fields) without any feature engineering. The effectiveness of their method in the diagnosis or prediction of other diseases or adverse outcomes needs to be further explored.
Akbilgic et al.  used a text mining approach (hybrid model) on preoperative notes to obtain a text-based risk score to predict death within 30 days of surgery in children, which significantly improved the performance of C-statistic from 0.76 to 0.92 when text-based risk scores were included in addition to structured data.
Therefore, this study fully considers the disease characteristics of AP, collects and analyzes relevant features as much as possible, and uses the attention mechanism to output important words, which leaves room for the treatment of AP. We propose two strategies, including the baseline model of only structured features and text plus structured features fusion model.
2.2. Our Methods
To implement this task (Figure 1), we will do the following:(1)Extract text information of HPI, laboratory tests, vital signs, and demographic data within the first 72 h of admission(2)Clean HPI, including but not limited to, punctuation marks and meaningless symbol, then make word segmentation(3)By using the pretrained BERT model for Chinese, convert words into word vectors, which are then converted into uniform representation feature vectors by attention-based LSTM(4)Combine structured features (laboratory tests, vital signs, and demographic data) and unstructured features (HPI) and feed them into softmax function to predict organ failures in AP(5)Use minimized cross entropy to estimate the parameters
The input of the model is , where is the index of the sample, X is the structural variables such as vital signs and demographic, is the text description, and is the occurrence of a specific target event, such as organ failure. can be expressed as a string of Chinese words. For each Chinese word, the corresponding Chinese word is first converted into the form of one-hot coding according to the dictionary, and then each word is converted into a word vector by using the word embedding matrix of the pretrained BERT model, as shown in formula (1). Then, the word vector of each word is input into the attention-based LSTM network according to the order of sentences. The last hidden layer is taken as the input of the next layer, which is recorded as . The structured information is mapped to the hidden layer vector through the neural network, and then and are spliced to get the vector . Finally, is used as the input of , and is used as the output for the specific target event. The formula of attention-based LSTM is defined as follows. Based on the existing standard LSTM network, when the last hidden layer outputs, attention weight is added, which is recorded as . represents hidden layer vector mapped from the word vector of a word:
Different from other ways of using the pretrained BERT model, we did not directly use the prediction task of organ failure as a subtask of the BERT model but used the embedding matrix of the BERT model to convert the sparse word vector represented by one-hot encoding to a dense real vector of degree 200. Therefore, in this study, a sentence is a matrix, which can also be regarded as a sequence of vectors. Then, we input this vector sequence into the attention-based long short-term memory model in turn and take out the final hidden layer with a length of 10 neurons as the feature representation of the text information.
The patients diagnosed with AP based on ICD code from the hospital information system of West China Hospital, Sichuan University, from 2008 to 2018 were included. There were 15,813 AP patients included initially. We extracted AP patients’ demographic data, laboratory tests, and vital signs within 72 hours of admission. HPI of AP patients also was extracted. Respiratory failure was defined as the partial pressure of oxygen in blood gas analysis was less than 60 mmHg or the use of a ventilator. Circulatory failure was defined as diastolic blood pressure was less than 60 mmHg or systolic blood pressure was less than 90 mmHg and the use of vasoactive drugs. Kidney failure was defined as creatinine was greater than 177 umol/L. The research protocol was approved by the ethics review board of West China Hospital, Sichuan University, and the need for informed consent was waived owing to the retrospective nature of the study.
2.4. Data Preprocessing
The features with a missing rate of more than 20% in the structured data and patients without a complete description of HPI were deleted. A total of 12,748 patients were used for prediction finally. We used mice package of R software to perform multiple linear interpolation on structured data. Before and after linear interpolation (see Figure S2), there is no statistical difference in the distribution of the data ().
2.5. Word Embedding
Word embedding can be represented by a sparse one-hot vector or a dense real vector. Word vectors represented sparsely not only require large storage space and computing resources but are also difficult to reflect the correlation between words. Therefore, since dense vectors solve the above two problems at the same time, dense real vectors are more commonly used. In the classification and prediction tasks of natural language processing, that is, the issue of sequence to sequence, the quality of the word vector often has a great impact on the prediction ability of the model. There are currently a large number of word vectors that have been trained based on continuous bag-of-words (CBOW) and skip-gram models . This study adopted an analogical reasoning task on Chinese . In order to ensure the accuracy of Chinese medical word segmentation, Tsinghua Open Chinese Lexicon  was applied.
2.6. Training and Testing
The data then was further divided into a training dataset and testing dataset according to the proportions of 70% and 30%. Because of class imbalance and the importance of positive identification in medicine, we used as the weight of the -th class to make up for the problems caused by a class imbalance in the cross-entropy loss function . We used a pretrained word vector network with a word vector length of 60. The batch size was set to 500 and the learning rate was set to 0.005. The number of hidden layer neurons of LSTM was set to 200. Models were optimized using a gradient descent approach. The performance of training was an average of 1000 epochs. In our model, mixed features including structured features and text features were as input and combined attention mechanism, compared to the model that used only structured features as input. PyTorch framework was adopted to implement the experiment on a Dell T640 GPU server.
3. Results and Discussion
There were 12,748 AP patients with average age of 47.58 years and 60.9% male in this study. Respiratory failure, circulatory failure, and renal failure accounted for 14.2%, 21.6%, and 4.5%, respectively. Early vital signs and early laboratory tests are shown in Table 1.
Table 2 contains the results of four models: model 1, we use structured features as input to predict three organ failures, respectively; model 2, we use text features as input to predict three organ failures, respectively; model 3, we use mixed features including structured features and text features as input combined with the attention mechanism to predict three organ failures, respectively; model 4, we use mixed features as input and combine with the attention mechanism to predict three organ failures together as multitask. From the training dataset perspective, model 2 performed best through the prediction of organ failures. From the testing dataset perspective, the accuracies of model 3 and model 4 were higher than model 1 and model 2 to predict respiratory failure, circulatory failure, and renal failure, which shows that adding text features can effectively help improve the accuracy of predicting organ failures in AP.
In the classification task, a good classifier should have a good effect on the judgment of each class, so when evaluating the effect of the classifier, the recall and specificity should be considered comprehensively. The specificity of model 3 is the highest to predict respiratory failure and circulatory failure. It shows that model 3 can effectively learn some characteristics of AP patients without organ failures. Although the accuracy, precision, and specificity of renal failure are the highest in model 1, the recall of model 1 is lower than that of model 3 and model 4. The highest recall of respiratory failure comes from model 3. Based on the comprehensive evaluation using the performance matrix, the performance of the model with text features added is superior to the model that only includes structured features or text features. The top 30 important Chinese word segmentations were obtained through the attention mechanism from model 4, which is shown in Table S1.
Mentula et al.  used 351 AP patients’ some laboratory tests and APACHE II score within 12 h of admission to predict organ failure through logistic regression. Their results showed 0.82 of recall to predict respiratory failure and renal failure without testing, which is lower than the predicted performance on the training dataset of respiratory failure and renal failure with text features added in this study (0.850 and 0.894, resp.). Khanna et al.  used 72 AP patients’ various scores and a few laboratory tests to predict organ failure with a maximum recall of 1 using procalcitonin and a minimum recall of 0.652 using CT severity index. It is difficult to believe that only one score or a single indicator can be used to achieve such a good prediction effect. Their findings need to be further verified.
Koutroumpakis et al.  used 1,612 AP patients’ three laboratory tests and APACHE II score to predict persistent organ failure. The best recall was 0.684 using admission APACHE II score, and the lowest was 0.249 using admission creatinine. Although these results seem reasonable, they have not been tested and are lower than the results of the training dataset after adding text features in this study. Hyland et al.  developed a machine learning method to predict circular failure in ICU patients. Although the predicted performance of AUC of 0.94 was obtained, the data used was routinely collected structured data. In addition, ICU data is monitored in real time, so the onset and duration of organ failure can be observed and a time series model can be used. However, in routine inpatients, laboratory tests and vital signs are monitored irregularly. This is why this study did not use earlier data and time series models to predict persistent organ failure.
In addition to predicting single organ failure, we also used predictive multitasking to output the results of three organ failures simultaneously after adding text features. In the multitask prediction model, the loss function of each task can be regarded as the constraint of other tasks. The prediction results of the three organ failures can be obtained in a shorter time, which may meet different clinical needs. We do not want the model to be limited to the learning of the target task but can adapt to multiple task scenarios, which can greatly increase the functional capability of the model (generalization).
Since it is a single-center study, patients only come from West China Hospital, Sichuan University, which is a large general hospital with 4300 beds in China. Doctors in West China Hospital, Sichuan University, may describe the patient’s current medical history in more detail than doctors in other hospitals. Therefore, the addition of text information will increase the predictive ability of the model. However, in the same way, when used in other hospitals, the text information of the current medical history may be different from the text information of this study, so it should be careful when using the proposed model.
We performed single-task and multitask prediction of organ failures in AP by the joint representation of structured features and text features. According to our best knowledge, this is the first time to use clinical notes to predict organ failures in AP. Our methods achieve superior accuracy compared to traditional techniques and uncover the underlying structure of the disease and intervention space.
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Jiawei Luo and Lan Lan contributed equally.
This work was supported by the Postdoctoral Research Project, West China Hospital, Sichuan University (no. 2019HXBH039), the 1 3 5 Project for Disciplines of Excellence, West China Hospital, Sichuan University (no. ZYJC18010), and the Center of Excellence-International Collaboration Initiative Grant (no. 139170052).
The supplementary materials contain Figures S1 and S2 and Table S1. (Supplementary Materials)
P. A. Banks, T. L. Bollen, C. Dervenis et al., “Classification of acute pancreatitis--2012: revision of the Atlanta classification and definitions by international consensus,” Gut, vol. 62, no. 1, pp. 102–111, 2013.View at: Google Scholar
P. Mentula, M. L. Kylänpää, E. Kemppainen et al., “Early prediction of organ failure by combined markers in patients with acute pancreatitis,” British Journal of Surgery, vol. 92, no. 1, pp. 68–75, 2005.View at: Google Scholar
S. Yuwono, H. Ng, and K. Ngiam, “Learning from the experience of doctors: automated diagnosis of appendicitis based on clinical notes,” in Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 11–19, Florence, Italy, August 2019.View at: Google Scholar
A. K. Khanna, S. Meher, S. Prakash et al., “Comparison of ranson, glasgow, moss, sirs, bisap, Apache-II, ctsi scores, IL-6, crp, and procalcitonin in predicting severity, organ failure, pancreatic necrosis, and mortality in acute pancreatitis,” HPB Surgery, vol. 2013, 10 pages, 2013.View at: Publisher Site | Google Scholar
E. Koutroumpakis, B. U. Wu, O. J. Bakker et al., “Admission hematocrit and rise in blood urea nitrogen at 24 h outperform other laboratory markers in predicting persistent organ failure and pancreatic necrosis in acute pancreatitis: a post hoc analysis of three large prospective databases,” American Journal Of Gastroenterology, vol. 110, no. 12, pp. 1707–1716, 2015.View at: Publisher Site | Google Scholar
S. Han, Y. Zhang, Y. Ma et al., “THUOCL: Tsinghua open Chinese lexicon,” Journal of Chinese Linguistics, vol. 2020, 5 pages, 2020.View at: Google Scholar