Scientific Programming Towards a Smart World 2020View this Special Issue
Research Article | Open Access
Wenping Guo, Zhuoming Xu, Xijian Ye, Shiqing Zhang, Xiaoming Zhao, Xue Li, "A Time-Critical Topic Model for Predicting the Survival Time of Sepsis Patients", Scientific Programming, vol. 2020, Article ID 8884539, 13 pages, 2020. https://doi.org/10.1155/2020/8884539
A Time-Critical Topic Model for Predicting the Survival Time of Sepsis Patients
Sepsis is a leading cause of mortality in intensive care units and costs hospitals billions of dollars annually worldwide. Predicting survival time for sepsis patients is a time-critical prediction problem. Considering the useful sequential information for sepsis development, this paper proposes a time-critical topic model (TiCTM) inspired by the latent Dirichlet allocation (LDA) model. The proposed TiCTM approach takes into account the time dependency structure between notes, measurement, and survival time of a sepsis patient. Experimental results on the public MIMIC-III database show that, overall, our method outperforms the conventional LDA and linear regression model in terms of recall, precision, accuracy, and F1-measure. It is also found that our method achieves the best performance by using 5 topics when predicting the probability for 30-day survival time.
Predicting the survival time of patients is an active research area for both clinicians and scientists [1–5]. It can significantly contribute to making decisions about clinical treatment, allocation of medical resources, and hospice care for patients . Sepsis is a disease of life-threatening organ dysfunction caused by a dysregulated host response to infection . Without timely treatment, sepsis can rapidly lead to tissue damage, organ failure, and death. Common signs and symptoms include fever, increased heart rate, increased breathing rate, and confusion. In US health systems, the cost for patients with sepsis accounted for more than $20 billion (5.2%) of total US hospital costs in 2011 . In 2001–2010, one in twenty deaths in England was associated with sepsis based on information recorded on death certificates . Although clinicians have made efforts to improve sepsis patient survival time, the mortality rate of sepsis is still very high [9, 10]. Thus, accurate prediction of survival time for sepsis patients could help clinicians conduct prevention, provide early warning and effective treatment, and reduce the mortality rate. Unfortunately, the pathogenesis of sepsis remains unclear. Predicting the survival time for specific diseases, such as sepsis, is still a challenging problem.
To help clinicians understand the overall body situations of a patient, ICUs have introduced many different mechanisms for describing the patient’s body situation and progress in the ICU. Different severity of illness scores has been used to predict sepsis or mortality risk scores in the ICU. The most widely used score-based methods include the acute physiology and chronic health evaluation (APACHE III) , the simplified acute physiology score (SAPS II) , the modified early warning score (MEWS) , the sepsis-related organ failure assessment (SOFA) , and quick SOFA (qSOFA) . In addition, Zhang and Hong  proposed a novel score for predicting hospital mortality for severe sepsis. These score-based methods utilize a set of easily obtainable measurements from various patients to generate risk scores. Although they allow clinicians to make rapid diagnostics of a patient, the obtained results are not satisfactory [16, 17]. Additionally, these score-based approaches only evaluate the patient body situation at specific times and cannot predict the survival time of sepsis patients because the development of sepsis is a time-sensitive process.
Sepsis is a life-threatening condition that arises when the body’s response to infection causes injury to its own tissues and organs [18, 19]. It is difficult to predict the development of sepsis based on a small number of measurements. Conventional topic models such as latent Dirichlet allocation (LDA) are unsupervised machine learning methods that can recognize latent topic information in massive document collections [20, 21]. Lehman et al.  proposed a novel approach for ICU patient risk stratification using a topic model. Ghassemi et al.  proposed a mortality model using a topic model to predict in-hospital mortality, 30-day postdischarge mortality, and 1-year postdischarge mortality. Vairamani  proposed an approach for mortality prediction in ICU patients based on LDA. Zhang et al.  proposed a novel survival topic model inspired by LDA for trauma patients. However, these topic models overlook the useful sequential information for disease development, thereby reducing the prediction accuracy for sepsis patient survival time. Unfortunately, the development and treatment of sepsis is a time-critical process that has a high correlation between the order of words in measurement and notes.
To address this issue, this paper proposes a time-critical topic model (TiCTM) inspired by the LDA model to predict the survival time of sepsis patients. The proposed TiCTM approach takes into account the time dependency structure between notes, measurement, and survival time of a sepsis patient. We consider the time-critical dynamic process of sepsis patients as an approximately linear variation under clinician treatment. Therefore, the linear change in the parameter of TiCTM can reflect the time-critical dynamic process, whereas the parameter of LDA is fixed. Our experimental results on the public MIMIC-III database show that, overall, TiCTM outperforms the conventional LDA and linear regression model in terms of recall, precision, accuracy, and F1-measure. In particular, TiCTM obtains the best performance when predicting the probability for 30-day survival time using 5 topics.
The remainder of this paper is organized as follows. In Section 2, we describe the proposed TiCTM for predicting survival time for adult sepsis patients in the ICU. The experiments and evaluations are discussed in Section 3. Finally, we conclude our research in Section 4.
In this section, we first present a brief review of the classical LDA model and then describe our proposed TiCTM approach in detail.
2.1. Brief Review of LDA
LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words . The LDA model is represented as a probabilistic graphical model, as shown in Figure 1. The meaning of each notation for LDA is shown in Table 1.
The LDA model considers documents as the collection of words, which overlook the order of words. To address this issue, we propose a time-critical topic model inspired by LDA to predict the survival time of sepsis patients, as described as follows.
2.2. Our Proposed Method
Our proposed TiCTM approach considers the time dependency structure between notes, measurement, and survival time of a sepsis patient. The TiCTM model is depicted in Figure 2.
Assume that sequential notes and measurement submodel have several phases. For phase 1, the meaning of each notation is shown in Table 2.
For phase , the meaning of each notation is shown in Table 3.
For phase , the meaning of each notation is shown in Table 4.
As shown in Figure 2, the proposed TiCTM model consists of two submodels: sequential notes and measurement and survival time prediction. This is because the development of sepsis is a time-critical process. The main idea of the TiCTM model is that we consider the time-critical dynamic process of sepsis patients as approximately linear variation under clinician treatment. Therefore, we employ the linear change in the parameter of TiCTM to reflect this time-critical dynamic process, whereas the parameter of LDA is fixed.
2.2.1. Sequential Notes and Measurement
For a sepsis patient m, the parameter represents the patient’s initial body situation. In everyday clinician treatment, the patient’s body situation changes . Therefore, after days of clinician treatment, the parameter of patient m will transit to , which is used to reflect the sequential changing process. The parameter determines the disease probability distribution . The parameter determines the disease for patient m. (probability of bigrams appearing in the topics of notes dictionary) and (topic variable) generate words in the notes of patient m. (probability of measurements appearing in the topics of measurement dictionary) and generate measurements in the measurement result of patient m.
For each word , and , we draw a topic assignment ∼ , wherewhere represents the probability of topic appearing, which is determined by parameter .
Draw a word ∼ , wherewhere represents the probability of word appearing, which is determined by topic and parameter .
For each phase i, the measurement is shown in Figure 2. For each measurement , and .
Draw a topic assignment ∼ , wherewhere represents the probability of topic appearing, which is determined by parameter .
Draw a measurement ∼ , wherewhere represents the probability of measurement appearing, which is determined by topic and parameter .
2.2.2. Survival Time Prediction
For each patient m, denotes the real survival time, and represents the predicted survival time with TiCTM. Body situation and regression coefficient are used for predicting the survival time for patient m. The objective of the problem is to minimize the difference between and . As shown in Figure 2, we define the time to death from formula (5). ∼ , where
The details for predicting survival time for sepsis patients using the TiCTM model are further described in the following subsection.
2.3. Details of Survival Time Prediction Using TiCTM
2.3.1. Definition of Real Survival Time Function
Assume that there are patients, is defined as the one-variable function of , whereas is defined as the one-variable function of . These two functions are further described later. The objective of survival time prediction for sepsis patients is to minimize the difference (Diff) between and :
We hypothesize as the real survival time for patient m after the second phase. is the function for real survival time analysis .
Assume that patient m has the same probability to survive every day; then, we can calculate the probability of death on the day using
To maximize , it must be satisfied withwhere is a form of sigmoid function as
To make = 0 (Death) and = 1 (Survive), we modify to
2.3.2. Definition of the Survival Time Prediction Function
To verify the effectiveness of the proposed TiCTM model, this paper uses the two-phase survival model to analyze survival time analysis for adult sepsis patients. The measurement and notes in the first phase limit the time for the patient’s admission within 24 hours. The second phase uses the patient’s measurement and notes after 24 hours to the last.
To calculate the probability of measurement and note in the first phase by body condition , we usewhere
To calculate the probability of measurement and note in the second phase by body condition (), we usewhere
The likelihood function of the two-phase of patient m can be obtained from
The log of the likelihood function of the two-phase of patient m can be obtained fromwhere
After removing the constant term in formula (21), the log of the total likelihood function of the two-phase for all patients can be obtained from the following formula:
To maximize the log of the total likelihood function, formula (23) is obtained as follows:
Under the given constraints, , the following formulas are obtained:
A body condition transition diagram is shown in Figure 3. Therefore, we can obtain the patient’s body condition at discharge.
We hypothesize as the predicted survival time for patient m after the second phase. is a function for predicting survival time analysis .
We predict the probability for patient m to survive withwhere is the indicator function. When the notes of patient m contain the -th bigram word of the notes dictionary, the value is 1; otherwise, the value is 0. is the indicator function. When the measurement of patient m contains the -th measurement of the measurement dictionary, the value is 1; otherwise, the value is 0. is the regression coefficient. The -th bigram word of the dictionary represents a danger factor when , which indicates a shorter survival time. Otherwise, the -th bigram word of the notes dictionary represents a protective factor when , which indicates a longer survival time. is the regression coefficient. The -th measurement of the measurement dictionary represents a danger factor when , which indicates a shorter survival time. Otherwise, the -th measurement of the measurement dictionary represents a protective factor when , which indicates a longer survival time.
2.3.3. Object Optimization
To minimize the difference between and , we define as follows:
To find the optimal to minimize , we obtain
Then, we obtain the gradient update as the following formulas:
3.1. Experimental Design
3.1.1. Evaluation Tasks
Task 1: Evaluating the performance of the TiCTM model with different topics. Task 2: Performance comparisons of TiCTM with LDA and linear regression models. Task 3: F1 comparisons for different survival days with TiCTM (K = 5). Task 4: Comprehensive performance comparisons of TiCTM (different topics) with LDA (K = 5, 10) and linear regression models using ROC.
represents the predicted survival time with linear regression. is defined aswhere is a function for predicting survival time analysis . is defined aswhere a and b are regression coefficients.
The objective of survival time prediction for sepsis patients is to minimize the loss function between and :
3.1.3. Evaluation Criteria
To evaluate the performance of the proposed method, we use a 3-fold cross-validation scheme. Evaluation metrics, such as recall, F1, and FPR, are adopted in this paper. They are defined as follows: Recall (TPR) = TP/(TP + FN). Precision = TP/(TP + FP). Accuracy = (TP + TN)/(TP + TN + FP + FN). F1 = precision recall 2/(precision + recall). FPR = FP/(FP + TN), where TP indicates true positive, which means predicting a survival time less than or equal to a given time, while the true survival time is less than or equal to a given time; FP indicates false positive, which means predicting a survival time less than or equal to a given time but the true survival time is greater than a given time; TN indicates true negative, which means predicting a survival time greater than a given time, while the true survival time is greater than a given time; and FN indicates false negative, which means predicting a survival time greater than a given time but the true survival time is less than or equal to a given time.
The dataset used in the experiments is the public Medical Information Mart for Intensive Care (MIMIC-III) . MIMIC-III has been widely used by 845 publications as of the end of August 2019 . The version of MIMIC is MIMIC-III v1.4, which comprises over 58,000 hospital admissions for 38,645 adults and 7,875 neonates. The data spanned from June 2001 to October 2012. We used a dataset of 2,487 deceased adult (age > 14) sepsis patient records from MIMIC-III. The data processing flowchart is shown in Figure 4.
Patient features include text data, class data, and numerical data.(1)Text data processing: clinical notes are also text data. This paper considers the clinical notes as a set of bigrams. We calculate the TF-IDF value of every bigram after removing the stopwords. Then, we sort every bigram according to TF-IDF and select the top 3,000 bigrams as the words for the dictionary.(2)Numerical data processing: to calculate the mean and standard deviation (std) for numerical data, we divide it into five intervals: , , , , and . Each interval is used as a word in the measurement dictionary.(3)Class data processing: each class of data is used as a word in the measurement dictionary.
3.2. Experimental Results
3.2.1. Evaluating the Performance of the TiCTM Model with Different Topics
To evaluate the performance of our proposed TiCTM model, we choose different topics, that is, K = 5, 10, 15, 20, to predict the survival time for adult sepsis patients in the ICU. The results are shown in Tables 6–9. In Table 9, we can see that 5 topics (K = 5) achieve the best performance for predicting the probability for 30-day survival time. The number of topics is chosen by a 3-fold cross-validation scheme. We use the average value of 3-fold to select the best number of topics. It can be seen that the best number of topics for sepsis patients is 5 topics. The F1 score of 5 topics improved by 3.55% (in-hospital death), 3.88% (7 days), 2.56% (14 days), and 1.99% (30 days) compared to 20 topics. The possible reason is that there are many words in the dictionary, and the combination of topics and bigrams increases sharply when the number of topics increases. This situation will lead to model overfitting, and the F1 score will decrease.
3.2.2. Performance Comparisons of TiCTM with LDA and Linear Regression Model
Performance comparisons of TiCTM (K = 5, 10) with LDA (K = 5, 10) and the linear regression model are shown in Tables 10–13. The performance of our proposed TiCTM outperforms LDA and the linear model overall in terms of recall, precision, accuracy, and F1-measure. When we consider sequential information for predicting the survival time of sepsis patients, the prediction accuracy increases.
3.2.3. F1 Comparisons for Different Survival Days with TiCTM (K = 5)
When predicting in-hospital death, 7-day survival, 14-day survival, and 30-day survival, the F1-score will increase, as shown in Table 14. For longer survival times, the patient’s condition remains steady. Then, the clinician intervenes less. Therefore, the F1 score is more accurate for predicting longer survival times.
3.2.4. Comprehensive Performance Comparisons Using ROC
Figure 5 presents comprehensive performance comparisons of TiCTM (different topics) with LDA (K = 5, 10) and the linear regression model using ROC. In Figure 5, we can see that the performance of our proposed TiCTM is better than that of the classic LDA and linear regression models.
In this paper, we propose a time-critical topic model (TiCTM), which combines a patient’s measurement with notes, to predict the survival time for adult sepsis patients in the ICU. We consider useful sequential information for predicting the survival time of sepsis patients, thereby increasing the prediction accuracy. Our experimental results show that the proposed TiCTM has the best performance when predicting the probability for 30-day survival time using 5 topics. In addition, the performance of our proposed TiCTM is better than that of the classic LDA and linear regression models. In the future, our study will focus on the explainable machine learning model for predicting survival time in the ICU [28–31].
The data used to support the findings of this study are available from https://mimic.physionet.org.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The authors are grateful to the Institute of Big Data and Artificial Intelligence in Medicine, Taizhou University, for providing computing resources. This research was funded by the Key Projects of Zhejiang Province’s Educational Science Planning (2017SB068), the Science and Technology Program of Taizhou (1803gy07), the Humanities and Social Science Project of the Chinese Ministry of Education (20YJAZH033), and the National Natural Science Foundation of China (61976149).
- L. Zhou, J. Cui, J. Lu, B. Wee, and J. Zhao, “Prediction of survival time in advanced cancer: a prognostic scale for Chinese patients,” Journal of Pain and Symptom Management, vol. 38, pp. 578–586, 2009.
- M. Zhou, L. O. Hall, D. B. Goldgof, R. A. Gatenby, and R. J. Gillies, “A texture feature ranking model for predicting survival time of brain tumor patients,” in Proceedings of 2013 IEEE International Conference on Systems, Man, and Cybernetics, IEEE Computer Society, Manchester, United kingdom, October 2013.
- M. Zhou, L. O. Hall, D. B. Goldgof, R. A. Gatenby, and R. J. Gillies, “Exploring brain tumor heterogeneity for survival time prediction,” in Proceedings of 22nd International Conference on Pattern Recognition, Institute of Electrical and Electronics Engineers Inc, Stockholm, Sweden, August 2014.
- K. B. Ahmed, L. O. Hall, R. Liu, R. A. Gatenby, and R. J. Gillies, “Neuroimaging based survival time prediction of gbm patients using cnns from small data,” in Proceedings of 2019 IEEE International Conference on Systems, Man, and Cybernetics, IEEE Computer Society, Bari, Italy, October 2019.
- A. N. V. Dehkordi, A. Kamali-Asl, N. Wen, T. Mikkelsen, I. J. Chetty, and H. Bagher-Ebadian, “DCE-MRI prediction of survival time for patients with glioblastoma multiforme: using an adaptive neuro-fuzzy-based model and nested model selection technique,” NMR in Biomedicine, vol. 30, Article ID e3739, 2017.
- M. Singer, C. S. Deutschman, C. W. Seymour et al., “The third international consensus definitions for sepsis and septic shock (sepsis-3),” JAMA, vol. 315, no. 8, pp. 801–810, 2016.
- C. M. Torio and R. M. Andrews, “National inpatient hospital costs: the most expensive conditions by payer,” 2011, https://www.ncbi.nlm.nih.gov/books/NBK169005/.
- D. McPherson, C. Griffiths, M. Williams et al., “Sepsis-associated mortality in England: an analysis of multiple cause of death data from 2001 to 2010,” BMJ Open, vol. 3, pp. 1–7, 2013.
- L. Ou, J. Chen, K. Hillman et al., “The impact of post-operative sepsis on mortality after hospital discharge among elective surgical patients: a population-based cohort study,” Critical Care, vol. 21, p. 34, 2017.
- J. C. Yébenes, J. C. Ruiz-Rodriguez, R. Ferrer et al., “SOCMIC (Catalonian critical care society) sepsis working group. epidemiology of sepsis in catalonia: analysis of incidence and outcomes in a European setting,” Annals of Intensive Care, vol. 7, p. 19, 2017.
- W. A. Knaus, D. P. Wagner, E. A. Draper et al., “The Apache III prognostic system,” Chest, vol. 100, no. 6, pp. 1619–1636, 1991.
- J.-R. Le Gall, S. Lemeshow, and F. Saulnier, “A new simplified acute physiology score (SAPS II) based on a European/North American multicenter study,” JAMA, vol. 270, no. 24, pp. 2957–2963, 1993.
- C. P. Subbe, A. Slater, D. Menon, and L. Gemmell, “Validation of physiological scoring systems in the accident and emergency department,” Emergency Medicine Journal, vol. 23, no. 11, pp. 841–845, 2006.
- J.-L. Vincent, R. Moreno, J. Takala et al., “The SOFA (sepsis-related organ failure assessment) score to describe organ dysfunction/failure,” Intensive Care Medicine, vol. 22, no. 7, pp. 707–710, 1996.
- Z. Zhang and Y. Hong, “Development of a novel score for the prediction of hospital mortality in patients with severe sepsis: the use of electronic healthcare records with LASSO regression,” Oncotarget, vol. 8, no. 30, pp. 49637–49645, 2017.
- J.-L. Vincent and R. Moreno, “Clinical review: scoring systems in the critically ill,” Critical Care, vol. 14, no. 2, p. 207, 2010.
- A. B. Nielsen, H. C. Thorsen-Meyer, K. Belling et al., “Survival prediction in intensive-care units based on aggregation of long-term disease history and acute physiology: a retrospective study of the danish national patient registry and electronic patient records,” The Lancet Digital Health, vol. 1, no. 2, pp. e78–89, 2019.
- J.-L. Vincent, S. M. Opal, J. C. Marshall, and K. J. Tracey, “Sepsis definitions: time for change,” The Lancet, vol. 381, no. 9868, pp. 774-775, 2013.
- T. J. Iwashyna, E. W. Ely, D. M. Smith, and K. M. Langa, “Long-term cognitive impairment and functional disability among survivors of severe sepsis,” JAMA, vol. 304, no. 16, pp. 1787–1794, 2010.
- D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
- Z. Gou, Z. Huo, Y. Liu, and Y. Yang, “A method for constructing supervised topic model based on term frequency-inverse topic frequency,” Symmetry, vol. 11, no. 12, p. 1486, 2019.
- L. W. Lehman, M. Saeed, W. Long, J. Lee, and R. Mark, “Risk stratification of ICU patients using topic models inferred from unstructured progress notes,” in Proceedings of the American Medical Informatics Association Annual Symposium, pp. 505–511, Chicago, IL, USA, November 2012.
- M. Ghassemi, T. Naumann, F. Doshi-Velez et al., “Unfolding physiological state: mortality modelling in intensive care units,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, Publisher: Association for computing Machinery, New York, NY, USA, August 2014.
- P. Vairamani, “Mortality prediction in ICU patients. Report of Project for CSE8803 Big data analytics for healthcare,” Spring, 2016.
- Y. Zhang, R. Jiang, and L. Petzold, “Survival topic models for predicting outcomes for trauma patients,” in Proceedings of 33rd IEEE International Conference on Data Engineering, IEEE Computer Society, San Diego, CA, United states, April 2017.
- A. E. Johnson, T. J. Pollard, L. Shen et al., “MIMIC-III, a freely accessible critical care database,” Scientific Data, vol. 3, Article ID 160035, 2016.
- Z. Shi, W. Zuo, S. Liang, X. Zuo, L. Yue, and X. Li, “IDDSAM: an integrated disease diagnosis and severity assessment model for intensive care units,” IEEE Access, vol. 8, pp. 15423–15435, 2020.
- L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal, “Explaining explanations: an overview of interpretability of machine learning,” in Proceedings of IEEE 5th International Conference on Data Science and Advanced Analytics, Institute of Electrical and Electronics Engineers Inc, Turin, Italy, October 2018.
- A. Holzinger, G. Langs, H. Denk, K. Zatloukal, and H. Müller, “Causability and explainability of artificial intelligence in medicine,” Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery, vol. 9, no. 4, Article ID e1312, 2019.
- E. Tjoa and C. Guan, “A survey on explainable artificial intelligence (XAI): towards medical XAI,” Article ID 07374, 2019, https://arxiv.org/abs/1907.07374.
- A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser et al., “Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI,” Information Fusion, vol. 58, pp. 82–115, 2020.
Copyright © 2020 Wenping Guo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.