#### Abstract

Sepsis is a leading cause of mortality in intensive care units and costs hospitals billions of dollars annually worldwide. Predicting survival time for sepsis patients is a time-critical prediction problem. Considering the useful sequential information for sepsis development, this paper proposes a time-critical topic model (TiCTM) inspired by the latent Dirichlet allocation (LDA) model. The proposed TiCTM approach takes into account the time dependency structure between notes, measurement, and survival time of a sepsis patient. Experimental results on the public MIMIC-III database show that, overall, our method outperforms the conventional LDA and linear regression model in terms of recall, precision, accuracy, and F1-measure. It is also found that our method achieves the best performance by using 5 topics when predicting the probability for 30-day survival time.

#### 1. Introduction

Predicting the survival time of patients is an active research area for both clinicians and scientists [1–5]. It can significantly contribute to making decisions about clinical treatment, allocation of medical resources, and hospice care for patients [1]. Sepsis is a disease of life-threatening organ dysfunction caused by a dysregulated host response to infection [6]. Without timely treatment, sepsis can rapidly lead to tissue damage, organ failure, and death. Common signs and symptoms include fever, increased heart rate, increased breathing rate, and confusion. In US health systems, the cost for patients with sepsis accounted for more than $20 billion (5.2%) of total US hospital costs in 2011 [7]. In 2001–2010, one in twenty deaths in England was associated with sepsis based on information recorded on death certificates [8]. Although clinicians have made efforts to improve sepsis patient survival time, the mortality rate of sepsis is still very high [9, 10]. Thus, accurate prediction of survival time for sepsis patients could help clinicians conduct prevention, provide early warning and effective treatment, and reduce the mortality rate. Unfortunately, the pathogenesis of sepsis remains unclear. Predicting the survival time for specific diseases, such as sepsis, is still a challenging problem.

To help clinicians understand the overall body situations of a patient, ICUs have introduced many different mechanisms for describing the patient’s body situation and progress in the ICU. Different severity of illness scores has been used to predict sepsis or mortality risk scores in the ICU. The most widely used score-based methods include the acute physiology and chronic health evaluation (APACHE III) [11], the simplified acute physiology score (SAPS II) [12], the modified early warning score (MEWS) [13], the sepsis-related organ failure assessment (SOFA) [14], and quick SOFA (qSOFA) [6]. In addition, Zhang and Hong [15] proposed a novel score for predicting hospital mortality for severe sepsis. These score-based methods utilize a set of easily obtainable measurements from various patients to generate risk scores. Although they allow clinicians to make rapid diagnostics of a patient, the obtained results are not satisfactory [16, 17]. Additionally, these score-based approaches only evaluate the patient body situation at specific times and cannot predict the survival time of sepsis patients because the development of sepsis is a time-sensitive process.

Sepsis is a life-threatening condition that arises when the body’s response to infection causes injury to its own tissues and organs [18, 19]. It is difficult to predict the development of sepsis based on a small number of measurements. Conventional topic models such as latent Dirichlet allocation (LDA) are unsupervised machine learning methods that can recognize latent topic information in massive document collections [20, 21]. Lehman et al. [22] proposed a novel approach for ICU patient risk stratification using a topic model. Ghassemi et al. [23] proposed a mortality model using a topic model to predict in-hospital mortality, 30-day postdischarge mortality, and 1-year postdischarge mortality. Vairamani [24] proposed an approach for mortality prediction in ICU patients based on LDA. Zhang et al. [25] proposed a novel survival topic model inspired by LDA for trauma patients. However, these topic models overlook the useful sequential information for disease development, thereby reducing the prediction accuracy for sepsis patient survival time. Unfortunately, the development and treatment of sepsis is a time-critical process that has a high correlation between the order of words in measurement and notes.

To address this issue, this paper proposes a time-critical topic model (TiCTM) inspired by the LDA model to predict the survival time of sepsis patients. The proposed TiCTM approach takes into account the time dependency structure between notes, measurement, and survival time of a sepsis patient. We consider the time-critical dynamic process of sepsis patients as an approximately linear variation under clinician treatment. Therefore, the linear change in the parameter of TiCTM can reflect the time-critical dynamic process, whereas the parameter of LDA is fixed. Our experimental results on the public MIMIC-III database show that, overall, TiCTM outperforms the conventional LDA and linear regression model in terms of recall, precision, accuracy, and F1-measure. In particular, TiCTM obtains the best performance when predicting the probability for 30-day survival time using 5 topics.

The remainder of this paper is organized as follows. In Section 2, we describe the proposed TiCTM for predicting survival time for adult sepsis patients in the ICU. The experiments and evaluations are discussed in Section 3. Finally, we conclude our research in Section 4.

#### 2. Methodology

In this section, we first present a brief review of the classical LDA model and then describe our proposed TiCTM approach in detail.

##### 2.1. Brief Review of LDA

LDA is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words [20]. The LDA model is represented as a probabilistic graphical model, as shown in Figure 1. The meaning of each notation for LDA is shown in Table 1.

The LDA model considers documents as the collection of words, which overlook the order of words. To address this issue, we propose a time-critical topic model inspired by LDA to predict the survival time of sepsis patients, as described as follows.

##### 2.2. Our Proposed Method

Our proposed TiCTM approach considers the time dependency structure between notes, measurement, and survival time of a sepsis patient. The TiCTM model is depicted in Figure 2.

Assume that sequential notes and measurement submodel have several phases. For phase 1, the meaning of each notation is shown in Table 2.

For phase , the meaning of each notation is shown in Table 3.

For phase , the meaning of each notation is shown in Table 4.

As shown in Figure 2, the proposed TiCTM model consists of two submodels: sequential notes and measurement and survival time prediction. This is because the development of sepsis is a time-critical process. The main idea of the TiCTM model is that we consider the time-critical dynamic process of sepsis patients as approximately linear variation under clinician treatment. Therefore, we employ the linear change in the parameter of TiCTM to reflect this time-critical dynamic process, whereas the parameter of LDA is fixed.

###### 2.2.1. Sequential Notes and Measurement

For a sepsis patient *m*, the parameter represents the patient’s initial body situation. In everyday clinician treatment, the patient’s body situation changes . Therefore, after days of clinician treatment, the parameter of patient *m* will transit to , which is used to reflect the sequential changing process. The parameter determines the disease probability distribution . The parameter determines the disease for patient *m*. (probability of bigrams appearing in the topics of notes dictionary) and (topic variable) generate words in the notes of patient *m*. (probability of measurements appearing in the topics of measurement dictionary) and generate measurements in the measurement result of patient *m*.

For each word , and , we draw a topic assignment ∼ , wherewhere represents the probability of topic appearing, which is determined by parameter .

Draw a word ∼ , wherewhere represents the probability of word appearing, which is determined by topic and parameter .

For each phase *i*, the measurement is shown in Figure 2. For each measurement , and .

Draw a topic assignment ∼ , wherewhere represents the probability of topic appearing, which is determined by parameter .

Draw a measurement ∼ , wherewhere represents the probability of measurement appearing, which is determined by topic and parameter .

###### 2.2.2. Survival Time Prediction

For each patient *m*, denotes the real survival time, and represents the predicted survival time with TiCTM. Body situation and regression coefficient are used for predicting the survival time for patient *m.* The objective of the problem is to minimize the difference between and . As shown in Figure 2, we define the time to death from formula (5). ∼ , where

The details for predicting survival time for sepsis patients using the TiCTM model are further described in the following subsection.

##### 2.3. Details of Survival Time Prediction Using TiCTM

###### 2.3.1. Definition of Real Survival Time Function

Assume that there are patients, is defined as the one-variable function of , whereas is defined as the one-variable function of . These two functions are further described later. The objective of survival time prediction for sepsis patients is to minimize the difference (Diff) between and :

We hypothesize as the real survival time for patient *m* after the second phase. is the function for real survival time analysis .

Assume that patient *m* has the same probability to survive every day; then, we can calculate the probability of death on the day using

To maximize , it must be satisfied withwhere is a form of sigmoid function as

To make = 0 (Death) and = 1 (Survive), we modify to

Then, we can obtain formula (12) for and formula (13) for :

###### 2.3.2. Definition of the Survival Time Prediction Function

To verify the effectiveness of the proposed TiCTM model, this paper uses the two-phase survival model to analyze survival time analysis for adult sepsis patients. The measurement and notes in the first phase limit the time for the patient’s admission within 24 hours. The second phase uses the patient’s measurement and notes after 24 hours to the last.

To calculate the probability of measurement and note in the first phase by body condition , we usewhere

To calculate the probability of measurement and note in the second phase by body condition (), we usewhere

The likelihood function of the two-phase of patient *m* can be obtained from

The log of the likelihood function of the two-phase of patient *m* can be obtained fromwhere

To simplify the calculation, we use formula (21) to replace the log of the likelihood function of formula (19):

After removing the constant term in formula (21), the log of the total likelihood function of the two-phase for all patients can be obtained from the following formula:

To maximize the log of the total likelihood function, formula (23) is obtained as follows:

Under the given constraints, , the following formulas are obtained:

A body condition transition diagram is shown in Figure 3. Therefore, we can obtain the patient’s body condition at discharge.

We hypothesize as the predicted survival time for patient *m* after the second phase. is a function for predicting survival time analysis .

We predict the probability for patient *m* to survive withwhere is the indicator function. When the notes of patient *m* contain the -th bigram word of the notes dictionary, the value is 1; otherwise, the value is 0. is the indicator function. When the measurement of patient *m* contains the -th measurement of the measurement dictionary, the value is 1; otherwise, the value is 0. is the regression coefficient. The -th bigram word of the dictionary represents a danger factor when , which indicates a shorter survival time. Otherwise, the -th bigram word of the notes dictionary represents a protective factor when , which indicates a longer survival time. is the regression coefficient. The -th measurement of the measurement dictionary represents a danger factor when , which indicates a shorter survival time. Otherwise, the -th measurement of the measurement dictionary represents a protective factor when , which indicates a longer survival time.

Then, we can obtain formula (27) for and formula (28) for :

###### 2.3.3. Object Optimization

To minimize the difference between and , we define as follows:

To find the optimal to minimize , we obtain

Then, we obtain the gradient update as the following formulas:

#### 3. Experiments

##### 3.1. Experimental Design

###### 3.1.1. Evaluation Tasks

Task 1: Evaluating the performance of the TiCTM model with different topics. Task 2: Performance comparisons of TiCTM with LDA and linear regression models. Task 3: F1 comparisons for different survival days with TiCTM (*K* = 5). Task 4: Comprehensive performance comparisons of TiCTM (different topics) with LDA (*K* = 5, 10) and linear regression models using ROC.

###### 3.1.2. Baselines

In this experiment, we use two methods as our baselines: LDA [14] and linear regression. For the linear regression model, the meaning of each notation is shown in Table 5.

represents the predicted survival time with linear regression. is defined aswhere is a function for predicting survival time analysis . is defined aswhere *a* and *b* are regression coefficients.

The objective of survival time prediction for sepsis patients is to minimize the loss function between and :

Then, we obtain the gradient update as formulas (36), (37), and (38):

###### 3.1.3. Evaluation Criteria

To evaluate the performance of the proposed method, we use a 3-fold cross-validation scheme. Evaluation metrics, such as recall, F1, and FPR, are adopted in this paper. They are defined as follows: Recall (TPR) = TP/(TP + FN). Precision = TP/(TP + FP). Accuracy = (TP + TN)/(TP + TN + FP + FN). F1 = precision recall 2/(precision + recall). FPR = FP/(FP + TN), where TP indicates true positive, which means predicting a survival time less than or equal to a given time, while the true survival time is less than or equal to a given time; FP indicates false positive, which means predicting a survival time less than or equal to a given time but the true survival time is greater than a given time; TN indicates true negative, which means predicting a survival time greater than a given time, while the true survival time is greater than a given time; and FN indicates false negative, which means predicting a survival time greater than a given time but the true survival time is less than or equal to a given time.

###### 3.1.4. Dataset

The dataset used in the experiments is the public Medical Information Mart for Intensive Care (MIMIC-III) [26]. MIMIC-III has been widely used by 845 publications as of the end of August 2019 [27]. The version of MIMIC is MIMIC-III v1.4, which comprises over 58,000 hospital admissions for 38,645 adults and 7,875 neonates. The data spanned from June 2001 to October 2012. We used a dataset of 2,487 deceased adult (age > 14) sepsis patient records from MIMIC-III. The data processing flowchart is shown in Figure 4.

Patient features include text data, class data, and numerical data.(1)Text data processing: clinical notes are also text data. This paper considers the clinical notes as a set of bigrams. We calculate the TF-IDF value of every bigram after removing the stopwords. Then, we sort every bigram according to TF-IDF and select the top 3,000 bigrams as the words for the dictionary.(2)Numerical data processing: to calculate the mean and standard deviation (std) for numerical data, we divide it into five intervals: , , , , and . Each interval is used as a word in the measurement dictionary.(3)Class data processing: each class of data is used as a word in the measurement dictionary.

##### 3.2. Experimental Results

###### 3.2.1. Evaluating the Performance of the TiCTM Model with Different Topics

To evaluate the performance of our proposed TiCTM model, we choose different topics, that is, *K* = 5, 10, 15, 20, to predict the survival time for adult sepsis patients in the ICU. The results are shown in Tables 6–9. In Table 9, we can see that 5 topics (*K* = 5) achieve the best performance for predicting the probability for 30-day survival time. The number of topics is chosen by a 3-fold cross-validation scheme. We use the average value of 3-fold to select the best number of topics. It can be seen that the best number of topics for sepsis patients is 5 topics. The *F*1 score of 5 topics improved by 3.55% (in-hospital death), 3.88% (7 days), 2.56% (14 days), and 1.99% (30 days) compared to 20 topics. The possible reason is that there are many words in the dictionary, and the combination of topics and bigrams increases sharply when the number of topics increases. This situation will lead to model overfitting, and the *F*1 score will decrease.

###### 3.2.2. Performance Comparisons of TiCTM with LDA and Linear Regression Model

Performance comparisons of TiCTM (*K* = 5, 10) with LDA (*K* = 5, 10) and the linear regression model are shown in Tables 10–13. The performance of our proposed TiCTM outperforms LDA and the linear model overall in terms of recall, precision, accuracy, and *F*1-measure. When we consider sequential information for predicting the survival time of sepsis patients, the prediction accuracy increases.

###### 3.2.3. F1 Comparisons for Different Survival Days with TiCTM (*K* = 5)

When predicting in-hospital death, 7-day survival, 14-day survival, and 30-day survival, the F1-score will increase, as shown in Table 14. For longer survival times, the patient’s condition remains steady. Then, the clinician intervenes less. Therefore, the F1 score is more accurate for predicting longer survival times.

###### 3.2.4. Comprehensive Performance Comparisons Using ROC

Figure 5 presents comprehensive performance comparisons of TiCTM (different topics) with LDA (*K* = 5, 10) and the linear regression model using ROC. In Figure 5, we can see that the performance of our proposed TiCTM is better than that of the classic LDA and linear regression models.

#### 4. Conclusions

In this paper, we propose a time-critical topic model (TiCTM), which combines a patient’s measurement with notes, to predict the survival time for adult sepsis patients in the ICU. We consider useful sequential information for predicting the survival time of sepsis patients, thereby increasing the prediction accuracy. Our experimental results show that the proposed TiCTM has the best performance when predicting the probability for 30-day survival time using 5 topics. In addition, the performance of our proposed TiCTM is better than that of the classic LDA and linear regression models. In the future, our study will focus on the explainable machine learning model for predicting survival time in the ICU [28–31].

#### Data Availability

The data used to support the findings of this study are available from https://mimic.physionet.org.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

The authors are grateful to the Institute of Big Data and Artificial Intelligence in Medicine, Taizhou University, for providing computing resources. This research was funded by the Key Projects of Zhejiang Province’s Educational Science Planning (2017SB068), the Science and Technology Program of Taizhou (1803gy07), the Humanities and Social Science Project of the Chinese Ministry of Education (20YJAZH033), and the National Natural Science Foundation of China (61976149).