Computational and Mathematical Methods in Medicine

Volume 2015, Article ID 576413, 6 pages

http://dx.doi.org/10.1155/2015/576413

## Application of Random Forest Survival Models to Increase Generalizability of Decision Trees: A Case Study in Acute Myocardial Infarction

^{1}Regional Knowledge Hub and WHO Collaborating Center for HIV Surveillance, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman 7616911317, Iran^{2}Department of Epidemiology, University of Tehran, Tehran, Iran^{3}Research Center for Modeling in Health, Institute for Futures Studies in Health, Kerman University of Medical Sciences, Kerman 7616911317, Iran

Received 12 September 2015; Revised 23 November 2015; Accepted 24 November 2015

Academic Editor: Issam El Naqa

Copyright © 2015 Iman Yosefian et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

*Background*. Tree models provide easily interpretable prognostic tool, but instable results. Two approaches to enhance the generalizability of the results are pruning and random survival forest (RSF). The aim of this study is to assess the generalizability of saturated tree (ST), pruned tree (PT), and RSF.* Methods*. Data of 607 patients was randomly divided into training and test set applying 10-fold cross-validation. Using training sets, all three models were applied. Using Log-Rank test, ST was constructed by searching for optimal cutoffs. PT was selected plotting error rate versus minimum sample size in terminal nodes. In construction of RSF, 1000 bootstrap samples were drawn from the training set.* C*-index and integrated Brier score (IBS) statistic were used to compare models.* Results*. ST provides the most overoptimized statistics. Mean difference between* C*-index in training and test set was 0.237. Corresponding figure in PT and RSF was 0.054 and 0.007. In terms of IBS, the difference was 0.136 in ST, 0.021 in PT, and 0.0003 in RSF.* Conclusion*. Pruning of tree and assessment of its performance of a test set partially improve the generalizability of decision trees. RSF provides results that are highly generalizable.

#### 1. Introduction

The prediction of survival rate is a major aim in survival analysis. In the case of time-to-event data, Log-Rank test and Cox regression models are the most frequently used method. The Cox model can be used to identify the variables that significantly affect the outcome of interest and presents the results in terms of Hazard Ratio (HR) [1]. However, this model does not provide an easily interpretable decision rule to be used in clinical practice. In addition, exploration of presence of high order interactions needs inclusion of interaction terms in the model which makes the interpretation of results more difficult [2].

An alternative strategy which easily handles both these problems is decision tree analysis [3]. The trees consist of root, internal, or daughter nodes and terminal nodes. At the first step, all subjects are put in the root node. Subjects should be categorized into two daughter nodes with maximum difference between them. This will achieve by extensive search among all independent variables to find the variable (and cutoff) that maximizes the difference [4]. All possible cutoffs of all independent variables are tried to explore which one leads to the highest Log-Rank statistics (corresponding to the lowest value). Once the first split is created, a similar approach is applied to each internal node. This leads to a tree structure which divides the subjects into the final terminal nodes [5–8]. These models provide pictorial decision rules and therefore can be easily used in medical decision making.

Once a model has been created some measures of model performance are required. For example, in the case of logistic regression, sensitivity and specificity, or area under ROC curve, should be reported. These statistics show how well the model discriminates between cases and controls.

In the case of survival analysis,* C*-index and Brier statistics are usually reported.* C*-index is a generalization of the area under ROC curve which compares survival rate of those who experienced the event with those who did not [9]. Brier score (BS) compares predicted survival rate with the actual status of patients [10]. High* C*-index and low BS indicate adequate fit of the model to the data.

In the process of model building, researchers usually fit a model using a given data set and then assess its performance using the same data set. Regardless of the method of model building, an important aim in risk prediction models is to construct models which accurately predicts the risk for future patients. It has been argued that use of a training set to construct the model and to assess its performance leads to overoptimized statistics with low generalizability [11]. The level of overoptimization in the case of decision tree models is even higher, due to extensive search at each node [12].

One of the easiest approaches to tackle the problem of overoptimized statistics is to randomly divide the data into training and test set. In this case, the model can be constructed on the training set. The model derived will then be applied on the test set to calculate the performance statistics [11]. This approach, however, leads to decrease in sample size and power.

Alternative approaches suggest bootstrap aggregation of the results [13, 14]. This means to construct the model on a number of randomly derived bootstrap samples (say 1000) and to test them using the same sample and to report the mean and standard deviation of the statistics of interest.

One of the aggregation methods which has been proposed is random survival forest models. This method controls for overoptimization by two mechanisms [15]. Firstly, it draws multiple bootstrap samples from the initial data. In addition to that, to construct each tree, a random sample of independent variables would be selected and used. It has been argued that using two forms of randomization in growing the trees and combination of them cause sensible reduction instability of a single tree. The objective of this study was to compare the performance of survival tree and random survival forest for predicting survival probability patients admitted with acute myocardial infarction.

#### 2. Material and Methods

We used information of 607 acute myocardial infarction (AMI) patients aged >25 years, admitted to the CCU of Imam Reza Hospital Mashhad, Iran, in 2007. Patients were identified according to the International Classification of Diseases (ICD-10) with 12.0 to 12.9 codes. In the current study, the main outcome was death due to AMI. Time from admission to discharge or death was considered as follow-up time. Information of 11 predictor variables was as follows: age (in years), sex, hypertension disease (no and yes) (patients with systolic blood pressure ≥140 mmg or diastolic blood pressure ≥90 mmg were considered as “yes”), hyperlipidemia (no and yes), history of ischemic heart disease at admission (no and yes), diabetes (no and yes), smoking status (no and yes), family history of AMI disease, Q wave status (presence or absence of pathologic Q waves in electrocardiogram (ECG)), streptokinase treatment (no and yes), and intervention (angioplasty, pacemaker surgery, bypass surgery, and drug therapy).

We compared four methods as explained below: saturated survival tree, pruned survival tree, and Random Forest Survival (RFS) (see detail below). We randomly divided our data set into two parts, training and test sets, by using 10-fold cross-validation; then models were constructed using the training set. In saturated and pruned survival trees, performance was assessed on both training and test sets. In random survival forest, performance was assessed on out-of-bag and test sets (explained later).

##### 2.1. Saturated Survival Tree

In construction of the survival tree, using training set, Log-Rank statistics was used as split criterion. A saturated tree was constructed under the restriction that a terminal node has at least 1 death. The performance of the final tree (in terms of IBS and* C*-index) was tested on both training and test samples.

##### 2.2. Pruned Survival Tree

Secondly, the tree constructed using training sample was pruned. The tree size was plotted against error in test set ( index) to select the optimal tree. Sampling variation was addressed as explained above.

##### 2.3. Random Survival Forest

RSF is an ensemble method that introduces 2 forms of randomization into the tree growing process: bootstrap sampling from the data and selection of a limited number of independent variables to construct the tree [16].

Using the training set, RSF procedure was applied. Its performance was then assessed using OOB training and the test set. This procedure has been repeated 1000 times, as explained below.

First, an independent bootstrap sample is used for growing the tree. Second, to split each node of the tree into 2 daughter nodes, a limited number of covariates are selected. It has been shown that each sample would be selected in about 63% of samples. The samples not being selected are referred to as out-of-bag (OOB) sample. This means that, in 1000 bootstrap samples, each subject is a part of OOB 370 times. We followed the procedure below:(1)1000 bootstrap samples were drawn.(2)In each sample, a survival tree was constructed. At each node of the tree, candidate variables were selected. The node is split using the candidate variable that maximizes survival difference between daughter nodes.(3)Based on the rules derived from trees, survival curves for OOB patients were plotted.(4)For each subject, the average survival curves are calculated to be considered as subject’s final .In all three approaches, 10-fold cross-validation was applied. To capture additional variations, the process of cross-validation was repeated 20 times, therefore creating 200 training and 200 test data sets at each method.

##### 2.4. Performance Statistics

###### 2.4.1.
*C*-Index

Let be the survival times and the censoring status for subjects in a terminal node . Also, let be the distinct event times in terminal node . Define and to be the number of deaths and subjects at risk at time . The cumulative hazard function (CHF) estimate for terminal node is the Nelson-Aalen estimatorfor the subject with a -dimensional covariate In RSF procedure, to estimate CHF of subject , define if is an OOB case for th bootstrap sample; otherwise, . Let denote the CHF for subject in a tree grown from the th bootstrap sample. The ensemble CHF for isThe* C*-index is calculated using the following steps:(1)Form all possible pairs of subjects.(2)Consider permissible pairs, by eliminating those pairs whose shorter survival time is censored, and by eliminating pairs if and both are deaths.(3)For each permissible pair where , count 1 if the shorter survival time has high risk predicted; count 0.5 if risk predicted is tied. For each permissible pair, where and both are deaths, count 1 if risk predicted is tied; otherwise, count 0.5. For each permissible pair where , but at least one is not a death, count 1 if the death has high risk predicted; otherwise, count 0.5. Let Concordance denote the sum over all permissible pairs.(4)*C*-index = Concordance/permissible.In the survival tree, we say has a high risk predicted than if where are the unique event times in the data set. In RSF ensemble CHF () is used instead of [16].

A value of 0*.*5 for* C*-index is not better than random guessing and a value of 1 denotes full-discriminative ability. Percentiles 2.5 and 97.5 were considered as lower and upper bounds of CI for final statistics.

###### 2.4.2. IBS Statistics

The Brier score at time is given bywhere denote the Kaplan-Meier estimate of the censoring survival function [17, 18].

The prediction error curve is gotten by calculating of Brier score across the times. In addition, the integrated Brier score (IBS) that is cumulative prediction error curves over time is given by Lower values of IBS indicate better predictive performances. Percentiles 2.5 and 97.5 were considered as lower and upper bounds of CI for final statistics.

##### 2.5. Impact of Method of Tree Construction and Data Set on Performance Statistics

As explained above, three methods were applied to construct the tree (ST, PT, and RSF). In addition, two data sets (training and testing) were used to assess the performance. These two factors together created six scenarios with 200 replications in each. In each of 1200 samples, values IBS and* C*-index were recorded. Two way ANOVA was applied to assess the impact of method of tree construction and data used for validation on performance statistics.

##### 2.6. Software

We used randomForestSRC and pec R-package for analyses of this study.

#### 3. Results

Our data set comprised 607 patients with mean age of 61.34 years (SD = 13.46). In total, 204 patients experienced the outcome of interest (death due to AIM). Table 1 provides information for the other 10 independent variables collected.