Computational and Mathematical Methods in Medicine

Volume 2017 (2017), Article ID 7340565, 19 pages

https://doi.org/10.1155/2017/7340565

## Integration of Multiple Genomic Data Sources in a Bayesian Cox Model for Variable Selection and Prediction

^{1}EXCO, Penzberg, Germany^{2}Department of Statistics, TU Dortmund University, Dortmund, Germany^{3}Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway

Correspondence should be addressed to Manuela Zucknick

Received 10 February 2017; Revised 23 April 2017; Accepted 11 May 2017; Published 30 July 2017

Academic Editor: Elisabeth Waldmann

Copyright © 2017 Tabea Treppmann et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Bayesian variable selection becomes more and more important in statistical analyses, in particular when performing variable selection in high dimensions. For survival time models and in the presence of genomic data, the state of the art is still quite unexploited. One of the more recent approaches suggests a Bayesian semiparametric proportional hazards model for right censored time-to-event data. We extend this model to directly include variable selection, based on a stochastic search procedure within a Markov chain Monte Carlo sampler for inference. This equips us with an intuitive and flexible approach and provides a way for integrating additional data sources and further extensions. We make use of the possibility of implementing parallel tempering to help improve the mixing of the Markov chains. In our examples, we use this Bayesian approach to integrate copy number variation data into a gene-expression-based survival prediction model. This is achieved by formulating an informed prior based on copy number variation. We perform a simulation study to investigate the model’s behavior and prediction performance in different situations before applying it to a dataset of glioblastoma patients and evaluating the biological relevance of the findings.

#### 1. Introduction

In cancer research, we often deal with time-to-event endpoints, and the more advances in technology enable the systematic collection of different genome-wide data, the more interest arises in integrative statistical analyses, that is, using more than one information source to obtain a more comprehensive understanding of the biology of diseases and improve the performance of risk prediction models.

Recently, a lot of research has been done in the following three areas:(1)Cox proportional hazards models for survival (or time-to-event) data in high dimensions(2)For variable selection in high-dimensional problems(3)For integrative analyses of several data sourcesThe novelty of our approach is the combination of recent advances in these three areas in one Bayesian model as outlined below.

To model* survival data*, Cox (1972) [1] developed the semiparametric proportional hazards regression model for taking into account the relation between covariates and the hazard function. The Cox model has been widely used and analyzed in low-dimensional settings for this purpose; see, for example, Harrell Jr. (2001) [2], Klein et al. (2013) [3], or Ibrahim et al. (2005) [4]. In biological applications with genomic data, we are, however, often in a high-dimensional setting, that is, having more variables than subjects. Therefore, we are in need of a high-dimensional survival time model. One recent approach in this context was suggested by Lee et al. (2011) [5], who use a Bayesian version of the Cox model for right censored survival data, where high dimensions are handled by regularization of the regression coefficient vector imposed by Laplace priors. This corresponds to the lasso penalty; see Tibshirani (1997) [6] or Park and Casella (2008) [7], which shrinks regression coefficients towards zero and thus allows parameter inference in problems where the number of variables is larger than number of subjects . Since the automatic variable selection property of lasso is lost in fully Bayesian inference, Lee et al. (2011) [5] adopted a post hoc approach to identify the most important variables by thresholding based on the Bayesian Information Criterion.

Since* variable selection* is a core question in many statistical applications, it has been subject to a lot of research, and many approaches exist, especially for linear models. In low-dimensional settings and for frequentist inference, the most common procedures are best subset selection or backward or forward selection (Harrell Jr. (2001) [2], Hocking, (1976) [8]). There are different models to evaluate for best subset selection which becomes infeasible in higher dimensions (). In high dimensions, classical backward selection cannot be applied since the full model is not identified, and both backward and forward selection will typically only explore a very small proportion of all possible models. In addition, all of these approaches do not incorporate shrinkage in the estimation procedure. Bayesian approaches offer a good alternative to stochastically search over the whole parameter space, implicitly taking into account the model uncertainty; see Held et al. (2016) [9] for a recent evaluation study in the context of Cox regression models. One appealing approach often used in regression analyses is the stochastic search variable selection (SSVS) of George and McCulloch (1993) [10], a flexible and intuitive method which makes use of data augmentation for the selection task and incorporates shrinkage.

For biological information on a molecular level, many* different data sources* exist nowadays, and they often provide shared information, for example, the amount of expressed genes being transcribed to different proteins results in different functions of the cells or the body. If unexpected or unusual changes in the expression levels occur, the functionality of the cells can be disturbed. Cancer is often caused by changes in the DNA, for example, single-base mutations or copy number changes in larger genomic regions, which in turn will have an effect on gene expression. Therefore, including such data sources jointly into the analyses can lead to more accurate results. Bayesian approaches offer a handy pipeline to do so.

*In our approach* we combine the three mentioned tasks in one model: variable selection in a high-dimensional survival time model based on an integrative analysis. In particular, we integrate copy number variation (CNV) data with gene expression data, aiming to jointly use their respective advantages to achieve sparse and well interpretable models and good prediction performance. We combine the variable selection procedure of George and McCulloch (1993) [10] with the Cox proportional hazards model of Lee et al. (2011) [5] and use CNV data for the construction of an informed prior. We investigate the use of parallel tempering methods to improve the mixture of the Markov chains and to circumvent the manual tuning of hyperprior parameters.

In the following, we describe the details of the model, including the technical details, the sampler with extensions, and diagnostics, in Section 2. Afterwards, we describe the synthetic data as well as the real dataset on glioblastoma; we state the prior settings needed and chosen for the simulation study as well as for the real data analysis. Before drawing conclusions in Section 4, we describe the most important findings for the application to synthetic and real data, including findings regarding the extracted genes for glioblastoma patients, and discuss the results in Section 3.

#### 2. Materials and Methods

##### 2.1. Model and MCMC Sampling Procedure

Based on the general semiparametric proportional hazards model introduced by Cox (1972) [1], Lee et al. (2011) [5] developed a Bayesian version for right censored survival time data in high dimensions (), with being the number of variables, the number of subjects, the survival time of a person with covariable vector , the vector of regression parameters, and the unspecified arbitrary baseline hazard function. Lee et al. (2011) constructed a grouped likelihood for their model with a finite partitioning of the time axis, with , , in this case choosing the breaks as the points at which at least one event occurred and defining the last interval so that the last event lies in the middle of it, leading to the grouped data likelihood introduced by Burridge (1981) [11] Here, denotes the observed data, where and are the risk sets and the event sets corresponding to the th interval. describes a Gamma distribution with shape and scale , where , , and is a monotonously increasing function with . represents an initial estimate for the cumulative baseline hazard function . The constant specifies how strong the believe in the initial estimate for this cumulative baseline hazard function is. Mostly, a known parametric function for is used, for example, the Weibull distribution, which then leads to the following form:The hyperparameters have to be carefully chosen, though, to avoid convergence problems within the MCMC sampling [5].

The implicit shrinkage of the model and the variable selection will be done through the stochastic search variable selection procedure of George and McCulloch (1993) [10]. Assuming equal variances for the regression coefficients of variables which are included in the model, the prior distribution for conditioned on , is as follows:where the variance parameter is small, , and represents an indicator vector, analogous to the concept of data augmentation (Tanner and Wong, 1987 [12]), giving the state of the respective variable of being in the model or not.

As in Lee et al. (2011) [5], we compared three possible samplers to update the full conditional distribution , ( and ): the Adaptive Rejection Sampling algorithm proposed by Gilks (1992) [13], as well as the Adaptive Rejection Metropolis Sampler from Gilks et al. (1995) [14] and the special random walk Metropolis-Hastings (RW-MH) method with adaptive jumping rules proposed by Lee et al. (2011) [5]. We also found the adaptive random walk Metropolis-Hastings sampler to perform best in our applications, which are high-dimensional with more variables than samples and . We therefore only report the results for the adaptive RW-MH sampler.

are assumed to be independent Bernoulli () a priori; that is, and . The conditional distributions with for the MCMC sampler are determined bywhere with being the density of the normal distribution and corresponding to .

According to Ibrahim et al. (2005) [4], the full conditional distribution , with , can be well approximated by a gamma distribution, where represents the number of censorings in interval .

Finally, we use a Gibbs sampler to update , , and iteratively according to the full conditional distributions described above.

##### 2.2. Extension of MCMC Sampling Procedure

For multimodal posterior distributions, some problems may occur during the MCMC sampling, because the areas in the model space with higher posterior probability might be separated by a low-probability region, which the MCMC sampler might not manage to overcome. Therefore, there is a risk that important values cannot be sampled, because the MCMC sampler never visits the relevant region in the model space. Parallel tempering [15, 16] can alleviate this problem. Even in unimodal situations, parallel tempering can help by broadening the area of the sampling. This is done through the parallel generation of different MCMC chains with their own stationary distributions, where at regular intervals (after a predetermined number of MCMC iterations) a swapping of states (i.e., of the current values of all parameters in the model) of two neighboring chains is proposed. The distributions of all chains have the same basic form as the original, but are more flat. This is achieved by raising the original density function to the power () with values between 0 and 1, with 0 (for ) corresponding to a complete flattening of the distribution and 1 corresponding to the desired target. This can improve the sampling performance in two ways: (a) the flattened probability distribution covers more of the parameter space with sufficiently large probability to be reached by the sampler in a given number of iterations, and (b) the “hills” and “valleys” of a multimodal probability density will be less steep, thus reducing the likelihood that the sampler might get stuck in local optima (which in turn will improve its mixing performance). For historical reasons, the parameter is usually referred to as a* temperature parameter*.

At regular intervals (in our applications after every tenth MCMC iteration), two neighboring chains are selected randomly, and the Metropolis-Hastings acceptance probability is calculated based on the target distributions and the current states of the chains to determine whether a swap of the states between these two chains is accepted.

Let and be the respective target distributions of the selected chains with current parameter states and . The acceptance probability of swapping states is given by withWithin the Metropolis update, this will be compared with a uniform random variable in the interval , where means that the swap will be accepted. The probability of a chain to swap to another state therefore only depends on the current states of the compared chains [17].

In this manuscript, we use log-linear temperature scales , (). The original, untempered chain is hence given by . The distributions of the tempered versions are determined so that the standard deviation of the normal mixture prior of (equation (3)) will be broadened, which is achieved by multiplying the parameter in the prior with ().

It is recommended to choose the temperatures so that the acceptance rate lies between 20% and 50%, since different studies have shown that rates in this range deliver the most satisfactory results (e.g., [16, 18, 19]).

##### 2.3. Prior Settings

For the application of the Bayesian model, several prior specifications are needed. We start with the hyperparameters and , which are chosen so that in (2) is similar to the Nelson-Aalen estimator of the cumulative hazard function, which is therefore used to provide an initial guess for . For this we determine the scale parameters for the Weibull distribution from the estimated survival model of the event times of the training data without covariable information. For the update of the cumulated baseline hazard within the iterations of the MCMC chains, the hyperparameter , which describes the level of certainty associated with , has to be specified. We follow the suggestion by Lee et al. (2011) [5] to set . We have previously performed a sensitivity analysis to investigate the influence of the choice of (Zucknick et al., 2015 [20]), where we found that while there was a notable influence on the posterior estimates of the baseline hazard , the posterior distributions of were nearly unchanged.

The parameters and of the normal mixture distribution of in (3) conditioned on in (4), that is, , will be set to and . This implies that we obtain a standard deviation of for and a corresponding 95% probability interval of .

The specifications of the prior probabilities for the selection of the variables are described in Section 2.5, separately for the simulation scenarios and for the glioblastoma data application.

##### 2.4. Posterior Estimation and Prediction

We report the posterior distributions of and in terms of their posterior means and standard deviations. In order to select the most relevant variables, we choose an inclusion criterion in an automated data dependent way, which respects the prior model setup instead of choosing one cutoff for all cases. This is done by first calculating the mean model size (by rounding the average of selected variables per iteration). Then we choose variables with the highest selection probability.

We used the empty model, with for all , as starting values of the MCMC chains.

The results of the simulation study are based on single MCMC chains with 100,000 iterations each, after removal of 20,000 iterations (“burn-in”). The results for the glioblastoma data application are based on a combined analysis of five Markov chains, each of length 90,000 after removal of 10,000 initial iterations (“burn-in”). For the parallel tempering (only applied to the simulated data), we included four chains with 30,000 iterations each and log-linear temperature scales.

We evaluated the mixing and convergence properties of the Markov chains in several ways. We used graphical evaluations of running means plots of the individual parameters as well as trace plots for summary measures such as the -norm of the vector, the model size, and the log likelihood. Additionally, we calculated the effective sample sizes ([21]) for each . The R package coda [22] offers a wide variety of graphics and diagnostic measures to assess both mixing and diagnostic performance of MCMC chains.

We evaluate the prediction accuracy of the models chosen this way by prediction error curves and by computing the integrated Brier score (IBS) [23, 24] and comparing them with the reference approach, which is the Kaplan-Meier estimator without any covariates. The Brier score is a strictly proper scoring rule, since it takes its minimum when the true survival probabilities are used as predictions [24, 25]. It therefore measures both discrimination and calibration, contrary to other common measures of evaluation such as Harrell’s -Index (which only measures discrimination) and the calibration slope (for measuring calibration); see, for example, Held et al., 2016 [9].

The implementation of the model and the evaluations were done in the statistical computing environment R [26] and are available upon request from the authors.

##### 2.5. Data

###### 2.5.1. Simulated Data

For obtaining simulated data for our survival time model, we generated two different datasets, representing a sparse and nonsparse scenario for the true predictors. For the simulation of the survival data, we used the procedure described in Zucknick et al. (2015) [20] for the high-dimensional case. This setup is based on the approach of Bender et al. (2005) [27] following a Cox-Weibull survival model with known regression coefficients and any nonzero baseline hazard rate, taking into account the general relation between the hazard function and the survival time of the Cox model. We simulated blockwise correlated variables with a pairwise correlation between variables and , with for the variables within the blocks of size .

In short, we first simulate the hypothetical survival times () that would be observed without the presence of censoring, and the censoring times , which are generated to be uninformative and a mixture of uniform administrative censoring and exponential loss to follow-up. Note that scale and shape parameters and are chosen such that the survival probabilities at 12 and 36 time units are 0.5 and 0.9, respectively. For more details, we refer to Zucknick et al. (2015) [20].

Then, for each subject the individual observed time to event or censoring and the corresponding survival status are defined as

For both scenarios, we generate a training dataset for model fitting and a test dataset to evaluate the prediction performance of the final models. The generated datasets comprise genomic variables and subjects. In the* sparse setting,* we have true effects of the prognostic variables of , analogous to the setup of Zucknick et al. (2015) [20]. Therefore, the first variables are simulated to be related to the response (called “predictors” throughout the manuscript). For the* nonsparse setting* we randomly generated variables in the range of and equally distributed for the negative and positive part. Therefore, in this setting, the first variables of the dataset represent the true predictors. See Tables 1 and 2 for an overview of all simulation scenarios.