Computational and Mathematical Methods in Medicine

Volume 2019, Article ID 7037230, 15 pages

https://doi.org/10.1155/2019/7037230

## A Simulation Study Comparing Different Statistical Approaches for the Identification of Predictive Biomarkers

Technical University of Munich, School of Medicine, Institute of Medical Informatics, Statistics and Epidemiology, Ismaninger Str. 22, 81675 Munich, Germany

Correspondence should be addressed to Bernhard Haller; ed.mut@rellah.drahnreb

Received 1 February 2019; Accepted 22 May 2019; Published 13 June 2019

Guest Editor: Tomas Krilavičius

Copyright © 2019 Bernhard Haller et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Identification of relevant biomarkers that are associated with a treatment effect is one requirement for adequate treatment stratification and consequently to improve health care by administering the best available treatment to an individual patient. Various statistical approaches were proposed that allow assessing the interaction between a continuous covariate and treatment. Nevertheless, categorization of a continuous covariate, e.g., by splitting the data at the observed median value, appears to be very prevalent in practice. In this article, we present a simulation study considering data as observed in a randomized clinical trial with a time-to-event outcome performed to compare properties of such approaches, namely, Cox regression with linear interaction, Multivariable Fractional Polynomials for Interaction (MFPI), Local Partial-Likelihood Bootstrap (LPLB), and the Subpopulation Treatment Effect Pattern Plot (STEPP) method, and of strategies based on categorization of continuous covariates (splitting the covariate at the median, splitting at quartiles, and using an “optimal” split by maximizing a corresponding test statistic). In different scenarios with no interactions, linear interactions or nonlinear interactions, type I error probability and the power for detection of a true covariate-treatment interaction were estimated. The Cox regression approach was more efficient than the other methods for scenarios with monotonous interactions, especially when the number of observed events was small to moderate. When patterns of the biomarker-treatment interaction effect were more complex, MFPI and LPLB performed well compared to the other approaches. Categorization of data generally led to a loss of power, but for very complex patterns, splitting the data into multiple categories might help to explore the nature of the interaction effect. Consequently, we recommend application of statistical methods developed for assessment of interactions between continuous biomarkers and treatment instead of arbitrary or data-driven categorization of continuous covariates.

#### 1. Introduction

For medical decision making, predictive biomarkers play an important role for various diseases [1–4]. A biomarker is called “predictive,” if the difference between the effectiveness of two or more treatment options depends on the value of that biomarker [5, 6]. In the presence of a qualitative biomarker-treatment interaction [7], i.e., when the choice of the “optimal” treatment for a given patient depends on the patient’s value of a certain biomarker, the biomarker can be used for treatment stratification [8]. Biomarkers used in clinical practice for treatment stratification are, e.g., the human epidermal growth factor receptor 2 (HER-2) status for breast cancer patients [9, 10] or presence of epidermal growth factor receptor (EGFR) mutation in non-small cell lung cancers (NSCLC) [11]. Consequently, the identification of biomarkers that allow prediction of the treatment effect when different treatment options are available is essential to increase clinical decision making in the sense of a stratified or personalized medicine [12].

In practice, investigation of such treatment effect heterogeneity over the range of a certain biomarker in data obtained from a randomized clinical trial is often performed by subgroup analyses [13], where the difference in outcome between the study groups, quantified, e.g., by a hazard ratio, an odds ratio, or a mean difference, is estimated for patient subgroups with similar characteristics [14] and compared using a statistical test for interaction, which can be performed by including the product of the biomarker and the variable indicating treatment allocation in an appropriate regression model [15, 16]. While this procedure is intuitive and straightforward for categorical variables, e.g., gender or presence of comorbidities as diabetes, investigation of treatment effect heterogeneity with respect to continuous variables, e.g., age or continuously measured blood parameters, requires categorization of the variable, when subgroup analyses are to be performed. Such categorization of continuous variables was criticized due to loss of information leading to a loss of power for detection of true interactions, implication of biological implausible effects, and lack of comparability of results from different studies [17, 18]. Therefore, various approaches were proposed in the literature that allow to model and test for treatment effect heterogeneity over the range of a continuous variable that do not require categorization of the variable [19–21].

In this article, we describe a simulation study comparing different approaches for detection of an interaction between one (predefined) continuous covariate and treatment. We simulated data as they would be expected to be collected in a randomized clinical trial intended to compare efficacy of two treatment groups or of treatment versus placebo. Consequently, patients are allocated randomly into one of two treatment arms and the distribution of the variable of interest (often referred to as a biomarker [22]) is expected to be the same for both treatment groups. As most predictive biomarkers were identified for treatment of cancer [23], a time-to-event outcome is considered, as typically overall survival or progression-free survival is considered as primary endpoint in randomized phase III oncological trials [24]. Results obtained by methods relying on categorization of the continuous variable as well as methods that do not use such categorization were investigated. We considered a method splitting the continuous biomarker at its median to determine two subgroups for further analysis, the use of four subgroups determined by splitting data at the quartiles, and use of an “optimal” cutoff value found by maximization of the Wald statistics of the interaction term in a Cox regression model. Additionally, we applied the Subpopulation Treatment Effect Pattern Plot (STEPP) approach that incorporates overlapping subgroups [25], the Cox regression model [26] assuming a linear covariate-treatment interaction term, the Multivariable Fractional Polynomials for Interaction (MFPI) approach that incorporates nonlinear transformations for the interaction term [19], and the Local Partial-Likelihood Bootstrap (LPLB) that uses local estimates of the treatment effect at different values of the variable of interest [27]. Different scenarios with absence and presence of biomarker-treatment interactions were investigated in order to estimate and compare type I error probability and statistical power of the different approaches under the given scenarios. Sample size and censoring distribution are varying to investigate the impact of these characteristics on the outcome.

The article is organized as follows. In Section 2, the simulation study is described. The different methods used for identification of a biomarker-treatment interaction are shortly introduced in Section 2.1, and references to original articles and further articles including more detailed descriptions of the considered methods are given. The setting of the simulation study and the relevant aspects that were varied are described in Section 2.2. Results of the simulation study, namely, observed type I error probabilities for scenarios with no true biomarker-treatment interactions and estimates for statistical power for scenarios with truly present biomarker-treatment interactions, are presented in Section 3. A discussion of the results with concluding remarks and strengths and limitations of our simulation study is given in Section 4.

#### 2. Methods

The methods investigated in the simulation study are described in Section 2.1. Details on the settings used in the simulation study and the data generating process are given in Section 2.2. Data were generated and analysed using the statistical software R [28]. Cox regression was performed using the function *coxph* provided in the R library *survival* [29, 30]. For convenience, the continuous covariate of interest will be called “biomarker” and denoted as *Z* throughout the section. Treatment allocation will be represented by a binary treatment variable *T* with , where represents, e.g., an experimental treatment and a placebo control or standard treatment. As it appears to be the most relevant effect size in practice, homogeneity of the hazard ratio between the study groups in regard to the biomarker of interest was investigated. For all statistical tests, a significance level of was used. Exact 95% confidence intervals for rejection probabilities were calculated.

##### 2.1. Methods Used to Test for a Biomarker-Treatment Interaction

###### 2.1.1. Median Split

In many applications investigating treatment-effect heterogeneity in regard to a continuous biomarker, individuals are divided into two subgroups of equal size. This is achieved by splitting the data at the median of the biomarker *Z*. This procedure will be denoted as “Median split” in this article. A binary indicator variable that is assigned the value of one if the biomarker value is above or equal to the observed median and zero else is derived. To test for biomarker-treatment interaction, a Cox regression model with this indicator variable, the binary treatment indicator, and their product (the interaction term) is fitted to the data. The *p* value of the Wald test for the interaction term was used to decide whether the null hypothesis of no biomarker-treatment interaction can be rejected on the prespecified significance level of .

###### 2.1.2. Quartile Split

As an alternative approach, individuals were divided into four subgroups with splits at the corresponding quartiles of the biomarker of interest (“Quartile split”). The categorical variable indicating the corresponding subgroup was used as a dummy coded nominal independent variable in a Cox regression model. Additionally, the binary treatment indicator and an interaction term between the dummy coded categorical variable indicating the biomarker quartile and treatment were included. A likelihood ratio test with three degrees of freedom provided in the R library *car* [31] was performed to test for presence of a biomarker-treatment interaction.

###### 2.1.3. “Optimal” Split

For this approach, henceforth called “Optimal split”, an “optimal” cutoff value for splitting the continuous variable into two subgroups was determined in a first step. Of all possible cutoff values (restricted to a minimum subgroup size of of the overall sample size), the one leading to the largest value of the Wald statistic for the interaction term between the dichotomized biomarker and treatment in a Cox regression model also including the corresponding main effects as independent variables was used to define the subgroups for assessment of treatment effect heterogeneity. In a second step, these subgroups were treated as if they were predefined subgroups, and assessment of a biomarker-treatment interaction was performed as described for the Median split procedure in Section 2.1.1.

###### 2.1.4. Subpopulation Treatment Effect Pattern Plot (STEPP) Method

The Subpopulation Treatment Effect Pattern Plot (“STEPP”) method was proposed by Bonetti and Gelber [25]. In the STEPP procedure, heterogeneity of the treatment effect over the range of a biomarker of interest is assessed by estimating the effect in multiple overlapping subgroups. Additionally, methods for estimation of simultaneous confidence intervals and for testing the null hypothesis of no biomarker-treatment interaction were developed [25, 32]. Two different versions, a “tail-oriented” and a “sliding window” approach, were proposed initially. In our simulation study, we used the “sliding window” approach, where the number of individuals within two consecutive subgroups is held (approximately) constant by adding and eliminating the same number of observations and the number of observations overlapping between two consecutive subgroups is chosen a priori. For our analysis, the number of individuals within each subgroup was chosen to be and the number of overlapping individuals to be . So, the subgroup sizes were 50, 100, and 200 for scenarios with 250, 500, and 1000 observations, and the number of overlapping observations was 25, 50, and 100, respectively. This led to a total number of nine subgroups considered irrespective of the sample size. A test on homogeneity of the hazard ratio over all subgroups was performed to test for a biomarker-treatment interaction. A permutation test as recommended in [32] was conducted using 500 permutations for each simulated dataset. Further details on the STEPP procedure can be found in [33, 34]. For application of STEPP, the R library *stepp* [35] was used.

###### 2.1.5. Cox Regression Model with Linear Interaction

To avoid categorization of the continuous biomarker of interest *Z*, a Cox regression model [26] assuming a linear interaction between *Z* and treatment *T* was considered. This procedure implies that the log-hazard ratio between the study groups is linearly associated with the biomarker value. The main effects of the biomarker *Z*, the treatment group *T*, and their product were used as independent variables in a Cox regression model. The *p* value of the Wald test for the interaction term was considered to decide on rejection of the null hypothesis of no biomarker-treatment interaction. This procedure will be called “Cox model with linear interaction” or shortly “Cox (linear Int.)” throughout the article.

###### 2.1.6. Multivariable Fractional Polynomials for Interaction (MFPI)

To allow for nonlinear interaction terms, Royston and Sauerbrei proposed the Multivariable Fractional Polynomials for Interaction (“MFPI”) approach [19], which is based on the Multivariable Fractional Polynomials (MFP) approach presented by Royston and Altman [36]. A nonlinear transformation is considered for the biomarker of interest, and a model including main effects of treatment and the transformed biomarker as well as their interaction is compared to a model including only the corresponding main effects. In the original publication, a model with two polynomial transformations and (FP2) out of the set , where indicates a logarithmic transformation, was described. Identification of the best transformation was proposed to be determined in the model without an interaction term by finding the combination of transformations providing the highest (log-)likelihood value (later called flex1 approach). Based on the results of a simulation study [37, 38] considering a continuous outcome, an alternative approach with only one polynomial transformation (FP1) and separate determination of the best transformation in the model with and without interaction (flex3, potentially leading to nonnested models) was recommended. We applied both approaches, the FP2-flex1 and the FP1-flex3 approach, to our simulated data. To test for presence of a biomarker-treatment interaction, likelihood ratio tests comparing the models with and without interaction terms were performed for both strategies.

###### 2.1.7. Local Partial-Likelihood Bootstrap (LPLB)

Another method proposed in the literature for modelling nonlinear interaction effects between a continuous biomarker and treatment is the Local Partial-Likelihood Estimation proposed by Fan et al. [21]. Liu et al. developed a bootstrapping method, called Local Partial-Likelihood Bootstrap (“LPLB”), that allows to test for the presence of an overall treatment effect and to test whether the treatment effect is heterogeneous over the range of a continuous biomarker [27]. In the LPLB approach, linear approximations of the treatment effect estimate at a given biomarker value are obtained by first-order Taylor approximations using weighted data in the local neighbourhood of the biomarker value of interest. The proposed bootstrap test makes use of the residual bootstrap [39]. The obtained local estimates of the log-hazard ratio are compared to the estimate obtained from a standard Cox regression model assuming a constant treatment effect over the biomarker range. The maximum observed standardized difference of the local estimates to the constant log-hazard ratio is considered as test statistic. For our simulation study, we used the R library *lplb* [40] to apply the LPLB procedure. Local estimates were obtained for every decile of the empirical biomarker distribution. A bandwidth, indicating the amount of observations in the neighbourhood used for local estimation, of 0.2 was used and an Epanechnikov kernel was considered for weighting. Five hundred bootstrap samples were drawn for each generated dataset.

##### 2.2. Simulation Settings

Data were generated to mimic data observed in a randomized clinical trial primarily intended for comparison of two different treatment options. Consequently, simulated individuals were randomly allocated to one of two treatment groups with equal probability for each group. The covariate of interest was randomly generated from a uniform distribution with a minimum value of zero and a maximum value of one. Event times were drawn from an exponential distribution with the individual hazard rate depending on the allocated treatment group and the drawn covariate value as described in Section 2.2.1. Censoring times were drawn from exponential distributions with rates as described in Section 2.2.3. The lower value of the two time variables was allocated as observed time and an observed event was indicated, if the drawn event time was smaller than the corresponding censoring time, and a censored observation was indicated else.

###### 2.2.1. Functional Form

In order to estimate the type I error probability and the statistical power for detection of truly present interaction effects associated with the different approaches, different scenarios were investigated. Overall, six different functional forms were considered, two without presence of an interaction effect (Scenarios 1 and 2) and four scenarios considering different shapes of interaction terms (Scenarios 3 to 6). All scenarios are visualized in Figure 1, showing the hazard rates used for simulation of the event times in dependence of the biomarker value (dashed black and solid grey line and black scale/axis) and the resulting hazard ratios (using a logarithmic scale) between the treatment groups (red line and scale/axis).