Computational and Mathematical Methods in Medicine

Volume 2017 (2017), Article ID 1421409, 8 pages

https://doi.org/10.1155/2017/1421409

## Probing for Sparse and Fast Variable Selection with Model-Based Boosting

^{1}Department of Statistics, LMU München, München, Germany^{2}Department of Medical Informatics, Biometry and Epidemiology, FAU Erlangen-Nürnberg, Erlangen, Germany^{3}Department of Medical Biometry, Informatics and Epidemiology, University Hospital Bonn, Bonn, Germany

Correspondence should be addressed to Tobias Hepp

Janek Thomas and Tobias Hepp contributed equally to this work.

Received 9 February 2017; Accepted 13 April 2017; Published 31 July 2017

Academic Editor: Yuhai Zhao

Copyright © 2017 Janek Thomas et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We present a new variable selection method based on model-based gradient boosting and randomly permuted variables. Model-based boosting is a tool to fit a statistical model while performing variable selection at the same time. A drawback of the fitting lies in the need of multiple model fits on slightly altered data (e.g., cross-validation or bootstrap) to find the optimal number of boosting iterations and prevent overfitting. In our proposed approach, we augment the data set with randomly permuted versions of the true variables, so-called shadow variables, and stop the stepwise fitting as soon as such a variable would be added to the model. This allows variable selection in a single fit of the model without requiring further parameter tuning. We show that our probing approach can compete with state-of-the-art selection methods like stability selection in a high-dimensional classification benchmark and apply it on three gene expression data sets.

#### 1. Introduction

At the latest since the emergence of genomic and proteomic data, where the number of available variables is possibly far higher than the sample size , high-dimensional data analysis becomes increasingly important in biomedical research [1–4]. Since common statistical regression methods like ordinary least squares are unable to estimate model coefficients in these settings due to singularity of the covariance matrix, varying strategies have been proposed to select only truly influential, that is, informative, variables and discard those without impact on the outcome.

By enforcing sparsity in the true coefficient vector, regularized regression approaches like the* lasso* [5],* least angle regression* [6],* elastic net* [7], and* gradient boosting* algorithms [8, 9] perform variable selection directly in the model fitting process. This selection is controlled by tuning hyperparameters that define the degree of penalization. While these hyperparameters are commonly determined using resampling strategies like cross-validation, bootstrapping, and similar methods, the focus on minimizing the prediction error often results in the selection of many noninformative variables [10, 11].

One approach to address this problem is* stability selection* [12, 13], a method that combines variable selection with repeated subsampling of the data to evaluate selection frequencies of variables. While stability selection can considerably improve the performance of several variable selection methods including regularized regression models in high-dimensional settings [12, 14], its application depends on additional hyperparameters. Although recommendations for reasonable values exist [12, 14], proper specification of these parameters is not straightforward in practice as the optimal configuration would require a priori knowledge about the number of informative variables. Another potential drawback is that stability selection increases the computational demand, which can be problematic in high-dimensional settings if the computational complexity of the used selection technique scales superlinearly with the number of predictor variables.

In this paper, we propose a new method to determine the optimal number of iterations in model-based boosting for variable selection inspired by* probing*, a method frequently used in related areas of machine learning research [15–17] and the analysis of microarrays [18]. The general notion of probing involves the artificial inflation of the data with random noise variables, so-called* probes* or* shadow variables*. While this approach is in principle applicable to the lasso or least angle regression as well, it is especially attractive to use with more computationally intensive boosting algorithms, as no resampling is required at all. Using the first selection of a shadow variable as stopping criterion, the algorithm is applied only once without the need to optimize any hyperparameters in order to extract a set of informative variables from the data, thereby making its application very fast and simple in practice. Furthermore, simulation studies show that the resulting models in fact tend to be more strictly regularized compared to the ones resulting from cross-validation and contain less uninformative variables.

In Section 2, we provide detailed descriptions of the model-based gradient boosting algorithm as well as stability selection and the new probing approach. Results of a simulation study comparing the performance of probing to cross-validation and different configurations of stability selection in a binary classification setting are then presented in Section 3 before discussing the application of these methods on three data sets with measurements of gene expression levels in Section 4. Section 5 summarizes our findings and presents an outlook to extensions of the algorithm.

#### 2. Methods

##### 2.1. Gradient Boosting

Given a learning problem with a data set sampled i.i.d. from a distribution over the joint space , with a -dimensional input space and an output space (e.g., for regression and for binary classification), the aim is to estimate a function, , that maps elements of the input space to the output space as good as possible. Relying on the perspective on boosting as gradient descent in function space, gradient boosting algorithms try to minimize a given loss function, , that measures the discrepancy between a predicted outcome value of and the true . Minimizing this discrepancy is achieved by repeatedly fitting weak prediction functions, called* base learners*, to previous mistakes, in order to combine them to a strong ensemble [19]. Although early implementations in the context of machine learning focused specifically on the use of regression trees, the concept has been successfully extended to suit the framework of a variety of statistical modelling problems [8, 20]. In this model-based approach, the* base learners * are typically defined by semiparametric regression functions on to build an additive model. A common simplification is to assume that each base learner is defined on only one component of the input spaceFor an overview of the fitting process of model-based boosting see Algorithm 1.

*Algorithm 1 (model-based gradient boosting). *Starting at with a constant loss minimal initial value , the algorithm iteratively updates the predictor with a small fraction of the base learner with the best fit on the negative gradient of the loss function: (1)Set iteration counter .(2)While , compute the negative gradient vector of the loss function: (3)Fit every base learner separately to the negative gradient vector .(4)Find , that is, the base learner with the best fit: (5)Update the predictor with a small fraction of this component:

The resulting model can be interpreted as a generalized additive model with partial effects for each covariate contained in the additive predictor. Although the algorithm relies on two hyperparameters and , Bühlmann and Hothorn [9] claim that the* learning rate * is of minor importance as long as it is “sufficiently small,” with commonly used in practice.

The stopping criterion, , determines the degree of regularization and thereby heavily affects the model quality in terms of overfitting and variable selection [21]. However, as already outlined in the introduction, optimizing using common approaches like cross-validation results in the selection of many uninformative variables. Although still focusing on minimizing prediction error, using a 25-fold bootstrap instead of the commonly used 10-fold cross-validation tends to return sparser models without sacrificing prediction performance [22].

##### 2.2. Stability Selection

The weak performance of cross-validation regarding variable selection partly results from the fact that it pursues the goal of minimizing the prediction error instead of selecting only informative variables. One possible solution is the* stability selection* framework [12, 13], a very versatile algorithm that can be combined with all kinds of variable selection methods like gradient boosting, lasso, or forward stepwise selection. It produces sparser solutions by controlling the number of false discoveries. Stability selection defines an upper bound for the per-family error rate (PFER), for example, the expected number of uninformative variables included in the final model.

Therefore, using stability selection with model-based boosting means that Algorithm 1 is run independently on random subsamples of the data until either a predefined number of iterations is reached or different variables have been selected. Subsequently, all variables are sorted with respect to their selection frequency in the sets. The amount of informative variables is then determined by a user-defined threshold that has to be exceeded. A detailed description of these steps is given in Algorithm 2.

*Algorithm 2 (stability selection for model-based boosting [14]). *(1)For ,(a)draw a subset of size from the data;(b)fit a boosting model to the subset until the number of selected variables is equal to or the number of iterations reaches a prespecified number ().(2)Compute the selection frequencies per variable : where denotes the set of selected variables in iteration .(3)Select variables with a selection frequency of at least , which yields a set of stable covariates:

Following this approach, the upper bound for the PFER can be derived as follows [12]:With additional assumptions on exchangeability and shape restrictions on the distribution of simultaneous selection, even tighter bounds can be derived [13]. While this method is successfully applied in a large number of different applications [23–26], several shortcomings impede the usage in practice. First off, three additional hyperparameters , PFER, and are introduced. Although only two of them have to be specified by the user (the third one can be calculated by assuming equality in (7)), it is not intuitively clear which parameter should be left out and how to specify the remaining two. Even though recommendations for reasonable settings for the selection threshold [12] or the PFER [14] are proposed, the effectiveness of these settings is difficult to evaluate in practical settings. The second obstacle in the usage of stability selection is the considerable computational power required for calculation. Overall boosting models ([13] recommends ) have to be fitted and a reasonable has to be found as well, which will most likely require cross-validation. Even though this process can be parallelized quite easily, complex model classes with smooth and higher-order effects can become extremely costly to fit.

##### 2.3. Probing

The approach of adding* probes* or* shadow variables*, for example, artificial uninformative variables to the data, is not completely new and has already been investigated in some areas of machine learning. Although they share the underlying idea to benefit from the presence of variables that are known to be independent from the outcome, the actual implementation of the concept differs (see Guyon and Elisseeff (2003) [15] for an overview). An especially useful approach, however, is to generate these additional variables as randomly shuffled versions of all observed variables. These permuted variables will be called* shadow variables* for the remainder of this paper and are denoted as . Compared to adding randomly sampled variables, shadow variables have the advantage that the marginal distribution of is preserved in . This approach is tightly connected to the theory of permutation tests [27] and is used similarly for* all-relevant* variable selection with random forests [28].

Implementing the* probing* concept to the sequential structure of model-based gradient boosting is rather straightforward. Since boosting algorithms proceed in a greedy fashion and only update the effect which yields the largest loss reduction in each iteration, selecting a shadow variable essentially implies that the best possible improvement at this stage relies on information that is known to be unrelated to the outcome. As a consequence, variables that are selected in later iterations are most likely correlated to only by chance as well. Therefore, all variables that have been added prior to the first shadow variable are assumed to have a true influence on the target variable and should be considered informative. A description of the full procedure is presented in Algorithm 3.

*Algorithm 3 (probing for variable selection in model-based boosting). *(1)* Expand* the data set by creating randomly shuffled images for each of the variables such that where denotes the symmetric group that contains all possible permutations of .(2)* Initialize* a boosting model on the inflated data set and start iterations with .(3)* Stop if* the first is selected; see Algorithm 1 step ().(4)* Return* only the variables selected from the original data set .

The major advantage of this approach compared to variable selection via cross-validation or stability selection is that one model fit is enough to find informative variables and no expensive refitting of the model is required. Additionally, there is no need for any prespecification like the search space () for cross-validation or additional hyperparameters (, , PFER) for stability selection. However, it should be noted that, unlike classical cross-validation, probing aims at optimal variable selection instead of prediction performance of the algorithm. Since this usually involves stopping much earlier, the effect estimates associated with the selected variables are most likely strongly regularized and might not be optimal for predictions.

#### 3. Simulation Study

In order to evaluate the performance of our proposed variable selection method, we conduct a benchmark simulation study where we compare the set of nonzero coefficients determined by the use of shadow variables as stopping criterion to cross-validation and different configurations of stability selection. We simulate data points for variables from a multivariate normal distribution with Toeplitz correlation structure for all and . The response variable is then generated by sampling Bernoulli experiments with probability with the linear predictor for the th observation and all nonzero elements of sampled from . Since the total amount of nonzero coefficients determines the number of informative variables in the setting, it is denoted as .

Overall, we consider 12 different simulation scenarios defined by all possible combinations of , , and . Specifically, this leads to the evaluation of 2 low-dimensional settings with , 4 settings with , and 6 high-dimensional settings with . Each configuration is run times. Along with new realizations of and , we also draw new values for the nonzero coefficients in and sample their position in the vector in each run to allow for varying correlation patterns among the informative variables. For variable selection with cross-validation, -fold bootstrap (the default in mboost) is used to determine the final number of iterations. Different configurations of stability selection were tested to investigate whether and, if so, to what extent these settings affect the selection. In order to explicitly use the upper error bounds of stability selection, we decided to specify combinations with and and calculate from (7). Aside from the learning rate , which is set to for all methods, no further parameters have to be specified for the probing scheme. Two performance measures are considered for the evaluation of the methods with respect to variable selection: first, the true positive rate (TPR) as the fraction of (correctly) selected variables from all true informative variables and, second, the false discovery rate (FDR) as the fraction of uninformative variables in the set of selected variables. To ensure reproducibility the R package batchtools [29] was used for all simulations.

The results of the simulations for all settings are illustrated in Figure 1. With TPR and FDR on the -axis and -axis, respectively, solutions displayed in the top left corner of the plots therefore successfully separate informative variables from the ones without true effect on the response. Although already using a sparse cross-validation approach, the FDR of variable selection via cross-validation is still relatively high, with more than 50% false positives in the selected sets in the majority of the simulated scenarios. Whereas this seems to be mostly disadvantageous in the cases where , the trend to more greedy solutions leads to a considerably higher chance of identifying more of the truly informative variables if or with very high , however, still at the price of picking up many noise variables on the way. Pooling the results of all configurations considered for stability selection, the results cover a large area of the performance space in Figure 1, thereby probably indicating high sensitivity on the decisions regarding the three tuning parameters.