Computational Intelligence and Neuroscience

Volume 2017 (2017), Article ID 2747431, 10 pages

https://doi.org/10.1155/2017/2747431

## Ensembling Variable Selectors by Stability Selection for the Cox Model

^{1}School of Science, Xi’an University of Architecture and Technology, Xi’an, Shaanxi 710055, China^{2}School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China

Correspondence should be addressed to Qing-Yan Yin

Received 13 May 2017; Revised 18 August 2017; Accepted 29 October 2017; Published 15 November 2017

Academic Editor: Paolo Gastaldo

Copyright © 2017 Qing-Yan Yin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

As a pivotal tool to build interpretive models, variable selection plays an increasingly important role in high-dimensional data analysis. In recent years, variable selection ensembles (VSEs) have gained much interest due to their many advantages. Stability selection (Meinshausen and Bühlmann, 2010), a VSE technique based on subsampling in combination with a base algorithm like lasso, is an effective method to control false discovery rate (FDR) and to improve selection accuracy in linear regression models. By adopting lasso as a base learner, we attempt to extend stability selection to handle variable selection problems in a Cox model. According to our experience, it is crucial to set the regularization region in lasso and the parameter properly so that stability selection can work well. To the best of our knowledge, however, there is no literature addressing this problem in an explicit way. Therefore, we first provide a detailed procedure to specify and . Then, some simulated and real-world data with various censoring rates are used to examine how well stability selection performs. It is also compared with several other variable selection approaches. Experimental results demonstrate that it achieves better or competitive performance in comparison with several other popular techniques.

#### 1. Introduction

Variable selection is a classical problem in statistics and has enjoyed increased attention in recent years due to a massive growth of high-dimensional data across many scientific disciplines. In modern statistical applications, the number of variables or covariates often exceeds the number of observations . In such settings, the true model is often assumed to be sparse, in the sense that only a small proportion of the variables actually relates to the response. Thus, variable selection is fundamentally important in statistical analysis of high-dimensional data. With a proper selection method and under suitable conditions, we are able to build a good model to interpret the relationship between covariates and our interested outcome more easily, to avoid overfitting in prediction and estimation, and to identify important variables for applications or further study.

For variable selection, many researchers focus on multiple linear regression models. To emphasize that variable selection methods are useful for other statistical models as well, we use a different statistical model, that is, a Cox’s proportional hazards model (abbreviated as Cox model) [1], as the platform in this context. The Cox model was first proposed for exploring the relationship between the survival of a patient and some explanatory variables. As a matter of fact, the Cox model [2, 3] nowadays is one of the most commonly used forms in semiparametric models and it can not only solve the issues of censored data, but also analyze the influence of various factors on survival time simultaneously. A brief mathematical description of the Cox model is given as follows.

Suppose that there are observations of survival data. For an individual , denotes its survival time and stands for the observed data for the covariates. At the same time, is a censoring indicator variable, where means that is right-censored. Let be the hazard rate at time ; the generic form of a Cox’s proportional hazards model can be expressed aswhere is a -dimensional unknown coefficient vector and is the baseline hazard function, that is, the hazard function at time when all covariates take value zero. In general, can be estimated by maximizing partial likelihood function. For convenience, we assume below.

Like linear regression models, traditional methods such as subset selection [4, 5], forward selection, backward elimination, and a combination of both are among the most common methods for selecting variables in a Cox model. However, these methods will have difficulty in computation when faced with high-dimensional data. Therefore, some other methods have been proposed to overcome this problem. After lasso (least absolute shrinkage and selection operator) [6] was first proposed for linear regression models, Tibshirani [7] extended it to the Cox model. Later on, many scholars [2, 3, 8–12] developed some penalized shrinkage techniques like SCAD [13] and adaptive lasso [14] specially for Cox models.

Although the above-mentioned variable selection methods have been shown to be successful in theoretical properties and numerous experiments, their performance strongly depends on the proper setup of the tuning parameter. On the other hand, these approaches may be unstable (especially in the situation of high-dimensional data). Breiman [15] proved that uncertainty can lead to more prediction loss. What is more important, small changes in data can result in that the same method selects different models. This makes the subsequent interpretation difficult and unreliable. In order to obtain more stable, accurate, and reliable variable selection results, ensemble learning [16, 17] is one kind of extremely potential technologies.

As a hot research topic in machine learning, ensemble learning is used more and more widely in many fields of natural science and social science in last two decades. The powerful advantages of ensemble learning lie in improving the generalization capacity and enhancing robustness in the process of learning. Its main idea is to obtain a number of different base learning machines by running some simple learning algorithm and then combine these base machines into an ensemble learning machine in some way. Generally, the base learning machines should have strong generalization capability on one side, and they should also complement each other on the other hand.

The ensemble approach for statistical modeling was first proposed for solving prediction problems, aiming to maximize* prediction accuracy*. Inspired by this idea, Zhu and Chipman [18] applied bagging ensemble approach to handle variable selection problems, aiming at maximizing* selection accuracy*. Meanwhile, they pointed out that there is much difference between “prediction ensembles” (PEs) and “variable selection ensembles” (VSEs). More recently, ensemble learning methods have attracted more attention on coping with variable selection problems since they can greatly improve the selection accuracy and lessen the risk to falsely select unimportant variables and simultaneously overcome the instability of traditional methods in the high-dimensional data analysis. Because of these benefits, there are more and more researches applying ensemble learning to variable selection and putting forward some novel approaches. As far as we know, existing VSE techniques mainly include PGA (parallel genetic algorithm) [18], stability selection [19], BSS (bagged stepwise search) [20], random lasso [21], ST2E (stochastic stepwise ensemble) [22], TCSL (tilted correlation screening learning) [23], RMSA (random splitting model averaging) [24], SCCE (stochastic correlation coefficient ensemble) [25], and PST2E (pruned stochastic stepwise ensemble) [26]. It is noteworthy that these algorithms are mainly designed for handling variable selection problems in linear regression models. Only Zhu and Fan [20] investigated the performance of BSS and PGA in the Cox model.

Through analyzing these VSE techniques, it can be found that their success primarily lies in producing multiple importance measures for each predictor. By simply averaging these measures across multiple trials, the noise variables can be more reliably distinguished from the informative ones. In this process, the strength to select important variables and the diversity between the importance measures need to be preserved simultaneously [20, 22]. Stability selection applies subsampling (or bootstrap) to a selection method like lasso to improve its performance. In fact, it is an extremely general ensemble learning technique for identifying important variables. Due to the characteristics of lasso, it is very efficient in high-dimensional situations. Another good property of stability selection is that it provides an effective way to control false discovery rate (FDR) in finite sample cases provided that its tuning parameters are set properly. Due to its versatility and flexibility, stability selection has been successfully applied in many domains such as gene expression analysis [24, 27–29]. Nevertheless, we have not found any literature about applying stability selection to a Cox model. Therefore, in this paper we would like to extend it to the situation of Cox models. At the same time, we also discuss how to set appropriate values for the involved parameters so that stability selection achieves its best performance.

The remainder of the paper is described as follows. In Section 2, the details for applying stability selection to the Cox model are described. We also provide an explicit way to set its involved parameters. In Section 3, some numerical experiments were conducted to study the impact of on the behavior of stability selection and to compare its performance with other variable selection approaches for the Cox model. In Section 4, some real examples are analyzed to further study the effectiveness of stability selection. Finally, some conclusions are offered in Section 5.

#### 2. Stability Selection Algorithm for the Cox Model

In this paper, we consider stability selection with lasso as its base learner. Lasso [6] is one of the most effective techniques to deal with high-dimensional linear regression problems with . With respect to its application in Cox models, the core idea is to maximize the partial likelihood minus the penalty function. For convenience, suppose that there are unique failure times, say, , among the observations . Let denote the index of the observation failing at time . The lasso algorithm needs to maximizeunder the constraint . In (2), is the set of indices, , with (i.e., the observations are at risk at time ). Equivalently, the estimate of can be obtained aswhere is the regularization parameter which controls the trade-off between the model fitting and the coefficient shrinkage degree. At present, there are several efficient algorithms [7, 30] (such as cyclical coordinate descent) to get in (3). We refer readers to the related literature for more details about the optimization strategy.

In applications, we need to first set a sensible region, say, , for the regularization parameter in lasso. Notice that lasso will choose* all* variables (i.e., full model) for while choosing* none* of the variables (i.e., null model) for . By taking candidate values in , that is, , lasso generally employs 5-fold or 10-fold cross-validation to select an optimal value of , say . Then, the variables which have nonzero coefficient estimation under are determined as important variables. Although lasso with being specified in this way has good prediction performance, much evidence [14, 19, 21] has shown that it tends to choose more variables than necessary (i.e., higher FDR).

To eliminate this drawback of lasso, Meinshausen and Bühlmann [19] developed stability selection which works by choosing variables whose* selection probabilities* are large as important ones. In reality, the selection probability can be estimated by running lasso on multiple different sets. These sets can be obtained via subsampling from the given set. Specifically, stability selection first estimates the probability that variable is important for each regularization parameter , and then takes the maximum probability over as the important measure for . Eventually, it selects important variables by a preset threshold . The detailed steps of stability selection algorithm for the Cox model are listed in Algorithm 1.