Statistical Genetics and Its Applications in Medical StudiesView this Special Issue
Variable Selection in ROC Regression
Regression models are introduced into the receiver operating characteristic (ROC) analysis to accommodate effects of covariates, such as genes. If many covariates are available, the variable selection issue arises. The traditional induced methodology separately models outcomes of diseased and nondiseased groups; thus, separate application of variable selections to two models will bring barriers in interpretation, due to differences in selected models. Furthermore, in the ROC regression, the accuracy of area under the curve (AUC) should be the focus instead of aiming at the consistency of model selection or the good prediction performance. In this paper, we obtain one single objective function with the group SCAD to select grouped variables, which adapts to popular criteria of model selection, and propose a two-stage framework to apply the focused information criterion (FIC). Some asymptotic properties of the proposed methods are derived. Simulation studies show that the grouped variable selection is superior to separate model selections. Furthermore, the FIC improves the accuracy of the estimated AUC compared with other criteria.
In modern medical diagnosis or genetic studies, the receiver operating characteristic (ROC) curve is a popular tool to evaluate the discrimination performance of a certain biomarker on a disease status or a phenotype. For example, in a continuous-scale test, the diagnosis of a disease is dependent upon whether a test result is above or below a specified cutoff value. Also, genome-wide association studies in human populations aim at creating genomic profiles which combine the effects of many associated genetic variants to predict the disease risk of a new subject with high discriminative accuracy . For a given cutoff value of a biomarker or a combination of biomarkers, the sensitivity and the specificity are employed to quantitatively evaluate the discriminative performance. By varying cutoff values throughout the entire real line, the resulting plot of sensitivity against 1-specificity is a ROC curve. The area under the ROC curve (AUC) is an important one-number summary index of the overall discriminative accuracy of a ROC curve, by taking the influence of all cutoff values into account. Let be the response of a diseased subject, and let be the response of a nondiseased subject; then, the AUC can be expressed as . Pepe  and Zhou et al.  provided broad reviews on many statistical methods for the evaluation of diagnostic tests.
Traditional ROC analyses do not consider the effect of characteristics of study subjects or operating conditions of the test, so test results may be affected in the way of influencing distributions of test measurements for diseased and/or nondiseased subjects. Additionally, although the number of genes is large, there may be only a small number of them associated with the disease risk or phenotype. Therefore, regression models are introduced into the ROC analysis. Chapter Six in Pepe  offered a wonderful introduction to the adjustment for covariates in ROC curves. As reviewed in Rodríguez-Álvarez et al. , there are two main methodologies of regression analyses in ROC: (1) “induced” methodology, which firstly models outcomes of diseased and nondiseased subjects separately and then uses these outcomes to induce ROC and AUC and (2) “direct” methodology, which directly models the AUC on all covariates. In this paper, we focus on the induced methodology, to which current model selection techniques may be extended.
If there are many covariates, the variable selection issue arises in terms of the consideration of model interpretation and estimability. There are two main groups of variable selection procedures. One is the best-subset selection associated with criteria such as cross-validation (CV, ), generalized cross-validation (GCV, ), AIC , and BIC . The other is based on regularization methods such as LASSO , SCAD , and adaptive LASSO , with tuning parameters selected by the same criteria such as CV and BIC. Procedures in the second group have recently become popular because they are stable  and applicable for high-dimensional data .
So far, not much attention has been drawn on the topic of variable selection in the ROC regression. Two possible reasons may account for this situation. Firstly, if we model outcomes of diseased and nondiseased subjects separately, selected submodels may be different. The difference will result in difficulties in interpretation, because it is natural to expect that the same set of variables contributes to discriminating diseased and nondiseased subjects. Secondly, most current criteria for variable selection procedures focus on the prediction performance or variable selection consistency. However, in the ROC regression, instead of prediction or model selection, our focus is the precision of an estimated AUC, which means that most popular criteria may not be appropriate. Claeskens and Hjort  argued that these “one-fit-all” model selection criteria aim at selecting a single model with good overall properties. Alternatively, they developed the focused information criterion (FIC), which focuses on a parameter singled out for interests. The insight behind this criterion is that a model that gives good precision for one estimand may be worse when used in inference for another estimand. Wang and Fang  successfully applied the FIC to variable selection in linear models and demonstrated that the FIC exactly improved the estimation performance of singled-out parameters. This “individualized” criterion exactly fits the ROC regression.
The remaining parts of this paper are organized as follows. In Section 2, we rewrite the ROC regression into a grouped variable selection form so that current criteria can be applied. Then, a general two-stage framework with a BIC selector for the group SCAD under the local model assumption is proposed in Section 3. Simulation studies and a real data analysis are given in Sections 4 and 5. A brief discussion is provided in Section 6. All proofs are presented in the Supplement; see Supplementary Materials available online at http://dx.doi.org/10.1155/2013/436493.
2. ROC Regression
In this section, we rewrite the penalized ROC regression with induced methodology into a problem of the grouped variable selection by SCAD. Initially, we require that all covariates be centered at 0 for the consideration of comparability. Also, for notation simplicity, response variables are centered. If not, we can center responses to finish the model selection and then add centers back to evaluate the AUC. By following notations of the local model, which generalizes the commonly used sparsity assumption, homoscedastic regression models for diseased and nondiseased subjects are assumed as follows: where includes variables added always, includes variables which may or may not be added, , and are dimensional vectors, and are dimensional vectors with and as sample sizes for diseased and nondiseased groups, respectively, and are dimensional vectors, and and independently follow . Especially, if , a sparse model is given. Then, the AUC given can be written as where is the cumulative distribution function of a standard normal distribution. Clearly, the narrow model is , including all constant effects and . More details of the local model assumption are provided in the following section.
Assume that observed i.i.d that the samples are , , and , . Instead of selecting separate models, we consider the following single objective function with a group penalty, given a tuning parameter : where , a 2-dimensional vector, with the th component of and the th component of , and with . More generally, instead of the norm for , we can define with a positive definite matrix . Then, given , the minimizer of (3) can be obtained as an estimate of . The motivation of considering such a penalty on jointly rather than separately is that the inclusion or exclusion of the effect of a certain variable should be simultaneous for both diseased and nondiseased groups. It may not be appropriate to include either or in the model only, which will bring troubles in interpretation of the resulting model. This is exactly the motivation of the group LASSO method by Yuan and Lin  to handle categorical variables, and the group SCAD by Wang et al.  to address spline bases.
Note that there are two separate summations of residual squares in (3). In order to comply with the framework of selecting grouped variables, a modified version of the objective function (3) is required. Let be the Kronecker product operator. Define , , , and , . In matrix form, we have where is an dimensional vector with components , , and is an dimensional matrix. Clearly, there are grouped variables, and can be split into submatrices , each of which includes two consecutive columns of in turn. Similarly, with , . Additionally, due to different variances of healthy and diseased subjects, weighted least squares should be applied. Let be a diagonal matrix, with each diagonal entry Then, the objective function (3) is written as
Furthermore, in order to facilitate computation with current R packages, we would define transformed observations by weighting. Simply, put and . Therefore, Finally, the penalized ROC regression (3) has been written into a group SCAD-type problem (7). Then, current model selection criteria, like CV, GCV, AIC, and BIC, can be applied to select a final model. For this specific ROC regression problem, where AUC is the focus, these criteria may not be appropriate. Therefore, as argued by Claeskens and Hjort , the FIC can play a role here.
Under the local model assumption, a novel procedure of applying the FIC to the grouped variable selection is developed, which is motivated by Wang and Fang . Briefly speaking, the procedure consists of two steps. Firstly, a narrow model, containing variables added always, is identified through the objective function (7). Secondly, the FIC is applied to select a subgroup of remaining variables. As a consequence, the final model is the combination of variables selected in both two steps. Details are provided in the following section. In terms of FIC, naturally, the focus parameter is the AUC at a given ; that is, with .
Later, in simulation studies, the separate variable selection for diseased and nondiseased models will also be utilized to make a comparison. We expect, the group selection is superior to the separate selection.
3. A BIC Selector for Group SCAD under the Local Model Assumption
This section follows notations used in the two fundamental papers of the FIC: Hjort and Claeskens  and Claeskens and Hjort . Furthermore, we allow grouped variables, each of which stands for a factor, such as a series of dummy variables coded from a multilevel categorical variable. The starting assumption of the FIC is that some variables are added to the regression model always and the others may or may not be added; that is, where includes variables which are added always, includes variables which may or may not be added, and . Without loss of generality, both and are standardized to remove the intercept term. Furthermore, we assume that actually consists of factors, that is, , and the corresponding , with dimensions for each and , , such that . Similarly, consists of factors, that is, , and the corresponding , with dimensions for each and , , such that . Let , with dimensions, and each has dimensions, , such that and . Let , , , and . For simplicity, assume that the residual variance is estimated based on the full model and is not considered as a parameter.
In the literature of the variable selection, in order to show the selection consistency of a variable selection procedure, usually, the true model is assumed to be sparse. Thus, the sparsity assumption plays a critical role in the current model selection literature. Many procedures have been shown to be selection consistent under this sparsity assumption . For example, the SCAD with tuning parameter selected via BIC has been shown to be selection consistent by Wang et al. [21, 22], and Zhang et al. .
However, it is questionable or too strict to assume that the true model is sparse. It is more reasonable and flexible to consider the local model (8) with and as a true model, where for the purpose of variable selection, under which the FIC is developed. This model is close to the sparse model, but it is different from it by . The sparsity assumption, with notations in this paper, is equivalent to assume that and . Therefore, the local model assumption used here is a natural extension of the sparsity assumption. All “consistency” results obtained in this paper still apply to sparse models with grouped variables.
The FIC centers at the inference on a certain estimand or focus, denoted by . It is well known that using a bigger model would typically mean smaller bias but bigger variance. Therefore, the FIC tries to balance the bias and the variance of estimating a certain parameter estimand. To be specific, like what any existing criterion does, among a possible model range, the FIC starts with a narrow model that includes only variables in and searches over submodels including some factors in . The whole process leads to totally submodels, one for each subset of .
In this framework, various estimators of the focus parameter range from to . In general, the FIC attempts to select a subset associated with the smallest mean squared error (MSE) of , where is the complement of and the subscript means a subset of corresponding vectors indexed by .
3.1. Stage 1: Consistent Selection of the Narrow Model
Once assuming the true model (8) with and as well as grouped variables, here arises the first important question regarding whether we can select the narrow model consistently. A similar question has been addressed by Wang and Fang , where they considered nongrouped variables. In the following, we show that the group SCAD with a tuning parameter selected via BIC can consistently select the narrow model.
Wang et al.  extended the SCAD, proposed by Fan and Li , to grouped variables and established its oracle property, following an elegant idea of the group LASSO . The group SCAD generates an estimate via following penalized least squares: where with -dimensional , and is defined in the previous section. Let be the selected narrow model for a given . With similar arguments in the previous section, the norm used in the penalty can be replaced by any metric with the form such that is a symmetric positive definite matrix.
Under the local model assumption with no grouped variables, Wang and Fang  showed that, with a tuning parameter selected via BIC, the SCAD is selection consistent; that is, with probability tending to one, the narrow model can be identified. Similarly, a BIC selector can be defined based on the group SCAD as follows: where and . We expect that the group SCAD is still selection consistent in the sense that as , provided that is the narrow model.
Formally, within the framework of FIC, assuming that the local model (8) is the true model and that is the narrow model, we show the following theorem. Proofs can be found in the Supplement.
Theorem 1. Under some mild conditions (see the Supplement for details), one has that provided that model (8) with and is the true model.
Remark 2. If we assume that , that is, the model is sparse, then Theorem 1 provides a BIC selector for the tuning parameter in the group SCAD, which can consistently identify nonzero effects. In other words, we extend the BIC selector for the SCAD proposed by Wang et al.  to the situation with the group SCAD.
Theorem 1 also implies both advantages and disadvantages of the BIC, which have been discussed by Wang and Fang . Briefly speaking, the BIC sacrifices prediction consistency  in the sense of filtering all of the variables whose effect sizes are of order to achieve the model selection consistency. The previous theorem provides a data-driven method to consistently specify a narrow model, which is critical before applying FIC. In the following subsection, we suggest a two-stage framework to apply the FIC based upon a narrow model selected via the BIC, in order to recover part of the variables filtered by the BIC.
3.2. Stage 2: FIC
In Stage 1, a narrow model, , has been identified via the group SCAD with a tuning parameter selected via BIC. In Stage 2, any subset of can be added to . A direct application of the FIC proposed by Claeskens and Hjort  is not plausible even for moderate size of , because there are subsets of . Furthermore, the best-subset selection is unstable . Therefore, similar to Wang and Fang , without double minimizations through both subsets and tuning parameters proposed by Claeskens , we suggest limiting the search domain to those subsets on the solution path from any group regularization procedure such as group LASSO or group SCAD.
With a selected narrow model , let , , , and . Then, a solution path is generated from the following group LASSO procedure (or group SCAD): where the tuning parameter controls the grouped variables included in the subset . As the tuning parameter varies from some large value to 0, increases from an empty set to a “full” set . Then, we utilize the FIC to guide the selection of in (12) over the resulting ’s, which consist of a search domain.
Now, Stage 2 of the FIC for a certain focus is summarized as follows. For a given , a subset is provided by indices of nonzero factors from (12). Then, based on the submodel , the is evaluated according to a formula developed in Claeskens and Hjort [15, formula (3.3)], which is essentially a parametric estimate of the MSE of on a model . Consequently, is selected as and the final submodel is selected as .
Simulated data are generated under models (1) with 0 as intercepts. Moderate sample sizes are set to and , compared with 8 and 20 as numbers of covariates. Three scenarios of parameters are considered in the following:(1), , , , , , , , , , , ;(2), , , , , , , , , , , ;(3), , , , .Clearly, the narrow model of the first two settings is , whereas, for the third one, no clear boundary is specified between big effects and small effects.
Corresponding to each setting, test datasets , , and are selected to generate AUC around 0.6, 0.8, and 0.95 to accommodate low-, moderate-, and high-accuracy cases, respectively. Consider the following:(1), ; , ; , ;(2), ; , ; , ;(3), ; , ; , .
Besides the proposed two-stage framework (FIC) with group SCAD, for comparison purpose, four popular variable selection criteria, including 5-fold CV, GCV, AIC, and BIC, are also employed. Additionally, the SCAD penalty is applied to diseased and healthy groups separately to show the gain of applying the group SCAD.
Two popular measurements, and the mean absolute error (MAE), defined by , are utilized to evaluate the prediction performance of selected models based on different criteria, where is an estimate of based on the final model selected by a certain selection criterion. Due to the limited range of AUC and skewed distributions of estimates of AUC especially at boundaries, the MAE is supposed to be more appropriate.
In this paper, a composite measurement, the -measure, is employed to evaluate the performance of selecting the narrow model among various methods, including commonly used proportions of selecting underfitting, correct, and overfitting models separately. As noted by Lim and Yu , a high -measure means that both false-positive and false-negative rates are low. Define Precision = true positivity, Recall = true discovery and then, . All results are summarized based on 500 repetitions according to simulation settings in Tables 1, 2, and 3.
Table 1 indicates that the BIC has the best performance to identify the narrow model, compared with others. Also, if there are more weak signals, like Setting 2, the performance is not as good as that of Setting 1. This is reasonable, because, with increasing number of variables given the sample size, it is more challenging to filter weak signals, even under the sparsity assumption. From Table 2, we can see that, in all three settings, these five methods perform well. Specifically, for moderate and large AUC cases, the FIC performs slightly better, providing smaller MAE. Additionally, in these cases, the FIC improves the BIC substantially, which once again indicates that the BIC would filter weak signals.
In order to show how we can benefit from applying the grouped variable selection, separate model selections for diseased and healthy subjects are also considered, and results are summarized in Table 3. By comparing Tables 2 and 3, in most cases, the group penalty provides smaller MSE and MAE for every criterion. Due to limited range of the AUC, all MSE and MAE values in Tables 2 and 3 are small, but the group selection can improve separate selections by as high as 25%. It is not surprising to see that, in high AUC situations, differences are small, and separate selections with BIC are better. Possible reasons are the following: (1) there is no much room for an estimated AUC to vary when it is close to 1; (2) separate selections with BIC offer a larger flexibility to obtain a sparse model.
5. Real Data Analysis
In this section, we demonstrate the proposed procedure by the audiology data reported by Stover et al. , which has been analyzed by Pepe [3, 28]. The dataset contains results of distortion product otoacoustic emissions (DPOAE) test used to diagnose the hearing impairment. There are 208 subjects who were examined at different combinations of three frequencies () and three intensities () of the DPOAE device. An audiometric threshold can be obtained for each combination. At a particular frequency, if the audiometric threshold is greater than 20 dB HL, an ear was classified as hearing impaired. In the original dataset, there are multiple records for each subject. In this study, we randomly select one record for each subject, and among 208 subjects there are 55 subjects with hearing impairment. The test result is the negative signal-to-noise ratio, −SNR. The covariates used in Dodd and Pepe  are = frequency Hz/100, = intensity dB/10, and = (hearing threshold − 20) dB/10. In order to encourage the model selection, we incorporate two-way interaction terms. Quadratic terms are not included due to the high correlation between each variable and its quadratic term. Therefore, is the centered for each element.
Former studies on this dataset showed that −SNR provided quite high discriminative performance and that had a small effect. In order to avoid specifying inappropriate covariates, we randomly select three centered observations from the whole dataset as focused subjects.
Table 4 shows AUC values of models selected by each method as well as corresponding model sizes. CV, AIC, and GCV tend to select a full model. On the contrary, BIC tends to select a sparse model, only containing . The full model may not provide the largest AUC, because a large model will bring instability and ruin the AUC. As indicated in the table, for the second test point, both BIC and FIC provide a higher AUC than the full model. But a single variable selected by the BIC seems to be too strict. By focusing on the precision of estimated focus parameter, the FIC provides a customized way to fill the gap: for the first test point, three main effects are selected; for the second one, and are selected; for the third one, only is selected. Based on the precision of estimated AUC, the FIC performs as a compromise, selecting models to generate AUC values in the middle.
In this paper, we rewrite the model selection problem of the ROC regression into a grouped factor selection form with induced methodology. Also, we develop a two-stage framework to apply the FIC to select a final model with group SCAD under the local model assumption. Specifically, if the true model is sparse, our framework naturally accommodates current model selection criteria. Furthermore, the BIC selector is proved to be model selection consistent if either a sparse or a local model is assumed, in the sense of selecting a sparse model or a narrow model.
Most current model selection criteria aim at the prediction performance or model selection consistency; thus, in the ROC regression where the AUC is a focus parameter, they may not be appropriate. This observation motivates an application of FIC, which is shown to perform well through simulation studies. Therefore, our method has a potential application in genetic studies, where the number of gene arrays is always large, compared with the sample size.
For the direct methodology, the literature based on generalized estimating equations is prosperous, which is motivated by the range of the AUC, similar to the probability of a binary random variable. Our future work will extend the framework developed here to generalized estimating equations and apply it to the ROC regression with the direct methodology.
As discussed by one referee, it is possible that some coefficients are the same for both and . As in (1), modeling them separately will increase the degree of freedom in (3), especially when a large number of genes are covariates. If the shrinkage of a coefficient, which is known a priori to be the same in both diseased and healthy groups, is not necessary, then it is natural for the FIC to include it in the narrow model with a single coefficient. By using the proposed objective function, a fused LASSO type of penalty may be applied to obtain such kind of structure, in addition to the group LASSO/SCAD. Friedman et al.  provided a note on the group LASSO and the sparse group LASSO, which could shed light on the question here. It will be also an interesting topic in the future.
Conflict of Interests
There is no conflict of interests regarding the publication of this article.
The authors would like to thank Dr. Yixin Fang for his invaluable suggestions and generous support which make this paper publishable. They also thank the editor, the associate editor, and the referees for their valuable comments which led to substantial improvements of this paper.
In the supplement, we prove the Theorem 1 by establishing 4 key lemmas. This theorem shows the developed BIC selector can consistently identify the narrow model with grouped variables.
D. Bamber, “The area above the ordinal dominance graph and the area below the receiver operating characteristic graph,” Journal of Mathematical Psychology, vol. 12, no. 4, pp. 387–415, 1975.View at: Google Scholar
M. S. Pepe, The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford University Press, New York, NY, USA, 2003.
X. H. Zhou, N. A. Obuchowski, and D. M. McClish, Statistical Methods in Diagnostic Medicine, John Wiley & Sons, New York, NY, USA, 2nd edition, 2011.
M. X. Rodríguez-Álvarez, P. G. Tahoces, C. Cadarso-Suárez, and M. J. Lado, “Comparative study of ROC regression techniques-applications for the computer-aided diagnostic system in breast cancer detection,” Computational Statistics and Data Analysis, vol. 55, no. 1, pp. 888–902, 2011.View at: Publisher Site | Google Scholar
M. Stone, “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal Statistical Society B, vol. 36, no. 2, pp. 111–147, 1974.View at: Google Scholar
H. Akaike, “Information theory and an extension of the maximum likelihood principle,” in Proceedings of the 2nd International Symposium Information Theory, B. N. Petrov and F. Csaki, Eds., pp. 267–281, Akademia Kiado, Budapest, Hungary, 1973.View at: Google Scholar
R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society B, vol. 58, no. 1, pp. 267–288, 1996.View at: Google Scholar
J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.View at: Google Scholar
L. Breiman, “Heuristics of instability and stabilization in model selection,” Annals of Statistics, vol. 24, no. 6, pp. 2350–2383, 1996.View at: Google Scholar
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer, New York, NY, USA, 2009.
B. Wang and Y. Fang, “On the focused information criterion for variable selection,” submitted.View at: Google Scholar
P. Bühlmann and S. van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer, 2011.
L. Stover, M. P. Gorga, S. T. Neely, and D. Montoya, “Toward optimizing the clinical utility of distortion product otoacoustic emission measurements,” Journal of the Acoustical Society of America, vol. 100, no. 2, part 1, pp. 956–967, 1996.View at: Google Scholar
J. Friedman, T. Hastie, and R. Tibshirani, A Note on the Group Lasso and a Sparse Group Lasso, 2010.