Advanced Designs and Statistical Methods for Genetic and Genomic Studies of Complex DiseasesView this Special Issue
Research Article | Open Access
Jingyuan Zhao, Zehua Chen, "A Two-Stage Penalized Logistic Regression Approach to Case-Control Genome-Wide Association Studies", Journal of Probability and Statistics, vol. 2012, Article ID 642403, 15 pages, 2012. https://doi.org/10.1155/2012/642403
A Two-Stage Penalized Logistic Regression Approach to Case-Control Genome-Wide Association Studies
We propose a two-stage penalized logistic regression approach to case-control genome-wide association studies. This approach consists of a screening stage and a selection stage. In the screening stage, main-effect and interaction-effect features are screened by using -penalized logistic like-lihoods. In the selection stage, the retained features are ranked by the logistic likelihood with the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) and Jeffrey’s Prior penalty (Firth, 1993), a sequence of nested candidate models are formed, and the models are assessed by a family of extended Bayesian information criteria (J. Chen and Z. Chen, 2008). The proposed approach is applied to the analysis of the prostate cancer data of the Cancer Genetic Markers of Susceptibility (CGEMS) project in the National Cancer Institute, USA. Simulation studies are carried out to compare the approach with the pair-wise multiple testing approach (Marchini et al. 2005) and the LASSO-patternsearch algorithm (Shi et al. 2007).
The case-control genome-wide association study (GWAS) with single-nucleotide polymorphism (SNP) data is a powerful approach to the research on common human diseases. There are two goals of GWAS: (1) to identify suitable SNPs for the construction of classification rules and (2) to discover SNPs which are etiologically important. The emphasis is on the prediction capacity of the SNPs for the first goal and on the etiological effect of the SNPs for the second goal. The phrase “an etiological SNP” is used in the sense that either the SNP itself is etiological or it is in high-linkage disequilibrium with an etiological locus. Well-developed classification methods in the literature can be used for the first goal. These methods include classification and regression trees , random forest , support vector machine , and logic regression . In this article, we focus on statistical methods for the second goal.
The approach of multiple testing based on single or paired SNP models is commonly used for the detection of etiological SNPs. Either the Bonferroni correction is applied for the control of the overall Type I error rate, see, for example, Marchini et al.  or some methods are used to control the false discovery rate (FDR), see, Banjamini and Hochberg , Efron and Tibshirani , and Storey and Tibshirani . Other variants of multiple testing have also been advocated, see Hoh and Ott . The multiple test approach considers either a single SNP or a pair of SNPs at a time. It does not adjust for the effects of other markers. If there are many loci having high sample correlations with a true genetic variant, which is common in GWAS, it is prone to result in spurious etiological loci.
It is natural to seek alternative methods that overcome the drawback of multiple testing. Such methods must have the nature of considering many loci simultaneously and assessing the significance of the loci by their synergistic effect. When the synergistic effect is of concern, adding loci spuriously correlated to an etiological locus does not contribute to the synergistic effect while the etiological locus has already been considered. Thus the drawback of multiple testing can be avoided. In this paper, we propose a method of the abovementioned nature: a two-stage penalized logistic regression approach. In the first stage of this approach, -penalized logistic regression models are used together with a tournament procedure  to screen out apparently unimportant features (by features we refer to the covariates representing SNPs or their products). In the second stage, logistic models with the SCAD penalty  plus the Jeffrey’s prior penalty  are used to rank the retained features and form a sequence of nested candidate models. The extended Bayesian information criteria (EBIC, [13, 14]) are used for the final model selection. In both stages of the approach, the features are assessed by their synergistic effects.
The two-stage strategy has been considered by other authors. For example, J. Fan and Y. Fan  adopted this strategy for high-dimensional classification, and Shi et al.  developed a two-stage procedure called LASSO-patternsearch. Sure independence screening (SIS, ) and its ramifications such as correlation screening and -tests are commonly used in the screening stage. Compared with SIS approaches, the tournament screening with -penalized likelihood produces less spuriously correlated features while enjoying the sure screening property possessed by the SIS approaches, see Z. Chen and J. Chen  and the comprehensive simulation studies by Wu , which has an impact on the accuracy of feature selection in the second stage, see Koh . The -penalized likelihood is easier to compute than that with the SCAD penalty. However, the SCAD penalty has an edge over the -penalty in ranking the features so that the ranks are more consistent with their actual effects. This has been observed in simulation studies, see Zhao . It is possibly due to the fact that the penalty over-penalizes those features with large effects compared with SCAD penalty that does not penalize large effects at all. Jeffrey’s prior penalty is added to handle the difficulty caused by separation of data that usually presents in logistic regression models with factor covariates, see Albert and Anderson . If, within any of the categories determined by the levels of the factors, the responses are all 1 or 0, it is said that there is a complete data separation. When the responses within any of the categories are almost all 1 or 0, it is referred to as a quasicomplete data separation. When there is separation (complete or quasi-complete), the maximum likelihood estimate of the corresponding coefficients becomes infinite. Jeffrey’s prior penalty plays the role to shrink the parameters toward zero in the case of separation.
Logistic regression models with various penalties have been considered for GWAS by a number of authors. Park and Hastie  considered logistic models with a -penalty. Wu et al.  considered logistic models with an -penalty. The LASSO-patternsearch developed by Shi et al.  is also based on logistic regression models. However, the accuracy for identifying etiological SNPs was not fully addressed. Park and Hastie  introduced the -penalty mainly for computational reasons. Their method is essentially a classical stepwise procedure with AIC/BIC as model selection criteria. The method considered by Wu et al.  is in fact only a screening procedure. The numbers of main-effect and interaction features to be retained are predetermined and left as a subjective matter. The LASSO-patternsearch is closer to our approach. The procedure first screens the features by correlation screening based on single-feature (main-effect/interaction) models. Then a LASSO model is fitted to the retained features with its penalty parameter chosen by cross-validation. The features selected by LASSO are then refitted to a nonpenalized logistic regression model, and the coefficients of the features are subjected to hypothesis testing with varied level . The is again determined by cross-validation. By using cross-validation, this procedure addresses the prediction error of the selected model instead of the accuracy of the selected features. Our method is compared with the LASSO-patternsearch and the multiple test approach by simulation studies.
The two-stage penalized logistic regression approach is described in detail in Section 2. The approach is applied to a publically accessible CGEMS prostate cancer data in Section 3. Simulation studies are presented in Section 4. The paper is ended by some remarks. A supplementary document which contains some details omitted in the paper is provided at the website: http://www.stat.nus.edu.sg/~stachenz/, available online at doi: 10.1155/2012/642403.
2. The Two-Stage Penalized Logistic Regression Approach
We first give a brief account on the elements required in the approach: the logistic model for case-control study, the penalized likelihood, and the EBIC.
2.1. Logistic Model for Case-Control GWAS
Let denote the disease status of individual , 1 for case and 0 for control. Denote by , , the genotypes of individual at the SNPs under study. The takes the value 0, 1, or 2, corresponding to the number of a particular allele in the genotype. Here, the additive genetic mode is assumed for all SNPs. The logistic model is as follows: where and are referred to as main-effect and interaction features, respectively, hereafter. The validity of the logistic model for case-control experiments has been argued by Armitage  and Breslow and Day . There are two fundamental facts about the above model for GWAS: (a) the number of features is much larger than the sample size , since is usually huge in GWAS, this situation is referred to as small--large-; (b) since there are only a few etiological SNPs, only a few of the coefficients in the model are nonzero, this phenomenon is referred to as sparsity.
2.2. Penalized Likelihood
Penalized likelihood makes the fitting of a logistic model with small--large- computationally feasible. It also provides a mechanism for feature selection. Let be the index set of a subset of the features. Let denote the likelihood function of the logistic model consisting of features with indices in , where consists of those and with their indices in . The penalized log likelihood is defined as where is a penalty function and is called the penalty parameter. The following penalty functions are used in our approach: where is a fixed constant bigger than 2. The penalized log likelihood with -penalty is used together with the tournament procedure  in the screening stage. At each application of the penalized likelihood, the parameter is tuned such that the minimization of the penalized likelihood yields a predetermined number of nonzero coefficients. The R package glmpath developed by Park and Hastie  is used for the computation, the tuning on is equivalent to setting the maximum steps in the glmpath function to the predetermined number of nonzero coefficients. The SCAD penalty is used in the second stage for ranking the features.
2.3. The Extended BIC
In small--large- problems, the AIC and BIC are not selection consistent. To tackle the issue of feature selection in small--large- problems, J. Chen and Z. Chen  developed a family of extended Bayes information criteria (EBIC). In the context of the logistic model described above, the EBIC is given as where and are, respectively, the numbers of main-effect and interaction features and is the maximum likelihood estimate of the parameter vector in the model. It has been shown that, under certain conditions, the EBIC is selection consistent when is larger than , see J. Chen and Z. Chen [13, 14]. The original BIC, which corresponds to the EBIC with , fails to be selection consistent when has a order.
We now describe the two-stage penalized logistic regression (TPLR) approach as follows.
2.4. Screening Stage
Let and be two predetermined numbers, respectively, for main-effect and interaction features to be retained. The screening stage consists of two steps: a main-effect screening step and an interaction screening step.
In the main-effect screening step, only the main-effect features are considered. Let denote the index set of the main features, that is, . If , the number of members in , is not too large, minimize by tuning the value of to retain features. If is very large. The following tournament procedure proposed in Z. Chen and J. Chen  is applied. Partition into with equal to an appropriate group size , where is chosen such that the minimization of the penalized likelihood with features can be efficiently carried out. For each , minimize by tuning the value of to retain features. If , repeat the above process with all retained features; otherwise, apply the -penalized logistic model to the retained features to reduce the number to . Let denote the indices of these features.
The interaction screening is similar to the main-effect screening step. However, the main-effect features retained in the main-effect screening step are built in the models for interaction screening. Let denote the set of pairs of the indices for all the interaction features, that is, . Since is large in general, the tournament procedure is applied for interaction screening. Let be partitioned as with . For each , minimize by tuning the value of to retain interaction features. Note that, in the above penalized likelihood, both the main-effect features in and the interaction features in are involved in the likelihood part. However, only the parameters associated with the interaction features are penalized. Since no penalty is put on the parameters associated with the main-effect features, the main-effect features are always retained in the process of interaction screening. If , repeat the above process with the set of all retained features; otherwise, reduce the retained features to of them by one run of the minimization using an -penalized likelihood.
2.5. Selection Stage
The selection stage consists of a ranking step and a model selection step. In the ranking step, the retained features (main-effect and interaction) are ranked together by a penalized likelihood with SCAD penalty plus an additional Jeffrey’s prior penalty. In the model selection step, a sequence of nested models are formed and evaluated by the EBIC.
For convenience, let the retained interaction features be referred to by a single index. Let be the index set of all the main-effect and interaction features retained in the screening stage. Let . Denote by the vector of coefficients corresponding to these features (the components of are the ’s and ’s corresponding to the retained main-effects and interactions). Jeffrey’s prior penalty is the log determinant of the Fisher information matrix. Thus the penalized likelihood in the selection stage is given by where is the SCAD penalty and is the Fisher information matrix. The ranking step is done as follows. The parameter is tuned to a value such that it is the smallest to make at least one component of zero by minimizing . Let be the index corresponding to the zero component. Update to , that is, the feature with index is eliminated from further consideration. With the updated , the above process is repeated, and another feature is eliminated. Continuing this way, eventually, we obtain an ordered sequence of the indices in : . From the ordered sequence above, a sequence of nested models is formed as . For each , the un-penalized likelihood is maximized. The EBIC with values in a range is computed for all these models. For each , the model with the smallest is identified. The upper bound of the range, , is taken as a value such that no feature can be selected by the EBIC with that value. Only a few models can be identified when varies in the range. Each identified model corresponds to a subinterval of . The identified models together with their corresponding subintervals are then reported.
The choice of in the EBIC affects the positive discovery rate (PDR) and the false discovery rate (FDR). In the context of GWAS, the PDR is the proportion of correctly identified SNPs among all etiological SNPs, and the FDR is the proportion of incorrectly identified SNPs among all identified SNPs. A larger results in a smaller FDR and also a lower PDR. A smaller results in a higher PDR and also a higher FDR. A balance must be stricken between PDR and FDR according to the purpose of the study. If the purpose is to confirm the etiological effect of certain well-studied loci or regions, one should emphasize more on a desirably low FDR rather than a high PDR. If the purpose is to discover candidate loci or regions for further study, one should emphasize more on a high PDR with only a reasonable FDR. The FDR is related to the statistical significance of the features. Measures on the statistical significance can be obtained from the final identified models and their corresponding subintervals. The upper bound of the subinterval determines the largest threshold which the effects of the features in the model must exceed. Likelihood ratio test (LRT) statistics can be used to assess the significance of the feature effects. For example, suppose a model consisting of main-effect features and interaction features is selected with in a sub-interval . The LRT statistic for the significance of the feature with the lowest rank in the model must exceed the threshold , if the feature is a main-effect one, and , if the feature is an interaction one. The probability for the LRT to exceed the threshold is at most or for a main-effect or interaction feature if the feature does not actually have any effect. These probabilities, like the -values in classical hypothesis testing, provide statistical basis for the user to determine which model should be taken as the selected model.
A final issue on the two-stage logistic regression procedure is how to determine and . If they are large enough, usually several times of the actual numbers, their choice will not affect the final model selection. Since the actual numbers are unknown, a strategy is to consider several different and . First, run the procedure with some educated guess on and . Then, run the procedure again using larger and . If the identified models by using these and are almost the same, the choice of and is appropriate. Otherwise, further values of and should be considered, until eventually different , result in the same results.
3. Analysis of CGEMS Prostate Cancer Data
The CGEMS data portal of National Cancer Institute, USA, provides public access to the summary results of approximately 550,000 SNPs genotyped in the CGEMS prostate cancer whole genome scan, see http://cgems.cancer.gov. We applied the two-stage penalized regression approach to the prostate cancer Phase 1A data in the prostate, lung, colon, and ovarian (PLCO) cancer screening trial. The dataset contains 294,179 autosomal SNPs which passed the quality controls on 1,111 controls and 1,148 cases (673 cases are aggressive, Gleason ≥ 7 or stage ≥ III; 475 cases are nonaggressive, Gleason < 7 and stage < III). In our analysis, we put all the cases together without distinguishing aggressive and non-aggressive ones. We assumed additive genetic mode for all the SNPs.
The application of the screening stage to all the 294,179 SNPs directly is not only time consuming but also unnecessary. Therefore, we did a preliminary screening by using single-SNP logistic models. For each SNP, a logistic model is fitted and the -value of the significance test of the SNP effect is obtained. Those SNPs with a -value bigger than 0.05 are discarded. There are 17,387 SNPs which have a -value less than 0.05 and are retained.
Because of the sheer huge number of features, 17,387 main features and interaction features, the tournament procedure is applied in the screening stage. At the main-effect feature screening step, the main-effect features are randomly partitioned into groups of size 1,000, except one group of size 1,387, and 100 features are selected from each group. A second round of screening is applied to the selected 1,700 features out of which 100 features are retained. The interaction feature screening is applied to interaction features. Each round, the retained features are partitioned into groups of size 1,000, and 50 features are selected from each group. The procedure continues until 300 interaction features are finally selected. Eventually, the 100 main-effect features and 300 interaction features are put together and screened to retain a total of 100 features (main-effect or interaction). The eventual 100 features are then subjected to the selection procedure.
The features selected by EBIC with in the subintervals , and are given in Table 1. With , the largest value at which at least one feature can be selected, the following three interaction features are selected: rs1885693-rs12537363, rs7837688-rs2256142 and rs1721525-rs2243988. The effects of these features have a significance level at least . The next largest value, 0.77, selects 7 additional interaction features which have a significance level at least . The third largest value, 0.73, selects still 2 additional interaction features which have a significance level at least . The chromosomal region 8q24 is the one where many previous prostate cancer studies are concentrated. It has been reported in a number of studies that rs1447295, one of the 4 tightly linked SNPs in the “locus 1” region of 8q24, is associated with prostate cancer, and it has been established as a benchmark for prostate cancer association studies. In the current data set, we found that rs7837688 is highly correlated with rs1447295 () and is more significant than rs1447295 based on single-SNP models. These two SNPs, which are in the same recombination block, are also physically close.
An older and slightly different version of the CGEMS prostate data has been analyzed by Yeager et al.  using single-SNP multiple testing approach. In their analysis, they distinguished between aggressive and non-aggressive status and assumed no structure on genetic modes. For each SNP, they considered four tests: a -test with 4 degrees of freedom based on a contingency table, a score test with 4 degrees of freedom based on a polytomous logistic regression model adjusted for age group, region of recruitment, and whether a case is diagnosed within one year of entry to the trial, as well as the other two which are the same as the and score tests but take into account incidence-density sampling. They identified two physically close but genetically independent regions (in a distance 0.65 centi-Morgans) within 8q24. One of the regions is where the benchmark SNP rs1447295 is located. They reported three SNPs: rs1447295 (-value: ), rs7837688 (-value: ) and rs6983267 (-value: ), where rs7837688 is in the same region as rs1447295 and rs6983267 is in the other region. The -values are computed from the score statistic based on incidence-density sampling polytomous logistic regression model adjusted for other covariates.
In our analysis, we identified rs7837688 but not rs1447295. This is because the penalized likelihood tends to select only one feature among several highly correlated features, which is a contrast to the multiple testing that selects all the correlated features if any of them is associated with the disease status. We failed to identify rs6983267. The possible reason could be that its effect is masked by other more significant features which are identified in our analysis. We also carried out the selection procedure with only the 100 main-effect features retained from the screening stage. It is found that rs6983267 is among the top 20 selected main-effect features with a significance level . It is interesting to notice that the two SNPs rs7837688 and rs1721525 appearing in the top three interaction features are also among the top four features selected with a maximum value 0.7185 when only main-effect features are considered. Since no SNP on chromosomes other than 8q24 has been reported in other studies, we wonder whether statistically significant SNPs on other chromosomes can be ignored due to biological reasons: if not, our analysis strongly suggests that rs1721525 located on chromosome 1 could represent another region in the genome which is associated with prostate cancer, if it holds, biologically, chromosome 1 cannot be excluded in the consideration of genetic variants for prostate cancer.
4. Simulation Studies
We present results of two simulation studies in this section. In the first study, we compare the two-stage penalized logistic regression (TPLR) approach with the paired-SNP multiple testing (PMT) approach of Marchini et al.  under simulation settings considered by them. In the second study, we compare the TPLR approach with LASSO-patternsearch using a data structure mimicking the CGEMS prostate cancer data.
4.1. Simulation Study 1
The comparison of TPLR and PMT is based on four models. Each model involves two etiological SNPs. In the first model, the effects of the two SNPs are multiplicative both within and between loci; in the second model, the effects of the two SNPs are multiplicative within but not between loci; in the third model, the two SNPs have threshold interaction effects; in the fourth model, the two SNPs have an interaction effect but no marginal effects. The first three models are taken from Marchini et al. . The details of these models are provided in the supplementary document.
Marchini et al.  considered two strategies of PMT. In the first strategy, a logistic model with 9 parameters is fitted for each pair of SNPs, and the Bonferroni corrected significance level is used to declare the significant pairs. In the second strategy, the SNPs that are significant in single-SNP tests at a liberal level are identified, then the significances of all the pairs formed by these SNPs are tested using the Bonferroni corrected level .
In the first three models, the marginal effects of both loci are nonnegligible and can be picked up by the single-SNP tests at the relaxed significance level. In this situation, the second strategy has an advantage over the first strategy in terms of detection power and false discovery rate. In this study, we compare our approach with the second strategy of PMT under the first three models. In the fourth model, since there are no marginal effects at both loci, the second strategy of PMT cannot be applied since it will fail to pick up any loci at the first step. Hence, we compare our approach with the first strategy of PMT. However, the first strategy involves a stupendous amount of computation which exceeds our computing capacity. To circumvent this dilemma, we consider an artificial version of the first strategy; that is, we only consider the pairs which involve at least one of the etiological SNPs. This artificial version has the same detection power but lower false discovery rate than the full version. The artificial version cannot be implemented with real data since it requires the knowledge of the etiological SNPs. However, it can be implemented with simulated data and serves the purpose of comparison.
Each simulated dataset contains individuals (400 cases and 400 controls) with genotypes of SNPs. Two values of , 1000 and 5000, are considered. The genotypes of disease loci, which are not among the SNPs, and the disease status of the individuals are generated first. Then, the genotypes of the SNPs which are in linkage disequilibrium with the disease loci are generated using a square correlation coefficient . The genotypes of the remaining SNPs are generated independently assuming Hardy-Weinberg equilibrium. For the first three models, the effects of the disease loci are specified by the prevalence, disease allele frequencies, denoted by , and marginal effect parameters, denoted by and . The prevalence is set at 0.01 throughout. The two marginal effects are set equal, that is, . For the fourth model, the effect is specified through the coefficient in the logistic model. The coefficients are determined by first specifying and then determining and through the constraints of the model while is set to −5. The definition of these parameters and the details of the data generation are given in the supplementary document.
The and in the PMT approach are taken to be 0.1 and 0.05, respectively, the same as in Marchini et al. . The in EBIC is fixed as 1 since it is infeasible to incorporate the consideration on the choice of into the simulation study. The average PDR and FDR over 200 simulation replicates under Model 1–4 are given in Tables 2–5, respectively. In Table 5, the entries of the FDR for the PMT approach are lower bounds rather than the actual FDRs, since, as mentioned earlier, only the pairs of SNPs involving at least one etiological SNP are considered in the artificial version of the first strategy of PMT, which results in less false discoveries than the full version while retaining the same positive detections.
The results presented in Tables 2–5 are summarized as follows. Under Model 1, TPLR has much lower FDR and comparable PDR compared with PMT. Under Models 2–4, the PDR of TPLR is significantly higher than PMT in all cases except Model 2 when (0.95 versus 1) and Model 3 when (0.81 versus 0.84). The overall averaged FDRs of TPLR is 0.0487 while that of PMT is 0.7604. It is seen that the FDR of TPLR is always kept at reasonably low levels but that of PMT is intolerably high, and at the same time TPLR is still more powerful than PMT for detecting etiological SNPs. From the simulation results, we can also see the impact of on PDR and FDR. In general, the increase of reduces PDR and increases FDR of both approaches. However, the impact on TPLR is less than that on PMT.
4.2. Simulation Study 2
The data for this simulation study is generated mimicking the structure of the CGEMS prostate cancer data. The cases and controls are generated using a logistic model with the following linear predictor: where ’s are feature values of 14 SNPs. The parameter values are taken as The SNPs in the above model mimic the 14 SNPs involved in the top 5 main-effect features and top 5 interaction features of the CGEMS prostate cancer data. The minor allele frequencies (MAF) of the SNPs, which are estimated from the prostate cancer data, are given as follows: The genotypes of these 14 SNPs are generated by using the MAF, assuming Hardy- Weinburg Equilibrium. In addition to these 14 SNPs, 20,000 noncausal SNPs are randomly selected (without replacement) from the 294,179 SNPs of the prostate cancer data in each simulation replicate. For each simulation replicate, 1,000 cases and 1,000 controls are generated. They are matched by randomly selected (without replacement) individuals from the prostate cancer data. Their genotypes at the 20,000 noncausal SNPs are taken the same as those in the prostate cancer data.
In the TPLR approach, 50 main effect features and 50 interaction features are selected in the screening stage using the tournament screening strategy. In the selection stage, EBIC() values are calculated for the nested models with in the range 0(0.1)2, that is, from 0 to 2 in space of 0.1.
In the LASSO-patternsearch approach, at the screening stage, 0.05 and 0.002 are used as thresholds for the -values of the main-effect features and interaction features, respectively. At the LASSO selection step, a 5-fold cross-validation is used for the choice of penalty parameter. At the hypothesis testing step, 9 levels are considered, that is, . The case amounts to stopping the procedure at the LASSO selection step.
Since in the TPLR approach there is not a definite choice of , to facilitate the comparison, we calculate PDR and FDR for each fixed value in the TPLR approach, and for each fixed level in LASSO-patternsearch. The PDR and FDR are calculated separately for the detection of true main-effect and interaction features. They are also calculated for the detection of causal SNPs. A causal SNP is considered positively discovered if it is selected either as a main-effect feature or a constituent in an interaction feature. The simulated FDR and PDR over 100 replicates of TPLR with and and those of LASSO-patternsearch with are reported in Table 6. It is actually the values in the higher end and levels in the lower end that will be involved in the final selection. The comparison of the results with those values is more relevant. As shown by the bold digits in Table 6, TPLR has higher PDR and lower FDR than LASSO-patternsearch across-the-board. For the main-effect features, the lowest FDR of TPLR is 0.006 while it achieves PDR around 0.65, but the lowest FDR of LASSO-patternsearch is around 0.2 while it only achieves PDR around 0.6. The FDR and PDR on interaction features and causal SNPs have the same pattern. When the two approaches have about the same PDR, the LASSO-patternsearch has a much larger undesirable FDR than TPLR. For example, on the SNPs, when the PDR is 0.608 for TPLR and 0.609 for LASSO-patternsearch, the FDRs are, respectively, 0.041 and 0.654; on the main-effect features, when the PDR is 0.646 for both TPLR and LASSO-patternsearch, the FDRs are, respectively, 0.006 and 0.220. The ROC curves of the two approaches in identifying etiological SNPs are plotted in Figure 1. Figure 1 shows clearly that the PDR of TPLR is much higher than the PDR of LASSO-patternsearch when FDR is the same, which is true uniformly over FDR.
To investigate the effect of the choice of and , we considered , and 50 which are 3, 5, and 10 times of the actual number of causal features, respectively. The simulation results show that, though there is a slight difference between the choice of 15 and the other two choices, there is no substantial difference between the choice of 25 and 50. This justifies the strategy given at the end of Section 2. The detailed results on the comparison of the choices are given in the supplementary document.
We also investigated whether the ranking step in the TPLR approach really reflects the actual importance of the features. The average ranks of the ten causal features over the 100 simulation replicates are given in Table 7.
On the average, the causal features are all among the top ten ranks. This gives a justification for the ranking step in the selection stage of the TPLR approach.
5. Some Remarks
It is a common understanding that individual SNPs are unlikely to play an important role in the development of complex diseases, and, instead, it is the interactions of many SNPs that are behind disease developments, see Garte . The finding that only interaction features are selected (since they are more significant than main-effect features) in our analysis provides some evidence to this understanding. Perhaps, even higher-order interactions should be investigated. This makes methods such as the penalized logistic regression which can deal with interactions even more desirable.
The analysis of the CGEMS prostate cancer data can be refined by replacing the binary logistic model with a polytomous logistic regression model taking into account that the genetic mechanisms behind aggressive and nonaggressive prostate cancers might be different. Accordingly, the penalty in the penalized likelihood can be replaced by some variants of the group LASSO penalty considered by Huang et al. . A polytomous logistic regression model with an appropriate penalty function is of general interest in feature selection with multinomial responses, which will be pursued elsewhere.
The authors would like to thank the National Cancer Institute of USA for granting the access to the CGEMS prostate cancer data. The research of the authors is supported by Research Grant R-155-000-065-112 of the National University of Singapore, and the research of the first author was done when she was a Ph.D. student at the National University of Singapore.
The supplementary document contains (1) the details of the genetic models and the data generation procedure considered in Simulation Study 1 and (2) some further results obtained in Simulation Study 2.
- L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth Statistics/Probability Series, Wadsworth Advanced Books and Software, Belmont, Calif, USA, 1984.
- L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
- V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.
- H. Schwender and K. Ickstadt, “Identification of SNP interactions using logic regression,” Biostatistics, vol. 9, no. 1, pp. 187–198, 2008.
- J. Marchini, P. Donnelly, and L. R. Cardon, “Genome-wide strategies for detecting multiple loci that influence complex diseases,” Nature Genetics, vol. 37, no. 4, pp. 413–417, 2005.
- Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society. Series B, vol. 57, no. 1, pp. 289–300, 1995.
- B. Efron and R. Tibshirani, “Empirical Bayes methods and false discovery rates for microarrays,” Genetic Epidemiology, vol. 23, no. 1, pp. 70–86, 2002.
- J. D. Storey and R. Tibshirani, “Statistical Methods for Identifying Differentially Expressed Genes in DNA Microarrays,” Functional Genomics: Methods in Molecular Biology, vol. 224, pp. 149–157, 1993.
- J. Hoh and J. Ott, “Mathematical multi-locus approaches to localizing complex human trait genes,” Nature Reviews Genetics, vol. 4, no. 9, pp. 701–709, 2003.
- Z. Chen and J. Chen, “Tournament screening cum EBIC for feature selection with high-dimensional feature spaces,” Science in China. Series A, vol. 52, no. 6, pp. 1327–1341, 2009.
- J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, no. 456, pp. 1348–1360, 2001.
- D. Firth, “Bias reduction of maximum likelihood estimates,” Biometrika, vol. 80, no. 1, pp. 27–38, 1993.
- J. Chen and Z. Chen, “Extended Bayesian information criteria for model selection with large model spaces,” Biometrika, vol. 95, no. 3, pp. 759–771, 2008.
- J. Chen and Z. Chen, “Extended BIC for small-n-large-P sparse GLM,” Statistica Sinica. In press.
- J. Fan and Y. Fan, “High-dimensional classification using features annealed independence rules,” The Annals of Statistics, vol. 36, no. 6, pp. 2605–2637, 2008.
- W. Shi, K. E. Lee, and G. Wahba, “Detecting disease-causing genes by LASSO-Patternsearch algorithm,” BMC Proceedings, vol. 1, supplement 1, p. S60, 2007.
- J. Fan and J. Lv, “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society. Series B, vol. 70, no. 5, pp. 849–911, 2008.
- K. K. Wu, Comparison of sure independence screening and tournament screening for feature selection with ultra-high dimensional feature space, Honor's thesis, Department of Statistics & Applied Probability, National University of Singapore, 2010.
- W. L. H. Koh, The comparison of two-stage feature selection methods in small-n-large-p problems, Honor's thesis, Department of Statistics & Applied Probability, National University of Singapore, 2011.
- J. Zhao, Model selection methods and their applications in genome-wide association studies, Ph.D. thesis, Department of Statistics and Applied Probability, National University of Singapore, 2008.
- A. Albert and J. A. Anderson, “On the existence of maximum likelihood estimates in logistic regression models,” Biometrika, vol. 71, no. 1, pp. 1–10, 1984.
- M. Y. Park and T. Hastie, “Penalized logistic regression for detecting gene interactions,” Biostatistics, vol. 9, no. 1, pp. 30–50, 2008.
- T. T. Wu, Y. F. Chen, T. Hastie, E. Sobel, and K. Lange, “Genome-wide association analysis by lasso penalized logistic regression,” Bioinformatics, vol. 25, no. 6, pp. 714–721, 2009.
- P. Armitage, Statistical Methods in Medical Research, Blackwell, Oxford, UK, 1971.
- N. Breslow and N. E. Day, Statistical Methods in Cancer Research, vol. 1 of The Analysis of Case-Control Studies, International Agency for Research on Cancer Scientific Publications, Lyon, France, 1980.
- M. Y. Park and T. Hastie, “An L1 regularization path algorithm for generalized linear models,” Journal of the Royal Statistical Society. Series B, vol. 69, no. 4, pp. 659–677, 2007.
- M. Yeager, N. Orr, R. B. Hayes et al., “Genome-wide association study of prostate cancer identifies a second risk locus at 8q24,” Nature Genetics, vol. 39, no. 5, pp. 645–649, 2007.
- S. Garte, “Metabolic susceptibility genes as cancer risk factors: time for a reassessment?” Cancer Epidemiology Biomarkers and Prevention, vol. 10, no. 12, pp. 1233–1237, 2001.
- J. Huang, S. Ma, H. Xie, and C.-H. Zhang, “A group bridge approach for variable selection,” Biometrika, vol. 96, no. 2, pp. 339–355, 2009.
Copyright © 2012 Jingyuan Zhao and Zehua Chen. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.