Abstract

We propose a two-stage penalized logistic regression approach to case-control genome-wide association studies. This approach consists of a screening stage and a selection stage. In the screening stage, main-effect and interaction-effect features are screened by using 𝐿1-penalized logistic like-lihoods. In the selection stage, the retained features are ranked by the logistic likelihood with the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) and Jeffrey’s Prior penalty (Firth, 1993), a sequence of nested candidate models are formed, and the models are assessed by a family of extended Bayesian information criteria (J. Chen and Z. Chen, 2008). The proposed approach is applied to the analysis of the prostate cancer data of the Cancer Genetic Markers of Susceptibility (CGEMS) project in the National Cancer Institute, USA. Simulation studies are carried out to compare the approach with the pair-wise multiple testing approach (Marchini et al. 2005) and the LASSO-patternsearch algorithm (Shi et al. 2007).

1. Introduction

The case-control genome-wide association study (GWAS) with single-nucleotide polymorphism (SNP) data is a powerful approach to the research on common human diseases. There are two goals of GWAS: (1) to identify suitable SNPs for the construction of classification rules and (2) to discover SNPs which are etiologically important. The emphasis is on the prediction capacity of the SNPs for the first goal and on the etiological effect of the SNPs for the second goal. The phrase β€œan etiological SNP” is used in the sense that either the SNP itself is etiological or it is in high-linkage disequilibrium with an etiological locus. Well-developed classification methods in the literature can be used for the first goal. These methods include classification and regression trees [1], random forest [2], support vector machine [3], and logic regression [4]. In this article, we focus on statistical methods for the second goal.

The approach of multiple testing based on single or paired SNP models is commonly used for the detection of etiological SNPs. Either the Bonferroni correction is applied for the control of the overall Type I error rate, see, for example, Marchini et al. [5] or some methods are used to control the false discovery rate (FDR), see, Banjamini and Hochberg [6], Efron and Tibshirani [7], and Storey and Tibshirani [8]. Other variants of multiple testing have also been advocated, see Hoh and Ott [9]. The multiple test approach considers either a single SNP or a pair of SNPs at a time. It does not adjust for the effects of other markers. If there are many loci having high sample correlations with a true genetic variant, which is common in GWAS, it is prone to result in spurious etiological loci.

It is natural to seek alternative methods that overcome the drawback of multiple testing. Such methods must have the nature of considering many loci simultaneously and assessing the significance of the loci by their synergistic effect. When the synergistic effect is of concern, adding loci spuriously correlated to an etiological locus does not contribute to the synergistic effect while the etiological locus has already been considered. Thus the drawback of multiple testing can be avoided. In this paper, we propose a method of the abovementioned nature: a two-stage penalized logistic regression approach. In the first stage of this approach, 𝐿1-penalized logistic regression models are used together with a tournament procedure [10] to screen out apparently unimportant features (by features we refer to the covariates representing SNPs or their products). In the second stage, logistic models with the SCAD penalty [11] plus the Jeffrey’s prior penalty [12] are used to rank the retained features and form a sequence of nested candidate models. The extended Bayesian information criteria (EBIC, [13, 14]) are used for the final model selection. In both stages of the approach, the features are assessed by their synergistic effects.

The two-stage strategy has been considered by other authors. For example, J. Fan and Y. Fan [15] adopted this strategy for high-dimensional classification, and Shi et al. [16] developed a two-stage procedure called LASSO-patternsearch. Sure independence screening (SIS, [17]) and its ramifications such as correlation screening and 𝑑-tests are commonly used in the screening stage. Compared with SIS approaches, the tournament screening with 𝐿1-penalized likelihood produces less spuriously correlated features while enjoying the sure screening property possessed by the SIS approaches, see Z. Chen and J. Chen [10] and the comprehensive simulation studies by Wu [18], which has an impact on the accuracy of feature selection in the second stage, see Koh [19]. The 𝐿1-penalized likelihood is easier to compute than that with the SCAD penalty. However, the SCAD penalty has an edge over the 𝐿1-penalty in ranking the features so that the ranks are more consistent with their actual effects. This has been observed in simulation studies, see Zhao [20]. It is possibly due to the fact that the 𝐿1 penalty over-penalizes those features with large effects compared with SCAD penalty that does not penalize large effects at all. Jeffrey’s prior penalty is added to handle the difficulty caused by separation of data that usually presents in logistic regression models with factor covariates, see Albert and Anderson [21]. If, within any of the categories determined by the levels of the factors, the responses are all 1 or 0, it is said that there is a complete data separation. When the responses within any of the categories are almost all 1 or 0, it is referred to as a quasicomplete data separation. When there is separation (complete or quasi-complete), the maximum likelihood estimate of the corresponding coefficients becomes infinite. Jeffrey’s prior penalty plays the role to shrink the parameters toward zero in the case of separation.

Logistic regression models with various penalties have been considered for GWAS by a number of authors. Park and Hastie [22] considered logistic models with a 𝐿2-penalty. Wu et al. [23] considered logistic models with an 𝐿1-penalty. The LASSO-patternsearch developed by Shi et al. [16] is also based on logistic regression models. However, the accuracy for identifying etiological SNPs was not fully addressed. Park and Hastie [22] introduced the 𝐿2-penalty mainly for computational reasons. Their method is essentially a classical stepwise procedure with AIC/BIC as model selection criteria. The method considered by Wu et al. [23] is in fact only a screening procedure. The numbers of main-effect and interaction features to be retained are predetermined and left as a subjective matter. The LASSO-patternsearch is closer to our approach. The procedure first screens the features by correlation screening based on single-feature (main-effect/interaction) models. Then a LASSO model is fitted to the retained features with its penalty parameter chosen by cross-validation. The features selected by LASSO are then refitted to a nonpenalized logistic regression model, and the coefficients of the features are subjected to hypothesis testing with varied level 𝛼. The 𝛼 is again determined by cross-validation. By using cross-validation, this procedure addresses the prediction error of the selected model instead of the accuracy of the selected features. Our method is compared with the LASSO-patternsearch and the multiple test approach by simulation studies.

The two-stage penalized logistic regression approach is described in detail in Section 2. The approach is applied to a publically accessible CGEMS prostate cancer data in Section 3. Simulation studies are presented in Section 4. The paper is ended by some remarks. A supplementary document which contains some details omitted in the paper is provided at the website: http://www.stat.nus.edu.sg/~stachenz/, available online at doi: 10.1155/2012/642403.

2. The Two-Stage Penalized Logistic Regression Approach

We first give a brief account on the elements required in the approach: the logistic model for case-control study, the penalized likelihood, and the EBIC.

2.1. Logistic Model for Case-Control GWAS

Let 𝑦𝑖 denote the disease status of individual 𝑖, 1 for case and 0 for control. Denote by π‘₯𝑖𝑗, 𝑗=1,…,𝑃, the genotypes of individual 𝑖 at the SNPs under study. The π‘₯𝑖𝑗 takes the value 0, 1, or 2, corresponding to the number of a particular allele in the genotype. Here, the additive genetic mode is assumed for all SNPs. The logistic model is as follows:π‘¦π‘–ξ€·βˆΌBinomial1,πœ‹π‘–ξ€Έ,πœ‹log𝑖1βˆ’πœ‹π‘–=𝛽0+𝑃𝑗=1𝛽𝑗π‘₯𝑖𝑗+𝑗<π‘˜πœ‰π‘—π‘˜π‘₯𝑖𝑗π‘₯π‘–π‘˜,𝑖=1,…,𝑛,(2.1) where π‘₯𝑖𝑗 and π‘₯𝑖𝑗π‘₯π‘–π‘˜ are referred to as main-effect and interaction features, respectively, hereafter. The validity of the logistic model for case-control experiments has been argued by Armitage [24] and Breslow and Day [25]. There are two fundamental facts about the above model for GWAS: (a) the number of features is much larger than the sample size 𝑛, since 𝑃 is usually huge in GWAS, this situation is referred to as small-𝑛-large-𝑝; (b) since there are only a few etiological SNPs, only a few of the coefficients in the model are nonzero, this phenomenon is referred to as sparsity.

2.2. Penalized Likelihood

Penalized likelihood makes the fitting of a logistic model with small-𝑛-large-𝑝 computationally feasible. It also provides a mechanism for feature selection. Let 𝑠 be the index set of a subset of the features. Let 𝐿(𝜽(𝑠)βˆ£π‘ ) denote the likelihood function of the logistic model consisting of features with indices in 𝑠, where 𝜽(𝑠) consists of those 𝛽 and πœ‰ with their indices in 𝑠. The penalized log likelihood is defined as 𝑙𝑝(ξ“πœ½(𝑠)βˆ£πœ†)=βˆ’2log𝐿(𝜽(𝑠)βˆ£π‘ )+π‘—βˆˆπ‘ π‘πœ†ξ€·πœƒπ‘—ξ€Έ,(2.2) where π‘πœ†(β‹…) is a penalty function and πœ† is called the penalty parameter. The following penalty functions are used in our approach:𝐿1-penaltyβˆΆπ‘πœ†ξ€·πœƒπ‘—ξ€Έ||πœƒ=πœ†π‘—||,SCADpenaltyβˆΆπ‘ξ…žπœ†ξ€·||πœƒ||𝐼||πœƒ||ξ€Έ+ξ€·||πœƒ||ξ€Έ=πœ†β‰€πœ†π‘Žπœ†βˆ’+𝐼||πœƒ||ξ€Έξƒ°,(π‘Žβˆ’1)πœ†>πœ†(2.3) where π‘Ž is a fixed constant bigger than 2. The penalized log likelihood with 𝐿1-penalty is used together with the tournament procedure [10] in the screening stage. At each application of the penalized likelihood, the parameter πœ† is tuned such that the minimization of the penalized likelihood yields a predetermined number of nonzero coefficients. The R package glmpath developed by Park and Hastie [26] is used for the computation, the tuning on πœ† is equivalent to setting the maximum steps in the glmpath function to the predetermined number of nonzero coefficients. The SCAD penalty is used in the second stage for ranking the features.

2.3. The Extended BIC

In small-𝑛-large-𝑝 problems, the AIC and BIC are not selection consistent. To tackle the issue of feature selection in small-𝑛-large-𝑝 problems, J. Chen and Z. Chen [13] developed a family of extended Bayes information criteria (EBIC). In the context of the logistic model described above, the EBIC is given as +ξ€·πœˆEBIC(𝛾)=βˆ’2log𝐿𝜽(𝑠)βˆ£π‘ 1+𝜈2ξ€ΈβŽ›βŽœβŽœβŽπ‘ƒπœˆlog(𝑛)+2𝛾log1βŽžβŽŸβŽŸβŽ βŽ›βŽœβŽœβŽπœˆπ‘ƒ(π‘ƒβˆ’1)/22⎞⎟⎟⎠,𝛾β‰₯0,(2.4) where 𝜈1 and 𝜈2 are, respectively, the numbers of main-effect and interaction features and 𝜽(𝑠) is the maximum likelihood estimate of the parameter vector in the model. It has been shown that, under certain conditions, the EBIC is selection consistent when 𝛾 is larger than 1βˆ’ln𝑛/(2ln𝑝), see J. Chen and Z. Chen [13, 14]. The original BIC, which corresponds to the EBIC with 𝛾=0, fails to be selection consistent when 𝑝 has a βˆšπ‘› order.

We now describe the two-stage penalized logistic regression (TPLR) approach as follows.

2.4. Screening Stage

Let 𝑛𝑀 and 𝑛𝐼 be two predetermined numbers, respectively, for main-effect and interaction features to be retained. The screening stage consists of two steps: a main-effect screening step and an interaction screening step.

In the main-effect screening step, only the main-effect features are considered. Let 𝑆𝑀 denote the index set of the main features, that is, 𝑆𝑀={1,…,𝑝}. If |𝑆𝑀|, the number of members in 𝑆𝑀, is not too large, minimizeξ€·βˆ’2logπΏπœ·βˆ£π‘†π‘€ξ€Έ+πœ†π‘ξ“π‘—=1||𝛽𝑗||(2.5) by tuning the value of πœ† to retain 𝑛𝑀 features. If |𝑆𝑀| is very large. The following tournament procedure proposed in Z. Chen and J. Chen [10] is applied. Partition 𝑆𝑀 into 𝑆𝑀=βˆͺπ‘˜π‘ π‘˜ with |π‘ π‘˜| equal to an appropriate group size 𝑛𝐺, where 𝑛𝐺 is chosen such that the minimization of the penalized likelihood with 𝑛𝐺 features can be efficiently carried out. For each π‘˜, minimize ξ€·πœ·ξ€·π‘ βˆ’2logπΏπ‘˜ξ€Έβˆ£π‘ π‘˜ξ€Έξ“+πœ†π‘—βˆˆπ‘ π‘˜||𝛽𝑗||(2.6) by tuning the value of πœ† to retain π‘›π‘˜(β‰ˆπ‘›π‘€) features. If βˆ‘π‘—π‘›π‘—>𝑛𝐺, repeat the above process with all retained features; otherwise, apply the 𝐿1-penalized logistic model to the retained features to reduce the number to 𝑛𝑀. Let 𝑠𝑀 denote the indices of these 𝑛𝑀 features.

The interaction screening is similar to the main-effect screening step. However, the main-effect features retained in the main-effect screening step are built in the models for interaction screening. Let 𝑆𝐼 denote the set of pairs of the indices for all the interaction features, that is, 𝑆𝐼={(𝑖,𝑗)βˆΆπ‘–<𝑗,𝑖,𝑗=1,…,𝑝}. Since |𝑆𝐼| is large in general, the tournament procedure is applied for interaction screening. Let 𝑆𝐼 be partitioned as 𝑆𝐼=βˆͺπ‘˜π‘‡π‘˜ with |π‘‡π‘˜|β‰ˆπ‘›πΊ. For each π‘˜, minimizeξ€·πœ·ξ€·π‘ βˆ’2log𝐿𝑀𝑇,πƒπ‘˜ξ€Έβˆ£π‘ π‘€,π‘‡π‘˜ξ€Έξ“+πœ†(𝑖,𝑗)βˆˆπ‘‡π‘˜||πœ‰π‘–π‘—||(2.7) by tuning the value of πœ† to retain π‘›π‘˜(β‰ˆπ‘›πΌ) interaction features. Note that, in the above penalized likelihood, both the main-effect features in 𝑠𝑀 and the interaction features in π‘‡π‘˜ are involved in the likelihood part. However, only the parameters associated with the interaction features are penalized. Since no penalty is put on the parameters associated with the main-effect features, the main-effect features are always retained in the process of interaction screening. If βˆ‘π‘˜π‘›π‘˜>𝑛𝐺, repeat the above process with the set of all retained features; otherwise, reduce the retained features to 𝑛𝐼 of them by one run of the minimization using an 𝐿1-penalized likelihood.

2.5. Selection Stage

The selection stage consists of a ranking step and a model selection step. In the ranking step, the retained features (main-effect and interaction) are ranked together by a penalized likelihood with SCAD penalty plus an additional Jeffrey’s prior penalty. In the model selection step, a sequence of nested models are formed and evaluated by the EBIC.

For convenience, let the retained interaction features be referred to by a single index. Let π‘†βˆ— be the index set of all the main-effect and interaction features retained in the screening stage. Let 𝐾=|π‘†βˆ—|. Denote by 𝜽(π‘†βˆ—) the vector of coefficients corresponding to these features (the components of 𝜽(π‘†βˆ—) are the 𝛽’s and πœ‰β€™s corresponding to the retained main-effects and interactions). Jeffrey’s prior penalty is the log determinant of the Fisher information matrix. Thus the penalized likelihood in the selection stage is given byπ‘™π‘ξ€·πœ½ξ€·π‘†βˆ—ξ€Έξ€Έξ€·πœ½ξ€·π‘†βˆ£πœ†=βˆ’2logπΏβˆ—ξ€Έβˆ£π‘†βˆ—ξ€Έ||πΌξ€·πœ½ξ€·π‘†βˆ’logβˆ—||+ξ“ξ€Έξ€Έπ‘—βˆˆπ‘†βˆ—π‘πœ†ξ€·||πœƒπ‘—||ξ€Έ,(2.8) where π‘πœ† is the SCAD penalty and 𝐼(𝜽(π‘†βˆ—)) is the Fisher information matrix. The ranking step is done as follows. The parameter πœ† is tuned to a value πœ†1 such that it is the smallest to make at least one component of 𝜷(π‘†βˆ—) zero by minimizing 𝑙𝑝(𝜷(π‘†βˆ—)βˆ£πœ†1). Let π‘—πΎβˆˆπ‘†βˆ— be the index corresponding to the zero component. Update π‘†βˆ— to π‘†βˆ—/{𝑗𝐾}, that is, the feature with index 𝑗𝐾 is eliminated from further consideration. With the updated π‘†βˆ—, the above process is repeated, and another feature is eliminated. Continuing this way, eventually, we obtain an ordered sequence of the indices in π‘†βˆ—: 𝑗1,𝑗2,…,𝑗𝐾. From the ordered sequence above, a sequence of nested models is formed as π‘†π‘˜={𝑗1,…,π‘—π‘˜},π‘˜=1,…,𝐾. For each π‘†π‘˜, the un-penalized likelihood log𝐿(𝜽(π‘†π‘˜)βˆ£π‘†π‘˜) is maximized. The EBIC with 𝛾 values in a range [1βˆ’ln𝑛/ln𝑝,𝛾max] is computed for all these models. For each 𝛾, the model with the smallest EBIC(𝛾) is identified. The upper bound of the range, 𝛾max, is taken as a value such that no feature can be selected by the EBIC with that value. Only a few models can be identified when 𝛾 varies in the range. Each identified model corresponds to a subinterval of [1βˆ’ln𝑛/ln𝑝,𝛾max]. The identified models together with their corresponding subintervals are then reported.

The choice of 𝛾 in the EBIC affects the positive discovery rate (PDR) and the false discovery rate (FDR). In the context of GWAS, the PDR is the proportion of correctly identified SNPs among all etiological SNPs, and the FDR is the proportion of incorrectly identified SNPs among all identified SNPs. A larger 𝛾 results in a smaller FDR and also a lower PDR. A smaller 𝛾 results in a higher PDR and also a higher FDR. A balance must be stricken between PDR and FDR according to the purpose of the study. If the purpose is to confirm the etiological effect of certain well-studied loci or regions, one should emphasize more on a desirably low FDR rather than a high PDR. If the purpose is to discover candidate loci or regions for further study, one should emphasize more on a high PDR with only a reasonable FDR. The FDR is related to the statistical significance of the features. Measures on the statistical significance can be obtained from the final identified models and their corresponding subintervals. The upper bound of the subinterval determines the largest threshold which the effects of the features in the model must exceed. Likelihood ratio test (LRT) statistics can be used to assess the significance of the feature effects. For example, suppose a model consisting of 𝜈1 main-effect features and 𝜈2 interaction features is selected with 𝛾 in a sub-interval (𝛾,𝛾]. The LRT statistic for the significance of the feature with the lowest rank in the model must exceed the threshold log𝑛+2𝛾log((π‘ƒβˆ’πœˆ1+1)/𝜈1), if the feature is a main-effect one, and log𝑛+2𝛾log((𝑃(π‘ƒβˆ’1)/2βˆ’πœˆ2+1)/𝜈2), if the feature is an interaction one. The probability for the LRT to exceed the threshold is at most π‘ƒπ‘Ÿ(πœ’21>log𝑛+2𝛾log((π‘ƒβˆ’πœˆ1+1)/𝜈1)) or π‘ƒπ‘Ÿ(πœ’21>log𝑛+2𝛾log((𝑃(π‘ƒβˆ’1)/2βˆ’πœˆ2+1)/𝜈2)) for a main-effect or interaction feature if the feature does not actually have any effect. These probabilities, like the 𝑃-values in classical hypothesis testing, provide statistical basis for the user to determine which model should be taken as the selected model.

A final issue on the two-stage logistic regression procedure is how to determine 𝑛𝑀 and 𝑛𝐼. If they are large enough, usually several times of the actual numbers, their choice will not affect the final model selection. Since the actual numbers are unknown, a strategy is to consider several different 𝑛𝑀 and 𝑛𝐼. First, run the procedure with some educated guess on 𝑛𝑀 and 𝑛𝐼. Then, run the procedure again using larger 𝑛𝑀 and 𝑛𝐼. If the identified models by using these 𝑛𝑀 and 𝑛𝐼 are almost the same, the choice of 𝑛𝑀 and 𝑛𝐼 is appropriate. Otherwise, further values of 𝑛𝑀 and 𝑛𝐼 should be considered, until eventually different 𝑛𝑀, 𝑛𝐼 result in the same results.

3. Analysis of CGEMS Prostate Cancer Data

The CGEMS data portal of National Cancer Institute, USA, provides public access to the summary results of approximately 550,000 SNPs genotyped in the CGEMS prostate cancer whole genome scan, see http://cgems.cancer.gov. We applied the two-stage penalized regression approach to the prostate cancer Phase 1A data in the prostate, lung, colon, and ovarian (PLCO) cancer screening trial. The dataset contains 294,179 autosomal SNPs which passed the quality controls on 1,111 controls and 1,148 cases (673 cases are aggressive, Gleason β‰₯ 7 or stage β‰₯ III; 475 cases are nonaggressive, Gleason < 7 and stage < III). In our analysis, we put all the cases together without distinguishing aggressive and non-aggressive ones. We assumed additive genetic mode for all the SNPs.

The application of the screening stage to all the 294,179 SNPs directly is not only time consuming but also unnecessary. Therefore, we did a preliminary screening by using single-SNP logistic models. For each SNP, a logistic model is fitted and the 𝑃-value of the significance test of the SNP effect is obtained. Those SNPs with a 𝑃-value bigger than 0.05 are discarded. There are 17,387 SNPs which have a 𝑃-value less than 0.05 and are retained.

Because of the sheer huge number of features, 17,387 main features and 17,387Γ—(1,7387βˆ’1)/2 interaction features, the tournament procedure is applied in the screening stage. At the main-effect feature screening step, the main-effect features are randomly partitioned into groups of size 1,000, except one group of size 1,387, and 100 features are selected from each group. A second round of screening is applied to the selected 1,700 features out of which 100 features are retained. The interaction feature screening is applied to 17,387Γ—(1,7387βˆ’1)/2 interaction features. Each round, the retained features are partitioned into groups of size 1,000, and 50 features are selected from each group. The procedure continues until 300 interaction features are finally selected. Eventually, the 100 main-effect features and 300 interaction features are put together and screened to retain a total of 100 features (main-effect or interaction). The eventual 100 features are then subjected to the selection procedure.

The features selected by EBIC with 𝛾 in the subintervals (0.70,0.73],(0.73,0.77], and (0.77,0.8] are given in Table 1. With 𝛾=0.8, the largest value at which at least one feature can be selected, the following three interaction features are selected: rs1885693-rs12537363, rs7837688-rs2256142 and rs1721525-rs2243988. The effects of these features have a significance level at least 1.8250𝑒-11. The next largest 𝛾 value, 0.77, selects 7 additional interaction features which have a significance level at least 9.6435𝑒-11. The third largest 𝛾 value, 0.73, selects still 2 additional interaction features which have a significance level at least 2.7407𝑒-10. The chromosomal region 8q24 is the one where many previous prostate cancer studies are concentrated. It has been reported in a number of studies that rs1447295, one of the 4 tightly linked SNPs in the β€œlocus 1” region of 8q24, is associated with prostate cancer, and it has been established as a benchmark for prostate cancer association studies. In the current data set, we found that rs7837688 is highly correlated with rs1447295 (π‘Ÿ2=0.9) and is more significant than rs1447295 based on single-SNP models. These two SNPs, which are in the same recombination block, are also physically close.

An older and slightly different version of the CGEMS prostate data has been analyzed by Yeager et al. [27] using single-SNP multiple testing approach. In their analysis, they distinguished between aggressive and non-aggressive status and assumed no structure on genetic modes. For each SNP, they considered four tests: a πœ’2-test with 4 degrees of freedom based on a 3Γ—3 contingency table, a score test with 4 degrees of freedom based on a polytomous logistic regression model adjusted for age group, region of recruitment, and whether a case is diagnosed within one year of entry to the trial, as well as the other two which are the same as the πœ’2 and score tests but take into account incidence-density sampling. They identified two physically close but genetically independent regions (in a distance 0.65 centi-Morgans) within 8q24. One of the regions is where the benchmark SNP rs1447295 is located. They reported three SNPs: rs1447295 (𝑃-value: 9.75𝑒-05), rs7837688 (𝑃-value: 6.52𝑒-06) and rs6983267 (𝑃-value: 2.43𝑒-05), where rs7837688 is in the same region as rs1447295 and rs6983267 is in the other region. The 𝑃-values are computed from the score statistic based on incidence-density sampling polytomous logistic regression model adjusted for other covariates.

In our analysis, we identified rs7837688 but not rs1447295. This is because the penalized likelihood tends to select only one feature among several highly correlated features, which is a contrast to the multiple testing that selects all the correlated features if any of them is associated with the disease status. We failed to identify rs6983267. The possible reason could be that its effect is masked by other more significant features which are identified in our analysis. We also carried out the selection procedure with only the 100 main-effect features retained from the screening stage. It is found that rs6983267 is among the top 20 selected main-effect features with a significance level 2.3278𝑒-05. It is interesting to notice that the two SNPs rs7837688 and rs1721525 appearing in the top three interaction features are also among the top four features selected with a maximum 𝛾 value 0.7185 when only main-effect features are considered. Since no SNP on chromosomes other than 8q24 has been reported in other studies, we wonder whether statistically significant SNPs on other chromosomes can be ignored due to biological reasons: if not, our analysis strongly suggests that rs1721525 located on chromosome 1 could represent another region in the genome which is associated with prostate cancer, if it holds, biologically, chromosome 1 cannot be excluded in the consideration of genetic variants for prostate cancer.

4. Simulation Studies

We present results of two simulation studies in this section. In the first study, we compare the two-stage penalized logistic regression (TPLR) approach with the paired-SNP multiple testing (PMT) approach of Marchini et al. [5] under simulation settings considered by them. In the second study, we compare the TPLR approach with LASSO-patternsearch using a data structure mimicking the CGEMS prostate cancer data.

4.1. Simulation Study 1

The comparison of TPLR and PMT is based on four models. Each model involves two etiological SNPs. In the first model, the effects of the two SNPs are multiplicative both within and between loci; in the second model, the effects of the two SNPs are multiplicative within but not between loci; in the third model, the two SNPs have threshold interaction effects; in the fourth model, the two SNPs have an interaction effect but no marginal effects. The first three models are taken from Marchini et al. [5]. The details of these models are provided in the supplementary document.

Marchini et al. [5] considered two strategies of PMT. In the first strategy, a logistic model with 9 parameters is fitted for each pair of SNPs, and the Bonferroni corrected significance level 𝛼/𝑃2ξ€Έ is used to declare the significant pairs. In the second strategy, the SNPs that are significant in single-SNP tests at a liberal level 𝛼1 are identified, then the significances of all the pairs formed by these SNPs are tested using the Bonferroni corrected level 𝛼/𝑃𝛼12ξ€Έ.

In the first three models, the marginal effects of both loci are nonnegligible and can be picked up by the single-SNP tests at the relaxed significance level. In this situation, the second strategy has an advantage over the first strategy in terms of detection power and false discovery rate. In this study, we compare our approach with the second strategy of PMT under the first three models. In the fourth model, since there are no marginal effects at both loci, the second strategy of PMT cannot be applied since it will fail to pick up any loci at the first step. Hence, we compare our approach with the first strategy of PMT. However, the first strategy involves a stupendous amount of computation which exceeds our computing capacity. To circumvent this dilemma, we consider an artificial version of the first strategy; that is, we only consider the pairs which involve at least one of the etiological SNPs. This artificial version has the same detection power but lower false discovery rate than the full version. The artificial version cannot be implemented with real data since it requires the knowledge of the etiological SNPs. However, it can be implemented with simulated data and serves the purpose of comparison.

Each simulated dataset contains 𝑛=800 individuals (400 cases and 400 controls) with genotypes of 𝑃 SNPs. Two values of 𝑃, 1000 and 5000, are considered. The genotypes of disease loci, which are not among the 𝑃 SNPs, and the disease status of the individuals are generated first. Then, the genotypes of the SNPs which are in linkage disequilibrium with the disease loci are generated using a square correlation coefficient π‘Ÿ2=0.5. The genotypes of the remaining SNPs are generated independently assuming Hardy-Weinberg equilibrium. For the first three models, the effects of the disease loci are specified by the prevalence, disease allele frequencies, denoted by π‘ž, and marginal effect parameters, denoted by πœ†1 and πœ†2. The prevalence is set at 0.01 throughout. The two marginal effects are set equal, that is, πœ†1=πœ†2=πœ†. For the fourth model, the effect is specified through the coefficient in the logistic model. The coefficients are determined by first specifying πœ‰12 and then determining 𝛽1 and 𝛽2 through the constraints of the model while 𝛽0 is set to βˆ’5. The definition of these parameters and the details of the data generation are given in the supplementary document.

The 𝛼1 and 𝛼 in the PMT approach are taken to be 0.1 and 0.05, respectively, the same as in Marchini et al. [5]. The 𝛾 in EBIC is fixed as 1 since it is infeasible to incorporate the consideration on the choice of 𝛾 into the simulation study. The average PDR and FDR over 200 simulation replicates under Model 1–4 are given in Tables 2–5, respectively. In Table 5, the entries of the FDR for the PMT approach are lower bounds rather than the actual FDRs, since, as mentioned earlier, only the pairs of SNPs involving at least one etiological SNP are considered in the artificial version of the first strategy of PMT, which results in less false discoveries than the full version while retaining the same positive detections.

The results presented in Tables 2–5 are summarized as follows. Under Model 1, TPLR has much lower FDR and comparable PDR compared with PMT. Under Models 2–4, the PDR of TPLR is significantly higher than PMT in all cases except Model 2 when πœ†=0.7,π‘ž=0.2,𝑃=1000 (0.95 versus 1) and Model 3 when πœ†=1,π‘ž=0.1,𝑃=1000 (0.81 versus 0.84). The overall averaged FDRs of TPLR is 0.0487 while that of PMT is 0.7604. It is seen that the FDR of TPLR is always kept at reasonably low levels but that of PMT is intolerably high, and at the same time TPLR is still more powerful than PMT for detecting etiological SNPs. From the simulation results, we can also see the impact of 𝑃 on PDR and FDR. In general, the increase of 𝑃 reduces PDR and increases FDR of both approaches. However, the impact on TPLR is less than that on PMT.

4.2. Simulation Study 2

The data for this simulation study is generated mimicking the structure of the CGEMS prostate cancer data. The cases and controls are generated using a logistic model with the following linear predictor: πœ‚=𝛽0+5j=1𝛽jπ‘₯j+πœ‰1π‘₯6π‘₯7+πœ‰2π‘₯8π‘₯9+πœ‰3π‘₯10π‘₯11+πœ‰4π‘₯2π‘₯12+πœ‰5π‘₯13π‘₯14,(4.1) where π‘₯𝑗’s are feature values of 14 SNPs. The parameter values are taken as 𝜷=(βˆ’8.65,0.89,1.1,0.74,1.18,1.25),𝝃=(1.95,1.62,1.9,1.8,1.1).(4.2) The SNPs in the above model mimic the 14 SNPs involved in the top 5 main-effect features and top 5 interaction features of the CGEMS prostate cancer data. The minor allele frequencies (MAF) of the SNPs, which are estimated from the prostate cancer data, are given as follows: MAF=(0.31,0.12,0.29,0.12,0.13,0.13,0.47,0.18,0.29,0.16,0.04,0.12,0.36,0.40).(4.3) The genotypes of these 14 SNPs are generated by using the MAF, assuming Hardy- Weinburg Equilibrium. In addition to these 14 SNPs, 20,000 noncausal SNPs are randomly selected (without replacement) from the 294,179 SNPs of the prostate cancer data in each simulation replicate. For each simulation replicate, 1,000 cases and 1,000 controls are generated. They are matched by randomly selected (without replacement) individuals from the prostate cancer data. Their genotypes at the 20,000 noncausal SNPs are taken the same as those in the prostate cancer data.

In the TPLR approach, 50 main effect features and 50 interaction features are selected in the screening stage using the tournament screening strategy. In the selection stage, EBIC(𝛾) values are calculated for the nested models with 𝛾 in the range 0(0.1)2, that is, from 0 to 2 in space of 0.1.

In the LASSO-patternsearch approach, at the screening stage, 0.05 and 0.002 are used as thresholds for the 𝑃-values of the main-effect features and interaction features, respectively. At the LASSO selection step, a 5-fold cross-validation is used for the choice of penalty parameter. At the hypothesis testing step, 9 𝛼 levels are considered, that is, 𝛼=10βˆ’π‘˜,π‘˜=0,1,…,8. The case 𝛼=1 amounts to stopping the procedure at the LASSO selection step.

Since in the TPLR approach there is not a definite choice of 𝛾, to facilitate the comparison, we calculate PDR and FDR for each fixed 𝛾 value in the TPLR approach, and for each fixed 𝛼 level in LASSO-patternsearch. The PDR and FDR are calculated separately for the detection of true main-effect and interaction features. They are also calculated for the detection of causal SNPs. A causal SNP is considered positively discovered if it is selected either as a main-effect feature or a constituent in an interaction feature. The simulated FDR and PDR over 100 replicates of TPLR with 𝑛𝑀=𝑛𝐼=50 and 𝛾=0(0.2)2 and those of LASSO-patternsearch with 𝛼=10βˆ’π‘˜,π‘˜=0,1,…,8 are reported in Table 6. It is actually the 𝛾 values in the higher end and 𝛼 levels in the lower end that will be involved in the final selection. The comparison of the results with those values is more relevant. As shown by the bold digits in Table 6, TPLR has higher PDR and lower FDR than LASSO-patternsearch across-the-board. For the main-effect features, the lowest FDR of TPLR is 0.006 while it achieves PDR around 0.65, but the lowest FDR of LASSO-patternsearch is around 0.2 while it only achieves PDR around 0.6. The FDR and PDR on interaction features and causal SNPs have the same pattern. When the two approaches have about the same PDR, the LASSO-patternsearch has a much larger undesirable FDR than TPLR. For example, on the SNPs, when the PDR is 0.608 for TPLR and 0.609 for LASSO-patternsearch, the FDRs are, respectively, 0.041 and 0.654; on the main-effect features, when the PDR is 0.646 for both TPLR and LASSO-patternsearch, the FDRs are, respectively, 0.006 and 0.220. The ROC curves of the two approaches in identifying etiological SNPs are plotted in Figure 1. Figure 1 shows clearly that the PDR of TPLR is much higher than the PDR of LASSO-patternsearch when FDR is the same, which is true uniformly over FDR.

To investigate the effect of the choice of 𝑛𝑀 and 𝑛𝐼, we considered 𝑛𝑀=𝑛𝐼=15,25, and 50 which are 3, 5, and 10 times of the actual number of causal features, respectively. The simulation results show that, though there is a slight difference between the choice of 15 and the other two choices, there is no substantial difference between the choice of 25 and 50. This justifies the strategy given at the end of Section 2. The detailed results on the comparison of the choices are given in the supplementary document.

We also investigated whether the ranking step in the TPLR approach really reflects the actual importance of the features. The average ranks of the ten causal features over the 100 simulation replicates are given in Table 7.

On the average, the causal features are all among the top ten ranks. This gives a justification for the ranking step in the selection stage of the TPLR approach.

5. Some Remarks

It is a common understanding that individual SNPs are unlikely to play an important role in the development of complex diseases, and, instead, it is the interactions of many SNPs that are behind disease developments, see Garte [28]. The finding that only interaction features are selected (since they are more significant than main-effect features) in our analysis provides some evidence to this understanding. Perhaps, even higher-order interactions should be investigated. This makes methods such as the penalized logistic regression which can deal with interactions even more desirable.

The analysis of the CGEMS prostate cancer data can be refined by replacing the binary logistic model with a polytomous logistic regression model taking into account that the genetic mechanisms behind aggressive and nonaggressive prostate cancers might be different. Accordingly, the penalty in the penalized likelihood can be replaced by some variants of the group LASSO penalty considered by Huang et al. [29]. A polytomous logistic regression model with an appropriate penalty function is of general interest in feature selection with multinomial responses, which will be pursued elsewhere.

Acknowledgments

The authors would like to thank the National Cancer Institute of USA for granting the access to the CGEMS prostate cancer data. The research of the authors is supported by Research Grant R-155-000-065-112 of the National University of Singapore, and the research of the first author was done when she was a Ph.D. student at the National University of Singapore.

Supplementary Materials

The supplementary document contains (1) the details of the genetic models and the data generation procedure considered in Simulation Study 1 and (2) some further results obtained in Simulation Study 2.

  1. Supplementary Material