Statistical Analysis of High-Dimensional Genetic Data in Complex TraitsView this Special Issue
Evaluation of Penalized and Nonpenalized Methods for Disease Prediction with Large-Scale Genetic Data
Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called “large P and small N” problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better accuracy than the existing methods for at least diseases under consideration.
Accurate disease prediction is a central goal of clinical genetics, and much effort has been made to utilize the large-scale genetic data for a disease prediction model for complex disease. However, except for the fully penetrant Mendelian disorders, effect sizes of most disease susceptibility loci identified by genome-wide association studies (GWAS) are usually modest  and the presence of much larger number of genetic variants than the sample size (or so-called “large P and small N” problem) makes the construction of a disease risk prediction model intractable. For instance, the variation of predicted risk scores for each individual is proportionally related to the number of causal variants, and the accuracy of the predicted disease status decreases with the increase of the number of causal variants when the relative proportion of variance explained by causal variants is fixed [2, 3]. Also large P and small N problem prevents the estimation of the joint effect of all markers and thus prediction model building has been based only on marginal effects of variants [4–6]. Recently, various nonpenalized and penalized statistical methods have been suggested to tackle these issues. However, a comprehensive evaluation of existing methods has not been conducted yet.
The statistical methods for the disease risk prediction at the early stage were based on gene scores [4–6]. Causal variants often have additive effects on phenotypes, and a simple linear (logistic) regression can be adopted to estimate marginal effects of each variant under the assumption that there is no gene × gene and gene × environment interactions. Then, the coded genotypes of large-scale genetic data are multiplied with their marginal effect estimates and their sum for each individual can be incorporated to build the disease risk prediction model. Multiple studies showed that gene score-based approach is practically useful for building disease risk prediction . This approach is computationally very efficient, and further extensions based on best linear unbiased predictor (BLUP) have been proposed in the literature [8–10]. However, for instance, if joint effects between multiple variants are substantial or there is large linkage disequilibrium between variants, the estimated gene score can be biased and the predicted disease risk becomes less accurate.
As an alternative to gene score-based approach, one may consider statistical learning methods in disease risk prediction. Statistical learning algorithms have been successful over the past decades in various learning tasks including text categorization, fraud detection, character and image recognition, natural language processing, and marketing. Disease risk prediction can be naturally posed as a classification problem, and support vector machines (SVMs)  and ensemble algorithms, in particular, random forests proposed by Breiman , have been often shown to yield more accurate predictions than other classification algorithms . In particular, SVMs have been an important tool in classification because of their accuracy and flexibility in modeling different types of data. However, these methods have some drawbacks in disease risk prediction. Effects of variants on phenotypes in the prediction models from ensemble algorithms are difficult to interpret and SVMs do not provide class conditional probabilities . Therefore, in this report, we focus on the penalized methods in logistic regression.
Recently, various penalized methods have been proposed to resolve the large P and small N problem. Examples include convex penalizations such as ridge [15–17] and LASSO  and nonconvex penalizations such as the smoothly clipped absolute deviation (SCAD)  and bridge . In general, penalized methods have often provided more accurate predictions and easier interpretations than nonpenalized methods, especially when the number of samples is smaller than the number of variables. Some penalized methods automatically select relevant variables by setting the estimated coefficients of irrelevant variables as exactly zero. Also penalized methods enhance the accuracy of predictions by shrinking the coefficients of nonzero elements with data-adaptive tuning parameters.
In this report, we compare the performances of various nonpenalized and penalized methods in the prediction of diseases on data from Korea Association Resource (KARE) project that is a part of Korea Genome Epidemiology Study. We select individuals with extreme phenotypes among the participants in KARE project and consider the type 2 diabetes, obesity, hypertension, and three smoking-related phenotypes. The predictive performances of those nonpenalized and penalized methods are compared by area under the curve (AUC). Our results indicate that penalized methods tend to yield more accurate predictions than nonpenalized methods although their relative performances depend on particular diseases.
2.1. KARE Cohort
The KARE project, with 10,038 participants living in Ansung (rural) and Ansan (urban), was initiated in 2007 for large-scale GWAS based on the Korean population. Among the participants, 10,004 individuals were genotyped for 500,568 SNPs with the Affymetrix Genome-Wide Human SNP array 5.0. We discarded SNPs with -values for Hardy-Weinberg equilibrium (HWE) less than 10−6, with genotype call rates less than 95%, or minor allele frequencies (MAF) less than 0.01, and 352,228 SNPs were left for subsequent analysis. Individuals with low call rates (<95%, ), high heterozygosity (>30%, ), gender inconsistencies , or serious concomitant illness were excluded from analysis. We considered independent samples and excluded related or identical individuals whose computed average pairwise identical in state value was higher than that estimated from first-degree relatives of Korean sib-pair samples (>0.8, ). In total, 8,842 individuals were analyzed. From randomly selected 20 duplicate samples, we found that genotype concordance rates exceeded 99.7%, with no single SNP excessively discordant. The population substructure was handled with EIGENSTRAT approach  and we chose 10 principal component scores. Missing genotypes were imputed with Beagle .
2.1.1. Type 2 Diabetes (T2D)
T2D mainly occurs in people aged over 40, and it is diagnosed with level of glucose and hemoglobin a1c (hba1c) in blood. In our studies, individuals were selected as being affected with type 2 diabetes if their hba1c are larger than 6.5, fasting plasma glucoses are larger than or equal to 126, or 2-hour postprandial blood glucoses are larger than or equal to 200. In total, there were 1182 affected individuals, and 2364 individuals not satisfying the condition for type-2 diabetes and older than the other unaffected individuals were considered as controls. As environmental variables, we considered area (Ansan/Ansung), sex, age, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), triglyceride, and ten PC scores.
Obesity status was determined by BMI. Individuals were considered as cases if their BMIs are larger than 27, and there were 1022 affected individuals in KARE cohort. We also selected 2325 individuals with BMIs less than 27 and older than the other unaffected individuals as controls. We considered area, sex, age, height, waist-hip ratio, SBP, DBP, high density lipoprotein, triglyceride, and ten PC scores as environmental variables.
Hypertension status was determined by SBP and DBP. 1035 individuals with SBPs and DBPs larger than 140 and 80, respectively, were considered as cases. 2290 individuals whose SBPs and DBPs were less than 120 and 80, respectively, were selected as controls. Environmental variables considered were area, sex, age, BMI, and ten PC scores.
2.1.4. Cigarettes Smoked per Day (CPD)
For smoking-related phenotypes, we considered only male samples for predicting smoking behaviors because the number of female smokers was very small. CPD was defined to detect the nicotine dependence of each individual. Individuals whose number of cigarettes per day was larger than 20 were defined as being addicted to nicotine, and 333 individuals were selected as cases. Individuals were chosen as controls if the number of cigarettes per day was less than 10, and 375 males were chosen as controls. Environmental variables were area, age, BMI, waist-hip ratio, triglyceride, SBP, and ten PC scores.
2.1.5. Smoking Initiation (SI)
Smoking status for each individual has four categories: never smoked, former smoker, occasional smoker, and habitual smoker. Males who never smoked were defined as controls, and males who occasionally or habitually smoke were defined as cases. There were 3357 cases and 807 controls, and the same clinical variables as in CPD were environmental variables.
2.1.6. Smoking Cessation (SC)
SC was defined with smoking status as SI, but we used different categories for cases and controls. Males who never smoked were defined as controls, and males who occasionally or habitually smoke or smoked before were defined as cases. The numbers of cases and controls are 2064 and 1293, respectively, and environmental variables were the same clinical variables as in CPD.
2.2. Disease Risk Prediction Model Building
Let be a dichotomous phenotype for individual , and affected and unaffected individuals are coded as 1 and 0, respectively. The sample size is denoted as , where and denote the numbers of cases and controls, respectively. We assume that there are genetic variants and environmental variants including an intercept. Therefore the total number of variables is . denotes a vector with covariates for individual , and the coded genotypes of the th variant and the th environmental variable were denoted by and , respectively. The coefficient vector of covariates is denoted by .
2.2.2. Cross Validation
To see the effect of sample size, we selected individuals where cases and controls with extreme phenotypes were chosen, and the relative ratios of to are assumed to be equal to their ratios between all available cases and controls in KARE cohorts. We evaluated the accuracies of the disease risk prediction models for different choices of . Accuracies of the disease prediction models were assessed via 10-fold cross validation, and AUCs were calculated with 10 replicates. All individuals were randomly divided into 10 different subgroups with the same number of cases and controls. Each subgroup was used as test set once across ten replicates and therefore there is no overlap between test set in different replicates.
2.2.3. Feature Selection and Risk Prediction
Numbers of available genetic markers seem to be related to the prediction accuracy, and different numbers of genetic variants were selected to build the disease risk prediction model. We choose the top genetic variants by the order of -ratio from train set. If we let be the average expression level of the th variant for individuals with phenotype and denote the overall mean expression level of the th variant by , the -ratio of the th variant [23, 24] is defined as Then we build the disease risk prediction model with those selected top genetic variants and environmental variables on train set and apply the prediction model to test set.
2.3. Nonpenalized Methods
2.3.1. Genetic Risk Scores (GRS)
The marginal effects of covariates are tested with -ratio [23, 24]. Then, the coded genotypes of significant variants at level are summed to calculate GRS, and GRS and environmental variables were incorporated into the logistic regressions as covariates to build the final disease risk model on train set. The disease risk scores are calculated for individuals on test set and its accuracy of disease risk prediction model is evaluated.
Polygenic effects explained by available SNPs can be modeled by the linear mixed model whose variance covariance matrix is parameterized with the genetic relationship matrix [25–27], and BLUP can be used to predict the disease risk by genetic effects. However, those approaches assume that effects of all SNPs are homogeneous in spite of their heterogeneity. For instance, it has been shown that MAFs of SNPs may reveal some information about genetic architecture  and random effects need to be defined for SNPs with different spectra of MAFs separately. MultiBLUP  categorizes each SNP into different classes with distinct effect sizes or linkage disequilibrium block and applies a linear mixed model with multiple random effects to improve the accuracy of the prediction model .
2.4. Penalized Methods
Various penalized methods have been recently proposed, and we consider five penalized methods in our comparison: ridge , LASSO , elastic-net , SCAD , and truncated ridge (TR) [33–35]. The dimensional coefficient vector can be estimated by minimizing the penalized negative log-likelihood: where is a penalty function and is a vector of tuning parameter that can be determined by a search on an appropriate grid. Each penalized regression requires the estimation of , and 100 grid points of were considered from “glmnet” function in for all the methods.
In linear regression, estimates from least square method are quite unstable under severe multicollinearity because of their large variances. Ridge penalty was originally developed to stabilize the sample performance of least square estimates by shrinking their absolute values toward zero . Ridge penalty controls the amount of shrinkage effect by choosing the tuning parameter , and the resulting ridge estimates tend to have a smaller variance than least square estimates. In particular, ridge regression can be conducted even when is much larger than , where the least square method does not have a model identifiability. However, ridge estimates have a drawback in the interpretation of the final model because all the covariates are included in the final model regardless of the choice of . Hence, ridge regression must be conducted together with an extra selection process such as stepwise subset selection or truncation methods.
LASSO was proposed by Tibshirani  to achieve both shrinkage and covariate selection via the penalty LASSO selects relevant covariates and estimates their coefficients simultaneously by controlling the tuning parameter . LASSO often shows a quite stable performance, especially when the sample size is small [32, 36], and achieves higher prediction accuracy than other penalized methods. LASSO has been applied to various statistical models such as Gaussian graphical models  because there are fast and efficient algorithms that are easily implementable [37–40]. However, several defects of LASSO have been reported in the literature [36, 41–43]. For example, LASSO tends to overfit, that is, selecting more covariates than expected , and is known to have a confliction between correct selection and optimal prediction . To remedy such defects, modified versions of LASSO  were proposed and extended to the large P and small N problem .
Elastic-net penalty proposed by Zou and Hastie  is a convex combination of LASSO and ridge penalty is Here we considered 20 equally spaced grid points from zero to one for . Elastic-net has more desirable properties than LASSO and ridge. For instance, ridge tends to keep all the covariates in the final model and hence is undesirable when there are many noncausal variants. In contrast, LASSO cannot select larger number of covariates than the sample size and tends to select a single covariate among highly correlated covariates. However, by choosing appropriate and , elastic-net enables us to have balanced estimates, producing a slightly more complex model than LASSO but far simpler model than ridge. Also it achieves a grouping effect  on highly correlated covariates. However, elastic-net shares the disadvantage of LASSO; that is, it often overfits, which can be resolved by applying a data adaptive weight vector .
The SCAD penalty introduced by Fan and Li  is and is used for our own optimization algorithm. SCAD has several desirable properties over LASSO . First, SCAD produces the same unbiased estimates as usual nonpenalized estimates of the covariates selected by SCAD. Hence SCAD can be considered as a stable version of best subset selection , achieving a unique benefit of the unbiased coefficient estimate . Second, SCAD is known to have the oracle property ; that is, the set of selected covariates are asymptotically equal to the set of true causal variants. However, in spite of theoretical optimality of SCAD , its estimates can be poor unless the sample size is large and the effects of signal covariates are strong. In addition, similarity between numerically estimated values and theoretical ones cannot be measured because of the nonconvexity of SCAD penalty, and the computational cost for SCAD is often much more expensive than LASSO.
As we mentioned above, ridge cannot be directly used in identifying important covariates. However, TR  can produce sparse estimates and inherits the same shrinkage effect as ridge that results in high prediction accuracy in the presence of multicollinearity . To obtain TR estimates, we first obtain usual ridge estimates with tuning parameter and then truncate them with truncating level . Hence TR declares the ridge coefficients whose absolute values smaller than as zero and keeps the other large coefficients intact. An appropriate choice of truncating level enables us to identify a correct model while the final estimates still keep the same shrinkage property as ridge [33–35], and 20 grid points equally spaced in logarithmic scale from 0.01 to 0.001 were considered for .
To see the differences of penalized methods, we calculated AUCs of those methods on test set and the number of nonzero coefficients as a function of sample size. Figure 1 shows that relative performance of each method substantially depends on phenotypes, and least AUCs are often observed for SI, followed by SC. Their least AUCs may be explained by the relative importance of genetic components for each phenotype. We calculated the relative proportion of variances, , explained by genotyped variants with GCTA program [27, 49]. for binary traits was estimated with all available samples by using default options, and Table 1 shows estimates for . In particular, the proportion between cases and controls for each phenotype is different from true prevalence, and the ascertainment bias often happens. However the performance of each method may be related to unadjusted estimates of and ascertainment bias was not taken into account. According to Table 1, the genotyped variants explain around 25% of phenotypic variances for hypertension and CPD. However the standard error of for CPD10 is large, and genetic components for all smoking-related phenotypes seem relatively less informative.
Figure 1 shows that AUCs of two nonpenalized methods, GRS and MultiBLUP, on test set were generally outperformed by the penalized methods across various levels of and . Both approaches do not consider the joint effects among multiple causal SNPs. GRS assumes that effect sizes for causal SNPs are homogeneous and MultiBLUP assumes that sums of each SNP affect the normal distribution. However, penalized methods estimate individual effects of each SNP by shrinking each coefficient. This may explain the superiority of penalized methods over nonpenalized methods, but if those assumptions for nonpenalized methods are satisfied, they may perform better than penalized methods approaches. Interestingly MultiBLUP performs better than GRS except OB if is larger, and AUC improvement of MultiBLUP for larger is more substantial than GRS. Therefore, MultiBLUP seems to be more reasonable choice than GRS. Comparing overall performances of penalized methods, it can be seen that ridge and TR are the best, LASSO and elastic-net are the second, and SCAD is the last even though the performance of each method depends on specific diseases and the levels of and . Regardless of and , ridge was the best performer even for small for all phenotypes except SI. For SI, it seems that the performance of ridge depends on rather than . TR virtually has almost the same prediction accuracy, and Figure 2 shows that its model complexity is similar with ridge for CPD, obesity, hypertension, and T2D. This observation is also strengthened by the fact that the optimal value of truncation parameter, , is around 0.001, and thus the effect of truncation parameter on model complexity is almost negligible for these data sets. However, Figure 1 shows that differences between ridge and TR are substantial for SC and SI. AUCs of TR depend on and, in particular, are large even when is small, which indicates that AUCs of TR depend less on than ridge. Robustness of TR can be partially explained by smaller model complexity than ridge in Figure 2. For instance, TR usually selects quite small number of SNPs (at most 15.3 SNPs for and 46.6 SNPs for ) but achieves higher prediction accuracy than ridge when is less than 800. However, when , TR selects the same number of SNPs as ridge. Thus, we can conclude that the effect of truncation parameter diminishes for large , which explains higher prediction accuracy when is small.
LASSO and elastic-net show relatively large dependency on and in prediction accuracy and model complexity for whole phenotypes except SI, and their AUCs are proportionally related to but inversely related to . Although their prediction accuracies are lower than those of ridge and TR for small , they perform as well as ridge with small numbers of SNPs for large . For instance, LASSO includes about 100 SNPs for and and about 500 SNPs for and , which indicates that we can construct prediction models without using the whole SNPs. Elastic-net tends to behave quite similarly as LASSO and it selects slightly larger number of SNPs for whole phenotypes except SI.
In terms of model complexity, there are substantial differences among penalized methods. Figure 2 show that SCAD selects the smallest number of covariates, while other methods such as LASSO and elastic-net usually include much more covariates. Ridge always includes all covariates, and model complexity for TR depends on data. However, even though SCAD generates the simplest model, SCAD is less preferable if it achieves the least performance among penalized regressions. For , SCAD performs as well as other methods while still keeping small number of SNPs. For instance, for obesity, AUCs of SCAD are virtually the best and select extremely sparse models that have only 7.3 and 3.4 SNPs for and , respectively. Therefore, we can conclude that SCAD is appropriate as long as relatively large number of individuals is available.
In this study we have considered five penalized and two nonpenalized statistical methods with six case-control datasets that are computationally feasible at the genome-wide scale. Each method was utilized to build the disease risk prediction model with different sample sizes and numbers of variants, and the accuracy of disease risk prediction models was evaluated with cross validation. Cross validation tends to overestimate the prediction accuracy, and results should be interpreted with care. A more reliable but time-consuming way is to compare the methods on random partitions of data. However cross validation does not have a strong preference towards a specific method and it may give us a rough idea on prediction accuracies of methods. According to our results, dense methods such as ridge and TR are usually more accurate than sparse methods such as LASSO and SCAD. For a large sample size, prediction accuracies from penalized methods are expected to be similar to that from ridge [23, 35, 50].
However, in spite of our comprehensive evaluations, various factors such as filtering conditions for SNPs or individuals, test statistic for prescreening, and ways of obtaining tuning parameters can affect the accuracy of the final risk prediction model, and depending on their choices, accuracy of disease risk prediction model can be substantially different. In this context, the 1-standard deviation rule  for tuning parameters was adopted to reduce overfitting problem. However, it did not provide any significant improvement in the results, which may indicate that there may be many causal genetic variants with small effects in the analyzed data sets. This consistently explains the reason why dense methods outperformed sparse methods such as LASSO and SCAD in our analysis. Moreover, while the results from SCAD were quite unstable for , the choice of led to the better prediction accuracy. These findings suggest that most of SNPs have a small causal effect on diseases considered in this report. In this sense, sparse methods such as SCAD may not be preferred for infinitesimal model  unless the sample size is sufficiently large.
In this report we have compared various penalized regression methods. However, we have not considered more recent methods such as bootstrapping methods [33, 52, 53]. Most of them usually suffer from intensive computational burden induced by tuning extra parameters such as bootstrap size, and thus they are not computationally feasible at genome-wide scale. Alternatively, in the follow-up studies, we pursue the direction of refining the penalized methods considered in this report because there is still a significant room for improvement.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Sungho Won and Hosik Choi contributed equally to this work.
The research of S. Won was supported by the Industrial Core Technology Development Program (10040176, Development of Various Bioinformatics Software Using Next Generation Bio-Data) funded by the Ministry of Trade, Industry and Energy (MOTIE, Korea) and by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2014S1A2A2028559). The research of H. Choi was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no. 2013R1A1A2007611). The research of J. Lee was provided by an intramural grant from the Korean National Institute of Health (2013-NG73002-00). The research of C. Park was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (no. 2012R1A1A2004901). The research of S. Kwon was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2014R1A1A1002995).
Figure 1: AUCs from train set for T2D, obesity and hypertension AUCs for T2D, obesity and hypertension from train set were calculated for different n and p1 TR indicates the truncated ridge.
Figure 2: AUCs from train set for CPD, SC and SI. AUCs for CPD, SC and SI from train set were calculated for different n and p1. TR indicates the truncated ridge.
A. E. Hoerl, “Ridge regression,” Biometrics, vol. 26, article 603, 1970.View at: Google Scholar
A. E. Hoerl and R. W. Kennard, “Ridge regression: applications to nonorthogonal problems,” Technometrics, vol. 12, no. 1, pp. 69–82, 1970.View at: Google Scholar
S. H. Lee, J. Yang, M. E. Goddard, P. M. Visscher, and N. R. Wray, “Estimation of pleiotropy between complex diseases using single-nucleotide polymorphism-derived genomic relationships and restricted maximum likelihood,” Bioinformatics, vol. 28, no. 19, pp. 2540–2542, 2012.View at: Google Scholar
R. J. Tibshirani, “Regression shrinkage and selection via the LASSO,” Journal of the Royal Statistical Society: Series B, vol. 58, pp. 267–288, 1996.View at: Google Scholar
J. Huang, S. Ma, and C.-H. Zhang, “Adaptive Lasso for sparse high-dimensional regression models,” Statistica Sinica, vol. 18, no. 4, pp. 1603–1618, 2008.View at: Google Scholar
P. Hall, E. R. Lee, and B. U. Park, “Bootstrap-based penalty choice for the LASSO, achieving oracle performance,” Statistica Sinica, vol. 19, no. 2, pp. 449–471, 2009.View at: Google Scholar