BioMed Research International

BioMed Research International / 2015 / Article
Special Issue

Advanced Computational Approaches for Medical Genetics and Genomics

View this Special Issue

Review Article | Open Access

Volume 2015 |Article ID 143712 |

Ronald de Vlaming, Patrick J. F. Groenen, "The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics", BioMed Research International, vol. 2015, Article ID 143712, 18 pages, 2015.

The Current and Future Use of Ridge Regression for Prediction in Quantitative Genetics

Academic Editor: Junwen Wang
Received28 Nov 2014
Accepted24 Dec 2014
Published26 Jul 2015


In recent years, there has been a considerable amount of research on the use of regularization methods for inference and prediction in quantitative genetics. Such research mostly focuses on selection of markers and shrinkage of their effects. In this review paper, the use of ridge regression for prediction in quantitative genetics using single-nucleotide polymorphism data is discussed. In particular, we consider (i) the theoretical foundations of ridge regression, (ii) its link to commonly used methods in animal breeding, (iii) the computational feasibility, and (iv) the scope for constructing prediction models with nonlinear effects (e.g., dominance and epistasis). Based on a simulation study we gauge the current and future potential of ridge regression for prediction of human traits using genome-wide SNP data. We conclude that, for outcomes with a relatively simple genetic architecture, given current sample sizes in most cohorts (i.e., ,000) the predictive accuracy of ridge regression is slightly higher than the classical genome-wide association study approach of repeated simple regression (i.e., one regression per SNP). However, both capture only a small proportion of the heritability. Nevertheless, we find evidence that for large-scale initiatives, such as biobanks, sample sizes can be achieved where ridge regression compared to the classical approach improves predictive accuracy substantially.

1. Introduction

The advent of large-scale molecular genetic data has paved the way for using these data to help predict, diagnose, and treat complex human diseases [1]. In recent years, the use of such data for the prediction of polygenic diseases and traits has become increasingly popular (e.g., [24]). This venue has proved successful even for traits such as educational attainment and cognitive performance [5, 6]. The vast majority of research into the genetic architecture of human traits and diseases is exploratory and considers the effects of at least hundreds of thousands of single-nucleotide polymorphisms (SNPs) on the outcome of interest [7].

Predictions based on molecular genetic data are typically constructed as a weighted linear combination of the available SNPs. This yields a so-called polygenic risk score [3] (polygenic score, genetic risk score, and genome-wide score [8]). Multiple regression (ordinary least squares, OLS) is a natural technique for estimating the weights of the predictors (SNPs) in this context but cannot be applied here: in general, the number of samples () available is far lower than the number of SNPs (); typically, and . OLS would yield a perfect in-sample prediction without any predictive value out of sample and would not allow drawing inferences on the weights of the SNPs, as they are nonunique. A commonly accepted solution to this problem is to carry out a genome-wide association study (GWAS), where one regresses the outcome of interest on each SNP separately. In this paper, we call this the repeated simple regression (RSR) approach.

Polygenic scores are typically constructed as the weighted sum of the SNPs with weights resulting from a GWAS using RSR. We raise four points of critique regarding this method. The first problem with this approach is that, in contrast to multiple regression, there is no search for the best linear combination over all SNPs jointly for predicting the outcome. A second, related, problem is that highly correlated SNPs (i.e., SNPs in strong linkage disequilibrium) repeatedly contribute very similar information, thereby distorting the risk score. For example, consider a set of ten perfectly correlated SNPs. In the RSR, they receive exactly the same weight. As the polygenic risk score is a weighted linear sum of the SNPs with the weights coming from RSR, these perfectly correlated SNPs contribute a factor ten stronger to the risk score than a single SNP capturing all information from that region does. This factor ten does not depend on the predictive power of the information in that region. A third problem is that the polygenic risk score can theoretically be correlated with confounding variables (confounders, control variables, and controls). For instance, SNPs can be correlated with the population structure. Therefore, the polygenic risk—being a linear combination of SNPs—can be correlated with the confounders. Usually, confounders, such as age and gender, are included as regressors in order to control for spurious relations through these covariates. However, we find that often in empirical work researchers do not control properly for the confounders in at least one of the many steps that lead from phenotype and genotype data to evaluation of the out-of-sample predictive accuracy of the polygenic risk score. A fourth problem is that the RSR approach is not able to handle even two-way interactions between the SNPs, as it would lead to a number of weights to be estimated that is quadratic in the number of SNPs, which is clearly computationally infeasible.

In this paper, we review the use of ridge regression (RR) [9] to tackle the four problems discussed above. The purpose of this paper is threefold. First, we discuss how prediction using RR can address the aforementioned four points of critique pertaining to a typical polygenic score, that is, how RR can be used to search for the best linear combination of SNPs jointly, to address the multicollinearity of SNPs [10, 11], and to account for the presence of confounding variables and of nonlinear SNP effects (e.g., [1217]). Second, we review relevant work on ridge regression both in and outside the field genetics. Third, we assess the merits of prediction using ridge regression in the new domain of biobanks. That is, we predict the expected accuracy of ridge regression in large scale initiatives with over a 100,000 observations.

An important property of RR is that it cannot select a subset of predictors (e.g., SNPs). Other regularization methods related to RR are able to select a subset of predictors from a large set of predictors. Examples of such methods are the least absolute shrinkage and selection operator (LASSO), group LASSO [18], adaptive LASSO [19], and the elastic net [20].

In a GWAS, SNP selection is a desirable property when trying to find regions in the DNA that bear a causal influence on the outcome. However, there is mixed evidence for the claim that selection techniques in general improve the overall predictive accuracy of the polygenic score. Some studies suggest that preselection of markers (e.g., SNPs), based on either linkage disequilibrium or (in-sample) univariate association results, is detrimental to predictive accuracy (e.g., [3, 8, 11, 21]). Moreover, there is no conclusive evidence on the relative performance of RR-type methods and LASSO-type methods. For instance, using a simulation study, Ogutu et al. [22] find that LASSO-type methods outperform classic RR, whereas other studies find that RR outperforms LASSO and similar variable selection methods (e.g., [2325]). A reasonable proposition is that the relative performance of RR and LASSO depends on trait architecture (e.g., [21, 26]). In particular, a low number of causal SNPs favor LASSO-type methods, whereas an intermediate or high number of causal variants favor RR-type methods. Regularization methods performing selection are computationally more involved and less amenable to incorporate nonlinear SNP effects than RR. For the above reasons, as well as our aim to provide a clear overview of RR, we focus in this paper primarily on RR.

The remainder of this paper is organized as follows. In Section 2, we present the theory underlying RR. In Section 3, we show that RR can be perceived as a method between OLS and RSR, leveraging the advantages of these two methods. Subsequently, in Section 4, we discuss the relation between RR and the best linear unbiased prediction used in animal breeding and the relation between RR and LASSO-type methods. In Section 5, we pay special attention to the effect standardization of SNP data has on the implicit assumptions about the genetic architecture of traits. As indicated, the feasibility of RR depends critically on the use of computationally efficient approaches. These will be discussed in Section 6. Related to this, in Section 7, we will discuss methods to tune the penalty parameter of RR. Following that, in Section 8, advanced RR techniques will be discussed, such as modelling nonlinear effects using RR, weighting SNPs differently, and incorporating information from earlier studies.

In order to assess the current and future use of ridge regression for prediction in quantitative genetics, we run a suite of simulations. The design of the simulations and the results are presented in Section 9. Based on these results we will estimate the effect sample size, the number of SNPs, the number of causal SNPs, and trait heritability have on the predictive accuracy of RR and the classical RSR approach. Using these estimates we will extrapolate how RR and RSR are expected to perform relative to each other in large scale studies (e.g., ). Finally, in Section 10, we summarize the most important aspects of RR in the context of prediction in quantitative genetics and discuss our expectations for its future uses.

2. Ridge Regression

Using ridge regression (RR) for prediction in quantitative genetics was first proposed by Whittaker et al. [27]. RR can be understood as follows. Like regular least-squares methods RR minimizes a loss function that includes the sum of squared regression residuals. However, opposed to least squares, the loss function also includes a term consisting of positive penalty parameter times the model complexity, measured by the sum of squared regression weights [9]. This penalty prevents overfitting by shrinking the weights towards zero, ensuring that, even in case of multicollinearity and , the estimator has a solution. The RR estimator has a simple analytical solution.

More formally, given a set of individuals, SNPs, and confounders, a linear model for quantitative outcome vector (), with a matrix of SNP data (), and a matrix of confounders () as predictors, is given by where is the vector of SNP effects, the vector of effects of the confounders, and the phenotype noise.

In this particular case, we consider a large set of SNPs and a small set of potential confounders. Since one of our aims is to prevent any spurious relations via the confounders, we use a loss function that does not apply shrinkage to these. Therefore, the RR estimator minimizes Under this loss function, the RR estimator of is given by where is the projection matrix, removing the effects of the confounding variables. The larger the is, the more the shrinkage is applied. When , RR corresponds to OLS. The OLS estimator only exists if , meaning that there is no perfect collinearity amongst the SNPs and that . However, in a GWAS, almost invariably . Therefore, OLS cannot be applied in this context. However, the RR estimator has a solution for any , even if .

Heteroskedastic ridge regression (HRR) is a generalization of RR, where each SNP receives a different amount of shrinkage, . The loss function of HRR is given bywhere . The corresponding estimator is given by The matrix in (3) and (5) can be regarded as a map of the estimated correlation (linkage disequilibrium) between markers. OLS takes this linkage disequilibrium fully into account at the expense of overfitting the data, whereas RSR completely ignores it. For this reason, when constructing a polygenic score, RSR is often used in combination with a heuristic procedure, known as linkage disequilibrium pruning, which selects SNPs that are not too strongly correlated. As is shown in the next section, RR leverages the two extremes of OLS and RSR. Therefore, opposed to RSR, RR does not require the a priori selection of SNPs; RR is able to handle linkage disequilibrium between markers [10, 11].

RR is expected to perform particularly well under a scenario where a substantial proportion of the SNPs is expected to contribute to the phenotype and where each contribution is small.

3. The Limiting Cases of Ridge Regression

Varying the penalty weight, , allows specifying special cases of RR. Prediction by RR can be perceived as a method that lies between prediction based on OLS estimates considering all SNPs jointly and OLS estimates considering each SNP separately. By definition of RR [9], for sufficiently low shrinkage, the RR estimates converge to the multiple regression estimates [10], provided these are unique. For sufficiently high shrinkage a RR prediction score is equivalent to an RSR prediction score, in terms of the proportion of variance accounted for by the respective scores. For ease of notation, we assume in this section that there are no confounders .

To establish the aforementioned relations, two conditions are needed. First, the measure of predictive accuracy is independent of scale. That is, given an out-of-sample quantitative outcome vector and its prediction , the accuracy measure should be such that for any coefficient the accuracy of prediction is identical to that of prediction . An example of such a measure is the of an outcome and its prediction. The second condition is that SNP data are standardized, such that each SNPs has mean zero (, where ) and equal standard deviation (, where is a scalar).

Consider the prediction of based on out-of-sample genotype matrix , using in-sample RR estimates . This prediction is given by . Based on the first condition, we can multiply the prediction by . This is equivalent to inflating the RR estimates by instead of inflating the predictions. Thus, we can take . This yields where . The OLS estimator considering all SNPs jointly is given by Thus, it follows that when goes to zero (i.e., goes to zero), the RR estimator goes to the OLS estimator. Moreover, as goes to one (i.e., becomes sufficiently large), the inflated RR estimator goes to .

Using the condition of having standardized SNPs, we can rewrite the RSR for SNP as , where is the standardized genotype vector of SNP . This expression can be vectorized over all SNPs as . From this, it follows that the inflated RR estimates approach the RSR estimates as becomes sufficiently large.

Prediction using RR is related to the predictions that arise under a widely used simple mixed linear model, commonly referred to as the animal model. In such a model, expected genetic relatedness is mapped to phenotypic relatedness. Usually pedigree information is used to infer genetic relatedness. However, with the advent of genome-wide molecular data, mixed models that use SNPs to estimate genetic relatedness have been proposed (e.g., see Yang et al. [28]). In most mixed models using SNPs, the prior assumption is that SNP effects are normally distributed with mean zero and variance , and the error terms in the phenotype are also normally distributed with variance .

To understand the relation between RR and mixed models, consider the following mixed linear model where is the SNP effect variance and the noise variance. In this model the effects of the confounders, , are assumed to be fixed. For the remainder of this section we ignore the confounders for ease of notation. The parameters and can be estimated using, for instance, maximum likelihood, restricted maximum likelihood [29], or expectation maximization [30]. Alternatively, these parameters can be fixed by using prior information from other data sets; see, for instance, Hofheinz et al. [31].

Consider conditional expectations and . In a mixed linear model such expectations are known as the best linear unbiased prediction (BLUP) [3236]. BLUP was first proposed by Henderson [32] in order to obtain estimates of the so-called breeding values, that is, the part of the phenotype that can be attributed to genetic variation.

Provided that the RR penalty , the BLUP of SNP effects [28, 37, 38] is equivalent to the RR estimator. Under that same condition, the BLUP of the SNP-based breeding values is equivalent to RR prediction. Such genomic estimated breeding values [38] contain the part of the phenotype that can be attributed to the genetic variation in the genotyped markers.

To understand this equivalence, first we rewrite the RR estimator in (3). By applying the Sherman-Morrison-Woodbury formula [39, 40] to the inverse of , we obtain Second, by rewriting (8) in terms of the joint distribution of and : the BLUP of is given by the expectation of conditional on [17]. This yields Clearly, when , .

In addition, from a Bayesian perspective the posterior mode of the distribution of SNP effects (i.e., the mode of the distribution conditional on a training set) can also be used as point estimator. Estimation using the posterior mode is known as maximum a posteriori (MAP) estimation. However, due to the normality of and the mode coincides with the conditional expectation . Therefore, MAP estimation of in (8) is equivalent to BLUP.

Consequently, there exists a such that the RR estimator of SNP effects is equivalent to its BLUP [16, 41] and by extension to the MAP estimator. The diagram in Figure 1 summarizes the relations between RR, BLUP, and MAP.

4.1. SNP Selection Using LASSO-Type Methods

An important feature that RR lacks is the selection of SNPs. LASSO-type methods, such as the LASSO, group LASSO, adaptive LASSO, and the elastic net, are able to select SNPs. The key to achieving SNP selection is to include an penalty, that is, adding a penalty consisting of a penalty parameter, , times . The loss function of the LASSO is given by This function is highly similar to the RR loss function in (2). The most important property of the LASSO is that it performs variable selection; that is, for a sufficiently large many of the SNP coefficients will be zero. The higher the is, the fewer the nonzero SNP effects are obtained by the LASSO. Moreover, this method also shrinks the nonzero coefficients, that is, the estimated effects of the selected SNPs.

The loss function of the elastic net [20] is obtained by taking a convex combination of and as penalty; that is, with and . The elastic-net method preserves SNP selection, while allowing more than of SNPs to be selected. Taking a convex combination of the two norms hardly increases the computational costs of solving this problem, when compared to solving the LASSO problem [20]. Typically, the LASSO solution is obtained by means of the least-angle regression algorithm [42]. This algorithm entails an iterative procedure, where at most one SNP can enter the model at a time. Therefore, LASSO-type methods are computationally far more involved than RR-type methods.

Finally, the group LASSO [18] splits the predictors in mutually disjoint groups, with predictors in group , and associated effects , for groups . The group LASSO minimizes Each group can be chosen, for instance, to represent a single gene in terms of its SNPs. The group LASSO induces sparsity at the group level (e.g., a gene is either included as a whole or wholly excluded), whereas within a group the individual regressors receive an penalty. To the best of our knowledge, Sabourin et al. [43] provide the first, and so far only, application of a (modified) group LASSO using SNP data to construct polygenic scores. In this study, each SNP is considered as a group, with two effects: an additive and a dominance effect. In a simulation with mild to strong dominance, this method improves accuracy, compared to an RSR-type approach [43]. For a detailed comparison of LASSO-type methods and RR, we refer to Hastie et al. [44].

5. The Implications of Standardizing SNPs

In the preceding sections, we have only considered SNP standardization as a tool to show that RR can be perceived as a method between the classical GWAS approach and the OLS approach considering all SNPs jointly. However, SNP standardization is often used in the mixed linear model in (8).

The reason for this is that standardization has a profound effect on the implicit assumptions about the effect sizes of SNPs. We show in this section that the standardization we use is equivalent to HRR applied to raw genetic data, where SNPs measuring rare variants receive less shrinkage than SNPs measuring common variants.

More specifically, let (resp., ) denote raw SNP data in sample (out of sample) that has already been mean-centered but not yet standardized to have the same variance. The standardized data in Section 3 can now be obtained by postmultiplying by a diagonal matrix . That is, , where Under the reasonable assumption that only SNPs are considered for which in-sample variation occurs, this matrix is invertible.

By applying this transformation in both the training and test set, RR prediction based on standardized data is given by This shows that RR applied to standardized SNP data is equivalent to HRR, with , applied to raw genotype data. Here, the SNP-specific shrinkage depends on the amount of SNP variation. This type of shrinkage implicitly assumes that the standardized SNPs have homoskedastic effects, whereas the underlying raw genotypes (i.e., the count data) have effects of which the variance decreases with minor allele frequency. That is, rare alleles are assumed to have larger effects on average than common variants. For a qualitative treatment of the relation between allele frequency and expected effect sizes, see, for instance, Manolio et al. [45].

To be more precise, this type of shrinkage corresponds to the implicit assumption that the variance of the effect of raw SNP , denoted by , with allele frequency , is proportional to . This assumption implies that when is close to one or zero, the variance of the effect size is expected to be large, whereas for close to 50% the variance of the effect size attains its minimum.

Naturally, raw SNP effect variances responding differently to allele frequency can be conceived. As indicated by Manolio et al. [45] such relations depend on the effect of the trait under consideration, on the fitness of the individual. Therefore, a natural extension would be to consider HRR with . Here corresponds to a trait for which allele frequency is independent of effect size and corresponds to the relation described before. Moreover, describes a trait for which there is a slight relation between allele frequency and effect size. It is interesting to note that corresponds to a trait where diversity is an asset, that is, a trait in which variants causing phenotypic divergence between individuals tend to become common. Finally, would correspond to a trait for which there has been strong selection pressure causing convergence; only very rare variants are expected to have a large effect. Thus, in future work can be considered as an additional hyperparameter which might boost predictive accuracy and of which the estimate would reveal something about the selection pressure regarding the trait under consideration. The same type of transformation has been proposed by Speed et al. [46] for improving estimation of SNP-based heritability in a mixed linear model.

6. Computational Costs

The main hurdle in computing RR predictions is estimating the parameters, when . In particular, a naive approach requires solving a system with unknowns. However, RR can be implemented in a computationally efficient way. When , using dimensionality reduction techniques the complexity of RR can be reduced from to in case one is interested in the estimated effects [47].

Moreover, if the focus lies solely on obtaining predictions, a nonparametric representation of RR reveals the fact that a dual formulation exists, which can be perceived as solving a linear model with unknowns [48]. Solving such a system has a complexity slightly less than . Building on this computationally efficient approach, RR can also efficiently control for confounders, both in sample and out of sample.

Finally, when considering a wide array of values of , RR can be reformulated to generate predictions for all values of jointly by exploiting the properties of the eigendecomposition of an matrix, thereby yielding a complexity of .

To understand these reductions in computational costs, consider the RR estimator in (9), used to show equivalence of RR and the BLUP. Premultiplying this expression by , the out-of-sample prediction is given by As discussed, accounting for confounding variables is important. Let be the in-sample matrix of confounders and the out-of-sample matrix of confounders. By replacing by and by , where is the projection matrix removing the effects of , we find that where and , and , and . Therefore, one can correct for covariates by simply pre- and postmultiplying matrices, by appropriate projection matrices.

Matrices and both have the interpretation of a SNP-based genetic relationship matrix (GRM) [28], measuring the genetic similarity of individuals in the space of additive SNP effects.

Given the eigendecomposition RR prediction can be written as If , this approach is far more efficient than the naive approach to RR prediction. GRMs can be computed efficiently in packages such as  PLINK 1.9 [49] and  GCTA [28]. The most involved step in the prediction procedure is finding the eigendecomposition of .

7. Tuning and Interpreting

So far, it was assumed that the penalty strength parameter is given. However, in most applications of RR the optimal is not known in advance. Here, we discuss three ways for choosing .

The dominant approach in the machine learning literature for tuning is by maximizing out-of-sample predictive accuracy of RR using cross-validation (CV). In CV one considers a fine grid of potential values of . The data are randomly split in a (small) test set (e.g., 10% of the sample) and CV set (90%). To the CV set one applies -fold CV (e.g., ), meaning that one splits the CV sample randomly in blocks of (approximately) equal size. In each fold blocks are considered as CV training set and the remaining block as CV test set. Using RR for all values of , predictions in the CV test set are generated. Each block is the CV test set precisely once. After the -folds, the predictive accuracy over all CV test sets is evaluated for all . Now, is set to maximize the cross-validation accuracy. Finally, using the predictive accuracy in the final test is considered, using the full CV set as training data. For a more detailed treatment of CV, see, for instance, Hastie et al. [44].

Nested cross-validation (NCV) is a natural extension of CV, where the sample is randomly split in “super”-blocks of approximately equal size (e.g., ) and where there are “super”-folds. In each superfold, one block is considered as final test set and other blocks as CV set. To this CV set and test set one applies regular -fold CV. Each superblock is used as final test set precisely once.

Classical CV is used to fit the model and to assess its predictive accuracy; one can judge the merits of a set of values of the hyperparameter by means of the CV procedure and apply the optimal value to a new part of the sample which has not yet been considered. Using NCV one can test whether the hyperparameter and accuracy that result from classical CV are robust; NCV can show the amount of variation in either of these over the “super”-folds.

CV requires a computationally efficient strategy since a different set of RR predictions will result for each different value of . However, a large set of different values of can be evaluated in one step at nearly the same costs of evaluating a single value of . This approach avoids computing a full RR solution for each separately. To see this, the formulation of RR prediction in (20) is highly relevant. In this equation, the eigendecomposition of is independent of . Thus, predictions for each can be obtained by the following equation:where and “” denotes the element-wise (Hadamard) product. A  MATLAB implementation of this approach to RR prediction is provided in Algorithm 1. The computation of the eigendecomposition of has a computational complexity of . Given this decomposition, the prediction consists of simple operations such as multiplication and addition of scalars.

(1)    %  rdgpredEfficient prediction using ridge regression.
(2)   %rdgpred(Y,A,A21,L) returns ridge regression predictions in test set.
(3)   %Vector Y contains outcome training set, matrix A similarity measures
(4)   %in training set (e.g., A=XX  for N-by-P input matrix X), matrix A21
(5)   %similarity individuals in test set (rows) and training set (columns),
(6)   %and vector L the penalty values to consider.
(7)   %
(8)   %rdgpred(Y,A,A21,L,Z,Z2) first corrects A and A21 for confounders.
(9)   %Matrix Z contains confounders in traing set and Z2 those in test set.
(10) %
(11)  %Author  : R de Vlaming and PJF Groenen
(12) %Institute: Erasmus School of EconomicsDate: November 25, 2014
(13) function Y2 = rdgpred(Y,A,A21,L,Z,Z2)
(15) P  = numel(L);  % find size of set of penalties
(16) N  = numel(Y);  % find size of training set
(17) N2 = size(A21,1); % find size of test set
(19) if nargin > 5    % correct similarities if confounders present
(20)  M  = eye(N) Zinv(ZZ)Z;% anti projection matrix of Z
(21)   M2  = eye(N2)  -  Z2inv(Z2Z2)Z2;  % anti projection matrix of Z2
(22)  A  = MAM;% adjust similarties A
(23)  A21 = M2A21M;% adjust similarties A21
(24) end
(26) [Q,D] = eig(A); D = diag(D); % obtain eigenvecs Q and eigenvalues D of A
(28) % for each eigenvalue E (rows) and lambda S (cols) find 1/(E+S)
(29) D = 1./(repmat(D,1,P) + repmat(L(:),N,1));
(31) % predict for observations in test set (rows) for each lambda (cols)
(32) QTY = repmat((YQ),1,P);
(33) Y2  = A21(Q(D.QTY));
(34) Y2  = real(Y2); % remove imaginary part due to numerical imprecession
(36) end

To illustrate the differences in the respective approaches to RR, Figure 2 shows the CPU time for (i) the naive approach in (3) which involves solving unknowns, (ii) the dual formulation in (18) which requires solving systems with unknowns each, and (iii) the dual formulation in (21) solving for all values of jointly. These results are obtained by applying the approaches to simulated data, with baseline settings , , , and , and by varying the levels of the factors and , one factor at a time. In order to ensure no approach has an advantage in terms of preprocessing of the data (e.g., constructing and its eigendecomposition) all reported CPU times include these preprocessing steps.

In Figure 2(a), we see that as the number of SNPs increases the time required by the naive approach keeps growing at a fixed rate, whereas the time required by the dual approaches remains unchanged. Moreover, the approach considering all values of jointly outperforms the dual approach solving separate systems. When sample size is relatively large compared to the dual formulations lose their advantage compared to the naive approach. This is not surprising: when the dual formulation requires solving more unknowns than the naive approach. Concordantly, when faced with data in which one can apply the dual approach, and when one can use the classical approach to RR. Figure 2(b) shows that for a very small set of ’s the dual formulation solving systems with unknowns is faster than the formulation solving for all values of jointly. However, the CPU time required by the former approach increases continuously with , whereas the CPU time of the method considering all ’s jointly hardly changes. When , the latter method attains a better CPU time than the former method does.

The second method for setting is based on the mixed model in (8). In this model, the optimal hyperparameter is a function of and . Therefore, one can estimate the mixed linear model using methods such as (restricted) maximum likelihood [28, 29] and take .

Finally, one can use an existing heritability estimate of the trait under consideration. Given the following definition of SNP-based heritability, provided the SNP data are standardized as -scores, it is shown by Hofheinz et al. [31] that the RR shrinkage parameter can be written as a function of the SNP-based heritability. Specifically, simple algebra shows that, under the above definition of SNP-based heritability, This implies that heritability estimates can be used to set [31]. When using a GRM () to carry out RR prediction, the corresponding shrinkage parameter . This implies the relation between and is given by .

8. Advanced Ridge Regression Methods

8.1. Heteroskedastic Ridge Regression

A point of critique regarding the use of RR is the lack of SNP selection. However, for highly polygenic traits, given current sample sizes, there is evidence that SNP selection is sometimes detrimental to predictive accuracy (e.g., [3, 8, 21]). Nevertheless, since RR can be used for inference just as well as RSR, the approach of selecting SNPs that attain a value below some threshold in the GWAS can also be extended to RR.

In a spirit similar to that of SNP selection, one can argue in favor of a heteroskedastic ridge regression (HRR), where each SNP receives a different amount of shrinkage [50, 51]. As with homoskedastic shrinkage, this SNP-specific shrinkage might be based on either results from the training set or prior information from different data sets. Depending on the size of SNP-specific shrinkage, this method can leverage between SNP selection and full inclusion. Based on prior evidence or in-sample evidence the weight assigned to a SNP can be made arbitrarily small or arbitrarily large given the amount of evidence for association with the outcome. SNP-specific shrinkage opens up the door for a whole array of HRR methods (e.g., [50, 51]).

The HRR estimator in (5) and resulting predictions can be rewritten as where is a diagonal matrix with SNP-specific shrinkage effects.

It is implied by (24) and (25) that HRR can be carried out using the same machinery as homoskedastic RR, by first weighting the SNPs appropriately. More specifically, take and and construct corresponding weighted GRMs by taking Now, using the eigendecomposition defined in (19) of the weighted GRM defined in (26) and by subsequently applying (21) to resulting eigenvectors in and eigenvalues, , we obtain efficient out-of-sample HRR predictions.

8.2. Incorporating Information from Earlier Studies

Using HRR prediction it is possible to include results from a GWAS in other samples as prior information. Consider SNP-specific shrinkage, given by , and a set of GWAS -test statistics from another study without the presence of confounding variables. Given that is approximately constant over the SNPs in the GWAS, the -test statistic of SNP can be written as where denotes SNP standardized to unit length and the estimated effect of the standardized SNP. It follows from this equation that these statistics are proportional to the estimated effects of standardized SNPs. Therefore, the square -test statistics are approximately proportional to the square standardized GWAS estimates. Now, under the prior probability distribution that we have that is a consistent estimator of . Correspondingly, the square -test statistics are proportional to this estimator of the SNP-specific effect variance. Therefore, for a suitable choice of a consistent estimator of is given by . In the framework of HRR, this entails setting . This definition of implies that SNPs are weighted according to . From these weighted SNPs we can construct the weighted GRM and apply (25) to obtain out-of-sample HRR predictions which incorporate information from a GWAS in another dataset.

8.3. Nonlinear Prediction Methods

An important question in genetics is how nonlinear effects (e.g., dominance and epistasis) contribute to the variation of complex traits. RR can efficiently implement such nonlinear SNP effects using the kernel trick from machine learning. Resulting kernel ridge regression (KRR) extends the nonparametric approach to RR, where genetic “similarities” in the space of additive effects are replaced by genetic “similarities” in a larger (potentially infinite) feature space, for instance, including two- or three-way interactions.

The efficient RR predictions in (17) are in essence a weighted average of the observed phenotypes in the training set. Weights are based on the genetic similarity of individuals in the test set and the training set. The more genetically similar two individuals are in the test and training set, the more weight will be given to the phenotype of the similar individual in the training set.

Classical RR measures genetic similarity of individuals in the space of additive effects and assigns weights accordingly. KRR, however, can measure genetic similarity in the space of more than just additive effects. This extended space can include, for instance, -way interactions between SNPs. Now, a GWAS estimating all potential -way interactions between SNPs is not feasible. However, with KRR, rather than having to estimate all coefficients of all nonlinear combinations of regressors, one can instead obtain the measure of genetic similarity in this higher-dimensional space by applying a simple kernel function to any two genotype vectors and corresponding to individuals and .

In this context, classical RR corresponds to . Similarly, a function measuring similarity in the space consisting only of two-way linear interactions is given by To see why this is so, consider expanding (28). We then have where . Thus, is a vector that contains all possible two-way interactions between the markers. Kernel function represents the genetic similarity of individuals and in this space of all two-way interactions between SNPs.

The essence of KRR is the so-called kernel trick that allows one to efficiently compute the higher-dimensional similarity measure by applying a simple kernel function to any two input vectors for individuals and [52]. Provided the kernel is positive definite it constitutes the reproducing kernel of a unique reproducing kernel Hilbert space (RKHS) [53]. KRR then is equivalent to a so-called RKHS regression.

In the case of -way interactions the associated kernel function can be evaluated for all pairs of individuals by raising each element of the GRM, , to the power . An alternative is the nonhomogeneous polynomial kernel of degree , given by . This kernel, similar to the regular polynomial kernel of degree , includes -way interactions but also lower-order interaction terms including the “one-way interactions,” that is, simple additive linear effects.

The preceding example of the polynomial kernel of degree two shows how KRR can include dominance and epistasis in the prediction model. For frequently used kernels, such as the Gaussian (radial basis function) kernel, there exists a representation in which classical RR is applied to a model with infinitely many predictors, nevertheless yielding finite predictions. Obtaining the weights for infinitely many predictors is not possible. Hence, rather than aiming to obtain point estimates of , KRR only aims to obtain predictions.

BLUP and, by extension, RR are special cases of prediction using KRR (e.g., [54, 55]). There has been a substantial amount of work in plant and animal breeding, aiming to improve predictive accuracy using KRR (e.g., [12, 15, 16]). A generally used kernel is the aforementioned Gaussian kernel, defined as where and hyperparameter . This type of kernel includes all conceivable linear interactions between the SNPs and with themselves. Endelman [16] finds that the Gaussian kernel outperforms accuracy of RR and a Bayesian approach to LASSO, used to predict wheat and maize traits in samples, typically with about 300 observations and 3000 SNPs. Similarly, using a Bayesian approach, Crossa et al. [15] find in samples of about 250 observations, with 1100 SNPs, that both the Gaussian kernel and the LASSO outperform predictive accuracy of RR for grain yield and maize flowering traits. However, when comparing the LASSO with the Gaussian KRR, which of two the methods is better, depends on the trait. An efficient implementation of KRR based on maximum likelihood, using the Gaussian kernel, is available in the R package  rrBLUP [16].

Morota and Gianola [17] compare a wide range of kernels, such as the exponential [12, 16, 56], Matérn, diffusion (e.g., [57]), and kernel [58], for the purpose of obtaining genomic estimated breeding values [38]. Though it is argued that selecting a suitable kernel is the most precarious step (e.g., [14]), current evidence suggests that most considered kernels attain a predictive accuracy similar to that of the Gaussian kernels [17]. Thus, it appears that the Gaussian KRR is a robust prediction method for quantitative traits, able to handle nonlinear genetic architectures. Moreover, Endelman [16] finds little evidence supporting the hypothesis that a Gaussian kernel is likely to overfit the data [56].

Given the current evidence, KRR using an appropriate kernel (e.g., the Gaussian kernel) is a promising prediction technique, especially for traits where epistatic effects and dominance are expected to contribute to trait variation. De los Campos et al. [14] suggest an interesting venue for further research on the use of KRR for prediction in quantitative genetics, by combining multiple kernels in a single model, each kernel representing a single variance component (e.g., additive, dominance, or epistasis). For a more detailed treatment of KRR and its uses in quantitative genetics, see Morota and Gianola [17].

Regarding the computation of predictions using KRR, let denote the matrix of similarities in the higher-dimensional feature space in the training set, such that an element of this matrix is given by and let be defined similarly for individuals in the test set versus individuals in the training set. Now, KRR prediction without confounders is given by and with confounders by where, as before, is the projection matrix removing the effects of .

In the case of the nonhomogeneous polynomial kernel of degree , given the GRMs, and , the matrices and can be obtained efficiently by adding a constant to each element of the GRMs and by raising each resulting element to the power . When and are not fixed, these are additional hyperparameters which can be tuned via ()CV.

9. Simulation Study

An important question is under what circumstances can we expect RR to yield more accurate predictions than RSR? The answer to this question can help us assess the merits of RR in quantitative genetics. As discussed, prediction using RR is intimately related to the BLUP of the phenotype under a mixed linear model in which SNP effects are assumed to be all drawn from a normal distribution. This corresponds to idea of each SNP making a tiny contribution to phenotype. Therefore, it is reasonable to assume that RR will perform well when the SNP effects are as such. However, given that not all SNPs in existence are causally affecting the outcome, an open question is how does RR perform when only a subset of SNPs affects the outcome?

Moreover, an important factor influencing predictive accuracy of a classical polygenic score is the training sample size. Therefore, RR is likely also to be very sensitive to the sample size. Finally, the more heritable a trait is, the easier it should be to detect the effects of SNPs. Thus, an additional question is how do RR and RSR perform under different levels of heritability?

In short, we want to know the relative predictive accuracy of RR and RSR (i) for a wide range of trait architectures and (ii) under particular combinations of sample size and the number of genotyped SNPs. To answer this question we run a suite of simulations. In these analyses, we vary sample size of the training set (), the number of genotyped SNPs (), the fraction of SNPs exerting a causal influence (), and the SNP-based heritability ().

Table 1 shows the levels we consider for these factors. In addition, a range of values for on the interval is considered. Each unique combination of levels of these factors constitutes a scenario. The total number of scenarios is ,160. We consider runs, yielding combinations of levels and runs.

FactorsLevelsNumber of levels


(Linear increases on logarithmic scale)


For a combination of sample size, the number of SNPs, trait heritability, and a fraction of causal SNPs chosen from the levels shown in Table 1, let be the corresponding number of causal SNPs. Now, the data generating process for this combination of levels is given bywhere denotes the binomial distribution with draws each with probability of success and denotes the uniform distribution on the interval . This data generating process corresponds to a quantitative trait which is normally distributed and has only additive genetic variation to which common variants contribute (i.e., minor allele frequency above 5%).

The total number of observations includes the individuals in the test set. The size of the test set is 10% of the size of the training set, hence, yielding . Here, denotes the nearest smaller integer.

In order not to be dependent on a single generated dataset, the entire simulation consists of independent runs (replications). In each run we simulate only one set of genotype data for individuals and SNPs. Given any combination of and listed in Table 1 we can take an appropriate submatrix of the genotype matrix. To this submatrix we apply a set of weights of which some are zero, such that we attain the desired fraction of SNPs being causal. Moreover, by scaling these weights and the noise vector appropriately we can attain any specified heritability. The result is a four-dimensional phenotype array with individuals along the first dimension and the factors , , and along the remaining dimensions.

When computing the out-of-sample predictions based on RR and RSR, the available genotype matrix only depends on and , not on or on . Therefore, given and , when the eigendecomposition of the GRM, , can be reused for all combinations of and . Moreover, the approach has already been amended to reuse the eigendecomposition for different values of . Similarly, when the eigendecomposition of matrix can be reused. Since there only are unique levels of and unique levels of , RR prediction (i) for the 62,160 scenarios per replication reduces to computing eigendecompositions and (ii) for each scenario to carrying out the matrix multiplications seen in (21).

In a typical run it takes 4.5 hours to predict using RR on a machine with 16 cores at 2.60 GHz per core with 64 GB RAM. The RSR predictions are generated alongside at virtually no costs in terms of CPU and memory. The computing time includes computation of the GRM, , when and when . Given and , failure to exploit (i) the constancy of the GRM and of over the different combinations of and and (ii) the properties of the eigendecomposition which enable the joint evaluation of the 151 values of we consider dramatically increases the CPU time of RR. In fact, we infer that the less efficient approach yields a CPU time that is at most a factor larger than the 4.5 hours we attain (i.e., about 57 years per run). Even worse, when the naive RR approach is applied and also when , RR predictions cannot be obtained for datasets with more than 50,000 SNPs on the machine we use. Thus, using the efficient approach based on the GRM when and based on when , combined with the smart use of eigendecompositions and constancy of GRMs over different combinations of and we are able to reduce CPU times from several decades to several hours.

In each run, for each combination of levels we compute the of the RSR prediction with the outcome and the of the RR prediction with the outcome. is measured by the squared sample correlation coefficient between the polygenic score and the outcome in the test set. Our aim is to assess predictive accuracy of RSR and see whether it differs significantly from zero for a wide range of configurations. Moreover, we want to test whether RR provides a significant improvement compared to RSR. Therefore, the performance of RR is measured relative to RSR. That is, we take the log-ratio of the two, given by . This measure is continuously distributed over .

We measure the absolute performance of RSR by the logit transformation of ; that is, This measure is also distributed over . The reason for dividing by is that we want to know what part of the genetic variation the polygenic score captures. If is low, for instance, 5%, we consider a polygenic score that attains an of to be more impressive than a risk score that explains 10% of the variation in a highly heritable trait (e.g., ). Note that we exclude observations with as these are uninformative outliers; a polygenic score that “explains” more genetic variation than there actually is is simply wrong.

Regarding the RR penalty, let denote the accuracy of RR in run , given penalty , conditional on some , , , and . Now, let denote the median of the RR performance over the runs for a specific value of . Now, for this combination of , , , and we take Thus, for a given combination of levels of factors is tuned by setting it such that it maximizes the median of RR over the runs for the given combination of levels. Based on this procedure, the optimal of RR in run is given . This yields a single measure of accuracy of RR per replication and per combination of levels. This procedure results in a value of that performs well in 21 independent samples. Hence, it is similar to a value that would result from CV; there is little scope for overfitting. Moreover, since the median is less sensitive to outliers than, for instance, the mean, we make our measure more robust by taking the median over the runs. The reason that we choose for this approach instead of CV is to reduce the computational complexity of the simulation procedure at the expense of having a slightly less elegant approach.

9.1. Simulation Results

Table 2 shows the summary statistics of the measure and of . As can be seen, overall the combinations of levels and runs RR seems to outperform RSR on average by about 6%. However, there is much variation in the log-ratio. The lowest log-ratio is and the highest is . Since this ratio is on a log scale this implies a tremendous difference in . The reason for this is that when either the nominator or the denominator of gets close to zero, the log-ratio can attain a large value (in absolute terms). For this reason we excluded log-ratios outside the interval . This leads to a drop in the variance from about 0.4 to 0.04, only at the expense of losing 3.9% of the observed combinations of levels and runs. Moreover, the mean log-ratio hardly changes by removing the outliers. This reduction in variance allows us to display the results in a more insightful manner and ensures further inferences on the relation between our factors (e.g., sample size) and predictive accuracy are not influenced by aberrant observations. For we see that the average of about 17% is significantly greater than zero.

Outcome Restriction Count (% total) Mean Var. Min Max



Figure 3 shows the histogram of over the combinations of runs and levels inside the range . This histogram confirms that there are long and thin tails. Most mass centers around zero. However, the empirical distribution is slightly skewed to the right, giving rise to the positive average log-ratio. The figure shows that RR often performs better than RSR. Given the fact that RR lies between RSR and OLS, this is not surprising. Using the penalty parameter , RR tries to find the optimum between these two extremes. Figure 4 shows the histogram of excluding observations for which . The observations are smoothly distributed. A value of zero corresponds to an equal to half the heritability. Thus, in a substantial proportion of the cases RSR captures more than half of the genetic variation.

Figure 5 shows the log-ratio of the median of ridge regression and of RSR, with values outside the interval truncated to corresponding extremes of this interval. This truncation is necessary in order for the figure not to be dominated by the outliers. For (see the lower right block in Figure 5), the performance of RR and RSR is volatile. Sometimes, RR strongly outperforms RSR and sometimes it is the other way round. However, on average RR seems to outperform RSR. As approaches (see the lower left and upper right blocks in Figure 5) RR starts to outperform RSR. There are large regions, where the log of the gain in accuracy is consistently between zero and a half. This corresponds to a relative increase between zero and 65%. For example, for , , and 200 causal SNPs RSR attains a median of 17% and RR 20%, constituting a relative increase of 16%. This gain in accuracy peaks when .

When (see the upper left block in Figure 5), the gain in accuracy drops to zero. However, it is unlikely that this pattern, where the gain of RR dies out as keeps increasing, replicates empirically. The reason for this is that the pattern is probably an artefact of the design of the simulation; all SNPs are simulated independent of each other. Even though empirical correlations between SNPs can arise in the simulations, asymptotically there is none. Thus, for sufficiently large (compared to ) the standardized simulated SNP data are such that approaches the identity matrix and RR becomes equivalent to RSR (see Section 3). Therefore, the accuracy of RR and RSR does not differ for such extremely large values of . How the performance differs in these large samples when there is linkage disequilibrium in the data remains to be seen.

Table 3 shows the median of the of RSR and that of RR relative to RSR for combinations of sample size and the number of genotyped SNPs that are typically seen in a GWAS (e.g., , ). We see that for these data dimensions a trait with a heritability of 50% has a classical polygenic score which on average only explains 0.5% of the total phenotypic variation. Moreover, RR yields a relative increase of just 2.9%. This increase gives an absolute of 0.51% for RR. This observation clearly illustrates that the so-called missing heritability [45] is hard to find, even under a very simple data generating process, that is, a process for which we are sure that both RSR and RR should asymptotically capture all genetic variation.

Median Median / 

5 k500 k0.500.0031.078
10 k500 k0.500.0051.029
20 k500 k0.500.0091.038

10 k100 k0.500.0271.079
10 k200 k0.500.0111.000
10 k500 k0.500.0051.029

10 k500 k0.250.0011.000
10 k500 k0.500.0051.029
10 k500 k0.750.0111.011

9.2. Modelling the Simulation Results

To understand the relation between the various factors in the simulation study and the gain in predictive accuracy by RR we fit a linear model to the logarithm of the ratio for all replications and for all considered levels of factors, such as sample size. Moreover, in order to obtain the of RSR as a benchmark we also fit a linear model, the logit transformation of .

The results in the previous section indicate that the relation between sample size and the performance is nonlinear. The relation seems to exhibit an inverted U-shape. For this purpose, we include and its square as regressors. Moreover, the location of the peak depends on the number of SNPs, implying that the parameters of regressors related to sample size depend on . Consequently, interactions between and are added to the model. By symmetry of Figure 5, similar arguments hold for the performance as function of . Based on this argument we consider up to three-way interactions between the regressors.

In addition, we see in many subplots of Figure 5 that the gain in predictive accuracy differs systematically between low, intermediate, and high heritabilities. Therefore, heritability is included as regressor. Finally, although the effect of the fraction of causal SNPs is hard to judge from Figure 5, we include this factor as regressor as well.

Both outcomes are modelled as a linear function of the aforementioned basic regressors. These regressors are reported in Table 4. We consider models ranging from merely an intercept up to all 3-way interactions between the explanatory variables. We choose the model that minimizes the Bayes information criterion (BIC) [59].

Regressor Captures

Intercept Level
Effect sample size
Effect number of SNPs
Effect of number of causal SNPs ()
Effect of fraction of SNPs causal
Effect of heritability

Table 5 reports the BIC values of the respective models. On the basis of these values we find that a model including all three-way interactions is most appropriate, both in case of the log-ratio and in case of the logit of the performance of RSR relative to the heritability. The model for the gain in accuracy of RR relative to RSR can explain approximately 12% of the variation in this measure on the basis of sample size and the other regressors. The model for the accuracy of RSR can explain about 61%.

Outcome Regressors (number of regressors) Number of observations BIC

Intercept (1)1,254,1680.0%
Regressors in Table 4 (5)1,254,1684.7%
& 2-way interactions (15)1,254,1688.4%
& 3-way interactions (35)1,254,16812.4%

Intercept (1)1,239,7210.0%
Regressors in Table 4 (5)1,239,72148.6%
& 2-way interactions (15)1,239,72156.3%
& 3-way interactions (35)1,239,72160.9%

A likely reason for the fact that we can explain far more variation in the of RSR than in the gain of RR relative to RSR is the following. In case both the of RR and RSR are to a large extent influenced by our factors in a similar way, taking the log-ratio basically eliminates these common effects. What then remains is a measure over which the factors have less predictive power than over the absolute measure.

Using the parameters estimates of the models we predict the log-ratio of and as well as for sample sizes between 100,000 and 500,000 individuals and the number of SNPs between 100,000 and 500,000. For heritability and the fraction of causal SNPs we use the ranges considered in the initial simulations. The resulting predictions of the gain in accuracy are displayed in the heatmap in Figure 6.

In addition, point estimates of and are reported together with confidence intervals in Table 6. There are three groups of predictions. In the first group , , and varies from 100,000 to 500,000. In the second group and varies from 100,000 to 500,000. In the last group and ranges from 25 to 75%.

95% CI 95% CI

100 k500 k0.50(0.139; 0.146; 0.153)(1.062; 1.070; 1.079)
200 k500 k0.50(0.324; 0.337; 0.349)(1.094; 1.107; 1.121)
500 k500 k0.50(0.473; 0.478; 0.482)(1.142; 1.167; 1.193)

500 k100 k0.50(0.486; 0.488; 0.490)(1.218; 1.244; 1.270)
500 k200 k0.50(0.482; 0.485; 0.488)(1.193; 1.218; 1.244)
500 k500 k0.50(0.473; 0.478; 0.482)(1.142; 1.167; 1.193)

500 k500 k0.25(0.205; 0.212; 0.218)(1.110; 1.135; 1.160)
500 k500 k0.50(0.473; 0.478; 0.482)(1.142; 1.167; 1.193)
500 k500 k0.75(0.733; 0.736; 0.739)(1.191; 1.218; 1.245)

Results from Figure 6 and Table 6 indicate that in most cases RR is expected to yield a relative increase in between 10% and 20% for sample sizes ranging between 100,000 and 500,000 individuals. All increases in accuracy are greater than zero at a 5% significance level. Moreover, RSR attains values of ranging between 15% and 75%. As an example, in case of 200,000 individuals and 500,000 SNPs, for a trait with the of RSR is expected to be 33.7% and the of RR 37.3%.

Regarding these findings, combining the attained by RSR with the relative increase by RR yields expected values of the of RR which in some cases surpass . In practice this cannot be true. In case a trait has an of 50% it is not possible to consistently predict more than 50% of the phenotypic variation on the basis of SNP data. This seems to indicate that our estimates are somewhat optimistic. Nevertheless, for the ranges in which we actually simulated data (i.e., and ) RSR is able to attain a substantial when and RR is able to considerably increase the . For instance, at and , with 200 causal SNPs the median of RSR is 17%, and the median of RR is 20%. This constitutes a relative increase in of about 16%. As shown in Figure 5, this pattern seems to persist while .