Abstract

Multiple correlated traits are often collected in genetic studies. By jointly analyzing multiple traits, we can increase power by aggregating multiple weak effects and reveal additional insights into the genetic architecture of complex human diseases. In this article, we propose a multivariate linear regression-based method to test the joint association of multiple quantitative traits. It is flexible to accommodate any covariates, has very accurate control of type I errors, and offers very competitive performance. We also discuss fast and accurate significance value computation especially for genome-wide association studies with small-to-medium sample sizes. We demonstrate through extensive numerical studies that the proposed method has competitive performance. Its usefulness is further illustrated with application to genome-wide association analysis of diabetes-related traits in the Atherosclerosis Risk in Communities (ARIC) study. We found some very interesting associations with diabetes traits which have not been reported before. We implemented the proposed methods in a publicly available R package.

1. Introduction

Over the past ten years, many epidemiologic studies have used genome-wide association studies (GWAS) to identify genetic components of many complex human diseases. These large cohort studies often collected a broad array of correlated traits that often reflect common physiological processes. By jointly analyzing these correlated traits, we can often gain more power by aggregating multiple weak effects and shed light on the mechanisms underlying complex human diseases [1].

There have been many methods proposed recently to detect SNP association with multiple correlated traits (see, e.g., [213]). A direct approach is based on the minimum trait value [6], which typically requires permutations to compute significance value. A related approach is the trait-based association test using an extended Simes procedure (TATES; [10]) that combines the univariate trait values while correcting for the correlations among the multivariate traits. Various dimension reduction methods that summarize the multivariate traits into a univariate outcome are also proposed, which then apply the traditional univariate association test. Examples include the principal component analysis (PCA) [2], principal components of heritability (PCH) [3], and averaging longitudinally observed traits [7, 14]. PCA is an unsupervised dimension reduction and the top PC may not necessarily reflect the association signal. Sample splitting is typically used in PCH for significance calculations and may lead to loss of power.

Multivariate trait testing methods generally perform better than univariate analysis-based approach [15]. Among the multivariate testing methods, a popular approach is the canonical correlation analysis (CCA) [4, 16, 17], which is fast to compute but not flexible and is unable to accommodate covariates. Liu et al. [5] proposed the GEE model [18] to jointly analyze one continuous and one binary trait. In Avery et al. [19] and He et al. [11], GEE-based marginal generalized linear modeling of multivariate traits is adopted for efficient multitrait association testing. Schifano et al. [20] proposed a closely related GEE-based scaled marginal association test of multiple secondary continuous traits. Sitlani et al. [13] explored the GEE modeling of longitudinally measured traits for association test. These GEE-based methods typically explicitly avoided modeling the trait correlations. Another set of multivariate approaches is based on the inverted regression of genotypes to test the overall trait effects. For example, the proportional odds regression modeling of genotypes was proposed as a convenient approach to testing multitrait associations [8, 21, 22]. A related adjacent category logistic regression of genotypes was proposed by Wu and Pankow [12]. Inverted regression approach does not easily accommodate imputed SNPs and has generally used the “best-guess” genotypes, which is known to be leading to a loss of power. In contrast, the multivariate trait regression approach can easily test imputed SNPs by using the imputation dosage as covariate.

In this article, we explore an alternative multivariate regression framework to explicitly model the trait correlation and adjust for covariates to test multitrait associations. We compute the analytical values for the proposed tests based on the -distributions that offer very accurate type I error control with good finite sample performance. We also exploit the parallel nature of genome-wide association test to develop very efficient numerical algorithms that are extremely scalable to genome-wide association tests of millions of SNPs. We demonstrate through extensive numerical studies that the proposed methods have very competitive performance compared to existing methods. We further illustrate the usefulness of the proposed methods through an application to genome-wide association study of multiple diabetes-related glycemic traits.

2. Methods

We first discuss a multivariate linear regression-based framework for modeling the multiple quantitative traits and then derive the Wald type statistics for testing multitrait associations.

2.1. Multivariate Linear Regression Model

Consider continuous traits , a covariate vector of length (which could contain an ancestry indicator or principal components), and a genotype score coding the number of minor alleles. Consider the multivariate normal trait model:where is a vector of length , is an matrix, is a vector of length , and the random error is of length and is assumed to follow a zero mean multivariate normal distribution with covariance , . Multivariate trait association amounts to testing . Here we have assumed the same covariates for all traits, which is the case for our ARIC study GWAS example (see Application to ARIC GWAS of Glycemic Traits) and many typical GWAS. In the supplementary materials (available here), we discuss the possible scenario with different covariates for each trait. The trait model (1) is a multivariate linear model (MLM; see, e.g., [23, chapter 8] and [24, chapter 9]).

Given observations for unrelated individuals, for individual , denote as the outcome, as the covariate, and as the genotype score. Denote , , , and design matrix of dimension , where is a column vector of ones.

Denote the parameter matrix . We can check that the maximum likelihood estimators (MLEs) are (see, e.g., [23, p. 294])

2.2. Conducting Multivariate Association Tests

Denote the vector operator , which stacks the columns of a matrix into a vector. Denote . For the MLEs (2) of the MLM model (1), we can check that (see, e.g., [23, p. 296])where denotes the Kronecker product and independently follows a Wishart distribution, , with degrees of freedom (DFs) and scale matrix .

Define the design matrix and the corresponding hat matrix . Let and . Here is an identity matrix. We can check thatWe test the multitrait association with the following Wald statistic:Note that and are independent. Under the null hypothesis,    follows the -distribution with DFs (see, e.g., [25, p. 541]).

In the supplementary materials, we analytically show that the CCA test approach [4] is equivalent to a Score test statistic under the MLM model (1) when there are no covariates other than the genotype. Therefore, the proposed MLM-based Wald test can be treated as a natural and flexible generalization of the CCA: (I) it can accommodate any covariates; (II) it is based on the more powerful Wald test instead of the Score test for an association test of quantitative traits; (III) it has an exact -distribution for the multivariate normally distributed traits and hence has very accurate control of type I errors for any sample sizes without the need of asymptotic approximation; and (IV) it is very fast to compute (see next section for details) and extremely scalable to genome-wide association tests of millions of SNPs.

When genetic effects are similar across traits, we can further improve the multivariate association test power using a test statistic with 1-DF following the lines of O’Brien [26], which performed a Wald test of linear combinations of . We can derive similar Wald tests under the MLM (1) (see supplementary materials for technical details). When the genotype effects are the same across different traits, we study the following test statistic:where is an column vector of ones. When the scaled genotype effects are the same across different traits, we study the following test statistic:where is a column vector of estimated standard errors: .

Under the null hypothesis, both and follow the asymptotic standard normal distribution. To improve the finite sample performance, we can compare and to a -distribution with -DF.

2.3. Efficient Computation of GWAS Wald Test Statistics

For a typical GWAS with millions of SNPs, rather than fitting a MLM for each SNP, we developed very efficient algorithm to estimate the MLMs for all SNPs using matrix decomposition tricks following the line of Voorman et al. [27] as follows. For , denote its singular value decomposition (SVD) as , where is an matrix with orthogonal columns, is a diagonal matrix, and is a orthogonal matrix. The null MLM hat matrix can then be computed as , and . Denote the null MLM residual matrix as , and let . In (4), we have shown that the genotype effect can be efficiently computed as . We can then compute the covariance matrix MLE as . Here both and just need to be precomputed once and can be stored for use with all SNPs. Operationally we can also apply the popular PLINK tool [28] to test multitrait association. We first obtain the residuals of multivariate traits and genotypes adjusting for all covariates. We then input the residuals into the PLINK CCA test approach [4]. Technically, we need to adjust the PLINK output value using an -distribution with different DFs (see supplementary materials for technical details).

3. Results

3.1. Simulation Studies

We consider three forms of Wald statistics: is the omnibus test, and and are the 1-DF test assuming common or common scaled effects. The GEE-based approaches of He et al. [11] are computationally very efficient, have been shown to appropriately control the type I errors, and have the overall best detection power compared to the other methods (e.g., TATES of [10] and other univariate test-based methods) in extensive numerical studies. Here we compared the proposed methods to their GEE score tests, denoted as , which are the -DF omnibus test and 1-DF tests assuming a common effect or common scaled effect.

We consider a standard normal covariate and a Bernoulli covariate with probability of . The SNP genotype score is simulated from a Binomial distribution, , where the minor allele frequency (MAF) . Here is essentially a population indicator and we have simulated SNPs under population stratification.

We conducted simulations for testing related traits of 1,000 unrelated individuals, respectively. Each time, we simulate the traits from a multivariate normal distribution with a compound symmetry correlation matrix with correlation . The first trait has a variance of 2 and all the other traits have unit variance. We set for , and for .

We used 10 million experiments to evaluate the type I error and experiments to evaluate the power under various combinations of . We conducted simulations for , , and . Here we report the results for , , and . The conclusions remain the same for other settings (data not shown).

Tables 1 and 2 summarize the estimated type I errors. Overall, the type I errors are well controlled for the proposed methods, while the GEE score tests are conservative, especially for large number of traits (). In general, the proposed Wald tests follow the exact -distribution under the null hypothesis and hence the type I errors are well controlled under all settings. The GEE tests rely on the large-sample asymptotic distribution and therefore generally we need large sample size to have better control of type I errors, especially for a larger number of traits (containing more model parameters).

Tables 3 and 4 summarize the power for and , respectively. is the most powerful when are close to each other, and is the most powerful when are close to each other. In general, the proposed MLM-based Wald tests perform better than the corresponding GEE-based score tests, especially when testing a large number of traits. This agrees with the general principle that the Wald test is typically more powerful than the GEE-based test.

The chi-square statistic is commonly used in practice and referred to an -DF chi-square distribution to compute multitrait association test’s values, which can lead to significantly inflated type I errors at stringent genome-wide significance levels. Figure 1 shows the ratio of actual significance level of Wald test’s values computed using the chi-square distribution and -distribution, respectively. We can see that the type I error based on the chi-square distribution is inflated: more so for larger number of traits, smaller significance level, and smaller sample size. For example, when testing traits with covariates and samples, under genome-wide significance level , the actual significance level of chi-square distribution value is . Using the chi-square distribution to compute values will lead to very small inflation only when the sample size is large, such as in the meta-analysis of multiple GWAS studies. For typical GWAS with small-to-medium sample sizes, we recommend using the appropriate -distribution to compute significance values to reduce false positive findings.

3.2. Application to ARIC GWAS of Glycemic Traits

The Atherosclerosis Risk in Communities (ARIC) study [29] is a population-based, multicenter prospective investigation of cardiovascular disease. Men and women aged 45–64 years at baseline were recruited from four US communities: Forsyth County, North Carolina; Jackson, Mississippi; suburban areas of Minneapolis, Minnesota; and Washington County, Maryland. A total of 15,792 individuals participated in the baseline examination during the period of 1987–1989. The vast majority of ARIC participants are of European (73%) or African (26%) ancestry. We conducted two analyses of diabetes-related glycemic traits in ARIC GWAS data, which has been imputed to around 2.5 million HapMap SNPs using MaCH [30]. We included in the analysis those common SNPs with MAF and imputation score .

As a proof of concept, we first analyzed four fasting glucose levels in 5947 nondiabetic ARIC white participants measured at four visits (visits 1–4) conducted approximately three years apart. The average correlation of glucose levels is 0.55. We applied an additive genetic model with imputed dosage as a covariate and adjusted for age, gender, and study center in all tests. By analyzing four fasting glucose measures jointly, identified 104 significant SNPs, identified 103, identified 102, identified 101, and and identified the same set of 95 SNPs at the genome-wide significance level . Analyzing each glucose measure separately identified 34, 84, 37, and 64 genome-wide significant SNPs at visits 1, 2, 3, and 4, respectively. All the identified SNPs by different methods are genome-wide significant in the MAGIC Consortium, a meta-analysis of 21 fasting glucose GWAS which together included 46,186 nondiabetic participants [31].

Compared to , the two additional SNPs identified by , rs780093 and rs780094, had values of and using . Their respective MAGIC meta-analysis’ values were and . Compared to , the two additional SNPs identified by , rs1260326 and rs11688384, had values of and using . Their respective MAGIC meta-analysis’ values were and .

Second, we jointly analyzed three distinct diabetes-related glycemic traits measured at visit 4 in 5068 nondiabetic white participants measured at visit 4 in ARIC: fasting glucose, fasting insulin, and glucose level 2 hours after an oral glucose challenge. We applied an additive genetic model with imputed dosage as a covariate and adjusted for age, gender, and study center. To account for the skewed distribution of fasting insulin, we adopted the Box-Cox transformation with an estimated power of 0.35 [32]. The three traits had an average pairwise correlation of 0.31. When analyzing fasting insulin or 2-hour glucose levels individually, we did not identify any significant SNPs at the genome-wide significance level (). For joint testing of all three traits, identified none, identified 139, and identified 140 genome-wide significant SNPs, among which 61 and 61 SNPs were reported as genome-wide significant in the MAGIC meta-analyses of fasting glucose, fasting insulin, or 2-hour glucose levels [31, 33].

Compared to , identified two additional genome-wide significant SNPs, rs4665987 and rs853780, with values of and , respectively. MAGIC meta-analysis of fasting glucose reported a value of for rs853780. Its MAGIC meta-analyses of fasting insulin and 2-hour glucose values are 0.054 and 0.477, respectively. For rs4665987 (near GCKR on chromosome 2:27755825), MAGIC meta-analysis’ values for the fasting glucose, fasting insulin, and 2-hour glucose levels are , , and , respectively. This SNP was genome-wide significantly associated with human serum metabolite levels in a GWAS of 8330 Finnish individuals [34] and several other GWAS [3538]. Compared to , reported one additional genome-wide significant SNP, rs17540154, with value of . The MAGIC meta-analysis of fasting glucose reported a value of for rs17540154. Its MAGIC meta-analyses of fasting insulin and 2-hour glucose values are 0.101 and 0.720, respectively.

Among the identified significant SNPs by joint testing, there were 79 novel genome-wide significant SNPs that have not been reported as significantly associated with diabetes-related fasting glucose and insulin levels before. Among them, one SNP, rs4665987, is located on chromosome 2:27755825 and 78 other SNPs are clustered on chromosomes 15:62132921 to 15:62396389. Interestingly, six of them (listed in Table 5) were genome-wide significant in the MAGIC meta-analysis of proinsulin level [39]. The list of all identified SNPs with detailed analysis’ results is available in the supplementary materials.

4. Discussion

So far typical effect sizes of most identified genetic variants for many diseases or traits are very small and they have only explained a very small proportion of the overall disease heritability or trait variation. It is commonly accepted that there are many more common variants with relatively small-to-medium effect sizes or rare variants with larger effect sizes yet to be discovered. To identify these additional variants, very large sample sizes will be needed. One approach is to form a consortium to facilitate meta-analysis of many studies, but development of these genetics consortia is generally time-consuming and logistically challenging. Meanwhile the recently studied joint association test of multiple correlated traits offers an alternative approach to boost power in that it can often dramatically improve the association test power by “enlarging the sample size” through the incorporation of many correlated traits that are typically collected in most large genetic studies and may share genetic determinants. Another strategy to further improve the detection power is to use a variant-set association test, which has been proven to be very useful (see, e.g., [16, 17, 4042]). It is worthwhile to generalize the proposed Wald tests to develop more accurate and powerful association tests of variant sets across multiple traits.

Here we have focused on testing a relatively small number of correlated quantitative traits, which have enabled us to develop accurate and powerful association tests without any asymptotic approximations as adopted in the more general though conservative GEE approach, which can be applied to any mix of quantitative and discrete traits. It will be interesting to extend the proposed methods to the phenome-wide association studies (PheWAS) with a large collection of phenotypes [4345] and develop more powerful joint association test of quantitative and discrete traits.

In the previous discussions, we have assumed the same set of covariates across all traits. With differing covariates, we provide technical details regarding model estimation and extensive simulation studies to confirm that the proposed methods accurately control type I errors and perform favorably compared to existing methods (see the supplementary materials for complete results). In summary, we recommend the proposed multivariate linear regression-based test as a complementary approach to enhancing the power of analyzing multiple quantitative traits in unrelated individuals. Our numerical studies have suggested that the omnibus Wald test generally has robust and good performance. The 1-DF Wald tests can perform well due to reduced DFs, but they could be sensitive to the underlying assumptions. It will be worthwhile to develop adaptive and powerful tests. We have implemented the proposed methods in an R package available at http://www.github.com/baolinwu/MTAR. We provide some sample R codes to install and use the package in the supplementary materials. The developed algorithms are very efficient and extremely scalable to genome-wide association test.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported in part by NIH Grants GM083345 and CA134848. The authors are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. The ARIC study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute Contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, and HHSN268201100012C), R01HL087641, R01HL59367, and R01HL086694; National Human Genome Research Institute Contract U01HG004402; and National Institutes of Health Contract HHSN268200625226C. The authors thank the staff and participants of the ARIC study for their important contributions. Infrastructure was partly supported by Grant no. UL1RR025005, a component of the National Institutes of Health and NIH Roadmap for Medical Research.

Supplementary Materials

In the supplementary materials, we provide more details for (1) model estimation and inference with numerical results when we have different covariates for each trait, (2) 1-DF Wald test assuming similar effects across all traits, (3) relation of proposed methods to CCA (e.g., as implemented in the PLINK), (4) the list of all identified SNPs with detailed analysis results for the ARIC GWAS of glycemic traits, (5) sample codes to install and use the developed R package, and (6) simulation results illustrating the robustness of proposed tests to deviation from normality and impact of directions of marginal effects and trait correlations on the multitrait association test power. (Supplementary Materials)