Computational and Mathematical Methods in Medicine

Volume 2018 (2018), Article ID 2564531, 9 pages

https://doi.org/10.1155/2018/2564531

## Fast and Accurate Genome-Wide Association Test of Multiple Quantitative Traits

^{1}Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA^{2}Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, MN, USA

Correspondence should be addressed to Baolin Wu; ude.nmu@niloab

Received 27 September 2017; Accepted 24 January 2018; Published 18 March 2018

Academic Editor: Zoran Bursac

Copyright © 2018 Baolin Wu and James S. Pankow. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Multiple correlated traits are often collected in genetic studies. By jointly analyzing multiple traits, we can increase power by aggregating multiple weak effects and reveal additional insights into the genetic architecture of complex human diseases. In this article, we propose a multivariate linear regression-based method to test the joint association of multiple quantitative traits. It is flexible to accommodate any covariates, has very accurate control of type I errors, and offers very competitive performance. We also discuss fast and accurate significance value computation especially for genome-wide association studies with small-to-medium sample sizes. We demonstrate through extensive numerical studies that the proposed method has competitive performance. Its usefulness is further illustrated with application to genome-wide association analysis of diabetes-related traits in the Atherosclerosis Risk in Communities (ARIC) study. We found some very interesting associations with diabetes traits which have not been reported before. We implemented the proposed methods in a publicly available R package.

#### 1. Introduction

Over the past ten years, many epidemiologic studies have used genome-wide association studies (GWAS) to identify genetic components of many complex human diseases. These large cohort studies often collected a broad array of correlated traits that often reflect common physiological processes. By jointly analyzing these correlated traits, we can often gain more power by aggregating multiple weak effects and shed light on the mechanisms underlying complex human diseases [1].

There have been many methods proposed recently to detect SNP association with multiple correlated traits (see, e.g., [2–13]). A direct approach is based on the minimum trait value [6], which typically requires permutations to compute significance value. A related approach is the trait-based association test using an extended Simes procedure (TATES; [10]) that combines the univariate trait values while correcting for the correlations among the multivariate traits. Various dimension reduction methods that summarize the multivariate traits into a univariate outcome are also proposed, which then apply the traditional univariate association test. Examples include the principal component analysis (PCA) [2], principal components of heritability (PCH) [3], and averaging longitudinally observed traits [7, 14]. PCA is an unsupervised dimension reduction and the top PC may not necessarily reflect the association signal. Sample splitting is typically used in PCH for significance calculations and may lead to loss of power.

Multivariate trait testing methods generally perform better than univariate analysis-based approach [15]. Among the multivariate testing methods, a popular approach is the canonical correlation analysis (CCA) [4, 16, 17], which is fast to compute but not flexible and is unable to accommodate covariates. Liu et al. [5] proposed the GEE model [18] to jointly analyze one continuous and one binary trait. In Avery et al. [19] and He et al. [11], GEE-based marginal generalized linear modeling of multivariate traits is adopted for efficient multitrait association testing. Schifano et al. [20] proposed a closely related GEE-based scaled marginal association test of multiple secondary continuous traits. Sitlani et al. [13] explored the GEE modeling of longitudinally measured traits for association test. These GEE-based methods typically explicitly avoided modeling the trait correlations. Another set of multivariate approaches is based on the inverted regression of genotypes to test the overall trait effects. For example, the proportional odds regression modeling of genotypes was proposed as a convenient approach to testing multitrait associations [8, 21, 22]. A related adjacent category logistic regression of genotypes was proposed by Wu and Pankow [12]. Inverted regression approach does not easily accommodate imputed SNPs and has generally used the “best-guess” genotypes, which is known to be leading to a loss of power. In contrast, the multivariate trait regression approach can easily test imputed SNPs by using the imputation dosage as covariate.

In this article, we explore an alternative multivariate regression framework to explicitly model the trait correlation and adjust for covariates to test multitrait associations. We compute the analytical values for the proposed tests based on the -distributions that offer very accurate type I error control with good finite sample performance. We also exploit the parallel nature of genome-wide association test to develop very efficient numerical algorithms that are extremely scalable to genome-wide association tests of millions of SNPs. We demonstrate through extensive numerical studies that the proposed methods have very competitive performance compared to existing methods. We further illustrate the usefulness of the proposed methods through an application to genome-wide association study of multiple diabetes-related glycemic traits.

#### 2. Methods

We first discuss a multivariate linear regression-based framework for modeling the multiple quantitative traits and then derive the Wald type statistics for testing multitrait associations.

##### 2.1. Multivariate Linear Regression Model

Consider continuous traits , a covariate vector of length (which could contain an ancestry indicator or principal components), and a genotype score coding the number of minor alleles. Consider the multivariate normal trait model:where is a vector of length , is an matrix, is a vector of length , and the random error is of length and is assumed to follow a zero mean multivariate normal distribution with covariance , . Multivariate trait association amounts to testing . Here we have assumed the same covariates for all traits, which is the case for our ARIC study GWAS example (see Application to ARIC GWAS of Glycemic Traits) and many typical GWAS. In the supplementary materials (available here), we discuss the possible scenario with different covariates for each trait. The trait model (1) is a multivariate linear model (MLM; see, e.g., [23, chapter 8] and [24, chapter 9]).

Given observations for unrelated individuals, for individual , denote as the outcome, as the covariate, and as the genotype score. Denote , , , and design matrix of dimension , where is a column vector of ones.

Denote the parameter matrix . We can check that the maximum likelihood estimators (MLEs) are (see, e.g., [23, p. 294])

##### 2.2. Conducting Multivariate Association Tests

Denote the vector operator , which stacks the columns of a matrix into a vector. Denote . For the MLEs (2) of the MLM model (1), we can check that (see, e.g., [23, p. 296])where denotes the Kronecker product and independently follows a Wishart distribution, , with degrees of freedom (DFs) and scale matrix .

Define the design matrix and the corresponding hat matrix . Let and . Here is an identity matrix. We can check thatWe test the multitrait association with the following Wald statistic:Note that and are independent. Under the null hypothesis, follows the -distribution with DFs (see, e.g., [25, p. 541]).

In the supplementary materials, we analytically show that the CCA test approach [4] is equivalent to a Score test statistic under the MLM model (1) when there are no covariates other than the genotype. Therefore, the proposed MLM-based Wald test can be treated as a natural and flexible generalization of the CCA: (I) it can accommodate any covariates; (II) it is based on the more powerful Wald test instead of the Score test for an association test of quantitative traits; (III) it has an exact -distribution for the multivariate normally distributed traits and hence has very accurate control of type I errors for any sample sizes without the need of asymptotic approximation; and (IV) it is very fast to compute (see next section for details) and extremely scalable to genome-wide association tests of millions of SNPs.

When genetic effects are similar across traits, we can further improve the multivariate association test power using a test statistic with 1-DF following the lines of O’Brien [26], which performed a Wald test of linear combinations of . We can derive similar Wald tests under the MLM (1) (see supplementary materials for technical details). When the genotype effects are the same across different traits, we study the following test statistic:where is an column vector of ones. When the scaled genotype effects are the same across different traits, we study the following test statistic:where is a column vector of estimated standard errors: .

Under the null hypothesis, both and follow the asymptotic standard normal distribution. To improve the finite sample performance, we can compare and to a -distribution with -DF.

##### 2.3. Efficient Computation of GWAS Wald Test Statistics

For a typical GWAS with millions of SNPs, rather than fitting a MLM for each SNP, we developed very efficient algorithm to estimate the MLMs for all SNPs using matrix decomposition tricks following the line of Voorman et al. [27] as follows. For , denote its singular value decomposition (SVD) as , where is an matrix with orthogonal columns, is a diagonal matrix, and is a orthogonal matrix. The null MLM hat matrix can then be computed as , and . Denote the null MLM residual matrix as , and let . In (4), we have shown that the genotype effect can be efficiently computed as . We can then compute the covariance matrix MLE as . Here both and just need to be precomputed once and can be stored for use with all SNPs. Operationally we can also apply the popular PLINK tool [28] to test multitrait association. We first obtain the residuals of multivariate traits and genotypes adjusting for all covariates. We then input the residuals into the PLINK CCA test approach [4]. Technically, we need to adjust the PLINK output value using an -distribution with different DFs (see supplementary materials for technical details).

#### 3. Results

##### 3.1. Simulation Studies

We consider three forms of Wald statistics: is the omnibus test, and and are the 1-DF test assuming common or common scaled effects. The GEE-based approaches of He et al. [11] are computationally very efficient, have been shown to appropriately control the type I errors, and have the overall best detection power compared to the other methods (e.g., TATES of [10] and other univariate test-based methods) in extensive numerical studies. Here we compared the proposed methods to their GEE score tests, denoted as , which are the -DF omnibus test and 1-DF tests assuming a common effect or common scaled effect.

We consider a standard normal covariate and a Bernoulli covariate with probability of . The SNP genotype score is simulated from a Binomial distribution, , where the minor allele frequency (MAF) . Here is essentially a population indicator and we have simulated SNPs under population stratification.

We conducted simulations for testing related traits of 1,000 unrelated individuals, respectively. Each time, we simulate the traits from a multivariate normal distribution with a compound symmetry correlation matrix with correlation . The first trait has a variance of 2 and all the other traits have unit variance. We set for , and for .

We used 10 million experiments to evaluate the type I error and experiments to evaluate the power under various combinations of . We conducted simulations for , , and . Here we report the results for , , and . The conclusions remain the same for other settings (data not shown).

Tables 1 and 2 summarize the estimated type I errors. Overall, the type I errors are well controlled for the proposed methods, while the GEE score tests are conservative, especially for large number of traits (). In general, the proposed Wald tests follow the exact -distribution under the null hypothesis and hence the type I errors are well controlled under all settings. The GEE tests rely on the large-sample asymptotic distribution and therefore generally we need large sample size to have better control of type I errors, especially for a larger number of traits (containing more model parameters).