Research Article  Open Access
Robust Joint Analysis with Data Fusion in TwoStage Quantitative Trait GenomeWide Association Studies
Abstract
Genomewide association studies (GWASs) in identifying the diseaseassociated genetic variants have been proved to be a great pioneering work. Twostage design and analysis are often adopted in GWASs. Considering the genetic model uncertainty, many robust procedures have been proposed and applied in GWASs. However, the existing approaches mostly focused on binary traits, and few work has been done on continuous (quantitative) traits, since the statistical significance of these robust tests is difficult to calculate. In this paper, we develop a powerful statisticbased robust joint analysis method for quantitative traits using the combined raw data from both stages in the framework of twostaged GWASs. Explicit expressions are obtained to calculate the statistical significance and power. We show using simulations that the proposed method is substantially more robust than the test based on the additive model when the underlying genetic model is unknown. An example for rheumatic arthritis (RA) is used for illustration.
1. Introduction
Genomewide association studies (GWASs) have identified a large number of genomic regions (especially singlenucleotide polymorphisms (SNPs)) with a wide variety of complex traits/diseases. In a GWAS, two most common types of data, qualitative (or binary) and quantitative (or continuous) traits, are analyzed and two contentious points are often faced; one is how to construct the test statistic considering the genetic model uncertainty and the other is how to evaluate the statistical significance for controlling the false positive rates efficiently (e.g., [1, 2]). Considering these issues, a lot of work has been done on the binary trait in the past 10 years (e.g., [3â€“7]). Computer algorithms have also been developed to calculated the significance level of robust tests in GWASs, taking into account the genetic model uncertainty [8]. However, few work has been done on continuous traits, only recently So and Sham [9] proposed a MAX3 based on score test statistics, and Li et al. [10] gave a MAX3 based on test statistics. Note that these tests just focus on singlemarker analysis in onestage analysis.
Although the costs of wholegenome genotyping are decreasing with the highthroughput biological technology, the total costs for a GWAS are still very expensive due to the thousands of sampling units and huge amounts of singlenucleotide polymorphisms. In order to save the costs, the twostage design and the corresponding statistical analysis where all the SNPs are genotyped in Stage 1 on a portion of the samples and the promising SNPs with small values (e.g., <0.001) based on some efficient tests are further screened on the remaining subjects, are often adopted in practice (e.g., [11â€“15]).
In genetic association studies, especially GWASs, genetic markers are routinely tested under the assumption of additive effects. Although convenient to use, those tests are optimal only when the true underlying genetic model is additive so that they are not robust against the genetic model misspecification. To our best knowledge, few work has been done on the twostage joint analysis for quantitative trait GWASs allowing for genetic model uncertainty. Here, we attempt to develop a joint analysis method with data fusion in the twostage design using statistic, since test is commonly employed from the linear regression model for quantitative trait, and Li et al. [10] show that MAX3 based on statistics is more powerful than So and Shamâ€™s method by extensively numerical simulation.
The content of this paper is organized as follows. In Section 2, we give some notations and the proposed robust joint test statistics. Further, we derive the asymptotic distribution of the test statistics under the null and the alternative hypotheses. In Section 3, we show that the proposed joint analysis method is substantially more robust than the additivemodelbased test from the numerical results of power comparison when the real genetic model is unknown. After that, an illustrative example for rheumatic arthritis (RA) is presented. Finally, we give some discussion of this paper in Section 4.
2. Methods
2.1. Notations
Assume that individuals are randomly selected to be genotyped in a twostaged GWAS for a certain quantitative trait and that is the sampling proportion in Stage 1. Let and be the sample sizes for Stages 1 and 2, respectively. Consider a biallelic marker with two alleles and . Without loss of generality, we assume that is the minor or highrisk allele. We suppose that the total SNPs are genotyped on the samples of Stage 1, and SNPs with values less than in Stage 1 will be further genotyped and tested in Stage 2. Let the significance level be , and then the genomewide significance level per SNP is with the Bonferroni adjustments. Let and be the observed quantitative outcome vectors for Stage 1 and Stage 2, respectively. Without loss of generality, we assume that the first individuals in Stage 1 have the genotype , the second individuals in Stage 1 have the genotype , and the last subjects in Stage 1 possess the genotype . Similarly, the first subjects in Stage 2 have the genotype , the second individuals in Stage 2 have the genotype , and the last subjects in Stage 2 possess the genotype . Let and , and let be the matrix with all its entries being zero and be the identity matrix.
2.2. StatisticBased Robust Joint Analysis
We firstly briefly introduce statisticbased MAX3 by Li et al. [10] just using the data from Stage 1. Consider the following linear regression model: where is the nuisance parameter for the intercept, is the parameter of interest for genetic effect, and is the genotype value, which takes 0, 1, or 2 corresponding to the count of at a marker locus for the th subject, . The hypotheses of interest are The variable in the previously stated equation is coded differently for the three common genetic models. Let , , and be the design matrices under three commonly used genetic models, where corresponds to the recessive model, corresponds to the additive model, and is for the dominant model. Denote , where and . The modified test statistics under the recessive, additive, and dominant models for Stage 1 are given by where The robust test statistic in Stage 1 is
We now give the proposed robust joint analysis. In the framework of twostage design GWAS of quantitative traits, the SNPs with values less than will be genotyped on the remaining subjects in Stage 2. Following the previous notation for Stage 1, corresponding to the recessive, additive, and dominant models, the genotype data in Stage 2 are denoted by , , and , respectively, and the design matrices are , , and , respectively. Denote , where and . Then, we can obtain three modified test statistics under the recessive, additive, and dominant models for Stage 2 similarly, and denote them by , , and . Let , , , and . Denote , , and for the combined sample sizes from two stages, corresponding to three genotypes. Then the proposed test statistics under three genetic models on the basis of the combined data are as follows: where , , , and â€‰â€‰, Furthermore, we propose the joint testing statistic as
In order to calculate the power of the proposed joint analysis, we have to get the thresholds, which is determined by the significance level. Denote the threshold for choosing the promising SNPs in Stage 1 by , which is the solution of Since the genomewide significance level is , in order to control the false positive rate, we have where is the cutoff point for the joint statistic. Once we have and , the power is calculated by
We now give the detail to calculate the cutoff point and power above. The left side of (10) can be further expressed as For controlling the type I error rate and calculating the power, we need to know the distribution or the asymptotic distribution of under both and .
Note that whether or holds, and and are mutually independent (the proof is given in Appendix A). Denote the correlation matrix of by , whose entries are , , , and , respectively. Similarly, let be the correlation matrix of with , , , and . Then, we can derive that , , and where is the correlation matrix between and , with
Under , for a given odds ratio for subjects with two copies of risk allele corresponding to recessive model or one copy of risk allele corresponding to additive or dominant models, we have the following:(i) when the true genetic model is recessive, â€‰where with (ii) when the true genetic model is additive, â€‰where with (iii) when the true genetic model is dominant, â€‰where with
We develop a method for simplifying the calculations of and and . The details are included in Appendix B, and the calculations of and and are essentially similar.
3. Results
3.1. Power Comparison
We conduct simulation studies to evaluate the performance of the proposed method under three commonly used genetic models (recessive, additive, and dominant models). We mainly compare the power of two approaches; one is the proposed method in this paper, and the other is the joint analysis based on the test statistics and . For convenience, we refer to the proposed method as MAXFJ and AFJ for the other one. We choose the sample size , and . The proportion of subjects genotyped in Stage 1 has three levels . We set the genomewide significance level as and that the significance level per SNP as . In Stage 1, the value threshold for SNPs selected for followup is set to be and . We assume that the HardyWeinberg equilibrium holds in the general sample population, and then there are on average , , and individuals with genotype gg, Gg, and GG, respectively, where the minor allele frequency is set to be 0.15, 0.30 and 0.45. To make the power comparison more distinctly, we specify different genetic effect parameters under three genetic models as follows: for the recessive model, for the additive model, and for the dominant model.
The power results are displayed in Tables 1 and 2 for and , respectively. They indicate that MAXFJ is more efficiency robust than AFJ across various inheritance models. As expected, AFJ is more powerful than MAXFJ under the additive model. However, MAFJ performs much more powerful than AFJ when the true genetic model is recessive. For instance, in Table 2, with and MAF = 0.3, the powers of AFJ and MAXFJ are 0.101 and 0.529, respectively. In summary, MAXFJ is substantially more powerful than AFJ in twostaged GWAS of quantitative traits, when the model for AFJ is misspecified.


3.2. An Illustration Example: Rheumatoid Arthritis
Rheumatoid arthritis (RA) is an autoimmune disease (resulting in a chronically systemic inflammatory disorder) which mainly attacks synovial joints. About 1% of the common adult population worldwide is affected by RA [16]. It has been pointed out that the genetic variants might play a major role in RA susceptibility [17]. Genetic Analysis Workshop 16 (GAW16) based on the North American Rheumatoid Arthritis Consortium (NARAC) is a GWAS testing association with RA using about SNPs [18â€“20]. It included 868 individuals who were RA positive (cases) and also had continuous trait anticyclic citrullinated peptide (antiCCP) measures and 1194 controls sampled from the New York Cancer Project (NYCP) without RA which had no antiCCP measures. Huizinga et al. [21] pointed out that a greater antiCCP would be linked to better prediction of increased risk developing RA. Chen et al. [22] showed that SNP rs2476601 located in PTPN22 had the most significant association with RA. Here, we only focus on SNP rs2476601 and apply two joint analysis methods (AFJ and MAXFJ) to evaluate its statistical significance. The minimum of antiCCP among 868 cases was affected to each control, and a log transformation of antiCCP was applied in the analysis. Then, we considered three simulation circumstances. For , thirty percent of individuals were randomly sampled from all cases and controls and were used as the data from Stage 1, and the rest of individuals were treated as the data of Stage 2. The values of AFJ and MAXFJ were calculated, respectively. We repeated the above procedure 1,000 times and saved the corresponding values. A base10 logarithm transformation and an opposite transformation were successively applied to these values, and the histogram and density of these transformed data were obtained (Figure 1). Similarly, we conducted the simulation and calculation for and 0.5, and the corresponding histogram and density were presented in Figures 2 and 3. Examination of Figures 1â€“3 showed that the values of MAXFJ are more stable than those of AFJ and the estimated density curves of MAXFJ are more closer to the symmetrical normal distribution while the estimated density curves of AFJ are rather skewed, which indicated that MAXFJ possesses more robust performance when the real genetic models are unknown.
(a)
(b)
(a)
(b)
(a)
(b)
4. Discussion
We have developed a feasible twostage design and the corresponding robust joint analysis approach for quantitative trait GWASs. The method is based on the statistics over three different genetic models. The denominator of the used statistic, which is constructed without assuming any genetic model, is different from the commonly used one. This adoption can reduce the computation intensity. Taking advantage of an ingenious design matrix, we successfully construct the common denominator of three test statistics for the joint analysis with combined raw data from both stages. The statistical significance (value) for the proposed joint analysis method can be calculated with the derived analytic expressions on the basis of the asymptotic distributions, which greatly reduce the complexity and computational intensity compared with the resamplingtype permutation and bootstrap procedures. Our numerical results demonstrate that this novel approach has the greater efficiency robustness for genetic model uncertainty than the statisticbased joint analysis which assumes the additive genetic model.
In this work, we did not investigate the power of joint analysis based on other existing robust association methods for quantitative traits such as So and Shamâ€™s method. We find that it is very difficult to extend So and Shamâ€™s method (score testbased MAX3) to twostaged GWASs with quantitative outcomes, since it is almost impossible to derive the joint distribution of score tests from two stages.
For simplicity, here we do not take into account the effects of covariates in the considered twostage design. However, in real application, the proposed method can be easily applied to the situation including one or more covariates as shown by the original MAXF by Li et al. [10]. It is important to stress that we combine the raw data from two stages to construct the joint statistic, unlike the joint analysis for binary traits using the weighted sum of two statistics in Stages 1 and 2 [12]. Furthermore, one basic assumption in this paper is that the effect sizes of genetic variants between two stages are identical (i.e., no heterogeneity exists), which is the natural and reasonable precondition for the data fusion strategy. In addition, the populationbased genetic association studies may be affected by the population stratification, and this needs future research to examine it.
Appendix
A. The Derivation of the Asymptotic Properties of
Consider the linear model for the combined raw data from Stage 1 and Stage 2 as follows: where and .
Denote
Based on the design matrix for the expanded full model above, we can get the ordinary least square estimator of by , and the residual sum of square is given by . Furthermore, we denote the residual sum of squares under the following constraints: , , , and , by , and , respectively. After some algebras, we can obtain
On the one hand, according to it follows that So, we can get that That is, . Furthermore, is also the unbiased estimator of the variance of the residual based on the independence and unbiasedness of and .
On the other hand, and are both independent of , so is also independent of , and , . Consequently, we have . Similarly, we can get and .
B. The Detailed Calculation of and and
Denote and . For a given , where and is the bivariate normal density function for with mean and variancecovariance matrix . Taking advantage of the symmetry of bivariate normal distribution, the above twofold integration can be only calculated at the right half space of and then multiplied by 2, which is
Based on the property of conditional distributions of the multivariate normal distribution, we have where is the probability density function of and is the density function of the conditional normal distribution . That is, Then, it follows that Thus,
Denote and . For any given , where and is the multivariate normal density function for with mean and variancecovariance matrix
Note that is the sum of nine integrations as follows: Moreover, we can obtain that , , , and based on the symmetry of the integration domain for and , respectively.
We have where is the conditional normal density function as and is the conditional normal density function given by with the submatrix of formed by first three rows and three columns.
Denote . Then, we have