#### Abstract

In genome-wide association study (GWAS), robust genetic association tests such as maximum of three CATTs (MAX3), each corresponding to recessive, additive, and dominant genetic models, the minimum *p* value of Pearson’s Chi-square test with 2 degrees of freedom, and CATT based on additive genetic model (MIN2), genetic model selection (GMS), and genetic model exclusion (GME) methods have been shown to provide better power performance under wide range of underlying genetic models. In this paper, we demonstrate how these robust tests can be applied to the replication study of GWAS and how the overall statistical significance can be evaluated using the combined test formed by *p* values of the discovery and replication studies.

#### 1. Introduction

With the advance of biotechnology and substantial reduction of genotyping costs, a genome-wide association study (GWAS) using hundred thousand markers in several thousand individuals is now increasingly utilized and has been successful in detecting genetic associations across the entire genome with complex human traits [1–6]. Among many challenges this application holds; development of more efficient and robust statistical methodologies with higher power to detect an association with a single marker has been one of the most important statistical issues, given that effects of individual markers are usually characterized as being small to moderate. One attempt to overcome this challenge is focused on developing efficient tests that are robust against underlying genetic model misspecification.

Two most frequently used association tests are the allele-based test (ABT) and the genotype-based test (GBT). ABT compares the allele frequencies between cases and controls, while GBT compares the genotype distributions of cases and controls. The Cochran-Armitage trend test (CATT) [7, 8] is a popular GBT which takes into account the underlying genetic model. It is well known, however, that the ABT may inflate type I error when Hardy-Weinberg equilibrium (HWE) does not hold in the samples [9]. Even under HWE, when the genetic model is recessive or dominant, the ABT may suffer from serious power loss. On the other hand, the CATT does not depend on HWE, but to apply the CATT the choice of scores optimal for the underlying genetic model needs to be specified. For complex diseases, the genetic model is usually unknown and robust tests such as the maximum of three CATTs (MAX3) [10] and the maximum efficiency robust test (MERT) [11, 12] are preferable. Alternatively, Zheng and Ng [13] and Joo et al. [14] proposed a two-phase analysis based on the genetic model selection (GMS) and genetic model exclusion (GME). Moreover, an alternative approach was proposed by the Wellcome trust case-control consortium (WTCCC) [5] which used a minimum value of Pearson’s Chi-square test and additive CATT, and the asymptotic properties of this approach were studied in detail by Joo et al. [15]. These methods provide better or comparable power performance than some of the robust tests such as MAX3.

In this paper, we illustrate how these robust tests can be applied to a replication study of GWAS and how overall statistical significance can be evaluated using the combined test formed by values of the discovery and replication studies. The importance of replication or validation in GWAS has been well recognized [16, 17], and joint analysis in a two-stage design of GWAS has been proved to be more powerful than replication-based analysis and has been widely conducted in GWAS with a variety of phenotypes of interest [18, 19].

The paper is organized as follows. We first describe the data structures and notation and review existing robust association tests for a single data set. Then we describe how to obtain the value for the replication data set, given the significant result of the discovery stage, using robust tests. In the next section, a combined test of the values of the discovery and replication data sets is proposed, together with the way to evaluate the statistical significance for the combined test. Simulation studies are conducted to compare the type I error rates and powers of various analytical strategies. For illustration purposes, the summarized methods are applied to a non-small-cell lung cancer data set and at the end there is a discussion.

#### 2. Methods

##### 2.1. Data and Notation

For a marker with two alleles and , let the frequencies of in cases and controls be and . Denote three genotypes by , , and . In case-control association studies, cases and controls are independently sampled from each population. The observed genotype counts for are in the cases and in the controls. Disease prevalence is denoted by and penetrance by for . Two genotype relative risks (GRRs) are denoted by and using as baseline penetrance. Under the null hypothesis of no association or alternatively . Genetic model is recessive (REC), additive (ADD), multiplicative (MUL), and dominant (DOM) when , , , and , respectively.

##### 2.2. Review of Association Tests for a Single Data Set

The association in case-control studies can be tested using various methods which have been extensively studied. The general association between the disease status and the SNP can be tested using Pearson’s Chi-square test which has an asymptotic Chi-square distribution with 2 degrees of freedom under . The test is given by where for and . Under Hardy-Weinberg equilibrium (HWE), an allele-based test (ABT) and CATT with scores for , where , are given by where [9]. The optimal choices of for the recessive (REC), additive/multiplicative (ADD/MUL), and dominant (DOM) models are and 1, respectively [9, 20]. Both and asymptotically follow a standard normal distribution under . can be used even when HWE does not hold. However, without the HWE assumption, does not follow a standard normal distribution due to the correlation between two alleles.

A robust test, MAX3 proposed by Friedlin et al. [10], can be obtained by taking the maximum of three CATTs under the three genetic models as . Parametric bootstrap or permutation methods can be used to find the value of [4].

Let the values of Pearson’s Chi-square test and CATT under the additive genetic model be and , respectively. WTCCC [5] proposed an alternative robust test . Joo et al. [15] derived the asymptotic null distribution of and using their result the value of can be obtained as where and are the cumulative distributions of Chi-square distributions with 1 and 2 degrees of freedom.

On the other hand, Song and Elston [21] considered a Hardy-Weinberg disequilibrium trend test (HWDTT) given by where and are the estimates of and , where and . Here, denotes the Hardy-Weinberg disequilibrium (HWD) coefficient defined by and and denote the HWD coefficient in cases and controls, respectively.

Zheng and Ng [13] used the information contained in the signs of to determine the genetic models in their two-phase method. Their two-phase statistic is given by if , if , and otherwise, where for . The asymptotic correlations between and three CATTs under HWE were derived and the significance level was adjusted accordingly to control the desired type I error. Based on the observation that this method assumes is the risk allele, Joo et al. [14] studied the behavior of when either one of the alleles can be a risk allele. They chose the risk allele based on the sign of ; that is, if , is the risk allele, and , , and are chosen for REC, ADD, and DOM models, respectively. If , the respective test statistics are chosen to be , , and . They incorporate this property in defining the test statistic for genetic model selection () and calculating the value. Let , , and . Then, the value of this method can be obtained bywhere in (5) and () are replaced by their consistent estimates. Here, and . Moreover, and are the observed values of and , respectively.

While studying the properties of GMS, Joo et al. [14] noticed that the probability of selecting the true recessive or dominant models using is very low especially for low to moderate GRRs, but the unlikely genetic model can be successfully excluded. This led to genetic model exclusion method which is the same as the described above except for is replaced by where . And the value of GME can be obtained aswhere for .

##### 2.3. Value of Replication Data Using the Robust Method

In the discovery stage, the value of robust association tests, including , , , and , can be obtained as described in Section 2.2. For the value of replication data using the robust method, we use the same analytic method that was used for discovery and the risk allele identified by it [16]. This means that when the best test statistic or genetic model is selected in the discovery stage, the replication stage will adopt the discovery stage selection and the direction of association.

Suppose that, for simplicity of notation, our interest is in GWAS with two stages, one for discovery and the other for replication, although the methodology described below can be extended to multistages for replication. Let for be the CATT optimal for recessive, additive, and dominant models and let be corresponding value for th stage ( for discovery and for replication stages). Also, denote for . Then, for CATT with a preselected genetic model, using a one-sided value given the direction of association from the discovery stage, and . Moreover, denote the test statistics and values using Pearson’s Chi-square test from the th stage as and . Further, let HWDTT from the th stage be . Then, the second stage values, using , , , and , denoted as , , , and , can be obtained as follows:

It is important to note that even though the direction of the test statistics and the selected genetic models are used to obtain the second stage values, the values from the two stages are independent under the null hypothesis. This is because, under the null hypothesis, the probability of being positive or negative is simply , and the probability of the selection of a certain genetic model is also a constant ( for the recessive and dominant models and for the additive model).

##### 2.4. Combined Test Using Values and Its Statistical Significance

For a given robust test, we can consider the joint analysis by combining values from the discovery and replication stages of GWAS. We consider using values rather than the test statistics because test statistics can have complex forms and obtaining the distribution of the joint test can be difficult. On the other hand, calculating a value for each data set might be relatively simple, and the distribution of values under the null hypothesis of no association is easy to handle.

There are several methods for combining test statistics from two stages [22], and two most commonly used forms are based on Fisher’s combination and a linear combination after inverse normal transformation [23]. Fisher’s combination (FC) directly sums values after transformation; that is, , where is value from for discovery and for replication stages using a given robust test. A specification of gives the same weight for discovery and replication stages, and one can consider and where , and and are sample sizes of the discovery and replication data sets. A linear combination of two values after taking the inverse of the standard normal cumulative distribution is given by with a natural choice of and . Let the significance level of the discovery stage be , which means that markers with are selected and replicated in the replication stage. The value of combined test can then be obtained as where the observed value of is . The are calculated as for equal weights where and for unequal weights where . Detailed derivations are described in the Appendix. Equivalently, for an overall type I error threshold for a single marker of , one may obtain the threshold of that satisfies . Similarly, for the , the value is calculated as = for where the observed value of .

#### 3. Simulation Results

##### 3.1. Type I Error

Table 1 provides the type I errors under different scenarios. A disease prevalence of 10% is assumed, and a total of 1500 cases and 1500 controls were divided into two stages. The proportions of samples in the first stage () of 0.5 and 0.6 were considered for the minor allele frequency (MAF) of 0.3 and 0.4. We considered markers to control the genome-wide false positive rate at with the Bonferroni correction. We did not consider a larger such as 300,000 or 500,000 because this will require more than 10 million simulations to show a stable estimate of the type I error rate. With , we performed 20,000 simulations which result in less than 10% of a coefficient of variation for a significance level for each marker [24]. The test statistics considered are , Pearson’s Chi-square test, MIN2, MAX3, GMS, and GME. For the second stage analysis, we considered a replication-based analysis, , and as proposed above. The results are based on the situation under HWE (HWE coefficient ). As expected, all tests control the type I error reasonablly well, and similar results were obtained when a slight deviation from HWE is present with (results not shown).

##### 3.2. Empirical Power

We examined the empirical powers of different tests considered above. In Figure 1, we considered markers, a disease prevalence of 10%, the same genotype relative risk for two stages ( and ), and 1,000 cases and 1,000 controls. 2,000 simulations were performed under HWE () to control the genome-wide false positive rate at . The recessive, additive, and dominant models were assumed for the first, second, and third rows. Both joint analyses showed better power performances compared to the replication-based analysis (up to 15.9% in scenarios considered in Figure 1), and LC and FC have comparable powers with less than 2% difference. The power gain of using the joint analysis is not as much as that observed in Skol et al. [18]. However, as reported by Skol et al. [18], when the between-stage heterogeneity exists and the risk allele has a larger effect in the first stage than that in the second stage, much improved power is observed by using the joint test. Figure 2 shows results under this scenario with and , and the observed increase in power using the joint test is as high as 33.9%. Again, the difference between LC and FC is minor with less than 3% difference. As for comparison between different robust methods, MAX3, GMS, and GME perform well under the recessive model, while , , and MIN2 are less powerful. Under the additive model, is most powerful, as expected, and is least powerful. Other robust methods perform well with a slight decrease in power compared to . Under the dominant model, MAX3, GMS, and GME perform the best even though all tests show good power performances, and the difference is minor. Similar patterns were observed when a slight deviation from the HWE is present (results not shown).

#### 4. Real Data Application

The GWAS on non-small-cell lung cancer (NSCLC) by Yoon et al. [25] studied 621 NSCLC patients and 1541 control subjects in the discovery stage. After stringent quality control steps, a total of 246,758 SNPs were tested for the association with NSCLC based on . In the replication stage, 168 SNPs with value less than in the first stage based on were tested using 804 patients and 1470 control samples. We identified additional 234 SNPs using MIN2 in the first stage which could be studied in the replication stage if MIN2 was used instead of since MIN2 produces stronger evidence for the additional SNPs than does. The Manhattan plots of using MIN2 and are presented in Figure 3. One example is located in chromosome 2, which had a value of which reached significance level at Bonferroni correction in discovery samples alone, whereas yielded a value greater than . Even though there is possibility of false positive findings, these SNPs could have been selected for replication if robust methods were used.

**(a)**

**(b)**

Since we do not have replication data for these additional SNPs selected using MIN2 because the first stage selection was based on in Yoon et al. [25], just for illustration purpose of the proposed methods, we present the results of three SNPs including that was reported by Yoon et al. [25]. When the significance level in the discovery stage is set at so that all these exemplary SNPs can be selected in the discovery stage; the value of combined test based on four robust methods (, MIN2, GMS, and GME) as well as and Pearson’s Chi-square test is presented in Table 2. Fisher’s combination was used for the joint test in the second stage. Only was found to be significant with Bonferroni correction ( value ) by all except method.

#### 5. Discussion

In genetic association studies, efficiency robust tests whose performance does not depend on the underlying genetic model have been extensively studied, and their power benefit over a wide range of genetic models has been well recognized. In this paper, we described how the idea of these robust association tests can be applied to the replication studies and further how overall statistical significance can be evaluated using the combined test formed by values of the discovery and replication studies.

When the robust tests are used, the test statistic of each stage can have a complex form and thus dealing with the distribution of the joint test can be difficult, whereas calculating the value of each stage might be relatively simple. Because the asymptotic distribution of the value under the null hypothesis of no association is easy to handle, the combined test using values rather than the test statistics themselves can provide computational convenience.

There are several methods for combining test statistics from two stages and Won et al. [22] compared the performances of various choices. Two most commonly used forms are based on Fisher’s combination and the linear combination after the inverse normal transformation [23], and we presented the test statistics and values of these two methods. In our limited experience, the linear combination and Fisher’s combination are fairly comparable. Fisher’s combination seems to perform slightly better than the linear combination when there exists some heterogeneity between stages in terms of the genotype relative risk, while the linear combination seems to perform slightly better in most of other situations. However, the difference is extremely minor. Further research is required for the thorough comparison of various methods of combining values in the application of efficiency robust tests to the replication of genetic association studies.

In a genetic study where the purpose of considering a replication stage is to validate or replicate the genetic findings from the discovery stage, which is the case considered in this paper, the analysis in the replication stage utilized the test statistic or genetic model that is selected as being the best in the discovery stage and also the direction of the risk allele, following guidelines for exact replication in genetic association studies. If the purpose is to simply combine the evidence from different data sources such as in meta-analysis, other strategies may be devised. Further research, again, is required to provide fully detailed properties of such methods.

Power gain of a joint analysis over the conventional replication-based analysis was thoroughly studied by Skol et al. [18, 19]. In our simulation, the amount of power increase using a joint test compared to the replication-based analysis was much minor than what was observed by Skol et al. [18, 19]. The exact reason is not known, but we suspect this might be due to the power advantages of robust methods and also due to the fact that the optimal choice from the first stage is used when calculating the second stage values. However, even though it was minor in some situations, the joint anlysis presented better power performance than the replication-based analysis in our study. This type of joint analysis raised concerns about the exact meaning of replication [17]. However, McCarthy et al. [26] mentioned that joint analyses “blur the boundaries of where exactly replication starts, but whichever analytical approach is taken, confirmation in many independent samples is important and it is the overall strength of the evidence of association that matters.” Purpose of the current study was to present how the overall strength of the evidence of association can be evaluated when robust tests are used in GWAS replication studies.

We illustrated how the proposed methods can be applied in the real data that studied the association of SNPs with non-small-cell lung cancer (NSCLC) in discovery and replication stages. In the original study reported by Yoon et al. [25], SNPs were selected in the discovery data set not based on the robust tests but based on additive CATT. Therefore, we found that some SNPs could have been selected by one of the robust methods but they were not included in the replication data set. For these SNPs, we were not able to perform the joint analysis that we propose, and it was not possible to examine whether there are other SNPs that could have been found to be associated with NSCLC by proposed methods in the replication study. For this reason, we merely presented how many additional SNPs could have been further followed in the replication stage when robust methods were used. In many GWASs, it is a common practice to report the summary test statistics and values of the SNPs under a specific genetic model, usually an additive model, which were further genotyped in the replication stage and were finally defined to be significantly associated with a phenotype of interest. As emphasized in this paper, one may have a better chance of finding many missing SNPs by applying more powerful and robust methods that consider different genetic models simultaneously. Therefore, we urge the community to share test results under not only an additive model but also other genetic models, although they were not significant at a stringent significance level, so that future research may have enriched data resources, to which robust tests can be applied in association studies.

#### Appendix

#### value of Fisher’s Combination for Equal and Unequal Weights

*Equal Weights*. Assume . Under the null hypothesis of no association, and are independent and each asymptotically follows a distribution with 2 degrees of freedom Let and be the probability and cumulative density functions of random variable with degrees of freedom Then , , , and . Denote the cutoff of the discovery stage based on as ; that is, . For observed value of , the value is written as

*Unequal Weights.* When different proportions of samples are used in the discovery and replication stages, it may be more appropriate to assign weights proportional to the sample sizes for each stage. For example, when only a small portion is used in the discovery stage, to prevent Fisher’s combination test from being dominated by the significant result in the discovery stage, one may want to assign a small weight to the discovery stage result.

When is the proportion of samples used in the discovery stage, one selection for weights is and for discovery and replication stages, which simplifies to equal weights when . Based on these weights, we consider unequal-weighted Fisher’s combination as [27]. Its density function is given by and the probability distribution function is

Using the previous notation, we have the following form of value:where .

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

The authors are indebted to late Dr. Gang Zheng for his inspiration and support on their work. This work was supported by grant of the National Cancer Center (no. NCC-1210060).