Abstract

Simultaneous testing of multiple genetic variants for association is widely recognized as a valuable complementary approach to single-marker tests. As such, principal component regression (PCR) has been found to have competitive power. We focus on exploring a robust test for an unknown genetic mode of all SNPs, an unknown Hardy-Weinberg equilibrium (HWE) in a population, and a large number of all SNPs. First, we propose a new global test by means of the use of codominant codes for all markers and PCR. The new global test is built on an empirical Bayes-type score statistic for testing marginal associations with each single marker. The new global test gains power by robustly exploiting the Hardy-Weinberg equilibrium in the control population and effectively using linkage disequilibrium among test markers. The new global test reduces to PCR when the genotype for each marker is coded as the number of minor alleles. This connection lends insight into the power of the new global test relative to PCR and some other popular multimarker test methods. Second, we propose a robust test method based on the new global test and the ordinary PCR test built on a prospective score statistic for testing marginal associations with each single marker when the genotype for each marker is coded as the number of minor alleles by taking the minimum value of these two tests. Finally, through extensive simulation studies and analysis of the association between pancreatic cancer and some genes of interest, we show that the proposed robust test method has desirable power and can often identify association signals that may be missed by existing methods.

1. Introduction

Association analyses that test multiple genetic markers as a set rather than individually have been appreciated for their potential power. These statistical methods largely fall into three classes: those for summarizing values from the tests of each single marker [15], those that synthesize single-marker test statistics, such as Hotelling (standard Chi-squared) statistic [68] and the burden test [9, 10], and those based on a direct test of joint associations of multiple markers, such as variance component tests (VC) [1113], the sequence kernel association test (SKAT) [1418], and principal component regression (PCR) methods [1921]. The relative performance of these methods has been comprehensively compared in previous work [22]. When the number of single-nucleotide polymorphisms (SNPs) is small, these methods have similar power; however, when the number of SNPs is large, the effects of SNPs are not constant and may have different directions, the linkage disequilibrium (LD) among multiple markers is somewhat strong, and the SNPs adopt additive genetic code. Three methods, namely, VC, SKAT, and PCR, have been found to have competitive power in this case [22, 23]. A major reason is that all 3 methods can decrease the degrees of freedom of the test to some extent [12]. In this work, we focus on exploring a robust test for unknown genetic modes of SNPs of interest, unknown Hardy-Weinberg equilibrium (HWE) in a population, and a large number of SNPs of interest.

We first propose a novel multi-SNP test under the case-control study design, which we term the principal Chi-squared test. The principal Chi-squared test applies a two-degree-of-freedom score statistic based on the empirical Bayes method for each SNP and derives a global test based on the eigenvalue decomposition of the asymptotic variance-covariance matrix of each SNP test. The global test achieves improved power by robustly exploiting the HWE in the control population and effectively exploiting the LD among all SNPs. We denote the global test by PChiB (see Methods). In addition to competitive power, PChiB is conveniently implemented and is easily comprehensible by the nonstatistics community because of the well-known eigenvalue decomposition method. The global test is closely related to standard PCR in that it reduces to the score test of PCR when each SNP is coded as the number of minor alleles. This relation not only lends insight into its power relative to PCR but also into the connection between PCR and variance-component-based tests. We show that both classes of these methods are weighted combinations of uncorrelated Chi-squared random variables, each of which is a weighted combination of a single SNP test with weights equal to the loadings of the eigenvectors of their joint asymptotic variance-covariance matrix. This observation, while supporting documented conclusions that none of the two classes of methods is uniformly more powerful than the other [22], reveals theoretically that the LD structure among SNPs plays a critical role in the powers of these methods. When a real disease causal SNP adopts recessive and dominate codes, test PChiB can gain desirable power. When a real disease causal SNP adopts an additive code, test PChiB may have somewhat lower power. Thus, we propose a robust test by taking the minimum value of the new global test PChiB and the ordinary prospective score test of PCR in which each SNP is coded as the number of minor alleles, regardless of the actual genetic code of each SNP. We denote the robust test by Min2.

Suppose that diallelic SNPs in a genomic region of interest are genotyped for case samples and control samples. Let denote the binary case-control status (: case; : control) for sample (), where , the first samples are cases, and the remaining samples are controls. Denote as the count of the minor alleles of SNP from sample for , and . A new global test is designed to test the null hypothesis that the genomic region spanned by SNPs is not associated with the phenotype status of interest against the general alternative that one or more SNPs, which may or may not be genotyped, are associated with the phenotype status of interest. We fit an ordinary logistic regression model for the binary case-control status and all SNPs.

Incorporating HWE constraints into the control population based on the retrospective likelihood for testing a diallelic marker may lead to increased power under dominant and recessive genetic models compared to standard prospective likelihood-based tests [24]. To address the issue that deviation from HWE may lead to an inflated type I error rate in this test, an empirical Bayes score test, which is a data-adaptive linear combination of the prospective likelihood score test and retrospective likelihood score test under the HWE constraint, was proposed [25]. This test can maintain nominal type I error rates under deviations from HWE that are observed in real settings and largely maintains the power gain under the recessive genetic model. Here, our new global statistic uses this test principal as the building block. We expect that our method achieves considerably improved power when aggregating the small power gains at each SNP.

The rest of this paper is organized as follows. In Results, we demonstrate, through simulation studies and analysis of pancreatic cancer data [26, 27], that the proposed robust test can often have desirable power compared to some popular tests across a broad range of scenarios. In Discussion, we further discuss the merits and disadvantages of our proposed test method and note some directions for future research. In Methods, we present the new global test in detail and discuss its connections to PCR and other existing methods. We also briefly introduce the robust test by taking the minimum value of the new global test and the score test of PCR, where each single SNP is coded as the number of minor alleles, regardless of the actual genetic code of each SNP.

2. Results

2.1. A Robust Statistical Method Based on Two Types of Principal Chi-Squared Tests

For real genotype data, we can first calculate the prospective score test, denoted by , in which all SNPs are supposed to adopt additive codes. We denote a consistently estimated covariance for and calculate the ordinary principal components regression (PCR) score statistic, which is denoted by (selecting the top PCs explaining 85% of genetic variability) based on the estimated covariance , as in Gauderman et al. [19]. Second, we can obtain the value of PChiP, which is denoted by because PChiP follows a Chi-squared distribution asymptotically under the null hypothesis. Third, we calculate the empirical Bayes score denoted by and its consistently estimated covariance, which is denoted by , based on codominant codes (see Methods). Similarly, we calculate the new aforementioned principal Chi-squared statistic PChiB based on the estimated covariance . Note the dimension of is , and we can estimate the value of PChiB, which is denoted by , because PChiB also follows a Chi-squared distribution asymptotically under the null hypothesis. Finally, we take the minimum of the two values of PChiP and PChiB as a robust test, as follows:

We estimate the value of Min2 via statistical permutation. We conduct extensive simulations to investigate the power performance of Min2.

To view the performance of Min2 comprehensively, we can compare it to 4 other tests, namely, PChiB, PChiP, SSUP (see Methods) and GOLD, where GOLD is constructed as follows. Suppose the first SNP is the real causal SNP satisfying the logistic regression model , where and represent log odds ratios. Other SNPs are correlated with the first SNP with genotypes . The Gold method (denoted by GOLD) is an ordinary score test based on the above real statistical model. Clearly, in real data analysis scenarios, we do not know the causal SNP. GOLD only has a value in simulation studies and is not practical in real data analysis. We consider 3 scenarios for analysing genotype data. First, we apply PChiB, Min2, PChiP, SSUP, and GOLD to analyse genotype data, including all SNPs. Second, we apply PChiB, Min2, PChiP, SSUP, and GOLD to analyse genotype data, excluding the first SNP, which is the causal SNP. Third, we apply PChiB, Min2, PChiP, SSUP and GOLD to analyse genotype data, including only labelled SNPs. To comprehensively assess the performance of these 5 methods, we designate every SNP among all SNPs as the causal SNP in turn in the simulation procedure.

2.2. Simulation Procedure

We conduct extensive simulation studies to assess the relative power of Min2 by comparing its performance with that of 4 other test statistics, namely, PChiB, PChiP, SSUP, and GOLD. We consider real LD structures defined by haplotypes inferred from the International Hapmap Project CEU samples. We set the haplotype information for gene NAT2 studied by Kwee et al. [28] as the basis of our simulations. To generate multilocus genotype data based on real haplotypes, we estimated haplotypes and their frequencies in a genomic region via HaploView software [29]. The LD structures plot based on the complete set of SNPs for gene NAT2 is displayed in Supplementary Figure 1 (See Supplementary File). For gene NAT2, we select SNPs with and for a total of 18 SNPs. Haplotypes based on the complete set of SNPs and their frequencies are provided in Table 1. Five SNPs, rs13277605, rs1799930, rs1208, rs1961456, and rs2410556, are tag SNPs.

To obtain the control samples, we generated multilocus genotype data, as follows. Let denote the set of estimated haplotype frequencies with . Then, a pair of haplotypes for each control sample was generated under HWE, where the frequency of haplotype pairs takes the form as and as . The haplotype phase information was then deleted, and only locus-specific genotype data were retained. To generate multilocus genotype data for each case sample (total number ), we generated the pair of haplotypes using the following probabilities: where and are the odds ratios for genotypes “Aa” and “aa”, ‘A’ is the major allele for the disease causal SNP, ‘a’ is the minor allele of the disease-causal SNP, and indicator functions and refer to whether haplotype pair has allele combinations (A,a) and (a, a), respectively, at the causal SNP.

To evaluate the impact of deviation from HWE on the power of PChiB, we additionally generated multilocus genotype data from real haplotypes based on gene NAT2, as described above, but with the frequency of haplotype pairs equal to Here, is an indicator function, with if and if , and is the fixation parameter, which represents mild deviation from HWE, as observed in real gene association analysis studies.

We set and and consider two scenarios with HWE indicator and , as in Luo et al. [25]. Furthermore, we designate every SNP as the causal SNP in turn. When the causal SNP adopts an additive code, we obtain the genotype and case-control status based on the logistic model with causal SNP odds ratio 1 for estimating the empirical type I error rates and with causal SNP odds ratio 1.2 for estimating the empirical power. When the causal marker adopts a dominant code, we obtain the genotype and case-control status based on the logistic model with causal SNP odds ratio 1.3 for estimating the empirical power. When the causal marker adopts a recessive code, we obtain the genotype and case-control status based on the logistic model with causal SNP odds ratio 1.5 for estimating the empirical power. With the genotype and case-control status information, we calculate the value of Min2 via 200 permutations. The empirical type I error rates and powers of the 4 tests were considered under a significance level of 0.05 by means of 500 repetitions, as Kwee et al. [28] examined the type I error and power of the semiparametric and single-tag SNP approaches assuming a nominal significance level of 0.05.

2.3. Numerical Results

To comprehensively assess the performance of Min2, we construct test statistics under 3 scenarios, namely, using all SNPs, using all SNPs except the causal SNP, and using only tag SNPs.

Because the empirical type I error rates are nearly the same when the real causal SNP adopts an additive code, dominant code, and recessive code, we present the empirical type I error rates for only the case where the real causal SNP adopts an additive code. The results based on all 18 SNPs with are displayed in Figure 1, and the results based on all 18 SNPs with are displayed in Figure 2. Other results based on all 17 SNPs (excluding the causal SNP) and 5 tag SNPs are displayed in Supplementary Figure 2a, Figure 2b, Figure 3a, and Figure 3b (See Supplementary File). From Figures 1 and 2, we can see that Min2 can control the type I error rate well when the HWE indicator coefficient equals 0 or 0.5log(2.0), but PChiB has a conservative empirical type I error rate when equals 0. We further investigate this phenomenon: when the real genetic model adopts additive code, PChiB adopts a codominant code with equal to 0, so the correlations between every two SNPs are decreased and test PChiB may absorb a large number of degrees of freedom. For example, when considering the scenario with all 18 SNPs and designating the 1st SNP as the causal SNP, PChiP absorbs 2 degrees of freedom and PChiB absorbs 5 degrees of freedom, according to the simulation data. When the real genetic model adopts recessive and dominant codes, all 5 tests control the type I error rate well, regardless of whether is 0 or 0.5log(2.0).

For the empirical power comparison, when the real causal SNP adopts a recessive code, we display the results based on all 18 SNPs in Tables 2 and 3 for and . Other results based on 17 SNPs (excluding the causal SNP) and 5 tag SNPs are displayed in Supplementary Figure 4a, 4a, Figure 5a, and Figure 5b (See Supplementary File). From Table 2, Supplementary Figure 4a and Supplementary Figure 4b for , we can see that the GOLD test always performs best because it is an oracle test, and Min2 performs nearly as good as PChiB in all 3 scenarios. Additionally, Min2 always performs better than PChiP and SSUP, regardless of which of the 18 SNPs is the causal SNP. For example, in Table 2, the empirical powers of PChiP, SSUP, GOLD, Min2, and PChiB are 0.364, 0.352, 0.826, 0.504, and 0.492, respectively, when the 2nd SNP is the causal SNP. From Table 3, Supplementary Figure 5a, and Supplementary Figure 5b for , we can see that Min2, when using all 18 SNPs, using all 18 SNPs except for the causal SNP, and using only tag SNPs, always performs much better than PChiP and SSUP, regardless of which of the 18 SNPs is the causal SNP. For example, in Table 3, the empirical powers of PChiP, SSUP, GOLD, and Min2 are 0.755, 0.795, 0.970, 0.840, and 0.875, respectively, when the 1st SNP is the causal SNP.

When the real causal SNP adopts a dominant code, we display all the results based on all 18 SNPs in Tables 4 and 5 for and . Other results based on 17 SNPs (excluding the causal SNP) and 5 tag SNPs are displayed in Supplementary Figure 6a, Figure 6b, Figure 7a, and Figure 7b (See Supplementary File). From these figures, we can see that Min2 performs robustly among all 5 tests over all 3 scenarios with and 0.5log(2). For example, in Table 4, the empirical powers of PChiP, SSUP, GOLD, Min2, and PChiB are 0.598, 0.588, 0.846, 0.636, and 0.556, respectively, when the 9th SNP is the causal SNP, and the empirical powers of PChiP, SSUP, GOLD, Min2, and PChiB are 0.638, 0.382, 0.826, 0.628, and 0.496, respectively, when the 10th SNP is the causal SNP. In Table 5 for , the empirical powers of PChiP, SSUP, GOLD, Min2, and PChiB are 0.585, 0.310, 0.786, 0.545, and 0.455, respectively, when the 11th SNP is the causal SNP.

When the real causal SNP adopts an additive code, we display all results based on all 18 SNPs in Tables 6 and 7 for and . Other results based on 17 SNPs (excluding the causal SNP) and 5 tag SNPs are displayed in Supplementary Figure 8a, Figure 8b, Figure 9a, and Figure 9b (See Supplementary File). From these figures, we can see that Min2 performs robustly among all 5 tests over all 3 scenarios for and 0.5log(2). Under these 3 scenarios, the real genetic codes are additive, so it is not unexpected that the performance of PChiP is always a little better than that of Min2, regardless of which of the 18 SNPs is the causal SNP. Although SSUP sometimes has slightly better power than PChiP and Min2, it can sometimes have very low power. For example, in Table 6, the empirical powers of PChiP, SSUP, GOLD, Min2, and PChiB are 0.626, 0.402, 0.770, 0.616, and 0.400 when the 11th SNP is the causal SNP. In Table 7, for , the empirical powers of PChiP, SSUP, GOLD, Min2, and PChiB are 0.660, 0.670, 0.800, 0.645, and 0.570, respectively, when the 9th SNP is the causal SNP.

2.4. The Analysis of High-Density Lipoprotein Cholesterol (HDL-C) Data from GWAS Pancreatic Cancer Data

Herein, we present an analysis of HDL-C data from GWAS pancreatic cancer data [26, 27] to illustrate our method. Plasma levels of high-density lipoprotein cholesterol are known to be heritable, but only a fraction of the heritability is explained. We developed a high-density genotyping array populated with HDL-C candidate loci selected based on the known biology of HDL metabolism, mouse genetic studies, human genetic association studies, and available GWAS data. SNP selection was based on tag SNPs but also included low-frequency nonsynonymous SNPs. We performed association analysis on the majority of reported GWAS loci (including ABCA1, CETP, GALNT2, LCAT, LIPG, LIPC, and LPL).

The data set consists of 1231 samples (case: 625 and control: 606) with 64 SNPs from the above 13 genes. Basic information about the 13 genes is presented in Supplementary Table 1 (additional file 2). We calculate the values of 4 test methods, i.e., PChiP, SSUP, Min2, and PChiB, when analysing the data set. The numerical results are displayed in Table 8. From Table 8, we can see that the numerical results of Min2 are consistent with those of the other tests. For example, when investigating the association between HDL-C and gene GALNT2, including 2 SNPs, the values of PChiP, SSUP, Min2, and PChiB are 0.1065, 0.1065, 0.0370, and 0.0272, respectively. For another example, when investigating the association between HDL-C and gene LPL, including 15 SNPs, the values of PChiP, SSUP, Min2, and PChiB are 0.002, 0.00016, 0.002, and 0.0044, respectively. For the third example, when investigating the association between HDL-C and gene LIPG, including 2 SNPs, the values of PChiP, SSUP, Min2, and PChiB are 0.0012, 0.0012, 0.0001, and 0.0002.

Because the number of SNPs in each gene is not very large in the real data, the real data do not provide a good example to illustrate the merit our test. However, this limitation does not affect our purpose of deriving a robust test. Our method focuses on the robustness in the following 3 scenarios: the genetic code for all SNPs is unknown, whether the HWE is satisfied in the original population is unknown, and a large number of SNPs exists.

3. Discussion

One key factor of the improved power of kernel-machine-based tests [17] and PCR is the reduced degrees of freedom. Kernel-machine-based tests make full use of possible correlations among score statistics, which is known to be advantageous for high-dimensional data [30], and are robust to the directions of association of different SNPs. Principal component analysis is a standard method of reducing the dimensionality of a large number of variables. Despite this seemingly obvious argument, the relative merits of PCR and kernel-machine-based tests remain understudied. We provide insights into the theoretical connection between kernel-machine-based tests and the PCR method. We find that when the LD extent of each pair of SNPs is somewhat strong, principal component analysis methods may have higher power than kernel-machine-based tests. PCR often has similar or higher power than kernel-machine-based tests, where the LD pattern is an important parameter for power. We will further explore the principle of selecting the number of PCs in future work.

In this work, we consider an association test between human complex diseases and genetic SNPs based on principal component analysis (PCA) since PCA is widely used in the recent literature. PCA accounts for linear combinations among SNPs. If this linearity exists, PCA is optimal. However, when how the multiple genetic SNPs influence the risk of disease is unknown, one alternative strategy is to use haplotype analysis since haplotypes can capture the LD information between markers [3137].

We propose a novel global test (PChiB) based on the empirical Bayes score test, which is a data-adaptive linear combination of the prospective likelihood score and the retrospective likelihood score under the HWE constraint in the control population. PChiB can maintain desirable power when the real causal SNP adopts recessive and dominant codes under the HWE constraint in the control population. A small disadvantage of PChiB is that when the genetic code of the real causal SNP is additive, PChiB does not have desirable power because of the large degrees of freedom. Thus, we propose a robust test (Min2) that maintains the power gain under deviations from HWE observed in real settings, regardless of which genetic code the real causal SNP adopts. Min2 gains power by effectively using the LD among all the tested SNPs over all scenarios. Because PChiP is based on the assumption that all SNPs adopt an additive code, while PChiB and Min2 are based on the assumption that all SNPs adopt a codominant code, PChiP has low degrees of freedom and performs best when the causal SNP adopts an additive code. PChiB and Min2 may have less power than PchiP in this scenario. When the causal SNP adopts dominant or recessive codes, Min2 has desirable power, regardless of whether HWE is satisfied in the control population. We propose to use our new test Min2 for the association analysis of multilocus genotypes and complex diseases.

We propose the robust test Min2, where the values are obtained via permutation and compared it with PChiB (empirical score based on all SNPs adopting codominant codes), PChiP (prospective score based on all SNPs adopting additive codes), and SSUP (a VC method based on the prospective score and all SNPs adopting an additive code). The main purpose of this article is to introduce the proposed test Min2, not to compare it with other existing tests for GWAS.

Notably, it would be a good idea to extend the proposed tests to include covariate adjustments in the logistic models. The derivation will be very complex and requires additional research. We will consider this problem in our future work. In simulations, we need to set a large sample size as the number of MAF is low, so we have not considered rare variants. We may investigate the robustness about PChiB when the number of MAF is low in our further work.

4. Methods

4.1. A New Principal Chi-Squared Test

Suppose there are case samples and control samples and denote . For the th () sample and th () SNP, denote as the additive code, namely, the numbers of minor alleles taking values 0, 1, and 2. For the th () sample and th () SNP, denote as the codominant code, namely, , where is an indicator function. Clearly, , , and .

For , denote as the estimated minor allele frequency (MAF) for the th SNP in the pooled case-control sample and denote as the number of minor allele in a genotype for the th SNP in a population with values 0, 1, and 2. For , denote as the estimated genotype frequency for the th SNP. We can then obtain , , , and . For , denote a 2-dimensional row vector by , where is the expected value of under HWE, and is the pooled sample mean of , namely, . For , denote as the pooled sample variance of , namely, the variance of and denote as the pooled sample variance of , namely, the variance of . For , denote a diagonal matrix with elements equal to and . Clearly, is extended from the weight proposed by Luo et al. [25] and Chatterjee et al. [38] when an additive (dominant or recessive) code is adopted. The weight matrix is data adaptive. When codominant coding is adopted, by means of , we propose the empirical Bayes score for the th SNP with the following form: where is an identity matrix with dimension 2.

Let denote the vector of empirical Bayes scores for all SNPs, namely, , which is of length . Denote the estimated asymptotic covariance matrix by (See Supplementary File) for empirical Bayes score vector . A common test for whether all markers can be jointly built, similar to the Hotelling statistic, is , where ‘’ indicates the transpose of a vector or matrix. Our proposed new global statistic is based on the eigenvalue decomposition of covariance matrix , as follows. For , denote and (a column vector) as the eigenvalue and corresponding eigenvector of covariance matrix . Let and denote the eigenvalues and corresponding eigenvectors of covariance matrix . We then have , and can be written as . Since the norm of the eigenvector is unity and can be written as , the test statistic can be written as

Note that is a linear combination of the score for each individual SNP with for We propose to utilize the first () summands in to test the null hypothesis and denote the resultant test statistic as follows:

Due to the orthogonality of , are independent. Because are all asymptotically normally distributed with mean 0 and variance 1 under the null hypothesis that the genomic region spanned by the SNPs is not associated with the phenotype status of interest, PChiB is asymptotically distributed as a Chi-squared variable with degrees of freedom under the null hypothesis.

A remaining issue is how to select the number of summands . Note that PChiB is based on eigenvalue decomposition, similar to the standard PCR. Many criteria for selecting have been introduced in the literature [39]. It has been shown that using the top principal components that explain 8090% of the genetic variability is sufficient [19, 20, 23]. We select according to the same principal, i.e., that the top principal components can explain approximately 85% of the genetic variability. This strategy is supported by the connection between PChiB and PCR (see the next subsection). In fact, the number of principal components affects the power of the principal component test [40]. When the LD extent of each pair of SNPs is very strong, the top one principal component alone has desirable power. When the LD extent of each pair of SNPs is somewhat strong, using the top principal components that explain 8090% of the genetic variability is a robust method.

4.2. Understanding PChiB through an Exposition of PCR

We revisit PChiB based on only the standard prospective likelihood score under additive coding for th () and establish its equivalence to PCR [19, 20]. This equivalence sheds light on the promise of increased power of PChiB since PCR has been established to be a promising method for multi-SNP association analysis. In PCR, the phenotype variable is regressed on only a few of the top principal components (PCs) that summarize approximately 80-90% of the genetic variability. The PCs represent the directions in which most of the variability in the data occurs, as identified by the eigenvalue decomposition of the variance-covariance matrix of the centred raw genotype scores. Each principal component is a linear combination of genotype scores for all SNPs, and all principal components are uncorrelated with each other.

Here, we present the standard prospective likelihood score under additive genetic coding. The collection of all prospective score functions, denoted by , is asymptotically distributed as multivariate normal with mean and variance-covariance matrix under the null hypothesis. Let , . For , let and . For , let . Denote as a genotype matrix with th row and th column element for ,and . Let be a column vector with all elements and length . In matrix form, , and its covariance matrix . Now, let be a matrix whose th column is the characteristic vector of the matrix , and let be its eigenvalues. Denote orthogonal transformation . The likelihood score based on a logistic regression of on is . The covariance matrix of is a diagonal matrix with elements

Suppose that we consider the first () PCs as follows. Let be a matrix containing the first eigenvectors, and let . The standard PCA test based on the score statistic for testing the association between and from the logistic regression model is exactly equal to , which is denoted by , and is the same as our proposed method when the adopted genetic code is additive code. Denote , When adopting additive code, the standard Hotelling statistic is equal to , and the PChiP statistic reduces to .

The proposed statistic (in this situation, equivalent to PCR) can be shown to be closely related to a statistic called the sum of squared score test based on prospective likelihood [12], which is denoted by SSUP. SSUP is obtained as SSUP, and it can be expressed as SSUP. Therefore, SUP and PChiB use different weights for the contributions of the PCs: SSUP weights all PCs by the eigenvalues, whereas PChiB assigns equal weights to the top PCs. SSUP allows PCs with small eigenvalues to make additional contributions to the test, but PChiB discards PCs with small eigenvalues to reduce the degrees of freedom. This difference has implications on their relative power, which depends critically on the structure of variance-covariance matrix and, therefore, the LD structure of the assessed genomic region.

Data Availability

Data available on request.

Conflicts of Interest

The authors declare no competing interests.

Authors’ Contributions

Jiayan Zhu and Yi Tian designed the methods, wrote the main manuscript text, and conducted some of the simulations. Ma Li and Xiaohong Cai conducted some of the simulations. Jiayan Zhu contributed to the interpretation of all results. All authors reviewed the manuscript.

Acknowledgments

We thank Prof. Jinbo Chen from the Department of Biostatistics of the University of Pennsylvania and Tonja Nansel and Prof. Kai Yu from the National Institutes of Health in the USA for providing the genetic data to demonstrate the methods and for providing some meaningful comments to improve the manuscript. We thank Prof. Qizhai Li from the China Science Academy for the discussion of the manuscript. Research of Yi Tian is partially supported by the self-determined research funds of Central China Normal University(CCNU) from the colleges basic research of MOE (No. CCNU19TD009) and National Nature Science Foundation of China (No. 61877023). Research of Jiayan Zhu is partially supported by seeding project funding (No. 2019ZZX026), scientific research project funding of talent recruitment, and start up funding for scientific research of Hubei University of Chinese Medicine.

Supplementary Materials

Derivation of the asymptotic variance estimate for EB score test UB. 2. Supplementary tables and figures for Results. (Supplementary Materials)