Statistical Analysis of High-Dimensional Genetic Data in Complex TraitsView this Special Issue
Clique-Based Clustering of Correlated SNPs in a Gene Can Improve Performance of Gene-Based Multi-Bin Linear Combination Test
Gene-based analysis of multiple single nucleotide polymorphisms (SNPs) in a gene region is an alternative to single SNP analysis. The multi-bin linear combination test (MLC) proposed in previous studies utilizes the correlation among SNPs within a gene to construct a gene-based global test. SNPs are partitioned into clusters of highly correlated SNPs, and the MLC test statistic quadratically combines linear combination statistics constructed for each cluster. The test has degrees of freedom equal to the number of clusters and can be more powerful than a fully quadratic or fully linear test statistic. In this study, we develop a new SNP clustering algorithm designed to find cliques, which are complete subnetworks of SNPs with all pairwise correlations above a threshold. We evaluate the performance of the MLC test using the clique-based CLQ algorithm versus using the tag-SNP-based LDSelect algorithm. In our numerical power calculations we observed that the two clustering algorithms produce identical clusters about 40~60% of the time, yielding similar power on average. However, because the CLQ algorithm tends to produce smaller clusters with stronger positive correlation, the MLC test is less likely to be affected by the occurrence of opposing signs in the individual SNP effect coefficients.
Current genetic association studies aim to identify genetic variants responsible for a disease by investigating associations between single nucleotide polymorphisms (SNPs) and a trait of interest. In a single-SNP approach, each SNP is analyzed individually for the marginal association with the trait. In a multi-SNP approach, a group of SNPs is analyzed together for polygenic model analysis or gene-based analysis to obtain a global statistic for the combined effect of a set of SNPs [1–8]. When the gene is the unit of interest in the association analysis, gene-based analyses can be performed with multimarker methods using multi-SNP genotypes or haplotypes [6, 8–10]. These methods have the potential benefits of reducing genome-wide type I error burden and boosting the power of the study .
Most popular multimarker methods have been developed for the analysis of genotypes. In some methods, the marginal effects of individual SNPs are combined to form a global statistic [5, 11]. In others, SNP genotypes are analyzed in a multiple regression model and global statistics are constructed to represent the joint effects of multiple SNPs in a gene [3, 8, 12, 13]. Some multimarker tests such as C-alpha , SKAT , and CMC tests  specifically target rare variants with minor allele frequency (MAF) less than 1%. On the other hand, multimarker tests such as SKAT-C  and the test by Curtis  can be applied to a combined set of rare (MAF < 1%), low frequency (1% ≤ MAF < 5%), and common (MAF ≥ 5%) variants.
Multimarker methods can be roughly divided into two types: linear and quadratic tests . Linear tests are constructed by combining the individual SNP effects in a linear combination with certain weights [2, 4, 18]. Linear tests can be powerful if the individual SNP effects have the same direction but can lose substantial power if this condition is not met [2, 5, 8, 17]. Since the direction of an effect can be reversed by recoding the genotype, some researchers developed methods to recode the risk and base alleles considering magnitude and direction of pairwise correlations between SNPs [2, 5]. Quadratic tests are constructed as a quadratic form of an effect vector with corresponding weight matrix [5, 11, 16]. Quadratic tests are more robust to the occurrence of effects in opposing directions, but the degrees of freedom can be high if many neutral SNPs are included in the analysis .
Yoo et al. [8, 13] proposed the multi-bin linear combination test (MLC), which is a hybrid of linear and quadratic tests, and evaluated its performance for common SNPs  and for combinations of common and low frequency SNPs . For the MLC test, SNPs are partitioned into bins or clusters of highly correlated SNPs according to the pairwise linkage disequilibrium (LD) measure . Then percluster linear combinations of individual SNP effects are combined in a quadratic form [8, 13]. Because of the quadratic component, the MLC test is more robust than linear tests and can have better power than a quadratic test such as the generalized Wald test under realistic causal model scenarios [8, 13].
For SNP clustering, Yoo et al. [8, 13] previously applied an algorithm incorporated in the LDSelect program . LDSelect is designed to select tag SNPs and the cluster partitioning of a gene proceeds by identifying SNPs that capture the effects of other SNPs through LD. Because its greedy algorithm begins with a SNP in LD with the largest number of other SNPs, it tends to first construct one large cluster. However, for the MLC test, clusters with fewer SNPs are less likely to include causal effects in opposing directions and may therefore be more robust. Yoo et al. also showed that the power of the MLC test is better when correlations between SNPs within a cluster are large and positive .
In this study, we develop a new clustering algorithm called CLQ that identifies cliques in the network of SNPs. By definition, all pairwise correlations between SNPs in a clique are above a prespecified threshold value. We also incorporate the coding correction algorithm of Wang and Elston [2, 5] into CLQ so that, after recoding, the cliques consist only of positively correlated SNPs. We compare the performance of MLC tests using the previous clustering algorithm, LDSelect, with that using the new CLQ algorithm in terms of power and robustness. For power calculations, we use genotype data from the HapMap Asian population to provide 1000 different realistic LD structures. For one causal and two causal SNP scenarios, we consider all possible causal SNP choices within each gene. Through extensive numerical power calculations for the MLC test under various causal-gene SNP-trait model scenarios, we show that the CLQ algorithm is highly suitable for incorporation into the MLC test.
2. Methods and Materials
2.1. Multi-Bin Linear Combination Test
Suppose SNPs in a gene are jointly analyzed in a multiple regression. We denote the genotypes of SNPs as . The genotypes can be coded differently depending on the genetic model. For the rest of the paper, we assume an additive genetic model such that is the count of minor alleles for th SNP; that is, , 1, or 2. We set up the regression model aswhere is the link function. For the global hypothesis of gene-based association, we construct a test using the estimated beta coefficients, , and the covariance matrix of , .
Suppose SNPs are partitioned into several bins or clusters based on the pairwise linkage disequilibrium measure defined aswhere and are the MAF values of the th and th SNP and is the frequency of the haplotype consisting of the two minor alleles. If phase information of genotypes to identify haplotypes is not available, is estimated using maximum likelihood methods, which is the same as computing the Pearson correlation coefficient between additive genotypes and . If SNPs are partitioned into clusters, we construct a matrix to denote SNP assignments such that if the th SNP belongs to the th cluster and if not.
If only one SNP is assigned to each cluster (singleton), is the same as the generalized Wald test statistic Moreover, if all SNPs are assigned to one cluster, is a linear combination (LC) test . The asymptotic null distribution of the Wald test statistic is an chi-square distribution, assuming no linear dependencies among SNP genotypes, whereas the MLC test statistic follows an chi-square distribution. The asymptotic null distribution of the LC test statistic is 1 chi-square.
2.2. Allele Recoding Algorithm
As shown in Yoo et al. , power of the LC and the MLC tests benefits from high positive correlation between causal and neutral SNPs or between causal SNPs. The sign of the correlation between two SNPs changes if we switch the risk and base alleles for one of the two SNPs. For example, if we replace with a new genotype variable , then the genotype of , 1, 2 becomes , 1, 0, respectively, under an additive model. When this change is applied, the correlation between the genotype and becomes for . This coding change will also change the sign of beta estimates if . To make most pairwise correlations positive for SNPs in the joint analysis, we apply the Wang and Elston’s SNP recoding algorithm , which is as follows.
Step 1. Count the number of negatively correlated SNPs for each SNP and denote it as for ; that is, , where is an indicator function.
Step 2. Select the SNP with the ; then switch the risk and base allele for the genotype of that SNP.
Step 3. Iterate Steps 1-2 with updated correlations from the updated genotypes until .
For the MLC test based on the LDSelect algorithm, we applied recoding for each cluster separately after clustering. With the CLQ algorithm, we incorporate recoding within the clustering algorithm.
2.3. SNP Clustering Using the LDSelect Algorithm
The LDSelect algorithm  is as follows.
Step 1. Set a threshold value for correlation between two SNPs. Suppose the SNPs in a gene are indexed with . Let .
Starting with , iterate the selection of the th cluster in Steps 2 to 4.
Step 2. For each SNP in , count the number of other SNPs having correlation with SNP greater than a threshold value such that , whereWe call the SNPs that meet this criteria the neighbors of SNP . Proceed to Step 5 if is at most equal to 0 for all . If not, proceed to the next step.
Step 3. First, select one SNP (say ) with and all the neighbors of SNP in and group them as . When there is more than one SNP with the maximum number of neighbors, randomly select one SNP from among them.
Step 4. Remove SNPs in from and denote it as . Also, update with . Iterate Steps 2~4 unless the condition to proceed to Step 5 is met or all SNPs are assigned into a cluster.
Step 5. If the maximum for all is at most 0, the SNPs in will be partitioned into singleton clusters (each with only one SNP).
End. In this way all the SNPs are assigned into clusters for some . Then , where for and for .
2.4. SNP Clustering Using CLQ Algorithm
Let be a graph with a vertex set and an edge set of , the set of some pairs of vertices in . If an edge between two vertices is included in , we call these two adjacent. A clique is defined as a subset of such that all pairs of vertices in are adjacent. A maximal clique in is a clique whose vertices are not a subset of the vertices of a larger clique, and the maximum cliques in are the largest among all cliques in a graph. A subgraph of is a graph with a vertex set and an edge set . A subgraph of is said to be induced by a vertex set when an edge is in if and only if the edge is in . The subgraph induced by a clique is complete (all possible edges between vertices in clique are included in the edge set).
The CLQ algorithm is as follows.
Step 1. For a threshold value , construct a graph with a vertex set corresponding to SNP in a gene and an edge set in which the undirected edge between vertex and is included if for . Let with and .
Starting with , iterate the selection of the th cluster in Steps 2 to 4.
Step 2. For each vertex in , find the maximal cliques that contain the vertex using the Bron-Kerbosch algorithm  implemented in igraph package  and select the largest clique of maximal cliques found for all vertices. Proceed to Step 5 if there is no maximum clique with at least two vertices. Otherwise, proceed to the next step.
Step 3. Apply the recoding algorithm to the maximum cliques chosen in Step 2. If all pairwise correlations between SNPs in the clique can be recoded to be positive, then take the SNPs corresponding the chosen clique as the cluster . If negatively correlated SNPs still exist after the recoding algorithm has been applied to this clique, discard the chosen clique and select the next largest one. If there are multiple cliques in with the largest size and all SNPs can be recoded to be positively correlated, choose the one with the largest sum of absolute correlation. Repeat application of the recoding algorithm until is determined. If there is no clique with at least two vertices that can be recoded to have all positive correlations, proceed to Step 5.
Step 4. Remove SNPs in from and denote it as . Also, update with . Update with the subgraph induced by . The edge set is also updated by the edge set of as . Iterate Steps 2~4 unless the condition to proceed to Step 5 is met or all SNPs are assigned into a cluster.
Step 5. If there is no maximum clique with at least two vertices in , the SNPs in will be partitioned into singleton clusters.
End. In this way all the SNPs are assigned into clusters for some . Then , where for and for .
2.5. Comparison of Clustering Results
To compare the clusters produced for a gene by the two different clustering methods, we used the criterion of Rand  and the criterion that is adjusted for chance agreement . Suppose in one clustering method, , that SNPs are partitioned into and, using another method, they are partitioned into . The similarity between two clustering results is defined aswhere if there exist and such that SNP and SNP are both in and , or if there exist and such that SNP is in both and while SNP is in neither nor , and otherwise. A higher value of corresponds to more similar performance of two clustering methods for the given data. When a pair of clustering results are exactly identical, then , whereas if there is no similarity. The adjusted agreement measure is defined aswhere denotes the number of common SNPs that belong to clusters and , and and are the number of SNPs in clusters and , respectively .
2.6. Numerical Power Analysis Based on HapMap Data
Based on 1000 gene structures obtained from HapMap phase III Asian data, we computed MLC test power using LDSelect clustering (MLC-LD) and CLQ clustering (MLC-CL) for several alternative trait models with one or two causal SNPs. The HapMap gene panels were obtained by random selection from the 8883 genes that had 4~30 SNPs, excluding SNPs with MAF < 0.01 or any pairwise LD value . Two sets of 1000 genes were randomly selected, allowing overlap: one for the analysis of Models A, B, and C and another for validation analysis. For each set of 1000 genes, two panels of SNPs with MAF ≥ 0.05 and MAF ≥ 0.01, respectively, were used in comparisons of clustering results, and one panel with MAF ≥ 0.01 was used for power analysis. We evaluated a range of clustering threshold values for equal to 0.3 through 0.9 for LDSelect and CLQ.
For trait models, we considered models with one to four causal SNPs within a gene and a linear model for the quantitative phenotype : where is the number of causal SNPs, is the effect of th causal SNP, is the number of causal alleles for the th causal SNP, and is the error term assumed to follow a normal distribution with mean 0 and variance (Table 1).
Initially we considered three types of trait models: one with one causal SNP in a gene with effect (Model A), another with two causal SNPs in a gene with effects , (Model B), and a third with two causal SNPs in a gene with effects , (Model C). Since power in a linear model depends on the ratio of signal to noise, that is, , we selected the variance to correspond to reasonable power for a given gene structure and choice of causal SNPs for Models A, B, and C. To clearly see relative performance of the MLC tests compared to the generalized Wald test, we adjusted such that the Wald test power is 60% for each trait model. For Model A, in each gene we chose each of the SNPs in turn to be the causal SNP, resulting in over 11,000 trait model settings in total over 1000 genes. For Models B and C, in each gene we chose each of all possible SNP pairs in turn to be the causal SNP pair, resulting in nearly 80,000 trait model settings in total for main and validation sets.
In a fourth trait model (Model D), we also obtained power over mixed types of causal models with variable effect sizes. The number of causal SNPs was chosen randomly between 1 and 4, with the deleterious and protective causal SNPs randomly assigned. For each causal SNP, was randomly selected from the uniform distribution where is the expected standard deviation of following the effect size estimates for SNPs associated with lipid levels presented in Willer et al. . Then the error variance was fixed as 1.
For power analysis using Models A to D, the genotype data were randomly generated from the haplotype panel of HapMap data for subjects. Power was calculated numerically for each gene assuming asymptotic chi-square distributions under the null and alternative trait models. With a given significance level and number of clusters , the critical value is obtained from the chi-square distribution such that . Power is computed as with and noncentrality parameter equal to the expected MLC statistic value under the trait model (see Appendix in ).
3.1. Comparison of SNP Clustering by LDSelect and CLQ
In Figure 1, we illustrate LDSelect and CLQ clustering for 12 SNPs in ARHGAP29 at a threshold value 0.7. By LDSelect, the largest cluster includes SNPs 1, 2, 3, and 7 since SNP 1 tags SNPs 2, 3, and 7. However with CLQ, these four SNPs do not form a clique because the pairwise correlations between SNPs 2 and 3 and between SNPs 3 and 7 are below the threshold value. By CLQ, SNPs 2, 7, and 9 are clustered as a clique and SNPs 1 and 3 are clustered as another clique. Here, SNP 1 is recoded so that the correlation within the clique is positive.
We compared LDSelect and CLQ clustering for each of the 1000 HapMap genes across a range of threshold values. For a given threshold, the clustering methods often produce identical gene clusters, particularly at higher threshold values (Table 2). For example, at the threshold value 0.7, 54% of the clustering results are the same. With increased threshold values, the averages of the agreement measures and also increase. At threshold values greater than 0.5, the average agreement measure is greater than 78% overall. Comparison of average and under stratification by five gene-size groups (≤10, 11~15, 16~20, 21~25, >25 SNPs per gene) showed that the agreement slightly weakens with increased number of SNPs (results not shown).
On average, the number of clusters obtained by LDSelect is usually smaller than that by CLQ for a given threshold value (Table 3). Cluster sizes are smaller and less variable in CLQ than in LDSelect, averaged over all gene sizes (Table 3). Figure 2(a) shows the average over 1000 genes of the ratio of the number of clusters to the number of SNPs per gene used for clustering by LDSelect and CLQ given a threshold value . Restricting the SNPs to have higher minor allele frequency (MAF ≥ 0.05 versus MAF ≥ 0.01) reduces the ratio similarly for both clustering methods. At the same threshold value, CLQ produces a larger number of clusters compared to LDSelect, mainly because CLQ has a stricter within-cluster LD requirement, but this difference decreases as the threshold value increases. It follows that the average size of the largest cluster in each gene is greater for LDSelect than for CLQ (Figure 2(b)), with greater differences at lower threshold values. Conversely, CLQ usually produces more singleton clusters than LDSelect (Figure 2(c)). We conclude that at the same threshold value, CLQ tends to produce more clusters of smaller size than LDSelect.
(a) The ratio of number of clusters to number of SNPs
(b) The size of the largest cluster
(c) The number of singleton clusters
To compare maximum cluster size when the number of clusters per gene is the same, rather than at a fixed threshold value, we applied the clustering methods across a range of threshold values and for each gene matched the LDselect and CLQ clustering results according to the number of clusters (Figure 3(a)). At nearly all cluster numbers, the average size of the largest cluster is again smaller for CLQ. Out of all clustering results with different numbers of clusters, 69% had the same maximum cluster size by LDSelect and CLQ, 23% had a larger maximum cluster by LDSelect, and only 8% had a smaller maximum cluster size by LDSelect, based on SNPs with MAF ≥ 0.05. For the SNPs with MAF ≥ 0.01, these percentages were 65%, 28%, and 7%. For a fixed number of clusters, the number of singleton clusters was slightly smaller for CLQ than LDSelect (Figure 3(b)). Out of all clustering results, 70% had the same number of singleton clusters by LDSelect and CLQ, 21% had a larger number by LDSelect, and only 9% had a smaller number of singleton clusters by LDSelect than CLQ, based on SNPs with MAF ≥ 0.05. For the SNPs with MAF ≥ 0.01, these percentages were 67%, 25%, and 8%. We conclude that when the two clustering methods produce the same number of clusters for a gene, the CLQ clusters will tend to be less variable in size than the LDSelect clusters. We draw similar conclusions from the analysis of validation set (results not shown).
(a) The size of the largest cluster
(b) The number of singleton clusters
3.2. Comparison of MLC-LD and MLC-CL Test Power
For the power calculations, the variance of the error term was set such that Wald test power is 60% for Models A, B, and C. For Model D, the variance was fixed as 1 and variation in the regression coefficient determined power. In Table 4, the average MLC test power values for trait model types A, B, C, and D using two clustering methods (MLC-LD and MLC-CL) vary across values of the clustering threshold . When averaged over all sets of genes and causal SNP choices, MLC-LD power and MLC-CL power were both higher than Wald test power (which was 60% for Models A–C and roughly 35–38% for Model D). Average power was usually maximized at or 0.7 for LDSelect and at or 0.5 for CLQ. At threshold values less than 0.7, MLC-CL power was higher than MLC-LD for all models. At threshold values higher than 0.6, however, MLC-CL power was less than MLC-LD for Models A, B, and D. For Model C, MLC-CL power was higher than MLC-LD for all threshold values. For each model, the highest average power was achieved by MLC-CL (bolded entries in Table 4). Comparison of average MLC power values under stratification by five gene-size groups (≤10, 11~15, 16~20, 21~25, >25 SNPs per gene) generally yielded similar results (results not shown).
We also compared the proportion of gene-causal-SNP cases in which MLC-LD power or MLC-CL power was less than Wald test power (Table 4). The proportion with lower MLC-CL power was smaller, suggesting improved robustness. At lower threshold values particularly, the proportion with power less than the Wald test for LDSelect was much higher, up to 40~54% for some models, but was less than 25% for CLQ. Plots of gene-specific power obtained for the cases in which the LDSelect and CLQ clusters differ show that the MLC test using CLQ is less likely than MLC test using LDselect to have substantially reduced power relative to the Wald test (Figure 4). Similar conclusions about relative power were obtained from the analysis of validation set (results not shown).
(a) Model A (one causal SNP)
(b) Model B (two causal SNPs, both deleterious)
(c) Model C (two causal SNPs, one deleterious, one protective)
(d) Model D (1~4 random causal SNPs, random number of deleterious and protective SNPs)
In previous studies, we reported that power of the MLC test depends on the correlation structure among SNPs and we postulated that clusters of strongly correlated SNPs with positive correlations benefit the test . Therefore, the CLQ algorithm designed to construct such clusters, in which all pairwise correlations in the cluster are positive and strong, should work well for MLC tests. LDSelect and CLQ produced exactly identical clusters for about 38~54% of the genes at threshold values = 0.5~0.7. This implies that many of the clusters found by LDSelect using a SNP that tags other SNPs are actually cliques; that is, the pairwise correlations between SNPs in the cluster other than the SNP with most neighbors are also above the threshold, even though the LDSelect algorithm does not consider that information.
The LDSelect algorithm was originally developed for tag SNP selection so that indirect associations could be efficiently captured by genotyping and analyzing only tag SNPs. We observed that the LDSelect algorithm also works reasonably well for MLC tests where power depends on formation of the clusters with large positive correlations. Because the algorithms produce identical clusters for a substantial portion of the cases, the average MLC test power values were not dramatically different over the genes we tested. However, robustness as measured by the proportion of gene-causal-SNP cases with lower power than the Wald test was better with CLQ than with LDSelect. The plot of entire power values for all trait-causal SNP models over 1000 genes also indicated that the MLC test using CLQ is less likely than LDSelect to have substantially reduced power relative to the Wald test. Because the CLQ algorithm produces slightly more clusters than LDSelect for a given threshold, the degrees of freedom tends to be higher for the MLC test using CLQ than LDSelect, and in that matter CLQ has a disadvantage compared to LDSelect. However, the smaller sized clusters constructed by CLQ may be advantageous because SNPs with opposing effects are less likely to occur in the same cluster.
For power comparisons using different threshold values, we found that CLQ with threshold values = 0.4~0.5 usually produces the best average power among all clustering algorithm-threshold value combinations. In previous studies [8, 13], we suggested the threshold value 0.3~0.5 for (0.5~0.7 for ) to achieve optimum power using LDSelect algorithm, which has been validated by the results of this study. We suggest the threshold value of = 0.4~0.5 to be used for CLQ algorithm based on the results of this study. However, a dynamically determined threshold value after evaluating the LD structure might be more appropriate for MLC tests, and being able to choose nonarbitrary threshold values is more attractive to researchers applying the method.
We applied the CLQ algorithm to a prespecified gene unit for gene-based analysis, but it could be applied similarly to intergenic regions, exomes, and promoter regions, that is, any regional units exhibiting some linkage disequilibrium between SNPs. If these regions include too many SNPs (e.g., more than 100), it is unreasonable to apply MLC tests based on joint regression models unless the sample size is extremely large. In that case, it may be desirable to break up the region into several LD blocks. Another approach would be to apply variable selection techniques such as penalized regressions [25, 26] and construct a MLC-type test with the resulting models.
In summary, we observed that CLQ and LDSelect produce identical clusters about half the time, and in the remaining cases, CLQ usually produces more clusters of smaller size. On average, MLC test power using CLQ is similar to that using LDSelect. The MLC test using CLQ shows better robustness to the detrimental effects of opposing SNP associations within the same cluster. Therefore, the CLQ algorithm is a promising approach for preanalysis clustering of SNPs for multimarker methods such as the MLC test.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Research Foundation of Korea (NRF) Grant NRF-2012R1A1A3012428 and the Canadian Institutes of Health Research (CIHR) operating Grant MOP-84287.
Y. J. Yoo, L. Sun, J. Poirier, and S. B. Bull, “Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association,” Tech. Rep., Department of Statistical Sciences, University of Toronto, 2014, http://www.utstat.toronto.edu/wordpress/?page_id=514.View at: Google Scholar
C. S. Carlson, M. A. Eberle, M. J. Rieder, Q. Yi, L. Kruglyak, and D. A. Nickerson, “Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium,” The American Journal of Human Genetics, vol. 74, no. 1, pp. 106–120, 2004.View at: Publisher Site | Google Scholar
G. Csardi and T. Nepusz, “The igraph software package for complex network research,” InterJournal: Complex Systems, p. 1695, 2006.View at: Google Scholar