BioMed Research International

Volume 2015 (2015), Article ID 852341, 11 pages

http://dx.doi.org/10.1155/2015/852341

## Clique-Based Clustering of Correlated SNPs in a Gene Can Improve Performance of Gene-Based Multi-Bin Linear Combination Test

^{1}Department of Mathematics Education, Seoul National University, Seoul 151-742, Republic of Korea^{2}Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 151-742, Republic of Korea^{3}Prosserman Centre for Health Research, The Lunenfeld-Tanenbaum Research Institute of Mount Sinai Hospital, Toronto, ON, Canada M5T 3L9^{4}Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada M5T 3M7

Received 14 November 2014; Revised 3 February 2015; Accepted 14 February 2015

Academic Editor: Taesung Park

Copyright © 2015 Yun Joo Yoo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Gene-based analysis of multiple single nucleotide polymorphisms (SNPs) in a gene region is an alternative to single SNP analysis. The multi-bin linear combination test (MLC) proposed in previous studies utilizes the correlation among SNPs within a gene to construct a gene-based global test. SNPs are partitioned into clusters of highly correlated SNPs, and the MLC test statistic quadratically combines linear combination statistics constructed for each cluster. The test has degrees of freedom equal to the number of clusters and can be more powerful than a fully quadratic or fully linear test statistic. In this study, we develop a new SNP clustering algorithm designed to find cliques, which are complete subnetworks of SNPs with all pairwise correlations above a threshold. We evaluate the performance of the MLC test using the clique-based CLQ algorithm versus using the tag-SNP-based LDSelect algorithm. In our numerical power calculations we observed that the two clustering algorithms produce identical clusters about 40~60% of the time, yielding similar power on average. However, because the CLQ algorithm tends to produce smaller clusters with stronger positive correlation, the MLC test is less likely to be affected by the occurrence of opposing signs in the individual SNP effect coefficients.

#### 1. Introduction

Current genetic association studies aim to identify genetic variants responsible for a disease by investigating associations between single nucleotide polymorphisms (SNPs) and a trait of interest. In a single-SNP approach, each SNP is analyzed individually for the marginal association with the trait. In a multi-SNP approach, a group of SNPs is analyzed together for polygenic model analysis or gene-based analysis to obtain a global statistic for the combined effect of a set of SNPs [1–8]. When the gene is the unit of interest in the association analysis, gene-based analyses can be performed with multimarker methods using multi-SNP genotypes or haplotypes [6, 8–10]. These methods have the potential benefits of reducing genome-wide type I error burden and boosting the power of the study [9].

Most popular multimarker methods have been developed for the analysis of genotypes. In some methods, the marginal effects of individual SNPs are combined to form a global statistic [5, 11]. In others, SNP genotypes are analyzed in a multiple regression model and global statistics are constructed to represent the joint effects of multiple SNPs in a gene [3, 8, 12, 13]. Some multimarker tests such as C-alpha [14], SKAT [11], and CMC tests [15] specifically target rare variants with minor allele frequency (MAF) less than 1%. On the other hand, multimarker tests such as SKAT-C [16] and the test by Curtis [10] can be applied to a combined set of rare (MAF < 1%), low frequency (1% ≤ MAF < 5%), and common (MAF ≥ 5%) variants.

Multimarker methods can be roughly divided into two types: linear and quadratic tests [17]. Linear tests are constructed by combining the individual SNP effects in a linear combination with certain weights [2, 4, 18]. Linear tests can be powerful if the individual SNP effects have the same direction but can lose substantial power if this condition is not met [2, 5, 8, 17]. Since the direction of an effect can be reversed by recoding the genotype, some researchers developed methods to recode the risk and base alleles considering magnitude and direction of pairwise correlations between SNPs [2, 5]. Quadratic tests are constructed as a quadratic form of an effect vector with corresponding weight matrix [5, 11, 16]. Quadratic tests are more robust to the occurrence of effects in opposing directions, but the degrees of freedom can be high if many neutral SNPs are included in the analysis [13].

Yoo et al. [8, 13] proposed the multi-bin linear combination test (MLC), which is a hybrid of linear and quadratic tests, and evaluated its performance for common SNPs [13] and for combinations of common and low frequency SNPs [8]. For the MLC test, SNPs are partitioned into bins or clusters of highly correlated SNPs according to the pairwise linkage disequilibrium (LD) measure . Then percluster linear combinations of individual SNP effects are combined in a quadratic form [8, 13]. Because of the quadratic component, the MLC test is more robust than linear tests and can have better power than a quadratic test such as the generalized Wald test under realistic causal model scenarios [8, 13].

For SNP clustering, Yoo et al. [8, 13] previously applied an algorithm incorporated in the LDSelect program [19]. LDSelect is designed to select tag SNPs and the cluster partitioning of a gene proceeds by identifying SNPs that capture the effects of other SNPs through LD. Because its greedy algorithm begins with a SNP in LD with the largest number of other SNPs, it tends to first construct one large cluster. However, for the MLC test, clusters with fewer SNPs are less likely to include causal effects in opposing directions and may therefore be more robust. Yoo et al. also showed that the power of the MLC test is better when correlations between SNPs within a cluster are large and positive [13].

In this study, we develop a new clustering algorithm called CLQ that identifies cliques in the network of SNPs. By definition, all pairwise correlations between SNPs in a clique are above a prespecified threshold value. We also incorporate the coding correction algorithm of Wang and Elston [2, 5] into CLQ so that, after recoding, the cliques consist only of positively correlated SNPs. We compare the performance of MLC tests using the previous clustering algorithm, LDSelect, with that using the new CLQ algorithm in terms of power and robustness. For power calculations, we use genotype data from the HapMap Asian population to provide 1000 different realistic LD structures. For one causal and two causal SNP scenarios, we consider all possible causal SNP choices within each gene. Through extensive numerical power calculations for the MLC test under various causal-gene SNP-trait model scenarios, we show that the CLQ algorithm is highly suitable for incorporation into the MLC test.

#### 2. Methods and Materials

##### 2.1. Multi-Bin Linear Combination Test

Suppose SNPs in a gene are jointly analyzed in a multiple regression. We denote the genotypes of SNPs as . The genotypes can be coded differently depending on the genetic model. For the rest of the paper, we assume an additive genetic model such that is the count of minor alleles for th SNP; that is, , 1, or 2. We set up the regression model aswhere is the link function. For the global hypothesis of gene-based association, we construct a test using the estimated beta coefficients, , and the covariance matrix of , .

Suppose SNPs are partitioned into several bins or clusters based on the pairwise linkage disequilibrium measure defined aswhere and are the MAF values of the th and th SNP and is the frequency of the haplotype consisting of the two minor alleles. If phase information of genotypes to identify haplotypes is not available, is estimated using maximum likelihood methods, which is the same as computing the Pearson correlation coefficient between additive genotypes and . If SNPs are partitioned into clusters, we construct a matrix to denote SNP assignments such that if the th SNP belongs to the th cluster and if not.

Using the assignment matrix , we construct an MLC test statistic such thatwhere [8, 13].

If only one SNP is assigned to each cluster (singleton), is the same as the generalized Wald test statistic Moreover, if all SNPs are assigned to one cluster, is a linear combination (LC) test [1]. The asymptotic null distribution of the Wald test statistic is an chi-square distribution, assuming no linear dependencies among SNP genotypes, whereas the MLC test statistic follows an chi-square distribution. The asymptotic null distribution of the LC test statistic is 1 chi-square.

##### 2.2. Allele Recoding Algorithm

As shown in Yoo et al. [13], power of the LC and the MLC tests benefits from high positive correlation between causal and neutral SNPs or between causal SNPs. The sign of the correlation between two SNPs changes if we switch the risk and base alleles for one of the two SNPs. For example, if we replace with a new genotype variable , then the genotype of , 1, 2 becomes , 1, 0, respectively, under an additive model. When this change is applied, the correlation between the genotype and becomes for . This coding change will also change the sign of beta estimates if . To make most pairwise correlations positive for SNPs in the joint analysis, we apply the Wang and Elston’s SNP recoding algorithm [2], which is as follows.

*Step 1. *Count the number of negatively correlated SNPs for each SNP and denote it as for ; that is, , where is an indicator function.

*Step 2. *Select the SNP with the ; then switch the risk and base allele for the genotype of that SNP.

*Step 3. *Iterate Steps 1-2 with updated correlations from the updated genotypes until .

For the MLC test based on the LDSelect algorithm, we applied recoding for each cluster separately after clustering. With the CLQ algorithm, we incorporate recoding within the clustering algorithm.

##### 2.3. SNP Clustering Using the LDSelect Algorithm

The LDSelect algorithm [19] is as follows.

*Step 1. *Set a threshold value for correlation between two SNPs. Suppose the SNPs in a gene are indexed with . Let .

Starting with , iterate the selection of the th cluster in Steps 2 to 4.

*Step 2. *For each SNP in , count the number of other SNPs having correlation with SNP greater than a threshold value such that , whereWe call the SNPs that meet this criteria the* neighbors* of SNP . Proceed to Step 5 if is at most equal to 0 for all . If not, proceed to the next step.

*Step 3. *First, select one SNP (say ) with and all the neighbors of SNP in and group them as . When there is more than one SNP with the maximum number of neighbors, randomly select one SNP from among them.

*Step 4. *Remove SNPs in from and denote it as . Also, update with . Iterate Steps 2~4 unless the condition to proceed to Step 5 is met or all SNPs are assigned into a cluster.

*Step 5. *If the maximum for all is at most 0, the SNPs in will be partitioned into singleton clusters (each with only one SNP).

*End*. In this way all the SNPs are assigned into clusters for some . Then , where for and for .

##### 2.4. SNP Clustering Using CLQ Algorithm

Let be a graph with a vertex set and an edge set of , the set of some pairs of vertices in . If an edge between two vertices is included in , we call these two adjacent. A* clique* is defined as a subset of such that all pairs of vertices in are adjacent. A* maximal clique* in is a clique whose vertices are not a subset of the vertices of a larger clique, and the* maximum cliques* in are the largest among all cliques in a graph. A* subgraph* of is a graph with a vertex set and an edge set . A subgraph of is said to be* induced* by a vertex set when an edge is in if and only if the edge is in . The subgraph induced by a clique is complete (all possible edges between vertices in clique are included in the edge set).

The CLQ algorithm is as follows.

*Step 1. *For a threshold value , construct a graph with a vertex set corresponding to SNP in a gene and an edge set in which the undirected edge between vertex and is included if for . Let with and .

Starting with , iterate the selection of the th cluster in Steps 2 to 4.

*Step 2. *For each vertex in , find the maximal cliques that contain the vertex using the Bron-Kerbosch algorithm [20] implemented in igraph package [21] and select the largest clique of maximal cliques found for all vertices. Proceed to Step 5 if there is no maximum clique with at least two vertices. Otherwise, proceed to the next step.

*Step 3. *Apply the recoding algorithm to the maximum cliques chosen in Step 2. If all pairwise correlations between SNPs in the clique can be recoded to be positive, then take the SNPs corresponding the chosen clique as the cluster . If negatively correlated SNPs still exist after the recoding algorithm has been applied to this clique, discard the chosen clique and select the next largest one. If there are multiple cliques in with the largest size and all SNPs can be recoded to be positively correlated, choose the one with the largest sum of absolute correlation. Repeat application of the recoding algorithm until is determined. If there is no clique with at least two vertices that can be recoded to have all positive correlations, proceed to Step 5.

*Step 4. *Remove SNPs in from and denote it as . Also, update with . Update with the subgraph induced by . The edge set is also updated by the edge set of as . Iterate Steps 2~4 unless the condition to proceed to Step 5 is met or all SNPs are assigned into a cluster.

*Step 5. *If there is no maximum clique with at least two vertices in , the SNPs in will be partitioned into singleton clusters.

*End*. In this way all the SNPs are assigned into clusters for some . Then , where for and for .

##### 2.5. Comparison of Clustering Results

To compare the clusters produced for a gene by the two different clustering methods, we used the criterion of Rand [22] and the criterion that is adjusted for chance agreement [23]. Suppose in one clustering method, , that SNPs are partitioned into and, using another method, they are partitioned into . The similarity between two clustering results is defined aswhere if there exist and such that SNP and SNP are both in and , or if there exist and such that SNP is in both and while SNP is in neither nor , and otherwise. A higher value of corresponds to more similar performance of two clustering methods for the given data. When a pair of clustering results are exactly identical, then , whereas if there is no similarity. The adjusted agreement measure is defined aswhere denotes the number of common SNPs that belong to clusters and , and and are the number of SNPs in clusters and , respectively [23].

##### 2.6. Numerical Power Analysis Based on HapMap Data

Based on 1000 gene structures obtained from HapMap phase III Asian data, we computed MLC test power using LDSelect clustering (MLC-LD) and CLQ clustering (MLC-CL) for several alternative trait models with one or two causal SNPs. The HapMap gene panels were obtained by random selection from the 8883 genes that had 4~30 SNPs, excluding SNPs with MAF < 0.01 or any pairwise LD value . Two sets of 1000 genes were randomly selected, allowing overlap: one for the analysis of Models A, B, and C and another for validation analysis. For each set of 1000 genes, two panels of SNPs with MAF ≥ 0.05 and MAF ≥ 0.01, respectively, were used in comparisons of clustering results, and one panel with MAF ≥ 0.01 was used for power analysis. We evaluated a range of clustering threshold values for equal to 0.3 through 0.9 for LDSelect and CLQ.

For trait models, we considered models with one to four causal SNPs within a gene and a linear model for the quantitative phenotype : where is the number of causal SNPs, is the effect of th causal SNP, is the number of causal alleles for the th causal SNP, and is the error term assumed to follow a normal distribution with mean 0 and variance (Table 1).