BioMed Research International

Volume 2015, Article ID 523641, 10 pages

http://dx.doi.org/10.1155/2015/523641

## Detecting Genetic Interactions for Quantitative Traits Using -Spacing Entropy Measure

^{1}Department of Physiology and Biophysics, Eulji University, Daejeon, Republic of Korea^{2}Department of Bioinformatics, Seoul National University, Seoul, Republic of Korea^{3}Department of Informational Statistics, Korea University, Jochiwon, Republic of Korea^{4}Department of Statistics, Seoul National University, Seoul, Republic of Korea^{5}Department of Preventive Medicine, Eulji University, Daejeon, Republic of Korea

Received 14 November 2014; Revised 4 February 2015; Accepted 8 March 2015

Academic Editor: Xiang-Yang Lou

Copyright © 2015 Jaeyong Yee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A number of statistical methods for detecting gene-gene interactions have been developed in genetic association studies with binary traits. However, many phenotype measures are intrinsically quantitative and categorizing continuous traits may not always be straightforward and meaningful. Association of gene-gene interactions with an observed distribution of such phenotypes needs to be investigated directly without categorization. Information gain based on entropy measure has previously been successful in identifying genetic associations with binary traits. We extend the usefulness of this information gain by proposing a nonparametric evaluation method of conditional entropy of a quantitative phenotype associated with a given genotype. Hence, the information gain can be obtained for any phenotype distribution. Because any functional form, such as Gaussian, is not assumed for the entire distribution of a trait or a given genotype, this method is expected to be robust enough to be applied to any phenotypic association data. Here, we show its use to successfully identify the main effect, as well as the genetic interactions, associated with a quantitative trait.

#### 1. Introduction

Recent advances in high-throughput genotyping techniques have produced massive volumes of genetic data. Although it is common to analyze single SNP effects extensively, such approaches cannot adequately explain the intricate genetic contributions to complex diseases such as hypertension, diabetes, and certain psychiatric disorders. Consequently there are still large amounts of genetic components that remain unexplained. Gene-gene interaction analysis may be one method to adequately address this missing heritability problem [1].

For case-control studies, which formulate the measures for a binary trait, a number of statistical methods for detecting gene-gene interactions have been proposed. One of the most popular methods is multifactor dimensionality reduction (MDR) [2] that converts a high-dimensional contingency table to a one-dimensional model without raising the issue of sparse cells. Several variants of MDR have been recently developed [3–8], while another approach was developed [9–11] from information theory [12, 13]. More recently, an entropy-based approach which utilizes the relative gain of information, as well as its standardized measure, has also been proposed [14].

However, for quantitative traits such as the blood pressure, body mass index, and patient survival times, relatively few attempts have been made to analyze the genetic interactions. Because many phenotype measures are intrinsically quantitative, and categorizing a continuous trait may not always be straightforward and meaningful, association of gene-gene interactions with an observed distribution of such phenotypes needs to be investigated directly without categorization. To that end, introducing a new statistic is one way to tackle the problem [15]. Extending the MDR algorithm to continuous traits, as in the ways of the generalized MDR (GMDR) and the model-based MDR (MB-MDR), has been proposed [3, 6]. More recently a quantitative MDR (QMDR) was proposed to replace the balanced accuracy metric with a -test statistic [16]. However, these MDR-based approaches may oversimplify the original data to some degree, through classification of phenotypes. An entropy-based approach may well be an alternative model. Entropy is commonly used in information theory to measure the uncertainty of random variables [12, 13], and information gain or mutual information has been shown useful to represent association strengths [17–19]. Although the usefulness of such information theoretical methods is well known, the statistical methods based on this approach for analyzing gene-gene interactions of the quantitative traits are rarely found, with the exception of one specific case [20]. However, the application may also be limited by assuming a normal distribution.

Here, we extend the usefulness of the information concept to quantitative traits by considering nonparametric estimates based on sample-spacing or -spacing [22–25] for the conditional entropy of a quantitative phenotype, based on a given genotype. The challenge, therefore, is to couple a nonparametric entropy estimator to correct and stable information gains. We thus developed the useful information gain standardized (IGS) approach and applied it to datasets composed of several genotypes and the quantitative trait. This approach could be considered an extension of previous work on categorical traits [14] to the quantitative phenotypes. The proposed method, however, does not attempt in any way to classify quantitative phenotypes like other methods, such as variants of MDR but instead handles them directly, providing an intrinsic advantage of removing the chance of misclassification. While previous entropy-based methods of analyzing quantitative traits assumed the shape of its distribution to be normal [20], our method does not need to specify the distribution to estimate the association. Any regular or irregular distribution would not cause any difficulties. Although this is also an advantage of GMDR or QMDR, we propose a method that takes the advantageous characteristics from both of those methods. We also performed extensive simulation studies to compare the powers of the proposed method to QMDR and GMDR, demonstrating its advantage in detection power.

In the following sections, after a brief review of nonparametric entropy estimation, we describe a new method for modeling genetic interactions. A nonparametric entropy estimator is shown to successfully couple with genetic datasets through our modifying work in the Materials and Methods. Application of this information gain standardized (IGS) approach is evaluated for both simulation and real datasets in the Results and Discussions.

#### 2. Materials and Methods

##### 2.1. Estimation of the Entropy for a Continuous Variable

If is a random vector with probability density function, , its differential entropy is defined byA well-known approach for estimating a solution to this equation is to use plug-in estimates. In this approach, is first estimated using a standard density estimation method such as a histogram or kernel density estimator, and the entropy is then computed. Integral, resubstitution, splitting data, and cross-validation estimates are among the usual plug-in estimates [22]. Another approach is based on sample-spacing. Let be a set of independent and identically distributed real valued random variables, with corresponding order statistics of . Here, represents the total number of measured samples. For the arbitrary integers and satisfying the condition of , a spacing of order or -spacing is defined as . A density estimate, based on sample-spacing, , is then constructed aswhere [14]. This density estimate is consistent if, as , and [22]. Several variations of an entropy estimator with minor differences have been proposed, all based on the above density estimates [23, 24]. Among them, the following were reported to approximate with lowered variance [25]:Asymptotic bias of this estimator can be corrected by adding additional terms, including the digamma function [22, 28]:As increases, the correctional terms become negligible and the two estimators coincide. Our evaluation of the entropy of a phenotype, , of a quantitative trait is based on this estimator.

##### 2.2. Modification of the -Spacing Based Entropy Estimator

The estimator in (4) has both and as parameters. In genetic association studies, the number of samples, , of several hundreds is common. However, when the conditional entropy is estimated, there may be a minor allele that could have a much smaller number of samples corresponding to that allele. Moreover, the choice of the sample-spacing, , should affect the resulting estimation of an entropy value. Therefore, it is required to have an entropy estimation scheme independent of the number of samples, without the need of choosing a particular value of the sample-spacing. To illustrate such a requirement, an ensemble of 3,000 sets of the random deviation from was generated for each data point in Figure 1, where the mean and standard deviation of the estimates are plotted for each ensemble. On the left panel of Figure 1, is fixed to 10 and 20 while is varied. The analytic formula of the entropy for a normal distribution can be obtained as follows [20], where is Euler’s number:The calculated value of (5) is pointed on the vertical axis with a horizontal arrow with the corresponding above it. The obvious -dependence of the estimator can be seen in this plot, where the estimation approaches the analytic value, as increases with -consistency, as expected [24]. In Figure 1(b), is fixed to 400, while is varied. In this plot, the estimated entropy again changes in value throughout the possible range of . It is shown that the estimated value is always smaller than the analytically calculated value. Therefore, assigning a particular value to such as , the typical choice [25], would not be appropriate in this sampling range. Because of these - and -dependences, the estimator in (4) may need to be modified. Therefore, we modify the entropy estimator in (4) as follows:In this modification, an entropy estimator is averaged over the possible values for each , which is denoted by . This estimator is used to plot the entropy versus number of samples in Figure 2. Over a wide range of , this entropy estimator yields very stable values, in contrast to Figure 1(a). An increase in the extremely small range should be within the tolerable error in an application of genome-wide association, as the contribution to the conditional entropy by such a minor allele would be suppressed by the weighting factor of the marginal probability that should be proportional to the number of corresponding samples. Analytically obtained entropy values for , with three different ’s, are marked on the vertical axis on the right-hand side. Regardless of the value of , the differences between the analytically obtained value and the values given by the estimator stay essentially the same. Considering that the association study measures the difference between the entropy and the corresponding conditional entropy, the stability should be a more critical issue than the absolute value of the estimates. Therefore compensation of this would not be necessary as long as it is stable. Furthermore, the underestimation of the entropy shown in the plot should have little effect on the association strength. Hence, an entropy estimator has been set up that should satisfy the practical -independence without the need to find a proper sample-spacing.