BioMed Research International

Volume 2015, Article ID 671349, 9 pages

http://dx.doi.org/10.1155/2015/671349

## On the Estimation of Heritability with Family-Based and Population-Based Samples

^{1}The Center for Genome Science, Korea National Institute of Health, KCDC, Osong 361-951, Republic of Korea^{2}Department of Applied Statistics, Chung-Ang University, Seoul 156-756, Republic of Korea^{3}Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 151-742, Republic of Korea^{4}Chunlab Inc., Seoul National University, Seoul 151-742, Republic of Korea^{5}Department of Epidemiology, Seoul National University, Seoul 151-742, Republic of Korea^{6}Department of Preventive Medicine, Ajou University School of Medicine, Suwon 443-380, Republic of Korea^{7}Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106-7281, USA^{8}Department of Epidemiology and Biostatistics, School of Public Health & Institute of Health and Environment, Seoul National University, Seoul 151-742, Republic of Korea

Received 22 November 2014; Revised 16 April 2015; Accepted 21 April 2015

Academic Editor: Kristel van Steen

Copyright © 2015 Youngdoe Kim et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

For a family-based sample, the phenotypic variance-covariance matrix can be parameterized to include the variance of a polygenic effect that has then been estimated using a variance component analysis. However, with the advent of large-scale genomic data, the genetic relationship matrix (GRM) can be estimated and can be utilized to parameterize the variance of a polygenic effect for population-based samples. Therefore narrow sense heritability, which is both population and trait specific, can be estimated with both population- and family-based samples. In this study we estimate heritability from both family-based and population-based samples, collected in Korea, and the heritability estimates from the pooled samples were, for height, 0.60; body mass index (BMI), 0.32; log-transformed triglycerides (log TG), 0.24; total cholesterol (TCHL), 0.30; high-density lipoprotein (HDL), 0.38; low-density lipoprotein (LDL), 0.29; systolic blood pressure (SBP), 0.23; and diastolic blood pressure (DBP), 0.24. Furthermore, we found differences in how heritability is estimated—in particular the amount of variance attributable to common environment in twins can be substantial—which indicates heritability estimates should be interpreted with caution.

#### 1. Introduction

Under polygenic inheritance, the effects of segregation at single loci are assumed to be too small to estimate individually and the total genetic variance has been considered to identify the overall genetic effect underlying a trait. Genetic variance consists of additive, dominant, and epistatic components. However, the amount of dominant variance is usually assumed to be relatively small compared to the additive variance and is never identified without a family-based sample that includes bilineal relatives. Similarly, estimation of the epistatic variance (which may include additive components) requires special relationships in family data and is also assumed to be small. Therefore, the estimation of genetic variance has been confined to the additive genetic variance and, to estimate heritability, the proportion of the phenotypic variance attributable to only additive genetic variance has been used even though this can lead to biased estimation in the presence of dominant variance, epistatic variance, and gene × environmental interaction [1].

In general, a parameter allowing for additive polygenic variance can be incorporated into the phenotypic covariances between pairs of individuals, and there are two main ways for incorporating this parameterization. In the absence of population substructure, dominance or any environmental effect shared by family members, the phenotypic covariances can be expressed as a function of the kinship coefficient between family members in family-based samples. Under this parameterization, the additive polygenic variance is obtained from the covariances between family members using variance component models [2–5]. Alternatively, since the advent of large-scale genome data, which reveals similarity in genotypic background, the genetic relationships between individuals have become estimable from genome-wide data and this has also been used to identify population substructure. In the same context, the phenotypic variance explained by additive polygenic variance can also be estimated in population-based samples from the genetic relationships obtained in this way [6, 7]. In particular, the individuals in population-based samples are not closely related and share much less common environmental exposures than do the family members in family-based samples. For this reason Yang et al. [8] suggested excluding closely related individuals from the analysis when estimating heritability from population-based samples, noting that the environmental effect shared by family members seems to be inversely related to their degree of physical proximity, so that close relatives inflate any estimate of heritability.

In this paper, motivated by wishing to calculate the heritability of cardiovascular disease related traits in the Korean population, we examine to what extent estimates of heritability depend on how they are estimated. We calculate the heritability of various traits related to cardiovascular disease in a Korean population using two family-based cohorts, the healthy Twin Study, Korea (HTK) [9] and Ansung Family (ASF) cohorts, and one population-based cohort, that for the Korean Association Resource (KARE) [10] project. Comparing the heritability estimates from family-based and population-based samples, disturbing differences were found. With simulation studies we show that the meaning of heritability estimates can be affected by the absence of highly correlated samples and be substantially inflated by variance attributable to common environment. Thus heritability estimates should be interpreted with caution.

#### 2. Materials

Three cohorts, all part of the Korean Genome Epidemiology Study (KoGES) which is an ongoing prospective epidemiological study, have been utilized to estimate heritability: the KARE project [10] cohort and the HTK [9] and ASF cohorts. These cohorts were genotyped in the Korean Genome Analysis Project (KoGAP) by the Center for Genome Science in the Korea Center for Disease Control and Prevention, which was launched in Korea between 2001 and 2007.

##### 2.1. KARE Project

The KARE project, with 10,038 participants who were living in Ansung (rural) and Ansan (urban), was initiated in 2007 for large-scale genome-wide association studies (GWAS) based on the Korean population. Among the 10,038 participants, 10,004 individuals were genotyped for 500,568 SNPs with the Affymetrix Genome-Wide Human SNP array 5.0. We discarded SNPs with values for departure from Hardy-Weinberg equilibrium (HWE) less than 10^{−5}, with genotype call rates less than 95%, or minor allele frequencies (MAF) less than 0.01, leaving 350,364 SNPs for subsequent analysis. Individuals with low call rates (<95%, ), high heterozygosity (>30%, ), gender inconsistencies (), or serious concomitant illness () were excluded from analysis, along with 601 individuals related or identical whose computed average pairwise identical in state value was higher than that estimated from first-degree relatives of Korean sib-pair samples (>0.8). In total 8,842 individuals were analyzed. In 20 randomly selected duplicate samples, we found that genotype concordance rates exceeded 99.7%, with no single SNP excessively discordant.

##### 2.2. HTK Cohort

The HTK cohort was initiated to identify genetic variation responsible for complex traits as well as the role of the environment in the etiology of complex diseases. Some healthy twins in this cohort were recruited through advertisements in a nationwide newspaper and through posters in about 300 hospitals. Other twin families were selected from the large Korean Genomic Cohort Study of adult individuals and the KoGAP. Then the family members of the selected twins were recruited into this cohort. It should be noted that health status was not considered for sampling. This type of family study can be useful for detecting quantitative trait loci and genetic variations underlying common diseases [11]. Among the 2,473 participants enrolled from April 2005 to December 2008, there are 990 individuals comprising monozygotic (MZ) twins and 234 individuals comprising dizygotic (DZ) twins, and 1861 of these individuals could be genotyped with the Affymetrix Genome-Wide Human SNP array 6.0. We discarded SNPs with values for departure from HWE less than 10^{−5} or MAF less than 0.01. In addition, SNPs were excluded if Mendelian errors or double recombinants were found in at least 3 families, and in total 520,484 SNPs were used for analysis. We calculated the proportion of genotypes identical in state between individuals in each family and excluded those with any inconsistency between the genetic and reported relationship (). Also, individuals who had coding errors for MZ/DZ status () were excluded, and as a result genotypes for 1801 family members were available for analysis. Among the genotyped individuals, there are 4 pairs of MZ twins and 393 genotyped individuals whose MZ twin siblings were not genotyped. Also 84 pairs of DZ twins were genotyped, and there are 16 additional genotyped individuals whose DZ twin siblings’ genotypes were unknown. There are 162 nuclear families and 3 families consisting of individuals in three generations that include MZ/DZ twins.

##### 2.3. ASF Cohort

In the Ansung area, 5,018 unrelated and related participants were initially recruited for the KARE project; another cohort to study type 2 diabetes was initiated in this area in 2007. In this cohort, some individuals were selected from the KARE project, and their family members and other individuals from the Ansung area who were not in the KARE project were included, if they were diagnosed as having type 2 diabetes and agreed to participate in this study. This sampling scheme could lead to the presence of ascertainment bias, but the small correlations between type 2 diabetes status and the traits of interest (see Table S1 in Supplementary Material available online at http://dx.doi.org/10.1155/2015/671349) reveal that any ascertainment bias would not be substantial. In these samples, 456 individuals who were included in the KARE project were genotyped with the Affymetrix Genome-wide Human SNP array 5.0, and another 781 individuals were genotyped with the Affymetrix Genome-wide Human SNP array 6.0. Individuals were excluded if they reported relationships in the family inconsistent with the genotypic relationships estimated by the proportion of genotypes identical in state () or had unavailable trait data (). Also, SNPs were excluded if Mendelian inconsistency was found in at least 3 families, the values for HWE were less than 10^{−5}, or the MAF was less than 0.01. As a result, 784 family members with 417,719 SNPs were used for our analysis.

#### 3. Methods

To estimate heritability we used the freely available software Genome-wide Complex Trait Analysis (GCTA) [8] and the ASSOC program in the Statistical Analysis for Genetic Epidemiology (S.A.G.E.) [12] package. We considered eight traits: height, body mass index (BMI), triglycerides (TG), total cholesterol (TCHL), high-density lipoprotein (HDL), low-density lipoprotein (LDL), systolic blood pressure (SBP), and diastolic blood pressure (DBP). We included age, age^{2}, and sex as covariates. In particular, the linear mixed model for GCTA is robust to population substructure, and the EIGENSTRAT method [13] which includes PC scores as covariates was not applied. The effect of a living environment variable (urban versus rural) was not significant at the 0.05 significance level for any of the eight traits and so was not included as a covariate in the detailed analyses reported in Tables S2–S9. Quantile-quantile plots in Figures S1-S2 indicate that TG is not normally distributed and log-transformed TG (log TG) was used to obtain approximate normality. For the other phenotypes, the original scales were used because heritability estimates on the original scale and after inverse-normal transformation were almost the same and interpretation is not straightforward for the inverse-normal transformed data. The missing rates for each phenotype were calculated (see Table S10) and were usually very small. ASSOC parameterizes the phenotypic correlations between individuals using the reported familial relationships and can split the nonpolygenic variance into components for measurement error, sibling, and marital effects, and these results were summarized. GCTA estimates heritability by parameterizing phenotypic correlations with the estimated genetic relationship matrix (GRM) from the standardized genotypes. In particular, the results from GCTA were obtained with and without the default GRM-cutoff option. In addition, we separately analyzed monozygotic (MZ) and dizygotic (DZ) twin data, to estimate the relative proportion of the phenotypic variance attributable to common environmental effects.

##### 3.1. Heritability Estimation Using Familial Relationships

Under the multivariate normality model, the covariance between family members can be expressed as a function of their kinship coefficients and this can be utilized to estimate heritability. We estimated the heritability from the family data, separately in the HTK and ASF cohorts, with the ASSOC program in S.A.G.E. (ver. 6.2) [12]. ASSOC is based on a linear mixed model and the parameters are estimated by the maximum likelihood (ML) method. Let denote the response for individual in family , where and ; and indicate the number of families and the number of individuals in family , respectively. Also, let indicate covariates that affect . Then, denoting as the kinship coefficient between individual and individual in family , we letWe denote the additive polygenic, dominant polygenic, and random error variances, respectively, by , , and . If we also denote the identity matrix by , the linear model used in ASSOC for random mating and only additive effects is will be called the familial relationship matrix (FRM) in the remainder of this paper. Furthermore, ASSOC can estimate the variances separately attributable to polygenic, common sibship, and marital effects as described by Elston et al. [14].

S.A.G.E. ASSOC was used to estimate heritability in the family-based HTK and ASF cohorts and, for a fair comparison with GCTA, only genotyped individuals were analyzed this way. In the HTK cohort, 1801 genotyped individuals were considered, and there are 4 pairs of MZ twins among those genotyped. S.A.G.E. cannot easily handle MZ twins, and a single individual for each MZ twin was randomly selected for analysis with both ASSOC and GCTA. There is other software available that can handle MZ twins [15–17] in pedigrees, but this was not considered because the number of genotyped MZ twins is very small and so the variance attributable to common environment could not be well estimated in these cohorts using the GRM. We used the program PEDINFO in S.A.G.E. to provide descriptive statistics of the pedigree data.

##### 3.2. Heritability Estimation Using Estimated Genetic Relationships

When large-scale genotypes are available, the GRM can be estimated with the software GCTA [8] and, instead of the FRM, the estimated GRM can be incorporated into the same linear mixed model (2) as available in ASSOC, to estimate . The minor allele frequencies for GRM were estimated by using all individuals even when some individuals were correlated. Because the genetic relationship is estimated with genotypes, GCTA can be applied to both family-based and population-based samples. In addition, the GCTA program can estimate the variance components by both the restricted maximum likelihood (REML) and ML methods. The REML method provides more unbiased estimates of the variance components than the ML method. Therefore we estimated heritability by the REML method when applying GCTA to the KARE project, the HTK cohort, and the ASF cohort, though for these large samples the difference in the estimates is expected to be trivially small. Yang et al. [8] suggested excluding closely related individuals from the analysis when estimating genetic variation captured by all the SNPs, using a GRM-cutoff option. However, for the analysis of family-based samples, family members are highly correlated and most individuals become excluded from the analysis if the GRM-cutoff option for individual selection is activated. We report the results of both with and without the GRM-cutoff option, and we used 0.025 as the GRM-cutoff.

##### 3.3. Estimating Familial Correlations with S.A.G.E.

FCOR in S.A.G.E. [12] can estimate familial correlations for all pair types existing in a set of pedigrees. FCOR cannot handle the effect of covariates, and thus for height, BMI, log TG, TCHL, HDL, LDL, SBP, and DBP, we calculated the residuals from the linear model with age, age^{2}, and sex as covariates. Residuals from this linear model were used to estimate the empirical correlations between family members and their 95% confidence intervals with FCOR in S.A.G.E.

##### 3.4. Estimating Variance Attributable to the Common Environment with Twins

If we assume an additive model with no interaction, the phenotypic variance consists of the genetic variance and a common environmental variance component. However, the variance for environmental effects shared by family members is in general unidentifiable. If we further assume that the amount of covariance between MZ twins attributable to a common environmental effect is similar to that between DZ twins [18] and that any dominant or epistatic polygenic effects are relatively small compared to the additive genetic and common environmental effects, the covariance attributable to the common environmental effect can be estimated.

We separated out all the MZ and DZ twins, whether genotyped or not, from the HTK cohort, so that the members in each family are always either MZ or DZ twins in this analysis. In total, 958 individuals (479 pairs) comprising MZ twins and 224 individuals (112 pairs) comprising DZ twins were analyzed. If we denote the common environmental variance by , the polygenic model provides the following variance-covariance structure between twins:To construct this variance-covariance structure for MZ and DZ twins in our linear mixed model, we denote , , and . We define two matrices and as follows:Then, our linear model becomesHere and should be between −1 and 1. We used the REML method to estimate variance parameters, and each parameter was estimated by the average information method [19, 20]. R code for the proposed method can be downloaded from http://healthstat.snu.ac.kr/data/heritability_Rcode.zip. It is simple to show that, ignoring any epistatic effects, is and, if we assume that , becomes and the proportion of variance attributable to common environment, , can be calculated by . If we let , the Fisher information matrix for , , and can be obtained bywhere is a sample size and is the number of covariates. Thus the variance of can be obtained by . Provided the environmental correlation is the same for both MZ and DZ twins, this estimate can be utilized as a lower bound for the variance attributable to the environmental effects shared by siblings.

##### 3.5. Simulation Studies

With extensive simulation studies, we investigated the accuracy of heritability estimates for various scenarios. We generated 5000 pairs of individuals with 100,000 SNPs, and heritability was estimated by GCTA without activating the GRM-cutoff. The individuals in different pairs were generated to be independent and the correlations of genotypes, , between individuals in each pair were generated to be , or 0. A pair of individuals with indicates siblings or a parent-offspring pair. To generate pairs of individuals with correlation of genotypes , randomly selected alleles from two individuals were generated to be identical by descent with probability . The minor allele frequencies were generated from and genotypes were generated with the binomial distribution under Hardy-Weinberg equilibrium (HWE). Monomorphic variants were excluded from the analyses, and all markers were assumed to be in linkage equilibrium. If there are too many redundant SNPs in linkage disequilibrium with the causal variants, the empirical standard deviation of heritability estimates can be inflated and the analysis with GCTA should be modified as indicated by Speed et al. [21].

The traits were generated by summing a polygenic effect and a random effect. The random effect was generated from . To create a polygenic effect we simulated 100 independent causal SNPs and we assumed that all or 50 randomly selected ones of these causal SNPs were genotyped. The additive disease mode of inheritance was assumed and a single SNP genetic effect is denoted by . Letting be the allele frequency for causal SNP and heritability be , the genetic effect, , was calculated asHere were generated from or , respectively, and the genetic effects , for the 100 causal SNPs, were taken to be equal. was assumed to be 1 and was taken to be 0.1, 0.3, 0.5, 0.7, or 0.9.

#### 4. Results

##### 4.1. Estimates of Heritability in a Korean Population

Table 1 shows the descriptive statistics for eight traits: height, BMI, log TG, TCHL, HDL, LDL, SBP, and DBP. Interquartile ranges for these traits show that the traits in the three cohorts are comparable. We calculated heritabilities in the HTK, ASF, and KARE cohorts separately, and they were also combined to calculate overall heritabilities by pooling the samples and including two dummy (0/1) covariates to adjust for the effects of each sample. Table 2 shows that the heritability estimates from the pooled samples with GCTA were, for height, 0.60; BMI, 0.32; log TG, 0.24; TCHL, 0.30; HDL, 0.38; LDL, 0.29; SBP, 0.23; and DBP, 0.24. In each case these heritability estimates are between the limits of those from the individual KARE, HTK, and ASF cohorts. Tables S2–S9 show that, in the samples where both GCTA and S.A.G.E. can be applied, the heritability estimates from S.A.G.E. and GCTA are usually comparable. GCTA estimates heritability with the REML method based on an estimated GRM, while S.A.G.E. estimates heritability with the ML method based on the FRM. The estimates from the REML and ML methods must be very similar for a large sample size, and thus the convergence of the estimated GRM to FRM [22] explains their similarity.