Abstract

In linkage analysis for mapping genetic diseases, the transmission/disequilibrium test (TDT) uses the linkage disequilibrium (LD) between some marker and trait loci for precise genetic mapping while avoiding confounding due to population stratification. The sib-TDT (S-TDT) and combined-TDT (C-TDT) proposed by Spielman and Ewens can combine data from families with and without parental marker genotypes (PMGs). For some families with missing PMG, the reconstruction-combined TDT (RC-TDT) proposed by Knapp may be used to reconstruct missing parental genotypes from the genotypes of their offspring to increase power and to correct for potential bias. In this paper, we propose a further extension of the RC-TDT, called the reconstruction-combined transmission disequilibrium/heterogeneity (RC-TDH) test, to take into account the identical-by-descent (IBD) sharing information in addition to the LD information. It can effectively utilize families with missing or incomplete parental genetic marker information. An application of this proposed method to Genetic Analysis Workshop 14 (GAW14) data sets and extensive simulation studies suggest that this approach may further increase statistical power which is particularly valuable when LD is unknown and/or when some or all PMGs are not available.

1. Introduction

Genetic linkage analysis is an important step in localizing and identifying genes in the chromosomes that underlie many human diseases and other traits of interest. A brief overview of commonly used statistical methods for linkage analysis including recently developed model-free and model-based methods for mapping qualitative- and quantitative-trait loci, can be found in Shao [1]. For more extensive discussions on linkage analysis, readers can consult Ott [2].

Mapping genes that underlie complex diseases is of great current interest. The essence of linkage analysis is to identify statistical association between the inheritance of a complex genetic disease phenotype and inheritance of specific pieces of genetic material (called marker alleles). Many complex diseases including cancers have an inheritable component. For marker alleles that are associated with inheritance of complex diseases, it is common that the transmission probabilities of a marker allele of interest vary across heterozygous parents, due to locus heterogeneity, etiological heterogeneity, and many other complexities and/or combinations of them [3, 4]. Under such transmission heterogeneity, the transmission likelihood generally has the form of mixture models with many parameters [4, 5]. It can be shown that the efficient score test of such mixture likelihood includes two parts, one part related to transmission disequilibriums reflected by existence of linkage disequilibrium (LD) and the other related to transmission heterogeneity in the form of excessive dispersion in sharing of genetic markers as might be inferred from identical by descent (IBD) patterns (e.g., allele-sharing patterns among affected sib-pairs).

The transmission/disequilibrium test (TDT) developed by Spielman et al. [6] uses the LD information between some marker and disease loci for precise genetic mapping while avoiding confounding due to population stratification. It has been extended in multiple directions to meet the need for mapping complex traits, for example [7, 8]. In particular, missing parental genetic marker genotypes are very common for studying diseases with late onset. The sib-TDT (S-TDT) and combined-TDT (C-TDT) proposed by Spielman and Ewens [9] can deal with families without parental marker genotypes (PMGs) and can combine with data from families having PMG available. For some families with missing PMG, the reconstruction-combined TDT (RC-TDT) proposed by Knapp [10, 11] may be used to reconstruct missing PMG from the genotypes of their offspring to increase power of the C-TDT with a correction for potential bias in using reconstructed PMG [12].

An attractive feature of the RC-TDT is that it utilizes the missing PMG that can be uniquely determined from the genotypes of the children and corrects potential biases resulting from using reconstructed PMG by employing appropriate null expectation and variance, supplied in Tables 1 and 2 of Knapp [10]. Similar to the TDT and C-TDT, the RC-TDT is powerful only when there is strong LD. Usually LD is unknown, and it is difficult to measure, thus it is generally desirable to combine LD information with information on allele sharing obtained based on IBD patterns [5, 13].

For fine mapping of complex genetic disorders, Shao [4] derived a general mixture likelihood for allele transmission under various transmission disequilibrium and/or heterogeneity and further proposed a transmission disequilibrium/heterogeneity (TDH) test to efficiently combine the transmission disequilibrium and heterogeneity information to maximize the power for detecting linkage using genetic data from nuclear families. The TDH test was shown to be an efficient score test of the general mixture likelihood derived in Shao [4] which is a summation of two parts, a transmission/disequilibrium test (TDT) part which utilizes the LD information and a transmission heterogeneity test (THT) part that utilizes IBD-sharing information. To see that the THT utilizes IBD-sharing information, it should be pointed out that general mixture likelihood contains the mixture binomial likelihood discussed in Huang and Jiang [13] and Lo et al. [5], and the test statistic of the classical mean test for affected sib-pairs (ASPs) is a special case of the THT statistic with in Shao [4]. The classical mean test for affected sib-pairs is the most well-known IBD sharing-based linkage test [14]. The THT is applicable to general sibship and thus can be regarded as an extension of the classical mean test for affected sib-pairs.

In practice, parental marker genotypes are often incomplete for many genetic studies particularly for late onset diseases. Only using families with complete parental maker genotype information would lead to throwing away a large portion of the useful data and can also lead to biases. It is thus crucially important to make the TDH test applicable to families with missing or incomplete parental marker genotype information. In this paper, we develop a transmission disequilibrium/heterogeneity test with parental-genotype reconstruction, which utilizes both the LD information and the IBD-sharing information and can combine families with or without PMG information.

The transmission disequilibrium/heterogeneity test with parental-genotype reconstruction (RC-TDH) will be introduced in the next section. In Section 3, the RC-TDH test is applied to a data set from GAW14, and the results are compared with those of the RC-TDT. Finally, simulation studies that use common genetic models [5, 15] are carried out to compare the power and the true size of the RC-TDT and RC-TDH test. The numerical results suggest that RC-TDH test may greatly increase the statistical power which is particularly valuable whenever LD levels are unknown and/or whenever there is missing PMG information as in studying of a disease with late age of onset.

It should be pointed out that the main comparison made in this paper will be between RC-TDT and RC-TDH. We will not formally compare them with the classical IBD-based linkage tests such as those implemented in Genehunter and other softwares. The main rationale is as follows. We are mainly interested in fine mapping of genetic variants that underlie complex diseases, where the classical linkage tests are known to have low power because they do not utilize LD information effectively. With the rapid advancement of biotechnology, it is now feasible and affordable to use dense genetic markers, for example, the single nucleotide polymorphisms (SNPs), for genomewide linkage scan. With a large number of dense genetic markers (e.g., SNPs) some of the markers can be expected to fall into the LD block of the causal genetic variants; thus LD would generally exist to some degree for many markers. Thus the TDT and TDH tests would have power advantage over classical linkage tests which only effectively utilize the IBD information.

2. Method

2.1. Notation

It will be assumed that there are two alleles and at the marker locus, and allele is of particular interest. Let denote the number of affected children, let denote the number of unaffected children, and let denote the size of the sibship for family . In each family, all children have been typed at the marker locus, but the PMG may or may not be available. Let be random variables, denoting the number of affected (or unaffected) children with genotype in family . Small letters (i.e., and ) are used to denote the observed values of and . Further, let and denote the random variable and the observed number of children with genotype in family , respectively. denotes the number of alleles in affected children (i.e., ). The notation introduced here is consistent with Knapp [10, 11] and Han [16].

2.2. The TDH Test with Complete PMG

For completeness, we first consider the case when PMG are observed along with children's marker genotypes. Let be the number of alleles transmitted by the th marker heterozygous parent to the affected children. When the exact number of marker alleles transmitted to affected children cannot be determined as might happen in families with two heterozygous parents, then can be used to replace . Using in families with ambiguous transmissions, the TDT statistic can be written as where The transmission heterogeneity test (THT) statistic is denoted as where where the moments of under given the parental marker genotypes (PMGs) are summarized in Table 1.

The transmission disequilibrium/heterogeneity (TDH) test is based on the following test statistic [4]:

In terms of statistical optimality, it can be shown that the TDH test is the efficient score test from the mixture likelihood function under transmission disequilibrium and heterogeneity [4]. In theory, the efficient score test is known to be locally most powerful.

2.3. The Reconstruction-Combined TDH (RC-TDH) Test

When at least one parent with missing PMG, Knapp [10] proposed a reconstruction-combined TDT (RC-TDT) to reconstruct PMG from the genotypes of their offspring and correct for the biases resulting from using reconstructed PMG. To improve the power to detect linkage, we propose the reconstruction-combined TDH test (RC-TDH) using the following test statistic: where denotes the number of marker alleles in affected children, and , denote the appropriate null expectation and variance of , respectively, as can be found in Tables 1 and 2 of Knapp [10]. In the RC-TDH statistic, the first term is the RC-TDT statistic of Knapp [10] and the second term is the RC-THT statistic with the restriction. To get the appropriate null expectation , we need to derive the conditional distribution of given the constraint for reconstruction .

When one parental genotype is missing and reconstructible, the conditional probabilities of are listed in Table 2. Note that the family index has been dropped in the formula in Table 2. In the first column, the first parental genotype is typed and the second one is reconstructed. The second column presents a necessary and sufficient condition, for the observed marker genotypes in the offspring, to allow reconstruction of the parental genotypes. The details of the derivation are provided in Han [16].

When both parental genotypes are missing, the reconstruction condition and the conditional probabilities of are the same as that of one parental genotype is missing and the known parental genotype is .

When at least one parental genotype is missing and cannot be reconstructed, but the condition for the S-TDT is satisfied (i.e., there is at least one affected and at least one unaffected child in this family, not all of the children possess the same genotype), the distribution of can be calculated using the affected and unaffected children genotypes by the hypergeometric distribution. The details are provided in the Appendix section.

As in C-TDT and RC-TDT, families not belonging to the previous categories will be ignored.

3. Application to Genetic Analysis Workshop 14 Data

The proposed RC-TDH test was applied to a Genetic Analysis Workshop 14 (GAW14) dataset to compare the power with that of RC-TDT. The GAW14 simulated data were generated by Dr. David Greenberg. A behavioral disorder has been simulated in multiple replicates of four different populations/groups. There are 100 families in the Aipotu, Karnagar, and Danacaa data sets. There are 100 replicates for each data set. The results of power comparison of RC-TDH with RC-TDT to analyze the linkage between the trait b disease allele and the marker B01T0561 are presented in Table 3. This trait has incomplete penetrance with . Application of the RC-TDH is illustrated in Table 3 with 50% and 100% missing parental genotypes. The power is based on type I error at 0.05 level.

4. Simulation

4.1. Simulation Set-Up

Simulation studies are conducted to compare the powers of the proposed RC-TDH test with the RC-TDT. To attain the correct type I error rates, we directly simulated the critical values under the null hypothesis of no linkage, in which (recombination frequency) = 0.5. In the simulations for the null distribution, 1,000,000 replicates of samples of nuclear families are generated and the empirical critical values are obtained. Based on 500 independent replicates and the empirical critical values, we estimate the power of the tests using the relative frequencies of the simulated test statistics which exceed the empirical critical values.

To generate the family-based data, as in earlier work [5], we consider two biallelic loci: one disease locus (with disease allele and normal allele ) and one marker locus (with allele and ). The frequency for disease allele is and for marker allele is . The linkage disequilibrium is the deviation of the frequency of haplotype from its equilibrium value (expected by chance). Define the parameter as In our simulations, we assume is the allele in with . Thus, the range of the parameter is in , in which 0 indicates linkage equilibrium. There are three penetrance parameters, , , and , corresponding to three possible disease genotypes.

Simulation study 1 closely followed the approach used by Boehnke and Langefeld [15]. For each model, a disease prevalence of 5% was assumed. The disease allele frequency that resulted from each of the disease models can be calculated by . Summary of the parameters used in this simulation study is in Table 4.

Summary of the parameters used in simulation study 2 is in Table 5. Four commonly used disease models are used here: dominant (), additive (, multiplicative (), and recessive () models.

4.2. Simulation Results

Table 6 presents estimates of the critical values for RC-TDH at significance levels of .05, .01, and .001. Table 7 presents the estimates of the true type I error rate, at nominal significance levels of .05, .01, and .001. The simulations support the validity of approximating the null distribution with a standard normal distribution for RC-TDT.

The results of simulation study 1 are shown in Table 8. The disease models are denoted by “,” “,” and “” for the mode of inheritance (i.e., dominant, additive, and recessive); “1” and “2” for the value of (i.e., 1.0 and 0.5). The presented results come from the simulations with 4 sibs in each family, which have the same trend as those with 2 or 6 sibs in each family. In instances for which there is no parental genotype information available, application of the RC-TDH instead of the RC-TDT results in a consistent gain of power, especially when linkage disequilibrium is weak.

We conducted simulation study 2 to compare the power of the proposed RC-TDH test with that of RC-TDT according to linkage disequilibrium in different scenarios based on Table 5, such as tight linkage versus weak linkage, full penetrance versus incomplete penetrance. Each simulated sample consists of families with an identical number of sibs () in each family (with ), which are ascertained on the basis of the presence of an affected child. Each sample consists of a total of 600 children. Half of the 200 families have complete PGM, and half of the families without PGM. To assess the power of the tests, 500 replicate samples are generated, under different simulation scenarios. For each replicate sample, the statistics obtained with the proposed RC-TDH and with the RC-TDT were calculated.

To compare power of the RC-TDH with that of the RC-TDT at different levels, we set the range of between 0 and 1, recombination fraction at 0.01, the frequency of allele at 0.1, the frequency of allele at 0.5, penetrance for genotype at full penetrance 1, penetrance for genotype at 0.01, and then the penetrance for genotype can be determined by the modes of inheritance. The results in Table 9 and Figure 1 show that the power increases with , and the proposed RC-TDH is more powerful than RC-TDT, especially when is weak as in scenario 1 of Table 4.

Penetrance is the conditional probability of observing a phenotype given a specified disease genotype. In scenario 1, we set (the penetrance for a subject whose marker genotype is ) at 1, which is an idealistic penetrance. To compare the power of the proposed RC-TDH with that of its competitor under different penetrance, is varied from full penetrance to incomplete penetrance 0.5, which is more realistic. The results in Table 9 and Figure 2 show that the proposed RC-TDH has better power than RC-TDT with half penetrance for genotype individuals as in scenario 5 of Table 5.

In summary, our simulation results show that the proposed RC-TDH is generally more powerful than RC-TDT for a broad range of , the tightness of the linkage, and across disease models.

5. Discussion

For mapping complex diseases, it is common that the transmission probabilities of a marker allele of interest vary across heterozygous parents, due to locus heterogeneity, etiological heterogeneity, and many other complexities and/or combinations of them [3, 4]. Under such transmission heterogeneity, the transmission likelihood generally has the form of mixture models with many parameters, and the efficient score test has two parts in the form of a TDH test [4]. This paper studies a TDH test which allows the inclusion of reconstructed parental marker genotype data and extends the RC-TDT of Knapp [10, 11]. The proposed new approach was validated by simulation studies and GAW14 data sets, and the results indicate that the new approach might improve the power of family-based linkage analysis for a broad range of . Moreover, the simulation studies also indicate that the systematic power advantage of the RC-TDH test over the RC-TDT holds regardless of the underlying genetic models (e.g., recessive, dominant, additive, multiplicative).

Similar to RC-TDT, the new approach can utilize the missing parental information that can be reconstructed from the child genotypes, especially including some families with genotype-concordant or phenotype-concordant sibs. In addition, the proposed test is a sibship-oriented method which does not require specification of the underlying genetic model; it naturally uses the multiple siblings by considering the sibship as a whole. The second part of the RC-TDH statistic, the THT part of the test statistic, is based on information from IBD. This is quite obvious in the situation of affected sib-pairs, where the THT is essentially equivalent to the so-called mean test [4, 13].

Many other linkage analysis tests such as the tests implemented by Genehunter have relatively low power with respect to TDT or TDH when is present. In reality, some degree of is often present particularly when we use dense genetic markers (e.g., SNPs) along the genome because they are available at increasingly cheaper cost, and these dense markers are already very affordable. With a large number of dense genetic markers, some markers may be expected to fall into the block of the causal variants. When using these affordable dense markers along the genome or candidate gene regions, we believe that RC-TDH will have better chance of success than the classical IBD-based linkage methods in detecting linkage signals along the genome.

As high density SNP arrays become increasingly affordable to researchers, genomewide linkage studies are becoming common. Our TDH test has simple closed form test statistics which is computationally easy in addition to good overall power across a broad range of . Thus the proposed method would be potentially useful for genomewide linkage analysis. In contrast, likelihood ratio test for mixture likelihood is generally computationally intensive [5, 17]. Many existing linkage tests and algorithms such as the likelihood ratio test discussed in Lo et al. [5] would be too computationally intensive for genomewide studies or when the number of genotyped markers is large.

It is possible to further extend the method to be applicable to markers with more than two alleles, which would be of great interest in studying haplotypes of multiple loci. However, our proposed tests are already applicable to the commonly used biallelic markers; for instance, the widely used single nucleotide polymorphisms (SNPs) are convenient biallelic markers.

Appendix

A. Computational Details for the RC-TDH Test

When there are no parents who have been typed, the conditional probability has been derived in equation (A.6) of Knapp [10]. When only one parent has been typed as , the same constraint for reconstruction applies, thus (A.6) of Knapp [10] also works. Next we derive the the conditional probability when only one parent has been typed as . The case of when only one parent has been typed as is obvious due to symmetry between and .

A.1. One Parental Genotype Has Been Typed as

Note that the family index has been dropped in the following formula.

Only one parental genotype has been typed, which is , but the genotype of the missing parent can be reconstructed as , if there is at least one child with genotype and at least one child with genotype . Here, the condition is and . To calculate the conditional distribution of , we first calculate the probability of satisfying the constraint for reconstruction, :

Then we calculate the joint probability of and :

There are three cases for the calculation:case 1: , ,case 2: , ,case 3: , .

Therefore the distribution of conditioned on is

A.2. At Least One Parental Genotype Is Missing and Cannot Be Reconstructed, but the Condition for the S-TDT Is Satisfied

In a sibship with affected and unaffected sibs, the total number of sibs is . Suppose that in this sibship the number of sibs who are of genotype is and the number of sibs who are of genotype is . Let be the number of sibs and let be the number of sibs who are classified as affected. As discussed in Spielman and Ewens [9], given the totals , , , , and , the numbers , can be regarded as two entries in a contingency table with marginal totals , , , , and . Therefore, the distribution of can be obtained by the generalized hypergeometric distribution [18, page 47]. More specifically, we have More formulas of parental marker genotype reconstruction probabilities under various missing genotypes types and constraints, as well as detailed derivations of these formulas, can be found in Han [16].

Acknowledgments

This research was partially supported by a Stony Wold-Herbert Foundation grant, the MPD Research Consortium Project Grant (1P01 CA108671), and the New York University Cancer Center Supporting Grant (2P30 CA16087) and by the NYU NIEHS Center Grant (5P30 ES00260). The research of JH was carried out as part of her Ph.D. dissertation work at New York University.