Research Article | Open Access
A Statistical Variance Components Framework for Mapping Imprinted Quantitative Trait Locus in Experimental Crosses
Current methods for mapping imprinted quantitative trait locus (iQTL) with inbred line crosses assume fixed QTL effects. When an iQTL segregates in experimental line crosses, combining different line crosses with similar genetic background can improve the accuracy of iQTLs inference. In this article, we develop a general interval-based statistical variance components framework to map iQTLs underlying complex traits by combining different backcross line crosses. We propose a new iQTL variance partition method based on the nature of marker alleles shared identical-by-decent (IBD) in inbred lines. Maternal effect is adjusted when testing imprinting. Efficient estimation methods with the maximum likelihood and the restricted maximum likelihood are derived and compared. Statistical properties of the proposed mapping strategy are evaluated through extensive simulations under different sampling designs. An extension to multiple QTL analysis is given. The proposed method will greatly facilitate genetic dissection of imprinted complex traits in inbred line crosses.
The genetic architecture of complex phenotypes in agriculture, evolution, and biomedicine is generally complex involving a network of multiple genetic and environmental factors that interact with one another in complicated ways . The development of molecular markers makes it possible to identify genetic locus (i.e., quantitative trait locus (QTL)) underlie various traits of interest. Genetic designs with controlled crosses are generally pursued to generate mapping populations aimed to identify QTL underlying the variation of phenotypes. Statistical method for QTL mapping with experimental crosses dates back to the seminal work of Lander and Botstein . Various extensions have been developed since then (e.g., [3, 4]).
For a diploid organism, the expression products of most functional regions from each one of a chromosome pair are equal. A broken of this equivalence, that is, nonequivalent genetic contribution of each parental genome to offspring phenotype, can result in genomic imprinting, a phenomenon also called parent-of-origin effect . Since its discovery, imprinting-like phenomena have been commonly observed in mammals and seed plants (reviewed by Burt and Trivers ). However, statistical methods for identifying imprinted genes have not been extensively studied and well developed.
The imprinted inheritance violates the Mendelian theory and brings challenges in statistical modeling. Currently, there are two frameworks in mapping imprinted genes. One is based on the random effect model with pedigree-based natural population such as humans. Hanson et al.  first proposed a variance components framework by partitioning the additive variance component as two parts, a component due to maternal gene and a component due to paternal gene. The variance component method is developed based on the identical-by-decent (IBD) idea in which the expression of the gene for a pair of individuals is expected to be similar if they share alleles IBD. Liu et al.  recently applied the model to map iQTL underlying canine hip dysplasia in a structured canine population. However, the current IBD-based variance components method for mapping imprinted genes assumes noninbreeding population. Their applications are immediately limited with fully or partially inbreeding population such as the controlled inbreeding design in plants and animals. With inbred mapping population in humans, Abney et al.  proposed a method to estimate variance components of quantitative traits. However, the extension of the method to map imprinted gene is not straightforward. No variance components method has been proposed to map imprinted genes with inbred population in the literature.
Another general framework for mapping imprinted genes is based on the fixed-effect model in which the effects of genetic factors are considered as fixed. A number of studies were proposed under this framework for mapping imprinted QTL (iQTL) with controlled crosses of outbred parents [10–12]. One potential limitation of these methods is that allelic heterozygosity at a locus between two outbred parents could cause confounding effects for genomic imprinting. The genetic differences detected by such a fixed-effect model could be caused by allelic heterozygosity of the parents rather than the imprinted effect of iQTL . A natural alternative for the mapping population is the inbred lines. Fixed-effect models based on backcross (BC) and population were recently proposed under the maximum likelihood framework [14–17]. When inbred lines are used, Xie et al.  pointed out that it is more meaningful to inference QTL effect by its variance rather than by the allele substitution effect. The QTL variance is generally calculated conditional on the cross, and it, as a variable, is different from one cross to another . In a single-line cross, the estimated QTL variance cannot be simply extended to a statistical inference space beyond that . Multiple parental lines are needed for QTL variance inference. A solution to this is to combine data from multiple line crosses . An IBD-based variance component method was proposed by Xie et al.  with multiple line crosses. Extension of the IBD-based variance component method with multiple line crosses to iQTL mapping has not been studied.
Motivated by the limitations of current methods aforementioned and by the pressing need for efficient iQTL mapping procedure, in this article, we propose a statistical variance components framework for iQTL mapping by combining data from multiple inbred line crosses. The proposed model is robust in iQTL variance inference by extending the iQTL inference space from single-line cross to multiple line crosses. A parent-specific IBD sharing partition method is proposed by considering the inbreeding structure in line crosses. As discussed by Cui in , the phenotype of an offspring is not only controlled by its own genetic profiles, but also controlled by maternal genotype. The effect of maternal genotype on the phenotype of her offspring, termed maternal effect, is one potential source of confounding effect in the inference of genomic imprinting. The existence of such parental effect may lead to incorrect interpretations of imprinting when they are not properly accounted for in the analysis. Parameters that model the maternal effect are also included and adjusted when testing imprinting.
With the developed model, we propose an interval-based method for genome-wide scan and testing of iQTL. Both maximum likelihood (ML) and restricted maximum likelihood (REML) methods are proposed and compared for parameter estimation and power analysis. An extension to multiple QTL is also proposed in which the multiple QTL model provides improved resolution for QTL inference. Extensive simulations are conducted to compare the performance of the proposed model under different sampling designs with different combinations of family and offspring size. Coparisons of the ML and REML methods, single QTL and multiple QTL methods are discussed. The proposed method provides a general framework in iQTL mapping with multiple line crosses and has significant implications in real application.
2. Statistical Methods
2.1. Genetic Design
The dissection of imprinting effects in line crosses depends on appropriate mating designs, where the allele parental origin can be traced and distinguished. Most commonly used inbred line crosses are the backcross, , and recombinant inbred line (RIL). Reciprocal backcross design has been proposed in iQTL mapping [14, 16]. Considering parental origin of an allele, we use the subscripts and to refer to an allele inherited from the maternal and paternal parents, respectively. The merit of a backcross design is that two reciprocal heterozygotes in offsprings, and , can be distinguished and their mean effects can be estimated and tested to assess imprinting [14, 16]. While all individuals in an segregation population share the same parental information, theoretically it is impossible to distinguish the phenotypic distribution of and without extra information. Considering sex-specific recombination rates, Cui et al.  recently developed an imprinting model by incorporating this information into an interval mapping framework. No study has been reported to use RILs for iQTL mapping.
The methods proposed in Cui  and Cui et al.  are fixed-effects QTL models where the effects of an iQTL are considered as fixed. While only four backcross families are considered, when extending to multiple backcross families, the inference of iQTL variance calculation is less efficient. The variance components method, initially proposed in human linkage analysis , offers a powerful alternative in assessing genomic imprinting . In this paper, we will extend the variance components method to inbred line populations by combining different backcross lines to map iQTL.
A typical backcross design often starts with the cross between one of the parental lines and their progeny to create a segregation population. Then large number of offsprings are collected for QTL mapping. When imprinting effect is considered, reciprocal backcrosses are needed. A basic design framework is illustrated in Table 1 in Cui . The two reciprocal backcrosses are treated as the base mapping units. Multiple backcross families are sampled based on these four crosses. For simplicity, we sample equal number of families for each backcross category. For example, a sample of 8 families would require two of each of the four backcrosses. Noted that the variance components method assesses the degree of allele sharing among siblings. When it is applied to inbred line crosses, each backcross population is considered as one family and different families are considered as independent. For fixed total sample size, one issue is to assess whether we should sample large number of families each with small offspring size or small number of families each with large offspring size. For example, to sample 400 individuals, shall we sample 4 backcross families each with 100 offsprings or 100 families each with 4 progenies or other sampling strategies? The choice of optimal designs is intensively evaluated through simulations.
2.2. The Mixed-Effect Variance Components Model
Suppose there is a putative QTL with two segregating alleles and , located in an interval responsible for the variation of a quantitative trait. The phenotype, , for individual measured in backcross family can be written as a linear function of QTL, polygene, and environmental effects:where is the number of offsprings in the th backcross family; denotes the overall mean; is the random additive effect of the major monogenic QTL assuming normal distribution with mean zero; is the polygenic effect that reflects the effects of unlinked genes and is assumed to be normally distributed with mean zero; is the random environmental error uncorrelated to other effects. The phenotypic variance-covariance for the th family can be expressed aswhere and are the additive and polygene variances; is a matrix containing the proportion of marker alleles shared IBD for individuals in the th backcross family; is a matrix of the expected proportion of alleles shared IBD; I is the identity matrix. The calculation of the IBD sharing matrix with inbred lines can be found in Xie et al.  which is based on the Malécot coefficient of coancestry .
Noted that a backcross offspring with genotype may be obtained by the or the cross. When there is a significant maternal effect, the mean expression for genotype may be different depending on whether its maternal parents carrying or genotype. As described in Cui , maternal effect is one source of potential confounding factor for genomic imprinting. It should be appropriately modeled and adjusted when testing imprinting. Here, we model the cytoplasmic maternal effects as fixed effects, and the overall mean is replaced by which models the maternal effect of the th distinct backcross family.
To accommodate parent-of-origin effects, the QTL additive effect () can be partitioned as two terms:
(1)a component that reflects the influence of the QTL carried on the maternally derived chromosome (); (2)a component that reflects the influence of the QTL carried on the paternally derived chromosome ().
The model that accommodates the parent-specific effects can be expressed as Note that the proposed design contains three distinct maternal parent genotypes. Thus the maternal effects indexed by can be compressed to three distinct maternal effects, instead of terms. For data vector in family , the above model can be reexpressed aswhere is an indicator matrix corresponding to the th backcross family and contains parameters associated with the three maternal effects; , , , , where and are matrices containing the proportion of marker alleles shared IBD that are derived from the mother and father, respectively; is a matrix of the expected proportion of alleles shared IBD; I is the identity matrix; and are the variance of alleles inherited from the maternal and paternal parents, respectively.
With noninbreeding mapping population, Hanson et al.  expressed the phenotypic variance-covariance for the th family asHowever, for an inbred mapping population, this IBD-based variance partition method cannot be directly applied. New method considering the inbreeding structure is needed.
2.3. Parent-Specific Allele Sharing and Covariances between Two Inbreeding Full-Sibs
Before we get the phenotypic variance-covariance of a pair of individuals and , let us first consider the parent-specific allele sharing status. Within each BC family, there are two alleles segregating at each locus. Because of inbreeding, the IBD values between two backcross individuals are different from those calculated from outbred full-sibs. Consider two sibs and in the th backcross family. Without considering allelic parental origin, Xie et al.  proposed to calculate the IBD value at a QTL aswith being the Malécot coefficient of coancestry . Thus, for an inbred population, is not the actual IBD value between individuals and , rather interpreted as twice the coefficient of coancestry [18, 21]. For individuals with itself, where is the inbreeding coefficient for individual at the QTL. The elements in matrix are just the expected values of and which are and .
When allelic parental origin is considered, the IBD sharing matrix can also be calculated based on the coefficient of coancestry. By definition, the coefficient of coancestry is defined as the probability that two randomly drawn alleles from individuals and are identical by descent. Figure 1 displays possible alleles shared IBD for sibs drawn in backcross families. Consider two backcross individuals (with two alleles and ) and (with two alleles and ). Define as the coefficient of coancestry between individuals and . By definition, can be calculated aswhere can be interpreted as the allelic kinship coefficient, that is, the probability that a randomly chosen allele from individual is IBD to a randomly chosen allele from individual . Note that the two terms and are not distinguishable. However, their sum is unique and therefore the two terms can be combined as one single term, denoted as . After the manipulation, the coefficient of coancestry for individuals and can be expressed as which is composed of three components.
Following Xie et al. , the alleles shared IBD between individuals and can be expressed aswhere and are the alleles shared IBD derived from the mother and father, respectively; is the alleles shared IBD due to alleles cross sharing, a special case for inbreeding sibs. Without inbreeding, takes value of zero.
For completely inbreeding population, the inbreeding coefficient is 1 if alleles inherited from both parents are the same since these alleles can be traced back to the same grandparent. For example, for an individual with genotype , Pr since both alleles and are inherited from the same grandparent. Therefore, for individuals with itself, would be the same as when and carry the same genotypes. The expected proportion of alleles shared IBD can also be calculated.
Thus, the proportion of alleles-shared IBD can be partitioned as three components for inbreeding sibs, rather than two components considering parent-of-origin effects proposed by Hanson et al. . To further illustrate the idea, we use one backcross family to demonstrate the derivation. A full list of possible IBD sharing values for the two reciprocal backcrosses is given in Table 1. Considering a backcross family initiated with the cross. Randomly selecting two individuals and with genotype and , the Malécot coefficient of coancestry can be calculated asThus, and . For sib pairs (with genotype ) and (with genotype ), , and , and which is the same as given in (2.6) without considering parent-of-origin partition.
Considering the allelic sharing status in a complete inbreeding population, the relationship between the maternal and paternal alleles is no longer independent if the two alleles are in identical form. There exists a covariance term (denoted as ) due to alleles cross sharing for two inbreeding full-sibs when calculating the phenotypic variance. Corresponding to the partition of the IBD-sharing considering allelic parental origin, the major QTL additive variance component can be partitioned into three components, that is, , , and , in which can be interpreted as the covariance due to alleles cross sharing in inbreeding families. Thus, the trait covariance between two individuals and can be expressed aswhere is an indicator variable taking value 1 if and 0 if . The variance-covariance matrix for a phenotypic vector in the th backcross family can then be expressed aswhere the elements of , , and can be found in Table 1.
For noninbreeding sib pairs with random mating, and hence Cov(. Model (2.12) reduces to , the same as the variance components partition model considering parent-of-origin effects given by Hanson et al. .
2.4. Likelihood Function and Parameter Estimation
Assuming multivariate normality, the density function of observing a particular vector of data for family is given bywhere is an vector of phenotypes for the th backcross family, and is the th backcross family size. The overall log likelihood function for independent backcross families is give by
Note that the maternal effect is the same for families with the same maternal genotype. Thus, only three maternal effects need to be estimated. Two commonly used methods can be applied to estimate parameters in a mixed effects model, the ML method and the REML method. Both methods have been applied in genetic linkage analysis in a variance components model framework [19, 22]. In general, ML estimators tend to be downwardly biased given that it does not account for the loss in degrees of freedom resulted from estimation of the fixed effects . The REML is based on a linear transformation of the data such that the fixed effects are eliminated from the model, hence it provides less-biased estimators. Even though standard softwares such as SAS have standard procedures to estimate parameters for a mixed effects model, the estimation for the proposed model cannot be directly fitted into a standard software. The estimation procedures for the two methods are detailed here.
2.4.1. The Ml Estimation
The phenotype vector in the th backcross family follows a multivariate normal distribution, that is, . Parameters that need to be estimated are with .
Define , , , , , and . is the total phenotypic variance and hence and can be considered as the heritability of maternal and paternal alleles, is the total genetic heritability due to the major QTL, is the polygene heritability, and is the overall heritability. The phenotypic variance-covariance between any two individuals and in the th backcross family can then be reexpressed as wherewith ; is defined similarly; .
If there are sibs in each backcross family, is simply an matrix. Instead of estimating , we can estimate and solve above equations to get the original variance estimates. Now, the log-likelihood can be expressed as
Maximizing likelihood (2.14) is equivalent to maximize (2.17). Here, we take an iterated estimation procedure to estimate the parameters contained in . For given values of , we can get the maximum likelihood estimates (MLEs) of parameters () by setting the partial derivative of the log-likelihood function (2.17) to zero, that is,
It can be seen that and are functions of , , , and . Plug the updated parameter values for and into likelihood equation (2.17), the log-likelihood function can be simplified as
To guarantee a positive definite covariance matrix when searching for these heritability values over the constraint parameter space, a reparameterization technique is adopted . Taking where , , and . We now have four new unknowns with the constraints:, and .
The new constraints can be easily satisfied by a reparameterization technique. Let , , , , and be any real numbers. Estimating , , , , and can be done by maximizing the likelihood function (2.19) via searching through the real domain space with respect to , , , , and with the reparameterization
MLEs of , , , , and can be obtained through the estimated values for , , , , and according to the invariance property of MLEs. These estimated MLEs are used to update , , , , and , and hence and . The iteration steps continue until converge.
2.4.2. The Reml Estimation
The REML method was first proposed by Patterson and Thompson . This method has been broadly applied to estimate variance components in a mixed-effect model framework. Taking ), where . The REML method starts with maximizing the following likelihood function:where . We can combine all family data together as one vector denoted as where . All the and the variance-covariance matrix corresponding to each family can be combined. The log-likelihood function for the combined data is expressed aswhere is a block diagonal matrix with the th diagonal block corresponding to the th family and off-diagonal blocks being zeros; is also a block diagonal matrix with block elements given by . The dimension of is . With this combination, we develop the following REML estimation procedure.
We apply the Fisher scoring algorithm to estimate the unknowns, which has the formwhere is the Fisher information matrix evaluated at which can be expressed as
The first derivative of the log-likelihood function with respective to each variance components is given byThe REML estimator of is the generalized least squares estimator, that is,
Under the current mapping framework, the likelihood function is very complex and no global maxima is guaranteed. Thus, initial values are very important. In both ML and REML estimation procedures, the estimated values under the null are set as the initial parameters for the alternative model. The additional variance component(s) to be estimated under the alternative model is(are) set to a small positive number. In this way, we can guarantee that the alternative model always produces larger likelihood values than the null model does, and hence positive likelihood ratio value.
2.5. QTL IBD Sharing and Genome-Wide Linkage Scan
The above IBD computation procedure assumes that a putative QTL is located right on a marker. When a QTL is located within an interval, a more efficient approach would be to do an interval scan and to test the imprinting property of QTLs at positions across the entire linkage group. Under the proposed framework, essentially, we need to estimate the proportion of putative QTL alleles shared IBD at every genome position. Here, we propose a method to calculate QTL alleles-shared IBD inside an interval conditional on the flanking markers. The so-called expected conditional IBD values can be derived at each test position as a function of recombination fraction between the two flanking markers, and the one between one flanking marker and the QTL.
We use one backcross initiated with the cross as an example to illustrate the idea. For a putative QTL with two alleles and , four QTL genotype pairs , , and can be formed. If the QTL genotype is observed, the corresponding QTL alleles-shared IBD can be calculated (see Table 1). In general, the QTL genotype is unobservable, but its conditional distribution can be calculated from the two flanking markers. For individuals and with flanking marker genotypes and , let be the IBD values calculated at the QTL position between individual carrying QTL genotype or 2 corresponding to or , resp.) and individual carrying genotype (similarly 1 or 2), where or . For example, is the proportion of IBD sharing between individual carrying QTL genotype and individual carrying genotype for alleles derived from the mother.
Let and be the conditional distribution of QTL genotype and for individuals and given on the flanking markers and , respectively. This conditional probabilities can be easily calculated and can be found at standard QTL mapping literature (see ). The probability to observe is just . Thus, the expected IBD values between individuals and at the tested QTL position conditioning on the flanking markers and can be calculated as = E() = . For the above example, the IBD values derived from the maternal and paternal parents can be calculated as = E() = 0.5 + 0.5 +0.5 + 0.5 and = E() = 0.5 + 0.5 . Similarly, we can calculate the conditional expectation of IBD sharing for other backcross families.
Since and are functions of recombinations, the conditional QTL IBD values vary at different testing positions. Once the estimated IBD matrix is calculated at every 1 or 2 cm on an interval bracketed by two markers throughout the entire genome, a grid search can be done at all testing positions. The amount of support for a QTL at a particular map position can be displayed graphically through the use of likelihood ratio profiles, which plot the likelihood ratio test statistic as a function of testing positions of putative QTL (see details in Section 2.6). The peaks of the profile plot that pass certain significant threshold correspond to the positions of significant QTL.
2.6. Hypothesis Testing
With the estimated parameters using either the ML or REML method, we are interested in testing the existence of QTL across the genome and assess their imprinting mechanism. The first hypothesis is to test the existence of major QTL, termed overall QTL test, which can be formulated as
Likelihood ratio (LR) test is applied which is computed between the full (there is a QTL) and the reduced model (there is no QTL) corresponding to and , respectively. Let and be the estimates of the unknown parameters under and , respectively. The log-likelihood ratio can be calculated as
When testing the hypothesis, the polygene and the residual variances are nuisance parameters which are constrained to be nonnegative. The three tested genetic variance components under the null are lied on the boundaries of their alternative parameter spaces. Following Self and Liang , when the null is true, may asymptotically follows a mixture of distribution on degrees of freedom (df) with the mixture proportion for the components being . The theoretical distribution can be used to assess significance in linkage scan. However, since there are many point tests across the genome, the pointwise significance value may not guarantee an appropriate genome-wide error rate. Another approach to assess significance is to use nonparametric permutation tests in which the critical threshold value can be empirically calculated on the basis of repeatedly shuffling the relationships between marker genotypes and phenotypes . In simulation studies, we also simulate the null distribution and compare it with the theoretical distribution.
For those detected QTL, the next step is to assess their imprinting property. An identified QTL can be imprinted, completely imprinted, partially imprinted, or not imprinted at all. These can be tested through the following sequential tests. The first imprinting test is to assess whether a QTL shows imprinting effect, which can be done by formulating the following hypotheses:Rejection of provides evidence of genomic imprinting and the QTL is called iQTL. Again, likelihood ratio test can be applied in which the log-likelihood ratio test statistics asymptotically follows an with one df . We denote the log-likelihood ratio test statistic as . If the null is rejected, one would be interested to test if the detected iQTL is completely maternally or paternally imprinted. The corresponding hypotheses can be formulated asfor testing completely maternal imprinting andfor testing completely paternal imprinting. The likelihood ratio test statistics for the above two tests asymptotically follow a 50 : 50 mixture of and distributions . Rejection of complete imprinting indicates partial imprinting.
2.7. Multiple QTL Model
In reality, more than one QTL may contribute to the phenotypic variation located in one chromosome region or across the whole genome. The polygenic effect in model (2.4) absorbs the effects of multiple QTL located on other chromosomes. However, when there are multiple QTL located on the same linkage group as the tested QTL, if their effects are not properly adjusted, the estimation could be biased due to interference caused by these QTL outside of the testing interval [3, 30–32]. A multiple QTL model that can test the putative QTL effect while adjusting the effects of interference QTL deserves more attention.
Zeng  previously showed that IBD variables share the same property as the indicator variables in which the shared proportion of alleles IBD for a QTL conditional on the IBD of one flanking marker is independent of that of a QTL on the other side of that flanking marker. Thus, conditional on one flanking marker, the interference of QTL located on the other side of the marker can be eliminated. By conditional on the IBD of the flanking markers, the IBD sharing of a QTL is uncorrelated with that outside this interval. Xu and Atchley  showed that one marker is enough to block the interference caused by other QTL located on the same linkage group. The authors derived the next-to-flanking markers structure to block additional QTL effects from both sides of testing region in one chromosome. We derive a multiple QTL model adopting a similar idea as Xu and Atchley . Assume there are total QTL located on a linkage group. Considering parent-specific allelic effects, the multiple QTL model can be expressed in general as
In an interval-based linkage scan, only one putative QTL is considered at each testing position conditioning on the effects of all other QTL. Assuming there are total and QTL located on the left and right sides of the putative QTL on a linkage group, model (2.32) can be modified aswhere and are the th and th QTL random effects on the left and right sides of the putative QTL, respectively. When testing the putative QTL effect, we are only interested in blocking the total effects of QTL outside of the tested interval. Therefore, in the modified model, the effects of QTL outside of the tested interval are not partitioned. This, however, does not affect the inference of the tested QTL.
As shown by Zeng  and Jansen [31, 33], one marker is enough to block the correlation between a locus on its left and a locus on its right. Therefore, only two additional markers flanking the current interval are needed to block interference caused by outside QTL . Let and denote two flanking markers for the tested interval, and and denote the two markers next to and with the marker order - - - . With the modified model given in (2.33), the covariance of phenotypes between individuals and in the th backcross family can be expressed aswhere and are the IBD values for QTL located on the left and right sides of the putative QTL in the th backcross family, and can be calculated following (2.6) and (2.7) if their genotype information is known. Unfortunately, the number and exact locations of QTL outside the testing interval are unknown. Hence, and are not observable. Xu and Atchley  showed that when and are unknown, they can be estimated by some composite terms and , where is a function of the recombination fraction between the th QTL and the left marker as well as a function of , the IBD value for a pair of individuals at the left marker . can be similarly defined. Following Xu and Atchley , can be expressed as a function of multiplied by a function of recombination frequency between the th QTL and the marker , , that is, . Similarly, . When doing an interval scan, the covariance function given in (2.34) between individuals and can be reexpressed asInstead of estimating individual variance components and , now we estimate the composite terms and . By conditioning the IBD sharing information for the left and right markers and , the effects of those interference QTL are blocked. and absorb the random effects of all QTL that are outside of the testing interval but are on the same linkage group as the putative QTL. Estimation of the variance components terms follows the same procedure as the single QTL analysis with slight modification to consider multiple variance components.
3.1. Simulation Design
To investigate the performance of the proposed models and estimation methods, we conduct intensive computer simulations. We start with the single-QTL simulation followed by the multiple QTL analysis. Six evenly spaced markers () are simulated. The total length for the simulated linkage group is 100 cm. We assume that all the backcross families share the same linkage map constructed using Haldane map function. For simplicity, we assume the sample size for all backcross families is the same (i.e., ). The position of the simulated QTL is assumed to be located 48 cm away from the first marker (). The effect of the putative QTL is simulated by assuming different imprinting mechanisms, that is, no imprinting, completely imprinting, and partial imprinting. Once QTL genotypes are simulated, phenotypes can be simulated by randomly drawing multivariate normal distribution with the covariance structure given in (2.12) with different parameter combinations.
To evaluate the effect of family and offspring size combination on testing power and parameter estimation, we simulate data assuming different sample size combinations. We fix the total sample size as 400 and vary the family and offspring size with different combinations, that is, , , , and . The first number for each combination indicates the family size. For example, in the combination , 4 families each containing 100 offsprings are simulated. For each sib-pair, the IBD value at a putative position at every 2 cm along the linkage group is calculated as described in Section 2.7. For each simulation scenario, 100 simulation replications are recorded, and the ML and the REML methods are used to estimate the unknown parameters.
3.2. Simulation Results
3.2.1. Single QTL Analysis
The single QTL model assumes one QTL is located at the third interval in the simulated linkage group, 48 cm away from the first marker. Results using both ML and REML estimation methods are summarized in Table 2. denotes the number of families and denotes the number of offsprings for each family. Without loss of generality, we assume equal offspring size for all families in each simulation scenario. The simulated parameter values are listed under each parameter. The root mean square errors (RMSEs) are recorded for each parameter estimate to assess the estimation precision. Overall, the fixed effects (three means) and most variance components can be better estimated with large number of families. For example, the RMSE of parameter is reduced from 1.869 (2.45) to 0.321 (0.305) when the number of families increases from 4 to 100 with the ML (REML) estimation method. The only exception is the two variance components terms ( and ) which are better estimated with the combination design. Through the combination of different line crosses, the parameter inference space is expanded, and as a result, better estimations are achieved as expected. However, the QTL position is better estimated with the and designs than the other two among the four simulation scenarios. The design gives the worst QTL position estimation with the largest RMSEs for both estimation methods. Therefore, a balance of family and offspring size is needed. A moderate family size with moderate offspring size would be necessary in order to achieve reasonable parameter estimation for both QTL effects and position.
|The locations of the QTL are described by
the map distances (in cm) from the first marker of the linkage group (100 cm long). The true QTL is located at 48 cm; Pow is calculated using the empirical distribution
through simulation. Pow is calculated using the theoretical
distribution assuming mixture chi-square distribution. Type I error refers to
the imprinting type I error. |
Table 2 also lists the results of power analysis under different scenarios with two different estimation methods. Powe denotes the empirical power calculated from the simulated null distribution corresponding to hypothesis test (2.27). We simulate the null distribution by simulating data assuming no QTL effect (i.e., = = = 0). The LR test statistics is calculated for each simulation run, and the 95% cutoff is reported as the threshold value. Powe refers to the theoretical power which is calculated assuming the mixture chi-square distribution, that is, . Results show that the threshold calculated from the theoretical distribution is smaller than the one calculated from the simulation. Thus, the testing power based on the theoretical cutoff is greater than the empirical power. The testing powers under different sampling designs are very comparable except for the 100 4 design in which the power is dramatically reduced compared to other designs. No remarkable difference in power for both estimation methods is observed. Figure 2 shows the log-likelihood ratio test statistic calculated under the four sampling designs across the simulated linkage group by using both ML and REML estimation methods. The plotted LR curve is from averaged LR values out of 100 replications. It is clear that large offspring size always gives large test statistics. As the family size increases from 4 to 100 and so decreased offspring size, we observe a huge LR value decrease. Clearly, the 100 4 design is less powerful than the others. The last column listed in Table 2 shows type I error for testing genomic imprinting, that is, : . The simulated data assume no imprinting (). The imprinting test is only conducted at the position where the overall QTL test shows significance. The imprinting test statistic is compared with a chi-square distribution with 1 df. Overall, the REML estimation method results in smaller type I error rate than the ML method does. As the number of families increase, type I error decreases. The 4 design yields the largest type I error.
In comparison of the ML and REML methods, the REML method gives smaller estimation biases but larger RMSEs than the ML method does. This reflects the large variability of the REML estimation. In terms of computation speed, the ML method is faster than the REML method. For example, in a single-simulation run with the one-QTL model, the ML method takes about 9 minutes to scan the linkage group compared to 26 minutes with the REML method. The difference is more remarkable with the multiple QTL model (e.g., 10 minutes for ML versus 43 minutes for REML). Even though the QTL position estimation is better estimated by using the REML method when family size is small, as family size increases, the REML method performs worse than the ML method (Table 2). In checking the LR profile plot in Figure 2 and the power analysis in Table 2, we do not observe significant gain in power by using the REML method. The two methods do no dominate each other and are very comparable in power analysis. With large sample size and limited computing resources, one might want to try the ML method first. However, the REML method is suggested when testing imprinting since it has small type I error.
In a short summary of the results listed in Table 2, the and designs give better QTL position estimation and testing power. In terms of the type I error for imprinting test, the and designs provide reasonable type I error. Thus, a practical guidance is to choose the design, and one should always avoid designs with extremely large or extremely small family size.
To evaluate the proposed model under different imprinting mechanisms, we simulated data assuming different degrees of imprinting. Since the results in Table 2 indicate that a design provides relatively reasonable parameter estimation, good power, and small type I error rate for imprinting test, the evaluation of imprinting analysis is thus focused on this design. The results for 100 simulation replications are summarized in Table 3. Three imprinting models are assumed complete maternal imprinting ( and ), complete paternal imprinting ( and ), and partial maternal imprinting ( and ). Both ML and REML estimators are reported. Overall, the two estimation methods produce very comparable results with less-biased estimations by the REML method as we expected. All the parameters can be properly estimated with reasonable precision. Large imprinting power is observed when the variance difference between the two parent-specific variance components is large. When the difference between the two parent-specific variance components is reduced, the power to detect imprinting is largely reduced. For example, when data are simulated assuming complete paternal imprinting, the power is 0.91(0.86) by using the ML(REML) estimation method. With partially imprinted data, the imprinting power reduces to 0.24(0.09) by using the ML(REML) method, even though it can be increased by increasing the offspring sample size (data not shown).
|Po and Po correspond to the overall QTL effect test (2.27)
calculated using the empirical and theoretical cutoffs, respectively; ipower
refers to the imprinting test power corresponding to test (2.29). See Table 2 for
explanations of other parameters. |
In reality, whether a QTL is imprinted or not is an unknown prior. When a QTL has Mendelian effect and is not imprinted, is there any power loss by analyzing with the proposed imprinting model? Or when a QTL is actually imprinted, is there any power loss by analyzing with regular variance components approach? To answer these two questions, we simulated data under different scenarios and analyzed with both Mendelian and imprinting models. The first and second columns in Table 4 refer to the simulation and analysis models, respectively. M refers to the Mendelian model without variance components partition and I refers to the imprinting model with allelic-specific partition of the variance components. For comparison purpose, heritabilities are recorded instead of original variance components estimates. The polygene and residual variances are fixed as 0.5 () and 2, respectively for all the simulation scenarios. We first simulated data with one additive genetic effect without partitioning variance into allelic-specific components. This is equivalent to simulate data assuming the Mendelian model. A single additive variance component of 3.5 is assumed which corresponds to a heritability of . The second scenario is to simulate data with three allelic-specific variance components. Simulation models and correspond to a complete maternal imprinting model (i.e., and ) and a partial maternal imprinting model (i.e., and ), respectively. The variance component is assumed to be 0.5 () for and . In all the simulations, we use the design to make the comparison. Similar results are expected under the other sampling designs. Since the true variance components values for the imprinting model are unknown when data are simulated assuming Mendelian effect and vice versa, only standard deviations for these parameter estimates are recorded (listed as italic font in the parentheses).
|M and I refer to Mendelian and imprinting
models, respectively. Simulated parameters for model M:(); 0, 0.5, 0.083); 0.083, 0.417, 0.083). Powe and Powe correspond to the power calculated using the
empirical cutoff and the theoretical threshold, respectively. The numbers given
in the parenthesis with normal and italic fonts correspond to the RMSEs and
standard errors of the parameter estimates, respectively. See Table 2 for other
The simulation results are summarized in Table 4 analyzed with the ML method. When the simulated model is Mendelian, QTL position is better estimated with the Mendelian model than with the imprinting model. No remarkable difference in power is observed for both models. The estimated parent-specific variances due to maternal and paternal alleles are almost identical and no imprinting is detected. When data are simulated assuming imprinting (model and ), large power is observed when analyzed with the imprinting model. For example, the power is 86% when analyze the imprinting data by the Mendelian model. The power is increased to 95% when data are analyzed by the imprinting model. When imprinting data are analyzed with the Mendelian model, the major QTL variance is underestimated, and the polygene variance is slightly overestimated. No remarkable differences are observed for the estimation of the three fixed mean effects and the residual variance under all simulation cases. In any case, the imprinting model performs better or no worse than the Mendelian model in terms of power. In checking type I error rate based on the theoretical threshold, we find the imprinting model has slightly higher type I error rate compared with the Mendelian model. In real data analysis, it is more important to control the false negatives than the false positives. Thus, it is safe to apply the imprinting model for data with any inheritance pattern in this regard.
3.2.2. Multiple QTL Analysis
To see the relative merit of multiple QTL analysis against single QTL analysis when multiple QTL are located on the same linkage group, two QTL are simulated with QTL 1 (denoted as ) located at the second interval, away from the first marker (), and QTL 2 (denoted as ) located at the fourth interval, 68 cm away from the first marker. Two simulation scenarios are considered. The first scenario considers two nonimprinted QTL with equal genetic effects. The second scenario assumes is imprinted and is not imprinted. Simulated parameters for the two QTL are listed in Table 5. Data are simulated assuming the design. Parameters are estimated by the ML and REML approaches with 100 replicates.
| and refer to two QTL located at 28 cm and 68 cm. |
Figure 3 shows the LR profile plots for the single and multiple QTL analyses. The single QTL model indicates three major peaks. The highest peak for the single QTL analysis is located at the wrong QTL interval where no QTL is assumed. The so-called “ghost image” of QTL can be removed and the positions of the two QTL can be precisely mapped on the chromosome by the multiple QTL model. Two clear peaks indicating the correct QTL positions (arrow signs) are observed by the multiple QTL analysis. However, we observe a remarkable reduction in LR values by multiple QTL analysis compared to those by the single QTL analysis. Since the threshold for multiple QTL analysis is unknown, we cannot make the conclusion that multiple QTL analysis is less powerful than the single QTL analysis. It is possible that we may gain accuracy in QTL position estimation at the cost of power loss. Similar phenomenon and issues were also observed and discussed in the literature [3, 25].
The results of the multiple QTL analysis are summarized in Table 5. The fixed mean effects, the polygene, and residual variance components can be reasonably estimated with small RMSEs, similar results shown in Table 2 for the design and hence are not reported here. Only the genetic factors for the two simulated QTLs are reported. It can be seen that both ML and REML methods provide reasonable parameter estimates and are very comparable. Under the first simulation scenario in which both QTL are not imprinted, the genetic effects are all slightly overestimated by both methods. This might be due to the interference of the two QTL in the same linkage group. The multiple QTL model may not completely block the effects of QTL outside of the tested interval. For the second simulation scenario, an interesting pattern is observed. When one QTL is imprinted (), the maternal and paternal variance components for the second one () tend to be estimated with bias in the direction as the first imprinted QTL, that is, tends to be overestimated and tends to be underestimated. As we gain accuracy in QTL position estimation, we lose precision for the parameter estimation. These effects are expected as described in Zeng  and Xu and Atchley . More investigations are needed in multiple QTL analysis in order to maintain a good balance of QTL position and parameter inference.
Statistical methods assuming fixed effect models for iQTL mapping in controlled outbred and inbred lines have been proposed (e.g., [11, 14–16]). Considering the limitation of fixed-effect models, a random model that estimates the QTL variance by extending single line cross to multiple line crosses should be more powerful in QTL variance inference . The IBD-based variance components method assuming random genetic effect for iQTL mapping has been developed in human linkage analysis . However, no study has been proposed to map iQTL using variance components method with inbred or partially inbred line cross. In this article, we have first time presented an IBD-based variance components framework to search for the existence and distribution of iQTL throughout the entire genome in multiple experimental line crosses. The idea of the method is demonstrated through a backcross design. It can also be extended to multiple line crosses using the sex-specific recombination information as proposed by Cui et al. .
The key point of the proposed iQTL variance components analysis is to partition the additive genetic variance into parent-specific components. We have proposed a new parent-specific allelic sharing method which characterizes the relatedness of parent-specific alleles between pairs of individuals in a backcross pedigree. The calculation of parent-specific allelic sharing is based on the information of the coefficient of coancestry. More complicated calculation of the coefficient of coancestry can be found at Harris . The quantification of the coefficient of the coancestry proposed by Harris  can also be utilized to calculate the parent-specific IBD sharing in an inbred human population, and thus for iQTL mapping in inbred human populations.
There have been extensive studies in the literature about various methods in the estimation of variance components in a mixed-effect model framework. The ML and REML are two commonly applied methods in variance components estimation with less-biased estimation by the REML method. Simulations show that the ML method yields high precision in parameter estimation but with relatively large bias than the REML method. Power analysis indicates that the ML method is a little more powerful than the REML method but with large type I error when testing imprinting. In terms of computing speed, the ML method is faster than the REML method. Thus, no single method dominates the other. In terms of overall QTL test, we suggest to use the ML method for the genome-wide linkage scan and use the REML method for the imprinting test.
The effect of sampling design is investigated by extensive simulations. Results indicate that one can always achieve large power with large offspring size when the total sample size is fixed. The LR value differences under different sampling designs are shown in Figure 2. However, the combination of small families each with large offsprings gives poor parameter estimation and large type I error for imprinting test (Table 2). As the number of families increase, we observe less-biased parameter estimates for both fixed and random effects, but with poor QTL position estimation and small power. This information implies that it is necessary to enlarge the number of families to improve precision of parameter estimation. Meanwhile, a balance of family and offspring size is needed to maintain good QTL detection power and position estimation. Our simulations indicate that for a fixed total sample size (), both and designs yield comparable results and both designs outperform the other two designs (Table 2). Moreover, the design produces relatively small type I error in imprinting test. With the design, results in Table 4 indicate that the imprinting model is better or as good as the regular Mendelian analysis without considering imprinting. In real data analysis, it should be safe to apply the proposed imprinting model for data with any imprinting pattern.
In this study, we have extended the single marker-based analysis to an interval-based mapping for genome-wide scan and testing of iQTL effects. Considering the interference of QTL located on the same linkage group, we have extended the single QTL model to multiple QTL analysis following the derivation of Xu and Atchley . Simulation results indicate the relative merits of the multiple QTL analysis with improved QTL position inference, but with possible power loss (Figure 3). This, however, has been a common issue in multiple QTL modeling (see [3, 25]). More investigations are needed in deriving efficient and robust multiple QTL mapping models to improve precision without suffering too much from power loss.
The theoretical distribution for the likelihood ratio test has been a challenging problem in QTL mapping. Dupuis and Siegmund  first proposed theoretical properties for LR test statistics in a genome-wide linkage scan for QTL in an interval mapping framework with a fixed-effect model. Currently, most linkage analysis using the variance components method assumes that the LR test statistic follows a mixture of chi-square distribution . The mixture distribution is derived following Self and Liang . With multiple testings and multiple nuisance parameters in a genome-wide scan, the assumptions to get the mixture chi-square distribution may not satisfy. Moreover, the multivariate normal assumption for the phenotypic data required to get the mixture distribution may not even valid. No theoretical work has been done to investigate this in an IBD-based variance components linkage mapping. Our simulations indicate that the theoretical threshold calculated from the mixture chi-square distribution is smaller than the simulated cutoff. Thus, the power calculated with the theoretical threshold is slightly inflated. A modified mixture chi-quare distribution may be more appropriate. More theoretical investigations are needed in this regard.
This work was supported by NSF Grant DMS-0707031 and by Michigan State University intramural research Grant 06-IRGP-789.
- M. Lynch and B. Walsh, Genetics and Analysis of Quantitative Traits, Sinauer, Sunderland, Mass, USA, 1998.
- E. S. Lander and D. Botstein, “Mapping mendelian factors underlying quantitative traits using RFLP linkage maps,” Genetics, vol. 121, no. 1, pp. 185–199, 1989.
- Z.-B. Zeng, “Precision mapping of quantitative trait loci,” Genetics, vol. 136, no. 4, pp. 1457–1468, 1994.
- C.-H. Kao, Z.-B. Zeng, and R. D. Teasdale, “Multiple interval mapping for quantitative trait loci,” Genetics, vol. 152, no. 3, pp. 1203–1216, 1999.
- K. Pfeifer, “Mechanisms of genomic imprinting,” The American Journal of Human Genetics, vol. 67, no. 4, pp. 777–787, 2000.
- A. Burt and R. L. Trivers, Genes in Conflict, Harvard University Press, Cambridge, Mass, USA, 2006.
- R. L. Hanson, S. Kobes, R. S. Lindsay, and W. C. Knowler, “Assessment of parent-of-origin effects in linkage analysis of quantitative traits,” The American Journal of Human Genetics, vol. 68, no. 4, pp. 951–962, 2001.
- T. Liu, R. J. Todhunter, S. Wu et al., “A random model for mapping imprinted quantitative trait loci in a structured pedigree: an implication for mapping canine hip dysplasia,” Genomics, vol. 90, no. 2, pp. 276–284, 2007.
- M. Abney, M. S. McPeek, and C. Ober, “Estimation of variance components of quantitative traits in inbred populations,” The American Journal of Human Genetics, vol. 66, no. 2, pp. 629–650, 2000.
- S. A. Knott, L. Marklund, C. S. Haley et al., “Multiple marker mapping of quantitative trait loci in a cross between outbred wild boar and large white pigs,” Genetics, vol. 149, no. 2, pp. 1069–1080, 1998.
- D.-J. de Koning, A. P. Rattink, B. Harlizius, J. A. M. van Arendonk, E. W. Brascamp, and M. A. M. Groenen, “Genome-wide scan for body composition in pigs reveals important role of imprinting,” Proceedings of the National Academy of Sciences of the United States of America, vol. 97, no. 14, pp. 7947–7950, 2000.
- D.-J. de Koning, H. Bovenhuis, and J. A. M. van Arendonk, “On the detection of imprinted quantitative trait loci in experimental crosses of outbred species,” Genetics, vol. 161, no. 2, pp. 931–938, 2002.
- M. Lin, X.-Y. Lou, M. Chang, and R. Wu, “A general statistical framework for mapping quantitative trait loci in nonmodel systems: issue for characterizing linkage phases,” Genetics, vol. 165, no. 2, pp. 901–913, 2003.
- Y. Cui, “A statistical framework for genome-wide scanning and testing of imprinted quantitative trait loci,” Journal of Theoretical Biology, vol. 244, no. 1, pp. 115–126, 2007.
- Y. Cui, Q. Lu, J. M. Cheverud, R. C. Littell, and R. Wu, “Model for mapping imprinted quantitative trait loci in an inbred design,” Genomics, vol. 87, no. 4, pp. 543–551, 2006.
- Y. Cui, J. M. Cheverud, and R. Wu, “A statistical model for dissecting genomic imprinting through genetic mapping,” Genetica, vol. 130, no. 3, pp. 227–239, 2007.
- Y. Li, C. M. Coelho, T. Liu et al., “A statistical model for estimating maternal-zygotic interactions and parent-of-origin effects of QTLs for seed development,” PLoS ONE, vol. 3, no. 9, p. e3131, 2008.
- C. Xie, D. D. G. Gessler, and S. Xu, “Combining different line crosses for mapping quantitative trait loci using the identical by descent-based variance component method,” Genetics, vol. 149, no. 2, pp. 1139–1146, 1998.
- C. I. Amos, “Robust variance-components approach for assessing genetic linkage in pedigrees,” The American Journal of Human Genetics, vol. 54, no. 3, pp. 535–543, 1994.
- G. Malécot, Les mathématiques del'hérédité, Masson et Cie, Paris, France, 1948.
- D. L. Harris, “Genotypic covariances between inbred relatives,” Genetics, vol. 50, no. 6, pp. 1319–1348, 1964.
- L. Almasy and J. Blangero, “Multipoint quantitative-trait linkage analysis in general pedigrees,” The American Journal of Human Genetics, vol. 62, no. 5, pp. 1198–1211, 1998.
- R. R. Corbeil and S. R. Searle, “A comparison of variance component estimators,” Biometrics, vol. 32, no. 4, pp. 779–791, 1976.
- J. A. Nelder and R. Mead, “A simplex method for function minimization,” Computer Journal, vol. 7, no. 4, pp. 308–313, 1965.
- S. Xu and W. R. Atchley, “A random model approach to interval mapping of quantitative trait loci,” Genetics, vol. 141, no. 3, pp. 1189–1197, 1995.
- H. D. Patterson and R. Thompson, “Recovery of inter-block information when block sizes are unequal,” Biometrika, vol. 58, pp. 545–554, 1971.
- R. Wu, C.-X. Ma, and G. Casella, Statistical Genetics of Quantitative Traits: Linkage, maps, and QTL, Statistics for Biology and Health, Springer, New York, NY, USA, 2007.
- S. G. Self and K.-Y. Liang, “Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions,” Journal of the American Statistical Association, vol. 82, no. 398, pp. 605–610, 1987.
- G. A. Churchill and R. W. Doerge, “Empirical threshold values for quantitative trait mapping,” Genetics, vol. 138, no. 3, pp. 963–971, 1994.
- O. Martínez and R. N. Curnow, “Estimating the locations and the sizes of the effects of quantitative trait loci using flanking markers,” Theoretical and Applied Genetics, vol. 85, no. 4, pp. 480–488, 1992.
- R. C. Jansen, “Controlling the type I and type II errors in mapping quantitative trait loci,” Genetics, vol. 138, no. 3, pp. 871–881, 1994.
- Z.-B. Zeng, “Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci,” Proceedings of the National Academy of Sciences of the United States of America, vol. 90, no. 23, pp. 10972–10976, 1993.
- R. C. Jansen, “Interval mapping of multiple quantitative trait loci,” Genetics, vol. 135, no. 1, pp. 205–211, 1993.
- J. Dupuis and D. Siegmund, “Statistical methods for mapping quantitative trait loci from a dense set of markers,” Genetics, vol. 151, no. 1, pp. 373–386, 1999.
- D. B. Allison, M. C. Neale, R. Zannolli, N. J. Schork, C. I. Amos, and J. Blangero, “Testing the robustness of the likelihood-ratio test in a variance-component quantitative-trait loci-mapping procedure,” The American Journal of Human Genetics, vol. 65, no. 2, pp. 531–544, 1999.
Copyright © 2009 Gengxin Li and Yuehua Cui. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.