Abstract

The most powerful and comprehensive approach of study in modern biology is to understand the whole process of development and all events of importance to development which occur in the process. As a consequence, joint modeling of developmental processes and events has become one of the most demanding tasks in statistical research. Here, we propose a joint modeling framework for functional mapping of specific quantitative trait loci (QTLs) which controls developmental processes and the timing of development and their causal correlation over time. The joint model contains two submodels, one for a developmental process, known as a longitudinal trait, and the other for a developmental event, known as the time to event, which are connected through a QTL mapping framework. A nonparametric approach is used to model the mean and covariance function of the longitudinal trait while the traditional Cox proportional hazard (PH) model is used to model the event time. The joint model is applied to map QTLs that control whole-plant vegetative biomass growth and time to first flower in soybeans. Results show that this model should be broadly useful for detecting genes controlling physiological and pathological processes and other events of interest in biomedicine.

1. Introduction

To study biology, a classic approach is dimension reduction in which a biological phenomenon or process is dissected into several discrete features over time and space. Most efforts in the past decades have been made to understand biological details of individual features and then use knowledge from each feature to draw an inference about biology as a whole. There has been increasing recognition of the limitation of this approach because it fails to detect a rule that governs the transition from one feature to next, thus leading to a significant loss of information behind the development of a biological trait. More recently, tremendous developments in statistics and computer science have enabled scientists to model and compute the dynamic behavior of a biological phenomenon and construct a comprehensive view of how a cell, tissue, or organ grows and develops across the time-space scale.

A statistical dynamic model, called functional mapping, is one of the products of such developments [1, 2]. The merit of functional mapping lies in its biological relevance to study the tempo-spatial pattern of change for the trait and further predict the physiological or pathological status of trait phenotype. Functional mapping has proven to be powerful for elucidating the dynamic genetic architecture of complex phenotypic traits by identifying when specific genes (known as quantitative trait loci or QTLs) involved turn on and turn off and how long they are expressed in a time course. With the advent of new automatic techniques that collect dynamic data in a cost-effective way, functional mapping can be anticipated to play an increasingly important role in shedding light on the genetic control mechanisms of complex traits or diseases.

The statistical foundation of functional mapping is longitudinal data analysis or functional data analysis. There has been a considerable body of literature on statistical modeling of time-varying mean and covariance structure using various parametric, nonparametric, and semiparametric methods [37]. A joint mean-covariance model was proposed by Pourahmadi [8, 9], which shows some advantages over modeling the mean and covariance separately. Since the publication of the pioneering work by Laird and Ware [10], random effects model have been extensively used for longitudinal data analysis [11]. All these statistical approaches have been incorporated into functional mapping [12, 13], aiming to provide the most parsimonious estimates of QTL effects for a given data set. A Bayesian algorithm for functional mapping has been proposed recently by Liu and Wu [14].

The complexity of biology lies in the fact that no biological trait is isolated, rather every trait is affected by other traits through genes and environmental factors. For example, when a plant grows into a particular stage, reproductive behavior, such as flowering, starts to emerge as one of the important events in plant development. The time to first flower is highly associated with the amount of vegetative growth, depending on the environment where the plant is grown. Likewise, the time to recurrence of prostate cancer in humans is related with dynamic changes of prostate specific antigen level. How to jointly model longitudinal and time-to-event data within functional mapping has become an important issue for studying the common genetic basis of these processes and predicting events based on longitudinal traits.

Simultaneous modeling of longitudinal traits and time to events has been an active area in biostatistics during the past twenty years. A linear random effects model and EM estimation approach are proposed by Henderson et al. [15] for joint modeling. Guo and Carlin [16] made a comparative study between separate models and a joint model, showing that a joint model is more powerful when there is a strong correlation between the trait and the event. Wang and Taylor [17] developed a Bayesian method and MCMC algorithm for joint modeling of longitudinal and event time data and applied their algorithm on AIDS data. A review article by Tsiatis and Davidian [18] nicely summarizes the recent developments for such joint modeling.

By simply estimating the correlation between longitudinal traits and event time, Lin and Wu [19] developed a first model that connects these two aspects within functional mapping. However, they developed a likelihood-based framework where the covariance structure for the longitudinal trait was modelled by the known AR, and model parameters were estimated using maximum likelihood estimation. Taking advantage of event models, such as semiparametric Cox proportional hazard model, Weibull model, accelerated failure time (AFT) model, we here propose a sophisticated model for joint modeling of longitudinal trait and time to event to locate the QTLs which control the event via a dynamic trait. The detection of those QTLs that are common to these types of traits may help to prevent or accelerate the outcome by genetic approaches. Our model is constructed with a Bayesian paradigm and model parameters are estimated by the MCMC algorithm. Local polynomials are used to model the mean trajectory and generalized-linear-model- (GLM-) based approach is used to model the covariance matrix. The model is validated using a real example in which whole-plant biomass as a longitudinal trait measured at a series of discrete time points and the time to first flower as a time-to-event are jointly modeled through functional mapping. The statistical properties of the model applied to estimate QTL temporal effects in this example and its practical usefulness are investigated by simulation studies.

2. Joint Modeling Framework

2.1. Model for the Longitudinal Trait

Genetic mapping should be based on a segregating population, such as the backcross, F2, or recombinant inbred lines (RILS), initiated with two inbred lines each carrying an alternative allele. An RIL population is generated by self-crossing the hybrids of the two inbred lines continuously for 7-8 successive generations, which leads to two homozygous genotypes for alternative alleles at each locus. Methods for other designs can be derived similarly. Suppose a backcross has progeny which is genotyped to construct a linkage map, aiming at locating putative QTLs that trigger significant effects on a longitudinal trait and its associated event. For each progeny, the trait is measured repeatedly at different time points and a time-to-event is also recorded. At a specific time point , the phenotypic value of the trait for progeny affected by a putative QTL can be expressed by a linear model as follows: where is an indicator variable for a possible QTL genotype of progeny and defined as 1 if a particular QTL genotype is indicated and 0 otherwise ( for QTL genotype and 2 for genotype ), is the mean phenotypic value of QTL genotype for progeny at time , is the subject specific random effect, and is the residual error assumed to follow a normal distribution with mean zero and covariance matrix .

The central theme of functional mapping is to model the mean and covariance structures for the longitudinal trait efficiently. Here, we model the mean vector by polynomial function and the covariance matrix by an approach that guarantees the positive definiteness of the estimated covariance matrix. Without loss of generality, assume the response vector for progeny , , has mean 0 and covariance matrix . The response at time , , can be predicted by its predecessors as the follows: where is the corresponding regression coefficient, is the prediction error for progeny with , and is its variance. Assuming that ’s are uncorrelated (Pourahmadi [8]), we get , a diagonal matrix with being the th diagonal element, where is the vector of prediction errors. Hence, the matrix representation of the above autoregression becomes where is a lower triangular matrix with 1’s in diagonal elements and in the th position. The above equation simply gives which is related to the modified Cholesky decomposition of Σ [20].

Equation (4) will be considered as the basis for modeling the covariance structure, since this guarantees the estimated covariance matrix to be positive definite. Following Pourahmadi [8, 9], we model the mean vector, unconstrained variance parameters , and dependence parameter , using a polynomial function of a particular order, expressed as

The optimal is determined from the information criteria (AIC/BIC). We note that different genotypes are assumed to have the same covariance structure but different means. Note that the above method of modeling the covariance structure for a longitudinal response is more robust than the traditional first-order autoregressive (AR) or compound symmetry (CS) structure since real data might not show a parametric dependence structure. We refer to the proposed approach as GLM-based approach to estimate the covariance matrix.

Denote for QTL genotype and for subject . Then, the conditional mean function for progeny carrying QTL genotype ( or 2) for the given subject specific random effect () can be expressed as where

Assume the vectors of subject specific random effects follow -variate normal distribution with mean 0 and covariance matrix and they are independent of the residual errors. Note that under this assumption, will follow , and the marginal distribution of will be .

2.2. Model for the Event Time

We use to denote the event time of progeny . Since in the current situation, the event time is recorded for all progeny; no progeny is censored. Assuming a Cox proportional hazard model for this event, we get for progeny , where denotes the baseline hazard at time , is the mean longitudinal trait at time for given when progeny is of QTL genotype and the regression coefficient represents the effect of the trait on the event time. The survival function for progeny can be expressed in terms of the hazard function as .

The longitudinal model described above is linked to the hazard model by . If = 0, then the event is independent of the trait, and hence we should better fit separate models for the trait and the event. However, when is different from zero, a joint model performs better than the separate models [17]. For simplicity, the baseline hazard is assumed to be a step function, over a partition of the observed time scale [0, max)] into (possibly evenly spaced) intervals, . The value of is usually not too large, possibly smaller than 10.

2.3. Likelihood for the Joint Model

Since the QTL genotype of a progeny is unknown, we use a mixture model to describe the likelihood of the progeny in terms of its possible underlying QTL genotypes [21]. The joint likelihood of unknown parameters given the longitudinal trait and event time for all progeny can be expressed as where denotes the joint density of the longitudinal trait and event time; denotes a multivariate normal with QTL genotype-specific mean modeled as (9) and covariance matrix ; hazard function is modeled as (11); and denotes the conditional probability of QTL genotype given that the marker information of projeny and is the QTL genotype for the -th subject.

The QTL genotype is inferred from marker genotypes of the linkage map. Let be the -marker genotypes for progeny , the position of the putative QTL measured by its distance from the very first marker of an ordered linkage group, and the distances between marker 1 and . Assume that the QTL is located between marker and . Then, the conditional probability of QTL genotype given the genotype of these two markers that flank the QTL is expressed as Note that, given the QTL locations , , and , one can compute , the distance of the QTL from marker and , the distance of the QTL from marker [22]. Using the Haldane map function, one can compute recombination fractions between marker and QTL (), between QTL and marker (), and between markers and () as follows: where is the distance between marker and . Wu et al. [22] provide a procedure for deriving the conditional probabilities of QTL genotypes given marker interval genotypes for the backcross, F2, and RIL populations, respectively.

Unknown parameters in likelihood (12) contain QTL genotype-specific parameters , , the parameters that model the variance structure and dependence structure and , as shown in models (7) and (8), respectively, the effect of the longitudinal trait on the event , QTL position and the baseline hazards .

2.4. Posterior Distribution and Sampling Procedure

We derive a Bayesian approach for estimating the unknown parameters. This will first need to specify the prior distributions for the parameters and, given the data and the priors, derive the posterior distribution over all the unknown parameters. For , we place a multivariate normal prior with zero mean and covariance matrix . An inverse gamma prior with parameters (,) is considered for . We consider a uniform prior on and uniform prior for the parameter . Independent Gamma priors are taken for , for . Priors for and are taken as MVN(0,) and MVN(0,), respectively.

With the above priors and likelihood function, we have the joint posterior distribution for the parameters. In this case, it is quite straightforward to get the full conditional posterior distributions. Assume that the priors are independent for different parameters. Thus, we get the posterior density of , , , , , , as

Assuming that priors for different genotypes are independent, we can express the above posterior distribution as

The full conditional distributions for the model parameters, as derived in the Appendix, are used to estimate the parameters using the MCMC algorithm. Note that the full conditional distribution for is expressed as a product of a normal distribution term which comes from the longitudinal trait and two other terms from the hazard model. To update , therefore, we use a Metropolis-Hastings (MH) algorithm with a normal proposal density since it is a part of its posterior distribution. We also note that in the full conditionals of and , normal distribution coming from the longitudinal part of the data is the main determinant. Hence for and , we consider normal proposals with the current value of the parameter as the mean and covariance matrix as and , respectively. Selection of a good proposal density for is a bit tricky and we follow the recommendation given by Wang and Taylor [17]. By evaluating several choices for a good proposal, we consider a normal distribution with mean as the current state of the parameter and a suitable standard deviation in such a way that the proposed density gets well mixed with the target distribution (acceptance rate between 0.25 to 0.40). Because of conjugacy, we can directly simulate from the full conditional of ’s.

The parameter which specifies the location of the QTL is updated following the idea of Satagopan et al. [23] by using the MH algorithm. A new value of , which we denote by , is generated from Uniform , where is the tuning parameter. Denote this proposed distribution by . The proposed value will be considered as the new value of the chain with probability We note that and, similarly, .

Because of the independence among progeny, is updated by updating for each separately. For each progeny , the full conditional density is in the form of a multinomial with the following cell probabilities: We can sample the QTL genotype directly from this full conditional density at each cycle. Details of the estimation procedure can be found in Satagopan et al. [23].

3. Application

3.1. Material and Analysis

The new model was applied to analyze a real data set for QTL mapping in soybeans. The mapping population contains 184 RILs derived from two cultivars, Nannong 1138-2 and Kefeng no. 1. A genetic linkage map of this population was first established by Zhang et al. [24] with 452 makers including RFLP, SSR, EST distributed among the 21 linkage group. This map was recently updated by adding some new SSR makers and dumping some unreliable markers. The new map contains 834 molecular makers covering a length of 2,308 cM in 24 linkage groups, with an average genetic distance of 2.85 cM between adjacent markers. Those markers with missing information were excluded from the analysis, leading to a total of 780 markers involved in our analysis.

The plants and their parents were grown in a sample lattice design with two replicates at Jiangpu Soybean Experiment Station, Nanjing Agricultural University, China. After 20 days of seedling emergence, plant biomass (in gms.) were measured once every 5–10 days until most plants stopped height growth. A total of 8 measurements were taken for the biomass and the time to get the first flower in that growing season was also recorded for each plant. Figure 1 shows the raw data, both the trait (biomass) and the event time for 184 plants.

Prior distributions for the model parameters were taken as follows. For genotype specific fixed effect , a multivariate normal prior was used with zero mean and a diagonal covariance matrix with all diagonal elements 100. For , we took prior which has with small variance (0.25). A uniform prior was taken for following Wang and Taylor [17]. Observed time scale for the event time data was partitioned into 5 parts; that is, we took and considered independent gamma (0.04, 1.0) priors for , following Wang and Taylor [17]. Uniform prior was taken for . For and , we considered multivariate normal priors with zero means and diagonal covariance matrices with diagonal elements 30 and 20, respectively. To investigate the effect of prior distributions on the estimation method, we did a sensitivity analysis. Considering different sets of priors, we fitted our model many times and it turned out the estimation is almost insensitive to the choice of priors. Hence the choice of our priors (even with huge variances for , , and ) does not affect much the estimation of the model parameters.

We fitted our joint model as described in Section 2, by MCMC sampling. We ran chains 120,000 times. To remove the effect of the starting values, we excluded first 20,000 burn-in iterations. With the remaining 100,000 iterations, we estimated the posterior distributions and the parameters were estimated by the posterior mean and also calculated the sample standard deviations for the posterior densities. For the MH algorithm, the acceptance rates were 0.31, 0.26, 0.25, and 0.30 for , , , and , respectively.

Since our model is complex, we perform several standard diagnostic tests to assess the convergence of the Markov chains. First, we use the method proposed by Brooks and Gelman [25]. Considering five different chains with different starting points and discarding the burn-in iterations, we computed multivariate potential scale reduction factors (MPSRF) to assess the convergence of the chains. Starting points for the model parameters were drawn from the respective priors. The computed values of this statistic get stabilized near 1 after 60,000 iterations for our model parameters, which indicates convergence of the chains.

Second, we perform Geweke test which compares the earlier part of the markov chain to the later part for assessing convergence. After deleting the burn-in iterations, from the remaining 100,000 iterations, we take out two subsequences; the first 50,000 and the last 50,000 iterations. Also consistent spectral density estimates at zero frequency are calculated to compute the -scores. The calculated values for our model parameters are above 0.18, which indicates a small absolute -scores assessing the convergence of our chains. A detailed discussion of this method can be found in Geweke [26].

Finally, we perform the Heidelberger and Welch test as proposed by Heidelberger and Welch [27]. This test has two parts: a stationary test and a half-width test. Our chains after deletion of the burn-in iterations, pass the stationary test. To assess whether the number of iterations is adequate to estimate the parameters accurately, we calculate relative half-width (RHW). We consider the default value (0.05) and predetermined tolerance value is taken as 0.1. For all our model parameters, the calculated RHW is in between 0.045 and 0.081. This indicates that we have enough iterations to estimate the model parameters with 95% confidence under tolerance of 0.1.

3.2. Results

Figure 1 is the growth trajectories of whole-plant biomass over time for all RILs, in which the times to first flower for each RIL are also indicated. There are great variability found for these two traits among RILs. In the exploratory data analysis, we computed BIC values for different orders for the triplet , showing that the BIC value is smallest for the order , that is, second order polynomials can best fit the the mean, variance, and dependence structures (Table 1). By scanning for the existence of QTLs over the genetic linkage map, we obtained the posterior distribution of the model parameters and estimate marginal posterior distributions of the QTL locations () for all 24 linkage groups (Figure 2).

We observed posterior peaks in linkage groups 1, 4, 15, 19, 20, 21, and 23. To draw inference about the existence of a putative QTL in each of these groups, we computed the Bayes Factor (BF), defined as where denotes the number of QTLs in that particular group. Following Jeffrey’s scale, the BFS with value smaller than 1 gives strong evidence against the null hypothesis and higher than 10 gives enough evidence for the null hypothesis. Note that in this case, we are testing the existence of no QTL (null) versus the existence of one QTL (alternative) for each linkage group. Here, no QTL in a group means , for that group. So, in order to compute BF we run our MCMC twice, first under the null and then under the alternative. The calculated BF for the above 7 linkage groups were 0.2781, 11.274, 13.493, 10.610, 0.5913, 11.475, and 0.7953, respectively, implying the existence of QTL in groups 1, 20, and 23. Table 2 provides genotype-specific mean parameters for the QTLs located on linkage groups 1, 20, and 23, along with their 95 percent credible intervals (C.I.).

Since the nature of our model is complex and our estimation is based on MCMC, we perform posterior predictive check for the aforementioned 7 linkage groups. We simulate observations to get the posterior predictive distribution. Let be the set of all model parameters and be the replicated data. Then given the data , the posterior predictive distribution of is given by .

One can simulate from the posterior predictive distribution using the following two steps. First from the posterior distributions of the model parameters, simulate values (vectors) of . Next for each value of , simulate a value (vector) from the likelihood. The values (vectors) of drawn in this way will essentially come from posterior predictive distribution .

We simulate 100 draws (=100) from the posterior predictive distribution and then apply proposed joint analysis and estimate the model parameters using MCMC as described earlier. We compute BF for each of those 7 linkage groups by considering the problem of testing the existence of no QTL (null) versus the existence of one QTL (alternative). Table 3 shows the average BF (with estimated SE) for each linkage group and the existence of QTL in groups 1, 20, and 23 is quite evident.

Heritability (broad-sense) for the traits is estimated from the data, as the proportion of phenotypic variance attributable to genetic variance. The estimated heritability in our case is 32.6%. Also we compute the percentage of variance explained by three identified QTLs. It turns out the QTLs identified in linkage groups 1, 20, and 23 explain 6.8%, 14.3% and 11.4% of the total variance, respectively.

We show the marginal posterior plots with 95% credible intervals for the parameter in Figure 3. Note that for all three groups, the estimates and the confidence intervals are in the negative part which indicates a negative relationship between the trait and the event time. Biologically this is sensible since the plants with higher body mass will take less time to have the first flower compared to the plants with lower body masses.

Figure 4 illustrates genotypic differences in whole-plant biomass trajectory and the time to first flower for the three QTLs detected. At the QTL on linkage group 1, the allele () inherited from parent Nannong 1138-2 leads to increased biomass growth and earlier flowering than the allele () from parent Kefeng no. 1. The inverse pattern is observed for the QTL on linkage group 20. The QTL on linkage group 23 alters its direction of genetic effect. Affected by the first two QTLs, poor vegetative biomass growth in plants stimulates early flowering (Figures 4(a) and 4(b)). Yet, the QTL on linkage group 23 makes the fast-growing genotype to flower earlier than the slow-growing genotype (Figure 4(c)).

4. Simulation Study

We performed simulation studies to study the statistical properties of the joint model. We assumed an RIL design of 200 progeny and simulated 11 evenly spaced markers on a linkage group of length 100 cM. A QTL is located at 43 cM from the very first marker of the linkage group. To reflect a practical problem, we used parameter estimates of the soybean QTL detected in linkage group 20 as true values to simulate the data, allowing the covariance structure. Time-dependent phenotypic values were assumed to follow a multivariate normal distribution and the event times were taken the same as the soybean data. To make a comparison, we analyzed the simulated data using our nonparametric GLM-based covariance structure and the traditional AR and CS covariance structures.

The prior distributions for the model parameters were taken in the same way as discussed in Section 3.1. A uniform prior on (0,100) was considered for . For each situation, we ran Markov chains 120,000 iterations and initial 20,000 burn-in iterations were discarded. Model parameters were estimated from the posterior distributions on the basis of remaining 100,000 iterations. The computed BF was 0.649 giving a strong evidence against the null hypothesis. Table 4 shows the means of the Bayesian estimates of model parameters with their respective Monte Carlo standard errors, (MCSE). It can be seen that the estimates are quite close to the actual values with a reasonably small standard errors which justifies the accuracy and precision of our estimation procedure. However, our GLM-based approach provides better estimation of parameters than AR and CS-based approaches.

Figure 5 elucidates the marginal posterior plot for the QTL location under three different covariance structures. It is found that both AR- and CS-based models provide the peaks at wrong locations, whereas GLM-based nonparametric covariance structure locates QTL more accurately in which case the length of the credible interval is narrower than those obtained from the former two structures. This provides numerical evidence that the proposed GLM-based model has better precision of QTL localization.

We perform further simulation studies to assess the reliability of BF in our data application. For each of the linkage groups 1, 20, and 23, we simulate data under the “null” model. As mentioned earlier, under the “null” model, . For group 1, we consider the null model , for group 20 it is ), and for group 23, our null model is given as . The computed BFS for these three groups are 34.55, 46.79, and 41.86, respectively, suggesting strong evidence for the null.

5. Discussion

Tools to reveal the secret of life should reflect the dynamic nature of life. More recently, a series of statistical models have been developed to map quantitative trait loci (QTLs) that control the dynamic process of a complex trait [1, 2, 22, 28]. These so-called functional mapping models integrate mathematical aspects of biological processes into a statistical framework derived to map complex trait QTLs and have proved to be useful for detecting and identifying genes and genetic interactions involved in quantitative genetic variation for plant height, plant rooting ability, and animal body mass. Functional mapping is also flexible to incorporate complex biological phenomena, such as genotype-environment interactions and allometric scaling providing powerful means for addressing biological questions of fundamental importance.

In this paper, we develop a new version of functional mapping that can map QTLs for developmental events affected by organismic growth trajectories in time. This version is benefited from recent statistical developments for joint modeling of longitudinal traits and event time [1618, 29]. In the presence of strong correlation between a longitudinal trait and event, a joint model performs better than submodels separately for a single trait. In the joint modeling framework for these two types of traits, we applied a GLM-based approach to model the covariance structure and local polynomials for the mean curves. Bayesian estimation method using the MCMC algorithm was used since it is computationally much simpler than a likelihood-based approach. Simulation results show the effectiveness of GLM-based covariance model compared to traditional parametric compound symmetry or autoregressive structure.

Our joint model, embedded within functional mapping, promotes the study of testing how QTLs pleiotropically affect different biological processes and how one trait is predicted by other traits through genetic information. The application of the new model to soybean mapping data does not only validate its usefulness and utilization, but also gains new insight into the genetic and developmental regulation of trait correlations in plants. There is no doubt that the new model can be modified to study the genetic associations between HIV dynamics and the time to death as well as prostate specific antigen change and the time to recurrence of prostate. However, there is much room for modifying this model. First, to clearly describe our idea, we assume one QTL at a time for trait control. Epistatic interactions between multiple QTLs may play an important role in trait development as well as in correlations between longitudinal traits and events. Second, from a dynamic systems perspective, we need to model dynamic correlations among multiple longitudinal traits and multiple events. Third, with the availability of efficient genotyping techniques, our model should accommodate a high-dimension model selection scheme to identify significant genetic variants from a flood of marker data.

Appendix

Denote : , , , and . Let be the number of progeny that carries QTL genotype . We derive the full conditional distributions for the unknown parameters as

Hence, we have where

Similarly, the full conditional distributions for the other parameters can be derived as follows: where is the number of events in the interval and

Acknowledgments

This work is partially supported by NSF/IOS-0923975, Changjiang Scholars Award, “Thousand-Person Plan" Award, the China National Key Basic Research Program (2006CB1017, 2009CB1184, 2010CB125906), the China National Hightech R&D Program (2006AA100104), the Natural Science Foundation of China (30671266), and the China MOE 111 Project (B08025). R.Li was supported by an NIDA Grant P50-DA10075 and an NNSF of China Grant 11028103. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIDA, or the NIH.