Abstract

DNA sequence data are now being used to study the ancestral history of human population. The existing methods for such coalescence inference use recursion formula to compute the data probabilities. These methods are useful in practical applications, but computationally complicated. Here we first investigate the asymptotic behavior of such inference; results indicate that, broadly, the estimated coalescent time will be consistent to a finite limit. Then we study a relatively simple computation method for this analysis and illustrate how to use it.

1. Introduction

In the past decades, considerable progress has been made in the field of population genetics. One of the main goals is to infer the coalescence time of the population under study, that is, to infer the time since their most recent common ancestor (MRCA) and its distribution based on the observed data.

In genetics, coalescent theory is a retrospective of population genetics that traces all genes in a sample from a population to a single ancestral copy shared by all the members of the population. The coalescent time of a population is the time of their most recent common ancestor. The inheritance relationship among the genes is typically represented as a gene genealogy, similar to a phylogenetic tree. The goal of coalescent analysis is to infer the coalescent time of a sample of individuals independently sampled from a population of size , based on their observed DNA sequence diversity. Unlike parameter inference for independent and identically distributed (iid) data, for which asymptotic limit can be used conveniently to characterize the estimator when the data size is large, various existing studies indicate that the estimated MRCA, in unit of generations, is unclear as whether it will concentrate as the data sample size increases without bound. In contrast, in the estimation of mutation rate in the same setting, the estimate is consistent and asymptotically normal [1], although at a much slower rate of , compared to the rate of for i.i.d. data. Also, different from usual parameters, the MRCA changes with , the number of sequences. This prompts us to the investigate the asymptotic behavior of the estimated coalescent time. We want to know whether such estimator will be asymptotically consistent and in what sense if it does. Conditioning on the total number of segregating sites, we find that such estimators converge or not to some nonnegative finite limits in posterior mean, depending on the behavior of the number of mutations on all the branches of the rooted trees constructed from the observed data. Also, analysis of this problem with this type of data is often computationally extensive and complicated; we study a relatively simple simulation method for this problem. We first study the asymptotic behavior of this method in Section 3, and then describe and illustrate our method for this problem in Section 4.

In coalescence inference, mitochondrial DNA (mtDNA) data plays an important role. Mitochondria is one of the few genes existing outside the cell nucleus, and for mammalian it is only maternally inherited. Human mtDNA is a double-stranded molecule sequence about 16,500 base pairs in length. It is outside the cell nuclear, and it is known that the mutation rate in mtDNA is about 10 times that of the nuclear genes, and that on one section of the mitochondria, its control region, the mutation rate is even one order higher. The simple inheritance pattern and high variability make mtDNA an important source in the study of human evolutionary history. Each site on the DNA strand has one of the four bases A, C, G, or T. As the molecule evolves, mutations occur in the form of base substitutions. The change between purines (A,G) or pyrimidines (C,T) is called transition; that between a purine and pyrimidine is transversion. The former type of substitution is much more common than the latter.

We focus on the control region of the mitochondrial data in Griffiths and Tavaré [2], which is part of the data in Ward et al. [3]. They are from a segment of the control region, with 352 base pairs (sites), out of which 159 are purine sites and 193 are pyrimidine sites. This data contains 63 sequences sampled from a North American Indian tribe, the Nuu-Chah-Nulth, from Vancouver Island. After eliminating sequences with multiple mutations on some single sites, so that the assumption of at most one mutation each site is met, the remaining data has 55 sequences, with 14 distinct sequences (called lineages) in the data. Site at which not all the observed sequences have the same base is a segregating site. The whole sequences are long, but only the segregating sites are informative for the analysis; the other sites are ignored. The mentioned data has 18 segregating sites and is presented in Table 1, with the frequency (or multiplicity) of each lineage.

The coalescent is a model for the genealogical tree of a random sample of DNA sequences from a large population. An example of such a tree of sample size is given in Figure 1.

For more detailed reviews of this topic, see Hudson [4] and Donnelly and Tavaré [5].

In coalescence inference one has the following.

Basic Assumptions. The population size is large, remains unchanged for many generations into the past, and is known, or can be estimated from other sources; the data is a random sample from the population; the number of births in each generation follows the Wright-Fisher model (since the population is of constant size, the number of deaths also follows the similar model); mutation (substitution) at any nucleotide site can occur only once in the ancestry and is irreversible; mutations that occur in different time intervals are independent; the time point at which mutation occurs follows a Poison distribution with rate to be defined latter, independently in each branch of the genealogy tree, where is known, or can be estimated from other methods or sources.

The inference of coalescence time of a sample population of size has two steps. The first step is modeling the distribution of without any data, the predata distribution; then in the second step, update the predata distribution, using the observed data, to the postdata distribution, based on which the formal inference is conducted. The predata distribution is pioneered by Kingman [6, 7]; he showed that, in time units of generations, where the ’s are independent waiting times. is the time from common ancestors of the sample to common ancestors. A quick reference on this can be found in Tavaré [8]. Here is distributed as exponential , with . The s can be represented graphically as a coalescent tree as in Figure 1; then is the height of the tree. Define the tree length as then (Kingman) The time unit is transformed to years by the relationship , where is the average years of each generation, which is usually taken as 20–25. Here we see that, as an initial analysis without the observed data, the coalescent time of a random sample of size from a population of size is roughly generations, as long as is moderately large. Thus, the coalescent time of a sample from a subpopulation is roughly the same as that of the population (as long as the sample size is moderately large). This phenomenon is further investigated by Watterson [9], who showed that where is the number of ancestors, at generations ago, of the population with size from which the data sample of size is drawn. Here the sample must be a random draw from the population; otherwise the result may not be reliable. For example, the sample of size is drawn from a subpopulation of size from a population of size ; then by (3), the predata estimated of the coalescent time of this sample is roughly generations, but also it is roughly generations since the sample is also from the whole population. The paradox arises from the sampling scheme. If the sample is drawn from the subpopulation of size , one can only use as the time scale, not , since the samples drawn from the subpopulation are expected to have smaller genetic variation than from the whole population.

For mutation, the common assumption is that the times at which mutation occurs follow a Poison process with constant rate , so that, in any branch of length from the tree, the number of mutations on that branch has a Poison distribution with mean , independently of the mutations on the other branches. For the time scale mentioned before, usually , where is the probability of a mutation that occurs per sequence per generation. For DNA sequences, is the sequence length (number of bases) times the mutation rate per site per generation and is often available from other sources. Since the coalescent time of a sample with moderate size is approximately generations, can be approximately interpreted as the cumulative (since the time of MRCA) mutation rate (number of mutations) per sequence. Also, since the population size is , can also be interpreted as the mutation rate of the whole population per generation.

Thus, given the mutation rate and the tree length , the number of mutations in a sample of individuals from the given population follows the Poison distribution Po [10] Note that this probability does not depend on , but on , , and . Why ?. Take ; then , which is the expected number of cumulative mutations since generations ago in a sequence. So is a reasonable choice of the parameter in the Poison distribution. But if we model the number of cumulative mutations per sequence since generations ago, for moderately large , we should use .

The key in the coalescence inference is to evaluate the postdata distribution of , which is much more involved than its predata distribution, it depends heavily on the mutation distribution in the data. For example, if more mutations occur in the earlier stage of the genealogy tree, then the estimated will be bigger. Although under the assumption that mutation can only occur at most once at each site and mutation is irreversible, the total number of mutations in the observed data is just the number of segregating sites. But how the mutations distribute in the branches of the genealogy tree is unknown. Such distribution is crucial in the inference of , which depends on how much data information being used and on the actual methods. This is our focus from now on. Denoting by the observed data, the estimated coalescent time of the sample is given by the postdata distribution mean of as The inference can be viewed as a Bayesian procedure, with the predata and postdata distributions that correspond to the prior and posterior distributions in a Bayesian framework. But unlike the common Bayes setting, here the parameter varies with the sample size , and the data cannot be modeled i.i.d. with this parameter. That is the reason the inference of cannot be made arbitrarily accurate, in the sense that the variance of the postdata distribution cannot be arbitrarily small, as the sample size increases without bound. Also, generally the postdata distribution is not in closed form and has to be evaluated by sampling methods. Tavaré et al. [10] derived the postdata distribution based on only the number of segregating sites in the sample. This method is very convenient to use, but does not use the DNA sequences structural information. The well known method in Griffiths and Tavaré [2], hereafter GT, is based on the full data information represented by a set of rooted trees. This method is one of the basic tools in coalescent inference using full data information, but is computationally complicated.

To evaluate the postdata coalescent distribution, GT used the probabilities recursion formula, derived in Ethier and Griffiths [11]. The method is not easy to fully understand and correctly use for many geneticists. Also these probabilities are computationally prohibitive; the postdata distribution of is computed by a Markov chain Monte Carlo sampling and is quite involved.

Here we study a relatively simple approximate method using the full data information; in this method, instead of computing the tree probabilities as in GT, we just set the post-data tree probabilities as uniform for the rooted trees and use a simulation method to compute the coalescent distribution; thus, getting round of the complicated evaluations of the tree probabilities, it is easy to understand and much simpler in computation.

The rooted tree plays an important role in the analysis, which is not uniquely determined from the data. The data is equivalent to an unrooted tree, which is equivalent to a set of unrooted trees. Each rooted tree has a 0-1 valued matrix representation which is convenient for some computations, but not any 0-1 valued matrix corresponds to a rooted tree. In the following, we give more details about them and their relationships.

Rooted Tree. A rooted tree consists of a system of branches, subbranches, and so forth. The tip of each branch or subbranch represents a known lineage. The observed mutations in the sample are represented as dots in the branches, subbranches, and so forth at specified positions. The observed multiplicity of each lineage is represented as leaves at the tip of each branch or subbranch, and so forth.

The presentation of a rooted tree is unique up to the relative positions of its branches, subbranches, and so forth. A rooted tree has several levels of randomness. If we only know the sample size , then the rooted tree has a total of leaves; apart from that, the shape of the tree, how to split, how to allocate the leaves, how many mutations, and the distribution of the mutations are all random. If the data and the number of mutations are given, then the tree can only take a few shapes. Different from GT and other related literatures, here we put the observed lineage frequencies (multiplicities) as leaves in the corresponding tips of branches, subbranches, and so forth of the rooted tree.

Different from a coalescent tree which has a complete time ordering of the splitting points of branches, a rooted tree has only partial time orderings of these splits and mutations. We only know that splits of branch(es) occurred before those of its subbranches, but do not know the ordering of splits of different branches. We know that mutation(s) on the branch occurred before those on its subbranch(es), but do not know the order of ones on the same branch, same subbranch(es), or on different subbranches. For a given sequence data, it may correspond to more than one different rooted tree. For the observed data in Table 1, all the columns are for segregating sites, and there is no transversion. Under the assumption that mutation can only occur at most once at each site and mutation is irreversible, at each segregating site, one and only one of the base types is mutant; the other type is ancestral. So if we know the mutation status at each segregating site, the mutation statuses are said to be labelled, and we can use a 0-1 valued matrix to denote the observed data, where , if the base type of lineage at site is mutant, and otherwise. Such 0-1 matrix representation of the data is convenient in the analysis. It is easy to see that each rooted tree uniquely determines a 0-1 valued matrix , but an arbitrary 0-1 valued matrix may not correspond to a rooted tree. It must satisfy some conditions to corresponds a rooted tree. There are abundant methods and algorithms on how to judge if a given 0-1 values matrix is a valid representation of a rooted tree, and if so how to build the rooted tree (e.g., [1216]). We find the method that appeared in a number of articles and is stated as Lemma 1 in Gusfiled [16] is easy to use. Given a valid 0-1 valued matrix (means it satisfies the condition for representing a tree), one can uniquely draw a rooted tree corresponding to it. Here, uniqueness means the genealogy relationships, including which lineages are in the same branch or subbranch, and so forth and which mutation sites are on which section of which branch or subbranch and so forth, are determined, but the particular shape of the tree, such as some branch put on the left or right side, the angle of branches, their lengths, and so forth, are irrelevant. Thus, there is a 1-1 correspondence between a rooted tree and a valid 0-1 valued matrix. Given the observed data, the mutation statuses at the sites are usually unknown. For data with segregating sites, there are different ways to labelling the mutation statues, but most of the labeling matrices do not qualify to be representations of a rooted tree; it is known that there are only different rooted trees, and hence different labellings (matrices) correspond to the data, and there are existing algorithms to construct the rooted trees and their corresponding matrices (e.g., [16, 17]). However, we find that the method in GT is convenient. By this method, one first needs to construct one rooted tree from the data or its valid 0-1 valued matrix. For example, start from the least shared mutations labeling that, on each column (site) of the data, label the less common base type as mutant (the other as ancestral). It is easy to check the conditions for its validity using Lemma 1 mentioned above. Construct the rooted tree corresponding to this matrix and convert it to an unrooted tree as in GT; that is, absorb those subbranches without mutations into their branch(es), and then straighten the branches, subbranches, and so forth. The unrooted tree is uniquely determined from any of the rooted trees.

Then, based on this unrooted tree, one can get all the other rooted trees as in Griffiths and Tavaré [18]; that is, alternatively put the tree root point near each of the vertexes that stretch out that vertex, then arrange the branches, subbranches, and so forth into the desired shapes; if there are more than one mutation between two adjacent vertexes, put the tree root point in the middle of two such adjacent mutations, alternatively for all such pairs of mutations, and shape the tree as above. This way we get all the rooted trees from the unrooted tree. In fact, given any rooted tree, all the other rooted trees can be constructed in the same way above, without using the unrooted tree. Once the rooted trees are constructed, the corresponding matrix representations are at hand.

3. Asymptotic Behavior of MRCA Estimate

For parameter inference with independent and identically distributed data and sample size , it is known that the estimator is asymptotically consistent and asymptotically normal with rate . But for inference of MRCA, the data are not independent and identically distributed, and existing studies indicated that the distribution of the estimated MRCA will not concentrate, even if . In the case of estimating the mutation rate with the same data, the estimator is found to be consistent and asymptotically normal with rate [1]. This motivates us to investigate the asymptotic behavior of as a commonly used point estimator of the coalescent time. We want to know whether this estimator has similar asymptotic behavior as the mutation rate estimator. We find that such estimators are not consistent almost surely. To describe the result, we consider the data set in three different commonly used forms. The first type of data we consider is in the form of a coalescent tree as in Figure 1. This type of data is often not practical, as for most real data we do not have the information to construct such tree. But as a starting point it will provide us some guide on the result. There are nodes (splitting points) in the tree numbered 2 to in their time order. Recall the definition of the th coalescent time . Between the th and th node there are exactly segments, denote them as from left to right, each has length . Assume the number of mutations on segment is known. Let , be the mutation distribution corresponding to and . Here this type of data is fully represented by . When we do not have , is not uniquely determined. But given each rooted tree , and the location information of the mutations, a mutation vector can be constructed by a random manner (to be detailed in Section 4) corresponding to . Denote be our prior on the rooted tree 's, that is, without additional knowledge we treat each rooted tree as equally likely from the observed data. Here our ’s have different meaning from the probabilities ’s as in GT (the latter do not sum up to one, but to the probability of obtaining the unrooted tree from the observed data). We have (Appendix)

The commonly available data is in the form of Table 1, which is equivalent to rooted trees; here .

The last method is to estimate only by the number of mutations , without using the information in the rooted trees.

We have the following result (proof in Appendix).

Proposition 1. (i) One has consequently, the above estimator will diverge almost surely, if is treated as random.
(ii) One has and will converge or not depending on that of the series above.
(iii) One has and the asymptotic behavior of the above estimator depends on the series above.

Remark 2. The above result tells us that cannot be characterized by an asymptotic deterministic quantity, even for large data size. The estimator is dominated by the number of mutations in the first few coalescent times. Hence, the only practical way to infer the coalescent time is via numerical methods, as the postdata coalescent distribution has no closed form even asymptotically. In contrast, the predata mean is convergent but is inaccurate as an estimator of the coalescence time for the population under study.

4. The Proposed Method

The method is to construct the mutation vector and compute the data probability directly from the genealogy rooted tree 's. Suppose that there are segregating sites in the sequence data, which is exactly the total number of mutations occurred in the history of the sampled individuals, then there are different rooted trees 's compatible with the data. Each of the rooted trees is a fixed genealogy structure, with the multiplicities as the leaves, but the number of mutations among the tree segments is random, subject to the total number of mutations being . The structure consists of the tree branches, subbranches within each branches, sub-subbranches, and so on,, and the leaves. These are the fixed features of a rooted tree. Given the data, the rooted tree is a display of how the mutations are distributed along the lineages, but there is no time scale in the tree, so (5) cannot be used to compute the mutation probabilities. Each rooted tree tells us a partial ordering of the mutations. For example, in the rooted tree, we know mutations at sites 4, 6, and 14 occurred before the split of lineages , , , and , thus occurred before the mutations at sites 1, 5, and 10. But we do not know which of 4, 6, and 14 occurred first. We know mutation 1 occurred before 10, but we do not know the order of 1 and 5, and so forth. If we have the full data corresponding to all the rooted trees, 's, we can compute as in Proposition 1(ii). But and the s are not directly available; however, can be easily simulated by the prior exponential distribution, and each rooted tree has an initial mutation distribution on its branch segments. Denote by the th segment (the order is arbitrary, e.g., we can lable them from upper to lower and left to right locations), and let be the number of mutations on it (many of them are zeros; we can concentrate on the segments with nonzero mutations). Denote . Given , can be sampled from (to be detailed latter). Let be the expectation with respect to . The above motivates us to estimate by

The above expectation is not easy to compute directly since we do not know the joint distribution of . Instead we use simulation method. For this, we sample independently and generate (see below) corresponding to and for each then approximate as

Now we consider generating . After is allocated among the branches of , we only need to consider each segment with nonzero number of mutations in them. Each length of in is the summation of some of the 's. For simplicity of exposition and notation, suppose that they are ; then given the mutations in and using (5), it is easy to see that the number of mutations in each of the intervals , follows the multinomial distribution where . After all the nonzero 's are allocated in the corresponding intervals, we have where the summation is for all 's that fall in .

Specifically, the simulation method is as below. For , do the following steps.(i)Sample from the coalescent distribution as in (1); that is, the 's are independent, with . Or equivalently, sample and set .(ii)For each fixed , allocate to the coalescent events of the sequences based on each rooted tree . See illustration below for details.(iii)Allocate the mutations in the corresponding segments according to (13). Then get the 's as in (14).

After all the iterations, evaluate (12) until convergence, which can be assessed by relative error, for example.

Illustration: Allocate to the coalescent events of the sequences based on rooted tree. We use the backward method; that is, first allocate , then ,, and last . Consider the rooted tree, for example. There are sequences, with frequencies for lineages . Note that sequences (leaves) in each lineage (branch) only coalescence within each branch (if the branch has more than one leaves), and branch with a single leaf coalescences only at MRCA . We first decide goes to which branch or pairs of single branches. Since it is the latest coalescent time, it can only go to a pair of leaves in some branch with multiple leaves. Since branch has only 1 leaf; it is excluded at this step. The remaining branches all have multiple leaves with a total of 54. We assign to one of these branches with weights proportional to their number of leaves, that is, with probabilities . Suppose that is assigned to ; we need to decide which subbranch it goes to. We have three candidate subbranches with number of leaves . We randomly assign to them with weights . Suppose it is assigned to branch ; then will go to a pair within this branch, and which pair is irrelevant. But the pair will be treated as a single leaf in assigning the rest 's. So after this step, we reassign the number of leaves in as 18.

Now we assign . The procedure is the same as above; the only difference is now has 18 leaves. The candidate branches are still with weights . Suppose that is also allocated to of branch ; then has 17 leaves now.

We now allocate to candidates with weights . Suppose that is allocated to ; we need to decide which of the 4 subbranches it will go to. has only 1 leaf and is excluded. So we allocate subbranches with weights . Supposing it goes to , since it has only one pair of leaves, then and are merged as one leaf after this assignment.

Continue this way, until is allocated. Then all the branches in this rooted tree have lengths as the 's allocated to them. After this step, the length of each segment of is a summation of some 's. Since is known from each , we can allocate each of the 's by (13), then get the 's by the formula that follows it. Then compute (12).

The assumption that the population size is constant can be relaxed the same way as in GS and Tavaré et al. [10].

Appendix

Proof of the Proposition

(i) Recall that the 's are independent, the 's are independent, and the 's are independent with , , , , and so . Observe The right-hand side above is the density of distribution, up to an normalizing constant. It has mean and variance . Since , we have where , the 's are i.i.d with , , , and . Note that and are convergent sequences with Note also that , that is, is a weight sequence for each fixed . Since for fixed , by the results for weighted sum of i.i.d. random variables (see [19], for a review of such results), a necessary condition for to converge (a.s.) is that for all fixed , or no term will be dominant as . Since this condition is not satisfied, the first few terms are dominant and we have The proofs of part (ii) and (iii) are similar to that of part (i) and omitted.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.