Abstract

We consider an infinite-allele Markov branching process (IAMBP). Our main focus is the frequency spectrum of this process, that is, the proportion of alleles having a given number of copies at a specified time point. We derive the variance of the frequency spectrum, which is useful for interval estimation and hypothesis testing for process parameters. In addition, for a class of special IAMBP with birth and death offspring distribution, we show that the mean of its limiting frequency spectrum has an explicit form in terms of the hypergeometric function. We also derive an asymptotic expression for convergence rate to the limit. Simulations are used to illustrate the results for the birth and death process.

1. Introduction

The infinite-allele branching process was first introduced by Griffiths and Pakes [1]. As a special type of branching process, this process allows individuals to mutate into infinitely many allelic variants, each of which is “new” in the sense of being different from all previously existing variants. This idealization is approximately correct for rare point mutations in long DNA sequences. Fundamental results for the discrete-time case (simple branching process) and for the continuous-time case (Markov branching process) have been obtained in [1, 2]. These include the number of alleles at a given generation or time, the generation number or time of the last mutation, and the limiting frequency spectrum. There exists an analogy between the results for the discrete-time and the continuous-time cases; however, the characteristics in the continuous case are relatively easier to derive [2]. Many evolutionary processes may be considered time continuous, and frequently we assume Markov property in modeling. A classical example is the discrete-time Wright-Fisher model, which is typically either approximated by a continuous-time diffusion or replaced by a continuous-time Markov chain, the so-called continuous-time Moran process [3]. Therefore, the time-continuous infinite-allele Markov branching process (TCIAMBP, or simply IAMBP) seems to be appropriate for modeling evolution in population genetics.

Consider a Markov branching process with neutral mutations. Suppose that the process starts from a group of individuals carrying the same allele, and individuals can mutate into new allelic variants. We assume that the mutation is independent of the previous history of the process, and the offspring distribution is independent of the allelic type, that is, the selection is neutral for all alleles. The process can be described as an “infinitely-many-alleles” model (IAM). Whenever a mutation happens, it yields a new allele, which differs from all the previously existing ones. In this paper, we are interested in the frequency spectrum of the IAMBP, which may be defined as the number or proportion of alleles present in a given number of individuals at a specified time point. Frequency spectrum in this paper refers to random allele frequencies, not their expected value as in Griffiths and Pakes [1] and in Pakes [2], since we also consider the variance of the allele frequency later on. Unless specified otherwise, we will use terms “mean frequency spectrum” and “variance frequency spectrum” in the remainder of this paper to denote the expected value and variance of the allele frequency. The frequency spectrum plays an important role in many genetic processes, such as DNA sequence evolution. As an example, Kimmel and Mathaes [4] modeled the Alu sequence data using an infinite-allele simple branching process with linear-fractional offspring distribution, and the goodness of fit testing suggested that Alu sequences do not evolve neutrally and might be under selection. It has to be noted that the concept of the frequency spectrum is in some sense similar to the Ewens’ sampling formula [5] in population genetics. We will return to this subject in the discussion, although analysis of the analogies and differences transcends the scope of the present paper.

The paper is organized as follows. In Section 2, we rigorously define the IAMBP and the mean frequency spectrum of the IAMBP. Then, we provide explicit expressions for the special case of the birth and death process. In Section 3, we derive the variance frequency spectrum and discuss its use in interval estimation for process parameters. We perform simulations to illustrate the results using the birth and death process example in Section 4. Section 5 is a summary.

2. IAMBP and Its Limiting Mean Frequency Spectrum

2.1. Definition and Basic Properties in the Supercritical Case

Let us consider a continuous-time Markov branching process consisting of individuals with exponential life spans with mean . Let us assume that upon death, each individual produces a random number of offspring. As usually assumed, the offspring counts are identically distributed according to probability generating function (pgf) , and they are independent conditional on the past process. The mean of the offspring distribution is , regardless of the allelic type. We further assume that a newborn individual mutates into a new allelic type with probability independently of the previous history of the process. Let us denote by the offspring pgf in a clone, started by the overall ancestor or any of mutants, containing only the like-type individuals. The entire process is a union over all individual types of such clones. The theory of the IAMBP has been developed by Griffiths and Pakes [1] in the discrete-time case and then by Pakes [2] in the continuous-time case. We will assume and , although some results can be proved without this latter assumption.

Let be the number of alleles present in individuals at time and , where subscript indicates that the process begins with individuals carrying the same allele. It has been shown that [2] where is the Malthusian parameter of the overall process and is the probability of observing individuals () carrying the parental allele at time when starting from individuals with the parental allele at time . Consequently, for the number of alleles at time , we have

Let , . If we define and as the limiting mean frequency spectrum, that is, the expected proportion of alleles present in individuals as , then we see that for the supercritical process such that ,

If , then the process of the like-type clones is supercritical, and as it is known [6], and , , as . Therefore, and as , for . This yields the following asymptotic equivalence: Details of the proof are omitted, since they appear elementary.

2.2. IAMBP with Birth and Death Offspring Distribution

For the IAMBP with birth and death offspring distribution , , we are able to obtain an explicit form for , ; therefore, the limiting mean frequency spectrum can be derived. The offspring pgf of the like-type individuals clone in the birth and death IAMBP is written as where and stand for the death, birth, and mutation probabilities for every individual and . Note that under another parameterization where the two newborn individuals die, live, and mutate independently, this pgf may be formulated differently as . Under either parameterization, . If, as assumed, , then parameters and are subject to a constraint

Let us write and (note, for the other formulation, and ). The explicit form of can be written as where is the Malthusian parameter of the like-type clone and is the Gauss hypergeometric function [7], defined as For a detailed derivation, see Appendix A. Note that the supercritical condition also guarantees that the argument of the hypergeometric function remains within its region of definiteness.

It follows that

Figure 1 shows an example of the limiting mean frequency spectrum for the birth and death process with parameters , , and , based on formula (10). To see how the spectrum varies with different parameter settings, we plot in Figure 2(a), the 3-D surface of a major component of the spectrum, , for different ’s and ’s. Figures 2(b) and 2(c) illustrate the effect of one parameter on given a fixed value of the other parameter.

We see that for fixed , increasing causes an increase of . This can be intuitively explained by the offspring pgf of the like-type clone. From the pgf expression , we see that the probability of obtaining one like-type individual in the offspring is , which is an increasing function of for a given , under the constraint . Therefore, increasing will finally lead to an increase of . The effect of on when fixing is not so obvious, but we notice that when fixing very close to , as approaches , the process is approximately critical binary fission; therefore, drops down because of almost sure extinction of the process, as seen from the tail behavior of the solid thick line in Figure 2(c).

Arguably, the frequency spectrum can only be observed in finite time. The finite-time mean frequency spectrum can be obtained by computing , numerically. For the birth and death process, this involves the computation of the incomplete hypergeometric function. The following is a valid question in this context. In order to safely use the limiting mean frequency spectrum, how long should the process history be? Figure 3(a) compares the limiting mean frequency spectrum with some long-term mean frequency spectra, for the birth and death process with parameters , , and . We see that under this setting, the long-term mean frequency spectrum is almost identical to the limiting mean frequency spectrum when . In general, this result depends strongly on parameters , , and , for example, small leads to longer . This provides us with some intuitions concerning the sufficiently large for approximating the limiting mean frequency spectrum. Figure 3(b) illustrates the difference between the finite-time mean frequency spectrum and the limiting mean frequency spectrum as a function of , for large , and for , where lines represent the true difference and markers represent the asymptotic approximation by formula (5). To emphasize the agreement for large, this figure is plotted in semilogarithmic scale. We see that the true difference drops exponentially fast, and the asymptotic approximation is good for large .

Given the observed long-term mean frequency spectrum, the parameters of the IAMBP, such as in the birth and death process, can be estimated by equating the observed long-term mean frequency spectrum from the sample to the expected limiting mean frequency spectrum from formula (3) and solving for the process parameters. In the case of the birth and death process, we may estimate and for example by solving for positive integers , , where and are both functions of and .

There is no explicit solution for such estimator, but numerical search according to some criteria is feasible. Another possibility is to minimize the distance (such as the norm) between the observed long-term mean frequency spectrum and the expected limiting mean frequency spectrum, that is, .

The estimated parameters can be used to check the goodness of fit of the IAMBP model. Another interesting problem is to test whether two sets of parameters are identical, given two observed mean frequency spectra. A simple approach is to use Pearson’s test, such as in Kimmel and Mathaes [4]. However, there may be restrictions to applying the test, such as small cell counts and inappropriateness due to the finite length of the observed spectrum. This motivates us to develop an interval estimator for the IAMBP parameters.

3. Variance of the Frequency Spectrum

Moment estimators based on the mean frequency spectrum only give point estimates of the process parameters. In order to quantify the uncertainty of point estimates, an interval estimator is needed, which requires more information about the distribution of the statistic . First, it can be seen that [2] where are the successive split times of the process, , are two indicators, and if there are individuals alive at time carrying the parental allele, and , for if the th individual born at time () mutates to a novel allelic type and further produces individuals carrying this allele time units later. is the number of split times in , and is the number of offspring produced at time . Obtaining the distribution of is not elementary. However, it may still be possible to define a confidence interval (CI) based on the first and second moments of .

Let be the variance frequency spectrum; by the law of total variance and independence between the indicators in the expression of (details in Appendix B), we have where In Expression (14), , and is the variance of the offspring distribution, regardless of the allelic types.

Similarly as in Expression (3), we may define a limiting variance frequency spectrum . Expression (13) is complicated and usually does not assume an explicit form, even for the special case of the birth and death process. Therefore, we will only give numerical solutions for the finite-time variance frequency spectrum. Figure 4 shows an example of the “”-bands of the finite-time frequency spectrum for the infinite-allele birth and death process with , , , , and . To emphasize the tail probabilities, we draw this plot in semilogarithmic scale.

From the finite-time variance frequency spectrum, it is possible to define a CI where the upper and lower bounds can be written as This CI is useful for checking model validity and for testing whether two observed mean frequency spectra are from the same IAMBP model.

4. Simulation Study

We perform a simulation study of the birth and death process to illustrate the finite-time mean and variance frequency spectra. First we generate samples (genealogical trees) from an IAMBP with birth and death offspring distribution starting from individuals carrying the same parental allele. Due to memory restrictions caused by forward simulation, we limit our simulations to generations and a relatively large mutation probability . The other parameters of the process are set to be and . At time , we record the number of alleles represented by copies, for . Repeating the simulation times, we then obtain the simulated finite-time mean and variance frequency spectra from the replicates.

Figure 5 shows side-by-side bar plots of the simulated and expected finite-time mean and variance frequency spectra, for . We plot the mean frequency spectrum and the variance frequency spectrum in semi-logarithmic scale to emphasize the tail probabilities. In each bar plot, the first black bar represents the expected finite-time mean frequency spectrum or variance frequency spectrum . The remaining ten white bars represent ten replicates of the simulated finite-time mean or variance frequency spectrum as described above. We see that for some classes, the expected mean or variance frequency spectrum is slightly different from the simulated spectrum. Beside sampling bias, this may be caused by the small scale of the simulations. For small mutation probability , we have to set large initial population size and a long time to obtain acceptable values of and from the simulated genealogical trees. We note that if one tries to use a naive method to calculate the variance frequency spectrum, that is, assume the proportions of alleles having representatives to be the mean of some independent Bernoulli random variables (they are not independent) and employ , such method performs much worse than based on the derivation of the variance frequency spectrum.

5. Summary

In this paper, we consider the frequency spectrum of the IAMBP of Pakes [2]. We develop an explicit expression for the limiting mean frequency spectrum for the special case of the birth and death process, which can be stated in terms of the hypergeometric function. We also derive an asymptotic expression for the rate of convergence of the finite-time mean frequency spectrum to the limiting mean frequency spectrum and illustrate the convergence using the birth and death process. We further state and prove a theorem concerning the variance frequency spectrum of the IAMBP, which helps to quantify uncertainty in parameter estimation and hypothesis testing. We illustrate the results using simulations of the birth and death process case.

As noted in the introduction, the frequency spectrum is similar to the Ewens’ sampling formula [5] in population genetics, since they both concern the count or frequency spectrum based on an infinite-allele model under neutral selection. However, they differ in several aspects. (1) Our frequency spectrum describes population property under a branching process, whereas the Ewens’ sampling formula describes allelic class count probabilities caused by a sampling procedure and further requires the sample size to be small compared to the size of the whole population which is assumed constant. (2) Our results concerning the frequency spectrum only provide the first and second moments and not the distribution function of the proportion of alleles having a given number of copies at a specified time point, whereas the Ewens’ sampling formula gives the joint probability of all allelic classes. We also note that the Poisson-Dirichlet process [8] is usually used to describe the equilibrium behavior of the neutral infinite-allele model. Study of the relation between variations of the frequency spectrum under different models is of our future interest.

The question of validity of the Wright-Fisher and Moran models of population genetics [3], as compared to stochastic population processes such as the IAMBP or O’Connell process [9], has importance for estimation of parameters based on genetic data. As an example, Cyran and Kimmel [10] compared estimates of the age of the Mitochondrial Eve based on the various versions of the Wright-Fisher model with those based on various branching process models. The outcomes showed differences of about 10–15%.

Appendices

A. Derivation of , for the Birth and Death Process

For the birth and death process, the offspring pgf of the like-type individuals clone assumes either the form or the form . In both cases, the backward Kolmogorov equation gives a unified expression for the process pgf : under different parameterizations, where for offspring pgf , and , whereas for offspring pgf , and . We see that this process is equivalent to a birth and death process with and offspring pgf . Using the known result of the birth and death process pgf [6], we obtain where .

To obtain an explicit form for , we may use two approaches. The first approach is to start from finding the pgf of , which then leads to . The second approach is to find directly and then obtain . Both approaches lead to the same result. Here, we give derivation for the second approach only.

From the explicit form of , we can directly read and , . Consider Let , , and , the above expression becomes and .

This is a mixture of two geometric pgf’s with the same parameter but different supports, one is the set , the other is the set . Therefore, Hence,

B. Derivation of the Variance Frequency Spectrum for IAMBP

Let be the successive split times, let be the number of split times till time , and let be the number of offspring produced at split time . Consider that at time , the alleles which are represented by individuals are from two sources: the initial allele or the mutant alleles. Correspondingly, we define two indicator functions. if there are individuals carrying the initial allele alive at time , and , for if the th individual born at time () mutates to a new allelic type and further produces individuals carrying this allele units later. Then

For each , is independent of and , as well as , and it can be seen that , , , . By the law of total variance, the variance frequency spectrum takes the form

For the second term, we know from independence among the indicator functions conditional on that The variance on the right hand side can be obtained from the following theorem, which is an analogue to Lemma  3.1.1 in [2]: .

Theorem 1. Let be a bounded continuous function. Then where

Proof. By independence of family lines, . By the law of total variance, where and is the number of split times in . The right hand side can be further written as Differentiating both sides and solving the resulting differential equation, we obtain

Replacing by and replacing by , we see that (B.3) becomes

For the third term, we have the expression inside the expectation as where is the variance of the offspring distribution, regardless of the allelic types. Therefore,

By (B.2), (B.9) and (B.11), the final expression of is then

Acknowledgments

The authors thank an anonymous referee for suggesting an asymptotic expression, which following some revisions, became (5). Part of M. Kimmel’s work has been carried out in Fall 2011 when he was visiting the Institute of Advance Study at the Warwick University, supported by grant from IAS and EPSCR. His work was also supported by the Polish NCN Grant NN519579938.