Abstract

Recombination within a DNA segment during the neutral fixation process is studied to determine the number of individuals in previous generations which carry genetic material ancestral to that region in the present generation. If 𝑁𝑟1, where 𝑁 is the population size and 𝑟 is the probability of a recombination event within that region per individual in a generation, the ancestors of all the base pairs in that segment were probably in the same individual in an arbitrary generation in the asymptotic past (prior to the most recent common ancestor) and all the base pairs in that segment share a common coalescent. If 𝑁𝑟1, the ancestors of the base pairs in a segment are probably spread among several individuals in asymptotic generations; hence, there is not an ancestral individual, but an ancestral pool, and the coalescents of base pairs do not coincide. The overlap of the ancestral pools of unlinked genetic segments is less than 2𝑝𝑞 where 𝑝 and 𝑞 are the relative frequencies of the two ancestral pools, which provides that the size of the ancestral pool for the human genome is close to the .80 upper bound which ensues from the Poisson progeny distribution.

1. Introduction

Gene substitution is a foundation of evolution. Greater understanding of this process has been provided by the diffusion approximation of Kimura and Ohta [1] which yielded an estimate of the time until fixation of a new mutation and the coalescent process of Kingman [2, 3] which provided an estimate of the time since a common ancestor (which is essentially the same quantity). This is the basis of the time since the mitochondrial Eve [4] and the 𝑌-chromosome Adam [5] which penetrated the popular press.

But these calculations for Eve and Adam are based on the fact that there is no recombination in the mitochondrial DNA or the 𝑌-chromosome. Eve and Adam only contained the genes ancestral to all present genes in the mitochondria and 𝑌-chromosome, and the present genetic material in the 22 autosomes and the 𝑋-chromosome had its ancestral material in many different contemporaries of Eve and Adam. There is not one genetic ancestor of the human population, but an ancestral pool, in each generation a set of individuals which contain genetic material ancestral to the present population. (The pool may contract to a single individual in some generations which provides a grand-most recent common ancestor [6] but will expand in previous generations.)

This paper studies how many base pairs (nucleotide sites) a genetic segment (a contiguous set of base pairs in DNA) can contain and have no recombination in that segment as a reasonable model for evolution; and how many individuals in a generation will contain material ancestral to the present population (base pairs identical by descent to base pairs in the present population) if recombination splits the genetic segment, hence the ancestral graph. The number of individuals in a given generation which contain material ancestral to the present population is the size of the ancestral genetic pool. Of course, recombination can split the ancestry of two adjacent base pairs, and there may be some generations where the genetic material ancestral to the present population is in a single individual no matter how long the genetic segment, but estimates for the expected size of the ancestral genetic pool are obtained. This paper helps delineate when recombination is an important factor in evolution.

There are two results which provide information on the size of the ancestral genetic pool. Chang [7] showed that asymptotically as time goes back, 80 percent of the population are pedigree ancestors of the present population, the others have no living descendants. This does not mean that entire 80 percent contains genetic material ancestral to the present population, rather that is an upper bound on the size of the ancestral pool for the entire genome.

Wiuf and Hein [8] obtained an estimate for the size of the ancestral pool of chromosome 20 using the model of Hudson and Kaplan [9] for incorporating recombination into the coalescent process. Their estimate is 1.28𝑅/ln(1+𝑅), where 𝑅 is defined as the (effective) population size (𝑁) times the length of the genetic material in morgans (𝑟) (the number of morgans is the expected number of recombination events in an individual in one generation). This formula, which was obtained from curve fitting based on numerical simulations, produces the estimate that the ancestral pool for chromosome 20 is 13 percent of the diploid population size (𝑅=20,000). They employed the range of values 1000𝑅20,000 for their numerical simulations, which includes neither 1000 contiguous base pairs (unless 𝑁>108) nor the entire genome (unless 𝑁<400). The formula 1.28𝑅/ln(1+𝑅) is consistent with our results for 1000 contiguous base pairs but cannot be valid for the entire genome if 𝑁<1012 (because the size of the ancestral pool would exceed the size of the population). Since their formula is obtained from a diffusion approximation holding 𝑁×𝑟 constant as 𝑁, it should not be expected to remain valid for large 𝑟.

We first calculate asymptotic bounds for the expected size of the ancestral pool, hence the probability that the ancestral pool is a single individual. This addresses the question: does a common ancestor exist (i.e, is there high probability that the ancestral pool is a single individual for most generations in the asymptotic past)? We use the word “common” in the sense of shared by all the individuals in the present generation (which is the standard usage), but also in the sense of shared by all the nucleotide sites in a segment. The results depend on the product of the (effective) population size (𝑁) and the length of the genetic segment (𝑟) in morgans. For concreteness, we identify the results with the product 𝑟𝑁 and also various population sizes for a segment of 1000 contiguous base pairs (i.e., 𝑟=105 morgans). This choice is motivated as a contiguous DNA sequence coding for a 333 amino acid protein.

We next calculate bounds for the probability that the most recent common ancestor (MRCA) of a nucleotide site in a DNA segment is indeed the MRCA of the entire segment (i.e., the MRCA of every base pair in the segment is in the same individual). These bounds are not functions of 𝑟𝑁, so we employ the value 𝑟=105 above and various values for 𝑁. However, we have numerically confirmed that the results do not change much as 𝑟 and 𝑁 vary with 𝑟𝑁 constant. Results for the asymptotic pool size and for the MRCA are presented in Table 1.

Sets of base pairs which are not contiguous (i.e., multiple segments) are of interest but difficult to analyze because recombination between the segments will depend on the locations within the segments. But our last results provide information on multiple genetic segments by bounding the overlap of ancestral pools of unlinked genetic segments. This provides a loose bound for the size of genetic pools of multiple genetic segments. In particular, it is informative for the size of the ancestral pool of the entire genome if the sizes of the ancestral pools of chromosomes are known.

2. Results

2.1. The Model

The results are obtained using the coalescent [6, 10]. The population size is 𝑁 diploid individuals (i.e., 2𝑁 haploid gametes); we are assuming this is also the effective population size. However, the analysis is haploid; hence, the word “individual’’ (when not preceded by “diploid’’) refers to a single copy of the genetic segment. The length of a segment (𝑟) is measured in morgans, 1 morgan is the length over which the expected number of crossover events in one individual (in one generation) is 1. When we study the MRCA, we shall employ the length 𝑟=105, which is motivated by a segment of 1000 contiguous base pairs with the crossover probability between two adjacent nucleotides of 108. The value 1000 corresponds to DNA coding for 333 amino acids, and 108 was used by Wiuf and Hein [8] (the recombination rate varies between species, and hotspots may impact the recombination rate by a factor of 10; Wiuf and Hein [11] assumed the recombination rate 107). This model is for a single contiguous segment.

By coalescent, we are always referring to the coalescent of the entire population which is the ancestral graph containing all of the ancestors of the individuals in the present generation. The coalescent process (merging of ancestral lineages) is essentially the inverse of the fixation process. Time (𝑡) is measured in generations from the common ancestor hence increases with real time. Recombination (crossing over) within the segment is incorporated using the model of Hudson and Kaplan [9] as employed by Wiuf and Hein [8].

In computing bounds, some approximations are employed (such as rounding off to lowest-order terms or employing estimates for the coalescent size). Hence, the bounds could be interpreted as approximate bounds but, when paired, give a good indication of the measures of identity for various parameter values.

2.2. Asymptotic Ancestral Pools

The coalescent may not exist for a segment, different base pairs may have different ancestral pedigrees; but it does exist for every base pair. Before (i.e., after in negative time) the MRCA of a base pair, there is an ancestral lineage which extends back to the dawn of time. Such a lineage exists for each base pair. The ancestral pool of a segment is the union of the individuals (gametes) which contain the ancestral lineages of the base pairs in that segment in a given generation. By asymptotic, we mean the behavior of those pools as time goes backward to negative infinity. Two questions which are of interest are what is the probability that all the lineages coincide in a single gamete (i.e., a common ancestor exists) in a given generation, and what is the average size of the ancestral pool (averaged as time goes back to negative infinity)? It is possible to bound these two quantities.

A sequence [8] is defined as a segment which contains one or more ancestral base pairs, perhaps contiguous, perhaps with intervening nonancestral base pairs. For a given segment (region of DNA), denote the number of sequences in a generation in the past as 𝑘. At equilibrium, the number of coalescent events decreasing the number of sequences is equal to the number of crossing over events increasing the number of sequences. Unfortunately, we cannot characterize the latter exactly but have two inequalities:𝑟𝐸𝑘(𝑘1)[](4𝑁)𝐸𝑘×𝑟.(1)

The outer quantities are bounds on the number of crossing over events, and the middle quantity is the frequency of coalescent events. Equality on the left assumes all the ancestral base pairs in a sequence are contiguous so that only crossovers between adjacent ancestral base pair can increase the number of sequences. Equality on the right assumes that ancestral material is dispersed everywhere (within the segment region) in sequences carrying ancestral material so that crossovers anywhere within the segment region will generate an additional sequence. (Simulations by Wiuf and Hein [8] suggest that the former is closer to reality.)

From convexity and the right hand inequality,[𝑘])(𝐸2[𝑘]𝑘𝐸𝐸2[𝑘][𝑘]𝐸4𝑁𝐸×𝑟.(2) Solving this quadratic inequality for 𝐸[𝑘] yields 𝐸[𝑘]1+4𝑁×𝑟.

This provides 𝐸[𝑘]1.004 for 𝑁𝑟=.001, 1.04 for 𝑁𝑟=.01, 1.4 for 𝑁𝑟=.1, 5 for 𝑁𝑟=1, 41 for 𝑁𝑟=10, and 401 for 𝑁𝑟=100 (the number of base pairs is always an upper bound, since each sequence contains at least one ancestral base pair). Because 𝑘1 (there is at least one ancestor), we can calculate 𝑃(𝑘=1)>.996 for 𝑁𝑟=.001,.96 for 𝑁𝑟=.01, and .6 for 𝑁𝑟=.1 (these bounds are based on the worst case scenario 𝑘=2 if 𝑘1). These values are in Table 1.

An upper bound for the probability of there being a single sequence (a true coalescent common ancestor) and a lower bound for the expected number of sequences is obtained by using the lower bound for the frequency of crossover events generating new sequences 𝑟 with the coalescent probability 𝑘(𝑘1)/4𝑁 (i.e., the left hand inequality in (1)). Recall that increased frequency of crossing over increases the number of sequences and coalescence decreases the number of sequences (going backward in time). Hence, a model employing a smaller frequency of crossovers will generate fewer sequences than the actual crossover frequency would generate. This will provide a higher probability that there will be a single sequence in the asymptotic past and a smaller asymptotic expected number of sequences than the actual crossover rate would provide.

To calculate the bounds, the transitions 𝑟 and 𝑘(𝑘1)/4𝑁 can be put into an infinite stochastic matrix governing the distribution of the number of sequences with 𝑟 on the subdiagonal increasing the number of sequences by recombination, 𝑘(𝑘1)/4𝑁 on the superdiagonal decreasing the number of sequences due to coalescence, and 1𝑟𝑘(𝑘1)/4𝑁 on the diagonal manifesting no change in the number of sequences. (The coalescent probability 𝑘(𝑘1)/4𝑁 is an approximation which is only valid for small 𝑘, but this does not affect our calculations which only employ small 𝑘.) The 𝑖th entry in the stochastic vector the matrix acts on is the probability that the ancestral pool contains 𝑖 sequences. The upper left hand corner of this matrix is displayed below:21𝑟24𝑁000𝑟1𝑟64𝑁64𝑁000𝑟1𝑟4𝑁124𝑁000𝑟1𝑟124𝑁204𝑁000𝑟1𝑟204𝑁.(3) Because (3) is a nondegenerate stochastic matrix, there is a unique stochastic eigenvector which is the equilibrium (asymptotic) distribution for the stochastic process governed by (3), and repeated multiplication of any stochastic vector by (3) will converge to that equilibrium distribution. The first component of this eigenvector is the asymptotic probability that there is a single sequence, and the expected number of sequences is 𝑖=1𝑖×𝑒𝑖 where 𝑒𝑖 is the 𝑖th component of the eigenvector.

This eigenvector can be calculated iteratively using 1 as the first component, 2𝑁×𝑟 for the second component, and ((𝑖1)(𝑖2)𝑒𝑖1+(4𝑁×𝑟)(𝑒𝑖1𝑒𝑖2))/(𝑖(𝑖1)) for the 𝑖th component where 𝑒𝑖 is the 𝑖th component, and then normalizing to a stochastic vector. Computations were performed truncating both at 10,000 components and at 50 components to make sure that error was not introduced by 𝑘 being too large (the results were the same for both truncations) and normalizing. (Truncating is consistent with the direction of the bound.)

To show that the result is really a function of the product 𝑟×𝑁, note that the eigenvectors of a matrix are unchanged when the matrix is multiplied by a nonzero constant or has a multiple of the identity matrix added to it (excluding degenerate cases). Hence, the eigenvectors for (3) are the same as the eigenvectors for2𝑁𝑟42000𝑁𝑟𝑁𝑟4646000𝑁𝑟𝑁𝑟4124000𝑁𝑟𝑁𝑟124204000𝑁𝑟𝑁𝑟204,(4) which is obtained by multiplying (3) by 𝑁, and then subtracting 𝑁𝐈 from it (𝐈 is the identity matrix). Since the matrix (4) is a function of 𝑟𝑁, so are its eigenvectors, hence the bound for the asymptotic ancestral pool sizes associated with (3).

The result from calculating the eigenvectors is that for 𝑟𝑁=.001, the probability of a single ancestral sequence was less than 1.00, the expected number of sequences was greater than 1.00; for 𝑟𝑁=.01, the probability of a single ancestral sequence was less than .98, the expected number of sequences was greater than 1.02; for 𝑟𝑁=.1, the probability of a single ancestral sequence was less than .83, the expected number of sequences was greater than 1.19; for 𝑟𝑁=1, the probability of a single ancestral sequence was less than .20, the expected number of sequences was greater than 2.32; for 𝑟𝑁=10, the probability of a single ancestral sequence was less than .00019, the expected number of sequences was greater than 6.59; for 𝑟𝑁=100, the probability of a single ancestral sequence was less than 1015, the expected number of sequences was greater than 20. Note that 𝑟𝑁=1 corresponds to 𝑁=105 if 𝑟=105 which ensues from a segment length of 1000 base pairs. These values are in Table 1.

2.3. The Most Recent Common Ancestor

In addition to the asymptotic history, we can ask whether the MRCA really is an MRCA, that is, whether the MRCA of a single base pair (which must exist) is the MRCA of every base pair in the segment. This is not the requirement that the coalescents of all the base pairs in a segment coincide, merely that they terminate in the same individual. Crossing over during the coalescent process divides the genetic material in a single individual among two individuals, causing the ancestry of the gene to be contained in two different ancestral subgraphs; those graphs may terminate in the same MRCA or in different MRCAs. This is illustrated in Figure 1, where a crossover between individuals 𝑥1 and 𝑥2 or 𝑥1 and 𝑥3 would change the ancestral graph of the genetic material involved in the crossover but leave the same MRCA; a crossover between 𝑥1 and 𝑥4 would change the ancestral graph and change the MRCA to a more distant ancestor. The schematic of a coalescent in Figure 1 also illustrates that, during the process of coalescence or fixation, there are individuals not in the coalescent (ancestral pedigree) which share the common ancestor of the coalescent (e.g., 𝑥3) and individuals not in the coalescent which do not share the common ancestor of the coalescent (e.g., 𝑥4).

The probability of no crossing over involving individuals in the coalescent provides a lower bound for the probability of a common MRCA because that will assure a common MRCA, but allowing crossing over to individuals sharing the MRCA, whether inside or outside the coalescent, will also provide that MRCA. The probability of no crossing over involving individuals in the coalescent can be approximated employing the estimate for the cumulative number of individuals in the coalescent 4𝑁(ln(4𝑁)0.5) ([12]; the cumulative size of the coalescent is the total number of individuals in the coalescent: in Figure 1,  𝑧1, 𝑦1, 𝑥1, 𝑥2, 𝑤1, 𝑤2, 𝑤3, and 𝑤4 are in the coalescent; hence, the cumulative size is 8) and probability of a crossover in a single individual 105, and assuming crossing over is a Poisson process. The result is that the probability of no crossover involving individuals in the coalescent is approximately exp(105×4𝑁(ln(4𝑁)0.5)). The quantity 4𝑁(ln(4𝑁)0.5) is an estimate for the expected size of the coalescent based on the expected time between changes in the size of the coalescent; convexity of the exponential function provides that exp(𝐸[𝑋])𝐸[exp(𝑋)] (in this case, 𝑋 is the size of the coalescent), which is consistent with providing a lower bound.

A higher lower bound is obtained by calculating an upper bound for the probability that a recombination event involving a member of the coalescent resulted in at least one nucleotide base pair which did not share the MRCA of the coalescent being in the ancestry of that individual. To this end, we calculate the probability that a member of the coalescent crossed over with an individual outside the coalescent (e.g., 𝑥1 with 𝑥3 or 𝑥4); this overestimates the probability of recombination with an individual not sharing the MRCA because some individuals outside the coalescent (e.g., 𝑥3 in Figure 1) will share the same MRCA. The number of individuals in the coalescent at time 𝑡 (𝑡 is the expected time from the MRCA until the coalescent has the specified size; this function is the inverse of the expected time to the coalescent size) is approximately (1+1/2𝑁𝑡/4𝑁)1 [12]. Because 𝑡 is the expected time until the coalescent size, this is only valid until the expected time to fixation (4𝑁) when the size of the coalescent becomes the population size (2𝑁, which is 𝑁 diploid individuals); hence, it is not relevant that the quantity becomes negative for 𝑡>4𝑁+2. Because (1+1/2𝑁𝑡/4𝑁)1 is obtained from the coalescent process by employing the expected transition times for decreasing the number of individuals in the coalescent by one (i.e., manifests the expected time at each size), the summation (5) manifests the expected time at each coalescent size hence gives the expected number of crossing over events; variation in the timing of coalescent events does not introduce any error since expected times are used, any error results from the approximation (1+1/2𝑁𝑡/4𝑁)1 (and perhaps summing instead of integrating). The expected number of crossover events between individuals inside and outside the coalescent is 105×4𝑁𝑡=011+𝑡2𝑁4𝑁12𝑁(1+1/2𝑁𝑡/4𝑁)12𝑁,(5) where 105 is the probability that a crossover occurs in a single individual, (1+1/2𝑁𝑡/4𝑁)1 is the number of individuals in the coalescent at time 𝑡, and 1/2𝑁×(2𝑁(1+1/2𝑁𝑡/4𝑁)1) is the probability that the crossover is with an individual outside the coalescent. This, assuming crossover events are a Poisson process, provides the probability of no such crossovers𝑒105×4𝑁𝑡=0(1+1/2𝑁𝑡/4𝑁)1(2𝑁(1+1/2𝑁𝑡/4𝑁)1)/2𝑁,(6) (The variation in duration of the coalescent process will provide greater variation than a Poisson process; hence, the exponentiation in (5) underestimates the probability of no crossovers, which is consistent with providing a lower bound.)

For a population of 100 diploid individuals (i.e., 200 gametes, 2𝑁=200), this provides the lower bound for the probability that all nucleotide sites in a segment have the same MRCA .98; for 2𝑁=2000,.77; for 2𝑁=20,000,.03; for 2𝑁=200,000 or more, less than 1019. Thus, all the nucleotide sites in a segment probably have the same MRCA in populations smaller than 1000 but may not in larger populations (this is only a lower bound for all nucleotide sites having the same MRCA). This information is presented in Table 1.

In order to obtain an upper bound for the probability that the MRCA for a nucleotide base pair is indeed the MRCA for the entire 1000 base pairs in the segment, we shall use a lower bound for the probability that a crossover occurred between an individual in the coalescent and an individual not sharing the MRCA of the coalescent (e.g., 𝑥1 and 𝑥4 in Figure 1).

Heuristically, this can be obtained from the growth of the coalescent (1+1/2𝑁𝑡/4𝑁)1 and the rate of increase of the allele destined to fixation (which includes individuals such as 𝑥3 which are not in the coalescent). For the Poisson progenies distribution with 𝜆=1, the expected number of siblings of an individual is 1. Therefore, since all progeny are equally likely to become fixed, the expected increase in frequency, conditioned on fixation, is 1(𝑘1)/(2𝑁1)<1, where the 1 is the expected number of siblings of the progeny destined for fixation and the (𝑘1)/(2𝑁1) reflects that the other 2𝑁1 individuals in the parental generation (𝑘1 of which are of the same type as the progeny destined for fixation) must have on average 11/(2𝑁1) progeny to maintain a constant population size. This provides that the expected number of copies of the allele destined for fixation is less than or equal to 𝑡 at time 𝑡; hence, 𝑟02𝑁(1+1/2𝑁𝑡/4𝑁)1(2𝑁𝑡)/2𝑁 should be a lower bound for the probability that the MRCA of a nucleotide pair is not the MRCA of all the nucleotide pairs (a crossover occurred with an individual not descended from the MRCA). Truncating the summation at 2𝑁 is consistent with calculating a lower bound, but because the factors in the summation are an expected value and a bound on an expected value, this may not be a lower bound.

Rigorously, a weaker bound can be obtained using Tchebychev’s theorem. The variance of the change in allele frequency in a generation is 𝑘(2𝑁𝑘)/2𝑁 where 𝑘 is the number of alleles of the designated type (the actual model is the binomial distribution, the Poisson progeny distribution is an approximation which is useful for many purposes, but the binomial variance is tractable here). Because the rate of increase of the designated allele is less than 1, the expected number of copies of the designated allele at time 𝑡 is less than 𝑡 (assuming one copy at time 1); hence, the variance of the change in allele frequencies at time 𝑡 is less than 𝑡 (i.e., 𝑘×(2𝑁𝑘)/2𝑁<𝑡; because of the convexity of 𝑘(2𝑁𝑘), the expected value of the variance is less than the variance calculated using the expected value). Independence between generations provides that the variance of the cumulative change over 𝑡 generations is less than 𝑡𝑖=1𝑖=𝑡(𝑡+1)/2<𝑡2; hence, the cumulative standard deviation is less than 𝑡.

This provides that 4𝑡 is three standard deviation units above the expected number of copies at time 𝑡; hence, by Tchebychev’s theorem, there are at least 2𝑁4𝑡 alleles not identical by descent with the designated allele at time 𝑡 with probability 8/9. Because the argument 𝑡 of the coalescent size (1+1/2𝑁𝑡/4𝑁)1 is the expected time to that size and 2𝑁4𝑡 is linear, multiplying (1+1/2𝑁𝑡/4𝑁)1 by 2𝑁4𝑡 entails an accurate pairing of coalescent and nondescendant sizes (i.e., for a given 𝐸(𝑡) which is the argument of (1+1/2𝑁𝑡/4𝑁)1, the actual value of 𝑡 in 2𝑁4𝑡 will vary, but conditioning on 𝐸(𝑡) as the argument for (1+1/2𝑁𝑡/4𝑁)1, averaging over all the associated values of 2𝑁4𝑡 will be the same as using that 𝐸(𝑡) as the argument for 2𝑁4𝑡. (The truncation of 2𝑁4𝑡 is consistent with the direction of the bound.) This provides the upper bound for the probability that the MRCA of a nucleotide pair is the MRCA of all the nucleotide pairs in the segment:𝑒105×𝑁/2𝑡=0(1+1/2𝑁𝑡/4𝑁)1×(2𝑁4𝑡)/2𝑁×.88,(7) where 𝑟=105 and .88 is the 8/9 from Tchebychev’s theorem.

Numerical evaluation of this expression produces 1.000 for 2𝑁=200,.998 for 2𝑁=2000,.977 for 2𝑁=20,000,.795 for 2𝑁=200,000,.100 for 2𝑁=2,000,000, and 1010 for 2𝑁=20,000,000. As noted above, this is a generous bound; hence, there is very low probability that all the nucleotide sites in a gene have the same MRCA for 𝑁 greater than 1,000,000. These values are in Table 1.

2.4. Multiple Unlinked Segments

Genetics is seldom concerned with single contiguous segments of DNA, but often multiple segments with significant separation, hence recombination, between them. Although we should consider an arbitrary recombination frequency between segments, that frequency will depend on the locations within the segments (recombination within one segment will result in part, but not all, of that segment recombining with another segment), making it a difficult problem. Free recombination is the opposite extreme to no recombination and is appropriate for some cases including segments on different chromosomes or segments which are entire chromosomes. The specific question which we address is if the sizes of the ancestral pools of two unlinked segments are known, what is the size of the combined ancestral pool? It is at least the size of the larger of the two pools and at most the sum of the sizes of the pools. We provide a more precise bound. Calculations are based on lowest-order terms in power series.

First consider the case where the segment lengths and population size are small enough so that each ancestral pool is a single individual; hence, there are two ancestral lineages. This case lays a foundation for the following cases hence is of interest beyond the circumstances when its assumptions are met. The population size is 𝑁, hence 2𝑁 gametes. If the ancestral lineages of two unlinked segments are in the same gamete, then the previous generation they were in the same gamete half the time (because the zygote they came from was two gametes). If they are in different gametes, then 1/𝑁 of the time they came from the same zygote (this follows from Kingman’s [3] observation that the Wright-Fisher model is equivalent to each individual choosing its parent independently from the previous generation), hence 1/2𝑁 of the time they came from the same gamete the previous generation. This defines a Markov process going backward in time with the two states that the lineages are or are not in the same gamete, and the matrix for this Markov process is1.5,12𝑁.51,2𝑁(8) which has the eigenvector (stable distribution) (1/(1+𝑁),𝑁/(1+𝑁)), hence the diploid structure provides that two independent lineages will coincide (be in the same gamete) approximately 1/𝑁 of the time rather than 1/2𝑁 which would occur from random association.

Next consider a single ancestral lineage (ancestral pool of size one) and the ancestral pool of size greater than one of an unlinked segment; 𝑢 is the relative frequency (size/2𝑁) of the ancestral pool at the gamete stage. In order to maintain an equilibrium size 𝑢 of the ancestral pool, coalescence must be balanced by crossing over (recombination) going backward in time. Coalescence reduces the size of the ancestral pool from 𝑢 to 1𝑒𝑢 in a generation, 𝑢(1𝑒𝑢)=𝑢2/2 to lowest order terms, hence crossing over must increase the number of ancestral lineages by that amount. Only crossing over in individuals in which exactly one of the alleles is ancestral to the ancestral pool will increase the size of the ancestral pool, the frequency of such individuals is 2𝑒𝑢(1𝑒𝑢) (𝑒𝑢 is the probability that a parental allele (half a zygote) is not an ancestor of the ancestral pool). Therefore, the frequency of crossing over, which we designate with 𝜌, satisfies 𝑢2/2=𝜌×2𝑒𝑢(1𝑒𝑢) or 𝜌=𝑢/4 to order 𝑢.

This provides that the probability that if the lineage was in a gamete with a part of the ancestral pool, it was in a gamete with part of the ancestral pool the previous generation is .5+.5(1𝑒𝑢)+.5𝜌𝑒𝑢, which is obtained by summing the probability the ancestral pool material was in the same gamete the previous generation (.5), the probability the gamete the previous generation contained the other copy of the allele in the zygote, but it was also ancestral (.5(1𝑒𝑢)), and the probability the gamete the previous generation contained the other copy of the allele in the zygote which was not ancestral, but it was made ancestral by crossing over (.5𝜌𝑒𝑢). To first-order terms in 𝑢, this is equal to .5+.625𝑢, hence the probability that if a lineage was in a gamete with part of the ancestral pool, it was in a gamete without part of the ancestral pool the previous generation is .5.625𝑢. If the lineage was in a gamete without material from the ancestral pool, then its gamete the previous generation could have material from the ancestral pool if either its gamete the previous generation contained the ancestor of that nonancestral allele, but that allele had coalesced with an allele with ancestral material, or it contained the ancestor of the other allele in the parent to the gamete and that allele contained ancestral material (crossing over produces higher-order terms), the respective probabilities are .5(1𝑒𝑢) and .5(1𝑒𝑢). To order 𝑢, summing these yields 𝑢. Hence, the probability that if the lineage was in a gamete without ancestral material, it was also in a gamete without ancestral material the previous generation is 1𝑢. This yields the Markov matrix governing cooccurrence of the lineage and ancestral pool.5+.625𝑢𝑢,.5.625𝑢1𝑢,(9) which has the eigenvector (stable distribution) 𝑢/(.5+.375𝑢),(.5.625𝑢)/(.5+.375𝑢); hence, the diploid structure provides that a lineage will coincide with part of an unlinked ancestral pool of size 𝑢 approximately 𝑢/(.5+.375𝑢) (i.e., approximately 2𝑢) of the time rather than 𝑢 which would occur from random association.

Now consider two unlinked segments (or unlinked collections of genetic material) for which the sizes of the ancestral pools are known. Assume the asymptotic probabilities of gametes containing ancestral material for those segments are 𝑢 and 𝑣, respectively (hence, we shall refer to them as “𝑢” and “𝑣” segments). Then, the ancestral lineage for each nucleotide pair in the “𝑣” segment will be in a gamete with material in the “𝑢” ancestral pool with probability 𝑢/(.5+.375𝑢) (or 𝑢/(.5+.375𝑢) of such lineages will be in “𝑢” gametes). If all gametes containing “𝑣” ancestral material had equal probability of containing “𝑢” ancestral material, the probability that a gamete with “𝑣’’ ancestral material contained “𝑢’’ ancestral material would be 𝑢/(.5+.375𝑢), the probability for a “𝑣” lineage containing “𝑢” ancestral material. Hence, the probability that a gamete contained ancestral material from both segments would be 𝑣𝑢/(.5+.375𝑢) (𝑣 is the probability of containing ancestral material from the second segment, and 𝑢/(.5+.375𝑢) is the conditional probability of containing ancestral material from the first segment).

However, gametes containing many (as opposed to fewer) “𝑣” ancestral lineages are likely to have recently coalesced (because coalescence combines ancestral lineages and crossing over separates them). The “𝑢” segment (whether or not ancestral) in that gamete is also likely to have recently coalesced because the sexual reproduction process keeps independent segments together (with probability .5 each generation), and because it coalesced, it is more likely to contain ancestral material. Hence, gametes with many ancestral “𝑣” lineages are more likely to contain ancestral “𝑢” material than gametes with few ancestral “𝑣” lineages. This provides that the probability that a gamete containing ancestral “𝑣” material also contains ancestral “𝑢” material will be less than the probability that an ancestral “𝑣” lineage is in a gamete with ancestral “𝑢” material. Thus, the probability that a gamete contains both “𝑢” and “𝑣” ancestral material is less than 𝑣𝑢/(.5+.375𝑢) (and less than 𝑣𝑢/(.5+.375𝑣) by symmetry). In particular, the probability that an individual contains ancestral material from both pools is less than twice the product of the probabilities of the two pools (2𝑢𝑣).

Therefore, the size of the combined ancestral pool is at least 𝑢+𝑣2𝑢𝑣 (and at most 𝑢+𝑣). This argument can be extended recursively to find a bound on the size of the ancestral pool of an arbitrary number of unlinked segments for which the ancestral pool size is known. In particular, it can be used to find a bound on the size of the ancestral pool of the entire genome if the size of the ancestral pool for each chromosome is known.

3. Discussion

The main result from Table 1 is that a segment will probably have a single ancestor (i.e., ancestral pool of size 1) if 𝑟𝑁1 (the probability is greater than .6 if 𝑟𝑁=.1, greater than .96 if 𝑟𝑁=.01, and greater than .99 if 𝑟𝑁=.001). Complementarily, the probability of a single ancestor is close to zero for 𝑟𝑁1 (the probability is less than .00019 for 𝑟𝑁=10 and less than 1015 for 𝑟𝑁=100). The bounds on the expected size of the asymptotic pool are of course close to 1 for 𝑟𝑁<1, but are not very useful for 𝑟𝑁>1 (numerical calculations provide that the lower bound approaches 51 as 𝑟𝑁 gets large while the upper bound is approximately 4𝑟𝑁). For 𝑟𝑁=1, there is a rather tight bound on the expected size of the asymptotic pool size (between 2.3 and 5). However, 𝑟𝑁=1 is of limited interest. 𝑟𝑁=1 corresponds to a gene or a piece of a gene of 103 or 102 contiguous base pairs if the population size is 105 or 106. But it certainly does not correspond to an entire chromosome, a chromosome in man or Drosophila is about one morgan in size, which would require an effective population size close to 1. (This assumes a recombination rate of 108 between adjacent base pairs, there are other estimates for that rate, and variation in the rate (hotspots) further complicates the analysis [13].)

These results provide insight into the question: what is the integrity of the gene? Is the gene the atom of evolution or does evolution occur on a finer scale? In small populations (𝑁<1000), the gene (defined as 1000 contiguous base pairs) is indeed a meaningful entity, the most recent common ancestor (MRCA) is the same for all of its base pairs and that individual has an ancestral lineage which contains common ancestors for all the nucleotide pairs in that gene. Periods when the ancestral material is spread among multiple individuals are infrequent; hence, all the base pairs change their frequency as a unit. In larger populations (𝑁>1,000,000), the MRCAs for the various base pairs in the gene do not coincide, and it is rare that the ancestral lineages for all the base pairs coincide. There is not an ancestral individual, but an ancestral pool. Positive probability, no matter how small, provides that the lineages of all the base pairs will coincide at some time in the past (hence, there is a common ancestor), but, if 𝑁𝑟1, the base pairs will not all stay together and evolve (change frequency) as a unit. These conclusions are from the numerical bounds calculated in Table 1. Some of the bounds are quite loose, but they still support the conclusions.

These results are for neutral drift with no mutation (i.e., identity by descent). Selection will speed up the fixation process and increase identity by descent [14], hence increase the likelihood that the MRCA for a base pair is the MRCA for all the base pairs in the gene, it might also eliminate aberrant forms of the gene, thereby further contributing to integrity. Mutation will decrease the physical identity of the genes. Since the mutation rate is comparable to the recombination rate (both are around 108 (per nucleotide site or between adjacent nucleotide sites; both have great variation)), probabilities of identity by type will be similar. But because much recombination will be with individuals which are identical by descent, identity by type is less likely than identity by descent.

The bounds in this paper on the size of the ancestral pool are most useful for a genetic segment of 1000 contiguous base pairs, and Wiuf and Hein [8] have presented an estimate for the size of the ancestral pool for a chromosome. Indeed, it would be nice to have tighter bounds for a genetic segment and an estimate for chromosomes which does not rely on simulation for the population size of interest. But it is also necessary to extend results for genetic segments to results for unions of genetic segments, whether a few separated contiguous segments or the entire genome. We have improved the bounds obtained by assuming that the genetic material in different segments (or chromosomes) is in the same individuals as much as possible, or in different individuals as much as possible (i.e., if the sizes of two genetic pools are 𝑢 and 𝑣, the size of the combined pool is between max(𝑢,𝑣) and 𝑢+𝑣); we have shown that the overlap of the two pools is less than 2𝑢𝑣 if the genetic segments are unlinked. This enables us to show, based on the chromosomal pool size of Wiuf and Hein [8] and recursively applying the 2𝑢𝑣 bound, that the size of the ancestral pool of the human genome is close to the 80 percent pedigree ancestor upper bound of Chang [7]. But tighter bounds should be sought in general, especially for the difficult problem of genetic segments which are linked.

Acknowledgment

This paper has been significantly improved due to suggestions from Joe Felsenstein and anonymous reviewers.