Advances in Bioinformatics

Volume 2008 (2008), Article ID 257864, 9 pages

http://dx.doi.org/10.1155/2008/257864

## A Tutorial of the Poisson Random Field Model in Population Genetics

^{1}Department of Genetics, School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA^{2}Department of Computer and Information Sciences, School of Engineering and Applied Sciences, University of Pennsylvania, Philadelphia, PA 19104, USA

Received 6 February 2008; Accepted 15 May 2008

Academic Editor: Alexander Zelikovsky

Copyright © 2008 Praveen Sethupathy and Sridhar Hannenhalli. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Population genetics is the study of allele frequency changes driven by various evolutionary forces such as mutation, natural selection, and random genetic drift. Although natural selection is widely recognized as a bona-fide phenomenon, the extent to which it drives evolution continues to remain unclear and controversial. Various qualitative techniques, or so-called “tests of neutrality”, have been introduced to detect signatures of natural selection. A decade and a half ago, Stanley Sawyer and Daniel Hartl provided a mathematical framework, referred to as the Poisson random field (PRF), with which to determine quantitatively the intensity of selection on a particular gene or genomic region. The recent availability of large-scale genetic polymorphism data has sparked widespread interest in genome-wide investigations of natural selection. To that end, the original PRF model is of particular interest for geneticists and evolutionary genomicists. In this article, we will provide a tutorial of the mathematical derivation of the original Sawyer and Hartl PRF model.

#### 1. Introduction

Selectionists and neutralists have fiercely debated, for the past five decades, the extent to which Darwinian selection has shaped
molecular evolution. However, both camps do agree that Darwinian selection is a
bona fide natural phenomenon. Therefore, various so-called “tests of
neutrality” have been developed to detect natural selection on a particular
gene or genomic location (for a review on this topic, see [1]). However, these
tests are often qualitative and only provide the directionality of selection. A
decade and a half ago, S. Sawyer and D. Hartl provided a mathematical framework
with which to determine quantitatively the intensity of selection on a
particular gene, which they applied to the *Adh* locus in the *Drosophila* genome [2].
This framework is referred to as the Poisson random field (PRF) model. They
then further used this framework to analyze codon bias in enteric bacteria [3].
Owing to the recent availability of whole genome sequences and genome-wide
human polymorphism data, it has become increasingly tractable to perform
genome-wide scans for signatures of selection. The PRF model has been applied
to estimate the intensity of selection on synonymous and nonsynonymous sites
throughout mitochondrial and nuclear genomes of a variety of species, including
human [4–12]. Very recently, due to the
advent of high-throughput experimental and computational identification of
genomic regulatory elements, there has been an interest to estimate the
intensity of natural selection on regulatory mutations. Chen and Rajewsky [13]
use the PRF, among other techniques, to provide evidence for purifying
selection (even stronger than on nonsynonymous coding sites) on a class of
regulatory sites known as microRNA target sites. Due to the potentially wide
range of applications of, and opportunities for theoretical extensions to, the
PRF model, it is an increasingly important mathematical framework for quantitative
geneticists. In this article, we will provide a tutorial of the mathematical
derivation of the basic PRF model that was originally developed in [2]. The tutorial will follow the outline provided
below:
(i)Wright-Fisher
model,(ii)diffusion approximation
to the Wright-Fisher model,(iii)derivation,
via diffusion theory, of formulas describing evolutionary processes of interest,(iv)derivation of
the PRF using the above-mentioned formulas.
The first three items are discussed in [14], and the last point was originally presented in [2]. In this tutorial, we aim to provide an
integrated and comprehensive presentation that is accessible to nonprofessionals or beginners in the field of population genetics. Since the
primary purpose is to review mathematical derivations, familiarity with calculus and at least a cursory knowledge of genetics will be helpful for the
reader.

#### 2. The Wright-Fisher Model

The Wright-Fisher (WF) model describes the change in frequency of a single mutation (derived allele) in a population over time. The simplest version of the model makes the following assumptions: (1) nonoverlapping generations, (2) constant population size in each generation, and (3) random mating, and is described as follows.

Consider a population of *N* diploid individuals that has a single
polymorphic site with two alleles, one ancestral and one derived. Under this
model, the frequency of the derived allele in the current generation is a
function of the selection pressure on this allele and the binomial sampling
effect with probabilities proportional to the frequency of this allele in the
previous generation. The probability, *p _{ij}*,
that there are

*j*genes of the derived allele present at generation G + 1 given

*i*genes of the derived allele present at generation G is given by the following binomial calculation: where depends on the relative fitness of the derived allele.

Assuming no dominance and no
recurrent mutation,
where is the fitness of the derived allele relative to 1 for the
ancestral allele, and *x* (which is
simply ) is the derived allele
frequency (daf) in generation G. In the simplest model (no selection and no
recurrent mutation), is simply *x* or .

The intuition behind * Ψ_{i}* is the following. Consider
the scenario where both the ancestral and the derived alleles are neutrally
evolving (no or negligible selection pressure). In this case, the probability
of sampling a gene of the derived allele from the population in generation G is
simply the frequency of the derived allele in generation G, or

*x*. This can be rewritten as . Now, suppose that the derived allele is under some selection,

*s*, meaning that the fitness of the derived allele is relative to 1 for the ancestral allele. In this case, genes are sampled according to their relative fitnesses (as in the equation for

*above). Figure 1(a) provides a pictorial representation of the basic Wright-Fisher model.*

*Ψ*_{i}#### 3. Diffusion Theory

We define as the probability that a polymorphic site has *i* genes of the derived allele at time *t*, given that it had *k* genes of the derived allele at time 0. satisfies the following:
where is given in (1).

It is convenient to change notation
and write as ,
so that the above becomes
In this framework, it has been
shown to be extremely difficult to explicitly derive formulas for several
quantities of evolutionary interest. However, as the size of the population
approaches infinity (i.e., ), and assuming that the scaled selection
pressure (*Ns*) and scaled mutation
rate (*N μ*) remain constant, the
discrete Markov process given above can be closely approximated by a
continuous-time, continuous-space diffusion process (Figure 1(b)):
where is the probability distribution of

*x*at time

*t*,

*x*is the daf at time

*t*,

*p*is the daf at time 0, and is the daf change in time .

We can perform a Taylor series expansion on both sides in and to derive the forward Kolmogorov equation:
where
and *a(x)* and *b(x)* depend on
the genetic model (e.g., see eq (24)).

Equation (5) can be represented
diagrammatically as in Figure 2.
The probability of derived allele frequency
at time is the product of the probability of moving
from *p* to *x* in time *t* and the
probability of moving from *x* to in time ,
summed over all possible values of *x*.

The frequency trajectory of a
derived allele can also be depicted as in Figure 3, which illustrates that the
probability of frequency *x* at time is the product of the probability of moving
from *p* to in time and the probability of moving from to *x* in time *t*, summed over
all possible values of * δp*. This is
formalized as follows:
We can again perform a Taylor
series expansion on
both sides to derive the backward Kolmogorov equation:
The forward and backward Kolmogorov
equations have played a central role in theoretical population genetics since
1922. For details regarding their derivation, we refer the reader [15, Chapter 4].
Next, we will discuss how they are utilized to derive formulas for various
quantities of evolutionary interest (yellow boxes in Figure 4).

In a model where there is two-way
recurrent mutation (i.e., there are no absorbing states, either extinction or
fixation), stationarity is achieved when the probability of change in the
derived allele frequency is no longer dependent on time *t*. We solve for the stationary distribution, *f(x)*, in the following manner. First, we integrate through the
forward Kolmogorov equation with respect to *x* :
where is the
probability of the derived allele assuming a frequency between 0 and *x* at time *t*. Therefore, the derivative of with respect to *t* can be
interpreted as the probability flux (change in probability over time) of the
diffusion process. The stationary distribution, *f(x)*, can be solved by setting the probability flux equal to zero.

#### 4. Derivation of Formulas Describing Evolutionary Processes of Interest

Let us now focus on a genetic model that assumes no recurrent mutation (i.e., two absorbing states, one at and another at ). As depicted by Figure 4, in such a model, it is possible to determine the probability of extinction (), the probability of fixation (), and the mean time until absorption (either at or ) by using the Kolmogorov backward equation (Figure 4). It is also possible to derive the mean time until absorption conditioned on always eventually reaching only one of the two states. Since this quantity is not directly applicable to the PRF, we do not review its derivation here, but instead refer the reader to [14].

##### 4.1. Probability of Extinction

Using (11), we arrive at an equation
parallel to (9):
The probability that the derived allele
frequency, *x*, reaches 0 at or before time *t* follows from (11) and is given by
where *p* is the initial frequency of the derived allele and 0^{+} indicates 0 + *ε*, where *ε* is very small.

Replacing with ,
(12) can be written as
As , can be interpreted as the probability that
extinction ever occurs (independent of time) and can be rewritten in the form .
From (14), it is evident that satisfies the following equation:
Solving (15), we arrive at the
following:
where
and where *a(z)* and *b(z)* are defined
as in (6).

##### 4.2. Probability of Fixation

The probability that the derived
allele frequency, *x*, reaches 1 at
time *t* follows from (11) and is given
by
where *p* is the initial frequency of the derived allele and 1^{−} indicates , where *ε* is very small.

In (12), can be replaced by without any loss of generality. Also, by replacing with ,
(12) can be rewritten as
By letting and solving for we arrive at the following:
where has been defined in (17) and *a(z)* and *b(z)* have been defined in (6).

The probability of fixation and the probability of extinction must sum to 1. Using (16) and (20), we can verify that this is indeed the case.

Consider a genetic model that assumes the presence of selection, but no recurrent mutation, where and . Starting from (20), we can express the probability of fixation under this genetic model in the following manner:

##### 4.3. Mean Time until Either Extinction or Fixation

We define to be the density function of the time *t* at which absorption occurs. The probability that absorption occurs, at
either boundary or , by time *t*, is
Furthermore, since absorption must
happen by ,
we know that
Performing integration by parts, we
get the following:
Equations (14), (19), and (22) show
that satisfies the following equation:
Using (25) and the fact that approaches 0 faster than *t* approaches ,
we can rewrite (24) as
After interchanging the order of
integration and differentiation we get
where
and is the mean time that the daf spends in the interval before absorption occurs.

We are interested in the case,
where , since this is the
initial frequency of the derived allele. In this case, we are interested only
in values of *x* greater than , and for these values we can write
and is defined in (17).

Under the simplest genetic model
that assumes no selection and no recurrent mutation, we can set in (17) and (21) and show that reduces to *p* and reduces to 1.
It follows from this that (29) can be reduced to
Under a genetic model where ,
using , (29) can be rewritten as
After integrating and simplifying
the terms, we obtain
Finally, substituting and , and invoking the approximation for small values of *a*, reduces approximately to
where is a notation common in the literature to represent the
expected time for which the population frequency of a derived allele is in the
range before eventual
absorption.

#### 5. Poisson Random Field Theory

S. Sawyer and D. Hartl expanded the
modeling of site evolution to multiple sites. Their model makes the following
assumptions: (1) mutations arise at Poisson times, (2) each mutation occurs at
a new site (infinite sites, irreversible), and (3) each mutant follows an independent
WF process (no linkage). Sawyer and Hartl noticed from in (33), that
is the expected number of sites in
the population with derived allele frequency between *x _{1}* and (where equals , the per-locus mutation rate). The function , for which the full expression is
given below, is also referred to in the literature as the limiting,
equilibrium, or expected density function for derived allele frequencies.
In a sample of size

*n*, the expected number of sites with

*i*(which ranges from 1 to ) copies of the derived allele is defined as a function of : The intuition behind

*F(i)*is the following. The expected number of polymorphic sites with population daf

*x*that have

*i*copies of the derived allele out of

*n*samples is given by the product of the expected number of sites with population daf , and the probability that each of those sites has

*i*copies in the sample, which is given by the binomial calculation in the right-hand side of (36). To determine the expected number of sites with

*any*population daf that have

*i*copies of the derived allele, this product must be integrated over all possible values of

*x*(resulting in

*F(i)*above).

Consider the sample data , where is the observed number of sites with *i* copies of the derived allele out of *n*.
Sawyer and Hartl showed that the number of derived alleles in the entire
population at a particular frequency is a PRF with mean density given by (35) [2].
It follows, from the marking theorem on Poisson processes [16],
that each random variable *X _{i}* is
an independent Poisson distribution with mean equal to

*F(i)*[2]. This framework allows us to define the probability of observing

*x*sites that have

_{i}*i*copies of the derived allele (and copies of the ancestral allele) as the following: Since the

*X*’s are independent, the probability of observing is given as The likelihood equation above provides a convenient means of estimating the values of the parameters

_{i}*and*

*θ**. The use of the PRF theory leads directly to a likelihood-ratio test of neutrality.*

*γ**is defined as the ratio of the likelihood value under the maximum likelihood estimate of*

*Λ**to the likelihood value under the neutral value of*

*γ**. It is a standard result that is asymptotically chi-square distributed with one degree of freedom [17].*

*γ*Sawyer and Hartl further extended
the PRF model in order to calculate the ratio of expected number of
polymorphisms within species to expected number of fixed differences between
species.In 1991, McDonald and
Kreitman devised a 2-by-2 contingency table test of neutrality that was later
named the MK test [18]. In the
traditional MK test, a 2-by-2 contingency table is formed in order to compare
the number of nonsynonymous and synonymous sites that are polymorphic within a
species (RP and SP) and diverged between species (RF and SF) (Table 1). The central assumption of the MK test is that only nonsynonymous sites may be under
selective pressure (i.e., synonymous sites are assumed to be neutrally evolving). If nonsynonymous sites are evolving according to a neutral model,
then the expectation is that . However, if nonsynonymous sites are under
negative selection, then the expectation is that , and if under positive selection, then . Sawyer
and Hartl derived the formulas for the expected values of SP, SF, RP, and RF
using their PRF theory [2]. Below are
the derivations of each of these formulas. For all of the derivations, assume
that the data consists of samples of size *m* and *n* from two different species.

##### 5.1. Expected Number of Synonymous Polymorphic Sites

Under neutral evolution (), the expected number of polymorphic
sites with population daf *x* can be
computed by taking the product of the per-locus mutation rate ()
and the probability under a neutral model of a single mutation having a
frequency of *x* (from (30)):
Now, consider species 1 with sample
size *m*. The probability that a
polymorphic site, with population daf equal to *x*, is detected as polymorphic in a sample of size *m* is given as
The expected number of synonymous
polymorphic sites, with population daf *x*,
in the species 1 sample is the product of the expected number of synonymous
polymorphic sites with daf *x* in the
population () and the fraction of those that are
expected to be detected in a sample of size . It follows then
that the total expected number of synonymous polymorphic sites, with any
population daf, in the species 1 sample is computed by integrating the product
of and over the range of
possible values for *x*:
Finally, the total number of
expected synonymous polymorphic sites in both species’ sample data is given as

##### 5.2. Expected Number of Replacement Polymorphic Sites

The derivation of the expected
value of RP follows the same logic. As described in (35), the expected number
of polymorphic sites with population daf *x* given some average selection pressure *γ* is given by *g(x)*. Similar to (41), the total expected number of replacement
polymorphic sites in the species 1 sample is computed by integrating the
product of *g(x)* and *P _{m}(x)* from 0 to 1:
Finally, the total expected number
of replacement polymorphic sites in both species’ sample data is given as

##### 5.3. Expected Number of Synonymous Fixed Substitutions

When , the expected number of fixed substitutions in one species
relative to another that diverged generations ago is given as the product of the number of total mutations and
the probability of fixation of each mutation. The number of total mutations is
the product of the mutation rate per generation and the number of generations since
divergence is
The probability of fixation is
given in (21). As *s* approaches 0
(i.e., neutral evolution), the probability of fixationcan be reduced to *p* using the approximation for small values of *a*. Thus, for a newly derived neutral allele that has an initial
frequency of , the probability of
fixation is also .

Therefore, the total expected number of fixed substitutions in species 1 is However, given that the data are samples of the populations from both species, not all sites identified as fixed substitutions in the sample are truly fixed substitutions in the entire population. The expected number of sites in the species 1 sample that fall into this category is given by where and is given in (39).

Therefore, the total expected number of synonymous fixed substitutions in both species’ sample data is given as

##### 5.4. Expected Number of Replacement Fixed Substitutions

Similar to the calculation of (46),
given some selection pressure, * γ*, the
expected number of fixed substitutions in one species relative to another that
diverged generations ago is given as the product of (45) and (21):
Substituting for

*p*and invoking the approximation that for small values of

*a*, we arrive at the following: However, again, given that the data are samples of the populations from both species, not all sites identified as fixed substitutions in the sample are truly fixed substitutions in the entire population. The expected number of sites in the species 1 sample that fall into this category is given by Therefore, the total expected number of replacement fixed substitutions in both species’ sample data is given as

##### 5.5. Estimating Parameters

It is possible to obtain estimates
of * θ* and

*by setting each of the observed values SP, RP, SF, and RF (Table 1) to their PRF expectations given by (42), (44), (48), and (52), respectively, and solving for the parameters. It has been shown that these estimates are equivalent to maximum-likelihood estimates [2, 19]. Bustamante et al. also eloquently describe and implement a hierarchical Bayesian model for parameter estimation [9].*

*γ*#### 6. Concluding Remarks

Sawyer and Hartl’s seminal presentation of the PRF in 1992 provided an innovative mathematical framework for estimating selection pressures and mutation rates, which are critical parameters that influence molecular evolution. However, it is worth noting that the model does harbor certain limitations. Foremost among these is the assumption of site independence, which is equivalent to the assumption of free recombination among mutations (i.e., no linkage). Thus, the model may not be appropriate for many data wherein strong linkage is present. Another limitation is the assumption of infinite sites (i.e., each mutation is at a new site). Although this assumption allows for a simpler model, it is not always biologically appropriate, especially for organisms that experience a higher mutation rate. Indeed, recent work has shown that the assumption of infinite sites can underestimate selection pressures and mutation rates and even infer positive selection, when in fact there is weak negative selection [20]. Recent theoretical work has focused on relaxing these and other assumptions of the original PRF model, so as to make it more appropriate for diverse biological contexts. For a brief list of such studies, we refer the reader to [20]. Ongoing theoretical and empirical work in this area will undoubtedly continue to extend the power of a PRF-based approach for population genetic inference.

#### Acknowledgments

The authors would like to thank Professors Joshua B. Plotkin and Warren J. Ewens for many thoughtful comments on the manuscript, discussions about the material, and suggestions about the presentation. Their support and expert advice have been instrumental to the successful completion of this tutorial. They also thank two anonymous reviewers for their helpful suggestions in improving the manuscript.

#### References

- S. Biswas and J. M. Akey, “Genomic insights into positive selection,”
*Trends in Genetics*, vol. 22, no. 8, pp. 437–446, 2006. View at Publisher · View at Google Scholar · View at PubMed - S. A. Sawyer and D. L. Hartl, “Population genetics of polymorphism and divergence,”
*Genetics*, vol. 132, no. 4, pp. 1161–1176, 1992. View at Google Scholar - D. L. Hartl, E. N. Moriyama, and S. A. Sawyer, “Selection intensity for codon bias,”
*Genetics*, vol. 138, no. 1, pp. 227–234, 1994. View at Google Scholar - H. Akashi, “Inferring weak selection from patterns of polymorphism and divergence at “silent” sites in
*Drosophila*DNA,”*Genetics*, vol. 139, no. 2, pp. 1067–1076, 1995. View at Google Scholar - M. W. Nachman, “Deleterious mutations in animal mitochondrial DNA,”
*Genetica*, vol. 102-103, pp. 61–69, 1998. View at Publisher · View at Google Scholar - D. M. Rand and L. M. Kann, “Mutation and selection at silent and replacement sites in the evolution of animal mitochondrial DNA,”
*Genetica*, vol. 102-103, pp. 393–407, 1998. View at Publisher · View at Google Scholar - H. Akashi, “Inferring the fitness effects of DNA mutations from polymorphism and divergence data: statistical power to detect directional selection under stationarity and free recombination,”
*Genetics*, vol. 151, no. 1, pp. 221–238, 1999. View at Google Scholar - D. M. Weinreich and D. M. Rand, “Contrasting patterns of nonneutral evolution in proteins encoded in nuclear and mitochondrial genomes,”
*Genetics*, vol. 156, no. 1, pp. 385–399, 2000. View at Google Scholar - C. D. Bustamante, R. Nielsen, S. A. Sawyer, K. M. Olsen, M. D. Purugganan, and D. L. Hartl, “The cost of inbreeding in
*Arabidopsis*,”*Nature*, vol. 416, no. 6880, pp. 531–534, 2002. View at Publisher · View at Google Scholar · View at PubMed - S. A. Sawyer, R. J. Kulathinal, C. D. Bustamante, and D. L. Hartl, “Bayesian analysis suggests that most amino acid replacements in
*Drosophila*are driven by positive selection,”*Journal of Molecular Evolution*, vol. 57, pp. S154–S164, 2003. View at Publisher · View at Google Scholar · View at PubMed - C. Bartolomé, X. Maside, S. Yi, A. L. Grant, and B. Charlesworth, “Patterns of selection on synonymous and nonsynonymous variants in
*Drosophila miranda*,”*Genetics*, vol. 169, no. 3, pp. 1495–1507, 2005. View at Publisher · View at Google Scholar · View at PubMed - C. D. Bustamante, A. Fledel-Alon, S. Williamson et al., “Natural selection on protein-coding genes in the human genome,”
*Nature*, vol. 437, no. 7062, pp. 1153–1157, 2005. View at Publisher · View at Google Scholar · View at PubMed - K. Chen and N. Rajewsky, “Natural selection on human microRNA binding sites inferred from SNP data,”
*Nature Genetics*, vol. 38, no. 12, pp. 1452–1456, 2006. View at Publisher · View at Google Scholar · View at PubMed - W. J. Ewens,
*Mathematical Population Genetics: I. Theoretical Introduction*, Springer, New York, NY, USA, 2004. - W. J. Ewens,
*Mathematical Population Genetics*, Springer, New York, NY, USA, 1979. - J. F. C. Kingman,
*Poisson Processes*, Oxford University Press, Oxford, UK, 1993. - S. S. Wilks,
*Mathematical Statistics*, John Wiley & Sons, New York, NY, USA, 1962. - J. H. McDonald and M. Kreitman, “Adaptive protein evolution at the
*Adh*locus in*Drosophila*,”*Nature*, vol. 351, no. 6328, pp. 652–654, 1991. View at Publisher · View at Google Scholar · View at PubMed - S. Williamson, A. Fledel-Alon, and C. D. Bustamante, “Population genetics of polymorphism and divergence for diploid selection models with arbitrary
dominance,”
*Genetics*, vol. 168, no. 1, pp. 463–475, 2004. View at Publisher · View at Google Scholar · View at PubMed - M. Desai and J. B. Plotkin, “Detecting directional selection from the polymorphism frequency spectrum,” Genetics, In press.