The Number of Segregating Sites in Expanding

The Number of Segregating Sites in Expanding Human Populations,
with Implications for Estimates of Demographic Parameters
Giorgio Bertore&? and Montgomery S1atkin-f
*Dipartime n t o di Biologia, Universita’
Berkeley
di Padova,
and TDepartment
of Integrative
Biology, University
of California,
The frequency distribution of pairwise differences between sequences of mtDNA has recently been used to estimate
the size of human populations before and after a hypothetical episode of rapid population growth and the time at
which the population grew. To test the internal consistency of this method, we used three different sets of human
mtDNA data and the corresponding demographic parameters estimated from the distribution of pairwise differences
to determine by simulation the expected number of segregating sites, S, and its empirical distribution. The results
indicate that the observed values of S are significantly lower than expected in two of three cases under the assumption
of the infinite-sites model. Further simulations in which mutations were allowed to occur more than once at the
same site and in which there was variation in mutation rate among sites show that the expected number of
segregating sites can be much lower than under the infinite-sites assumption. Nevertheless, the observed value of
S is still significantly different from the value expected under the expansion hypothesis in two of three cases.
Introduction
Histograms
of pairwise differences between nonrecombining
DNA sequences have recently been investigated, with the goal of understanding
the relationship
between their features and the demographic
history of
populations.
The theoretical
expectation
of the distribution of pair-wise differences has been predicted assuming different hypothetical scenarios of population history
and structure
(Watterson
1975; Slatkin and Hudson
199 1; Rogers and Harpending
1992; Marjoram
and
Donnelly
1994). The comparison
of an observed histogram with expectations
from alternative models may
thus provide information
about important characteristics
of the population
sampled.
The distribution
of pairwise differences of mtDNA
sequences sampled from different human populations
has been noted to fit poorly the distribution
expected in
a population
of constant size under neutrality (Cann et
al. 1987; Di Rienzo and Wilson 199 1; Harpending
et al.
1993). In contrast, unimodal distributions
are often observed. Slatkin and Hudson ( 199 1) showed that exponential population growth can produce a nearly Poisson
distribution
of pairwise differences, similar to that obKey words: population
growth,
finite-sites
model, human
evolu-
tion.
Address for correspondence
and reprints: Giorgio Bertorelle, Dipartimento
di Biologia, Universita’ di Padova, via Trieste 75, 35 12 1
Padova, Italy; E-mail: GIORGIO@CRIBI
1.BIO.UNIPD.IT.
Mol. Bid. Evol. 12(5):887-892. 1995.
0 1995 by The University of Chicago. All rights reserved.
0737-4038/95/l 205-00 17$02.00
served in some human populations
(Di Rienzo and Wilson 199 1). Rogers and Harpending
( 1992) assumed that
a unimodal
distribution
of pairwise differences is the
signature of a previous population expansion event, and
they estimated population
sizes before and after rapid
growth and the time of growth for different human populations. Their method relies on a least-squares
fitting
of a theoretical transient distribution
of pairwise differences, derived by Li (1977) under the assumption
of the
infinite-sites
model
of mutation.
Rogers’s
( 1995)
method-of-moments
estimator gives similar estimates
of the population size before the expansion and the time
since expansion but does not estimate the present population size, which is assumed to approach infinity.
In general, two problems may arise in trying to infer
the demographic
history of a population
from genetic
data. First, the observed sample of genes represents only
one of the possible evolutionary
outcomes of the genealogical process. If the variance in the expected distribution of the possible genealogies is high, it may be difficult to draw any inference from a single sample. After
a rapid expansion
and until the new equilibrium
is
reached, however, the average pairwise difference is likely
to be closer to its expectation
than in a population
of
constant size at equilibrium
(Rogers and Harpending
1992), especially if there is a large increase in the population size (Rogers 1995). This fact, which is due to the
initial decrease of the coefficient of variation of the average pairwise difference after the expansion, implies that
the errors in Rogers and Harpending’s
(1992) estimates
887
888
Bertorelle
and Slatkin
Table 1
The mtDNA Data Sets Used in the Simulations
Sample
na
S”
@lb
@lb
zb
No’
NIC
Sardinia
..
Middle East
147
69
42
195
52
61
2.44
0.76
0
410.69
17.55
3,117.4
7.18
3.99
7.34
813
232
1
136,897
5,35 1
950,427
World
...
tC
2,393
1,216
2,238
’ Sample size (n) and number of segregating sites (S) from Cann et al. (1987)
(worldwide sample) and from Di Rienzo
and Wilson (1991) (Sardinian and Middle Eastern samples).
b Estimates of 0 = 2Nu before (0,) and after (0,) the instantaneous population expansion and the age of the expansion
in units of mutational time (T), from Rogers and Hat-pending (1992).
’ Population sizes and date of the expansion (in unit of generations) used in the simulations, calculated from C&, 8,,
and z, assuming a mutation rate per nucleotide of 5.0 X lo-’ and 4.1 X 10m6per generation for the world data set (which
includes the whole mtDNA sequence) and for the Sardinia and Middle East data sets (which include only the control
region), respectively.
tend to be low when large and recent expansion events
occurred in the history of the population.
Second, the application of Rogers and Harpending’s
(1992) method for the estimation
of demographic
parameters relies on the assumption
that population
expansion, genetic drift, and mutation are the only factors
that produced the observed distribution
of pairwise differences. Yet, a unimodal
distribution
of pairwise differences may also be the consequence
of selection on a
particular haplotype (Di Rienzo and Wilson 199 1) or of
high levels of homoplasy present at nucleotide sites with
very high mutation
rates (Lundstrom
et al. 1992). In
particular, these alternative
explanations
for unimodal
distributions
may have some importance
in human
populations,
where simulation of expansion events have
shown that nonunimodal
distributions
of pairwise differences can easily result if the population
is subdivided
(Marjoram and Donnelly
1994).
The consistency
of the Rogers and Harpending
(1992) method may be tested by using another measure
of genetic variability,
the number of segregating sites,
which has been shown to have a different dynamic than
the average number of nucleotide differences when population size is not constant (Tajima 1989b). To do that,
we simulated gene genealogies similar to those for three
human mtDNA data sets (Cann et al. 1987; Di Rienzo
and Wilson 199 1) using the parameters of the population
expansion estimated by Rogers and ‘Harpending ( 1992).
We then compared the observed number of segregating
sites with the simulation
results. Finally, we tested how
the mutational
model affects the results, in particular
assuming more realistic finite-sites models of mutation
for the human mtDNA.
(1987). The second and the third are the sequences of a
segment spanning the first 400 bp of the control region
in the Sardinian and Middle Eastern populations studied
by Di Rienzo and Wilson ( 199 1). The characteristics
of
these samples and the genetic variability measures relevant for the present work are presented in table 1, together with the population
expansion parameters
estimated by Rogers and Harpending
(1992) from the
observed distribution of the pairwise differences between
each sequence and all the others in the sample. These
parameters
are O,, = 2N0u, O1 = 2Nlu, and z = 2ut,
where No and Nl are the number of females in the population before and after the expansion,
t is the age in
number of generations
of this event, and u is the mutation rate per generation per region of DNA. The crucial
parameters in the model are 00, O,, and z. The estimates
of No, Ni, and t depend on the assumed mutation rate.
The three different data sets lead to three different
sets of population expansion estimates (see table 1). The
estimated variation in size ranges from a 23-fold increase
in the Sardinian population
to an unknown
(@, = 0)
but very large increase in the Middle Eastern population.
The age of these expansions (in units of 1/2u generations)
range from 3.99 in the Sardinian data set to 7.34 in the
Middle Eastern data set. Moreover, the three samples
are apparently
also very different in terms of homogeneity, since the Cann et al. ( 1987) data set includes people
from four continents,
whereas the other samples come
from more restricted areas. Consequently,
the three data
sets seem to be suitable to test the applicability of Rogers
and Harpending’s
( 1992) method of inferring evolutionary parameters.
Simulations
The mtDNA Data
Three human mtDNA data sets were used. The
first is the worldwide
sample of restriction
fragment
length polymorphisms
(RFLPs) analyzed by Cann et al.
Following
the general coalescent
approach
described by Hudson (1990), the computer simulations
reconstructed
backward the genealogy of a sample of
genes. A thousand gene genealogies were simulated for
Segregating Sites in Expanding Populations
each of the three human mtDNA samples considered,
using the values of Oo, 01, and z estimated by Rogers
and Harpending
(1992). A model of sudden population
expansion was assumed: t generations before the present,
the population
size changed from No to N, (see table 1).
The values of No, N,, and t were determined
from Oo,
O1 z and assuming a per-nucleotide
mutation
rate of
5.0 X 10m7 and 4.1 X 10m6 per generation for the world
data set (which includes the entire mtDNA molecule)
and for the Sardinian
and Middle Eastern data sets
(which include only the control region), respectively. For
the purposes of this article, the estimates of the mutation
rate and the consequent
values assigned to N are not
critical, since the total number of mutations on the tree
depends only on their product Nu.
The number of mutations
that occurred along a
particular lineage of length t was Poisson distributed with
mean ut. To avoid overestimation
of the total number
of mutations in a tree (which is possible when the population size is very small), multiple coalescence events
each generation were allowed.
Three models of mutation were used in the simulations. The first was the infinite-sites
model (Kimura
1969), where each mutation
occurs at a different site.
Under this model, the number of segregating sites is the
total number of mutations
in the tree (Hudson
1990).
The second was a finite-sites model with two possible
states per sites, thus allowing for the possibility of reverse
mutations and multiple changes per sites. The third was
a two-rate finite-sites model. Again, the mutations
may
occur more than once at the same site (with two possible
states per site), but in this case some sites were considered
slow, that is, with a low mutation
rate, and other fast,
that is, with a high mutation
rate. Wakeley (1993) has
shown that a model with gamma-distributed
rates provides a better fit to the number of changes at each nucleotide site in three human mtDNA data sets pooled
than does a two-rate model, but this was no longer the
case when the three samples were analyzed separately.
Since the size of the samples used in this article is similar
to or smaller than the individual samples used by Wakeley, a simple two-rate model of mutation seems, in this
context, a reasonable approximation
of a more realistic
mutational
process. Keeping the average mutation rate
equal to the value used in the previous models, the proportion of fast sites was fixed at 0.1 in the simulations
of the world data set and at 0.2 in the simulations
of the
Sardinian
and Middle Eastern data sets. These values
were chosen because they approach the fraction of noncoding sites in human mtDNA sequence and the estimates of Wakeley (1993) on the control region, respectively. As already mentioned,
the world data set refers
to the whole mtDNA sequence, whereas the other two
889
data sets consisted of the control region only. In all cases,
fast sites mutated 10 times faster than slow sites. Small
changes in these values did not affect the simulation
results.
Under the infinite-sites
assumption,
the two-rate
mutation model and the constant mutation rate model
produce the same level of genetic diversity, as long as
the average mutation rate is the same. For that reason,
we used the two-rate model only under a finite-sites assumption.
Results
The distributions
of the number of segregating sites
S obtained in the simulations
are shown in figure 1,
where the observed values in the three human populations analyzed are indicated by an arrow. Under an infinite-sites assumption,
the values of S expected if the
WORLD
Finite-sites, two rates
140
180
220
260
model
300
340
380
420
S
SARDINIA
Finite-sites, two rates model
20
28
36
44
52
60
68
S
MIDDLE EAST
60
75
90
105
120
135
150
165
180
195
S
FIG. 1.-Empirical distributions of the number of segregating sites,
S, obtained by simulating the same genealogical process 1,000 times
under three different models of mutation. Arrows indicate the observed
value of S.
890
Bertorelle
and Slatkin
populations
had experienced an expansion as estimated
by Rogers and Harpending
(1992) are quite different
from the observed values. In particular, using the empirical distribution
of S obtained in 1,000 simulations
to evaluate the significance of the departures of the observed from the expected values, the population
expansion hypothesis significantly
overestimates
the number
of segregating sites observed in the world and in the
Middle Eastern data sets. For the Sardinian data set, on
the other hand, the expected value of S was smaller than
the observed value. In this case, however, about 8% of
the simulations
produced a value of S equal to or larger
than the observed S, thus implying a nonsignificant
departure of the number of segregating sites from the prediction of the model.
Using the two finite-sites models of mutation, the
simulations
resulted in a lower number of segregating
sites, and this reduction was stronger when the two-rate
model of mutation was employed. In this case, the value
of S observed in the world data set did not differ significantly
from the expected value, but this was not
true for the Middle Eastern and the Sardinian data sets
(table 2).
Finally, the average mean number of pairwise differences between sequences,
X, as well as its average
sample variance S*(K) did not change much under different mutational
models (table 3). Their coefficients of
variation for 1,000 simulations
are reported in table 3.
Discussion
The results of this study show that the expected
number of segregating sites in expanding
populations
Table 2
Simulation Results: Mean and Coefficient of Variation
of the Number of Segregating Sites, S, Observed
in 1,000 Repetitions
Sample and Mutational
World:
IS .
FSl
.
FS2
.
Sardinia:
IS .
..
FSl _....
FS2
..
Middle East:
IS
.
FSl
.
FS2
.
Model
s
cv
E(s)
352.20*
246.24*
182.81
0.06
0.05
0.06
351.72
0.17
40.85
40.8 1
39.3 1
35.70*
0.16
0.16
150.77*
124.67*
98.87*
0.09
0.08
0.07
150.53
NOTE.-Abbreviations
as follows: IS, infinite-sites model; FS 1,finite-sites,
one-rate model; FS2, finite-sites, two-rate model. Asterisks indicate that the observedvalue of S is smaller than the 2.5 percentile or larger than the 97.5 percentile
of the empirical distribution of S. The expected value of S, analytically derived
by Tajima ( 19896) for the infinite-sites model, is reported in the last column.
Table 3
Simulation Results: Mean and Coefficient of Variation of the
Mean Number of Pairwise Differences, IC,and of its Sample
Variance, s*(n), Observed in 1,000 Repetitions
Sample and Mutational
World:
IS
FSl
FS2
Sardinia:
IS
FSl . . . .
FS2
. .
Middle East:
IS . . . . .
FSl
...
FS2
Model
Tc
cv
;;I@)
cv
Eb0
9.50
9.30
8.95
0.17
0.16
0.15
16.65
15.38
13.32
1.41
0.92
0.60
9.51
4.18
4.19
4.07
0.24
0.25
0.23
34.54
34.47
35.25
0.25
0.25
0.23
4.17
7.34
7.17
6.99
0.08
0.08
0.08
12.03
12.54
12.93
0.22
0.213
0.20
7.33
NOTE-Abbreviations
are as follows: IS, infinite-sites model; FSI, finitesites, one-rate model; FS2, finite-sites, two-rate model. The expected value of IC,
analytically derived by Li (1977) for the infinite-sites model, is reported in the
last column.
can be strongly affected by assumptions
about the mutational process and the demographic
parameters
of a
population
expansion, estimated from the distribution
of the pairwise differences observed in three human
populations, do not always explain the observed number
of segregating sites.
The assumption
that mutations
can occur more
than once at the same site reduced the number of segregating sites expected under the infinite-sites
model.
This effect, more evident when a large number of mutations were forced into few fast sites, is due to the fact
that under the infinite-sites
model each mutation
increases the number of segregating site, whereas, under
a finite-sites model, that is not true when a mutation in
the genealogy occurs at an already-mutated
site. The
deviation from the infinite-sites
expectation
is proportional to the total length of the tree (longer trees contain
more mutations,
which implies more homoplasy)
and
therefore to the size of the population
and, until a new
equilibrium
is reached, to the age of the expansion. This
explains the small reduction of S observed in the simulation of the Sardinian data set compared with the almost-twofold
decrease observed in the simulations
of
the world and Middle Eastern data sets. In contrast, the
expected values of n found in the simulations
did not
change much when the infinite-sites
or the finite-sites
models were used. This finding was predicted by the
analytical derivation of Rogers (1992), but it is unclear
how these small variations, together
with the variation
_
introduced
by the finite-sites model into the sample
variance of n, will affect the estimations of the population
expansion parameters.
Segregating
The different influences of the mutational
assumptions on different measures of genetic variability have
other interesting
consequences.
For example, the conclusions of the widely used Tajima’s ( 1989a) test of neutrality may be affected by the mutational
process. The
expected value of Tajima’s D statistic, under neutrality
and assuming the infinite-sites model, is 0, and the upper
95% confidence limit, in a sample of 100 genes, is 2.073
(Tajima 1989~). That means that observed values of D
larger than 2.073 can be taken as evidence against neutrality. When we simulated the genealogy of 100 genes
using a finite-sites, two-rate model of mutation, with 0
= 20, the average mutation rate per generation per nucleotide equal to 2.5 X 10w6, 390 slow sites (u = 2.33
X 10w7), and 10 fast sites (u = 9.1 X 10P5), the empirical
distribution of D observed in 1,000 replicates was shifted
substantially
toward positive values (fig. 2). The mean
value of D was 1.58, and 23.4% of the genealogies showed
a significant departure from the hypothesis of neutrality.
This finding shows that, even if only 2.5% of the sites
are subject to frequent recurrent mutation,
the confidence limits of Tajima’s statistic may no longer be appropriate.
We now turn to the comparison
between the observed and the expected values of S. The analysis of the
world data set shows that the population
expansion hypothesis can explain the observed number of segregating
sites when we use the two-rate model of mutation. However, it is not clear how much bias is introduced
into
the distribution
of pairwise differences when sequences
sampled from distinct populations
(geographically
and
historically) are pooled, as they are in the world sample.
The pooling of data from different populations
could
create a unimodal
distribution
because of the central
limit theorem (L. Excoffier, personal communication).
Thus, the interpretation
of demographic
parameters estimated from the distribution
of pairwise differences in
a nonhomogeneous
data sets is difficult, even when the
observed number of segregating sites and mean pairwise
difference are compatible with the values expected under
the expansion hypothesis. An additional
complication
is that we found in our simulations
of the world data
set that the coefficient of variation of the sample variance
of n: was rather high (greater than one under the infinitesites model). Hence, it is difficult to have confidence in
estimates of parameters that depend on the variance of
n:. In this context, we note that Templeton
(1993), using
a different approach, could not find evidence of worldwide population expansions in a different mitochondrial
DNA data set.
The analyses of the Sardinian and the Middle Eastern data sets, which comprise homologous
sequences,
show a significant difference between the observed and
60
Sites in Expanding
Populations
89 1
1
2
25
35
D
FIG. 2.-Empirical
distribution
of Tajima’s (1989~) D statistic,
when a finite-sites two-rate mutation model was used and the population
size did not vary; D values with empty frequency bars show significant
departure from neutrality. 8= 20; n = 100; average mutation rate per
nucleotide per generation = 2.5 X lo-$390 slow sites (p = 2.33 X lo-‘);
10 fast sites (u = 9.1 X lo-‘).
expected values of S when the two-rate mutation model
was assumed. Two different but not mutually exclusive
explanations
for these results may be envisaged. First,
these populations
underwent an expansion that led to a
unimodal
distribution
of the pairwise differences, but
the errors in the estimation of 00, @, and z due to the
stochasticity of the genealogical process are large enough
to also include values that predict a number of segregating sites different from the observed value. The errors
of these estimates depend on the magnitude and the age
of the expansion (Rogers 1995) and increase as the population approaches the new equilibrium.
This first explanation is not convincing for the Middle Eastern population, which seems to be far from equilibrium.
In fact,
the nucleotide diversity expected in this population after
the expansion and at equilibrium
is almost 500 times
larger than that observed. Second, population expansion
may not be the only factor that shaped the levels of genetic variability in Sardinian and Middle Eastern populations. Compared with the simulated expectation, the
excess of S observed in the Sardinia sample may be due
to the existence of deleterious mutations (Tajima 1989a),
wh&h do not seem to be uncommon
in human mtDNA
(see Excoffier 1990). On the contrary, the deficit of S
observed in the Middle Eastern data set seems more difficult to explain.
A unimodal distribution
of pairwise differences of
mtDNA can result from any process that produces a
lack of correlation between sequences. Rapid population
growth and a selective sweep of a new advantageous
haplotype lead to a low correlation
among sequences
because they cause a starlike gene genealogy. High mutation rates at some nucleotide sites also result in low
correlation because mutation would then tend to obscure
the true genealogical history. It seems likely that both
demographic and mutational processes contribute to the
892 Bertorelle and Slatkin
distributions of pairwise differences. A unimodal distribution is consistent with rapid population growth but
does not prove it occurred or necessarily provide much
information about how it occurred. We have shown that
the observed numbers of segregating sites are not consistent with the values expected in the model of Rogers
and Hat-pending (1992), as well as when more realistic
models of mutation are incorporated. Thus, the estimates
of Oo, 0,, and z obtained from that analysis should be
interpreted with caution.
Acknowledgments
We thank F. Bonhomme, G. Barbujani, and L. Excoffier for helpful discussions on this topic. This work
was partly supported by an Erasmus grant to G.B. and
by a National Institutes of Health grant GM 40282 to
M.S., and it was developed while visiting the Centre National de la Recherche Scientifique (CNRS) laboratory
of F. Bonhomme in Montpellier, France.
LITERATURE CITED
CANN, R. L., M. STONEKING,and A. C. WILSON. 1987. Mitochondrial DNA and human evolution. Nature 325:3 l36.
DI RIENZO, A., and A. C. WILSON. 199 1. Branching pattern
in the evolutionary tree for human mitochondrial DNA.
Proc. Natl. Acad. Sci. USA 88: 1597- 160 1.
EXCOFFIER,L. 1990. Evolution of human mitochondrial DNA:
evidence for departure from a pure neutral model of populations at equilibrium. J. Mol. Evol. 30: 125- 139.
HARPENDING, H. C., S. T. SHERRY, A. R. ROGERS, and M.
STONEKING.1993. The genetic structure of ancient human
populations. Cm-r. Anthropol. 34:483-496.
HUDSON, R. R. 1990. Gene genealogy and the coalescent process. Oxford Surv. Evol. Biol. 7: l-44.
KIMURA, M. 1969. The number of heterozygous nucleotide
sites maintained in a finite population due to steady flux
of mutation. Genetics 61:893-903.
Lr, W. H. 1977. Distribution of nucleotide differences between
two randomly chosen cistrons in a finite population. Genetics 85:33 l-337.
LUNDSTROM,R., S. TAVAR~, and R. H. WARD. 1992. Modelling evolution of the human mitochondrial genome. Math.
Biosci. 112:319-335.
MARJORAM,P., and P. DONNELLY. 1994. Pairwise comparisons of mitochondrial DNA sequences in subdivided populations and implications for early human evolution. Genetics 136:673-683.
ROGERS,A. 1992. Error introduced by the infinite-sites model.
Mol. Biol. Evol. 9: 118 1- 1184.
-.
1995. Genetic evidence for a Pleistocene population
explosion. Evolution. In press.
ROGERS,
A. R., and H. HARPENDING.1992. Population growth
makes waves in the distribution of pair-wise genetic differences. Mol. Biol. Evol. 9:552-569.
SLATKIN,M., and R. R. HUDSON. 199 1. Pairwise comparisons
of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129:555-562.
TEMPLETON, A. R. 1993. The “Eve” hypotheses: a genetic
critique and reanalysis. Am. Anthropol. 95:5 l-72.
TAJIMA, T. 1989a. Statistical method for testing the neutral
mutation hypothesis by DNA polymorphism. Genetics 123:
585-595.
. 1989b. The effect of change in population size on
DNA polymorphism. Genetics 123:597-60 1.
WAKELEY,J. 1993. Substitution rate variation among sites in
hypervariable region 1 of human mitochondrial DNA. Mol.
Biol. Evol. 37:6 13-623.
WATTERSON,G. A. 1975. On the number of segregating sites
in genetical models without recombination. Theor. Popul.
Biol. 7:256-276.
CHARLES F. AQUADRO, reviewing editor
Received August 4, 1994
Accepted April 3, 1995