Genetic Sampling Error of Distance ( )2 and

Genetic Sampling Error of Distance (dm)2 and Variation in Mutation Rate
Among Microsatellite Loci
Lev A. Zhivotovsky,* David B. Goldstein,† and Marcus W. Feldman‡
*Vavilov Institute of General Genetics, Moscow, Russia; †Department of Biology, University College London,
London, England; and ‡Department of Biological Sciences, Stanford University
An expression is obtained for the time-dependent variance of the microsatellite genetic distance (dm)2 when the
mutation rate is allowed to vary randomly among loci. An estimator is presented for the coefficient of variation,
Cw, in the mutation rate. Estimated values of Cw from genetic distances between African and non-African populations
were less than 100%. Caveats to this conclusion are discussed.
Introduction
In order to estimate the time of divergence of two
contemporary populations from a single ancestral lineage, a genetic distance that is a known function of this
time is desirable. When the populations are assayed for
microsatellite polymorphism, the genetic distance (dm)2,
based on the average squared differences in the sizes of
alleles sampled in pairs, one from each population, has
an expectation that increases linearly with time at a rate
equal to twice the mutation rate in the case of one-step
mutations (Goldstein et al. 1995). For multistep mutations, the rate of increase is twice the effective mutation
rate, which is the product of the mutation rate and the
variance of changes in allele size due to mutation (Zhivotovsky and Feldman 1995).
The usual way to analyze a set of microsatellite loci
from individuals sampled in two populations is to compute (dm)2 for each locus and average across loci. If the
mutation rate (or the effective mutation rate) is the same
at all loci, and is known, then simple division gives an
estimate of the expected time since separation of the
populations. Variation across loci in the mutation rate
affects the variance of (dm)2 (but not its expectation).
The evolutionary process involves genetic sampling error due to random genetic drift and mutation,
and thus the variance among the possible evolutionary
replicates of the distance is an important issue. Zhivotovsky and Feldman (1995) implied that among replicates, the distance follows a chi-square distribution. In
fact, the variance of the distance does asymptotically
satisfy the most important property of the chi-square distribution, namely, that its variance approaches twice the
square of its expectation as time increases (Zhivotovsky,
Feldman, and Grishechkin 1997), but the actual distribution is not exactly chi-square.
From their analysis of properties of (dm)2 in a study
of more than 200 human microsatellite loci, Cooper et
al. (1999) found strong evidence for variation among
loci in the mutation rate. Our purpose with this paper is
to obtain an analytical expression for the variance of
(dm)2 when the mutation rate is variable. An important
application of this analytical expression could be estiKey words: microsatellite loci, mutation rate, genetic distance.
Address for correspondence and reprints: Marcus W. Feldman,
Department of Biological Sciences, Stanford University, Stanford, California 94305. E-mail: [email protected].
Mol. Biol. Evol. 18(12):2141–2145. 2001
q 2001 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
mation of the extent of variation in mutation rate among
microsatellite loci. Our analysis also allows us to compute the time-dependent dynamics of the variance of
(dm)2 and to assess how sensitive these dynamics are to
the assumption of a fixed mutation rate that is constant
across loci.
Results
Consider a randomly mating diploid population of
constant size N with nonoverlapping generations and an
autosomal microsatellite locus undergoing multiple-step
mutation with mutation rate m and, possibly, constant
mutation bias, as measured by the difference between
the mean size of mutations and the size of the parental
allele. (There is no bias if the difference is zero). Let
h(2)
m be the expectation of the square of mutational gains
and losses (Di Rienzo et al. 1998), which in the case of
no average mutation bias becomes the variances in mutation changes, s2m (Slatkin 1995). We call w 5 mh(2)
m
the effective mutation rate. Also, introduce k 5 mh(4)
m ,
where h(4)
m is the fourth noncentral moment of mutational
changes in repeat score; w 5 k 5 m in the case of onestep symmetric mutation. Assume for a while that the
mutation parameters do not vary between loci.
The within-population variation at a microsatellite
locus can be characterized by the mean allele size (r),
the variance of allele size (the second central moment)
(V), and the unnormalized kurtosis (the fourth central
moment) (K) (Zhivotovsky and Feldman 1995). The between-population variation can be measured by analogs
of FST (Slatkin 1995; see also Michalakis and Excoffier
1996; Rousset 1996; Feldman, Kumm, and Pritchard
1999). For two populations, the (dm)2 distance is defined
as the squared difference of the mean values of their
repeat scores: (dm)2 5 (r1 2 r2)2 (Goldstein et al. 1995).
Suppose that two populations diverged from an ancestral population at initial time t 5 0 at which the profile of allele frequencies was represented by P0 (i.e., P0
produced specific values of the variance V0, unnormalized kurtosis K0, etc.) and then evolved independently
under random genetic drift and multistep mutation. Given P0, Er{S z P0} is an expectation operator that averages
the statistic S over all possible realizations (replicates)
of the drift-mutation process. Averaging with Er may
then be followed by the operator E0, which averages
over all possible genetic structures P0 of the unknown
ancestral population, i.e., values of V0, K0, etc. Thus, Er
2141
2142
Zhivotovsky et al.
VarB 5 E 0 {E r [(dm) 2 z P 0 ] 2 E 0E r [(dm) 2 ]} 2 ,
FIG. 1.—To evaluate the accuracy of equation (2), we ran coalescent simulations following the algorithm of Hudson (1990) including
the stepwise mutation process with and without rate variation. Separation times are given in units of 2N. In the case of a constant mutation
rate, u is set to 3.5. For rate variation, the average u is again 3.5, but
the thetas are now drawn from a gamma distribution with a variance
of 10. Each of 200 replications involves 30 loci and 30 sampled alleles
at each locus. White triangles and circles represent analytical results
for no variation in mutation and for variation in mutation among loci,
respectively. Black triangles and circles are the corresponding simulated values.
averages over loci having identical mutation parameters
and identical starting conditions, and E0 averages over
the different initial conditions. We assume that prior to
divergence, the ancestral population had attained mutation-drift equilibrium, where the expectations of the
within-locus variances, the between-locus variance of
variances, and the unnormalized within-locus kurtosis
are
V̂ 5 E 0 (V0 ) 5 (2N 2 1)w ø 2Nw,
ˆ 5 E 0 (K0 ) ø 5Vˆ 2 1 k
K
2
ˆ 2 ø 4 Vˆ 2 1 k ,
d
Var(V)
5 E 0 (V0 2 V)
3
6
(1)
respectively, with k 5 (2N 2 1)k (Zhivotovsky and
Feldman 1995); k is defined in equation (6) of the appendix. Recall that expressions (1) are valid if mutation
parameters do not vary among loci (see eqs. 12 if they
vary among loci).
After t generations of divergence, the expected distance, E0Er((dm)2), equals 2wt (Zhivotovsky and Feldman 1995; see also Feldman, Kumm, and Pritchard
1999; Zhivotovsky 2001), which becomes 2mt with onestep symmetric mutation (Goldstein et al. 1995).
The square of the genetic sampling error of a statistic is its variance over replicates (Weir 1996). Therefore, the within-locus variance of (dm)2 is defined as
VarW 5 E 0 {Varr [(dm) 2 z P 0 ]}
5 E 0 (E r [{(dm) 2 2 E r [(dm) 2 z P 0 ]} 2 z P 0 ]).
The between-locus variance,
is due to variation in initial conditions P0 and together
with the within-locus variance makes up the total variance in the case of no variation in mutation rate among
loci, VarT 5 VarW 1 VarB. VarT is in fact the quantity
of interest in assessing the reliability of (dm)2 as a distance measure in the presence of variation across loci in
the mutation rate. Analytical expressions for both variances are given in the appendix. It can be shown that
the total variance is greater than 8w2t2 near the beginning of the process, although this value is ultimately
approached, as was shown by Zhivotovsky, Feldman,
and Grishechkin (1997) and has been observed in numerical simulations.
Suppose the effective mutation rate varies across
loci with mean w̄, variance s2w, and k̄, the mean value
of k over loci. It is proved in the appendix that
[
1
21
1
VarT 5 8w̄ 2 t 2 3 1 1 1 1 e2t /2N
3
1
1
2
1 2 e2t /2N
t/2N
2
]
2
k̄ 3t
1 4e2t /2N 1 e2t /N 2 5
3 N
2
1 (1 2 e2t /2N ) 2 [k̄ 1 8(2Nw̄) 2 ]
3
5
[
1
2]
21
1
1 s 2w 3 8 t 2 1.5 1 1 1 e2t /2N
3
1 2 e2t /2N
t/2N
6
2
1 (1 2 e2t /2N ) 2 (2N) 2 ,
3
2
(2)
assuming mutation-drift equilibrium, which entails that
the expected distance is 2wt at generation t. As time
increases, this asymptotically approaches 8w̄2t2 3 [1 1
3s2w/2w̄2 1 O(1/t)]. In order to evaluate the accuracy of
equation (2), we carried out a simulation using coalescent techniques following the algorithm of Hudson
(1990), modified to include the stepwise mutation process with and without variation among loci in the mutation rate. In figure 1, we see that the simulated data
produce values for VarT that are close to those expected
from equation (2).
Rewrite equation (2) as
VarT 5 A 1 Bsw2 ,
(3)
where A is the expression in the first three lines of the
right-hand side of equation (2), and B is the multiplier
of s2w in the third line of equation (2). Then, given the
observed variance in genetic distances across loci,
Varobs, the variance in mutation rates can be estimated
as
s 2w 5
Varobs 2 A
.
B
(4)
As time increases, s2w and the coefficient of variation of w, Cw 5 sw /w̄, asymptotically satisfy
Variable Mutation Rate and (dm)2 Distance
2143
Table 1
Estimates of the Coefficient of Variation Among Loci of
the Effective Mutation rate, Cw, Based on Genetic
Distances Between African and Non-African Populations
for Different Sets of Data
SOURCE
FIG. 2.—Dynamics of the coefficient of variation (%) of the effective mutation rate Cw (equation 5). The parameters are w 5 0.001,
sw 5 0.0005 (hence, Cw 5 0.5), and 2N 5 5,000. Mutation is singlestep and symmetric.
s 2w
Varobs 2 2[(dm) 2 ] 2
ø
,
w̄ 2
3[(dm) 2 ] 2
1
2
C 2w ø C 2(dm) 2 2 ,
3
3
(5)
where C(dm)2 is the coefficient of variation of (dm)2. Expression (5) can also provide an upper estimate of C2w if
the asymptote has not been approached (fig. 2).
Discussion
We can use expression (5) to estimate Cw from
data. Table 1 shows the estimates for different sets of
di- and tetranucleotide loci based on genetic distances
between African and non-African human populations.
Two of three sets show substantial values of Cw. However, probably not more than 10,000 generations have
passed since the divergence of Africans and non-Africans, and thus the values of Cw in table 1 are overestimated (see fig. 1). Therefore, on average, variation in
mutation rate does not seem to be very extensive, although it is not excluded that some microsatellite loci
can show much higher or lower mutation rates than an
average locus. For example, Forster et al. (2000) found
that the average mutation rate at the Y-chromosome loci
could be taken as 0.26 3 1023 if locus DYS392 was
omitted because of its unusual behavior; otherwise, it
was about 10 times as high. However, we should emphasize that our findings concern the effective mutation
rate, i.e., the product of mutation rate and the variance
in the number of repeats due to mutation, while Forster
et al. (2000) considered only the mutation rate.
Two caveats should be noted in connection with
the above remarks on the size of Cw. First, our estimates
were made under the assumption of constant population
size, which is surely erroneous for humans in the last
4,000 generations. Second, since the variance of Cw is
likely to be large over this time range and with the number of loci considered here, our confidence that Cw is
indeed small cannot be great.
Earlier, Zhivotovsky and Feldman (1995) pointed
out that hundreds of loci are required to estimate the
genetic distance (dm)2 with reasonable accuracy, and
with variable mutation rates, the number of loci must be
ITEMS
28
Dinucleotide
Loci
(Bowcock
et al. 1994)
Mean value of (dm)2 . . . .
Variance of (dm)2 . . . . . .
Estimate of Cw (%) . . . . .
4.65
34.2
213.9
OF
DATA
89
Dinucleotide 60 TetranuLoci
cleotide Loci
(Jin et al.
(Jorde
2000)
et al. 1997)
4.64
82.4
78.0
2.44
28.1
89.9
NOTE.—For each locus, (dm)2 was computed as the average value of the
distance over all different pairs of populations, one African and the other nonAfrican. Cw was estimated by using the asymptotic formulae (eq. 5).
even greater. Indeed, as follows from equation (5), the
coefficient of variation of genetic distance (dm)2 averaged over L loci, which can be used as a measure of the
relative accuracy (R) of estimation of the genetic distance, is approximated by [(2 1 3C2w)/L]½, or L 5 (2 1
3C2w)/R2. For instance, if the relative accuracy is 10%,
i.e., R 5 0.1, then 200 loci with identical mutation rates
would be needed, whereas 500 loci are required to estimate genetic distance with the same precision if the
relative variation in mutation rates is 100%, i.e., if Cw
5 1. As an example, using combined data on 131 di-,
tri-, and tetranucleotide microsatellite loci, Zhivotovsky
(2001, table 1) estimated approximately 14% for the accuracy of genetic distances between African and nonAfrican populations. It should be noted, however, that
in the analyses of Jin et al. (2000), (dm)2 was not able
to reliably distinguish continental groups in trees made
using the 28 loci of Bowcock et al. (1994), although its
performance was comparable with other distance measures with 64 microsatellite loci. Again, this reinforces
our view that several hundred loci would be needed to
produce satisfactory estimates of (dm)2 and Cw.
It should be strongly emphasized that expression
(2), as well as expressions (4) and (5), derived from it,
are only valid for reproductively isolated populations of
constant size at mutation-drift equilibrium. Otherwise,
if we consider a process of subdivision of a parental
population into two populations that subsequently
evolve under mutation and genetic drift, the genetic distance (dm)2 becomes a nonlinear function of time; in
particular, it underestimates the divergence time if the
two populations are growing in size and/or are connected by gene flow (Zhivotovsky 2001). Therefore, our
estimates in table 1 have to be regarded with caution.
Acknowledgments
We are indebted to two anonymous reviewers for
helpful comments and constructive suggestions. This research was supported in part by the National Institutes
of Health (grants GM 28016, GM 28428, and 1 R03
TW005540), the Russian Foundation of Basic Research
(grants 01-04-48441 and 01-07-90197), and the Russian
State Program ‘‘Human Genome’’ (grant 26/01).
2144
Zhivotovsky et al.
APPENDIX
A Case of Constant Mutation Bias
We permit a constant bias in mutation; that is, the
expected average repeat score in progeny may be larger
(or smaller) than the size of a parental allele by a constant value that is independent of the parental allele. As
noted by Di Rienzo et al. (1998) and Zhivotovsky
(4)
(2001), if h(2)
m and hm are the second and fourth noncentral moments of mutational changes in repeat score,
equations (3)–(8) of Zhivotovsky and Feldman (1995),
as well as the expectation of (dm)2 (namely, 2wt), remain valid with the parameters
w 5 mh (2)
m ,
(4) ,
k 5 mh m
1) z P0) 2 Er(V(t) z P0) ø w 2 (1/N)Er(V(t) z P0)), neglecting terms of order less than 1/N and recalling that w is
defined by equation (6). Replacing the differences in the
left-hand sides of these approximations with corresponding differentials and solving the resulting linear differential equations, we have
ˆ
E r [(dm) 2 z P 0 ] ø 2wt 1 2(V0 2 V)(1
2 e2t /2N ). (9)
As follows from the definition of the between-locus
variance, VarB is equal to the expectation E0 of the
square of 2(V0 2 V̂)(1 2 e2t/2N). Then, using equation
(1), we obtain
2
VarB 5 (1 2 e2t /2N ) 2 [k 1 8(2Nw) 2 ].
3
(6)
m2
neglecting terms of order
and smaller. In the case of
2
no mutation bias, h(2)
m is sm, the variance in mutation
changes. In particular, the relationships (eq. 1) that were
obtained by Zhivotovsky and Feldman (1995) under the
assumption of no mutation bias remain valid in the case
of constant bias if the moments are taken with respect
to zero instead of with respect to the mean (Zhivotovsky
2001). Expressions for V̂ and Var(V) were extended to
the case of constant mutation bias by Kimmel and Chakraborty (1996) and Di Rienzo et al. (1998).
The Within-Locus Variance of (dm)2
Using the expression for the within-locus variance
Var{D(t)} (Zhivotovsky, Feldman, and Grishechkin
1997, p. 932, right column), which remains valid with
the moments taken with respect to zero, and taking the
limit as the regression coefficient b → 10, we obtain
[
1
21
1
VarW 5 8w 2 t 2 3 1 1 1 1 e2t /2N
3
1
1
2
1 2 e2t /2N
t/2N
2
k 3t
1 4e2t /2N 1 e2t /N 2 5 .
3 N
]
2
(7)
(Note that in Zhivotovsky, Feldman, and Grishechkin
[1997, p. 932, right column], the expressions Var{D(t)}
and (E{D(t)}) 2 in the above notation are
E0{(Varr{D(t)}} and E0{Er{D(t)})2}, respectively, and
the symbol k should read km.) Zhivotovsky, Feldman,
and Grishechkin (1997) showed that this variance is approximated as
VarW ø E0{2[Er{(dm)2 z P0}]2}
(8)
when t/2N either is small or increases infinitely. (Earlier,
Zhivotovsky and Feldman [1995, corollaries 1 and 2]
had suggested that the variance was twice the squared
distance expected at equilibrium, namely, 2(2wt)2 5
8w2t2. However, the latter is 2[E0{Er{(dm)2}}]2, which
is not the right-hand side of eq. 8).
The Between-Locus Variance of (dm)2
From equations (4) and (14) of Zhivotovsky and
Feldman (1995), the changes in the expected values of
the distance and the variance are Er((dm)2(t 1 1) z P0) 2
Er((dm)2(t) z P0) ø (1/N)Er(V(t) z P0), and Er(V(t 1
(10)
Variation in Mutation Rate
The well-known partitioning of conditional variance (e.g., Rice 1995) can be extended to the case of
three random values: for an arbitrary function f(x, y, z),
its variance, EzEyEx(f 2 EzEyExf)2, is
Varxyz f 5 E zE yVarx ( f ) 1 E zVaryE x ( f )
1 VarzE yE x ( f ).
(11)
Now, consider Ex, Ey, and Ez, respectively, as Er,
E0, and the expectation operator averaging over varying
values of the mutation parameters, Em, and take the distance (dm)2 as function f. The first two terms in the righthand side of equation (11) represent the expectation Em
of VarW in equation (7) and VarB in equation (10), respectively. The third term is Varm(E0((dm)2)), the variance of the expected distance in equation (9) with respect to mutation parameters. Taking the expectations
and summing in equation (11), we obtain equation (2).
Additionally, note that at mutation-drift equilibrium, the within-locus variance, the unnormalized withinlocus kurtosis, and the between-locus variance of variances in the case of varying mutation rate become (using
the same notation as in eq. 1)
Vˆ
1
1
2
2
¯ ø 5Vˆ 2 1 1 s w 1 k̄ ,
K
w̄ 2
2
5 (2N 2 1)w̄,
2
4
7 s 2w
k̄
d
Var(V)
ø Vˆ 2 11
1 ;
3
4 w̄ 2
6
(12)
Di Rienzo et al. (1998) obtained the same expresd (V).
sion for Var
LITERATURE CITED
BOWCOCK, A. M., A. RUIZ-LINARES, J. TOMFOHRDE, E. MINCH,
J. R. KIDD, and L. L. CAVALLI-SFORZA. 1994. High resolution of human evolutionary trees with polymorphic microsatellites. Nature 368:455–457.
COOPER, G., W. AMOS, R. BELLAMY, M. R. SIDDIQUI, A. FRODSHAM, A. V. S. HILL, and D. C. RUBINSZTEIN. 1999. An
empirical exploration of the (dm)2 genetic distance for 213
human microsatellite markers. Am. J. Hum. Genet. 65:
1125–1133.
Variable Mutation Rate and (dm)2 Distance
DI RIENZO, A., P. DONNELLY, C. TOOMAJIAN, B. SISK, A. HILL,
M. L. PETZL-ERLER, G. K. HAINES, and D. H. BARCH. 1998.
Heterogeneity of microsatellite mutations within and between loci, and implications for human demographic histories. Genetics 148:1269–1281.
FELDMAN, M. W., J. KUMM, and J. K. PRITCHARD. 1999. Mutation and migration in models of microsatellite evolution.
Pp. 98–115 in D. G. GOLDSTEIN and C. SCHLOTTERER, eds.
Microsatellites: evolution and applications. Oxford University Press, Oxford.
FORSTER, P., A. ROHL, P. L. LUNNERMANN, C. BRINKMANN, T.
ZERJAL, C. TYLER-SMITH, and B. BRINKMANN. 2000. A
short tandem repeat-based phylogeny for the human Y chromosome. Am. J. Hum. Genet. 67:182–196.
GOLDSTEIN, D. B., A. R. LINARES, L. L. CAVALLI-SFORZA, and
M. W. FELDMAN. 1995. Genetic absolute dating based on
microsatellites and the origin of modern humans. Proc. Natl.
Acad. Sci. USA 92:6723–6727.
HUDSON, R. R. 1990. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7:1–45.
JIN, L., M. L. BASKETT, L. L. CAVALLI-SFORZA, L. A. ZHIVOTOVSKY, M. W. FELDMAN, and N. A. ROSENBERG. 2000.
Microsatellite evolution in modern humans: a comparison
of two data sets from the same populations. Ann. Hum.
Genet. 64:117–134.
JORDE, L. B., A. R. ROGERS, M. BAMSHAD, W. S. WATKINS,
P. KRAKOWIAK, S. SUNG, J. KERE, and H. HARPENDING.
1997. Microsatellite diversity and the demographic history
of modern humans. Proc. Natl. Acad. Sci. USA 94:3100–
3103.
2145
KIMMEL, M., and R. CHAKRABORTY. 1996. Measures of variation at DNA repeat loci under a general stepwise mutation
model. Theor. Popul. Biol. 50:345–367.
MICHALAKIS, Y., and L. A. EXCOFFIER. 1996. Generic estimation of population subdivision using distances between alleles with special reference for microsatellite loci. Genetics
142:1061–1064.
RICE, J. A. 1995. Mathematical statistics and data analysis. 2nd
edition. Duxbury Press, Belmont, Calif.
ROUSSET, F. 1996. Equilibrium values of measures of population subdivision for stepwise mutation processes. Genetics
142:1357–1362.
SLATKIN, M. 1995. A measure of population subdivision based
on microsatellite allele frequencies. Genetics 139:457–462.
WEIR, B. S. 1996. Genetic data analysis II. Methods for discrete population genetic data. Sinauer, Sunderland, Mass.
ZHIVOTOVSKY, L. A. 2001. Estimating divergence time with
the use of microsatellite genetic distances: impacts of population growth and gene flow. Mol. Biol. Evol. 18:700–709.
ZHIVOTOVSKY, L. A., and M. W. FELDMAN. 1995. Microsatellite variability and genetic distances. Proc. Natl. Acad. Sci.
USA 92:11549–11552.
ZHIVOTOVSKY, L. A., M. W. FELDMAN, and S. A. GRISHECHKIN. 1997. Biased mutations and microsatellite variation.
Mol. Biol. Evol. 14:926–933.
KEITH CRANDALL, reviewing editor
Accepted June 6, 2001