The Number of Segregating Sites in Expanding Human Populations, with Implications for Estimates of Demographic Parameters Giorgio Bertore&? and Montgomery S1atkin-f *Dipartime n t o di Biologia, Universita’ Berkeley di Padova, and TDepartment of Integrative Biology, University of California, The frequency distribution of pairwise differences between sequences of mtDNA has recently been used to estimate the size of human populations before and after a hypothetical episode of rapid population growth and the time at which the population grew. To test the internal consistency of this method, we used three different sets of human mtDNA data and the corresponding demographic parameters estimated from the distribution of pairwise differences to determine by simulation the expected number of segregating sites, S, and its empirical distribution. The results indicate that the observed values of S are significantly lower than expected in two of three cases under the assumption of the infinite-sites model. Further simulations in which mutations were allowed to occur more than once at the same site and in which there was variation in mutation rate among sites show that the expected number of segregating sites can be much lower than under the infinite-sites assumption. Nevertheless, the observed value of S is still significantly different from the value expected under the expansion hypothesis in two of three cases. Introduction Histograms of pairwise differences between nonrecombining DNA sequences have recently been investigated, with the goal of understanding the relationship between their features and the demographic history of populations. The theoretical expectation of the distribution of pair-wise differences has been predicted assuming different hypothetical scenarios of population history and structure (Watterson 1975; Slatkin and Hudson 199 1; Rogers and Harpending 1992; Marjoram and Donnelly 1994). The comparison of an observed histogram with expectations from alternative models may thus provide information about important characteristics of the population sampled. The distribution of pairwise differences of mtDNA sequences sampled from different human populations has been noted to fit poorly the distribution expected in a population of constant size under neutrality (Cann et al. 1987; Di Rienzo and Wilson 199 1; Harpending et al. 1993). In contrast, unimodal distributions are often observed. Slatkin and Hudson ( 199 1) showed that exponential population growth can produce a nearly Poisson distribution of pairwise differences, similar to that obKey words: population growth, finite-sites model, human evolu- tion. Address for correspondence and reprints: Giorgio Bertorelle, Dipartimento di Biologia, Universita’ di Padova, via Trieste 75, 35 12 1 Padova, Italy; E-mail: GIORGIO@CRIBI 1.BIO.UNIPD.IT. Mol. Bid. Evol. 12(5):887-892. 1995. 0 1995 by The University of Chicago. All rights reserved. 0737-4038/95/l 205-00 17$02.00 served in some human populations (Di Rienzo and Wilson 199 1). Rogers and Harpending ( 1992) assumed that a unimodal distribution of pairwise differences is the signature of a previous population expansion event, and they estimated population sizes before and after rapid growth and the time of growth for different human populations. Their method relies on a least-squares fitting of a theoretical transient distribution of pairwise differences, derived by Li (1977) under the assumption of the infinite-sites model of mutation. Rogers’s ( 1995) method-of-moments estimator gives similar estimates of the population size before the expansion and the time since expansion but does not estimate the present population size, which is assumed to approach infinity. In general, two problems may arise in trying to infer the demographic history of a population from genetic data. First, the observed sample of genes represents only one of the possible evolutionary outcomes of the genealogical process. If the variance in the expected distribution of the possible genealogies is high, it may be difficult to draw any inference from a single sample. After a rapid expansion and until the new equilibrium is reached, however, the average pairwise difference is likely to be closer to its expectation than in a population of constant size at equilibrium (Rogers and Harpending 1992), especially if there is a large increase in the population size (Rogers 1995). This fact, which is due to the initial decrease of the coefficient of variation of the average pairwise difference after the expansion, implies that the errors in Rogers and Harpending’s (1992) estimates 887 888 Bertorelle and Slatkin Table 1 The mtDNA Data Sets Used in the Simulations Sample na S” @lb @lb zb No’ NIC Sardinia .. Middle East 147 69 42 195 52 61 2.44 0.76 0 410.69 17.55 3,117.4 7.18 3.99 7.34 813 232 1 136,897 5,35 1 950,427 World ... tC 2,393 1,216 2,238 ’ Sample size (n) and number of segregating sites (S) from Cann et al. (1987) (worldwide sample) and from Di Rienzo and Wilson (1991) (Sardinian and Middle Eastern samples). b Estimates of 0 = 2Nu before (0,) and after (0,) the instantaneous population expansion and the age of the expansion in units of mutational time (T), from Rogers and Hat-pending (1992). ’ Population sizes and date of the expansion (in unit of generations) used in the simulations, calculated from C&, 8,, and z, assuming a mutation rate per nucleotide of 5.0 X lo-’ and 4.1 X 10m6per generation for the world data set (which includes the whole mtDNA sequence) and for the Sardinia and Middle East data sets (which include only the control region), respectively. tend to be low when large and recent expansion events occurred in the history of the population. Second, the application of Rogers and Harpending’s (1992) method for the estimation of demographic parameters relies on the assumption that population expansion, genetic drift, and mutation are the only factors that produced the observed distribution of pairwise differences. Yet, a unimodal distribution of pairwise differences may also be the consequence of selection on a particular haplotype (Di Rienzo and Wilson 199 1) or of high levels of homoplasy present at nucleotide sites with very high mutation rates (Lundstrom et al. 1992). In particular, these alternative explanations for unimodal distributions may have some importance in human populations, where simulation of expansion events have shown that nonunimodal distributions of pairwise differences can easily result if the population is subdivided (Marjoram and Donnelly 1994). The consistency of the Rogers and Harpending (1992) method may be tested by using another measure of genetic variability, the number of segregating sites, which has been shown to have a different dynamic than the average number of nucleotide differences when population size is not constant (Tajima 1989b). To do that, we simulated gene genealogies similar to those for three human mtDNA data sets (Cann et al. 1987; Di Rienzo and Wilson 199 1) using the parameters of the population expansion estimated by Rogers and ‘Harpending ( 1992). We then compared the observed number of segregating sites with the simulation results. Finally, we tested how the mutational model affects the results, in particular assuming more realistic finite-sites models of mutation for the human mtDNA. (1987). The second and the third are the sequences of a segment spanning the first 400 bp of the control region in the Sardinian and Middle Eastern populations studied by Di Rienzo and Wilson ( 199 1). The characteristics of these samples and the genetic variability measures relevant for the present work are presented in table 1, together with the population expansion parameters estimated by Rogers and Harpending (1992) from the observed distribution of the pairwise differences between each sequence and all the others in the sample. These parameters are O,, = 2N0u, O1 = 2Nlu, and z = 2ut, where No and Nl are the number of females in the population before and after the expansion, t is the age in number of generations of this event, and u is the mutation rate per generation per region of DNA. The crucial parameters in the model are 00, O,, and z. The estimates of No, Ni, and t depend on the assumed mutation rate. The three different data sets lead to three different sets of population expansion estimates (see table 1). The estimated variation in size ranges from a 23-fold increase in the Sardinian population to an unknown (@, = 0) but very large increase in the Middle Eastern population. The age of these expansions (in units of 1/2u generations) range from 3.99 in the Sardinian data set to 7.34 in the Middle Eastern data set. Moreover, the three samples are apparently also very different in terms of homogeneity, since the Cann et al. ( 1987) data set includes people from four continents, whereas the other samples come from more restricted areas. Consequently, the three data sets seem to be suitable to test the applicability of Rogers and Harpending’s ( 1992) method of inferring evolutionary parameters. Simulations The mtDNA Data Three human mtDNA data sets were used. The first is the worldwide sample of restriction fragment length polymorphisms (RFLPs) analyzed by Cann et al. Following the general coalescent approach described by Hudson (1990), the computer simulations reconstructed backward the genealogy of a sample of genes. A thousand gene genealogies were simulated for Segregating Sites in Expanding Populations each of the three human mtDNA samples considered, using the values of Oo, 01, and z estimated by Rogers and Harpending (1992). A model of sudden population expansion was assumed: t generations before the present, the population size changed from No to N, (see table 1). The values of No, N,, and t were determined from Oo, O1 z and assuming a per-nucleotide mutation rate of 5.0 X 10m7 and 4.1 X 10m6 per generation for the world data set (which includes the entire mtDNA molecule) and for the Sardinian and Middle Eastern data sets (which include only the control region), respectively. For the purposes of this article, the estimates of the mutation rate and the consequent values assigned to N are not critical, since the total number of mutations on the tree depends only on their product Nu. The number of mutations that occurred along a particular lineage of length t was Poisson distributed with mean ut. To avoid overestimation of the total number of mutations in a tree (which is possible when the population size is very small), multiple coalescence events each generation were allowed. Three models of mutation were used in the simulations. The first was the infinite-sites model (Kimura 1969), where each mutation occurs at a different site. Under this model, the number of segregating sites is the total number of mutations in the tree (Hudson 1990). The second was a finite-sites model with two possible states per sites, thus allowing for the possibility of reverse mutations and multiple changes per sites. The third was a two-rate finite-sites model. Again, the mutations may occur more than once at the same site (with two possible states per site), but in this case some sites were considered slow, that is, with a low mutation rate, and other fast, that is, with a high mutation rate. Wakeley (1993) has shown that a model with gamma-distributed rates provides a better fit to the number of changes at each nucleotide site in three human mtDNA data sets pooled than does a two-rate model, but this was no longer the case when the three samples were analyzed separately. Since the size of the samples used in this article is similar to or smaller than the individual samples used by Wakeley, a simple two-rate model of mutation seems, in this context, a reasonable approximation of a more realistic mutational process. Keeping the average mutation rate equal to the value used in the previous models, the proportion of fast sites was fixed at 0.1 in the simulations of the world data set and at 0.2 in the simulations of the Sardinian and Middle Eastern data sets. These values were chosen because they approach the fraction of noncoding sites in human mtDNA sequence and the estimates of Wakeley (1993) on the control region, respectively. As already mentioned, the world data set refers to the whole mtDNA sequence, whereas the other two 889 data sets consisted of the control region only. In all cases, fast sites mutated 10 times faster than slow sites. Small changes in these values did not affect the simulation results. Under the infinite-sites assumption, the two-rate mutation model and the constant mutation rate model produce the same level of genetic diversity, as long as the average mutation rate is the same. For that reason, we used the two-rate model only under a finite-sites assumption. Results The distributions of the number of segregating sites S obtained in the simulations are shown in figure 1, where the observed values in the three human populations analyzed are indicated by an arrow. Under an infinite-sites assumption, the values of S expected if the WORLD Finite-sites, two rates 140 180 220 260 model 300 340 380 420 S SARDINIA Finite-sites, two rates model 20 28 36 44 52 60 68 S MIDDLE EAST 60 75 90 105 120 135 150 165 180 195 S FIG. 1.-Empirical distributions of the number of segregating sites, S, obtained by simulating the same genealogical process 1,000 times under three different models of mutation. Arrows indicate the observed value of S. 890 Bertorelle and Slatkin populations had experienced an expansion as estimated by Rogers and Harpending (1992) are quite different from the observed values. In particular, using the empirical distribution of S obtained in 1,000 simulations to evaluate the significance of the departures of the observed from the expected values, the population expansion hypothesis significantly overestimates the number of segregating sites observed in the world and in the Middle Eastern data sets. For the Sardinian data set, on the other hand, the expected value of S was smaller than the observed value. In this case, however, about 8% of the simulations produced a value of S equal to or larger than the observed S, thus implying a nonsignificant departure of the number of segregating sites from the prediction of the model. Using the two finite-sites models of mutation, the simulations resulted in a lower number of segregating sites, and this reduction was stronger when the two-rate model of mutation was employed. In this case, the value of S observed in the world data set did not differ significantly from the expected value, but this was not true for the Middle Eastern and the Sardinian data sets (table 2). Finally, the average mean number of pairwise differences between sequences, X, as well as its average sample variance S*(K) did not change much under different mutational models (table 3). Their coefficients of variation for 1,000 simulations are reported in table 3. Discussion The results of this study show that the expected number of segregating sites in expanding populations Table 2 Simulation Results: Mean and Coefficient of Variation of the Number of Segregating Sites, S, Observed in 1,000 Repetitions Sample and Mutational World: IS . FSl . FS2 . Sardinia: IS . .. FSl _.... FS2 .. Middle East: IS . FSl . FS2 . Model s cv E(s) 352.20* 246.24* 182.81 0.06 0.05 0.06 351.72 0.17 40.85 40.8 1 39.3 1 35.70* 0.16 0.16 150.77* 124.67* 98.87* 0.09 0.08 0.07 150.53 NOTE.-Abbreviations as follows: IS, infinite-sites model; FS 1,finite-sites, one-rate model; FS2, finite-sites, two-rate model. Asterisks indicate that the observedvalue of S is smaller than the 2.5 percentile or larger than the 97.5 percentile of the empirical distribution of S. The expected value of S, analytically derived by Tajima ( 19896) for the infinite-sites model, is reported in the last column. Table 3 Simulation Results: Mean and Coefficient of Variation of the Mean Number of Pairwise Differences, IC,and of its Sample Variance, s*(n), Observed in 1,000 Repetitions Sample and Mutational World: IS FSl FS2 Sardinia: IS FSl . . . . FS2 . . Middle East: IS . . . . . FSl ... FS2 Model Tc cv ;;I@) cv Eb0 9.50 9.30 8.95 0.17 0.16 0.15 16.65 15.38 13.32 1.41 0.92 0.60 9.51 4.18 4.19 4.07 0.24 0.25 0.23 34.54 34.47 35.25 0.25 0.25 0.23 4.17 7.34 7.17 6.99 0.08 0.08 0.08 12.03 12.54 12.93 0.22 0.213 0.20 7.33 NOTE-Abbreviations are as follows: IS, infinite-sites model; FSI, finitesites, one-rate model; FS2, finite-sites, two-rate model. The expected value of IC, analytically derived by Li (1977) for the infinite-sites model, is reported in the last column. can be strongly affected by assumptions about the mutational process and the demographic parameters of a population expansion, estimated from the distribution of the pairwise differences observed in three human populations, do not always explain the observed number of segregating sites. The assumption that mutations can occur more than once at the same site reduced the number of segregating sites expected under the infinite-sites model. This effect, more evident when a large number of mutations were forced into few fast sites, is due to the fact that under the infinite-sites model each mutation increases the number of segregating site, whereas, under a finite-sites model, that is not true when a mutation in the genealogy occurs at an already-mutated site. The deviation from the infinite-sites expectation is proportional to the total length of the tree (longer trees contain more mutations, which implies more homoplasy) and therefore to the size of the population and, until a new equilibrium is reached, to the age of the expansion. This explains the small reduction of S observed in the simulation of the Sardinian data set compared with the almost-twofold decrease observed in the simulations of the world and Middle Eastern data sets. In contrast, the expected values of n found in the simulations did not change much when the infinite-sites or the finite-sites models were used. This finding was predicted by the analytical derivation of Rogers (1992), but it is unclear how these small variations, together with the variation _ introduced by the finite-sites model into the sample variance of n, will affect the estimations of the population expansion parameters. Segregating The different influences of the mutational assumptions on different measures of genetic variability have other interesting consequences. For example, the conclusions of the widely used Tajima’s ( 1989a) test of neutrality may be affected by the mutational process. The expected value of Tajima’s D statistic, under neutrality and assuming the infinite-sites model, is 0, and the upper 95% confidence limit, in a sample of 100 genes, is 2.073 (Tajima 1989~). That means that observed values of D larger than 2.073 can be taken as evidence against neutrality. When we simulated the genealogy of 100 genes using a finite-sites, two-rate model of mutation, with 0 = 20, the average mutation rate per generation per nucleotide equal to 2.5 X 10w6, 390 slow sites (u = 2.33 X 10w7), and 10 fast sites (u = 9.1 X 10P5), the empirical distribution of D observed in 1,000 replicates was shifted substantially toward positive values (fig. 2). The mean value of D was 1.58, and 23.4% of the genealogies showed a significant departure from the hypothesis of neutrality. This finding shows that, even if only 2.5% of the sites are subject to frequent recurrent mutation, the confidence limits of Tajima’s statistic may no longer be appropriate. We now turn to the comparison between the observed and the expected values of S. The analysis of the world data set shows that the population expansion hypothesis can explain the observed number of segregating sites when we use the two-rate model of mutation. However, it is not clear how much bias is introduced into the distribution of pairwise differences when sequences sampled from distinct populations (geographically and historically) are pooled, as they are in the world sample. The pooling of data from different populations could create a unimodal distribution because of the central limit theorem (L. Excoffier, personal communication). Thus, the interpretation of demographic parameters estimated from the distribution of pairwise differences in a nonhomogeneous data sets is difficult, even when the observed number of segregating sites and mean pairwise difference are compatible with the values expected under the expansion hypothesis. An additional complication is that we found in our simulations of the world data set that the coefficient of variation of the sample variance of n: was rather high (greater than one under the infinitesites model). Hence, it is difficult to have confidence in estimates of parameters that depend on the variance of n:. In this context, we note that Templeton (1993), using a different approach, could not find evidence of worldwide population expansions in a different mitochondrial DNA data set. The analyses of the Sardinian and the Middle Eastern data sets, which comprise homologous sequences, show a significant difference between the observed and 60 Sites in Expanding Populations 89 1 1 2 25 35 D FIG. 2.-Empirical distribution of Tajima’s (1989~) D statistic, when a finite-sites two-rate mutation model was used and the population size did not vary; D values with empty frequency bars show significant departure from neutrality. 8= 20; n = 100; average mutation rate per nucleotide per generation = 2.5 X lo-$390 slow sites (p = 2.33 X lo-‘); 10 fast sites (u = 9.1 X lo-‘). expected values of S when the two-rate mutation model was assumed. Two different but not mutually exclusive explanations for these results may be envisaged. First, these populations underwent an expansion that led to a unimodal distribution of the pairwise differences, but the errors in the estimation of 00, @, and z due to the stochasticity of the genealogical process are large enough to also include values that predict a number of segregating sites different from the observed value. The errors of these estimates depend on the magnitude and the age of the expansion (Rogers 1995) and increase as the population approaches the new equilibrium. This first explanation is not convincing for the Middle Eastern population, which seems to be far from equilibrium. In fact, the nucleotide diversity expected in this population after the expansion and at equilibrium is almost 500 times larger than that observed. Second, population expansion may not be the only factor that shaped the levels of genetic variability in Sardinian and Middle Eastern populations. Compared with the simulated expectation, the excess of S observed in the Sardinia sample may be due to the existence of deleterious mutations (Tajima 1989a), wh&h do not seem to be uncommon in human mtDNA (see Excoffier 1990). On the contrary, the deficit of S observed in the Middle Eastern data set seems more difficult to explain. A unimodal distribution of pairwise differences of mtDNA can result from any process that produces a lack of correlation between sequences. Rapid population growth and a selective sweep of a new advantageous haplotype lead to a low correlation among sequences because they cause a starlike gene genealogy. High mutation rates at some nucleotide sites also result in low correlation because mutation would then tend to obscure the true genealogical history. It seems likely that both demographic and mutational processes contribute to the 892 Bertorelle and Slatkin distributions of pairwise differences. A unimodal distribution is consistent with rapid population growth but does not prove it occurred or necessarily provide much information about how it occurred. We have shown that the observed numbers of segregating sites are not consistent with the values expected in the model of Rogers and Hat-pending (1992), as well as when more realistic models of mutation are incorporated. Thus, the estimates of Oo, 0,, and z obtained from that analysis should be interpreted with caution. Acknowledgments We thank F. Bonhomme, G. Barbujani, and L. Excoffier for helpful discussions on this topic. This work was partly supported by an Erasmus grant to G.B. and by a National Institutes of Health grant GM 40282 to M.S., and it was developed while visiting the Centre National de la Recherche Scientifique (CNRS) laboratory of F. Bonhomme in Montpellier, France. LITERATURE CITED CANN, R. L., M. STONEKING,and A. C. WILSON. 1987. Mitochondrial DNA and human evolution. Nature 325:3 l36. DI RIENZO, A., and A. C. WILSON. 199 1. Branching pattern in the evolutionary tree for human mitochondrial DNA. Proc. Natl. Acad. Sci. USA 88: 1597- 160 1. EXCOFFIER,L. 1990. Evolution of human mitochondrial DNA: evidence for departure from a pure neutral model of populations at equilibrium. J. Mol. Evol. 30: 125- 139. HARPENDING, H. C., S. T. SHERRY, A. R. ROGERS, and M. STONEKING.1993. The genetic structure of ancient human populations. Cm-r. Anthropol. 34:483-496. HUDSON, R. R. 1990. Gene genealogy and the coalescent process. Oxford Surv. Evol. Biol. 7: l-44. KIMURA, M. 1969. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutation. Genetics 61:893-903. Lr, W. H. 1977. Distribution of nucleotide differences between two randomly chosen cistrons in a finite population. Genetics 85:33 l-337. LUNDSTROM,R., S. TAVAR~, and R. H. WARD. 1992. Modelling evolution of the human mitochondrial genome. Math. Biosci. 112:319-335. MARJORAM,P., and P. DONNELLY. 1994. Pairwise comparisons of mitochondrial DNA sequences in subdivided populations and implications for early human evolution. Genetics 136:673-683. ROGERS,A. 1992. Error introduced by the infinite-sites model. Mol. Biol. Evol. 9: 118 1- 1184. -. 1995. Genetic evidence for a Pleistocene population explosion. Evolution. In press. ROGERS, A. R., and H. HARPENDING.1992. Population growth makes waves in the distribution of pair-wise genetic differences. Mol. Biol. Evol. 9:552-569. SLATKIN,M., and R. R. HUDSON. 199 1. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129:555-562. TEMPLETON, A. R. 1993. The “Eve” hypotheses: a genetic critique and reanalysis. Am. Anthropol. 95:5 l-72. TAJIMA, T. 1989a. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595. . 1989b. The effect of change in population size on DNA polymorphism. Genetics 123:597-60 1. WAKELEY,J. 1993. Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA. Mol. Biol. Evol. 37:6 13-623. WATTERSON,G. A. 1975. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7:256-276. CHARLES F. AQUADRO, reviewing editor Received August 4, 1994 Accepted April 3, 1995
© Copyright 2025 Paperzz