RESEARCH ARTICLES Native South American Genetic Structure and Prehistory Inferred from Hierarchical Modeling of mtDNA Cecil M. Lewis Jr1 and Jeffrey C. Long Department of Human Genetics, University of Michigan Medical School, Ann Arbor Genetic diversity in Native South Americans forms a complex pattern at both the continental and local levels. In comparing the West to the East, there is more variation within groups and smaller genetic distances between groups. From this pattern, researchers have proposed that there is more variation in the West and that a larger, more genetically diverse, founding population entered the West than the East. Here, we question this characterization of South American genetic variation and its interpretation. Our concern arises because others have inferred regional variation from the mean variation within local populations without taking into account the variation among local populations within the same region. This failure produces a biased view of the actual variation in the East. In this study, we analyze the mitochondrial DNA sequence between positions 16040 and 16322 of the Cambridge reference sequence. Our sample represents a total of 886 people from 27 indigenous populations from South (22), Central (3), and North America (2). The basic unit of our analyses is nucleotide identity by descent, which is easily modeled and proportional to nucleotide diversity. We use a forward modeling strategy to fit a series of nested models to identity by descent within and between all pairs of local populations. This method provides estimates of identity by descent at different levels of population hierarchy without assuming homogeneity within populations, regions, or continents. Our main discovery is that Eastern South America harbors more genetic variation than has been recognized. We find no evidence that there is increased identity by descent in the East relative to the total for South America. By contrast, we discovered that populations in the Western region, as a group, harbor more identity by descent than has been previously recognized, despite the fact that average identity by descent within groups is lower. In this light, there is no need to postulate separate founding populations for the East and the West because the variability in the East could serve as a source for the Western gene pools. Introduction Genetic diversity in Native South Americans forms a complex pattern at both the continental and local levels. Western populations, such as those located in the Andes, have higher variation within groups and lower genetic distances among groups, whereas Eastern populations, such as those in the Amazon and surrounding regions, have lower variation within groups and higher genetic distances. This pattern is observed in multiple genetic systems including classical autosomal markers (Luiselli et al. 2000), Y chromosome short tandem repeats (Tarazona-Santos et al. 2001), and mitochondrial DNA (Fuselli et al. 2003; Lewis et al. 2007). Tarazona-Santos and colleagues (Luiselli et al. 2000; Tarazona-Santos et al. 2001; Fuselli et al. 2003) explained the pattern using the following historical scenario. Initially, a larger and more genetically diverse population entered the West than the East. Subsequently, the West maintained a larger effective population size than did the East, and Western local groups maintained more gene flow than was maintained by Eastern local groups. These authors raised the possibility that the ancestors of the Western and Eastern populations entered South America separately and at different times (Tarazona-Santos et al. 2001). However, Rothhammer and Moraga (2001) contest this scenario because they doubt that the populations in the East form a cohesive group. 1 Present address: Department of Anthropology, The University of Oklahoma, Norman. Key words: identity by descent, site frequency spectrum, population structure. E-mail: [email protected]. Mol. Biol. Evol. 25(3):478–486. 2008 doi:10.1093/molbev/msm225 Advance Access publication January 24, 2008 Ó The Author 2007. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] The uncertainty about South American population origins and structure is not surprising considering that complex patterns of genetic variation are difficult to model and test for statistical significance. There is no simple solution. For instance, Long and Kittles (2003) have shown that assumptions made about the distribution of variation at one level of population structure (e.g., within groups) can significantly bias the results at other levels (e.g., among groups). The most commonly used population structure statistics require restrictive assumptions, such as heterozygosity is the same for all local populations and for all regions (Weir and Cockerham 1984; Excoffier et al. 1992). Thus, they are likely to produce biased results in situations where heterozygosity varies within local groups and/or between geographic regions. The previous genetic studies on the origins of South Americans have used these biased methods. Moreover, some studies have compared populations and regions in terms of effective population size estimates obtained from genetic data. Such estimates require a mutation/drift steady state in a closed population. A steady state between mutation and genetic drift rarely, if ever, occurs in natural populations. If a steady state were present, it would have erased information about the initial migrations and founding populations. In this paper, we measure genetic variation in different local populations in different continental regions without assuming equal levels of diversity within populations, or regions, and without assuming a mutation–drift steady state. We test 3 specific hypotheses: 1) the Native populations in the Western and Eastern regions of South America form meaningful groups, 2) there is more genetic variation in the Western region, and 3) the differences between the Western and Eastern regional groups can be related back to the characteristicsofdistinctfoundingpopulations.Totestthesehypotheses, we analyzed nucleotide sequence data from the first hypervariable segment (HVS1) of the mtDNA control region. Native South American Genetic Structure 479 FIG. 1.—Location of the populations examined. Materials and Methods DNA Sequences, Subjects, and Geographic Regions From peer-reviewed literature sources, we obtained mtDNA HVS1 data for 886 individuals. For each person, the datum is the complete sequence between reference nucleotide positions 16040 and 16362 (Anderson et al. 1981; Andrews et al. 1999). In total, 27 populations are represented, 22 South American, 3 Central American, and 2 North American. The North and Central American populations serve as outgroups to assess levels of variation in South American populations. Figure 1 presents the locations for all 27 populations studied. The appendix contains the primary reference for each sample, the sample size, and the assignment to continental region (also see fig. 1). We divide the 22 South American populations into 14 from the Western region and 8 from the Eastern region. Although previous studies restricted the Western group to populations in or near the Andes (Tarazona-Santos et al. 2001), we define the Western region of South America as the geographic area from the Pacific coast to the Eastern foothills of the Andean mountains. This better corresponds to the hypothesized Pacific coastal pathway for founding Western migrations (Dillehay 1999). We define the Eastern region of South America as the Amazon Rainforest to the Atlantic coast, including the surrounding swamps, woodlands, and savannas as well as subtropical forests surrounding the Paraná River. With regard to South American regional groups, 4 populations require further consideration. The Embera and Wounan settlements are north of the Andes, but their settlements reach the Pacific Coast. Additionally, the range of Ignaciano and Trinitario settlements include the Eastern foothills of the Andes as well as the savannas west of the Amazon rainforest. We included the Embera, Wounan, Ignaciano, and Trinitario within the Western group; however, we found that our results are robust to alternative groupings, such as including the Embera, Wounan, Ignaciano, and/or Trinitario within the Eastern group. Unit of Analysis The basic unit of our analysis is the probability that the nucleotides present at a randomly chosen site between 2 480 Lewis and Long homologous copies of a nonrecombining DNA sequence are identical by descent. We call this measure ‘‘nucleotide identity by descent’’ (nibd). We chose to analyze our data in terms of nibd for 3 reasons. First, this measure does not depend on the length of the DNA sequence because it is scaled to a single nucleotide. Second, founder effects and population bottlenecks, which are likely to occur at first colonization of a region, enhance genetic drift and increase identity by descent. Third, it has clear biological meaning. Under the infinite sites mutation model, the proportion of sites with the same nucleotide between 2 DNA sequences is an unbiased estimator of ‘‘nibd.’’ However, we use the Tamura–Nei model to account for factors such as finite sites, transition–transversion bias, asymmetric substitution probabilities, and heterogeneous mutation rates among sites. Thus, for each pair of DNA sequences in the total sample, we used equation (16) and equation (17) from Tamura and Nei (1993, p. 518) to estimate the expected number (l̂) and standard error (r̂2 ) of mutations that occurred per site. Then, we estimated the probability that a random site has not substituted because the common ancestor for a pair of sequences from the negative binomial formula, Lk P ; Pðx50Þ5nibd5 1 1þP where L is the length of the nucleotide sequence, k5r̂2 Pð1 þ PÞ; and P5l̂=k: For the purpose of population comparisons, we constructed a matrix RĴ with rows and columns equal to the number of local populations. Each diagonal element RĴii contains the estimated nibd averaged over all pairs of DNA sequences from the ith population, and each offdiagonal element RĴij contains the estimated nibd averaged for all pairs of DNA sequences, the first from population i and the second from population j. The elements of RĴ represent ‘‘raw’’ averages because we do not use a model of population relationships to estimate them. Hierarchical Models and Estimation Our analysis consisted of proposing a series of hierarchical models for the 27 populations and fitting these models to our estimated nibd matrix, RĴ. Each hierarchical model consists of strictly nested sets of populations related to each other such that, for any set, the previous entry is a superset and the next entry is a subset. We use tree diagrams and terminology to display our models and explain our results. We assume for biological reasons that 1) the nibd at any node on a tree is higher than the nibd at the node preceding it and 2) the change in nibd along any branch of a tree is independent of changes on all other branches of the tree. We call attention to 3 features of our models. First, a node can split into 2 or more populations. Second, a phylogenetic process is sufficient but not necessary to satisfy the above requirements. Third, a hierarchical pattern of gene flow is also sufficient but not necessary to satisfy the above requirements. We estimate the nibd at internal nodes of a proposed model from the nibd between the pairs of observed populations for which the node makes the closest connection. By extending this process, the nibd from the most divergent pairs of local populations provides an estimate of nibd at the root of the tree. Internal nodes often connect more than one pair of observed populations. This situation provides multiple estimates of the nibd at that node. Although these estimates are not fully independent, it is possible to test them for consistency. This is the basis of the Cavalli-Sforza and Piazza (1975) test for treeness. We used the system of equations developed by Anderson(1973)tofitourhierarchicalmodels.Thisprocedure provides approximate maximum likelihood solutions. The system of equations as applied to genetic data is given in more detail by Cavalli-Sforza and Piazza (1975) and Urbanek et al. (1996). The estimation procedure produces a new estimate of the nibd matrix that is contingent on the hierarchical model. We denote this estimate of the nibd matrix by MĴ. Hypothesis Testing We use modifications of a likelihood ratio test that was originally proposed by Cavalli-Sforza and Piazza (1975) to assess how well a proposed hierarchical model fits the data relative to a specified alternative. We are interested here in 2 kinds of hypothesis tests. Global Hypothesis Test The first kind of hypothesis test involves a global comparison between the raw matrix RĴ and the model-based estimate MĴ. In this test, the hierarchical model serves as the null hypothesis. Rejecting this null hypothesis indicates that the model has the wrong structure for the data, but it does not reveal the nature of the lack of fit. The global hypothesis test is performed using a likelihood ratio statistic, K0,X (Long and Kittles 2003). For a large number independently evolving sites, K0,X is distributed as a v2 random variable with degrees of freedom equal to s(s þ 1)/2 p, where s is the number of populations sampled and p is the number of parameters fitted. However, mtDNA does not provide independently evolving sites; as a result, the likelihood ratio test rejects the null hypothesis too easily, but K0,X still provides a rank ordering the fit of different hierarchical models to the data. Subhypothesis Test The second kind of hypothesis test compares a full model to a reduced model. For example, we can evaluate whether creating a new subset of populations improves the fit of an existing model. This is accomplished by creating a new node in the hierarchy. The previous model is a now reduced version of the new model. The reduced model serves as the null hypothesis, and the full model serves as the alternative hypothesis. Rejecting this null hypothesis indicates something specific about the population structure. That is, whether or not, a specific group of populations significantly improves the hierarchical model’s fit to the data. We compare the log likelihood maximized under the full model with the log likelihood maximized under the reduced model. We use an F-test that is likely to be conservative, Ka;X dfa ;Fðdfa ; dfb Þ; fa;b 5 Kb;X dfb Native South American Genetic Structure 481 where Ka,X is the likelihood ratio of the full model in comparison to the global alternative and Kb,X is the likelihood ratio of the reduced model in comparison to the global alternative. The principle behind this test is as follows. If the test statistics Ka,X and Kb,X are equally inflated relative to the chi-squared distribution, then the inflation factor will cancel in their ratio, thus providing a valid test according to F distribution. A worse approximation for Kb,X. The test is likely to be conservative because the reduced model always has a worse fit to the data. Modeling Strategy We used the step-forward procedure described below to test our hypotheses about the genetic structure of Native South American populations. Steps: 1. We test a proposed model against the most general alternative, that is, that the matrix ĴR perfectly represents the population structure. If the test rejects the model, we proceed to step 2, otherwise we terminate the analysis. 2. The process goes on by elaborating the model from step 1. Two sorts of elaborations are possible: 1) we relax constraints on the existing parameters without changing the levels of nesting and/or 2) we add a new level of nesting. 3. Now we test the elaborated model against the original model. If there is a significant improvement, we maintain the new level of nesting and return to step 1. If there is not a significant improvement, we return to step 2. 4. The process repeats until further improvements are impossible. We investigated the following sequence of models in order to identify the genetic structure of Native South American populations and to evaluate the evolutionary scenarios that others have proposed. To begin, we postulated that each of the s 5 27 local populations was evolving independently and that local populations possessed the same level of nibd. From there, we evaluated importance of relaxing the assumption that nibd was the same in all groups. This was performed because the failure to allow for differences in the level of nibd within groups can bias estimates of nibd between groups. We then tested a model that clustered populations into 4 geographical subsets: North America, Central America, Eastern South America, and Western South America. After this, we tested the effect of placing Eastern and Western South American populations into a South American superset. Finally, we added local population structure to the previous results. The purpose of adding local population structure was to test whether our results for continental regions were sensitive to the existence of higher levels of genetic structure. Results Site Frequency Spectrum By comparing the 886 copies of HVS1, we found that the nucleotide varied at 127 of 322 sites. However, these data provide less information for answering our questions about the peopling of South America than one might expect. The site frequency spectrum for the total sample (fig. 2, top) shows that at most sites the minor allele is rare. We observed only one copy of the minor allele at one-third of the sites (42/127) and fewer than 10 copies of the minor allele at three-quarters (95/127) of the sites. In fact, the minor allele at over 90% (115/127) of sites fails to reach the 5% frequency threshold for declaring a polymorphism. The bottom plot of figure 2 presents the site frequency spectrum in a bivariate manner that shows more about the information for determining population relationships. The abscissa gives the number of copies of the minor allele. The ordinate gives the number of populations for which a minor allele of a certain number of copies occurs. For example, we observe only 2 copies of the minor allele at 17 sites, at 5 of these sites we found both copies in the same population, and at 12 of these sites we found one copy in each of 2 populations. Clearly, a minor allele that occurs in only one population provides no information about population relationships. By summing the bottom row of the table, we see that the minor allele at 51 sites falls into this category. Similarly, a minor allele that appears in only 2 populations provides little information about relationships of larger sets of populations. We expect that the sites with the most information about population structure will have minor alleles that are common in one subset of local groups and absent in other subsets of local groups. With regard to our questions about South America, the minor allele at an optimal site that would confirm an Eastern group will be present in 8 populations and absent in all others, whereas an optimal site that would confirm a Western group will be present in 14 populations and absent in all others. The absence of sites with minor allele frequencies in this space is conspicuous, but the situation is even worse. We note that 5 sites have minor alleles that appear in samples from exactly 8 populations, but these 8 populations are not located exclusively in the Eastern region. Only 1 site has a minor allele that appears in samples from exactly 14 populations, but these 14 populations are not located exclusively in the Western region. There are 10 sites with high frequency minor alleles that appear in at least 20 of the samples. Allele frequencies at these 10 sites provide some information about population relationships; however, they are useless as diagnostic markers for groups of the sizes that would resolve our questions about South America because the both alleles occur in most populations. Moreover, some of the information provided by these sites is redundant because they are in high linkage disequilibrium with each other. Nucleotide Identity by Descent The range for raw average nibd within populations was from 0.9811 for the Cheyenne to 0.9969 for the Ache. When comparing the sequences from a random pair of individuals for all 322 nucleotide positions, these nibd estimates are consistent with at least one substitution at 6.81 and 1.06 sites, respectively. The range for raw average nibd between populations was from 0.9785 for the Cheyenne and Tupe to 0.9914 for the Surui and Gaviao. When comparing the sequences from a random pair of individuals for all 322 482 Lewis and Long FIG. 2.—Minor allele frequency spectrum with respect to the number of sites (top) and the number of populations (bottom). Shaded boxes indicate sites that are optimally diagnostic of regional South American populations. The big and small boxes indicate the regions corresponding to the 1% and 5% polymorphism thresholds, respectively. nucleotide positions, these nibd estimates are consistent with at least one substitution at 6.93 and 2.78 sites, respectively. Figure 3 displays the elements of RĴ grouped by region. In the bottom panel, the tick marks indicate nibd within populations for the 4 different geographic regions. In the middle panel, the tick marks indicate nibd between pairs of populations within the same region. In the top panel, the tick marks indicate nibd between pairs of populations from different regions. Two salient points emerge from this figure. First, the range of nibd within populations is so great that it makes little sense to pool them into a single within-group component of variation for further analyses of population genetic structure. Second, nibd between populations in the same region is often lower than nibd between populations in different regions. Because of this, we do not expect regional groupings of populations to be a major feature of the population genetic structure. We assumed a Tamura–Nei (1993) substitution model for our estimates of nibd. This model accounts for finite sites, transition–transversion bias, asymmetric substitution probabilities, and heterogeneous mutation rates among sites. However, given the coalescence time frame for Native American mtDNA, few substitutions deviate from an infinite sites model. In fact, in separate analyses, we found the same results by assuming the infinite sites model where the metric was the proportion of sites with the same nucleotide between 2 DNA sequences. Fitted Models We fit 17 models to the nibd matrix, RĴ. Five models test our major hypotheses. We constructed the remaining models to confirm the principal 5 models, for example, by using few or greater numbers of within-group nibd values or by adding regional populations one at a time rather than as blocks. We did not find that minor variations on the principal models made large changes in the outcomes. For brevity, we present the results for the 5 models that directly address our major questions. In Model 1 (fig. 4, I), we assumed that nibd within all local populations is the equal and that nibd between all pairs of local populations is equal and independent. This assumption is roughly equivalent to Wright’s island model of population structure. This model has 2 parameters, one a pooled within-population nibd component and the other a pooled between-population nibd component. The estimates of these parametersare0.9864and0.9829,respectively.Themodelfits the data poorly, as we should expect from the raw nibd values displayed in figure 3. Nevertheless, this model provides a useful baseline to begin the forward model selection procedure. Model 2 (fig. 4, II) relaxes the assumption that nibd is equal within local populations by allowing 5 different levels of within-group nibd Although all 27 populations may, in principle, harbor a unique level of nibd, we were unable to extract 27 estimates from the HVS1 data because there are too few polymorphic sites. The range of the modelbased nibd estimates is 0.9824–0.9911, which better approximates the range of the raw estimates. Model 2 provides an improved fit to raw data matrix over Model 1. The F-test shows that this improvement achieved significance relative to Model 1 (P 5 0.0452); therefore, we reject model 1 in favor of model 2 and conclude that it is necessary to allow for differences in nibd within populations. Native South American Genetic Structure 483 FIG. 3.—Raw nibd, the points in the top plot reflect comparisons of sequences drawn from different local populations within different groups; the points in the middle plot reflect comparisons of sequences drawn from different local populations within the same groups; and the points in the bottom plot reflect comparisons of sequences drawn from the same local populations. Model 3 (fig. 4, III) clusters populations into 4 geographical groups: North America, Central America, Western South America, and Eastern South America. In this model, we estimated nibd separately for each regional group. The fitted model had a near-zero branch length (0.0002) between the Eastern South American node and the node connecting all populations; thus, the Eastern South American node was eliminated, and the model was refitted. Model 3 fits the raw nibd matrix significantly better than does Model 2 (P 5 0.021). The fact that populations in Eastern South America do not form a distinct genetic group is of utmost importance to the theories about the initial peopling of South America because it shows that populations in the Eastern region harbor a great deal of variation. Model 4 (fig. 4, IV) is similar to Model 3, but it nests all South American populations into a continental cluster. Although Model 4 provides a slightly improved fit to the raw nibd matrix, the improvement is not statistically significant (P 5 0.484), and we retain Model 3 as a parsimonious representation of RĴ. Model 5 adds local population structure to Model 3 in order to determine whether the existence of higher level structure affects our conclusions about broad regional groups (fig. 5). Model 5 provides a significant improvement over Model 3 (P 5 0.005). However, the local pattern does not change our interpretation of the regional population groups. Our forward modeling strategy assesses the distribution of nibd at lower levels of population structure prior to higher levels. This raises an important methodological issue because assumptions made about the distribution of nibd at one level of population structure can significantly bias the results at other levels (Long and Kittles 2003). When assessing nibd at lower levels, the forward strategy assumes that changes in nibd within and among groups at higher levels occur independently. To some extent, our strategy is robust to this assumption because, after our first model, we allowed nibd within local populations to vary. Fortunately, the higher level structure that we discovered in Model 5 did not change our inferences about regional groups from Model 3. Specifically, Eastern local populations still connected at the basal position, and all Western local populations still emerged together as a distinct group. Discussion Our principal discovery is that Eastern South America harbors more variation than has been recognized heretofore (Tarazona-Santos et al. 2001; Fuselli et al. 2003; Lewis et al. 2007). Although generalized hierarchical modeling led us to this result, 2 lines of evidence supporting it are visible in the raw nibd estimates (fig. 3). First, there are some estimates of low nibd within the Eastern populations. In fact, nibd in the Guahibo ranks third lowest in comparison to nibd for all 27 populations studied. Second, nibd estimates between some pairs of Eastern populations are on the order of nibd estimates for interregional, and intercontinental, comparisons. Because of these low nibd estimates between Eastern populations, we were unable to reject the hypothesis that Eastern South America harbors as much variation as the total for South America and even the hypothesis that Eastern South America harbors as much variation as total for all the Americas. Thus, we cannot consider the Eastern populations a meaningful group because they harbor as much diversity collectively as our total sample including all regions. In contrast to Eastern South America, we discovered that Western South America harbors less genetic variation than has been previously recognized (Tarazona-Santos et al. 2001; Fuselli et al. 2003). Again, the raw nibd estimates confirm the result from generalized hierarchical modeling (fig. 3). Although nibd is low within most Western populations, the 484 Lewis and Long FIG. 4.—Models 1–4 fitted to RĴ. The units on the scale bar are nibd. The range of nibd estimates for the raw data is marked by the boxes on the scale bar. The range of the nibd estimates predicted by the model is marked by the circles on the scale bar. North American populations are coded in black, Central American populations are coded in yellow, Eastern South American populations are coded in blue, and Western South American populations are coded in red. range substantially overlaps the range of nibd within Eastern populations. Moreover, nibd between pairs of Western populations is entirely within the range of nibd between pairs of Eastern populations. This fact causes the Western populations to emerge together as a distinct group. However, our results now fail to confirm the previously held idea that there is more variation in the Western region than in the Eastern region. The pattern of population relationships deciphered above imposes an important limitation on testing the theories about the peopling of South America. From these HVS1 data, we find that the Eastern populations connect at the most basal node of the hierarchy; consequently, we cannot distinguish the characteristics of the female founders of South America from the female founders of FIG. 5.—Model 5 fitted to the data. The units on the scale bar are nibd. The range of nibd estimates within the observed data is marked by the boxes on the scale bar. The range of nibd estimates predicted by the model is marked by the circles on the scale bar. Native South American Genetic Structure 485 the Americas as a whole. Therefore, there is no need to postulate separate source populations for the Eastern and Western regions of South America because the variability in the East could serve as a source for the Western gene pools. Thequestionsarise,couldadditionalpopulationsamples orsequencefrom alongerstretchofthemitochondrialgenome change our findings with respect to relative levels of nibd? To address these questions, we performed a phylogeographic analysis on the 886 HVS1 sequences analyzed. Specifically, we calculated a Neighbor-Joining tree, which we rooted using an HVS1 sequence from an African for the outgroup. The Neighbor-Joining tree presents clusters of haplotypes that correspond to the universally recognized Native American mitochondrial haplogroups, A–D. In our total sample, copies of HVS1 from 879/886 individuals fall into these clusters. Seven of these copies of the HVS1, all from the Cheyenne population, show characteristics of haplogroup X (Brown et al. 1998) including the T allele at both nucleotide positions 16223 and 16278 and the absence of markers at other sites that would place them in one of the 4 main haplogroups. However, definitive assignment of these copies of the sequence would require data from outside of the HVS1. For the HVS1 data examined here, haplotypes within major haplogroups have approximately 2–3 nt differences, whereas haplotypes between major haplogroups have approximately 6–10 nt differences. Previous studies estimated that haplotypes within these major haplogroups coalesce to a common ancestor approximately 25–40 thousand years ago (Bonatto and Salzano 1997b; Silva et al. 2002), whereas haplotypes between these major haplogroups coalesce to a common ancestor approximately 100 thousand years ago (Bonatto and Salzano 1997b; Silva et al. 2002; Gonder et al. 2007). This places an important limitation on the resolving power of mtDNA because the mutations responsible for many of the nucleotide differences between copies of HVS1 occurred prior to the peopling of the Americas. Substitutions between haplotypes in different haplogroups make a greater contribution to average nibd than do substitutions between haplotypes within haplogroups. Because of this, the pattern of nibd is resistant to adding additional population samples because our current population samples have already revealed that haplogroups A–D are common in all regions. In fact, sampling additional populations is unlikely to change the observed pattern unless the added samples present a new common haplogroup. However, the possibility of such a discovery is remote. To date, Native American studies have provided an extensive survey of haplogroups and nearly all Native Americans possess one of the 4 haplogroups (Bonatto and Salzano 1997b; Mulligan et al. 2004; Schurr and Sherry 2004; Tamm et al. 2007). We also expect that the pattern of nibd is resistant to sequencing a larger portion of the mitochondrial genome. Extending the sequence length can improve the resolution of the coalescent history of the haplotypes within a haplogroup and may reveal additional sublineages. However, to change the pattern of nibd, extending the sequence length would need to reject our current knowledge of the coalescent history of these haplogroups. This is unlikely, considering that mitochondrial genome studies identify the same major haplogroups as seen in analyses of HVS1 (Maca-Meyer et al. 2001; Bandelt et al. 2003; Tamm et al. 2007). Moreover, even with such a discovery, the ubiquitous distribution of the haplogroups buffers the pattern of nibd. We expect that such a discovery would affect nibd in all regions similarly. In summary, we developed a series of models to test hypotheses about the genetic structure and initial peopling of South America using mtDNA sequence data. Our models donot require usto pool variationwithin populations ortopool variation among populations from different regions, as is the case with more usual methods. The flexibility in our approach led us to the novel discovery that populations in Eastern South America harbor a great deal of variation. The level of variation is so great that we cannot view the Eastern South American populations as a single cohesive group. Because of this finding, a single human migration into South America is the most parsimonious interpretation of the mtDNA HVS1 data. Analyses of the site frequency spectrum indicate that questions aboutbroadregionalpopulationsinSouthAmericaarebeyond the resolving power of mtDNA HVS1 data. Moreover, phylogeographic analysis of the mtDNA haplogroup lineages tagged by these HVS1 data indicates that even mitochondrial whole-genome analysis may not be able to resolve these questions. The best chance for obtaining a higher resolution of population history will be to examine many independently inherited loci with similar mutation mechanisms. Acknowledgment C.M.L. was supported by NIH T32-HG-00040. Appendix—Populations, Sample Sizes, and Primary Citations Population n Region Citation Ache Ancash Arequipa Bella Coola Cheyenne 63 35 22 41 39 Eastern South America Western South America Western South America North America North America Embera Gaviao Guahibo Huentar Ignaciano Kuna Mapuche Movima Ngoebe Pehuenche Puno 44 28 59 27 15 63 34 12 46 24 34 Western South America Eastern South America Eastern South America Central America Western South America Central America Western South America Eastern South America Central America Western South America Western South America Surui Tayacaja Trinitario Tupe 24 65 12 16 Eastern South America Western South America Western South America Western South America Waiwai 26 Eastern South America Wounan Xavante Yaghan Yungay Yuracare Zoro 31 25 15 38 15 28 Western South America Eastern South America Western South America Western South America Western South America Eastern South America (Schmitt et al. 2004) (Lewis et al. 2007) (Fuselli et al. 2003) (Ward et al. 1993) (Kittles et al. 1999) (Kolman and Bermingham 1997) (Ward et al. 1996) (Vona et al. 2005) (Santos et al. 1994) (Bert et al. 2004) (Bert et al. 2004) (Moraga et al. 2000) (Bert et al. 2004) (Kolman et al. 1995) (Moraga et al. 2000) (Lewis et al. 2007) (Bonatto and Salzano 1997a) (Fuselli et al. 2003) (Bert et al. 2004) (Lewis et al. 2007) (Bonatto and Salzano 1997a) (Kolman and Bermingham 1997) (Ward et al. 1996) (Moraga et al. 2000) (Lewis et al. 2007) (Bert et al. 2004) (Ward et al. 1996) 486 Lewis and Long Literature Cited Anderson S, Bankier AT, Barrell BG, et al. (14 co-authors). 1981. Sequence and organization of the human mitochondrial genome. Nature. 290:457–465. Anderson TW. 1973. Asymptotically efficient estimation of covariance matrices with linear structure. Ann Stat. 1:79–95. Andrews R, Kubacka I, Chinnery P, Lightowlers R, Turnbull D, Howell N. 1999. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet. 23:147. Bandelt HJ, Herrnstadt C, Yao YG, et al. (13 co-authors). 2003. Identification of Native American founder mtDNAs through the analysis of complete mtDNA sequences: some caveats. Ann Hum Genet. 67:512–524. Bert F, Corella A, Gene M, Perez-Perez A, Turbon D. 2004. Mitochondrial DNA diversity in the Llanos de Moxos: moxo, Movima and Yuracare Amerindian populations from Bolivia lowlands. Ann Hum Biol. 31:9–28. Bonatto SL, Salzano FM. 1997a. A single and early migration for the peopling of the Americas supported by mitochondrial DNA sequence data. Proc Natl Acad Sci USA. 94:1866–1871. Bonatto SL, Salzano FM. 1997b. Diversity and age of the four major mtDNA haplogroups, and their implications for the peopling of the New World. Am J Hum Genet. 61:1413–1423. Brown MD, Hosseini SH, Torroni A, Bandelt HJ, Allen JC, Schurr TG, Scozzari R, Cruciani F, Wallace DC. 1998. mtDNA haplogroup X: an ancient link between Europe/ Western Asia and North America? Am J Hum Genet. 63:1852–1861. Cavalli-Sforza LL, Piazza A. 1975. Analysis of evolution: evolutionary rates, independence, and treeness. Theor Popul Biol. 8:127–165. Dillehay TD. 1999. The late Pleistocene cultures of South America. Evol Anthropol. 7:206–216. Excoffier L, Smouse PE, Quattro JM. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics. 131:479–491. Fuselli S, Tarazona-Santos E, Dupanloup I, Soto A, Luiselli D, Pettener D. 2003. Mitochondrial DNA diversity in South America and the genetic history of Andean Highlanders. Mol Biol Evol. 20:1682–1691. Gonder MK, Mortensen HM, Reed FA, de Sousa A, Tishkoff SA. 2007. Whole-mtDNA genome sequence analysis of ancient African lineages. Mol Biol Evol. 24:757–768. Kittles RA, Bergen AW, Urbanek M, Virkkunen M, Linnoila M, Goldman D, Long JC. 1999. Autosomal, mitochondrial, and Y chromosome DNA variation in Finland: evidence for a male-specific bottleneck. Am J Phys Anthropol. 108:381–399. Kolman CJ, Bermingham E. 1997. Mitochondrial and nuclear DNA diversity in the Chocó and Chibcha Amerinds of Panamá. Genetics. 147:1289–1302. Kolman CJ, Bermingham E, Cooke R, Ward RH, Arias TD, Guionneau-Sinclair F. 1995. Reduced mtDNA diversity in the Ngöbé Amerinds of Panama. Genetics. 140:275–283. Lewis CM Jr, Lizárraga B, Tito RY, et al. (11 co-authors). Forthcoming. Mitochondrial DNA and the Peopling of South America. Hum Biol. Long JC, Kittles RA. 2003. Human genetic diversity and the nonexistence of biological races. Hum Biol. 75:449–471. Luiselli D, Simoni L, Tarazona-Santos E, Pastor S, Pettener D. 2000. Genetic Structure of Quechua-Speakers of the Central Andes and Geographic Patterns of Gene Frequencies in South Amerindian Populations. Am J Phys Anthropol. 113:5–17. Maca-Meyer N, Gonzalez AM, Larruga JM, Flores C, Cabrera VM. 2001. Major genomic mitochondrial lineages delineate early human expansions. BMC Genet. 2:13. Moraga ML, Rocco P, Miquel JF, Nervi F, Llop E, Chakraborty R, Rothhammer F, Carvallo P. 2000. Mitochondrial DNA polymorphisms in Chilean aboriginal populations: implications for the peopling of the southern cone of the continent. Am J Phys Anthropol. 113:19–29. Mulligan CJ, Hunley K, Cole S, Long JC. 2004. Population genetics, history, and health patterns in native americans. Annu Rev Genomics Hum Genet. 5:295–315. Rothhammer F, Moraga M. 2001. Patterns of Y-chromosome variation in South Amerindians. Am J Hum Genet. 69:904–906. Santos M, Ward RH, Barrantes R. 1994. mtDNA variation in the Chibcha Amerindian Huetar from Costa Rica. Hum Biol. 66:963–977. Schmitt R, Bonatto SL, Freitas LB, Muschner VC, Hill K, Hurtado AM, Salzano FM. 2004. Extremely limited mitochondrial DNA variability among the Ache Natives of Paraguay. Ann Hum Biol. 31:87–94. Schurr TG, Sherry ST. 2004. Mitochondrial DNA and Y chromosome diversity and the peopling of the Americas: evolutionary and demographic evidence. Am J Hum Biol. 16:420–439. Silva WA Jr, Bonatto SL, Holanda AJ, et al. (14 co-authors). 2002. Mitochondrial genome diversity of Native Americans supports a single early entry of founder populations into America. Am J Hum Genet. 71:187–192. Tamm E, Kivisild T, Reidla M, et al. (21 co-authors). 2007. Beringian standstill and spread of Native American founders. PLoS ONE. 2:e829. Tamura K, Nei M. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 10:512–526. Tarazona-Santos E, Carvalho-Silva DR, Pettener D, Luiselli D, De Stefano GF, Martinez Labarga C, Rickards O, TylerSmith C, Pena SDJ, Santos FR. 2001. Genetic differentiation in South Amerindians is related to environmental and cultural diversity: evidence from the Y chromosome. Am J Hum Genet. 68:1485–1496. Urbanek M, Goldman D, Long JC. 1996. The apportionment of dinucleotide repeat diversity in Native Americans and Europeans: a new approach to measuring gene identity reveals asymmetric patterns of divergence. Mol Biol Evol. 13:943–953. Vona G, Falchi A, Moral P, Calo CM, Varesi L. 2005. Mitochondrial sequence variation in the Guahibo Amerindian population from Venezuela. Am J Phys Anthropol. 127:361–369. Ward RH, Redd A, Valencia D, Frazier B, Pääbo S. 1993. Genetic and linguistic differentiation in the Americas. Proc Natl Acad Sci USA. 90:10663–10667. Ward RH, Salzano FM, Bonatto SL, Hutz MH, Coimbra CEA Jr, Santos RV. 1996. Mitochondrial DNA polymorphism in three Brazilian Indian tribes. Am J Hum Biol. 8:317–323. Weir BS, Cockerham CC. 1984. Estimating F-statistics for the analysis of population structure. Evolution. 38:1358–1370. Connie Mulligan, Associate Editor Accepted October 4, 2007
© Copyright 2026 Paperzz