Native South American Genetic Structure and Prehistory Inferred

RESEARCH ARTICLES
Native South American Genetic Structure and Prehistory Inferred from
Hierarchical Modeling of mtDNA
Cecil M. Lewis Jr1 and Jeffrey C. Long
Department of Human Genetics, University of Michigan Medical School, Ann Arbor
Genetic diversity in Native South Americans forms a complex pattern at both the continental and local levels. In
comparing the West to the East, there is more variation within groups and smaller genetic distances between groups.
From this pattern, researchers have proposed that there is more variation in the West and that a larger, more genetically
diverse, founding population entered the West than the East. Here, we question this characterization of South American
genetic variation and its interpretation. Our concern arises because others have inferred regional variation from the mean
variation within local populations without taking into account the variation among local populations within the same
region. This failure produces a biased view of the actual variation in the East.
In this study, we analyze the mitochondrial DNA sequence between positions 16040 and 16322 of the Cambridge
reference sequence. Our sample represents a total of 886 people from 27 indigenous populations from South (22), Central
(3), and North America (2). The basic unit of our analyses is nucleotide identity by descent, which is easily modeled and
proportional to nucleotide diversity. We use a forward modeling strategy to fit a series of nested models to identity by
descent within and between all pairs of local populations. This method provides estimates of identity by descent at
different levels of population hierarchy without assuming homogeneity within populations, regions, or continents.
Our main discovery is that Eastern South America harbors more genetic variation than has been recognized. We find no
evidence that there is increased identity by descent in the East relative to the total for South America. By contrast, we discovered
that populations in the Western region, as a group, harbor more identity by descent than has been previously recognized, despite
the fact that average identity by descent within groups is lower. In this light, there is no need to postulate separate founding
populations for the East and the West because the variability in the East could serve as a source for the Western gene pools.
Introduction
Genetic diversity in Native South Americans forms
a complex pattern at both the continental and local levels.
Western populations, such as those located in the Andes,
have higher variation within groups and lower genetic distances among groups, whereas Eastern populations, such as
those in the Amazon and surrounding regions, have lower
variation within groups and higher genetic distances. This
pattern is observed in multiple genetic systems including
classical autosomal markers (Luiselli et al. 2000), Y chromosome short tandem repeats (Tarazona-Santos et al. 2001),
and mitochondrial DNA (Fuselli et al. 2003; Lewis et al.
2007). Tarazona-Santos and colleagues (Luiselli et al.
2000; Tarazona-Santos et al. 2001; Fuselli et al. 2003) explained the pattern using the following historical scenario.
Initially, a larger and more genetically diverse population
entered the West than the East. Subsequently, the West maintained a larger effective population size than did the East, and
Western local groups maintained more gene flow than was
maintained by Eastern local groups. These authors raised the
possibility that the ancestors of the Western and Eastern populations entered South America separately and at different
times (Tarazona-Santos et al. 2001). However, Rothhammer
and Moraga (2001) contest this scenario because they doubt
that the populations in the East form a cohesive group.
1
Present address: Department of Anthropology, The University of
Oklahoma, Norman.
Key words: identity by descent, site frequency spectrum, population
structure.
E-mail: [email protected].
Mol. Biol. Evol. 25(3):478–486. 2008
doi:10.1093/molbev/msm225
Advance Access publication January 24, 2008
Ó The Author 2007. Published by Oxford University Press on behalf of
the Society for Molecular Biology and Evolution. All rights reserved.
For permissions, please e-mail: [email protected]
The uncertainty about South American population origins and structure is not surprising considering that complex patterns of genetic variation are difficult to model
and test for statistical significance. There is no simple solution. For instance, Long and Kittles (2003) have shown that
assumptions made about the distribution of variation at one
level of population structure (e.g., within groups) can significantly bias the results at other levels (e.g., among groups).
The most commonly used population structure statistics require restrictive assumptions, such as heterozygosity is the
same for all local populations and for all regions (Weir
and Cockerham 1984; Excoffier et al. 1992). Thus, they
are likely to produce biased results in situations where heterozygosity varies within local groups and/or between geographic regions. The previous genetic studies on the origins
of South Americans have used these biased methods. Moreover, some studies have compared populations and regions in
terms of effective population size estimates obtained from
genetic data. Such estimates require a mutation/drift steady
state in a closed population. A steady state between mutation
and genetic drift rarely, if ever, occurs in natural populations.
If a steady state were present, it would have erased information about the initial migrations and founding populations.
In this paper, we measure genetic variation in different
local populations in different continental regions without
assuming equal levels of diversity within populations, or
regions, and without assuming a mutation–drift steady state.
We test 3 specific hypotheses: 1) the Native populations in
the Western and Eastern regions of South America form
meaningful groups, 2) there is more genetic variation in the
Western region, and 3) the differences between the Western
and Eastern regional groups can be related back to the characteristicsofdistinctfoundingpopulations.Totestthesehypotheses, we analyzed nucleotide sequence data from the first
hypervariable segment (HVS1) of the mtDNA control region.
Native South American Genetic Structure 479
FIG. 1.—Location of the populations examined.
Materials and Methods
DNA Sequences, Subjects, and Geographic Regions
From peer-reviewed literature sources, we obtained
mtDNA HVS1 data for 886 individuals. For each person,
the datum is the complete sequence between reference nucleotide positions 16040 and 16362 (Anderson et al. 1981;
Andrews et al. 1999). In total, 27 populations are represented, 22 South American, 3 Central American, and 2
North American. The North and Central American populations serve as outgroups to assess levels of variation in
South American populations. Figure 1 presents the locations for all 27 populations studied. The appendix contains
the primary reference for each sample, the sample size, and
the assignment to continental region (also see fig. 1).
We divide the 22 South American populations into 14
from the Western region and 8 from the Eastern region. Although previous studies restricted the Western group to
populations in or near the Andes (Tarazona-Santos et al.
2001), we define the Western region of South America
as the geographic area from the Pacific coast to the Eastern
foothills of the Andean mountains. This better corresponds
to the hypothesized Pacific coastal pathway for founding
Western migrations (Dillehay 1999). We define the Eastern
region of South America as the Amazon Rainforest to the
Atlantic coast, including the surrounding swamps, woodlands, and savannas as well as subtropical forests surrounding the Paraná River.
With regard to South American regional groups, 4
populations require further consideration. The Embera
and Wounan settlements are north of the Andes, but their
settlements reach the Pacific Coast. Additionally, the range
of Ignaciano and Trinitario settlements include the Eastern
foothills of the Andes as well as the savannas west of the
Amazon rainforest. We included the Embera, Wounan,
Ignaciano, and Trinitario within the Western group; however, we found that our results are robust to alternative
groupings, such as including the Embera, Wounan, Ignaciano, and/or Trinitario within the Eastern group.
Unit of Analysis
The basic unit of our analysis is the probability that the
nucleotides present at a randomly chosen site between 2
480 Lewis and Long
homologous copies of a nonrecombining DNA sequence
are identical by descent. We call this measure ‘‘nucleotide
identity by descent’’ (nibd). We chose to analyze our data in
terms of nibd for 3 reasons. First, this measure does not depend on the length of the DNA sequence because it is scaled
to a single nucleotide. Second, founder effects and population bottlenecks, which are likely to occur at first colonization of a region, enhance genetic drift and increase identity
by descent. Third, it has clear biological meaning.
Under the infinite sites mutation model, the proportion
of sites with the same nucleotide between 2 DNA sequences
is an unbiased estimator of ‘‘nibd.’’ However, we use the
Tamura–Nei model to account for factors such as finite
sites, transition–transversion bias, asymmetric substitution
probabilities, and heterogeneous mutation rates among
sites. Thus, for each pair of DNA sequences in the total
sample, we used equation (16) and equation (17) from
Tamura and Nei (1993, p. 518) to estimate the expected
number (l̂) and standard error (r̂2 ) of mutations that occurred per site. Then, we estimated the probability that a random site has not substituted because the common ancestor
for a pair of sequences from the negative binomial formula,
Lk
P
;
Pðx50Þ5nibd5 1 1þP
where L is the length of the nucleotide sequence,
k5r̂2 Pð1 þ PÞ; and P5l̂=k:
For the purpose of population comparisons, we constructed a matrix RĴ with rows and columns equal to the
number of local populations. Each diagonal element RĴii
contains the estimated nibd averaged over all pairs of
DNA sequences from the ith population, and each offdiagonal element RĴij contains the estimated nibd averaged
for all pairs of DNA sequences, the first from population i
and the second from population j. The elements of RĴ represent ‘‘raw’’ averages because we do not use a model of
population relationships to estimate them.
Hierarchical Models and Estimation
Our analysis consisted of proposing a series of hierarchical models for the 27 populations and fitting these models
to our estimated nibd matrix, RĴ. Each hierarchical model
consists of strictly nested sets of populations related to each
other such that, for any set, the previous entry is a superset and
the next entry is a subset. We use tree diagrams and terminology to display our models and explain our results. We assume for biological reasons that 1) the nibd at any node on
a tree is higher than the nibd at the node preceding it and 2) the
change in nibd along any branch of a tree is independent of
changes on all other branches of the tree. We call attention to
3 features of our models. First, a node can split into 2 or more
populations. Second, a phylogenetic process is sufficient but
not necessary to satisfy the above requirements. Third, a hierarchical pattern of gene flow is also sufficient but not necessary to satisfy the above requirements.
We estimate the nibd at internal nodes of a proposed
model from the nibd between the pairs of observed populations for which the node makes the closest connection. By
extending this process, the nibd from the most divergent
pairs of local populations provides an estimate of nibd at
the root of the tree. Internal nodes often connect more than
one pair of observed populations. This situation provides
multiple estimates of the nibd at that node. Although these
estimates are not fully independent, it is possible to test
them for consistency. This is the basis of the Cavalli-Sforza
and Piazza (1975) test for treeness.
We used the system of equations developed by
Anderson(1973)tofitourhierarchicalmodels.Thisprocedure
provides approximate maximum likelihood solutions. The
system of equations as applied to genetic data is given in more
detail by Cavalli-Sforza and Piazza (1975) and Urbanek
et al. (1996). The estimation procedure produces a new estimate of the nibd matrix that is contingent on the hierarchical
model. We denote this estimate of the nibd matrix by MĴ.
Hypothesis Testing
We use modifications of a likelihood ratio test that was
originally proposed by Cavalli-Sforza and Piazza (1975) to
assess how well a proposed hierarchical model fits the data
relative to a specified alternative. We are interested here in 2
kinds of hypothesis tests.
Global Hypothesis Test
The first kind of hypothesis test involves a global comparison between the raw matrix RĴ and the model-based estimate MĴ. In this test, the hierarchical model serves as the
null hypothesis. Rejecting this null hypothesis indicates that
the model has the wrong structure for the data, but it does
not reveal the nature of the lack of fit.
The global hypothesis test is performed using a likelihood ratio statistic, K0,X (Long and Kittles 2003). For
a large number independently evolving sites, K0,X is distributed as a v2 random variable with degrees of freedom equal
to s(s þ 1)/2 p, where s is the number of populations
sampled and p is the number of parameters fitted. However,
mtDNA does not provide independently evolving sites; as
a result, the likelihood ratio test rejects the null hypothesis
too easily, but K0,X still provides a rank ordering the fit of
different hierarchical models to the data.
Subhypothesis Test
The second kind of hypothesis test compares a full
model to a reduced model. For example, we can evaluate
whether creating a new subset of populations improves
the fit of an existing model. This is accomplished by creating
a new node in the hierarchy. The previous model is a now
reduced version of the new model. The reduced model serves
as the null hypothesis, and the full model serves as the alternative hypothesis. Rejecting this null hypothesis indicates
something specific about the population structure. That is,
whether or not, a specific group of populations significantly
improves the hierarchical model’s fit to the data.
We compare the log likelihood maximized under the full
model with the log likelihood maximized under the reduced
model. We use an F-test that is likely to be conservative,
Ka;X dfa
;Fðdfa ; dfb Þ;
fa;b 5
Kb;X dfb
Native South American Genetic Structure 481
where Ka,X is the likelihood ratio of the full model in comparison to the global alternative and Kb,X is the likelihood ratio of the reduced model in comparison to the global
alternative. The principle behind this test is as follows. If
the test statistics Ka,X and Kb,X are equally inflated relative
to the chi-squared distribution, then the inflation factor will
cancel in their ratio, thus providing a valid test according to F
distribution. A worse approximation for Kb,X. The test is
likely to be conservative because the reduced model always
has a worse fit to the data.
Modeling Strategy
We used the step-forward procedure described below
to test our hypotheses about the genetic structure of Native
South American populations.
Steps:
1. We test a proposed model against the most general
alternative, that is, that the matrix ĴR perfectly
represents the population structure. If the test rejects
the model, we proceed to step 2, otherwise we terminate
the analysis.
2. The process goes on by elaborating the model from step 1.
Two sorts of elaborations are possible: 1) we relax constraints on the existing parameters without changing the
levels of nesting and/or 2) we add a new level of nesting.
3. Now we test the elaborated model against the original
model. If there is a significant improvement, we maintain
the new level of nesting and return to step 1. If there is not
a significant improvement, we return to step 2.
4. The process repeats until further improvements are
impossible.
We investigated the following sequence of models in
order to identify the genetic structure of Native South
American populations and to evaluate the evolutionary scenarios that others have proposed.
To begin, we postulated that each of the s 5 27 local
populations was evolving independently and that local populations possessed the same level of nibd. From there, we
evaluated importance of relaxing the assumption that nibd
was the same in all groups. This was performed because
the failure to allow for differences in the level of nibd within
groups can bias estimates of nibd between groups. We then
tested a model that clustered populations into 4 geographical
subsets: North America, Central America, Eastern South
America, and Western South America. After this, we tested
the effect of placing Eastern and Western South American
populations into a South American superset. Finally, we
added local population structure to the previous results.
The purpose of adding local population structure was to test
whether our results for continental regions were sensitive to
the existence of higher levels of genetic structure.
Results
Site Frequency Spectrum
By comparing the 886 copies of HVS1, we found that
the nucleotide varied at 127 of 322 sites. However, these
data provide less information for answering our questions
about the peopling of South America than one might expect. The site frequency spectrum for the total sample
(fig. 2, top) shows that at most sites the minor allele is rare.
We observed only one copy of the minor allele at one-third
of the sites (42/127) and fewer than 10 copies of the minor
allele at three-quarters (95/127) of the sites. In fact, the minor allele at over 90% (115/127) of sites fails to reach the
5% frequency threshold for declaring a polymorphism.
The bottom plot of figure 2 presents the site frequency
spectrum in a bivariate manner that shows more about the
information for determining population relationships. The
abscissa gives the number of copies of the minor allele.
The ordinate gives the number of populations for which
a minor allele of a certain number of copies occurs. For example, we observe only 2 copies of the minor allele at 17
sites, at 5 of these sites we found both copies in the same
population, and at 12 of these sites we found one copy in
each of 2 populations.
Clearly, a minor allele that occurs in only one population provides no information about population relationships.
By summing the bottom row of the table, we see that the
minor allele at 51 sites falls into this category. Similarly,
a minor allele that appears in only 2 populations provides
little information about relationships of larger sets of populations. We expect that the sites with the most information
about population structure will have minor alleles that are
common in one subset of local groups and absent in other
subsets of local groups. With regard to our questions about
South America, the minor allele at an optimal site that would
confirm an Eastern group will be present in 8 populations
and absent in all others, whereas an optimal site that would
confirm a Western group will be present in 14 populations
and absent in all others. The absence of sites with minor
allele frequencies in this space is conspicuous, but the situation is even worse. We note that 5 sites have minor alleles
that appear in samples from exactly 8 populations, but these
8 populations are not located exclusively in the Eastern region. Only 1 site has a minor allele that appears in samples
from exactly 14 populations, but these 14 populations are
not located exclusively in the Western region.
There are 10 sites with high frequency minor alleles
that appear in at least 20 of the samples. Allele frequencies
at these 10 sites provide some information about population
relationships; however, they are useless as diagnostic
markers for groups of the sizes that would resolve our questions about South America because the both alleles occur in
most populations. Moreover, some of the information provided by these sites is redundant because they are in high
linkage disequilibrium with each other.
Nucleotide Identity by Descent
The range for raw average nibd within populations was
from 0.9811 for the Cheyenne to 0.9969 for the Ache. When
comparing the sequences from a random pair of individuals
for all 322 nucleotide positions, these nibd estimates are consistent with at least one substitution at 6.81 and 1.06 sites,
respectively. The range for raw average nibd between populations was from 0.9785 for the Cheyenne and Tupe to
0.9914 for the Surui and Gaviao. When comparing the sequences from a random pair of individuals for all 322
482 Lewis and Long
FIG. 2.—Minor allele frequency spectrum with respect to the number of sites (top) and the number of populations (bottom). Shaded boxes indicate
sites that are optimally diagnostic of regional South American populations. The big and small boxes indicate the regions corresponding to the 1% and
5% polymorphism thresholds, respectively.
nucleotide positions, these nibd estimates are consistent with
at least one substitution at 6.93 and 2.78 sites, respectively.
Figure 3 displays the elements of RĴ grouped by region. In the
bottom panel, the tick marks indicate nibd within populations
for the 4 different geographic regions. In the middle panel, the
tick marks indicate nibd between pairs of populations within
the same region. In the top panel, the tick marks indicate nibd
between pairs of populations from different regions. Two salient points emerge from this figure. First, the range of nibd
within populations is so great that it makes little sense to pool
them into a single within-group component of variation for
further analyses of population genetic structure. Second, nibd
between populations in the same region is often lower than
nibd between populations in different regions. Because of
this, we do not expect regional groupings of populations
to be a major feature of the population genetic structure.
We assumed a Tamura–Nei (1993) substitution model
for our estimates of nibd. This model accounts for finite
sites, transition–transversion bias, asymmetric substitution
probabilities, and heterogeneous mutation rates among
sites. However, given the coalescence time frame for Native
American mtDNA, few substitutions deviate from an infinite sites model. In fact, in separate analyses, we found the
same results by assuming the infinite sites model where the
metric was the proportion of sites with the same nucleotide
between 2 DNA sequences.
Fitted Models
We fit 17 models to the nibd matrix, RĴ. Five models
test our major hypotheses. We constructed the remaining
models to confirm the principal 5 models, for example,
by using few or greater numbers of within-group nibd values or by adding regional populations one at a time rather
than as blocks. We did not find that minor variations on the
principal models made large changes in the outcomes. For
brevity, we present the results for the 5 models that directly
address our major questions.
In Model 1 (fig. 4, I), we assumed that nibd within all
local populations is the equal and that nibd between all pairs
of local populations is equal and independent. This assumption is roughly equivalent to Wright’s island model of population structure. This model has 2 parameters, one a pooled
within-population nibd component and the other a pooled
between-population nibd component. The estimates of these
parametersare0.9864and0.9829,respectively.Themodelfits
the data poorly, as we should expect from the raw nibd values
displayed in figure 3. Nevertheless, this model provides a useful baseline to begin the forward model selection procedure.
Model 2 (fig. 4, II) relaxes the assumption that nibd is
equal within local populations by allowing 5 different levels
of within-group nibd Although all 27 populations may, in
principle, harbor a unique level of nibd, we were unable
to extract 27 estimates from the HVS1 data because there
are too few polymorphic sites. The range of the modelbased nibd estimates is 0.9824–0.9911, which better
approximates the range of the raw estimates. Model 2 provides an improved fit to raw data matrix over Model 1. The
F-test shows that this improvement achieved significance
relative to Model 1 (P 5 0.0452); therefore, we reject
model 1 in favor of model 2 and conclude that it is necessary to allow for differences in nibd within populations.
Native South American Genetic Structure 483
FIG. 3.—Raw nibd, the points in the top plot reflect comparisons of sequences drawn from different local populations within different groups; the
points in the middle plot reflect comparisons of sequences drawn from different local populations within the same groups; and the points in the bottom
plot reflect comparisons of sequences drawn from the same local populations.
Model 3 (fig. 4, III) clusters populations into 4 geographical groups: North America, Central America, Western
South America, and Eastern South America. In this model,
we estimated nibd separately for each regional group. The
fitted model had a near-zero branch length (0.0002) between the Eastern South American node and the node connecting all populations; thus, the Eastern South American
node was eliminated, and the model was refitted. Model 3
fits the raw nibd matrix significantly better than does Model
2 (P 5 0.021). The fact that populations in Eastern South
America do not form a distinct genetic group is of utmost
importance to the theories about the initial peopling of South
America because it shows that populations in the Eastern
region harbor a great deal of variation.
Model 4 (fig. 4, IV) is similar to Model 3, but it nests
all South American populations into a continental cluster.
Although Model 4 provides a slightly improved fit to the
raw nibd matrix, the improvement is not statistically significant (P 5 0.484), and we retain Model 3 as a parsimonious
representation of RĴ.
Model 5 adds local population structure to Model 3 in
order to determine whether the existence of higher level
structure affects our conclusions about broad regional
groups (fig. 5). Model 5 provides a significant improvement
over Model 3 (P 5 0.005). However, the local pattern does
not change our interpretation of the regional population
groups.
Our forward modeling strategy assesses the distribution of nibd at lower levels of population structure prior
to higher levels. This raises an important methodological
issue because assumptions made about the distribution of
nibd at one level of population structure can significantly
bias the results at other levels (Long and Kittles 2003).
When assessing nibd at lower levels, the forward strategy
assumes that changes in nibd within and among groups at
higher levels occur independently. To some extent, our
strategy is robust to this assumption because, after our first
model, we allowed nibd within local populations to vary.
Fortunately, the higher level structure that we discovered
in Model 5 did not change our inferences about regional
groups from Model 3. Specifically, Eastern local populations still connected at the basal position, and all Western
local populations still emerged together as a distinct group.
Discussion
Our principal discovery is that Eastern South America
harbors more variation than has been recognized heretofore
(Tarazona-Santos et al. 2001; Fuselli et al. 2003; Lewis
et al. 2007). Although generalized hierarchical modeling
led us to this result, 2 lines of evidence supporting it are
visible in the raw nibd estimates (fig. 3). First, there are
some estimates of low nibd within the Eastern populations.
In fact, nibd in the Guahibo ranks third lowest in comparison to nibd for all 27 populations studied. Second, nibd
estimates between some pairs of Eastern populations are
on the order of nibd estimates for interregional, and intercontinental, comparisons. Because of these low nibd estimates between Eastern populations, we were unable to
reject the hypothesis that Eastern South America harbors
as much variation as the total for South America and even
the hypothesis that Eastern South America harbors as much
variation as total for all the Americas. Thus, we cannot consider the Eastern populations a meaningful group because
they harbor as much diversity collectively as our total sample including all regions.
In contrast to Eastern South America, we discovered that
Western South America harbors less genetic variation than
has been previously recognized (Tarazona-Santos et al.
2001; Fuselli et al. 2003). Again, the raw nibd estimates confirm the result from generalized hierarchical modeling (fig. 3).
Although nibd is low within most Western populations, the
484 Lewis and Long
FIG. 4.—Models 1–4 fitted to RĴ. The units on the scale bar are nibd. The range of nibd estimates for the raw data is marked by the boxes on the
scale bar. The range of the nibd estimates predicted by the model is marked by the circles on the scale bar. North American populations are coded in
black, Central American populations are coded in yellow, Eastern South American populations are coded in blue, and Western South American
populations are coded in red.
range substantially overlaps the range of nibd within Eastern
populations. Moreover, nibd between pairs of Western populations is entirely within the range of nibd between pairs of
Eastern populations. This fact causes the Western populations
to emerge together as a distinct group. However, our results
now fail to confirm the previously held idea that there is more
variation in the Western region than in the Eastern region.
The pattern of population relationships deciphered
above imposes an important limitation on testing the theories about the peopling of South America. From these
HVS1 data, we find that the Eastern populations connect
at the most basal node of the hierarchy; consequently,
we cannot distinguish the characteristics of the female
founders of South America from the female founders of
FIG. 5.—Model 5 fitted to the data. The units on the scale bar are nibd. The range of nibd estimates within the observed data is marked by the boxes
on the scale bar. The range of nibd estimates predicted by the model is marked by the circles on the scale bar.
Native South American Genetic Structure 485
the Americas as a whole. Therefore, there is no need to postulate separate source populations for the Eastern and Western regions of South America because the variability in the
East could serve as a source for the Western gene pools.
Thequestionsarise,couldadditionalpopulationsamples
orsequencefrom alongerstretchofthemitochondrialgenome
change our findings with respect to relative levels of nibd? To
address these questions, we performed a phylogeographic
analysis on the 886 HVS1 sequences analyzed. Specifically,
we calculated a Neighbor-Joining tree, which we rooted using
an HVS1 sequence from an African for the outgroup.
The Neighbor-Joining tree presents clusters of haplotypes that correspond to the universally recognized Native
American mitochondrial haplogroups, A–D. In our total
sample, copies of HVS1 from 879/886 individuals fall into
these clusters. Seven of these copies of the HVS1, all from the
Cheyenne population, show characteristics of haplogroup X
(Brown et al. 1998) including the T allele at both nucleotide
positions 16223 and 16278 and the absence of markers at
other sites that would place them in one of the 4 main haplogroups. However, definitive assignment of these copies of
the sequence would require data from outside of the HVS1.
For the HVS1 data examined here, haplotypes within major
haplogroups have approximately 2–3 nt differences, whereas
haplotypes between major haplogroups have approximately
6–10 nt differences. Previous studies estimated that haplotypes within these major haplogroups coalesce to a common
ancestor approximately 25–40 thousand years ago (Bonatto
and Salzano 1997b; Silva et al. 2002), whereas haplotypes
between these major haplogroups coalesce to a common ancestor approximately 100 thousand years ago (Bonatto and
Salzano 1997b; Silva et al. 2002; Gonder et al. 2007). This
places an important limitation on the resolving power of
mtDNA because the mutations responsible for many of
the nucleotide differences between copies of HVS1 occurred
prior to the peopling of the Americas.
Substitutions between haplotypes in different haplogroups make a greater contribution to average nibd than
do substitutions between haplotypes within haplogroups.
Because of this, the pattern of nibd is resistant to adding
additional population samples because our current population samples have already revealed that haplogroups A–D
are common in all regions. In fact, sampling additional populations is unlikely to change the observed pattern unless the
added samples present a new common haplogroup. However, the possibility of such a discovery is remote. To date,
Native American studies have provided an extensive survey
of haplogroups and nearly all Native Americans possess one
of the 4 haplogroups (Bonatto and Salzano 1997b; Mulligan
et al. 2004; Schurr and Sherry 2004; Tamm et al. 2007).
We also expect that the pattern of nibd is resistant to
sequencing a larger portion of the mitochondrial genome. Extending the sequence length can improve the resolution of the
coalescent history of the haplotypes within a haplogroup and
may reveal additional sublineages. However, to change the
pattern of nibd, extending the sequence length would need to
reject our current knowledge of the coalescent history of
these haplogroups. This is unlikely, considering that mitochondrial genome studies identify the same major haplogroups as seen in analyses of HVS1 (Maca-Meyer et al.
2001; Bandelt et al. 2003; Tamm et al. 2007). Moreover, even
with such a discovery, the ubiquitous distribution of the haplogroups buffers the pattern of nibd. We expect that such
a discovery would affect nibd in all regions similarly.
In summary, we developed a series of models to test hypotheses about the genetic structure and initial peopling of
South America using mtDNA sequence data. Our models
donot require usto pool variationwithin populations ortopool
variation among populations from different regions, as is the
case with more usual methods. The flexibility in our approach
led us to the novel discovery that populations in Eastern South
America harbor a great deal of variation. The level of variation
is so great that we cannot view the Eastern South American
populations as a single cohesive group. Because of this finding, a single human migration into South America is the most
parsimonious interpretation of the mtDNA HVS1 data. Analyses of the site frequency spectrum indicate that questions
aboutbroadregionalpopulationsinSouthAmericaarebeyond
the resolving power of mtDNA HVS1 data. Moreover, phylogeographic analysis of the mtDNA haplogroup lineages
tagged by these HVS1 data indicates that even mitochondrial
whole-genome analysis may not be able to resolve these questions. The best chance for obtaining a higher resolution of population history will be to examine many independently
inherited loci with similar mutation mechanisms.
Acknowledgment
C.M.L. was supported by NIH T32-HG-00040.
Appendix—Populations, Sample Sizes, and Primary
Citations
Population
n
Region
Citation
Ache
Ancash
Arequipa
Bella Coola
Cheyenne
63
35
22
41
39
Eastern South America
Western South America
Western South America
North America
North America
Embera
Gaviao
Guahibo
Huentar
Ignaciano
Kuna
Mapuche
Movima
Ngoebe
Pehuenche
Puno
44
28
59
27
15
63
34
12
46
24
34
Western South America
Eastern South America
Eastern South America
Central America
Western South America
Central America
Western South America
Eastern South America
Central America
Western South America
Western South America
Surui
Tayacaja
Trinitario
Tupe
24
65
12
16
Eastern South America
Western South America
Western South America
Western South America
Waiwai
26
Eastern South America
Wounan
Xavante
Yaghan
Yungay
Yuracare
Zoro
31
25
15
38
15
28
Western South America
Eastern South America
Western South America
Western South America
Western South America
Eastern South America
(Schmitt et al. 2004)
(Lewis et al. 2007)
(Fuselli et al. 2003)
(Ward et al. 1993)
(Kittles et al. 1999)
(Kolman and
Bermingham 1997)
(Ward et al. 1996)
(Vona et al. 2005)
(Santos et al. 1994)
(Bert et al. 2004)
(Bert et al. 2004)
(Moraga et al. 2000)
(Bert et al. 2004)
(Kolman et al. 1995)
(Moraga et al. 2000)
(Lewis et al. 2007)
(Bonatto and
Salzano 1997a)
(Fuselli et al. 2003)
(Bert et al. 2004)
(Lewis et al. 2007)
(Bonatto and
Salzano 1997a)
(Kolman and
Bermingham 1997)
(Ward et al. 1996)
(Moraga et al. 2000)
(Lewis et al. 2007)
(Bert et al. 2004)
(Ward et al. 1996)
486 Lewis and Long
Literature Cited
Anderson S, Bankier AT, Barrell BG, et al. (14 co-authors).
1981. Sequence and organization of the human mitochondrial
genome. Nature. 290:457–465.
Anderson TW. 1973. Asymptotically efficient estimation of
covariance matrices with linear structure. Ann Stat. 1:79–95.
Andrews R, Kubacka I, Chinnery P, Lightowlers R, Turnbull D,
Howell N. 1999. Reanalysis and revision of the Cambridge
reference sequence for human mitochondrial DNA. Nat
Genet. 23:147.
Bandelt HJ, Herrnstadt C, Yao YG, et al. (13 co-authors). 2003.
Identification of Native American founder mtDNAs through
the analysis of complete mtDNA sequences: some caveats.
Ann Hum Genet. 67:512–524.
Bert F, Corella A, Gene M, Perez-Perez A, Turbon D. 2004.
Mitochondrial DNA diversity in the Llanos de Moxos: moxo,
Movima and Yuracare Amerindian populations from Bolivia
lowlands. Ann Hum Biol. 31:9–28.
Bonatto SL, Salzano FM. 1997a. A single and early migration for
the peopling of the Americas supported by mitochondrial
DNA sequence data. Proc Natl Acad Sci USA. 94:1866–1871.
Bonatto SL, Salzano FM. 1997b. Diversity and age of the four
major mtDNA haplogroups, and their implications for the
peopling of the New World. Am J Hum Genet.
61:1413–1423.
Brown MD, Hosseini SH, Torroni A, Bandelt HJ, Allen JC,
Schurr TG, Scozzari R, Cruciani F, Wallace DC. 1998.
mtDNA haplogroup X: an ancient link between Europe/
Western Asia and North America? Am J Hum Genet.
63:1852–1861.
Cavalli-Sforza LL, Piazza A. 1975. Analysis of evolution:
evolutionary rates, independence, and treeness. Theor Popul
Biol. 8:127–165.
Dillehay TD. 1999. The late Pleistocene cultures of South
America. Evol Anthropol. 7:206–216.
Excoffier L, Smouse PE, Quattro JM. 1992. Analysis of
molecular variance inferred from metric distances among
DNA haplotypes: application to human mitochondrial DNA
restriction data. Genetics. 131:479–491.
Fuselli S, Tarazona-Santos E, Dupanloup I, Soto A, Luiselli D,
Pettener D. 2003. Mitochondrial DNA diversity in South
America and the genetic history of Andean Highlanders. Mol
Biol Evol. 20:1682–1691.
Gonder MK, Mortensen HM, Reed FA, de Sousa A,
Tishkoff SA. 2007. Whole-mtDNA genome sequence analysis of ancient African lineages. Mol Biol Evol. 24:757–768.
Kittles RA, Bergen AW, Urbanek M, Virkkunen M, Linnoila M,
Goldman D, Long JC. 1999. Autosomal, mitochondrial, and
Y chromosome DNA variation in Finland: evidence for
a male-specific bottleneck. Am J Phys Anthropol.
108:381–399.
Kolman CJ, Bermingham E. 1997. Mitochondrial and nuclear
DNA diversity in the Chocó and Chibcha Amerinds of
Panamá. Genetics. 147:1289–1302.
Kolman CJ, Bermingham E, Cooke R, Ward RH, Arias TD,
Guionneau-Sinclair F. 1995. Reduced mtDNA diversity in the
Ngöbé Amerinds of Panama. Genetics. 140:275–283.
Lewis CM Jr, Lizárraga B, Tito RY, et al. (11 co-authors).
Forthcoming. Mitochondrial DNA and the Peopling of South
America. Hum Biol.
Long JC, Kittles RA. 2003. Human genetic diversity and the
nonexistence of biological races. Hum Biol. 75:449–471.
Luiselli D, Simoni L, Tarazona-Santos E, Pastor S, Pettener D.
2000. Genetic Structure of Quechua-Speakers of the Central
Andes and Geographic Patterns of Gene Frequencies in South
Amerindian Populations. Am J Phys Anthropol. 113:5–17.
Maca-Meyer N, Gonzalez AM, Larruga JM, Flores C,
Cabrera VM. 2001. Major genomic mitochondrial lineages
delineate early human expansions. BMC Genet. 2:13.
Moraga ML, Rocco P, Miquel JF, Nervi F, Llop E,
Chakraborty R, Rothhammer F, Carvallo P. 2000. Mitochondrial DNA polymorphisms in Chilean aboriginal populations:
implications for the peopling of the southern cone of the
continent. Am J Phys Anthropol. 113:19–29.
Mulligan CJ, Hunley K, Cole S, Long JC. 2004. Population
genetics, history, and health patterns in native americans.
Annu Rev Genomics Hum Genet. 5:295–315.
Rothhammer F, Moraga M. 2001. Patterns of Y-chromosome
variation in South Amerindians. Am J Hum Genet.
69:904–906.
Santos M, Ward RH, Barrantes R. 1994. mtDNA variation in the
Chibcha Amerindian Huetar from Costa Rica. Hum Biol.
66:963–977.
Schmitt R, Bonatto SL, Freitas LB, Muschner VC, Hill K,
Hurtado AM, Salzano FM. 2004. Extremely limited mitochondrial DNA variability among the Ache Natives of
Paraguay. Ann Hum Biol. 31:87–94.
Schurr TG, Sherry ST. 2004. Mitochondrial DNA and Y
chromosome diversity and the peopling of the Americas:
evolutionary and demographic evidence. Am J Hum Biol.
16:420–439.
Silva WA Jr, Bonatto SL, Holanda AJ, et al. (14 co-authors).
2002. Mitochondrial genome diversity of Native Americans
supports a single early entry of founder populations into
America. Am J Hum Genet. 71:187–192.
Tamm E, Kivisild T, Reidla M, et al. (21 co-authors). 2007.
Beringian standstill and spread of Native American founders.
PLoS ONE. 2:e829.
Tamura K, Nei M. 1993. Estimation of the number of nucleotide
substitutions in the control region of mitochondrial DNA in
humans and chimpanzees. Mol Biol Evol. 10:512–526.
Tarazona-Santos E, Carvalho-Silva DR, Pettener D, Luiselli D,
De Stefano GF, Martinez Labarga C, Rickards O, TylerSmith C, Pena SDJ, Santos FR. 2001. Genetic differentiation
in South Amerindians is related to environmental and cultural
diversity: evidence from the Y chromosome. Am J Hum
Genet. 68:1485–1496.
Urbanek M, Goldman D, Long JC. 1996. The apportionment of
dinucleotide repeat diversity in Native Americans and Europeans: a new approach to measuring gene identity reveals
asymmetric patterns of divergence. Mol Biol Evol.
13:943–953.
Vona G, Falchi A, Moral P, Calo CM, Varesi L. 2005.
Mitochondrial sequence variation in the Guahibo Amerindian
population from Venezuela. Am J Phys Anthropol.
127:361–369.
Ward RH, Redd A, Valencia D, Frazier B, Pääbo S. 1993.
Genetic and linguistic differentiation in the Americas. Proc
Natl Acad Sci USA. 90:10663–10667.
Ward RH, Salzano FM, Bonatto SL, Hutz MH, Coimbra CEA Jr,
Santos RV. 1996. Mitochondrial DNA polymorphism in three
Brazilian Indian tribes. Am J Hum Biol. 8:317–323.
Weir BS, Cockerham CC. 1984. Estimating F-statistics for the
analysis of population structure. Evolution. 38:1358–1370.
Connie Mulligan, Associate Editor
Accepted October 4, 2007