Bayesian clustering analyses for genetic assignment and study of

Author s self-archived version
Bayesian clustering for genetic assignment in oaks
Bayesian clustering analyses for genetic assignment and study of
hybridization in oaks: Effects of asymmetric phylogenies and
asymmetric sampling schemes
Charalambos Neophytou1
ABSTRACT
Bayesian clustering methods have been widely used
for studying species delimitation and genetic
introgression. In order to test the effect of
phylogenetic relationships and sampling scheme on
the inferred clustering solution and on the
performance of Bayesian clustering analysis, I
simulated genotypes of the interfertile oak species
Quercus robur, Q. petraea and Q. pubescens and I ran
analyses using two popular software programs,
STRUCTURE and BAPS. First, based on purebred
simulations, I compared clustering solutions
resulting from different sample size configurations.
While clustering solution generally reflected the
taxonomic relationships when equal samples of each
species were included, spurious partition was
inferred by STRUCTURE when some species were
represented by larger and others by smaller
samples. In very unbalanced configurations,
STRUCTURE failed to identify the three species, even
if three subpopulations were assumed. By contrast,
BAPS could properly identify the three species
under any sampling scheme. Second, based on
simulations of purebreds and hybrids, I tested the
performance of individual assignments with variable
number of loci. This analysis showed that
STRUCTURE can detect introgressed individuals
more efficiently than BAPS. However, BAPS could
assign purebreds more efficiently with a lower
number of loci. Method performance also depended
on phylogenetic relationships. In the case of Quercus
petraea, Q. pubescens and their hybrids, method
1
Forest Research Institute (FVA) Baden-Württemberg
Wonnhaldestr. 4
79100 Freiburg
Germany
E-mail: [email protected]
performance was lower due to their phylogenetic
affinity. Inclusion of three instead of two species into
the analysis led to reduction of performance, and to
misclassification of hybrids, which often reflected
the phylogenetic affinity between Q. petraea and Q.
pubescens.
KEY WORDS
Bayesian clustering, Quercus, BAPS, STRUCTURE,
simulation, microsatellites
1. INTRODUCTION
Along with significant improvements in molecular
techniques, Bayesian clustering methods of
population genetic structure analysis have
experienced a rapid expansion during the last
decade. Main applications of such methods include
investigation of intraspecific genetic differentiation
(Rosenberg et al. 2001; Heuertz et al. 2004; Frantz et
al. 2006), species delimitation and study of
hybridization and genetic introgression (Lexer et al.
2005; Kronforst et al. 2006; Bohling et al. 2013).
Being a multispecific genus with high levels of
interbreeding among taxa, oaks (genus Quercus)
have often been used for such analyses in a
population genetic and evolutionary context
(Burgarella et al. 2009; Lepais et al. 2009;
Neophytou et al. 2010; Gugger and Cavender-Bares
2011). Yet, method performance varies among
different case studies. Several factors related to the
experimental
design
and
the
taxonomic
relationships affect the ability of Bayesian clustering
analyses to characterize inter- and intraspecific
genetic differentiation and to define levels of genetic
introgression between interfertile units (Vähä and
1
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
Primmer 2006, Kalinowski 2011, Bohling et al.
2013).
In general, an increase of the genotyped loci leads to
a higher efficiency and accuracy of genetic
assignment (Vähä and Primmer 2006). However, it
has been shown that diagnostic power can vary
strongly among loci. Thus, use of a small number of
highly informative loci can be adequate to identify
genetic structures, whereas adding a large number
of less informative loci may not improve the results
significantly (Rosenberg 2005). In order to select a
set of appropriate markers for Bayesian clustering
analyses, loci have to be evaluated and sorted
according to their diagnostic power. Several
measures estimating locus diagnostic power have
been proposed. Among them, Wright s FST and
Rosenberg s informativeness of assignment (In;
Rosenberg et al. 2003) have been frequently used
for such marker classifications and have been shown
to perform better than other related measures (Ding
et al. 2011).
Phylogenetic relationships among species may also
strongly affect the ability of Bayesian clustering to
distinguish different taxonomical units. In the case of
ancient speciation events, genetic drift and
mutations are likely to have altered allelic
frequencies in extant species stronger and at more
loci. Ancient speciation may explain the high levels
of
differentiation
among
Mediterranean
representatives of the oak section Cerris (Manos et
al. 1999). In two case studies with oaks of the
section Cerris, a limited amount of markers (as few
as four) was adequate to achieve a high performance
of Bayesian clustering for species assignment,
supporting
the
aforementioned
hypothesis
(Burgarella et al. 2009; Neophytou et al. 2011). On
the contrary, more recent speciation, probably in
interaction with frequent hybridization, may explain
the fact that Bayesian analyses within certain
species of the section Lobatae (red oaks) could not
resolve the species, even when 15 loci were used
(Aldrich et al. 2003).
Furthermore, a better resolution of Bayesian
clustering may be required to study introgression
among interfertile species. Given that performance
Author s self-archived version
of Bayesian clustering methods depends on the
genetic differentiation among species, a high
number of loci may be required in order to reliably
assign hybrids and especially backcrosses, in case of
limited interspecific differentiation (Vähä and
Primmer 2006). In addition, depending on the
algorithm used, proportions of assigned purebreds,
hybrids and backcrosses may vary strongly
(Burgarella et al. 2009; Bohling et al. 2013). Another
key issue is the decision about a threshold value of
membership proportion or admixture coefficient (q)
in order to distinguish purebreds from potential
hybrids (i.e. the percentage of genetic variation of
each individual drawn from a specific gene pool).
Studies based on simulations of purebreds and
hybrid genotypes have aimed to evaluate the effect
of all aforementioned factors to the performance of
Bayesian clustering (Vähä and Primmer 2006;
Burgarella et al. 2009; Lepais et al. 2009; Guichoux
et al. 2013).
Finally, sampling scheme and the chosen analysis
method may also cause problems to genetic
structure identification. For instance, using the
popular Bayesian clustering analysis software
STRUCTURE (Pritchard et al. 2000; Falush et al.
2003), it has been shown that variations in sample
size among demes may strongly influence the
clustering solution (Kalinowski 2011). Preliminary
analysis of the data upon which the present study
was carried out using the same software showed a
tendency of two species groups, those of Q. petraea
and Q. pubescens, represented by a relatively low
number of individuals to cluster together, even
when the number of assumed subpopulations was
set to 3 (i.e. the number of species). This could be
due to the phylogenetic affinity of these two species
or due to stochastic error (Kalinowski 2011). On the
contrary, use of another method of Bayesian
clustering analysis, BAPS (Corander and Marttinen
2006; Corander et al. 2008a), showed clustering
patterns consistent with the taxonomic relationships
among the three species. This observation has
largely motivated further simulation-based analyses
presented in this paper.
In the present study, I chose to study the effects of
all aforementioned factors by focusing on three
2
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
interfertile oak species of Central Europe; Quercus
robur, Q. petraea and Q. pubescens. Given that Q.
petraea and Q. pubescens are phylogenetically closer
to each other, while Q. robur is genetically more
divergent (Curtu et al. 2007, Lepais et al. 2009), this
species complex is a good model for studying the
effects of phylogeny on the performance of Bayesian
clustering. In particular, based on genotype
simulations I aimed to explore the utility of Bayesian
clustering analysis methods of genetic structure for
species identification and study of hybridization and
introgression to investigate advantages and
drawbacks of two well established methods, BAPS
and STRUCTURE. Specifically, my objectives were
(1) to study the effect of phylogenetic relationships
on the reliability of clustering among species with
different levels of pairwise differentiation, the oak
species Quercus robur, Q. petraea and Q. pubescens,
(2) to test the effect of sample size of each specific
group on clustering patterns, (3) to choose subsets
of highly informative markers that discriminate
among the three species, as well as pairwise in all
three possible combinations, (4) to evaluate
Wright s FST and Rosenberg s In as tools for choosing
highly informative marker sets suitable for Bayesian
analyses, (5) to compare the performance of two
different Bayesian clustering methods in all
aforementioned tasks, (6) to study the effect of
phylogenetic relationships on the efficiency and
accuracy of purebred and hybrid assignment and (7)
to test whether inclusion of three species reduces
method performance in comparison to analyses with
species pairs and their hybrids.
2. MATERIALS AND METHODS
2.1. Study area and sample collections
A total of 2048 individual trees of Q. robur, Q.
petraea and Q. pubescens were systematically
sampled from 76 forest stands in the Upper Rhine
Valley in France and Germany, delimited by the
Vosges Mountains to the west, the Jura Mountains to
the south and the Black Forest to the east. Individual
trees were georeferenced. A preliminary assignment
to one of the three species was made in the field
based on basic phenotypic characters (leave, bark
and acorns). Among the sampled stands, 15 were
Author s self-archived version
mixed with Q. robur and Q. petraea and the
remaining were pure for one of the three study
species. In particular, 40 of those stands consisted
predominantly of Q. robur, 15 of Q. petraea and 6 of
Q. pubescens.
2.2. Laboratory procedures
Depending on the season of the samplings, three
different types of plant tissue – cambium, leaves or
buds – were collected from each individual for the
DNA analysis. After sample collections, plant
material was transferred to the laboratory and was
frozen at -80°C. Subsequently, it was freeze-dried in
vacuum and DNA was extracted using the DNeasy 96
extraction kit (Qiagen, Hilden, Germany). Multiplex
polymerase chain reactions (PCR) were carried out
for the amplification of 11 non-genic (nSSRs) and 10
EST-derived microsatellite loci (EST-SSRs). Among
the eleven analyzed non-genic microsatellites six –
QrZAG7, QrZAG11, QrZAG30, QrZAG96 and
QrZAG112 – were initially developed in Q. robur
(Kampfer et al. 1998), four – QpZAG9, QpZAG15,
QpZAG104 and QpZAG110 – in Q. petraea
(Steinkellner et al. 1997a) and one – MSQ13 – in Q.
macrocarpa (Dow et al. 1995). All ten EST-derived
microsatellites – PIE020, PIE102, PIE152, PIE215,
PIE223, PIE227, PIE242, PIE243, PIE267 and PIE
271 – were described in Durand et al. (2010). Both
non-genic and EST-derived microsatellites are
known to be highly transferable among related
white oak species (section Quercus), as has been
described in studies mainly including Q. robur and Q.
petraea (Steinkellner et al. 1997b; Guichoux et al.
2011).
For the PCR reactions, primers were divided in three
multiplexes. The multiplex including the ESTderived SSRs was largely based on Guichoux et al.
(2011). Details about the used loci, multiplexes and
fluorescent labeling are presented in Online
Resource 1. As a reagent, the SuperHot Mastermix
(Genaxxon, Biberach, Germany), a premixed
mastermix including all PCR components except
DNA template and primers, was used. Reaction
volume was set to 10 µl, comprised of 5 µl reaction
mastermix, 2 µl of primer mix, 2 µl water and 1 µl
diluted DNA (ca. 4 ng / µl). A common PCR-program
3
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
was used for all the reactions, including following
steps: (1) denaturation at 95 °C for 15 min; (2) 26
cycles with a denaturation step at 94 °C for 30 s,
primer annealing at 57°C for 1 min and 30s and an
elongation step at 72°C for 30 s; (3) final elongation
at 72°C for 10min and (4) a final step at 60°C for 30
min. Allele scoring was performed by means of a
capillary electrophoresis using an ABI Prism 3130xl
genetic analyzer and the software GeneMapper
(Applied Biosystems).
2.3. Selection of purebred individuals
For a preliminary species assignment, two different
Bayesian clustering approaches, STRUCTURE
(Pritchard et al. 2000; Falush et al. 2003) and BAPS
(Corander and Marttinen 2006; Corander et al.
2008a), were used. Both methods allow for presence
of multiple clusters, which fits the present data set
consisting of three species. The purpose of this first
step was to select purebred individuals of each
species and use them for simulations in the next
steps of the study. An individual was characterized
as purebred only if it was assigned to the same
species cluster by both clustering methods. The
value of 0.875 was chosen as a threshold of
membership proportion (q) with each individual
with q ≥ . 5 being characterized as purebred. This
threshold value was used only for the preliminary
analyses (the choice of threshold values for the
simulation study is described in Chapter 2.6). It was
selected assuming that first generation hybrids are
expected to have q-values of 0.50 to each one of their
parental species and backcrossings 0.75 and 0.25
respectively. Therefore, a membership proportion of
0.375-0.625 would be expected for F1 hybrids,
0.625-0.875 for backcrossings and above 0.875 for
purebreds. A q-value of 0.875 was also used as a
threshold to distinguish purebreds in recent studies
using Bayesian clustering (Bohling et al. 2013,
Guichoux et al. 2013).
STRUCTURE analysis was performed choosing the
admixture model and correlated allele frequencies.
The number of assumed subpopulations (K) was set
between 1 and 20. For each K, ten independent runs
were performed applying 100,000 burn-in
replications followed by 100,000 MCMC iterations.
Author s self-archived version
All runs made for this study were performed using
the on-line platform of the Oslo University (Kumar et
al. 2009) which applied the version 2.3 of
STRUCTURE at the time of the analyses. In order to
choose the most appropriate number of clusters, the
method of Evanno et al. (2005) was implemented.
According to this method, ΔK, an ad-hoc statistic
based on the rate of change of the maximum
posterior probability of data was calculated for each
value of K. The value of K for which ΔK is maximized
indicates the uppermost hierarchical level of
population subdivision. Notably, in the case of
complex hierarchical schemes – for instance when
genetic differentiation among different pairs of
clusters varies strongly – ΔΚ may not detect all
clusters from the beginning (Evanno et al. 2005).
Thus, in order to detect further hidden within-group
clustering, subsequent STRUCTURE analyses
(applying the same settings) were performed using
individuals having been assigned as purebreds (q ≥
0.875) in the first analysis, as suggested by Evanno
et al. (2005). These within-group analyses were
continued until no meaningful population
subdivision was supported by the program results.
ΔΚ analyses were performed using the on-line
platform STRUCTURE HARVESTER (Earl and
vonHoldt 2012).
Given that geographic coordinates of the individuals
were available, the option of spatial clustering of
individuals was chosen for BAPS analysis. This
method uses a prior that is stricter against an
increase of the number of clusters, thus preventing
detection of spurious clusters due to stochastic
fluctuations of allele frequencies (Corander et al.
2008b). The maximum number of assumed
subpopulations (K) was set from 2 to 20. An analysis
assuming K = 1 is non-sense in BAPS. Ten
independent runs for each value of K were
performed. First, a mixture analysis was carried out
to assign individuals to clusters and to define the
most appropriate clustering solution. Second, based
on mixture data, an admixture analysis was
performed in order to calculate membership
proportions (admixture coefficients) of each
individual to each cluster. The default settings of the
program were used for this analysis. The minimum
size of populations to be taken into account was set
4
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
to 5, while 50 iterations were applied to estimate the
admixture coefficients of the individuals, 50 of
reference individuals from each population were
used and, finally, 10 iterations were performed in
order to estimate the admixture coefficients of the
reference individuals.
2.4. Simulations, genetic diversity and phylogenetic
relationships
Subsequently, purebred, as well as first generation
hybrids and backcrossings were simulated based on
the purebreds characterized previously. In
particular, one hundred purebred individuals from
each species group, characterized in the first step,
were randomly chosen and were used as input for
the program HYBRIDLAB (Nielsen et al. 2006). The
following groups of genotypes were simulated using
this software: (a) 100, 500 and 1000 purebred
individuals of each species, (b) 100 first generation
hybrids between each species pair and (c) 100
backcrossings of first generation hybrids with each
one of the parental species. Locus diversity in each
species was analyzed using the three groups of
purebred individuals, upon which simulations were
based. To calculate number of alleles per locus (na),
observed (Ho) and expected (He) heterozygosity, as
well as inbreeding coefficients (FIS; Weir and
Cockerham 1984) the software Genetix v. 4.05.02
(Belkhir et al. 2004) was used. Moreover, in order to
investigate phylogenetic relationships among the
three species, an unrooted neighbor-joining tree
based on pairwise FST values was constructed using
the software POPTREE2 (Takezaki et al. 2010).
2.5. Effect of sample size and phylogenetic
relationships on clustering
In the following step, simulated purebred genotypes
were used to investigate sampling effects on
clustering, as calculated by each one of the two
Bayesian methods. Each run of BAPS or STRUCTURE
was based on an input file consisting of 100, 500 or
1000 simulated purebreds of each species.
Following configurations of sample size were tested:
(a) 1000/100/100, (b) 1000/500/100, (c)
1000/500/500,
(d)
1000/1000/100,
(e)
1000/1000/500 and (f) 1000/1000/1000. These
configurations were tested for all possible species
Author s self-archived version
combinations, in order to investigate the effect of
phylogeny (e.g. whether the presence of related
species in the small groups leads to different
clustering solution in comparison to inputs with the
least related species forming the small groups). In
BAPS, clustering of individuals was carried out,
since use of coordinates in simulated individuals for
spatial clustering of individuals would be nonsense. All other parameters for the analysis were the
same as those used in the preliminary analysis (see
Chapter 2.3). The maximum number of assumed
subpopulations (K) was set from 2 to 10. Ten
independent runs for each value of K were
performed. In addition, ten independent runs with
fixed K = 2 were performed in order to test whether
phylogenetic relationships are reflected into the
results, i.e. whether the most related species cluster
together (by choosing variable K values, only the
optimal solution in terms of log-likelihood is
presented in the output).
Similarly to BAPS, ten independent runs for each K
value between 1 and 10 were carried out in
STRUCTURE, maintaining the same analysis
parameters mentioned previously. The uppermost
hierarchical level of population subdivision was
calculated following the ΔK method with the on-line
software STRUCTURE HARVESTER, as described
above. In order to find the optimal cluster alignment
and calculate the average membership proportion
among the 10 runs for each K, the software CLUMPP
v. 1.1.2 (Jakobsson and Rosenberg 2007) was
applied. For producing graphics with individual
membership proportions based on either BAPS or
STRUCTURE, the cluster visualization program
DISTRUCT (Rosenberg 2004) was used.
2.6. Diagnostic power of loci, efficiency, accuracy and
performance of Bayesian assignment
In order to test marker set efficiency, loci were
sorted by their diagnostic power. Two different
measures
were used: Wright s
FST and
informativeness of assignment (In), introduced by
Rosenberg et al. (2003). The latter measure was
calculated using the software INFOCALC (Rosenberg
2005). Calculation of FST per locus was made using
the on-line software LOSITAN (Antao et al. 2008).
5
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
For both analyses, the genotypes of the 100
purebred tree individuals of each species selected in
the first stage of the study were used. Separate
analyses were run (1) including all three species and
(2) pairwise, resulting in three species pairs (Q.
robur – Q. petraea, Q. robur – Q. pubescens and Q.
petraea – Q. pubescens).
After sorting the loci by decreasing FST and In values,
Bayesian clustering analyses were performed using
BAPS and STRUCTURE to test the performance of
the two clustering methods with increasing number
of loci included. Following species configurations
were used: (1) Three species (including 1000
simulated genotypes for each species – all possible
simulated
first
generation
hybrids
and
backcrossings – 100 simulated genotypes each and
(2) Two species, using all three possible species
combinations (including 1000 purebreds, 100
simulated F1 and 100 backcrossings with each one
of the parental species). First, analyses were carried
out using genotypic data from the first locus (i.e.
with the highest FST or In) and, subsequently, loci
were added one by one to carry out further runs,
following the aforementioned order (i.e. by
decreasing FST or In). This resulted in 21 different
analyses for each configuration. Ten independent
runs were performed with each software by setting
K = 3 when all three species were included and K = 2
when species pairs were analyzed. The remaining
settings for both software programs were the same
as in the analysis examining the effects of sample
size (see above). In order to find the optimal
alignment and the average membership proportion
among the 10 runs and to visualize the results, the
software programs CLUMPP and DISTRUCT were
used as described above.
Subsequently, simulated individuals were assigned
to groups of purebreds and hybrids. Performance of
Bayesian analyses was evaluated based on different
measures described in Vähä and Primmer (2006),
which have been also used in later simulation based
studies (e.g. Burgarella et al. 2009; Guichoux et al.
2013). Efficiency was defined as the proportion of
individuals in a group that were correctly identified.
For instance, efficiency for Q. robur was calculated as
the number of simulated Q. robur genotypes
Author s self-archived version
correctly assigned to the Q. robur cluster divided by
the total number of simulated Q. robur genotypes.
Accuracy was defined as the proportion of
individuals assigned by the clustering analysis to a
certain group that truly belong to this group (Vähä
and Primmer 2006). For instance, Q. robur accuracy
is the number of simulated Q. robur genotypes
correctly assigned to their species cluster divided by
the overall number of individuals assigned as Q.
robur (including simulated hybrids, backcrossings
and purebreds of other species assigned to the Q.
robur cluster). Finally, total performance was
calculated as the product of efficiency multiplied by
the accuracy for a given category.
In order to choose optimal threshold values of
membership proportion (q), several critical q-values
from 0.5 to 0.975 (in steps of 0.025) were tested.
Maximization of overall performance was used as a
criterion to select the optimal threshold values.
Given that q-values of the simulated first generation
hybrids and backcrosses were largely overlapping
even when all loci were included (see Results), only
groups of purebreds and hybrids were defined.
Therefore, in the case of three-species configuration,
individuals were assigned to three purebred (one for
each species) and three hybrid groups (Q. robur – Q.
petraea, Q. robur – Q. pubescens and Q. petraea – Q.
pubescens). In order for an individual to be assigned
as purebred, membership proportion higher than
the threshold q-value to any cluster was required. If
membership proportion to any cluster was lower,
then the individual was assigned as hybrid. The two
clusters with the highest membership proportion
were considered as the parental species of the
hybrid.
After choosing threshold q-values, comparisons of
efficiency, accuracy and method performance were
made. First, in order to compare FST and In, the rate
of increase of the three measures with increasing
number of loci was observed. Second, the rate of
increase and final value of the three measures were
compared between the two Bayesian clustering
methods used. Third, efficiency, accuracy and total
performance was compared among the tested
species pairs, but also between pairwise and threespecies configurations.
6
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
Author s self-archived version
3. RESULTS
3.1. Preliminary Bayesian analyses
At a first stage, I performed preliminary Bayesian
analyses with STRUCTURE and BAPS, in order to
choose purebred individuals for genotype
simulations. In the first STRUCTURE run, including
all sampled individuals, the statistic ΔΚ was
maximized for two assumed subpopulations (K = 2;
Fig. 1). For K = 2, one cluster included Q. robur
individuals and the other consisted of Q. petraea and
Q. pubescens. For K = 3 clustering solutions among
runs differed. For two runs, species could be
separated. For eight runs, Q. petraea and Q.
pubescens individuals clustered together, while Q.
robur was subdivided and its individuals were
admixed (Fig. 2). This subdivision was biologically
unreasonable. However, ln posterior probability of
data – lnP(D) – was on average higher for the runs
with proper species identification than for those
with biologically unreasonable partition. This
resulted in two bulks of points, when lnP(D) for each
run was plotted as a function of K (Fig. 1). Similar
bimodality was also observed when K was set to 4.
When K was set to 5, uniform results among all ten
runs were obtained. Quercus petraea and Q.
pubescens formed separate clusters and individuals
of them were assigned to their own species clusters
with relatively high membership proportions. ΔΚ
presented a secondary peak for K = 5 (Fig. 1).
Given that ΔK was maximized for two assumed
subpopulations, I used results for K = 2 to perform
further analyses within the derived clusters derived
for K = 2, as suggested by Evanno et al. (2005). By
analyzing individuals assigned to the common
cluster of Q. petraea and Q. pubescens (q ≥ . 5 , the
two species could be consistently identified and ΔΚ
was maximal for K = 2. Thus, I used results from this
run to mak final species assignments. In total, I
assigned 522 individuals to Q. petraea and 108 to Q.
pubescens (q ≥ . 5 . Within Q. robur, no further
subdivision was revealed by the analysis, as
posterior likelihood of data did not increase with in-
Fig.1 – Results of the first preliminary run with
STRUCTURE carried out using sampled individual trees.
Ln posterior probability of data (lnP(D)) for each run and
values of the statistic ΔK are presented for different
numbers of assumed subpopulations (K = …
.
creasing K. Thus, I assigned 1226 individuals Q.
robur (q ≥ . 5 based on the analysis for K = 2
including all individuals.
By performing spatial clustering of individuals with
BAPS, I found the optimal clustering solution for K =
3, corresponding to the three species. After a
subsequent admixture analysis, I assigned 1248
individuals to Q. robur, 600 to Q. petraea and 128 to
Q. pubescens q ≥ . 5 . Among these individuals,
1224, 522 and 105 had been also assigned as Q.
robur, Q. petraea and Q. pubescens, respectively,
using STRUCTURE. I used 100 randomly chosen
individuals of each group to produce genotype
simulations, since they had been identified as
purebred by both analysis methods. Furthermore, I
used these individuals to calculate diversity
parameters and phylogenetic relationships among
species.
3.2. Genetic diversity and differentiation
Whereas genetic variability was generally high,
some reduction of expected heterozygosity (He) in
one of the species was observed in some cases. At
7
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
Author s self-archived version
Fig. 2 – STRUCTURE clustering results of preliminary analysis carried out based on genotyped tree individuals. Each
individual is represented with a vertical bar and each inferred cluster is marked with a different gray tone. For K = 3
and K = 4, two different clustering solutions were found. The number of runs, in which each one of these solutions was
found, is given on the left hand side of the figure. Species assignment based on morphology is given below the diagram
(RO = Quercus robur, PE = Q. petraea, PU = Q. pubescens).
loci QrZAG96 and PIE227 Q. robur presented lower
He values in comparison to the other two
species.Similarly, He was reduced in Q. petraea at
locus QrZAG112. On average, genetic diversity in
terms of He was highest in Q. pubescens. On the other
hand, Q. robur displayed a higher number of alleles
per locus than the other two species. Furthermore,
EST-SSRs showed reduced values of both number of
alleles per locus and expected heterozygosity in
comparison to non-genic SSRs. Details about the
analysis of genetic diversity are provided as
supplementary material (Online Resource 2).
Phylogenetic relationships among the three species
are presented by means of an FST-based unrooted NJtree (Fig. 3). Results support phylogenetic affinity
between Q. petraea and Q. pubescens, while Q. robur
appears to be genetically more differentiated. The
measured pairwise FST values were 0.120 between Q.
robur and Q. petraea, 0.098 between Q. robur and Q.
pubescens and 0.050 between Q. petraea and Q.
pubescens.
3.3. Effect of phylogenetic relationships and sample
size on Bayesian clustering
In order to test the effect of phylogenetic
relationships and sample size on Bayesian
clustering, I used three different group sizes of
simulated purebreds in all possible species
combinations. In all tested configurations of group
sizes and species combinations, clustering of
individuals using BAPS inferred K = 3 as the optimal
population subdivision. For K = 3, the three species
were correctly identified and individuals were
assigned to their clusters with high membership
proportions (Online Resource 3). On the contrary,
Fig. 3 – Phylogenetic relationships among the three study
species visualized by an unrooted neighbour-joining
phylogenetic tree based on pairwise FST values between
species.
application of the STRUCTURE software did not
always resolve the three species. The statistic ΔΚ
was maximized for K = 2 and showed secondary
peaks in some cases, resembling to the previously
described preliminary analysis (details on ΔΚ, as
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
8
Author s self-archived version
Bayesian clustering for genetic assignment in oaks
Table 1 – Comparison of group membership proportions between BAPS and STRUCTURE for different species and
sample sizes, assuming two subpopulations (K = 2). Cases in which the phylogenetically more related species Q. petraea
and Q. pubescens did not cluster together are highlighted with bold letters. In cases with variable clustering solutions
among independent runs, membership proportions are underscored. RO = Quercus robur, PE = Quercus petraea, PU =
Quercus pubescens.
Configuration
Sample size
Simulated group
BAPS
STRUCTURE
2
3
1
2
3
Species
1
1000/100/100
RO-PE-PU
PE-RO-PU
PU-RO-PE
1,000
1,000
1,000
0,000
0,000
0,000
0,000
0,045
0,971
0,995
0,992
0,992
0,010 0,080
0,004 0,132
0,005 0,052
1000/500/100
RO-PE-PU
RO-PU-PE
PE-RO-PU
PE-PU-RO
PU-RO-PE
PU-PE-RO
0,999
1,000
1,000
1,000
1,000
1,000
0,001
0,000
0,001
0,006
0,001
0,001
0,006
0,019
0,957
0,000
0,970
0,008
0,995
0,995
0,993
0,992
0,996
0,986
0,006 0,051
0,004 0,042
0,008 0,816
0,020 0,012
0,006 0,922
0,010 0,041
1000/500/500
RO-PE-PU
PE-RO-PU
PU-RO-PE
1,000
1,000
1,000
0,001
0,001
0,001
0,000
0,999
0,998
0,995
0,980
0,995
0,008 0,006
0,017 0,660
0,007 0,985
1000/1000/100
RO-PE-PU
RO-PU-PE
PE-PU-RO
0,999
1,000
0,998
0,000
0,000
0,004
0,045
0,031
0,015
0,994
0,995
0,991
0,006 0,152
0,004 0,072
0,018 0,489
1000/1000/500
RO-PE-PU
RO-PU-PE
PE-PU-RO
0,999
0,999
0,999
0,000
0,000
1,000
0,001
0,003
0,001
0,995
0,995
0,698
0,006 0,010
0,005 0,014
0,709 0,007
1000/1000/1000
RO-PE-PU
1,000
0,001
0,000
0,995
0,009
well as for lnP(D) of each run are presented in
Online Resource 4). For K = 2, in most cases Q.
petraea and Q. pubescens clustered together, as
expected given their phylogenetic affinity. However,
when sample sizes were unbalanced, there was a
tendency of STRUCTURE to assign the smallest
groups to the same cluster. For instance, when one
group of 1000 individuals and two groups of 100
individuals (corresponding to the three study
species) were used as input, the two small groups
always clustered together for all species
combinations irrespective of their species identity,
0,006
when K = 2. Though less frequent, clustering
inconsistent to species phylogenies, could be also
observed by running the data with BAPS with a fixed
K = 2. A comparison of the results from the two
software programs for K = 2 is presented in Table 1.
In contrast to BAPS, STRUCTURE did not always
distinguish the three species when three
subpopulations were assumed (K = 3). For example,
by including a combination of 1000 simulated Q.
robur, 100 Q. petraea and 100 Q. pubescens into the
analysis, the latter two species were assigned to the
9
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Author s self-archived version
Bayesian clustering for genetic assignment in oaks
Table 2 – Efficiency, accuracy and total performance of BAPS and STRUCTURE analyses of simulated purebreds and hybrids. Configurations of all three species and all three
pairwise combinations including simulated hybrids and backcrosses were used. Individuals were assigned to groups of purebreds and hybrid groups using a threshold of q =
0.90. RO = Quercus robur, PE = Quercus petraea, PU = Quercus pubescens.
RO
Purebr.
ROxPE
Hybr.
PE
Purebr.
BAPS
PExPU
PU
Hybr.
Purebr.
RO - PE PU
Eff.
Acc.
Perf.
1,000
0,851
0,851
0,498
0,943
0,470
0,999
0,866
0,865
0,200
0,938
0,188
RO - PE
Eff.
Acc.
Perf.
0,999
0,932
0,931
0,550
0,994
0,547
1,000
0,942
0,942
-
RO - PU
Eff.
Acc.
Perf.
1,000
0,926
0,926
-
PE - PU
Eff.
Acc.
Perf.
-
-
0,999
0,916
0,915
0,997
0,830
0,828
-
-
1,000
0,939
0,939
0,220
0,875
0,210
0,997
0,875
0,872
ROxPU
Hybr.
0,475
0,973
0,462
0,517
1,000
0,517
-
Average
Purebr. Hybr.
RO
Purebr.
ROxPE
Hybr.
PE
Purebr.
STRUCTURE
PExPU
PU
Hybr.
Purebr.
0,999
0,849
0,848
0,391
0,951
0,372
0,992
0,954
0,946
0,830
0,883
0,733
0,985
0,951
0,937
0,643
0,757
0,487
1,000
0,937
0,936
0,550
0,994
0,547
0,996
0,971
0,967
0,843
0,969
0,817
0,996
0,983
0,979
-
1,000
0,932
0,932
0,517
1,000
0,517
0,997
0,976
0,973
-
0,998
0,895
0,893
0,220
0,957
0,210
-
-
0,993
0,959
0,953
0,956
0,917
0,876
-
-
1,000
0,968
0,968
0,623
0,820
0,511
0,966
0,932
0,900
ROxPU
Hybr.
0,753
0,926
0,698
0,807
0,988
0,797
-
Average
Purebr. Hybr.
0,978
0,940
0,919
0,742
0,855
0,635
0,996
0,977
0,973
0,843
0,969
0,817
0,999
0,972
0,970
0,807
0,988
0,797
0,980
0,945
0,926
0,623
0,820
0,511
10
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
same cluster in all ten runs, while the large Q. robur
group was subdivided into two, biologically
unreasonable clusters and individuals were admixed
between them. For the same configuration
(1000/100/100), when Q. petraea or Q. pubescens
formed the large cluster, results among the ten
performed runs for K = 3 were not consistent. Four
runs correctly resolved the three species, while six
merged the two small, and in this case,
phylogenetically less related groups within the same
cluster (see detailed results of group membership
proportions for all clustering solutions in Online
Resource 5). Again, runs with the proper clustering
solution presented a higher lnP(D) (Online Resource
4).
Setting K = 3, similar inconsistencies were observed
with other configurations as well. For instance,
when I used group sizes of 1000, 500 and 100
individuals for the analysis, the two smallest groups
again tended to cluster together (with bimodality of
lnP(D)), while biologically unreasonable subdivision
of the large simulated group was observed. This
merge of small clusters was more common when the
small groups of 500 and 100 simulated purebreds
were formed by Q. petraea and Q. pubescens (in
either combination). On the contrary, species were
consistently correctly identified when the small
clusters were formed by Q. robur and Q. petraea. The
correct clustering solution was also given by using
1000 simulated Q. petraea, 500 Q. robur genotypes
and 100 Q. pubescens, but not in the opposite case
(1000 Q. petraea – 500 Q. pubescens – 100 Q. robur).
Furthermore, small groups were again merged into
one cluster in two particular cases for the
configurations
of
1000/500/500
and
1000/1000/100 (Online Resource 5). In both cases,
this was due to the occurrence of a common Q.
petraea – Q. pubescens cluster for some runs when
small groups were formed by these two species.
Finally, STRUCTURE gave the correct clustering
solution when I used a less unbalanced
configuration of 1000/1000/500 or equally large
groups (1000/1000/1000).
Author s self-archived version
3.4. Efficiency and accuracy of Bayesian assignment,
diagnostic power of loci and effect of species
configuration
Even using all 21 loci, the range of q-values for the
simulated backcrosses was greatly overlapping with
F1 hybrids and, to a lesser extent, with purebreds
(Online Resource 6). Therefore, I decided to use a
single threshold value to distinguish between
purebreds and hybrids. In STRUCTURE, using a
threshold value of q = 0.90 led to maximum total
performance in three-species configuration and in
most cases of two-species configurations (Online
Resource 7). With BAPS, total performance
increased for threshold q-values up to 0.7-0.825,
depending on the particular case, and then remained
steady for higher values (Online Resource 7). This is
due to the fact that not a single individual, purebred
or hybrid, received a q-value between 0.825 and
0.999 under any configuration (BAPS assigned
individuals either as purebreds with q = 1.000 or as
admixed with q < 0.825). Thus, a threshold q = 0.90
was suitable in order to achieve maximum total
performance in BAPS, too.
In order to test the increase rate of efficiency and
accuracy of Bayesian clustering analysis with
increasing number of loci, I first calculated two
measures of locus-specific diagnostic power,
informativeness of assignment (In and Wright s FST.
The two measures resulted in different rank orders.
I used both rank lists to run Bayesian analyses
starting from the most informative locus and adding
gradually loci of lower diagnostic power (Online
Resource 8). In general, use of the rank list based on
In resulted in a higher increase of efficiency and
accuracy (with increasing number of loci included)
than FST, either when I included all species or when I
carried out analyses pairwise (Online Resource 9).
In all cases, purebred efficiency surpassed 80 % with
use of the five most informative loci. Increase of
efficiency and accuracy varied depending on the
species compared and the software used.
In general, total performance of the analysis was
higher with STRUCTURE than with BAPS (Table 2).
Regarding simulated purebreds, I found a higher
efficiency, but not accuracy when I applied BAPS.
11
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
This means that BAPS assigned correctly more
simulated purebreds to their own group than
STRUCTURE. Moreover, with BAPS, fewer loci than
STRUCTURE were required in order to achieve the
same value of purebred detection efficiency (Online
Resource 9). However, STRUCTURE provided higher
accuracy of purebred assignment, as BAPS tended to
wrongly assign simulated hybrids or backcrossings
as purebreds. In particular, it assigned most
backcrossings as purebreds with q = 1.00 in any
species configurations and even when all 21 loci
were used (Online Resource 10). With STRUCTURE,
the majority of both hybrids and backcrossings
showed q < 0.90.
Phylogenetic relationships played a significant role
in method performance, especially concerning
hybrids. Hybrid detection efficiency was generally
higher when these arose from combinations of
phylogenetically most differentiated species (Table
2). Either in two or in three-species configurations,
method performance was highest for hybrids
between Q. robur and Q. petraea, whereas in the case
of Q. robur and Q. pubescens it was slightly lower.
Method performance for hybrids between Q. petraea
and Q. pubescens was markedly lower in comparison
to the other two species combinations (Table 2).
Inclusion of all three species into the analysis
resulted in a slight reduction of method
performance. This was true for both purebreds and
hybrids, either with BAPS or STRUCTURE (Table 2).
When all three species were included, backcrossings
of Q. robur – Q. pubescens hybrids with Q. robur were
relatively often assigned by STRUCTURE as hybrids
of Q. robur with Q. petraea (12 % of such
backcrossings). Likewise, backcrossings of Q. robur –
Q. petraea hybrids with Q. petraea were assigned as
hybrids of Q. robur with Q. pubescens (6 % of the
cases; see also Online Resource 10).
4. DISCUSSION
The first part of the present study comprised
analyses based on simulated purebreds of the three
study species in various configurations of species
and sample sizes. A frequent observation throughout
these analyses was instability of clustering solutions
among independent runs at a particular K with
Author s self-archived version
STRUCTURE and the occurrence of biologically
unreasonable partition. As Bayesian clustering
methods often use stochastic simulations and
unsupervised approaches (as in the runs with
simulated individuals presented here), analyses of
the same data may generally produce several
distinct solutions, even if the same initial conditions
are used in each run (Jakobsson and Rosenberg
2007). Thus, biologically unreasonable clustering
solutions observed here apparently occurred due to
the fact that the STRUCTURE algorithm stuck in
suboptimal solutions. By choosing the runs with the
highest data posterior probability, I could
distinguish runs with proper partition in several
cases of multimodality. This strategy has been also
followed elsewhere (e.g. Rosenberg et al. 2001,
Reeves and Richards 2011). However, the frequency
of runs with biologically unreasonable partition
increased when I used unbalanced sampling scheme.
In the most unbalanced configurations, not even a
single run resolved the three species for K = 3 (e.g.
configuration with 1000 simulated purebreds of Q.
robur, 100 of Q. petraea and 100 of Q. pubescens).
Besides sample size configuration, another factor
affecting the inferred clustering solution were the
phylogenetic relationships among species. The more
related Q. petraea and Q. pubescens showed a
tendency to cluster together at K = 2. Yet,
unbalanced sampling sizes led to a merge of the
smaller simulated groups irrespective of their
phylogenetic identity, thus obscuring the effect of
phylogenetic relationships. For example, analysis of
1000 simulated purebreds of Q. petraea, 100 of Q.
robur and 100 of Q. pubescens for K = 2 gave such a
result. By forcing K to 2, the same clustering was
inferred by BAPS for this specific configuration.
Nevertheless, clustering for K = 2 with BAPS fit
better the phylogenetic relationships than with
STRUCTURE (results not shown). In many studies
carried out with STRUCTURE, it has been shown that
small size exacerbates subpopulation identification
(Rosenberg et al. 2002, Duminil et al. 2006,
Kalinowski 2011). This problem can especially occur
when the program is forced to assign individuals
into an inappropriately small number of clusters
(Kalinowski 2011).
12
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
Here, I show that the problem may persist when the
real K value is reached and, interestingly, even
above. Based on previous knowledge, Jakobsson &
Rosenberg (2007) state that biological factors may
cause multiple parts of the space of possible
membership coefficients to provide similarly
appropriate explanations for the data, which can
explain the occurrence of multimodality. Indeed, in
some case studies with real data, multimodality
could be explained through subtle genetic structure
within populations (or species) or isolation by
distance (Lepais et al. 2006, Ηöltken et al. 2012).
However, in the present study, I clearly show that
that the occurrence of inappropriate partition for
unbalanced species configurations is due to
stochastic error. There is no biological explanation
for splitting a homogenous group of simulated
purebreds when K is equal to the number of species.
Interestingly, for unbalanced configurations,
converging solutions were provided at a higher
number of K, for which the small groups were
consistently separated from each other and the
species represented by the largest group was
admixed (for example preliminary data analysis for
K = 5, Fig. 1). K values characterized by such
converging solutions showed an increased posterior
probability of data and a secondary ΔΚ peak (Online
Resource 3). Probably, the size of the inferred
clusters has, in turn, an influence on the outcome of
each run. More balanced sizes among the inferred
clusters may facilitate convergence and may result
in a lower rate of suboptimal solutions, whether
these are real clusters or spuriously admixed. )n
any case, multimodality decreased in more balanced
configurations, but only the configuration with 1000
simulated purebreds of each species was totally free
of biologically unreasonable clustering solutions for
both K = 2 and K = 3.
In contrast to STRUCTURE, BAPS analyses steadily
resulted in proper clustering solutions. In all 19
configurations tested, K = 3 was inferred as the most
likely partition and admixture coefficients (q)
corresponded to the taxonomic identity of
individuals and populations. This result might be
due to the differences between the algorithms of
both software packages. BAPS uses a non-reversible
Author s self-archived version
algorithm process with intelligent operators, which
enables simultaneous exploration of several local
neighborhoods of parameter space, while preventing
the absorption of any particular process to a
relatively inferior state (Corander et al. 2008a). In
contrast, the Gibbs sampler algorithm used by
STRUCTURE (Pritchard et al. 2000) is prone to
convergence problems and may not reach the true
posterior even after a substantial number of
iterations (Celeux et al. 2000, Hanage et al. 2009).
These differences between the two programs should
be taken into account especially when blind
approaches are followed (i.e. when the proportions
among sample sizes are not known from the
beginning).
The second part of the study aimed to explore the
power of the chosen Bayesian clustering methods in
assigning purebreds, F1 hybrids, as well as
backcrosses. Due to the balanced sampling schemes,
multimodality and spurious partition was not an
issue here. At a first stage, I aimed to identify highly
informative marker subsets, by ranking the used loci
based on In and FST as criteria of diagnostic power. In
generally performed better than FST, as higher levels
of efficiency and accuracy could be reached when
markers were chosen based on In value. These
results are in agreement with a previous simulation
based study of Ding et al. (2011) in humans. FST may
underestimate the biologically relevant genetic
differentiation as it is strongly dependent on
variation. In particular, maximum value of FST
decreases with increasing locus heterozygosity and,
thus, the same value of FST may be not reflect the
same levels of genetic differentiation (Hedrick
1999). This might explain why EST-SSRs had lower
ranks in the list based on In, compared to the FST
based list. On the other hand, non-genic SSRs in the
present study tend to be more variable and more
powerful for Bayesian clustering analysis than ESTmicrosatellites (based on In). Highly variable
dinucleotide SSRs have been shown to perform
better in Bayesian clustering analyses compared to
less variable markers like trinucleotide (as most of
EST-SSRs used here) or tetranucleotide SSRs or
SNPs in other studies as well (Narum et al. 2008;
Payseur and Jing 2009).
13
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
Regarding simulated purebreds, increase of
efficiency was fast and, in most cases, it was possible
to assign more than 90 % of simulated purebreds
correctly using only two markers. In general, BAPS
showed a faster increase of purebred detection
efficiency than STRUCTURE. This is obviously due to
algorithm differences between the two software
packages. Unlike STRUCTURE, BAPS uses a filtering
process that sets admixture coefficient to 1 when
weak evidence is provided by the genotypic data,
aiming to reduce the amount of false positive cases
of admixture (Corander et al. 2008a). Thus, using
only one locus, BAPS recognizes only purebreds q
= 1 to any cluster). Increasing number of loci results
in an increasing number of admixed individuals. On
the contrary, STRUCTURE begins with a high
number of admixed individuals which is diminished
with increasing number of loci, as method
performance improves. However, even when all 21
loci were used, there was an obvious lack of
individuals with q values between 0.8 and 0.99 in
BAPS, while q value distribution in STRUCTURE was
continuous.
A further consequence of these algorithm
differences is the fact that BAPS tends to
underestimate the number of admixed individuals.
Even when all 21 markers were included into the
analysis, BAPS misassigned more F1 hybrids as
purebreds than STRUCTURE did. In general, BAPS
tends to overestimate the number of purebreds,
which results in a generally higher efficiency, but not
accuracy, among purebreds. Similar observations
were made in a recent study with red wolves (Canis
rufus) and coyotes (Canis latrans) comparing the
two software packages (Bohling et al. 2013). In any
case, the present study showed that the marker set
used here was not adequate for BAPS to distinguish
backcrossings either from purebreds or from F1
hybrids (i.e. confidence intervals around the median
q for all these categories were overlapping). By
choosing a compromised threshold q of 0.90, it was
possible to assign a relatively weak majority of
simulated
backcrossings
as
hybrids
with
STRUCTURE, whereas with BAPS, most of them
received q = 1. Irrespective of the applied method,
21 loci are obviously not adequate in order to
distinguish backcrosses from purebreds and hybrids
Author s self-archived version
among the study species. Up to around 50 loci may
be required to distinguish backcrossings even when
the parental species are highly divergent (Vähä and
Primmer 2006). Use of a higher number of loci
would also allow BAPS to cover a wider or even the
whole range of q (Corander et al. 2008a), achieving a
better efficiency of backcrossing (and hybrid)
assignment. Researchers should keep this in mind
when applying Bayesian clustering analyses in
natural populations of hybridizing species, since
backcrossed individuals may occur more often than
F1 hybrids (Lepais et al. 2009).
Furthermore, phylogenetic relationships also
influenced performance of Bayesian analyses. For
instance, more loci were required for the
combination of Q. petraea and Q. pubescens than for
the two other species combinations in order to
achieve the same levels of method performance. The
effect of phylogenetic relationship on hybrid
assignment performance was even stronger. Using
STRUCTURE with the seven most informative loci, it
was possible to distinguish F1 hybrids between Q.
robur and Q. petraea or between Q. robur and Q.
pubescens from the parental species, at least in two
species configuration. By contrast, 10-12 loci were
required to distinguish purebreds of Q. petraea from
F1 hybrids between Q. petraea and Q. pubescens.
Notably, even with use of all 21 loci, F1 hybrids still
overlapped with purebreds of Q. pubescens in terms
of q-values. This might be due to differences of allelic
patterns among species. Both Quercus robur and Q.
petraea displayed reduced genetic variation at
specific loci, which was due to high frequency of a
certain allele, which is characteristic for the
particular species (e.g. QrZAG96 for Q. robur and
QrZAG112 for Q. petraea). This might have
facilitated the correct assignment of purebred
individuals. Lack of such loci in Q. pubescens
probably accounts for the relatively low
performance of purebred assignment. The utility of
such loci for species discrimination has been shown
in other case studies, too (Curtu et al. 2007,
Neophytou et al. 2011).
Lower performance for purebreds of Q. pubescens on
the one hand and for hybrids between Q. petraea
and Q. pubescens on the other could also be observed
14
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Author s self-archived version
Bayesian clustering for genetic assignment in oaks
in three species configuration. However, the most
important outcome of the analysis using all three
species is the reduction of efficiency and accuracy in
comparison to pairwise configurations. First, the
presence of three clusters increased the probability
to misassign an individual, which is supported by
the reduction of efficiency and accuracy of every
single category of purebreds and hybrids. Second,
the phylogenetic affinity between Q. petraea and Q.
pubescens resulted in further assignment errors. In
particular, F1 hybrids and backcrossings of Q.
petraea with Q. robur were often assigned as hybrids
involving Q. pubescens and the opposite happened
for hybrids and backcrossings of Q. pubescens with Q.
petraea. This phenomenon, also observed elsewhere
(Lepais et al. 2009, Bohling et al. 2013), additionally
contributed to the reduction of method
performance. It should be taken into account when
several hybridizing species are included into the
analysis, in order to avoid wrong conclusions about
the direction of introgression and the species
involved to hybridization.
ACKNOWLEDGEMENTS
This work was supported by the European Regional
Development Fund (ERDF), the regional government
authority of Baden-Württemberg in Freiburg
(Regierungspräsidium Freiburg; RPF), the National
Office of Forests (Office National des Forêts; ONF) in
France and the Regional Directory of Food,
Agriculture and Forestry of Alsace (Direction
Régionale de l'Alimentation, de l'Agriculture et de la
Forêt d'Alsace; DRAAF) in the frame of the Interreg)V project The regeneration of the oaks in the Upper
Rhine lowlands . ) express my gratitude to all the
colleagues of the ONF, RPF and the FVA who worked
for sample collections and laboratory analyses, to
Jukka Corander for kindly answering several
questions about the BAPS software and to two
anonymous reviewers for providing valuable
comments and suggestions.
DATA ARCHIVING STATEMENT
Genotypic data used for this study are available at
Dryad: doi: 10.1007/s11295-013-0680-2.
REFERENCE LIST
Aldrich PR, Parker GR, Michler CH, Romero-Severson
J (2003) Whole-tree silvic identifications and the
microsatellite genetic structure of a red oak species
complex in an Indiana old-growth forest. Can J
Forest Res 33:2228–2237.
Antao T, Lopes A, Lopes RJ, Beja-Pereira A, Luikart G
(2008) LOSITAN: A workbench to detect molecular
adaptation based on a FST-outlier method. BMC
Bioinformatics 9:323.
Belkhir K, Borsa P, Chikhi L, Raufaste N, Bonhomme
F (2004) GENETIX 4.05, WindowsTM Software for
Population
Genetics.
Laboratoire
génome,
populations, interactions, CNRS UMR 5000.
Bohling JH, Adams JR, Waits LP (2013) Evaluating
the ability of Bayesian clustering methods to detect
hybridization and introgression using an empirical
red wolf data set. Mol Ecol 22:74–86.
Burgarella C, Lorenzo Z, Jabbour-Zahab R, Lumaret
R, Guichoux E, Petit RJ, Soto Á, Gil L (2009) Detection
of hybrids in nature: application to oaks (Quercus
suber and Q. ilex). Heredity 102:442–452.
Celeux G, Hurn M, Robert CP (2000) Computational
and Inferential Difficulties with Mixture Posterior
Distributions. J Am Stat Assoc 95:957–970.
Corander J, Marttinen P (2006) Bayesian
identification of admixture events using multilocus
molecular markers. Mol Ecol 15:2833–2843.
Corander J, Marttinen P, Sirén J, Tang J (2008a)
Enhanced Bayesian modelling in BAPS software for
learning genetic structures of populations. BMC
Bioinformatics 9:539.
Corander J, Sirén J, Arjas E (2008b) Bayesian spatial
modeling of genetic population structure. Compu
Stat 23:111–129.
Curtu AL, Gailing O, Finkeldey R (2007) Evidence for
hybridization and introgression within a speciesrich oak (Quercus spp.) community. BMC Evol Biol
7:218.
Ding L, Wiener H, Abebe T, et al (2011) Comparison
of measures of marker informativeness for ancestry
and admixture mapping. BMC Genomics 12:622.
15
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
Dow B, Ashley M, Howe H (1995) Characterization of
highly variable (GA/CT) n microsatellites in the bur
oak, Quercus macrocarpa. Theor Appl Genet 91:137–
141.
Duminil, J., Caron, H., Scotti, I., Cazal, S.-O., Petit, R.J.,
2006. Blind population genetics survey of tropical
rainforest trees. Mol Ecol 15:3505–3513.
Durand J, Bodénès C, Chancerel E, et al (2010) A fast
and cost-effective approach to develop and map
EST-SSR markers: oak as a case study. BMC
Genomics 11:570.
Earl DA, vonHoldt BM (2012) STRUCTURE
HARVESTER: a website and program for visualizing
STRUCTURE output and implementing the Evanno
method. Conservation Genet Resour 4:359–361.
Evanno G, Regnaut S, Goudet J (2005) Detecting the
number of clusters of individuals using the software
STRUCTURE: a simulation study. Mol Ecol 14:2611–
2620.
Falush D, Stephens M, Pritchard JK (2003) Inference
of population structure using multilocus genotype
data: Linked loci and correlated allele frequencies.
Genetics 164:1567–1587.
Frantz AC, Pourtois JT, Heuertz M, Schley L, Flamand
MC, Krier A, Bertouille S, Chaumont F, Burke T
(2006) Genetic structure and assignment tests
demonstrate illegal translocation of red deer (Cervus
elaphus) into a continuous population. Mol Ecol
15:3191–3203.
Gugger PF, Cavender-Bares J (2011) Molecular and
morphological support for a Florida origin of the
Cuban oak. J Biogeogr. Published on-line
(doi:10.1111/j.1365-2699.2011.02610.x)
Guichoux E, Lagache L, Wagner S, Léger P, & Petit RJ
(2011) Two highly validated multiplexes (12-plex
and 8-plex) for species delimitation and parentage
analysis in oaks (Quercus spp.). Mol Ecol Resour
11:578–585.
Guichoux E, Garnier-Géré P, Lagache L, Lang T,
Boury C, Petit RJ (2013) Outlier loci highlight the
direction of introgression in oaks. Mol Ecol 22:450–
462.
Hanage WP, Fraser C, Tang J, Connor TR, Corander J
(2009) Hyper-Recombination, Diversity, and
Author s self-archived version
Antibiotic Resistance in Pneumococcus. Science
324:1454–1457.
Hedrick PW (1999) Perspective: Highly variable loci
and their interpretation in evolution and
conservation. Evolution 53:313.
Heuertz M, Fineschi S, Anzidei M et al (2004)
Chloroplast DNA variation and postglacial
recolonization of common ash (Fraxinus excelsior L.)
in Europe. Mol Ecol 13:3437–3452.
Höltken A, Buschbom J, Kätzel R (2012) Die
Artintegrität unserer heimischen Eichen Quercus
robur L., Q. petraea (Matt.) Liebl. und Q. pubescens
Willd. aus genetischer Sicht (in German). Allg ForstJagdztg 183:100–110.
Jakobsson M, Rosenberg NA (2007) CLUMPP: a
cluster matching and permutation program for
dealing with label switching and multimodality in
analysis of population structure. Bioinformatics
23:1801–1806.
Kalinowski ST (2011) The computer program
STRUCTURE does not reliably identify the main
genetic clusters within species: simulations and
implications for human population structure.
Heredity 106:625–632.
Kampfer S, Lexer C, Glössl J, Steinkellner H (1998)
Characterization of (GA) n microsatellite loci from
Quercus robur. Hereditas 129:183–186.
Kronforst MR, Young LG, Blume LM, Gilbert LE
(2006) Multilocus analyses of admixture and
introgression
among
hybridizing
Heliconius
butterflies. Evolution 60:1254–1268.
Kumar S, Skjæveland Å, Orr RJ, Enger P, Ruden T,
Mevik B-H, Burki F, Botnen A, Shalchian-Tabrizi K
(2009) AIR: A batch-oriented web program package
for construction of supermatrices ready for
phylogenomic analyses. BMC Bioinformatics 10:357.
Lepais O, Petit R, Guichoux E, Lavabre J, Alberto F,
Kremer A, Gerber S
(2009) Species relative
abundance and direction of introgression in oaks.
Mol Ecol 18:2228–2242.
Lexer C, Fay MF, Joseph JA, Nica M-S, Heinze B
(2005) Barrier to gene flow between two
ecologically divergent Populus species, P. alba (white
poplar) and P. tremula (European aspen): the role of
16
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
Bayesian clustering for genetic assignment in oaks
ecology and life history in gene introgression. Mol
Ecol 14:1045–1057.
Manos PS, Doyle JJ, Nixon KC (1999) Phylogeny,
Biogeography, and Processes of Molecular
Differentiation in Quercus Subgenus Quercus
(Fagaceae). Mol Phylogenet Evol 12:333–349.
Narum SR, Banks M, Beacham TD et al (2008)
Differentiating salmon populations at broad and fine
geographical scales with microsatellites and single
nucleotide polymorphisms. Mol Ecol 17:3464–3477.
Neophytou C, Aravanopoulos F, Fink S, Dounavi A
(2010) Detecting interspecific and geographic
differentiation patterns in two interfertile oak
species (Quercus petraea (Matt.) Liebl. and Q. robur
L.) using small sets of microsatellite markers. For
Ecol Manag 259:2026–2035.
Neophytou C, Dounavi A, Fink S, Aravanopoulos F
(2011) Interfertile oaks in an island environment: I.
High nuclear genetic differentiation and high degree
of chloroplast DNA sharing between Q. alnifolia and
Q. coccifera in Cyprus. A multipopulation study. Eur J
For Res 130:543–555.
Nielsen EE, Bach LA, Kotlicki P (2006) HYBRIDLAB
(version 1.0): a program for generating simulated
hybrids from population samples. Mol Ecol Notes
6:971–973.
Payseur BA, Jing P (2009) A Genomewide
Comparison of Population Structure at STRPs and
Nearby SNPs in Humans. Mol Biol Evol 26:1369–
1377.
Pritchard JK, Stephens M, Donnelly P (2000)
Inference of population structure using multilocus
genotype data. Genetics 155:945–959.
Reeves PA, Richards CM (2011) Species Delimitation
under the General Lineage Concept: An Empirical
Example Using Wild North American Hops
(Cannabaceae: Humulus lupulus). Syst Biol 60:45–59.
Author s self-archived version
Rosenberg NA, Burke T, Elo K, et al (2001) Empirical
evaluation of genetic clustering methods using
multilocus genotypes from 20 chicken breeds.
Genetics 159:699–713.
Rosenberg NA, Pritchard JK., Weber JL, Cann H.M.,
Kidd K.K., Zhivotovsky LA, Feldman MW (2002).
Genetic Structure of Human Populations. Science
298:2381–2385.
Rosenberg NA, Li LM, Ward R, Pritchard JK (2003)
Informativeness of genetic markers for inference of
ancestry. Am J Hum Genet 73:1402–1422.
Rosenberg NA (2004) DISTRUCT: a program for the
graphical display of population structure. Mol Ecol
Notes 4:137–138.
Rosenberg NA (2005) Algorithms for selecting
informative marker panels for population
assignment. J Comput Biol 12:1183–1201.
Steinkellner H, Fluch S, Turetschek E, Lexer C, Streiff
R, Kremer A, Burg K, Glössl J (1997a) Identification
and characterization of (GA/CT) n-microsatellite loci
from Quercus petraea. Plant Mol Biol 33:1093–1096.
Steinkellner H, Lexer C, Turetschek E, Glössl J
(1997b) Conservation of (GA)n microsatellite loci
between Quercus species. Mol Ecol 6:1189–1194.
Takezaki N, Nei M, Tamura K (2010) POPTREE2:
Software for constructing population trees from
allele frequency data and computing other
population statistics with Windows interface. Mol
Biol Evol 27:747 –752.
Vähä J-P, Primmer CR (2006) Efficiency of modelbased Bayesian methods for detecting hybrid
individuals under different hybridization scenarios
and with different numbers of loci. Mol Ecol 15:63–
72.
Weir BS, Cockerham CC (1984) Estimating FStatistics for the analysis of population structure.
Evolution 38:1358–1370.
17
The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2