Author s self-archived version Bayesian clustering for genetic assignment in oaks Bayesian clustering analyses for genetic assignment and study of hybridization in oaks: Effects of asymmetric phylogenies and asymmetric sampling schemes Charalambos Neophytou1 ABSTRACT Bayesian clustering methods have been widely used for studying species delimitation and genetic introgression. In order to test the effect of phylogenetic relationships and sampling scheme on the inferred clustering solution and on the performance of Bayesian clustering analysis, I simulated genotypes of the interfertile oak species Quercus robur, Q. petraea and Q. pubescens and I ran analyses using two popular software programs, STRUCTURE and BAPS. First, based on purebred simulations, I compared clustering solutions resulting from different sample size configurations. While clustering solution generally reflected the taxonomic relationships when equal samples of each species were included, spurious partition was inferred by STRUCTURE when some species were represented by larger and others by smaller samples. In very unbalanced configurations, STRUCTURE failed to identify the three species, even if three subpopulations were assumed. By contrast, BAPS could properly identify the three species under any sampling scheme. Second, based on simulations of purebreds and hybrids, I tested the performance of individual assignments with variable number of loci. This analysis showed that STRUCTURE can detect introgressed individuals more efficiently than BAPS. However, BAPS could assign purebreds more efficiently with a lower number of loci. Method performance also depended on phylogenetic relationships. In the case of Quercus petraea, Q. pubescens and their hybrids, method 1 Forest Research Institute (FVA) Baden-Württemberg Wonnhaldestr. 4 79100 Freiburg Germany E-mail: [email protected] performance was lower due to their phylogenetic affinity. Inclusion of three instead of two species into the analysis led to reduction of performance, and to misclassification of hybrids, which often reflected the phylogenetic affinity between Q. petraea and Q. pubescens. KEY WORDS Bayesian clustering, Quercus, BAPS, STRUCTURE, simulation, microsatellites 1. INTRODUCTION Along with significant improvements in molecular techniques, Bayesian clustering methods of population genetic structure analysis have experienced a rapid expansion during the last decade. Main applications of such methods include investigation of intraspecific genetic differentiation (Rosenberg et al. 2001; Heuertz et al. 2004; Frantz et al. 2006), species delimitation and study of hybridization and genetic introgression (Lexer et al. 2005; Kronforst et al. 2006; Bohling et al. 2013). Being a multispecific genus with high levels of interbreeding among taxa, oaks (genus Quercus) have often been used for such analyses in a population genetic and evolutionary context (Burgarella et al. 2009; Lepais et al. 2009; Neophytou et al. 2010; Gugger and Cavender-Bares 2011). Yet, method performance varies among different case studies. Several factors related to the experimental design and the taxonomic relationships affect the ability of Bayesian clustering analyses to characterize inter- and intraspecific genetic differentiation and to define levels of genetic introgression between interfertile units (Vähä and 1 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks Primmer 2006, Kalinowski 2011, Bohling et al. 2013). In general, an increase of the genotyped loci leads to a higher efficiency and accuracy of genetic assignment (Vähä and Primmer 2006). However, it has been shown that diagnostic power can vary strongly among loci. Thus, use of a small number of highly informative loci can be adequate to identify genetic structures, whereas adding a large number of less informative loci may not improve the results significantly (Rosenberg 2005). In order to select a set of appropriate markers for Bayesian clustering analyses, loci have to be evaluated and sorted according to their diagnostic power. Several measures estimating locus diagnostic power have been proposed. Among them, Wright s FST and Rosenberg s informativeness of assignment (In; Rosenberg et al. 2003) have been frequently used for such marker classifications and have been shown to perform better than other related measures (Ding et al. 2011). Phylogenetic relationships among species may also strongly affect the ability of Bayesian clustering to distinguish different taxonomical units. In the case of ancient speciation events, genetic drift and mutations are likely to have altered allelic frequencies in extant species stronger and at more loci. Ancient speciation may explain the high levels of differentiation among Mediterranean representatives of the oak section Cerris (Manos et al. 1999). In two case studies with oaks of the section Cerris, a limited amount of markers (as few as four) was adequate to achieve a high performance of Bayesian clustering for species assignment, supporting the aforementioned hypothesis (Burgarella et al. 2009; Neophytou et al. 2011). On the contrary, more recent speciation, probably in interaction with frequent hybridization, may explain the fact that Bayesian analyses within certain species of the section Lobatae (red oaks) could not resolve the species, even when 15 loci were used (Aldrich et al. 2003). Furthermore, a better resolution of Bayesian clustering may be required to study introgression among interfertile species. Given that performance Author s self-archived version of Bayesian clustering methods depends on the genetic differentiation among species, a high number of loci may be required in order to reliably assign hybrids and especially backcrosses, in case of limited interspecific differentiation (Vähä and Primmer 2006). In addition, depending on the algorithm used, proportions of assigned purebreds, hybrids and backcrosses may vary strongly (Burgarella et al. 2009; Bohling et al. 2013). Another key issue is the decision about a threshold value of membership proportion or admixture coefficient (q) in order to distinguish purebreds from potential hybrids (i.e. the percentage of genetic variation of each individual drawn from a specific gene pool). Studies based on simulations of purebreds and hybrid genotypes have aimed to evaluate the effect of all aforementioned factors to the performance of Bayesian clustering (Vähä and Primmer 2006; Burgarella et al. 2009; Lepais et al. 2009; Guichoux et al. 2013). Finally, sampling scheme and the chosen analysis method may also cause problems to genetic structure identification. For instance, using the popular Bayesian clustering analysis software STRUCTURE (Pritchard et al. 2000; Falush et al. 2003), it has been shown that variations in sample size among demes may strongly influence the clustering solution (Kalinowski 2011). Preliminary analysis of the data upon which the present study was carried out using the same software showed a tendency of two species groups, those of Q. petraea and Q. pubescens, represented by a relatively low number of individuals to cluster together, even when the number of assumed subpopulations was set to 3 (i.e. the number of species). This could be due to the phylogenetic affinity of these two species or due to stochastic error (Kalinowski 2011). On the contrary, use of another method of Bayesian clustering analysis, BAPS (Corander and Marttinen 2006; Corander et al. 2008a), showed clustering patterns consistent with the taxonomic relationships among the three species. This observation has largely motivated further simulation-based analyses presented in this paper. In the present study, I chose to study the effects of all aforementioned factors by focusing on three 2 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks interfertile oak species of Central Europe; Quercus robur, Q. petraea and Q. pubescens. Given that Q. petraea and Q. pubescens are phylogenetically closer to each other, while Q. robur is genetically more divergent (Curtu et al. 2007, Lepais et al. 2009), this species complex is a good model for studying the effects of phylogeny on the performance of Bayesian clustering. In particular, based on genotype simulations I aimed to explore the utility of Bayesian clustering analysis methods of genetic structure for species identification and study of hybridization and introgression to investigate advantages and drawbacks of two well established methods, BAPS and STRUCTURE. Specifically, my objectives were (1) to study the effect of phylogenetic relationships on the reliability of clustering among species with different levels of pairwise differentiation, the oak species Quercus robur, Q. petraea and Q. pubescens, (2) to test the effect of sample size of each specific group on clustering patterns, (3) to choose subsets of highly informative markers that discriminate among the three species, as well as pairwise in all three possible combinations, (4) to evaluate Wright s FST and Rosenberg s In as tools for choosing highly informative marker sets suitable for Bayesian analyses, (5) to compare the performance of two different Bayesian clustering methods in all aforementioned tasks, (6) to study the effect of phylogenetic relationships on the efficiency and accuracy of purebred and hybrid assignment and (7) to test whether inclusion of three species reduces method performance in comparison to analyses with species pairs and their hybrids. 2. MATERIALS AND METHODS 2.1. Study area and sample collections A total of 2048 individual trees of Q. robur, Q. petraea and Q. pubescens were systematically sampled from 76 forest stands in the Upper Rhine Valley in France and Germany, delimited by the Vosges Mountains to the west, the Jura Mountains to the south and the Black Forest to the east. Individual trees were georeferenced. A preliminary assignment to one of the three species was made in the field based on basic phenotypic characters (leave, bark and acorns). Among the sampled stands, 15 were Author s self-archived version mixed with Q. robur and Q. petraea and the remaining were pure for one of the three study species. In particular, 40 of those stands consisted predominantly of Q. robur, 15 of Q. petraea and 6 of Q. pubescens. 2.2. Laboratory procedures Depending on the season of the samplings, three different types of plant tissue – cambium, leaves or buds – were collected from each individual for the DNA analysis. After sample collections, plant material was transferred to the laboratory and was frozen at -80°C. Subsequently, it was freeze-dried in vacuum and DNA was extracted using the DNeasy 96 extraction kit (Qiagen, Hilden, Germany). Multiplex polymerase chain reactions (PCR) were carried out for the amplification of 11 non-genic (nSSRs) and 10 EST-derived microsatellite loci (EST-SSRs). Among the eleven analyzed non-genic microsatellites six – QrZAG7, QrZAG11, QrZAG30, QrZAG96 and QrZAG112 – were initially developed in Q. robur (Kampfer et al. 1998), four – QpZAG9, QpZAG15, QpZAG104 and QpZAG110 – in Q. petraea (Steinkellner et al. 1997a) and one – MSQ13 – in Q. macrocarpa (Dow et al. 1995). All ten EST-derived microsatellites – PIE020, PIE102, PIE152, PIE215, PIE223, PIE227, PIE242, PIE243, PIE267 and PIE 271 – were described in Durand et al. (2010). Both non-genic and EST-derived microsatellites are known to be highly transferable among related white oak species (section Quercus), as has been described in studies mainly including Q. robur and Q. petraea (Steinkellner et al. 1997b; Guichoux et al. 2011). For the PCR reactions, primers were divided in three multiplexes. The multiplex including the ESTderived SSRs was largely based on Guichoux et al. (2011). Details about the used loci, multiplexes and fluorescent labeling are presented in Online Resource 1. As a reagent, the SuperHot Mastermix (Genaxxon, Biberach, Germany), a premixed mastermix including all PCR components except DNA template and primers, was used. Reaction volume was set to 10 µl, comprised of 5 µl reaction mastermix, 2 µl of primer mix, 2 µl water and 1 µl diluted DNA (ca. 4 ng / µl). A common PCR-program 3 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks was used for all the reactions, including following steps: (1) denaturation at 95 °C for 15 min; (2) 26 cycles with a denaturation step at 94 °C for 30 s, primer annealing at 57°C for 1 min and 30s and an elongation step at 72°C for 30 s; (3) final elongation at 72°C for 10min and (4) a final step at 60°C for 30 min. Allele scoring was performed by means of a capillary electrophoresis using an ABI Prism 3130xl genetic analyzer and the software GeneMapper (Applied Biosystems). 2.3. Selection of purebred individuals For a preliminary species assignment, two different Bayesian clustering approaches, STRUCTURE (Pritchard et al. 2000; Falush et al. 2003) and BAPS (Corander and Marttinen 2006; Corander et al. 2008a), were used. Both methods allow for presence of multiple clusters, which fits the present data set consisting of three species. The purpose of this first step was to select purebred individuals of each species and use them for simulations in the next steps of the study. An individual was characterized as purebred only if it was assigned to the same species cluster by both clustering methods. The value of 0.875 was chosen as a threshold of membership proportion (q) with each individual with q ≥ . 5 being characterized as purebred. This threshold value was used only for the preliminary analyses (the choice of threshold values for the simulation study is described in Chapter 2.6). It was selected assuming that first generation hybrids are expected to have q-values of 0.50 to each one of their parental species and backcrossings 0.75 and 0.25 respectively. Therefore, a membership proportion of 0.375-0.625 would be expected for F1 hybrids, 0.625-0.875 for backcrossings and above 0.875 for purebreds. A q-value of 0.875 was also used as a threshold to distinguish purebreds in recent studies using Bayesian clustering (Bohling et al. 2013, Guichoux et al. 2013). STRUCTURE analysis was performed choosing the admixture model and correlated allele frequencies. The number of assumed subpopulations (K) was set between 1 and 20. For each K, ten independent runs were performed applying 100,000 burn-in replications followed by 100,000 MCMC iterations. Author s self-archived version All runs made for this study were performed using the on-line platform of the Oslo University (Kumar et al. 2009) which applied the version 2.3 of STRUCTURE at the time of the analyses. In order to choose the most appropriate number of clusters, the method of Evanno et al. (2005) was implemented. According to this method, ΔK, an ad-hoc statistic based on the rate of change of the maximum posterior probability of data was calculated for each value of K. The value of K for which ΔK is maximized indicates the uppermost hierarchical level of population subdivision. Notably, in the case of complex hierarchical schemes – for instance when genetic differentiation among different pairs of clusters varies strongly – ΔΚ may not detect all clusters from the beginning (Evanno et al. 2005). Thus, in order to detect further hidden within-group clustering, subsequent STRUCTURE analyses (applying the same settings) were performed using individuals having been assigned as purebreds (q ≥ 0.875) in the first analysis, as suggested by Evanno et al. (2005). These within-group analyses were continued until no meaningful population subdivision was supported by the program results. ΔΚ analyses were performed using the on-line platform STRUCTURE HARVESTER (Earl and vonHoldt 2012). Given that geographic coordinates of the individuals were available, the option of spatial clustering of individuals was chosen for BAPS analysis. This method uses a prior that is stricter against an increase of the number of clusters, thus preventing detection of spurious clusters due to stochastic fluctuations of allele frequencies (Corander et al. 2008b). The maximum number of assumed subpopulations (K) was set from 2 to 20. An analysis assuming K = 1 is non-sense in BAPS. Ten independent runs for each value of K were performed. First, a mixture analysis was carried out to assign individuals to clusters and to define the most appropriate clustering solution. Second, based on mixture data, an admixture analysis was performed in order to calculate membership proportions (admixture coefficients) of each individual to each cluster. The default settings of the program were used for this analysis. The minimum size of populations to be taken into account was set 4 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks to 5, while 50 iterations were applied to estimate the admixture coefficients of the individuals, 50 of reference individuals from each population were used and, finally, 10 iterations were performed in order to estimate the admixture coefficients of the reference individuals. 2.4. Simulations, genetic diversity and phylogenetic relationships Subsequently, purebred, as well as first generation hybrids and backcrossings were simulated based on the purebreds characterized previously. In particular, one hundred purebred individuals from each species group, characterized in the first step, were randomly chosen and were used as input for the program HYBRIDLAB (Nielsen et al. 2006). The following groups of genotypes were simulated using this software: (a) 100, 500 and 1000 purebred individuals of each species, (b) 100 first generation hybrids between each species pair and (c) 100 backcrossings of first generation hybrids with each one of the parental species. Locus diversity in each species was analyzed using the three groups of purebred individuals, upon which simulations were based. To calculate number of alleles per locus (na), observed (Ho) and expected (He) heterozygosity, as well as inbreeding coefficients (FIS; Weir and Cockerham 1984) the software Genetix v. 4.05.02 (Belkhir et al. 2004) was used. Moreover, in order to investigate phylogenetic relationships among the three species, an unrooted neighbor-joining tree based on pairwise FST values was constructed using the software POPTREE2 (Takezaki et al. 2010). 2.5. Effect of sample size and phylogenetic relationships on clustering In the following step, simulated purebred genotypes were used to investigate sampling effects on clustering, as calculated by each one of the two Bayesian methods. Each run of BAPS or STRUCTURE was based on an input file consisting of 100, 500 or 1000 simulated purebreds of each species. Following configurations of sample size were tested: (a) 1000/100/100, (b) 1000/500/100, (c) 1000/500/500, (d) 1000/1000/100, (e) 1000/1000/500 and (f) 1000/1000/1000. These configurations were tested for all possible species Author s self-archived version combinations, in order to investigate the effect of phylogeny (e.g. whether the presence of related species in the small groups leads to different clustering solution in comparison to inputs with the least related species forming the small groups). In BAPS, clustering of individuals was carried out, since use of coordinates in simulated individuals for spatial clustering of individuals would be nonsense. All other parameters for the analysis were the same as those used in the preliminary analysis (see Chapter 2.3). The maximum number of assumed subpopulations (K) was set from 2 to 10. Ten independent runs for each value of K were performed. In addition, ten independent runs with fixed K = 2 were performed in order to test whether phylogenetic relationships are reflected into the results, i.e. whether the most related species cluster together (by choosing variable K values, only the optimal solution in terms of log-likelihood is presented in the output). Similarly to BAPS, ten independent runs for each K value between 1 and 10 were carried out in STRUCTURE, maintaining the same analysis parameters mentioned previously. The uppermost hierarchical level of population subdivision was calculated following the ΔK method with the on-line software STRUCTURE HARVESTER, as described above. In order to find the optimal cluster alignment and calculate the average membership proportion among the 10 runs for each K, the software CLUMPP v. 1.1.2 (Jakobsson and Rosenberg 2007) was applied. For producing graphics with individual membership proportions based on either BAPS or STRUCTURE, the cluster visualization program DISTRUCT (Rosenberg 2004) was used. 2.6. Diagnostic power of loci, efficiency, accuracy and performance of Bayesian assignment In order to test marker set efficiency, loci were sorted by their diagnostic power. Two different measures were used: Wright s FST and informativeness of assignment (In), introduced by Rosenberg et al. (2003). The latter measure was calculated using the software INFOCALC (Rosenberg 2005). Calculation of FST per locus was made using the on-line software LOSITAN (Antao et al. 2008). 5 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks For both analyses, the genotypes of the 100 purebred tree individuals of each species selected in the first stage of the study were used. Separate analyses were run (1) including all three species and (2) pairwise, resulting in three species pairs (Q. robur – Q. petraea, Q. robur – Q. pubescens and Q. petraea – Q. pubescens). After sorting the loci by decreasing FST and In values, Bayesian clustering analyses were performed using BAPS and STRUCTURE to test the performance of the two clustering methods with increasing number of loci included. Following species configurations were used: (1) Three species (including 1000 simulated genotypes for each species – all possible simulated first generation hybrids and backcrossings – 100 simulated genotypes each and (2) Two species, using all three possible species combinations (including 1000 purebreds, 100 simulated F1 and 100 backcrossings with each one of the parental species). First, analyses were carried out using genotypic data from the first locus (i.e. with the highest FST or In) and, subsequently, loci were added one by one to carry out further runs, following the aforementioned order (i.e. by decreasing FST or In). This resulted in 21 different analyses for each configuration. Ten independent runs were performed with each software by setting K = 3 when all three species were included and K = 2 when species pairs were analyzed. The remaining settings for both software programs were the same as in the analysis examining the effects of sample size (see above). In order to find the optimal alignment and the average membership proportion among the 10 runs and to visualize the results, the software programs CLUMPP and DISTRUCT were used as described above. Subsequently, simulated individuals were assigned to groups of purebreds and hybrids. Performance of Bayesian analyses was evaluated based on different measures described in Vähä and Primmer (2006), which have been also used in later simulation based studies (e.g. Burgarella et al. 2009; Guichoux et al. 2013). Efficiency was defined as the proportion of individuals in a group that were correctly identified. For instance, efficiency for Q. robur was calculated as the number of simulated Q. robur genotypes Author s self-archived version correctly assigned to the Q. robur cluster divided by the total number of simulated Q. robur genotypes. Accuracy was defined as the proportion of individuals assigned by the clustering analysis to a certain group that truly belong to this group (Vähä and Primmer 2006). For instance, Q. robur accuracy is the number of simulated Q. robur genotypes correctly assigned to their species cluster divided by the overall number of individuals assigned as Q. robur (including simulated hybrids, backcrossings and purebreds of other species assigned to the Q. robur cluster). Finally, total performance was calculated as the product of efficiency multiplied by the accuracy for a given category. In order to choose optimal threshold values of membership proportion (q), several critical q-values from 0.5 to 0.975 (in steps of 0.025) were tested. Maximization of overall performance was used as a criterion to select the optimal threshold values. Given that q-values of the simulated first generation hybrids and backcrosses were largely overlapping even when all loci were included (see Results), only groups of purebreds and hybrids were defined. Therefore, in the case of three-species configuration, individuals were assigned to three purebred (one for each species) and three hybrid groups (Q. robur – Q. petraea, Q. robur – Q. pubescens and Q. petraea – Q. pubescens). In order for an individual to be assigned as purebred, membership proportion higher than the threshold q-value to any cluster was required. If membership proportion to any cluster was lower, then the individual was assigned as hybrid. The two clusters with the highest membership proportion were considered as the parental species of the hybrid. After choosing threshold q-values, comparisons of efficiency, accuracy and method performance were made. First, in order to compare FST and In, the rate of increase of the three measures with increasing number of loci was observed. Second, the rate of increase and final value of the three measures were compared between the two Bayesian clustering methods used. Third, efficiency, accuracy and total performance was compared among the tested species pairs, but also between pairwise and threespecies configurations. 6 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks Author s self-archived version 3. RESULTS 3.1. Preliminary Bayesian analyses At a first stage, I performed preliminary Bayesian analyses with STRUCTURE and BAPS, in order to choose purebred individuals for genotype simulations. In the first STRUCTURE run, including all sampled individuals, the statistic ΔΚ was maximized for two assumed subpopulations (K = 2; Fig. 1). For K = 2, one cluster included Q. robur individuals and the other consisted of Q. petraea and Q. pubescens. For K = 3 clustering solutions among runs differed. For two runs, species could be separated. For eight runs, Q. petraea and Q. pubescens individuals clustered together, while Q. robur was subdivided and its individuals were admixed (Fig. 2). This subdivision was biologically unreasonable. However, ln posterior probability of data – lnP(D) – was on average higher for the runs with proper species identification than for those with biologically unreasonable partition. This resulted in two bulks of points, when lnP(D) for each run was plotted as a function of K (Fig. 1). Similar bimodality was also observed when K was set to 4. When K was set to 5, uniform results among all ten runs were obtained. Quercus petraea and Q. pubescens formed separate clusters and individuals of them were assigned to their own species clusters with relatively high membership proportions. ΔΚ presented a secondary peak for K = 5 (Fig. 1). Given that ΔK was maximized for two assumed subpopulations, I used results for K = 2 to perform further analyses within the derived clusters derived for K = 2, as suggested by Evanno et al. (2005). By analyzing individuals assigned to the common cluster of Q. petraea and Q. pubescens (q ≥ . 5 , the two species could be consistently identified and ΔΚ was maximal for K = 2. Thus, I used results from this run to mak final species assignments. In total, I assigned 522 individuals to Q. petraea and 108 to Q. pubescens (q ≥ . 5 . Within Q. robur, no further subdivision was revealed by the analysis, as posterior likelihood of data did not increase with in- Fig.1 – Results of the first preliminary run with STRUCTURE carried out using sampled individual trees. Ln posterior probability of data (lnP(D)) for each run and values of the statistic ΔK are presented for different numbers of assumed subpopulations (K = … . creasing K. Thus, I assigned 1226 individuals Q. robur (q ≥ . 5 based on the analysis for K = 2 including all individuals. By performing spatial clustering of individuals with BAPS, I found the optimal clustering solution for K = 3, corresponding to the three species. After a subsequent admixture analysis, I assigned 1248 individuals to Q. robur, 600 to Q. petraea and 128 to Q. pubescens q ≥ . 5 . Among these individuals, 1224, 522 and 105 had been also assigned as Q. robur, Q. petraea and Q. pubescens, respectively, using STRUCTURE. I used 100 randomly chosen individuals of each group to produce genotype simulations, since they had been identified as purebred by both analysis methods. Furthermore, I used these individuals to calculate diversity parameters and phylogenetic relationships among species. 3.2. Genetic diversity and differentiation Whereas genetic variability was generally high, some reduction of expected heterozygosity (He) in one of the species was observed in some cases. At 7 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks Author s self-archived version Fig. 2 – STRUCTURE clustering results of preliminary analysis carried out based on genotyped tree individuals. Each individual is represented with a vertical bar and each inferred cluster is marked with a different gray tone. For K = 3 and K = 4, two different clustering solutions were found. The number of runs, in which each one of these solutions was found, is given on the left hand side of the figure. Species assignment based on morphology is given below the diagram (RO = Quercus robur, PE = Q. petraea, PU = Q. pubescens). loci QrZAG96 and PIE227 Q. robur presented lower He values in comparison to the other two species.Similarly, He was reduced in Q. petraea at locus QrZAG112. On average, genetic diversity in terms of He was highest in Q. pubescens. On the other hand, Q. robur displayed a higher number of alleles per locus than the other two species. Furthermore, EST-SSRs showed reduced values of both number of alleles per locus and expected heterozygosity in comparison to non-genic SSRs. Details about the analysis of genetic diversity are provided as supplementary material (Online Resource 2). Phylogenetic relationships among the three species are presented by means of an FST-based unrooted NJtree (Fig. 3). Results support phylogenetic affinity between Q. petraea and Q. pubescens, while Q. robur appears to be genetically more differentiated. The measured pairwise FST values were 0.120 between Q. robur and Q. petraea, 0.098 between Q. robur and Q. pubescens and 0.050 between Q. petraea and Q. pubescens. 3.3. Effect of phylogenetic relationships and sample size on Bayesian clustering In order to test the effect of phylogenetic relationships and sample size on Bayesian clustering, I used three different group sizes of simulated purebreds in all possible species combinations. In all tested configurations of group sizes and species combinations, clustering of individuals using BAPS inferred K = 3 as the optimal population subdivision. For K = 3, the three species were correctly identified and individuals were assigned to their clusters with high membership proportions (Online Resource 3). On the contrary, Fig. 3 – Phylogenetic relationships among the three study species visualized by an unrooted neighbour-joining phylogenetic tree based on pairwise FST values between species. application of the STRUCTURE software did not always resolve the three species. The statistic ΔΚ was maximized for K = 2 and showed secondary peaks in some cases, resembling to the previously described preliminary analysis (details on ΔΚ, as The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 8 Author s self-archived version Bayesian clustering for genetic assignment in oaks Table 1 – Comparison of group membership proportions between BAPS and STRUCTURE for different species and sample sizes, assuming two subpopulations (K = 2). Cases in which the phylogenetically more related species Q. petraea and Q. pubescens did not cluster together are highlighted with bold letters. In cases with variable clustering solutions among independent runs, membership proportions are underscored. RO = Quercus robur, PE = Quercus petraea, PU = Quercus pubescens. Configuration Sample size Simulated group BAPS STRUCTURE 2 3 1 2 3 Species 1 1000/100/100 RO-PE-PU PE-RO-PU PU-RO-PE 1,000 1,000 1,000 0,000 0,000 0,000 0,000 0,045 0,971 0,995 0,992 0,992 0,010 0,080 0,004 0,132 0,005 0,052 1000/500/100 RO-PE-PU RO-PU-PE PE-RO-PU PE-PU-RO PU-RO-PE PU-PE-RO 0,999 1,000 1,000 1,000 1,000 1,000 0,001 0,000 0,001 0,006 0,001 0,001 0,006 0,019 0,957 0,000 0,970 0,008 0,995 0,995 0,993 0,992 0,996 0,986 0,006 0,051 0,004 0,042 0,008 0,816 0,020 0,012 0,006 0,922 0,010 0,041 1000/500/500 RO-PE-PU PE-RO-PU PU-RO-PE 1,000 1,000 1,000 0,001 0,001 0,001 0,000 0,999 0,998 0,995 0,980 0,995 0,008 0,006 0,017 0,660 0,007 0,985 1000/1000/100 RO-PE-PU RO-PU-PE PE-PU-RO 0,999 1,000 0,998 0,000 0,000 0,004 0,045 0,031 0,015 0,994 0,995 0,991 0,006 0,152 0,004 0,072 0,018 0,489 1000/1000/500 RO-PE-PU RO-PU-PE PE-PU-RO 0,999 0,999 0,999 0,000 0,000 1,000 0,001 0,003 0,001 0,995 0,995 0,698 0,006 0,010 0,005 0,014 0,709 0,007 1000/1000/1000 RO-PE-PU 1,000 0,001 0,000 0,995 0,009 well as for lnP(D) of each run are presented in Online Resource 4). For K = 2, in most cases Q. petraea and Q. pubescens clustered together, as expected given their phylogenetic affinity. However, when sample sizes were unbalanced, there was a tendency of STRUCTURE to assign the smallest groups to the same cluster. For instance, when one group of 1000 individuals and two groups of 100 individuals (corresponding to the three study species) were used as input, the two small groups always clustered together for all species combinations irrespective of their species identity, 0,006 when K = 2. Though less frequent, clustering inconsistent to species phylogenies, could be also observed by running the data with BAPS with a fixed K = 2. A comparison of the results from the two software programs for K = 2 is presented in Table 1. In contrast to BAPS, STRUCTURE did not always distinguish the three species when three subpopulations were assumed (K = 3). For example, by including a combination of 1000 simulated Q. robur, 100 Q. petraea and 100 Q. pubescens into the analysis, the latter two species were assigned to the 9 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Author s self-archived version Bayesian clustering for genetic assignment in oaks Table 2 – Efficiency, accuracy and total performance of BAPS and STRUCTURE analyses of simulated purebreds and hybrids. Configurations of all three species and all three pairwise combinations including simulated hybrids and backcrosses were used. Individuals were assigned to groups of purebreds and hybrid groups using a threshold of q = 0.90. RO = Quercus robur, PE = Quercus petraea, PU = Quercus pubescens. RO Purebr. ROxPE Hybr. PE Purebr. BAPS PExPU PU Hybr. Purebr. RO - PE PU Eff. Acc. Perf. 1,000 0,851 0,851 0,498 0,943 0,470 0,999 0,866 0,865 0,200 0,938 0,188 RO - PE Eff. Acc. Perf. 0,999 0,932 0,931 0,550 0,994 0,547 1,000 0,942 0,942 - RO - PU Eff. Acc. Perf. 1,000 0,926 0,926 - PE - PU Eff. Acc. Perf. - - 0,999 0,916 0,915 0,997 0,830 0,828 - - 1,000 0,939 0,939 0,220 0,875 0,210 0,997 0,875 0,872 ROxPU Hybr. 0,475 0,973 0,462 0,517 1,000 0,517 - Average Purebr. Hybr. RO Purebr. ROxPE Hybr. PE Purebr. STRUCTURE PExPU PU Hybr. Purebr. 0,999 0,849 0,848 0,391 0,951 0,372 0,992 0,954 0,946 0,830 0,883 0,733 0,985 0,951 0,937 0,643 0,757 0,487 1,000 0,937 0,936 0,550 0,994 0,547 0,996 0,971 0,967 0,843 0,969 0,817 0,996 0,983 0,979 - 1,000 0,932 0,932 0,517 1,000 0,517 0,997 0,976 0,973 - 0,998 0,895 0,893 0,220 0,957 0,210 - - 0,993 0,959 0,953 0,956 0,917 0,876 - - 1,000 0,968 0,968 0,623 0,820 0,511 0,966 0,932 0,900 ROxPU Hybr. 0,753 0,926 0,698 0,807 0,988 0,797 - Average Purebr. Hybr. 0,978 0,940 0,919 0,742 0,855 0,635 0,996 0,977 0,973 0,843 0,969 0,817 0,999 0,972 0,970 0,807 0,988 0,797 0,980 0,945 0,926 0,623 0,820 0,511 10 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks same cluster in all ten runs, while the large Q. robur group was subdivided into two, biologically unreasonable clusters and individuals were admixed between them. For the same configuration (1000/100/100), when Q. petraea or Q. pubescens formed the large cluster, results among the ten performed runs for K = 3 were not consistent. Four runs correctly resolved the three species, while six merged the two small, and in this case, phylogenetically less related groups within the same cluster (see detailed results of group membership proportions for all clustering solutions in Online Resource 5). Again, runs with the proper clustering solution presented a higher lnP(D) (Online Resource 4). Setting K = 3, similar inconsistencies were observed with other configurations as well. For instance, when I used group sizes of 1000, 500 and 100 individuals for the analysis, the two smallest groups again tended to cluster together (with bimodality of lnP(D)), while biologically unreasonable subdivision of the large simulated group was observed. This merge of small clusters was more common when the small groups of 500 and 100 simulated purebreds were formed by Q. petraea and Q. pubescens (in either combination). On the contrary, species were consistently correctly identified when the small clusters were formed by Q. robur and Q. petraea. The correct clustering solution was also given by using 1000 simulated Q. petraea, 500 Q. robur genotypes and 100 Q. pubescens, but not in the opposite case (1000 Q. petraea – 500 Q. pubescens – 100 Q. robur). Furthermore, small groups were again merged into one cluster in two particular cases for the configurations of 1000/500/500 and 1000/1000/100 (Online Resource 5). In both cases, this was due to the occurrence of a common Q. petraea – Q. pubescens cluster for some runs when small groups were formed by these two species. Finally, STRUCTURE gave the correct clustering solution when I used a less unbalanced configuration of 1000/1000/500 or equally large groups (1000/1000/1000). Author s self-archived version 3.4. Efficiency and accuracy of Bayesian assignment, diagnostic power of loci and effect of species configuration Even using all 21 loci, the range of q-values for the simulated backcrosses was greatly overlapping with F1 hybrids and, to a lesser extent, with purebreds (Online Resource 6). Therefore, I decided to use a single threshold value to distinguish between purebreds and hybrids. In STRUCTURE, using a threshold value of q = 0.90 led to maximum total performance in three-species configuration and in most cases of two-species configurations (Online Resource 7). With BAPS, total performance increased for threshold q-values up to 0.7-0.825, depending on the particular case, and then remained steady for higher values (Online Resource 7). This is due to the fact that not a single individual, purebred or hybrid, received a q-value between 0.825 and 0.999 under any configuration (BAPS assigned individuals either as purebreds with q = 1.000 or as admixed with q < 0.825). Thus, a threshold q = 0.90 was suitable in order to achieve maximum total performance in BAPS, too. In order to test the increase rate of efficiency and accuracy of Bayesian clustering analysis with increasing number of loci, I first calculated two measures of locus-specific diagnostic power, informativeness of assignment (In and Wright s FST. The two measures resulted in different rank orders. I used both rank lists to run Bayesian analyses starting from the most informative locus and adding gradually loci of lower diagnostic power (Online Resource 8). In general, use of the rank list based on In resulted in a higher increase of efficiency and accuracy (with increasing number of loci included) than FST, either when I included all species or when I carried out analyses pairwise (Online Resource 9). In all cases, purebred efficiency surpassed 80 % with use of the five most informative loci. Increase of efficiency and accuracy varied depending on the species compared and the software used. In general, total performance of the analysis was higher with STRUCTURE than with BAPS (Table 2). Regarding simulated purebreds, I found a higher efficiency, but not accuracy when I applied BAPS. 11 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks This means that BAPS assigned correctly more simulated purebreds to their own group than STRUCTURE. Moreover, with BAPS, fewer loci than STRUCTURE were required in order to achieve the same value of purebred detection efficiency (Online Resource 9). However, STRUCTURE provided higher accuracy of purebred assignment, as BAPS tended to wrongly assign simulated hybrids or backcrossings as purebreds. In particular, it assigned most backcrossings as purebreds with q = 1.00 in any species configurations and even when all 21 loci were used (Online Resource 10). With STRUCTURE, the majority of both hybrids and backcrossings showed q < 0.90. Phylogenetic relationships played a significant role in method performance, especially concerning hybrids. Hybrid detection efficiency was generally higher when these arose from combinations of phylogenetically most differentiated species (Table 2). Either in two or in three-species configurations, method performance was highest for hybrids between Q. robur and Q. petraea, whereas in the case of Q. robur and Q. pubescens it was slightly lower. Method performance for hybrids between Q. petraea and Q. pubescens was markedly lower in comparison to the other two species combinations (Table 2). Inclusion of all three species into the analysis resulted in a slight reduction of method performance. This was true for both purebreds and hybrids, either with BAPS or STRUCTURE (Table 2). When all three species were included, backcrossings of Q. robur – Q. pubescens hybrids with Q. robur were relatively often assigned by STRUCTURE as hybrids of Q. robur with Q. petraea (12 % of such backcrossings). Likewise, backcrossings of Q. robur – Q. petraea hybrids with Q. petraea were assigned as hybrids of Q. robur with Q. pubescens (6 % of the cases; see also Online Resource 10). 4. DISCUSSION The first part of the present study comprised analyses based on simulated purebreds of the three study species in various configurations of species and sample sizes. A frequent observation throughout these analyses was instability of clustering solutions among independent runs at a particular K with Author s self-archived version STRUCTURE and the occurrence of biologically unreasonable partition. As Bayesian clustering methods often use stochastic simulations and unsupervised approaches (as in the runs with simulated individuals presented here), analyses of the same data may generally produce several distinct solutions, even if the same initial conditions are used in each run (Jakobsson and Rosenberg 2007). Thus, biologically unreasonable clustering solutions observed here apparently occurred due to the fact that the STRUCTURE algorithm stuck in suboptimal solutions. By choosing the runs with the highest data posterior probability, I could distinguish runs with proper partition in several cases of multimodality. This strategy has been also followed elsewhere (e.g. Rosenberg et al. 2001, Reeves and Richards 2011). However, the frequency of runs with biologically unreasonable partition increased when I used unbalanced sampling scheme. In the most unbalanced configurations, not even a single run resolved the three species for K = 3 (e.g. configuration with 1000 simulated purebreds of Q. robur, 100 of Q. petraea and 100 of Q. pubescens). Besides sample size configuration, another factor affecting the inferred clustering solution were the phylogenetic relationships among species. The more related Q. petraea and Q. pubescens showed a tendency to cluster together at K = 2. Yet, unbalanced sampling sizes led to a merge of the smaller simulated groups irrespective of their phylogenetic identity, thus obscuring the effect of phylogenetic relationships. For example, analysis of 1000 simulated purebreds of Q. petraea, 100 of Q. robur and 100 of Q. pubescens for K = 2 gave such a result. By forcing K to 2, the same clustering was inferred by BAPS for this specific configuration. Nevertheless, clustering for K = 2 with BAPS fit better the phylogenetic relationships than with STRUCTURE (results not shown). In many studies carried out with STRUCTURE, it has been shown that small size exacerbates subpopulation identification (Rosenberg et al. 2002, Duminil et al. 2006, Kalinowski 2011). This problem can especially occur when the program is forced to assign individuals into an inappropriately small number of clusters (Kalinowski 2011). 12 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks Here, I show that the problem may persist when the real K value is reached and, interestingly, even above. Based on previous knowledge, Jakobsson & Rosenberg (2007) state that biological factors may cause multiple parts of the space of possible membership coefficients to provide similarly appropriate explanations for the data, which can explain the occurrence of multimodality. Indeed, in some case studies with real data, multimodality could be explained through subtle genetic structure within populations (or species) or isolation by distance (Lepais et al. 2006, Ηöltken et al. 2012). However, in the present study, I clearly show that that the occurrence of inappropriate partition for unbalanced species configurations is due to stochastic error. There is no biological explanation for splitting a homogenous group of simulated purebreds when K is equal to the number of species. Interestingly, for unbalanced configurations, converging solutions were provided at a higher number of K, for which the small groups were consistently separated from each other and the species represented by the largest group was admixed (for example preliminary data analysis for K = 5, Fig. 1). K values characterized by such converging solutions showed an increased posterior probability of data and a secondary ΔΚ peak (Online Resource 3). Probably, the size of the inferred clusters has, in turn, an influence on the outcome of each run. More balanced sizes among the inferred clusters may facilitate convergence and may result in a lower rate of suboptimal solutions, whether these are real clusters or spuriously admixed. )n any case, multimodality decreased in more balanced configurations, but only the configuration with 1000 simulated purebreds of each species was totally free of biologically unreasonable clustering solutions for both K = 2 and K = 3. In contrast to STRUCTURE, BAPS analyses steadily resulted in proper clustering solutions. In all 19 configurations tested, K = 3 was inferred as the most likely partition and admixture coefficients (q) corresponded to the taxonomic identity of individuals and populations. This result might be due to the differences between the algorithms of both software packages. BAPS uses a non-reversible Author s self-archived version algorithm process with intelligent operators, which enables simultaneous exploration of several local neighborhoods of parameter space, while preventing the absorption of any particular process to a relatively inferior state (Corander et al. 2008a). In contrast, the Gibbs sampler algorithm used by STRUCTURE (Pritchard et al. 2000) is prone to convergence problems and may not reach the true posterior even after a substantial number of iterations (Celeux et al. 2000, Hanage et al. 2009). These differences between the two programs should be taken into account especially when blind approaches are followed (i.e. when the proportions among sample sizes are not known from the beginning). The second part of the study aimed to explore the power of the chosen Bayesian clustering methods in assigning purebreds, F1 hybrids, as well as backcrosses. Due to the balanced sampling schemes, multimodality and spurious partition was not an issue here. At a first stage, I aimed to identify highly informative marker subsets, by ranking the used loci based on In and FST as criteria of diagnostic power. In generally performed better than FST, as higher levels of efficiency and accuracy could be reached when markers were chosen based on In value. These results are in agreement with a previous simulation based study of Ding et al. (2011) in humans. FST may underestimate the biologically relevant genetic differentiation as it is strongly dependent on variation. In particular, maximum value of FST decreases with increasing locus heterozygosity and, thus, the same value of FST may be not reflect the same levels of genetic differentiation (Hedrick 1999). This might explain why EST-SSRs had lower ranks in the list based on In, compared to the FST based list. On the other hand, non-genic SSRs in the present study tend to be more variable and more powerful for Bayesian clustering analysis than ESTmicrosatellites (based on In). Highly variable dinucleotide SSRs have been shown to perform better in Bayesian clustering analyses compared to less variable markers like trinucleotide (as most of EST-SSRs used here) or tetranucleotide SSRs or SNPs in other studies as well (Narum et al. 2008; Payseur and Jing 2009). 13 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks Regarding simulated purebreds, increase of efficiency was fast and, in most cases, it was possible to assign more than 90 % of simulated purebreds correctly using only two markers. In general, BAPS showed a faster increase of purebred detection efficiency than STRUCTURE. This is obviously due to algorithm differences between the two software packages. Unlike STRUCTURE, BAPS uses a filtering process that sets admixture coefficient to 1 when weak evidence is provided by the genotypic data, aiming to reduce the amount of false positive cases of admixture (Corander et al. 2008a). Thus, using only one locus, BAPS recognizes only purebreds q = 1 to any cluster). Increasing number of loci results in an increasing number of admixed individuals. On the contrary, STRUCTURE begins with a high number of admixed individuals which is diminished with increasing number of loci, as method performance improves. However, even when all 21 loci were used, there was an obvious lack of individuals with q values between 0.8 and 0.99 in BAPS, while q value distribution in STRUCTURE was continuous. A further consequence of these algorithm differences is the fact that BAPS tends to underestimate the number of admixed individuals. Even when all 21 markers were included into the analysis, BAPS misassigned more F1 hybrids as purebreds than STRUCTURE did. In general, BAPS tends to overestimate the number of purebreds, which results in a generally higher efficiency, but not accuracy, among purebreds. Similar observations were made in a recent study with red wolves (Canis rufus) and coyotes (Canis latrans) comparing the two software packages (Bohling et al. 2013). In any case, the present study showed that the marker set used here was not adequate for BAPS to distinguish backcrossings either from purebreds or from F1 hybrids (i.e. confidence intervals around the median q for all these categories were overlapping). By choosing a compromised threshold q of 0.90, it was possible to assign a relatively weak majority of simulated backcrossings as hybrids with STRUCTURE, whereas with BAPS, most of them received q = 1. Irrespective of the applied method, 21 loci are obviously not adequate in order to distinguish backcrosses from purebreds and hybrids Author s self-archived version among the study species. Up to around 50 loci may be required to distinguish backcrossings even when the parental species are highly divergent (Vähä and Primmer 2006). Use of a higher number of loci would also allow BAPS to cover a wider or even the whole range of q (Corander et al. 2008a), achieving a better efficiency of backcrossing (and hybrid) assignment. Researchers should keep this in mind when applying Bayesian clustering analyses in natural populations of hybridizing species, since backcrossed individuals may occur more often than F1 hybrids (Lepais et al. 2009). Furthermore, phylogenetic relationships also influenced performance of Bayesian analyses. For instance, more loci were required for the combination of Q. petraea and Q. pubescens than for the two other species combinations in order to achieve the same levels of method performance. The effect of phylogenetic relationship on hybrid assignment performance was even stronger. Using STRUCTURE with the seven most informative loci, it was possible to distinguish F1 hybrids between Q. robur and Q. petraea or between Q. robur and Q. pubescens from the parental species, at least in two species configuration. By contrast, 10-12 loci were required to distinguish purebreds of Q. petraea from F1 hybrids between Q. petraea and Q. pubescens. Notably, even with use of all 21 loci, F1 hybrids still overlapped with purebreds of Q. pubescens in terms of q-values. This might be due to differences of allelic patterns among species. Both Quercus robur and Q. petraea displayed reduced genetic variation at specific loci, which was due to high frequency of a certain allele, which is characteristic for the particular species (e.g. QrZAG96 for Q. robur and QrZAG112 for Q. petraea). This might have facilitated the correct assignment of purebred individuals. Lack of such loci in Q. pubescens probably accounts for the relatively low performance of purebred assignment. The utility of such loci for species discrimination has been shown in other case studies, too (Curtu et al. 2007, Neophytou et al. 2011). Lower performance for purebreds of Q. pubescens on the one hand and for hybrids between Q. petraea and Q. pubescens on the other could also be observed 14 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Author s self-archived version Bayesian clustering for genetic assignment in oaks in three species configuration. However, the most important outcome of the analysis using all three species is the reduction of efficiency and accuracy in comparison to pairwise configurations. First, the presence of three clusters increased the probability to misassign an individual, which is supported by the reduction of efficiency and accuracy of every single category of purebreds and hybrids. Second, the phylogenetic affinity between Q. petraea and Q. pubescens resulted in further assignment errors. In particular, F1 hybrids and backcrossings of Q. petraea with Q. robur were often assigned as hybrids involving Q. pubescens and the opposite happened for hybrids and backcrossings of Q. pubescens with Q. petraea. This phenomenon, also observed elsewhere (Lepais et al. 2009, Bohling et al. 2013), additionally contributed to the reduction of method performance. It should be taken into account when several hybridizing species are included into the analysis, in order to avoid wrong conclusions about the direction of introgression and the species involved to hybridization. ACKNOWLEDGEMENTS This work was supported by the European Regional Development Fund (ERDF), the regional government authority of Baden-Württemberg in Freiburg (Regierungspräsidium Freiburg; RPF), the National Office of Forests (Office National des Forêts; ONF) in France and the Regional Directory of Food, Agriculture and Forestry of Alsace (Direction Régionale de l'Alimentation, de l'Agriculture et de la Forêt d'Alsace; DRAAF) in the frame of the Interreg)V project The regeneration of the oaks in the Upper Rhine lowlands . ) express my gratitude to all the colleagues of the ONF, RPF and the FVA who worked for sample collections and laboratory analyses, to Jukka Corander for kindly answering several questions about the BAPS software and to two anonymous reviewers for providing valuable comments and suggestions. DATA ARCHIVING STATEMENT Genotypic data used for this study are available at Dryad: doi: 10.1007/s11295-013-0680-2. REFERENCE LIST Aldrich PR, Parker GR, Michler CH, Romero-Severson J (2003) Whole-tree silvic identifications and the microsatellite genetic structure of a red oak species complex in an Indiana old-growth forest. Can J Forest Res 33:2228–2237. Antao T, Lopes A, Lopes RJ, Beja-Pereira A, Luikart G (2008) LOSITAN: A workbench to detect molecular adaptation based on a FST-outlier method. BMC Bioinformatics 9:323. Belkhir K, Borsa P, Chikhi L, Raufaste N, Bonhomme F (2004) GENETIX 4.05, WindowsTM Software for Population Genetics. Laboratoire génome, populations, interactions, CNRS UMR 5000. Bohling JH, Adams JR, Waits LP (2013) Evaluating the ability of Bayesian clustering methods to detect hybridization and introgression using an empirical red wolf data set. Mol Ecol 22:74–86. Burgarella C, Lorenzo Z, Jabbour-Zahab R, Lumaret R, Guichoux E, Petit RJ, Soto Á, Gil L (2009) Detection of hybrids in nature: application to oaks (Quercus suber and Q. ilex). Heredity 102:442–452. Celeux G, Hurn M, Robert CP (2000) Computational and Inferential Difficulties with Mixture Posterior Distributions. J Am Stat Assoc 95:957–970. Corander J, Marttinen P (2006) Bayesian identification of admixture events using multilocus molecular markers. Mol Ecol 15:2833–2843. Corander J, Marttinen P, Sirén J, Tang J (2008a) Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinformatics 9:539. Corander J, Sirén J, Arjas E (2008b) Bayesian spatial modeling of genetic population structure. Compu Stat 23:111–129. Curtu AL, Gailing O, Finkeldey R (2007) Evidence for hybridization and introgression within a speciesrich oak (Quercus spp.) community. BMC Evol Biol 7:218. Ding L, Wiener H, Abebe T, et al (2011) Comparison of measures of marker informativeness for ancestry and admixture mapping. BMC Genomics 12:622. 15 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks Dow B, Ashley M, Howe H (1995) Characterization of highly variable (GA/CT) n microsatellites in the bur oak, Quercus macrocarpa. Theor Appl Genet 91:137– 141. Duminil, J., Caron, H., Scotti, I., Cazal, S.-O., Petit, R.J., 2006. Blind population genetics survey of tropical rainforest trees. Mol Ecol 15:3505–3513. Durand J, Bodénès C, Chancerel E, et al (2010) A fast and cost-effective approach to develop and map EST-SSR markers: oak as a case study. BMC Genomics 11:570. Earl DA, vonHoldt BM (2012) STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genet Resour 4:359–361. Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14:2611– 2620. Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164:1567–1587. Frantz AC, Pourtois JT, Heuertz M, Schley L, Flamand MC, Krier A, Bertouille S, Chaumont F, Burke T (2006) Genetic structure and assignment tests demonstrate illegal translocation of red deer (Cervus elaphus) into a continuous population. Mol Ecol 15:3191–3203. Gugger PF, Cavender-Bares J (2011) Molecular and morphological support for a Florida origin of the Cuban oak. J Biogeogr. Published on-line (doi:10.1111/j.1365-2699.2011.02610.x) Guichoux E, Lagache L, Wagner S, Léger P, & Petit RJ (2011) Two highly validated multiplexes (12-plex and 8-plex) for species delimitation and parentage analysis in oaks (Quercus spp.). Mol Ecol Resour 11:578–585. Guichoux E, Garnier-Géré P, Lagache L, Lang T, Boury C, Petit RJ (2013) Outlier loci highlight the direction of introgression in oaks. Mol Ecol 22:450– 462. Hanage WP, Fraser C, Tang J, Connor TR, Corander J (2009) Hyper-Recombination, Diversity, and Author s self-archived version Antibiotic Resistance in Pneumococcus. Science 324:1454–1457. Hedrick PW (1999) Perspective: Highly variable loci and their interpretation in evolution and conservation. Evolution 53:313. Heuertz M, Fineschi S, Anzidei M et al (2004) Chloroplast DNA variation and postglacial recolonization of common ash (Fraxinus excelsior L.) in Europe. Mol Ecol 13:3437–3452. Höltken A, Buschbom J, Kätzel R (2012) Die Artintegrität unserer heimischen Eichen Quercus robur L., Q. petraea (Matt.) Liebl. und Q. pubescens Willd. aus genetischer Sicht (in German). Allg ForstJagdztg 183:100–110. Jakobsson M, Rosenberg NA (2007) CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23:1801–1806. Kalinowski ST (2011) The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure. Heredity 106:625–632. Kampfer S, Lexer C, Glössl J, Steinkellner H (1998) Characterization of (GA) n microsatellite loci from Quercus robur. Hereditas 129:183–186. Kronforst MR, Young LG, Blume LM, Gilbert LE (2006) Multilocus analyses of admixture and introgression among hybridizing Heliconius butterflies. Evolution 60:1254–1268. Kumar S, Skjæveland Å, Orr RJ, Enger P, Ruden T, Mevik B-H, Burki F, Botnen A, Shalchian-Tabrizi K (2009) AIR: A batch-oriented web program package for construction of supermatrices ready for phylogenomic analyses. BMC Bioinformatics 10:357. Lepais O, Petit R, Guichoux E, Lavabre J, Alberto F, Kremer A, Gerber S (2009) Species relative abundance and direction of introgression in oaks. Mol Ecol 18:2228–2242. Lexer C, Fay MF, Joseph JA, Nica M-S, Heinze B (2005) Barrier to gene flow between two ecologically divergent Populus species, P. alba (white poplar) and P. tremula (European aspen): the role of 16 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2 Bayesian clustering for genetic assignment in oaks ecology and life history in gene introgression. Mol Ecol 14:1045–1057. Manos PS, Doyle JJ, Nixon KC (1999) Phylogeny, Biogeography, and Processes of Molecular Differentiation in Quercus Subgenus Quercus (Fagaceae). Mol Phylogenet Evol 12:333–349. Narum SR, Banks M, Beacham TD et al (2008) Differentiating salmon populations at broad and fine geographical scales with microsatellites and single nucleotide polymorphisms. Mol Ecol 17:3464–3477. Neophytou C, Aravanopoulos F, Fink S, Dounavi A (2010) Detecting interspecific and geographic differentiation patterns in two interfertile oak species (Quercus petraea (Matt.) Liebl. and Q. robur L.) using small sets of microsatellite markers. For Ecol Manag 259:2026–2035. Neophytou C, Dounavi A, Fink S, Aravanopoulos F (2011) Interfertile oaks in an island environment: I. High nuclear genetic differentiation and high degree of chloroplast DNA sharing between Q. alnifolia and Q. coccifera in Cyprus. A multipopulation study. Eur J For Res 130:543–555. Nielsen EE, Bach LA, Kotlicki P (2006) HYBRIDLAB (version 1.0): a program for generating simulated hybrids from population samples. Mol Ecol Notes 6:971–973. Payseur BA, Jing P (2009) A Genomewide Comparison of Population Structure at STRPs and Nearby SNPs in Humans. Mol Biol Evol 26:1369– 1377. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959. Reeves PA, Richards CM (2011) Species Delimitation under the General Lineage Concept: An Empirical Example Using Wild North American Hops (Cannabaceae: Humulus lupulus). Syst Biol 60:45–59. Author s self-archived version Rosenberg NA, Burke T, Elo K, et al (2001) Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Genetics 159:699–713. Rosenberg NA, Pritchard JK., Weber JL, Cann H.M., Kidd K.K., Zhivotovsky LA, Feldman MW (2002). Genetic Structure of Human Populations. Science 298:2381–2385. Rosenberg NA, Li LM, Ward R, Pritchard JK (2003) Informativeness of genetic markers for inference of ancestry. Am J Hum Genet 73:1402–1422. Rosenberg NA (2004) DISTRUCT: a program for the graphical display of population structure. Mol Ecol Notes 4:137–138. Rosenberg NA (2005) Algorithms for selecting informative marker panels for population assignment. J Comput Biol 12:1183–1201. Steinkellner H, Fluch S, Turetschek E, Lexer C, Streiff R, Kremer A, Burg K, Glössl J (1997a) Identification and characterization of (GA/CT) n-microsatellite loci from Quercus petraea. Plant Mol Biol 33:1093–1096. Steinkellner H, Lexer C, Turetschek E, Glössl J (1997b) Conservation of (GA)n microsatellite loci between Quercus species. Mol Ecol 6:1189–1194. Takezaki N, Nei M, Tamura K (2010) POPTREE2: Software for constructing population trees from allele frequency data and computing other population statistics with Windows interface. Mol Biol Evol 27:747 –752. Vähä J-P, Primmer CR (2006) Efficiency of modelbased Bayesian methods for detecting hybrid individuals under different hybridization scenarios and with different numbers of loci. Mol Ecol 15:63– 72. Weir BS, Cockerham CC (1984) Estimating FStatistics for the analysis of population structure. Evolution 38:1358–1370. 17 The final publication is available at http://link.springer.com/article/10.1007%2Fs11295-013-0680-2
© Copyright 2026 Paperzz