Brandvain et al. Founding C. rubella 1 Genomic identification of founding haplotypes reveals the history of the selfing species Capsella rubella Yaniv Brandvain1,∗ , Tanja Slotte2 , Khaled M. Hazzouri3 , Stephen I. Wright3 , Graham Coop1 1 Section of Evolution and Ecology & Center for Population Biology, University of California - Davis, Davis, CA, USA 2 Evolutionary Biology Center, Uppsala University, Uppsala, Sweden 3 Dept. of Ecology and Evolutionary Biology, University of Toronto, Toronto, ON, Canada ∗ E-mail: [email protected] Abstract The evolution of predominant self-fertilization is one of the most common transitions in flowering plants; however, the population genomic processes occurring early in this transition are still not well understood. We investigate patterns of population genomic variation in a selfing species, Capsella rubella, recently derived from its outcrossing progenitor, C. grandiflora. Viewing the genome as a mosaic of incompletely sorted founding haplotypes and regions where all individuals are descended from the same founding haplotype, we utilize variation at sites jointly segregating in both species to infer the alternative extant haplotypes inherited from C. rubella’s founders. Since genomic regions where individuals carry different founding chromosomes are incompletely sorted since C. rubella’s origin, pairwise comparisons between these chromosomes mimic interspecific comparisons - that is, these regions are characterized by high levels of neutral diversity and strong selective constraint. By contrast, comparisons within genomic regions descended from the same founding chromosome contain information regarding the recent history of C. rubella, and as such, harbor very little sequence diversity, a slight excess of rare alleles, and low levels of selective constraint. To infer the history of C. rubella since its founding, we build a coalescent model that makes use of these patterns within and between founding chromosomes to infer that C. rubella was founded ≈ 50 (or 100 kya, depending on the mutation rate estimate), and has experienced a twenty-fold reduction in effective population size relative to C. grandiflora. The low levels of sequence diversity in C. rubella are consistent with this long-term reduction in effective population size, and do not require an extreme founding event. As population genomic data from an increasing number of outcrossing/selfing pairs is generated, analyses of these data with approaches like the one proposed herein will facilitate a fine scale view of the evolution of self-fertilization. Brandvain et al. Founding C. rubella 2 Author Summary While many plants require pollen from another individual to set seed, in some species, self-pollination is the norm. This transition from outcrossing to self-fertilization is among the most common in flowering plants. Here we use dense genome sequence data to identify where in the genome two individuals have inherited the same or different chunks of ancestral diversity present in the founders of the selfing species, Capsella rubella, to obtain a genome-wide view of this transition. This identification of ancestral chromosomes allows us to partition mutations that occurred before and after C. rubella separated from its outcrossing progenitor, C. grandiflora. With this partitioning, we estimate that C. rubella split from C. grandiflora between 50 and 100 kya (depending on the estimated mutation rate). In this relatively short time frame, an extreme reduction in C. rubellas population size is associated with a massive loss of genetic variation and an increase in the relative proportion of putatively deleterious polymorphisms. Although much ancestral variation had been lost by drift and/or selection, the approximately 25% of the genome for which two individuals have inherited different ancestral variation makes up roughly 90% of the genetic variation between pairs of individuals. Brandvain et al. Founding C. rubella 3 Introduction The majority of flowering plant species are hermaphroditic (i.e. have both male and female reproductive organs), but to avoid self-fertilization many have evolved elaborate mechanisms including self incompatibility and physical (herkogamy) or temporal (dichogamy) separation of male and female function. These mechanisms are thought to have evolved to allow plants to evade the high costs of selfing, including inbreeding depression and the limited genetic diversity of offspring from selfing parents. However, an estimated ∼ 15% of flowering plant species [1], (but see [2] for a discussion of this estimate), including many commercially important crop species [3], are self-compatible and predominately selfing. This transition from outcrossing to inbreeding by self-fertilization is one of the most common evolutionary transitions in flowering plants [4, 5]. This transition is thought to occur when the short-term evolutionary benefits of selfing (e.g. reproductive assurance when mates or pollinators are limiting [6], the ‘automatic’ advantage to fertilizing your own and other ovules [7], and the maintenance of locally adapted genotypes [8]) overwhelm the immediate costs of inbreeding depression [9, 10]. However, in the longer term, limited genetic diversity and difficulty in shedding deleterious mutations are thought to make selfing an evolutionary dead end [11–13, for example]. Here, we utilize population genomic data from a young selfing species, Capsella rubella, and its outcrossing progenitor, C. grandiflora, to provide a genome-wide view of the transition from outcrossing to selfing. C. rubella is a highly selfing and broadly distributed (its indigenous range includes much of Europe, the Middle East, and Northern Africa [14, 15]) annual [16], with low levels of genetic diversity [17–19]. By contrast, C. grandiflora is self-incompatible (and therefore obligately outcrossing), with a limited distribution (Greece, Italy and parts of Albania) and high levels of diversity [17–19].The stark contrast between these closely related species provides an opportunity to document the demographic and selective consequences of the origin of a selfing species and its spread across the globe. A rough view of the history of speciation in this pair has begun to emerge from population genetic analyses of numerous loci; however, the exact details are murky. For example, while it is clear that C. rubella split from C. grandiflora in the last 100,000 years, estimates of this split time vary dramatically [14,15,19]. Additionally, the low levels of diversity in C. rubella reflect a massive decrease in the effective population size relative to C. grandiflora [14, 15, 19]; however, it is unclear if this decrease corresponds to a dramatic bottleneck of perhaps a single individual at the founding of C. rubella [15], or if the signature of a bottleneck is reflective of a long-term decrease in C. rubella’s effective population size. Similarly, while C. rubella is found worldwide and C. grandiflora is restricted to Greece, the details of C. rubella’s spread are not clear - Did C. rubella leave Greece as it originated or is there evidence for a distinct ‘Out of Greece event’ ? It is in hopes of answering these questions that we introduce our haplotype-based approach. We document our novel approach in Figure 1, which depicts a recently founded selfing population (e.g. C. rubella). Today, at a given site, all individuals trace their ancestry to one of a small number of founding alleles that have survived to the present. Due to the high level of selfing, and hence a low effective rate of recombination, this ancestry can persist for long distances along the genome [20], allowing for the identification of founding haplotypes (bottom right of Figure 1A). Here, we develop an approach to assign extant sequences to alternative founding chromosomes using patterns of variation at sites polymorphic in both species.This allows us to learn about the diversity of the founders, and to partition mutational changes into those that occurred before and after the origin of C. rubella (Figure 1B). We then use this partitioning of diversity within and among ancestral haplotypes to gain a detailed view of the timing of the origin of C. rubella, its historical rate of population growth, and changes in selective pressures since its origin. The methods developed herein highlight that the fourfold decrease in diversity in C. rubella relative to C. grandiflora represent a twenty-fold decrease in the effective population size of C. rubella. This longterm reduction in Ne suggests that ecological factors associated with selfing and/or the long arm of linked 4 Brandvain et al. Founding C. rubella Haps from outcrossing pop 6 1 Haps from selfing pop 4 5 N f founding haplotypes Selfing ind Haps from ancestral pop Chromosome 7 B) 3 Coalescent history of a selfer 2 A) τ 1979 N0 Outcrosser Selfer 2007 Chrom pos (1kb scale) Figure 1. The coalescent history of a recently derived selfing species. A) We depict a selfing species founded by Nf chromosomes (i.e. Nf /2 individuals) τ generations ago with no subsequent introgression. At that time we sample Nf chromosomes from a large outcrossing population (i.e. a finite number of alternative chromosomes drawn from the progenitor population), which instantaneous grow to population size of N0 . The entire selfing population therefore traces its ancestry back to one of Nf chromosomes chromosomes, which we depict in red and blue (for Nf = 2). The low effective recombination rate in the selfing species ensures that large chunks of ancestral chromosomes remain intact, while recombination severely scrambles ancestral chromosomes in the outcrosser. B) A portion of chromosome seven in C. rubella, with inferred founding chromosomes colored as red or blue. Mutations segregating in samples of both Capsella species are marked in black, while mutations private to the C. rubella sample are marked in green. 2034 Brandvain et al. Founding C. rubella 5 selection act to dramatically depress the effective population size of C. rubella well beyond the expected two-fold decrease due to strict autozygosity. Such ecological and evolutionary factors may explain the common [21–24], but not universal [25], ≫ 2 fold decrease in diversity in selfing populations as compared to their outcrossing relatives. This decrease in Ne is associated with an extreme relaxation in selection, as demonstrated by a 3.5 fold increase in the ratio of non-synonymous to synonymous diversity (πN /πS ) within haplotypes as compared to that between species and haplotypes, or within C. grandiflora. This partitioning of πN /πS within and among haplotypes more clearly demonstrates that the reduced efficacy of purifying selection in selfers [17, 18, 25] may be more dramatic than the slight increase in πN /πS often documented. Results Samples / Sequencing We analyze SNP data generated from the transcriptomes of 11 Capsella samples (six C. rubella, five C. grandiflora) aligned to the C. rubella reference genome [17], using the GATK pipeline to call SNPs (METHODS ). Three of our C. rubella samples were collected from Greece (denoted by G, below) – the native range of C. grandiflora, and the putative location of origin of C. rubella [14, 15] – and three were collected from outside of Greece (from Italy, Algeria, and Argentina, denoted by O, below). Therefore, our Greek C. rubella samples are likely closer to demographic equilibrium and have a potential for recent introgression with C. grandiflora, while our Out-of-Greece samples provide us with an opportunity to explore the influence of geographic expansion on patterns of variation in C. rubella. We implemented a series of quality controls as described in the METHODS, resulting in nearly identical values of π in 53 kb of Sanger sequencing as compared to our genotype calls (see METHODS and SUPPLEMENT ). Genome-wide summaries We briefly present simple summaries of our data without consideration of haplotype labels (See Figure S1 for a graphical version of these results), noting that although we use the same dataset as [17], slightly different quality controls result in very similar results. Diversity within and among species: The mean number of pairwise differences at fourfold degenerate sites (hereafter πS ) is much lower in C. rubella (πS Cr = 0.41%) than in C. grandiflora (πS Cg = 1.86%) or between species (πS Cr x Cg = 2.03%). Out-of-Greece [O] C. rubella samples are more closely related to one another than are Greek [G] C. rubella samples, while comparisons across geographic labels are most diverse (πS Cr[O] = 0.27%, πS Cr[G] = 0.40%, πS Cr[GxO] = 0.46%), a result consistent with an Out-of-Greece history of C. rubella; however, relatedness between C. grandiflora and C. rubella Greek (πS Cr[G] x Cg = 2.04%) and Out-of-Greece (πS Cr[O] x Cg = 2.02%) samples are very similar, suggesting that recent introgression in Greece has not been common enough to shape patterns of diversity. The similarity of diversity within C. grandiflora and between species speaks to the recency of the species split and provides a way to date the timing of this split. Assuming a simple model of speciation with no subsequent gene flow and that the expected pairwise coalescent time in C. grandiflora is the same today as it was in the population ancestral to C. grandiflora and C. rubella, we use levels of diversity within and between species to estimate a divergence time (following [26]). Under these assumptions, interspecific diversity at neutral sites should equal diversity within C. grandiflora plus two times the product of the per generation mutation rate and the number of generations since the population split (i.e. πCg x Cr ≈ πCg + 2τ µ, where µ is the neutral per generation mutation rate), yielding an estimate of τ = 8.47 × 10−4 /µ. We convert this estimate of split time to a date of 56.5 kya by assuming a neutral mutation rate of 1.5 × 10−8 per generation [27] and one generation per year [16] to facilitate a direct Brandvain et al. Founding C. rubella 6 comparison to previous analyses [14]. Substituting the Koch [27] estimate of µ of 7 × 10−9 with that of Ossowski [28], we estimate a split time of 100 kya, in line with a recent analysis of these data [17]. The allele frequency spectrum in C. rubella: We investigate the allele frequency spectrum separately in our three Greek and Out-of-Greece samples (to avoid the confounding effect of population structure on demographic inference), polarizing the spectrum by the major allele in C. grandiflora (ties broken randomly). In these comparisons, 58% (Out-of-Greece) and 59% (Greek) of segregating fourfold degenerate sites exist as singletons in this sample, a value lower than the 66% expected under the standard neutral model. This result is consistent with a recent population contraction associated with the origin of C. rubella, and corroborates the large positive value of Tajima’s D reported elsewhere [17]. Relaxed selection in C. rubella: To investigate the hypothesis that selfing species accrue a higher burden of deleterious mutations than outcrossing species [29], we compare the ratio of diversity at zerofold and fourfold degenerate sites (πN /πS ). In line with this hypothesis, and the other analysis of these data [17] πN /πS is lowest within C. grandiflora and between species (0.144, and 0.146, respectively), and greatest within C. rubella (0.173). Thus, we see evidence for relaxed constraint in C. rubella; however, this pattern leaves no imprint on interspecific comparisons, presumably because the large majority of interspecific variation can be traced to polymorphism between current day C. grandiflora and C. rubella’s founders, rather than the low levels of diversity accumulated since C. rubella’s origin. Comparisons within and among haplotypes We now introduce our comparisons within and among founding haplotypes, described in Figure 1. To orient the reader, we begin with a simple heuristic view of expected patterns of haplotype sharing across the genome and patterns of diversity within and among founding haplotypes. If two individuals inherit the same founding haplotype for p0 of their genome, diversity between them can be thought of as the sum of diversity within and between haplotypes, weighted by p0 and (1 − p0 ), respectively (i.e. πS Cr = πS same hap × p0 + πS diff hap × (1 − p0 )). Since the divergence between C. rubella and C. grandiflora is much less than the coalescent time scale in C. grandiflora, and since both interspecific pairs and alternative founding chromosomes cannot coalesce until the origin of C. rubella (barring secondary contact), we assume that πS same hap ≈ 0, and πS diff hap ≈ πS Cr x Cg , so that πS Cr = πS Cr x Cg × (1 − p0 ). With our diversity estimates (above, see Figure S1A), πS Cr ≈ 0.004 = .02 × (1 − p0 ), so we estimate that two individuals carry the same founding haplotype for 80% of their genome. Below, we show both that our analysis is strikingly consistent with this conceptual view of the history of C. rubella, and that patterns of variation within haplotypes and rates of transition between haplotypes can provide novel insight into the biology of C. rubella. Haplotype labeling: Building on our genomic model of the origin of a selfing population presented in Figure 1, we utilize patterns of polymorphism within C. rubella and between species to robustly identify where individuals have inherited the same or different founding chromosomes in a non-parametric framework. In regions where two C. rubella individuals differ at ancestrally polymorphic sites, they cannot reside on the same founding chromosome. We therefore place pairs of individuals on alternate founding chromosomes in regions where they consistently differ at sites segregating in both species, and place samples on the same founding chromosome in stretches of the genome where they are identical at such sites. Additionally, we assign all C. rubella samples to the same founding chromosome in stretches of the genome where a number of sites polymorphic in C. grandiflora are fixed in C. rubella. We then use this preliminary assignment to place all individuals on a set of founding chromosomes. We note that regions with more than one founding chromosome may reflect incomplete lineage sorting or Brandvain et al. Founding C. rubella 7 introgression of C. grandiflora chromosomes into C. rubella after the founding event. In practice, given the long-range linkage disequilibrium in C. rubella and the recent divergence of these species, these two options are hard to distinguish and so for the sake of brevity we simply refer to founding chromosomes. In the METHODS we describe this algorithm more explicitly, including details of the number of SNPs and physical distances we require to place samples on the same or different founding chromosomes. While we also experimented with both Hidden Markov Models (similar to those implemented in FASTPHASE [30]) and sliding window-based analyses to call haplotypes, in practice this yielded poor haplotype calls, perhaps because of the necessarily patchy nature of transcriptomic data. We found our non-parametric method to yield robust haplotype calls, and show in the SUPPLEMENT that our results are largely insensitive to the parameter choices made. Patterns of haplotype sharing Our data spans 124.6 MB of the C. rubella genome. Approximately 120Mb of these 124.6 MB could be assigned a recombination rate from a genetic map (constructed from a QTL cross between C. rubella and C. grandiflora [31]), for a total genetic map length of 339 cM. While this genetic map may not be representative of that in C. rubella, it is more appropriate to measure lengths of haplotypes on a genetic rather than physical map, and so we quote both lengths. Although C. rubella is predominantly self-fertilizing, some of our samples contained large genomic regions of residual heterozygosity across multiple chromosomes (e.g. Figure S2A, S2B, and S2C). These heterozygous tracts can complicate assignment to founding haplotypes, as in these regions haplotypic phase is unknown. While conceptually simple, a computational phasing step is unnecessary as only 7.2% of our total C. rubella sequence appeared to be heterozygous, and so we simply treat the few heterozygous regions as missing data and so directly observe phase. Individual heterozygosity in these regions resembles diversity between C. rubella samples (Figure S2G), suggesting that regions are not unusual, so excluding them will have little impact on our results. Additionally, the homozygous nature of the genome allows for an additional step of quality control, since all heterozygous sites in putatively homozygous regions are likely errors, and so we treat these sites as missing data (METHODS ). Pairwise haplotype sharing: Patterns of pairwise haplotype sharing provide a view of population history and a measure of relatedness between individuals. In Figure 2A we display the frequency with which two individuals are assigned to the same or different haplotypes. Two C. rubella individuals have inherited the same founding chromosome for at least 72% of their genome, and different founding haplotypes for on average 15% of their genome, and our calls are ambiguous (i.e. we are unwilling to make a haplotype call due to missing data causing conflicting calls amongst samples, or imperfect resolution of haplotype switch points) for the remaining 13%. This result is consistent with our ‘back-of-the envelope’ prediction that two individuals share the same founding haplotype for 80% of the genome (above). We note that these proportions vary across geographic comparisons – two Out-of-Greece samples carry the same haplotype more often than two Greek samples, and comparisons between Greek and Outof-Greece samples have the lowest proportion of haplotype sharing. Additionally, pairs of Out-of-Greece samples contain relatively more long regions of haplotype sharing than pairs of Greek samples, which contain slightly more long runs of haplotype sharing than combinations of Greek and Out-of-Greece samples (Figure 2B). In total, patterns of haplotype sharing support a recent Out-of-Greece bottleneck in C. rubella, and corroborate our findings from π within and between C. rubella. geographic regions (i.e. πS Cr[O] ≪ πS Cr[G] < πS Cr[GxO] , above. See Figure S1A). A summary of haplotype assignment: For a majority of the genome (51%), all samples are homozygous and have inherited the same founding chromosome, suggesting that drift and selection since the origin of C. rubella has homogenized much of the genome. Additionally, for a large portion of the genome, (83% on average) a C. rubella individual is assigned to one of two common haplotypes. The 8 Brandvain et al. Founding C. rubella B) Length of shared haps. G/G G/O O/O 0 0.1 0.05 0 frequency 0.6 0.4 Same hap Ambiguous Diff hap 0.2 frequency 0.8 1 0.15 A) Pairwise hap. sharing G/G G/O O/O Geographic comparison 0.001 0.01 0.1 1 Length bin (cM) Figure 2. Patterns of haplotype sharing in C. rubella: A) The proportion of the genome for which two individuals are on the same or different haplotypes, or for which haplotype calls are ambiguous (see text for explanation). B) The proportion of regions of haplotype sharing of length X. In the SUPPLEMENT, wwe recreate this figure in the supplement utilizing physical, rather than genetic distances, and find qualitatively similar patterns. remaining 17% of an average C. rubella genome is split between heterozygous regions (7%), contradictory haplotype assignment due to missing data (%5), transitions between haplotypes (3%), and assignment to a rarer haplotype (2%, i.e. not one of the 2 dominant haplotypes). Since approximately 2% of an average sample’s genome is assigned to a third (or rarer) haplotype, approximately 10% of the genome contains at least one individual to be putatively assigned to a third or greater haplotype. We revisit these regions later. Polymorphism at putatively neutral sites Above, we described how often we infer two C. rubella individuals to carry the same or different founding chromosomes. Here, we utilize patterns of pairwise sequence differences within and among these founding chromosomes to investigate the composition of the founders of C. rubella and to illuminate evolutionary processes occurring since C. rubella’s founding. To minimize the influence of haplotype mis-assignment on our inference, we limit these analyses to regions in which two individuals are inferred to be on the same or different haplotypes by both our pairwise comparisons and our higher level haplotype assignment (e.g. we exclude regions with ambiguous haplotype calls as described above). We then find the mean fraction of sequence differences at fourfold degenerate sites between C. rubella individuals assigned to the same or different haplotypes (πS Cg Same and πS Cg Diff , respectively), irrespective of whether these sites were polymorphic in C. grandiflora. Brandvain et al. Founding C. rubella 9 Neutral diversity within and among founding chromosomes: Diversity among C. rubella founding haplotypes is very high (see Figure ??A) and similar to interspecific diversity (compare πS Cg Diff = 2.22% to πS Cg X Cr = 2.03%). By contrast, diversity within founding haplotypes is very low, (compare πS Cr Same = 5.29 × 10−4 to πS Cr = 40.72 × 10−4 , Figure ??A). This result is consistent with different haplotypes corresponding to alternative founding chromosomes, drawn from the full diversity of C. grandiflora. That diversity between haplotypes resembles interspecific diversity reflects the idea that in both cases coalescence is restricted to the time before speciation. The slightly deeper divergence between haplotypes than between species may result from the fact that, in order for us to call individuals as being on different haplotypes they necessarily have to differ at a sufficient number of sites, which will inflate their divergence, compared to two truly randomly sampled C. grandiflora haplotypes. Since differences within founding haplotypes represent mutations that have arisen since the origin of C. rubella, the low diversity within founding haplotypes can provide insight into historical processes subsequent to speciation. For example, by converting the sample-wide πS Cr Same to the average coalescent time (in years), we find that, on average, two samples on the same founding haplotype are separated by 17.6 kya (assuming a per generation mutation rate of 1.5 × 10−8 and 1 generation per year, as above). If there has been no coalescence since the origin of C. rubella, as predicted under very rapid growth/expansion (i.e. a starlike tree), this value would provide an accurate estimate of the origin of C. rubella. That πS Cr Same is substantially smaller than our estimate of species origin from diversity within and between species (56.5 kya, above, or 100 kya under the lower mutation rate estimate) suggests reasonable levels of coalescence since C. rubella’s origin. Therefore, to estimate the age of C. rubella from patterns of polymorphism within haplotypes, we integrate our estimate of πS with the observed frequency spectrum within haplotypes (below). Diversity within founding chromosomes across the globe: As in other measures of genetic distance between samples, pairwise relatedness within haplotypes is greatest in comparisons between Outof-Greece samples, and relatedness is lowest between Greek and Out-of-Greece samples (πS Same[O] = 3.48 × 10−4, πS Same[G] = 5.55 × 10−4 and πS Same[GxO] = 6.05 × 10−4), providing independent support for an Out-of-Greece founding event which occurred subsequent to the origin of C. rubella (we note that, as predicted from this hypothesis, this pattern is not observed in comparisons between haplotypes, Figure ??A). The relationship between length of a shared haplotype and genetic distance: For each region where two individuals have inherited the same founding haplotype, we can estimate the time to coalescence from the number of mutations separating these samples (hereafter, Tµ ). If there is variation in withinhaplotype coalescent times in C. rubella, all else being equal, regions where two individuals share a long stretch of haplotype should have a shallower coalescent time because there has been less time for recombination to break apart this region of shared ancestry. In Figure ??B, we show a strong relationship between the length of haplotype sharing, and the diversity between shared haplotypes. This result implies both plentiful opportunities for coalescence in C. rubella (i.e. C. rubella does not have a starlike topology), confirming our results above, and that C. rubella although predominantly selfing, does outcross with some frequency. In fact, by fitting a simple mogdel to the relationship between Tµ and the genetic length of a region where two samples are assigned to the same founding haplotype, and by accounting for the possibility that a recombination event will be unobserved because when an individual recombines back onto the same founding haplotype (see METHODS for details of this model), we estimate a 6% outcrossing rate in C. rubella (see METHODS for details). This estimate is admittedly somewhat crude, as it does not make use of all the data (e.g. potentially observable recombination events within founding haplotypes); however, it is not misled by ancestral recombination before the origin of C. rubella. Additionally, this estimate of the historical outcrossing rate is in strong agreement with an independent estimate of the 10 Brandvain et al. Founding C. rubella Diversity at silent sites B) O/O diff hap G/G G/O O/O 0.01 same hap 1.0 C) AFS, same hap: 6 samples All samples Neutral expectation 0.8 frequency 0.6 0.8 All Greek All Out−of−Greece Neutral expectation 0.0 0.0 0.2 0.4 1 Length bin (cM) C) AFS, same hap: 3 samples 0.2 0.1 0.6 G/O 0.4 G/G 1.0 0.05 0 0.01 0.025 0.1 π S% π S% 1 grand x grand rube x rube Greek Out of Greece 0.075 rube x grand frequency π S by length 0.1 10 A) 1X 2X num. derived 1X 2X 3X 4X num. derived Figure 3. Diversity within and between haplotypes in C. rubella. A) Mean number of pairwise sequence differences at fourfold degenerate sites (in log10 scale). B) Ratio of diversity at zerofold to fourfold degenerate sites. Geographic comparison (denoted by the number of non-Greek (W) samples is given on the x-axis). 5X Brandvain et al. Founding C. rubella 11 current outcrossing rate between 0.06 and 0.10 based on single site heterozygosity [19]. The demography of C. rubella: To learn more about the demographic history of C. rubella since its founding, we examine the allele frequency spectrum in regions where more than two individuals share the same founding haplotype. If, for example, the C. rubella population grew quickly to a large size after its founding, such that drift (and therefore coalescence) was negligible over the past ∼ 55 ky, then we would expect all sites segregating within a haplotype to be singletons. Alternatively, if C. rubella has been maintained at a stable size small (and/or long) enough to ensure coalescence at all loci, the observed frequency spectrum would match the standard neutral model. The allele frequency spectrum: In Figure ??C-D we display the allele frequency spectrum when individuals are assigned to the same haplotype. Contrary to the frequency spectrum observed without reference to haplotypic identity (Figure S1C-D), in which we observed an excess of common variants, here we observe a very minor excess of rare variants in Greece and a stronger excess of rare variants in our Out-of-Greece samples. This suggests that our Greek samples are only slightly removed from drift-mutation equilibrium (i.e. little recent population growth), while the site frequency spectrum in our Out-of-Greece samples is consistent with a population expansion upon leaving Greece. We note that given the broad geographic sampling of our Out-of-Greece samples, however, this signal of population growth outside of Greece could be due to the confounding effect of population structure [32]. We observe a substantial excess of singletons in our combined sample (Figure ??D), which likely largely reflects population subdivision documented above. Inferring the number of founders and the timing of speciation: Inspired by previous methods that aim to infer the number of founding chromosomes from patterns of genetic variation [33, 34], we jointly estimate the number of founding chromosomes and the time of C. rubella’s founding. To do so, we simulate a simple coalescent history in which C. rubella is founded by Nf chromosomes drawn at random from C. grandiflora τ generations ago with no subsequent gene flow from C. grandiflora, and that this population then grew to the current effective number of chromosomes , N0 ( Figure 1, and see METHODS for details). We then infer these demographic parameters (τ, N0 , Nf ) by finding the likelihood of our parameter values for our observed summaries of our data. This instant growth model is clearly a very simple cartoon of the history of C. rubella, but captures the spectrum of previous models proposed by [15]. Details of our inference method are described in the METHODS, but we sketch the salient points here. We infer the time of founding (τ ), the current effective number of chromosomes (N0 ), and the number of founders (Nf ) of C. rubella, in a composite likelihood framework, since this treats all observations as independent, it over inflates our confidence in our parameter estimates (see METHODS for more discussion). Specifically, we generate expected values of the allele frequency spectrum within haplotypes and the fraction of genomic windows where all samples inherited the same founding haplotype by simulating a coalescent model across a grid of Nf founders giving rise to C. rubella τ /N0 generations ago Figure 1. After estimating τ /N0 , we resolve this compound parameter by using our estimate of diversity within haplotypes and an estimate of the mutation rate of 1.5X10−8. Throughout, we limit this analysis to four exchangeable samples (three from Greece and one from Out-of-Greece), so that our inference is not misled by population structure. In Figure 4A we present the composite likelihood of our data across a grid of τ /N0 and Nf . A broad ridge in parameter space, (1.2 < τ /N0 < 1.9, and 3 ≤ Nf < ∞) is consistent with our observed data (i.e. these values are within two log-likelihood units from our MLE). Our relatively large estimate and peaked likelihood with respect to τ /N0 (MLE = 1.7) is reflective of the slight excess of singletons and the preservation of alternative founding haplotypes (Figure 4C). From this result and πS within haplotypes, Brandvain et al. Founding C. rubella 12 we estimate of N0 (Figure 4D), to lie between 25,000 and 42,000, and a split time between 48 and 52 kya (which is fairly close to the estimated split time of 56 kya, above). We further note that the composite likelihood is particularly flat with respect to N0 , suggesting that the precise details concerning the number of founders of C. rubella have been lost to generations of coalescence (Figure 4B,C2); however, we point out that this model suggests that hypothesis that C. rubella was founded by a single hermaphrodite is particularly unlikely, as Nf = 2 lies just outside of our confidence interval, a finding consistent with the direct observation of regions containing at least three ancestral chromosomes (below). Relaxed selection in C. rubella: By comparing the ratio of non-synonymous to synonymous polymorphisms within and among haplotypes, we can evaluate whether most putatively deleterious mutations associated with selfing accumulate during the founding of the selfing species or subsequent to its early evolution. In Figure 5A, we show that πN /πS between haplotypes (.139) is similar to values in C. grandiflora (.144), and between species (.146), consistent with our view of the different haplotypes representing draws from C. grandiflora diversity (Figure 5A). By contrast, πN /πS within haplotypes is very large (.438), suggesting that the accumulation of putatively deleterious mutations since the origin of species is predominantly responsible for the elevated occurrence of putatively deleterious variants in C. rubella. Moreover, πN /πS within haplotypes increased with the number of worldwide samples (πN[G] /πS[G] = 0.416, πN[GxO] /πS[GxO] = 0.439, πN[O] /πS[O] = .461), suggesting a further weakening of selection outside of the Mediterranean (Figure 5A). Although the elevated ratio of synonymous to non-synonymous diversity in C. rubella may be due to a shift in mating system, it is also possible that, since the putatively deleterious mutations within haplotypes in C. rubella are necessarily young there may have been insufficient time for purifying selection to remove them, which would elevate πN /πS . This hypothesis, which may explain some of our pattern, clearly cannot fully explain extreme elevations of πN /πS within C. rubella haplotypes because even common sites (e.g. doubletons, tripletons etc..) are not much rarer at nonsynonymous sites than the are at synonymous sites (Figure 5B). Thus, it seems likely that relaxed purifying selection following a shift to selfing has allowed deleterious mutations to rise to moderate frequency in C. rubella. Diversity and haplotype sharing genome-wide Haplotype sharing: The view put forward in this manuscript is that C. rubella diversity is a mosaic of lineages tracing back to relatively few founding haplotypes from a large population ancestral to both C. rubella and C. grandiflora that have survived to the present day. Thus, when we conduct a pairwise comparison between pairs of individuals, we move between long genomic regions which coalesce to the same founding haplotype, and regions which coalesce to alternative founding haplotypes. Scaling this comparison to all six of our samples, we envision moving across genomic regions where some proportion of our samples all coalesce to the same founding haplotype. When all individuals coalesce to the same haplotype, the major haplotype frequency is 1, and πS will be approximately πS same hap ≈ 0.05%, representing mutations accumulated since the local most recent common ancestor. At the other extreme, in regions where there is no coalescence since the founding of C. rubella (i.e. all samples reside on different founding chromosomes), we expect πS = πS Cr x Cg = πS Diff hap ≈ 2%. With this view in mind, we present πS and the major haplotype frequency in 10 kb overlapping sliding windows which are slid every 2 kb across chromosome 7 in Figure 6, where the frequency of the major haplotype is the number of individuals assigned the major haplotype divided by the number of homozygous, non-contradictory haplotype assignments, and πS is calculated as above (we present figures for each chromosome in the SUPPLEMENT ). Figure 6 suggests a negative relationship between the frequency of the major haplotype (hereafter M ) and πS - as the major haplotype frequency decreases, 13 A) Composite log−likelihood surface B) Frac. completely sorted 1,000 τ N 0 = 1.96 (Upper bound) τ N 0 = 1.70 (MLE) Observed 1/2 frequency −5 100 −5 −2 MLE τ N 0 = 1.16 (Lower bound) −25 0 0 −1 −10 −5 Obs Sim 1 1/2 τ N 0 = 1.16 (Lower bound) doubletons MLE N0 tripletons C2) # lineages survive to founding (nc ) τ N 0 = 1.70 (MLE) nc = 1 η Observed ~ 0.66 MLE nc = 3 D) Population size 30000 1/2 singletons nc = 2 τ N 0 = 1.96 (Upper bound) nc = 4 0 frequency 40000 C1) AFS, same hap. 0 frequency 1 −2 5 0 10 Nf 1 10,000 Brandvain et al. Founding C. rubella c 0.5 1.0 1.5 2.0 τ / N0 2.5 3.0 10 100 1000 1000 Nf Figure 4. A summary of our model of the history of C. rubella: The sensitivity of these results to the cutoffs used in haplotype labeling is presented in the SUPPLEMENT. A) The composite log-likelihood of the data as function of τ /2N0 and Nf , normalized so that the MLE = 0. B) The probability that all individuals reside on the same haplotype as a function of Nf at the MLE, lower and upper confidence intervals for τ /N0 . The dotted red line indicates the observed value. C) A summary of simulation results (assuming Nf = 1000). C1) The frequency of singletons, doubletons, and tripletons observed in simulation (full lines), and in our data (dashed lines) conditional on all lineages coalescing by τ . C2) The frequency of one, two, three or four lineages surviving to the founding. When Nf is large, P r(nc ) = 1 is the probability that all samples coalesce to the same founding haplotype. The dotted black line portrays the observed frequency of all lineages residing on one founding haplotype ηobserved . D) The estimated effective number of chromosomes in C. rubella as a function of the number of founding chromosomes at the MLE, lower and upper confidence intervals for τ /N0 . 14 Brandvain et al. Founding C. rubella π N / π S by haplotype 1.0 0.4 0.6 Fourfold degenerate Fully constrained 0.0 0.0 0.2 0.2 0.3 site frequency 0.8 G/G G/O O/O 0.1 πN / πS 0.4 0.5 AFS same hap: syn vs nonsyn diff hap same hap 1 2 3 4 5 derived allele count Figure 5. Nonsynonymous variation in C. rubella: A) Diversity at zerofold degenerate sites divided by diversity at fourfold degenerate sites, within and between C. rubella haplotypes. B) The frequency spectrum of nonsynonymous and synonymous sites when all C. rubella samples are descended from the same founding allele. Brandvain et al. Founding C. rubella 15 πS increases. Genome-wide, we find strong and significant correlation (Pearson correlation, r = −.23, p < 2.2 × 10−16 ) between πS and M in non-overlapping two kb windows. Since the correlation between the C. rubella M and πS in C. grandiflora is much weaker than that observed in C. rubella (r = −.02, p < 7.3 × 10−6 ), elevated πS and minor haplotype frequencies due to alignment errors alone are unlikely to drive the observed pattern. More than two haplotypes? Above, we found that some portions of the genome could not be easily assigned to one of two founding chromosomes. While such regions are excellent candidates for for genomic regions tracing there ancestry back to more than a single founder there are numerous alternative explanations (e.g. misalignment of paralogous regions, unobserved recombination events). Since the observation of more than two founding chromosomes in a region directly informs the provocative speculation that the ancestry of C. rubella can be traced to just a single founder [15], here we carefully vet these regions and produce a list of our top candidate regions containing three (or more) haplotypes – that is, we identify regions in which our inferred third haplotype cannot be easily explained by misalignment or many crossovers. Finding such regions would provide direct empirical confirmation of our model-based inference that C. rubella likely originated form more than one founder (i.e. > 2 haplotypes). After a series of rigorous controls METHODS, we find 80 genomic regions that support the existence of > 2 haplotypes (all are presented in the SUPPLEMENT, to allow the reader to evaluate this evidence). We show an example of these regions in figure 7 where we compare a region with two (Figure 7B) or three (Figure 7C) inferred haplotypes. Regions with three or more haplotypes are generally quite short (thirty-six are 10 kb or less, twenty are greater than 10 kb and up to 20 kb, twelve are greater than 20 kb but less than 30 kb, nine are 30-50 kb, and of the remaining three, two are 74 kb and one is 90 kb. We note that since there has been substantial time for recombination, we expect regions with three haplotypes to be short, and we speculate that we missed many regions likely containing > 2 founding chromosomes). Discussion We present a novel framework to interpret patterns of sequence diversity in recently founded populations by assigning chromosomes to alternative founding haplotypes. We exploit this view to provide a detailed characterization of the evolutionary transition from outcrossing to self-pollination in C. rubella. In principle, our conceptual approach is applicable to any founding event recent enough to preserve a reasonable portion of polymorphism present in the founders. The application to Capsella was aided by the fact that few founding lineages contribute ancestry to our C. rubella sample, and that levels of linkage disequilibrium differ so starkly between C. rubella and C. grandiflora, making identification of the founding haplotypes relatively easy. As these criteria are met by many recently founded selfing species and populations (e.g selfing population or species within Leavenworthia [22], Mimulus [23, 35], Arabidopsis lyrata [36], and Clarkia xantiana [21]), including a number of commercially important species (e.g. indica rice [37] and soybean [38]), our framework should be of broad use as population genomic resources are continue to be developed in these systems [39–41]. Our approach provides a straightforward interpretation of pairwise sequence diversity across the genomes of recently derived selfers. Moving across the genome, we transition between regions in which our samples coalesce at or since the origin of selfing, and regions in which samples do not coalesce until they join the ancestral outcrossing population. Critically, we can use polymorphism present in the outcrossing progenitor to assess if two individuals have inherited the same or different founding haplotypes, as individuals that differ at ancestrally segregating sites certainly inherited different founding haplotypes. Viewing pairwise comparisons across the genome as comparisons either within or between founding chromosomes provides a straightforward way to decompose patterns of genomic variation in C. rubella. 16 .1 Brandvain et al. Founding C. rubella .05 .5 1 0 π S% Major hap freq. A) Diversity by major haplotype frequency across chromosome 7 5.0e+06 1.0e+07 C) Pairwise πS (3rd hap candidate) πS 0.04 Argentina x Greece 1 Argentina x Greece 2 Greece 1 x Greece 2 0.00 0.00 0.02 0.04 B) Pairwise πS (No 3rd hap) 1.5e+07 0.02 0.0e+00 πS Petal size 3 0 S−locus 1.580 1.584 1.588 position (gb) 10.865 10.875 10.885 position (gb) Figure 6. Haplotypic diversity across chromosome seven: A) Mean pairwise sequence diversity and major haplotype frequency across chromosome seven. Grey boxes surround narrow QTL for key functional selfing traits (labeled in black). πS and the major haplotype frequency in such regions do not differ from randomly sampled, sized matched regions (not shown). Red points are regions putatively containing more than two haplotypes. B and C) Pairwise sequence diversity between three individuals for an exemplar portion of chromosome seven containing two (B ) or potentially three (C ) haplotypes. Lines at πS = 2% are presented as a reference for expected diversity among haplotypes. C) Pairwise sequence diversity between three individuals all assigned to different haplotypes (beginning at position 10 874 000) for a portion of chromosome seven. 17 Brandvain et al. Founding C. rubella NJ tree for Capsella NJ tree for Capsella (shared haps in rubella) NJ tree for Capsella (different haps in rubella) π S = 0.5% grand. rube. [G] rube. [O] Figure 7. An unrooted neighbor joining tree for all Capsella samples. Built from a matrix of pairwise πS . A) Without regard to haplotype information. B) When C. rubella samples are on the same haplotype. C) When C. rubella samples are on different haplotypes. Within a founding lineage, diversity is incredibly low and a reasonable fraction of this diversity is nonsynonymous (i.e. putatively deleterious); however, diversity between founding haplotypes is comparable to interspecific diversity and little of this diversity is non-synonymous. In Figure 7, we present a set of neighbor joining trees to visualize how consideration of haplotypic structure aids in our interpretation of patterns of variation in Capsella. Without considering haplotype, i.e. using data from the entire transcriptome (Figure 7A), we get a rough view of the history of Capsella – C. rubella contains little genetic diversity, is distinct from C. grandiflora, and Out-of-Greece samples cluster closely together. When only considering diversity within haplotypes (Figure 7B), diversity in C. rubella is completely dwarfed by diversity within C. grandiflora and between species; however zooming in (top left of Figure 7B), the Out-of-Greece history of C. rubella becomes clear. By contrast, comparisons between founding haplotypes reveal a starlike phylogeny for all sequences (Figure 7C). This starlike tree suggests that the founding lineages of C. rubella were close to a random selection of ancestral chromosomes present in C. grandiflora, rather than a distinct sub-population, and that there has been little drift in C. grandiflora since the founding of C. rubella. Overall, our haplotype-based approach facilitates an in-depth view of the history and selective regime of the selfing species, C. rubella. Results of our haplotype-based approach provide a rough sketch of the history of C. rubella (we assume a mutation rate of 1.5 × 10−8 , use of the other major mutation rate estimate of 7 × 10−9 [28] results in a two-fold increase in our date estimates). Approximately 50 kya (potentially during the ‘long mild middle’ of the last ice age), a C. grandiflora-like ancestral population of unknown size became largely selfing and gave rise to C. rubella. Although most of the diversity present in the founding population has been lost to drift since the origin of C. rubella, two C. rubella individuals inherit different founding haplotypes for ≈ 20% of their genome, and the diversity maintained from the Brandvain et al. Founding C. rubella 18 founding populations makes up ≈ 90% of extent diversity. Since its origin, C. rubella has experienced a long-term reduction in population size as compared to is outcrossing progenitor, C. grandiflora, and therefore maintains less diversity and relatively more non-synonymous diversity than C. grandiflora. And sometime between now and the origin of C. rubella, a population left Greece and spread across the globe. No obvious signal of an extreme bottleneck: Due to the high levels of autozygosity within a selfing population, the effective population size of a selfing population can be reduced as much as a 1/2 compared to the same population outcrossing [42, 43]. Therefore, all else being equal, neutral diversity in selfing taxa should be approximately half of that observed in their outcrossing relatives. As selfing species often exhibit a greater than two-fold reduction in diversity, severe founding bottlenecks (along with other demographic processes [44, 45] and linked selection [46]) are often presented to explain this discordance (e.g. in C. rubella [14, 15]). Such founding bottlenecks are seen as evidence supporting the idea that selfing species are often founded by a small number of individuals, consistent with reproductive assurance favoring the evolution of selfing, rather than the gradual evolution of selfing favored by other advantages of selfing [47]. The very low levels of diversity within C. rubella seemed initially to be consistent with this view. Indeed, we find that for a given genomic region, few founding lineages drawn from a C. grandiflora-like ancestor contributed ancestry to present day C. rubella. However, reduction in diversity appears to be due to drift subsequent to the founding of C. rubella, and this low level of diversity foes not require an extreme bottleneck at C. rubella’s founding. This high level of drift confounds our ability to estimate the actual number of founding chromosomes, because the genetic contribution of founders has been lost (see [33, 34] further discussion). We therefore caution that low current levels of diversity in selfing plants may erode historical signals concerning their founding. Our likelihood based inference – in combination with evidence for > 2 haplotypes in some genomic regions – shows that the hypothesis that C. rubella was founded by a single plant with no subsequent secondary contact from C. grandiflora is not well founded; however, we lack sufficient information to pinpoint the founding population size. Long term reduction in C. rubella effective population size: The observations above reflect the fact that the size of the population that founded C. rubella played a minor role in shaping C. rubella diversity, as compared to drift since its founding. Levels of diversity since the founding of C. rubella seem to reflect the long-term maintenance of a small effective population size, corresponding to a twenty-fold reduction in the 600,000 effective chromosomes making up C. grandiflora. Although the cause of the reduced effective population size in C. rubella is unclear, numerous forces (e.g. frequent oscillations on population size, linked selection, etc..) may be responsible [44,46,48], and future work on the determinants of Ne in selfing species will clarify this issue. This small effective population size has led to a rapid loss of diversity since the founding of C. rubella. While some regions maintain higher levels of pairwise sequence diversity, with multiple extant lineages, if this small size persists C. rubella will quickly lose much of its genetic diversity. For example, currently two individuals reside on the same founding haplotype for 80% of the genome, resulting in a profound lack of diversity. At the current rate, it will take only another 40ky for 95% of the genome of two C. rubella individuals to be homozygous for all C. grandiflora variation. This would reduce genome-wide πS in C. rubella to 0.0016, severely limiting the pool of standing variation available for a response to selection. Perhaps in part it is this low diversity that limits the adaptive evolution of selfing species and contributes to their eventual demise [11–13]. Relaxed selection in C. rubella: The long-term reduction in effective population size and our comparisons within and between founding haplotypes also clarifies the process of the accumulation of deleterious mutations in C. rubella. Viewing the chromosomes that contributed ancestry to C. rubella as a Brandvain et al. Founding C. rubella 19 random draw from an ancestral C. grandiflora population, we expect (and indeed observe – Figure 5A) pairwise comparisons between founding chromosomes to contain similar πN /πS values as comparisons within C. grandiflora. Therefore, the founding of C. rubella does not appear to facillitate the accumulation of deleterious mutations; however, the long-term reduction in effective population size has severely weakened the efficacy of purifying selection, as is reflected by the threefold increase in πN /πS within haplotypes as compared to between species, haplotypes, or within C. grandiflora. Although most pairwise comparisons across the C. rubella genome reflect comparisons within a founding haplotype, πN /πS within C. rubella is closer to interspecific πN /πS , because a majority of the diversity (≈ 90%) observed in C. rubella exists between founding haplotypes. In principle, elevated πN /πS values observed within C. rubella haplotypes could be explained by insufficient time for the removal of these mutations in a non-equilibrial population. However, numerous observations suggest that this is not the case, and favor the hypothesis that the accumulation of putatively deleterious mutations in C. rubella is due to a long-term reduction in population size. Specifically, our observation of numerous non-synonymous variants as doubletons and tripletons within a founding haplotype, and an allele frequency spectrum within C. rubella haplotypes close to the spectrum expected under equilibrium, support the latter hypothesis. Our view of the origin of deleterious mutations in C. rubella, may explain the fact that although πN /πS within selfing species often exceeds that of their outcrossing relatives (e.g. [25]), limited observations of divergence between selfers and outcrossers do not suggest an excess of putatively deleterious substitutions on selfing branches (as we observe in comparisons within and between Capsella species [25, 49]). Since many selfing species are thought to be young, most of divergence between them and their outcrossing relatives may reflect sorting of polymorphisms accumulated as an outcrosser, rather than deleterious mutations that have arisen and achieved high frequency since the founding of the selfing species. Future prospects: With our haplotype-based approach, we can provide a reasonable sketch of the history of C. rubella with only six C. rubella genomes; however, numerous questions remain. For example, with whole genome data, we should be able to more clearly identify contiguous regions with more than two founding haplotypes, and with sequence from more individuals, it may be possible to identify putative regions of recent introgression between Capsella species (for which we found no evidence). Moreover, additional samples will provide a more fine-grained view of the allele frequency spectrum, ushering in a more complete model of population expansion and contraction over the last 50 ky in C. rubella, and will facilitate a better estimation of haplotype frequencies across the genome, providing more information to leverage in an attempt to identify recent positive selection in a species with a small effective populations size, elevated linkage disequilibrium and a recent founding. Materials and Methods Sequencing, alignment, and sequence quality We utilize genotype data from sequencing RNA extracted from flower bud tissue of 11 samples (6 C. rubella and 5 C. grandiflora), which was then mapped to the C. rubella reference genome as described previously [17]. To call SNPs from the RNA data, we utilized the GATK pipeline on the BAM files [50, 51], and instituted straightforward QC steps – treating all genotypes with coverage less than 10X, quality scores (from the GATK pipeline) less than 30, and/or heterozygous sites in otherwise autozygous regions as missing data. We compared our genotype data to ≈ 53, 000 sites of Sanger sequencing and found very little discordance (see SUPPLEMENT ). Two sites labeled as alternative in the tanscriptome data were observed to be reference in the Sanger data, while two sites called as NA in the transcriptome data were observed to be alternative alleles in the Sanger data. Diversity, as measured as the mean number of pairwise sequence differences was very similar in both types of data (πSanger = 0.156%, πTransciptome = 0.159%, Brandvain et al. Founding C. rubella 20 for 72,066 and 71,645 pairwise comparisons between base pairs, respectively). See SUPPLEMENT for more details. Runs of Residual Heterozygosity Given the high level of observed selfing in C. rubella, the genome of a C. rubella individual is expected to be mostly homozygous. However, some heterozygous regions are expected in field-collected selfing species with non-zero outcrossing rates. Indeed, we observed numerous heterozygous sites in our C. rubella samples. Such sites could be caused by genotyping and/or alignment error, de novo mutations, or residual heterozygosity retained since a lineage’s most recent outcrossing event. Since incompletely sorted heterozygosity will be maintained in chunks corresponding to portions of the alternative chromosomes present in the most recent outcrossed ancestor, while sequencing errors will be distributed widely across the genome, we utilize the distribution of heterozygous sites across the genome to separate truly heterozygous regions from sequencing error in C. rubella. More specifically, we identify regions of residual heterozygosity by examining the local density of heterozygous sites. These region are generally quite obvious (Figure S2A-F), so we visually identified the beginning and ends of these incompletely sorted regions. Heterozygous sites in regions of residual heterozygosity should largely reflect the individuals true genotype, with relatively few genotyping errors. By contrast, the vast majority of heterozygous sites in long autozygous regions are expected to be artifacts (e.g. sequencing error, misalignment, etc.), and very few should represent de novo mutations that have arisen since the region was last made homozygous by descent due to inbreeding. Reassuringly, πS in these regions is comparable to πS between individuals (compare πS ind = 0.43% to πS pop = 0.41% Figure S3). Outside these regions on average 0.13% of synonymous sites of sites are heterozygous. This error rate varying significantly across individuals, corresponding to sequencing lane, and we therefore treat such sites as missing data in our population genomic analyses. Labeling founding haplotypes Since C. rubella has recently arisen from C. grandiflora, much of the genome is incompletely sorted and therefore many sites (≈ 1/4 of sites polymorphic in C. rubella) are segregating in both species. We leveraged these jointly segregating sites to assign individuals to putative founding haplotypes across the genome. These haplotypes represent chromosomes contributed to C. rubella by an ancestral, C. grandiflora-like population (see Figure 1). We note that throughout this assignment, we ignored individuals who had truly heterozygous stretches overlapping a region, regardless of whether they appeared to be on the same or different founding chromosomes. Preliminary pairwise haplotype assignment: We began by comparing pairs of C. rubella individuals at sites jointly segregating in both species. In regions where there are exactly two haplotypes segregating, our two founding C. rubella haplotypes must differ at these sites, and regardless of the number of founders, samples with different alternative alleles at ancestrally polymorphic sites cannot share a founding chromosome. We use this simple intuition to preliminarily assign individuals to the same or different founding haplotype across the genome. In pairwise comparisons between samples, we placed each site polymorphic in both species in one of three categories (ignoring heterozygous regions) 1. NA sites: Sites where either individual is heterozygous or contains missing data. 2. Same sites: Sites where both individuals are homozygous for and identical allele. 3. Different sites: Sites where individuals are homozygous for alternative alleles. Brandvain et al. Founding C. rubella 21 Ignoring NA sites, we identified runs of haplotype sharing between two samples beginning with a ‘Same site’ and ending at the last ‘Same site’ before a ‘Different site.’ We identified runs of haplotype differences in a similar manner. Higher order haplotype assignment: We next utilize our pairwise haplotype assignment to label alternative founding haplotypes in C. rubella. In addition to providing a view into the allele frequency spectrum within a haplotype, integrating data from all pairwise comparisons in this manner allows us to correct for cases in which we incorrectly assigned pairwise comparisons to the same haplotype due to of missing data. Additionally, this higherorder haplotype labeling step assists us in identifying the number and frequency of alternative haplotypes across the genome (e.g. we find regions with ancestry from only one founding chromosomes). To begin, we identified regions of high confidence haplotype assignment – pairwise comparisons in which runs the same or different haplotypes extended more than 1.5 kb and consisted of at least 4 informative sites (We explore alternative haplotype labeling rules in the SUPPLEMENT, where we show that our results hold under most reasonable criteria). In addition to these high-confidence regions, we added another category – long regions (> 10 kb, and containing more than five sites polymorphic in C. grandiflora) with no shared polymorphisms between species, representing regions where the ancestry of all C. rubella samples coalesces to a single founding chromosome (i.e. invariant regions). With our high-confidence pairwise assignments and invariant regions, we broke the genome into windows. The starting and ending points of each window correspond to the beginning and ends of invariant regions and high-confidence pairwise assignments across any pair of individuals. We then placed individuals on haplotypes in each window as follows: 1. In invariant regions, we assigned all individuals to the same haplotype. 2. In all other regions, we placed individuals with a high-confidence pairwise haplotype assignments onto alternative haplotypes by constructing networks of haplotype sharing. To do this, (a) We began with the first individual (this choice does not affect the algorithm, see below) and found which (if any) others where on the same haplotype with high confidence, and labeled all individuals as ‘haplotype 1’. (b) We continued this process until no individuals matched haplotype one. (c) We then chose the first individual not assigned to haplotype one, and place it on haplotype two, finding the other individuals on this haplotype as described above. (d) We continued this scheme until all of these individuals where assigned to a haplotype. 3. Occasionally, due to missing data, there was discordance in our haplotype assignment (e.g. individual 1 was on the same haplotype as individual 2 and 3, but individuals 2 and 3 were assigned to different haplotypes). In such cases, the order of our algorithm could influence our results; however, since these regions clearly suffered from poor data quality, we labeled such conflicts as ‘ambiguous regions’ and did not use them when comparing within or among haplotypes, this both maintains high data quality and ensures that the order of our individuals does not influence our results. 4. Since pairwise assignments began and ended at the first and last different (or same) ancestrally polymorphic site, in some regions an individual was not assigned to the same or different haplotypes as any other samples, and were labeled ‘NA.’ These regions represent switches between founding chromosomes, which occur by a historical recombination event. 5. After implementing this algorithm individuals where assigned to a haplotype (or labelled as NA, heterozygous, or ambiguous) for every genomic window. Brandvain et al. Founding C. rubella 22 Estimating the outcrossing rate The relationship between the divergence time of a pair of sequences descended from the same founding haplotype and the genetic length of this shared haplotype can provide a novel estimate of the effective outcrossing rate. This follows from the fact that the genetic length of a shared haplotype is expected to be inversely proportional to the divergence time (in number of generations) and the probability, β, that a recombination event produces a visible change in haplotype. This probability. β reflects the fact that a recombination event can be hidden, either by the lack of heterozygosity within the individual (due to numerous generations of selfing), and/or the lack of heterozygosity within the population. As T µ (the mutational distance between two sequences) and the genetic length of a region for which two individuals reside on the same haplotype (L) provide us with two independent estimates of the number of generations since the pair last shared a common ancestor, we can hope to estimate the probability of a visible recombination per generation. We do so by fitting a linear model to predict 2/L by πS /2µ, forcing the intercept of this regression to be at zero, where L is the haplotype length in Morgans. The slope of this linear model, β, is an estimate of twice the probability that a recombination event produces a visible change in haplotype. Looking exclusively at comparisons between Greek samples, and limiting our analysis to regions with > 0.005 cM of sharing (where we have very strong confidence in our haplotype calls) we find β = 0.012. Dividing β by one minus the haplotype homozygosity within Greece (the probability that two individuals are on different haplotypes, 1 − p0 ≈ .2), allows us to adjust β for the probability that an outcrossing individual recombines without locally changing haplotypes. This yields an estimated outcrossing rate of ≈ 0.06, representing an outcrossing event roughly 6 times in every 100 generations (MAIN TEXT ). Demographic inference To infer the history of C. rubella, we simulate a coalescent model where at time, τ , Nf chromosomes found our population and the population size instantaneously grows to N0 effective chromosomes (Figure 1A). To avoid potential confusion with the definition of the effective population size in selfers (see [52] for recent discussion) we directly use the effective number of chromosomes, N0 , as our coalescent units, so that the rate of coalescence of a pair of lineages equals 1/N0 . We note that our inference of the number of founding chromosomes is inspired by and a slight elaboration of two recent models [33, 34] built for micro-satellite and PCR amplified loci, respectively. Coalescent simulations To infer the the demographic parameters of interest (τ , Nf , and N0 ), we make use of two critical pieces of information – the allele frequency spectrum when all samples reside on the sample haplotype(φ) and the frequency with which all samples have inherited the same founding chromosome η. In our four exchangeable samples, φ = .62 singletons : .22 doubletons : .16 tripletons and η = .66. We simulate a coalescent model to generate the composite likelihood of our data given our parameters, C(φ, η|τ, N0 , Nf ), and obtain C(φ, η|τ, N0 , Nf ) by multiplying C(φ|τ, N0 , Nf ) and C(η|τ, N0 , Nf ). In this simulation, our parameters are the coalescent-scaled time until the founding event τ /N0 and the number of founding chromosomes Nf . For each combination of N0 /τ and Nf we obtain the probability that all four samples coalesce to the same founding chromosome, and the allele frequency spectrum conditional on all four samples coalescing to the same founding chromosome, by replicating the coalescent simulation outlined below 10,000 times: We simulate the coalescent genealogy of four lineages (representing our four exchangeable samples) in a population with N0 effective chromosomes, back to time τ /N0 . For a given simulation, our sample of four has coalesced down to nc lineages (1 ≤ nc ≤ 4) at time τ /N0 . With probability, (1/Nf )(nc −1) , all surviving lineages coalesce to the same founding haplotype, and so we force the coalescence of all 23 Brandvain et al. Founding C. rubella remaining lineages at time τ . For each simulation we keep track of the number of lineages surviving to the generation after founding (nc ), whether or not all samples coalesce to the same founding chromosome, as well as the proportion of simulations where all samples coalesce to the same founding chromosome (ηsimulated ), we obtain a vector of the time with k lineages, Tk (2 ≤ k ≤ 4). We use this distribution of coalescence times to infer the allele frequency spectrum within a haplotype, φ. We do so by computing the expected number of sites with i copies of a derived allele, E[ξi ], from [53] E[ξi ] = −1 θ n−1 i i n−i+1 X k=2 k 2 n−k E[Tk ] i−1 1 ≤ i ≤ (n − 1) (1) Where θ is the population mutation rate. We then convert the expected number of sites with i copies of a derived allele, E[ξi ], into the expected proportion of polymorphic sites observed i times in a sample of size n i.e. the frequency spectrum X E[φi ] = E[ξi ]/( E[ξi ]) (2) E[φi ] is the expected frequency spectrum conditional on all four of our samples residing on the same founding haplotype. Note that this frequency spectrum does not depend on θ and so carries no information about the product N0 µ, only τ /2N and Nf (and only weak information about the latter). The probability of a configuration of unlinked sites is multinomial with probabilities given by E[φi ], and which we use to estimate the likelihood of φ given the parameters and a the number of sites that are polymorphic in the the four samples when all have inherited the same founding chromosome. Similarly, the probability that all samples coalesce to the founding haplotype is binomial with probability ηsimulated which we use to estimate the likelihood of ηobserved given the model. A difficulty with estimating the composite likelihood of the fraction of sites where all samples reside on the same haplotype, ηobserved , is that there is no natural observable unit for a haplotype (unlike a site) to take a product of likelihoods over, and as the majority of our stretches where individuals share a haplotype are shorter than 1cM, and our total genetic map length covered by our data is ∼ 300 cM, conservatively we have > 100 independent haplotype regions, and likely many more. We take a composite likelihood approach to construct a likelihood of our allele frequency spectrum and the probability that all individuals reside on the same haplotype given τ /N0 and Nf – that is, we assume C(φ, η|τ, N0 , Nf ) = P (φ|τ, N0 , Nf ) × P (ηobserved |τ, N0 , Nf ). Composite likelihood approaches construct a psuedo-likelihood by taking the product of the likelihoods of a set of correlated observations, ignoring the dependance between of observations. This considerably simplifies the inference, allowing simulation or theoretical expectations or simple simulations to be used to obtain the likelihood. This approach has gained popularity in population genetics as they allow us to circumvent correlations between our observations due to linkage disequilibrium between sites or sets of sites (e.g. [54–58]). Ignoring this kind of dependance makes the likelihood surface overly peaky, which should not lead to bias in the MLE, but can lead to overly tight confidence intervals (when they are constructed on the basis of asymptotic assumptions). Specifically, in our model, the likelihood of the allele frequency spectrum is composite because we treat each SNP as an independent (e.g. we do not model linkage between sites), and the probability that all individuals reside on the same haplotype is also composite because these regions are also linked. Finally, we use a moment-based estimator of N0 µ based on the mean pairwise difference within these four haplotypes (πS Same ) to estimate N0 given a prior estimate of µ (1.5 × 10−8 , as above) and our estimates of Nf and τ /N0 . To do so, we conduct a coalescent simulation of π within a haplotype by simulating the expected coalescent time for a pair of lineages conditional on them coalescing to the same founding parental chromosome at or before the founding of C. rubella. We then numerically solve this to match the average πS Same /2µ for our six pairwise comparisons of our four samples, to obtain an estimate of N0 This does not acknowledge the noise in our estimate of πSsame , but given the many sites that Brandvain et al. Founding C. rubella 24 contribute to this estimate, and the hundreds of independent instances of our pairwise haplotypes across the genome, this noise should be negligible, and any error in our estimate of N0 will be due to systematic errors such as sequencing errors. Identifying regions with more than two haplotypes To begin our search for candidate regions with ancestry from more than two founding chromosomes, we identified all overlapping 10kb windows in which at least 20% of the region suggested three or more haplotypes in our haplotype labeling algorithm. We then merged contiguous windows that met our criterion (above) and found the first and last non-overlapping 2 kb windows that support the existence of more than two haplotypes to identify the beginning and end of regions putatively containing more than two haplotypes. With this approach, we assign ≈ 11% of the genome as a candidate for containing at least three haplotypes. We then trim these regions by removing all windows that begin before or end after our local observation of a third haplotype. After this trimming, 7% of the genome putatively contained more than two haplotypes. We take these remaining candidates and find pairwise πS for all pairs of individuals assigned to the same or different haplotypes. Cases where πS between alternative haplotypes is low represent two haplotypes that have been incorrectly split, and with a cutoff of πS between haps > 0.5% we see that potentially 52% of our candidates represent this oversplitting. By contrast, regions where πS within haps is large (we use a threshold value of 0.5%) could represent regions with insufficient coverage of ancestral variation to correctly assign haplotypes, alignment errors unchecked by previous filters, or genotyping errors, and make up 53% of our candidate regions (note that these options are nonexclusive). After enforcing these strict filters, we find that 10% of our original candidates (1% of the genome) represent genomic regions with ancestry that potentially traces back to more than two ancestral chromosomes. Acknowledgments We would like to thank Dan Koenig, Jeremiah Busch, Peter Ralph, Alisa Sedghifar, and Jeremy Berg, Mike May, and Gideon Bradburd for their thoughtful comments. References 1. Goodwille C, Kalisz S, Eckert C (2005) The evolutionary enigma of mixed mating systems in plants: Occurrence, theoretical explanations, and empirical evidence. Annual review of ecology, evolution and systematics 36: 47-79. 2. Igic B, Kohn JR (2006) The distribution of plant mating systems: study bias against obligately outcrossing species. Evolution 60: 1098-103. 3. Hamrick JL, Godt MJW (1997) Allozyme diversity in cultivated crops. Crop Science 37: 26–30. 4. Stebbins GL (1950) Variation and evolution in plants. New York, New York, USA: Columbia University Press. 5. Stebbins GL (1974) Flowering plants: Evolution above the species level. Cambridge, MA, USA: Belknap Press. 6. Baker H (1955) Self-compatibility and establishment after ’long-distance’ dispersal. Evolution 9: 347–349. Brandvain et al. Founding C. rubella 25 7. Fisher RA (1941) Average excess and average effect of a gene substitution. Annals of Human Genetics 11: 53–63. 8. Schoen D, Lloyd D (1984) The selection of cleistogamy and heteromorphic diaspores. Biological Journal of the Linnean Society 23: 303–322. 9. Lande R, Schemske D (1985) The evolution of self-fertilization and inbreeding depression in plants. i. genetic models. Evolution 39: 24–40. 10. Charlesworth D (2006) Evolution of plant breeding systems. Current Biology 16: R726–R735. 11. Stebbins GL (1957) Self fertilization and population variability in higher plants. American Naturalist 91: 337-354. 12. Takebayashi N, Morrell PL (2001) Is self-fertilization an evolutionary dead end? revisiting an old hypothesis with genetic theories and a macroevolutionary approach. Am J Bot 88: 1143-1150. 13. Goldberg EE, Kohn JR, Lande R, Robertson KA, Smith SA, et al. (2010) Species selection maintains self-incompatibility. Science 330: 493-495. 14. Foxe JP, Slotte T, Stahl EA, Neuffer B, Hurka H, et al. (2009) Recent speciation associated with the evolution of selfing in Capsella. Proc Natl Acad Sci U S A 106: 5241-5. 15. Guo YL, Bechsgaard JS, Slotte T, Neuffer B, Lascoux M, et al. (2009) Recent speciation of Capsella rubella from Capsella grandiflora, associated with loss of self-incompatibility and an extreme bottleneck. PNAS 106: 5246-51. 16. Neuffer B, Eschner S (1995) Life history traits and ploidy levels in the genus Capsella (brassicaceae). Canadian Journal of Botany 73: 1354–1365. 17. Slotte T ea (in prep) The Capsella rubella genome provides insights into the causes and consequences of mating system evolution . 18. Qiu S, Zeng K, Slotte T, Wright S, Charlesworth D (2011) Reduced efficacy of natural selection on codon usage bias in selfing Arabidopsis and Capsella species. Genome Biol Evol 3: 868-80. 19. St Onge KR, Källman T, Slotte T, Lascoux M, Palmé AE (2011) Contrasting demographic history and population structure in Capsella rubella and Capsella grandiflora, two closely related species with different mating systems. Mol Ecol 20: 3306-20. 20. Nordborg M (2000) Linkage disequilibrium, gene trees and selfing: an ancestral recombination graph with partial self-fertilization. Genetics 154: 923-9. 21. Pettengill JB, Moeller DA (2012) Tempo and mode of mating system evolution between incipient Clarkia species. Evolution 66: 1210-25. 22. Busch JW, Joly S, Schoen DJ (2011) Demographic signatures accompanying the evolution of selfing in Leavenworthia alabamica. Mol Biol Evol 28: 1717-29. 23. Sweigart AL, Willis JH (2003) Patterns of nucleotide diversity in two species of Mimulus are affected by mating system and asymmetric introgression. Evolution 57: 2490-506. 24. Baudry E, Kerdelhue C, Innan H, Stephan W (2001) Species and recombination effects on dna variability in the tomato genus. Genetics 158: 1725–1735. Brandvain et al. Founding C. rubella 26 25. Glémin S, Bazin E, Charlesworth D (2006) Impact of mating systems on patterns of sequence polymorphism in flowering plants. Proc Biol Sci 273: 3011-9. 26. Hudson RR, Kreitman M, Aguadé M (1987) A test of neutral molecular evolution based on nucleotide data. Genetics 116: 153-9. 27. Koch M, Haubold B, Mitchell-Olds T (2001) Molecular systematics of the brassicaceae: evidence from coding plastidic matk and nuclear chs sequences. Am J Bot 88: 534-44. 28. Ossowski S, Schneeberger K, Lucas-Lledó JI, Warthmann N, Clark RM, et al. (2010) The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science 327: 92-4. 29. Glémin S (2007) Mating systems and the efficacy of selection at the molecular level. Genetics 177: 905-16. 30. Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78: 62944. 31. Slotte T, Hazzouri KM, Stern D, Andolfatto P, Wright SI (2012) Genetic architecture and adaptive significance of the selfing syndrome in Capsella. Evolution 66: 1360-1374. 32. Ptak SE, Przeworski M (2002) Evidence for population growth in humans is confounded by finescale population structure. Trends Genet 18: 559-63. 33. Anderson EC, Slatkin M (2007) Estimation of the number of individuals founding colonized populations. Evolution 61: 972-83. 34. Leblois R, Slatkin M (2007) Estimating the number of founder lineages from haplotypes of closely linked snps. Molecular Ecology 16: 2237–2245. 35. Wu CA, Lowry DB, Cooley AM, Wright KM, Lee YW, et al. (2008) Mimulus is an emerging model system for the integration of ecological and genomic studies. Heredity 100: 220-30. 36. Mable BK, Adam A (2007) Patterns of genetic diversity in outcrossing and selfing populations of Arabidopsis lyrata. Mol Ecol 16: 3565-80. 37. Caicedo AL, Williamson SH, Hernandez RD, Boyko A, Fledel-Alon A, et al. (2007) Genome-wide patterns of nucleotide polymorphism in domesticated rice. PLoS Genet 3: 1745-56. 38. Lam HM, Xu X, Liu X, Chen W, Yang G, et al. (2010) Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection. Nat Genet 42: 1053-9. 39. Branca A, Paape TD, Zhou P, Briskine R, Farmer AD, et al. (2011) Whole-genome nucleotide diversity, recombination, and linkage disequilibrium in the model legume Medicago truncatula. Proc Natl Acad Sci U S A 108: E864-70. 40. Ness RW, Wright SI, Barrett SCH (2010) Mating-system variation, demographic history and patterns of nucleotide diversity in the tristylous plant Eichhornia paniculata. Genetics 184: 381-92. 41. Ness RW, Siol M, Barrett SCH (2011) De novo sequence assembly and characterization of the floral transcriptome in cross- and self-fertilizing plants. BMC Genomics 12: 298. 42. Pollak E (1987) On the theory of partially inbreeding finite populations. i. partial selfing. Genetics 117: 353-60. Brandvain et al. Founding C. rubella 27 43. Nordborg M, Donnelly P (1997) The coalescent process with selfing. Genetics 146: 1185-95. 44. Ingvarsson PK (2002) A metapopulation perspective on genetic diversity and differentiation in partially self-fertilizing plants. Evolution 56: 2368-73. 45. Wright SI, Lauga B, Charlesworth D (2003) Subdivision and haplotype structure in natural populations of Arabidopsis lyrata. Mol Ecol 12: 1247-63. 46. Charlesworth D, Wright S (2001) Breeding systems and genome evolution. Current Opinion In Genetics & Development 11: 685–690. 47. Schoen D, Morgan M, Bataillon T (1996) How does self-pollination evolve? inferences from floral ecology and molecular genetic variation. Philosophical Transactions of the Royal Society of London Series B-Biological Sciences 351: 1281–1290. 48. Wright S, Ness R, Foxe J, Barrett S (2008) Genomic consequences of selfing and outcrossing in plants. International Journal of Plant Sciences 169: 105–118. 49. Wright SI, Lauga B, Charlesworth D (2002) Rates and patterns of molecular evolution in inbred and outbred Arabidopsis. Mol Biol Evol 19: 1407-20. 50. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, et al. (2010) The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome Res 20: 1297-303. 51. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, et al. (2011) A framework for variation discovery and genotyping using next-generation dna sequencing data. Nat Genet 43: 491-8. 52. Balloux F, Lehmann L, de Meeûs T (2003) The population genetics of clonal and partially clonal diploids. Genetics 164: 1635-44. 53. Griffiths R, Tavare S (1999) The ages of mutations in gene trees. Annals of Applied Probability 9: 567-590. 54. Adams AM, Hudson RR (2004) Maximum-likelihood estimation of demographic parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms. Genetics 168: 1699-712. 55. Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD (2009) Inferring the joint demographic history of multiple populations from multidimensional snp frequency data. PLoS Genet 5: e1000695. 56. Hudson RR (2001) Two-locus sampling distributions and their application. Genetics 159: 1805-17. 57. Larribe F, Fearnhead P (2011) On composite likelihoods in statisticall genetics. Statistica Sinica 21: 43–69. 58. Wiuf C (2006) Consistency of estimators of population scaled parameters using composite likelihood. Journal of Mathematical Biology 53: 821–841.
© Copyright 2026 Paperzz