Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 doi: 10.1111/j.1755-0998.2010.02969.x SNP DISCOVERY: NEXT GENERATION SEQUENCING Short reads and nonmodel species: exploring the complexities of next-generation sequence assembly and SNP discovery in the absence of a reference genome M. V. EVERETT, E. D. GRAU and J . E . S E E B School of Aquatic and Fishery Sciences, University of Washington, 1122 NE Boat Street Box 355020, Seattle, WA 98195-5020, USA Abstract How practical is gene and SNP discovery in a nonmodel species using short read sequences? Next-generation sequencing technologies are being applied to an increasing number of species with no reference genome. For nonmodel species, the cost, availability of existing genetic resources, genome complexity and the planned method of assembly must all be considered when selecting a sequencing platform. Our goal was to examine the feasibility and optimal methodology for SNP and gene discovery in the sockeye salmon (Oncorhynchus nerka) using short read sequences. SOLiD short reads (up to 50 bp) were generated from single- and pooled-tissue transcriptome libraries from ten sockeye salmon. The individuals were from five distinct populations from the Wood River Lakes and Mendeltna Creek, Alaska. As no reference genome was available for sockeye salmon, the SOLiD sequence reads were assembled to publicly available EST reference sequences from sockeye salmon and two closely related species, rainbow trout (Oncorhynchus mykiss) and Atlantic salmon (Salmo salar). Additionally, de novo assembly of the SOLiD data was carried out, and the SOLiD reads were remapped to the de novo contigs. The results from each reference assembly were compared across all references. The number and size of contigs assembled varied with the size reference sequences. In silico SNP discovery was carried out on contigs from all four EST references; however, discovery of valid SNPs was most successful using one of the two conspecific references. Keywords: EST, next-generation sequencing, SNP, sockeye salmon, SOLiD, transcriptome Received 1 September 2010; revision received 26 November 2010; accepted 30 November 2010 Introduction Single-nucleotide polymorphisms (SNPs) have emerged as a powerful multipurpose tool in the study of wild populations (Habicht et al. 2010; McGlauflin et al. 2010). SNPs are the most common type of genetic variation (Morin et al. 2009; Slate et al. 2010) and can allow the characterization of both neutral and adaptive variation on a genome-wide scale. A carefully selected SNP panel (see Morin et al. (2009) for experimental design with SNPs) may perform as well or better than microsatellite markers as a population genetics tool with results that are more consistent between laboratories (Smith & Seeb 2008; Morin et al. 2009; Slate et al. 2010). However, the use of SNP markers in population studies to date remains limited because of their narrow availability for many species. Additional SNP discovery has been hindered by the need for time-consuming and expensive sequencing efforts. Correspondence: Meredith V. Everett, Fax: (206) 543 5728; E-mail: [email protected] Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd Other techniques for SNP discovery, such as high-resolution melt analysis (HRMA) (e.g. McGlauflin et al. 2010) or alignment of sequences from existing EST databases (Smith et al. 2005), have been used effectively but are limited in the number of SNPs they detect (Elfstrom et al. 2006). The recent advances in next-generation sequencing (NGS) techniques, which can produce millions of sequence reads in a single run, provide a potential wealth of material for SNP discovery at a relatively low cost (Hayes et al. 2007; Seeb et al. 2011). NGS technologies have revolutionized molecular studies in nonmodel organisms, allowing the rapid characterization of gene structure and expression (Ellegren 2008). As costs decrease, NGS technologies continue to be applied to an increasing number of nonmodel species (defined here as species lacking a reference genome) (Morozova et al. 2009; Trick et al. 2009; Goetz et al. 2010; Kunstner et al. 2010; Van Bers et al. 2010; Wolf et al. 2010). When selecting an NGS platform, laboratories working with nonmodel species must consider the cost, research question and availability of resources for sequence assem- 94 M . V . E V E R E T T E T A L . bly (Flicek & Birney 2009). The three commonly used NGS platforms are the Roche GS-FLX, Illumina Genome Analyzer (GA) and ABI SOLiD. The Roche GS-FLX runs produce the longest reads (400 bp) (Morozova et al. 2009), and while the short reads (50–80 bp) produced by both the SOLiD system and the Illumina GA present challenges for assembly, both systems produce more reads at relatively lower cost. Shendure & Ji (2008) observed that the price per megabase for Roche 454 sequencing is approximately 30-times more than the cost of both Illumina GA and SOLiD. Our observations at the time of this writing are that this cost distribution remains the same. The specific chemistries of the platforms have been reviewed elsewhere (Morozova et al. 2009), but the advantage of all platforms is their ability to produce hundreds of thousands to millions of sequences in a single run. Multiple tools have been developed for the assembly of all types of NGS data (Flicek & Birney 2009). Sequence assembly may either be de novo or assembly to a reference (hereafter referred to as ‘reference assembly’ or ‘read mapping’). The reference may be a genome sequence, existing EST database or other sequence database from either the species of interest or one closely related (Trick et al. 2009; Parchman et al. 2010; Van Bers et al. 2010). Many studies utilize a combination of these assembly methods (Collins et al. 2008; Flicek & Birney 2009; Buggs et al. 2010). Complications for both de novo and reference assembly remain for all three NGS technologies. De novo assembly remains difficult because of the computational complexity of assembling the large volume of data each system produces. This is especially true for Illumina and SOLiD, which produce millions of reads per run. Additionally, Flicek & Birney (2009) point out that to provide long assemblies, regardless of sequence or assembly method used, at least a portion of reads must be longer than the longest near identical region in the genome. This parameter varies greatly among genomes, and the short reads produced by all three platforms rarely meet this threshold. Regardless of NGS platform selected, in order for contig assembly and detection of variants such as SNPs to be successful, sufficiently deep genome coverage is needed. As many genomes are large and complex, a method for reducing sequence complexity such that sufficient coverage can be achieved at relatively low cost is necessary. Sequencing of the transcriptome provides a straightforward method for identification and annotation of only the protein coding genes reducing the complexity of sequences to assemble. Furthermore, the high coverage achieved by NGS technologies, especially the Illumina and SOLiD platforms, ensures identification of even rare transcripts and variants (Hale et al. 2009; Morozova et al. 2009; Van Bers et al. 2010). When constructing a transcriptome EST library, the selection of tissues for starting material is important. Pooled-tissue libraries have frequently been used in EST sequencing projects in an attempt to expand the diversity of genes discovered (Bonaldo et al. 1996; Carre et al. 2006; Govoroun et al. 2006). However, the overexpression of specific genes in some tissues is a potential source of bias in sequencing pooled-tissue EST libraries. This overrepresentation can be addressed through library normalization or through tissue selection targeted to reduce redundancy (Govoroun et al. 2006). One remaining question is whether the high coverage of NGS data can overcome the transcript redundancy found in pooled-tissue libraries when compared to single-tissue libraries, allowing the identification of rare transcripts in any library. Another area of interest is the determination of the most efficient method for generating data relevant to questions of population genomics, population and individual assignment, gene discovery etc., in the absence of a reference genome. While previous studies have successfully assembled NGS data to existing sequence resources, none of these examples have questioned whether the EST database selected and assembly parameters used affect the quality of sequence assembly and identification of SNPs. Species of Pacific salmon (Oncorhynchus sp.) and Atlantic salmon (Salmo salar) are thought to have diverged 25 Mya (Allendorf & Thorgaard 1984), and the two groups are 94–96% similar when comparing ESTs. It is unknown whether this 4–6% sequence divergence, measured across the entire transcriptome, might affect the correct detection of informative SNPs (Smith et al. 2005; Koop et al. 2008). Our study had three primary goals: first, to compare assemblies between pooled- and single-tissue SOLiD libraries, second, to evaluate the assembly of SOLiD reads among existing EST resources for salmonids and third, to examine whether underlying differences between EST databases from different species affect the validation of SNPs in sockeye salmon (O. nerka). We found limited differences between the assemblies of pooled- and single-tissue SOLiD libraries, with a trend towards higher rates of contig discovery in pooled-tissue libraries. Contig assembly on each EST database was associated with the size of the database. SNP discovery was most successful using conspecific EST data; however, differences in SNP validation among EST databases may have been related to assembly parameters that lead to the misidentification of paralogous sequence variants (PSVs) as SNPs. Methods RNA isolation and SOLiD sequencing For transcriptome library generation, gill, heart, liver and testes were collected from two reproductively active male Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 95 including colour space information, were imported into the Genomics Workbench and trimmed for length and quality using the trim sequences tool. The quality score limit was set to 0.05, and sequences less than 20 bp in length were discarded. While each RNA sample was treated to remove rRNAs during library preparation, some rRNA contamination remained. To exclude these sequences from the analysis, the trimmed SOLiD sequences from each library were assembled to a set of publicly available teleost 18s and 28s sequences [Genbank (EF126038, EF126042, EU780557, EF126037, EU637075, EF126043, EF126039, AB193742, AB193567, AB105163, AF308735, EF126040, EU780557, AB099628, U34342, AY452491, U34341, U34340, Z18683, Z18691, Z18686, Z18673, Z18764)] using the reference assembly function and default assembly settings in the Genomics Workbench. Reads from each library that did not assemble to the teleost rRNA sequences were retained in a separate file and used for all remaining sequence analysis. sockeye salmon from each of five populations of ecological and commercial interest (Schindler et al. 2010; Smith et al. 2011) (Table 1). The tissues were collected and stored in RNAlater (Ambion). Total RNA was extracted from all four tissues using Trizol (Invitrogen) according to the manufacturer’s protocols and cleaned using a Qiagen RNeasy kit. Total RNA concentration was quantified using a PicoGreen (Invitrogen) assay according to the manufacturer’s protocol. All RNA was screened for quality on a Bioanalyzer (Agilent). Aliquots of the cleaned total RNA were sent to the University of Washington High Throughput Next Generation Sequencing Facility for cDNA library preparation and sequencing using the ABI SOLiD system. All cDNA library preparation and sequencing were carried out using standard SOLiD protocols, including treatment with RiboMinus (Invitrogen), to remove ribosomal RNA (rRNA). None of the libraries were normalized during preparation. Two cDNA libraries were created for each of the ten sockeye salmon individuals. The first library was created from 10 lg of total RNA from testes alone. The second cDNA library was created by pooling of 2.5 lg total RNA from each tissue from gill, heart, liver and testes. Each library was run on 1 ⁄ 8th of a SOLiD slide. All sequences were deposited in the NCBI Short Read Archive (SRA) under accession number SRA023604.2. Sequence assembly. Publicly available salmonid EST databases from rainbow trout (O. mykiss), Atlantic salmon and sockeye salmon (Koop et al. 2008) were used as reference sequences to map the SOLiD reads. The EST databases varied in size: sockeye salmon 6598 sequences, rainbow trout 79 018 sequences and Atlantic salmon 119 912 sequences. All three EST databases were from the most recent 100 ⁄ 99 assemblies [minimum score, repeat_stringency 99 in a Phrap assembly see (Koop et al. 2008); cGRASP]. Mapping of SOLiD reads was carried out using the reference assembly function in the Genom- Sequence analysis Sequence quality. Sequence analysis and quality control were carried out using the CLC Genomics Workbench 3.7.1 (CLC bio). All SOLiD data and quality scores, Table 1 Summary of SOLiD sequencing for all individuals. Individuals included were from populations of commercial interest (Habicht et al. 2010). Locations are as follows: Yako Creek (YakoCk), Yako Beach (YakoB), Silverhorn Bay Beach (SSilv), Lake Kulik beaches (SLKul) and Mendeltna Creek (SMend). The numbers after each abbreviation designate specific individuals. All locations except Mendeltna Creek are from the Wood River Lakes system, Alaska, that drains into the Bering Sea; Mendeltna Creek is a Copper River tributary, draining into the Gulf of Alaska. Testes tissue libraries were all created from 10 lg RNA from testes. Pooled-tissue libraries were created by pooling 2.5 lg RNA each from four tissues: testes, liver, heart and gill Testes Tissue Pooled Tissue Individual Number of reads Number of reads after trimming % Ribosomal Final number of reads Number of reads Number of reads after trimming % Ribosomal Final number of reads YakoCk 1001 YakoCk 1002 YakoB 1001 YakoB 1002 SSilv 1001 SSilv 1002 SLKul 1001 SLKul 1002 SMend 1001 SMend 1002 29 939 160 28 256 540 27 784 826 55 825 749 20 570 121 27 656,862 51 794 375 27 818 996 26 605 246 50 094 660 13 11 11 15 2 12 26 13 12 20 25 23 47 17 21 24 32 20 23 25 10 8 6 12 1 9 17 10 9 15 27 23 26 26 28 24 86 93 41 62 13 10 9 10 10 11 49 43 17 24 35 25 27 27 36 18 22 24 17 22 8 7 6 7 7 9 38 32 14 19 347 505 590 058 383 580 072 562 166 900 643 962 747 650 678 518 506 574 111 448 Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd 031 913 142 551 886 607 792 904 406 749 781 446 722 092 069 538 214 540 487 481 618 412 287 995 207 807 188 180 988 962 007 417 980 349 755 577 999 511 429 615 891 377 476 655 982 571 422 108 566 473 903 341 172 899 744 972 200 751 983 432 999 780 937 747 066 455 671 591 578 156 696 985 048 486 965 131 756 176 467 967 96 M . V . E V E R E T T E T A L . ics Workbench. Each SOLiD library was mapped to each EST database singly, followed by simultaneous mapping of all SOLiD pooled-tissue and testes libraries (Table 1) (20 individual assemblies and 2 assemblies across all individuals). The Genomics Workbench reference assembly tool allows the user to set the score for an individual nucleotide mismatch and the total mismatch score limit for retaining each assembly and also takes into account colour space information. All trimmed SOLiD reads were assembled with a nucleotide mismatch score of one and a total score limit of two mismatches per sequence. This corresponds to between 90 and 98 per cent identity with the reference for each SOLiD read, dependant on the read length. De novo sequence assembly was carried on all SOLiD reads from all libraries using the NGS Cell 3.1 (CLC bio) with default parameters. The SOLiD reads were remapped to these de novo contigs using the reference assembly tool in the Genomics Workbench and the same parameters as the reference assemblies to the EST databases. In silico detection of putative SNPs. SNP detection was carried out on each of the reference assemblies using the SNP detection tool in the Genomics Workbench. SNP detection parameters included a minimum coverage threshold of four reads, with a minimum variant frequency of 35% for heterozygous individuals. The maximum coverage limit was set to the value of the highest coverage contig found in each assembly to initially capture the maximum number of putative polymorphisms. SNP detection parameters also included a high-quality score for both the putative SNP and an eleven-base-pair window of surrounding nucleotides. Both homozygous and heterozygous SNPs were considered when examining the assemblies from that included all individuals. Putative SNPs which were heterozygous with the rainbow trout or Atlantic salmon sequences, but homozygous within the SOLiD reads were discarded. The resulting putative SNP tables were exported to Microsoft Excel where both single- and multi-individual assemblies were compared and screened using additional parameters similar to those in Sanchez et al. (2009). These screening steps were included to reduce the rate of misidentification of PSV’s as true SNPs in silico, helping to reduce the high cost of SNP validation. The order of these steps is simply a matter of convenience. First, putative SNPs that appeared to contain more than two alleles among all individuals were excluded from further analysis as there is an increased likelihood that such loci are PSVs rather than true SNPs. Next, the remaining putative SNPs were screened to exclude any that occurred within 100 bp of one another in a contig. Sanchez et al. (2009) point out that contigs containing multiple SNPs are more likely to represent paralogous loci. Thus, the decision to exclude these putative SNPs attempts to strike a balance between the number of putative SNPs discovered and the rate of false discovery. Finally, putative SNP tables from each individual and the multi-individual assembly were compared to one another to locate putative SNPs shared among multiple individuals. SNPs that contained both homozygous and heterozygous individuals among the populations were selected for further validation. SNP validation A set of 96 putative SNPs was selected for validation. Twenty-four were selected from contigs from each of the four reference databases, comparing across the individual assemblies and the group assemblies to detect putative SNPs found in multiple individuals. Each of the putative SNPs chosen for the panel was detected in multiple individuals and appeared to have variable allele counts among populations. We use a four-step validation process, including PCR tests, HRMA, Sanger sequencing and population genotyping (Seeb et al. 2011). BatchPrimer3 (You et al. 2008) was used to design PCR primers flanking each putative SNP from consensus sequences from each reference assembly containing a selected putative SNP. Primers were first subjected to a PCR test on pooled genomic DNA extracted from the same 10 sockeye salmon individuals used for SOLiD transcriptome sequencing. Genomic DNA was extracted from preserved tissues from each individual using a Qiagen DNeasy kit, following manufacturers’ protocols. DNA concentration was quantified via a fluorescence assay on a NanoDrop (ThermoScientific), and the concentration was standardized among all samples. Primer testing was carried out using real-time PCR on a LightCycler 480 (Roche). PCR conditions were identical for all primer pairs: an initial denaturation of 10 min at 95 C followed by 45 cycles of : 95 C for 10 s, annealing at 55 C for 20 s and extension at 72 for 20 s. Primer pairs that failed to amplify, or amplified more than one product, were excluded from further testing. Successful primer pairs were used in a second set of PCRs reactions that included an HRMA to test for the presence of each putative SNP. Each primer pair was run on each of the ten individuals singly. PCR conditions were identical to the conditions described earlier except for the addition of a final melt step with a temperature ramp from 62–95 C, at a rate of 0.02 per s. High-resolution melt analysis (HRMA) results indicating a polymorphism were further validated by Sanger sequencing. Sanger sequencing was carried out at the University of Washington’s High Throughput Sequencing facility. PCR products identified from HRMA were Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 97 sequenced in both directions using a standard BigDye 3.1 protocol (ABI). Resulting sequences and chromatograms for each primer set were examined across all individuals in BioEdit. The ClustalW algorithm in BioEdit was used to create alignments, and the presence of each putative SNP was verified by eye. Consensus sequences from alignments containing putative SNPs were used in the design of custom TaqMan assays (ABI). TaqMan assays were tested on up to 95 individuals from each of eight test populations of sockeye salmon (Table 2, Fig. 1). Assays were run in 384-well plates on an ABI 7900 instrument, and resulting genotypes were analysed using the SDS 2.3 software (ABI) (Seeb et al. 2009). Allele frequencies and deviation from Hardy– Weinberg equilibrium were calculated in GENALEX 6.1 (Peakall & Smouse 2006), and FST estimates were calculated in GENEPOP on the web, version 4.0.10 (Raymond & Rousset 1995; Rousset 2008). Table 2 Test populations and sample size for SNP validation. The number of individuals included from each population is listed in n. Populations range from the South Peninsula, Alaska to the Igushik River in Bristol Bay, Alaska. Bolshaya River is located on the Western Kamchatka Peninsula in Russia (Fig. 1) Population region n Igushik River Lower Wood River Illiamna Lake Egegik River Cinder River Bear Lake Chignik Lake Bolshaya River 60 93 95 95 89 95 95 95 Results Sequencing and assembly results Two SOLiD sequence libraries, one each from pooled-tissue and testes, were obtained from each individual for a total of 20 SOLiD libraries (Table 1). Within each library, between 20 570 121 and 93 180 511 reads were initially obtained. Reads were trimmed for quality and length, leaving between 2 383 678 and 49 422 200 reads (Table 1). Across all SOLiD libraries, between 17 and 47 per cent of the trimmed reads were found to be rRNA sequence and excluded from further analysis leaving between 1 886 069 and 38 671 756 sequence reads in each library (Table 1). All SOLiD libraries were successfully assembled to the cGRASP EST databases for sockeye salmon, rainbow trout and Atlantic salmon (Table 3). De novo assembly was successfully carried out across all SOLiD libraries, resulting in 25 426 contigs. De novo contig lengths ranged from 200 to 1193 bp with a mean length of 262 bp. All SOLiD libraries were successfully mapped to these de novo contigs (Table 3). Pooled-tissue vs. testes specific libraries One goal of this study was to compare sequence assembly between pooled-tissue and testes libraries within each of the four reference databases. The pooled-tissue libraries contained higher numbers of assembled reads, higher numbers of contigs and greater contig length overall (Table 3, Fig. 2). Among all assemblies, there was great deal of variation in both the number of contigs detected and the depth of coverage per contig. Both pooled-tissue and testes libraries contained contigs Fig. 1 Map of locations of the eight test populations genotyped using TaqMan assays. Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd 163 64 27 13 17 9 24 18 235 149 32 20 26 18 53 35 88 77 53 47 36 30 30 31 338 300 202 176 145 128 149 133 59 41 2381 1498 5352 2946 62 4 054 (74) 634 (47) 985 (73) 324 (48) 873 (71) 619 (46) 618 (56) 749 (54) 1 049 402 1 571 621 1 354 519 828 406 177 917 101 322 711 298 877 768 1 411 861 2 147 1 304 1 883 1 121 1 475 758 88 84 70 63 62 60 88 82 329 481 6853 8537 13 352 14 575 1734 2843 828 588 371 152 904 686 276 920 5 5 55 50 73 64 22 20 360 (75) 473 (44) 360 (75) 473 (44) 360 (75) 473 (44) 360 (75) 473 (44) SD Pooled Testes Pooled Testes Pooled Testes Pooled Testes Reference species Sockeye salmon Sockeye salmon Rainbow trout Rainbow trout Atlantic salmon Atlantic salmon Sockeye salmon de novo Sockeye salmon de novo 15 10 15 10 15 10 15 10 298 298 298 298 298 298 298 298 567 537 567 537 567 537 567 537 11 4 11 4 11 4 11 4 SD (%) 478 504 478 504 478 504 478 504 SD Average coverage SD Average length SD (%) Number of unique contigs Average reads assembled to reference % EST library Average number of contigs Average starting reads unique to each library type, but more unique contigs were detected in pooled-tissue samples (Table 3). However, among the unique contigs assembled on each EST reference database, many were detected in only a single library: 67 in sockeye salmon, 2411 in rainbow trout and 4840 in Atlantic salmon. Of these contigs, the majority were constructed from the alignment of a single SOLiD read to the reference (56 ⁄ 67 in sockeye salmon, 2083 ⁄ 2411 in rainbow trout and 4260 ⁄ 4840 Atlantic salmon). In all cases, these single read contigs made up more than half of the unique contigs detected between pooled-tissue and testes libraries. Reference assembly back to the de novo set of sequences produced 48 contigs detected in a single library. Once more, these reads made up more than half of unique reads between pooled-tissue and testes libraries. However, in this instance, multiple SOLiD reads were mapped to all reference contigs. Variation among reference databases SOLiD library type Table 3 Summary of mapping of SOLiD to the cGRASP EST reference libraries and the de novo SOLiD contigs. Among the EST databases, the sockeye salmon reference consisted of 6598 contigs and singletons, the rainbow trout reference contained 79 018 contigs and singletons, and the Atlantic salmon reference contained 119 912 contigs and singletons. The de novo references contained 25 426 contigs. The percentages following the standard deviation are the standard deviation calculated as a percentage of the average 98 M . V . E V E R E T T E T A L . The results of mapping the SOLiD reads to each of the EST reference databases and the de novo contigs were variable among these four references. On the sockeye salmon EST database, an average of between 84 and 88% of the total ESTs in the reference library were assembled among individual libraries, with a total of 6528 contigs (98%) assembled across all libraries (Table 3). Mapping to the rainbow trout database generated contigs representing between 63 and 70% of reference library (Table 3), with a total of 73,843 (93%) contigs mapped across all libraries. An average of between 60% and 62% of the contigs from the Atlantic salmon database were generated from mapped SOLiD reads among all individual libraries with a total number of 112 214 contigs (93%) mapped across all libraries. Finally, remapping the SOLiD reads to the de novo contigs produced an average of between 22 276 and 20 920 contigs in each library, representing an average 82–88% of the starting reference library (Table 3). However, across all SOLiD libraries, a total of 25 426 contigs representing the entire de novo reference were successfully assembled. We observed substantial variation in coverage depth and length among assemblies to each EST database (Figs 3 and 4). In general, assembly to both the rainbow trout and Atlantic salmon EST references produced the largest number of contigs, consistent with the larger size of these reference databases. However, in the assembly to both rainbow trout and Atlantic salmon reference, there is a very high proportion of contigs (up to 55% of the contigs mapped on the Atlantic salmon reference) that contain fewer than 20 SOLiD reads. There is also a larger proportion of short (<100 bp) contigs in the rainbow trout and Atlantic salmon assemblies. Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd 100 1000 10 000100 0001 000 000 Single Pooled 1000 100 000 (e) 10 Number of contigs (log) 10 0.1 100 000 1000 10 Number of contigs (log) (a) 0.1 Sockeye salmon de novo S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 99 100 300 1000 10 000 100 000 1 000 000 100 000 1000 200 600 10 000 100 000 1 000 000 2200 4500 5500 100 000 1000 500 1500 2500 3500 100 000 1000 (h) 10 Number of contigs (log) 1000 1800 Contig length (bp) 0.1 100 000 1000 10 100 Number of reads per contig 1400 10 Number of contigs (log) 1000 10 000 100 000 1 000 000 0.1 100 000 1000 10 0.1 Rainbow trout Number of contigs (log) Atlantic salmon Number of contigs (log) 0.1 10 1000 (g) Number of reads per contig (d) 1100 Contig length (bp) (c) 100 900 (f) Number of reads per contig 10 700 10 Number of contigs (log) 100 0.1 100 000 1000 10 10 500 Contig length (bp) (b) 0.1 Sockeye salmon Number of contigs (log) Number of reads per contig 200 600 1000 1400 1800 2200 2600 3000 Contig length (bp) Fig. 2 Comparison of the frequency of occurence of contig coverage values and contig lengths between pooled-tissue and testes libraries. Each SOLiD library was mapped to all four reference databases, and mean length and coverage (number of reads per contig) were calculated across all individuals in each reference. Panels a–d are coverage and e–h are length. Reference library species are as follows: a. Sockeye salmon de novo, b. Sockeye salmon, c. Rainbow trout, d. Atlantic salmon, e. Sockeye salmon de novo, f. Sockeye salmon, g. Rainbow trout, h. Atlantic salmon. Error bars are standard deviation from the mean. Note the overlap in error in all cases. SNP detection and validation In silico SNP detection was successfully carried out on the contigs from all four reference assemblies. The average number of putative SNPs across all four reference databases ranged between 4219 and 13 608 at a minimum coverage depth of 4 reads (Table 4). At higher coverage values (‡10 reads), an average of between 1928 and 5657 putative SNPs was detected (Table 4). The total number of putative SNPs detected among all SOLiD libraries, assembled on each EST reference database, was between Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd 37 831 and 109 850 before secondary screening, and 631 and 14 046 after screening to remove loci that appeared to contain more than two alleles or occurred within 100 bp of one another (Table 4). The average number of putative SNPs per contig (coverage ‡4) after all screening steps ranged from 1.6 to 4.4, across all reference databases. The greatest numbers of putative SNPs per contig were detected on the sockeye salmon EST reference. Following the same pattern as the mapped contigs, numerous putative SNPs were detected in assemblies from only a single individual. Additionally, in the assemblies to 100 M . V . E V E R E T T E T A L . Fig. 3 Comparison of contig coverage frequencies among reference databases. Coverage was calculated as number of reads per consensus for the assembly of all pooled-tissue and testes SOLiD libraries on each reference sequence. Fig. 4 Comparison of contig length frequencies among reference databases. Lengths were determined based on the assembly of all pooled-tissue SOLiD libraries and all the testes SOLiD libraries on each reference sequence. Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 101 some reference sequences, multiple putative SNPs were detected across individuals; however, the detection and coverage for each of these putative SNPs were variable among individuals. Ninety-six putative SNPs were selected for further validation, 24 each from contigs assembled from each of the four reference databases. Coverage values on each of these candidates ranged from around 25 reads to more than 100 when all individuals were aligned together. No relationship between the level of coverage and SNP validation rates was observed. Six of the original 96 primer pairs failed to amplify a product in initial PCR tests (Table 5). A further 39 were rejected because they amplified more than one product. Successful primer pairs were screened with HRMA to detect the presence of putative SNPs. Seventeen putative SNPs were rejected after HRMA because they contained ambiguous products, and 18 were rejected when no polymorphism was detected. Sixteen putative SNPs were detected with HRMA and were Sanger sequenced to confirm the presence or absence of the variant. After sequencing, five potential SNPs were rejected because of either amplification of multiple products or no SNP was detected in any product (Table 5). Finally, 11 putative SNPs were confirmed via direct sequencing, and the consensus of these sequences were used to design TaqMan assays (Table 6). Of these final 11 putative SNPs, five were detected in contigs assembled from the SOLiD de novo reference, two from the reference assembly to the rainbow trout EST database and four were from the sockeye salmon EST database. All candidate SNPs detected using the Atlantic salmon database failed validation as they either did not amplify a product, amplified multiple products, or lacked a true SNP in HRMA analysis. The 11 TaqMan assays were successfully amplified on the eight test populations (Table 2). One assay failed to distinguish between homozygous and heterozygous individuals and was excluded from further analysis, while the other ten loci were successfully scored across all eight test populations (Table 7). Several populations deviated from Hardy–Weinberg equilibrium at a single locus (Table 7). FST values for each locus ranged from 0.01 to 0.31. Global FST across all 10 loci was 0.06. Table 4 Summary of SNP discovery across all reference assemblies. The total numbers of putative SNPs in all cases are total number of unique putative SNPs across all assembled contigs and libraries. Coverage is the number of reads assembled per contig Reference species SOLiD library type Average in silico SNPs (coverage ‡10) ±SD Sockeye salmon Sockeye salmon Rainbow trout Rainbow trout Atlantic salmon Atlantic salmon Sockeye salmon de novo Sockeye salmon de novo Pooled Testes Pooled Testes Pooled Testes Pooled Testes 4076 2962 5657 3213 3877 1928 2730 2583 Average in silico SNPs (coverage ‡4) 3367 5671 2410 4219 5386 13 608 2945 7742 3700 10 201 1374 5293 2344 4428 3405 4290 ±SD 4348 3280 11 671 5969 8811 3467 3629 5260 Average SNPs Total Total putative per contig putative SNPs SNPs after (coverage ‡4) ±SD discovered screening 4.1 4.4 2.1 2.2 1.9 2.0 1.6 1.7 0.6 0.9 0.2 0.5 0.2 0.3 0.1 0.4 43773 37 831 142 944 101 100 109 850 71 089 44 950 51 617 806 631 1036 3923 14 046 2350 2218 775 Table 5 Results of primer design and validation of candidate SNPs. Twenty-four primer pairs were designed from consensus sequences from each reference assembly to test putative SNPs. These putative SNPs were subjected to four validation steps. First, we tested for successful PCR amplification producing a single product. Second, the successful primer pairs were subjected to high-resolution melt analysis (HRMA) using individuals from eight populations. Third, templates that appeared to contain a SNP in HRMA were resequenced using Sanger sequencing. Finally, putative SNPs appearing in the Sanger sequences were validated using TaqMan assays on eight populations (Table 2) Reference assembly Primer pairs Successful PCR Successful HRMA Sanger validation TaqMan validation Sockeye salmon Rainbow trout Atlantic salmon Sockeye salmon de novo Totals 24 24 24 24 96 13 11 11 16 51 5 4 1 6 16 4 2 na 5 11 4 2 na 4 10 Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd Rainbow trout Sockeye salmon de novo Sockeye salmon de novo One-F4b One-A5 Sockeye salmon Sockeye salmon Sockeye salmon Sockeye salmon One-F10 One-G6 One-H3 One-H5 One-H6 One-B8 One-B6 Sockeye salmon de novo Sockeye salmon de novo Sockeye salmon de novo Rainbow trout One-B5* One-B11 Reference database Assay name AGGTGAATCATTTG TCCATAGCATCA CCATCCTTCCTTCAT TGACACCATT ATTTAAGTAATCTA TTTCTTTGGCTCGAC TGA GTCGATGGGCAGGA AGATGA ACAAATCGTGTTAA TGACAGGCTACT TGTCCGACCACAAC AATGTC GAGCCCCAGTACCA TTCCA GGAAGTTTGTGCTTG GCTAAGATCA CATTTTTTGGCACCA TAACCTTGGT TTGGTTCCGATCTAC AAGTTGACAT GGCATGATTGTCCTT GGGAAGATAT Forward primer seq. TCTCCTGTCCTTTC ATACACTCTGA GGATGACGTAATT GATCAACTGTCCAT ACCAGGTTGAGAA AAACGTTATCCT GGACCCATGATAGT TCCCATCTT GGGATTTATTGCTC TGAGAGGACAA ATACCCCAGTCCACCAATCAG ACAAGGAATTCA GTGTGGGATTGG ATCTGAGGAAGCATATTTTTTCCTAATTCTATTTCT CCTCAGGGCAACTT ATATTCAAAGC CCTCTCGGCCATCT TTGAAGTTATT GGAGCATCTAAGA AAATACCCGTCTT Reverse primer seq. VIC VIC VIC VIC VIC VIC VIC VIC VIC VIC VIC Probe 1 dye CAAATCAACTGGATTTAC TCATGCATAGATA CTTGAC CTAAATCTGAATT AATTTACG CCTAACACAACATTGCTT AGGACACACAGC TCTGT ATAAACAATCAGG GAAATG TAGCGACGAAGAC CACA CCTGCCAGGCCTC CCATCATTCTCA TTACTGTTT ATCTGAA GTATTGGCTTTAA TGTGGCCAATGGA CCAA Probe 1 sequence FAM FAM FAM FAM FAM FAM FAM FAM FAM FAM FAM Probe 2 dye Table 6 TaqMan primer and probe sequences designed from Sanger-validated sequences. Assay One-A5 (*) amplified more than two alleles when tested TCAAATCAACTG TATTTAC ATGCATAGATGCC TTGAC AAATCTGAATTCATTTACG TCCTAACACAACTTTGCTT AGGACACACAAC TCTGT AAACAATCAAGGAAATG TAGCGACGAACA CCACA CCCTGCTAGGCCTC CCATCATTCTCATTCCTGTTT ATCTGAAG TATTTGCTTTAA TGGCCAATGAACC AA probe 2 sequence 102 M . V . E V E R E T T E T A L . Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 103 Discussion Sequencing results As the cost of NGS methods decreases, the use of this technology on nonmodel species, those species lacking a reference genome, has increased (Morozova et al. 2009; Trick et al. 2009; Kunstner et al. 2010). The short reads produced by NGS, including the SOLiD system, remain a challenge for assembly; thus, their primary use to date has been in resequencing and gene expression studies in organisms with a sequenced genome (Morozova et al. 2009; Trick et al. 2009; Wall et al. 2009). Recently, NGS studies in nonmodel species have begun to use existing EST data sets as references for sequence assembly (Collins et al. 2008; Trick et al. 2009; Parchman et al. 2010; Van Bers et al. 2010). However, no studies have examined the possible consequences of the mapping reference selected. Our study characterized variations in assembly to three publicly available salmonid EST databases, as well as assembly to a set of de novo contigs, and examined the optimal methodology for SNP detection in the assembly of short SOLiD reads. Using the SOLiD sequencing system, we successfully obtained transcriptome sequence from both pooled-tissue and testes libraries from 10 individuals spanning five distinct populations of Alaskan sockeye salmon. Our decision to use the SOLiD system, rather than a method such as 454, was based on cost. Using SOLiD allowed us to sequence more individuals at very high coverage. Several million reads were obtained from each library. However, a large proportion (up to 47%) of the sequences obtained consisted of rRNA sequences. These sequences were present despite the RiboMinus treatment of all RNA samples outlined in the methods. The RiboMinus kit uses regions of ribosomal sequence that are highly conserved among species. Nonetheless, the large number of ribosomal sequences remaining in the sample suggests that the included sequences do not have high enough specificity for use in salmonids. Future studies should consider using other techniques such as poly-A selection to reduce rRNA in their samples. Pooled-tissue verses testes libraries Our first goal was to compare the efficiency of assembly between pooled-tissue and testes libraries. Pooled-tissue libraries have frequently been used in EST sequencing projects in an attempt to expand the diversity of genes discovered (Bonaldo et al. 1996; Carre et al. 2006; Govoroun et al. 2006). However, one potential source of bias in sequencing pooled-tissue EST libraries is that highly expressed genes may be overrepresented in the library, while rare transcripts may be missed altogether. Such variation in distribution can be addressed through library normalization or through tissue selection targeted to reduce library redundancy. Library normalization reduces the proportion of highly expressed transcripts compared to other transcripts in the sample; however, even after normalization, redundant EST clusters may remain, and depending on the intended downstream application of the library (i.e. gene expression), normalization may not be appropriate. While this is problematic for traditional sequencing methods, which are limited in their numerical capacity, Hale et al. (2009) demonstrated that the high coverage obtained in NGS generally eliminates the need for normalization, and even Table 7 Genotype results from eight populations: Igushik River, AK; Lower Wood River, AK; Illiamna Lake, AK; Egegik River, AK; North Peninsula, AK; Cinder River, AK; North Peninsula, AK; Bear Lake, AK; Chignik Lake, AK; Bolshaya River, Russia. Eleven TaqMan assays were designed from Sanger-validated sequences. Ten assays successfully amplified in all eight populations. Minor allele frequency and deviations from Hardy–Weinberg equilibrium (*P < 0.05, **P < 0.01) are shown below for all populations Minor allele frequency Assay name Igushik river Lower Wood River One-F4b One-A5 One-B11 One-B6 One-B8 One-F10 One-G6 One-H3 One-H5 One-H6 0.117** 0.283 0.242 0.242* 0.475 0.158 0.333 0.411 0.333 0.344* 0.188 0.382 0.484 0.258* 0.473 0.177 0.360 0.140 0.285 0.265 Illiamna Lake Egegik River North Peninsula, Cinder River North Peninsula, Bear Lake Chignik Lake Bolshaya River, Russia 0.126 0.363 0.411 0.384 0.405 0.100 0.342 0.102 0.437 0.159 0.128 0.300 0.420 0.326 0.426 0.153 0.353* 0.054 0.426 0.196 0.099 0.309 0.427* 0.340 0.416 0.081 0.494 0.326 0.433 0.263* 0.087 0.200 0.290 0.440 0.421 0.273 0.337 0.468* 0.463 0.227 0.140 0.263 0.381 0.299 0.489 0.367 0.379 0.163 0.453 0.295 0.146 0.337* 0.131 0.355 0.400 0.120 0.393 0.300 0.494 0.500 Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd 104 M . V . E V E R E T T E T A L . rare transcripts can be identified in nonnormalized libraries. We compared reference assemblies of libraries from both testes and pooled-tissues. Despite the use of nonnormalized (native) libraries, more than 90% of the ESTs found in each reference database were identified in at least one individual from our libraries. This was consistent with the findings of Hale et al. (2009) that native libraries can be effectively used for gene discovery. The use of native libraries does have an effect on the distribution of coverage, however. A comparison of the 100 genes with the highest coverage from each reference assembly revealed that 19–85% of the total number of reads mapped to each reference was contained in these 100 contigs. The large variation in these values is attributed to differences between reference libraries and the high variability between individuals. Given the high proportion of reads mapped to a few contigs, it is probable that normalization would more evenly distribute coverage across our libraries, improving consistency between individuals and possibly enhancing SNP discovery. Mapping of pooled-tissue libraries generally resulted in higher numbers of contigs, and the assembled contigs tended to be longer (Fig. 2, Table 3). Both pooled-tissue and testes libraries contained unique contigs; though, there were more contigs unique to the pooled-tissue libraries, regardless of the EST database used as a reference. It should be noted that these trends are general and the pooled-tissue and testes libraries have overlapping error rates in the total number of reads assembled, assembled contig length and coverage depth (Table 3, Fig. 2). Differences among EST libraries are typically identified by grouping ESTs by Gene Ontology (GO) terms or gene identifications and then comparing relative counts of ESTs via a hypergeometric distribution, such as a binomial, chi-squared or FISHER’S exact distribution (Susko & Roger 2004; Young et al. 2010). The resulting distribution of sequences among libraries from NGS methods frequently violates the assumptions underlying these tests, specifically the assumption that all genes are independent and equally likely to be selected as differentially expressed, under the null hypothesis (Young et al. 2010). Thus, we did not perform statistical tests, and our comparisons remain general at this time. While pooledtissue libraries contained a higher diversity of transcripts, the large variability in reads mapped (Table 3) and the overlap in error rates for assemblies between pooled-tissue and testes libraries suggest that both are an important resource for gene discovery. Decisions regarding tissue selection for library preparation should be made based on tissue availability and the specific question being addressed, keeping in mind the possibility for other downstream applications with each library type. Comparison among EST references The second goal of this study was a comparison of gene and SNP discovery when SOLiD reads are mapped to different EST databases and a de novo assembly. The short reads produced by the SOLiD sequencing system remain difficult to assemble de novo, because of their length and the computational complexity of handling the large volume of data, although assembly algorithms are improving (Flicek & Birney 2009). The overall number of reads that assembled to each reference database was variable among all individuals (Table 3). Additionally, a relatively low proportion of the average number starting reads (approximately 10%) mapped to each reference database. There were two factors that may explain this low rate. First, the genes in the EST database may not have been representative of the genes expressed in our tissue libraries. The second factor, our strict assembly parameters, was more likely the crucial factor producing the low mapping rate. Any reads that contained more than two mismatches to the reference sequence were discarded. This necessarily excluded many sequences; however, the hope was that it would improve the overall quality of the assembly and separate PSVs. In spite of the reduction in the total number of reads assembled, contigs representing more than 90% of each reference library were assembled from our data. We found that assembly of SOLiD reads was variable among the three EST databases and the de novo contigs. As expected, the number of contigs assembled on each reference varied proportionally with the number of sequences in the reference database. The Atlantic salmon and rainbow trout databases contained a larger number of ESTs and thus were able to capture more of the expressed genes present in the sockeye salmon transcriptome. Across all assemblies, a large proportion of the starting EST databases (between 94 and 98%) were detected in at least a single library, suggesting a large proportion of transcripts in sockeye salmon are found in the publicly available EST databases. The average proportion of contigs assembled, as well as sequence length and coverage, was lower in the Atlantic salmon and rainbow trout assemblies (Table 3). There was also a large variation in the number of contigs assembled among individuals on all references, with a number of contigs assembled in a single individual in all cases. Based on standard deviation of the contigs assembled (Table 3), the variation in the number of assembled contigs was largest on the Atlantic salmon reference, and smallest on the sockeye salmon EST database. In addition to containing more sequences, the rainbow trout and Atlantic salmon EST libraries contained sequences that were generally longer than those in either the sockeye Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 105 salmon EST or sockeye de novo library. Thus, the contigs assembled from these databases tended to be longer. This variability in contig length and depth, as well as the number of contigs assembled among individuals, may have been related to the methods of assembly. In the assembly software, if a SOLiD read was nonspecific, i.e. it mapped to more than one reference sequence, then the software provided the option to align to one of these references at random or remove it from analysis. In all reference assemblies, the software was set to random assembly of nonspecific matches. Thus, any underlying redundancy in an EST database could result in a wider distribution of SOLiD reads across these sequences, resulting in more short, low coverage contigs. Such underlying redundancy could include paralogous sequences, repetitive DNA elements, or shared motifs, although specific identification of these structures is beyond the scope of the current study. While longer sequence reads can separate these features, the correct mapping of short SOLiD reads to these features remains problematic. Among our assemblies, we observed numerous, short (<100 bp) contigs, and contigs consisting of only a few reads (<20 reads per contig) (Figs 3 and 4). A test reassembly of the SOLiD reads, with the software option set to remove any read that mapped to multiple reference sequences, resulted in a lower total number of assembled reads and a lower number of assembled contigs. There was also a substantial decrease in the differences in the number, length and coverage of contigs assembled using the rainbow trout and Atlantic salmon references (data not shown). Thus, differences in assembly between the EST databases may be related to both underlying differences in the structure of the EST database chosen and to the parameters chosen for the reference assembly itself. Researchers mapping NGS reads to a closely related species should take both of these factors into account when performing reference assembly. SNP discovery and validation SNP detection was successfully carried out in silico on contigs from assemblies to all four reference libraries. Following the same patterns as contig assembly, there was a large variation in number of putative SNPs detected, both among individual libraries and across the EST references. One of the primary difficulties in successfully detecting and confirming true SNPs in salmonids is the presence of PSVs, the result of a whole genome duplication event in salmonids (Allendorf & Thorgaard 1984; Koop et al. 2008). Paralog assemblies often contain multiple polymorphic sites. The intent of our initial assembly parameters, allowing only two mismatches per mapped SOLiD read, was to separate PSVs into individual loci if Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd possible. The EST databases selected from cGRASP were from an EST assembly also intended to separate PSVs (Koop et al. 2008). Despite the stringency of assembly, our approach had limited success. Similar SNP detection studies in species without a recent genome duplication have had high success rates. For example, 84% of in silico SNPs detected in the great tit, Parus major, were validated (Van Bers et al. 2010). SNP detection efforts in polyploid plant species have had up to 93% success in validating SNPs detected from NGS data (Bundock et al. 2009; Trick et al. 2009; Buggs et al. 2010). Many of the plants in these studies, however, have EST or other sequence resources available from closely related, diploid ancestors, which allow better identification of PSVs. Some previous SNP detection studies in salmonids have generally reported around 50–74% success rates (Smith et al. 2005; Ryynanen & Primmer 2006; Hayes et al. 2007; Sanchez et al. 2009). The apparent large difference in validation rates between the current study and these previous efforts in salmonids may be due the methods used in each study. We tested only a limited number of putative SNPs assembled across the entire transcriptome. Previous efforts have targeted known overlapping sequences between species (Smith et al. 2005). A large portion of our putative SNPs were discarded when the primers designed to test them failed to amplify a product. Primers designed on transcriptome sequence to test putative SNPs may cross introns and fail to amplify (see Table 5). Additionally, these previous efforts have used Sanger sequencing, which produces longer reads that are easier to accurately assemble. Some studies report success based on the number of successful TaqMan assays alone, rather than the total number of sequences tested (Smith et al. 2005). Calculating our success in this fashion would result in a 90% success rate (10 of 11 TaqMan assays were successful). In contrast, a recent next-generation SNP discovery effort in chum salmon used 454 sequencing for SNP discovery and the same SNP validation pipeline used in this study. Their reported rate of SNP validation in this study was approximately 20%, much closer to the rate reported here (Seeb et al. 2011). Regardless of the differences in reporting between studies, the relatively low validation rate in salmonids compared to other species is attributed to difficulty in separating PSVs from true SNPs. While our validation was low, the specific validation rate varied depending on the EST database used as a reference. Of the validated SNPs, eight were from contigs assembled from the two sockeye salmon references and the remaining two assays were successfully designed from assemblies to the rainbow trout reference. None of the candidates selected for validation from the Atlantic salmon reference were successful. 106 M . V . E V E R E T T E T A L . A number of additional factors may underly our low validation. First, despite the large numbers of putative SNPs detected (Table 4), the lack of overlap in loci detected among libraries reduced the number of SNPs that passed secondary screening. A portion of the SNPs detected were from contigs assembled in only one library. At the same time, in a proportion of contigs assembled in multiple individuals, the portions of each reference sequence that were assembled from each individual did not fully overlap among individuals. For example, if we assembled contigs corresponding to a single reference sequence in two sockeye individuals, the reads from the first individual might align to the beginning portion of the reference sequence, while the reads from the second sockeye individual might align to the end. Consequently, any putative SNPs detected in each of these segments would not be shared between these two individuals. This uneven distribution of putative SNPs among individuals greatly reduced the number of putative SNPs for selection for validation. Possible solutions to this uneven coverage include library normalization or use of emerging techniques including sequencing of reduced representation libraries (Sanchez et al. 2009) or RAD tag sequencing (Miller et al. 2007; Baird et al. 2008) that produce more even coverage across all individuals. Another source of SNP dropout was erroneous assignment of PSVs or sequence errors as SNPs. Of the loci that failed during validation, six failed to amplify a product in an initial PCR test (Table 5). Primers designed on these loci probably spanned an intron boundary, as primers were designed from transcriptome consensus sequences, but tested on genomic DNA. Furthermore, of the starting 96 loci tested, 60 appeared to amplify multiple loci and thus were likely PSVs. Nineteen appeared to be false positives in initial assembly, where no SNP was detected during validation, possibly a result of sequencing error. The variation in SNP detection among reference sequences is unusual. Previous studies in salmonids have successfully used cross-species sequence data to design primers and assays for SNP loci (Smith et al. 2005). This unusual complication appears to be related to the assembly parameters used for the complex assembly of short SOLiD reads to the large EST data set and our screening parameters for SNP detection. As described previously, in our assembly parameters, short SOLiD reads that hit multiple sequences were randomly mapped to a single sequence. This led to the large number of relatively low coverage contigs discussed earlier. Before selecting putative SNPs for validation, we eliminated putative SNPs that either contained more than two alleles, or were within less than 100 bp of one another, as such variants were more likely to be PSVs (Sanchez et al. 2009). Our assembled reads were more widely distributed among more, low coverage contigs in the assembly to the Atlantic salmon and rainbow trout EST sequences. Thus, PSVs and sequencing errors that would otherwise have been eliminated during our secondary screening of SNPs were more likely to be missed. Possible solutions to this scenario are to set the assemble parameters to discard all SOLiD read which have multiple hits, to only assemble to full length sequences, or to increase the coverage threshold for putative SNP detection. All these methods will result in loss of some sequence data but should reduce the number of misidentified PSVs. Conclusions and recommendations for future research SNP discovery using short read chemistry as carried out here was technically challenging. We offer the following conclusions and recommendations for future research. First, differences between testes and pooled-tissue SOLiD libraries were not substantial. Investigators may choose to select any tissue or tissue combination relevant to their individual study. Second, the high depth of coverage provided by NGS data identified even rare transcripts. However, normalization of libraries may help more evenly distribute coverage, enhancing variant discovery. If use of normalization can be balanced against the resulting increase in cost, it is a good option. If native libraries are needed for gene expression studies, these may be subsampled during library construction. Third, care should be taken to select references sequences from conspecifics or closely related species. Despite the difficulties in using short reads for de novo assembly, we recommend a strategy that incorporates a combination of de novo and reference assemblies. Fourth, when assembling to EST libraries, assembly parameters should eliminate nonspecific hits to better facilitate complete, high coverage contig assembly. An assembly strategy that randomly assigns nonspecific reads may be used for comparison if too large a proportion of data is removed from the analysis but should not be used for SNP detection. Finally, strategies to screen putative SNPs for potential PSVs, as well as stringent assembly strategies such as only mapping to full length contigs, may reduce the occurrence of incorrectly combining PSVs into a single sequence. Our current strategy for SNP discovery using NGS is to use strict assembly parameters and assemble to conspecific EST libraries to maximize our SNP validation rates. Use of strict parameters will reduce the misidentification of PSVs; however, it may lower the overall number of contigs discovered. By testing a variety of strict assembly parameters on a subset of the data, investigators can Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 107 find parameters that optimize both SNP and gene discovery for their individual organism. Acknowledgements We thank Carita Pascal for laboratory support, Steven Roberts for aid in de novo assembly and all the University of Washington and Alaska Department of Fish and Game personnel who collected and provided tissue samples. This manuscript was partially funded by the Alaska Sustainable Salmon Fund under Study no. 45908 from the National Oceanic and Atmospheric Administration, US Department of Commerce, administered by the Alaska Department of Fish and Game. Additional funding for this project was provided by a grant from the Gordon and Betty Moore Foundation, and a grant from the Bristol Bay Regional Seafood Development Association. The statements, findings, conclusions and recommendations are those of the authors and do not necessarily reflect the views of the National Oceanic and Atmospheric Administration, the US Department of Commerce, or the Alaska Department of Fish and Game. Conflict of Interest The authors have no conflict of interest to declare and note that the sponsors of the issue had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. References Allendorf FW, Thorgaard GH (1984) Tetraploidy and the evolution of salmonid fishes. In: Evolutionary Genetics of Fishes (ed. Turner BJ), pp. 1–53. Plenum Press, New York. Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS ONE, 3, e3376. Bonaldo MDF, Lennon G, Soares MB (1996) Normalization and subtraction: two approaches to facilitate gene discovery. Genome Research, 6, 791–806. Buggs RJA, Chamala S, Wu W et al. (2010) Characterization of duplicate gene evolution in the recent natural allopolyploid Tragopogon miscellus by next-generation sequencing and Sequenom iPLEX MassARRAY genotyping. Molecular Ecology, 19, 132–146. Bundock PC, Eliott FG, Ablett G et al. (2009) Targeted single nucleotide polymorphism (SNP) discovery in a highly polyploid plant species using 454 sequencing. Plant Biotechnology Journal, 7, 347–354. Carre W, Wang XF, Porter TE et al. (2006) Chicken genomics resource: sequencing and annotation of 35,407 ESTs from single and multiple tissue cDNA libraries and CAP3 assembly of a chicken gene index. Physiological Genomics, 25, 514–524. Collins LJ, Biggs PJ, Voelckel C, Joly S (2008) An approach to transcriptome analysis of non-model organisms using short-read sequences. Genome Informatics, 21, 3–14. Elfstrom CM, Smith CT, Seeb JE (2006) Thirty-two single nucleotide polymorphism markers for high-throughput genotyping of sockeye salmon. Molecular Ecology Notes, 6, 1255–1259. Ellegren H (2008) Sequencing goes 454 and takes large-scale genomics into the wild. Molecular Ecology, 17, 1629–1631. Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nature Methods, 6, S6–S12. Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd Goetz F, Rosauer D, Sitar S et al. (2010) A genetic basis for the phenotypic differentiation between siscowet and lean lake trout (Salvelinus namaycush). Molecular Ecology, 19, 176–196. Govoroun M, Le Gac F, Guiguen Y (2006) Generation of a large scale repertoire of Expressed Sequence Tags (ESTs) from normalised rainbow trout cDNA libraries. BMC Genomics, 7, 196. Habicht C, Seeb LW, Myers KW, Farley EV, Seeb JE (2010) Summer–Fall Distribution of Stocks of Immature Sockeye Salmon in the Bering Sea as Revealed by Single-Nucleotide Polymorphisms. Transactions of the American Fisheries Society, 139, 1171–1191. Hale MC, McCormick CR, Jackson JR, DeWoody JA (2009) Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery. BMC Genomics, 10, 203. Harismendy O, Ng PC, Strausberg RL et al. (2009) Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology, 10, R32. Hayes B, Laerdahl J, Lien S et al. (2007) An extensive resource of single nucleotide polymorphism markers associated with Atlantic salmon (Salmo salar) expressed sequences. Aquaculture, 265, 82–90. Koop BF, von Schalburg KR, Leong J et al. (2008) A salmonid EST genomic study: genes, duplications, phylogeny and microarrays. BMC Genomics, 9, 545. Kunstner A, Wolf JBW, Backstrom N et al. (2010) Comparative genomics based on massive parallel transcriptome sequencing reveals patterns of substitution and selection across 10 bird species. Molecular Ecology, 19, 266–276. McGlauflin MT, Smith MJ, Wang JT et al. (2010) High-resolution melting analysis for the discovery of novel single-nucleotide polymorphisms in rainbow and cutthroat trout for species identification. Transactions of the American Fisheries Society, 139, 676–684. Miller MR, Dunham JP, Amores A, Cresko WA, Johnson EA (2007) Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Research, 17, 240–248. Morin PA, Martien KK, Taylor BL (2009) Assessing statistical power of SNPs for population structure and conservation studies. Molecular Ecology Resources, 9, 66–73. Morozova O, Hirst M, Marra MA (2009) Applications of new sequencing technologies for transcriptome analysis. Annual Review of Genomics and Human Genetics, 10, 135–151. Parchman TL, Geist KS, Grahnen JA, Benkman CW, Buerkle CA (2010) Transcriptome sequencing in an ecologically important tree species: assembly, annotation, and marker discovery. BMC Genomics, 11, 180. Peakall R, Smouse P (2006) GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research. Molecular Ecology Notes, 6, 288–295. Raymond M, Rousset F (1995) Genepop (Version-1.2) - population-genetics software for exact tests and ecumenicism. Journal of Heredity, 86, 248–249. Rousset F (2008) GENEPOP ‘ 007: a complete re-implementation of the GENEPOP software for Windows and Linux. Molecular Ecology Resources, 8, 103–106. Ryynanen HJ, Primmer CR (2006) Single nucleotide polymorphism (SNP) discovery in duplicated genomes: intron-primed exon-crossing (IPEC) as a strategy for avoiding amplification of duplicated loci in Atlantic salmon (Salmo salar) and other salmonid fishes. BMC Genomics, 7, 192. Sanchez CC, Smith TPL, Wiedmann RT et al. (2009) Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library. BMC Genomics, 10, 559. Schindler DE, Hilborn R, Chasco B et al. (2010) Population diversity and the portfolio effect in an exploited species. Nature, 465, 609–U102. Seeb JE, Pascal CE, Ramakrishnan R, Seeb LW (2009) SNP genotyping by the 5¢-nuclease reaction: advances in high throughput genotyping with non-model organisms. In: Methods in Molecular Biology, Single Nucleotide 108 M . V . E V E R E T T E T A L . Polymorphisms, 2nd edn (ed. Komar A), pp. 277–292. Humana Press, New York. Seeb JE, Pascal CE, Grau ED et al. (2011) Transcriptome sequencing and high-resolution melt analysis advance SNP discovery in duplicated salmonids. Molecular Ecology Resources, doi: 10.1111/j.1755-0998.2010. 02936.x. Shendure J, Ji HL (2008) Next-generation DNA sequencing. Nature Biotechnology, 26, 1135–1145. Slate J, Gratten J, Beraldi D et al. (2010) Gene mapping in the wild with SNPs: guidelines and future directions (vol. 136, pg 97, 2009). Genetica 138, 467–467. Smith CT, Seeb LW (2008) Number of alleles as a predictor of the relative assignment accuracy of short tandem repeat (STR) and single-nucleotide-polymorphism (SNP) baselines for chum salmon. Transactions of the American Fisheries Society, 137, 751–762. Smith CT, Elfstrom CM, Seeb LW, Seeb JE (2005) Use of sequence data from rainbow trout and Atlantic salmon for SNP detection in Pacific salmon. Molecular Ecology, 14, 4193–4203. Smith M, Pascal CE, Grauvogel Z et al. (2011) Multiplex preamplification PCR and microsatellite validation allows accurate single nucleotide polymorphism (SNP) genotyping of historical fish scales. Molecular Ecology Resources, 11, 257–266. Susko E, Roger AJ (2004) Estimating and comparing the rates of gene discovery and expressed sequence tag (EST) frequencies in EST surveys. Bioinformatics, 20, 2279–2287. Trick M, Long Y, Meng JL, Bancroft I (2009) Single nucleotide polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing. Plant Biotechnology Journal, 7, 334–346. Van Bers NEM, Van Oers K, Kerstens HHD et al. (2010) Genome-wide SNP detection in the great tit Parus major using high throughput sequencing. Molecular Ecology, 19, 89–99. Wall P, Leebens-Mack J, Chanderbali A et al. (2009) Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genomics, 10, 347. Wolf JBW, Bayer T, Haubold B et al. (2010) Nucleotide divergence vs. gene expression differentiation: comparative transcriptome sequencing in natural isolates from the carrion crow and its hybrid zone with the hooded crow. Molecular Ecology, 19, 162–175. You FM, Huo NX, Gu YQ et al. (2008) BatchPrimer3: a high throughput web application for PCR and sequencing primer design. Bmc Bioinformatics, 9, 253. Young M, Wakefield M, Smyth G, Oshlack A (2010) Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, 11, R14. Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108 2011 Blackwell Publishing Ltd 1 2 Appendix 3. Allele frequency stability in large, wild exploited populations over multiple generations: insights from Alaska sockeye salmon 3 4 (Manuscript in preparation for Canadian Journal of Fishery and Aquatic Sciences) 5 6 Daniel Gomez-Uchida1,2, James E. Seeb1, Christopher Habicht3 & Lisa W. Seeb1* 7 8 9 10 11 12 13 1 School of Aquatic and Fishery Sciences, 1122 Boat St NE Box 355020 Seattle, WA 98195-5020 USA. 2 Departmento de Zoología, Facultad de Ciencias Naturales y Oceanográficas, Universidad de Concepción, Casilla 160-C, Concepción, Chile. 3 Division of Commercial Fisheries, Alaska Department of Fish and Game, 333 Raspberry Road, Anchorage, AK 99518, USA. 14 15 *Corresponding author 16 17 Acknowledgments 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Lowell Fair, Tim Baker, and Colton Lipka helped identify and prepare the scale collections from ADF&G archives. We are indebted to Carita Pascal, Eleni Petrou, and Taylor Gibbons for support during the laboratory stages of this project. Mark Witteveen and Birch Foster at the ADF&G kindly donated their expertise on Kodiak Island salmon management through conversations and reports. This manuscript benefited from criticisms from colleagues at the School of Aquatic and Fishery Sciences, University of Washington, who attend the Friday Lunch Quantitative Seminar series. We thank Ryan Waples for stimulating discussions and feedback on one of the figures. Funding for this research was provided by the Gordon and Betty Moore Foundation and by the Alaska Sustainable Salmon Fund under Study #45908 from the National Oceanic and Atmospheric Administration, U.S. Department of Commerce, administered by the ADF&G. The statements, findings, conclusions, and recommendations are those of the authors and do not necessarily reflect the views of the National Oceanic and Atmospheric Administration, the U.S. Department of Commerce, or the ADF&G. Data for this study are available at: to be completed after manuscript is accepted for publication. 32 33 34 35 Abstract 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 Genetic data is increasingly used to improve the management of commercially exploited populations. Uncertainty about the temporal stability of allelic frequencies surrounds the common use of multigenerational datasets in fisheries that are prosecuted on migrating admixtures. We genotyped six pairs of (archived and contemporary) collections of Alaskan sockeye salmon to estimate temporal divergence over a period of 25 – 42 years (4.9 – 8.4 generations). First, our results show that temporal changes were dramatically (between 40- and 250-fold) smaller than spatial changes in allele frequencies when based on nuclear SNPs; differences were much less marked for mitochondrial SNPs. Second, the magnitude of temporal change was generally consistent with a model of genetic drift: (i) large-FST or candidate SNPs for diversifying selection were not more likely to show significant temporal changes than small-FST or selectively neutral SNPs and (ii) the observed number of significant tests fell within estimates predicted by a theoretical model relating sample size and effective population size (Ne). Third, estimates of Ne and upper 95% CI were generally infinitely large, except for one paired collection with unique life-history attributes of both a shorter smoltification phase and generation time. Overall, these findings argue that allele frequency stability was pervasive over multiple generations for most SNPs, despite the potential influence of diversifying selection and the presence of significant temporal divergence in one paired collection. Use of multigenerational datasets based on candidates for selection and putatively neutral SNPs seems a safe practice in management of Alaska sockeye salmon that could be extended to other large, wild stocks. 55 56 Introduction 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 A noteworthy application of population genetics theory to resource management has been genetic stock identification (GSI), also known as “mixed-stock” or “stock composition” analyses, a type of assignment method that uses multilocus genotypes to ascertain the population (or groups thereof) composition in a mixture sample (Pella and Milner 1987; Manel et al. 2005). GSI has found enormous success in commercial species, endangered organisms, or both, that require an accurate estimate of the number of population sources and their proportions. Populations that have benefited from the GSI approach are diverse and include hawksbill turtles (Browne, Horrocks, and Abreu-Grobois 2010), Canada geese (Mylecraine et al. 2008), honey bees (Bourgeois et al. 2010), and many species of fish, especially salmonids (reviewed in Utter and Ryman 1993; Waples, Punt, and Cope 2008). Like traditional assignment tests, GSI requires a set of source or reference populations (“baseline”) to estimate individual membership probabilities; unlike assignment tests, however, GSI incorporates the uncertainty of individual assignment to get the stock composition of the mixture, rather than simply classify individuals to their potential source, which generally translates into greater accuracy (Manel et al. 2005). Yet, both methods may provide congruent outputs if genetic differentiation among populations of the baseline is large (Potvin and Bernatchez 2001). 73 74 75 76 77 It is a common practice to establish baseline datasets from samples that were collected over several generations (e.g., Beacham et al. 2004; Habicht et al. 2010). A frequently unverified assumption is the temporal stability of baseline allele frequencies, despite some earlier in-depth theoretical considerations (Waples 1990). On the other hand, temporal instability of allele frequencies may yield unreliable GSI estimates, especially if it surpasses the magnitude of spatial 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 variation (Waples 1990). It is pertinent to evaluate the extent of this phenomenon on various grounds, some of which have been previously discussed (Waples 1990). First, contemporary datasets normally span several decades of sampling (e.g., Seeb et al. 2011; Templin et al. 2011), which poses the question on whether baseline allele frequencies are representative of the entire period in which GSI analyses take place. Second, not all markers within a set used for GSI contain the same amount of information (Banks, Eichert, and Olsen 2003); performance among the now widely-used single nucleotide polymorphisms (SNPs) depends heavily on levels of genetic divergence, such as FST (Weir and Cockerham 1984). Large-FST SNPs generally outperform small-FST SNPs (Ackerman, Habicht, and Seeb 2011; Hess, Matala, and Narum 2011). Large-FST loci in general are also more likely to be candidates for diversifying selection (“outliers”) than small-FST loci, which on average behave as selectively neutral (Storz 2005). But, are large-FST loci also more temporally unstable than small-FST loci? Because some outlier SNPs may be linked to functional genes and their allele frequencies may change as a function of latitude (Seeb et al. 2011) as well as environmental factors (Bradbury et al. 2010), this question deserves further consideration. Changes in allele frequency over two or more generations are often the result of genetic drift, which occurs at a rate that is inversely proportional to the effective size of a population (Ne). Populations of large Ne are thus expected to drift less than populations of small Ne (Ostergaard et al. 2003; Vaha et al. 2008; Therkildsen et al. 2010; but see Garant, Dodson, and Bernatchez 2000). The estimation of Ne using genetic data from molecular markers (reviewed in (Wang 2005) has accordingly found renewed interest among applied geneticists during the last decade, chiefly because access to life-time demographic parameters (e.g., variance in reproductive success) are limited in wild populations, and because genetic estimates of Ne can be an indicator of population health when related to census or adult population sizes (Palstra and Ruzzante 2008). Yet, Ne can be notoriously difficult to estimate, and too often this parameter may be influenced by factors other than genetic drift, especially age-structure and population subdivision (Waples 2010). 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 Here we used six pairs of collections of sockeye salmon (Oncorhynchus nerka, Walbaum 1792), which were temporally spaced between 25 and 42 years (4.9 and 8.4 generations), to gauge the stability of allele frequencies over multiple generations in three lake systems throughout Alaska. Anadromous sockeye salmon undergo a short freshwater migration to the spawning grounds where they were hatched after spending several years growing in the northern Pacific Ocean (Quinn 2005). This annual migration has supported one of the world’s largest, most profitable fisheries—its landed value was estimated at US$ 7.9 billion between 1950 and 2008 in Bristol Bay (Schindler et al. 2010). Using a suite of 85 nuclear and 3 mitochondrial SNPs, we addressed three specific objectives. First, we tested whether temporal changes in allele frequencies were smaller than spatial changes in allele frequencies, one fundamental assumption of GSI, using population-based spatial statistics. There is considerable support for this hypothesis in sockeye salmon (Beacham et al. 2004; Habicht et al. 2010; Creelman et al. 2011); to our knowledge, however, no studies have used the timescales presented here or compared nuclear to mitochondrial SNPs. Second, we tested whether temporal changes in allele frequencies (or generational divergence) were significant and consistent with a model of pure genetic drift, where directional forces like natural selection may be negligible. We investigated whether (i) large-FST SNPs, as revealed by an outlier detection method or “genome scan”, were more likely to exhibit significant temporal changes in allele frequencies than small-FST SNPs and (ii) the 123 124 125 126 127 128 129 130 131 132 observed number of temporal significant tests among SNPs was proportional to the ratio of sample size to Ne, according to theoretical predictions (Waples 1989). Third, we tested whether estimates of variance Ne, using an unbiased estimator of the so-called temporal method (Jorde and Ryman 2007), varied significantly among collections, and were thus good predictors of the magnitude of genetic drift or temporal change. Because spawning populations of sockeye salmon are often composed of thousands of individuals (Schindler et al. 2010), we made the general prediction that estimates of variance Ne should be infinitely large. However, we also hypothesized that variable life and colonization histories within sockeye salmon (Wood 1995) may influence variance Ne, as life-history types differ in demographic attributes (Wood et al. 2008). 133 Methods 134 Experimental design 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 Collections of archived dry scales and contemporary ethanol-preserved tissue (fin and heart) originated from three lake systems: Bear Lake, Red Lake, and South Olga Lakes (Fig. 1). The first is located in the northern Alaska Peninsula; the second and third are located in the western region of Kodiak Island, Alaska. These were chosen because they had (i) the oldest archived collections, (ii) no documented history of hatchery practices or introductions, and (iii) no evidence of population admixture, either natural or human-mediated (L. Fair and M. Witteveen, Alaska Department of Fish and Game, Anchorage, pers. comm. 2010). Adult migration has a bimodal temporal distribution in Bear Lake and South Olga lakes; Red Lake, on the other hand, is composed of several peaks (Fig. 2). The Alaska Department of Fish and Game (ADF&G) has managed each lake system as two stocks—EARLY and LATE—depending on the date of migration (Fig. 2), albeit Red Lake was managed as a single stock between 1989 and 2010 (M. B. Foster, ADF&G, pers. comm. 2010). EARLY and LATE runs are linked to spawning ecotypes: EARLY fish often spawn among tributary inlets, whereas LATE fish spawn among lake shoals or river outlets (Wood 1995). Early life-history strategies differ among lake systems depending on the duration and location of the smoltification phase (Wood 1995). Juveniles from Bear Lake, Red Lake and the EARLY run from South Olga Lakes normally spend 1 – 2 years rearing in lacustrine habitats (‘lake’ ecotype); offspring from the LATE run from South Olga Lakes, however, spend less than a year rearing in river (‘sea’ ecotype) or estuarine habitats as suggested by age composition analyses (M. B. Foster, unpublished data; M. Witteveen, pers. comm. 2010). 155 156 157 158 159 160 161 162 163 164 165 166 We sampled a total of 12 collections (Table 1). Samples from each lake system and run (EARLY and LATE) were taken at two different generations, 0 and t (i.e., a paired collection). Generation 0 (g0) corresponds to the oldest collections within each pair, which were taken between 1966 and 1976, whereas generation t (gt) corresponds to the most recent collections, which were taken between 1995 and 2009. We often combined individuals taken over the course of multiple days during the spawning season, despite some exceptions among contemporary collections that were taken within the same day (Fig. 2). We avoided scales taken too close to the recognized date boundaries that historically differentiate EARLY and LATE collections (Fig. 2). In a few collections, it was necessary to pool samples over multiple years to attain reasonable composite sample sizes (n = 71 – 96: Table 1). We combined samples over a maximum of three years into collections to avoid pooling across different generations. Nomenclature for the collections combined all three hierarchical sampling levels: lake system, run timing, and 167 168 169 generation. For instance, the most recent collection from South Olga Lakes taken from the EARLY run would be OLGA_EARLY_gt (t was subsequently replaced by an estimate of the number of generations since g0; see Estimation of variance Ne subsection on how to calculate t). 170 Genotyping 171 172 173 174 175 176 177 178 We followed the genotyping protocol of Seeb et al. (2009) with some modifications for archived scales (Smith et al. 2011). Genotyping for a panel of 96 SNPs, composed of 93 nuclear and 3 mitochondrial markers (Smith et al. 2005; Elfstrom, Smith, and Seeb 2006; Habicht et al. 2010; C. Storer, unpublished) was performed using Fluidigm® 96.96 dynamic arrays and uniplex PCR reactions. Genotype calls were conducted in Fluidigm® Genotyping Analysis proprietary software by two independent researchers. A quality control step included re-genotyping of 8% of the samples from each collection. SNPs that failed to consistently amplify across all collections (> 10% failure rate) were excluded from further analyses. 179 Exploratory analyses 180 181 182 183 184 185 186 187 188 189 190 191 192 We performed several tests to verify the integrity of the data and calculate general statistics. For nuclear SNPs, we tested for deviations from Hardy-Weinberg equilibrium (HWE) and linkage equilibrium on each locus using GENEPOP 4.0 (Rousset 2008). Heterozygote excess or deficit was reported through inbreeding coefficients per collection (FIS: Weir and Cockerham 1984). For HWE tests over all SNPs within collections, we implemented a binomial likelihood method that is less sensitive to deviations in a few loci (Moran 2003); for linkage equilibrium over all SNPs, we used Fisher’s method in GENEPOP. We estimated observed and expected heterozygosities in GENALEX (Peakall and Smouse 2006), while allelic richness was estimated in FSTAT (Goudet 1995, 2001). Expected heterozygosity and allelic richness were compared between lake systems through randomizations in FSTAT. All three mitochondrial SNPs were combined in alphabetical order and analyzed as composite haplotypes, which were referred to by their nucleotide composition (e.g., “CGG”). We estimated haplotype diversity in G`ENALEX. For all software settings, see Gomez-Uchida et al. (2011). 193 Temporal vs. spatial divergence 194 195 196 197 198 199 200 201 Divergence between collections was classified according to the type of comparison: (i) within lakes (WL) or between lakes (BL), (ii) within runs (WR) or between runs (BR), and (iii) within generations (WG) or between generations (BG); these were combined into composite categories of divergence. For example, a comparison between RED_LATE_g0 and RED_LATE_gt would be classified WLWRBG for temporal or generational divergence. Thus, spatial divergence categories would be WLBRWG for run-timing divergence, and BLWG for lake divergence. Comparisons WR or BR between lake systems as well as hybrid categories (e.g., WLBRBG or BLBG) were not considered. 202 203 204 205 206 207 Each collection was treated as a separate population for the purpose of estimating divergence. For nuclear SNPs, we estimated pairwise divergence between collections through FST (Weir and Cockerham 1984), which was calculated in GENEPOP. Significance of FST values (i.e., if FST > 0) was evaluated using locus-specific χ2 tests of differentiation implemented in CHIFISH (Ryman 2006). Multilocus significance was estimated through Pearson test on χ2 values. For mitochondrial haplotypes, we estimated pairwise divergence between collections 208 209 210 211 212 213 214 215 216 through a standardized genetic distance, DS (Nei 1972) implemented in GENALEX. Pairwise tests of genetic differentiation on haplotype data were performed through contingency tables of haplotype counts and exact tests of heterogeneity using R (R Development Core Team 2010). To address multiple comparisons over loci or collections, we compared the nominal error type I (p = 0.05) to the adjusted one using a 10% false discovery rate (padj) implemented in R that follows (Benjamini and Yekutieli 2001). Estimates of divergence (FST, DS) were visualized in two ways: first, we compared the distribution of values among hierarchical groups of collections using dot charts; second, we computed principal coordinate analysis via the covariance matrix of genetic distances using standard options in GENALEX. 217 218 219 220 221 222 The distribution of genetic variance among hierarchical sampling levels was computed through an analysis of molecular variance (AMOVA) in ARLEQUIN 3.5 (Excoffier, Laval, and Schneider 2005) for nuclear SNPs and GENALEX for mitochondrial SNPs. Such analysis enabled the estimation of temporal to spatial variance ratios. Components of the genetic variance were reported through sums of squares and F-statistics (nuclear SNPs) or Φ-statistics (mitochondrial SNPs). 223 Expectations of generational divergence among SNPs under pure genetic drift 224 225 226 227 228 229 230 231 232 First, we identified candidate SNPs for selection by relating genetic diversity and differentiation within and between populations (Storz 2005), using ARLEQUIN 3.5. We assumed a hierarchical island model, because it best describes the complex genetic structure among sockeye salmon populations, where gene flow is more predominant between demes within groups than among groups (Gomez-Uchida et al. 2011). Settings included 10,000 simulations, 100 demes (= paired collections), and 10 groups (= lake systems). Minimum and maximum expected heterozygosities were set at 0 and 0.5, respectively. We focused exclusively on candidates for diversifying selection found outside the upper quantiles, and hence ignored low-differentiation candidate SNPs for balancing selection, usually found outside the lower quantiles. 233 234 235 237 238 Second, we investigated if generational divergence was consistent with expectations under pure stochastic forces, namely sampling error and genetic drift. A generalized model by Waples (1989) was used to calculate the probability of a locus-specific Pearson χ2 test being 3.84 significant according to pe ( 2 ) , where 3.84 is the critical χ2 value for a diallelic locus (χ2 C = 3.84, d.f. = 1, p = 0.05) and C is a scaling factor that is proportional to the ratio of sample size (n) to effective population size (Ne) and can be approximated by (Waples 1989): 239 C 1 240 241 242 243 244 245 246 this expression should be valid for a broad range of values and applicable to both sampling plans (before and after reproduction), unless t is too large and Ne is too small (Waples 1989). The expected number of locus-specific significant tests (pe) was then calculated from estimates of variance Ne, t (see Estimation of variance effective population size, below ), and ñ, the harmonic mean of sample sizes taken at g0 and gt, and compared to the observed number of locus-specific significant tests (po). In case variance Ne was infinity, we used an estimate of census population size (N). In addition, we identified SNPs showing significant (p < 0.05) generational divergence 236 n~t ; 2Ne (1) 247 248 249 250 within each set of paired collections and compared them to candidates for diversifying selection identified above. Our rationale was that if nonrandom forces, such as natural selection, drive generational divergence at specific SNPs, these may appear in multiple paired collections and may match putative outlier SNPs (e.g., Jump et al. 2006). 251 Estimation of variance (Ne) and census population size (N) 252 253 254 255 256 257 258 We used an unbiased moment-based estimator of the standardized shift in allele frequencies, F’S (Jorde and Ryman 2007) to calculate variance Ne for each paired collection. The estimation of F’S has been implemented in the software TempoFs (http://www.zoologi.su.se/~ryman/). The method assumes that generations are discrete and that there is no gene flow; thus, temporal changes in allele frequencies are the sole result of genetic drift. We estimated (e.g., OLGA_EARLY_g0 and OLGA_EARLY_gt) separated by t generations; t was estimated from the difference in years between the oldest and most recent collections divided by the mean 259 generation time, G. The parameter G was calculated according to G pi i , where pi is the 260 261 262 263 264 proportion of individuals of age i (iterated through age j: (Felsenstein 1971). Sampling at g0 and gt followed plan I or after reproduction (Waples 1989). The age composition of the escapement was estimated from brood tables prepared during 1985 – 2010 (M. B. Foster, unpublished). Pairwise differences in age composition between paired collections were assessed by means of exact tests of heterogeneity implemented in R. 265 266 267 268 269 Census population sizes (N) for each paired collection were estimated by summing daily average escapements (between 1966 and 2009: see Fig. 2) across days that define EARLY and LATE run timing within each lake system: BEAR_ EARLY, 10-Jun to 31-Jul; BEAR_ LATE, 1Aug to 15-Sep; RED_ EARLY, 29-May to 15-Jul; RED_ LATE, 16-Jul to 1-Sep; OLGA_ EARLY, 29-May to 15-Jul; and OLGA_ LATE, 16-Jul to 15-Sep). j i 1 270 271 Results 272 Genotyping 273 274 275 276 277 278 Eight SNPs failed to consistently amplify across all collections (> 10% failure rate) and were therefore excluded, leaving 85 nuclear and three mitochondrial SNPs for all ensuing statistical analyses (average and median amplification success = 98%; range of amplification success among loci: 93 – 100%). The quality control step found mismatches in 6 of 8579 genotypes; mismatches were exclusively heterozygote-homozygote calls (or vice versa) for a discrepancy rate of 0.07%. 279 Exploratory analyses 280 281 282 283 284 Tests. Forty-three out of 936 tests for HWE were significant using a nominal 5% of type I error (p = 0.05), a number expected by chance without correction for multiple tests (Pearson χ21,1 = 1.1, p = 0.56). No evidence was found to reject the joint null hypothesis of no deviations for HWE in any collection (Table 1). Evidence for gametic disequilibrium was found in one collection, OLGA_EARLY_g0 (Fisher’s p = 0.012). Furthermore, significant evidence (p < 285 286 287 288 289 0.001) for physical linkage was found between three locus-pairs across multiple collections: (i) One_MHC2-190 and One_MHC2-251 (12 collections), (ii) One_Tf_ex11-750 and One_Tf_in3182 (9 collections), and (iii) One_GPDH-201 and One_GPDH2-187 (7 collections). These results agree with previous findings in sockeye salmon from different Alaskan drainages (Habicht et al. 2010; Creelman et al. 2011). 290 291 292 293 294 Nuclear SNPs. We found significant evidence for differences in genetic diversity between lake systems (FSTAT: p = 0.01), including mean estimates (± standard deviation) of expected heterozygosity (Bear Lake: HE = 0.269 ± 0.008; Red Lake: HE = 0.284 ± 0.005; South Olga Lakes: HE = 0.279 ± 0.005) and allelic richness (Bear Lake: AR = 1.916 ± 0.009; Red Lake: AR = 1.919 ± 0.010; South Olga Lakes: AR = 1.941 ± 0.005). 295 296 297 298 299 300 301 302 303 Mitochondrial SNPs. Haplotype CGG was the most common in Bear Lake and Red Lake (> 0.6), followed by haplotype TAG (< 0.4); conversely, haplotype TAG was generally the most common for the paired OLGA_EARLY collection, followed by haplotype CGG (Fig. 3). For the OLGA_LATE paired collection, however, both haplotypes had fairly similar frequencies. In South Olga Lakes we also found one unique haplotype (CAA = 0.141 in OLGA_EARLY_g0; Fig. 3) and one rare haplotype (TGA) that had a higher frequency within OLGA_LATE_g0 and OLGA_LATE_g6.5 than Red Lake (Fig. 3). We found that the mean (± standard deviation) haplotype diversity was the lowest for Bear Lake (h = 0.337 ± 0.064), intermediate for Red Lake (h = 0.463 ± 0.040), and the highest for South Olga Lakes (h = 0.569 ± 0.116). 304 Temporal vs. spatial divergence 305 306 307 308 309 310 311 312 313 314 315 316 317 Nuclear SNPs. The smallest divergence was always generational (WLWRBG; FST range: 0.0016 – 0.0047; Fig. 4a). No significance was found following correction for multiple tests (padj = 0.001) except for the paired OLGA_LATE collection (FST = 0.0047, Fisher’s p = 0.00008, Pearson χ2 p < 0.0001). Generational divergence was followed by intermediate values of runtiming divergence (WLBRWG; FST range: 0.0075 – 0.0295; Fig. 4a) that showed significant multilocus probabilities (Table 2). Lake divergence was always the largest (BLWG; FST range: 0.0408 – 0.0924; Fig. 4a) and significant in all cases (Table 2). This hierarchy of divergence among FST was also evident in a principal coordinate analysis: distances separating generational comparisons were the shortest (with nuances between lake systems), followed by run-timing and lake comparisons that had intermediate and the longest distances in the two-dimensional space, respectively (Fig. 5a). All three lake systems were reciprocally different: South Olga Lakes was as different from Red Lake as from Bear Lake, despite large differences in distances among these sites (Figure 1). 318 319 320 321 322 323 324 325 326 327 Mitochondrial SNPs. DS values overlapped among categories, especially at small levels of divergence (Fig. 4b). Yet, categories differed in range of DS values as did the probability for the null hypothesis of no differentiation between collections, which was especially pronounced between South Olga Lakes and the other two lake systems (Table 2). The range for generational divergence was DS = 0.001 – 0.026 with no significant comparisons found after correction for multiple tests (padj = 0.014; Fig. 4b). The range for run-timing divergence was DS = 0.000 – 0.181 with only four significant comparisons , whereas the range for lake divergence was DS = 0.000 – 0.730 and most comparisons were significant (Fig. 4b; Table 2). Using principal coordinate analysis, we noticed a lack of hierarchical distribution of DS values among categories for Red Lake and Bear Lake, except for South Olga Lakes (Fig. 5b). Bear Lake and Red Lake 328 329 330 clustered together, despite the geographic distance separating them, whereas South Olga Lakes formed a genetically distinct group, with marked differences between EARLY and LATE collections (Fig. 5b). 331 332 333 334 335 336 337 338 Hierarchical AMOVA. For nuclear SNPs, and after the variation found within collections, the largest component of the genetic variance was found between lakes, followed by run-timing within lakes, and their FST values were significantly greater than zero. The component between generations within run-timing was the smallest and its FST was no different from zero (Table 3). Absolute FST ratios between components suggested that the generational component was nearly 250-fold and 40-fold smaller than the lake and run-timing components, respectively. For mitochondrial SNPs, ΦST for all three components were significantly greater than zero; ΦST ratios between components, on the other hand, approached unity (Table 3). 339 Was generational divergence consistent with a model of pure genetic drift? 340 341 342 343 344 345 346 347 Candidate SNPs for diversifying selection. The number of outlier SNPs varied between 2 (99th quantile) and 6 (95th quantile) depending on the threshold of the theoretical heterozygosity-FST distribution (Fig. 6). The range of FST for these SNPs varied between 0.146 (One_RFC-285) and 0.549 (One_Tf_in3-182). Two outliers mapped to well-described transferrin proteins in salmonids (One_Tf_in3-182 and One_Tf_ex11-750: Ford 2001), while a third (One_HpaI-99) mapped to a family of short interspersed elements in salmonids (Kido et al. 1991). A fourth outlier (One_RFC2-285) mapped to the replication factor C, subunit 2, of Atlantic salmon (Leong et al. 2010). Two outliers had no described annotation. 348 349 350 351 352 353 354 355 356 357 Observed (po) and expected number of significant tests (pe) among SNPs. Estimates of po varied between 5.0% and 14.1%; the highest estimate was found in the paired OLGA_LATE collection (Table 4). In general, po were smaller than pe in all but one paired collection (BEAR_EARLY), suggesting that generational divergence could be explained by genetic drift alone in most cases (Table 4). SNPs that exhibited significant temporal differentiation varied between 1 (RED_ EARLY) and 12 (OLGA_ LATE), and in total, 26 SNPs showed significant temporal variation to a variable degree. There was little overlap among SNPs across paired collections: only two markers—One_RAG3-93 and One_ghsR-66—appeared twice (Table 4). With one exception (One_HpaI-99), no outlier SNPs or candidates for diversifying selection matched SNPs that showed significant generational divergence. 358 Ne, age composition, and N 359 360 361 362 363 Estimates of Ne were generally infinitely large: we found no finite upper 95% CI in the majority of paired collections (Table 5). Only OLGA_LATE had finite 95% CI estimates for Ne (Table 5). Census population sizes (N) fluctuated between ~40,000 (OLGA_EARLY) and ~200,000 (BEAR_EARLY) and suggest differences in productivity between lake systems as well as between collections with different run timing within the same lake system (Table 5). 364 365 366 367 368 Age composition varied significantly only between OLGA_ LATE and the remaining collections (all exact tests, p < 0.0001); differences were especially obvious for age-3 and age-6 fish (Table 6), which appeared overrepresented and underrepresented in OLGA_ LATE, respectively. Interestingly, generation time was roughly one year shorter for OLGA_ LATE than the rest of paired collections (Table 6). 369 370 Discussion 371 Temporal vs. spatial divergence 372 373 374 375 376 377 378 379 380 381 382 383 384 385 The temporal stability of baseline allele frequencies is one of the underpinnings of GSI: even though allele frequency shifts between generations were expected due to genetic drift, these ought to be smaller than changes in allele frequencies in space for GSI to provide reliable estimates (Waples 1990; Beacham and Withler 2010). Our survey validated this assumption for nuclear SNPs over a period of 25 – 42 years (4.9 – 8.4 generations): absolute FST ratios between spatial and temporal AMOVA components suggested that generational divergence was on average nearly 40-fold smaller than run-timing divergence, and 250-fold smaller than lake divergence. Larger spatial variation than temporal variation in allele frequencies appears to be a trademark of baselines in Pacific salmonids (Beacham and Withler 2010), despite some exceptions (Heath et al. 2002; Walter et al. 2009). Recently, Walter et al. (2009) showed that temporal variation surpassed spatial variation for a group of Chinook salmon populations from the Upper Fraser River in British Columbia. This conclusion, however, has been questioned on technical and statistical grounds, sparking some debate given the importance of GSI for Chinook salmon management (Beacham and Withler 2010; Walter, Shrimpton, and Heath 2010). 386 387 388 389 390 391 392 393 394 DS estimates from mitochondrial SNP haplotypes, on the other hand, showed overlap between lake, run-timing, and generational comparisons, and ΦST ratios from the AMOVA were close to 1. This suggests fundamental differences between the two SNP types; in particular, mitochondrial SNPs offered limited GSI resolution within the scope of this survey. However, a more thorough assessment is needed, given differences in the spatial scale of this study and GSI applications that may include the entire north Pacific Ocean (Habicht et al. 2010), to validate this conclusion. Furthermore, mitochondrial SNPs may be useful identifying unique (maternal) lineages overlooked using nuclear SNPs frequently experiencing recombination (see section Divergent colonization and life histories in sockeye salmon, below). 395 396 397 398 399 400 401 402 403 404 405 406 407 408 The implementation of population-based statistics required the assumption that each paired collection represented a separate population; this is difficult to verify, because sampling at weirs occurred before the fish reach their spawning sites. Nevertheless, several statistics suggest that paired collections may be a cohesive reproductive unit and thus considered ‘populations’ from an evolutionary perspective (Waples and Gaggiotti 2006), even if each is composed of fish that spawn at multiple sites (e.g., Bear Lake: Boatright, Quinn, and Hilborn 2004). First, negative or near-zero (and generally nonsignificant) pairwise for generational comparisons suggest that, most likely, the same spawning aggregations have been sampled at two different generations (intrapopulation FST). Second, no consistent deviations from HWE and gametic disequilibrium were evident across collections, suggestive of negligible population admixture. The only exception was OLGA_EARLY_g0, for which we found evidence for gametic disequilibrium but no deviations from HWE. However, the intrapopulation FST between the oldest and most recent collections (OLGA_EARLY_g0 vs. OLGA_EARLY_g6.9) was no different than 0, implying that admixture, if any, appears to have a minor contribution to differentiation. 409 Was generational divergence consistent with a model of pure genetic drift? 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 Generational divergence among outlier SNPs. Only one candidate SNP showed significant generational divergence in OLGA_LATE; most of the other well-annotated outliers linked to potentially adaptive polymorphisms were not more likely than neutral SNPs to show significant temporal variation. This strongly suggests that, if outlier SNPs are effectively under selection, or if they are linked to functional genes experiencing selection, the environmental regimes influencing spatial divergence are clearly different from those influencing temporal variation, with one possible exception in One_HpaI-99. This polymorphism was located within a family of short interspersed repetitive elements, which have been suggested play a role in salmonid speciation (Kido et al. 1991), although no studies have addressed its potential role in adaptive divergence within species. Jump et al. (2006) determined that one strong outlier locus in European beech Fagus sylvatica populations experienced a dramatic decrease in the frequency of one allele over 50 years (0.8 – 0.5), possibly from an increase of 2ºC in mean annual temperature. Gomez-Uchida et al. (2011) also found One_HpaI-99 to be a candidate for selection in sockeye salmon populations from the Kvichak River drainage in southwest Alaska; this warrants additional scrutiny of this locus, perhaps including additional temporal samples and relating allele frequencies to candidate environmental variable(s). Second, only two SNPs showing significant generational divergence appeared twice in three paired collections (BEAR_LATE, OLGA_LATE, and RED_LATE), which can be considered independent as there is limited gene flow between lake systems. Therefore, temporal divergence in allele frequencies appears to occur randomly among SNPs. 430 431 432 433 434 435 436 437 438 439 440 441 442 443 Expectations of generational divergence under pure genetic drift. Waples (1989) demonstrated that the probability to reject the null hypothesis of no temporal variation in allele frequencies is not equal to the nominal value in a classical contingency test (e.g., Pearson χ2: p = 0.05), and it depended on the ratio of sample size to Ne. Using this theoretical framework, we generally found no evidence to reject this null hypothesis, and our range of po estimates was consistent with findings for other wild, but not hatchery, salmonid populations (Waples and Teel 1990). The only exception was the paired collection from BEAR_EARLY, for which we expected 5.0% of significant tests, but observed 5.8%. We speculate this may be the result of the test’s dependence on a finite estimate of Ne. For all EARLY collections, we used estimates of N instead of Ne as the latter were infinitely large; however, the use of smaller yet plausible values for Ne are likely to increase the expected number of significant tests according to equation (1) and simulations presented by Waples (1989). For example, if we use Ne = 1247 (the lower 95% CI for BEAR_EARLY), then pe = 7.2%, which provides an upper (though conservative) threshold for the number of significant expectations. 444 445 446 447 448 449 450 451 Overall, the main implications of these two findings for GSI are encouraging: first, SNPs with the ability to more accurately discriminate between populations, and possible be candidates for selection, are not more prone to experience significant temporal shifts in allele frequencies than neutral SNPs; and second, generational divergence was consistent with a model of pure genetic drift and has occurred randomly among SNPs across independent collections, where selection seems to play a negligible role. We hypothesize these emanate from the presence of large, wild populations of sockeye salmon, located in a largely pristine environment with no hatchery influence (Schindler et al. 2010). 452 Do divergent colonization and life histories help explain spatial divergence and Ne? 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 Spatial divergence. Hierarchical lake and run-timing divergence can be explained by a combination of large-scale historical events, such as postglacial colonization following retreat of the Cordilleran ice sheet less than 14,700 years BP (Mann and Peteet 1994), and small-scale local adaptation, such as life history variation as an evolutionary response to ecologically divergent spawning habitats (Gomez-Uchida et al. 2011). Regarding large-scale differences between lake systems, principal coordinate analyses based on nuclear SNPs revealed all three lake systems were reciprocally different, while principal coordinate analyses based on mitochondrial SNPs revealed that Red Lake shared more haplotypes with Bear Lake than South Olga Lakes. Results from both marker types were intriguing given the geographic proximity between the two lake systems in Kodiak Island, strongly suggesting that sockeye salmon from South Olga Lakes are derived from a unique lineage. This hypothesis finds support in genetic and demographic attributes of this lake system. First, South Olga Lake collections generally contained the highest genetic diversity at both nuclear and mitochondrial SNPs (Table 1), which is typical of the ‘sea’ sockeye salmon, an ancestral form that spend a shorter period rearing in fluvial or estuarine habitat than the more derived ‘lake’ form (Wood 1995; Beacham, McIntosh, and MacConnachie 2004; Wood et al. 2008). Second, the OLGA_LATE paired collection had a different age structure in comparison with the others. For instance, the age-4 cohort had an important contribution from fish that spend only months in freshwater and three years at sea; similarly, the age-3 cohort is dominated by fish that spend less than a year in freshwater and two years at sea (M. B. Foster, unpublished). This cohort possibly explains the shorter generation length in OLGA_LATE than in other collections. But, why did OLGA_EARLY had a different age structure? Age composition analyses additionally suggested that juveniles from OLGA_EARLY spend one or two years rearing in freshwater, a typical strategy of the ‘lake’ ecotype and the majority of populations of this study. We hypothesize that OLGA_EARLY was founded from OLGA_LATE colonizers, and thus retained some ancestral haplotypes, despite the loss of demographic attributes of ‘sea’ fish that possibly followed the contemporary adaptation to rearing in limnetic environments. 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 Bear Lake and Red Lake were more divergent at nuclear SNPs than mitochondrial SNPs; indeed, these two lake systems seem to share an abundant haplotype (i.e., CGG), possibly characteristic of the ‘lake’ ecotype as pointed out by principal coordinate analyses from mitochondrial SNPs. Although the ‘sea’ ecotype has life-history advantages over the ‘lake’ ecotype for successful recolonization where sockeye salmon populations have been extirpated (Wood et al. 2008), the possibility that the ‘lake’ ecotype can survive in glacial refugia and colonize new environments has been recently raised (Pavey, Hamon, and Nielsen 2007), and our results are in line with this possibility. Principal coordinate analyses from nuclear SNPs, on the other hand, suggested there is very little gene flow between Bear Lake and Red Lake populations, potentially resulting from geographic isolation, demographic processes, or both. For example, estimates of nuclear genetic diversity indicated that Bear Lake was the most depauperated of all three lakes (Table 1), a finding potentially associated with population bottlenecks (e.g., Gomez-Uchida et al. 2011). Nuclear markers in general may also reveal further historical contingencies overlooked by mitochondrial markers, especially in geologically young populations (Fraser and Bernatchez 2005; Gomez-Uchida et al. 2008). Indeed, it has been proposed that a large area of Kodiak Island, which coincidentally matches the location of the Ayakulik River that drains Red Lake, remained ice-free during the last glacial maximum, based on radiometric and stratigraphic evidence (Mann and Peteet 1994). Red Lake populations may thus be descendants from the so-called Kodiak Island Refugium (Karlstrom 1969), whereas Bear 499 500 501 Lake populations likely originated from the Beringian Refugium in the northern Alaska Peninsula (Wood 1995). A broader survey of lake systems composed of ‘lake’ and ‘sea’ ecotypes should corroborate many of the hypotheses presented in the preceding sections. 502 503 504 505 506 507 508 509 Within lake systems, substantial nuclear divergence between populations with variable run timing has been previously described in sockeye salmon (Burger et al. 2000; Seeb et al. 2000; Ramstad, Foote, and Olsen 2003). EARLY and LATE seasonal components are often linked to specific spawning habitats: EARLY fish reproduce among inlet tributaries or streams, whereas LATE fish reproduce among river outlets and beaches (Schindler et al. 2010). The basis for this variation includes life history trade-offs related to egg size, age composition, body size, and body depth, which have evolved presumably as a result of natural and sexual selection (Quinn, Hendry, and Wetzel 1995; Quinn, Hendry, and Buck 2001). 510 511 512 513 514 515 516 517 518 Run-timing divergence can be important in a variety of applied contexts. For management strategies that depend on temporally-explicit (seasonal) sampling regimes, the natural divisions of EARLY and LATE components with varying productivity, for which different escapement goals are normally set, has a clear genetics basis. In particular, significant differentiation between RED_EARLY and RED_LATE argue for separate management, a measure in existence prior to 1989 that ADF&G reinstituted during 2011 (M. B. Foster, pers. comm. 2010). Overall, the sustainability of the sockeye salmon fishery relies on these lower hierarchical levels of population diversity (Schindler et al. 2010), some of which can have dramatically different population and recruitment dynamics (e.g., ‘sea’ vs. ‘lake’ ecotypes: Wood et al. 2008). 519 520 521 522 523 524 525 526 527 528 529 530 Variance Ne. The magnitude of genetic drift varied among paired collections. In particular, we found that OLGA_LATE had the highest number of both observed and expected significant tests under pure genetic drift, which likely resulted from a relatively small (and finite) estimate of Ne, and generational divergence was significant over loci after a false discovery rate correction for multiple tests. Even though the composite sample size for the most recent collection (OLGA_LATE_g6.5) was the smallest (n = 71), we argue that sampling error is unlikely to yield that many false positives. Computer simulations under several conditions, such as variable number of loci (0 – 40 SNPs) and unequal sample sizes, suggested that the error I of Pearson χ2 test usually remained under p = 0.06 for diallelic SNPs with uniform or even skewed allele frequencies (Ryman et al. 2006). This reinforces our hypothesis that a small contemporary Ne— in the order of few hundred individuals—reflects on increased genetic drift and instability in OLGA_LATE. 531 532 533 534 535 536 537 538 539 It is unclear what demographic or life-history attributes, or both, explain this finding. Potential bottlenecks that occurred in either the distant or recent past are inconsistent with high estimates of genetic diversity and N, respectively. We speculate that the younger age composition and faster generation time of OLGA_LATE than the rest of paired collections represent pieces to this puzzle. Waples, Jensen, and McClure (2010) found an important link between Ne, though calculated from demographic data, and the coefficient of variation of population growth in Chinook salmon: annual variation in population growth accounted for a reduction of 35% in Ne. Findings of reduced Ne in OLGA_LATE may reflect eco-evolutionary trade-offs, and we believe this deserves further scrutiny. 540 541 Although the concept of Ne was developed from a very simple idea, its empirical estimation becomes complicated owing to multiple spatial and temporal stratifications, which 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 creates different classes of individuals within a population (Waples 2010). In our study, two of such stratifications include age structure and immigration from nearby populations. First, we assumed no age structure (i.e., discrete generations), although Pacific salmon populations are often composed of multiple cohorts with variable age at maturity. Nevertheless, the temporal approach may still be valid considering samples were spaced 5 or more generations apart, which should eliminate the bias introduced by unequal cohort contribution (see Palstra, O'Connell, and Ruzzante 2009). Yet, the interpretation of estimates of Ne must proceed with caution, because they represent a generational average and not the quantity fishery biologists are likely to be most interested in: the annual number of breeders (Waples 1990). Second, we assumed no gene flow, albeit dispersal between EARLY and LATE populations may be possible, which could potentially bias Ne estimates downward or upward, depending on whether gene flow is sporadic or frequent, respectively (Wang and Whitlock 2003). Exploratory analyses using MLNE (Wang and Whitlock 2003), which simultaneously estimates temporal Ne and migration rate (m), yielded a similar point estimate for OLGA_LATE (NeMLNE = 539) using OLGA_EARLY as potential source of immigrants (m = 0.001), but a rather imprecise 95% CI (< 1 – Inf). Although sporadic gene flow from OLGA_EARLY may indeed underestimate Ne for OLGA_LATE, our general conclusion regarding this collection is unlikely to change after accounting for immigration (e.g., Waples 2010). 560 561 Conclusions 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 How stable were allele frequencies among Alaskan sockeye salmon populations over multiple generations? First, our survey suggests that temporal changes were between 3- and 20-fold smaller than spatial changes in allele frequencies when based on nuclear SNPs; differences were nonetheless much less marked for mitochondrial SNPs. Second, the magnitude of generational divergence was generally consistent with a model of pure genetic drift: (i) large-FST or candidate SNPs for diversifying selection were not more likely to show significant temporal changes than small-FST SNPs and (ii) the observed number of significant tests fell within estimates predicted by a theoretical model relating sample size and Ne. Third, estimates of Ne and upper 95% CI using the temporal method were infinitely large, with exception of OLGA_LATE. Demographic attributes of this collection suggest it corresponds to the ‘sea’ ecotype. A possible explanation for a reduced Ne could be found in a faster generation length than the rest of collections, albeit ecoevolutionary links are hitherto unclear. Overall, our findings for sockeye salmon support the use of SNP baselines spanning various decades, which may be pertinent for management of other large, wild stocks, but perhaps less so for stocks with hatchery influence that are much more temporally unstable (Waples and Teel 1990). With the increasing use of high-resolution SNPs, we welcome parallel studies that contemplate multigenerational analyses of SNP variation to judge the breadth of our conclusions. 579 580 Literature cited 581 582 Ackerman, M. W., C. Habicht, and L. W. Seeb. 2011. Single-Nucleotide Polymorphisms (SNPs) under Diversifying Selection Provide Increased Accuracy and Precision in Mixed-Stock 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 Analyses of Sockeye Salmon from the Copper River, Alaska. Transactions of the American Fisheries Society 140 (3):865-881. Banks, M. A., W. Eichert, and J. B. Olsen. 2003. Which genetic loci have greater population assignment power? Bioinformatics 19 (11):1436-1438. Beacham, T. D., M. Lapointe, J. R. Candy, B. McIntosh, C. MacConnachie, A. Tabata, K. Kaukinen, L. T. Deng, K. M. Miller, and R. E. Withler. 2004. Stock identification of Fraser River sockeye salmon using microsatellites and major histocompatibility complex variation. Transactions of the American Fisheries Society 133 (5):1117-1137. Beacham, T. D., B. McIntosh, and C. MacConnachie. 2004. Population structure of lake-type and river-type sockeye salmon in transboundary rivers of northern British Columbia. Journal of Fish Biology 65 (2):389-402. Beacham, T. D., and R. E. Withler. 2010. Comment on "Gene flow increases temporal stability of Chinook salmon (Oncorhynchus tshawytscha) populations in the Upper Fraser River, British Columbia, Canada". Canadian Journal of Fisheries and Aquatic Sciences 67 (1):202-205. Benjamini, Y., and D. Yekutieli. 2001. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29 (4):1165-1188. Boatright, C., T. Quinn, and R. Hilborn. 2004. Timing of adult migration and stock structure for sockeye salmon in Bear Lake, Alaska. Transactions of the American Fisheries Society 133 (4):911-921. Bourgeois, L., W. S. Sheppard, H. A. Sylvester, and T. E. Rinderer. 2010. Genetic Stock Identification of Russian Honey Bees. Journal of Economic Entomology 103 (3):917-924. Bradbury, I. R., S. Hubert, B. Higgins, T. Borza, S. Bowman, I. G. Paterson, P. V. R. Snelgrove, C. J. Morris, R. S. Gregory, D. C. Hardie, J. A. Hutchings, D. E. Ruzzante, C. T. Taggart, and P. Bentzen. 2010. Parallel adaptive evolution of Atlantic cod on both sides of the Atlantic Ocean in response to temperature. Proceedings of the Royal Society B-Biological Sciences 277 (1701):3725-3734. Browne, D. C., J. A. Horrocks, and F. A. Abreu-Grobois. 2010. Population subdivision in hawksbill turtles nesting on Barbados, West Indies, determined from mitochondrial DNA control region sequences. Conservation Genetics 11 (4):1541-1546. Burger, C. V., K. T. Scribner, W. J. Spearman, C. O. Swanton, and D. E. Campton. 2000. Genetic contribution of three introduced life history forms of sockeye salmon to colonization of Frazer Lake, Alaska. Canadian Journal of Fisheries and Aquatic Sciences 57 (10):2096-2111. Creelman, E. K., L. Hauser, R. K. Simmons, W. D. Templin, and L. W. Seeb. 2011. Temporal and Geographic Genetic Divergence: Characterizing Sockeye Salmon Populations in the Chignik Watershed, Alaska, Using Single-Nucleotide Polymorphisms. Transactions of the American Fisheries Society 140 (3):749-762. Creelman, Elisabeth, Lorenz Hauser, Ryan Simmons, William D. Templin, and Lisa W. Seeb. 2011. Temporal and geographic genetic divergence: Characterizing sockeye salmon populations in the Chignik watershed, Alaska, using single nucleotide polymorphisms. Transactions of the American Fisheries Society in press. Elfstrom, C. M., C. T. Smith, and J. E. Seeb. 2006. Thirty-two single nucleotide polymorphism markers for high-throughput genotyping of sockeye salmon. Molecular Ecology Notes 6 (4):1255-1259. 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 Excoffier, L., G. Laval, and S. Schneider. 2005. Arlequin (version 3.0): An integrated software package for population genetics data analysis. Evolutionary Bioinformatics 1:47-50. Felsenstein, Joe. 1971. Inbreeding and variance effective numbers in populations with overlapping generations. Genetics 68 (4):581-&. Ford, M. J. 2001. Molecular evolution of transferrin: Evidence for positive selection in salmonids. Molecular Biology and Evolution 18 (4):639-647. Fraser, D. J., and L. Bernatchez. 2005. Allopatric origins of sympatric brook charr populations: colonization history and admixture. Molecular Ecology 14 (5):1497-1509. Garant, D., J. J. Dodson, and L. Bernatchez. 2000. Ecological determinants and temporal stability of the within-river population structure in Atlantic salmon (Salmo salar L.). Molecular Ecology 9 (5):615-628. Gomez-Uchida, D., K. P. Dunphy, M. F. O'Connell, and D. E. Ruzzante. 2008. Genetic divergence between sympatric Arctic charr Salvelinus alpinus morphs in Gander Lake, Newfoundland: roles of migration, mutation and unequal effective population sizes. Journal of Fish Biology 73 (8):2040-2057. Gomez-Uchida, D., J.E. Seeb, M.J. Smith, C. Habicht, T.P. Quinn, and L.W. Seeb. 2011. Single nucleotide polymorphisms unravel hierarchical divergence and signatures of selection among Alaskan sockeye salmon (Oncorhynchus nerka) populations. BMC Evolutionary Biology 11 (48):48. Goudet, J. 1995. FSTAT (Version 1.2): A computer program to calculate F-statistics. Journal of Heredity 86 (6):485-486. ———. 2001. FSTAT, a program to estimate and test gene diversities and fixation indices. Ver. 2.9.3. Available from:http://www2.unil.ch/popgen/softwares/fstat.htm. Habicht, C., L. W. Seeb, K. W. Myers, E. V. Farley, and J. E. Seeb. 2010. Summer-Fall Distribution of Stocks of Immature Sockeye Salmon in the Bering Sea as Revealed by Single-Nucleotide Polymorphisms. Transactions of the American Fisheries Society 139 (4):1171-1191. Heath, D. D., C. Busch, J. Kelly, and D. Y. Atagi. 2002. Temporal change in genetic structure and effective population size in steelhead trout (Oncorhynchus mykiss). Molecular Ecology 11 (2):197-214. Hess, J. E., A. P. Matala, and S. R. Narum. 2011. Comparison of SNPs and microsatellites for fine-scale application of genetic stock identification of Chinook salmon in the Columbia River Basin. Molecular Ecology Resources 11:137-149. Hilborn, R., T. P. Quinn, D. E. Schindler, and D. E. Rogers. 2003. Biocomplexity and fisheries sustainability. Proceedings of the National Academy of Sciences of the United States of America 100 (11):6564-6568. Jorde, P. E., and N. Ryman. 2007. Unbiased estimator for genetic drift and effective population size. Genetics 177 (2):927-935. Jump, A. S., J. M. Hunt, J. A. Martinez-Izquierdo, and J. Penuelas. 2006. Natural selection and climate change: temperature-linked spatial and temporal trends in gene frequency in Fagus sylvatica. Molecular Ecology 15 (11):3469-3480. Karlstrom, T.N.V. 1969. Regional setting and geology. In The Kodiak Island Refugium, edited by T. N. V. Karlstrom and G. E. Ball: The Boreal Institute of North America, University of Alberta, Ryerson Press. Kido, Y., M. Aono, T. Yamaki, K. Matsumoto, S. Murata, M. Saneyoshi, and N. Okada. 1991. Shaping and reshaping of salmonid genomes by amplification of transfer RNA-derived 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 retroposons during evolution. Proceedings of the National Academy of Sciences of the United States of America 88 (6):2326-2330. Leong, J. S., S. G. Jantzen, K. R. von Schalburg, G. A. Cooper, A. M. Messmer, N. Y. Liao, S. Munro, R. Moore, R. A. Holt, S. J. M. Jones, W. S. Davidson, and B. F. Koop. 2010. Salmo salar and Esox lucius full-length cDNA sequences reveal changes in evolutionary pressures on a post-tetraploidization genome. Bmc Genomics 11. Manel, S., O. E. Gaggiotti, R. S. Waples, and Rw. 2005. Assignment methods: matching biological questions techniques with appropriate. Trends in Ecology & Evolution 20 (3):136-142. Mann, D. H., and D. M. Peteet. 1994. Extent and timing of the last glacial maximum in southwestern Alaska. Quaternary Research 42 (2):136-148. Moran, M. D. 2003. Arguments for rejecting the sequential Bonferroni in ecological studies. Oikos 100 (2):403-405. Mylecraine, K. A., H. L. Gibbs, C. S. Anderson, and M. C. Shieldcastle. 2008. Using 2 genetic markers to discriminate among Canada goose populations in Ohio. Journal of Wildlife Management 72 (5):1220-1230. Nei, M. 1972. Genetic distance between populations. American Naturalist 106 (949):283-&. Ostergaard, S., M. M. Hansen, V. Loeschcke, and E. E. Nielsen. 2003. Long-term temporal changes of genetic composition in brown trout (Salmo trutta L.) populations inhabiting an unstable environment. Molecular Ecology 12 (11):3123-3135. Palstra, F. P., M. F. O'Connell, and D. E. Ruzzante. 2009. Age Structure, Changing Demography and Effective Population Size in Atlantic Salmon (Salmo salar). Genetics 182 (4):12331249. Palstra, FP, and DE Ruzzante. 2008. Genetic estimates of contemporary effective population size: what can they tell us about the importance of genetic stochasticity for wild population persistence? MOLECULAR ECOLOGY 17 (15):3428-3447. Pavey, S. A., T. R. Hamon, and J. L. Nielsen. 2007. Revisiting evolutionary dead ends in sockeye salmon (Oncorhynchus nerka) life history. Canadian Journal of Fisheries and Aquatic Sciences 64 (9):1199-1208. Peakall, R., and P. E. Smouse. 2006. GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research. Molecular Ecology Notes 6 (1):288-295. Pella, J.J., and G.B. Milner. 1987. Use of genetic marks in stock composition analysis. In Population Genetics & Fishery Management, edited by N. Ryman and F. M. Utter. Seattle: University of Washington Press. Potvin, C., and L. Bernatchez. 2001. Lacustrine spatial distribution of landlocked Atlantic salmon populations assessed across generations by multilocus individual assignment and mixed-stock analyses. Molecular Ecology 10 (10):2375-2388. Quinn, T. P., A. P. Hendry, and G. B. Buck. 2001. Balancing natural and sexual selection in sockeye salmon: interactions between body size, reproductive opportunity and vulnerability to predation by bears. Evolutionary Ecology Research 3 (8):917-937. Quinn, T. P., A. P. Hendry, and L. A. Wetzel. 1995. The influence of life history trade-offs and the size of incubation gravels on egg size variation in sockeye salmon (Oncorhynchus nerka). Oikos 74 (3):425-438. Quinn, T.P. 2005. The Behavior and Ecology of Pacific Salmon and Trout. Seattle: University of Washington Press. 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 R Development Core Team. 2010. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. Ramstad, K. M., C. J. Foote, and J. B. Olsen. 2003. Genetic and phenotypic evidence of reproductive isolation between seasonal runs of Sockeye salmon in Bear Lake, Alaska. Transactions of the American Fisheries Society 132 (5):997-1013. Rousset, F. 2008. GENEPOP ' 007: a complete re-implementation of the GENEPOP software for Windows and Linux. Molecular Ecology Resources 8 (1):103-106. Ryman, N. 2006. CHIFISH: a computer program testing for genetic heterogeneity at multiple loci using chi-square and Fisher's exact test. Molecular Ecology Notes 6 (1):285-287. Ryman, N., S. Palm, C. Andre, G. R. Carvalho, T. G. Dahlgren, P. E. Jorde, L. Laikre, L. C. Larsson, A. Palme, and D. E. Ruzzante. 2006. Power for detecting genetic divergence: differences between statistical methods and marker loci. Molecular Ecology 15 (8):20312045. Schindler, D. E., R. Hilborn, B. Chasco, C. P. Boatright, T. P. Quinn, L. A. Rogers, and M. S. Webster. 2010. Population diversity and the portfolio effect in an exploited species. Nature 465 (7298):609-U102. Seeb, J.E., C.E. Pascal, R. Ramakrishnan, and L.W. Seeb. 2009. SNP Genotyping by the 5'Nuclease Reaction: Advances in High-Throughput Genotyping with Nonmodel Organisms. In Single Nucleotide Polymorphisms, Methods in Molecular Biology, edited by A. A. Komar. Seeb, L. W., C. Habicht, W. D. Templin, K. E. Tarbox, R. Z. Davis, L. K. Brannian, and J. E. Seeb. 2000. Genetic diversity of sockeye salmon of Cook Inlet, Alaska, and its application to management of populations affected by the Exxon Valdez oil spill. Transactions of the American Fisheries Society 129 (6):1223-1249. Seeb, L. W., W. D. Templin, S. Sato, S. Abe, K. Warheit, J. Y. Park, and J. E. Seeb. 2011. Single nucleotide polymorphisms across a species' range: implications for conservation studies of Pacific salmon. Molecular Ecology Resources 11:195-217. Smith, C. T., C. M. Elfstrom, L. W. Seeb, and J. E. Seeb. 2005. Use of sequence data from rainbow trout and Atlantic salmon for SNP detection in Pacific salmon. Molecular Ecology 14 (13):4193-4203. Smith, M. J., C. E. Pascal, Z. Grauvogel, C. Habicht, J. E. Seeb, and L. W. Seeb. 2011. Multiplex preamplification PCR and microsatellite validation enables accurate single nucleotide polymorphism genotyping of historical fish scales. Molecular Ecology Resources 11:268277. Storz, J. F. 2005. Using genome scans of DNA polymorphism to infer adaptive population divergence. Molecular Ecology 14 (3):671-688. Templin, W. D., J. E. Seeb, J. R. Jasper, A. W. Barclay, and L. W. Seeb. 2011. Genetic differentiation of Alaska Chinook salmon: the missing link for migratory studies. Molecular Ecology Resources 11:226-246. Therkildsen, N. O., E. E. Nielsen, D. P. Swain, and J. S. Pedersen. 2010. Large effective population size and temporal genetic stability in Atlantic cod (Gadus morhua) in the southern Gulf of St. Lawrence. Canadian Journal of Fisheries and Aquatic Sciences 67 (10):1585-1595. Utter, F., and N. Ryman. 1993. Genetic markers and mixed stock fisheries. Fisheries 18 (8):1121. 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 Vaha, J. P., J. Erkinaro, E. Niemela, and C. R. Primmer. 2008. Temporally stable genetic structure and low migration in an Atlantic salmon population complex: implications for conservation and management. Evolutionary Applications 1 (1):137-154. Walter, R. P., T. Aykanat, D. W. Kelly, J. M. Shrimpton, and D. D. Heath. 2009. Gene flow increases temporal stability of Chinook salmon (Oncorhynchus tshawytscha) populations in the Upper Fraser River, British Columbia, Canada. Canadian Journal of Fisheries and Aquatic Sciences 66 (2):167-176. Walter, R. P., J. M. Shrimpton, and D. D. Heath. 2010. Reply to the comment by Beacham and Withler on "Gene flow increases temporal stability of Chinook salmon (Oncorhynchus tshawytscha) populations in the Upper Fraser River, British Columbia, Canada". Canadian Journal of Fisheries and Aquatic Sciences 67 (1):206-208. Wang, J. L. 2005. Estimation of effective population sizes from data on genetic markers. Philosophical Transactions of the Royal Society B-Biological Sciences 360 (1459):13951409. Wang, JL, and MC Whitlock. 2003. Estimating effective population size and migration rates from genetic samples over space and time. GENETICS 163 (1):429-446. Waples, R. S. 1989. Temporal variation in allele frequencies: testing the right hypothesis. Evolution 43 (6):1236-1251. ———. 1990. Temporal changes of allele frequency in Pacific salmon - implications for mixedstock fishery analysis. Canadian Journal of Fisheries and Aquatic Sciences 47 (5):968976. ———. 2010. Spatial-temporal stratifications in natural populations and how they affect understanding and estimation of effective population size. Molecular Ecology Resources 10 (5):785-796. Waples, R. S., and O. Gaggiotti. 2006. What is a population? An empirical evaluation of some genetic methods for identifying the number of gene pools and their degree of connectivity. Molecular Ecology 15 (6):1419-1439. Waples, R. S., D. W. Jensen, and M. McClure. 2010. Eco-evolutionary dynamics: fluctuations in population growth rate reduce effective population size in chinook salmon. Ecology 91 (3):902-914. Waples, R. S., A. E. Punt, and J. M. Cope. 2008. Integrating genetic data into management of marine resources: how can we do it better? Fish and Fisheries 9 (4):423-449. Waples, R. S., and D. J. Teel. 1990. Conservation Genetics of Pacific salmon. I. Temporal Changes in Allele Frequencies. Conservation Biology 4 (2):144-156. Weir, B. S., and C. C. Cockerham. 1984. Estimating F-statistics for the analysis of population structure. Evolution 38 (6):1358-1370. Wood, C. C., J. W. Bickham, R. J. Nelson, C. J. Foote, and J. C. Patton. 2008. Recurrent evolution of life history ecotypes in sockeye salmon: implications for conservation and future evolution. Evolutionary Applications 1 (2):207-221. Wood, C.C. 1995. Life history variation and population structure in sockeye salmon. American Fisheries Society Symposium 17:195-216. 809 Figure legends 810 811 812 813 Figure 1. Study sites in (a) Alaska Peninsula and (b) Kodiak Island: 1) Bear River; 2) Bear Lake; 3) Ayakulik River; 4) Red Lake; 5) lower South Olga Lake and 6) upper South Olga Lake. Black bars represent weirs operated by the Alaska Department of Fish and Game. 814 815 816 817 818 819 Figure 2. Daily average escapement of sockeye salmon by date between 1966 and 2009 (black solid line, left y-axis) and number of sampled individuals within specific years (colored bars, right y-axis) from three Alaskan lake systems: (a) Bear Lake, (b) Red Lake, and (c) South Olga Lakes. Vertical dashed lines are cut-off dates used by the Alaska Department of Fish and Game to separate EARLY and LATE run timing (Red Lake and South Olga Lakes: 15 July; Bear Lake: 31 July). 820 821 Figure 3. Frequency of five mitochondrial SNP haplotypes (see legend) in all 12 sockeye salmon collections (see Table 1 for nomenclature of collections). 822 823 824 825 826 827 Figure 4. Dotcharts of pairwise FST from 85 nuclear SNPs: (a) number of nuclear SNPs showing significance (p < 0.05) and (b) pairwise Nei’s DS from mitochondrial SNP haplotypes (c) plotted for different hierarchical groups: WLWRBG = within lakes, within runs, between generations; WLBRWG = within lakes, between runs, within generations; BLWG = between lakes, within generations;. Dashed line: p = 0.05 (uncorrected); solid line: padj (multiple-test correction assuming 10% false discovery rate). 828 829 830 831 832 833 Figure 5. Principal coordinate analysis using the covariance matrix of pairwise FST (a) and pairwise Nei’s DS (b) projected in a two-dimensional space. Symbols represent different lake systems (circles = Bear Lake; squares = Red Lake; triangles = South Olga Lakes); empty symbols represent EARLY run timing, whereas filled symbols represent LATE run timing. Labels represent different generations at which each collection was sampled (see Table 1 for nomenclature). 834 835 836 837 Figure 6. Scatterplot of FST as a function of genetic diversity (HO/1 – FST) to distinguish candidate SNPs for diversifying selection (in red, plus labels) from neutral SNPs (in black) using simulations. Dashed and solid lines are the upper limits of the 95th and 99th quantile distribution of neutral SNPs, respectively. 838 Tables 839 840 841 842 843 Table 1. Collection number, name, lake system, run timing, sampling years and DNA source material and genetic statistics including: composite sample size per collection (n); allelic richness (AR; number of alleles in 55 diploid individuals); correlation coefficient of inbreeding (FIS); mitochondrial haplotype diversity (h); and probability for deviations of Hardy-Weinberg equilibrium over all loci (HWE) for sockeye salmon collected from three lakes in Alaska used to examine temporal stability in allele frequencies. Collection* 844 845 Lake system, Run timing Sampling year(s) DNA source n AR FIS h HWE 1 BEAR_EARLY_g0 Bear Lake, EARLY 1975-1976 scale 95 1.907 0.002 0.369 0.164 2 BEAR_EARLY_g4.9 Bear Lake, EARLY 2000 scale 95 1.928 0.017 0.268 0.200 Bear Lake, LATE 3 BEAR_LATE_g0 1969 scale 95 1.918 -0.085 0.303 0.099 Bear Lake, LATE 4 BEAR_LATE_g6.0 2000 scale 95 1.911 -0.024 0.409 0.099 5 RED_EARLY_g0 Red Lake, EARLY 1967 scale 95 1.905 -0.007 0.404 0.210 6 RED_EARLY_g8.4 Red Lake, EARLY 2009 fin 95 1.921 0.003 0.478 0.156 7 RED_LATE_g0 Red Lake, LATE 1966-1968 scale 95 1.921 0.002 0.490 0.203 8 RED_LATE_g8.4 Red Lake, LATE 2008 fin 95 1.930 -0.003 0.480 0.200 9 OLGA_EARLY_g0 South Olga Lakes, EARLY 1967-1968 scale 95 1.942 -0.019 0.522 0.105 10 OLGA_EARLY_g6.9 South Olga Lakes, EARLY 2000 heart 95 1.948 -0.017 0.427 0.162 South Olga Lakes, LATE 11 OLGA_LATE_g0 1967-1968 scale 96 1.936 0.039 0.663 0.160 South Olga Lakes, LATE 12 OLGA_LATE_g6.5 1995 scale 71 1.939 0.024 0.663 0.154 *Nomenclature combines all three hierarchical sampling levels: lake system, run timing, and generation (g0 = generation 0; gt = generation t, with t = number of generations since g0). 846 847 Table 2. Pairwise FST (below diagonal; nuclear SNPs) and Nei’s DS (above diagonal; mitochondrial SNPs) with associated multilocus significance from Pearson χ2 (tests using Fisher’s method produced identical results). 848 1 2 3 4 5 6 ‐ 1 0.008 0.006 0.001 0.000 0.025 ‐ 2 -0.001 0.000 0.015 0.010 0.062* 3 0.004† 0.006* 0.011 0.008 0.055* 4 0.006* 0.009* 0.001 0.001 0.016 5 0.082* 0.078* 0.084* 0.087* 0.022 6 0.084* 0.081* 0.085* 0.088* -0.002 7 0.072* 0.071* 0.071* 0.072* 0.011* 0.012* 8 0.074* 0.072* 0.071* 0.074* 0.007* 0.008* 9 0.079* 0.078* 0.088* 0.089* 0.092* 0.094* 10 0.078* 0.075* 0.087* 0.087* 0.085* 0.086* 11 0.060* 0.058* 0.061* 0.062* 0.067* 0.069* 12 0.059* 0.055* 0.058* 0.061* 0.073* 0.074* † p < 0.01; *p < 0.001. Collection numbers follow those in Table 1. 849 7 0.025 0.061* 0.054* 0.016 0.022 0.000 0.001 0.068* 0.062* 0.041* 0.047* 8 0.019 0.051* 0.045† 0.011 0.017 0.001 0.001 0.068* 0.062* 0.043* 0.049* 9 0.538* 0.730* 0.699* 0.480* 0.520* 0.297* 0.299* 0.326* 0.000 0.016* 0.030* 10 0.403* 0.561* 0.535* 0.355* 0.388* 0.203* 0.205* 0.227* 0.025* 0.015* 0.030* 11 0.201* 0.289* 0.274* 0.174* 0.184* 0.091* 0.091* 0.106* 0.112* 0.080* 0.005* 12 0.192* 0.269* 0.256* 0.169* 0.174* 0.097* 0.097* 0.112* 0.181* 0.127* 0.009 - 850 851 Table 3. Analysis of molecular variance (AMOVA) at three hierarchical levels for sockeye salmon collections taken from two time periods (between 4.9 and 6.9 generations apart), from two run-timings, from three lakes in Alaska. Source of variation Nuclear SNPs Between collections in different lakes Between collections with different run-timing within lakes Between generations within run-timing collections Within collections 852 Mitochondrial SNPs (haplotypes) Between collections in different lakes Between collections with different run-timing within lakes Between generations within run-timing collections Within collections d.f. = degrees of freedom; SS = sum of squares. d.f. SS % Variance FST /ΦST 2 3 6 2222 1237.3 173.8 62.7 24452 6.3 1.1 -0.03 93.8 2 3 6 2222 13.1 10.1 11.0 705.2 3.8 2.6 2.6 p-value 0.07401 0.01134 -0.00027 < 0.001 < 0.001 0.640 0.03767 0.02551 0.02631 < 0.001 < 0.001 < 0.001 853 854 855 Table 4. SNPs involved in generational divergence (Pearson χ2 test: p < 0.05) and observed vs. expected number of significant tests across six paired collections (three lake systems with two run times) of sockeye salmon from Alaska. Paired collection nomenclature is defined in Table 1. Noteworthy SNPs are in bold and footnoted. Paired collection Observed significant tests (po) 5/85 = 5.9% Expected significant tests (pe)§ 5.0% BEAR_ LATE (g0 vs. g6.0) 5/85 = 5.9% 6.2% One_Ots208-234 (0.006); One_U1105 (0.017); One_U1013-108 (0.033); One_Zp3b-49 (0.009); One_RAG3-93** (0.031) RED_ EARLY (g0 vs. g8.4) 1/85 = 1.2% 5.0% One_pax7-248 (0.01574) RED_ LATE (g0 vs. g8.4) 3/85 = 3.5% 7.9% One_rpo2j-261 (0.009); One_PIP (0.028); One_ghsR-66**(0.004) OLGA_ EARLY (g0 vs. g6.9) 2/85 = 2.4% 5.1% One_gdh-212 (0.02224); One_IL8r-362 (0.02571) 12/85 = 14.1% 14.6% One_U1201-492 (0.016); One_U1101 (0.031); One_tshB-92 (0.00871); One_ghsR-66** (0.032); One_cin-177 (0.007); One_Mkpro-129 (0.033) One_hcs71-220 (0.036); One_MHC2_190 (0.018); One_RAG3-93** (0.019); One_MHC2_251 (0.009); One_HpaI-99† (0.04554); One_U503-170 (0.04794) BEAR_ EARLY (g0 vs. g4.9) OLGA_ LATE (g0 vs. g6.5) 856 857 Locus list (p-value) One_apoe-83 (0.006); One_agt-132 (0.022); One_STR07 (0.039); One_ZNF-61 (0.020); One_ctgf-301 (0.048) . **Found in more than one comparison; †Candidate for diversifying selection from simulations in ARLEQUIN. §Using Ne: BEAR_LATE, RED_LATE, and OLGA_LATE; using N: BEAR_EARLY, RED_EARLY, and OLGA_EARLY. 858 859 860 861 Table 5. Run timing period, number of generations elapsed between samples (t); unbiased estimator of the temporal shift in allele frequencies (F'S;(Jorde and Ryman 2007); estimates of variance effective population size (Ne, plus 95% CI); and estimate of census population size from historical escapement data (N; average 1966 – 2009).for six paired collections (three lake systems with two run times) of sockeye salmon from Alaska. 862 Paired collection Run-timing period 10-Jun to 31-Jul BEAR_ EARLY 1-Aug to 15-Sep BEAR_ LATE 29-May to 15-Jul RED_ EARLY 16-Jul to 1-Sep RED_ LATE 29-May to 15-Jul OLGA_ EARLY 16-Jul to 15-Sep OLGA_ LATE *. †Inf = infinity large (negative) estimate. 863 t 4.9 6.0 8.4 8.4 6.9 6.5 F'S -0.0015362 0.0011198 -0.0032631 0.0025456 -0.000141 0.0099589 Ne (95% CI)† Inf (1247 - Inf) 2679 (613 - Inf) Inf (Inf - Inf) 1650 (607 - Inf) Inf (1339 - Inf) 326 (198 - 907) N ~ 200,000 ~ 150,000 ~ 150,000 ~ 100,000 ~ 40,000 ~ 150,000 864 865 Table 6. Average age composition (%; between 1985 and 2010) and estimated mean generation years (G) for six paired collections (three lake systems with two run times of sockeye salmon from Alaska. BEAR_EARLY 866 BEAR_LATE RED_EARLY RED_LATE OLGA_EARLY OLGA_LATE Age 2 3 4 5 6 7 0.0 0.7 16.1 56.1 26.6 0.4 0.0 0.9 14.5 65.1 19.3 0.2 0.0 2.2 24.1 50.4 22.7 0.6 0.0 1.3 15.9 61.1 21.1 0.6 0.0 2.7 25.4 57.1 14.7 0.1 0.5 15.8 30.4 49.8 3.4 0.0 G (years) 5.1 5.0 5.0 5.0 4.8 4.3 26−Sep 14−Sep 02−Sep 21−Aug 09−Aug 28−Jul 16−Jul 04−Jul 22−Jun 10−Jun 29−May 0 0 1 50 2 100 3 4 5 6 0 1 50 2 3 4 5 6 Average escapement x 103 4 100 6 8 Number of sampled individuals 200 150 200 2000 1995 1968 1967 150 c 100 2009 2008 1968 1967 1966 0 b 200 50 2 2000 1976 1975 1969 150 0 0 a OLGA_LATE_g6.5 OLGA_LATE_g0 OLGA_EARLY_g6.9 OLGA_EARLY_g0 RED_LATE_g8.4 RED_LATE_g0 RED_EARLY_g8.4 RED_EARLY_g0 BEAR_LATE_g6.0 BEAR_LATE_g0 BEAR_EARLY_g4.9 BEAR_EARLY_g0 Frequency 1.0 0.8 0.6 0.4 TGA TAG CGG CGA CAA 0.2 0.0 WLWRBG WLBRWG BLWG Hierarchical group ● ●● ● ● ● ● 0.00 a 0.02 0.04 0.06 0.08 0.10 Pairwise F S T Hierarchical group ● ● ●● ● ● ● 0.0 WLWRBG WLBRWG BLWG b 0.2 0.4 Nei's D S 0.6 0.8 Principal coordinate 2 (38%) 0.06 g0 g6.9 a g0 g6.5 0.04 ● 0.02 0.00 −0.02 g0 g4.9 ●● g0 ●● g0 g8.4 g6.0 g0 −0.04 g8.4 −0.04 0.10 Principal coordinate 2 (7%) ● RED_EARLY RED_LATE OLGA_EARLY OLGA_LATE BEAR_EARLY BEAR_LATE −0.02 0.00 0.02 0.04 Principal coordinate 1 (47%) g6.5 b 0.08 ● g0 0.06 ● 0.06 RED_EARLY RED_LATE OLGA_EARLY OLGA_LATE BEAR_EARLY BEAR_LATE 0.04 0.02 0.00 g4.9 ●● g0 g0 g0 ● −0.02 ● g6.0 g0 g8.4 −0.04 −0.2 g8.4 g6.9 −0.1 0.0 0.1 0.2 Principal coordinate 1 (92%) g0 0.3 0.25 One_Tf_in3−182 ● ● 0.20 ● One_HpaI−99 ● One_Tf_ex11−750 One_U1003−75 One_U1209−111 ● 0.15 ● One_RFC2−285 ● F ST ● ● ● ● ● ● 0.10 ● ● ● ● ● ● ● ● ● ● 0.05 ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● 0.1 ● ● 0.2 ● ● ● ● 0.3 H O /(1 − F S T ) ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● 0.0 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● ●● ● 0.5
© Copyright 2026 Paperzz