Sockeye SNP Short Read Sequences Paper

Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
doi: 10.1111/j.1755-0998.2010.02969.x
SNP DISCOVERY: NEXT GENERATION SEQUENCING
Short reads and nonmodel species: exploring the complexities
of next-generation sequence assembly and SNP discovery in
the absence of a reference genome
M. V. EVERETT, E. D. GRAU and J . E . S E E B
School of Aquatic and Fishery Sciences, University of Washington, 1122 NE Boat Street Box 355020, Seattle, WA 98195-5020, USA
Abstract
How practical is gene and SNP discovery in a nonmodel species using short read sequences? Next-generation sequencing
technologies are being applied to an increasing number of species with no reference genome. For nonmodel species, the
cost, availability of existing genetic resources, genome complexity and the planned method of assembly must all be considered when selecting a sequencing platform. Our goal was to examine the feasibility and optimal methodology for SNP and
gene discovery in the sockeye salmon (Oncorhynchus nerka) using short read sequences. SOLiD short reads (up to 50 bp)
were generated from single- and pooled-tissue transcriptome libraries from ten sockeye salmon. The individuals were from
five distinct populations from the Wood River Lakes and Mendeltna Creek, Alaska. As no reference genome was available
for sockeye salmon, the SOLiD sequence reads were assembled to publicly available EST reference sequences from sockeye salmon and two closely related species, rainbow trout (Oncorhynchus mykiss) and Atlantic salmon (Salmo salar). Additionally, de novo assembly of the SOLiD data was carried out, and the SOLiD reads were remapped to the de novo contigs.
The results from each reference assembly were compared across all references. The number and size of contigs assembled
varied with the size reference sequences. In silico SNP discovery was carried out on contigs from all four EST references;
however, discovery of valid SNPs was most successful using one of the two conspecific references.
Keywords: EST, next-generation sequencing, SNP, sockeye salmon, SOLiD, transcriptome
Received 1 September 2010; revision received 26 November 2010; accepted 30 November 2010
Introduction
Single-nucleotide polymorphisms (SNPs) have emerged
as a powerful multipurpose tool in the study of wild populations (Habicht et al. 2010; McGlauflin et al. 2010). SNPs
are the most common type of genetic variation (Morin
et al. 2009; Slate et al. 2010) and can allow the characterization of both neutral and adaptive variation on a genome-wide scale. A carefully selected SNP panel (see
Morin et al. (2009) for experimental design with SNPs)
may perform as well or better than microsatellite markers
as a population genetics tool with results that are more
consistent between laboratories (Smith & Seeb 2008; Morin et al. 2009; Slate et al. 2010). However, the use of SNP
markers in population studies to date remains limited
because of their narrow availability for many species.
Additional SNP discovery has been hindered by the need
for time-consuming and expensive sequencing efforts.
Correspondence: Meredith V. Everett, Fax: (206) 543 5728;
E-mail: [email protected]
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
Other techniques for SNP discovery, such as high-resolution melt analysis (HRMA) (e.g. McGlauflin et al. 2010) or
alignment of sequences from existing EST databases
(Smith et al. 2005), have been used effectively but are limited in the number of SNPs they detect (Elfstrom et al.
2006). The recent advances in next-generation sequencing
(NGS) techniques, which can produce millions of
sequence reads in a single run, provide a potential wealth
of material for SNP discovery at a relatively low cost
(Hayes et al. 2007; Seeb et al. 2011).
NGS technologies have revolutionized molecular studies in nonmodel organisms, allowing the rapid characterization of gene structure and expression (Ellegren 2008).
As costs decrease, NGS technologies continue to be
applied to an increasing number of nonmodel species
(defined here as species lacking a reference genome)
(Morozova et al. 2009; Trick et al. 2009; Goetz et al. 2010;
Kunstner et al. 2010; Van Bers et al. 2010; Wolf et al. 2010).
When selecting an NGS platform, laboratories working
with nonmodel species must consider the cost, research
question and availability of resources for sequence assem-
94 M . V . E V E R E T T E T A L .
bly (Flicek & Birney 2009). The three commonly used NGS
platforms are the Roche GS-FLX, Illumina Genome Analyzer (GA) and ABI SOLiD. The Roche GS-FLX runs produce the longest reads (400 bp) (Morozova et al. 2009),
and while the short reads (50–80 bp) produced by both
the SOLiD system and the Illumina GA present challenges
for assembly, both systems produce more reads at relatively lower cost. Shendure & Ji (2008) observed that the
price per megabase for Roche 454 sequencing is approximately 30-times more than the cost of both Illumina GA
and SOLiD. Our observations at the time of this writing
are that this cost distribution remains the same. The specific chemistries of the platforms have been reviewed elsewhere (Morozova et al. 2009), but the advantage of all
platforms is their ability to produce hundreds of thousands to millions of sequences in a single run.
Multiple tools have been developed for the assembly of
all types of NGS data (Flicek & Birney 2009). Sequence
assembly may either be de novo or assembly to a reference
(hereafter referred to as ‘reference assembly’ or ‘read mapping’). The reference may be a genome sequence, existing
EST database or other sequence database from either the
species of interest or one closely related (Trick et al. 2009;
Parchman et al. 2010; Van Bers et al. 2010). Many studies
utilize a combination of these assembly methods (Collins
et al. 2008; Flicek & Birney 2009; Buggs et al. 2010). Complications for both de novo and reference assembly remain for
all three NGS technologies. De novo assembly remains difficult because of the computational complexity of assembling the large volume of data each system produces. This
is especially true for Illumina and SOLiD, which produce
millions of reads per run. Additionally, Flicek & Birney
(2009) point out that to provide long assemblies, regardless of sequence or assembly method used, at least a portion of reads must be longer than the longest near identical
region in the genome. This parameter varies greatly
among genomes, and the short reads produced by all three
platforms rarely meet this threshold.
Regardless of NGS platform selected, in order for contig assembly and detection of variants such as SNPs to be
successful, sufficiently deep genome coverage is needed.
As many genomes are large and complex, a method for
reducing sequence complexity such that sufficient coverage can be achieved at relatively low cost is necessary.
Sequencing of the transcriptome provides a straightforward method for identification and annotation of only
the protein coding genes reducing the complexity of
sequences to assemble. Furthermore, the high coverage
achieved by NGS technologies, especially the Illumina
and SOLiD platforms, ensures identification of even rare
transcripts and variants (Hale et al. 2009; Morozova et al.
2009; Van Bers et al. 2010).
When constructing a transcriptome EST library, the
selection of tissues for starting material is important.
Pooled-tissue libraries have frequently been used in EST
sequencing projects in an attempt to expand the diversity
of genes discovered (Bonaldo et al. 1996; Carre et al. 2006;
Govoroun et al. 2006). However, the overexpression of
specific genes in some tissues is a potential source of bias
in sequencing pooled-tissue EST libraries. This overrepresentation can be addressed through library normalization or through tissue selection targeted to reduce
redundancy (Govoroun et al. 2006). One remaining question is whether the high coverage of NGS data can overcome the transcript redundancy found in pooled-tissue
libraries when compared to single-tissue libraries, allowing the identification of rare transcripts in any library.
Another area of interest is the determination of the
most efficient method for generating data relevant to
questions of population genomics, population and individual assignment, gene discovery etc., in the absence of
a reference genome. While previous studies have successfully assembled NGS data to existing sequence
resources, none of these examples have questioned
whether the EST database selected and assembly parameters used affect the quality of sequence assembly and
identification of SNPs. Species of Pacific salmon (Oncorhynchus sp.) and Atlantic salmon (Salmo salar) are
thought to have diverged 25 Mya (Allendorf & Thorgaard 1984), and the two groups are 94–96% similar when
comparing ESTs. It is unknown whether this 4–6%
sequence divergence, measured across the entire transcriptome, might affect the correct detection of informative SNPs (Smith et al. 2005; Koop et al. 2008).
Our study had three primary goals: first, to compare
assemblies between pooled- and single-tissue SOLiD
libraries, second, to evaluate the assembly of SOLiD
reads among existing EST resources for salmonids and
third, to examine whether underlying differences
between EST databases from different species affect the
validation of SNPs in sockeye salmon (O. nerka). We
found limited differences between the assemblies of
pooled- and single-tissue SOLiD libraries, with a trend
towards higher rates of contig discovery in pooled-tissue
libraries. Contig assembly on each EST database was
associated with the size of the database. SNP discovery
was most successful using conspecific EST data; however, differences in SNP validation among EST databases
may have been related to assembly parameters that lead
to the misidentification of paralogous sequence variants
(PSVs) as SNPs.
Methods
RNA isolation and SOLiD sequencing
For transcriptome library generation, gill, heart, liver and
testes were collected from two reproductively active male
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 95
including colour space information, were imported into
the Genomics Workbench and trimmed for length and
quality using the trim sequences tool. The quality score
limit was set to 0.05, and sequences less than 20 bp in
length were discarded.
While each RNA sample was treated to remove
rRNAs during library preparation, some rRNA contamination remained. To exclude these sequences from the
analysis, the trimmed SOLiD sequences from each library
were assembled to a set of publicly available teleost 18s
and 28s sequences [Genbank (EF126038, EF126042,
EU780557, EF126037, EU637075, EF126043, EF126039,
AB193742, AB193567, AB105163, AF308735, EF126040,
EU780557, AB099628, U34342, AY452491, U34341,
U34340, Z18683, Z18691, Z18686, Z18673, Z18764)] using
the reference assembly function and default assembly
settings in the Genomics Workbench. Reads from each
library that did not assemble to the teleost rRNA
sequences were retained in a separate file and used for all
remaining sequence analysis.
sockeye salmon from each of five populations of ecological and commercial interest (Schindler et al. 2010; Smith
et al. 2011) (Table 1). The tissues were collected and
stored in RNAlater (Ambion). Total RNA was extracted
from all four tissues using Trizol (Invitrogen) according
to the manufacturer’s protocols and cleaned using a Qiagen RNeasy kit. Total RNA concentration was quantified
using a PicoGreen (Invitrogen) assay according to the
manufacturer’s protocol. All RNA was screened for quality on a Bioanalyzer (Agilent). Aliquots of the cleaned
total RNA were sent to the University of Washington
High Throughput Next Generation Sequencing Facility
for cDNA library preparation and sequencing using the
ABI SOLiD system. All cDNA library preparation and
sequencing were carried out using standard SOLiD protocols, including treatment with RiboMinus (Invitrogen),
to remove ribosomal RNA (rRNA). None of the libraries
were normalized during preparation. Two cDNA
libraries were created for each of the ten sockeye salmon
individuals. The first library was created from 10 lg of
total RNA from testes alone. The second cDNA library
was created by pooling of 2.5 lg total RNA from each tissue from gill, heart, liver and testes. Each library was run
on 1 ⁄ 8th of a SOLiD slide. All sequences were deposited
in the NCBI Short Read Archive (SRA) under accession
number SRA023604.2.
Sequence assembly.
Publicly available salmonid EST
databases from rainbow trout (O. mykiss), Atlantic salmon and sockeye salmon (Koop et al. 2008) were used as
reference sequences to map the SOLiD reads. The EST
databases varied in size: sockeye salmon 6598 sequences,
rainbow trout 79 018 sequences and Atlantic salmon
119 912 sequences. All three EST databases were from
the most recent 100 ⁄ 99 assemblies [minimum score,
repeat_stringency 99 in a Phrap assembly see (Koop et al.
2008); cGRASP]. Mapping of SOLiD reads was carried
out using the reference assembly function in the Genom-
Sequence analysis
Sequence quality. Sequence analysis and quality control were carried out using the CLC Genomics Workbench 3.7.1 (CLC bio). All SOLiD data and quality scores,
Table 1 Summary of SOLiD sequencing for all individuals. Individuals included were from populations of commercial interest (Habicht
et al. 2010). Locations are as follows: Yako Creek (YakoCk), Yako Beach (YakoB), Silverhorn Bay Beach (SSilv), Lake Kulik beaches
(SLKul) and Mendeltna Creek (SMend). The numbers after each abbreviation designate specific individuals. All locations except
Mendeltna Creek are from the Wood River Lakes system, Alaska, that drains into the Bering Sea; Mendeltna Creek is a Copper River
tributary, draining into the Gulf of Alaska. Testes tissue libraries were all created from 10 lg RNA from testes. Pooled-tissue libraries
were created by pooling 2.5 lg RNA each from four tissues: testes, liver, heart and gill
Testes Tissue
Pooled Tissue
Individual
Number
of reads
Number of
reads after
trimming
%
Ribosomal
Final number
of reads
Number of
reads
Number of reads
after trimming
%
Ribosomal
Final number
of reads
YakoCk 1001
YakoCk 1002
YakoB 1001
YakoB 1002
SSilv 1001
SSilv 1002
SLKul 1001
SLKul 1002
SMend 1001
SMend 1002
29 939 160
28 256 540
27 784 826
55 825 749
20 570 121
27 656,862
51 794 375
27 818 996
26 605 246
50 094 660
13
11
11
15
2
12
26
13
12
20
25
23
47
17
21
24
32
20
23
25
10
8
6
12
1
9
17
10
9
15
27
23
26
26
28
24
86
93
41
62
13
10
9
10
10
11
49
43
17
24
35
25
27
27
36
18
22
24
17
22
8
7
6
7
7
9
38
32
14
19
347
505
590
058
383
580
072
562
166
900
643
962
747
650
678
518
506
574
111
448
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
031
913
142
551
886
607
792
904
406
749
781
446
722
092
069
538
214
540
487
481
618
412
287
995
207
807
188
180
988
962
007
417
980
349
755
577
999
511
429
615
891
377
476
655
982
571
422
108
566
473
903
341
172
899
744
972
200
751
983
432
999
780
937
747
066
455
671
591
578
156
696
985
048
486
965
131
756
176
467
967
96 M . V . E V E R E T T E T A L .
ics Workbench. Each SOLiD library was mapped to each
EST database singly, followed by simultaneous mapping
of all SOLiD pooled-tissue and testes libraries (Table 1)
(20 individual assemblies and 2 assemblies across all
individuals). The Genomics Workbench reference assembly tool allows the user to set the score for an individual
nucleotide mismatch and the total mismatch score limit
for retaining each assembly and also takes into account
colour space information. All trimmed SOLiD reads were
assembled with a nucleotide mismatch score of one and a
total score limit of two mismatches per sequence. This
corresponds to between 90 and 98 per cent identity with
the reference for each SOLiD read, dependant on the read
length.
De novo sequence assembly was carried on all SOLiD
reads from all libraries using the NGS Cell 3.1 (CLC bio)
with default parameters. The SOLiD reads were remapped to these de novo contigs using the reference
assembly tool in the Genomics Workbench and the same
parameters as the reference assemblies to the EST databases.
In silico detection of putative SNPs.
SNP detection was
carried out on each of the reference assemblies using the
SNP detection tool in the Genomics Workbench. SNP
detection parameters included a minimum coverage
threshold of four reads, with a minimum variant frequency of 35% for heterozygous individuals. The maximum coverage limit was set to the value of the highest
coverage contig found in each assembly to initially capture the maximum number of putative polymorphisms.
SNP detection parameters also included a high-quality
score for both the putative SNP and an eleven-base-pair
window of surrounding nucleotides. Both homozygous
and heterozygous SNPs were considered when examining the assemblies from that included all individuals.
Putative SNPs which were heterozygous with the rainbow trout or Atlantic salmon sequences, but homozygous
within the SOLiD reads were discarded.
The resulting putative SNP tables were exported to
Microsoft Excel where both single- and multi-individual
assemblies were compared and screened using additional parameters similar to those in Sanchez et al. (2009).
These screening steps were included to reduce the rate of
misidentification of PSV’s as true SNPs in silico, helping
to reduce the high cost of SNP validation. The order of
these steps is simply a matter of convenience. First, putative SNPs that appeared to contain more than two alleles
among all individuals were excluded from further analysis as there is an increased likelihood that such loci are
PSVs rather than true SNPs. Next, the remaining putative
SNPs were screened to exclude any that occurred within
100 bp of one another in a contig. Sanchez et al. (2009)
point out that contigs containing multiple SNPs are more
likely to represent paralogous loci. Thus, the decision to
exclude these putative SNPs attempts to strike a balance
between the number of putative SNPs discovered and
the rate of false discovery. Finally, putative SNP tables
from each individual and the multi-individual assembly
were compared to one another to locate putative SNPs
shared among multiple individuals. SNPs that contained
both homozygous and heterozygous individuals among
the populations were selected for further validation.
SNP validation
A set of 96 putative SNPs was selected for validation.
Twenty-four were selected from contigs from each of the
four reference databases, comparing across the individual assemblies and the group assemblies to detect putative SNPs found in multiple individuals. Each of the
putative SNPs chosen for the panel was detected in multiple individuals and appeared to have variable allele
counts among populations.
We use a four-step validation process, including PCR
tests, HRMA, Sanger sequencing and population genotyping (Seeb et al. 2011). BatchPrimer3 (You et al. 2008)
was used to design PCR primers flanking each putative
SNP from consensus sequences from each reference
assembly containing a selected putative SNP.
Primers were first subjected to a PCR test on pooled
genomic DNA extracted from the same 10 sockeye salmon individuals used for SOLiD transcriptome sequencing. Genomic DNA was extracted from preserved tissues
from each individual using a Qiagen DNeasy kit, following manufacturers’ protocols. DNA concentration was
quantified via a fluorescence assay on a NanoDrop (ThermoScientific), and the concentration was standardized
among all samples. Primer testing was carried out using
real-time PCR on a LightCycler 480 (Roche). PCR conditions were identical for all primer pairs: an initial denaturation of 10 min at 95 C followed by 45 cycles of :
95 C for 10 s, annealing at 55 C for 20 s and extension
at 72 for 20 s. Primer pairs that failed to amplify, or
amplified more than one product, were excluded from
further testing.
Successful primer pairs were used in a second set of
PCRs reactions that included an HRMA to test for the
presence of each putative SNP. Each primer pair was run
on each of the ten individuals singly. PCR conditions
were identical to the conditions described earlier except
for the addition of a final melt step with a temperature
ramp from 62–95 C, at a rate of 0.02 per s.
High-resolution melt analysis (HRMA) results indicating a polymorphism were further validated by Sanger
sequencing. Sanger sequencing was carried out at the
University of Washington’s High Throughput Sequencing facility. PCR products identified from HRMA were
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 97
sequenced in both directions using a standard BigDye 3.1
protocol (ABI). Resulting sequences and chromatograms
for each primer set were examined across all individuals
in BioEdit. The ClustalW algorithm in BioEdit was used
to create alignments, and the presence of each putative
SNP was verified by eye. Consensus sequences from
alignments containing putative SNPs were used in the
design of custom TaqMan assays (ABI).
TaqMan assays were tested on up to 95 individuals
from each of eight test populations of sockeye salmon
(Table 2, Fig. 1). Assays were run in 384-well plates on
an ABI 7900 instrument, and resulting genotypes were
analysed using the SDS 2.3 software (ABI) (Seeb et al.
2009). Allele frequencies and deviation from Hardy–
Weinberg equilibrium were calculated in GENALEX 6.1
(Peakall & Smouse 2006), and FST estimates were calculated in GENEPOP on the web, version 4.0.10 (Raymond &
Rousset 1995; Rousset 2008).
Table 2 Test populations and sample size for SNP validation.
The number of individuals included from each population is
listed in n. Populations range from the South Peninsula, Alaska
to the Igushik River in Bristol Bay, Alaska. Bolshaya River is
located on the Western Kamchatka Peninsula in Russia (Fig. 1)
Population region
n
Igushik River
Lower Wood River
Illiamna Lake
Egegik River
Cinder River
Bear Lake
Chignik Lake
Bolshaya River
60
93
95
95
89
95
95
95
Results
Sequencing and assembly results
Two SOLiD sequence libraries, one each from pooled-tissue and testes, were obtained from each individual for a
total of 20 SOLiD libraries (Table 1). Within each library,
between 20 570 121 and 93 180 511 reads were initially
obtained. Reads were trimmed for quality and length,
leaving between 2 383 678 and 49 422 200 reads
(Table 1). Across all SOLiD libraries, between 17 and 47
per cent of the trimmed reads were found to be rRNA
sequence and excluded from further analysis leaving
between 1 886 069 and 38 671 756 sequence reads in each
library (Table 1).
All SOLiD libraries were successfully assembled to
the cGRASP EST databases for sockeye salmon, rainbow
trout and Atlantic salmon (Table 3). De novo assembly
was successfully carried out across all SOLiD libraries,
resulting in 25 426 contigs. De novo contig lengths ranged
from 200 to 1193 bp with a mean length of 262 bp. All
SOLiD libraries were successfully mapped to these de
novo contigs (Table 3).
Pooled-tissue vs. testes specific libraries
One goal of this study was to compare sequence assembly between pooled-tissue and testes libraries within
each of the four reference databases. The pooled-tissue
libraries contained higher numbers of assembled reads,
higher numbers of contigs and greater contig length
overall (Table 3, Fig. 2). Among all assemblies, there
was great deal of variation in both the number of contigs
detected and the depth of coverage per contig. Both
pooled-tissue and testes libraries contained contigs
Fig. 1 Map of locations of the eight test populations genotyped using TaqMan assays.
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
163
64
27
13
17
9
24
18
235
149
32
20
26
18
53
35
88
77
53
47
36
30
30
31
338
300
202
176
145
128
149
133
59
41
2381
1498
5352
2946
62
4
054 (74)
634 (47)
985 (73)
324 (48)
873 (71)
619 (46)
618 (56)
749 (54)
1 049
402
1 571
621
1 354
519
828
406
177
917
101
322
711
298
877
768
1 411
861
2 147
1 304
1 883
1 121
1 475
758
88
84
70
63
62
60
88
82
329
481
6853
8537
13 352
14 575
1734
2843
828
588
371
152
904
686
276
920
5
5
55
50
73
64
22
20
360 (75)
473 (44)
360 (75)
473 (44)
360 (75)
473 (44)
360 (75)
473 (44)
SD
Pooled
Testes
Pooled
Testes
Pooled
Testes
Pooled
Testes
Reference species
Sockeye salmon
Sockeye salmon
Rainbow trout
Rainbow trout
Atlantic salmon
Atlantic salmon
Sockeye salmon de novo
Sockeye salmon de novo
15
10
15
10
15
10
15
10
298
298
298
298
298
298
298
298
567
537
567
537
567
537
567
537
11
4
11
4
11
4
11
4
SD (%)
478
504
478
504
478
504
478
504
SD
Average
coverage
SD
Average
length
SD (%)
Number
of unique
contigs
Average reads
assembled
to reference
% EST
library
Average
number
of contigs
Average
starting
reads
unique to each library type, but more unique contigs
were detected in pooled-tissue samples (Table 3). However, among the unique contigs assembled on each EST
reference database, many were detected in only a single
library: 67 in sockeye salmon, 2411 in rainbow trout and
4840 in Atlantic salmon. Of these contigs, the majority
were constructed from the alignment of a single SOLiD
read to the reference (56 ⁄ 67 in sockeye salmon,
2083 ⁄ 2411 in rainbow trout and 4260 ⁄ 4840 Atlantic salmon). In all cases, these single read contigs made up
more than half of the unique contigs detected between
pooled-tissue and testes libraries. Reference assembly
back to the de novo set of sequences produced 48 contigs
detected in a single library. Once more, these reads
made up more than half of unique reads between
pooled-tissue and testes libraries. However, in this
instance, multiple SOLiD reads were mapped to all reference contigs.
Variation among reference databases
SOLiD
library
type
Table 3 Summary of mapping of SOLiD to the cGRASP EST reference libraries and the de novo SOLiD contigs. Among the EST databases, the sockeye salmon reference consisted of
6598 contigs and singletons, the rainbow trout reference contained 79 018 contigs and singletons, and the Atlantic salmon reference contained 119 912 contigs and singletons. The de
novo references contained 25 426 contigs. The percentages following the standard deviation are the standard deviation calculated as a percentage of the average
98 M . V . E V E R E T T E T A L .
The results of mapping the SOLiD reads to each of the
EST reference databases and the de novo contigs were
variable among these four references. On the sockeye
salmon EST database, an average of between 84 and 88%
of the total ESTs in the reference library were assembled
among individual libraries, with a total of 6528 contigs
(98%) assembled across all libraries (Table 3). Mapping
to the rainbow trout database generated contigs representing between 63 and 70% of reference library
(Table 3), with a total of 73,843 (93%) contigs mapped
across all libraries. An average of between 60% and 62%
of the contigs from the Atlantic salmon database were
generated from mapped SOLiD reads among all individual libraries with a total number of 112 214 contigs (93%)
mapped across all libraries. Finally, remapping the
SOLiD reads to the de novo contigs produced an average
of between 22 276 and 20 920 contigs in each library,
representing an average 82–88% of the starting reference
library (Table 3). However, across all SOLiD libraries, a
total of 25 426 contigs representing the entire de novo reference were successfully assembled.
We observed substantial variation in coverage depth
and length among assemblies to each EST database
(Figs 3 and 4). In general, assembly to both the rainbow
trout and Atlantic salmon EST references produced the
largest number of contigs, consistent with the larger size
of these reference databases. However, in the assembly to
both rainbow trout and Atlantic salmon reference, there
is a very high proportion of contigs (up to 55% of the contigs mapped on the Atlantic salmon reference) that contain fewer than 20 SOLiD reads. There is also a larger
proportion of short (<100 bp) contigs in the rainbow trout
and Atlantic salmon assemblies.
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
100
1000
10 000100 0001 000 000
Single
Pooled
1000
100 000
(e)
10
Number of contigs (log)
10
0.1
100 000
1000
10
Number of contigs (log)
(a)
0.1
Sockeye salmon de novo
S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 99
100
300
1000 10 000 100 000 1 000 000
100 000
1000
200
600
10 000 100 000 1 000 000
2200
4500
5500
100 000
1000
500
1500
2500
3500
100 000
1000
(h)
10
Number of contigs (log)
1000
1800
Contig length (bp)
0.1
100 000
1000
10
100
Number of reads per contig
1400
10
Number of contigs (log)
1000 10 000 100 000 1 000 000
0.1
100 000
1000
10
0.1
Rainbow trout
Number of contigs (log)
Atlantic salmon
Number of contigs (log)
0.1
10
1000
(g)
Number of reads per contig
(d)
1100
Contig length (bp)
(c)
100
900
(f)
Number of reads per contig
10
700
10
Number of contigs (log)
100
0.1
100 000
1000
10
10
500
Contig length (bp)
(b)
0.1
Sockeye salmon
Number of contigs (log)
Number of reads per contig
200 600 1000 1400 1800 2200 2600 3000
Contig length (bp)
Fig. 2 Comparison of the frequency of occurence of contig coverage values and contig lengths between pooled-tissue and testes libraries.
Each SOLiD library was mapped to all four reference databases, and mean length and coverage (number of reads per contig) were calculated across all individuals in each reference. Panels a–d are coverage and e–h are length. Reference library species are as follows: a.
Sockeye salmon de novo, b. Sockeye salmon, c. Rainbow trout, d. Atlantic salmon, e. Sockeye salmon de novo, f. Sockeye salmon, g. Rainbow trout, h. Atlantic salmon. Error bars are standard deviation from the mean. Note the overlap in error in all cases.
SNP detection and validation
In silico SNP detection was successfully carried out on the
contigs from all four reference assemblies. The average
number of putative SNPs across all four reference databases ranged between 4219 and 13 608 at a minimum
coverage depth of 4 reads (Table 4). At higher coverage
values (‡10 reads), an average of between 1928 and 5657
putative SNPs was detected (Table 4). The total number
of putative SNPs detected among all SOLiD libraries,
assembled on each EST reference database, was between
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
37 831 and 109 850 before secondary screening, and 631
and 14 046 after screening to remove loci that appeared
to contain more than two alleles or occurred within
100 bp of one another (Table 4). The average number of
putative SNPs per contig (coverage ‡4) after all screening
steps ranged from 1.6 to 4.4, across all reference databases. The greatest numbers of putative SNPs per contig
were detected on the sockeye salmon EST reference. Following the same pattern as the mapped contigs, numerous putative SNPs were detected in assemblies from only
a single individual. Additionally, in the assemblies to
100 M . V . E V E R E T T E T A L .
Fig. 3 Comparison of contig coverage frequencies among reference databases. Coverage was calculated as number of reads per consensus for the assembly of all pooled-tissue and testes SOLiD libraries on each reference sequence.
Fig. 4 Comparison of contig length frequencies among reference databases. Lengths were determined based on the assembly of all
pooled-tissue SOLiD libraries and all the testes SOLiD libraries on each reference sequence.
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 101
some reference sequences, multiple putative SNPs were
detected across individuals; however, the detection and
coverage for each of these putative SNPs were variable
among individuals.
Ninety-six putative SNPs were selected for further
validation, 24 each from contigs assembled from each of
the four reference databases. Coverage values on each of
these candidates ranged from around 25 reads to more
than 100 when all individuals were aligned together. No
relationship between the level of coverage and SNP validation rates was observed. Six of the original 96 primer
pairs failed to amplify a product in initial PCR tests
(Table 5). A further 39 were rejected because they amplified more than one product. Successful primer pairs were
screened with HRMA to detect the presence of putative
SNPs. Seventeen putative SNPs were rejected after
HRMA because they contained ambiguous products, and
18 were rejected when no polymorphism was detected.
Sixteen putative SNPs were detected with HRMA and
were Sanger sequenced to confirm the presence or
absence of the variant. After sequencing, five potential
SNPs were rejected because of either amplification of
multiple products or no SNP was detected in any product
(Table 5). Finally, 11 putative SNPs were confirmed via
direct sequencing, and the consensus of these sequences
were used to design TaqMan assays (Table 6). Of these
final 11 putative SNPs, five were detected in contigs
assembled from the SOLiD de novo reference, two from
the reference assembly to the rainbow trout EST database
and four were from the sockeye salmon EST database.
All candidate SNPs detected using the Atlantic salmon
database failed validation as they either did not amplify
a product, amplified multiple products, or lacked a true
SNP in HRMA analysis.
The 11 TaqMan assays were successfully amplified
on the eight test populations (Table 2). One assay
failed to distinguish between homozygous and heterozygous individuals and was excluded from further
analysis, while the other ten loci were successfully
scored across all eight test populations (Table 7). Several populations deviated from Hardy–Weinberg equilibrium at a single locus (Table 7). FST values for each
locus ranged from 0.01 to 0.31. Global FST across all 10
loci was 0.06.
Table 4 Summary of SNP discovery across all reference assemblies. The total numbers of putative SNPs in all cases are total number of
unique putative SNPs across all assembled contigs and libraries. Coverage is the number of reads assembled per contig
Reference species
SOLiD
library
type
Average in
silico SNPs
(coverage ‡10) ±SD
Sockeye salmon
Sockeye salmon
Rainbow trout
Rainbow trout
Atlantic salmon
Atlantic salmon
Sockeye salmon de novo
Sockeye salmon de novo
Pooled
Testes
Pooled
Testes
Pooled
Testes
Pooled
Testes
4076
2962
5657
3213
3877
1928
2730
2583
Average
in silico SNPs
(coverage ‡4)
3367
5671
2410
4219
5386 13 608
2945
7742
3700 10 201
1374
5293
2344
4428
3405
4290
±SD
4348
3280
11 671
5969
8811
3467
3629
5260
Average SNPs
Total
Total putative
per contig
putative SNPs SNPs after
(coverage ‡4) ±SD discovered
screening
4.1
4.4
2.1
2.2
1.9
2.0
1.6
1.7
0.6
0.9
0.2
0.5
0.2
0.3
0.1
0.4
43773
37 831
142 944
101 100
109 850
71 089
44 950
51 617
806
631
1036
3923
14 046
2350
2218
775
Table 5 Results of primer design and validation of candidate SNPs. Twenty-four primer pairs were designed from consensus sequences
from each reference assembly to test putative SNPs. These putative SNPs were subjected to four validation steps. First, we tested for
successful PCR amplification producing a single product. Second, the successful primer pairs were subjected to high-resolution melt
analysis (HRMA) using individuals from eight populations. Third, templates that appeared to contain a SNP in HRMA were
resequenced using Sanger sequencing. Finally, putative SNPs appearing in the Sanger sequences were validated using TaqMan assays
on eight populations (Table 2)
Reference assembly
Primer
pairs
Successful
PCR
Successful
HRMA
Sanger
validation
TaqMan
validation
Sockeye salmon
Rainbow trout
Atlantic salmon
Sockeye salmon de novo
Totals
24
24
24
24
96
13
11
11
16
51
5
4
1
6
16
4
2
na
5
11
4
2
na
4
10
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
Rainbow trout
Sockeye salmon
de novo
Sockeye salmon
de novo
One-F4b
One-A5
Sockeye salmon
Sockeye salmon
Sockeye salmon
Sockeye salmon
One-F10
One-G6
One-H3
One-H5
One-H6
One-B8
One-B6
Sockeye salmon
de novo
Sockeye salmon
de novo
Sockeye salmon
de novo
Rainbow trout
One-B5*
One-B11
Reference database
Assay
name
AGGTGAATCATTTG
TCCATAGCATCA
CCATCCTTCCTTCAT
TGACACCATT
ATTTAAGTAATCTA
TTTCTTTGGCTCGAC
TGA
GTCGATGGGCAGGA
AGATGA
ACAAATCGTGTTAA
TGACAGGCTACT
TGTCCGACCACAAC
AATGTC
GAGCCCCAGTACCA
TTCCA
GGAAGTTTGTGCTTG
GCTAAGATCA
CATTTTTTGGCACCA
TAACCTTGGT
TTGGTTCCGATCTAC
AAGTTGACAT
GGCATGATTGTCCTT
GGGAAGATAT
Forward primer seq.
TCTCCTGTCCTTTC
ATACACTCTGA
GGATGACGTAATT
GATCAACTGTCCAT
ACCAGGTTGAGAA
AAACGTTATCCT
GGACCCATGATAGT
TCCCATCTT
GGGATTTATTGCTC
TGAGAGGACAA
ATACCCCAGTCCACCAATCAG
ACAAGGAATTCA
GTGTGGGATTGG
ATCTGAGGAAGCATATTTTTTCCTAATTCTATTTCT
CCTCAGGGCAACTT
ATATTCAAAGC
CCTCTCGGCCATCT
TTGAAGTTATT
GGAGCATCTAAGA
AAATACCCGTCTT
Reverse primer seq.
VIC
VIC
VIC
VIC
VIC
VIC
VIC
VIC
VIC
VIC
VIC
Probe
1 dye
CAAATCAACTGGATTTAC
TCATGCATAGATA
CTTGAC
CTAAATCTGAATT
AATTTACG
CCTAACACAACATTGCTT
AGGACACACAGC
TCTGT
ATAAACAATCAGG
GAAATG
TAGCGACGAAGAC
CACA
CCTGCCAGGCCTC
CCATCATTCTCA
TTACTGTTT
ATCTGAA
GTATTGGCTTTAA
TGTGGCCAATGGA
CCAA
Probe 1 sequence
FAM
FAM
FAM
FAM
FAM
FAM
FAM
FAM
FAM
FAM
FAM
Probe
2 dye
Table 6 TaqMan primer and probe sequences designed from Sanger-validated sequences. Assay One-A5 (*) amplified more than two alleles when tested
TCAAATCAACTG
TATTTAC
ATGCATAGATGCC
TTGAC
AAATCTGAATTCATTTACG
TCCTAACACAACTTTGCTT
AGGACACACAAC
TCTGT
AAACAATCAAGGAAATG
TAGCGACGAACA
CCACA
CCCTGCTAGGCCTC
CCATCATTCTCATTCCTGTTT
ATCTGAAG
TATTTGCTTTAA
TGGCCAATGAACC
AA
probe 2 sequence
102 M . V . E V E R E T T E T A L .
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 103
Discussion
Sequencing results
As the cost of NGS methods decreases, the use of this
technology on nonmodel species, those species lacking a
reference genome, has increased (Morozova et al. 2009;
Trick et al. 2009; Kunstner et al. 2010). The short reads
produced by NGS, including the SOLiD system, remain a
challenge for assembly; thus, their primary use to date
has been in resequencing and gene expression studies in
organisms with a sequenced genome (Morozova et al.
2009; Trick et al. 2009; Wall et al. 2009). Recently, NGS
studies in nonmodel species have begun to use existing
EST data sets as references for sequence assembly (Collins et al. 2008; Trick et al. 2009; Parchman et al. 2010; Van
Bers et al. 2010). However, no studies have examined the
possible consequences of the mapping reference selected.
Our study characterized variations in assembly to three
publicly available salmonid EST databases, as well as
assembly to a set of de novo contigs, and examined the
optimal methodology for SNP detection in the assembly
of short SOLiD reads.
Using the SOLiD sequencing system, we successfully
obtained transcriptome sequence from both pooled-tissue and testes libraries from 10 individuals spanning five
distinct populations of Alaskan sockeye salmon. Our
decision to use the SOLiD system, rather than a method
such as 454, was based on cost. Using SOLiD allowed us
to sequence more individuals at very high coverage. Several million reads were obtained from each library. However, a large proportion (up to 47%) of the sequences
obtained consisted of rRNA sequences. These sequences
were present despite the RiboMinus treatment of all
RNA samples outlined in the methods. The RiboMinus
kit uses regions of ribosomal sequence that are highly
conserved among species. Nonetheless, the large number
of ribosomal sequences remaining in the sample suggests
that the included sequences do not have high enough
specificity for use in salmonids. Future studies should
consider using other techniques such as poly-A selection
to reduce rRNA in their samples.
Pooled-tissue verses testes libraries
Our first goal was to compare the efficiency of assembly
between pooled-tissue and testes libraries. Pooled-tissue
libraries have frequently been used in EST sequencing
projects in an attempt to expand the diversity of genes
discovered (Bonaldo et al. 1996; Carre et al. 2006; Govoroun et al. 2006). However, one potential source of bias in
sequencing pooled-tissue EST libraries is that highly
expressed genes may be overrepresented in the library,
while rare transcripts may be missed altogether. Such
variation in distribution can be addressed through
library normalization or through tissue selection targeted
to reduce library redundancy. Library normalization
reduces the proportion of highly expressed transcripts
compared to other transcripts in the sample; however,
even after normalization, redundant EST clusters may
remain, and depending on the intended downstream
application of the library (i.e. gene expression), normalization may not be appropriate. While this is problematic
for traditional sequencing methods, which are limited
in their numerical capacity, Hale et al. (2009) demonstrated that the high coverage obtained in NGS generally
eliminates the need for normalization, and even
Table 7 Genotype results from eight populations: Igushik River, AK; Lower Wood River, AK; Illiamna Lake, AK; Egegik River, AK;
North Peninsula, AK; Cinder River, AK; North Peninsula, AK; Bear Lake, AK; Chignik Lake, AK; Bolshaya River, Russia. Eleven
TaqMan assays were designed from Sanger-validated sequences. Ten assays successfully amplified in all eight populations. Minor allele
frequency and deviations from Hardy–Weinberg equilibrium (*P < 0.05, **P < 0.01) are shown below for all populations
Minor allele frequency
Assay name
Igushik river
Lower
Wood
River
One-F4b
One-A5
One-B11
One-B6
One-B8
One-F10
One-G6
One-H3
One-H5
One-H6
0.117**
0.283
0.242
0.242*
0.475
0.158
0.333
0.411
0.333
0.344*
0.188
0.382
0.484
0.258*
0.473
0.177
0.360
0.140
0.285
0.265
Illiamna
Lake
Egegik
River
North
Peninsula,
Cinder River
North
Peninsula,
Bear Lake
Chignik
Lake
Bolshaya
River, Russia
0.126
0.363
0.411
0.384
0.405
0.100
0.342
0.102
0.437
0.159
0.128
0.300
0.420
0.326
0.426
0.153
0.353*
0.054
0.426
0.196
0.099
0.309
0.427*
0.340
0.416
0.081
0.494
0.326
0.433
0.263*
0.087
0.200
0.290
0.440
0.421
0.273
0.337
0.468*
0.463
0.227
0.140
0.263
0.381
0.299
0.489
0.367
0.379
0.163
0.453
0.295
0.146
0.337*
0.131
0.355
0.400
0.120
0.393
0.300
0.494
0.500
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
104 M . V . E V E R E T T E T A L .
rare transcripts can be identified in nonnormalized
libraries.
We compared reference assemblies of libraries from
both testes and pooled-tissues. Despite the use of nonnormalized (native) libraries, more than 90% of the ESTs
found in each reference database were identified in at
least one individual from our libraries. This was consistent with the findings of Hale et al. (2009) that native
libraries can be effectively used for gene discovery. The
use of native libraries does have an effect on the distribution of coverage, however. A comparison of the 100 genes
with the highest coverage from each reference assembly
revealed that 19–85% of the total number of reads
mapped to each reference was contained in these 100
contigs. The large variation in these values is attributed
to differences between reference libraries and the high
variability between individuals. Given the high proportion of reads mapped to a few contigs, it is probable that
normalization would more evenly distribute coverage
across our libraries, improving consistency between individuals and possibly enhancing SNP discovery.
Mapping of pooled-tissue libraries generally resulted
in higher numbers of contigs, and the assembled contigs
tended to be longer (Fig. 2, Table 3). Both pooled-tissue
and testes libraries contained unique contigs; though,
there were more contigs unique to the pooled-tissue
libraries, regardless of the EST database used as a reference. It should be noted that these trends are general and
the pooled-tissue and testes libraries have overlapping
error rates in the total number of reads assembled,
assembled contig length and coverage depth (Table 3,
Fig. 2). Differences among EST libraries are typically
identified by grouping ESTs by Gene Ontology (GO)
terms or gene identifications and then comparing relative
counts of ESTs via a hypergeometric distribution, such as
a binomial, chi-squared or FISHER’S exact distribution
(Susko & Roger 2004; Young et al. 2010). The resulting
distribution of sequences among libraries from NGS
methods frequently violates the assumptions underlying
these tests, specifically the assumption that all genes are
independent and equally likely to be selected as differentially expressed, under the null hypothesis (Young et al.
2010). Thus, we did not perform statistical tests, and our
comparisons remain general at this time. While pooledtissue libraries contained a higher diversity of transcripts,
the large variability in reads mapped (Table 3) and the
overlap in error rates for assemblies between pooled-tissue and testes libraries suggest that both are an important
resource for gene discovery. Decisions regarding tissue
selection for library preparation should be made based
on tissue availability and the specific question being
addressed, keeping in mind the possibility for other
downstream applications with each library type.
Comparison among EST references
The second goal of this study was a comparison of gene
and SNP discovery when SOLiD reads are mapped to
different EST databases and a de novo assembly. The short
reads produced by the SOLiD sequencing system remain
difficult to assemble de novo, because of their length and
the computational complexity of handling the large volume of data, although assembly algorithms are improving (Flicek & Birney 2009). The overall number of reads
that assembled to each reference database was variable
among all individuals (Table 3). Additionally, a relatively
low proportion of the average number starting reads
(approximately 10%) mapped to each reference database.
There were two factors that may explain this low rate.
First, the genes in the EST database may not have been
representative of the genes expressed in our tissue
libraries. The second factor, our strict assembly parameters, was more likely the crucial factor producing the low
mapping rate. Any reads that contained more than two
mismatches to the reference sequence were discarded.
This necessarily excluded many sequences; however, the
hope was that it would improve the overall quality of the
assembly and separate PSVs. In spite of the reduction in
the total number of reads assembled, contigs representing more than 90% of each reference library were assembled from our data.
We found that assembly of SOLiD reads was variable
among the three EST databases and the de novo contigs.
As expected, the number of contigs assembled on each
reference varied proportionally with the number of
sequences in the reference database. The Atlantic salmon
and rainbow trout databases contained a larger number
of ESTs and thus were able to capture more of the
expressed genes present in the sockeye salmon transcriptome. Across all assemblies, a large proportion of the
starting EST databases (between 94 and 98%) were
detected in at least a single library, suggesting a large
proportion of transcripts in sockeye salmon are found in
the publicly available EST databases.
The average proportion of contigs assembled, as well
as sequence length and coverage, was lower in the Atlantic salmon and rainbow trout assemblies (Table 3). There
was also a large variation in the number of contigs assembled among individuals on all references, with a number
of contigs assembled in a single individual in all cases.
Based on standard deviation of the contigs assembled
(Table 3), the variation in the number of assembled contigs was largest on the Atlantic salmon reference, and
smallest on the sockeye salmon EST database. In addition
to containing more sequences, the rainbow trout and
Atlantic salmon EST libraries contained sequences that
were generally longer than those in either the sockeye
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 105
salmon EST or sockeye de novo library. Thus, the contigs
assembled from these databases tended to be longer.
This variability in contig length and depth, as well as
the number of contigs assembled among individuals,
may have been related to the methods of assembly. In the
assembly software, if a SOLiD read was nonspecific, i.e. it
mapped to more than one reference sequence, then the
software provided the option to align to one of these references at random or remove it from analysis. In all reference assemblies, the software was set to random
assembly of nonspecific matches. Thus, any underlying
redundancy in an EST database could result in a wider
distribution of SOLiD reads across these sequences,
resulting in more short, low coverage contigs. Such
underlying redundancy could include paralogous
sequences, repetitive DNA elements, or shared motifs,
although specific identification of these structures is
beyond the scope of the current study. While longer
sequence reads can separate these features, the correct
mapping of short SOLiD reads to these features remains
problematic. Among our assemblies, we observed
numerous, short (<100 bp) contigs, and contigs consisting of only a few reads (<20 reads per contig) (Figs 3 and
4). A test reassembly of the SOLiD reads, with the software option set to remove any read that mapped to multiple reference sequences, resulted in a lower total
number of assembled reads and a lower number of
assembled contigs. There was also a substantial decrease
in the differences in the number, length and coverage of
contigs assembled using the rainbow trout and Atlantic
salmon references (data not shown). Thus, differences in
assembly between the EST databases may be related to
both underlying differences in the structure of the EST
database chosen and to the parameters chosen for the
reference assembly itself. Researchers mapping NGS
reads to a closely related species should take both of
these factors into account when performing reference
assembly.
SNP discovery and validation
SNP detection was successfully carried out in silico on
contigs from assemblies to all four reference libraries.
Following the same patterns as contig assembly, there
was a large variation in number of putative SNPs
detected, both among individual libraries and across the
EST references. One of the primary difficulties in successfully detecting and confirming true SNPs in salmonids is
the presence of PSVs, the result of a whole genome duplication event in salmonids (Allendorf & Thorgaard 1984;
Koop et al. 2008). Paralog assemblies often contain multiple polymorphic sites. The intent of our initial assembly
parameters, allowing only two mismatches per mapped
SOLiD read, was to separate PSVs into individual loci if
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
possible. The EST databases selected from cGRASP were
from an EST assembly also intended to separate PSVs
(Koop et al. 2008). Despite the stringency of assembly,
our approach had limited success.
Similar SNP detection studies in species without a
recent genome duplication have had high success rates.
For example, 84% of in silico SNPs detected in the great
tit, Parus major, were validated (Van Bers et al. 2010). SNP
detection efforts in polyploid plant species have had up
to 93% success in validating SNPs detected from NGS
data (Bundock et al. 2009; Trick et al. 2009; Buggs et al.
2010). Many of the plants in these studies, however, have
EST or other sequence resources available from closely
related, diploid ancestors, which allow better identification of PSVs.
Some previous SNP detection studies in salmonids
have generally reported around 50–74% success rates
(Smith et al. 2005; Ryynanen & Primmer 2006; Hayes et al.
2007; Sanchez et al. 2009). The apparent large difference
in validation rates between the current study and these
previous efforts in salmonids may be due the methods
used in each study. We tested only a limited number of
putative SNPs assembled across the entire transcriptome.
Previous efforts have targeted known overlapping
sequences between species (Smith et al. 2005). A large
portion of our putative SNPs were discarded when the
primers designed to test them failed to amplify a product. Primers designed on transcriptome sequence to test
putative SNPs may cross introns and fail to amplify (see
Table 5). Additionally, these previous efforts have used
Sanger sequencing, which produces longer reads that are
easier to accurately assemble. Some studies report success based on the number of successful TaqMan assays
alone, rather than the total number of sequences tested
(Smith et al. 2005). Calculating our success in this fashion
would result in a 90% success rate (10 of 11 TaqMan
assays were successful). In contrast, a recent next-generation SNP discovery effort in chum salmon used 454
sequencing for SNP discovery and the same SNP validation pipeline used in this study. Their reported rate of
SNP validation in this study was approximately 20%,
much closer to the rate reported here (Seeb et al. 2011).
Regardless of the differences in reporting between studies, the relatively low validation rate in salmonids compared to other species is attributed to difficulty in
separating PSVs from true SNPs.
While our validation was low, the specific validation
rate varied depending on the EST database used as a reference. Of the validated SNPs, eight were from contigs
assembled from the two sockeye salmon references and
the remaining two assays were successfully designed
from assemblies to the rainbow trout reference. None of
the candidates selected for validation from the Atlantic
salmon reference were successful.
106 M . V . E V E R E T T E T A L .
A number of additional factors may underly our low
validation. First, despite the large numbers of putative
SNPs detected (Table 4), the lack of overlap in loci
detected among libraries reduced the number of SNPs
that passed secondary screening. A portion of the SNPs
detected were from contigs assembled in only one
library. At the same time, in a proportion of contigs
assembled in multiple individuals, the portions of each
reference sequence that were assembled from each individual did not fully overlap among individuals. For
example, if we assembled contigs corresponding to a single reference sequence in two sockeye individuals, the
reads from the first individual might align to the beginning portion of the reference sequence, while the reads
from the second sockeye individual might align to the
end. Consequently, any putative SNPs detected in each
of these segments would not be shared between these
two individuals. This uneven distribution of putative
SNPs among individuals greatly reduced the number of
putative SNPs for selection for validation. Possible solutions to this uneven coverage include library normalization or use of emerging techniques including sequencing
of reduced representation libraries (Sanchez et al. 2009)
or RAD tag sequencing (Miller et al. 2007; Baird et al.
2008) that produce more even coverage across all individuals.
Another source of SNP dropout was erroneous
assignment of PSVs or sequence errors as SNPs. Of the
loci that failed during validation, six failed to amplify a
product in an initial PCR test (Table 5). Primers
designed on these loci probably spanned an intron
boundary, as primers were designed from transcriptome
consensus sequences, but tested on genomic DNA. Furthermore, of the starting 96 loci tested, 60 appeared to
amplify multiple loci and thus were likely PSVs. Nineteen appeared to be false positives in initial assembly,
where no SNP was detected during validation, possibly
a result of sequencing error.
The variation in SNP detection among reference
sequences is unusual. Previous studies in salmonids have
successfully used cross-species sequence data to design
primers and assays for SNP loci (Smith et al. 2005). This
unusual complication appears to be related to the assembly parameters used for the complex assembly of short
SOLiD reads to the large EST data set and our screening
parameters for SNP detection. As described previously,
in our assembly parameters, short SOLiD reads that hit
multiple sequences were randomly mapped to a single
sequence. This led to the large number of relatively low
coverage contigs discussed earlier. Before selecting putative SNPs for validation, we eliminated putative SNPs
that either contained more than two alleles, or were
within less than 100 bp of one another, as such variants
were more likely to be PSVs (Sanchez et al. 2009). Our
assembled reads were more widely distributed among
more, low coverage contigs in the assembly to the Atlantic salmon and rainbow trout EST sequences. Thus, PSVs
and sequencing errors that would otherwise have been
eliminated during our secondary screening of SNPs were
more likely to be missed. Possible solutions to this scenario are to set the assemble parameters to discard all
SOLiD read which have multiple hits, to only assemble to
full length sequences, or to increase the coverage threshold for putative SNP detection. All these methods will
result in loss of some sequence data but should reduce
the number of misidentified PSVs.
Conclusions and recommendations for future
research
SNP discovery using short read chemistry as carried out
here was technically challenging. We offer the following
conclusions and recommendations for future research.
First, differences between testes and pooled-tissue
SOLiD libraries were not substantial. Investigators may
choose to select any tissue or tissue combination relevant
to their individual study.
Second, the high depth of coverage provided by NGS
data identified even rare transcripts. However, normalization of libraries may help more evenly distribute coverage, enhancing variant discovery. If use of normalization
can be balanced against the resulting increase in cost, it is
a good option. If native libraries are needed for gene
expression studies, these may be subsampled during
library construction.
Third, care should be taken to select references
sequences from conspecifics or closely related species.
Despite the difficulties in using short reads for de novo
assembly, we recommend a strategy that incorporates a
combination of de novo and reference assemblies.
Fourth, when assembling to EST libraries, assembly
parameters should eliminate nonspecific hits to better
facilitate complete, high coverage contig assembly. An
assembly strategy that randomly assigns nonspecific
reads may be used for comparison if too large a proportion of data is removed from the analysis but should not
be used for SNP detection.
Finally, strategies to screen putative SNPs for potential PSVs, as well as stringent assembly strategies such as
only mapping to full length contigs, may reduce the
occurrence of incorrectly combining PSVs into a single
sequence. Our current strategy for SNP discovery using
NGS is to use strict assembly parameters and assemble to
conspecific EST libraries to maximize our SNP validation
rates. Use of strict parameters will reduce the misidentification of PSVs; however, it may lower the overall number
of contigs discovered. By testing a variety of strict assembly parameters on a subset of the data, investigators can
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
S N P D I S C O V E R Y : N E X T G E N E R A T I O N S E Q U E N C I N G 107
find parameters that optimize both SNP and gene discovery for their individual organism.
Acknowledgements
We thank Carita Pascal for laboratory support, Steven Roberts
for aid in de novo assembly and all the University of Washington and Alaska Department of Fish and Game personnel who
collected and provided tissue samples. This manuscript was
partially funded by the Alaska Sustainable Salmon Fund
under Study no. 45908 from the National Oceanic and
Atmospheric Administration, US Department of Commerce,
administered by the Alaska Department of Fish and Game.
Additional funding for this project was provided by a grant
from the Gordon and Betty Moore Foundation, and a grant
from the Bristol Bay Regional Seafood Development Association. The statements, findings, conclusions and recommendations are those of the authors and do not necessarily reflect
the views of the National Oceanic and Atmospheric Administration, the US Department of Commerce, or the Alaska
Department of Fish and Game.
Conflict of Interest
The authors have no conflict of interest to declare and
note that the sponsors of the issue had no role in the
study design, data collection and analysis, decision to
publish, or preparation of the manuscript.
References
Allendorf FW, Thorgaard GH (1984) Tetraploidy and the evolution of
salmonid fishes. In: Evolutionary Genetics of Fishes (ed. Turner BJ), pp.
1–53. Plenum Press, New York.
Baird NA, Etter PD, Atwood TS et al. (2008) Rapid SNP discovery
and genetic mapping using sequenced RAD markers. PLoS ONE, 3,
e3376.
Bonaldo MDF, Lennon G, Soares MB (1996) Normalization and subtraction: two approaches to facilitate gene discovery. Genome Research, 6,
791–806.
Buggs RJA, Chamala S, Wu W et al. (2010) Characterization of duplicate
gene evolution in the recent natural allopolyploid Tragopogon miscellus
by next-generation sequencing and Sequenom iPLEX MassARRAY
genotyping. Molecular Ecology, 19, 132–146.
Bundock PC, Eliott FG, Ablett G et al. (2009) Targeted single nucleotide
polymorphism (SNP) discovery in a highly polyploid plant species
using 454 sequencing. Plant Biotechnology Journal, 7, 347–354.
Carre W, Wang XF, Porter TE et al. (2006) Chicken genomics resource:
sequencing and annotation of 35,407 ESTs from single and multiple tissue cDNA libraries and CAP3 assembly of a chicken gene index. Physiological Genomics, 25, 514–524.
Collins LJ, Biggs PJ, Voelckel C, Joly S (2008) An approach to transcriptome analysis of non-model organisms using short-read sequences.
Genome Informatics, 21, 3–14.
Elfstrom CM, Smith CT, Seeb JE (2006) Thirty-two single nucleotide polymorphism markers for high-throughput genotyping of sockeye salmon. Molecular Ecology Notes, 6, 1255–1259.
Ellegren H (2008) Sequencing goes 454 and takes large-scale genomics
into the wild. Molecular Ecology, 17, 1629–1631.
Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nature Methods, 6, S6–S12.
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
Goetz F, Rosauer D, Sitar S et al. (2010) A genetic basis for the phenotypic
differentiation between siscowet and lean lake trout (Salvelinus namaycush). Molecular Ecology, 19, 176–196.
Govoroun M, Le Gac F, Guiguen Y (2006) Generation of a large scale repertoire of Expressed Sequence Tags (ESTs) from normalised rainbow
trout cDNA libraries. BMC Genomics, 7, 196.
Habicht C, Seeb LW, Myers KW, Farley EV, Seeb JE (2010) Summer–Fall
Distribution of Stocks of Immature Sockeye Salmon in the Bering Sea
as Revealed by Single-Nucleotide Polymorphisms. Transactions of the
American Fisheries Society, 139, 1171–1191.
Hale MC, McCormick CR, Jackson JR, DeWoody JA (2009) Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake
sturgeon (Acipenser fulvescens): the relative merits of normalization and
rarefaction in gene discovery. BMC Genomics, 10, 203.
Harismendy O, Ng PC, Strausberg RL et al. (2009) Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology, 10, R32.
Hayes B, Laerdahl J, Lien S et al. (2007) An extensive resource of single
nucleotide polymorphism markers associated with Atlantic salmon
(Salmo salar) expressed sequences. Aquaculture, 265, 82–90.
Koop BF, von Schalburg KR, Leong J et al. (2008) A salmonid EST genomic study: genes, duplications, phylogeny and microarrays. BMC Genomics, 9, 545.
Kunstner A, Wolf JBW, Backstrom N et al. (2010) Comparative genomics
based on massive parallel transcriptome sequencing reveals patterns of
substitution and selection across 10 bird species. Molecular Ecology, 19,
266–276.
McGlauflin MT, Smith MJ, Wang JT et al. (2010) High-resolution melting
analysis for the discovery of novel single-nucleotide polymorphisms in
rainbow and cutthroat trout for species identification. Transactions of
the American Fisheries Society, 139, 676–684.
Miller MR, Dunham JP, Amores A, Cresko WA, Johnson EA (2007) Rapid
and cost-effective polymorphism identification and genotyping using
restriction site associated DNA (RAD) markers. Genome Research, 17,
240–248.
Morin PA, Martien KK, Taylor BL (2009) Assessing statistical power of
SNPs for population structure and conservation studies. Molecular Ecology Resources, 9, 66–73.
Morozova O, Hirst M, Marra MA (2009) Applications of new sequencing
technologies for transcriptome analysis. Annual Review of Genomics and
Human Genetics, 10, 135–151.
Parchman TL, Geist KS, Grahnen JA, Benkman CW, Buerkle CA
(2010) Transcriptome sequencing in an ecologically important tree
species: assembly, annotation, and marker discovery. BMC Genomics,
11, 180.
Peakall R, Smouse P (2006) GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research. Molecular Ecology
Notes, 6, 288–295.
Raymond M, Rousset F (1995) Genepop (Version-1.2) - population-genetics software for exact tests and ecumenicism. Journal of Heredity, 86,
248–249.
Rousset F (2008) GENEPOP ‘ 007: a complete re-implementation of the
GENEPOP software for Windows and Linux. Molecular Ecology
Resources, 8, 103–106.
Ryynanen HJ, Primmer CR (2006) Single nucleotide polymorphism
(SNP) discovery in duplicated genomes: intron-primed exon-crossing
(IPEC) as a strategy for avoiding amplification of duplicated loci in
Atlantic salmon (Salmo salar) and other salmonid fishes. BMC Genomics,
7, 192.
Sanchez CC, Smith TPL, Wiedmann RT et al. (2009) Single nucleotide
polymorphism discovery in rainbow trout by deep sequencing of a
reduced representation library. BMC Genomics, 10, 559.
Schindler DE, Hilborn R, Chasco B et al. (2010) Population diversity and
the portfolio effect in an exploited species. Nature, 465, 609–U102.
Seeb JE, Pascal CE, Ramakrishnan R, Seeb LW (2009) SNP genotyping by
the 5¢-nuclease reaction: advances in high throughput genotyping with
non-model organisms. In: Methods in Molecular Biology, Single Nucleotide
108 M . V . E V E R E T T E T A L .
Polymorphisms, 2nd edn (ed. Komar A), pp. 277–292. Humana Press,
New York.
Seeb JE, Pascal CE, Grau ED et al. (2011) Transcriptome sequencing and
high-resolution melt analysis advance SNP discovery in duplicated salmonids. Molecular Ecology Resources, doi: 10.1111/j.1755-0998.2010.
02936.x.
Shendure J, Ji HL (2008) Next-generation DNA sequencing. Nature Biotechnology, 26, 1135–1145.
Slate J, Gratten J, Beraldi D et al. (2010) Gene mapping in the wild with
SNPs: guidelines and future directions (vol. 136, pg 97, 2009). Genetica
138, 467–467.
Smith CT, Seeb LW (2008) Number of alleles as a predictor of the relative
assignment accuracy of short tandem repeat (STR) and single-nucleotide-polymorphism (SNP) baselines for chum salmon. Transactions of
the American Fisheries Society, 137, 751–762.
Smith CT, Elfstrom CM, Seeb LW, Seeb JE (2005) Use of sequence data
from rainbow trout and Atlantic salmon for SNP detection in Pacific
salmon. Molecular Ecology, 14, 4193–4203.
Smith M, Pascal CE, Grauvogel Z et al. (2011) Multiplex preamplification
PCR and microsatellite validation allows accurate single nucleotide
polymorphism (SNP) genotyping of historical fish scales. Molecular
Ecology Resources, 11, 257–266.
Susko E, Roger AJ (2004) Estimating and comparing the rates of gene discovery and expressed sequence tag (EST) frequencies in EST surveys.
Bioinformatics, 20, 2279–2287.
Trick M, Long Y, Meng JL, Bancroft I (2009) Single nucleotide polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa
transcriptome sequencing. Plant Biotechnology Journal, 7, 334–346.
Van Bers NEM, Van Oers K, Kerstens HHD et al. (2010) Genome-wide
SNP detection in the great tit Parus major using high throughput
sequencing. Molecular Ecology, 19, 89–99.
Wall P, Leebens-Mack J, Chanderbali A et al. (2009) Comparison of next
generation sequencing technologies for transcriptome characterization.
BMC Genomics, 10, 347.
Wolf JBW, Bayer T, Haubold B et al. (2010) Nucleotide divergence vs.
gene expression differentiation: comparative transcriptome sequencing
in natural isolates from the carrion crow and its hybrid zone with the
hooded crow. Molecular Ecology, 19, 162–175.
You FM, Huo NX, Gu YQ et al. (2008) BatchPrimer3: a high throughput
web application for PCR and sequencing primer design. Bmc Bioinformatics, 9, 253.
Young M, Wakefield M, Smyth G, Oshlack A (2010) Gene ontology
analysis for RNA-seq: accounting for selection bias. Genome Biology, 11,
R14.
Molecular Ecology Resources (2011) 11 (Suppl. 1), 93–108
2011 Blackwell Publishing Ltd
1 2 Appendix 3.
Allele frequency stability in large, wild exploited populations over multiple
generations: insights from Alaska sockeye salmon
3 4 (Manuscript in preparation for Canadian Journal of Fishery and Aquatic Sciences)
5 6 Daniel Gomez-Uchida1,2, James E. Seeb1, Christopher Habicht3 & Lisa W. Seeb1*
7 8 9 10 11 12 13 1
School of Aquatic and Fishery Sciences, 1122 Boat St NE Box 355020 Seattle, WA 98195-5020
USA.
2
Departmento de Zoología, Facultad de Ciencias Naturales y Oceanográficas, Universidad de
Concepción, Casilla 160-C, Concepción, Chile. 3
Division of Commercial Fisheries, Alaska Department of Fish and Game, 333 Raspberry Road,
Anchorage, AK 99518, USA.
14 15 *Corresponding author
16 17 Acknowledgments
18 19 20 21 22 23 24 25 26 27 28 29 30 31 Lowell Fair, Tim Baker, and Colton Lipka helped identify and prepare the scale collections from
ADF&G archives. We are indebted to Carita Pascal, Eleni Petrou, and Taylor Gibbons for
support during the laboratory stages of this project. Mark Witteveen and Birch Foster at the
ADF&G kindly donated their expertise on Kodiak Island salmon management through
conversations and reports. This manuscript benefited from criticisms from colleagues at the
School of Aquatic and Fishery Sciences, University of Washington, who attend the Friday Lunch
Quantitative Seminar series. We thank Ryan Waples for stimulating discussions and feedback on
one of the figures. Funding for this research was provided by the Gordon and Betty Moore
Foundation and by the Alaska Sustainable Salmon Fund under Study #45908 from the National
Oceanic and Atmospheric Administration, U.S. Department of Commerce, administered by the
ADF&G. The statements, findings, conclusions, and recommendations are those of the authors
and do not necessarily reflect the views of the National Oceanic and Atmospheric
Administration, the U.S. Department of Commerce, or the ADF&G. Data for this study are
available at: to be completed after manuscript is accepted for publication.
32 33 34 35 Abstract
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 Genetic data is increasingly used to improve the management of commercially exploited
populations. Uncertainty about the temporal stability of allelic frequencies surrounds the
common use of multigenerational datasets in fisheries that are prosecuted on migrating
admixtures. We genotyped six pairs of (archived and contemporary) collections of Alaskan
sockeye salmon to estimate temporal divergence over a period of 25 – 42 years (4.9 – 8.4
generations). First, our results show that temporal changes were dramatically (between 40- and
250-fold) smaller than spatial changes in allele frequencies when based on nuclear SNPs;
differences were much less marked for mitochondrial SNPs. Second, the magnitude of temporal
change was generally consistent with a model of genetic drift: (i) large-FST or candidate SNPs for
diversifying selection were not more likely to show significant temporal changes than small-FST
or selectively neutral SNPs and (ii) the observed number of significant tests fell within estimates
predicted by a theoretical model relating sample size and effective population size (Ne). Third,
estimates of Ne and upper 95% CI were generally infinitely large, except for one paired
collection with unique life-history attributes of both a shorter smoltification phase and generation
time. Overall, these findings argue that allele frequency stability was pervasive over multiple
generations for most SNPs, despite the potential influence of diversifying selection and the
presence of significant temporal divergence in one paired collection. Use of multigenerational
datasets based on candidates for selection and putatively neutral SNPs seems a safe practice in
management of Alaska sockeye salmon that could be extended to other large, wild stocks.
55 56 Introduction
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 A noteworthy application of population genetics theory to resource management has been
genetic stock identification (GSI), also known as “mixed-stock” or “stock composition”
analyses, a type of assignment method that uses multilocus genotypes to ascertain the population
(or groups thereof) composition in a mixture sample (Pella and Milner 1987; Manel et al. 2005).
GSI has found enormous success in commercial species, endangered organisms, or both, that
require an accurate estimate of the number of population sources and their proportions.
Populations that have benefited from the GSI approach are diverse and include hawksbill turtles
(Browne, Horrocks, and Abreu-Grobois 2010), Canada geese (Mylecraine et al. 2008), honey
bees (Bourgeois et al. 2010), and many species of fish, especially salmonids (reviewed in Utter
and Ryman 1993; Waples, Punt, and Cope 2008). Like traditional assignment tests, GSI requires
a set of source or reference populations (“baseline”) to estimate individual membership
probabilities; unlike assignment tests, however, GSI incorporates the uncertainty of individual
assignment to get the stock composition of the mixture, rather than simply classify individuals to
their potential source, which generally translates into greater accuracy (Manel et al. 2005). Yet,
both methods may provide congruent outputs if genetic differentiation among populations of the
baseline is large (Potvin and Bernatchez 2001).
73 74 75 76 77 It is a common practice to establish baseline datasets from samples that were collected
over several generations (e.g., Beacham et al. 2004; Habicht et al. 2010). A frequently unverified
assumption is the temporal stability of baseline allele frequencies, despite some earlier in-depth
theoretical considerations (Waples 1990). On the other hand, temporal instability of allele
frequencies may yield unreliable GSI estimates, especially if it surpasses the magnitude of spatial
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 variation (Waples 1990). It is pertinent to evaluate the extent of this phenomenon on various
grounds, some of which have been previously discussed (Waples 1990). First, contemporary
datasets normally span several decades of sampling (e.g., Seeb et al. 2011; Templin et al. 2011),
which poses the question on whether baseline allele frequencies are representative of the entire
period in which GSI analyses take place. Second, not all markers within a set used for GSI
contain the same amount of information (Banks, Eichert, and Olsen 2003); performance among
the now widely-used single nucleotide polymorphisms (SNPs) depends heavily on levels of
genetic divergence, such as FST (Weir and Cockerham 1984). Large-FST SNPs generally
outperform small-FST SNPs (Ackerman, Habicht, and Seeb 2011; Hess, Matala, and Narum
2011). Large-FST loci in general are also more likely to be candidates for diversifying selection
(“outliers”) than small-FST loci, which on average behave as selectively neutral (Storz 2005).
But, are large-FST loci also more temporally unstable than small-FST loci? Because some outlier
SNPs may be linked to functional genes and their allele frequencies may change as a function of
latitude (Seeb et al. 2011) as well as environmental factors (Bradbury et al. 2010), this question
deserves further consideration.
Changes in allele frequency over two or more generations are often the result of genetic
drift, which occurs at a rate that is inversely proportional to the effective size of a population
(Ne). Populations of large Ne are thus expected to drift less than populations of small Ne
(Ostergaard et al. 2003; Vaha et al. 2008; Therkildsen et al. 2010; but see Garant, Dodson, and
Bernatchez 2000). The estimation of Ne using genetic data from molecular markers (reviewed in
(Wang 2005) has accordingly found renewed interest among applied geneticists during the last
decade, chiefly because access to life-time demographic parameters (e.g., variance in
reproductive success) are limited in wild populations, and because genetic estimates of Ne can be
an indicator of population health when related to census or adult population sizes (Palstra and
Ruzzante 2008). Yet, Ne can be notoriously difficult to estimate, and too often this parameter
may be influenced by factors other than genetic drift, especially age-structure and population
subdivision (Waples 2010).
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 Here we used six pairs of collections of sockeye salmon (Oncorhynchus nerka, Walbaum
1792), which were temporally spaced between 25 and 42 years (4.9 and 8.4 generations), to
gauge the stability of allele frequencies over multiple generations in three lake systems
throughout Alaska. Anadromous sockeye salmon undergo a short freshwater migration to the
spawning grounds where they were hatched after spending several years growing in the northern
Pacific Ocean (Quinn 2005). This annual migration has supported one of the world’s largest,
most profitable fisheries—its landed value was estimated at US$ 7.9 billion between 1950 and
2008 in Bristol Bay (Schindler et al. 2010). Using a suite of 85 nuclear and 3 mitochondrial
SNPs, we addressed three specific objectives. First, we tested whether temporal changes in allele
frequencies were smaller than spatial changes in allele frequencies, one fundamental assumption
of GSI, using population-based spatial statistics. There is considerable support for this hypothesis
in sockeye salmon (Beacham et al. 2004; Habicht et al. 2010; Creelman et al. 2011); to our
knowledge, however, no studies have used the timescales presented here or compared nuclear to
mitochondrial SNPs. Second, we tested whether temporal changes in allele frequencies (or
generational divergence) were significant and consistent with a model of pure genetic drift,
where directional forces like natural selection may be negligible. We investigated whether (i)
large-FST SNPs, as revealed by an outlier detection method or “genome scan”, were more likely
to exhibit significant temporal changes in allele frequencies than small-FST SNPs and (ii) the
123 124 125 126 127 128 129 130 131 132 observed number of temporal significant tests among SNPs was proportional to the ratio of
sample size to Ne, according to theoretical predictions (Waples 1989). Third, we tested whether
estimates of variance Ne, using an unbiased estimator of the so-called temporal method (Jorde
and Ryman 2007), varied significantly among collections, and were thus good predictors of the
magnitude of genetic drift or temporal change. Because spawning populations of sockeye salmon
are often composed of thousands of individuals (Schindler et al. 2010), we made the general
prediction that estimates of variance Ne should be infinitely large. However, we also
hypothesized that variable life and colonization histories within sockeye salmon (Wood 1995)
may influence variance Ne, as life-history types differ in demographic attributes (Wood et al.
2008).
133 Methods
134 Experimental design
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 Collections of archived dry scales and contemporary ethanol-preserved tissue (fin and
heart) originated from three lake systems: Bear Lake, Red Lake, and South Olga Lakes (Fig. 1).
The first is located in the northern Alaska Peninsula; the second and third are located in the
western region of Kodiak Island, Alaska. These were chosen because they had (i) the oldest
archived collections, (ii) no documented history of hatchery practices or introductions, and (iii)
no evidence of population admixture, either natural or human-mediated (L. Fair and M.
Witteveen, Alaska Department of Fish and Game, Anchorage, pers. comm. 2010). Adult
migration has a bimodal temporal distribution in Bear Lake and South Olga lakes; Red Lake, on
the other hand, is composed of several peaks (Fig. 2). The Alaska Department of Fish and Game
(ADF&G) has managed each lake system as two stocks—EARLY and LATE—depending on the
date of migration (Fig. 2), albeit Red Lake was managed as a single stock between 1989 and
2010 (M. B. Foster, ADF&G, pers. comm. 2010). EARLY and LATE runs are linked to
spawning ecotypes: EARLY fish often spawn among tributary inlets, whereas LATE fish spawn
among lake shoals or river outlets (Wood 1995). Early life-history strategies differ among lake
systems depending on the duration and location of the smoltification phase (Wood 1995).
Juveniles from Bear Lake, Red Lake and the EARLY run from South Olga Lakes normally
spend 1 – 2 years rearing in lacustrine habitats (‘lake’ ecotype); offspring from the LATE run
from South Olga Lakes, however, spend less than a year rearing in river (‘sea’ ecotype) or
estuarine habitats as suggested by age composition analyses (M. B. Foster, unpublished data; M.
Witteveen, pers. comm. 2010).
155 156 157 158 159 160 161 162 163 164 165 166 We sampled a total of 12 collections (Table 1). Samples from each lake system and run
(EARLY and LATE) were taken at two different generations, 0 and t (i.e., a paired collection).
Generation 0 (g0) corresponds to the oldest collections within each pair, which were taken
between 1966 and 1976, whereas generation t (gt) corresponds to the most recent collections,
which were taken between 1995 and 2009. We often combined individuals taken over the course
of multiple days during the spawning season, despite some exceptions among contemporary
collections that were taken within the same day (Fig. 2). We avoided scales taken too close to the
recognized date boundaries that historically differentiate EARLY and LATE collections (Fig. 2).
In a few collections, it was necessary to pool samples over multiple years to attain reasonable
composite sample sizes (n = 71 – 96: Table 1). We combined samples over a maximum of three
years into collections to avoid pooling across different generations. Nomenclature for the
collections combined all three hierarchical sampling levels: lake system, run timing, and
167 168 169 generation. For instance, the most recent collection from South Olga Lakes taken from the
EARLY run would be OLGA_EARLY_gt (t was subsequently replaced by an estimate of the
number of generations since g0; see Estimation of variance Ne subsection on how to calculate t).
170 Genotyping
171 172 173 174 175 176 177 178 We followed the genotyping protocol of Seeb et al. (2009) with some modifications for archived
scales (Smith et al. 2011). Genotyping for a panel of 96 SNPs, composed of 93 nuclear and 3
mitochondrial markers (Smith et al. 2005; Elfstrom, Smith, and Seeb 2006; Habicht et al. 2010;
C. Storer, unpublished) was performed using Fluidigm® 96.96 dynamic arrays and uniplex PCR
reactions. Genotype calls were conducted in Fluidigm® Genotyping Analysis proprietary
software by two independent researchers. A quality control step included re-genotyping of 8% of
the samples from each collection. SNPs that failed to consistently amplify across all collections
(> 10% failure rate) were excluded from further analyses.
179 Exploratory analyses
180 181 182 183 184 185 186 187 188 189 190 191 192 We performed several tests to verify the integrity of the data and calculate general statistics. For
nuclear SNPs, we tested for deviations from Hardy-Weinberg equilibrium (HWE) and linkage
equilibrium on each locus using GENEPOP 4.0 (Rousset 2008). Heterozygote excess or deficit
was reported through inbreeding coefficients per collection (FIS: Weir and Cockerham 1984). For
HWE tests over all SNPs within collections, we implemented a binomial likelihood method that
is less sensitive to deviations in a few loci (Moran 2003); for linkage equilibrium over all SNPs,
we used Fisher’s method in GENEPOP. We estimated observed and expected heterozygosities in
GENALEX (Peakall and Smouse 2006), while allelic richness was estimated in FSTAT (Goudet
1995, 2001). Expected heterozygosity and allelic richness were compared between lake systems
through randomizations in FSTAT. All three mitochondrial SNPs were combined in alphabetical
order and analyzed as composite haplotypes, which were referred to by their nucleotide
composition (e.g., “CGG”). We estimated haplotype diversity in G`ENALEX. For all software
settings, see Gomez-Uchida et al. (2011).
193 Temporal vs. spatial divergence
194 195 196 197 198 199 200 201 Divergence between collections was classified according to the type of comparison: (i) within
lakes (WL) or between lakes (BL), (ii) within runs (WR) or between runs (BR), and (iii) within
generations (WG) or between generations (BG); these were combined into composite categories
of divergence. For example, a comparison between RED_LATE_g0 and RED_LATE_gt would
be classified WLWRBG for temporal or generational divergence. Thus, spatial divergence
categories would be WLBRWG for run-timing divergence, and BLWG for lake divergence.
Comparisons WR or BR between lake systems as well as hybrid categories (e.g., WLBRBG or
BLBG) were not considered.
202 203 204 205 206 207 Each collection was treated as a separate population for the purpose of estimating
divergence. For nuclear SNPs, we estimated pairwise divergence between collections through
FST (Weir and Cockerham 1984), which was calculated in GENEPOP. Significance of FST values
(i.e., if FST > 0) was evaluated using locus-specific χ2 tests of differentiation implemented in
CHIFISH (Ryman 2006). Multilocus significance was estimated through Pearson test on χ2
values. For mitochondrial haplotypes, we estimated pairwise divergence between collections
208 209 210 211 212 213 214 215 216 through a standardized genetic distance, DS (Nei 1972) implemented in GENALEX. Pairwise
tests of genetic differentiation on haplotype data were performed through contingency tables of
haplotype counts and exact tests of heterogeneity using R (R Development Core Team 2010). To
address multiple comparisons over loci or collections, we compared the nominal error type I (p =
0.05) to the adjusted one using a 10% false discovery rate (padj) implemented in R that follows
(Benjamini and Yekutieli 2001). Estimates of divergence (FST, DS) were visualized in two ways:
first, we compared the distribution of values among hierarchical groups of collections using dot
charts; second, we computed principal coordinate analysis via the covariance matrix of genetic
distances using standard options in GENALEX.
217 218 219 220 221 222 The distribution of genetic variance among hierarchical sampling levels was computed
through an analysis of molecular variance (AMOVA) in ARLEQUIN 3.5 (Excoffier, Laval, and
Schneider 2005) for nuclear SNPs and GENALEX for mitochondrial SNPs. Such analysis
enabled the estimation of temporal to spatial variance ratios. Components of the genetic variance
were reported through sums of squares and F-statistics (nuclear SNPs) or Φ-statistics
(mitochondrial SNPs).
223 Expectations of generational divergence among SNPs under pure genetic drift
224 225 226 227 228 229 230 231 232 First, we identified candidate SNPs for selection by relating genetic diversity and differentiation
within and between populations (Storz 2005), using ARLEQUIN 3.5. We assumed a hierarchical
island model, because it best describes the complex genetic structure among sockeye salmon
populations, where gene flow is more predominant between demes within groups than among
groups (Gomez-Uchida et al. 2011). Settings included 10,000 simulations, 100 demes (= paired
collections), and 10 groups (= lake systems). Minimum and maximum expected heterozygosities
were set at 0 and 0.5, respectively. We focused exclusively on candidates for diversifying
selection found outside the upper quantiles, and hence ignored low-differentiation candidate
SNPs for balancing selection, usually found outside the lower quantiles.
233 234 235 237 238 Second, we investigated if generational divergence was consistent with expectations
under pure stochastic forces, namely sampling error and genetic drift. A generalized model by
Waples (1989) was used to calculate the probability of a locus-specific Pearson χ2 test being
3.84
significant according to pe (  2 
) , where 3.84 is the critical χ2 value for a diallelic locus (χ2
C
= 3.84, d.f. = 1, p = 0.05) and C is a scaling factor that is proportional to the ratio of sample size
(n) to effective population size (Ne) and can be approximated by (Waples 1989):
239 C  1
240 241 242 243 244 245 246 this expression should be valid for a broad range of values and applicable to both sampling plans
(before and after reproduction), unless t is too large and Ne is too small (Waples 1989). The
expected number of locus-specific significant tests (pe) was then calculated from estimates of
variance Ne, t (see Estimation of variance effective population size, below ), and ñ, the harmonic
mean of sample sizes taken at g0 and gt, and compared to the observed number of locus-specific
significant tests (po). In case variance Ne was infinity, we used an estimate of census population
size (N). In addition, we identified SNPs showing significant (p < 0.05) generational divergence
236 n~t
;
2Ne
(1)
247 248 249 250 within each set of paired collections and compared them to candidates for diversifying selection
identified above. Our rationale was that if nonrandom forces, such as natural selection, drive
generational divergence at specific SNPs, these may appear in multiple paired collections and
may match putative outlier SNPs (e.g., Jump et al. 2006).
251 Estimation of variance (Ne) and census population size (N)
252 253 254 255 256 257 258 We used an unbiased moment-based estimator of the standardized shift in allele frequencies, F’S
(Jorde and Ryman 2007) to calculate variance Ne for each paired collection. The estimation of
F’S has been implemented in the software TempoFs (http://www.zoologi.su.se/~ryman/). The
method assumes that generations are discrete and that there is no gene flow; thus, temporal
changes in allele frequencies are the sole result of genetic drift. We estimated (e.g.,
OLGA_EARLY_g0 and OLGA_EARLY_gt) separated by t generations; t was estimated from
the difference in years between the oldest and most recent collections divided by the mean
259 generation time, G. The parameter G was calculated according to G   pi i , where pi is the
260 261 262 263 264 proportion of individuals of age i (iterated through age j: (Felsenstein 1971). Sampling at g0 and
gt followed plan I or after reproduction (Waples 1989). The age composition of the escapement
was estimated from brood tables prepared during 1985 – 2010 (M. B. Foster, unpublished).
Pairwise differences in age composition between paired collections were assessed by means of
exact tests of heterogeneity implemented in R.
265 266 267 268 269 Census population sizes (N) for each paired collection were estimated by summing daily
average escapements (between 1966 and 2009: see Fig. 2) across days that define EARLY and
LATE run timing within each lake system: BEAR_ EARLY, 10-Jun to 31-Jul; BEAR_ LATE, 1Aug to 15-Sep; RED_ EARLY, 29-May to 15-Jul; RED_ LATE, 16-Jul to 1-Sep; OLGA_
EARLY, 29-May to 15-Jul; and OLGA_ LATE, 16-Jul to 15-Sep).
j
i 1
270 271 Results
272 Genotyping
273 274 275 276 277 278 Eight SNPs failed to consistently amplify across all collections (> 10% failure rate) and were
therefore excluded, leaving 85 nuclear and three mitochondrial SNPs for all ensuing statistical
analyses (average and median amplification success = 98%; range of amplification success
among loci: 93 – 100%). The quality control step found mismatches in 6 of 8579 genotypes;
mismatches were exclusively heterozygote-homozygote calls (or vice versa) for a discrepancy
rate of 0.07%.
279 Exploratory analyses
280 281 282 283 284 Tests. Forty-three out of 936 tests for HWE were significant using a nominal 5% of type I error
(p = 0.05), a number expected by chance without correction for multiple tests (Pearson χ21,1 =
1.1, p = 0.56). No evidence was found to reject the joint null hypothesis of no deviations for
HWE in any collection (Table 1). Evidence for gametic disequilibrium was found in one
collection, OLGA_EARLY_g0 (Fisher’s p = 0.012). Furthermore, significant evidence (p <
285 286 287 288 289 0.001) for physical linkage was found between three locus-pairs across multiple collections: (i)
One_MHC2-190 and One_MHC2-251 (12 collections), (ii) One_Tf_ex11-750 and One_Tf_in3182 (9 collections), and (iii) One_GPDH-201 and One_GPDH2-187 (7 collections). These
results agree with previous findings in sockeye salmon from different Alaskan drainages
(Habicht et al. 2010; Creelman et al. 2011).
290 291 292 293 294 Nuclear SNPs. We found significant evidence for differences in genetic diversity between lake
systems (FSTAT: p = 0.01), including mean estimates (± standard deviation) of expected
heterozygosity (Bear Lake: HE = 0.269 ± 0.008; Red Lake: HE = 0.284 ± 0.005; South Olga
Lakes: HE = 0.279 ± 0.005) and allelic richness (Bear Lake: AR = 1.916 ± 0.009; Red Lake: AR =
1.919 ± 0.010; South Olga Lakes: AR = 1.941 ± 0.005).
295 296 297 298 299 300 301 302 303 Mitochondrial SNPs. Haplotype CGG was the most common in Bear Lake and Red Lake (> 0.6),
followed by haplotype TAG (< 0.4); conversely, haplotype TAG was generally the most
common for the paired OLGA_EARLY collection, followed by haplotype CGG (Fig. 3). For the
OLGA_LATE paired collection, however, both haplotypes had fairly similar frequencies. In
South Olga Lakes we also found one unique haplotype (CAA = 0.141 in OLGA_EARLY_g0;
Fig. 3) and one rare haplotype (TGA) that had a higher frequency within OLGA_LATE_g0 and
OLGA_LATE_g6.5 than Red Lake (Fig. 3). We found that the mean (± standard deviation)
haplotype diversity was the lowest for Bear Lake (h = 0.337 ± 0.064), intermediate for Red Lake
(h = 0.463 ± 0.040), and the highest for South Olga Lakes (h = 0.569 ± 0.116).
304 Temporal vs. spatial divergence
305 306 307 308 309 310 311 312 313 314 315 316 317 Nuclear SNPs. The smallest divergence was always generational (WLWRBG; FST range: 0.0016
– 0.0047; Fig. 4a). No significance was found following correction for multiple tests (padj =
0.001) except for the paired OLGA_LATE collection (FST = 0.0047, Fisher’s p = 0.00008,
Pearson χ2 p < 0.0001). Generational divergence was followed by intermediate values of runtiming divergence (WLBRWG; FST range: 0.0075 – 0.0295; Fig. 4a) that showed significant
multilocus probabilities (Table 2). Lake divergence was always the largest (BLWG; FST range:
0.0408 – 0.0924; Fig. 4a) and significant in all cases (Table 2). This hierarchy of divergence
among FST was also evident in a principal coordinate analysis: distances separating generational
comparisons were the shortest (with nuances between lake systems), followed by run-timing and
lake comparisons that had intermediate and the longest distances in the two-dimensional space,
respectively (Fig. 5a). All three lake systems were reciprocally different: South Olga Lakes was
as different from Red Lake as from Bear Lake, despite large differences in distances among these
sites (Figure 1).
318 319 320 321 322 323 324 325 326 327 Mitochondrial SNPs. DS values overlapped among categories, especially at small levels of
divergence (Fig. 4b). Yet, categories differed in range of DS values as did the probability for the
null hypothesis of no differentiation between collections, which was especially pronounced
between South Olga Lakes and the other two lake systems (Table 2). The range for generational
divergence was DS = 0.001 – 0.026 with no significant comparisons found after correction for
multiple tests (padj = 0.014; Fig. 4b). The range for run-timing divergence was DS = 0.000 –
0.181 with only four significant comparisons , whereas the range for lake divergence was DS =
0.000 – 0.730 and most comparisons were significant (Fig. 4b; Table 2). Using principal
coordinate analysis, we noticed a lack of hierarchical distribution of DS values among categories
for Red Lake and Bear Lake, except for South Olga Lakes (Fig. 5b). Bear Lake and Red Lake
328 329 330 clustered together, despite the geographic distance separating them, whereas South Olga Lakes
formed a genetically distinct group, with marked differences between EARLY and LATE
collections (Fig. 5b).
331 332 333 334 335 336 337 338 Hierarchical AMOVA. For nuclear SNPs, and after the variation found within collections, the
largest component of the genetic variance was found between lakes, followed by run-timing
within lakes, and their FST values were significantly greater than zero. The component between
generations within run-timing was the smallest and its FST was no different from zero (Table 3).
Absolute FST ratios between components suggested that the generational component was nearly
250-fold and 40-fold smaller than the lake and run-timing components, respectively. For
mitochondrial SNPs, ΦST for all three components were significantly greater than zero; ΦST ratios
between components, on the other hand, approached unity (Table 3).
339 Was generational divergence consistent with a model of pure genetic drift?
340 341 342 343 344 345 346 347 Candidate SNPs for diversifying selection. The number of outlier SNPs varied between 2 (99th
quantile) and 6 (95th quantile) depending on the threshold of the theoretical heterozygosity-FST
distribution (Fig. 6). The range of FST for these SNPs varied between 0.146 (One_RFC-285) and
0.549 (One_Tf_in3-182). Two outliers mapped to well-described transferrin proteins in
salmonids (One_Tf_in3-182 and One_Tf_ex11-750: Ford 2001), while a third (One_HpaI-99)
mapped to a family of short interspersed elements in salmonids (Kido et al. 1991). A fourth
outlier (One_RFC2-285) mapped to the replication factor C, subunit 2, of Atlantic salmon
(Leong et al. 2010). Two outliers had no described annotation.
348 349 350 351 352 353 354 355 356 357 Observed (po) and expected number of significant tests (pe) among SNPs. Estimates of po varied
between 5.0% and 14.1%; the highest estimate was found in the paired OLGA_LATE collection
(Table 4). In general, po were smaller than pe in all but one paired collection (BEAR_EARLY),
suggesting that generational divergence could be explained by genetic drift alone in most cases
(Table 4). SNPs that exhibited significant temporal differentiation varied between 1 (RED_
EARLY) and 12 (OLGA_ LATE), and in total, 26 SNPs showed significant temporal variation to
a variable degree. There was little overlap among SNPs across paired collections: only two
markers—One_RAG3-93 and One_ghsR-66—appeared twice (Table 4). With one exception
(One_HpaI-99), no outlier SNPs or candidates for diversifying selection matched SNPs that
showed significant generational divergence.
358 Ne, age composition, and N
359 360 361 362 363 Estimates of Ne were generally infinitely large: we found no finite upper 95% CI in the
majority of paired collections (Table 5). Only OLGA_LATE had finite 95% CI estimates for Ne
(Table 5). Census population sizes (N) fluctuated between ~40,000 (OLGA_EARLY) and
~200,000 (BEAR_EARLY) and suggest differences in productivity between lake systems as well
as between collections with different run timing within the same lake system (Table 5).
364 365 366 367 368 Age composition varied significantly only between OLGA_ LATE and the remaining
collections (all exact tests, p < 0.0001); differences were especially obvious for age-3 and age-6
fish (Table 6), which appeared overrepresented and underrepresented in OLGA_ LATE,
respectively. Interestingly, generation time was roughly one year shorter for OLGA_ LATE than
the rest of paired collections (Table 6).
369 370 Discussion
371 Temporal vs. spatial divergence
372 373 374 375 376 377 378 379 380 381 382 383 384 385 The temporal stability of baseline allele frequencies is one of the underpinnings of GSI: even
though allele frequency shifts between generations were expected due to genetic drift, these
ought to be smaller than changes in allele frequencies in space for GSI to provide reliable
estimates (Waples 1990; Beacham and Withler 2010). Our survey validated this assumption for
nuclear SNPs over a period of 25 – 42 years (4.9 – 8.4 generations): absolute FST ratios between
spatial and temporal AMOVA components suggested that generational divergence was on
average nearly 40-fold smaller than run-timing divergence, and 250-fold smaller than lake
divergence. Larger spatial variation than temporal variation in allele frequencies appears to be a
trademark of baselines in Pacific salmonids (Beacham and Withler 2010), despite some
exceptions (Heath et al. 2002; Walter et al. 2009). Recently, Walter et al. (2009) showed that
temporal variation surpassed spatial variation for a group of Chinook salmon populations from
the Upper Fraser River in British Columbia. This conclusion, however, has been questioned on
technical and statistical grounds, sparking some debate given the importance of GSI for Chinook
salmon management (Beacham and Withler 2010; Walter, Shrimpton, and Heath 2010).
386 387 388 389 390 391 392 393 394 DS estimates from mitochondrial SNP haplotypes, on the other hand, showed overlap
between lake, run-timing, and generational comparisons, and ΦST ratios from the AMOVA were
close to 1. This suggests fundamental differences between the two SNP types; in particular,
mitochondrial SNPs offered limited GSI resolution within the scope of this survey. However, a
more thorough assessment is needed, given differences in the spatial scale of this study and GSI
applications that may include the entire north Pacific Ocean (Habicht et al. 2010), to validate this
conclusion. Furthermore, mitochondrial SNPs may be useful identifying unique (maternal)
lineages overlooked using nuclear SNPs frequently experiencing recombination (see section
Divergent colonization and life histories in sockeye salmon, below).
395 396 397 398 399 400 401 402 403 404 405 406 407 408 The implementation of population-based statistics required the assumption that each
paired collection represented a separate population; this is difficult to verify, because sampling at
weirs occurred before the fish reach their spawning sites. Nevertheless, several statistics suggest
that paired collections may be a cohesive reproductive unit and thus considered ‘populations’
from an evolutionary perspective (Waples and Gaggiotti 2006), even if each is composed of fish
that spawn at multiple sites (e.g., Bear Lake: Boatright, Quinn, and Hilborn 2004). First, negative
or near-zero (and generally nonsignificant) pairwise for generational comparisons suggest that,
most likely, the same spawning aggregations have been sampled at two different generations
(intrapopulation FST). Second, no consistent deviations from HWE and gametic disequilibrium
were evident across collections, suggestive of negligible population admixture. The only
exception was OLGA_EARLY_g0, for which we found evidence for gametic disequilibrium but
no deviations from HWE. However, the intrapopulation FST between the oldest and most recent
collections (OLGA_EARLY_g0 vs. OLGA_EARLY_g6.9) was no different than 0, implying
that admixture, if any, appears to have a minor contribution to differentiation.
409 Was generational divergence consistent with a model of pure genetic drift?
410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 Generational divergence among outlier SNPs. Only one candidate SNP showed significant
generational divergence in OLGA_LATE; most of the other well-annotated outliers linked to
potentially adaptive polymorphisms were not more likely than neutral SNPs to show significant
temporal variation. This strongly suggests that, if outlier SNPs are effectively under selection, or
if they are linked to functional genes experiencing selection, the environmental regimes
influencing spatial divergence are clearly different from those influencing temporal variation,
with one possible exception in One_HpaI-99. This polymorphism was located within a family of
short interspersed repetitive elements, which have been suggested play a role in salmonid
speciation (Kido et al. 1991), although no studies have addressed its potential role in adaptive
divergence within species. Jump et al. (2006) determined that one strong outlier locus in
European beech Fagus sylvatica populations experienced a dramatic decrease in the frequency of
one allele over 50 years (0.8 – 0.5), possibly from an increase of 2ºC in mean annual
temperature. Gomez-Uchida et al. (2011) also found One_HpaI-99 to be a candidate for selection
in sockeye salmon populations from the Kvichak River drainage in southwest Alaska; this
warrants additional scrutiny of this locus, perhaps including additional temporal samples and
relating allele frequencies to candidate environmental variable(s). Second, only two SNPs
showing significant generational divergence appeared twice in three paired collections
(BEAR_LATE, OLGA_LATE, and RED_LATE), which can be considered independent as there
is limited gene flow between lake systems. Therefore, temporal divergence in allele frequencies
appears to occur randomly among SNPs.
430 431 432 433 434 435 436 437 438 439 440 441 442 443 Expectations of generational divergence under pure genetic drift. Waples (1989) demonstrated
that the probability to reject the null hypothesis of no temporal variation in allele frequencies is
not equal to the nominal value in a classical contingency test (e.g., Pearson χ2: p = 0.05), and it
depended on the ratio of sample size to Ne. Using this theoretical framework, we generally found
no evidence to reject this null hypothesis, and our range of po estimates was consistent with
findings for other wild, but not hatchery, salmonid populations (Waples and Teel 1990). The
only exception was the paired collection from BEAR_EARLY, for which we expected 5.0% of
significant tests, but observed 5.8%. We speculate this may be the result of the test’s dependence
on a finite estimate of Ne. For all EARLY collections, we used estimates of N instead of Ne as the
latter were infinitely large; however, the use of smaller yet plausible values for Ne are likely to
increase the expected number of significant tests according to equation (1) and simulations
presented by Waples (1989). For example, if we use Ne = 1247 (the lower 95% CI for
BEAR_EARLY), then pe = 7.2%, which provides an upper (though conservative) threshold for
the number of significant expectations.
444 445 446 447 448 449 450 451 Overall, the main implications of these two findings for GSI are encouraging: first, SNPs
with the ability to more accurately discriminate between populations, and possible be candidates
for selection, are not more prone to experience significant temporal shifts in allele frequencies
than neutral SNPs; and second, generational divergence was consistent with a model of pure
genetic drift and has occurred randomly among SNPs across independent collections, where
selection seems to play a negligible role. We hypothesize these emanate from the presence of
large, wild populations of sockeye salmon, located in a largely pristine environment with no
hatchery influence (Schindler et al. 2010).
452 Do divergent colonization and life histories help explain spatial divergence and Ne?
453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 Spatial divergence. Hierarchical lake and run-timing divergence can be explained by a
combination of large-scale historical events, such as postglacial colonization following retreat of
the Cordilleran ice sheet less than 14,700 years BP (Mann and Peteet 1994), and small-scale
local adaptation, such as life history variation as an evolutionary response to ecologically
divergent spawning habitats (Gomez-Uchida et al. 2011). Regarding large-scale differences
between lake systems, principal coordinate analyses based on nuclear SNPs revealed all three
lake systems were reciprocally different, while principal coordinate analyses based on
mitochondrial SNPs revealed that Red Lake shared more haplotypes with Bear Lake than South
Olga Lakes. Results from both marker types were intriguing given the geographic proximity
between the two lake systems in Kodiak Island, strongly suggesting that sockeye salmon from
South Olga Lakes are derived from a unique lineage. This hypothesis finds support in genetic
and demographic attributes of this lake system. First, South Olga Lake collections generally
contained the highest genetic diversity at both nuclear and mitochondrial SNPs (Table 1), which
is typical of the ‘sea’ sockeye salmon, an ancestral form that spend a shorter period rearing in
fluvial or estuarine habitat than the more derived ‘lake’ form (Wood 1995; Beacham, McIntosh,
and MacConnachie 2004; Wood et al. 2008). Second, the OLGA_LATE paired collection had a
different age structure in comparison with the others. For instance, the age-4 cohort had an
important contribution from fish that spend only months in freshwater and three years at sea;
similarly, the age-3 cohort is dominated by fish that spend less than a year in freshwater and two
years at sea (M. B. Foster, unpublished). This cohort possibly explains the shorter generation
length in OLGA_LATE than in other collections. But, why did OLGA_EARLY had a different
age structure? Age composition analyses additionally suggested that juveniles from
OLGA_EARLY spend one or two years rearing in freshwater, a typical strategy of the ‘lake’
ecotype and the majority of populations of this study. We hypothesize that OLGA_EARLY was
founded from OLGA_LATE colonizers, and thus retained some ancestral haplotypes, despite the
loss of demographic attributes of ‘sea’ fish that possibly followed the contemporary adaptation to
rearing in limnetic environments.
480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 Bear Lake and Red Lake were more divergent at nuclear SNPs than mitochondrial SNPs;
indeed, these two lake systems seem to share an abundant haplotype (i.e., CGG), possibly
characteristic of the ‘lake’ ecotype as pointed out by principal coordinate analyses from
mitochondrial SNPs. Although the ‘sea’ ecotype has life-history advantages over the ‘lake’
ecotype for successful recolonization where sockeye salmon populations have been extirpated
(Wood et al. 2008), the possibility that the ‘lake’ ecotype can survive in glacial refugia and
colonize new environments has been recently raised (Pavey, Hamon, and Nielsen 2007), and our
results are in line with this possibility. Principal coordinate analyses from nuclear SNPs, on the
other hand, suggested there is very little gene flow between Bear Lake and Red Lake
populations, potentially resulting from geographic isolation, demographic processes, or both. For
example, estimates of nuclear genetic diversity indicated that Bear Lake was the most
depauperated of all three lakes (Table 1), a finding potentially associated with population
bottlenecks (e.g., Gomez-Uchida et al. 2011). Nuclear markers in general may also reveal further
historical contingencies overlooked by mitochondrial markers, especially in geologically young
populations (Fraser and Bernatchez 2005; Gomez-Uchida et al. 2008). Indeed, it has been
proposed that a large area of Kodiak Island, which coincidentally matches the location of the
Ayakulik River that drains Red Lake, remained ice-free during the last glacial maximum, based
on radiometric and stratigraphic evidence (Mann and Peteet 1994). Red Lake populations may
thus be descendants from the so-called Kodiak Island Refugium (Karlstrom 1969), whereas Bear
499 500 501 Lake populations likely originated from the Beringian Refugium in the northern Alaska
Peninsula (Wood 1995). A broader survey of lake systems composed of ‘lake’ and ‘sea’ ecotypes
should corroborate many of the hypotheses presented in the preceding sections.
502 503 504 505 506 507 508 509 Within lake systems, substantial nuclear divergence between populations with variable
run timing has been previously described in sockeye salmon (Burger et al. 2000; Seeb et al.
2000; Ramstad, Foote, and Olsen 2003). EARLY and LATE seasonal components are often
linked to specific spawning habitats: EARLY fish reproduce among inlet tributaries or streams,
whereas LATE fish reproduce among river outlets and beaches (Schindler et al. 2010). The basis
for this variation includes life history trade-offs related to egg size, age composition, body size,
and body depth, which have evolved presumably as a result of natural and sexual selection
(Quinn, Hendry, and Wetzel 1995; Quinn, Hendry, and Buck 2001).
510 511 512 513 514 515 516 517 518 Run-timing divergence can be important in a variety of applied contexts. For management
strategies that depend on temporally-explicit (seasonal) sampling regimes, the natural divisions
of EARLY and LATE components with varying productivity, for which different escapement
goals are normally set, has a clear genetics basis. In particular, significant differentiation between
RED_EARLY and RED_LATE argue for separate management, a measure in existence prior to
1989 that ADF&G reinstituted during 2011 (M. B. Foster, pers. comm. 2010). Overall, the
sustainability of the sockeye salmon fishery relies on these lower hierarchical levels of
population diversity (Schindler et al. 2010), some of which can have dramatically different
population and recruitment dynamics (e.g., ‘sea’ vs. ‘lake’ ecotypes: Wood et al. 2008).
519 520 521 522 523 524 525 526 527 528 529 530 Variance Ne. The magnitude of genetic drift varied among paired collections. In particular, we
found that OLGA_LATE had the highest number of both observed and expected significant tests
under pure genetic drift, which likely resulted from a relatively small (and finite) estimate of Ne,
and generational divergence was significant over loci after a false discovery rate correction for
multiple tests. Even though the composite sample size for the most recent collection
(OLGA_LATE_g6.5) was the smallest (n = 71), we argue that sampling error is unlikely to yield
that many false positives. Computer simulations under several conditions, such as variable
number of loci (0 – 40 SNPs) and unequal sample sizes, suggested that the error I of Pearson χ2
test usually remained under p = 0.06 for diallelic SNPs with uniform or even skewed allele
frequencies (Ryman et al. 2006). This reinforces our hypothesis that a small contemporary Ne—
in the order of few hundred individuals—reflects on increased genetic drift and instability in
OLGA_LATE.
531 532 533 534 535 536 537 538 539 It is unclear what demographic or life-history attributes, or both, explain this finding.
Potential bottlenecks that occurred in either the distant or recent past are inconsistent with high
estimates of genetic diversity and N, respectively. We speculate that the younger age
composition and faster generation time of OLGA_LATE than the rest of paired collections
represent pieces to this puzzle. Waples, Jensen, and McClure (2010) found an important link
between Ne, though calculated from demographic data, and the coefficient of variation of
population growth in Chinook salmon: annual variation in population growth accounted for a
reduction of 35% in Ne. Findings of reduced Ne in OLGA_LATE may reflect eco-evolutionary
trade-offs, and we believe this deserves further scrutiny.
540 541 Although the concept of Ne was developed from a very simple idea, its empirical
estimation becomes complicated owing to multiple spatial and temporal stratifications, which
542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 creates different classes of individuals within a population (Waples 2010). In our study, two of
such stratifications include age structure and immigration from nearby populations. First, we
assumed no age structure (i.e., discrete generations), although Pacific salmon populations are
often composed of multiple cohorts with variable age at maturity. Nevertheless, the temporal
approach may still be valid considering samples were spaced 5 or more generations apart, which
should eliminate the bias introduced by unequal cohort contribution (see Palstra, O'Connell, and
Ruzzante 2009). Yet, the interpretation of estimates of Ne must proceed with caution, because
they represent a generational average and not the quantity fishery biologists are likely to be most
interested in: the annual number of breeders (Waples 1990). Second, we assumed no gene flow,
albeit dispersal between EARLY and LATE populations may be possible, which could
potentially bias Ne estimates downward or upward, depending on whether gene flow is sporadic
or frequent, respectively (Wang and Whitlock 2003). Exploratory analyses using MLNE (Wang
and Whitlock 2003), which simultaneously estimates temporal Ne and migration rate (m), yielded
a similar point estimate for OLGA_LATE (NeMLNE = 539) using OLGA_EARLY as potential
source of immigrants (m = 0.001), but a rather imprecise 95% CI (< 1 – Inf). Although sporadic
gene flow from OLGA_EARLY may indeed underestimate Ne for OLGA_LATE, our general
conclusion regarding this collection is unlikely to change after accounting for immigration (e.g.,
Waples 2010).
560 561 Conclusions
562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 How stable were allele frequencies among Alaskan sockeye salmon populations over multiple
generations? First, our survey suggests that temporal changes were between 3- and 20-fold
smaller than spatial changes in allele frequencies when based on nuclear SNPs; differences were
nonetheless much less marked for mitochondrial SNPs. Second, the magnitude of generational
divergence was generally consistent with a model of pure genetic drift: (i) large-FST or candidate
SNPs for diversifying selection were not more likely to show significant temporal changes than
small-FST SNPs and (ii) the observed number of significant tests fell within estimates predicted
by a theoretical model relating sample size and Ne. Third, estimates of Ne and upper 95% CI
using the temporal method were infinitely large, with exception of OLGA_LATE. Demographic
attributes of this collection suggest it corresponds to the ‘sea’ ecotype. A possible explanation for
a reduced Ne could be found in a faster generation length than the rest of collections, albeit ecoevolutionary links are hitherto unclear. Overall, our findings for sockeye salmon support the use
of SNP baselines spanning various decades, which may be pertinent for management of other
large, wild stocks, but perhaps less so for stocks with hatchery influence that are much more
temporally unstable (Waples and Teel 1990). With the increasing use of high-resolution SNPs,
we welcome parallel studies that contemplate multigenerational analyses of SNP variation to
judge the breadth of our conclusions.
579 580 Literature cited
581 582 Ackerman, M. W., C. Habicht, and L. W. Seeb. 2011. Single-Nucleotide Polymorphisms (SNPs)
under Diversifying Selection Provide Increased Accuracy and Precision in Mixed-Stock
583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 Analyses of Sockeye Salmon from the Copper River, Alaska. Transactions of the
American Fisheries Society 140 (3):865-881.
Banks, M. A., W. Eichert, and J. B. Olsen. 2003. Which genetic loci have greater population
assignment power? Bioinformatics 19 (11):1436-1438.
Beacham, T. D., M. Lapointe, J. R. Candy, B. McIntosh, C. MacConnachie, A. Tabata, K.
Kaukinen, L. T. Deng, K. M. Miller, and R. E. Withler. 2004. Stock identification of
Fraser River sockeye salmon using microsatellites and major histocompatibility complex
variation. Transactions of the American Fisheries Society 133 (5):1117-1137.
Beacham, T. D., B. McIntosh, and C. MacConnachie. 2004. Population structure of lake-type
and river-type sockeye salmon in transboundary rivers of northern British Columbia.
Journal of Fish Biology 65 (2):389-402.
Beacham, T. D., and R. E. Withler. 2010. Comment on "Gene flow increases temporal stability
of Chinook salmon (Oncorhynchus tshawytscha) populations in the Upper Fraser River,
British Columbia, Canada". Canadian Journal of Fisheries and Aquatic Sciences 67
(1):202-205.
Benjamini, Y., and D. Yekutieli. 2001. The control of the false discovery rate in multiple testing
under dependency. Annals of Statistics 29 (4):1165-1188.
Boatright, C., T. Quinn, and R. Hilborn. 2004. Timing of adult migration and stock structure for
sockeye salmon in Bear Lake, Alaska. Transactions of the American Fisheries Society
133 (4):911-921.
Bourgeois, L., W. S. Sheppard, H. A. Sylvester, and T. E. Rinderer. 2010. Genetic Stock
Identification of Russian Honey Bees. Journal of Economic Entomology 103 (3):917-924.
Bradbury, I. R., S. Hubert, B. Higgins, T. Borza, S. Bowman, I. G. Paterson, P. V. R. Snelgrove,
C. J. Morris, R. S. Gregory, D. C. Hardie, J. A. Hutchings, D. E. Ruzzante, C. T. Taggart,
and P. Bentzen. 2010. Parallel adaptive evolution of Atlantic cod on both sides of the
Atlantic Ocean in response to temperature. Proceedings of the Royal Society B-Biological
Sciences 277 (1701):3725-3734.
Browne, D. C., J. A. Horrocks, and F. A. Abreu-Grobois. 2010. Population subdivision in
hawksbill turtles nesting on Barbados, West Indies, determined from mitochondrial DNA
control region sequences. Conservation Genetics 11 (4):1541-1546.
Burger, C. V., K. T. Scribner, W. J. Spearman, C. O. Swanton, and D. E. Campton. 2000.
Genetic contribution of three introduced life history forms of sockeye salmon to
colonization of Frazer Lake, Alaska. Canadian Journal of Fisheries and Aquatic Sciences
57 (10):2096-2111.
Creelman, E. K., L. Hauser, R. K. Simmons, W. D. Templin, and L. W. Seeb. 2011. Temporal
and Geographic Genetic Divergence: Characterizing Sockeye Salmon Populations in the
Chignik Watershed, Alaska, Using Single-Nucleotide Polymorphisms. Transactions of
the American Fisheries Society 140 (3):749-762.
Creelman, Elisabeth, Lorenz Hauser, Ryan Simmons, William D. Templin, and Lisa W. Seeb.
2011. Temporal and geographic genetic divergence: Characterizing sockeye salmon
populations in the Chignik watershed, Alaska, using single nucleotide polymorphisms.
Transactions of the American Fisheries Society in press.
Elfstrom, C. M., C. T. Smith, and J. E. Seeb. 2006. Thirty-two single nucleotide polymorphism
markers for high-throughput genotyping of sockeye salmon. Molecular Ecology Notes 6
(4):1255-1259.
628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 Excoffier, L., G. Laval, and S. Schneider. 2005. Arlequin (version 3.0): An integrated software
package for population genetics data analysis. Evolutionary Bioinformatics 1:47-50.
Felsenstein, Joe. 1971. Inbreeding and variance effective numbers in populations with
overlapping generations. Genetics 68 (4):581-&.
Ford, M. J. 2001. Molecular evolution of transferrin: Evidence for positive selection in
salmonids. Molecular Biology and Evolution 18 (4):639-647.
Fraser, D. J., and L. Bernatchez. 2005. Allopatric origins of sympatric brook charr populations:
colonization history and admixture. Molecular Ecology 14 (5):1497-1509.
Garant, D., J. J. Dodson, and L. Bernatchez. 2000. Ecological determinants and temporal
stability of the within-river population structure in Atlantic salmon (Salmo salar L.).
Molecular Ecology 9 (5):615-628.
Gomez-Uchida, D., K. P. Dunphy, M. F. O'Connell, and D. E. Ruzzante. 2008. Genetic
divergence between sympatric Arctic charr Salvelinus alpinus morphs in Gander Lake,
Newfoundland: roles of migration, mutation and unequal effective population sizes.
Journal of Fish Biology 73 (8):2040-2057.
Gomez-Uchida, D., J.E. Seeb, M.J. Smith, C. Habicht, T.P. Quinn, and L.W. Seeb. 2011. Single
nucleotide polymorphisms unravel hierarchical divergence and signatures of selection
among Alaskan sockeye salmon (Oncorhynchus nerka) populations. BMC Evolutionary
Biology 11 (48):48.
Goudet, J. 1995. FSTAT (Version 1.2): A computer program to calculate F-statistics. Journal of
Heredity 86 (6):485-486.
———. 2001. FSTAT, a program to estimate and test gene diversities and fixation indices. Ver.
2.9.3. Available from:http://www2.unil.ch/popgen/softwares/fstat.htm.
Habicht, C., L. W. Seeb, K. W. Myers, E. V. Farley, and J. E. Seeb. 2010. Summer-Fall
Distribution of Stocks of Immature Sockeye Salmon in the Bering Sea as Revealed by
Single-Nucleotide Polymorphisms. Transactions of the American Fisheries Society 139
(4):1171-1191.
Heath, D. D., C. Busch, J. Kelly, and D. Y. Atagi. 2002. Temporal change in genetic structure
and effective population size in steelhead trout (Oncorhynchus mykiss). Molecular
Ecology 11 (2):197-214.
Hess, J. E., A. P. Matala, and S. R. Narum. 2011. Comparison of SNPs and microsatellites for
fine-scale application of genetic stock identification of Chinook salmon in the Columbia
River Basin. Molecular Ecology Resources 11:137-149.
Hilborn, R., T. P. Quinn, D. E. Schindler, and D. E. Rogers. 2003. Biocomplexity and fisheries
sustainability. Proceedings of the National Academy of Sciences of the United States of
America 100 (11):6564-6568.
Jorde, P. E., and N. Ryman. 2007. Unbiased estimator for genetic drift and effective population
size. Genetics 177 (2):927-935.
Jump, A. S., J. M. Hunt, J. A. Martinez-Izquierdo, and J. Penuelas. 2006. Natural selection and
climate change: temperature-linked spatial and temporal trends in gene frequency in
Fagus sylvatica. Molecular Ecology 15 (11):3469-3480.
Karlstrom, T.N.V. 1969. Regional setting and geology. In The Kodiak Island Refugium, edited
by T. N. V. Karlstrom and G. E. Ball: The Boreal Institute of North America, University
of Alberta, Ryerson Press.
Kido, Y., M. Aono, T. Yamaki, K. Matsumoto, S. Murata, M. Saneyoshi, and N. Okada. 1991.
Shaping and reshaping of salmonid genomes by amplification of transfer RNA-derived
674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 retroposons during evolution. Proceedings of the National Academy of Sciences of the
United States of America 88 (6):2326-2330.
Leong, J. S., S. G. Jantzen, K. R. von Schalburg, G. A. Cooper, A. M. Messmer, N. Y. Liao, S.
Munro, R. Moore, R. A. Holt, S. J. M. Jones, W. S. Davidson, and B. F. Koop. 2010.
Salmo salar and Esox lucius full-length cDNA sequences reveal changes in evolutionary
pressures on a post-tetraploidization genome. Bmc Genomics 11.
Manel, S., O. E. Gaggiotti, R. S. Waples, and Rw. 2005. Assignment methods: matching
biological questions techniques with appropriate. Trends in Ecology & Evolution 20
(3):136-142.
Mann, D. H., and D. M. Peteet. 1994. Extent and timing of the last glacial maximum in
southwestern Alaska. Quaternary Research 42 (2):136-148.
Moran, M. D. 2003. Arguments for rejecting the sequential Bonferroni in ecological studies.
Oikos 100 (2):403-405.
Mylecraine, K. A., H. L. Gibbs, C. S. Anderson, and M. C. Shieldcastle. 2008. Using 2 genetic
markers to discriminate among Canada goose populations in Ohio. Journal of Wildlife
Management 72 (5):1220-1230.
Nei, M. 1972. Genetic distance between populations. American Naturalist 106 (949):283-&.
Ostergaard, S., M. M. Hansen, V. Loeschcke, and E. E. Nielsen. 2003. Long-term temporal
changes of genetic composition in brown trout (Salmo trutta L.) populations inhabiting an
unstable environment. Molecular Ecology 12 (11):3123-3135.
Palstra, F. P., M. F. O'Connell, and D. E. Ruzzante. 2009. Age Structure, Changing Demography
and Effective Population Size in Atlantic Salmon (Salmo salar). Genetics 182 (4):12331249.
Palstra, FP, and DE Ruzzante. 2008. Genetic estimates of contemporary effective population
size: what can they tell us about the importance of genetic stochasticity for wild
population persistence? MOLECULAR ECOLOGY 17 (15):3428-3447.
Pavey, S. A., T. R. Hamon, and J. L. Nielsen. 2007. Revisiting evolutionary dead ends in
sockeye salmon (Oncorhynchus nerka) life history. Canadian Journal of Fisheries and
Aquatic Sciences 64 (9):1199-1208.
Peakall, R., and P. E. Smouse. 2006. GENALEX 6: genetic analysis in Excel. Population genetic
software for teaching and research. Molecular Ecology Notes 6 (1):288-295.
Pella, J.J., and G.B. Milner. 1987. Use of genetic marks in stock composition analysis. In
Population Genetics & Fishery Management, edited by N. Ryman and F. M. Utter.
Seattle: University of Washington Press.
Potvin, C., and L. Bernatchez. 2001. Lacustrine spatial distribution of landlocked Atlantic
salmon populations assessed across generations by multilocus individual assignment and
mixed-stock analyses. Molecular Ecology 10 (10):2375-2388.
Quinn, T. P., A. P. Hendry, and G. B. Buck. 2001. Balancing natural and sexual selection in
sockeye salmon: interactions between body size, reproductive opportunity and
vulnerability to predation by bears. Evolutionary Ecology Research 3 (8):917-937.
Quinn, T. P., A. P. Hendry, and L. A. Wetzel. 1995. The influence of life history trade-offs and
the size of incubation gravels on egg size variation in sockeye salmon (Oncorhynchus
nerka). Oikos 74 (3):425-438.
Quinn, T.P. 2005. The Behavior and Ecology of Pacific Salmon and Trout. Seattle: University of
Washington Press.
719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 R Development Core Team. 2010. R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL
http://www.R-project.org.
Ramstad, K. M., C. J. Foote, and J. B. Olsen. 2003. Genetic and phenotypic evidence of
reproductive isolation between seasonal runs of Sockeye salmon in Bear Lake, Alaska.
Transactions of the American Fisheries Society 132 (5):997-1013.
Rousset, F. 2008. GENEPOP ' 007: a complete re-implementation of the GENEPOP software for
Windows and Linux. Molecular Ecology Resources 8 (1):103-106.
Ryman, N. 2006. CHIFISH: a computer program testing for genetic heterogeneity at multiple
loci using chi-square and Fisher's exact test. Molecular Ecology Notes 6 (1):285-287.
Ryman, N., S. Palm, C. Andre, G. R. Carvalho, T. G. Dahlgren, P. E. Jorde, L. Laikre, L. C.
Larsson, A. Palme, and D. E. Ruzzante. 2006. Power for detecting genetic divergence:
differences between statistical methods and marker loci. Molecular Ecology 15 (8):20312045.
Schindler, D. E., R. Hilborn, B. Chasco, C. P. Boatright, T. P. Quinn, L. A. Rogers, and M. S.
Webster. 2010. Population diversity and the portfolio effect in an exploited species.
Nature 465 (7298):609-U102.
Seeb, J.E., C.E. Pascal, R. Ramakrishnan, and L.W. Seeb. 2009. SNP Genotyping by the 5'Nuclease Reaction: Advances in High-Throughput Genotyping with Nonmodel
Organisms. In Single Nucleotide Polymorphisms, Methods in Molecular Biology, edited
by A. A. Komar.
Seeb, L. W., C. Habicht, W. D. Templin, K. E. Tarbox, R. Z. Davis, L. K. Brannian, and J. E.
Seeb. 2000. Genetic diversity of sockeye salmon of Cook Inlet, Alaska, and its
application to management of populations affected by the Exxon Valdez oil spill.
Transactions of the American Fisheries Society 129 (6):1223-1249.
Seeb, L. W., W. D. Templin, S. Sato, S. Abe, K. Warheit, J. Y. Park, and J. E. Seeb. 2011. Single
nucleotide polymorphisms across a species' range: implications for conservation studies
of Pacific salmon. Molecular Ecology Resources 11:195-217.
Smith, C. T., C. M. Elfstrom, L. W. Seeb, and J. E. Seeb. 2005. Use of sequence data from
rainbow trout and Atlantic salmon for SNP detection in Pacific salmon. Molecular
Ecology 14 (13):4193-4203.
Smith, M. J., C. E. Pascal, Z. Grauvogel, C. Habicht, J. E. Seeb, and L. W. Seeb. 2011. Multiplex
preamplification PCR and microsatellite validation enables accurate single nucleotide
polymorphism genotyping of historical fish scales. Molecular Ecology Resources 11:268277.
Storz, J. F. 2005. Using genome scans of DNA polymorphism to infer adaptive population
divergence. Molecular Ecology 14 (3):671-688.
Templin, W. D., J. E. Seeb, J. R. Jasper, A. W. Barclay, and L. W. Seeb. 2011. Genetic
differentiation of Alaska Chinook salmon: the missing link for migratory studies.
Molecular Ecology Resources 11:226-246.
Therkildsen, N. O., E. E. Nielsen, D. P. Swain, and J. S. Pedersen. 2010. Large effective
population size and temporal genetic stability in Atlantic cod (Gadus morhua) in the
southern Gulf of St. Lawrence. Canadian Journal of Fisheries and Aquatic Sciences 67
(10):1585-1595.
Utter, F., and N. Ryman. 1993. Genetic markers and mixed stock fisheries. Fisheries 18 (8):1121.
765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 Vaha, J. P., J. Erkinaro, E. Niemela, and C. R. Primmer. 2008. Temporally stable genetic
structure and low migration in an Atlantic salmon population complex: implications for
conservation and management. Evolutionary Applications 1 (1):137-154.
Walter, R. P., T. Aykanat, D. W. Kelly, J. M. Shrimpton, and D. D. Heath. 2009. Gene flow
increases temporal stability of Chinook salmon (Oncorhynchus tshawytscha) populations
in the Upper Fraser River, British Columbia, Canada. Canadian Journal of Fisheries and
Aquatic Sciences 66 (2):167-176.
Walter, R. P., J. M. Shrimpton, and D. D. Heath. 2010. Reply to the comment by Beacham and
Withler on "Gene flow increases temporal stability of Chinook salmon (Oncorhynchus
tshawytscha) populations in the Upper Fraser River, British Columbia, Canada".
Canadian Journal of Fisheries and Aquatic Sciences 67 (1):206-208.
Wang, J. L. 2005. Estimation of effective population sizes from data on genetic markers.
Philosophical Transactions of the Royal Society B-Biological Sciences 360 (1459):13951409.
Wang, JL, and MC Whitlock. 2003. Estimating effective population size and migration rates
from genetic samples over space and time. GENETICS 163 (1):429-446.
Waples, R. S. 1989. Temporal variation in allele frequencies: testing the right hypothesis.
Evolution 43 (6):1236-1251.
———. 1990. Temporal changes of allele frequency in Pacific salmon - implications for mixedstock fishery analysis. Canadian Journal of Fisheries and Aquatic Sciences 47 (5):968976.
———. 2010. Spatial-temporal stratifications in natural populations and how they affect
understanding and estimation of effective population size. Molecular Ecology Resources
10 (5):785-796.
Waples, R. S., and O. Gaggiotti. 2006. What is a population? An empirical evaluation of some
genetic methods for identifying the number of gene pools and their degree of
connectivity. Molecular Ecology 15 (6):1419-1439.
Waples, R. S., D. W. Jensen, and M. McClure. 2010. Eco-evolutionary dynamics: fluctuations in
population growth rate reduce effective population size in chinook salmon. Ecology 91
(3):902-914.
Waples, R. S., A. E. Punt, and J. M. Cope. 2008. Integrating genetic data into management of
marine resources: how can we do it better? Fish and Fisheries 9 (4):423-449.
Waples, R. S., and D. J. Teel. 1990. Conservation Genetics of Pacific salmon. I. Temporal
Changes in Allele Frequencies. Conservation Biology 4 (2):144-156.
Weir, B. S., and C. C. Cockerham. 1984. Estimating F-statistics for the analysis of population
structure. Evolution 38 (6):1358-1370.
Wood, C. C., J. W. Bickham, R. J. Nelson, C. J. Foote, and J. C. Patton. 2008. Recurrent
evolution of life history ecotypes in sockeye salmon: implications for conservation and
future evolution. Evolutionary Applications 1 (2):207-221.
Wood, C.C. 1995. Life history variation and population structure in sockeye salmon. American
Fisheries Society Symposium 17:195-216.
809 Figure legends
810 811 812 813 Figure 1. Study sites in (a) Alaska Peninsula and (b) Kodiak Island: 1) Bear River; 2) Bear Lake;
3) Ayakulik River; 4) Red Lake; 5) lower South Olga Lake and 6) upper South Olga Lake.
Black bars represent weirs operated by the Alaska Department of Fish and Game.
814 815 816 817 818 819 Figure 2. Daily average escapement of sockeye salmon by date between 1966 and 2009 (black
solid line, left y-axis) and number of sampled individuals within specific years (colored bars,
right y-axis) from three Alaskan lake systems: (a) Bear Lake, (b) Red Lake, and (c) South Olga
Lakes. Vertical dashed lines are cut-off dates used by the Alaska Department of Fish and Game
to separate EARLY and LATE run timing (Red Lake and South Olga Lakes: 15 July; Bear Lake:
31 July).
820 821 Figure 3. Frequency of five mitochondrial SNP haplotypes (see legend) in all 12 sockeye salmon
collections (see Table 1 for nomenclature of collections).
822 823 824 825 826 827 Figure 4. Dotcharts of pairwise FST from 85 nuclear SNPs: (a) number of nuclear SNPs showing
significance (p < 0.05) and (b) pairwise Nei’s DS from mitochondrial SNP haplotypes (c) plotted
for different hierarchical groups: WLWRBG = within lakes, within runs, between generations;
WLBRWG = within lakes, between runs, within generations; BLWG = between lakes, within
generations;. Dashed line: p = 0.05 (uncorrected); solid line: padj (multiple-test correction
assuming 10% false discovery rate).
828 829 830 831 832 833 Figure 5. Principal coordinate analysis using the covariance matrix of pairwise FST (a) and
pairwise Nei’s DS (b) projected in a two-dimensional space. Symbols represent different lake
systems (circles = Bear Lake; squares = Red Lake; triangles = South Olga Lakes); empty
symbols represent EARLY run timing, whereas filled symbols represent LATE run timing.
Labels represent different generations at which each collection was sampled (see Table 1 for
nomenclature).
834 835 836 837 Figure 6. Scatterplot of FST as a function of genetic diversity (HO/1 – FST) to distinguish
candidate SNPs for diversifying selection (in red, plus labels) from neutral SNPs (in black) using
simulations. Dashed and solid lines are the upper limits of the 95th and 99th quantile distribution
of neutral SNPs, respectively.
838 Tables
839 840 841 842 843 Table 1. Collection number, name, lake system, run timing, sampling years and DNA source material and genetic statistics including:
composite sample size per collection (n); allelic richness (AR; number of alleles in 55 diploid individuals); correlation coefficient of
inbreeding (FIS); mitochondrial haplotype diversity (h); and probability for deviations of Hardy-Weinberg equilibrium over all loci
(HWE) for sockeye salmon collected from three lakes in Alaska used to examine temporal stability in allele frequencies.
Collection*
844 845 Lake system, Run timing
Sampling
year(s)
DNA
source
n
AR
FIS
h
HWE
1 BEAR_EARLY_g0
Bear Lake, EARLY
1975-1976
scale
95 1.907 0.002 0.369 0.164
2 BEAR_EARLY_g4.9
Bear Lake, EARLY
2000
scale
95 1.928 0.017 0.268 0.200
Bear Lake, LATE
3 BEAR_LATE_g0
1969
scale
95 1.918 -0.085 0.303 0.099
Bear Lake, LATE
4 BEAR_LATE_g6.0
2000
scale
95 1.911 -0.024 0.409 0.099
5 RED_EARLY_g0
Red Lake, EARLY
1967
scale
95 1.905 -0.007 0.404 0.210
6 RED_EARLY_g8.4
Red Lake, EARLY
2009
fin
95 1.921 0.003 0.478 0.156
7 RED_LATE_g0
Red Lake, LATE
1966-1968
scale
95 1.921 0.002 0.490 0.203
8 RED_LATE_g8.4
Red Lake, LATE
2008
fin
95 1.930 -0.003 0.480 0.200
9 OLGA_EARLY_g0
South Olga Lakes, EARLY 1967-1968
scale
95 1.942 -0.019 0.522 0.105
10 OLGA_EARLY_g6.9 South Olga Lakes, EARLY
2000
heart
95 1.948 -0.017 0.427 0.162
South Olga Lakes, LATE
11 OLGA_LATE_g0
1967-1968
scale
96 1.936 0.039 0.663 0.160
South Olga Lakes, LATE
12 OLGA_LATE_g6.5
1995
scale
71 1.939 0.024 0.663 0.154
*Nomenclature combines all three hierarchical sampling levels: lake system, run timing, and generation (g0 = generation 0; gt =
generation t, with t = number of generations since g0).
846 847 Table 2. Pairwise FST (below diagonal; nuclear SNPs) and Nei’s DS (above diagonal; mitochondrial SNPs) with associated multilocus
significance from Pearson χ2 (tests using Fisher’s method produced identical results).
848 1
2
3
4
5
6
‐ 1
0.008 0.006
0.001
0.000
0.025
‐ 2 -0.001
0.000
0.015
0.010
0.062*
3 0.004†
0.006*
0.011
0.008
0.055*
4 0.006*
0.009*
0.001
0.001
0.016
5 0.082*
0.078*
0.084* 0.087*
0.022
6 0.084*
0.081*
0.085* 0.088* -0.002
7 0.072*
0.071*
0.071* 0.072* 0.011* 0.012*
8 0.074*
0.072*
0.071* 0.074* 0.007* 0.008*
9 0.079*
0.078*
0.088* 0.089* 0.092* 0.094*
10 0.078*
0.075*
0.087* 0.087* 0.085* 0.086*
11 0.060*
0.058*
0.061* 0.062* 0.067* 0.069*
12 0.059*
0.055*
0.058* 0.061* 0.073* 0.074*
†
p < 0.01; *p < 0.001. Collection numbers follow those in Table 1.
849 7
0.025
0.061*
0.054*
0.016
0.022
0.000
0.001
0.068*
0.062*
0.041*
0.047*
8
0.019
0.051*
0.045†
0.011
0.017
0.001
0.001
0.068*
0.062*
0.043*
0.049*
9
0.538*
0.730*
0.699*
0.480*
0.520*
0.297*
0.299*
0.326*
0.000
0.016*
0.030*
10
0.403*
0.561*
0.535*
0.355*
0.388*
0.203*
0.205*
0.227*
0.025*
0.015*
0.030*
11
0.201*
0.289*
0.274*
0.174*
0.184*
0.091*
0.091*
0.106*
0.112*
0.080*
0.005*
12
0.192*
0.269*
0.256*
0.169*
0.174*
0.097*
0.097*
0.112*
0.181*
0.127*
0.009
-
850 851 Table 3. Analysis of molecular variance (AMOVA) at three hierarchical levels for sockeye salmon collections taken from two time
periods (between 4.9 and 6.9 generations apart), from two run-timings, from three lakes in Alaska.
Source of variation
Nuclear SNPs
Between collections in different lakes
Between collections with different run-timing within lakes
Between generations within run-timing collections
Within collections
852 Mitochondrial SNPs (haplotypes)
Between collections in different lakes
Between collections with different run-timing within lakes
Between generations within run-timing collections
Within collections
d.f. = degrees of freedom; SS = sum of squares.
d.f.
SS
% Variance
FST /ΦST
2
3
6
2222
1237.3
173.8
62.7
24452
6.3
1.1
-0.03
93.8
2
3
6
2222
13.1
10.1
11.0
705.2
3.8
2.6
2.6
p-value
0.07401
0.01134
-0.00027
< 0.001
< 0.001
0.640
0.03767
0.02551
0.02631
< 0.001
< 0.001
< 0.001
853 854 855 Table 4. SNPs involved in generational divergence (Pearson χ2 test: p < 0.05) and observed vs. expected number of significant tests
across six paired collections (three lake systems with two run times) of sockeye salmon from Alaska. Paired collection nomenclature
is defined in Table 1. Noteworthy SNPs are in bold and footnoted.
Paired collection
Observed significant
tests (po)
5/85 = 5.9%
Expected significant
tests (pe)§
5.0%
BEAR_ LATE
(g0 vs. g6.0)
5/85 = 5.9%
6.2%
One_Ots208-234 (0.006); One_U1105 (0.017);
One_U1013-108 (0.033); One_Zp3b-49 (0.009);
One_RAG3-93** (0.031)
RED_ EARLY
(g0 vs. g8.4)
1/85 = 1.2%
5.0%
One_pax7-248 (0.01574)
RED_ LATE
(g0 vs. g8.4)
3/85 = 3.5%
7.9%
One_rpo2j-261 (0.009); One_PIP (0.028);
One_ghsR-66**(0.004)
OLGA_ EARLY
(g0 vs. g6.9)
2/85 = 2.4%
5.1%
One_gdh-212 (0.02224); One_IL8r-362 (0.02571)
12/85 = 14.1%
14.6%
One_U1201-492 (0.016); One_U1101 (0.031);
One_tshB-92 (0.00871); One_ghsR-66** (0.032);
One_cin-177 (0.007); One_Mkpro-129 (0.033)
One_hcs71-220 (0.036); One_MHC2_190 (0.018);
One_RAG3-93** (0.019); One_MHC2_251 (0.009);
One_HpaI-99† (0.04554); One_U503-170 (0.04794)
BEAR_ EARLY
(g0 vs. g4.9)
OLGA_ LATE
(g0 vs. g6.5)
856 857 Locus list (p-value)
One_apoe-83 (0.006); One_agt-132 (0.022);
One_STR07 (0.039); One_ZNF-61 (0.020);
One_ctgf-301 (0.048)
. **Found in more than one comparison; †Candidate for diversifying selection from simulations in ARLEQUIN. §Using Ne:
BEAR_LATE, RED_LATE, and OLGA_LATE; using N: BEAR_EARLY, RED_EARLY, and OLGA_EARLY.
858 859 860 861 Table 5. Run timing period, number of generations elapsed between samples (t); unbiased estimator of the temporal shift in allele
frequencies (F'S;(Jorde and Ryman 2007); estimates of variance effective population size (Ne, plus 95% CI); and estimate of census
population size from historical escapement data (N; average 1966 – 2009).for six paired collections (three lake systems with two run
times) of sockeye salmon from Alaska.
862 Paired collection
Run-timing period
10-Jun to 31-Jul
BEAR_ EARLY
1-Aug to 15-Sep
BEAR_ LATE
29-May to 15-Jul
RED_ EARLY
16-Jul to 1-Sep
RED_ LATE
29-May to 15-Jul
OLGA_ EARLY
16-Jul to 15-Sep
OLGA_ LATE
*. †Inf = infinity large (negative) estimate.
863 t
4.9
6.0
8.4
8.4
6.9
6.5
F'S
-0.0015362
0.0011198
-0.0032631
0.0025456
-0.000141
0.0099589
Ne (95% CI)†
Inf (1247 - Inf)
2679 (613 - Inf)
Inf (Inf - Inf)
1650 (607 - Inf)
Inf (1339 - Inf)
326 (198 - 907)
N
~ 200,000
~ 150,000
~ 150,000
~ 100,000
~ 40,000
~ 150,000
864 865 Table 6. Average age composition (%; between 1985 and 2010) and estimated mean generation years (G) for six paired collections
(three lake systems with two run times of sockeye salmon from Alaska.
BEAR_EARLY
866 BEAR_LATE
RED_EARLY
RED_LATE
OLGA_EARLY
OLGA_LATE
Age
2
3
4
5
6
7
0.0
0.7
16.1
56.1
26.6
0.4
0.0
0.9
14.5
65.1
19.3
0.2
0.0
2.2
24.1
50.4
22.7
0.6
0.0
1.3
15.9
61.1
21.1
0.6
0.0
2.7
25.4
57.1
14.7
0.1
0.5
15.8
30.4
49.8
3.4
0.0
G (years)
5.1
5.0
5.0
5.0
4.8
4.3
26−Sep
14−Sep
02−Sep
21−Aug
09−Aug
28−Jul
16−Jul
04−Jul
22−Jun
10−Jun
29−May
0
0
1
50
2
100
3
4
5
6
0
1
50
2
3
4
5
6
Average escapement x 103
4
100
6
8
Number of sampled individuals
200
150
200
2000
1995
1968
1967
150
c
100
2009
2008
1968
1967
1966
0
b
200
50
2
2000
1976
1975
1969
150
0
0
a
OLGA_LATE_g6.5
OLGA_LATE_g0
OLGA_EARLY_g6.9
OLGA_EARLY_g0
RED_LATE_g8.4
RED_LATE_g0
RED_EARLY_g8.4
RED_EARLY_g0
BEAR_LATE_g6.0
BEAR_LATE_g0
BEAR_EARLY_g4.9
BEAR_EARLY_g0
Frequency
1.0
0.8
0.6
0.4
TGA
TAG
CGG
CGA
CAA
0.2
0.0
WLWRBG
WLBRWG
BLWG
Hierarchical group
●
●●
●
●
●
●
0.00
a
0.02
0.04
0.06
0.08
0.10
Pairwise F S T
Hierarchical group
●
●
●●
●
●
●
0.0
WLWRBG
WLBRWG
BLWG
b
0.2
0.4
Nei's D S
0.6
0.8
Principal coordinate 2 (38%)
0.06
g0
g6.9
a
g0
g6.5
0.04
●
0.02
0.00
−0.02
g0
g4.9 ●●
g0 ●●
g0
g8.4
g6.0
g0
−0.04
g8.4
−0.04
0.10
Principal coordinate 2 (7%)
●
RED_EARLY
RED_LATE
OLGA_EARLY
OLGA_LATE
BEAR_EARLY
BEAR_LATE
−0.02 0.00
0.02
0.04
Principal coordinate 1 (47%)
g6.5
b
0.08
●
g0
0.06
●
0.06
RED_EARLY
RED_LATE
OLGA_EARLY
OLGA_LATE
BEAR_EARLY
BEAR_LATE
0.04
0.02
0.00
g4.9
●● g0
g0
g0 ●
−0.02
● g6.0 g0
g8.4
−0.04
−0.2
g8.4
g6.9
−0.1
0.0
0.1
0.2
Principal coordinate 1 (92%)
g0
0.3
0.25
One_Tf_in3−182 ●
●
0.20
● One_HpaI−99
●
One_Tf_ex11−750
One_U1003−75
One_U1209−111
●
0.15
● One_RFC2−285
●
F ST
●
●
●
●
●
●
0.10
●
●
●
●
●
●
●
●
●
●
0.05
●
●
0.00
●
●
●
●
●
●
●
●
●
●
●
0.1
●
●
0.2
●
●
●
●
0.3
H O /(1 − F S T )
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
0.0
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.4
● ●●
●
0.5