AGBT2007 - BC Bioinformatics

SNP Discovery in Whole-Genome Light-Shotgun
454 Pyrosequences
Aaron Quinlan1, Andrew Clark2, Elaine Mardis3, Gabor Marth1
(1) Department of Biology, Boston College
(2) Departments of Molecular Biology and Genetics, Cornell University
(3) Departments of Genetics and Molecular Microbiology, Washington
University
AGBT 2007. Marco Island, FL. February 9, 2007
454 machines have been proven for several applications
• genome sequencing
• microRNA discovery
• mutation detection in
cancer tissue
454 machines trade off throughput with read length
bases per run
1Gb
100 Mb
10 Mb
1Mb
10 bp
100 bp
read length
1,000 bp
454 shotgun reads for SNP discovery
• for 100Mb genomes a few 454 runs produce ~ 1x coverage
• at ~ 1x the genome is fairly densely covered
• still, most 454 reads align as singletons
bases per run
100 Mb
10 Mb
1 Mb
10 Mb
100 Mb
genome size
1 Gb
10 Gb
Are single-coverage 454 reads resulting from lightshotgun sequencing accurate enough for SNP discovery?
melanogster reference genome sequence (iso-1 strain)
454 shotgun reads from an African melanogaster isolate
(strain id 46-2)
•African melanogaster strain courtesy of Dr. Charles Langley, UC Davis
• 454 sequencing at the Washington University Genome Sequencing Center
Steps of SNP discovery
Sequence clustering and
organization
Paralog identification
Multiple fragment alignment
SNP detection
SNP discovery in capillary traces hinges on base quality
• in Sanger-principle capillary sequences the number of bases is generally well
resolved
• most errors come from
substitutions, i.e. calling the wrong
base
• substitution errors are well described by the
PHRED base quality values allowing us to
distinguish between sequencing error and true
polymorphism, detect and score candidate SNPs
Most 454 errors are over-calls or under-calls
• in 454 reads one the identity of the nucleotide is usually accurate, but the
number of bases is often unclear
Separate out alignments!!!
• most errors are over-calls or
under-calls
• errors don’t necessarily occur in “low quality” regions of the read, and PHRED
base quality values do not describe over- and under-call errors
light signal
How many bases were incorporated?
0.09
1.5
Add cartoon scale on
sides!!!
nucleotide incorporation tests
?
• the number of bases in a mono-nucleotide run has to be inferred from
the signal intensity, but this inference is often not trivial
• a signal is also produced when, in fact, no nucleotide is incorporated
• signal intensity is variable for a given # incorporated bases
The base number probabilities
Annotate 0, 1, 2!!!
Figga Mo’ bigga!!!
histogram of observed signal
intensities for different
numbers of actually
incorporated bases
• conversely, for a given signal intensity (e.g. 1.5), the true number of incorporated
nucleotides is either 1 or 2 (and sometimes even 3 or 0)
• our base caller calculates and reports the base number probabilities i.e. the
(posterior) probability that given the observed incorporation signal 0, 1, 2, …, etc.
bases were incorporated, e.g. P(0C), P(1C), Pr(2C), …, etc.
• these base number probabilities address under- and over-calls and replace
the PHRED base quality values for 454 reads
PyroBayes – our 454 base caller
Use data likelihood from last page!!!
Add Bayesian equation!!!
Mapping / sequence alignment
• simple BLAT approach to map 454 reads
• 454 reads that align to multiple locations in the genome (paralogous
sequences) are removed
• unique pair-wise alignments kept
TTGATGACTAGTAACGACAGGGACGCGTGGGAAGGTTAGTACCGTAC
ACGACAGGGATGCGTGGGA
SNP calling for 454 reads
To evaluate sequence differences…
… we use the base number
probabilities
ACGACAGGGACGCGTGGGA
ACGACAGGGATGCGTGGGA
P(0C) would not be available from PHRED
Given an apparent mismatch between the genome reference sequence (C allele) and
the 454 read (T allele) we have to consider the possibility that:
• the genome reference allele (C) is wrong and, in fact, the reference allele
is T (from PHRAP base quality value)
• the 454 allele (T) is the result of over-call, and one of the C nucleotide
tests just before or after was an under-call…
The result is a SNP probability score that our SNP caller reports
The SNP discovery pipeline
ACGACAAGGCGTGGGA
454 base calling
(341,600 reads called)
TTGATGACTAGTAACGACAGGGACGCGTGGGAAGGTTAGTACCGTACTGGGA
ACGACAGGGATGCGTGGGA
read mapping
(220,121 reads uniquely mapped)
Pr(C/T)
(41,265 candidate SNPs)
SNP calling + thresholding
SNP candidate validation
• we attempted experimental validation
for 1,549 randomly chosen candidates
• each candidate was PCR-amplified and
sequenced on ABI capillary machines.
• 1,114 of 1,231 candidates were
confirmed (318 could not be assayed).
• 90.5% true positive rate
Melanogaster SNPs from a single 454 run
• 81.4% of SNPs were discovered in a
single 454 read vs. the genome reference
• 1 SNP per 530 bp aligned 454 sequence
• SNPs were evenly distributed on
melanogaster autosomes (chr. 4 is
almost completely heterochromatic)
• Average density: 1 SNP per 2.9 kb
melanogaster genome sequence
SNPs for a melanogaster genotyping chip
• some SNP alleles we discovered are likely
singletons (alleles only present in the
reference or the African strain, but not in
the entire melanogaster “population”)
• but we know from population genetic
theory that SNP discovery (ascertainment)
in a pair of chromosomes enriches for
common variants most useful as genetic
markers
• 40K SNPs with 90%+ validation rate from a
single 454 run probably sufficient for a
genotyping chip
• for larger genomes / denser maps multiple 454
runs will be needed
Ongoing 454 data mining projects
• 10 different melanogaster strains
• mammalian projects: larger genome size
requires reduced genome representation
strategy (RRS)
• RRS shotgun reads provide deeper
sequence coverage in “target” regions
Refinements of the 454 data analysis pipeline
• improved base calling gives higher accuracy
• extended SNP calls for all substitutions and INDELs gives more SNPs
• effective anchored aligners and SNP callers for deep alignments
address more data and deeper alignments from RRS strategies
Thanks
Elaine Mardis
Wash. U.
Andy Clark
Cornell University
Aaron Quinlan
Boston College
Chip Stewart Michael Stromberg
Weichun Huang
Michele Busby
Damien CroteauChonka
Eric Tsung
bioinformatics.bc.edu/marthlab
Tony Nguyen
• base callers for 454 and short-read
sequencing machines
• reference guided, “anchored” alignment
programs
• SNP callers for deep 454
alignments and for short read
alignments
SNP calling – filters
• only considered candidate SNPs that were the least likely the result of
a 454 over-call or under-call
Reference
Afr. 454 seq.
Reference
Afr. 454 seq.
Reference
Afr. 454 seq.
TCGCCTACGCG
TCGCGTTCGCG
TCGCGTATGCG
TCCCGTATGCG
TCGCGTATGCG
TCTCGTATGCG
• only considered candidate SNPs with SNP probability score > 0.9