Sequence variation informatics

Discovery tools for human genetic variations
Gabor T. Marth
Department of Biology
Boston College
Chestnut Hill, MA 02467
Sequence variations
• Human Genome Project produced a reference genome
sequence that is 99.9% common to each human being
• sequence variations make our
genetic makeup unique
SNP
• Single-nucleotide polymorphisms
(SNPs) are most abundant, but other
types of variations exist and are important
How do we find variations?
• comparative analysis of multiple
sequences from the same region of the
genome (redundant sequence coverage)
• diverse sequence
resources can be used
EST
WGS
BAC
Steps of SNP discovery
Sequence clustering
Cluster refinement
Multiple alignment
SNP detection
Computational SNP mining – PolyBayes
Two innovative ideas:
1. Utilize the genome reference
sequence as a template to organize
other sequence fragments from
arbitrary sources
2. Use sequence quality information
(base quality values) to distinguish
true mismatches from sequencing
errors
sequencing error
true polymorphism
SNP discovery with PolyBayes
genome reference sequence
1. Fragment recruitment
(database search)
2. Anchored
alignment
4. SNP detection
3. Paralog
identification
Sequence clustering
• Clustering simplifies to search against sequence database to
recruit relevant sequences
• Clusters = groups of overlapping sequence fragments matching
the genome reference
genome reference
fragments
cluster 1
cluster 2
cluster 3
(Anchored) multiple alignment
• The genomic reference sequence serves as an anchor
• fragments pair-wise aligned to genomic sequence
• insertions are propagated – “sequence padding”
• Advantages
• efficient -- only involves pair-wise comparisons
• accurate -- correctly aligns alternatively spliced ESTs
Paralog filtering
• The “paralog problem”
• unrecognized paralogs give rise to spurious SNP predictions
• SNPs in duplicated regions may be useless for genotyping
• Challenge
• to differentiate between sequencing errors and paralogous
difference
Sequencing
errors
Paralogous
difference
Paralog filtering
• Pair-wise comparison between fragment and genomic sequence
• Model of expected discrepancies
• Orthologous: sequencing error + polymorphisms
• Paralog: sequencing error + paralogous sequence difference
Probability
Paralog discrimination
P(d|Model_NAT)
P(d|Model_PAR)
P(Model_NAT|d)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Discrepancies (d)
• Bayesian discrimination algorithm
Paralog filtering
SNP detection
• Goal: to discern true variation from sequencing error
sequencing error
polymorphism
Bayesian-statistical SNP detection
A
A
A
A
A
polymorphic
permutation
Bayesian
posterior
probability
P( SNP ) 
C
C
C
C
C
Base call +
Base quality

all var iable
G
G
G
G
G
T
T
T
T
T
monomorphic
permutation
Expected polymorphism rate
P( S N | RN )
P( S1 | R1 )
 ... 
 PPr ior ( S1 ,..., S N )
PPr ior ( S1 )
PPr ior ( S N )
P( SiN | R1 )
P( Si1 | R1 )
S
...

...

 PPr ior ( Si1 ,..., SiN )


P
(
S
)
P
(
S
)
S i1 [ A ,C ,G ,T ] S iN [ A ,C ,G ,T ] Pr ior
i1
Pr ior
iN
Base composition Depth of coverage
Priors
• Distribution of SNPs according
to minor allele frequency
relative occurence [%]
• Polymorphism rate in population -- e.g. 1 / 300 bp
40
30
20
10
0
10
20
30
40
50
• Distribution of SNPs according
to specific variation
Relative occurance
minor allele frequency [%]
70
60
50
40
30
20
10
0
AC
AG
AT
CG
Variation type
Prob(k alleles of N = 20)
Prob
• Sample size (alignment depth)
0.8
0.6
p = 0.02
p = 0.1
p = 0.5
0.4
0.2
0
0
5
10
15
20
k alleles
SNP score
polymorphism
specific variation
Validation – pooled sequencing
African
SNP confirmation rate
Asian
Hispanic
Confirmation rate
Caucasian
SNPs confirmed
80
60
40
20
0
0.37 - 0.59
0.60 - 0.79
P(SNP)
CHM 1
0.80 - 1.00
Confirmation rate [%]
Validation -- resequencing
100
80
60
40
20
0
51-60
61-70
71-80
SNP score [%]
81-90
91-100
Properties of SNP detection algorithm
Detection of a single allele
Threshold = 0.9
Quality value vs. allele frequency
(alignment depth = 20)
50
40
30
20
Threshold = 0.5
10
0
Threshold = 0.9
50
2
3
4
5
6
7
Alignment depth
8
9
10
Quality value
Quality value
Threshold = 0.5
40
30
20
10
0
5
10 15 20 25 30 35 40 45 50
allele frequency [% ]
• frequent alleles are easier to detect
• high-quality alleles are easier to detect
The PolyBayes software
http://genome.wustl.edu/gsc/polybayes
• First statistically rigorous SNP discovery
tool
• Correctly analyzes alternative cDNA
splice forms
• Available for use (~70 licenses)
Marth et al., Nature Genetics, 1999
SNP mining: genome BAC overlaps
overlap detection
inter- & intra-chromosomal duplications
known human repeats
fragmentary nature of draft data
SNP analysis
candidate SNP predictions
BAC overlap mining results
~ 30,000 clones
>CloneX
ACGTTGCAACGT
GTCAATGCTGCA
>CloneY
ACGTTGCAACGT
GTCAATGCTGCA
25,901 clones
(7,122 finished, 18,779 draft
with basequality values)
21,020 clone overlaps
(124,356 fragment overlaps)
ACCTAGGAGACTGAACTTACTG
ACCTAGGAGACCGAACTTACTG
507,152 high-quality
candidate SNPs
(validation rate 83-96%)
Marth et al., Nature Genetics 2001
SNP mining projects
1. Short deletions/insertions (DIPs) in the BAC overlaps
Weber et al., AJHG 2002
2. The SNP Consortium (TSC): polymorphism discovery in
random, shotgun reads from whole-genome libraries
Sachidanandam et al., Nature 2001
Genotyping by sequence
• SNP discovery usually deals with single-stranded (clonal) sequences
• It is often necessary to determine the allele state of individuals at
known polymorphic locations
• Genotyping usually involves double-stranded DNA  the possibility
of heterozygosity exists
• there is no unique underlying nucleotide,
no meaningful base quality value, hence
statistical methods of SNP discovery do not
apply
Genotyping
homozygous
peak
heterozygous
peak