Genotype Error Detection using Hiddend Markov Models of

Hidden Markov Models of Haplotype
Diversity and Applications in
Genetic Epidemiology
Ion Mandoiu
University of Connecticut
Outline
 HMM model of haplotype diversity
 Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
 Conclusions
Single Nucleotide Polymorphisms

Main form of variation between individual genomes:
single nucleotide polymorphisms (SNPs)
… ataggtccCtatttcgcgcCgtatacacgggActata …
… ataggtccGtatttcgcgcCgtatacacgggTctata …
… ataggtccCtatttcgcgcCgtatacacgggTctata …

High density in the human genome:  1  107 SNPs
out of total 3  109 base pairs
Haplotypes and Genotypes

Diploids: two homologous copies of each autosomal chromosome


Haplotype: description of SNP alleles on a chromosome


One inherited from mother and one from father
0/1 vector: 0 for major allele, 1 for minor
Genotype: description of alleles on both chromosomes

0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2
- the chromosomes contain different alleles
011100110
+ 001000010
021200210
two haplotypes per individual
genotype
Sources of Haplotype Diversity: Mutation
The International HapMap Consortium. A Haplotype Map
of the Human Genome. Nature 437, 1299-1320. 2005.
Sources of Haplotype Diversity: Recombination
Haplotype Structure in Human Populations
HMM Model of Haplotype Frequencies

F2
H1
H2
…
Fn
Hn
Fi = founder haplotype at locus i, Hi = observed allele
at locus i



F1
P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference
genotype or haplotype data
For given haplotype h, P(H=h|M) can be computed in O(nK2)
using forward algorithm
Similar models proposed in [Schwartz 04, Rastas et
al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]
Outline
 HMM model of haplotype diversity
 Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
 Conclusions
Genotype Phasing
h1:0010111
h2:0010010
g: 0010212
?
h3:0010011
h4:0010110
Maximum Likelihood Genotype Phasing
F1
F2
H1
H2
F'1
F'2
H'1
H'2
G1

…
Fn
Hn
…
F'n
H'n
G2
Gn
Maximum likelihood genotype phasing: given g, find
(h1,h2) = argmaxh1+h2=g P(h1|M)P(h2|M)
Computational Complexity
• [KMP08] Cannot approximate maxh1+h2=g P(h1|M)P(h2|M)
within a factor of O(n1/2 -), unless ZPP=NP
• [Rastas et al.] give Viterbi and randam sampling based
heuristics that yield phasing accuracy comparable to best
existing methods (PHASE)
Outline
 HMM model of haplotype diversity
 Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
 Conclusions
Genotyping Errors

A real problem despite advances in technology & typing
algorithms




1.1% of 20 million dbSNP genotypes typed multiple times are
inconsistent [Zaitlen et al. 2005]
Systematic errors (e.g., assay failure) typically detected by
departure from HWE [Hosking et al. 2004]
In pedigrees, some errors detected as Mendelian
Inconsistencies (MIs)
Many errors remain undetected

As much as 70% of errors are Mendelian consistent for
mother/father/child trios [Gordon et al. 1999]
Likelihood Sensitivity Approach to Error
Detection in Trios
Mother
012102
Father
0 1 1 1 0 0 h1
0 1 0 1 0 1 h2
022102
0 0 0 1 0 1 h3
0 1 1 1 0 0 h4
Child
022102
0 1 1 1 0 0 h1
0 0 0 1 0 1 h3
L(T )  MAX p(h1 ) p(h2 ) p(h3 ) p(h4 )
Likelihood of best
phasing for original
trio T
Likelihood Sensitivity Approach to Error
Detection in Trios
Mother
012102
Father
0 1 0 1 0 1 h’1
0 1 1 1 0 0 h’2
022102
0 0 0 1 0 0 h’ 3
0 1 1 1 0 1 h’ 4
Child
0 2 2? 1 0 2
0 1 0 1 0 1 h’ 1
0 0 0 1 0 0 h’ 3
L(T )  MAX p(h1 ) p(h2 ) p(h3 ) p(h4 )
L(T ' )  MAX p(h'1 ) p(h'2 ) p(h'3 ) p(h'4 )
Likelihood of best
phasing for original
trio T
Likelihood of best
phasing for
modified trio T’
Likelihood Sensitivity Approach to Error
Detection in Trios
Mother
Father
012102
022102
Child
0 2 2? 1 0 2
 Large change in likelihood suggests likely error
 Flag genotype as an error if L(T’)/L(T) > R, where R is the
detection threshold (e.g., R=104)
Alternate Likelihood Functions
• [KMP08] Cannot approximate L(T) within O(n1/4 -), unless
ZPP=NP
• Efficiently Computable Likelihood Functions
- Viterbi probability
- Probability of Viterbi Haplotypes
- Total Trio Probability
Comparison with FAMHAP (Children)
1
0.9
0.8
0.7
Sensitivity
TotalProb-UNO
0.6
TotalProb-DUO
TotalProb-TRIO
0.5
TotalProb-COMBINED
FAMHAP-1
0.4
FAMHAP-3
0.3
0.2
0.1
0
0
0.005
0.01
FP rate
0.015
Comparison with FAMHAP (Parents)
1
0.9
0.8
0.7
Sensitivity
TotalProb-UNO
0.6
TotalProb-DUO
TotalProb-TRIO
0.5
TotalProb-COMBINED
FAMHAP-1
0.4
FAMHAP-3
0.3
0.2
0.1
0
0
0.005
0.01
FP rate
0.015
Outline
 HMM model of haplotype diversity
 Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
 Conclusions
Genome-Wide Association Studies

Powerful method for finding genes associated with
complex human diseases



Large number of markers (SNPs) typed in cases and
controls
Disease causal SNPs unlikely to be typed directly
Significant statistical power gained by performing imputation of
untyped Hapmap genotypes [WTCCC’07]
HMM Based Genotype Imputation

Train HMM using the haplotypes from related Hapmap or
small cohor typed at high density

Probability of missing genotypes given the typed
genotype data
P ( g i , g i  x | M )
P ( g i  x | g i , M ) 
P ( g i | M )
 gi is imputed as
x  argmax x{0,1, 2} P( g i , g i  x | M )
Experimental Results

Estimates of the allele 0 frequency based on Imputation
vs. Illumina 15k
Experimental Results

Accuracy and missing data rate for imputed genotypes at
different thresholds
Outline
 HMM model of haplotype diversity
 Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage
sequencing data
 Conclusions
Ultra-High Throughput Sequencing

New massively parallel sequencing technologies
deliver orders of magnitude higher throughput
compared to Sanger sequencing
Roche / 454
Genome Sequencer FLX
100 Mb/run, 400bp reads
Illumina / Solexa
Genetic Analyzer 1G
1000 Mb/run, 35bp reads
Applied Biosystems
SOLiD
3000 Mb/run, 25-35bp reads
Probabilistic Model
F1
F2
H1
H2
F'1
F'2
H'1
H'2
G1
R1,1
…
…
Fn
Hn
…
F'n
H'n
G2
R1,c1
R2,1
…
Gn
R2,c2
Rn,1
…
Rn,cn
Model Training

Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi),
P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the
Baum-Welch algorithm from haplotypes inferred from the populations of
origin for mother/father

P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise

P( Ri , j  r | Gi  g i ) 
where
gi
 r (i ) 1r (i ) 1   r (i ) r (i )  2  gi  r (i ) r (i ) 1   r (i ) 1r (i )
2
2
 r (i ) is the probability that read r has an error at locus I
 Conditional probabilities for sets of reads are given by:
P(ri | Gi  0) 
 (1  
rri
r ( i ) 0
r (i )
)   r (i ) P(ri | Gi  2) 
rri
r ( i ) 1
 (1  
rri
r ( i ) 1
r (i )
)   r (i )
rri
r ( i ) 0
1
P(ri | Gi  1)   
2
ci
Multilocus Genotyping Problem
GIVEN:
• Shotgun read sets r=(r1, r2, … , rn)
• Base quality scores
• HMMs for populations of origin for mother/father
FIND:
• Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum
posterior probability, i.e., g*=argmaxg P(g | r)
Posterior Decoding Algorithm
1. For each i = 1..n, compute g i *  arg max gi P( g i | r )  arg max gi P( g i , r )
2. Return g*  ( g1*,..., g n *)

Joint probabilities can be computed using a forward-backward
algorithm:
P( gi , r )  P(ri | gi ) f
K

'
i
'
i
i
'
i
i
'
i
i
Direct implementation gives O(m+nK4) time, where




i 1
i
i
i



 f 1 f , f f , f f , f ( gi )
K
m = number of reads
n = number of SNPs
K = number of founder haplotypes in HMMs
Runtime reduced to O(m+nK3) using speed-up idea similar to
[Rastas et al. 08, Kennedy et al. 08]
40
20
20
0
93.7
5.64x Posterior
80.0
5.64x Binomial
40 30.1
88.5
2.82x Posterior
60
2.82x Binomial
50.5
100
1.41x Posterior
80
1.41x Binomial
73.7
0.70x Posterior
96.3 99.1
0.70x Binomial
Homozygous Watson SNPs (Affy 500k)
0.35x Posterior
120
0.35x Binomial
5.64x Posterior
98.5
5.64x Binomial
90.5
2.82x Posterior
97.4
2.82x Binomial
95.6
1.41x Posterior
80
1.41x Binomial
93.5
0.70x Posterior
60
0.70x Binomial
100
0.35x Posterior
0.35x Binomial
Genotyping Accuracy on Watson Reads
Heterozygous Watson SNPs (Affy 500k)
120
96.8
82.5
67.8
54.6
25.7
2.8
9.2
0
Outline
 HMM model of haplotype diversity
 Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
 Conclusions
Conclusions

HMM model of haplotype diversity provides a powerful
framework for addressing central problems in population
genetics & genetic epidemiology




Enables significant improvements in accuracy by exploiting the high
amount of linkage disequilibrium in human populations
Despite hardness results, heuristics such as posterior or Viterbi
decoding perform well in practice
Highly scalable runtime (linear in #SNPs and #individuals/reads)
Software available at
http://www.engr.uconn.edu/~ion/SOFT/
Acknowledgements


Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin
Kennedy, Bogdan Pasaniuc
NSF funding (awards IIS-0546457 and DBI-0543365)