Hidden Markov Models of Haplotype
Diversity and Applications in
Genetic Epidemiology
Ion Mandoiu
University of Connecticut
Outline
HMM model of haplotype diversity
Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
Conclusions
Single Nucleotide Polymorphisms
Main form of variation between individual genomes:
single nucleotide polymorphisms (SNPs)
… ataggtccCtatttcgcgcCgtatacacgggActata …
… ataggtccGtatttcgcgcCgtatacacgggTctata …
… ataggtccCtatttcgcgcCgtatacacgggTctata …
High density in the human genome: 1 107 SNPs
out of total 3 109 base pairs
Haplotypes and Genotypes
Diploids: two homologous copies of each autosomal chromosome
Haplotype: description of SNP alleles on a chromosome
One inherited from mother and one from father
0/1 vector: 0 for major allele, 1 for minor
Genotype: description of alleles on both chromosomes
0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2
- the chromosomes contain different alleles
011100110
+ 001000010
021200210
two haplotypes per individual
genotype
Sources of Haplotype Diversity: Mutation
The International HapMap Consortium. A Haplotype Map
of the Human Genome. Nature 437, 1299-1320. 2005.
Sources of Haplotype Diversity: Recombination
Haplotype Structure in Human Populations
HMM Model of Haplotype Frequencies
F2
H1
H2
…
Fn
Hn
Fi = founder haplotype at locus i, Hi = observed allele
at locus i
F1
P(Fi), P(Fi | Fi-1) and P(Hi | Fi) estimated from reference
genotype or haplotype data
For given haplotype h, P(H=h|M) can be computed in O(nK2)
using forward algorithm
Similar models proposed in [Schwartz 04, Rastas et
al. 05, Kimmel&Shamir 05, Scheet&Stephens 06]
Outline
HMM model of haplotype diversity
Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
Conclusions
Genotype Phasing
h1:0010111
h2:0010010
g: 0010212
?
h3:0010011
h4:0010110
Maximum Likelihood Genotype Phasing
F1
F2
H1
H2
F'1
F'2
H'1
H'2
G1
…
Fn
Hn
…
F'n
H'n
G2
Gn
Maximum likelihood genotype phasing: given g, find
(h1,h2) = argmaxh1+h2=g P(h1|M)P(h2|M)
Computational Complexity
• [KMP08] Cannot approximate maxh1+h2=g P(h1|M)P(h2|M)
within a factor of O(n1/2 -), unless ZPP=NP
• [Rastas et al.] give Viterbi and randam sampling based
heuristics that yield phasing accuracy comparable to best
existing methods (PHASE)
Outline
HMM model of haplotype diversity
Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
Conclusions
Genotyping Errors
A real problem despite advances in technology & typing
algorithms
1.1% of 20 million dbSNP genotypes typed multiple times are
inconsistent [Zaitlen et al. 2005]
Systematic errors (e.g., assay failure) typically detected by
departure from HWE [Hosking et al. 2004]
In pedigrees, some errors detected as Mendelian
Inconsistencies (MIs)
Many errors remain undetected
As much as 70% of errors are Mendelian consistent for
mother/father/child trios [Gordon et al. 1999]
Likelihood Sensitivity Approach to Error
Detection in Trios
Mother
012102
Father
0 1 1 1 0 0 h1
0 1 0 1 0 1 h2
022102
0 0 0 1 0 1 h3
0 1 1 1 0 0 h4
Child
022102
0 1 1 1 0 0 h1
0 0 0 1 0 1 h3
L(T ) MAX p(h1 ) p(h2 ) p(h3 ) p(h4 )
Likelihood of best
phasing for original
trio T
Likelihood Sensitivity Approach to Error
Detection in Trios
Mother
012102
Father
0 1 0 1 0 1 h’1
0 1 1 1 0 0 h’2
022102
0 0 0 1 0 0 h’ 3
0 1 1 1 0 1 h’ 4
Child
0 2 2? 1 0 2
0 1 0 1 0 1 h’ 1
0 0 0 1 0 0 h’ 3
L(T ) MAX p(h1 ) p(h2 ) p(h3 ) p(h4 )
L(T ' ) MAX p(h'1 ) p(h'2 ) p(h'3 ) p(h'4 )
Likelihood of best
phasing for original
trio T
Likelihood of best
phasing for
modified trio T’
Likelihood Sensitivity Approach to Error
Detection in Trios
Mother
Father
012102
022102
Child
0 2 2? 1 0 2
Large change in likelihood suggests likely error
Flag genotype as an error if L(T’)/L(T) > R, where R is the
detection threshold (e.g., R=104)
Alternate Likelihood Functions
• [KMP08] Cannot approximate L(T) within O(n1/4 -), unless
ZPP=NP
• Efficiently Computable Likelihood Functions
- Viterbi probability
- Probability of Viterbi Haplotypes
- Total Trio Probability
Comparison with FAMHAP (Children)
1
0.9
0.8
0.7
Sensitivity
TotalProb-UNO
0.6
TotalProb-DUO
TotalProb-TRIO
0.5
TotalProb-COMBINED
FAMHAP-1
0.4
FAMHAP-3
0.3
0.2
0.1
0
0
0.005
0.01
FP rate
0.015
Comparison with FAMHAP (Parents)
1
0.9
0.8
0.7
Sensitivity
TotalProb-UNO
0.6
TotalProb-DUO
TotalProb-TRIO
0.5
TotalProb-COMBINED
FAMHAP-1
0.4
FAMHAP-3
0.3
0.2
0.1
0
0
0.005
0.01
FP rate
0.015
Outline
HMM model of haplotype diversity
Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
Conclusions
Genome-Wide Association Studies
Powerful method for finding genes associated with
complex human diseases
Large number of markers (SNPs) typed in cases and
controls
Disease causal SNPs unlikely to be typed directly
Significant statistical power gained by performing imputation of
untyped Hapmap genotypes [WTCCC’07]
HMM Based Genotype Imputation
Train HMM using the haplotypes from related Hapmap or
small cohor typed at high density
Probability of missing genotypes given the typed
genotype data
P ( g i , g i x | M )
P ( g i x | g i , M )
P ( g i | M )
gi is imputed as
x argmax x{0,1, 2} P( g i , g i x | M )
Experimental Results
Estimates of the allele 0 frequency based on Imputation
vs. Illumina 15k
Experimental Results
Accuracy and missing data rate for imputed genotypes at
different thresholds
Outline
HMM model of haplotype diversity
Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage
sequencing data
Conclusions
Ultra-High Throughput Sequencing
New massively parallel sequencing technologies
deliver orders of magnitude higher throughput
compared to Sanger sequencing
Roche / 454
Genome Sequencer FLX
100 Mb/run, 400bp reads
Illumina / Solexa
Genetic Analyzer 1G
1000 Mb/run, 35bp reads
Applied Biosystems
SOLiD
3000 Mb/run, 25-35bp reads
Probabilistic Model
F1
F2
H1
H2
F'1
F'2
H'1
H'2
G1
R1,1
…
…
Fn
Hn
…
F'n
H'n
G2
R1,c1
R2,1
…
Gn
R2,c2
Rn,1
…
Rn,cn
Model Training
Initial founder probabilities P(f1), P(f’1), transition probabilities P(fi+1|fi),
P(f’i+1|f’i), and emission probabilities P(hi|fi), P(h’i|f’i) trained using the
Baum-Welch algorithm from haplotypes inferred from the populations of
origin for mother/father
P(gi|hi,h’i) set to 1 if h+h’i=gi and to 0 otherwise
P( Ri , j r | Gi g i )
where
gi
r (i ) 1r (i ) 1 r (i ) r (i ) 2 gi r (i ) r (i ) 1 r (i ) 1r (i )
2
2
r (i ) is the probability that read r has an error at locus I
Conditional probabilities for sets of reads are given by:
P(ri | Gi 0)
(1
rri
r ( i ) 0
r (i )
) r (i ) P(ri | Gi 2)
rri
r ( i ) 1
(1
rri
r ( i ) 1
r (i )
) r (i )
rri
r ( i ) 0
1
P(ri | Gi 1)
2
ci
Multilocus Genotyping Problem
GIVEN:
• Shotgun read sets r=(r1, r2, … , rn)
• Base quality scores
• HMMs for populations of origin for mother/father
FIND:
• Multilocus genotype g*=(g*1,g*2,…,g*n) with maximum
posterior probability, i.e., g*=argmaxg P(g | r)
Posterior Decoding Algorithm
1. For each i = 1..n, compute g i * arg max gi P( g i | r ) arg max gi P( g i , r )
2. Return g* ( g1*,..., g n *)
Joint probabilities can be computed using a forward-backward
algorithm:
P( gi , r ) P(ri | gi ) f
K
'
i
'
i
i
'
i
i
'
i
i
Direct implementation gives O(m+nK4) time, where
i 1
i
i
i
f 1 f , f f , f f , f ( gi )
K
m = number of reads
n = number of SNPs
K = number of founder haplotypes in HMMs
Runtime reduced to O(m+nK3) using speed-up idea similar to
[Rastas et al. 08, Kennedy et al. 08]
40
20
20
0
93.7
5.64x Posterior
80.0
5.64x Binomial
40 30.1
88.5
2.82x Posterior
60
2.82x Binomial
50.5
100
1.41x Posterior
80
1.41x Binomial
73.7
0.70x Posterior
96.3 99.1
0.70x Binomial
Homozygous Watson SNPs (Affy 500k)
0.35x Posterior
120
0.35x Binomial
5.64x Posterior
98.5
5.64x Binomial
90.5
2.82x Posterior
97.4
2.82x Binomial
95.6
1.41x Posterior
80
1.41x Binomial
93.5
0.70x Posterior
60
0.70x Binomial
100
0.35x Posterior
0.35x Binomial
Genotyping Accuracy on Watson Reads
Heterozygous Watson SNPs (Affy 500k)
120
96.8
82.5
67.8
54.6
25.7
2.8
9.2
0
Outline
HMM model of haplotype diversity
Applications
-
Phasing
Error detection
Imputation
Genotype calling from low-coverage sequencing
data
Conclusions
Conclusions
HMM model of haplotype diversity provides a powerful
framework for addressing central problems in population
genetics & genetic epidemiology
Enables significant improvements in accuracy by exploiting the high
amount of linkage disequilibrium in human populations
Despite hardness results, heuristics such as posterior or Viterbi
decoding perform well in practice
Highly scalable runtime (linear in #SNPs and #individuals/reads)
Software available at
http://www.engr.uconn.edu/~ion/SOFT/
Acknowledgements
Sanjiv Dinakar, Jorge Duitama, Yözen Hernández, Justin
Kennedy, Bogdan Pasaniuc
NSF funding (awards IIS-0546457 and DBI-0543365)
© Copyright 2026 Paperzz