Genotype Error Detection using
Hidden Markov Models of
Haplotype Diversity
Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc
CSE Department, University of Connecticut
Outline
Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion
Genotyping Errors
A real problem despite advances in genotyping technology
[Zaitlen et al. 2005] found 1.1% inconsistencies among the 20
million dbSNP genotypes typed multiple times
Error types
Systematic errors (e.g., assay failure) detected by departure from
HWE [Hosking et al. 2004]
For pedigree data some errors detected as Mendelian
Inconsistencies (MIs)
Undetected errors
E.g., if mother/father/child are all heterozygous, any error is
Mendelian consistent
Only ~30% detectable as MIs for trios [Gordon et al. 1999]
Effects of Undetected Genotyping
Errors
Even low error levels can have large effects for some study designs
(e.g. rare alleles, haplotype-based)
Errors as low as .1% can increase Type I error rates in haplotype
sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04]
1% errors decrease power by 10-50% for linkage, and by 5-20% for
association [Douglas et al. 00, Abecasis et al. 01]
Related Work
Improved genotype calling algorithms
[Marchini et al. 07, Nicolae et al. 06, Rabbee&Speed 05, Xiao et al. 07]
Explicit modeling in analysis methods
[Cheng 07, Hao & Wang 04, Liu et al. 07]
Computationally complex
Separate error detection step
Detected errors can be retyped, imputed, or ignored in downstream
analyses
Common approach in pedigree genotype data analysis [Abecasis et
al. 02, Douglas et al. 00, Sobel et al. 02]
Outline
Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion
Basic Notations
Haplotype: description of SNP alleles on a chromosome
0/1 vector: 0 for major allele, 1 for minor
Genotype: description of alleles on both chromosomes
0/1/2/? vector: 0 (1) - both chromosomes contain the major
(minor) allele; 2 - the chromosomes contain different alleles; ? –
missing genotype
011100110
+ 001000010
021200210
two haplotypes per individual
genotype
Likelihood Sensitivity Approach to Error
Detection [Becker et al. 06]
Mother
012102
Father
0 1 1 1 0 0 h1
0 1 0 1 0 1 h2
022102
0 0 0 1 0 1 h3
0 1 1 1 0 0 h4
Child
022102
0 1 1 1 0 0 h1
0 0 0 1 0 1 h3
L(T ) MAX p(h1 ) p(h2 ) p(h3 ) p(h4 )
Likelihood of best
phasing for original
trio T
Likelihood Sensitivity Approach to Error
Detection [Becker et al. 06]
Mother
012102
Father
0 1 0 1 0 1 h’1
0 1 1 1 0 0 h’2
022102
0 0 0 1 0 0 h’ 3
0 1 1 1 0 1 h’ 4
Child
0 2 2? 1 0 2
0 1 0 1 0 1 h’ 1
0 0 0 1 0 0 h’ 3
L(T ) MAX p(h1 ) p(h2 ) p(h3 ) p(h4 )
L(T ' ) MAX p(h'1 ) p(h'2 ) p(h'3 ) p(h'4 )
Likelihood of best
phasing for original
trio T
Likelihood of best
phasing for
modified trio T’
Likelihood Sensitivity Approach to Error
Detection [Becker et al. 06]
Mother
Father
012102
022102
Child
0 2 2? 1 0 2
Large change in likelihood suggests likely error
Flag genotype as an error if L(T’)/L(T) > R, where R is the
detection threshold (e.g., R=104)
Implementation in FAMHAP
[Becker et al. 06]
Window-based algorithm
For each window including the SNP under
test, generate list of H most frequent
Mother …201012 1 02210...
Father …201202 2 10211...
haplotypes (default H=50)
Child
…000120 2 21021...
Find most likely trio phasings by pruned
search over the H4 quadruples of
frequent haplotypes
Flag genotype as an error if L(T’)/L(T) >
R for at least one window
Limitations of FAMHAP
Implementation
Truncating the list of haplotypes to size H may lead to suboptimal phasings and inaccurate L(T) values
False positives caused by nearby errors (due to the use of
multiple short windows)
Our approach:
HMM model of haplotype frequencies all haplotypes
represented + no need for short windows
Alternate likelihood functions scalable runtime
Outline
Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion
Hidden Markov Model Overview
Similar HMMs proposed by [Kimmel &Shamir 05, Rastas et
al. 05, Schwartz 04]
Paths with high transition probability correspond to
“founder” haplotypes
P(h, | M ) and P(h | M ) computed using Viterbi and
forward algorithms
HMM Training
Previous works use EM training of HMM based on
unrelated genotype data
Our 2-step training algorithm exploits pedigree info
Step 1: Infer haplotypes using pedigree-aware algorithm based
on entropy-minimization
Step 2: train HMM based on inferred haplotypes, using BaumWelch
Inapproximability of HMM-based
Maximum Trio Phasing Probability
Maximum trio phasing probability problem: Given HMM model
M with n SNP loci and K founders + genotype trio T, find
L(T ) MAX p(h1 | M ) p(h2 | M ) p(h3 | M ) p(h4 | M )
Theorem: For every > 0, maximum trio phasing probability cannot
1/ 4
) for any > 0, unless
be approximated within a factor of O(n
ZPP=NP
• Proof by reduction from the maximum clique problem using an
idea of [Lyngso & Pederson 02]; details in journal version
Alternate Likelihood Functions
• Viterbi probability (ViterbiProb): the maximum probability
of a set of 4 HMM paths that emit 4 haplotypes compatible
with the trio
• Probability of Viterbi Haplotypes (ViterbiHaps): product
of total probabilities of the 4 Viterbi haplotypes
• Total Trio Probability (TotalProb): total probability P(T)
that the HMM emits four haplotypes that explain trio T
along all possible 4-tuples of paths
Efficient Computation of Viterbi
Probability
For a fixed trio, Viterbi paths can be found using a 4-path
version of Viterbi’s algorithm in O( NK 8 ) time
K3 speed-up by reuse of common terms (similar to [Rastas et
al. 05]):
V ( j 1; q1, q2 , q3 , q4 ) E( j 1; q1, q2 , q3 , q4 ) max q'4Q j {Pre3 ( j; q1, q2 , q3 , q'4 ) (q'4 , q4 )}
Where:
• E ( j 1; q1, q2 , q3 , q4 ) = maximum probability of emitting SNP
genotypes at locus j+1 from states ( q1 , q2 , q3 , q4 )
• = transition probability
Overall Runtimes
Viterbi probability
Probability of Viterbi haplotypes
Likelihoods of all 3N modified trios can be computed within O( NK 5 )
time using forward-backward algorithm
5
Overall runtime for M trios O ( MNK )
Obtain haplotypes from standard traceback, then compute haplotype
probabilities using forward algorithms
Overall runtime O( M ( NK 5 N 2 K ))
Total trio probability
Similar pre-computation speed-up & forward-backward algorithm
Overall runtime O( MNK 5 )
Outline
Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion
Datasets
Real dataset [Becker et al. 2006]
35 SNP loci covering a region of 91kb
551 trios
Synthetic datasets
35 SNPs, 551 trios
Preserved missing data pattern of real dataset
Haplotypes assigned to trios based on frequencies
inferred from real dataset
1% error rate using random allele insertion model
Comparison of Alternative
Likelihood Functions
1
0.9
0.8
VitHaps-P
0.7
Sensitivity
VitProb-P
0.6
TotalProb-P
0.5
VitHaps-C
0.4
VitProb-C
0.3
TotalProb-C
0.2
0.1
0
0
0.005
0.01
0.015
FP rate
Sensitivity=TP/(TP+TN)
False Positive rate = 1 - TN/(FP+TN)
Distribution of Log-Likelihood
Ratios for TotalTrioProb
Parents-TRIOS
NO_ERR
ERR
1000000
100000
10000
1000
100
10
5.94
5.4
5.67
5.13
4.86
4.59
4.32
4.05
3.78
3.51
3.24
2.97
2.7
2.43
2.16
1.89
1.62
1.35
1.08
0.81
0.54
0.27
0
1
Children-TRIOS
NO_ERR
ERR
1000000
Same-locus
errors in parents
100000
10000
1000
100
10
5.94
5.4
5.67
5.13
4.86
4.59
4.32
4.05
3.78
3.51
3.24
2.97
2.7
2.43
2.16
1.89
1.62
1.35
1.08
0.81
0.54
0.27
0
1
“Combined” Detection Method
Compute 4 likelihood ratios
Trio
Mother-child duo
Father-child duo
Child (unrelated)
Flag as error if all ratios are
above detection threshold
5.88
5.6
5.32
5.04
4.76
4.48
4.2
NO_ERR
3.92
5.94
5.67
5.4
5.13
4.86
4.59
4.32
4.05
3.78
3.51
3.24
2.97
2.7
2.43
2.16
1.89
1.62
1.35
1.08
0.81
0.54
0.27
0
NO_ERR
3.64
3.36
3.08
2.8
2.52
2.24
1.96
1.68
1.4
1.12
0.84
0.56
0.28
0
Distribution of Log-Likelihood
Ratios for Combined Method
Parents-COMBINED
ERR
1000000
100000
10000
1000
100
10
1
Children-COMBINED
ERR
1000000
100000
10000
1000
100
10
1
Comparison with FAMHAP (Children)
1
0.9
#FP=#FN line
0.8
Sensitivity
0.7
TotalProb-TRIO
0.6
0.5
TotalProbCOMBINED
0.4
0.3
0.2
FAMHAP
0.1
0
0
0.005
0.01
FP rate
0.015
Comparison with FAMHAP (Parents)
1
#FP=#FN line
0.9
0.8
Sensitivity
0.7
TotalProb-TRIO
0.6
0.5
0.4
TotalProbCOMBINED
0.3
0.2
FAMHAP
0.1
0
0
0.005
0.01
FP rate
0.015
Error Detection Accuracy on
Unrelated Genotype Data
1
0.9
#FP=#FN line
0.8
0.7
Len=10Kb (U)
Sensitivity
0.6
0.5
Len=100Kb (U)
0.4
0.3
Len=1Mb (U)
0.2
0.1
Len=10Mb (U)
0
0
0.005
0.01
0.015
FP rate
551 unrelated individuals
Recombination & mutation rates of 10-8 per generation per bp
35 SNPs within a region of 10kb-10Mb
Outline
Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion
Conclusion
Proposed efficient methods for error detection in
trio genotype data based on an HMM model of
haplotype diversity
Can exploit available pedigree info
Yield improved detection accuracy compared to FAMHAP
Runtime grows linearly in #SNPs and #individuals
Ongoing work
Improve error detection in unrelated genotype data by
integration of typing confidence scores
Questions?
Effect of Population Size
1
0.9
0.8
0.7
Sensitivity
n=551 (P)
0.6
n=129 (P)
n=30 (P)
0.5
n=551 (C)
n=129 (C)
0.4
n=30 (C)
0.3
0.2
0.1
0
0
0.005
0.01
FP rate
0.015
Error Model Comparison
1
0.9
0.8
random allele (P)
0.7
Sensitivity
random geno (P)
0.6
hetero-to-homo (P)
homo-to-hetero (P)
0.5
random allele (C)
random geno (C)
0.4
hetero-to-homo (C)
0.3
homo-to-hetero (C)
0.2
0.1
0
0
0.005
0.01
FP rate
0.015
TrioProb-Combined Results on
Real Dataset
Total Signals
True Positives
False Positives
FP Rate
1% .5%
1% .5%
1% .5%
Parents
218
127
69
9
9
8
1
0
0
208
118
91
Children
104
74
24
11
11
11
3
3
2
90
60
11
Total
322
201
93
20
20
19
4
3
2
298
178
72
.1%
.1%
Unknown
.1%
1% .5%
.1%
[Becker et al. 06] resequenced all trio members at 41 loci
flagged by FAMHAP-3
26 SNP genotypes in 23 trios were identified as true errors
41*3-26=97 resequenced SNP genotypes agree with original calls
(or are unknown)
Complexity of Computing
Maximum Phasing Probability
• For unrelated genotypes, computing maximum phasing probability is
hard to approximate within a factor of O(f½-) unless ZPP=NP, where f
is the number of founders
• For trios, hard to approx. within O(f1/4 -)
• Reductions from the clique problem
© Copyright 2026 Paperzz