(phasing) A

CSE291: Personal genomics for bioinformaticians
Class meetings: TR 3:30-4:50 MCGIL 2315
Office hours: M 3:00-5:00, W 4:00-5:00 CSE 4216
Contact: [email protected]
Today’s schedule:
• 3:30-4:10 – Phasing and imputation
• 4:10-4:15 – break
• 4:15-4:20 – Tips for using XSEDE
• 4:20-4:50 – Work on problem set 2
Announcements:
• Reading posted for next Tuesday
Phasing and Imputation
CSE291: Personal Genomics for
Bioinformaticians
01/19/17
Outline
• Overview
• Methods for phasing
• Hidden Markov Models
• Phasing/imputation using
HMMs
• Overview of existing tools
Overview
Review: LD induces correlation in the genome
A
Before recombination
C
T
C
T
G
A
After recombination
C
T
C
T
G
A
G
Recombination event!
Review: LD induces correlation in the genome
Linkage disequilibrium decays
with distance
Recombination induces
haplotype “blocks” of correlated
SNPs
http://graphics.cs.wisc.edu/WP/vis10/archives/458-hapmap-linkagedisequilibrium-plot
Review: quantifying LD
A/a
nAB=nAnB
nAb=nA(1-nB)
naB=(1-nA)nB
naB=(1-nA)(1-nB)
b
Total
nAB
nAb
nA
naB
nab
na
nB
nb
n
B/b
Frequencies: A
a
Linkage equilibrium:
B
Total
Can quantify by:
r2: Χ2/n
• Where Χ2 is the chi-squared statistic for the
contingency table
Linkage disequilibrium:
nAB≠nAnB
nAb≠nA(1-nB)
naB≠(1-nA)nB
naB≠(1-nA)(1-nB)
Can also write as a correlation coefficient:
r: (nAB-nAnB)/sqrt[nA(1-nA) nB(1-nB)]
• Same as Pearson correlation. Between -1 and +1.
• R2 gives power loss in association studies
• Most commonly used in population genetics
Haplotypes vs. Genotypes
Heterozygous genotype (CA or AC)
A C G T T G C A T A C C - T T
A A G T A G C A C A C C G T T
Haplotype
A C G T T G C A T A C C - T G
A C G T A G C A C A C C G T T
Homozygous genotype (TT)
A A G T A G C A T A C C G T T
A A G T A G C A T A C C G T T
Phase: are “T” and “C” on the same copy of the chromosome?”
A C G T T G C A C A C C G T T
A A G T A G C A T A C C - T G
Imputation
A ? G T T G C A ? A C C - T T
A ? G T A G C A ? A C C G T T
A ? G T T G C A ? A C C - T G
A ? G T A G C A ? A C C G T T
A ? G T A G C A ? A C C G T T
A ? G T A G C A ? A C C G T T
A ? G T T G C A ? A C C G T T
A ? G T A G C A ? A C C - T G
Phasing and imputation
Phasing:
• Intuitively: For a given position, which allele came from
mom and which from dad?
• Or: for a set of adjacent heterozygous sites, which
alleles are on the same chromosome (homozygous
sites trivially phased)
Imputation:
• Given a set of genotyped positions and a reference
haplotype panel, infer missing genotypes
Process:
• Usually: first phase, then use phased haplotypes to
impute. But as we’ll see these processes are not
completely independent!
Applications of phasing and imputation
Save $$ on association tests by genotyping fewer
variants (phasing->imputation)
A ? G T A G C A ? A C C G T T
Resolving compound heterozygotes (phasing)AGCATCAGATATG
AGCATTAGATATA
AGCAT(T/C)AGATAT(G/A)
AGCATTAGATATG
AGCATCAGATATA
IBD analysis (phasing)
Detecting selection, studying recombination/mutation, and more
Phasing methods
Phasing methods
Experimental phasing (e.g. physical phasing)
True sequence
AAACTACGTCTATACG
Observed sequencing read ACTACGTCT
Phasing from related individuals
AG
TT
AA
GT
GA
TG
Statistical phasing of unrelated individuals
Inferred haplotypes:
AAACGACATCTACTACG
AAATGACGTCTCCTACG
ATATGAAGTCTCCTGCG
Measuring phasing accuracy – switch error
True haplotypes
A G T C T A G G G A T C C A A G C A T A T T
C G A T A T C C A T C G G G T C T T A G C C
Inferred haplotypes
A G T C T A G G G A T C G G G T C T T A G C C
C G A T A T C C A T C G C A A G C A T A T T
Switch error = s/(n-1)
n: number of heterozygous sites in the individual
s: minimal number of switches needed to recover the true haplotype
Phasing in families
Unphased genotypes
0 1 2 2 0 0 1 0
1 1 0 0 1 1 1 2
0 ? 1 1 0 0 1 0
0 ? 1 1 0 0 0 0
0 ? 0 0 0 1 1 1
1 ? 0 0 1 0 0 1
Phased haplotypes
0 1 1 1 0 1 2 1
Mom:
Dad:
0 ? 1 1 0 0 1 0
0 ? 0 0 0 1 1 1
Family-based phasing is generally treated as the gold standard, since we can
directly trace inherited haplotypes
Phasing in families
Li et al. 2009
Phasing and imputation in unrelated individuals
Li et al. 2009
Statistical phasing using haplotype frequencies
• Assume haplotype frequencies are known
• Population frequency of haplotype pair obtained using Hardy-Weinberg principle
Browning and Browning 2011
Extracting haplotypes from unrelated samples
Example: Clark’s Algorithm
02101
10000
21212
00200
02000
?
10000/00000
?
00100/00100
01000/01000
01001/01100 or 01000/01101
10000/00000
10101/11111 or 10111/11101
00100/00100
01000/01000
Step 1: Find all easy cases (homozygotes, single heterozygotes) and make a list
of all haplotypes that can be uniquely resolved
Step II: While there are remaining genotypes unresolved, try to find a match
from the current list of haplotypes. If so, add that haplotype and its pair to the
list.
Simple, but you might get stuck!
Expectation maximization algorithm
Step 1: Initially: “Guess” haplotype frequencies (that are
consistent with the data)
Step 2: Use current frequency estimates to infer
genotype phase
Step 3: Re-estimate frequency of haplotypes by counting
the results
Step 4: Iterate steps 2 and 3 until the frequency
estimates converge
Hidden Markov Models
What is an HMM?
State (hidden)
Data (observed)
• Generally used to model sequential data (e.g. time series)
• The state at x(t) is only dependent on the state at x(t-1)
• We observe data (y), but the “state” (x) is hidden.
Elements of an HMM
• State (hidden)
• e.g. rainy or sunny
• Initial probabilities of each state
• Observation
• e.g. walk, shop, clean
• Transition probabilities
• e.g. P(xt-1=rainy|xt=sunny)
• Emission probabilities
• e.g. P(walk|rainy)
https://en.wikipedia.org/wiki/Hidden_Markov_model
Example HMM
https://en.wikipedia.org/wiki/Hidden_Markov_model
Example HMM on DNA sequence: splice sites
States:
• E: exon
• 5: 5’ splice site
• I: intron
Emission probabilities
Transition probabilities
(arrows)
Eddy Nature Biotechnology 2004
Decoding an HMM
Decoding problem: Given an HMM and an observed sequence of outputs, calculate the
most likely sequence of hidden states
State
Time -> (or position along a DNA sequence)
1
2
3
4
5
A
A G G T G C G G
A T G
A C G
A C
G C
A C G C
C G
A C G C G
A C
T
Observed data
Viterbi algorithm: dynamic programming algorithm to find the most likely sequence of
hidden states
The Viterbi algorithm
Given an HMM with:
• State space S
• Initial probabilities πi gives probability of starting at state i
• Transition probabilities aij gives probability to transition from state i to
state j
Let:
• y1, y2, y3, … yT be the sequence of observations
• Vt,k be the probability of the most likely sequence of states responsible
for the first t observations ending at state k
Key idea: best path ending in state k at time t must be an extension of the
best path to time t-1.
We can write the recurrence:
• V1,k = P(y1|k) *πk
• Vt,k = maxx in S P(yt|k)*ax,kVt-1,x
• xT = argmaxx in S VT,x
The Viterbi algorithm
• V1,k = P(y1|k) *πk
• Vt,k = maxx in S P(yt|k)*ax,kVt-1,x
• xT = argmaxx in S VT,x
We can calculate Vij by simply filling in the state by time matrix:
State
Time -> (or position along a DNA sequence)
1
2
3
4
5
A
A G G T G C G G
A T G
A C G
A C
G C
A C G C
C G
A C G C G
A C
T
Max value in this column gives the answer!
Use initial probabilities
Calculate from previous column
Backtrack the calculation to find the best path
HMM algorithms overview
Viterbi algorithm: given an HMM and a sequence of
outputs, infer the most likely sequence of hidden states
Forward algorithm: determine the probability of a state at a
given time, given the outputs
Forward-backward algorithm: determine posterior
marginal distribution for all hidden states given the outputs
Baum-Welch algorithm: uses EM to find the maximum
likelihood estimate of the parameters (transition, emission,
and initial probabilities) of an HMM given a set of
observations
Phasing/imputation using
HMMs
Modeling imputation using an HMM
Observations: observed genotypes
States: Represent pairs of haplotypes for an individual (Hi,
Hj), where H is the set of reference haplotypes (e.g. from
1000 Genomes or HapMap)
Transition probabilities: represent recombination events
Emission probabilities: P(obs genotype | haplotype pair).
Based on probabilities of mutation and genotyping error
Modeling imputation using an HMM
Example: fastPhase
•
•
•
•
States are “clusters” of
similar haplotypes.
Choosing number of
clusters is tricky…
Emissions are
P(allele=A | haplotype
cluster j)
Two independent
Markov chains
Baum Welch to learn
parameters
http://statweb.stanford.edu/~nzhang/Stat366/Lecture5.pdf
Measuring imputation accuracy
Concordance: what percent of imputed genotypes match
the true genotypes?
• Inflated for rare variants
R2: square the Pearson correlation between imputed and
true genotypes
Others: IQS, IMPUTE2 INFO, which take into account allele
frequencies
Factors influencing phasing/imputation accuracy
• Sample size: larger=better
• Genotype accuracy: hard to get the haplotypes write if
the genotypes were wrong! Genotype likelihoods
helpful
• Degree of relatedness: Related individuals produces
superior phase, since we can directly trace inheritance
patterns
• Sample ethnicity: does the reference panel match the
target population?
• Allele frequency: rarer alleles (low MAF) are harder to
impute
Overview of existing tools
State of the art
Beagle
• HMM-based method for imputation. Clustering method for
phasing and identifying haplotypes
Impute2
• Alternately estimate haplotypes at typed positions and impute
genotypes at untyped positions using Markov chain Monte Carlo.
Integrates over unknown phase.
• Suggest to pre-phase using e.g. SHAPEIT
SHAPEIT (phasing)
• HMM to represent haplotypes consistent with observed
genotypes
• duoHMM method incorporates family information
Eagle
• Use “positional Burrows Wheeler Transform” data structure with
search-based algorithm
The haplotype reference consortium
http://www.haplotype-reference-consortium.org/
• 64,976 haplotypes at 39M SNPs with >= 5 minor alleles
McCarthy et al. 2016
The imputation server
https://imputationserver.sph.umich.edu/index.html
VCF Format
“/” represents unphased genotypes e.g. 0/1
“|” represents phased genotypes, e.g. 0|1 vs. 1|0