CSE291: Personal genomics for bioinformaticians Class meetings: TR 3:30-4:50 MCGIL 2315 Office hours: M 3:00-5:00, W 4:00-5:00 CSE 4216 Contact: [email protected] Today’s schedule: • 3:30-4:10 – Phasing and imputation • 4:10-4:15 – break • 4:15-4:20 – Tips for using XSEDE • 4:20-4:50 – Work on problem set 2 Announcements: • Reading posted for next Tuesday Phasing and Imputation CSE291: Personal Genomics for Bioinformaticians 01/19/17 Outline • Overview • Methods for phasing • Hidden Markov Models • Phasing/imputation using HMMs • Overview of existing tools Overview Review: LD induces correlation in the genome A Before recombination C T C T G A After recombination C T C T G A G Recombination event! Review: LD induces correlation in the genome Linkage disequilibrium decays with distance Recombination induces haplotype “blocks” of correlated SNPs http://graphics.cs.wisc.edu/WP/vis10/archives/458-hapmap-linkagedisequilibrium-plot Review: quantifying LD A/a nAB=nAnB nAb=nA(1-nB) naB=(1-nA)nB naB=(1-nA)(1-nB) b Total nAB nAb nA naB nab na nB nb n B/b Frequencies: A a Linkage equilibrium: B Total Can quantify by: r2: Χ2/n • Where Χ2 is the chi-squared statistic for the contingency table Linkage disequilibrium: nAB≠nAnB nAb≠nA(1-nB) naB≠(1-nA)nB naB≠(1-nA)(1-nB) Can also write as a correlation coefficient: r: (nAB-nAnB)/sqrt[nA(1-nA) nB(1-nB)] • Same as Pearson correlation. Between -1 and +1. • R2 gives power loss in association studies • Most commonly used in population genetics Haplotypes vs. Genotypes Heterozygous genotype (CA or AC) A C G T T G C A T A C C - T T A A G T A G C A C A C C G T T Haplotype A C G T T G C A T A C C - T G A C G T A G C A C A C C G T T Homozygous genotype (TT) A A G T A G C A T A C C G T T A A G T A G C A T A C C G T T Phase: are “T” and “C” on the same copy of the chromosome?” A C G T T G C A C A C C G T T A A G T A G C A T A C C - T G Imputation A ? G T T G C A ? A C C - T T A ? G T A G C A ? A C C G T T A ? G T T G C A ? A C C - T G A ? G T A G C A ? A C C G T T A ? G T A G C A ? A C C G T T A ? G T A G C A ? A C C G T T A ? G T T G C A ? A C C G T T A ? G T A G C A ? A C C - T G Phasing and imputation Phasing: • Intuitively: For a given position, which allele came from mom and which from dad? • Or: for a set of adjacent heterozygous sites, which alleles are on the same chromosome (homozygous sites trivially phased) Imputation: • Given a set of genotyped positions and a reference haplotype panel, infer missing genotypes Process: • Usually: first phase, then use phased haplotypes to impute. But as we’ll see these processes are not completely independent! Applications of phasing and imputation Save $$ on association tests by genotyping fewer variants (phasing->imputation) A ? G T A G C A ? A C C G T T Resolving compound heterozygotes (phasing)AGCATCAGATATG AGCATTAGATATA AGCAT(T/C)AGATAT(G/A) AGCATTAGATATG AGCATCAGATATA IBD analysis (phasing) Detecting selection, studying recombination/mutation, and more Phasing methods Phasing methods Experimental phasing (e.g. physical phasing) True sequence AAACTACGTCTATACG Observed sequencing read ACTACGTCT Phasing from related individuals AG TT AA GT GA TG Statistical phasing of unrelated individuals Inferred haplotypes: AAACGACATCTACTACG AAATGACGTCTCCTACG ATATGAAGTCTCCTGCG Measuring phasing accuracy – switch error True haplotypes A G T C T A G G G A T C C A A G C A T A T T C G A T A T C C A T C G G G T C T T A G C C Inferred haplotypes A G T C T A G G G A T C G G G T C T T A G C C C G A T A T C C A T C G C A A G C A T A T T Switch error = s/(n-1) n: number of heterozygous sites in the individual s: minimal number of switches needed to recover the true haplotype Phasing in families Unphased genotypes 0 1 2 2 0 0 1 0 1 1 0 0 1 1 1 2 0 ? 1 1 0 0 1 0 0 ? 1 1 0 0 0 0 0 ? 0 0 0 1 1 1 1 ? 0 0 1 0 0 1 Phased haplotypes 0 1 1 1 0 1 2 1 Mom: Dad: 0 ? 1 1 0 0 1 0 0 ? 0 0 0 1 1 1 Family-based phasing is generally treated as the gold standard, since we can directly trace inherited haplotypes Phasing in families Li et al. 2009 Phasing and imputation in unrelated individuals Li et al. 2009 Statistical phasing using haplotype frequencies • Assume haplotype frequencies are known • Population frequency of haplotype pair obtained using Hardy-Weinberg principle Browning and Browning 2011 Extracting haplotypes from unrelated samples Example: Clark’s Algorithm 02101 10000 21212 00200 02000 ? 10000/00000 ? 00100/00100 01000/01000 01001/01100 or 01000/01101 10000/00000 10101/11111 or 10111/11101 00100/00100 01000/01000 Step 1: Find all easy cases (homozygotes, single heterozygotes) and make a list of all haplotypes that can be uniquely resolved Step II: While there are remaining genotypes unresolved, try to find a match from the current list of haplotypes. If so, add that haplotype and its pair to the list. Simple, but you might get stuck! Expectation maximization algorithm Step 1: Initially: “Guess” haplotype frequencies (that are consistent with the data) Step 2: Use current frequency estimates to infer genotype phase Step 3: Re-estimate frequency of haplotypes by counting the results Step 4: Iterate steps 2 and 3 until the frequency estimates converge Hidden Markov Models What is an HMM? State (hidden) Data (observed) • Generally used to model sequential data (e.g. time series) • The state at x(t) is only dependent on the state at x(t-1) • We observe data (y), but the “state” (x) is hidden. Elements of an HMM • State (hidden) • e.g. rainy or sunny • Initial probabilities of each state • Observation • e.g. walk, shop, clean • Transition probabilities • e.g. P(xt-1=rainy|xt=sunny) • Emission probabilities • e.g. P(walk|rainy) https://en.wikipedia.org/wiki/Hidden_Markov_model Example HMM https://en.wikipedia.org/wiki/Hidden_Markov_model Example HMM on DNA sequence: splice sites States: • E: exon • 5: 5’ splice site • I: intron Emission probabilities Transition probabilities (arrows) Eddy Nature Biotechnology 2004 Decoding an HMM Decoding problem: Given an HMM and an observed sequence of outputs, calculate the most likely sequence of hidden states State Time -> (or position along a DNA sequence) 1 2 3 4 5 A A G G T G C G G A T G A C G A C G C A C G C C G A C G C G A C T Observed data Viterbi algorithm: dynamic programming algorithm to find the most likely sequence of hidden states The Viterbi algorithm Given an HMM with: • State space S • Initial probabilities πi gives probability of starting at state i • Transition probabilities aij gives probability to transition from state i to state j Let: • y1, y2, y3, … yT be the sequence of observations • Vt,k be the probability of the most likely sequence of states responsible for the first t observations ending at state k Key idea: best path ending in state k at time t must be an extension of the best path to time t-1. We can write the recurrence: • V1,k = P(y1|k) *πk • Vt,k = maxx in S P(yt|k)*ax,kVt-1,x • xT = argmaxx in S VT,x The Viterbi algorithm • V1,k = P(y1|k) *πk • Vt,k = maxx in S P(yt|k)*ax,kVt-1,x • xT = argmaxx in S VT,x We can calculate Vij by simply filling in the state by time matrix: State Time -> (or position along a DNA sequence) 1 2 3 4 5 A A G G T G C G G A T G A C G A C G C A C G C C G A C G C G A C T Max value in this column gives the answer! Use initial probabilities Calculate from previous column Backtrack the calculation to find the best path HMM algorithms overview Viterbi algorithm: given an HMM and a sequence of outputs, infer the most likely sequence of hidden states Forward algorithm: determine the probability of a state at a given time, given the outputs Forward-backward algorithm: determine posterior marginal distribution for all hidden states given the outputs Baum-Welch algorithm: uses EM to find the maximum likelihood estimate of the parameters (transition, emission, and initial probabilities) of an HMM given a set of observations Phasing/imputation using HMMs Modeling imputation using an HMM Observations: observed genotypes States: Represent pairs of haplotypes for an individual (Hi, Hj), where H is the set of reference haplotypes (e.g. from 1000 Genomes or HapMap) Transition probabilities: represent recombination events Emission probabilities: P(obs genotype | haplotype pair). Based on probabilities of mutation and genotyping error Modeling imputation using an HMM Example: fastPhase • • • • States are “clusters” of similar haplotypes. Choosing number of clusters is tricky… Emissions are P(allele=A | haplotype cluster j) Two independent Markov chains Baum Welch to learn parameters http://statweb.stanford.edu/~nzhang/Stat366/Lecture5.pdf Measuring imputation accuracy Concordance: what percent of imputed genotypes match the true genotypes? • Inflated for rare variants R2: square the Pearson correlation between imputed and true genotypes Others: IQS, IMPUTE2 INFO, which take into account allele frequencies Factors influencing phasing/imputation accuracy • Sample size: larger=better • Genotype accuracy: hard to get the haplotypes write if the genotypes were wrong! Genotype likelihoods helpful • Degree of relatedness: Related individuals produces superior phase, since we can directly trace inheritance patterns • Sample ethnicity: does the reference panel match the target population? • Allele frequency: rarer alleles (low MAF) are harder to impute Overview of existing tools State of the art Beagle • HMM-based method for imputation. Clustering method for phasing and identifying haplotypes Impute2 • Alternately estimate haplotypes at typed positions and impute genotypes at untyped positions using Markov chain Monte Carlo. Integrates over unknown phase. • Suggest to pre-phase using e.g. SHAPEIT SHAPEIT (phasing) • HMM to represent haplotypes consistent with observed genotypes • duoHMM method incorporates family information Eagle • Use “positional Burrows Wheeler Transform” data structure with search-based algorithm The haplotype reference consortium http://www.haplotype-reference-consortium.org/ • 64,976 haplotypes at 39M SNPs with >= 5 minor alleles McCarthy et al. 2016 The imputation server https://imputationserver.sph.umich.edu/index.html VCF Format “/” represents unphased genotypes e.g. 0/1 “|” represents phased genotypes, e.g. 0|1 vs. 1|0
© Copyright 2026 Paperzz