Probability Theory and Basic Alignment of String Sequences

Markov Chains
and
Hidden Markov Models
Marjolijn Elsinga
&
Elze de Groot
Marjolijn Elsinga & Elze de Groot
1
Andrei A. Markov

Born: 14 June 1856 in Ryazan,
Russia
Died: 20 July 1922 in
Petrograd, Russia
 Graduate of Saint Petersburg
University (1878)
 Work: number theory and
analysis, continued fractions,
limits of integrals,
approximation theory and the
convergence of series
Marjolijn Elsinga & Elze de Groot
2
Todays topics

Markov chains

Hidden Markov models
- Viterbi Algorithm
- Forward Algorithm
- Backward Algorithm
- Posterior Probabilities
Marjolijn Elsinga & Elze de Groot
3
Markov Chains (1)

Emitting states
Marjolijn Elsinga & Elze de Groot
4
Markov Chains (2)

Transition probabilities

Probability of the sequence
Marjolijn Elsinga & Elze de Groot
5
Key property of Markov Chains

The probability of a symbol xi depends only
on the value of the preceding symbol xi-1
Marjolijn Elsinga & Elze de Groot
6
Begin and End states

Silent states
Marjolijn Elsinga & Elze de Groot
7
Example: CpG Islands
CpG = Cytosine – phosphodiester bond –
Guanine
 100 – 1000 bases long
 Cytosine is modified by methylation
 Methylation is suppressed in short stretches
of the genome (start regions of genes)
 High chance of mutation into a thymine (T)

Marjolijn Elsinga & Elze de Groot
8
Two questions

How would we decide if a short strech of
genomic sequence comes from a CpG
island or not?

How would we find, given a long piece of
sequence, the CpG islands in it, if there are
any?
Marjolijn Elsinga & Elze de Groot
9
Discrimination

48 putative CpG islands are extracted
 Derive 2 models
- regions labelled as CpG island (‘+’ model)
- regions from the remainder (‘-’ model)

Transition probabilities are set
- Where Cst+ is number of times letter t follows
letter s
Marjolijn Elsinga & Elze de Groot
10
Maximum Likelihood Estimators

Each row sums to 1
 Tables are asymmetric
Marjolijn Elsinga & Elze de Groot
11
Log-odds ratio
Marjolijn Elsinga & Elze de Groot
12
Discrimination shown
Marjolijn Elsinga & Elze de Groot
13
Simulation: ‘+’ model
Marjolijn Elsinga & Elze de Groot
14
Simulation: ‘-’ model
Marjolijn Elsinga & Elze de Groot
15
Todays topics

Markov chains

Hidden Markov models
- Viterbi Algorithm
- Forward Algorithm
- Backward Algorithm
- Posterior Probabilities
Marjolijn Elsinga & Elze de Groot
16
Hidden Markov Models (HMM) (1)

No one-to-one correspondence between
states and symbols
 No longer possible to say what state the
model is in when in xi
 Transition probability from state k to l:

πi is the ith state in the path (state sequence)
Marjolijn Elsinga & Elze de Groot
17
Hidden Markov Models (HMM) (2)

Begin state: a0k
 End state: a0k
 In CpG islands example:
Marjolijn Elsinga & Elze de Groot
18
Hidden Markov Models (HMM) (3)

We need new set of parameters because
we decoupled symbols from states
 Probability that symbol b is seen when
in state k:
Marjolijn Elsinga & Elze de Groot
19
Example: dishonest casino (1)

Fair die and loaded die
 Loaded die: probability 0.5 of a 6 and
probability 0.1 for 1-5
 Switch from fair to loaded: probability
0.05
 Switch back: probability 0.1
Marjolijn Elsinga & Elze de Groot
20
Dishonest casino (2)

Emission probabilities: HMM model that
generate or emit sequences
Marjolijn Elsinga & Elze de Groot
21
Dishonest casino (3)
Hidden: you don’t know if die is fair or
loaded
 Joint probability of observed sequence x
and state sequence π:

Marjolijn Elsinga & Elze de Groot
22

Three algorithms
What is the most probable path for generating a
given sequence?
Viterbi Algorithm
 How likely is a given sequence?
Forward Algorithm
 How can we learn the HMM parameters given a
set of sequences?
Forward-Backward (Baum-Welch) Algorithm
Marjolijn Elsinga & Elze de Groot
23
Viterbi Algorithm

CGCG can be generated on different ways, and
with different probabilities
 Choose path with highest probability

Most probable path can be found recursively
Marjolijn Elsinga & Elze de Groot
24
Viterbi Algorithm (2)

vk(i) = probability of most probable path
ending in state k with observation i
Marjolijn Elsinga & Elze de Groot
25
Viterbi Algorithm (3)
Marjolijn Elsinga & Elze de Groot
26
Viterbi Algorithm

Most probable path for CGCG
Marjolijn Elsinga & Elze de Groot
27

Viterbi Algorithm
Result with casino example
Marjolijn Elsinga & Elze de Groot
28

Three algorithms
What is the most probable path for generating a
given sequence?
Viterbi Algorithm
 How likely is a given sequence?
Forward Algorithm
 How can we learn the HMM parameters given a
set of sequences?
Forward-Backward (Baum-Welch) Algorithm
Marjolijn Elsinga & Elze de Groot
29
Forward Algorithm (1)

Probability over all possible paths

Number of possible paths increases
exponentonial with length of sequence
 Forward algorithm enables us to compute this
efficiently
Marjolijn Elsinga & Elze de Groot
30
Forward Algorithm (2)

Replacing maximisation steps for sums in
viterbi algorithm
 Probability of observed sequence up to and
including xi, requiring πi = k
Marjolijn Elsinga & Elze de Groot
31
Forward Algorithm (3)
Marjolijn Elsinga & Elze de Groot
32

Three algorithms
What is the most probable path for generating a
given sequence?
Viterbi Algorithm
 How likely is a given sequence?
Forward Algorithm
 How can we learn the HMM parameters given a
set of sequences?
Forward-Backward (Baum-Welch) Algorithm
Marjolijn Elsinga & Elze de Groot
33
Backward Algorithm (1)

Probability of observed sequence from xi to the
end of the sequence, requiring πi = k
Marjolijn Elsinga & Elze de Groot
34
Disadvantage Algorithms

Multiplying many probabilities gives very
small numbers which can lead to underflow
errors on the computer
 can be solved by doing the algorithms in
log space, calculating log(vl(i))
Marjolijn Elsinga & Elze de Groot
35
Backward Algorithm
Marjolijn Elsinga & Elze de Groot
36
Posterior State Probability (1)

Probability that observation xi came from
state k, given the observed sequence

Posterior probability of state k at time i
when the emitted sequence is known:
P(πi = k | x)
Marjolijn Elsinga & Elze de Groot
37
Posterior State Probability (2)

First calculate probability of producing entire
observed sequence with the ith symbol being
produced by state k

P(x, πi = k) = fk (i) · bk (i)
Marjolijn Elsinga & Elze de Groot
38
Posterior State Probability (3)

Posterior probabilities will then be:

P(x) is result of forward or backward calculation
Marjolijn Elsinga & Elze de Groot
39
Posterior Probabilities (4)

For the casino example
Marjolijn Elsinga & Elze de Groot
40
Two questions

How would we decide if a short strech of
genomic sequence comes from a CpG
island or not?

How would we find, given a long piece of
sequence, the CpG islands in it, if there are
any?
Marjolijn Elsinga & Elze de Groot
41
Prediction of CpG islands

First way: Viterbi Algorithm
- Find most probable path through the
model
- When this path goes through the ‘+’
state, a CpG island is predicted
Marjolijn Elsinga & Elze de Groot
42
Prediction of CpG islands

Second Way: Posterior Decoding
- function:
-
g(k) = 1 for k Є {A+, C+, G+, T+}
g(k) = 0 for k Є {A-, C-, G-, T-}
G(i|x) is posterior probability according to the
model that base i is in a CpG island
Marjolijn Elsinga & Elze de Groot
43
Summary (1)

Markov chain is a collection of states where
a state depends only on the state before

Hidden markov model is a model in which
the states sequence is ‘hidden’
Marjolijn Elsinga & Elze de Groot
44
Summary (2)

Most probable path: viterbi algorithm
 How likely is a given sequence?: forward
algorithm
 Posterior state probability: forward and
backward algorithms (used for most
probable state of an observation)
Marjolijn Elsinga & Elze de Groot
45

Download Report

Probability Theory and Basic Alignment of String Sequences

Paperzz.com

Your Paperzz