Motifs and motif prediction methods II - BIDD

CZ5226: Advanced Bioinformatics
Lecture 6: HHM Method for generating motifs
Prof. Chen Yu Zong
Tel: 6874-6877
Email: [email protected]
http://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1,
National University of Singapore
Problem in biology
• Data and patterns are often not clear cut
• When we want to make a method to recognise a
pattern (e.g. a sequence motif), we have to learn
from the data (e.g. maybe there are other
differences between sequences that have the
pattern and those that do not)
• This leads to Data mining and Machine learning
2
A widely used machine learning
approach: Markov models
Contents:
•Markov chain models (1st order, higher order and
inhomogeneous models; parameter estimation;
classification)
• Interpolated Markov models (and back-off models)
• Hidden Markov models (forward, backward and BaumWelch algorithms; model topologies; applications to gene
finding and protein family modeling
3
4
Markov Chain Models
• a Markov chain model is defined by:
– a set of states
• some states emit symbols
• other states (e.g. the begin state) are silent
– a set of transitions with associated
probabilities
• the transitions emanating from a given state define
a distribution over the possible next states
5
Markov Chain Models
• Given some sequence x of length L, we can
ask how probable the sequence is given our
model
• For any probabilistic model of sequences, we
can write this probability as
• Key property of a (1st order) Markov chain: the
probability of each Xi depends only on Xi-1
6
Markov Chain Models
Pr(cggt) = Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)
7
Markov Chain Models
Can also have an end state, allowing the model to
represent:
• Sequences of different lengths
• Preferences for sequences ending with particular symbols
8
Markov Chain Models
The transition parameters can be denoted by
where
a xi1 a xi
a xi1 a xi  Pr( xi | xi 1 )
Similarly we can denote the probability of a sequence x as
Where aBxi represents the transition from the begin state
9
Example Application
• CpG islands
– CGdinucleotides are rarer in eukaryotic genomes than
expected given the independent probabilities of C, G
– but the regions upstream of genes are richer in CG
dinucleotides than elsewhere – CpG islands
– useful evidence for finding genes
• Could predict CpG islands with Markov chains
– one to represent CpG islands
– one to represent the rest of the genome
Example includes using Maximum likelihood and Bayes’
statistical data and feeding it to a HM model
10
Estimating the Model Parameters
• Given some data (e.g. a set of sequences from
CpG islands), how can we determine the
probability parameters of our model?
• One approach: maximum likelihood estimation
– given a set of data D
– set the parameters  to maximize
Pr(D|)
– i.e. make the data D look likely under the model
11
Maximum Likelihood Estimation
• Suppose we want to estimate the parameters
Pr(a), Pr(c), Pr(g), Pr(t)
• And we’re given the sequences:
accgcgctta
gcttagtgac
tagccgttac
• Then the maximum likelihood estimates are:
Pr(a) = 6/30 = 0.2
Pr(c) = 9/30 = 0.3
Pr(g) = 7/30 = 0.233
Pr(t) = 8/30 = 0.267
12
13
14
15
16
These data are derived from genome sequences
17
18
19
20
Higher Order Markov Chains
• An nth order Markov chain over some alphabet is
equivalent to a first order Markov chain over the
alphabet of n-tuples
• Example: a 2nd order Markov model for DNA can be
treated as a 1st order Markov model over alphabet:
AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT,
TA, TC, TG, and TT (i.e. all possible dipeptides)
21
A Fifth Order Markov Chain
22
Inhomogenous Markov Chains
• In the Markov chain models we have considered
so far, the probabilities do not depend on where
we are in a given sequence
• In an inhomogeneous Markov model, we can
have different distributions at different positions
in the sequence
• Consider modeling codons in protein coding
regions
23
Inhomogenous Markov Chains
24
A Fifth Order Inhomogeneous
Markov Chain
25
Selecting the Order of a
Markov Chain Model
• Higher order models remember more “history”
• Additional history can have predictive value
• Example:
– predict the next word in this sentence fragment
“…finish __” (up, it, first, last, …?)
– now predict it given more history
• “Fast guys finish __”
26
Hidden Markov models (HMMs)
Given say a T in our input sequence, which state emitted it? 27
Hidden Markov models (HMMs)
Hidden State
• We will distinguish between the observed parts of a
problem and the hidden parts
• In the Markov models we have considered previously, it
is clear which state accounts for each part of the observed
sequence
• In the model above (preceding slide), there are multiple
states that could account for each part of the observed
sequence
– this is the hidden part of the problem
– states are decoupled from sequence symbols
28
HMM-based homology searching
Transition probabilities and Emission probabilities
Gapped HMMs also have insertion and deletion
states
29
Profile HMM: m=match state, I-insert state, d=delete state; go from
left to right. I and m states output amino acids; d states are ‘silent”.
d1
d2
d3
d4
I0
I1
I2
I3
I4
m0
m1
m2
m3
m4
Start
m5
End
30
HMM-based homology searching
• Most widely used HMM-based profile searching tools
currently are SAM-T99 (Karplus et al., 1998) and HMMER2
(Eddy, 1998)
• formal probabilistic basis and consistent theory behind gap
and insertion scores
• HMMs good for profile searches, bad for alignment (due to
parametrisation of the models)
• HMMs are slow
31
Homology-derived Secondary Structure of Proteins
Sander & Schneider, 1991
32
The Parameters of an HMM
33
HMM for Eukaryotic Gene Finding
Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences
34
A Simple HMM
35
Three Important Questions
• How likely is a given sequence?
the Forward algorithm
• What is the most probable “path” for
generating a given sequence?
the Viterbi algorithm
• How can we learn the HMM parameters given
a set of sequences?
the Forward-Backward
(Baum-Welch) algorithm
36
How Likely is a Given Sequence?
• The probability that the path is taken and the
sequence is generated:
• (assuming begin/end are the only silent states
on path)
37
How Likely is a Given Sequence?
38
How Likely is a Given Sequence?
The probability over all paths is:
but the number of paths can be exponential in the
length of the sequence...
• the Forward algorithm enables us to compute
this efficiently
39
How Likely is a Given Sequence:
The Forward Algorithm
• Define fk(i) to be the probability of being in
state k
• Having observed the first i characters of x
we want to compute fN(L), the probability of
being in the end state having observed all
of x
• We can define this recursively
40
How Likely is a Given Sequence:
41
The forward algorithm
probability that we’re in start state
• Initialisation:
and have observed 0 characters from
the sequence
f0(0) = 1 (start),
fk(0) = 0 (other silent states k)
• Recursion:
fl(i) = el(i)k fk(i-1)akl
(emitting states),
fl(i) = k fk(i)akl (silent states)
• Termination:
Pr(x) = Pr(x1…xL) = f N(L) = k fk(L)akN
probability that we are in the end state and have observed the entire sequence
42
Forward algorithm example
43
Three Important Questions
• How likely is a given sequence?
• What is the most probable “path” for
generating a given sequence?
• How can we learn the HMM parameters
given a set of sequences?
44
Finding the Most Probable Path:
The Viterbi Algorithm
• Define vk(i) to be the probability of the most probable
path accounting for the first i characters of x and ending
in state k
• We want to compute vN(L), the probability of the most
probable path accounting for all of the sequence and
ending in the end state
• Can be defined recursively
• Can use DP to find vN(L) efficiently
45
Finding the Most Probable Path:
The Viterbi Algorithm
Initialisation:
v0(0) = 1 (start), vk(0) = 0 (non-silent states)
Recursion for emitting states (i =1…L):
Recursion for silent states:
46
Finding the Most Probable Path:
The Viterbi Algorithm
47
Three Important Questions
• How likely is a given sequence?
(clustering)
• What is the most probable “path” for
generating a given sequence? (alignment)
• How can we learn the HMM parameters
given a set of sequences?
48
The Learning Task
• Given:
– a model
– a set of sequences (the training set)
• Do:
– find the most likely parameters to explain the training
sequences
• The goal is find a model that generalizes well to
sequences we haven’t seen before
49
Learning Parameters
• If we know the state path for each training
sequence, learning the model parameters is simple
– no hidden state during training
– count how often each parameter is used
– normalize/smooth to get probabilities
– process just like it was for Markov chain models
• If we don’t know the path for each training
sequence, how can we determine the counts?
– key insight: estimate the counts by considering every
path weighted by its probability
50
Learning Parameters:
The Baum-Welch Algorithm
• An EM (expectation maximization) approach, a
forward-backward algorithm
• Algorithm sketch:
– initialize parameters of model
– iterate until convergence
• Calculate the expected number of times each
transition or emission is used
• Adjust the parameters to maximize the likelihood
of these expected values
51
The Expectation step
52
The Expectation step
53
The Expectation step
54
The Expectation step
55
The Expectation step
• First, we need to know the probability of the i th
symbol being produced by state q, given sequence x:
Pr( i = k | x)
•Given this we can compute our expected counts for
state transitions, character emissions
56
The Expectation step
57
The Backward Algorithm
58
The Expectation step
59
The Expectation step
60
The Expectation step
61
The Maximization step
62
The Maximization step
63
The Baum-Welch Algorithm
Initialize parameters of model
• Iterate until convergence
– calculate the expected number of times each transition or
emission is used
– adjust the parameters to maximize the likelihood of these
expected values
• This algorithm will converge to a local maximum (in the
likelihood of the data given the model)
• Usually in a fairly small number of iterations
64