PGM 2002/03 Tirgul 1
Hidden Markov Models
Introduction
Hidden Markov Models (HMM) are one of the most
common form of probabilistic graphical models,
although they were developed long before the notion
of general models existed (1913). They are used to
model time-invariant and limited horizon models that
have both an underlying mechanism (hidden states)
and an observable consequence. They have been
extremely successful in language modeling and
speech recognition systems and are still the most
widely used technique in these domains.
Markov Models
A Markov process or model assumes that we can
predict the future based just on the present (or on a
limited horizon into the past):
Let {X1,…,XT} be a sequence of random variables
taking values {1,…,N} then the Markov properties
are:
Limited Horizon:
P (Xt 1 i | X1 , , Xt ) P (Xt 1 i | Xt )
Time invariant (stationary):
P (X 2 i | X 1 )
Describing a Markov Chain
A Markov chain can be described by the transition
matrix A and initial state probabilities Q:
qi P (X1 i )
aij P (Xt 1 j | Xt i )
or alternatively:
0.3
1
4
0.4 0.7
1.0
0.6
3
and we calculate:
1.0
2
1.0
T 1
P (X1 , , XT ) P (X1 )P (X2 | X1 ) P (XT | XT 1 ) qX1 A(Xt , Xt 1 )
t 1
Hidden Markov Models
In a Hidden Markov Model (HMM) we do not
observe the sequence that the model passed
through but only some probabilistic function of it.
Thus, it is a Markov model with the addition of
emission probabilities:
bik P (Yt k | Xt i )
For example:
Observed: The
House
States:
noun
def
is
verb
on
prep
fire
noun
Why use HMMs?
• A lot of real-life processes are composed of
underlying events generating surface phenomena.
Tagging parts of speech is a common example.
• We can usually think of processes as having a
limited horizon (we can easily extend to the case of
a constant horizon larger than 1)
• We have efficient training algorithm using EM
• Once the model is set, we can easily run it:
t=1, start in state i with probability qi
forever : move from state i to j with probability aij
emit yt=k with probability bik
t=t+1
The fundamental questions
Likelihood: Given a model =(A,B,Q), how do we
efficiently compute the likelihood of an observation
P(Y|)?
Decoding: Given the observation sequence Y and a
model , what state sequence explains it best
(MPE)?
This is, for example, the tagging process of an
observed sentence.
Learning: Given an observation sequence Y, and a
generic model, how do we estimate the parameters
that define the best model to describe the data?
Computing the Likelihood
For any state sequence (X1,…,XT):
T
P (Y | X ) P (yt | Xt , Xt 1 ) bx1y1 bx2y2 bxT yT
t 1
P (X1,..., XT ) qx1 ax1x2 ax2x3 axT 1xT
using P(Y,X)=P(Y|X)P(X) we get:
P (Y ) P (Y , X ) P (Y | X )P (X )
X
X
x1 xT
T
qx1 axt 1xt bxt yt
t 2
But, we have O(TNT) multiplications!
The trellis (lattice) algorithm
To compute likelihood: Need to enumerate over all
paths in the lattice (all possible instantiations of
X1…XT).
But… some starting subpath (blue) is common to
many continuing paths (blue+red)
Idea: using dynamic programming, calculate a path
in terms of shorter sub-paths
The trellis (lattice) algorithm
We build a matrix of the probability of being at time t
at state i - t(i)=P(xt=i,y1y2…yt-1) is a function of the
previous column (forward procedure):
1
α1 (i ) qi
2
j
αt 1 (i )
P (Y )
N
t(i)
t+1(i)
N
α
t
j 1
( j )a ji bjyt
N
α
i 1
T 1
(i )
The trellis (lattice) algorithm (cont.)
We can similarly define a backwards procedure for
filling the matrix βt (i ) P (yt 1 yT | Xt i )
βT (i ) 1
,
βt (i )
P (Y )
n
a
j 1
N
q b
i 1
i
iy1
ij
bjyt 1 βt 1 ( j )
β1 (i )
And we can easily combine:
P (Y , Xt i ) P (y1 yT , Xt i )
P (y1 yt 1 , Xt i , yt yT )
P (y1 yt 1 , Xt i )P (yt yT | y1 yt 1 , Xt i )
P (y1 yt , Xt i )P (yt yT | Xt i )
αt (i )biyt βt (i )
Finding the best state sequence
We would like to the most likely path (and not just
the most likely state at each time slice)
arg max P (X |Y ) arg max P (X ,Y )
x
x
The Viterbi algorithm is an efficient trellis method for
finding the MPE:
δ1 (i ) qi
δt 1 (i ) max δt ( j )a ji bjyt
j
γt 1 (i ) arg max δt ( j )a ji bjyt
j
and we to reconstruct the path:
P (Xˆ) max δT 1 (i )
i
XˆT 1 arg max δT 1 (i )
Xˆt γt 1 (Xˆt 1 )
i
The Casino HMM
A casino switches from a fair die (state F) to a
loaded one (state U) with probability 0.05 and the
other way around with probability 0.1. Also, with
probability 0.01. The casino, honestly, reads off the
number that was rolled.
0.95 0.05
A
0.9
0.1
1
B 6
1
10
1
1
6
10
1
1
6
10
1
1
6
10
1
1
6
10
P (3151166661, FFFFFUUUUU )
1 1 1 1 1 1 1 1 1 1
6 6 6 6 6 2 2 2 2 10
1
6
1
2
The Casino HMM (cont.)
What is the likelyhood of 3151166661?
Y=
3151166661
1(1)=0.95, 1(2)=0.05
2(1)=0.95*0.95*1/6+0.05*0.1*1/10=0.1509
2(1)=0.95*0.05*1/6+0.05*0.9*1/10=0.0124
3(1)=0.0240,3(2)=0.0025
4(1)=0.0038,4(1)=0.0004
5(1)=0.0006,5(1)=0.0001
… all smaller then 0.0001!
The Casino HMM (cont.)
What explains 3151166661 best?
Y=
3151166661
1(1)=0.95, 1(2)=0.05
2(1)=max(0.95*0.95*1/6,0.05*0.1*1/10)=0.1504
2(1)=max(0.95*0.05*1/6,0.05*0.9*1/10)=0.0079
3(1)=0.0238,3(2)=0.0013
4(1)=0.0006,4(1)=0.0002
5(1)=0.0001,5(1)=0.0000…
…
The Casino HMM (cont.)
An example of reconstruction using Viterbi (Durbin):
Rolls
3151162464466442453113216311641521336
Die
0000000000000000000000000000000000000
Viterbi 0000000000000000000000000000000000000
Rolls
2514454363165662656666665116645313265
Die
0000000011111111111111111111100000000
Viterbi 0000000000011111111111111111100000000
Rolls
1245636664631636663162326455236266666
Die
0000111111111111111100011111111111111
Viterbi 0000111111111111111111111111111111111
Learning
If we were given both X and Y, we could choose
arg max P (Ytraining , Xtraining | A, B ,Q )
A,B
Using the Maximum Likelihood principal, we simply
assign the parameter for each relative frequency
What do we do when we have only Y?
arg max P (Ytraining | A, B ,Q )
A,B
ML here does not have a closed form formula!
EM (Baum Welch)
Idea: Using current guess to complete data and reestimate
E-Step
“Guess” X using Y
and current parameters
M-Step
Reestimate parameters using
current completion of data
Thm: Likelihood of observables never decreases!!!
(to be proved later in the course)
Problems: Gets stuck at sub-optimal solutions
Parameter Estimation
We define the expected number of transitions from
state i to j at time t:
pt (i , j ) P (Xt i , Xt 1 j |Y )
αt (i )aij biyt bjyt 1 βt 1 ( j )
α (m )a
m
n
t
b bnyt 1 βt 1 (n )
mn myt
the expected # of transition from i to j in Y is then
T
p (i , j )
t 1
t
Parameter Esimation (cont.)
We use EM re-estimation formulas using the
expected counts we already have:
qi exp ected frequency in state i at time 1
aij
bik
P1 (i , j )
j
exp ected # of transition s from state i to j
exp ected # of transition s from state i
exp ected # of emissions of k from state i
exp ected # of emissions from state i
pt (i , j )
t
pt (i , j )
t j
pt (i , j )
j t :yt k
pt (i , j )
t
j
Application: Sequence Pair Alignmnet
DNA sequences are “strings” with a four letter
alphabet = { C,G,A,T}.
Real-life sequences often have by way of mutation of
a letter or even a deletion. A fundamental problem in
computation biology is to align such sequences:
Input:
Output:
CTTTACGTTACTTACG
CTTTACGTTAC-TTACG
CTTAGTTACGTTAG
C-TTA-GTAACGTTA-G
How can we use an HMM for such a task?
Sequence Pair Alignmnet (cont.)
We construct and HMM with 3 states that emits two letters:
(M)atch: Probability of emission of aligned pairs (high
probability for matching letter low for mismatch)
(D)elete1: Emission of a letter in the first sequence and an
insert in the second sequence
(D)elete2: The converse of D1
Transition matrix A
M
D1
D2
M
0.9
0.05
0.05
D1
0.95
0.05
0
D2
0.95
0
0.05
Matrix B
If in M, emit same letter
with probability 0.24
If in D1 or D2 emit all
letters uniformly
Q
M
0.9
D1
0.05
D2
0.05
Sequence Pair Alignmnet (cont.)
How to align 2 new sequences?
We also need (B)egin and (E)nd states to signify the start and
end of the sequence.
• From B we will have to same transition probabilities as for M
and we will never return to it.
• We will have a probability of 0.05 to reach E from any states
and we will never leave it. This probability will determine the
average length of a an alignment.
We now just do Viterbi with some extra technical details
(Programming ex1)
Extensions
HMM have been used so extensively, that it is
impossible to even start on the many forms they
take. Several extension however are worth
mentioning:
•using different kind of transition matrices
•using continuous observations
•using a larger horizon
•assigning probabilities to wait times
•using several training sets
•…
© Copyright 2026 Paperzz