Hidden Markov Models
A Hidden Markov Model consists of
1. A sequence of states {Xt|t T} = {X1, X2,
... , XT} , and
2. A sequence of observations {Yt |t T} =
{Y1, Y2, ... , YT}
• The sequence of states {X1, X2, ... , XT} form
a Markov chain moving amongst the M
states {1, 2, …, M}.
• The observation Yt comes from a
distribution that is determined by the
current state of the process Xt. (or possibly
past observations and past states).
• The states, {X1, X2, ... , XT}, are unobserved
(hence hidden).
Y1
Y2
Y3
X1
X2
X3
YT
…
The Hidden Markov Model
XT
Some basic problems:
from the observations {Y1, Y2, ... , YT}
1. Determine the sequence of states {X1, X2,
... , XT}.
2. Determine (or estimate) the parameters of
the stochastic process that is generating the
states and the observations.;
Examples
Example 1
• A person is rolling two sets of dice (one is
balanced, the other is unbalanced). He
switches between the two sets of dice using
a Markov transition matrix.
• The states are the dice.
• The observations are the numbers rolled
each time.
Balanced Dice
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
2
3
4
5
6
7
8
9
10 11 12
Unbalanced Dice
0.3
0.25
0.2
0.15
0.1
0.05
0
2
3
4
5
6
7
8
9
10 11 12
Example 2
• The Markov chain is two state.
• The observations (given the states) are
independent Normal.
• Both mean and variance dependent on state.
HMM AR.xls
Example 3 –Dow Jones
126
124
122
120
118
116
114
112
110
108
106
0
10
20
30
40
50
60
70
80
Daily Changes Dow Jones
2
1.5
1
0.5
0
-0.5
-1
0
10
20
30
40
50
60
70
80
Hidden Markov Model??
2
1.5
1
0.5
0
-0.5
-1
0
10
20
30
40
50
60
70
80
Bear and Bull Market?
126
124
122
120
118
116
114
112
110
108
106
0
10
20
30
40
50
60
70
80
Speech Recognition
• When a word is spoken the vocalization
process goes through a sequence of states.
• The sound produced is relatively constant
when the process remains in the same state.
• Recognizing the sequence of states and the
duration of each state allows one to
recognize the word being spoken.
• The interval of time when the word is
spoken is broken into small (possibly
overlapping) subintervals.
• In each subinterval one measures the
amplitudes of various frequencies in the
sound. (Using Fourier analysis). The vector
of amplitudes Yt is assumed to have a
multivariate normal distribution in each
state with the mean vector and covariance
matrix being state dependent.
Hidden Markov Models for
Biological Sequence
Consider the Motif:
[AT][CG][AC][ACGT]*A[TG][GC]
Some realizations:
A C A - - - A T G
T C A A C T A T C
A C A C - - A G C
A G A - - - A T C
A C C G - - A T C
Hidden Markov model of the same motif :
[AT][CG][AC][ACGT]*A[TG][GC]
.4
A.2
C.4
G.2
T.2
.6
A.8
C
G
T.2
1.0
A
C.8
G.2
T
1.0
A.8
C.2
G
T
.6
.4
A
C1.0
G
T
1.0
A
C
G.2
T.8
1.0
A
C.8
G.2
T
Profile HMMs
Begin
End
Computing Likelihood
Let pij = P[Xt+1 = j|Xt = i] and P = (pij) = the MM
transition matrix.
Let p i0 = P[X1 = i] and
p
0
0
0
0
p 1 , p 2 ,, p M
= the initial distribution over the states.
P[ X 1 i1, X 2 i2 , ... , XT iT ]
0
p i1p i1i2 p i2i3 p iT 1iT
Now assume that
P[Yt = yt |X1 = i1, X2 = i2, ... , Xt = it]
= P[Yt = yt | Xt = it] = p(yt| it ) = it y t
Then
P[X1 = i1,X2 = i2..,XT = iT, Y1 = y1, Y2 = y2, ... , YT = yT]
= P[X = i, Y = y]
0
p
= i1 i1 y1 p i1i2 i2 y 2 p i2 i3 i3 y 3 p iT 1iT iT yT
Therefore
P[Y1 = y1, Y2 = y2, ... , YT = yT]
= P[Y = y]
0
p i1 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
i1 ,i2 ,,iT
L p , P, where 1 , 2 ,, M
0
In the case when Y1, Y2, ... , YT are continuous
random variables or continuous random
vectors, Let f(y| i ) denote the conditional
distribution of Yt given Xt = i. Then the joint
density of Y1, Y2, ... , YT is given by
0
L p , P, = f(y1, y2, ... , yT) = f(y)
0
p i1 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
i1 ,i2 ,,iT
where
it y t = f(yt| i )
t
Efficient Methods for computing
Likelihood
The Forward Method
Let Y
(t )
(t )
Y1,Y2 ,,Yt and y y1, y2 ,, yt
Consider t it
P Y ( t ) y ( t ) , X t it PY1 y1 , Y2 y2 ,, Yt yt , X t it
Note 1 i1 P Y (1) y (1) , X 1 i1
0
P Y1 y1, X 1 i1 p i i y
1
1 1
and t 1 it 1 P Y( t 1) y ( t 1) , X t 1 it 1
P Y ( t ) y ( t ) , Yt 1 yt 1 , X t it 1
P Y
P Y( t ) y ( t ) , Yt 1 yt 1, X t it , X t 1 it 1
it
it
(t )
y ( t ) , X t it
P Yt 1 yt 1 , X t 1 it 1 | Y ( t ) y ( t ) , X t it
P Y( t ) y ( t ) , X t it p it it 1 it 1 yt 1
it
t it p it it 1 it 1 y t 1
it
Then P Y y P Y(T ) y (T )
PY1 y1 , Y2 y2 ,, YT yT
P Y (T ) y (T ) , X T iT
iT
PY1 y1 , Y2 y2 , , YT yT , X T iT
iT
T iT
iT
The Backward Procedure
Let Y*( t ) Yt 1 , Yt 2 ,, YT and y *( t ) yt 1 , yt 2 ,, yT
Consider t* it P Y*( t ) y *( t ) | X t it
PYt 1 yt 1 , Yt 2 yt 2 ,, YT yT | X t it
Note T* 1 iT 1 P Y*(T 1) y *(T 1) | X T 1 iT 1
PYT yT | X T 1 iT 1 p iT 1iT iT yT
iT
Now t*1 it 1 P Y*( t 1) y ( t 1) | X t 1 it 1
P Yt yt , Y*( t ) y *( t ) | X t 1 it 1
P Y
P Yt yt , Y*( t ) y*( t ) , X t it | X t 1 it 1
it
*( t )
it
y*( t ) | X t it PYt yt | X t it
PX t it | X t 1 it 1
t* it p it 1it it 1 yt 1
it
Then P Y y P Y*(0) y (0)
PY1 y1 , Y2 y2 ,, YT yT
P Y1 y1 , Y*(1) y *(1) , X 1 i1
i1
P Y
i p
P Y*(1) y *(1) | X 1 i1 P X 1 i1 , Y1 y1
i1
*(1)
y
*(1)
i1
i1
*
1 i1
0
p i1 y1i1 ,
| X1
1
0
i1 y1i1 ,
Prediction of states from the observations
and the model:
Consider T iT P Y y, X T iT
PY1 y1 , Y2 y2 ,, YT yT , X T iT
PY y, X T iT T iT
Thus P X T iT Y y
PY y
T iT
P Y y, X t it
Also P X t it Y y
P Y y
(t )
(t )
*( t )
*( t )
P Y y , X t it P Y y
X t it
P Y y
t it t* it
it
T iT
iT
iT
The Viterbi Algorithm
(Viterbi Paths)
Suppose that we know the parameters of the
Hidden Markov Model.
Suppose in addition suppose that we have
observed the sequence of observations Y1, Y2,
... , YT.
Now consider determining the sequence of
States X1, X2, ... , XT.
Recall that
P[X1 = i1,... , XT = iT, Y1 = y1,... , YT = yT]
= P[X = i, Y = y]
0
p
= i1 i1 y1p i1i2 i2 y 2 p i2i3 i3 y 3 p iT 1iT iT yT
Consider the problem of determining the
sequence of states, i1, i2, ... , iT , that
maximizes the above probability.
This is equivalent to maximizing
P[X = i|Y = y] = P[X = i,Y = y] / P[Y = y]
The Viterbi Algorithm
We want to maximize
0
p
P[X = i, Y = y] = i1 i1 y1p i1i2 i2 y 2 p i2i3 i3 y 3 p iT 1iT iT yT
Equivalently we want to minimize
U(i1, i2, ... , iT)
Where
U(i1, i2, ... , iT)
= -ln (P[X = i, Y = y])
= ln p i01 i1 y1 ln p i1i2 i2 y2 ln p iT 1iT iT yT
• Minimization of U(i1, i2, ... , iT) can be
achieved by Dynamic Programming.
• This can be thought of as finding the shortest
distance through the following grid of
points.
• By starting at the unique point in stage 0 and
moving from a point in stage t to a point in
stage t+1 in an optimal way. The distances
between points in stage t and points in stage
t+1 are equal to:
Dynamic Programming
...
Stage 0
Stage 1
Stage 2
Stage T-1
Stage T
• By starting at the unique point in stage 0
and moving from a point in stage t to a
point in stage t+1 in an optimal way.
• The distances between points in stage t and
points in stage t+1 are equal to:
0
d1 0, i1 ln p i i y
1
1 1
if t 0 and
d t 1 it , it 1 ln p it it 1 it 1 yt 1
if t 1
Dynamic Programming
d1 0, i1 d 2 i1 , i2
Stage 0
Stage 1
Stage 2
...
dT iT 1, iT
Stage T-1
Stage T
Dynamic Programming
...
Stage 0
Stage 1
Stage 2
dT iT 1, iT
Stage T-1
Stage T
Let
t
U (i1, i2 , ... , it )
ln p i01 i1 y1 ln p i1i2 i2 y2 ln p it1it it yt
d1 0, i1 d 2 i1 , i2 dt it 1 , it
V t (it )
min
i1 , i 2 , ... , i t 1
U t (i1, i2 , ... , it )
Then V 1(i1 ) ln p i0 i y
1
and
1 1
i1 = 1, 2, …, M
min V (i ) d i , i
V t 1(it 1 ) min V t (it ) ln p it it1 it1 yt1
it
t
it
t
t 1
t
it+1 = 1, 2, …, M; t = 1,…, T-2
t 1
Finally
V
T
min V
T 1
iT 1
(iT 1 ) ln p iT 1iT iT yT
min U(i1, i2 , ... , iT )
i1, i 2 , ... , i T
Summary of calculations of Viterbi Path
1. V 1(i1 ) ln p i0 i y i1 = 1, 2, …, M
1
2.
1 1
V t 1(it 1 ) min V t (it ) ln p it it1 it1 yt1
it
it+1 = 1, 2, …, M; t = 1,…, T-2
3.
V T min V T 1(iT 1 ) ln p iT 1iT iT yT
iT 1
min U(i1, i2 , ... , iT )
i1, i 2 , ... , i T
An alternative approach to prediction of states
from the observations and the model:
It can be shown that:
t it it
P X t it Y y it
T iT
*
t
t it it
*
t it t it
*
t
it
iT
Forward Probabilities
t it P Y (t ) y (t ) , X t it
1.
1 i1 p i y
2.
t 1 it 1 t it p it it 1 it 1 y t 1
0
i1
1 1
it
Backward Probabilities
it P Y
*
t
1.
2.
*(t )
y
| X t it
i p i
*
T 1 T 1
*
t 1
*(t )
T 1iT
iT
it 1
i
T yT
p i i i
*
t it
t 1 t
it
HMM generator (normal).xls
t 1 y t 1
Estimation of Parameters of a Hidden
Markov Model
If both the sequence of observations Y1, Y2, ...
, YT and the sequence of States X1, X2, ... , XT
is observed Y1 = y1, Y2 = y2, ... , YT = yT, X1 = i1,
X2 = i2, ... , XT = iT, then the Likelihood is
given by:
L p 0 , P, p i01 i1 y1 p i1i2 i2 y 2 p i2i3 i3 y 3 p iT 1iT iT yT
the log-Likelihood is given by:
l p 0 , P, ln L p 0 , P, ln p i01 ln i1 y1 ln p i1i2
ln p i2i3 ln i3 y3 ln p iT 1iT ln iT yT
fi0 ln p i0 fij ln p ij ln iy
M
i 1
M M
M
i 1 j 1
i 1 y i
where f i0 the number of times state i occurs in the first state
fij the number of times state i changes to state j.
iy f y i (or p y i in the discrete case)
y i
the sum of all observations yt where X t i
In this case the Maximum Likelihood estimates
are:
0
fi
1
f ij
pˆij M , and
fij
pˆi0
j 1
ˆi = the MLE of i computed from the
observations yt where Xt = i.
MLE (states unknown)
If only the sequence of observations Y1 = y1,
Y2 = y2, ... , YT = yT are observed then the
Likelihood is given by:
L p 0 , P,
0
p
i1 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
i1 ,i2 ,,iT
•
•
It is difficult to find the Maximum
Likelihood Estimates directly from the
Likelihood function.
The Techniques that are used are
1. The Segmental K-means Algorithm
2. The Baum-Welch (E-M) Algorithm
The Segmental K-means
Algorithm
In this method the parameters λ π0 , Π, θ are
adjusted to maximize
L π , Π, θ y, i Lλ y, i
0
p i01 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
where i i1, i2 ,iT is the Viterbi path
Consider this with the special case
Case: The observations {Y1, Y 2, ... , YT} are
continuous Multivariate Normal with mean
vector μ i and covariance matrix Σ i when
Xt i ,
i.e.
iy t
2p
1
p/2
Σi
exp
1
2
y t μ i Σi1 y t μ i
1. Pick arbitrarily M centroids a1, a2, … aM.
Assign each of the T observations yt (kT if
multiple realizations are observed) to a
state it by determining :
min y t ai
i
Number of times i1 i
2. Then
k
Number of transitio ns from i to j
pˆij
Number of transitio ns from i
0
pˆi
3. And
yt
μˆ i
it i
Ni
y t μˆ i y t μˆ i
it i
ˆ
, Σi
Ni
4. Calculate the Viterbi path (i1, i2, …, iT)
based on the parameters of step 2 and 3.
5. If there is a change in the sequence (i1, i2,
…, iT) repeat steps 2 to 4.
The Baum-Welch (E-M)
Algorithm
• The E-M algorithm was designed originally
to handle “Missing observations”.
• In this case the missing observations are the
states {X1, X2, ... , XT}.
• Assuming a model, the states are estimated
by finding their expected values under this
model. (The E part of the E-M algorithm).
• With these values the model is estimated by
Maximum Likelihood Estimation (The M
part of the E-M algorithm).
• The process is repeated until the estimated
model converges.
The E-M Algorithm
f Y, X θ LY, X, θ denote the joint
Let
distribution of Y,X.
Consider the function:
Qθ, θ EX ln LY, X, θ Y, θ
(1)
Starting with an initial estimate of θ θ .
A sequence of estimates θ(m) are formed by
finding θ θ
to maximize Q θ, θ( m )
( m 1)
with respect to θ .
The sequence of estimates θ(m)
converge to a local maximum of the
likelihood
LY, θ f Y θ .
Example: Sampling from Mixtures
Let y1, y2, …, yn denote a sample from the
density:
f y 1 , 2 ,, m , θ1 , θ2 ,, θm
1 g y θ1 2 g y θ2 m g y θm
where
and
1 2 m 1
g y θi is known except for θi
Suppose that m = 2 and let x1, x2, …, x1 denote
independent random variables taking on the
value 1 with probability and 0 with
probability 1- .
Suppose that yi comes from the density
f y , θ1 , θ2
xi g y θ1 1 xi g y θ2
We will also assume that g(y|i) is normal
with mean miand standard deviation si.
Thus the joint distribution of x1, x2, …,
xn and let y1, y2, …, yn is:
f y, x , m1 , m2 , s1 ,s 2
x
2
1
x
x
2
s
i
i
1
i 1
e
2p s 1
i 1
yi m 2 2
1 xi 2s 22
e
2p s 2
n
yi m1 2
xi
2
y m
x
n
i 21
1 xi
xi
2s 1
i
1
e
2p s 1
i 1
1 xi
2
yi m 2
1
2s 22
e
2p s
2
L , m1 , m2 , s1 , s 2 y, x
l , m1 , m2 , s 1 , s 2 y, x
ln L , m1 , m2 , s 1 , s 2 y, x
2
yi m1
n
ln 2p xi ln ln s 1
2
2
2s 1
i 1
n
yi m 2
1 xi ln 1 ln s 2
2
2s 2
i 1
n
2
l , m1 , m 2 , s 1 , s 2 y, x
n
xi 1 xi
0
1
i 1
n
x
or
i 1
i
n
n xi
i 1
1
n
n
r xi 1 n xi
i 1
i 1
n
Hence
x
i
i 1
n
n
and ˆ
x
i 1
n
i
l , m1 , m 2 , s 1 , s 2 y, x
m1
n
i 1
x
i
e
2p s 1
1
yi m1 2
2s 12
1 xi
e
2p s 2
yi m 2 1 2
2s 22
In the case of an HMM the log-Likelihood is
given by:
l p 0 , P, ln Lp 0 , P, ln p i0 ln i y ln p i i
M
M M
M
fi0 ln p i0 fij ln p ij ln iy
1
1 1
12
ln p i2i3 ln i3 y3 ln p iT 1iT ln iT yT
i 1
i 1 j 1
i 1 y i
where f i0 the number of times state i occurs in the first state
fij the number of times state i changes to state j.
iy f y i (or p y i in the discrete case)
y i
the sum of all observations yt where X t i
Recall
t i t* i
t i t* i
t i P X t i Y y
*
j
T
t j t j
j
j
and
T 1
t i
t 1
Expected no. of transitions from
state i.
Let
t i, j P X t i, X t 1 j Y y
P X t i, X t 1 j, Y y
P Y y
P X t i, Y (t ) y (t ) , X t 1 j, Yt 1 yt 1 , Y*(t 1) y *(t 1)
P Y y
t i p ij j yt 1 t*1 j
T j
T 1
j
t i, j
t 1
Expected no. of transitions from state i to
state j.
The E-M Re-estimation Formulae
Case 1: The observations {Y1, Y2, ... , YT} are
discrete with K possible values and
iy PYt y X t i
p i
0
ˆi
T 1
pˆ ij
t i, j
t 1
T 1
t i
t 1
T
, and
t i
t 1, yt y
ˆ
iy T
t i
t 1
Case 2: The observations {Y1, Y 2, ... , YT}
are continuous Multivariate Normal with
mean vector μ i and covariance matrix Σ i
when X t i ,
i.e.
iy t
2p
1
p/2
Σi
exp
1
2
y t μ i Σi1 y t μ i
pˆ i0 i
T 1
pˆ ij
t i, j
t 1
T 1
t i
, and
t 1
T
μˆ i
t i y t
t 1
T
t i
t 1
ˆ
ˆ
i
y
μ
y
μ
t t i t i
T
ˆ
,Σ
i
t 1
T
t i
t 1
Measuring distance between two HMM’s
Let
λ
and
λ
1
π
01
2
02
π
1
,Π ,θ
2
1
,Π ,θ
2
denote the parameters of two different HMM
models. We now consider defining a distance
between these two models.
The Kullback-Leibler distance
Consider the two discrete distributions
2
1
y
p
p y and
( f 1 y and f 2 y in the continuous case)
then define
1
p y 1
1 2
I p , p ln 2 p y
y
p y
E p1
ln p y ln p y
1
2
and in the continuous case:
1 2
I f ,f
ln
E f 1
f 1 y 1
f
y
d
y
2
f y
ln f y ln f y
1
2
These measures of distance between the two
distributions are not symmetric but can be
made symmetric by the following:
1 2
2 1
I
p
,
p
I
p
,p
1 2
Is p , p
2
In the case of a Hidden Markov model.
p
i
y py λ py π , Π , θ
p y, i π , Π , θ
i
0i
0i
i
i
i
i
i
where
p y, i π0 , Π, θ p i01 i1 y1p i1i2 i2 y 2 p i2i3 i3 y3 p iT 1iT iT yT
1 2
I
p
,p
The computation of
in this case
is formidable
Juang and Rabiner distance
Let YT(i ) Y1(i ) , Y2(i ) ,, YT(i ) denote a sequence
of observations generated from the HMM
with parameters:
i
i i
0i
λ π ,Π ,θ
Let i*(i ) y i1(i ) y , i2(i ) y ,, iT(i ) y
denote the optimal (Viterbi) sequence of states
assuming HMM model
.
λ i π0i , Πi , θi
Then define:
def
D λ 1 , λ 2
1
lim ln p YT(1) , i*(1) YT(1) λ 1 ln p YT(1) , i*( 2) YT(1) λ 2
T T
and
1 2
2 1
D
λ
,
λ
D
λ
,λ
1 2
Ds λ , λ
2
© Copyright 2026 Paperzz