Hidden Markov Models

Hidden Markov Models
A Hidden Markov Model consists of
1. A sequence of states {Xt|t  T} = {X1, X2,
... , XT} , and
2. A sequence of observations {Yt |t  T} =
{Y1, Y2, ... , YT}
Some basic problems:
from the observations {Y1, Y2, ... , YT}
1. Determine the sequence of states {X1, X2,
... , XT}. (Assuming the model)
- Viterbi path
- State probabilities given observations {Y1,
Y2, ... , YT
2. Determine (or estimate) the parameters of
the stochastic process that is generating the
states and the observations.;
Computing Likelihood
Let pij = P[Xt+1 = j|Xt = i] and P = (pij) = the MM
transition matrix.
Let p i0 = P[X1 = i] and
p 
0

0
0
0
p 1 , p 2 ,, p M


= the initial distribution over the states.
P[ X 1  i1, X 2  i2 , ... , XT  iT ]
0
 p i1p i1i2 p i2i3 p iT 1iT
Computing Likelihood
P[X1 = i1,X2 = i2..,XT = iT, Y1 = y1, Y2 = y2, ... , YT = yT]
= P[X = i, Y = y]
=
p i01  i1 y1 p i1i2  i2 y 2 p i2i3  i3 y 3 p iT 1iT  iT yT
Therefore
P[Y1 = y1, Y2 = y2, ... , YT = yT]
= P[Y = y]

0

p i1 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
i1 ,i2 ,,iT



 L p , P, where   1 , 2 ,, M 
0
In the case when Y1, Y2, ... , YT are continuous
random variables or continuous random
vectors, Let f(y| i ) denote the conditional
distribution of Yt given Xt = i. Then the joint
density of Y1, Y2, ... , YT is given by
0
L p , P, = f(y1, y2, ... , yT) = f(y)



0

p i1 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
i1 ,i2 ,,iT
where
 it y t = f(yt| i )
t
Efficient Methods for computing
Likelihood
The Forward Method
1. 1 i1   p i1 y1
0
i1
2.  t 1 it 1   t it p it it 1  it 1 y t 1
it
3. Then P Y  y
 T iT 
iT
The Backward Procedure
1. 
i 
*
T 1 T 1
2. Then 
*
t 1
 p iT 1iT  iT yT
iT
it 1   
 p i i i
*
t it
t 1 t
t 1 y t 1
it
and 3. P Y  y

i1
 
*
1 i1
0
p i1  y1i1 ,
Prediction of states from the observations
and the model:
P X t  it Y  y 

      i 
t
T iT 
*
 t it  t it
iT
The Viterbi Algorithm
(Viterbi Paths)
The Viterbi Path is the sequence of States
X1 = i1, X2 = i2, ... , XT = iT
That maximizes
P[X1 = i1,... , XT = iT, Y1 = y1,... , YT = yT]
0
 p i1 i1 y1p i1i2 i2 y2p i2i3 i3 y3 p iT 1iT iT yT
for a given set of observations
Y1 = y1, Y2 = y2, ... , YT = yT
Summary of calculations of Viterbi Path


1
0
V
(i
)


ln
p
1.
1
i1  i1 y1 i1 = 1, 2, …, M


t 1
t 
V
(i
)

min
V
(it )  ln p it it1  it1 yt1
2.
t 1
i

t
it+1 = 1, 2, …, M; t = 1,…, T-2
3. V
T 

 min V
iT 1

T 1

(iT 1 )  ln p iT 1iT  iT yT
min U(i1, i2 , ... , iT )
i1, i 2 , ... , i T
HMM generator (normal).xls

Estimation of Parameters of a Hidden
Markov Model
If both the sequence of observations Y1, Y2, ...
, YT and the sequence of States X1, X2, ... , XT
is observed Y1 = y1, Y2 = y2, ... , YT = yT, X1 = i1,
X2 = i2, ... , XT = iT, then the Likelihood is
given by:


L p 0 , P,  p i01  i1 y1 p i1i2  i2 y 2 p i2i3  i3 y 3 p iT 1iT  iT yT
the log-Likelihood is given by:



      
l p 0 , P,  ln L p 0 , P,  ln p i01  ln i1 y1  ln p i1i2
  


 
 ln p i2i3  ln  i3 y3   ln p iT 1iT  ln  iT yT
 

  fi0 ln p i0   fij ln p ij     ln iy 
M
i 1
M M
M
i 1 j 1
i 1 y i 
where f i0  the number of times state i occurs in the first state
fij  the number of times state i changes to state j.
 iy  f  y  i  (or p  y  i  in the discrete case)

y i 
 the sum of all observations yt where X t  i
In this case the Maximum Likelihood estimates
are:
0
fi

1
f ij
pˆij  M , and
 fij
pˆi0
j 1
ˆi = the MLE of i computed from the
observations yt where Xt = i.
MLE (states unknown)
If only the sequence of observations Y1 = y1,
Y2 = y2, ... , YT = yT are observed then the
Likelihood is given by:


L p 0 , P,  
0
p
 i1 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
i1 ,i2 ,,iT
•
•
It is difficult to find the Maximum
Likelihood Estimates directly from the
Likelihood function.
The Techniques that are used are
1. The Segmental K-means Algorithm
2. The Baum-Welch (E-M) Algorithm
The Segmental K-means
Algorithm
In this method the parameters λ  π0 , Π, θ are
adjusted to maximize


L π , Π, θ y, i  Lλ y, i 
0
 p i01  i1 y1p i1i2  i2 y2 p i2i3  i3 y3 p iT 1iT  iT yT
where i  i1, i2 ,iT  is the Viterbi path
Consider this with the special case
Case: The observations {Y1, Y 2, ... , YT} are
continuous Multivariate Normal with mean
vector μ i and covariance matrix Σ i when
Xt  i ,
i.e.
 iy t 
2p 
1
p/2
Σi

exp 
1
2


y t  μ i  Σi1 y t  μ i 
1. Pick arbitrarily M centroids a1, a2, … aM.
Assign each of the T observations yt (kT if
multiple realizations are observed) to a
state it by determining :
min y t  ai
i
Number of times i1  i
2. Then

k
Number of transitio ns from i to j
pˆij 
Number of transitio ns from i
0
pˆi
3. And
 yt
μˆ i 
it  i
Ni

 y t μˆ i y t μˆ i 
it  i
ˆ
, Σi 
Ni
4. Calculate the Viterbi path (i1, i2, …, iT)
based on the parameters of step 2 and 3.
5. If there is a change in the sequence (i1, i2,
…, iT) repeat steps 2 to 4.
The Baum-Welch (E-M)
Algorithm
• The E-M algorithm was designed originally
to handle “Missing observations”.
• In this case the missing observations are the
states {X1, X2, ... , XT}.
• Assuming a model, the states are estimated
by finding their expected values under this
model. (The E part of the E-M algorithm).
• With these values the model is estimated by
Maximum Likelihood Estimation (The M
part of the E-M algorithm).
• The process is repeated until the estimated
model converges.
The E-M Algorithm
f Y, X θ  LY, X, θ denote the joint
Let
distribution of Y,X.
Consider the function:
Qθ, θ  EX ln LY, X, θ Y, θ
(1)

Starting with an initial estimate of θ θ  .
A sequence of estimates θ(m)  are formed by
finding θ  θ
to maximize Q θ, θ( m ) 
( m 1)
with respect to θ .
 
The sequence of estimates θ(m)
converge to a local maximum of the
likelihood
LY, θ  f Y θ .
In the case of an HMM the log-Likelihood is
given by:
l p 0 , P,   ln Lp 0 , P,   ln p i0   ln i y   ln p i i 
  

   
M
M M
M
  fi0 ln p i0    fij ln p ij     ln iy 
1
1 1
12
 ln p i2i3  ln  i3 y3   ln p iT 1iT  ln  iT yT
i 1
i 1 j 1
i 1 y i 
where f i0  the number of times state i occurs in the first state
fij  the number of times state i changes to state j.
 iy  f  y  i  (or p  y  i  in the discrete case)

y i 
 the sum of all observations yt where X t  i
Recall
t i t* i 
t i t* i 
 t i   P X t  i Y  y  

*



j
 T
t  j t  j 
j
j
and
T 1
 t i  
t 1
Expected no. of transitions from
state i.
Let
 t i, j   P X t  i, X t 1  j Y  y 
P X t  i, X t 1  j, Y  y 

P Y  y 
P X t  i, Y (t )  y (t ) , X t 1  j, Yt 1  yt 1 , Y*(t 1)  y *(t 1)

P Y  y 
 t i p ij  j  yt 1  t*1  j 

 T  j 

T 1
j
 t i, j  
t 1
Expected no. of transitions from state i to
state j.

The E-M Re-estimation Formulae
Case 1: The observations {Y1, Y2, ... , YT} are
discrete with K possible values and
 iy  PYt  y X t  i 
p    i 
0
ˆi
T 1
pˆ ij 
 t i, j 
t 1
T 1
  t i 
t 1
T
, and
  t i 
t 1, yt  y
ˆ
 iy  T
  t i 
t 1
Case 2: The observations {Y1, Y 2, ... , YT}
are continuous Multivariate Normal with
mean vector μ i and covariance matrix Σ i
when X t  i ,
i.e.
 iy t 
2p 
1
p/2
Σi

exp 
1
2


y t  μ i  Σi1 y t  μ i 
pˆ i0    i 
T 1
pˆ ij 
 t i, j 
t 1
T 1
  t i 
, and
t 1
T
μˆ i 
 t i y t
t 1
T
 t i 
t 1

ˆ
ˆ





i
y

μ
y

μ
 t t i t i
T
ˆ 
,Σ
i
t 1
T
 t i 
t 1
Measuring distance between two HMM’s
Let
λ
and
λ
1
 π

01
2 

02 
 π
1
,Π ,θ
2 
1
,Π ,θ

2 

denote the parameters of two different HMM
models. We now consider defining a distance
between these two models.
The Kullback-Leibler distance
Consider the two discrete distributions
2 
1
y 
p
p y  and
( f 1 y  and f 2  y  in the continuous case)
then define
1

p y   1
1 2 
I p , p   ln  2   p y 
y
 p y  


 E p1
ln p y ln p y
1
2
and in the continuous case:



1 2 
I f ,f
  ln 

 E f 1
f 1 y   1



f
y
d
y
2 
f y  
ln  f  y ln  f  y
1
2
These measures of distance between the two
distributions are not symmetric but can be
made symmetric by the following:


 
1 2 
2  1
I
p
,
p

I
p
,p
1 2 
Is p , p 
2


In the case of a Hidden Markov model.
p
i 







y   py λ  py π , Π , θ 
     
  p y, i π , Π , θ
i 
0i
0i
i
i
i
i
i
where


p y, i π0 , Π, θ  p i01 i1 y1p i1i2 i2 y 2 p i2i3 i3 y3 p iT 1iT iT yT


1 2 
I
p
,p
The computation of
in this case
is formidable
Juang and Rabiner distance


Let YT(i )  Y1(i ) , Y2(i ) ,, YT(i ) denote a sequence
of observations generated from the HMM
with parameters:
i 
i  i 
0i 
λ  π ,Π ,θ




Let i*(i ) y   i1(i ) y , i2(i ) y ,, iT(i ) y 
denote the optimal (Viterbi) sequence of states
assuming HMM model
.

λ i   π0i  , Πi  , θi 

Then define:


def
D λ 1 , λ 2  

  
    
1
lim ln p YT(1) , i*(1) YT(1) λ 1  ln p YT(1) , i*( 2) YT(1) λ 2 
T  T
and


 
1 2 
2  1
D
λ
,
λ

D
λ
,λ
1 2 
Ds λ , λ 
2

