T - Pages

Hidden Markov Models
A Hidden Markov Model consists of
1. A sequence of states {Xt|t  T} = {X1, X2,
... , XT} , and
2. A sequence of observations {Yt |t  T} =
{Y1, Y2, ... , YT}
• The sequence of states {X1, X2, ... , XT} form
a Markov chain moving amongst the M
states {1, 2, …, M}.
• The observation Yt comes from a
distribution that is determined by the
current state of the process Xt. (or possibly
past observations and past states).
• The states, {X1, X2, ... , XT}, are unobserved
(hence hidden).
Y1
Y2
Y3
X1
X2
X3
YT
…
The Hidden Markov Model
XT
Some basic problems:
from the observations {Y1, Y2, ... , YT}
1. Determine the sequence of states {X1, X2,
... , XT}.
2. Determine (or estimate) the parameters of
the stochastic process that is generating the
states and the observations.;
Examples
Example 1
• A person is rolling two sets of dice (one is
balanced, the other is unbalanced). He
switches between the two sets of dice using
a Markov transition matrix.
• The states are the dice.
• The observations are the numbers rolled
each time.
Balanced Dice
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
2
3
4
5
6
7
8
9
10 11 12
Unbalanced Dice
0.3
0.25
0.2
0.15
0.1
0.05
0
2
3
4
5
6
7
8
9
10 11 12
Example 2
• The Markov chain is two state.
• The observations (given the states) are
independent Normal.
• Both mean and variance dependent on state.
HMM AR.xls
Example 3 –Dow Jones
126
124
122
120
118
116
114
112
110
108
106
0
10
20
30
40
50
60
70
80
Daily Changes Dow Jones
2
1.5
1
0.5
0
-0.5
-1
0
10
20
30
40
50
60
70
80
Hidden Markov Model??
2
1.5
1
0.5
0
-0.5
-1
0
10
20
30
40
50
60
70
80
Bear and Bull Market?
126
124
122
120
118
116
114
112
110
108
106
0
10
20
30
40
50
60
70
80
Speech Recognition
• When a word is spoken the vocalization
process goes through a sequence of states.
• The sound produced is relatively constant
when the process remains in the same state.
• Recognizing the sequence of states and the
duration of each state allows one to
recognize the word being spoken.
• The interval of time when the word is
spoken is broken into small (possibly
overlapping) subintervals.
• In each subinterval one measures the
amplitudes of various frequencies in the
sound. (Using Fourier analysis). The vector
of amplitudes Yt is assumed to have a
multivariate normal distribution in each
state with the mean vector and covariance
matrix being state dependent.
Hidden Markov Models for
Biological Sequence
Consider the Motif:
[AT][CG][AC][ACGT]*A[TG][GC]
Some realizations:
A C A - - - A T G
T C A A C T A T C
A C A C - - A G C
A G A - - - A T C
A C C G - - A T C
Hidden Markov model of the same motif :
[AT][CG][AC][ACGT]*A[TG][GC]
.4
A.2
C.4
G.2
T.2
.6
A.8
C
G
T.2
1.0
A
C.8
G.2
T
1.0
A.8
C.2
G
T
.6
.4
A
C1.0
G
T
1.0
A
C
G.2
T.8
1.0
A
C.8
G.2
T
Profile HMMs
Begin
End
Computing Likelihood
Let pij = P[Xt+1 = j|Xt = i] and P = (pij) = the MM
transition matrix.
Let p i0 = P[X1 = i] and
p 
0

0
0
0
p 1 , p 2 ,, p M


= the initial distribution over the states.
P[ X 1  i1, X 2  i2 , ... , XT  iT ]
0
 p i1p i1i2 p i2i3 p iT 1iT
Now assume that
P[Yt = yt |X1 = i1, X2 = i2, ... , Xt = it]
= P[Yt = yt | Xt = it] = p(yt| it ) =  it y t
Then
P[X1 = i1,X2 = i2..,XT = iT, Y1 = y1, Y2 = y2, ... , YT = yT]
= P[X = i, Y = y]
0
p
= i1  i1 y1 p i1i2  i2 y 2 p i2 i3  i3 y 3 p iT 1iT  iT yT
Therefore
P[Y1 = y1, Y2 = y2, ... , YT = yT]
= P[Y = y]

0

p i1 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
i1 ,i2 ,,iT



 L p , P, where   1 , 2 ,, M 
0
In the case when Y1, Y2, ... , YT are continuous
random variables or continuous random
vectors, Let f(y| i ) denote the conditional
distribution of Yt given Xt = i. Then the joint
density of Y1, Y2, ... , YT is given by
0
L p , P, = f(y1, y2, ... , yT) = f(y)



0

p i1 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
i1 ,i2 ,,iT
where
 it y t = f(yt| i )
t
Efficient Methods for computing
Likelihood
The Forward Method
Let Y
(t )


(t )
 Y1,Y2 ,,Yt  and y   y1, y2 ,, yt 
Consider  t it  


P Y ( t )  y ( t ) , X t  it  PY1  y1 , Y2  y2 ,, Yt  yt , X t  it 

Note 1 i1   P Y (1)  y (1) , X 1  i1

0


 P Y1  y1, X 1  i1  p i i y
1

1 1
and  t 1 it 1   P Y( t 1)  y ( t 1) , X t 1  it 1

 P Y ( t )  y ( t ) , Yt 1  yt 1 , X t  it 1

  P Y


  P Y( t )  y ( t ) , Yt 1  yt 1, X t  it , X t 1  it 1
it
it

(t )

 y ( t ) , X t  it 

P Yt 1  yt 1 , X t 1  it 1 | Y ( t )  y ( t ) , X t  it

  P Y( t )  y ( t ) , X t  it p it it 1 it 1 yt 1
it
  t it p it it 1  it 1 y t 1
it



Then P Y  y   P Y(T )  y (T )

 PY1  y1 , Y2  y2 ,, YT  yT 

  P Y (T )  y (T ) , X T  iT
iT

  PY1  y1 , Y2  y2 , , YT  yT , X T  iT 
iT
 T iT 
iT
The Backward Procedure
Let Y*( t )  Yt 1 , Yt  2 ,, YT  and y *( t )   yt 1 , yt  2 ,, yT 

Consider  t* it   P Y*( t )  y *( t ) | X t  it

 PYt 1  yt 1 , Yt  2  yt  2 ,, YT  yT | X t  it 

Note T* 1 iT 1   P Y*(T 1)  y *(T 1) | X T 1  iT 1

 PYT  yT | X T 1  iT 1    p iT 1iT  iT yT
iT

Now  t*1 it 1   P Y*( t 1)  y ( t 1) | X t 1  it 1

 P Yt  yt , Y*( t )  y *( t ) | X t 1  it 1

  P Y


  P Yt  yt , Y*( t )  y*( t ) , X t  it | X t 1  it 1
it
*( t )
it

 y*( t ) | X t  it PYt  yt | X t  it 
 PX t  it | X t 1  it 1 
  t* it p it 1it it 1 yt 1
it


Then P Y  y   P Y*(0)  y (0)

 PY1  y1 , Y2  y2 ,, YT  yT 

  P Y1  y1 , Y*(1)  y *(1) , X 1  i1
i1

  P Y

 i p

  P Y*(1)  y *(1) | X 1  i1 P X 1  i1 , Y1  y1 
i1
*(1)
y
*(1)
i1

i1
 
*
1 i1
0
p i1  y1i1 ,
| X1
1
0
i1  y1i1 ,
Prediction of states from the observations
and the model:
Consider T iT   P Y  y, X T  iT 
 PY1  y1 , Y2  y2 ,, YT  yT , X T  iT 
PY  y, X T  iT  T iT 
Thus P X T  iT Y  y  

PY  y 
T iT 
P Y  y, X t  it 
Also P X t  it Y  y  
P Y  y 
(t )
(t )
*( t )
*( t )

P Y  y , X t  it P Y  y
X t  it 



P Y  y 
 t it  t* it 

  it 
T iT 
iT

iT

The Viterbi Algorithm
(Viterbi Paths)
Suppose that we know the parameters of the
Hidden Markov Model.
Suppose in addition suppose that we have
observed the sequence of observations Y1, Y2,
... , YT.
Now consider determining the sequence of
States X1, X2, ... , XT.
Recall that
P[X1 = i1,... , XT = iT, Y1 = y1,... , YT = yT]
= P[X = i, Y = y]
0
p
= i1 i1 y1p i1i2 i2 y 2 p i2i3 i3 y 3 p iT 1iT iT yT
Consider the problem of determining the
sequence of states, i1, i2, ... , iT , that
maximizes the above probability.
This is equivalent to maximizing
P[X = i|Y = y] = P[X = i,Y = y] / P[Y = y]
The Viterbi Algorithm
We want to maximize
0
p
P[X = i, Y = y] = i1 i1 y1p i1i2 i2 y 2 p i2i3 i3 y 3 p iT 1iT iT yT
Equivalently we want to minimize
U(i1, i2, ... , iT)
Where
U(i1, i2, ... , iT)
= -ln (P[X = i, Y = y])
=  ln p i01 i1 y1  ln p i1i2 i2 y2   ln p iT 1iT iT yT 
• Minimization of U(i1, i2, ... , iT) can be
achieved by Dynamic Programming.
• This can be thought of as finding the shortest
distance through the following grid of
points.
• By starting at the unique point in stage 0 and
moving from a point in stage t to a point in
stage t+1 in an optimal way. The distances
between points in stage t and points in stage
t+1 are equal to:
Dynamic Programming
...
Stage 0
Stage 1
Stage 2
Stage T-1
Stage T
• By starting at the unique point in stage 0
and moving from a point in stage t to a
point in stage t+1 in an optimal way.
• The distances between points in stage t and
points in stage t+1 are equal to:
0



d1 0, i1   ln p i i y 
1
1 1

if t  0 and
d t 1 it , it 1    ln p it it 1  it 1 yt 1

if t  1
Dynamic Programming
d1 0, i1  d 2 i1 , i2 
Stage 0
Stage 1
Stage 2
...
dT iT 1, iT 
Stage T-1
Stage T
Dynamic Programming
...
Stage 0
Stage 1
Stage 2
dT iT 1, iT 
Stage T-1
Stage T
Let
t 
U (i1, i2 , ... , it )

 


  ln p i01 i1 y1  ln p i1i2 i2 y2    ln p it1it it yt
 d1 0, i1   d 2 i1 , i2     dt it 1 , it 
V t (it ) 
min
i1 , i 2 , ... , i t 1
U t (i1, i2 , ... , it )
Then V 1(i1 )   ln p i0 i y
1
and
1 1

i1 = 1, 2, …, M



 min V  (i )  d i , i 
V t 1(it 1 )  min V t (it )  ln p it it1  it1 yt1
it
t
it
t
t 1
t
it+1 = 1, 2, …, M; t = 1,…, T-2
t 1

Finally
V
T 


 min V
T 1
iT 1


(iT 1 )  ln p iT 1iT  iT yT
min U(i1, i2 , ... , iT )
i1, i 2 , ... , i T

Summary of calculations of Viterbi Path
1. V 1(i1 )   ln p i0 i y i1 = 1, 2, …, M
1
2.
1 1


V t 1(it 1 )  min V t (it )  ln p it it1  it1 yt1
it
it+1 = 1, 2, …, M; t = 1,…, T-2
3.


V T   min V T 1(iT 1 )  ln p iT 1iT  iT yT
iT 1

min U(i1, i2 , ... , iT )
i1, i 2 , ... , i T


An alternative approach to prediction of states
from the observations and the model:
It can be shown that:
 t it  it 
P X t  it Y  y   it  
 T iT 


*
t
 t it  it 

*
  t it  t it 
*
t
it
iT
Forward Probabilities
 t it   P Y (t )  y (t ) , X t  it 
1.
1 i1   p i y
2.
 t 1 it 1    t it p it it 1  it 1 y t 1
0
i1
1 1
it
Backward Probabilities
 it   P Y
*
t
1.

2. 
*(t )
y
| X t  it
i   p i
*
T 1 T 1
*
t 1
*(t )
T 1iT
iT
it 1   

i
T yT
 p i i i
*
t it
t 1 t
it
HMM generator (normal).xls
t 1 y t 1
Estimation of Parameters of a Hidden
Markov Model
If both the sequence of observations Y1, Y2, ...
, YT and the sequence of States X1, X2, ... , XT
is observed Y1 = y1, Y2 = y2, ... , YT = yT, X1 = i1,
X2 = i2, ... , XT = iT, then the Likelihood is
given by:


L p 0 , P,  p i01  i1 y1 p i1i2  i2 y 2 p i2i3  i3 y 3 p iT 1iT  iT yT
the log-Likelihood is given by:



      
l p 0 , P,  ln L p 0 , P,  ln p i01  ln i1 y1  ln p i1i2
  


 
 ln p i2i3  ln  i3 y3   ln p iT 1iT  ln  iT yT
 

  fi0 ln p i0   fij ln p ij     ln iy 
M
i 1
M M
M
i 1 j 1
i 1 y i 
where f i0  the number of times state i occurs in the first state
fij  the number of times state i changes to state j.
 iy  f  y  i  (or p  y  i  in the discrete case)

y i 
 the sum of all observations yt where X t  i
In this case the Maximum Likelihood estimates
are:
0
fi

1
f ij
pˆij  M , and
 fij
pˆi0
j 1
ˆi = the MLE of i computed from the
observations yt where Xt = i.
MLE (states unknown)
If only the sequence of observations Y1 = y1,
Y2 = y2, ... , YT = yT are observed then the
Likelihood is given by:


L p 0 , P,  
0
p
 i1 i1 y1p i1i2 i2 y2 p i2i3 i3 y3 p iT 1iT iT yT
i1 ,i2 ,,iT
•
•
It is difficult to find the Maximum
Likelihood Estimates directly from the
Likelihood function.
The Techniques that are used are
1. The Segmental K-means Algorithm
2. The Baum-Welch (E-M) Algorithm
The Segmental K-means
Algorithm
In this method the parameters λ  π0 , Π, θ are
adjusted to maximize


L π , Π, θ y, i  Lλ y, i 
0
 p i01  i1 y1p i1i2  i2 y2 p i2i3  i3 y3 p iT 1iT  iT yT
where i  i1, i2 ,iT  is the Viterbi path
Consider this with the special case
Case: The observations {Y1, Y 2, ... , YT} are
continuous Multivariate Normal with mean
vector μ i and covariance matrix Σ i when
Xt  i ,
i.e.
 iy t 
2p 
1
p/2
Σi

exp 
1
2


y t  μ i  Σi1 y t  μ i 
1. Pick arbitrarily M centroids a1, a2, … aM.
Assign each of the T observations yt (kT if
multiple realizations are observed) to a
state it by determining :
min y t  ai
i
Number of times i1  i
2. Then

k
Number of transitio ns from i to j
pˆij 
Number of transitio ns from i
0
pˆi
3. And
 yt
μˆ i 
it  i
Ni

 y t μˆ i y t μˆ i 
it  i
ˆ
, Σi 
Ni
4. Calculate the Viterbi path (i1, i2, …, iT)
based on the parameters of step 2 and 3.
5. If there is a change in the sequence (i1, i2,
…, iT) repeat steps 2 to 4.
The Baum-Welch (E-M)
Algorithm
• The E-M algorithm was designed originally
to handle “Missing observations”.
• In this case the missing observations are the
states {X1, X2, ... , XT}.
• Assuming a model, the states are estimated
by finding their expected values under this
model. (The E part of the E-M algorithm).
• With these values the model is estimated by
Maximum Likelihood Estimation (The M
part of the E-M algorithm).
• The process is repeated until the estimated
model converges.
The E-M Algorithm
f Y, X θ  LY, X, θ denote the joint
Let
distribution of Y,X.
Consider the function:
Qθ, θ  EX ln LY, X, θ Y, θ
(1)

Starting with an initial estimate of θ θ  .
A sequence of estimates θ(m)  are formed by
finding θ  θ
to maximize Q θ, θ( m ) 
( m 1)
with respect to θ .
 
The sequence of estimates θ(m)
converge to a local maximum of the
likelihood
LY, θ  f Y θ .
Example: Sampling from Mixtures
Let y1, y2, …, yn denote a sample from the
density:
f y 1 , 2 ,, m , θ1 , θ2 ,, θm 
 1 g y θ1   2 g y θ2     m g y θm 
where
and
1  2    m  1
g y θi  is known except for θi
Suppose that m = 2 and let x1, x2, …, x1 denote
independent random variables taking on the
value 1 with probability  and 0 with
probability 1- .
Suppose that yi comes from the density
f y  , θ1 , θ2 
 xi g y θ1   1  xi g y θ2 
We will also assume that g(y|i) is normal
with mean miand standard deviation si.
Thus the joint distribution of x1, x2, …,
xn and let y1, y2, …, yn is:
f y, x  , m1 , m2 , s1 ,s 2 

 x
2

1

x
x
2
s
i
i
1
   i 1    
e

2p s 1
i 1 




yi  m 2 2 
1  xi  2s 22 
e


2p s 2


n

yi  m1 2

xi
2

 y m 
 x
n
 i 21 
1 xi 

xi
2s 1

i
   1    
e





2p s 1
i 1



1 xi
2


yi  m 2  
 1



2s 22

e
 2p s
 
2

 
 L , m1 , m2 , s1 , s 2 y, x
l  , m1 , m2 , s 1 , s 2 y, x
 ln L , m1 , m2 , s 1 , s 2 y, x
2


yi  m1  
n
  ln 2p    xi ln    ln s 1  

2
2
2s 1 
i 1

n


yi  m 2  
  1  xi ln 1     ln s 2  

2
2s 2 
i 1

n
2

l  , m1 , m 2 , s 1 , s 2 y, x 

n
 xi 1  xi 
  
0

1 
i 1  
n
x
or
i 1

i
n

n   xi
i 1
1
n
n




r   xi 1      n   xi 
i 1
 i 1 


n
Hence
x
i
i 1
 n
n
and ˆ 
x
i 1
n
i

l  , m1 , m 2 , s 1 , s 2 y, x 
m1
n

i 1
x
i

e
 2p s 1

1

yi  m1 2

2s 12
1  xi

e
2p s 2

yi  m 2 1 2

2s 22




In the case of an HMM the log-Likelihood is
given by:
l p 0 , P,   ln Lp 0 , P,   ln p i0   ln i y   ln p i i 
  

   
M
M M
M
  fi0 ln p i0    fij ln p ij     ln iy 
1
1 1
12
 ln p i2i3  ln  i3 y3   ln p iT 1iT  ln  iT yT
i 1
i 1 j 1
i 1 y i 
where f i0  the number of times state i occurs in the first state
fij  the number of times state i changes to state j.
 iy  f  y  i  (or p  y  i  in the discrete case)

y i 
 the sum of all observations yt where X t  i
Recall
t i t* i 
t i t* i 
 t i   P X t  i Y  y  

*



j
 T
t  j t  j 
j
j
and
T 1
 t i  
t 1
Expected no. of transitions from
state i.
Let
 t i, j   P X t  i, X t 1  j Y  y 
P X t  i, X t 1  j, Y  y 

P Y  y 
P X t  i, Y (t )  y (t ) , X t 1  j, Yt 1  yt 1 , Y*(t 1)  y *(t 1)

P Y  y 
 t i p ij  j  yt 1  t*1  j 

 T  j 

T 1
j
 t i, j  
t 1
Expected no. of transitions from state i to
state j.

The E-M Re-estimation Formulae
Case 1: The observations {Y1, Y2, ... , YT} are
discrete with K possible values and
 iy  PYt  y X t  i 
p    i 
0
ˆi
T 1
pˆ ij 
 t i, j 
t 1
T 1
  t i 
t 1
T
, and
  t i 
t 1, yt  y
ˆ
 iy  T
  t i 
t 1
Case 2: The observations {Y1, Y 2, ... , YT}
are continuous Multivariate Normal with
mean vector μ i and covariance matrix Σ i
when X t  i ,
i.e.
 iy t 
2p 
1
p/2
Σi

exp 
1
2


y t  μ i  Σi1 y t  μ i 
pˆ i0    i 
T 1
pˆ ij 
 t i, j 
t 1
T 1
  t i 
, and
t 1
T
μˆ i 
 t i y t
t 1
T
 t i 
t 1

ˆ
ˆ





i
y

μ
y

μ
 t t i t i
T
ˆ 
,Σ
i
t 1
T
 t i 
t 1
Measuring distance between two HMM’s
Let
λ
and
λ
1
 π

01
2 

02 
 π
1
,Π ,θ
2 
1
,Π ,θ

2 

denote the parameters of two different HMM
models. We now consider defining a distance
between these two models.
The Kullback-Leibler distance
Consider the two discrete distributions
2 
1
y 
p
p y  and
( f 1 y  and f 2  y  in the continuous case)
then define
1

p y   1
1 2 
I p , p   ln  2   p y 
y
 p y  


 E p1
ln p y ln p y
1
2
and in the continuous case:



1 2 
I f ,f
  ln 

 E f 1
f 1 y   1



f
y
d
y
2 
f y  
ln  f  y ln  f  y
1
2
These measures of distance between the two
distributions are not symmetric but can be
made symmetric by the following:


 
1 2 
2  1
I
p
,
p

I
p
,p
1 2 
Is p , p 
2


In the case of a Hidden Markov model.
p
i 







y   py λ  py π , Π , θ 
     
  p y, i π , Π , θ
i 
0i
0i
i
i
i
i
i
where


p y, i π0 , Π, θ  p i01 i1 y1p i1i2 i2 y 2 p i2i3 i3 y3 p iT 1iT iT yT


1 2 
I
p
,p
The computation of
in this case
is formidable
Juang and Rabiner distance


Let YT(i )  Y1(i ) , Y2(i ) ,, YT(i ) denote a sequence
of observations generated from the HMM
with parameters:
i 
i  i 
0i 
λ  π ,Π ,θ




Let i*(i ) y   i1(i ) y , i2(i ) y ,, iT(i ) y 
denote the optimal (Viterbi) sequence of states
assuming HMM model
.

λ i   π0i  , Πi  , θi 

Then define:


def
D λ 1 , λ 2  

  
    
1
lim ln p YT(1) , i*(1) YT(1) λ 1  ln p YT(1) , i*( 2) YT(1) λ 2 
T  T
and


 
1 2 
2  1
D
λ
,
λ

D
λ
,λ
1 2 
Ds λ , λ 
2

