A Tutorial of Hidden Markov Models Shuxing Cheng CS673 Project

A Tutorial of Hidden Markov Models
Shuxing Cheng
CS673 Project
Our talk is based on the following paper
A Tutorial on Hidden Markov Models and Selected
Applications in Speech Recognition.
By Lawrence R. Rabiner
Outline

Markov model.
 Hidden Markov model(HMM).
 Three basic problems as well as their solutions.
 More HMM types.
 The application of HMM.
Background of Markov models
Markov model: A stochastic processes holding the Markov property
Markov property(memoryless property):
 Future is independent of past given present
Three basic information to define a Markov model:
 parameter space.
 state space.
 state transition probability.
One step transition probability is the basis.
A Markov model example
0.5
1
A
B
0.5
C
1
1
D
E
1
F
0.3
1
1
0.4
H
G
1
P =
I
A B
C
D
E
F
G
H
I
A
0
0.5
0
0.5
0
0
0
0
0
B
0
0
1
0
0
0
0
0
0
C
0
0
0
0
0
1
0
0
0
D
0
0
0
0
0
0
1
0
0
E
0
1
0
0
0
0
0
0
0
F
0
0
0
0
1
0
0
0
0
G
0
0
0
0
0
0
0
1
0
H
0
0
0
0.3
0
0.4
0
0
0.3
I
0
0
0
0
0
0
0
0
1
0.3
graphical representation
matrix representation
If the states are not observable!
 States are not observable.
 Observations are probabilistic functions of states.
 State transitions are still probabilistic.
Using Hidden Markov Models(HMMs)!
The urn and ball problem
 Each state corresponds to a specific urn.
 A (ball) color probability is defined for each state.
 The choice of urns is dictated by the state transition matrix.
MMs vs. HMMs
 MMs and HMMs represent different levels of knowledge
about “real” state.
 MMs use knowledge of the state history and the current
state to predict future states.
 HMMs use evidence of historical states and evidence of
the current state to predict future states.
Elements of HMM
 N: The number of states.
 A: The transition probability matrix between each state.
A  aij  where aij  P( qt 1  S j qt  S i )
 M: The number of distinct observation symbols per state, we denote
the individual symbols as V  v1 ,v2 , ,v M 
 The observation symbol probability distribution in state j, B  b j ( k )
where
b j ( k )  P( vk at t qt  S j )
 The initial state distribution     and   P( q  S )
i
i
1
i
Two model parameters: N, M
Three probability measures: A, B,

Build the observation sequence
Given the above five elements, we can build an observation sequence:
O  O1O2
OT T is the number of observations.
This observation sequence is build as following.
1.
Set t=1.
2.
Choose an initial state q1  Si according to the initial distribution.
3.
Choose Ot  Vk according to the symbol probability distribution.
4.
Transit to a new state qt 1  S j according to the state transition
probability distribution.
5.
Set t=t+1, return to line 3 while t<=T.
Three basic problems of HMMs
 Given the observation sequence O  O1O2 OT and a model,
how do we efficiently compute P( O  ) , the probability of this
observation sequence for the given model.
 Given the observation sequence, and the model, how do we
choose a corresponding state sequence Q  q1q2 qT .
 How do we adjust the model parameter to maximize P( O  )?
The discussion for these three problems
Problem1: How well a given model matches a given
observation sequence.
Problem2: To uncover the hidden part of the model.
Problem3: Try to optimize the model parameter; in other words,
try to train the HMM.
Solution to the first problem
A straightforward way:
P( O |  )   P( O | Q,  )P(Q |  )
Q
P( O | Q,  )  bq1 ( O1 )  bq2 ( O2 )  bqT ( OT )
P( Q |  )   q1aq1q2 aq2q3  aqT 1qT
We assume the statistical independence of observation here.
NT possible state sequence.
(2T-1) calculations per state sequence.
Forward procedure
Define the forward variable  t ( i )  P( O1O2  Ot ,qt  Si |  )
We can solve it inductively:
Initialization 1 ( i )   i bi ( O1 ) 1  i  N
1  t  T 1
N

Induction  t 1 ( j )     t ( i )aij  b j ( Ot 1 )
1 j  N
 i 1

N
Ter min ation P( O |  )   T ( i )
i 1
It requires on the order of N2T calculations.
Backward variable
Similar to the forward variable, we can define the backward variable
 t ( i )  P( Ot 1Ot 2  OT | qt  Si ,  )
It can be solved inductively:
Initialization  T ( i )  1 1  i  N
N
Induction  t ( i )   aij b j ( Ot 1 ) t 1 ( j )
j 1
1  t  T 1
1 i  N
Solution to the second problem
Several possible solutions exist because of the different
optimality criterion.
One of the most popular one is Viterbi algorithm: Find the
single best state sequence.
Formally: with the given observation sequence O  O1O2  OT
define the quantity:
 t ( i )  max P( q1 ,q2  qt  i ,O1O2  Ot |  )
q1 ,q2 ,qt 1
 t ( i ) is called the best score along a single path at time t.
By induction, we have
 t 1 ( j )   max  t ( i )aij   b j ( Ot 1 )

i

The complete procedure
Initialization
 1 ( i )   i bi ( O1 ) 1  i  N
1( i )  0
Recursion
 t ( j )  max  t 1 ( i )aij  b j ( Ot ) 2  i  T 1  j  N
1 i  N
 t ( j )  arg max  t 1 ( i )aij  2  i  T 1  j  N
1 i  N
The complete procedure
Termination
P   max  T ( i )
1 i  N
qT   arg max  T ( i )
1 i  N
Path backtracking
qT    t 1 ( qt 1* ), t  T  1,T  2 , ,1
Solution to the third problem
Given any finite observation sequences as training data,
there is no optimal way of estimating the model
parameters.
Locally maximized approach include:
 Baum-Welch method(EM, expectation-modification
method).
 Gradient techniques.
Baum-Welch method
Define  t ( i , j )  P( qt  S i ,qt 1  S j | O,  )
t( i, j ) 
P( qt  Si ,qt 1  S j ,O |  )
P( O |  )
 t ( i )aij b j ( Ot 1 ) t 1 ( j )

P( O |  )
 t ( i )aij b j ( Ot 1 ) t 1 ( j )

N
N
  ( i )a b ( O
i 1 j 1
t
ij
j
t 1
) t 1 ( j )
Baum-Welch method
Baum-Welch method (continue)
 t ( i ) : the probability of being in state Si at time t.
T 1

t 1
t
( i ) : expected number of transitions from Si.
T 1

t 1
t
( i , j ) : expected number of transitions from Si to Sj.
Using the above three formulas, we can give a method
for re-estimation of the parameters of an HMM.
Baum-Welch method (continue)
A set of reasonable re-estimation of formulas for
 , A,and B :
i 
expected number of times in state Si at time (t=1)   1 ( i )
aij 
expected number of transitions from state Si to state Sj
expected number of transitions from state Si
T 1

 ( i , j )
t 1
T 1
t

t 1
bj ( k ) 
t
(i )
expected number of times in state j and observing symbol vk
expected number of times in state j
T


t
( j)

t
( j)
t 1
T 1
t 1
Baum-Welch method (continue)
Given the current model of   ( A, B, )
The re-estimated model is
  ( A, B, )
It can be proved that either
The initial model
 defines a critical point, in which   
or
Model

is more likely then model
 in the sense that P( O |  )  P( O |  )
Baum-Welch method (continue)
The above updating scheme can be derived by maximizing
Baum’s auxiliary function
Q(  ,  )   P( Q | O,  )log( P( O,Q |  ))
Q
over  , they prove that maximization of Q(  ,  ) leads to
increased likelihood
max Q(  ,  )  P( O |  )  P( O |  )

More types of HMMs
There exist a lot of different types of HMMs.
We introduce one variation based on the Markov model itself.
ergodic Markov Model
reducible Markov Model
Distance between HMMs
Given two HMMs, λ1 and λ2, what is a reasonable
measure of the similarity of the two models?
We can use this kind of measure:
1
D( 1 , 2 )   log P( O | 1 )  log P( O | 2 )
T
What is the problem of this measure?
Distance between HMMs
Given two HMMs, λ1 and λ2, what is a reasonable
measure of the similarity of the two models?
We can use this kind of measure:
1
D( 1 , 2 )   log P( O | 1 )  log P( O | 2 )
T
What is the problem of this measure?
It is nonsymmetric!
We introduce this following measure.
D( 1 , 2 )  D( 2 , 1 )
Ds ( 1 , 2 ) 
2
Application of HMMs
 Speech recognition.
 Bioinformatics.
Isolated word recognition
 For each word v, we build an HMM λv.
 For each unknown word which is to be recognized, we
have to compute P(O| λv).
 Select the word whose model likelihood is highest.