Forward-backward algorithm

Forward-backward algorithm
LING 572
Fei Xia
02/23/06
Outline
• Forward and backward probability
• Expected counts and update formulae
• Relation with EM
HMM
• A HMM is a tuple ( S , , , A, B) :
–
–
–
–
–
A set of states S={s1, s2, …, sN}.
A set of output symbols Σ={w1, …, wM}.
Initial state probabilities
  { i }
State transition prob: A={aij}.
Symbol emission prob: B={bijk}
• State sequence: X1…XT+1
• Output sequence: o1…oT
Constraints
N

i 1
i
1
ij
1
N
a
j 1
ij ijk
M
b
k 1
 a b
ijk
1
k
j
1
Decoding
• Given the observation O1,T=o1…oT, find
the state sequence X1,T+1=X1 … XT+1 that
maximizes P(X1,T+1 | O1,T).
o1
X1
X2
o2
 Viterbi algorithm
…
oT
XT
XT+1
Notation
•
•
•
•
•
A sentence: O1,T=o1…oT,
T is the sentence length
The state sequence X1,T+1=X1 … XT+1
t: time t, range from 1 to T+1.
Xt: the state at time t.
• i, j: state si, sj.
• k: word wk in the vocabulary
Forward and backward
probabilities
Forward probability
The probability of producing oi,t-1 while ending up in
state si:
def
 i (t )  P(O1,t 1 , X t  i)
Calculating forward probability
Initialization:
 i (1)   i
Induction:
 j (t  1)  P (O1,t , X t 1  j )
  P (O1,t , X t  i, X t 1  j )
i
  P (O1,t 1 , X t  i ) * P (ot , X t 1  j | O1,t 1 , X t  i )
i
  P (O1,t 1 , X t  i ) * P (ot , X t 1  j | X t  i )
i
   i (t )aij bijot
i
Backward probability
• The probability of producing the sequence
Ot,T, given that at time t, we are at state si.
def
 i (t )  P(Ot ,T | X t  i)
Calculating backward probability
Initialization:
 i (T  1)  1
Induction:
def
 i (t )  P(Ot ,T | X t  i )
  P (ot , O(t 1),T ,X t 1  j | X t  i )
j
  P (ot , X t 1  j | X t  i ) * P (O( t 1),T | X t  i, X t 1  j , ot )
j
  P (ot , X t 1  j | X t  i ) * P (Ot 1,T | X t 1  j )
j
   j (t  1)aij bijot
j
Calculating the prob of
the observation
N
P(O)    i (T  1)
i 1
N
P(O)    i  i (1)
i 1
N
P (O )   P (O, X t  i )
i 1
N
   i (t )  i (t )
i 1
Estimating parameters
• The prob of traversing a certain arc at time t
given O: (denoted by pt(i, j) in M&S)
 ij (t )  P( X t  i, X t 1  j | O)


P ( X t  i, X t 1  j , O)
P(O)
 i (t )aij bijot  j (t  1)
N

m 1
m
(t )  m (t )
The prob of being at state i at time t given O:
N
 i (t )  P( X t  i | O)   P( X t  i, X t 1  j | O)
j 1
N
 i (t )    ij (t )
j 1
Expected counts
Sum over the time index:
• Expected # of transitions from state i to j in O:
T

t 1
ij
(t )
• Expected # of transitions from state i in O:
T

t 1
T
i
N
N
T
(t )    ij (t )   ij (t )
t 1 j 1
j 1 t 1
Update parameters
ˆ i  exp ected frequency in state i at time t  1   i (1)
T
exp ected # of transition s from state i to j
aij 

exp ected # of transition s from state i

t 1
T

t 1
T
ij
i

(t )

(t )
t 1
N
ij
(t )
T
 
j 1
t 1
ij
(t )
T
bijk 
exp ected # of transitions from state i to j with k observed
exp ected # of transitions from state i to j

  (o , w )
t
t 1
k
T

t 1
ij
(t )
ij
(t )
Final formulae
 i (t )aij bijo  j (t  1)
 ij (t ) 
t
N

m 1
m
(t )  m (t )
T
a ij 

t 1
N
ij
(t )
T
 
j 1
t 1
ij
(t )
T
bijk 
  (o , w
t
t 1
k
) ij (t )
T

t 1
ij
(t )
Emission probabilities
Arc-emission HMM:
bijk
exp ected # of transition s from state i to j with k observed

exp ected # of transition s from state i to j
T

  (o , w
t
t 1
k
) ij (t )
T

t 1
ij
(t )
The inner loop for
forward-backward algorithm
Given an input sequence and ( S , K , , A, B)
1. Calculate forward probability:
 i (1)   i
•
Base case
•
Recursive case:
 j (t  1)   i (t )aij bijo
t
2.
i
Calculate backward probability:
 i (T  1)  1
 i (t )    j (t  1)aij bijo
j
 i (t )aij bijot  j (t  1)
Calculate expected counts:
 ij (t ) 
N
Update the parameters:
 m (t ) m (t )
•
•
Base case:
Recursive case:
t
3.
4.
T
a ij 

t 1
N
ij
(t )
 
j 1
bijk 
T
t 1
ij
(t )
m 1
T
  (o , w
t
t 1
k
) ij (t )
T

t 1
ij
(t )
Relation to EM
Relation to EM
• HMM is a PM (Product of Multi-nominal) Model
• Forward-back algorithm is a special case of the
EM algorithm for PM Models.
• X (observed data): each data point is an O1T.
• Y (hidden data): state sequence X1T.
• Θ (parameters): aij, bijk, πi.
Relation to EM (cont)
count (aij )   P(Y | X , ) * count ( X , Y , aij )
Y
  P( X 1T | O1T ,  ) * count (O1T , X 1T , aij )
X 1T
T
  P( X t  i, X t 1  j | O1T ,  )
t 1
T
   ij (t )
t 1
count (bijk )   P(Y | X , ) * count ( X , Y , bijk )
Y
  P( X 1T | O1T ,  ) * count (O1T , X 1T , bijk )
X 1T
T
  P( X t  i, X t 1  j | O1T ,  ) *  (Ok , wk )
t 1
T
   ij (t ) (Ok , wk )
t 1
Iterations
• Each iteration provides values for all the
parameters
• The new model always improve the
likeliness of the training data:
ˆ )  P(O | )
P(O | 
• The algorithm does not guarantee to reach
global maximum.
Summary
• A way of estimating parameters for HMM
– Define forward and backward probability, which can
calculated efficiently (DP)
– Given an initial parameter setting, we re-estimate the
parameters at each iteration.
– The forward-backward algorithm is a special case of
EM algorithm for PM model
Additional slides
Definitions so far
• The prob of producing O1,t-1, and ending at state si at
def
time t:
 i (t )  P(O1,t 1 , X t  i)
• The prob of producing the sequence Ot,T, given that at
time t, we are at state si:
def
 i (t )  P(Ot ,T | X t  i)
• The prob of being at state i at time t given O:
P ( X t  i, O )
 i (t )  P( X t  i | O) 

P(O)
 i (t )  i (t )
N

j 1
j
(t )  j (t )
 ij (t )  P( X t  i, X t 1  j | O) 
 i (t )aij bijo  j (t  1)
t
N

m 1
 i (t )  P( X t  i | O) 
m
(t )  m (t )
 i (t )  i (t )
N

j 1
N
j
(t )  j (t )
 i (t )    ij (t )
j 1
Emission probabilities
Arc-emission HMM:
bijk 
exp ected # of transition s from state i to j with k observed
exp ected # of transition s from state i to j
T

  (o , w
t
t 1
k
) ij (t )
T

t 1
ij
(t )
State-emission HMM:
exp ected # of transitions to j with k observed
b jk 
exp ected # of transitions to j
T

  (o , w )
t
t 1
k
T

t 1
j
(t )
j
(t )