cs621-lect38-39-baum-welch-hmm-training-2009

CS621: Artificial Intelligence
Pushpak Bhattacharyya
CSE Dept.,
IIT Bombay
Lecture 38-39: Baum Welch
Algorithm; HMM training
Baum Welch algorithm
 Training Hidden Markov Model (not structure learning, i.e.,
the structure of the HMM is pre-given). This involves:
 Learning probability values ONLY
Correspondence with PCFG:
Not learning production rule but probabilities associated with them
Training algorithm for PCFG is called Inside-Outside algorithm
Key Intuition
a
a
b
q
b
a
r
a
b
Given:
Initialization:
Compute:
Approach:
Training sequence
Probability values
Pr (state seq | training seq)
get expected count of transition
compute rule probabilities
Initialize the probabilities and recompute them…
EM like approach
b
Building blocks: Probabilities to be used
1.
i (t )  forward probabilit y
 P(W 1,t  1,St  s i ),t  0
α1(t)  1.0 if i  0
 0 otherwise
T
T
P(W 1,n )   P(W 1,t,Sn  1  s )  αi(n  1 )
i
i 1
i 1
T
j
k
αj(t  1 )   αi(t)P( s i w
 s )
i 1
W1
S1
W2…………… Wn-1
S2
Wn
Sn
Sn+1
Probabilities to be used, contd…
2.
 i (t )  backward probabilit y
 P (Wt , n,, St  s i ) ,t  1
 1(1)  P (W 1, n,, S 1  s i )
 P (W 1, n )
i (n  1)  P (| Sn  1  s i )  1
T
j
i (t  1)   P( s i w
 s ). j (t )
k
j 1
T
Exercise 1:- Prove the following:P(W 1,n )  j (t ).j (t )
j 1
Start of baum-welch algorithm
b
b
q
a
r
a
String = aab aaa aab aaa
Sequence of states with respect to input symbols
o/p seq
a
b
b
a
a
a
b
b
b
a
a
a
q

r

q

q

r

q

r

q

q

q

r

q

r
State seq
Calculating probabilities from table
Table of counts
a
P( q 

r)  5 / 8
P( q 
 r )  3 / 8
b
P( s
i
wk s ) 

j
k
c( s i w
 s )
j
T
Src
Dest
O/P
Count
q
r
a
5
q
q
b
3
r
q
a
3
r
q
b
2
A
i wm
l
c
(
s




s
)

l 1 m 1
T=#states
A=#alphabet symbols
Now if we have a non-deterministic transitions then multiple
state seq possible for the given o/p seq (ref. to previous slide’s
feature). Our aim is to find expected count through this.
Interplay Between Two Equations
Wk
P( s i 
sj) 
Wk
c( s i 
sj)
T
A
Wk
i
j
c
(
s


s
)

l 1 m 1
Wk
C ( s i 
sj) 
 P( S
si , n 1
i , n 1
| W1,n )  n( s  s , Si ,n 1 , w1,n )
i
Wk
j
wk
No. of times the transitions sisj occurs in the string
Learning probabilities
a:0.67
b:0.17
q
r
a:0.16
b:1.0
Actual (Desired) HMM
a:0.4
b:0.48
q
r
a:0.48
b:1.0
Initial guess
One run of Baum-Welch algorithm:
string ababa
 a a  b b  a a  b b  b b 
P(path)
a
b
b
a
q  r r  q q  q q  q
q
r
q
r
q
q
0.00077
0.00154
0.00154
0
0.00077
q
r
q
q
q
q
0.00442
0.00442
0.00442
0.00442
0.00884
q
q
q
r
q
q
0.00442
0.00442
0.00442
0.00442
0.00884
q
q
q
q
q
q
0.02548
0.0
0.000
0.05096
0.07644
0.035
0.01
0.01
0.06
0.095
0.06
1.0
0.36
0.581
Rounded Total 
New Probabilities (P) 
State sequences
(0.01/(0.01
+0.06+0.09
5)
*  is considered as starting and ending symbol of the input sequence
string
This way through multiple iterations the probability values will converge.
Appling Naïve Bayes
P( S1,n 1 | W1,n )

P( S1,n 1 , W1,n )

1
P(S1 ) P(S 2W1 | S1 ) P( S3W2 | S1W2W1 )  P( S 4W3 | S1S 2W1W2 )
P(W1,n )
P(W1,n )

P( S1 )  n

P
(
S
W
|
S
)
 i 1 i i 
P(W1,n )  i 1
Hence multiplying the transition probabilities is valid
Discussions
1. Symmetry breaking:
Example: Symmetry breaking leads to no change in initial values
a:0.5
b:1.0
s
s
s
a:1.0
b:0.5
Desired
b:0.25
a:0.25
a:0.5
b:0.5
s
b:0.5
a:0.25
s
a:0.5
s
b:0.5
1. Struck in Local maxima
2. Label bias problem
Probabilities have to sum to 1.
Values can rise at the cost of fall of values for others.
Initialized
Computational part
1
Wk
C ( s  s ) 
[ P( S1,n1 ,W1,n )  n(s i 
s j , S1,n1 ,W1,n )]
P(W1,n )
i
Wk
j
Wk
P( S1,n 1 ,W1,n )  n( s i 
s j , S1,n 1 ,W1,n ) 
n
 P( S  s , S
t
t 1
n
 α (t)P(s
i
i
t 1
 s j ,Wt  Wk , S1,n 1 ,W1,n )
wk s ).i (t  1)
i  i
t 1
Exercise 2: What is the complexity of calculating the above expression?
Hint: To find this first solve Exercise
1 i.e. understand how probability of given
T
string can be represented as
αi(t).i (t )

i 1