EM Algorithm

Mixture Language Models and
EM Algorithm
(Lecture for CS397-CXZ Intro Text Info Systems)
Sept. 17, 2003
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Rest of this Lecture
• Unigram mixture models
– Slightly more sophisticated unigram LMs
– Related to smoothing
• EM algorithm
– VERY useful for estimating parameters of a mixture
model or when latent/hidden variables are involved
– Will occur again and again in the course
Modeling a Multi-topic Document
A document with 2 types of vocabulary
…
How do we model such a document?
text mining passage
How do we “generate” such a document?
food nutrition passage
How do we estimate our model?
text mining passage
text mining passage
food nutrition passage
…
Solution:
A mixture model + EM
Simple Unigram Mixture Model
…
Model/topic 1
p(w|1)
text 0.2
mining 0.1
assocation 0.01
clustering 0.02
…
food 0.00001
=0.7
…
…
Model/topic 2 food 0.25
nutrition 0.1
p(w|2)
1-=0.3
healthy 0.05
diet 0.02
…
p(w|1 2) = p(w|1)+(1- )p(w|2)
Parameter Estimation
p(d | 1   2 )  [ p( w |1 )  (1   ) p( w |  2 )]c ( w,d )
Likelihood:
wV
log p(d | 1   2 )   c( w, d ) log[ p( w |1 )  (1   ) p( w |  2 )]
wV
Estimation scenarios:
-p(w|1) & p(w|2) are known; estimate 
The doc is about text mining and food nutrition,
how much percent is about text mining?
-p(w|1) &  are known; estimate p(w|2)
30% of the doc is about text mining, what’s the
rest about?
-p(w|1) is known; estimate  & p(w|2)
The doc is about text mining, is it also about some
other topic, and if so to what extent?
- is known; estimate p(w|1)& p(w|2)
30% of the doc is about one topic and 70% is about
another, what are these two topics?
-Estimate , p(w|1), p(w|2)
=clustering
The doc is about two subtopics, find out what these
Two subtopics are and to what extent the doc covers
Each.
Parameter Estimation Example:
Given p(w|1) and p(w|2), estimate 
Maximum Likelihood:
L( )   [ p( w |1 )  (1   ) p( w |  2 )]c ( w,d )
wV
No closed form solution,
But many numerical algorithms
log L( )   c( w, d ) log[ p( w |1 )  (1   ) p( w |  2 )]
wV
*  arg max  log L( )
Expectation-Maximization (EM) Algorithm is a commonly used method
Basic idea: Start from some random guess of parameter values, and then
Iteratively improve our estimates (“hill climbing”)
Likelihood
logL()
next guess
current guess
Lower bound
(n+1)
(n)
E-step: compute the
lower bound
M-step: find a new  that
maximizes the lower
bound

EM Algorithm: Intuition
…
p(w|1)
text 0.2
mining 0.1
assocation 0.01
clustering 0.02
…
food 0.00001
=?
Observed
Doc d
From 2
…
…
p(w|2)
food 0.25
nutrition 0.1
healthy 0.05
diet 0.02
1-=?
…
p(w|1 2) = p(w|1)+(1- )p(w|2)
Suppose we know the identity of each word…
 c(w, d ) (" w from  ")
* 
 c(w, d )
1
wV
wV
Can We “Guess” the Identity?
Identity (“hidden”) variable: zw {1 (w from 1), 0(w from 2)}
zw
the
paper
presents
a
text
mining
algorithm
the
paper
...
1
1
1
1
0
0
0
1
0
...
What’s a reasonable guess?
- depends on  (why?)
- depends on p(w| 1) ) and p(w|2) (how?)
p( zw  1) p ( w | zw  1)
p( zw  1) p ( w | zw  1)  p ( z w  0) p ( w | z w  0)
p( zw  1| w) 


new
 p( w | 1 )
 p( w | 1 )  (1   ) p( w |  2 )
 c(w, d ) p( z  1)

 c(w, d )
w
wV
w
E-step
M-step
Initially, set  to some random value, then iterate …
An Example of EM Computation
Log  Likelihood : log L( )   c( w, d ) log[ p ( w | 1 )  (1   ) p( w |  2 )]
wV
E  step :
p( zw  1| w) 
M  step :

new
 p( w | 1 )
 p( w | 1 )  (1   ) p( w |  2 )
 c(w, d ) p( z  1| w)

 c(w, d )
w
wV
wV
Word
#
P(w|1)
P(w|2)
The
Paper
Text
Mining
4
2
4
2
0.5
0.3
0.1
0.1
0.2
0.1
0.5
0.3
Log-Likelihood
Init
(0)
0.5
-15.45
Iteration 1
P(z=1|w) (1)
Iteration 2
P(z=1|w) (2)
0.71
0.75
0.17
0.25
0.68
0.72
0.14
0.22
0.46
-15.39
0.43
-15.35
Any Theoretical Guarantee?
• EM is guaranteed to reach a LOCAL maximum
• When “local maxima” = “global maxima”, EM
can find the global maximum
• But, when there are multiple local maximas,
“special techniques” are needed (e.g., try
different initial values)
• In our case, we have one unique local maxima
(why?)
A General Introduction to EM
Data: X (observed) + H(hidden) Parameter: 
“Incomplete” likelihood: L( )= log p(X| )
“Complete” likelihood: Lc( )= log p(X,H| )
EM tries to iteratively maximize the complete likelihood:
Starting with an initial guess (0),
1. E-step: compute the expectation of the complete likelihood
Q( ; ( n1) )  E ( n1) [ Lc ( ) | X ]   p( H  hi | X , ( n1) ) log P( X , hi )
hi
2. M-step: compute (n) by maximizing the Q-function
 ( n)  arg max Q( ; ( n1) )  arg max  p( H  hi | X , ( n1) ) log P( X , hi )
hi
Convergence Guarantee
Goal: maximizing “Incomplete” likelihood: L( )= log p(X| )
I.e., choosing (n), so that L((n))-L((n-1))0
Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, )
L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))]
Taking expectation w.r.t. p(H|X, (n-1)),
L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n)))
EM chooses (n) to maximize Q
Therefore,
KL-divergence, always non-negative
L((n))  L((n-1))!
Another way of looking at EM
Likelihood p(X| )
L((n-1)) + Q(;
(n-1))
-Q(
(n-1);
 (n-1) ) + D(p(H|X,  (n-1) )||p(H|X,  ))
next guess
current guess
Lower bound
(Q function)
E-step = computing the lower bound
M-step = maximizing the lower bound

What You Should Know
• What is unigram language so important?
• What is a unigram mixture language model?
• How to estimate parameters of simple unigram
mixture models using EM
• Know the general idea of EM (EM will be
covered again later in the course)

Download Report

EM Algorithm

Paperzz.com

Your Paperzz