Parameter Estimation for Completely Observed Graphical Models

Completely Observed
Graphical Models
Samson Cheung
Outline



Motivation
Directed Acyclic Graph
Undirected Graph


Iterative Proportional Fitting (IPF)
Relation with Junction Tree
Density Estimation
Problem: Given a graphical model G and N data
points, compute the maximum likelihood
estimate of the density.
X1,i=0
X3,i=0
X2,i=1
X1 P(X1)
0

1
X1 X2 P(X2|X1)
0 0
1 0

0 1
1 1
This lecture
X4,i=0
i=1,…,N
I : P(X1| I)
2 : P(X2|X1,2)
3 : P(X3|X1,3)
4 : P(X4|X2,X3,3)
Maximum Likelihood
Parameter Estimation
1.
Form the likelihood of N IID data points:
2.
Compute logarithm
3.
Take the derivative with respect to , set it to
zero and solve for 
Directed Graph
X1,i=0
X3,i=0
X2,i=1
X4,i=0
i=1,…,N
e.g.
x1
x2
x3
x4
0
0
0
1
1
0
1
1
0
1
1
1
0
1
1
1
1
0
0
0
0
0
0
0
1
1
1
0
2
3
Why (1) ?
All possible configurations
of xv and xπ(v)
# times xv,π(v)
occurs
All possible configurations
of xv
Why (2) ?
Undirected Graph
e.g.
C2,i
C1,i
i=1,…,N
C1
C2
010
110
110
110
111
011
110
110
010
110
000
100
111
111
Iterative Proportional Fitting
Idea: Modify C(xC) so that the marginal p(XC=xC)
matches pML(XC=xC)
 How?
Let’s say after tth iteration, every potential function
converges except for one at clique C. Call this
potential function C (t)(XC).

Goal: Change C (t)(XC) to C (t+1)(XC) so that
p(t+1)(XC=xC)= pML(XC=xC)
IPF cont.
All possible configurations
of XV\C with XC fixed at xC

The marginal probability of undirected graph
is computed as follows:

Try this iteration step
Is p(t+1)(XC=xC)= pML(XC=xC)?

Yes, provided that Z(t+1)=Z(t)
Is Z(t+1)=Z(t)?

Amazing yes!
Plug in p(t)(xC) from last slide
Overall IPF algorithm
But changing C(xC) may affect other p(XD), so
iterate. Here is the overall algorithm:
1.
2.
Initialize all the potential functions by assigning 1 to each
configuration. Set t=0 and compute Z(0)
For each clique C:
How to do this
1.
2.
3.
4.
Inference step: compute marginal P(t)(xC)
Iterate step:
efficiently?
Check for convergence, say using maximum change < 
t=t+1, goto 2.
Homework: work out the example on slide 8.
Does IPF converge?
: values of the potential
functions ’s

Take the derivative of the log-likelihood and we have
(read the derivation in the chapter)

We cannot find C directly as it got canceled.

However, each Iteration in IPF ensures that the above
condition is SATISFIED, provided every other
potential function is fixed.
IPF as coordinate ascent

Each iteration MAXIMIZES
the log-likelihood in the
direction of a potential
function
 IPF must converge to a
LOCAL maximum as l()≤1.
 IPF is a Coordinate Ascent
technique ← the
movement of candidates is
always parallel to one of
the coordinates.
Alternative view from KL
Empirical distribution

From last chapter, we know that

Focus on the marginal on clique C
Parameter for clique C
Alternative view from KL (2)

Thus minimizing
in the
direction of C is the same as minimizing

But
. Since
each iteration step of IPF minimizes
in the C direction.