Slides - NYU Computer Science

Probabilistic Graphical Models
David Sontag
New York University
Lecture 8, March 22, 2012
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
1 / 20
Approximate marginal inference
Given the joint p(x1 , . . . , xn ) represented as a graphical model, how
do we perform marginal inference, e.g. to compute p(x1 )?
We showed in Lecture 5 that doing this exactly is NP-hard
Nearly all approximate inference algorithms are either:
1
2
Monte-carlo methods (e.g., likelihood reweighting, MCMC)
Variational algorithms (e.g., mean-field, TRW, loopy belief
propagation)
These next two lectures will be on variational methods
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
2 / 20
Variational methods
Goal: Approximate difficult distribution p(x) with a new distribution
q(x) such that:
1
2
p(x) and q(x) are “close”
Computation on q(x) is easy
How should we measure distance between distributions?
The Kullback-Leibler divergence (KL-divergence) between two
distributions p and q is defined as
D(pkq) =
X
p(x) log
x
p(x)
.
q(x)
(measures the expected number of extra bits required to describe
samples from p(x) using a code based on q instead of p)
As you showed in your homework, D(p k q) ≥ 0 for all p, q, with
equality if and only if p = q
Notice that KL-divergence is asymmetric
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
3 / 20
KL-divergence
(see Section 8.5 of K&F)
D(pkq) =
X
p(x) log
x
p(x)
.
q(x)
Suppose p is the true distribution we wish to do inference with
What is the difference between the solution to
arg min D(pkq)
q
(called the M-projection of q onto p) and
arg min D(qkp)
q
(called the I-projection)?
These two will differ only when q is minimized over a restricted set of
probability distributions Q = {q1 , . . .}, and in particular when p 6∈ Q
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
4 / 20
KL-divergence – M-projection
q ∗ = arg min D(pkq) =
q∈Q
X
p(x) log
x
p(x)
.
q(x)
For example, suppose that p(z) is a 2D Gaussian and Q is the set of all
Gaussian distributions with diagonal covariance matrices:
1
z2
0.5
0
David Sontag (NYU)
0
0.5
z1
(b)
p=Green, q=Red
Graphical Models
1
Lecture 8, March 22, 2012
5 / 20
KL-divergence – I-projection
q ∗ = arg min D(qkp) =
q∈Q
X
q(x) log
x
q(x)
.
p(x)
For example, suppose that p(z) is a 2D Gaussian and Q is the set of all
Gaussian distributions with diagonal covariance matrices:
1
z2
0.5
0
David Sontag (NYU)
0
0.5
z1
(a)
p=Green, q=Red
Graphical Models
1
Lecture 8, March 22, 2012
6 / 20
KL-divergence (single Gaussian)
In this simple example, both the M-projection and I-projection find an
approximate q(x) that has the correct mean (i.e. Ep [z] = Eq [z]):
1
1
z2
z2
0.5
0
0.5
0
0.5
(b)
z1
1
0
0
0.5
(a)
z1
1
What if p(x) is multi-modal?
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
7 / 20
KL-divergence – M-projection (mixture of Gaussians)
q ∗ = arg min D(pkq) =
q∈Q
X
p(x) log
x
p(x)
.
q(x)
Now suppose that p(x) is mixture of two 2D Gaussians and Q is the set of
all 2D Gaussian distributions (with arbitrary covariance matrices):
p=Blue, q=Red
M-projection yields distribution q(x) with the correct mean and covariance.
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
8 / 20
KL-divergence – I-projection (mixture of Gaussians)
q ∗ = arg min D(qkp) =
q∈Q
X
x
q(x) log
q(x)
.
p(x)
p=Blue, q=Red (two equivalently good solutions!)
Unlike the M-projection, the I-projection does not necessarily yield the
correct moments.
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
9 / 20
Finding the M-projection is the same as exact inference
M-projection is:
q ∗ = arg min D(pkq) =
q∈Q
X
p(x) log
x
p(x)
.
q(x)
Recall the definition of probability distributions in the exponential family:
q(x; η) = h(x) exp{η · f(x) − ln Z (η)}
f(x) are called the sufficient statistics
In the exponential family, there is a one-to-one correspondance between
distributions q(x; η) and marginal vectors Eq [f(x)]
Suppose that Q is an exponential family (p(x) can be arbitrary)
It can be shown (see Thm 8.6) that the expected sufficient statistics, with
respect to q ∗ (x), are exactly the corresponding marginals under p(x):
Eq∗ [f(x)] = Ep [f(x)]
Thus, solving for the M-projection is just as hard as the original inference
problem
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
10 / 20
Most variational inference algorithms make use of the I-projection
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
11 / 20
Variational methods
Suppose that we have an arbitrary graphical model:
X
1 Y
p(x; θ) =
φc (xc ) = exp
θc (xc ) − ln Z (θ)
Z (θ)
c∈C
c∈C
All of the approaches begin as follows:
X
q(x)
D(qkp) =
q(x) ln
p(x)
x
X
X
1
= −
q(x) ln p(x) −
q(x) ln
q(x)
x
x
X
X
= −
q(x)
θc (xc ) − ln Z (θ) − H(q(x))
x
= −
= −
David Sontag (NYU)
XX
c∈C
X
c∈C
c∈C
q(x)θc (xc ) +
x
X
x
q(x) ln Z (θ) − H(q(x))
Eq [θc (xc )] + ln Z (θ) − H(q(x)).
Graphical Models
Lecture 8, March 22, 2012
12 / 20
Variational approach
Since D(qkp) ≥ 0, we have
X
−
Eq [θc (xc )] + ln Z (θ) − H(q(x)) ≥ 0,
c∈C
which implies that
ln Z (θ) ≥
X
Eq [θc (xc )] + H(q(x)).
c∈C
Thus, any approximating distribution q(x) gives a lower bound on the
log-partition function
Recall that D(qkp) = 0 if and only if p = q.Thus, if we allow
ourselves to optimize over all distributions, we have:
X
ln Z (θ) = max
Eq [θc (xc )] + H(q(x)).
q
David Sontag (NYU)
c∈C
Graphical Models
Lecture 8, March 22, 2012
13 / 20
Mean-field algorithms
ln Z (θ) = max
q
X
Eq [θc (xc )] + H(q(x)).
c∈C
Although this function is concave and thus in theory should be easy
to optimize, we need some compact way of representing q(x)
Mean-field algorithms assume a factored representation of the joint
distribution:
Y
q(x) =
qi (xi )
i∈V
The objective function to use for variational inference then becomes:
XX
Y
X
max
θc (xc )
qi (xi ) +
H(qi )
P
{qi (xi )≥0,
xi
qi (xi )=1}
c∈C xc
i∈c
i∈V
Key difficulties: (1) highly non-convex optimization problem, and (2)
factored distribution is usually too big of an approximation
David Sontag (NYU)
Graphical Models
Lecture 8, March 22, 2012
14 / 20
Convex relaxation
X
ln Z (θ) = max
q
Eq [θc (xc )] + H(q(x)).
c∈C
Assume that p(x) is in the exponential family, and let f(x) be its sufficient
statistic vector
Let Q be the exponential family with sufficient statistics f(x)
Define µq = Eq [f(x)] be the marginals of q(x)
We can re-write the objective as
XX
ln Z (θ) = max
θc (xc )µcq (xc ) + H(µq ),
q
c∈C xc
where we define H(µq ) to be the entropy of the maximum entropy
distribution with marginals µq
Next, instead of optimizing over distributions q(x), optimize over valid
marginal vectors µ. We obtain:
XX
ln Z (θ) = max
θc (xc )µc (xc ) + H(µ)
µ∈M
David Sontag (NYU)
c∈C xc
Graphical Models
Lecture 8, March 22, 2012
15 / 20
Marginal polytope (same as from Lecture 7!)
µ�� =
0!
1!
0!
1!
1!
0!
0"
0"
1"
0"
0!
0!
0!
1!
0!
0!
1!
0"
Marginal polytope!
(Wainwright & Jordan, ’03)!
µ
�=
�
1 � ��
µ +µ
�
2
valid marginal probabilities!
X1!= 1!
X2!= 1!
1!
0!
0!
1!
1!
0!
1"
0"
0"
0"
0!
1!
0!
0!
0!
0!
1!
0"
Assignment for X1"
Assignment for X2"
Assignment for X3!
Edge assignment for"
X1X3!
Edge assignment for"
X1X2!
Edge assignment for"
X2X3!
X1!= 0!
X3 =! 0!
X2!= 1!
X3 =! 0!
Figure 2-1: Illustration of the marginal polytope for a Markov random field with three nodes
that
have
states
in {0, 1}. The vertices
correspond
one-to-oneLecture
with 8,
global
to20
David
Sontag
(NYU)
Graphical
Models
March assignments
22, 2012
16 /
Convex relaxation
ln Z (θ) = max
µ∈M
XX
θc (xc )µc (xc ) + H(µ)
c∈C xc
We still haven’t achieved anything, because:
1 The marginal polytope M is complex to describe (in general,
exponentially many vertices and facets)
2 H(µ) is very difficult to compute or optimize over
We now make two approximations:
1 We replace M with a relaxation of the marginal polytope, e.g. the local
consistency constraints ML
2 We replace H(µ) with a concave function H̃(µ) which upper bounds
H(µ), i.e. H(µ) ≤ H̃(µ)
As a result, we obtain the following upper bound on the log-partition
function, which is concave and easy to optimize:
XX
ln Z (θ) ≤ max
θc (xc )µc (xc ) + H̃(µ)
µ∈ML
David Sontag (NYU)
c∈C xc
Graphical Models
Lecture 8, March 22, 2012
17 / 20
Local consistency polytope (same as from Lecture 7!)
Force every “cluster” of variables to choose a local assignment:
X
µi (xi ) ≥ 0
µi (xi ) = 1
xi
X
µij (xi , xj ) ≥ 0
µij (xi , xj ) = 1
xi ,xj
∀i ∈ V , xi
∀i ∈ V
∀ij ∈ E , xi , xj
∀ij ∈ E
Enforce that these local assignments are globally consistent:
X
µi (xi ) =
µij (xi , xj ) ∀ij ∈ E , xi
xj
µj (xj ) =
X
xi
David Sontag (NYU)
µij (xi , xj ) ∀ij ∈ E , xj
Graphical Models
Lecture 8, March 22, 2012
18 / 20
Tree reweighted entropy
One particularly powerful concave entropy approximation is the tree-reweighted
approximation from Jaakkola, Wainwright, & Wilsky (2005)
f
f
f
joint pseud
f
Ist (Tst )
b
e
(a)
b
e
b
e
(b)
(c)
b
e
(d)
Figure 1. Illustration of the spanning tree polytope T(G). Original graph is shown in panel (a).
Probability 1/3 is assigned to each of the three
spanning trees { Ti | i = 1, 2, 3 } shown in panels
(b)–(d). Edge b is a so-called bridge in G, mean-
David Sontag (NYU)
Graphical Models
Borrowing
define an “
T · θ∗
=
Using this
lowing fun
Lecture 8, March 22, 2012
19 / 20
Obtaining true bounds on the marginals
We showed how to obtain upper and lower bounds on the partition function
These can be used to obtain upper and lower bounds on marginals
Let Z (θxi ) denote the partition function of the distribution on XV\i where
Xi = xi
Suppose that Lxi ≤ Z (θxi ) ≤ Uxi
Then,
p(xi ; θ)
=
P
x̂i
=
≤
Similarly, p(xi ; θ) ≥
David Sontag (NYU)
L
P xi
x̂ Ux̂i
.
x
V\i
P P
exp(θ(xV\i , xi ))
xV\i
exp(θ(xV\i , x̂i ))
Z (θ )
P xi
x̂i Z (θx̂i )
U
P xi .
x̂i Lx̂i
i
Graphical Models
Lecture 8, March 22, 2012
20 / 20