class10-140709

Approximate Inference:
Particle-Based Methods
Eran Segal
Weizmann Institute
Inference Complexity Summary

NP-Hard


Exact inference
Approximate inference



with relative error
with absolute error < 0.5 (given evidence)
Hopeless?

No, we will see many network structures that have provably
efficient algorithms and we will see cases when
approximate inference works efficiently with high accuracy
Approximate Inference

Particle-based methods

Create instances that represent part of the probability mass



Random sampling
Deterministically search for high probability assignments
Global methods

Approximate the distribution in its entirety


Use exact inference on a simpler (but close) network
Perform inference in the original network but approximate some
steps of the process (e.g., ignore certain computations or
approximate some intermediate results)
Particle-Based Methods

Particle generation process



Generate particles deterministically
Generate particles by sampling
Particle definition


Full particles – complete assignments to all variables
Distributional particles – assignment to part of the variables
General framework
 Generate samples x[1],...,x[M]
1M
 Estimate function by E
(
f
)

f
(
x
[
m
])
P
m

1
M
Particle-Based Methods Overview

Full particle methods

Sampling methods





Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Forward Sampling

Generate random samples from P(X)


Use the Bayesian network to generate samples
1M
Estimate function by E
(
f
)

f
(
x
[
m
])
P
m

1
M
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
d1
I
G
S
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
d1
i1
G
S
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
d1
i1
g2
S
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
d1
i1
g2
s1
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
L
d1
i1
g2
s1
l0
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling


Let X1, …,Xn be a topological order of the variables
For i = 1,...,n

Sample xi from P(Xi | pai )




(Note: since Pai  {X1,…,Xi-1}, we already assigned values to them)
return x1, …,xn
Estimate function by:
1M
E
(
f
)

f
(
x
[
m
])
P
m

1
M
Estimate P(y) by:
1M
P
(
y
)

1
{
x
[
m
](
y
)

y
}

m

1
M
Forward Sampling

Sampling cost

Per variable cost: O(log(Val|Xi|))





Sample uniformly in [0,1]
Find appropriate value of all Val|Xi| values that Xi can take
Per sample cost: O(nlog(d)) (d = maxi Val|Xi|)
Total cost: O(Mnlog(d))
Number of samples needed

To get a relative error < , with probability 1-, we need
ln(
2/)
M
3
P
(y
)2


Note that number of samples grows inversely with P(y)
For small P(y) we need many samples, otherwise we report P(y)=0
Rejection Sampling


In general we need to compute P(Y|e)
We can do so with rejection sampling




Generate samples as in forward sampling
Reject samples in which Ee
Estimate function from accepted samples
Problem: if evidence is unlikely (e.g., P(e)=0.001))
then we generate many rejected samples
Particle-Based Methods Overview

Full particle methods

Sampling methods





Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Likelihood Weighting

Can we ensure that all of our samples satisfy E=e?

Solution: when sampling a variable XE, set X=e



Problem: we are trying to sample from the posterior P(X|e) but our
sampling process still samples from P(X)
Solution: weigh each sample by the joint probability of setting each
variable to its evidence/observed value
In effect, we are sampling from P(X,e) which we can normalize to
then obtain P(X|e) for a query of interest
Likelihood Weighting


Let X1, …,Xn be a topological order of the variables
For i = 1,...,n

If XiE


If Xi E



Sample xi from P(Xi | pai )
Set Xi=E[xi]
Set wi = wi  P(E[xi] | pai )
return wi and x1, …,xn
w
[
m
]
1
{
x
[
m
](
y
)

y
}

P
(
y
|e
)

w
[
m
]

M

Estimate P(y|E) by:
m

1
M
m

1
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
d1
I
G
S
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
1
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
d1
i1
G
S
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
1
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
d1
i1
g2
S
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
0.2
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
d1
i1
g2
s1
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
0.16
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
L
W
d1
i1
g2
s1
l0
0.16
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
E = {S=s1, G=g2}
Importance Sampling

Idea: to estimate a function relative to P, rather than
sampling from the distribution P, sample from another
distribution Q



P is called the target distribution
Q is called the proposal or the sampling distribution
Requirement from Q: P(x) > 0  Q(x) > 0


Q does not ‘ignore’ any non-zero probability events in P
In practice, performance depends on similarity between Q and P
Unnormalized Importance Sampling

Since:


P
(
X
)
P
(
x
)
E
f
(
X
)

Q
(
x
)
f
(
x
)
f
(
x
)
P
(
x
)

E
[
f
(
X
)


Q
(
X
)
P
(
X
)

Q
(
X
)
(
x
)
x

X Q
x

X





We can estimate f(x) by generating samples from Q
and estimating:
1
P
(
x
[
m
])
M
E
(
f
)

f
(
x
[
m
])

P
m

1
M
Q
(
x
[
m
])
Can show that estimator variance decreases with
more samples
Can show that Q=P is the lowest variance estimator
Normalized Importance Sampling

Unnormalized importance sampling assumes known P

Usually we know P up to a constant P = P’/




=xP’(x)
Example: Posterior distribution P(X|e)=P(X,e)/ where =P(e)



P
'
(
X
)
P
(
x
)'
E

Q
(
x
)

P
'
(
x
)



Q
(
X
)


Q
(
X
)
(
x
)
x

XQ
x

X


We can estimate  by:
P(x)f (x)
Thus EP(X)[f (X)] 
x

Q(x)f (x)P(x)/Q(x)
x
1
 
Q(x)f (x)P'(x)/Q(x)
x

1
 EQ(X)[f (X)P'(x)/Q(x)]

EQ(X)[f (X)P'(x)/Q(x)]

EQ(X)[P'(x)/Q(x)]
Normalized Importance Sampling

Given M samples from Q, normalized sampling
estimates function f by:
f
(
x
[
m
])
P
'
(
x
[
m
])
/
Q
(
x
[
m
])

E
(
f
)

P
'
(
x
[
m
])
/
Q
(
x
[
m
])

M
P
m

1
M
m

1
Importance Sampling for BayesNets

Define mutilated network GE=e as:



Nodes XE have no parents in GE=e
Nodes XE have CPD that is 1 for X=E[X] and 0 otherwise
The parents and CPDs for all other variables are unchanged
D
I
G
L
S
Original
network
d0
d1
0.4
0.6
D
g0
g1
g2
0
0
1
I
G
E = {S=s1, G=g2}
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
s0
s1
0
1
Likelihood Weighting as IS




Target distribution P’(X,e)
Proposal distribution Q(X,e)=PGE=e(X,e)
Claim: Likelihood weighting is precisely normalized
importance sampling with the above distributions
Proof sketch:

LW estimate:
w
[
m
]
1
{
x
[
m
](
y
)

y
}

P
(
y
|e
)

w
[
m
]

M
m

1
M
m

1

IS estimate:
1
{
x
[
m
](
y
)

y
}
P
'
(
x
[
m
])
/
Q
(
x
[
m
])

P
(
y
|
e
)

P
'
(
x
[
m
])
/
Q
(
x
[
m
])

M
m

1
M
m

1

But w[m] in LW precisely comes out to be P’(x[m])/Q(x[m])

Since w[m]=P(ei|Pa(ei))=P(Xi|Pa(Xi))P(ei|Pa(ei))/P(Xi|Pa(Xi))
Particle-Based Methods Overview

Full particle methods

Sampling methods





Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Markov Chain Monte Carlo

General idea:



Define a sampling process that is guaranteed to converge to
taking samples from the posterior distribution of interest
Generate samples from the sampling process
Estimate f(X) from the samples
Markov Chains

A Markov chain consists of



A state space Val(X)
Transition probability T(xx’) of going from state x to x’
Distribution over subsequent states is defined as
(
t

1
)
(
t

1
)
(
t
)
(
t
)
P
(
X

x
'
)

P
(
X

x
)
T
(
x

x
'
)

x

Va
(
X
)
l
0.25
0.7
x1
x2
0.5
0.5
0.75
x3
0.3
Simple Markov chain
Stationary Distribution

A distribution (X) is a stationary distribution for a
Markov chain T if it satisfies


(
X

x
'
)

(
X

x
)
T
(
x

x
'
)
x

Val
(
X
)
1
1
3

(
x
) 
0
.25

(
x
)
0
.5

(
x
)
2
2
3

(
x
)
0
.7

(
x
)
0
.5

(
x
)
3
1
2

(
x
)
0
.75

(
x
)
0
.3

(
x
) 0.25
0.7
x1
(x1)  0.2
(x2 )  0.5
(x3)  0.3
x2
0.5
0.5
0.75
x3
0.3
Simple Markov chain
Markov Chains & Stationary Dist.



A Markov chain is regular if there is k such that for
every x,x’Val(X), the probability of getting from x to
x’ in exactly k steps is greater than zero
Theorem: A finite state Markov chain T has a unique
stationary distribution if and only if it is regular
Goal: Define a Markov chain whose stationary
distribution is P(X|e)
Gibbs Sampling

States


Transition probability



Legal (=consistent with evidence) assignments to variables
T = T1...Tk
Ti = T((ui, xi) (ui, x’i)) = P(x’i|ui)
Claim: P(X|e) is a stationary distribution to the chain
Gibbs Sampling

Gibbs-sampling Markov chain is regular if:



Bayesian networks: all CPDs are strictly positive
Markov networks: all clique potentials are strictly positive
Gibbs sampling procedure (one sample)


Set x[m]=x[m-1]
For each variable Xi  X-E



Set ui = x[m](X-Xi)
Sample from P(Xi | ui)
Save sample x[m](Xi) = sampled value

Return x[m]

Note: P(Xi | ui) is easily computed from Markov blanket
Gibbs Sampling


How do we evaluate P(Xi | x1,...,xi-1,xi+1,...,xn) ?
Let Y1, …,Yk be the children of Xi


By definition of MBi, the parents of Yj are in MBi{Xi}
It is easy to show that
P
(
x
|
Pa
)
P
(
y
|pa
)

i
i
i
y
j
j
P
(
x
|
X
)

P
(
x
|
MB
)

i
i
i
P
(
x
'
|
Pa
)
P
(
y
|pa
)


i
i
i
y
j
x
'
i
j
Gibbs Sampling in Practice

We need to wait until the burn-in time has ended –
number of steps until we take samples from the chain






Wait until the sampling distribution is close to the stationary
Hard to provides bounds in general for this ‘mixing time’
Once the burn-in time ended, all samples are from the
stationary distribution
We can evaluate burn-in time by comparing two chains
Note: after the burn-in time, samples are correlated
Since no theoretical guarantees exist, application of
Markov chains is somewhat of an art
Sampling Strategy


How do we collect the samples?
Strategy I:

Run the chain M times, each run for N steps


Return the last state in each run
Strategy II:


M chains

each run starts from a different state
Run one chain for a long time
After some “burn in” period, sample points every some fixed
number of steps
“burn in”
M samples from one chain
Comparing Strategies
Strategy I:


Better chance of “covering” the space of points, especially if
the chain is slow to reach the stationary distribution
Have to perform “burn in” steps for each chain
Strategy II:


Perform “burn in” only once
Samples might be correlated (although only weakly)
Hybrid strategy:


run several chains, and sample few samples from each
Combines benefits of both strategies
Particle-Based Methods Overview

Full particle methods

Sampling methods





Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Deterministic Search Methods

Idea: if the distribution is dominated by a small set of
instances, it suffices to consider only them for
approximating the function

For instances that we generate x[1],...x[m], estimate is:
E
(
f
)

f
(
x
[
m
])
P
(
x
[
m
])

P
m

1
M


Note: we can obtain lower and upper bounds by examining
the part of the probability mass covered by P(x[m])
Key: how can we enumerate highly probably
instantiations?

Note: the single most probable instantiation is MPE which
itself is an NP-hard problem
Particle-Based Methods Overview

Full particle methods

Sampling methods





Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Distributional Particles

Idea: use partial assignments to a subset of the
network variables, combined with closed form
representation of a distribution over the rest




Xp – Variables whose assignments are defined by particles
Xd – Variables over which we maintain a distribution
Distributional particles are a.k.a Rao-Blackwellized particles
Estimation proceeds as
E
(
f
)
P
(
x
,
x
|e
)
f
(
x
,
x
)

P
|
e
p
d
p
d
x
,
x
pd

P
(
x
|e
)
P
(
x
|x
,
e
)
f
(
x
,
x
)


p
d
p
p
d
x
x
p
d

P
(
x
|e
)(
E
[
f
(
x
,
x
)])

p
P
(
x
|
x
,
e
)
p
d
d
p
x
p


We can use any sampling procedure to sample Xp
We assume that we can compute the internal expectation efficiently
Distributional Particles

Distributional particles define a continuum



For |Xp|=|X-E| we have full particles and thus full sampling
For |Xp|= 0 we are performing full exact inference
Distributional Likelihood Weighting

Sample over a subset of the variables



w
[
m
]
E 
1
{
x
[
m
](
y
)

y
}

P
(
y
|
e
)

w
[
m
]

M
m

1
P
(
X
|
x
[
m
],
e
)
d
p
M
m

1

Distributional Gibbs Sampling

Sample only a subset of the variables, transition probability
is as before T((ui, xi) (ui, x’i)) = P(x’i|ui) but the
computation may require inference