Approximate Inference:
Particle-Based Methods
Eran Segal
Weizmann Institute
Inference Complexity Summary
NP-Hard
Exact inference
Approximate inference
with relative error
with absolute error < 0.5 (given evidence)
Hopeless?
No, we will see many network structures that have provably
efficient algorithms and we will see cases when
approximate inference works efficiently with high accuracy
Approximate Inference
Particle-based methods
Create instances that represent part of the probability mass
Random sampling
Deterministically search for high probability assignments
Global methods
Approximate the distribution in its entirety
Use exact inference on a simpler (but close) network
Perform inference in the original network but approximate some
steps of the process (e.g., ignore certain computations or
approximate some intermediate results)
Particle-Based Methods
Particle generation process
Generate particles deterministically
Generate particles by sampling
Particle definition
Full particles – complete assignments to all variables
Distributional particles – assignment to part of the variables
General framework
Generate samples x[1],...,x[M]
1M
Estimate function by E
(
f
)
f
(
x
[
m
])
P
m
1
M
Particle-Based Methods Overview
Full particle methods
Sampling methods
Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Forward Sampling
Generate random samples from P(X)
Use the Bayesian network to generate samples
1M
Estimate function by E
(
f
)
f
(
x
[
m
])
P
m
1
M
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
d1
I
G
S
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
d1
i1
G
S
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
d1
i1
g2
S
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
d1
i1
g2
s1
L
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
L
d1
i1
g2
s1
l0
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
Forward Sampling
Let X1, …,Xn be a topological order of the variables
For i = 1,...,n
Sample xi from P(Xi | pai )
(Note: since Pai {X1,…,Xi-1}, we already assigned values to them)
return x1, …,xn
Estimate function by:
1M
E
(
f
)
f
(
x
[
m
])
P
m
1
M
Estimate P(y) by:
1M
P
(
y
)
1
{
x
[
m
](
y
)
y
}
m
1
M
Forward Sampling
Sampling cost
Per variable cost: O(log(Val|Xi|))
Sample uniformly in [0,1]
Find appropriate value of all Val|Xi| values that Xi can take
Per sample cost: O(nlog(d)) (d = maxi Val|Xi|)
Total cost: O(Mnlog(d))
Number of samples needed
To get a relative error < , with probability 1-, we need
ln(
2/)
M
3
P
(y
)2
Note that number of samples grows inversely with P(y)
For small P(y) we need many samples, otherwise we report P(y)=0
Rejection Sampling
In general we need to compute P(Y|e)
We can do so with rejection sampling
Generate samples as in forward sampling
Reject samples in which Ee
Estimate function from accepted samples
Problem: if evidence is unlikely (e.g., P(e)=0.001))
then we generate many rejected samples
Particle-Based Methods Overview
Full particle methods
Sampling methods
Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Likelihood Weighting
Can we ensure that all of our samples satisfy E=e?
Solution: when sampling a variable XE, set X=e
Problem: we are trying to sample from the posterior P(X|e) but our
sampling process still samples from P(X)
Solution: weigh each sample by the joint probability of setting each
variable to its evidence/observed value
In effect, we are sampling from P(X,e) which we can normalize to
then obtain P(X|e) for a query of interest
Likelihood Weighting
Let X1, …,Xn be a topological order of the variables
For i = 1,...,n
If XiE
If Xi E
Sample xi from P(Xi | pai )
Set Xi=E[xi]
Set wi = wi P(E[xi] | pai )
return wi and x1, …,xn
w
[
m
]
1
{
x
[
m
](
y
)
y
}
P
(
y
|e
)
w
[
m
]
M
Estimate P(y|E) by:
m
1
M
m
1
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
d1
I
G
S
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
1
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
d1
i1
G
S
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
1
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
d1
i1
g2
S
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
0.2
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
d1
i1
g2
s1
L
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
W
0.16
E = {S=s1, G=g2}
Likelihood Weighting
d0
d1
0.4
0.6
D
I
G
D
I
g0
d0
i0
0.3
0.4
0.3
d0
i1
0.05
0.25
0.7
d1
i0
0.9
0.08
0.02
d1
i1
0.5
0.3
0.2
g1
g2
Samples
D
I
G
S
L
W
d1
i1
g2
s1
l0
0.16
S
i0
i1
0.7
0.3
I
s0
s1
i0
0.95
0.05
i1
0.2
0.8
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
E = {S=s1, G=g2}
Importance Sampling
Idea: to estimate a function relative to P, rather than
sampling from the distribution P, sample from another
distribution Q
P is called the target distribution
Q is called the proposal or the sampling distribution
Requirement from Q: P(x) > 0 Q(x) > 0
Q does not ‘ignore’ any non-zero probability events in P
In practice, performance depends on similarity between Q and P
Unnormalized Importance Sampling
Since:
P
(
X
)
P
(
x
)
E
f
(
X
)
Q
(
x
)
f
(
x
)
f
(
x
)
P
(
x
)
E
[
f
(
X
)
Q
(
X
)
P
(
X
)
Q
(
X
)
(
x
)
x
X Q
x
X
We can estimate f(x) by generating samples from Q
and estimating:
1
P
(
x
[
m
])
M
E
(
f
)
f
(
x
[
m
])
P
m
1
M
Q
(
x
[
m
])
Can show that estimator variance decreases with
more samples
Can show that Q=P is the lowest variance estimator
Normalized Importance Sampling
Unnormalized importance sampling assumes known P
Usually we know P up to a constant P = P’/
=xP’(x)
Example: Posterior distribution P(X|e)=P(X,e)/ where =P(e)
P
'
(
X
)
P
(
x
)'
E
Q
(
x
)
P
'
(
x
)
Q
(
X
)
Q
(
X
)
(
x
)
x
XQ
x
X
We can estimate by:
P(x)f (x)
Thus EP(X)[f (X)]
x
Q(x)f (x)P(x)/Q(x)
x
1
Q(x)f (x)P'(x)/Q(x)
x
1
EQ(X)[f (X)P'(x)/Q(x)]
EQ(X)[f (X)P'(x)/Q(x)]
EQ(X)[P'(x)/Q(x)]
Normalized Importance Sampling
Given M samples from Q, normalized sampling
estimates function f by:
f
(
x
[
m
])
P
'
(
x
[
m
])
/
Q
(
x
[
m
])
E
(
f
)
P
'
(
x
[
m
])
/
Q
(
x
[
m
])
M
P
m
1
M
m
1
Importance Sampling for BayesNets
Define mutilated network GE=e as:
Nodes XE have no parents in GE=e
Nodes XE have CPD that is 1 for X=E[X] and 0 otherwise
The parents and CPDs for all other variables are unchanged
D
I
G
L
S
Original
network
d0
d1
0.4
0.6
D
g0
g1
g2
0
0
1
I
G
E = {S=s1, G=g2}
S
L
G
l0
l1
g0
0.1
0.9
g1
0.4
0.6
g2
0.99
0.01
i0
i1
0.7
0.3
s0
s1
0
1
Likelihood Weighting as IS
Target distribution P’(X,e)
Proposal distribution Q(X,e)=PGE=e(X,e)
Claim: Likelihood weighting is precisely normalized
importance sampling with the above distributions
Proof sketch:
LW estimate:
w
[
m
]
1
{
x
[
m
](
y
)
y
}
P
(
y
|e
)
w
[
m
]
M
m
1
M
m
1
IS estimate:
1
{
x
[
m
](
y
)
y
}
P
'
(
x
[
m
])
/
Q
(
x
[
m
])
P
(
y
|
e
)
P
'
(
x
[
m
])
/
Q
(
x
[
m
])
M
m
1
M
m
1
But w[m] in LW precisely comes out to be P’(x[m])/Q(x[m])
Since w[m]=P(ei|Pa(ei))=P(Xi|Pa(Xi))P(ei|Pa(ei))/P(Xi|Pa(Xi))
Particle-Based Methods Overview
Full particle methods
Sampling methods
Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Markov Chain Monte Carlo
General idea:
Define a sampling process that is guaranteed to converge to
taking samples from the posterior distribution of interest
Generate samples from the sampling process
Estimate f(X) from the samples
Markov Chains
A Markov chain consists of
A state space Val(X)
Transition probability T(xx’) of going from state x to x’
Distribution over subsequent states is defined as
(
t
1
)
(
t
1
)
(
t
)
(
t
)
P
(
X
x
'
)
P
(
X
x
)
T
(
x
x
'
)
x
Va
(
X
)
l
0.25
0.7
x1
x2
0.5
0.5
0.75
x3
0.3
Simple Markov chain
Stationary Distribution
A distribution (X) is a stationary distribution for a
Markov chain T if it satisfies
(
X
x
'
)
(
X
x
)
T
(
x
x
'
)
x
Val
(
X
)
1
1
3
(
x
)
0
.25
(
x
)
0
.5
(
x
)
2
2
3
(
x
)
0
.7
(
x
)
0
.5
(
x
)
3
1
2
(
x
)
0
.75
(
x
)
0
.3
(
x
) 0.25
0.7
x1
(x1) 0.2
(x2 ) 0.5
(x3) 0.3
x2
0.5
0.5
0.75
x3
0.3
Simple Markov chain
Markov Chains & Stationary Dist.
A Markov chain is regular if there is k such that for
every x,x’Val(X), the probability of getting from x to
x’ in exactly k steps is greater than zero
Theorem: A finite state Markov chain T has a unique
stationary distribution if and only if it is regular
Goal: Define a Markov chain whose stationary
distribution is P(X|e)
Gibbs Sampling
States
Transition probability
Legal (=consistent with evidence) assignments to variables
T = T1...Tk
Ti = T((ui, xi) (ui, x’i)) = P(x’i|ui)
Claim: P(X|e) is a stationary distribution to the chain
Gibbs Sampling
Gibbs-sampling Markov chain is regular if:
Bayesian networks: all CPDs are strictly positive
Markov networks: all clique potentials are strictly positive
Gibbs sampling procedure (one sample)
Set x[m]=x[m-1]
For each variable Xi X-E
Set ui = x[m](X-Xi)
Sample from P(Xi | ui)
Save sample x[m](Xi) = sampled value
Return x[m]
Note: P(Xi | ui) is easily computed from Markov blanket
Gibbs Sampling
How do we evaluate P(Xi | x1,...,xi-1,xi+1,...,xn) ?
Let Y1, …,Yk be the children of Xi
By definition of MBi, the parents of Yj are in MBi{Xi}
It is easy to show that
P
(
x
|
Pa
)
P
(
y
|pa
)
i
i
i
y
j
j
P
(
x
|
X
)
P
(
x
|
MB
)
i
i
i
P
(
x
'
|
Pa
)
P
(
y
|pa
)
i
i
i
y
j
x
'
i
j
Gibbs Sampling in Practice
We need to wait until the burn-in time has ended –
number of steps until we take samples from the chain
Wait until the sampling distribution is close to the stationary
Hard to provides bounds in general for this ‘mixing time’
Once the burn-in time ended, all samples are from the
stationary distribution
We can evaluate burn-in time by comparing two chains
Note: after the burn-in time, samples are correlated
Since no theoretical guarantees exist, application of
Markov chains is somewhat of an art
Sampling Strategy
How do we collect the samples?
Strategy I:
Run the chain M times, each run for N steps
Return the last state in each run
Strategy II:
M chains
each run starts from a different state
Run one chain for a long time
After some “burn in” period, sample points every some fixed
number of steps
“burn in”
M samples from one chain
Comparing Strategies
Strategy I:
Better chance of “covering” the space of points, especially if
the chain is slow to reach the stationary distribution
Have to perform “burn in” steps for each chain
Strategy II:
Perform “burn in” only once
Samples might be correlated (although only weakly)
Hybrid strategy:
run several chains, and sample few samples from each
Combines benefits of both strategies
Particle-Based Methods Overview
Full particle methods
Sampling methods
Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Deterministic Search Methods
Idea: if the distribution is dominated by a small set of
instances, it suffices to consider only them for
approximating the function
For instances that we generate x[1],...x[m], estimate is:
E
(
f
)
f
(
x
[
m
])
P
(
x
[
m
])
P
m
1
M
Note: we can obtain lower and upper bounds by examining
the part of the probability mass covered by P(x[m])
Key: how can we enumerate highly probably
instantiations?
Note: the single most probable instantiation is MPE which
itself is an NP-hard problem
Particle-Based Methods Overview
Full particle methods
Sampling methods
Forward sampling
Importance sampling
Markov chain Monte Carlo
Deterministic particle generation
Distributional particles
Distributional Particles
Idea: use partial assignments to a subset of the
network variables, combined with closed form
representation of a distribution over the rest
Xp – Variables whose assignments are defined by particles
Xd – Variables over which we maintain a distribution
Distributional particles are a.k.a Rao-Blackwellized particles
Estimation proceeds as
E
(
f
)
P
(
x
,
x
|e
)
f
(
x
,
x
)
P
|
e
p
d
p
d
x
,
x
pd
P
(
x
|e
)
P
(
x
|x
,
e
)
f
(
x
,
x
)
p
d
p
p
d
x
x
p
d
P
(
x
|e
)(
E
[
f
(
x
,
x
)])
p
P
(
x
|
x
,
e
)
p
d
d
p
x
p
We can use any sampling procedure to sample Xp
We assume that we can compute the internal expectation efficiently
Distributional Particles
Distributional particles define a continuum
For |Xp|=|X-E| we have full particles and thus full sampling
For |Xp|= 0 we are performing full exact inference
Distributional Likelihood Weighting
Sample over a subset of the variables
w
[
m
]
E
1
{
x
[
m
](
y
)
y
}
P
(
y
|
e
)
w
[
m
]
M
m
1
P
(
X
|
x
[
m
],
e
)
d
p
M
m
1
Distributional Gibbs Sampling
Sample only a subset of the variables, transition probability
is as before T((ui, xi) (ui, x’i)) = P(x’i|ui) but the
computation may require inference
© Copyright 2026 Paperzz