Examples comparing Importance Sampling and the Metropolis

Examples comparing Importance Sampling and
the Metropolis algorithm
Federico Bassetti
Università degli Studi di Pavia
Torino, 29 Marzo, 2006
with Persi Diaconis, Stanford
Summary
I
Introduction: Metropolis vs Importance Sampling
I
Diagonalization and Variance
I
Example 1: the Hypercube
I
Independence Sampling
I
Example 2: Monotone Paths
I
Open problems
Introduction
I
I
I
I
X a finite set
π(x) be a probability on X .
a reversible Markov chain K (x, y ) on X
with stationary distribution σ(x) > 0 for all x in X
f :X →R
We want to approximate
µ=
X
f (x)π(x).
x
Two classical procedures are available.
Metropolis
Change the output of the K chain to have stationary distribution π
by constructing
K (x, y )A(x,
x 6= y
Py )
M(x, y ) =
K (x, x) + z6=x K (x, z)(1 − A(x, z))
x =y
with
A(x, y ) := min
π(y )K (y , x)
,1
π(x)K (x, y )
Metropolis
Generate Y1 from π and then Y2 , . . . , YN from M(x, y ).
µ̂M =
N
1 X
f (Yi )
N
i =1
Importance sampling
Generate X1 from σ and then X2 , . . . , XN from K (x, y ) and take
µ̂I =
N
1 X π(Xi )
f (Xi )
N
σ(Xi )
i =1
or
µ̃I = PN
1
N
X
π(Xi )
π(Xi )
i =1 σ(Xi ) i =1
σ(Xi )
f (Xi )
Importance sampling
Both µ̂M and µ̂I take the output of the Markov chain K and
reweight to get an unbiased estimate of µ. The work involved is
comparable and it is natural to ask which estimate is better.
(A) Here we don’t care about ”burn in” problem:
Y1 ∼ π and X1 ∼ σ
moreover
(B) we take KIS = KM = K
Importance sampling
Both µ̂M and µ̂I take the output of the Markov chain K and
reweight to get an unbiased estimate of µ. The work involved is
comparable and it is natural to ask which estimate is better.
(A) Here we don’t care about ”burn in” problem:
Y1 ∼ π and X1 ∼ σ
moreover
(B) we take KIS = KM = K
Variance computation
I
I
I
Let P(x, y ) be a reversible Markov chain on the finite set X
with stationary distribution p(x).
The spectral theorem implies that P has real eigenvalues:
1 = β0 > β1 ≥ β2 ≥ · · · ≥ β|X |−1 > −1 with an orthonormal
basis of eigen–functions ψi : X → R
P
P
f ∈ L2 (p):
x f (x)p(x) = 0, f (x) =
i ≥1 ai ψi (x) (with
ai =< f , ψi >p ).
Let Z be chosen from p and Z1 , . . . , ZN be a realization of a
P(x, y ) chain.
Variance computation
I
I
I
Let P(x, y ) be a reversible Markov chain on the finite set X
with stationary distribution p(x).
The spectral theorem implies that P has real eigenvalues:
1 = β0 > β1 ≥ β2 ≥ · · · ≥ β|X |−1 > −1 with an orthonormal
basis of eigen–functions ψi : X → R
P
P
f ∈ L2 (p):
x f (x)p(x) = 0, f (x) =
i ≥1 ai ψi (x) (with
ai =< f , ψi >p ).
Let Z be chosen from p and Z1 , . . . , ZN be a realization of a
P(x, y ) chain.
Variance computation
I
I
I
Let P(x, y ) be a reversible Markov chain on the finite set X
with stationary distribution p(x).
The spectral theorem implies that P has real eigenvalues:
1 = β0 > β1 ≥ β2 ≥ · · · ≥ β|X |−1 > −1 with an orthonormal
basis of eigen–functions ψi : X → R
P
P
f ∈ L2 (p):
x f (x)p(x) = 0, f (x) =
i ≥1 ai ψi (x) (with
ai =< f , ψi >p ).
Let Z be chosen from p and Z1 , . . . , ZN be a realization of a
P(x, y ) chain.
Variance computation
Then
µ̂P =
N
1 X
f (Zi )
N
i =1
has variance
Varp (µ̂P ) =
1 X
|ak |2 WN (k)
N2
k≥1
with
WN (k) =
N + 2βk − Nβk2 + 2βkN+1
.
(1 − βk )2
A classical bound
2
σ∞
(µ̂P ) :=
2
σ∞
(µ̂P ) ≤
lim NVarp (µ̂P ) =
N→+∞
2
kf k22,p .
1 − β1
X
k≥1
|ak |2
|1 − β1 | ≥ spectralgap(P)
1 + βk
1 − βk
The hypercube
X = Zd2
π(x) = θ |x|(1 − θ)d−|x|
1/2 ≤ θ ≤ 1
|x| = number of ones in the d–tuple x.
K (x, y ) be given by from x, pick a coordinate at random and
change it to one or zero with probability p or 1 − p (1/2 < p ≤ 1).
The stationary distribution is
σ(x) = p |x|(1 − p)d−|x| .
Remark
This example models a high–dimensional problem where the
desired distribution π(x) is concentrated in a small part of the
space. We have available a sampling procedure – run the chain
K (x, y )– where the stationary distribution is roughly right (if p is
close to θ) but not spot on.
The hypercube: Importance Sampling
The importance sampling chain on Zd2 is
K =
d
1X
Ki
d
i =1
with Ki having matrix (restricted to the i th coordinate)
0
1
0
1
p̄
p
p̄
p
!
with p̄ = 1 − p.
The hypercube: Importance sampling
The importance sampling chain on Zd2 has 2d eigenvalues and
eigenvectors indexed by ζ in Zd2 . These are
βζ∗ = 1 −
|ζ|
d
r p̄ ζi xi
.
p̄
p
i =1
i =1
p
p
∗
∗
Where ψ0 (0) = ψ1 (1) = 1, ψ1∗ (0) = p/p̄, ψ1∗ (1) = − p̄/p.
The eigenvectors are orthonormal in L 2 (σ) where
σ(x) = p |x|(1 − p)d−|x| .
ψz∗ (x) =
d
Y
ψζ∗i (xi ) =
d r Y
p ζi (1−xi ) −
The hypercube: Importance sampling
P
Example: Let f (x) = di=1 (xi − θ).
The importance sampling estimate is
µ̂I =
N
1 X π(Xi )
f (Xi )
N
σ(Xi )
i =1
with
π(x)
= ab |x|
σ(x)
for a = ((1 − θ)/(1 − p))d , b = (θ(1 − p)/(p(1 − θ)))d .
The hypercube: Importance sampling
One has
d
α2 X
Varσ (µ̂I ) = 2
N
i =1
with
d 2 2i
i β WN∗ (i),
i
p
p
p
p
α := (θ θ̄ p/p̄ + θ̄θ p̄/p)/(θ̄ p/p̄ − θ p̄/p)
p
p
β := θ̄ p/p̄ − θ p̄/p,
Wn∗ (i) :=
2Nd
2d 2
2d 2
− N + 2 (1 − i/d) + 2 1 − i/d)N+1 .
i
i
i
The hypercube: Metropolis
The Metropolis chain is given by
M=
d
1X
Mi
d
i =1
with Mi the Metropolis construction operating on the i–th
coordinate. The transition matrix restricted to the i th coordinate is
0
0
1
1
p̄
p θ̄/θ
p
1 − p θ̄/θ
!
with p̄ = 1 − p, θ̄ = 1 − θ.
The hypercube: Metropolis
The Metropolis chain on Zd2 has 2d eigenvalues and eigenvectors
which will be indexed by ζ ∈ Zd2 . These are
βζ = 1 −
ψζ (x) =
d
Y
ψζi (xi ) =
i =1
with
ψ0 (0) = ψ0 (1) = 1,
d Y
i =1
|ζ|p
dθ
s
r ζ
(1−x
)
i
i
θ̄ ζi xi
θ
−
θ
θ̄
q
q
ψ1 (0) = θ/θ̄, ψ1 (1) = − θ̄/θ.
The hypercube: Metropolis
Again take f (x) =
Then
Varπ (µ̂M ) =
Pd
i =1 (xi
− θ).
2d 2 θ̄θ 2 dθ θ̄ 2d 3 θ 3 θ̄
p
p
2d 3 θ̄θ 3
−
+ 2 2 (1− )N+1 + 2 2 (1− ).
Np
N
N p
dθ
N p
dθ
here:
2
σ∞
(µ̂M ) ∼
2d 2 θ̄θ 2
p
(d → +∞).
The hypercube: Metropolis
The bottom line:
Varσ (µ̂I ) ∼
2α2 d 2 β 2
(1 + β 2 )d−1 .
N
This is exponentially worse (in d) than
Varπ (µ̂M ) ∼
2d 2 θ 2 θ̄
Np
The hypercube: another example
For another example we take f (x) = δd (|x|) − θ d . In this case:
Varπ (µ̂M ) =
d
θ 2d X
N2
i =1
with
Wn∗ (i) =
θ̄ i
d
i
θ
WN∗ (ip/θ),
2Nd
2d 2
2d 2
− N + 2 (1 − i/d) + 2 (1 − i/d)N+1 .
i
i
i
and
Varσ (µ̂I ) =
d
θ 2d X
N2
i =1
p̄ i d
i
p
1−
θ − p i 2 ∗
WN (i).
1−p
The hypercube: another example
Simple computations show that
2
σ∞
(µ̂M )/Varπ (f ) =
2θ
d
Bd (θ̄/θ) − 1
p θ̄(1 − θ d ) d + 1
with d 7→ Bd (θ̄/θ) bounded.
Moreover,
2
σ∞
(µ̂I )/Varπ (f ) =
θ
1
2d
( )d [
B ∗ (p, θ) − Cd∗ (p, θ)]
d
1−θ p d +1 d
2d
Bd∗ (p, θ) − C ∗ (p, θ)] bounded and strictly positive,
with d 7→ [ d+1
for every p and θ such that 1/2 < p < θ < 1.
The hypercube: another example
Finally
2
σ∞
(µ̂I )/Varπ (f ) ∼ K3 (θ, p)[(
3
−2θ
with θp+θ
p(1−p)
constant.
2p
θp + θ 3 − 2θ 2 p d
θ
) + ( )d ]
p(1 − p)
p
< 1 if 1/2 < θ < p < 1. K3 being a suitable
The hypercube: another example
That is:
2
σ∞
(µ̂M )/Varπ (f ) ∼ K1 (θ, p)
If 1/2 < p < θ < 1
2
σ∞
(µ̂I )/Varπ (f
d
θ
) ∼ K2 (θ, p)
p
If 1/2 < θ < p < 1
2
σ∞
(µ̂I )/Varπ (f )
∼ K3 (θ, p)
"
θp + θ 3 − 2θ 2 p
p(1 − p)
d
d #
θ
+
p
The hypercube: Remark
The last proposition shows that, for 1/2 < p < θ < 1, the
normalized asymptotic variance of µ̂ I is exponentially worse (in d)
than the normalized asymptotic variance of µ̂ M , while it is
exponentially better if 1/2 < θ < p < 1. On the one hand, this fact
agrees with the heuristic that the importance sampling estimator
of π(A) performs better than the Monte Carlo estimator, whenever
the importance distribution σ puts more weight on A than π. On
the other hand, the last example shows that even a small ”loss of
weight” in A can cause an exponentially worse behavior.
Independence Sampling
K (x, y ) = σ(y )
Let X = {0, 1, . . . , m − 1} with
π(1)
π(m − 1)
π(0)
≥
≥ ··· ≥
.
σ(0)
σ(1)
σ(m − 1)
From state i it proceeds by choosing j from σ; if j ≤ i the chain
moves to j. If j > i the chain stays at j with probability
(π(j)σ(i)/π(i)σ(j)) and remains at i with probability
(1 − π(j)σ(i))/π(i)σ(j)).
Independence Sampling
J.Liu diagonalized the Independece Sampling Chain.
βk =
X
i ≥k
σ(i) − π(i)
σ(k) ,
π(k)
ψk = (0, . . . , 0, Sπ (k + 1), −π(k), . . . , −π(k)),
| {z }
k−1
Sπ (k + 1) :=
m−1
X
j=k+1
π(j).
Non–self intersecting paths: a Knuth’s problem
NO
YES
Non–self intersecting paths: a Knuth’s problem
µ̂SIS =
1
N
PN
i =1 1/σ(γ)
Knuth used a sequential importance sampling algorithm to give the
following estimates for an 10 × 10 grid:
I
I
I
number of paths = (1.6 ± 0.3)1024
av. path length = 92 ± 5
proportion of paths through (5, 5) = 81 ± 10 percent.
He noted that usually 1/σ(γ) was between 10 11 and 1017 . A few
of his sample values were much larger, accounting for the 10 24 . It
is hard to bound or asses the variance of the sequential importance
sampling estimator.
Monotone paths
1/4
1/8
1/8
1/8
1/8
1/4
Monotone paths
In this case:
I
I
I
|X | =
2n n
π(γ) = 1/(2n
n )
σ(γ) = 2−T (γ)
where T (γ) is the first time the walk hits the top or right side.
Monotone paths
For the monotone paths on an n × n grid:
I
Under the uniform distribution π
j−1 π{T (γ) = j} = 2
I
For n large and any fixed k
n−1
2n n
π{T (γ) = 2n − 1 − k} →
I
I
n ≤ j ≤ 2n − 1.
1
2k+1
0 ≤ k < +∞.
Under the importance sampling distribution σ
j−1 n ≤ j ≤ 2n − 1.
σ{T (γ) = j} = 21−j n−1
For n large and any fixed positive x
1
2n − 1 − T (γ)
√
≤ x} →
σ{
π
n
Z
0
x
e −y
2 /4
dy .
Monotone paths
Recall that
µ̂SIS =
N
1 X
1/σ(γ)
N
i =1
estimates the number of monotone paths in an n × n grid and
Varσ (µ̂SIS ) =
1
1 16n
√ (1 + O( )).
N4 n
n
Monotone paths
More generally,
µ̂SIS (f ) :=
N
1 X 2T (γi )
f (γi )
2n N
i =1
is an unbiased estimator of
µ=
X 1
γ
In this case we have
Varσ (µ̂SIS (f )) =
n
2n f
n
(γ).
o
2
1 n 2T (γ)
2
−
µ
.
f
(γ)
E
2n N
n
Monotone paths
Example:
f (γ) = T (γ) = the first hitting time of a uniformly chosen random
path γ to the top or right side of an n × n grid.
Then
2
µ = Eπ (T ) = (2 −
)n
n+1
and
√ 5/2
πn
.
Varσ (µ̂SIS (T )) ∼
N
Monotone paths
2n Consider independence Metropolis sampling with π(γ) = 1/ n
and proposal distribution σ(γ) = 2−T (γ) .
Convention:
γ ≤ γ 0 if T (γ) ≥ T (γ 0 ). If T (γ) = T (γ 0 ), use the lexicographical
order. We break paths into groups by T (γ):
2n−2 group one with T (γ) = 2n − 1 of size A1 := 2 n−1
2n−3 group two with T (γ) = 2n − 2 of size A2 := 2 n−1
...
2n−1−i group i with T (γ) = 2n − i of size Ai := 2 n−1
Monotone paths
The independence proposal Markov chain on monotone paths with
proposal distribution σ(γ) = 2−T (γ) and stationary distribution
2n π(γ) = 1/ n has n distinct eigenvalues on each of the n groups
above with multiplicity the size of the i th –group. If
s(i) = 2−(2n−1−i ) , the eigenvalues are
β1 = 1 − s(0)
β2 = 1 − s(1)
...
2n ,
n
2n +
n
βi = 1 − s(i − 1)
...
multiplicity A1 − 1
A0 (s(1) − s(0)),
2n n
multiplicity A2
+ A0 (s(i − 1) − s(0)) + · · · + Ai −2 (s(i − 1) − s(i −
Monotone paths
Using Stirling’s formula, for fixed j
2j
1
βj = 1 − √ + O j ( )
n
πn
Monotone paths
Let f (γ) = T (γ) = the first hitting time of a uniformly chosen
random path to the top or right side of an n × n grid.
Then,
√
2
σ∞
(µ̂Met (T )) ≤ {8 πn3/2 + O(n).}
Roughly
Varπ (µ̂SIS ) while
Varπ (µ̂MET ) n5/2
N
n3/2
.
N
What’s beyond toy models?
An example of an (almost) real problem:
Fix (c1 , . . . , cn ) ∈ (Z+ )n and (r1 , . . . , rn ) ∈ (Z+ )n such that
X
X
ci =
rj = M
set
X = {A = [aij ] :
X
aij = cj ,
i
X
j
aij = ri , aij ∈ Z+ }
Here: π uniform on X .
Harder:
X = {A = [aij ] :
[networks]
X
i
aij = cj ,
X
j
aij = ri , aij ∈ {0, 1}}
What’s beyond toy models?
MCMC (Diaconis):

... ... ...
... 1 0

... 0 1
... ... ...

...
...

...
...
0 1
1 0
SLOW!!
but everything is known!
−1 +1
+1 −1
+1 −1
−1 +1

... ... ...
... 0 1

... 1 0
... ... ...
1 0
0 1

...
...

...
...
What’s beyond toy models?
Importance sampling: FAST!!
but difficult to analyze!!!
(and to describe ...)