Examples comparing Importance Sampling and the Metropolis algorithm Federico Bassetti Università degli Studi di Pavia Torino, 29 Marzo, 2006 with Persi Diaconis, Stanford Summary I Introduction: Metropolis vs Importance Sampling I Diagonalization and Variance I Example 1: the Hypercube I Independence Sampling I Example 2: Monotone Paths I Open problems Introduction I I I I X a finite set π(x) be a probability on X . a reversible Markov chain K (x, y ) on X with stationary distribution σ(x) > 0 for all x in X f :X →R We want to approximate µ= X f (x)π(x). x Two classical procedures are available. Metropolis Change the output of the K chain to have stationary distribution π by constructing K (x, y )A(x, x 6= y Py ) M(x, y ) = K (x, x) + z6=x K (x, z)(1 − A(x, z)) x =y with A(x, y ) := min π(y )K (y , x) ,1 π(x)K (x, y ) Metropolis Generate Y1 from π and then Y2 , . . . , YN from M(x, y ). µ̂M = N 1 X f (Yi ) N i =1 Importance sampling Generate X1 from σ and then X2 , . . . , XN from K (x, y ) and take µ̂I = N 1 X π(Xi ) f (Xi ) N σ(Xi ) i =1 or µ̃I = PN 1 N X π(Xi ) π(Xi ) i =1 σ(Xi ) i =1 σ(Xi ) f (Xi ) Importance sampling Both µ̂M and µ̂I take the output of the Markov chain K and reweight to get an unbiased estimate of µ. The work involved is comparable and it is natural to ask which estimate is better. (A) Here we don’t care about ”burn in” problem: Y1 ∼ π and X1 ∼ σ moreover (B) we take KIS = KM = K Importance sampling Both µ̂M and µ̂I take the output of the Markov chain K and reweight to get an unbiased estimate of µ. The work involved is comparable and it is natural to ask which estimate is better. (A) Here we don’t care about ”burn in” problem: Y1 ∼ π and X1 ∼ σ moreover (B) we take KIS = KM = K Variance computation I I I Let P(x, y ) be a reversible Markov chain on the finite set X with stationary distribution p(x). The spectral theorem implies that P has real eigenvalues: 1 = β0 > β1 ≥ β2 ≥ · · · ≥ β|X |−1 > −1 with an orthonormal basis of eigen–functions ψi : X → R P P f ∈ L2 (p): x f (x)p(x) = 0, f (x) = i ≥1 ai ψi (x) (with ai =< f , ψi >p ). Let Z be chosen from p and Z1 , . . . , ZN be a realization of a P(x, y ) chain. Variance computation I I I Let P(x, y ) be a reversible Markov chain on the finite set X with stationary distribution p(x). The spectral theorem implies that P has real eigenvalues: 1 = β0 > β1 ≥ β2 ≥ · · · ≥ β|X |−1 > −1 with an orthonormal basis of eigen–functions ψi : X → R P P f ∈ L2 (p): x f (x)p(x) = 0, f (x) = i ≥1 ai ψi (x) (with ai =< f , ψi >p ). Let Z be chosen from p and Z1 , . . . , ZN be a realization of a P(x, y ) chain. Variance computation I I I Let P(x, y ) be a reversible Markov chain on the finite set X with stationary distribution p(x). The spectral theorem implies that P has real eigenvalues: 1 = β0 > β1 ≥ β2 ≥ · · · ≥ β|X |−1 > −1 with an orthonormal basis of eigen–functions ψi : X → R P P f ∈ L2 (p): x f (x)p(x) = 0, f (x) = i ≥1 ai ψi (x) (with ai =< f , ψi >p ). Let Z be chosen from p and Z1 , . . . , ZN be a realization of a P(x, y ) chain. Variance computation Then µ̂P = N 1 X f (Zi ) N i =1 has variance Varp (µ̂P ) = 1 X |ak |2 WN (k) N2 k≥1 with WN (k) = N + 2βk − Nβk2 + 2βkN+1 . (1 − βk )2 A classical bound 2 σ∞ (µ̂P ) := 2 σ∞ (µ̂P ) ≤ lim NVarp (µ̂P ) = N→+∞ 2 kf k22,p . 1 − β1 X k≥1 |ak |2 |1 − β1 | ≥ spectralgap(P) 1 + βk 1 − βk The hypercube X = Zd2 π(x) = θ |x|(1 − θ)d−|x| 1/2 ≤ θ ≤ 1 |x| = number of ones in the d–tuple x. K (x, y ) be given by from x, pick a coordinate at random and change it to one or zero with probability p or 1 − p (1/2 < p ≤ 1). The stationary distribution is σ(x) = p |x|(1 − p)d−|x| . Remark This example models a high–dimensional problem where the desired distribution π(x) is concentrated in a small part of the space. We have available a sampling procedure – run the chain K (x, y )– where the stationary distribution is roughly right (if p is close to θ) but not spot on. The hypercube: Importance Sampling The importance sampling chain on Zd2 is K = d 1X Ki d i =1 with Ki having matrix (restricted to the i th coordinate) 0 1 0 1 p̄ p p̄ p ! with p̄ = 1 − p. The hypercube: Importance sampling The importance sampling chain on Zd2 has 2d eigenvalues and eigenvectors indexed by ζ in Zd2 . These are βζ∗ = 1 − |ζ| d r p̄ ζi xi . p̄ p i =1 i =1 p p ∗ ∗ Where ψ0 (0) = ψ1 (1) = 1, ψ1∗ (0) = p/p̄, ψ1∗ (1) = − p̄/p. The eigenvectors are orthonormal in L 2 (σ) where σ(x) = p |x|(1 − p)d−|x| . ψz∗ (x) = d Y ψζ∗i (xi ) = d r Y p ζi (1−xi ) − The hypercube: Importance sampling P Example: Let f (x) = di=1 (xi − θ). The importance sampling estimate is µ̂I = N 1 X π(Xi ) f (Xi ) N σ(Xi ) i =1 with π(x) = ab |x| σ(x) for a = ((1 − θ)/(1 − p))d , b = (θ(1 − p)/(p(1 − θ)))d . The hypercube: Importance sampling One has d α2 X Varσ (µ̂I ) = 2 N i =1 with d 2 2i i β WN∗ (i), i p p p p α := (θ θ̄ p/p̄ + θ̄θ p̄/p)/(θ̄ p/p̄ − θ p̄/p) p p β := θ̄ p/p̄ − θ p̄/p, Wn∗ (i) := 2Nd 2d 2 2d 2 − N + 2 (1 − i/d) + 2 1 − i/d)N+1 . i i i The hypercube: Metropolis The Metropolis chain is given by M= d 1X Mi d i =1 with Mi the Metropolis construction operating on the i–th coordinate. The transition matrix restricted to the i th coordinate is 0 0 1 1 p̄ p θ̄/θ p 1 − p θ̄/θ ! with p̄ = 1 − p, θ̄ = 1 − θ. The hypercube: Metropolis The Metropolis chain on Zd2 has 2d eigenvalues and eigenvectors which will be indexed by ζ ∈ Zd2 . These are βζ = 1 − ψζ (x) = d Y ψζi (xi ) = i =1 with ψ0 (0) = ψ0 (1) = 1, d Y i =1 |ζ|p dθ s r ζ (1−x ) i i θ̄ ζi xi θ − θ θ̄ q q ψ1 (0) = θ/θ̄, ψ1 (1) = − θ̄/θ. The hypercube: Metropolis Again take f (x) = Then Varπ (µ̂M ) = Pd i =1 (xi − θ). 2d 2 θ̄θ 2 dθ θ̄ 2d 3 θ 3 θ̄ p p 2d 3 θ̄θ 3 − + 2 2 (1− )N+1 + 2 2 (1− ). Np N N p dθ N p dθ here: 2 σ∞ (µ̂M ) ∼ 2d 2 θ̄θ 2 p (d → +∞). The hypercube: Metropolis The bottom line: Varσ (µ̂I ) ∼ 2α2 d 2 β 2 (1 + β 2 )d−1 . N This is exponentially worse (in d) than Varπ (µ̂M ) ∼ 2d 2 θ 2 θ̄ Np The hypercube: another example For another example we take f (x) = δd (|x|) − θ d . In this case: Varπ (µ̂M ) = d θ 2d X N2 i =1 with Wn∗ (i) = θ̄ i d i θ WN∗ (ip/θ), 2Nd 2d 2 2d 2 − N + 2 (1 − i/d) + 2 (1 − i/d)N+1 . i i i and Varσ (µ̂I ) = d θ 2d X N2 i =1 p̄ i d i p 1− θ − p i 2 ∗ WN (i). 1−p The hypercube: another example Simple computations show that 2 σ∞ (µ̂M )/Varπ (f ) = 2θ d Bd (θ̄/θ) − 1 p θ̄(1 − θ d ) d + 1 with d 7→ Bd (θ̄/θ) bounded. Moreover, 2 σ∞ (µ̂I )/Varπ (f ) = θ 1 2d ( )d [ B ∗ (p, θ) − Cd∗ (p, θ)] d 1−θ p d +1 d 2d Bd∗ (p, θ) − C ∗ (p, θ)] bounded and strictly positive, with d 7→ [ d+1 for every p and θ such that 1/2 < p < θ < 1. The hypercube: another example Finally 2 σ∞ (µ̂I )/Varπ (f ) ∼ K3 (θ, p)[( 3 −2θ with θp+θ p(1−p) constant. 2p θp + θ 3 − 2θ 2 p d θ ) + ( )d ] p(1 − p) p < 1 if 1/2 < θ < p < 1. K3 being a suitable The hypercube: another example That is: 2 σ∞ (µ̂M )/Varπ (f ) ∼ K1 (θ, p) If 1/2 < p < θ < 1 2 σ∞ (µ̂I )/Varπ (f d θ ) ∼ K2 (θ, p) p If 1/2 < θ < p < 1 2 σ∞ (µ̂I )/Varπ (f ) ∼ K3 (θ, p) " θp + θ 3 − 2θ 2 p p(1 − p) d d # θ + p The hypercube: Remark The last proposition shows that, for 1/2 < p < θ < 1, the normalized asymptotic variance of µ̂ I is exponentially worse (in d) than the normalized asymptotic variance of µ̂ M , while it is exponentially better if 1/2 < θ < p < 1. On the one hand, this fact agrees with the heuristic that the importance sampling estimator of π(A) performs better than the Monte Carlo estimator, whenever the importance distribution σ puts more weight on A than π. On the other hand, the last example shows that even a small ”loss of weight” in A can cause an exponentially worse behavior. Independence Sampling K (x, y ) = σ(y ) Let X = {0, 1, . . . , m − 1} with π(1) π(m − 1) π(0) ≥ ≥ ··· ≥ . σ(0) σ(1) σ(m − 1) From state i it proceeds by choosing j from σ; if j ≤ i the chain moves to j. If j > i the chain stays at j with probability (π(j)σ(i)/π(i)σ(j)) and remains at i with probability (1 − π(j)σ(i))/π(i)σ(j)). Independence Sampling J.Liu diagonalized the Independece Sampling Chain. βk = X i ≥k σ(i) − π(i) σ(k) , π(k) ψk = (0, . . . , 0, Sπ (k + 1), −π(k), . . . , −π(k)), | {z } k−1 Sπ (k + 1) := m−1 X j=k+1 π(j). Non–self intersecting paths: a Knuth’s problem NO YES Non–self intersecting paths: a Knuth’s problem µ̂SIS = 1 N PN i =1 1/σ(γ) Knuth used a sequential importance sampling algorithm to give the following estimates for an 10 × 10 grid: I I I number of paths = (1.6 ± 0.3)1024 av. path length = 92 ± 5 proportion of paths through (5, 5) = 81 ± 10 percent. He noted that usually 1/σ(γ) was between 10 11 and 1017 . A few of his sample values were much larger, accounting for the 10 24 . It is hard to bound or asses the variance of the sequential importance sampling estimator. Monotone paths 1/4 1/8 1/8 1/8 1/8 1/4 Monotone paths In this case: I I I |X | = 2n n π(γ) = 1/(2n n ) σ(γ) = 2−T (γ) where T (γ) is the first time the walk hits the top or right side. Monotone paths For the monotone paths on an n × n grid: I Under the uniform distribution π j−1 π{T (γ) = j} = 2 I For n large and any fixed k n−1 2n n π{T (γ) = 2n − 1 − k} → I I n ≤ j ≤ 2n − 1. 1 2k+1 0 ≤ k < +∞. Under the importance sampling distribution σ j−1 n ≤ j ≤ 2n − 1. σ{T (γ) = j} = 21−j n−1 For n large and any fixed positive x 1 2n − 1 − T (γ) √ ≤ x} → σ{ π n Z 0 x e −y 2 /4 dy . Monotone paths Recall that µ̂SIS = N 1 X 1/σ(γ) N i =1 estimates the number of monotone paths in an n × n grid and Varσ (µ̂SIS ) = 1 1 16n √ (1 + O( )). N4 n n Monotone paths More generally, µ̂SIS (f ) := N 1 X 2T (γi ) f (γi ) 2n N i =1 is an unbiased estimator of µ= X 1 γ In this case we have Varσ (µ̂SIS (f )) = n 2n f n (γ). o 2 1 n 2T (γ) 2 − µ . f (γ) E 2n N n Monotone paths Example: f (γ) = T (γ) = the first hitting time of a uniformly chosen random path γ to the top or right side of an n × n grid. Then 2 µ = Eπ (T ) = (2 − )n n+1 and √ 5/2 πn . Varσ (µ̂SIS (T )) ∼ N Monotone paths 2n Consider independence Metropolis sampling with π(γ) = 1/ n and proposal distribution σ(γ) = 2−T (γ) . Convention: γ ≤ γ 0 if T (γ) ≥ T (γ 0 ). If T (γ) = T (γ 0 ), use the lexicographical order. We break paths into groups by T (γ): 2n−2 group one with T (γ) = 2n − 1 of size A1 := 2 n−1 2n−3 group two with T (γ) = 2n − 2 of size A2 := 2 n−1 ... 2n−1−i group i with T (γ) = 2n − i of size Ai := 2 n−1 Monotone paths The independence proposal Markov chain on monotone paths with proposal distribution σ(γ) = 2−T (γ) and stationary distribution 2n π(γ) = 1/ n has n distinct eigenvalues on each of the n groups above with multiplicity the size of the i th –group. If s(i) = 2−(2n−1−i ) , the eigenvalues are β1 = 1 − s(0) β2 = 1 − s(1) ... 2n , n 2n + n βi = 1 − s(i − 1) ... multiplicity A1 − 1 A0 (s(1) − s(0)), 2n n multiplicity A2 + A0 (s(i − 1) − s(0)) + · · · + Ai −2 (s(i − 1) − s(i − Monotone paths Using Stirling’s formula, for fixed j 2j 1 βj = 1 − √ + O j ( ) n πn Monotone paths Let f (γ) = T (γ) = the first hitting time of a uniformly chosen random path to the top or right side of an n × n grid. Then, √ 2 σ∞ (µ̂Met (T )) ≤ {8 πn3/2 + O(n).} Roughly Varπ (µ̂SIS ) while Varπ (µ̂MET ) n5/2 N n3/2 . N What’s beyond toy models? An example of an (almost) real problem: Fix (c1 , . . . , cn ) ∈ (Z+ )n and (r1 , . . . , rn ) ∈ (Z+ )n such that X X ci = rj = M set X = {A = [aij ] : X aij = cj , i X j aij = ri , aij ∈ Z+ } Here: π uniform on X . Harder: X = {A = [aij ] : [networks] X i aij = cj , X j aij = ri , aij ∈ {0, 1}} What’s beyond toy models? MCMC (Diaconis): ... ... ... ... 1 0 ... 0 1 ... ... ... ... ... ... ... 0 1 1 0 SLOW!! but everything is known! −1 +1 +1 −1 +1 −1 −1 +1 ... ... ... ... 0 1 ... 1 0 ... ... ... 1 0 0 1 ... ... ... ... What’s beyond toy models? Importance sampling: FAST!! but difficult to analyze!!! (and to describe ...)
© Copyright 2024 Paperzz