An introduction to Markov chain Monte Carlo on general state spaces Daniel Rudolf∗ [email protected] October 11, 2013 Contents Preface 2 1 Introduction 2 1.1 Where classical Monte Carlo fails . . . . . . . . . . . . . . . . 2 1.2 Bayesian approach to inverse problems . . . . . . . . . . . . . 7 2 Markov chains 11 2.1 Definitions and notation . . . . . . . . . . . . . . . . . . . . . 11 2.2 Markov operator . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Convergence properties of Markov chains . . . . . . . . . . . . 21 3 Approximation of expectations 25 ∗ The author was supported by an Australian Research Council Discovery Project (F. Kuo and I. Sloan), by the DFG priority program 1324 and the DFG Research training group 1523. 4 Algorithms 28 4.1 Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . 28 4.2 Hit-and-run algorithm . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Preface These notes are based on a series of six lectures given in November and December 2012 at the University of New South Wales in Sydney. The material discussed in these notes is not new. The goal was to provide an introduction to Markov chain Monte Carlo on general state spaces and I wanted to keep it elementary to give the reader a better access to the topic. I want to thank F. Kuo and J. Dick for asking me to give these lectures and for their interest to the topic. Furthermore, I want to thank H. Zhu for typing a preliminary version of this manuscript. 1 Introduction Markov chains are widely used. For example they are used in optimization, approximate sampling, provide a tool for modeling real world phenomena, are helpful in decision theory and can be used for the approximation of expectations. We focus on approximate sampling and the application of Markov chain Monte Carlo methods for the approximation of expectations. For more applications see for example [GRS96, BGJM11]. 1.1 Where classical Monte Carlo fails Let G ⊂ Rd be a measurable set and let 0 < vold (G) < ∞ where vold denotes the Lebesgue measure. Assume that f : G → R is an integrable function. The goal is to compute Z 1 S(f ) = f (x) dx, vold (G) G 2 where f belongs to some class of functions. In other words we want to compute the expectation of f with respect to the uniform distribution on G. We assume that there is a procedure Orf which provides information of f . This procedure is considered as “black box” and we call it oracle. Let Orf be a procedure which returns for an input x ∈ G the function value f (x), i.e. Orf (x) = f (x). In general, if d is large, say 10, 50 or even 100, then it might be difficult, depending on f , to compute S(f ) explicitly. For the moment we assume that we can sample the uniform distribution on G, i.e. we can sample random variables X, which map from a probability space (Ω, A, P) to (G, B(G)) with PX (A) := P(X ∈ A) = vold (A) , vold (G) A ∈ B(G), where B(G) is the Borel σ-algebra. Example 1. Let G = [0, 1]d and let x = (x1 , . . . , xd ), where each xi ∈ [0, 1] is independently, identically, distributed (i.i.d.) chosen with uniform distribution in [0, 1]. Then x ∈ [0, 1]d is a realization of a random variable with uniform distribution in [0, 1]d . P Example 2. Let G = Bd = {x ∈ Rd : |x|2 = di=1 x2i ≤ 1} be the Euclidean unit ball. By the following steps one can generate a uniformly distributed chosen point z in Bd : 1. Let x = (x1 , . . . , xd ), where each xi is i.i.d. chosen concerning N(0, 1). With N(µ, σ 2 ) we denote the Normal distribution with mean µ and variance σ 2 . x 2. Let y = |x| . This is a realization of a uniformly distributed random variable on the d-dimensional sphere Sd−1 = {x ∈ Rd : |x| = 1}. 3. Choose u ∈ [0, 1] uniformly distributed and set r = u1/d . Then return z = ry. 3 As an excercise one can prove that the procedure stated in Example 2 returns indeed a uniformly distributed point in the Euclidean unit ball Bd . Let (Xn )n∈N be an i.i.d. sequence of random variables, where every Xn is uniformly distributed on G. Then we consider the following algorithm n 1X Sn (f ) = f (Xi ) n i=1 as approximation of S(f ). We call it classical Monte Carlo method. The following result guarantees that the algorithm is well defined. Theorem 3. [Strong law of large numbers] Let f : G → R and assume that Z 1 kf k1 := |f (x)| dx < ∞. vold (G) G Then P( lim Sn (f ) = S(f )) = 1. n→∞ Note that it is an asymptotic statement and we are rather interested in explicit results with respect to some error term. Let us consider the mean square error of a generic algorithm An , which uses n function values of f and a sample of random variables Y1 , . . . , Yn . For an integrable function f it is given by 1/2 , e(An , f ) = E |An (f ) − S(f )|2 where E denotes the expectation concerning the joint distribution of Y1 , . . . , Yn . The squared mean square error can be decomposed into variance and bias, i.e. e(An , f )2 = V(An (f )) + B(An (f )), where the variance V(An (f )) = E |An (f ) − E An (f )|2 , and the bias B(An (f )) = |E An (f ) − S(f )|2 . The classical Monte Carlo method Sn is unbiased, i.e. ESn (f ) = S(f ), so that B(Sn (f )) = 0 and V(Sn (f ))1/2 = e(Sn , f ). Now we get the following error bound. 4 Theorem 4. Let f : G → R and assume that kf k2 := Then 1 vold (G) Z 1/2 |f (x)| dx < ∞. 2 G 1 1 e(Sn , f ) = √ kf − S(f )k2 ≤ √ kf k2 . n n Proof. By the i.i.d. sequence of random variables f (X1 ), . . . , f (Xn ) and with the formula of Bienaymé1 we obtain e(Sn , f )2 = V(Sn (f )) = n 1 X V( f (Xi )) n2 i=1 n 1 X 1 1 = 2 V(f (Xi )) = kf − S(f )k22 ≤ kf k22 . n i=1 n n We showed an explicit error bound for every n ∈ N rather than an asymptotic result. However, the assumption that kf k2 < ∞ is more restrictive compared to the assumptions of Theorem 3, since kf k1 ≤ kf k2 . In other words the class of functions for which we can apply the strong law of large numbers is larger than the class of functions which is considered in Theorem 4. The classical Monte Carlo method relies on the assumption that one has a sampling procedure for the uniform distribution on G. Let us modify the problem as follows. Suppose that the set G is given by a membership oracle. In other words we assume that we are able to evaluate the indicator function 1G . Then f and 1G are part of the input of the algorithm, so that we want to compute Z 1 S(f, 1G ) = f (x) dx, vold (G) G where the tuple (f, 1G ) belongs to some class of functions. We need the following regularity assumption. 1 For a sequence of square integrable i.i.d. P random variables Z1 , . . . , Zn the statement Pn n of the formula of Bienaymé is V( i=1 Zi ) = i=1 V(Zi ). 5 Assumption 5. The convex body G ⊂ Rd is bounded from below and above, more precisely for r > 1 let Bd ⊂ G ⊂ rBd , where rBd = {x ∈ Rd : |x| ≤ r} is the Euclidean ball with radius r around 0 and Bd = 1Bd . Since G is part of the input of the algorithm it is necessary to construct a sampling procedure which generates the uniform distribution on G. A first approach might be to consider a simple acceptance/rejection method. Namely, we generate a uniformly distributed point in rBd and if it is in G it is accepted, otherwise it is rejected. However, this method does not work reasonably. For example the acceptance probability for the generation of one state in G = Bd is r −d . One might think that √ the set rBd is just too large compared to Bd . Let us assume that r = d and assume that we generate a uniformly distributed state in the smallest box containing G = Bd , that is [−1, 1]d . The acceptance probability of the proposed state is d+1 √ 2 eπ . This is also exponentially bad with respect 2−d vold (Bd ) ≈ π2e 2(d+2) to the dimension. However, we assumed G = Bd such that one could argue that it is no problem to sample the uniform distribution on the unit ball (Example 2). But the sampling procedure should work for every set which satisfies Assumption 5. It should work uniformly for balls, cubes, polytopes as well as for more general convex sets, such as conv(1 ∪ Bd ), where 1 = (1, . . . , 1) and conv denotes the convex hull. Summarized, the problem is to provide a reasonably fast random number generator uniformly for all sets which satisfy Assumption 5. Now the idea is to run a Markov chain to approximate the uniform distribution for any G which is given by a membership oracle and satisfies Assumption 5. A Markov chain is a sequence of random variables (Xn )n∈N which satisfy the Markov property. This means that the distribution of Xn+1 (the future) depends only on Xn (the present) and not on (X1 , . . . , Xn−1 ) (the past). In formulas P(Xn+1 ∈ A | X1 , . . . , Xn ) = P(Xn+1 ∈ A | Xn ). The Markov chain should approximate the uniform distribution in G, i.e. for all A ⊂ G we want that vold (A) . P(Xn ∈ A | X1 ) −→ n→∞ vold (G) 6 Later we precisely define what kind of convergence is sufficient and we will construct a Markov chain which has the desired properties. Then we consider the estimator n 1X Sn,n0 (f, 1G ) = f (Xj+n0 ). n j=1 The additional parameter n0 is called burn-in and, roughly spoken, is the amount of steps of the Markov chain to get close to the uniform distribution in G. 1.2 Bayesian approach to inverse problems Markov chains are especially useful if one wants to draw approximate samples from a probability measure which is only partially known. We want to motivate this by a Bayesian approach to inverse problems. For a thorough introduction to Bayesian inverse problems we refer to [Stu10]. Let s, d ∈ N, u ∈ Rs , y ∈ Rd and G : Rs → Rd . We have an equation of the form y = G(u) and the goal is to find u ∈ Rs for a given y ∈ Rd . We call u the solution, y the data and G the observation operator. The problem might be ill-posed, which means that either no solution exists or the solution is not unique. This motivates a reformulation of the problem. Namely, we consider 1 argmin |y − G(u)|2Γ , u∈Rs 2 as a solution of the inverse problem, where |·|Γ is a norm on Rd . Again there might be difficulties: For example there might be multiple minima. To avoid this problem one can use a regularized approach. Let |·|Σ be a norm on Rs and let u0 ∈ Rs . Then the goal is to find argmin u∈Rs 1 1 |y − G(u)|2Γ + |u − u0 |2Σ . 2 2 (1) The problem formulated that way can often be solved. However, the choice of the norm |·|Σ and u0 , is without additional assumptions, somehow arbitrary. Furthermore, if we consider y as data which is measured by a physical device it is very likely that some noise in the measurement perturbs the data. In 7 this setting a Bayesian approach is useful. We have the following modified model y = G(u) + z, where z ∈ Rd is the observation noise. Now we want to find a probability measure π y on (Rs , B(Rs )) which contains information of the probability of different solutions u given data y. We call π y posterior distribution. Further, we assume that there is a probability measure π0 on (Rs , B(Rs )) which describes our prior belief about the solution u. The measure π0 is called prior distribution. Finally, we assume that the noise z is a realization of a random variable with mean zero, whose statistical properties are known. Let us precisely summarize and state the assumptions. Assumption 6. We have a probability space (Ω, F , P) and all random variables mapping from this space into (Rq , B(Rq )) for q ∈ N with either q = s or q = d. Further let (i) ρ0 : Rs → [0, ∞) be the density of the measure π0 with respect to the Lebesgue measure on (Rs , B(Rs )); (ii) ρy : Rs → [0, ∞) be the density of the measure π y with respect to the Lebesgue measure on (Rs , B(Rs )); (iii) the observational noise be modeled by a mean zero random variable Z : (Ω, F , P) → (Rd , B(Rd )) with distribution Z P(Z ∈ A) = v(z) dz, A ∈ B(Rd ), A where v : Rd → [0, ∞) is a density with respect to the Lebesgue measure. Before going on we recall the Bayesian formula and provide some implications. Therefore let U : (Ω, F , P) → (Rs , B(Rs )) and Y : (Ω, F , P) → (Rd , B(Rd )) be random variables. Let us assume that the joint density of (U, Y ) with respect to the Lebesgue measure is given by ξ : Rs ×Rd → (0, ∞), i.e. Z Z P((U, Y ) ∈ A × B) = ξ(u, y) dy du, A ∈ B(Rs ), B ∈ B(Rd ). A B Then the formula of Bayes is given by ξ(u | Y = y) = ξ(y | U = u) ξU (u) , ξY (y) 8 where ξU (u) = Z ξ(u, y) dy and ξY (y) = Rd Z ξ(u, y) du. Rs We call ξ(u | Y = y) the posterior density, ξ(y | U = u) data likelihood and ξU (u) prior density. We use the formula of the conditional distribution ξ(y | U = u) = ξ(u, y) ξU (u) to conclude that ξY (y) = Z Rs ξ(y | U = u) · ξU (u) du. Then, by the formula of Bayes, we obtain ξ(y | U = u) · ξU (u) . ξ(y | U = u) · ξU (u) du Rs ξ(u | Y = y) = R In other words the density ξ(u | Y = y) in u is proportional to ξ(y | U = u) · ξU (u). This proportionality is notated as ξ(u | Y = y) ∝ ξ(y | U = u) · ξU (u), Under Assumption 6 we set ρ0 (u) = ξU (u), (prior density) ρy (u) = ξ(u | Y = y) (posterior density) and use the likelihood ansatz ξ(y | U = u) = v(y − G(u)) such that ρy (u) is proportional to v(y − G(u))ρ0 (u), in symbols ρy (u) ∝ v(y − G(u))ρ0 (u). This implies dπ y (u) ∝ v(y − G(u)). dπ0 Let us define the potential φ(u, y) = − log(v(y − G(u)), such that dπ y (u) ∝ exp(−φ(u, y)). dπ0 9 The goal is to sample with respect to the posterior measure π y . Note that y we know the density dπ (u) up to the normalizing constant. Thus, Markov dπ0 chains might provide a way to approximate a sample with distribution π y . In the following let us briefly state some relations between the regularized and the Bayesian approach of the inverse problem. Therefore we consider a specific model, where we require additional assumption. We assume that v is a density of a Normal distribution with mean zero and covariance matrix Γ ∈ Rd×d and assume that the density of the prior ρ0 is also given by a Normal distribution with mean u0 and covariance matrix Σ ∈ Rs×s , i.e. 1 −1/2 2 v(z) ∝ exp − Γ z 2 and In this setting 2 1 −1/2 (u − u0 ) . ρ0 (u) ∝ exp − Σ 2 2 1 −1/2 2 1 −1/2 (y − G(u)) − Σ (u − u0 ) ρ (u) ∝ exp − Γ 2 2 y and 2 1 −1/2 dπ y . (y − G(u)) (u) ∝ exp − Γ dπ0 2 −1/2 Σ We define the norms in the regularized approach, in (1), as |·| = · Σ and |·|Γ = Γ−1/2 ·. Thus, a solution of the regularized problem (1) achieves the largest value of the density ρy . The other way around: A very likely area with respect to π y is with large probability close to a solution of (1). There are advantages of the Bayesian approach to inverse problem: The norms which arise have a close relation to the model. Namely, the norms above come from the covariance matrix of the Normal distribution of the observational noise and the prior. Additionally the parameter u0 is the mean of the prior. And, in the Bayesian context one might ask: - What is the probability that a solution of the regularized approach is determined by different local minimizer? - How certain can we be that a prediction made by a mathematical model will lie in a certain regime? 10 2 2.1 Markov chains Definitions and notation In this section we provide facts and definitions of Markov chains on general state spaces. The paper [RR04] of Rosenthal and Roberts surveys various results in this setting. For further reading we refer to [Rev84, Num84, MT09]. Let G be a complete, separable, metric space and let G = B(G) be the corresponding Borel σ-algebra over G, such that (G, G) is a measurable space. In all examples G is contained in Rd . However, we want to keep it at the moment more general. In the following we provide the definition of a transition kernel and a Markov chain. Definition 1 (Markov kernel, transition kernel). The function K : G × G → [0, 1] is called a Markov kernel or transition kernel if (i) for each x ∈ G the mapping A ∈ G 7→ K(x, A) is a probability measure on (G, G), (ii) for each A ∈ G the mapping x ∈ G 7→ K(x, A) is a G-measurable real-valued function. Definition 2 (Markov chain, initial distribution). A sequence of random variables (Xn )n∈N on a probability space (Ω, F , P) mapping into (G, G) is called a Markov chain with transition kernel K if for all n ∈ N and A ∈ G holds P(Xn+1 ∈ A | X1 , . . . , Xn ) = P(Xn+1 ∈ A | Xn ); (2) and for almost all x ∈ G we assume P(Xn+1 ∈ A | Xn = x) = K(x, A). The distribution ν(A) = P(X1 ∈ A), is called the initial distribution. 11 A∈G (3) We say that a sequence of random variables has the Markov property if (2) is satisfied. By (3) we have that K is a regular version2 of a conditional distribution of Xn+1 given Xn for every n. Note that N is the index set in the sequence of random variables, i.e. we only study discrete time Markov chains. Further note that all Markov chains are homogenous, i.e. P(Xn+1 ∈ A | Xn = x) = K(x, A), and K does not depend on n. Suppose that K is a transition kernel and ν is a probability measure on (G, G). How can we get a Markov chain with transition kernel K and initial distribution ν? Let us assume that G ⊂ Rd . For any transition kernel there exists a random mapping representation, see for example Kallenberg [Kal02, Lemma 2.22, p. 34]. A random mapping representation is given by a measurable function Φ : G × U → G, which satisfies P(Φ(x, Z) ∈ A) = K(x, A), x ∈ G, A ∈ G, where (U, U) is a measurable space and the random variable Z : (Ω, F , P) → (U, U) has distribution µ. Then, a Markov chain can be constructed as follows. Let (Zn )n∈N , with Zn : (Ω, F , P) → (U, U), be a sequence of i.i.d. random variables with distribution µ, and assume that X1 has distribution ν, then one can see that (Xn )n∈N defined by Xn = Φ(Xn−1 , Zn ), n ≥ 2, is a Markov chain with transition kernel K and initial distribution ν. The function Φ is also called update function, random mapping representer or algorithm of the corresponding Markov chain. We have seen that the transition kernel is an important object in the definition of a Markov chain. In the following we present some examples of transition kernels. The reader might check that the conditions of Definition 1 are satisfied. Example 7. Let G ⊂ Rd , A ∈ G and x ∈ G. 2 For details to conditional expectation, conditional distribution and a regular version of a conditional distribution see for example [Kal02]. 12 1. Let K1 (x, A) = 1A (x). Basically, K1 is the simplest transition kernel one can think of. The update function is given by Φ1 (x, u) = x for any u ∈ U. 2. Let π be a probability measure on (G, G). Further assume that we can sample with respect to π. Then, let us define the transition kernel K2 (x, A) = π(A). Thus, i.i.d. sampling is a special Markov chain. The update function is given by Φ2 (x, u) = u where (U, U) = (G, G) and u is chosen by π. Note that the succeeding state is chosen independently of the previous one. 3. Any convex combination of two transition kernels is again a transition kernel. Thus for any λ ∈ [0, 1] we have that K(x, A) = λK1 (x, A) + (1 − λ)K2 (x, A) is a transition kernel. Here it is more convenient to describe the update function agorithmically. Namely, choose a number r ∈ [0, 1] uniformly distributed and if r ≤ λ apply Φ1 otherwise Φ2 . 4. Let δ > 0 and by Bδ (x) we denote the Euclidean ball with radius δ around x. Then the transition kernel of the ball walk is vold (A ∩ Bδ (x)) vold (G ∩ Bδ (x)) . Bδ (x, A) = + 1A (x) 1 − vold (Bδ (0)) vold (Bδ (0)) The corresponding update function is ( x + δu, x + δu ∈ G, Φ(x, u) = x, otherwise, where u is chosen by the uniform distribution in (U, U) = (Bd , B(Bd )). This can be done, see Example 2. Algorithmically Bδ (x, ·) is presented in Algorithm 1. 5. Let G ⊂ Rd be bounded. Now we describe the transition kernel of the hit-and-run algorithm. Recall that Sd−1 = {x ∈ Rd : |x| = 1}. Let θ ∈ Sd−1 and let L(x, θ) be the chord in G through x and x + θ, i.e. L(x, θ) = {x + sθ ∈ G | s ∈ R}. Note that L(x, θ) is a one dimensional object in Rd . Then the transition kernel is Z 1 vol1 (L(x, θ) ∩ A) H(x, A) = dθ. vold−1 (Sd−1 ) Sd−1 vol1 (L(x, θ)) 13 Algorithm 1: Ball-Walk(x, δ) input : current state x, radius δ. output: next state y. Choose u uniformly distributed in Bd ; if x + δu ∈ G then Return x + δu; else Return x; end In words the transition works as follows. First, choose a direction θ. Then we sample the next state on the line given by x and x + θ intersected with G. The update function is described in Algorithm 2, where we assume that we are able to sample the uniform distribution in L(x, θ) for any x ∈ G and θ ∈ Sd−1 . Algorithm 2: hit-and-run(x) input : current state x. output: next state y. Choose θ uniformly distributed in Sd−1 ; Choose y uniformly distributed in L(x, θ); 6. Finally we want to explain the random scan Gibbs sampler. Let the set {e1 , . . . , ed } be the Euclidean standard base in Rd , i.e. the vector ei = (0, . . . , 0, 1, 0, . . . , 0) ∈ Rd where the 1 is at the ith coordinate. Then d 1X R(x, A) = He (x, A), d i=1 i where Hei (x, A) = vol1 (L(x, ei ) ∩ A) . vol1 (L(x, ei )) Recall that L(x, ei ) is the intersection of G with the line defined by x and x + ei . This is conceptionally very close to the hit-and-run transition kernel. The difference is, instead of choosing an arbitrary direction one chooses uniformly distributed an element of the Euclidean 14 standard basis. Also note that each Hei is a transition kernel and R is a convex combination of the Hei s. For completeness we present the update function in Algorithm 3. Algorithm 3: random-scan-Gibbs(x) input : current state x. output: next state y. Choose i uniformly distributed in {1, . . . , d}; Choose y uniformly distributed in L(x, ei ); In the examples we have seen that a convex combination of two transition kernels is again a transition kernel. One could say that is the way how transition kernels are adapted to summation. Now we want to define the product of two transition kernel. Let K1 and K2 be transition kernels on (G, G). Then the product of K1 and K2 is defined as Z K1 K2 (x, A) = K2 (y, A) K1(x, dy), x ∈ G, A ∈ G. G Note that K1 K2 is again a transition kernel. This enables us to multiply a transition kernel K with itself. Let x ∈ G, A ∈ G and n ∈ N then we define K 0 (x, A) = 1A (x) and get Z n K (x, A) = K n−1 (y, A) K(x, dy). G We also have n K (x, A) = Z K(y, A) K n−1(x, dy). G This product can be interpreted in terms of a Markov chain. Let (Xn )n∈N be a Markov chain with transition kernel K. Then we have for all k, n ∈ N that K n (x, A) = P(Xk+n ∈ A | Xk = x). This is proven by induction over n. For n = 1 we have P(Xk+1 ∈ A | Xk = x) = K(x, A) 15 If the assertion holds for n ∈ N we show how to deduce it for n + 1: Z P(Xk+n+1 ∈ A | Xk = x) = P(Xk+n+1 ∈ A | Xk+n = y, Xk = x) G = (2) = Z G Z × P(Xk+n ∈ dy | Xk = x) P(Xk+n+1 ∈ A | Xk+n = y) P(Xk+n ∈ dy | Xk = x) K(y, A) K n(x, dy) = K n+1 (x, A). G Now let us assume that (Xn )n∈N is a Markov chain with transition kernel K and initial distribution ν. Then let us define the notation Z m νP (A) = K m (y, A) ν(dy), m ∈ N. (4) G Note that we can consider ν as transition kernel (Example 7(2)) and then we have by the product of two kernels that νP m = νK m . Further note that Z m νP (A) = P(Xm+1 ∈ A | X1 = y) ν(dy) = P(Xm+1 ∈ A). G Let us just briefly mention that one can extend this operation to signed measures. With these notations at hand we can define the terms stationary measure and reversible transition kernel. Definition 3 (stationary measure). Let π be a probability measure on (G, G). Then π is called a stationary distribution of a transition kernel K if Z πP (A) = K(x, A) π(dx) = π(A), A ∈ G. G For some calculation it is useful to rewrite this as Z π(dy) = K(x, dy) π(dx), y ∈ G. (5) G For a Markov chain (Xn )n∈N with transition kernel K and stationary initial distribution π we have the following interpretation: After a single transition of the Markov chain the same distribution as before arises, i.e. P(X1 ∈ A) = π(A) = πP (A) = P(X2 ∈ A), 16 A ∈ G. Definition 4 (reversible transition kernel). Let π be a probability measure on (G, G). A transition kernel K is called reversible with respect to π if Z Z K(x, A) π(dx) = K(x, B) π(dx), A, B ∈ G. (6) B A Sometimes this is also written for short as K(x, dy) π(dx) = K(y, dx) π(dy), x, y ∈ G. (7) If a transition kernel K is reversible with respect to a distribution π, then π is a stationary distribution of K. Again for the Markov chain with stationary initial distribution we have, under the additional assumption that the transition kernel is reversible with respect to π, that P(X1 ∈ A, X2 ∈ B) = P(X1 ∈ B, X2 ∈ A), A, B ∈ G. This gives symmetry concerning A and B of the measure µ(A×B) = P(X1 ∈ A, X2 ∈ B). We have the following basic results for reversible transition kernel. We leave the proof for the reader, since it is not too difficult. Lemma 8. (i) If a transition kernel K is reversible with respect to π, then π is a stationary distribution of K. (ii) Assume that (6) holds for all disjoint sets A, B ∈ G, then K is reversible with respect to π. Now let us provide some examples of stationary distributions and reversible transition kernels. Example 9. 1. Let us consider K1 (x, A) = 1A (x). Then we obtain for any probability measure ν on (G, G) that Z Z K1 (x, A) ν(dx) = ν(A ∩ B) = K1 (x, B) ν(dx), A, B ∈ G. B A Thus, this kernel is reversible with respect to any possible ν. This further implies that any ν is a stationary distribution of K1 . 17 2. Let us consider K2 (x, A) = π(A), where π is a probability measure on (G, G). Obviously, Z Z K2 (x, B) π(dx) = π(A) π(B) = K2 (x, A) π(dx), A, B ∈ G. A B Thus, K2 is reversible with respect to π and π is a stationary measure. 3. Let K1 and K2 be two transition kernels which are both reversible concerning a distribution π. Then any convex combination of K1 and K2 is again reversible with respect to π. But in general it is not true that the product of K1 and K2 has this property. We have K1 K2 is reversible with respect to π if and only if K2 K1 = K1 K2 . However, π is a stationary distribution of the product of K1 and K2 . This holds since π is a stationary distribution of K1 and K2 . Here reversibility is not necessary. 4. Now let us turn to a not so obvious example. We consider the ball walk. Here we assume that G ⊂ Rd is a bounded set and δ > 0. Recall that vold (A ∩ Bδ (x)) vold (G ∩ Bδ (x)) Bδ (x, A) = . + 1A (x) 1 − vold (Bδ (0)) vold (Bδ (0)) If you one can imagine a typical large sample with the ball walk, then one might guess that the stationary measure is the uniform distribution in G. By Lemma 8(ii) it is enough to show (2) for disjoint sets C, D ∈ G, i.e. C ∩ D = ∅. To shorten the notation let us define κ = vold (Bδ (0))vold (G). We have Z Z Z 1 1 Bδ (x, D) dx = 1B (x) (y) dy dx vold (G) C κ C D δ Z Z Z 1 1 = 1B(y,δ) (x) dy dx = Bδ (x, C) dx. κ C D vold (G) D Note that the second equality follows by 1Bδ (x) (y) = 1Bδ (y) (x) for any x, y ∈ G. Thus, the ball walk is reversible with respect to the uniform distribution. 2.2 Markov operator In this subsection we describe how a transition kernel induces an operator and state properties of this operator. Let us assume that π is a stationary 18 probability measure of a transition kernel K. Further let p ∈ [1, ∞] and Lp = Lp (π) = {f : G → R | kf kp = Z p G |f (x)| π(dx) 1/p < ∞}. Note that kf kp ≤ kf kp′ if p ≤ p′ . Hence we have Lp′ ⊂ Lp . For f ∈ Lp let us define Z P f (x) = f (y) K(x, dy), x ∈ G. G We call the linear operator P Markov operator or transition operator. Let us provide an interpretation. If a Markov chain (Xn )n∈N with transition kernel K and initial distribution δx , the point mass at x ∈ G, is given, then P f (x) is the expectation of f (X2 ), i.e. E[f (X2 ) | X1 = x] = P f (x). In the following we state helpful properties. Lemma 10. (i) For the constant function f ≡ 1 we have P f (x) = f (x), x ∈ G. (8) (ii) The operator P maps from Lp to Lp and is bounded with kP kLp →Lp = 1. (iii) The transition kernel K is reversible with respect to π if and only if P is self-adjoint on L2 . Proof. The assertion of (i) is obvious. We prove (ii) by p Z Z Z p π(dx) f (y) K(x, dy) |P f (x)| π(dx) = G G ZG Z Z ≤ |f (y)|p K(x, dy) π(dx) = |f (y)|p π(dy). G G G The inequality is an application of the Jensen inequality for conditional expectations and R the last equality holds, since π is a stationary distribution, i.e. π(dy) = G K(x, dy) π(dx). Now we show (iii). Note that L2 is a Hilbert space with inner-product Z hf, gi = f (x)g(x) π(dx). G 19 Thus, we have to prove that hP f, gi = hf, P gi is equivalent to reversibility of K. The first direction follows with f = 1A and g = 1B for any A, B ∈ G. For the reverse conclusion we recall the intuitive writing of reversibility. Namely, (7) is K(x, dy)π(dx) = K(y, dx)π(dy). Then for all f, g ∈ L2 holds Z Z hP f, gi = f (y) g(x) K(x, dy) π(dx) G G Z Z = f (y) g(x) K(y, dx) π(dy) = hf, P gi, (7) G G which finishes the proof. From now on we have the following standing assumption. Assumption 11. The transition kernel K is reversible with respect to a probability measure π on (G, G). This assumption allows us to use the machinery for self-adjoint operators, since then the Markov operator is self-adjoint. Now let us define how the transition kernel acts on signed measures. By M(G) denote the set of real-valued signed measures 3 on (G, G). Let ν ∈ M(G). If there exists a density of ν with respect to π then we denote it by dν and for q ∈ [1, ∞] let dπ ( dν , ν ≪ π, dπ q kνkq = ∞, otherwise. Set Mq = Mq (G, π) = {ν ∈ M(G) : kνkq < ∞}. The function space Lq is isometrically isomorphic to the space of signed measures Mq , in symbols Lq ∼ = Mq . The space of singed measures M2 is a Hilbert space and the inner-product is the inner-product of L2 of the densities, one has Z dν dµ dν dµ hν, µi = (x) (x) π(dx) = h , i, ν, µ ∈ M2 . dπ dπ dπ D dπ 3 The set function µ : G → R is a real-valued signed measure if it is the difference of two measures on (G, G). More precisely, if µ(∅) = 0 and for pairwise disjoint A1 , A2 , . . . , with P∞ Ak ∈ G for k ∈ N, one has µ(∪∞ A ) = k=1 k k=1 µ(Ak ). 20 Let us extend the definition of (4) to signed measures. The transition kernel applies to a signed measure ν ∈ Mq as Z νP (A) = K(x, A) ν(dx), A ∈ G. D We show that this defines a linear operator ν 7→ νP on Mq . Obviously (ν + µ)P = νP + µP . But until now it is not clear whether νP ∈ Mq when ν ∈ Mq . In the following we show that this is indeed true. Lemma 12. Let q ∈ [1, ∞] and ν ∈ Mq . Then π-almost everywhere d(νP ) dν (x) = P ( )(x) dπ dπ (9) and νP ∈ Mq . Proof. We have for ν ∈ Mq that Z Z Z dν νP (A) = K(x, A) ν(dx) = 1A (y) (x) K(x, dy) π(dx) dπ G G ZG Z Z dν dν = 1A (y) (x) K(y, dx) π(dy) = P ( )(y) π(dy). rev. G G dπ dπ A dν dν Thus kνP kq = P ( dπ )q ≤ dπ q < ∞, which finishes the proof. 2.3 Convergence properties of Markov chains Let (Xn )n∈N be a Markov chain with transition kernel K and initial distribution ν. We assume that K is reversible with respect to π. For A ∈ B(G) we ask how fast P(Xn ∈ A) = νP n−1 (A) −→ π(A) ? n→∞ Thus, the goal is to quantify the speed of convergence, if it converges at all, of νP n−1 to π for increasing n ∈ N. Therefore we introduce a distance between measures. Definition 5 (total variation distance). The total variation distance between two probability measures ν, µ ∈ M(G) is defined by kν − µktv = sup |ν(A) − µ(A)| . A∈G 21 It is helpful to consider the total variation distance as an L1 -norm. Lemma 13. If ν, µ ∈ M1 , then kν − µktv = 1 2 kν − µk1 . Proof. First, we show that Let Z 1 kν − µktv = sup f (x)(ν(dx) − µ(dx)) . 2 kf k∞ ≤1 G A= (10) dµ dν (x) ≥ (x) . x∈G| dπ dπ Then, the supremum within the total variation distance of the left hand side of (10) is achieved for A and the supremum of the right hand side is achieved for ( 1 x ∈ A, f (x) = −1 x 6∈ A. By using this we obtain (10). Let ξ ∈ M1 then we have Z Z dξ sup f (x) ξ(dx) = sup f (x) (x) π(dx) . dπ kf k∞ ≤1 kf k∞ ≤1 G G Furthermore Z dξ dξ dξ ≤ sup f (x) (x) π(dx) ≤ sup kf k = dξ . ∞ dπ dπ dπ dπ kf k∞ ≤1 kf k∞ ≤1 G 1 1 1 Thus, the proof is finished if we set ξ = ν − µ. Now let us ask for an upper bound of kνP n − πktv . First we apply Lemma 13, therefore we assume that ν ∈ M1 . We just start a calculation and see how far we get: d(νP n ) n n 2 kνP − πktv = kνP − πk1 = − 1 dπ 1 n dν n dν = P − 1 − 1 = P (8) (9) dπ dπ 1 1 n dν = (P − S) dπ − 1 , 1 22 where we define S(f ) = Z f (x) π(dx). G dν Note that the last equality of the calculation comes from S( dπ − 1) = 0. This consideration implies the following two bounds. Lemma 14. Let ν ∈ M1 . Then n n kνP − πktv ≤ kP − SkL1 →L1 and n dν 1 ≤ kP n − Sk − 1 L1 →L1 , 2 dπ 1 n kνP − πktv ≤ kP − SkL2 →L2 1 dν . − 1 2 dπ 2 dν − 1p = Note that if ν = π, then kνP n − πktv = 0 and our estimates contain dπ 0 for p = 1, 2 which guarantees that the estimate is also zero. This is a nice behavior of the bound with respect to the initial distribution. Let us consider kP n − SkL2 →L2 more detailed. Lemma 15. Let K be a reversible transition kernel with respect to π. Then kP n − SkL2 →L2 = k(P − S)n kL2 →L2 = kP − SknL2 →L2 , n ∈ N. Proof. The first equality comes from the fact that P n − S = (P − S)n , which holds since π is a stationary distribution of K. The second equality comes from the fact that (P − S)n is a self-adjoint operator, which comes from the reversibility of K. See Lemma 10(iii). The last two lemmata motivate the following two different convergence properties of transition kernels. Definition 6 (L1 -exponential convergence). Let α ∈ [0, 1) and M ∈ (0, ∞). Then the transition kernel K is L1 -exponentially convergent with (α, M) if kP n − SkL1 →L1 ≤ αn M, n ∈ N. (11) A Markov chain with transition kernel K is called L1 -exponentially convergent if there exist an α ∈ [0, 1) and M ∈ (0, ∞) such that (11) holds. 23 Definition 7 (L2 -spectral gap). We say that a transition kernel K and its corresponding Markov operator P has an L2 -spectral gap if gap(P ) = 1 − kP − SkL2 →L2 > 0. If the transition kernel has an L2 -spectral gap, then we have by Lemma 14 and Lemma 15 that n n dν kνP − πktv ≤ (1 − gap(P )) − 1 . dπ 2 Next, we define other convergence properties which are based on the total variation distance. After that we state different relations between the different properties. Definition 8 (uniform ergodicity, geometric ergodicity). Let α ∈ [0, 1) and M : G → (0, ∞). Then the transition kernel K is called geometrically ergodic with (α, M(x)) if one has for π-almost all x ∈ G that kK n (x, ·) − πktv ≤ M(x) αn , n ∈ N. (12) If the inequality of (12) holds with a bounded function M(x), i.e. sup M(x) ≤ M ′ < ∞, x∈G then K is called uniformly ergodic with (α, M ′ ). Since we assume that the transition kernel is reversible with respect to π we have the following relations: ⇐= ⇐⇒ L1 -exponentially convergent with (α, 2M) ⇐= uniformly ergodic with (α, M) geometrically ergodic with (α, M(x)) L2 -spectral gap ≥ 1 − α. (13) The fact that uniform ergodicity implies geometric ergodicity is obvious. For the proofs of the other relations and further details we refer to [Rud12, Proposition 3.23, Proposition 3.24]. Now one might ask if there is a direct relation between geometric ergodicity and the existence of an L2 -spectral gap. We impose another condition on the transition kernel. 24 Definition 9 (π-irreducible). The transition kernel K is called π-irreducible if for all A ∈ G and for all x ∈ G there exists an n ∈ N such that π(A) > 0 K n (x, A) > 0. =⇒ The assumption of π-irreducibility is a connectivity condition in the following sense. Whenever a set A with π(A) > 0 is given, then, from any starting point x ∈ G, a Markov chain with transition kernel K needs at most a finite number of steps to reach A with positive probability. Thus, there is no set A with π(A) > 0 which cannot be reached by the Markov chain. Let the transition kernel be π-irreducible. Then geometrically ergodic ⇐⇒ L2 -spectral gap ≥ with (α, M(x)) 1 − α. (14) For a proof of the relation stated in (14) see [RR97] and [RT01]. 3 Approximation of expectations In this section we assume that we have a Markov chain (Xn )n∈N with transition kernel K and initial distribution ν. Further let K be reversible with respect to π. Then, the goal is to approximate Z S(f ) = f (x) π(dx). G We use the computation of an average of a finite Markov chain sample as approximation of the mean. Thus, we consider n 1X Sn,n0 (f ) = f (Xj+n0 ). n j=1 The number n determines the number of function values of f which are computed. The number n0 is the burn-in or warm up time. Intuitively, it is the number of steps of the Markov chain to get close to the stationary distribution. We state a strong law of large numbers also called ergodic theorem. For details we refer to [AG10], [RR04, Theorem 4, Fact 5, Remark after Corollary 6] or [MT09, Theorem 17.1.7, p. 427]. 25 Theorem 16 (Strong law of large numbers). Let K be a π-irreducible transition kernel. Then, for any f ∈ L1 we have Px ( lim Sn,n0 (f ) = S(f )) = 1, n→∞ for π-almost all x ∈ G, where Px denotes the distribution of a Markov chain (Xn )n∈N with transition kernel K and initial distributions ν = δx , the point mass at x. However, we are rather interested on explicit error bounds. We want to study the mean square error of Sn,n0 , given by eν (Sn,n0 , f ) = (Eν,K |Sn,n0 (f ) − S(f )|)1/2 , where ν and K indicate the initial distribution and transition kernel. A first question which one might ask could be: How does the error behave if one already starts with the stationary distribution? Or shorter: How behaves eπ (Sn,n0 , f )? Theorem 17. Let (Xn )n∈N be a Markov chain with transition kernel K and initial distribution π. Further assume that the transition kernel is reversible with respect to π. We define Λ = sup{α : α ∈ spec(P − S)}, where spec(P − S) denotes the spectrum of the operator P − S : L2 → L2 and assume that Λ < 1. Then sup eπ (Sn,n0 , f )2 ≤ kf k2 ≤1 2 . n(1 − Λ) For a proof of this result we refer to [Rud12, Corollary 3.27]. Let us discuss the assumptions and the result of the theorem. First, note that for the classical Monte Carlo method we have Λ = 0. In this case we exactly get (up to a constant factor) what we would expect (see Theorem 4), i.e. the theorem covers also the classical Monte Carlo method. Further note that gap(P ) = 1 − kP − SkL2 →L2 and kP − SkL2 →L2 = sup{|α| : α ∈ spec(P − S)}, such that gap(P ) ≤ 1 − Λ. This also implies that if P : L2 → L2 is positive semi-definite we obtain gap(P ) = 1 − Λ. Thus, whenever we have a lower 26 bound for the spectral gap we can apply Theorem 17 and can substitute 1−Λ by gap(P ). Further note if γ ∈ [0, 1), M ∈ (0, ∞) and the transition kernel is L1 -exponentially convergent with (γ, M) then we have, by using (13), that gap(P ) ≥ 1 − γ. Now we might ask how eν (Sn,n0 , f ) behaves depending on the initial distribution? The idea is to decompose the error in a suitable way. For example in a bias and variance term. However, we want to have an estimate with respect to kf k2 and in this setting the following decomposition is more convenient: eν (Sn,n0 , f )2 = eπ (Sn,n0 , f )2 + rest, where rest denotes an additional term such that equality holds. Then, we have to estimate the rest term and can use Theorem 17 to obtain an error bound. For further details of the proof of the following error bound we refer to [Rud12, Theorem 3.34 and Theorem 3.41]. Theorem 18. Let (Xn )n∈N be a Markov chain with transition kernel K and initial distribution ν. Further let the transition kernel be reversible with respect to π. Then sup eν (Sn,n0 , f )2 ≤ kf kp ≤1 2 2Cν γ n0 + 2 n(1 − Λ) n (1 − γ)2 (15) with Λ = sup{α : α ∈ spec(P − S)}, where spec(P − S) denotes the spectrum of the operator P − S : L2 → L2 , holds in the following settings: (i) for p = 2, ν ∈ M∞ and a transition kernel is L1 -exponentially dν K which convergent with (γ, M) where Cν = M dπ − 1 ∞ ; dν − 12 . (ii) for p = 4, ν ∈ M2 and 1 − γ = gap(P ) > 0 where Cν = 64 dπ Now we have an explicit error bound. If we want to have an error of ε ∈ (0, 1) it is still not clear how to choose n and n0 when the total amount of steps n + n0 is fixed. Thus, the question is how to choose the burn-in n0 ? Let e(n, n0 ) be the right hand side of (15) and we assume that we have computational resources for N = n + n0 steps of the Markov chain. We want to get an nopt which minimizes e(N − n0 , n0 ) (depending on n0 ). In [Rud12, 27 Lemma 2.26] the following is proven: For all δ > 0 if N and Cν are large enough nopt satisfies log Cν log Cν nopt ∈ . , (1 + δ) log γ −1 log γ −1 log Cν Thus, in this setting nopt = ⌈ log ⌉ is a reasonable choice. γ −1 4 Algorithms Let G ⊂ Rd , G = B(G) and ρ : G → (0, ∞), where ρ is integrable with respect to the Lebesgue measure. We can define a distribution πρ on (G, G) with R ρ(x) dx πρ (A) = RA , A ∈ G. ρ(x) dx G The goal of this section is to provide basic constructions of transition kernels with stationary distribution πρ . 4.1 Metropolis-Hastings algorithm The Metropolis-Hastings algorithm is the most famous method for approximate sampling by a Markov chain. The idea is to modify a given transition kernel, such that πρ gets a stationary distribution of the modification. First we consider a special case: Example 19. Let G ⊂ Rd be bounded and assume that µ denotes the uniform distribution on (G, G). How can we modify µ to obtain something which gets close to πρ ? We explain the independent Metropolis algorithm. For i ∈ N a single transition from xi ∈ G works as follows: 1. Sample a proposal state y ∈ G with respect to µ, i.e. with uniform distribution on (G, G). 2. Let ρ(y) α(xi , y) = min 1, ρ(xi ) and choose r ∈ [0, 1] uniformly distributed. If r ≤ α(xi , y) return xi+1 = y otherwise return xi+1 = xi . 28 By introducing an acceptance/rejection step we modify the sampling with respect µ. The dependence on ρ is entirely determined by the acceptance probability α(·, ·). A state y is proposed and it is checked how likely it is. Informally, if ρ(y)/ρ(xi ) ≥ 1 is large, then it is very likely and the proposed state is accepted. If ρ(y)/ρ(xi ) < 1, then the state is only with a certain probability accepted. This is quantified by measuring how much less likely the proposed state is compared to the current one. The example illustrates the general modification procedure. Namely, we propose a state and accept it with a certain probability which depends on ρ. We need some further notations to precise the general approach. Let q : G × G → [0,R∞] be a function such that q(x, ·) is Lebesgue integrable for all x ∈ G with G q(x, y) dy ≤ 1. Then Z Z Q(x, A) = q(x, y) dy + 1A (x) 1 − q(x, y) dy , x ∈ G, A ∈ G. A G is a transition kernel and we call q(·, ·) transition density. Let Q be the proposal transition kernel. Now the question is: How can we modify Q with an acceptance/rejection step to obtain a reversible transition kernel with respect to πρ ? With the proposal transition kernel Q and the non-normalized density ρ we want to construct a transition kernel Z Z Kρ (x, A) = kρ (x, y)dy + 1A (x) 1 − kρ (x, y)dy , A G with transition density kρ (·, ·), such that Kρ is reversible with respect to πρ . In terms of the transition density reversibility with respect to πρ is equivalent to kρ (x, y)ρ(x) = kρ (y, x)ρ(x), x, y ∈ G. Let x, y ∈ G and w.l.o.g assume that q(x, y)ρ(x) > q(y, x)ρ(y) We want to modify the left-hand-side so that we achieve equality, thus we have to multiply it by a number α(x, y) < 1. We obtain α(x, y)q(x, y)ρ(x) = q(y, x)ρ(y). 29 This gives that kρ (x, y) = α(x, y)q(x, y) and since we want reversibility we have to choose α(y, x) = 1 so that we get α(x, y)q(x, y)ρ(x) = q(y, x)ρ(y) = α(y, x)q(y, x)ρ(y). This completely specifies the acceptance probability as ( 1 q(x, y)ρ(x) = 0, α(x, y) = q(y,x)ρ(y) min{1, q(x,y)ρ(x) } otherwise. For a single transition of the Metropolis-Hastings algorithm see Algorithm 4. Now we are also able to define the transition kernel of the Metropolis-Hastings Algorithm 4: Metropolis-Hastings-algorithm(x, Q) input : current state x, proposal transition kernel Q. output: next state y. Generate y with respect to Q(x, ·); Compute α(x, y) = min{1, ρ(y)q(y, x) }; ρ(x)q(x, y) if U(0, 1) ≤ α(x, y) then Return y; else Return y := x; end algorithm. Namely, we have Z Z Kρ (x, A) = α(x, y) Q(x, dy) + 1A (x) 1 − α(x, y) Q(x, dy) A G Z Z = α(x, y) q(x, y)dy + 1A (x) 1 − α(x, y) q(x, y)dy . A G By the construction we obtained the following. Lemma 20. The transition kernel Kρ is reversible with respect to πρ . 30 Proof. The proof follows by the construction. However, if one just has the transition kernel and wants to check whether reversibility holds one has to show: Z Z Kρ (x, B) πρ (dx) = Kρ (x, A) πρ (dx) A B for disjoint A, B ∈ B(G), see Lemma 8(ii). If one keeps in mind that α(x, y)q(x, y)ρ(x) = α(y, x)q(y, x)ρ(y) it is not too difficult to prove the assertion. Now we will define two special cases of the Metropolis-Hastings-algorithm. Namely, if q(x, y) = q(y, x), i.e. q is symmetric, then Kρ is called Metropolis algorithm and if q(x, y) = η(y) for a function η : G → (0, ∞) for all x, y ∈ G, then Kρ is called independent Metropolis algorithm, see Example 19. The Metropolis-Hastings-algorithm provides a tool to construct a Markov chain with a desired stationary distribution. However, until now it is not clear if the Markov chain converges to the stationary distribution. In the following we state a sufficient condition for the independence Metropolis algorithm to be uniformly ergodic. This is based on [MT96, Theorem 2.1, p. 105]. The proof is omitted, however we want to remark that we also use [Rud12, Proposition 3.24, p. 49] to derive the result. R Theorem 21. Let η : G → (0, ∞), such that G η(y) dy = 1 and Q(x, A) = R η(y) dy for A ∈ B(G). If there exists a number γ > 0 such that A η(y) ≥ γ, ρ(y) y ∈ G, then n Pρ − S L1 →L1 ≤ 2(1 − γ)n , where Pρ denotesR the transition operator with corresponding transition kernel Kρ and S(f ) = G f (x) πρ (dx). The previous theorem states that the independent Metropolis algorithm is L1 -exponentially ergodic. Further note that by the diagram in (13) we have that L1 -exponentially ergodicity implies a spectral gap. Hence gap(Pρ ) > γ. 31 Now we could apply Theorem 18 to get an error bound for the approximation of expectations. We consider an example which satisfies the assumptions of Theorem 21. Example 22. Let G = R and ξ 2 > 1. Further let 1 ρ(y) = √ exp(−y 2 /2) 2π and 1 exp(−y 2 /(2ξ 2 )). 2πξ In other words ρ is the density of a N(0, 1) random variable and η is the density of a N(0, ξ 2 ) random variable. We have η(y) = √ η(y) = ξ −1 exp(−y 2 (ξ −2 − 1)/2) ≥ ξ −1. ρ(y) y=0 n Thus, by Theorem 21 we obtain Pρ − S L →L ≤ 2(1 − ξ −1)n . 1 1 Next, we will introduce another technique to show a convergence property. More precisely, a technique to show a lower bound of the spectral gap 1 − Λ, see below. Therefore we need the following new definition. Definition 10. The conductance of a transition kernel K, which is reversible with respect to π, is given by R K(x, Ac ) π(dx) A ϕ= inf . 0<π(A)≤1/2 π(A) The ratio in the infimum is also called bottleneck ratio. It is the probability P(X2 ∈ Ac | X1 ∈ A) of a Markov chain (Xn )n∈N , where the initial state is chosen by the stationary distribution, i.e. P(X1 ∈ A) = π(A). In this sense it measures how fast one leaves a set A in one step of the Markov chain. The following result is well known, which is, in this form, due to Lawler and Sokal [LS88]. Theorem 23. We have ϕ2 ≤ 1 − Λ ≤ 2ϕ, 2 with Λ = sup{α : α ∈ spec(P − S)} where spec(P − S) denotes the spectrum of P − S : L2 → L2 . 32 In the next example we show how one can use this result. Example 24. Let G ⊂ Rd be a bounded set and assume that there are numbers 0 < c1 , c2 < ∞ such that c1 ≤ ρ(x) ≤ c2 for all x ∈ G. We consider an independent Metropolis algorithm. Let the proposal transition kernel be given by vold (A) Q(x, A) = µ(A) = , A ∈ B(G), vold (G) i.e. a state is proposed with the uniform distribution on G. Then Z Z dy dy Kρ (x, A) = α(x, y) , + 1A (x) 1 − α(x, y) vold (G) vold (G) A G ρ(y) where α(x, y) = min{1, ρ(x) }. By Theorem 21 and n Pρ − S L 1 →L1 ≤2 1− η(y) ρ(y) 1 c2 vold (G) ≥ n 1 c2 vold (G) we obtain , where Pρ denotes the transition operator induced by Kρ . By (13) holds gap(Pρ ) ≥ c2 vol1d (G) . We apply Theorem 18 to obtain an error bound for the approximation of S(f ), see Corollary 25. However, often vold (G) is exponentially increasing for increasing d, for example if G = [−1, 1]d or G = Bd . This is the drawback of the former estimate. Let us apply Theorem 23. It is known that Pρ is positive semi-definite on L2 (πρ ), for details we refer to [RU12]. Thus, we have that gap(Pρ ) = 1 − Λρ with Λρ = sup{α : α ∈ spec(Pρ − S)}. Now we estimate the conductance: Let A ∈ B(G) and assume that 0 < πρ (A) ≤ 1/2. Then Z Z Z dy c πρ (dx) Kρ (x, A ) πρ (dx) = α(x, y) vold (G) A A Ac R Z Z ρ(z) dz 1 1 G = min , πρ (dy)πρ (dx) ρ(x) ρ(y) vold (G) A Ac Z Z Z Z 1 ρ(z) ρ(z) = min dz, dz πρ (dy)πρ(dx) vold (G) A Ac G ρ(x) G ρ(y) c1 c1 πρ (A). ≥ πρ (A)πρ (Ac ) ≥ c2 2c2 33 c2 This implies ϕ ≥ 2cc12 which gives gap(Pρ ) ≥ 8c12 . We obtained a lower bound 2 of gap(Pρ ) which is independent of vold (G), but depends on c21 and c22 . By the combination of both lower bounds of the spectral gap we have 2 c1 1 1 , gap(Pρ ) ≥ max c2 8c2 vold (G) dµ and obtain by Theorem 18 and dπρ − 1 ≤ cc12 − 1 the following error ∞ bounds. Corollary 25. Let (Xn )n∈N be a Markov chain with transition kernel Kρ and initial distribution µ. (i) For c21 vold (G) ≤ 8 c2 and n0 = max the error satisfies sup eµ (Sn,n0 , f )2 ≤ kf k2 ≤1 nl m o c2 vold (G) log 2( cc21 − 1) , 0 2c2 vold (G) 2c22 vold (G)2 + . n n2 (ii) For c21 vold (G) > 8 c2 and n0 = max error satisfies sup eµ (Sn,n0 , f )2 ≤ kf k4 ≤1 nl 8c22 c21 m o log 64( cc12 − 1) , 0 the 16c22 128c42 + 2 4. nc21 n c1 The setting of the example can be extended. Similar results hold for general proposals. In [Vol13, Theorem 3.4] a lower and upper bound of the conductance of the Metropolis-Hastings algorithm is proven depending on c1 , c2 and the conductance of the proposal transition kernel. In particular the lower bound of gap(Pρ ) depends on cc21 . However, there are some crucial drawbacks of lower bounds of the spectral gap which contain cc21 . Note that, then the error bound of the mean square ρ . Say, for example error depends on cc12 = sup inf ρ ρ(x) = exp(−α |x|2 ), x ∈ Rd , (16) √ i.e. ρ is a density of a N(0, 2α−1 I) random variable, where I denotes the identity. If G = Bd , then cc12 = exp(α) and if G = [−1, 1]d , then cc21 = exp(αd). This is not that good, since the ratio cc12 might depend exponentially on α and d. The goal would be to obtain an error bound which depends only polynomially on α and d. 34 In the previous example we have seen how one can estimate the conductance. However, the lower bounds were not satisfying. In the following we want to state a much better result. These type of results are difficult to prove, they rely on isoperimetric inequalities and different estimates of geometric objects. We have the following assumption. Assumption 26. Let G = Bd and let ρ be log-concave, i.e. for all λ ∈ (0, 1) and for all x, y ∈ G we have ρ(λx + (1 − λ)y) ≥ ρ(x)λ ρ(y)1−λ . (17) Further, let log ρ be Lipschitz continuous with α, i.e. |log ρ(x) − log ρ(y)| ≤ α |x − y| . Note that the setting is more restrictive compared to the previous example. The goal is to get a lower bound of the conductance which is polynomially in α−1 and d−1 . In [MN07] the following result is proven. 1 Proposition 27. Let Assumption 26 be satisfied and let δ = min{ √d+1 , α1 }. Let the proposal transition kernel for the Metropolis algorithm Kρ be given by the ball walk Bδ , see Example 9(4). Then, the conductance of Kρ satisfies 0.0025 1 1 . ϕ≥ √ min √ , d+1 d+1 α If we consider for example the density of (16), we see that Assumption 26 is satisfied and the lower bound of the conductance depends only polynomially on α and d rather than on cc21 = exp(−α). Note that for the Metropolis algorithm with ball walk proposal it is not clear whether the transition operator Pρ is positive semi-definite. This would imply that Theorem 23 gives a lower bound of gap(Pρ ). However, by considering the lazy version of the Metropolis algorithm with ball walk proposal one obtains a positive semi-definite operator. For further details see for instance [Rud12]. 35 4.2 Hit-and-run algorithm Let us describe a transition from x ∈ G of the hit-and-run algorithm. First, a direction θ with |θ| = 1, i.e. on the sphere Sd−1 , is uniformly distributed chosen. Then, the next state is sampled on {x + sθ : s ∈ R} with the distribution determined by ρ restricted to the line. Thus, for x ∈ G and A ∈ B(G) the transition kernel is Z 1 Hρ (x, A) = Hθ (x, A) dθ, vold−1 (Sd−1 ) Sd−1 where Hθ (x, A) = with ℓρ (x, θ) = R∞ −∞ Z ∞ 1A (x + sθ)ρ(x + sθ) ds ℓρ (x, θ) 1G (x + sθ)ρ(x + sθ) ds. −∞ Note that Hθ (x, ·) denotes the distribution of the next state. It is the distribution of ρ restricted to the line determined by θ and the current state x. Let H̄ρ be the transition operator which corresponds to Hρ . Now, we ask two questions: 1. Is the transition kernel Hρ reversible with respect to πρ ? 2. Are there estimates of the spectral gap of H̄ρ or estimates of n νHρ − πρ tv in terms of n and the initial distribution ν? The first question addresses whether the transition kernel is well-defined in the sense that we want to obtain the desired stationary distribution πρ . The second question asks for convergence. We want to quantify how fast we get to the stationary distribution. The following lemma answers the first question. Lemma 28. For all θ ∈ Sd−1 we have that Hθ is reversible with respect to πρ . Furthermore, Hρ is also reversible with respect to πρ . 36 R Proof. Let c = G ρ(x) dx and A, B ∈ B(G). By Z Z Z ∞ 1B (x + sθ)ρ(x + sθ) ds ρ(x) dx Hθ (x, B) πρ (dx) = ℓρ (x, θ) c A ZA Z−∞ ∞ 1A (x)1B (x + sθ)ρ(x + sθ)ρ(x) ds dx = ℓρ (x, θ) c Rd −∞ Z Z ∞ 1A (y − sθ)1B (y)ρ(y)ρ(y − sθ) ds dy = y=x+sθ Rd −∞ ℓρ (y − sθ, θ) c Z Z ∞ 1A (y − sθ)ρ(y − sθ) = ds πρ (dy) ℓρ (y − sθ, θ) B −∞ Z Z ∞ 1A (y − sθ)ρ(y − sθ) ds πρ (dy) = ℓρ (y−sθ,θ)=ℓρ (y,θ) B −∞ ℓρ (y, θ) Z Z ∞ 1A (y + tθ)ρ(y + tθ) = ds πρ (dy) −s=t B −∞ ℓρ (y, θ) Z = Hθ (x, A) πρ (dx) B reversibility of Hθ is proven. By the reversibility of Hθ we get the reversibility of Hρ . We have Z Z Z 1 Hθ (x, B) πρ (dx) dθ Hρ (x, B) πρ (dx) = vold (Sd−1 ) Sd−1 A A Z Z 1 = Hθ (x, A) πρ (dx) dθ vold (Sd−1 ) Sd−1 B Z = Hρ (x, A) πρ (dx). B Thus the proof is complete. Note that the transition kernel of the hit-and-run algorithm induces a positive semi-definite operator on L2 , for details we refer to [RU12]. Now let us come to the second question. Here we consider two different settings. First, let ρ = 1G and let Assumption 5 be satisfied. Basically it is assumed that Bd ⊂ G ⊂ rBd , where G is a convex body. Then, again by deep arguments and the conductance technique one can prove the following [LV06b]. 37 Theorem 29. Let Assumption 5 be satisfied. Recall that H̄ρ is the transition operator which corresponds to the transition kernel Hρ . Then gap(H̄ρ ) ≥ 2−52 . (dr)2 By Theorem 18 this implies an error bound for a suitable initial distribution of a Markov chain with transition kernel Hρ . Now let us define the second setting. Assumption 30. For a number r > 1 we have Bd ⊂ G ⊂ rBd . Furthermore the function ρ : G → (0, ∞) satisfies: - ρ is log-concave, for a definition see (17); - for all s > 0 let G(s) = {x ∈ G : ρ(x) ≥ s} be the level set of ρ of level s. Then we assume that πρ (G(s)) ≥ 1 8 =⇒ ∃z ∈ G such that B(z, 1) ⊂ G(s). For further details and interpretations of the conditions see [LV06a, Rud]. Then, again by an isoperimetric inequality and a relation of the so-called s-conductance to the total variation distance one can prove the following theorem [LV06a]. Theorem 31. LetAssumption 30 be satisfied. Let the initial distribution dν ν ∈ M∞ , i.e. dπ < ∞. Further assume that ρ ∞ 10−9 β = exp − (dr)2/3 Then dν . and C = 12 d r dπρ ∞ √ n νH − πρ ≤ Cβ 3 n , ρ tv n ∈ N. We provided two settings where explicit estimates of the spectral gap and total variation distance are available. The constants in this results are rather large, but note that these are absolute constants without any hidden dependence on some variables. 38 4.3 Gibbs sampler The Gibbs sampler, also known as Glauber dynamics or heat bath algorithm, is widely used in scientific computing. In this subsection we distinguish two Gibbs sampler, the random scan Gibbs sampler and the deterministic scan Gibbs sampler. The random scan Gibbs sampler is conceptually similar to the hit-and-run algorithm. For x ∈ G and A ∈ B(G) the transition kernel is given by d 1X Rρ (x, A) = He (x, A), d j=1 j where {e1 , . . . , ed } is the Euclidean standard basis in Rd and recall that R∞ 1A (x + sej )ρ(x + sej ) ds Hej (x, A) = −∞ . ℓρ (x, ej ) We choose a direction ei within the Euclidean standard basis uniformly distributed. This is the only difference to the hit-and-run algorithm, where the direction is chosen uniformly distributed on the sphere. Then, the next state is generated on the line {x + sei : s ∈ R} with respect to ρ restricted to {x + sei : s ∈ R}. By similar arguments as for the hit-and-run algorithm one can prove that Rρ is reversible with respect to πρ . The proof is a not too difficult exercise. Lemma 32. The transition kernel Rρ is reversible with respect to πρ . Now let us describe the deterministic scan Gibbs sampler. For x ∈ G and A ∈ B(G) the transition kernel is given by Dρ (x, A) = He1 He2 · · · Hed (x, A). The deterministic scan Gibbs sampler from x ∈ G updates each coordinate in one step. Note that the coordinate updates are according to the Euclidean standard basis. Thus, in a row each coordinate is updated. This is a very tempting idea. However, there are some drawbacks. First, one step of a deterministic scan Gibbs sampler needs computational cost of d steps of a random scan Gibbs sampler. Furthermore, the transition kernel is in general not reversible. 39 Lemma 33. The transition kernel Dρ is not reversible with respect to πρ , but πρ is a stationary distribution of Dρ . Proof. To prove that Dρ is in general not reversible with respect to πρ we construct a counterexample for d = 2. Let G = [0, 2] ×[0, 2] \ [1, 2] ×[1, 2] and ρ(x) = 1G (x). Now we argue that for A1 = [0, 1]×[1, 2] and A2 = [1, 2]×[0, 1] holds Z Z dx dx H1 H2 (x, A2 ) H1 H2 (x, A1 ) 6= . vol2 (G) vol2 (G) A1 A2 This is true, since for all x ∈ A1 we have H1 H2 (x, A2 ) = 0, but for all x ∈ A2 we have H1 H2 (x, A1 ) = 41 . Thus, Dρ is in general not reversible with respect to πρ . However, in the following we prove that πρ is a stationary distribution. From Lemma 28 we know that πρ Hej (A) = πρ (A) for all A ∈ B(G), since Hej is reversible with respect to πρ . Thus, πρ Dρ (A) = πρ He1 He2 · · · Hed (A) = πρ He2 · · · Hed (A) = · · · = πρ Hed (A) = πρ (A). We briefly state some convergence results for the Gibbs sampling algorithms. First, let us consider the case where ρ = 1G , i.e. the goal is to sample the uniform distribution in G. The following results are proven in [RR98]. Theorem 34. Let G ⊂ Rd be a bounded, open, connected set in Rd , whose boundary is a (d − 1)-dimensional C 2 manifold. Then the random scan Gibbs sampler is uniformly ergodic. Proposition 35. Let G ⊂ R2 be the width 1 triangle with lower angle θ and upper angle φ, that is, G = {(x, y) ∈ R2 : 0 < x < 1, x tan θ < y < x tan φ}, where 0 < θ < φ < π/2. Then both Gibbs sampler are geometrically ergodic. For general densities ρ there seems not much known. In [RR97] it is proven that: 40 Theorem 36. If the deterministic scan Gibbs sampler is geometrically ergodic, then the random scan Gibbs sampler is also geometrically ergodic. Note that, on the one hand the results of these theorems hold under weak assumptions compared to the convergence results of the hit-and-run algorithm. But on the other hand we do not get the parameters (α, M) or (α, M(x)) of the ergodicity explicitly. References [AG10] S. Asmussen and P. Glynn, Harris recurrence and MCMC: A simplified approach, submitted (2010). [BGJM11] S. Brooks, A. Gelman, G. Jones, and X. Meng, Handbook of Markov Chain Monte Carlo, Chapman & Hall, 2011. [GRS96] W. Gilks, S. Richardson, and D. Spiegelhalter, Markov chain Monte Carlo in practice, Chapman & Hall, 1996. [Kal02] O. Kallenberg, Foundations of modern probability, Second ed., Probability and its Applications, Springer-Verlag, New York, 2002. [LS88] G. Lawler and A. Sokal, Bounds on the L2 spectrum for Markov chains and Markov processes: a generalization of Cheeger’s inequality, Trans. Amer. Math. Soc. 309 (1988), no. 2, 557–580. [LV06a] L. Lovász and S. Vempala, Fast algorithms for logconcave functions: sampling, rounding, integration and optimization, Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (Washington, DC, USA), FOCS ’06, IEEE Computer Society, 2006, pp. 57–68. [LV06b] , Hit-and-run from a corner, SIAM J. Comput. 35 (2006), no. 4, 985–1005. [MN07] P. Mathé and E. Novak, Simple Monte Carlo and the Metropolis algorithm, J. Complexity 23 (2007), no. 4-6, 673–696. [MT96] K. Mengersen and R. Tweedie, Rates of convergence of the Hastings and Metropolis algorithms, Ann. Statist. 24 (1996), no. 1, 101–121. 41 [MT09] S. Meyn and R. Tweedie, Markov chains and stochastic stability, second ed., Cambridge University Press, 2009. [Num84] E. Nummelin, General irreducible Markov chains and nonnegative operators, Cambridge University Press, 1984. [Rev84] D. Revuz, Markov chains, second ed., North-Holland Mathematical Library, vol. 11, North-Holland Publishing Co., Amsterdam, 1984. [RR97] G. Roberts and J. Rosenthal, Geometric ergodicity and hybrid Markov chains, Electron. Comm. Probab. 2 (1997), no. 2, 13–25. [RR98] , On convergence rates of Gibbs samplers for uniform distributions, Ann. Appl. Probab. 8 (1998), no. 4, 1291–1302. [RR04] , General state space Markov chains and MCMC algorithms, Probability Surveys 1 (2004), 20–71. [RT01] G. Roberts and R. Tweedie, Geometric L2 and L1 convergence are equivalent for reversible Markov chains, J. Appl. Probab. 38A (2001), 37–41. [RU12] D. Rudolf and M. Ullrich, Positivity of hit-and-run and related algorithms, ArXiv e-prints (2012). [Rud] D. Rudolf, Hit-and-run for numerical integration, To appear in: J. Dick, F. Y. Kuo, G. Peters, I. H. Sloan (Eds.), Monte Carlo and Quasi-Monte Carlo Methods 2012, Springer-Verlag. [Rud12] , Explicit error bounds for Markov chain Monte Carlo, Dissertationes Math. 485 (2012), 93 pp. [Stu10] A. Stuart, Inverse problems: A bayesian perspective, Acta Numerica 19 (2010), 451–559. [Vol13] S. Vollmer, Dimension-Independent MCMC Sampling for Elliptic Inverse Problems with Non-Gaussian Priors, ArXiv e-prints (2013). 42
© Copyright 2026 Paperzz