An introduction to Markov chain Monte Carlo on general state spaces

An introduction to Markov chain Monte Carlo
on general state spaces
Daniel Rudolf∗
[email protected]
October 11, 2013
Contents
Preface
2
1 Introduction
2
1.1 Where classical Monte Carlo fails . . . . . . . . . . . . . . . .
2
1.2 Bayesian approach to inverse problems . . . . . . . . . . . . .
7
2 Markov chains
11
2.1 Definitions and notation . . . . . . . . . . . . . . . . . . . . . 11
2.2 Markov operator . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Convergence properties of Markov chains . . . . . . . . . . . . 21
3 Approximation of expectations
25
∗
The author was supported by an Australian Research Council Discovery Project (F.
Kuo and I. Sloan), by the DFG priority program 1324 and the DFG Research training
group 1523.
4 Algorithms
28
4.1 Metropolis-Hastings algorithm . . . . . . . . . . . . . . . . . . 28
4.2 Hit-and-run algorithm . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Preface
These notes are based on a series of six lectures given in November and December 2012 at the University of New South Wales in Sydney. The material
discussed in these notes is not new. The goal was to provide an introduction
to Markov chain Monte Carlo on general state spaces and I wanted to keep
it elementary to give the reader a better access to the topic.
I want to thank F. Kuo and J. Dick for asking me to give these lectures
and for their interest to the topic. Furthermore, I want to thank H. Zhu for
typing a preliminary version of this manuscript.
1
Introduction
Markov chains are widely used. For example they are used in optimization,
approximate sampling, provide a tool for modeling real world phenomena,
are helpful in decision theory and can be used for the approximation of
expectations. We focus on approximate sampling and the application of
Markov chain Monte Carlo methods for the approximation of expectations.
For more applications see for example [GRS96, BGJM11].
1.1
Where classical Monte Carlo fails
Let G ⊂ Rd be a measurable set and let 0 < vold (G) < ∞ where vold denotes
the Lebesgue measure. Assume that f : G → R is an integrable function.
The goal is to compute
Z
1
S(f ) =
f (x) dx,
vold (G) G
2
where f belongs to some class of functions. In other words we want to
compute the expectation of f with respect to the uniform distribution on G.
We assume that there is a procedure Orf which provides information of f .
This procedure is considered as “black box” and we call it oracle. Let Orf
be a procedure which returns for an input x ∈ G the function value f (x), i.e.
Orf (x) = f (x).
In general, if d is large, say 10, 50 or even 100, then it might be difficult,
depending on f , to compute S(f ) explicitly.
For the moment we assume that we can sample the uniform distribution on
G, i.e. we can sample random variables X, which map from a probability
space (Ω, A, P) to (G, B(G)) with
PX (A) := P(X ∈ A) =
vold (A)
,
vold (G)
A ∈ B(G),
where B(G) is the Borel σ-algebra.
Example 1. Let G = [0, 1]d and let x = (x1 , . . . , xd ), where each xi ∈
[0, 1] is independently, identically, distributed (i.i.d.) chosen with uniform
distribution in [0, 1]. Then x ∈ [0, 1]d is a realization of a random variable
with uniform distribution in [0, 1]d .
P
Example 2. Let G = Bd = {x ∈ Rd : |x|2 = di=1 x2i ≤ 1} be the Euclidean
unit ball. By the following steps one can generate a uniformly distributed
chosen point z in Bd :
1. Let x = (x1 , . . . , xd ), where each xi is i.i.d. chosen concerning N(0, 1).
With N(µ, σ 2 ) we denote the Normal distribution with mean µ and
variance σ 2 .
x
2. Let y = |x|
. This is a realization of a uniformly distributed random
variable on the d-dimensional sphere Sd−1 = {x ∈ Rd : |x| = 1}.
3. Choose u ∈ [0, 1] uniformly distributed and set r = u1/d . Then return
z = ry.
3
As an excercise one can prove that the procedure stated in Example 2 returns
indeed a uniformly distributed point in the Euclidean unit ball Bd .
Let (Xn )n∈N be an i.i.d. sequence of random variables, where every Xn is
uniformly distributed on G. Then we consider the following algorithm
n
1X
Sn (f ) =
f (Xi )
n i=1
as approximation of S(f ). We call it classical Monte Carlo method. The
following result guarantees that the algorithm is well defined.
Theorem 3. [Strong law of large numbers] Let f : G → R and assume that
Z
1
kf k1 :=
|f (x)| dx < ∞.
vold (G) G
Then
P( lim Sn (f ) = S(f )) = 1.
n→∞
Note that it is an asymptotic statement and we are rather interested in
explicit results with respect to some error term. Let us consider the mean
square error of a generic algorithm An , which uses n function values of f and
a sample of random variables Y1 , . . . , Yn . For an integrable function f it is
given by
1/2
,
e(An , f ) = E |An (f ) − S(f )|2
where E denotes the expectation concerning the joint distribution of Y1 , . . . , Yn .
The squared mean square error can be decomposed into variance and bias,
i.e.
e(An , f )2 = V(An (f )) + B(An (f )),
where the variance
V(An (f )) = E |An (f ) − E An (f )|2 ,
and the bias
B(An (f )) = |E An (f ) − S(f )|2 .
The classical Monte Carlo method Sn is unbiased, i.e. ESn (f ) = S(f ), so
that B(Sn (f )) = 0 and
V(Sn (f ))1/2 = e(Sn , f ).
Now we get the following error bound.
4
Theorem 4. Let f : G → R and assume that
kf k2 :=
Then
1
vold (G)
Z
1/2
|f (x)| dx
< ∞.
2
G
1
1
e(Sn , f ) = √ kf − S(f )k2 ≤ √ kf k2 .
n
n
Proof. By the i.i.d. sequence of random variables f (X1 ), . . . , f (Xn ) and with
the formula of Bienaymé1 we obtain
e(Sn , f )2 = V(Sn (f )) =
n
1 X
V(
f (Xi ))
n2 i=1
n
1 X
1
1
= 2
V(f (Xi )) = kf − S(f )k22 ≤ kf k22 .
n i=1
n
n
We showed an explicit error bound for every n ∈ N rather than an asymptotic
result. However, the assumption that kf k2 < ∞ is more restrictive compared
to the assumptions of Theorem 3, since kf k1 ≤ kf k2 . In other words the
class of functions for which we can apply the strong law of large numbers is
larger than the class of functions which is considered in Theorem 4.
The classical Monte Carlo method relies on the assumption that one has a
sampling procedure for the uniform distribution on G. Let us modify the
problem as follows. Suppose that the set G is given by a membership oracle.
In other words we assume that we are able to evaluate the indicator function
1G . Then f and 1G are part of the input of the algorithm, so that we want
to compute
Z
1
S(f, 1G ) =
f (x) dx,
vold (G) G
where the tuple (f, 1G ) belongs to some class of functions. We need the
following regularity assumption.
1
For a sequence of square integrable
i.i.d. P
random variables Z1 , . . . , Zn the statement
Pn
n
of the formula of Bienaymé is V( i=1 Zi ) = i=1 V(Zi ).
5
Assumption 5. The convex body G ⊂ Rd is bounded from below and above,
more precisely for r > 1 let
Bd ⊂ G ⊂ rBd ,
where rBd = {x ∈ Rd : |x| ≤ r} is the Euclidean ball with radius r around 0
and Bd = 1Bd .
Since G is part of the input of the algorithm it is necessary to construct
a sampling procedure which generates the uniform distribution on G. A
first approach might be to consider a simple acceptance/rejection method.
Namely, we generate a uniformly distributed point in rBd and if it is in
G it is accepted, otherwise it is rejected. However, this method does not
work reasonably. For example the acceptance probability for the generation
of one state in G = Bd is r −d . One might think that
√ the set rBd is just
too large compared to Bd . Let us assume that r = d and assume that
we generate a uniformly distributed state in the smallest box containing
G = Bd , that is [−1, 1]d . The acceptance probability of the proposed state is
d+1
√ 2
eπ
. This is also exponentially bad with respect
2−d vold (Bd ) ≈ π2e 2(d+2)
to the dimension.
However, we assumed G = Bd such that one could argue that it is no problem
to sample the uniform distribution on the unit ball (Example 2). But the
sampling procedure should work for every set which satisfies Assumption 5. It
should work uniformly for balls, cubes, polytopes as well as for more general
convex sets, such as conv(1 ∪ Bd ), where 1 = (1, . . . , 1) and conv denotes the
convex hull.
Summarized, the problem is to provide a reasonably fast random number
generator uniformly for all sets which satisfy Assumption 5.
Now the idea is to run a Markov chain to approximate the uniform distribution for any G which is given by a membership oracle and satisfies Assumption 5. A Markov chain is a sequence of random variables (Xn )n∈N which
satisfy the Markov property. This means that the distribution of Xn+1 (the
future) depends only on Xn (the present) and not on (X1 , . . . , Xn−1 ) (the
past). In formulas
P(Xn+1 ∈ A | X1 , . . . , Xn ) = P(Xn+1 ∈ A | Xn ).
The Markov chain should approximate the uniform distribution in G, i.e. for
all A ⊂ G we want that
vold (A)
.
P(Xn ∈ A | X1 ) −→
n→∞ vold (G)
6
Later we precisely define what kind of convergence is sufficient and we will
construct a Markov chain which has the desired properties. Then we consider
the estimator
n
1X
Sn,n0 (f, 1G ) =
f (Xj+n0 ).
n j=1
The additional parameter n0 is called burn-in and, roughly spoken, is the
amount of steps of the Markov chain to get close to the uniform distribution
in G.
1.2
Bayesian approach to inverse problems
Markov chains are especially useful if one wants to draw approximate samples from a probability measure which is only partially known. We want to
motivate this by a Bayesian approach to inverse problems. For a thorough
introduction to Bayesian inverse problems we refer to [Stu10].
Let s, d ∈ N, u ∈ Rs , y ∈ Rd and G : Rs → Rd . We have an equation of the
form
y = G(u)
and the goal is to find u ∈ Rs for a given y ∈ Rd . We call u the solution, y
the data and G the observation operator.
The problem might be ill-posed, which means that either no solution exists
or the solution is not unique. This motivates a reformulation of the problem.
Namely, we consider
1
argmin |y − G(u)|2Γ ,
u∈Rs 2
as a solution of the inverse problem, where |·|Γ is a norm on Rd .
Again there might be difficulties: For example there might be multiple minima. To avoid this problem one can use a regularized approach. Let |·|Σ be a
norm on Rs and let u0 ∈ Rs . Then the goal is to find
argmin
u∈Rs
1
1
|y − G(u)|2Γ + |u − u0 |2Σ .
2
2
(1)
The problem formulated that way can often be solved. However, the choice of
the norm |·|Σ and u0 , is without additional assumptions, somehow arbitrary.
Furthermore, if we consider y as data which is measured by a physical device
it is very likely that some noise in the measurement perturbs the data. In
7
this setting a Bayesian approach is useful. We have the following modified
model
y = G(u) + z,
where z ∈ Rd is the observation noise. Now we want to find a probability
measure π y on (Rs , B(Rs )) which contains information of the probability of
different solutions u given data y. We call π y posterior distribution. Further, we assume that there is a probability measure π0 on (Rs , B(Rs )) which
describes our prior belief about the solution u. The measure π0 is called
prior distribution. Finally, we assume that the noise z is a realization of a
random variable with mean zero, whose statistical properties are known. Let
us precisely summarize and state the assumptions.
Assumption 6. We have a probability space (Ω, F , P) and all random variables mapping from this space into (Rq , B(Rq )) for q ∈ N with either q = s
or q = d. Further let
(i) ρ0 : Rs → [0, ∞) be the density of the measure π0 with respect to the
Lebesgue measure on (Rs , B(Rs ));
(ii) ρy : Rs → [0, ∞) be the density of the measure π y with respect to the
Lebesgue measure on (Rs , B(Rs ));
(iii) the observational noise be modeled by a mean zero random variable
Z : (Ω, F , P) → (Rd , B(Rd )) with distribution
Z
P(Z ∈ A) =
v(z) dz, A ∈ B(Rd ),
A
where v : Rd → [0, ∞) is a density with respect to the Lebesgue measure.
Before going on we recall the Bayesian formula and provide some implications. Therefore let U : (Ω, F , P) → (Rs , B(Rs )) and Y : (Ω, F , P) →
(Rd , B(Rd )) be random variables. Let us assume that the joint density of
(U, Y ) with respect to the Lebesgue measure is given by ξ : Rs ×Rd → (0, ∞),
i.e.
Z Z
P((U, Y ) ∈ A × B) =
ξ(u, y) dy du, A ∈ B(Rs ), B ∈ B(Rd ).
A
B
Then the formula of Bayes is given by
ξ(u | Y = y) =
ξ(y | U = u) ξU (u)
,
ξY (y)
8
where
ξU (u) =
Z
ξ(u, y) dy
and ξY (y) =
Rd
Z
ξ(u, y) du.
Rs
We call ξ(u | Y = y) the posterior density, ξ(y | U = u) data likelihood and
ξU (u) prior density. We use the formula of the conditional distribution
ξ(y | U = u) =
ξ(u, y)
ξU (u)
to conclude that
ξY (y) =
Z
Rs
ξ(y | U = u) · ξU (u) du.
Then, by the formula of Bayes, we obtain
ξ(y | U = u) · ξU (u)
.
ξ(y | U = u) · ξU (u) du
Rs
ξ(u | Y = y) = R
In other words the density ξ(u | Y = y) in u is proportional to ξ(y | U =
u) · ξU (u). This proportionality is notated as
ξ(u | Y = y) ∝ ξ(y | U = u) · ξU (u),
Under Assumption 6 we set
ρ0 (u) = ξU (u), (prior density)
ρy (u) = ξ(u | Y = y) (posterior density)
and use the likelihood ansatz
ξ(y | U = u) = v(y − G(u))
such that ρy (u) is proportional to v(y − G(u))ρ0 (u), in symbols
ρy (u) ∝ v(y − G(u))ρ0 (u).
This implies
dπ y
(u) ∝ v(y − G(u)).
dπ0
Let us define the potential φ(u, y) = − log(v(y − G(u)), such that
dπ y
(u) ∝ exp(−φ(u, y)).
dπ0
9
The goal is to sample with respect to the posterior measure π y . Note that
y
we know the density dπ
(u) up to the normalizing constant. Thus, Markov
dπ0
chains might provide a way to approximate a sample with distribution π y .
In the following let us briefly state some relations between the regularized
and the Bayesian approach of the inverse problem. Therefore we consider a
specific model, where we require additional assumption. We assume that v
is a density of a Normal distribution with mean zero and covariance matrix
Γ ∈ Rd×d and assume that the density of the prior ρ0 is also given by a
Normal distribution with mean u0 and covariance matrix Σ ∈ Rs×s , i.e.
1 −1/2 2
v(z) ∝ exp − Γ
z
2
and
In this setting
2
1 −1/2
(u − u0 )
.
ρ0 (u) ∝ exp − Σ
2
2 1 −1/2
2
1 −1/2
(y − G(u)) − Σ
(u − u0 )
ρ (u) ∝ exp − Γ
2
2
y
and
2
1 −1/2
dπ y
.
(y − G(u))
(u) ∝ exp − Γ
dπ0
2
−1/2 Σ
We define the norms
in
the
regularized
approach,
in
(1),
as
|·|
=
·
Σ
and |·|Γ = Γ−1/2 ·. Thus, a solution of the regularized problem (1) achieves
the largest value of the density ρy . The other way around: A very likely area
with respect to π y is with large probability close to a solution of (1).
There are advantages of the Bayesian approach to inverse problem: The
norms which arise have a close relation to the model. Namely, the norms
above come from the covariance matrix of the Normal distribution of the
observational noise and the prior. Additionally the parameter u0 is the mean
of the prior. And, in the Bayesian context one might ask:
- What is the probability that a solution of the regularized approach is
determined by different local minimizer?
- How certain can we be that a prediction made by a mathematical model
will lie in a certain regime?
10
2
2.1
Markov chains
Definitions and notation
In this section we provide facts and definitions of Markov chains on general
state spaces. The paper [RR04] of Rosenthal and Roberts surveys various
results in this setting. For further reading we refer to [Rev84, Num84, MT09].
Let G be a complete, separable, metric space and let G = B(G) be the
corresponding Borel σ-algebra over G, such that (G, G) is a measurable space.
In all examples G is contained in Rd . However, we want to keep it at the
moment more general.
In the following we provide the definition of a transition kernel and a Markov
chain.
Definition 1 (Markov kernel, transition kernel). The function K : G × G →
[0, 1] is called a Markov kernel or transition kernel if
(i) for each x ∈ G the mapping A ∈ G 7→ K(x, A) is a probability measure
on (G, G),
(ii) for each A ∈ G the mapping x ∈ G 7→ K(x, A) is a G-measurable
real-valued function.
Definition 2 (Markov chain, initial distribution). A sequence of random
variables (Xn )n∈N on a probability space (Ω, F , P) mapping into (G, G) is
called a Markov chain with transition kernel K if for all n ∈ N and A ∈ G
holds
P(Xn+1 ∈ A | X1 , . . . , Xn ) = P(Xn+1 ∈ A | Xn );
(2)
and for almost all x ∈ G we assume
P(Xn+1 ∈ A | Xn = x) = K(x, A).
The distribution
ν(A) = P(X1 ∈ A),
is called the initial distribution.
11
A∈G
(3)
We say that a sequence of random variables has the Markov property if (2)
is satisfied. By (3) we have that K is a regular version2 of a conditional
distribution of Xn+1 given Xn for every n. Note that N is the index set in
the sequence of random variables, i.e. we only study discrete time Markov
chains. Further note that all Markov chains are homogenous, i.e.
P(Xn+1 ∈ A | Xn = x) = K(x, A),
and K does not depend on n.
Suppose that K is a transition kernel and ν is a probability measure on
(G, G). How can we get a Markov chain with transition kernel K and initial
distribution ν? Let us assume that G ⊂ Rd . For any transition kernel
there exists a random mapping representation, see for example Kallenberg
[Kal02, Lemma 2.22, p. 34]. A random mapping representation is given by a
measurable function Φ : G × U → G, which satisfies
P(Φ(x, Z) ∈ A) = K(x, A),
x ∈ G, A ∈ G,
where (U, U) is a measurable space and the random variable Z : (Ω, F , P) →
(U, U) has distribution µ. Then, a Markov chain can be constructed as
follows. Let (Zn )n∈N , with Zn : (Ω, F , P) → (U, U), be a sequence of i.i.d.
random variables with distribution µ, and assume that X1 has distribution
ν, then one can see that (Xn )n∈N defined by
Xn = Φ(Xn−1 , Zn ),
n ≥ 2,
is a Markov chain with transition kernel K and initial distribution ν.
The function Φ is also called update function, random mapping representer
or algorithm of the corresponding Markov chain.
We have seen that the transition kernel is an important object in the definition of a Markov chain. In the following we present some examples of
transition kernels. The reader might check that the conditions of Definition 1 are satisfied.
Example 7. Let G ⊂ Rd , A ∈ G and x ∈ G.
2
For details to conditional expectation, conditional distribution and a regular version
of a conditional distribution see for example [Kal02].
12
1. Let K1 (x, A) = 1A (x). Basically, K1 is the simplest transition kernel
one can think of. The update function is given by Φ1 (x, u) = x for any
u ∈ U.
2. Let π be a probability measure on (G, G). Further assume that we
can sample with respect to π. Then, let us define the transition kernel
K2 (x, A) = π(A). Thus, i.i.d. sampling is a special Markov chain. The
update function is given by Φ2 (x, u) = u where (U, U) = (G, G) and u
is chosen by π. Note that the succeeding state is chosen independently
of the previous one.
3. Any convex combination of two transition kernels is again a transition
kernel. Thus for any λ ∈ [0, 1] we have that K(x, A) = λK1 (x, A) +
(1 − λ)K2 (x, A) is a transition kernel. Here it is more convenient to
describe the update function agorithmically. Namely, choose a number
r ∈ [0, 1] uniformly distributed and if r ≤ λ apply Φ1 otherwise Φ2 .
4. Let δ > 0 and by Bδ (x) we denote the Euclidean ball with radius δ
around x. Then the transition kernel of the ball walk is
vold (A ∩ Bδ (x))
vold (G ∩ Bδ (x))
.
Bδ (x, A) =
+ 1A (x) 1 −
vold (Bδ (0))
vold (Bδ (0))
The corresponding update function is
(
x + δu, x + δu ∈ G,
Φ(x, u) =
x,
otherwise,
where u is chosen by the uniform distribution in (U, U) = (Bd , B(Bd )).
This can be done, see Example 2. Algorithmically Bδ (x, ·) is presented
in Algorithm 1.
5. Let G ⊂ Rd be bounded. Now we describe the transition kernel of the
hit-and-run algorithm. Recall that Sd−1 = {x ∈ Rd : |x| = 1}. Let
θ ∈ Sd−1 and let L(x, θ) be the chord in G through x and x + θ, i.e.
L(x, θ) = {x + sθ ∈ G | s ∈ R}.
Note that L(x, θ) is a one dimensional object in Rd . Then the transition
kernel is
Z
1
vol1 (L(x, θ) ∩ A)
H(x, A) =
dθ.
vold−1 (Sd−1 ) Sd−1 vol1 (L(x, θ))
13
Algorithm 1: Ball-Walk(x, δ)
input : current state x, radius δ.
output: next state y.
Choose u uniformly distributed in Bd ;
if x + δu ∈ G then
Return x + δu;
else
Return x;
end
In words the transition works as follows. First, choose a direction
θ. Then we sample the next state on the line given by x and x + θ
intersected with G. The update function is described in Algorithm 2,
where we assume that we are able to sample the uniform distribution
in L(x, θ) for any x ∈ G and θ ∈ Sd−1 .
Algorithm 2: hit-and-run(x)
input : current state x.
output: next state y.
Choose θ uniformly distributed in Sd−1 ;
Choose y uniformly distributed in L(x, θ);
6. Finally we want to explain the random scan Gibbs sampler. Let the
set {e1 , . . . , ed } be the Euclidean standard base in Rd , i.e. the vector
ei = (0, . . . , 0, 1, 0, . . . , 0) ∈ Rd where the 1 is at the ith coordinate.
Then
d
1X
R(x, A) =
He (x, A),
d i=1 i
where
Hei (x, A) =
vol1 (L(x, ei ) ∩ A)
.
vol1 (L(x, ei ))
Recall that L(x, ei ) is the intersection of G with the line defined by
x and x + ei . This is conceptionally very close to the hit-and-run
transition kernel. The difference is, instead of choosing an arbitrary
direction one chooses uniformly distributed an element of the Euclidean
14
standard basis. Also note that each Hei is a transition kernel and R
is a convex combination of the Hei s. For completeness we present the
update function in Algorithm 3.
Algorithm 3: random-scan-Gibbs(x)
input : current state x.
output: next state y.
Choose i uniformly distributed in {1, . . . , d};
Choose y uniformly distributed in L(x, ei );
In the examples we have seen that a convex combination of two transition
kernels is again a transition kernel. One could say that is the way how
transition kernels are adapted to summation. Now we want to define the
product of two transition kernel. Let K1 and K2 be transition kernels on
(G, G). Then the product of K1 and K2 is defined as
Z
K1 K2 (x, A) =
K2 (y, A) K1(x, dy), x ∈ G, A ∈ G.
G
Note that K1 K2 is again a transition kernel. This enables us to multiply a
transition kernel K with itself. Let x ∈ G, A ∈ G and n ∈ N then we define
K 0 (x, A) = 1A (x) and get
Z
n
K (x, A) =
K n−1 (y, A) K(x, dy).
G
We also have
n
K (x, A) =
Z
K(y, A) K n−1(x, dy).
G
This product can be interpreted in terms of a Markov chain. Let (Xn )n∈N be
a Markov chain with transition kernel K. Then we have for all k, n ∈ N that
K n (x, A) = P(Xk+n ∈ A | Xk = x).
This is proven by induction over n. For n = 1 we have
P(Xk+1 ∈ A | Xk = x) = K(x, A)
15
If the assertion holds for n ∈ N we show how to deduce it for n + 1:
Z
P(Xk+n+1 ∈ A | Xk = x) =
P(Xk+n+1 ∈ A | Xk+n = y, Xk = x)
G
=
(2)
=
Z
G
Z
× P(Xk+n ∈ dy | Xk = x)
P(Xk+n+1 ∈ A | Xk+n = y) P(Xk+n ∈ dy | Xk = x)
K(y, A) K n(x, dy) = K n+1 (x, A).
G
Now let us assume that (Xn )n∈N is a Markov chain with transition kernel K
and initial distribution ν. Then let us define the notation
Z
m
νP (A) =
K m (y, A) ν(dy), m ∈ N.
(4)
G
Note that we can consider ν as transition kernel (Example 7(2)) and then we
have by the product of two kernels that νP m = νK m . Further note that
Z
m
νP (A) =
P(Xm+1 ∈ A | X1 = y) ν(dy) = P(Xm+1 ∈ A).
G
Let us just briefly mention that one can extend this operation to signed
measures.
With these notations at hand we can define the terms stationary measure
and reversible transition kernel.
Definition 3 (stationary measure). Let π be a probability measure on (G, G).
Then π is called a stationary distribution of a transition kernel K if
Z
πP (A) =
K(x, A) π(dx) = π(A), A ∈ G.
G
For some calculation it is useful to rewrite this as
Z
π(dy) =
K(x, dy) π(dx), y ∈ G.
(5)
G
For a Markov chain (Xn )n∈N with transition kernel K and stationary initial
distribution π we have the following interpretation: After a single transition
of the Markov chain the same distribution as before arises, i.e.
P(X1 ∈ A) = π(A) = πP (A) = P(X2 ∈ A),
16
A ∈ G.
Definition 4 (reversible transition kernel). Let π be a probability measure
on (G, G). A transition kernel K is called reversible with respect to π if
Z
Z
K(x, A) π(dx) =
K(x, B) π(dx), A, B ∈ G.
(6)
B
A
Sometimes this is also written for short as
K(x, dy) π(dx) = K(y, dx) π(dy),
x, y ∈ G.
(7)
If a transition kernel K is reversible with respect to a distribution π, then π
is a stationary distribution of K. Again for the Markov chain with stationary initial distribution we have, under the additional assumption that the
transition kernel is reversible with respect to π, that
P(X1 ∈ A, X2 ∈ B) = P(X1 ∈ B, X2 ∈ A),
A, B ∈ G.
This gives symmetry concerning A and B of the measure µ(A×B) = P(X1 ∈
A, X2 ∈ B).
We have the following basic results for reversible transition kernel. We leave
the proof for the reader, since it is not too difficult.
Lemma 8. (i) If a transition kernel K is reversible with respect to π, then
π is a stationary distribution of K.
(ii) Assume that (6) holds for all disjoint sets A, B ∈ G, then K is reversible
with respect to π.
Now let us provide some examples of stationary distributions and reversible
transition kernels.
Example 9.
1. Let us consider K1 (x, A) = 1A (x). Then we obtain for
any probability measure ν on (G, G) that
Z
Z
K1 (x, A) ν(dx) = ν(A ∩ B) =
K1 (x, B) ν(dx), A, B ∈ G.
B
A
Thus, this kernel is reversible with respect to any possible ν. This
further implies that any ν is a stationary distribution of K1 .
17
2. Let us consider K2 (x, A) = π(A), where π is a probability measure on
(G, G). Obviously,
Z
Z
K2 (x, B) π(dx) = π(A) π(B) =
K2 (x, A) π(dx), A, B ∈ G.
A
B
Thus, K2 is reversible with respect to π and π is a stationary measure.
3. Let K1 and K2 be two transition kernels which are both reversible
concerning a distribution π. Then any convex combination of K1 and
K2 is again reversible with respect to π. But in general it is not true
that the product of K1 and K2 has this property. We have K1 K2 is
reversible with respect to π if and only if K2 K1 = K1 K2 . However, π
is a stationary distribution of the product of K1 and K2 . This holds
since π is a stationary distribution of K1 and K2 . Here reversibility is
not necessary.
4. Now let us turn to a not so obvious example. We consider the ball
walk. Here we assume that G ⊂ Rd is a bounded set and δ > 0. Recall
that
vold (A ∩ Bδ (x))
vold (G ∩ Bδ (x))
Bδ (x, A) =
.
+ 1A (x) 1 −
vold (Bδ (0))
vold (Bδ (0))
If you one can imagine a typical large sample with the ball walk, then
one might guess that the stationary measure is the uniform distribution in G. By Lemma 8(ii) it is enough to show (2) for disjoint sets
C, D ∈ G, i.e. C ∩ D = ∅. To shorten the notation let us define
κ = vold (Bδ (0))vold (G). We have
Z
Z Z
1
1
Bδ (x, D) dx =
1B (x) (y) dy dx
vold (G) C
κ C D δ
Z Z
Z
1
1
=
1B(y,δ) (x) dy dx =
Bδ (x, C) dx.
κ C D
vold (G) D
Note that the second equality follows by 1Bδ (x) (y) = 1Bδ (y) (x) for any
x, y ∈ G. Thus, the ball walk is reversible with respect to the uniform
distribution.
2.2
Markov operator
In this subsection we describe how a transition kernel induces an operator
and state properties of this operator. Let us assume that π is a stationary
18
probability measure of a transition kernel K. Further let p ∈ [1, ∞] and
Lp = Lp (π) = {f : G → R | kf kp =
Z
p
G
|f (x)| π(dx)
1/p
< ∞}.
Note that kf kp ≤ kf kp′ if p ≤ p′ . Hence we have Lp′ ⊂ Lp . For f ∈ Lp let
us define
Z
P f (x) =
f (y) K(x, dy), x ∈ G.
G
We call the linear operator P Markov operator or transition operator. Let us
provide an interpretation. If a Markov chain (Xn )n∈N with transition kernel
K and initial distribution δx , the point mass at x ∈ G, is given, then P f (x)
is the expectation of f (X2 ), i.e.
E[f (X2 ) | X1 = x] = P f (x).
In the following we state helpful properties.
Lemma 10.
(i) For the constant function f ≡ 1 we have
P f (x) = f (x),
x ∈ G.
(8)
(ii) The operator P maps from Lp to Lp and is bounded with kP kLp →Lp = 1.
(iii) The transition kernel K is reversible with respect to π if and only if P
is self-adjoint on L2 .
Proof. The assertion of (i) is obvious. We prove (ii) by
p
Z
Z Z
p
π(dx)
f
(y)
K(x,
dy)
|P f (x)| π(dx) =
G
G
ZG Z
Z
≤
|f (y)|p K(x, dy) π(dx) =
|f (y)|p π(dy).
G
G
G
The inequality is an application of the Jensen inequality for conditional expectations and
R the last equality holds, since π is a stationary distribution,
i.e. π(dy) = G K(x, dy) π(dx). Now we show (iii). Note that L2 is a Hilbert
space with inner-product
Z
hf, gi =
f (x)g(x) π(dx).
G
19
Thus, we have to prove that hP f, gi = hf, P gi is equivalent to reversibility of
K. The first direction follows with f = 1A and g = 1B for any A, B ∈ G. For
the reverse conclusion we recall the intuitive writing of reversibility. Namely,
(7) is
K(x, dy)π(dx) = K(y, dx)π(dy).
Then for all f, g ∈ L2 holds
Z Z
hP f, gi =
f (y) g(x) K(x, dy) π(dx)
G
G
Z Z
=
f (y) g(x) K(y, dx) π(dy) = hf, P gi,
(7)
G
G
which finishes the proof.
From now on we have the following standing assumption.
Assumption 11. The transition kernel K is reversible with respect to a
probability measure π on (G, G).
This assumption allows us to use the machinery for self-adjoint operators,
since then the Markov operator is self-adjoint.
Now let us define how the transition kernel acts on signed measures. By
M(G) denote the set of real-valued signed measures 3 on (G, G). Let ν ∈
M(G). If there exists a density of ν with respect to π then we denote it by
dν
and for q ∈ [1, ∞] let
dπ
( dν , ν ≪ π,
dπ q
kνkq =
∞,
otherwise.
Set
Mq = Mq (G, π) = {ν ∈ M(G) : kνkq < ∞}.
The function space Lq is isometrically isomorphic to the space of signed
measures Mq , in symbols Lq ∼
= Mq . The space of singed measures M2
is a Hilbert space and the inner-product is the inner-product of L2 of the
densities, one has
Z
dν
dµ
dν dµ
hν, µi =
(x) (x) π(dx) = h , i, ν, µ ∈ M2 .
dπ
dπ dπ
D dπ
3
The set function µ : G → R is a real-valued signed measure if it is the difference of two
measures on (G, G). More precisely, if µ(∅)
= 0 and for pairwise disjoint A1 , A2 , . . . , with
P∞
Ak ∈ G for k ∈ N, one has µ(∪∞
A
)
=
k=1 k
k=1 µ(Ak ).
20
Let us extend the definition of (4) to signed measures. The transition kernel
applies to a signed measure ν ∈ Mq as
Z
νP (A) =
K(x, A) ν(dx), A ∈ G.
D
We show that this defines a linear operator ν 7→ νP on Mq . Obviously
(ν + µ)P = νP + µP . But until now it is not clear whether νP ∈ Mq when
ν ∈ Mq . In the following we show that this is indeed true.
Lemma 12. Let q ∈ [1, ∞] and ν ∈ Mq . Then π-almost everywhere
d(νP )
dν
(x) = P ( )(x)
dπ
dπ
(9)
and νP ∈ Mq .
Proof. We have for ν ∈ Mq that
Z
Z Z
dν
νP (A) =
K(x, A) ν(dx) =
1A (y) (x) K(x, dy) π(dx)
dπ
G G
ZG Z
Z
dν
dν
=
1A (y) (x) K(y, dx) π(dy) =
P ( )(y) π(dy).
rev. G G
dπ
dπ
A
dν dν Thus kνP kq = P ( dπ )q ≤ dπ q < ∞, which finishes the proof.
2.3
Convergence properties of Markov chains
Let (Xn )n∈N be a Markov chain with transition kernel K and initial distribution ν. We assume that K is reversible with respect to π. For A ∈ B(G)
we ask how fast
P(Xn ∈ A) = νP n−1 (A) −→ π(A) ?
n→∞
Thus, the goal is to quantify the speed of convergence, if it converges at all, of
νP n−1 to π for increasing n ∈ N. Therefore we introduce a distance between
measures.
Definition 5 (total variation distance). The total variation distance between
two probability measures ν, µ ∈ M(G) is defined by
kν − µktv = sup |ν(A) − µ(A)| .
A∈G
21
It is helpful to consider the total variation distance as an L1 -norm.
Lemma 13. If ν, µ ∈ M1 , then kν − µktv =
1
2
kν − µk1 .
Proof. First, we show that
Let
Z
1
kν − µktv =
sup f (x)(ν(dx) − µ(dx)) .
2 kf k∞ ≤1 G
A=
(10)
dµ
dν
(x) ≥
(x) .
x∈G|
dπ
dπ
Then, the supremum within the total variation distance of the left hand side
of (10) is achieved for A and the supremum of the right hand side is achieved
for
(
1 x ∈ A,
f (x) =
−1 x 6∈ A.
By using this we obtain (10). Let ξ ∈ M1 then we have
Z
Z
dξ
sup f (x) ξ(dx) = sup f (x) (x) π(dx) .
dπ
kf k∞ ≤1
kf k∞ ≤1
G
G
Furthermore
Z
dξ dξ dξ
≤ sup f (x) (x) π(dx) ≤ sup kf k = dξ .
∞
dπ dπ dπ dπ
kf k∞ ≤1
kf k∞ ≤1
G
1
1
1
Thus, the proof is finished if we set ξ = ν − µ.
Now let us ask for an upper bound of kνP n − πktv . First we apply Lemma 13,
therefore we assume that ν ∈ M1 . We just start a calculation and see how
far we get:
d(νP n )
n
n
2 kνP − πktv = kνP − πk1 = − 1
dπ
1
n dν
n dν
= P
−
1
−
1
=
P
(8) (9) dπ
dπ
1
1
n
dν
=
(P − S) dπ − 1 ,
1
22
where we define
S(f ) =
Z
f (x) π(dx).
G
dν
Note that the last equality of the calculation comes from S( dπ
− 1) = 0. This
consideration implies the following two bounds.
Lemma 14. Let ν ∈ M1 . Then
n
n
kνP − πktv ≤ kP − SkL1 →L1
and
n
dν
1
≤ kP n − Sk
−
1
L1 →L1 ,
2 dπ
1
n
kνP − πktv ≤ kP − SkL2 →L2
1
dν
.
−
1
2 dπ
2
dν
− 1p =
Note that if ν = π, then kνP n − πktv = 0 and our estimates contain dπ
0 for p = 1, 2 which guarantees that the estimate is also zero. This is a nice
behavior of the bound with respect to the initial distribution.
Let us consider kP n − SkL2 →L2 more detailed.
Lemma 15. Let K be a reversible transition kernel with respect to π. Then
kP n − SkL2 →L2 = k(P − S)n kL2 →L2 = kP − SknL2 →L2 ,
n ∈ N.
Proof. The first equality comes from the fact that P n − S = (P − S)n , which
holds since π is a stationary distribution of K. The second equality comes
from the fact that (P − S)n is a self-adjoint operator, which comes from the
reversibility of K. See Lemma 10(iii).
The last two lemmata motivate the following two different convergence properties of transition kernels.
Definition 6 (L1 -exponential convergence). Let α ∈ [0, 1) and M ∈ (0, ∞).
Then the transition kernel K is L1 -exponentially convergent with (α, M) if
kP n − SkL1 →L1 ≤ αn M,
n ∈ N.
(11)
A Markov chain with transition kernel K is called L1 -exponentially convergent if there exist an α ∈ [0, 1) and M ∈ (0, ∞) such that (11) holds.
23
Definition 7 (L2 -spectral gap). We say that a transition kernel K and its
corresponding Markov operator P has an L2 -spectral gap if
gap(P ) = 1 − kP − SkL2 →L2 > 0.
If the transition kernel has an L2 -spectral gap, then we have by Lemma 14
and Lemma 15 that
n
n dν
kνP − πktv ≤ (1 − gap(P )) − 1
.
dπ
2
Next, we define other convergence properties which are based on the total
variation distance. After that we state different relations between the different properties.
Definition 8 (uniform ergodicity, geometric ergodicity). Let α ∈ [0, 1) and
M : G → (0, ∞). Then the transition kernel K is called geometrically ergodic
with (α, M(x)) if one has for π-almost all x ∈ G that
kK n (x, ·) − πktv ≤ M(x) αn ,
n ∈ N.
(12)
If the inequality of (12) holds with a bounded function M(x), i.e.
sup M(x) ≤ M ′ < ∞,
x∈G
then K is called uniformly ergodic with (α, M ′ ).
Since we assume that the transition kernel is reversible with respect to π we
have the following relations:
⇐=
⇐⇒ L1 -exponentially convergent
with (α, 2M)
⇐=
uniformly ergodic
with (α, M)
geometrically ergodic
with (α, M(x))
L2 -spectral gap ≥
1 − α.
(13)
The fact that uniform ergodicity implies geometric ergodicity is obvious.
For the proofs of the other relations and further details we refer to [Rud12,
Proposition 3.23, Proposition 3.24]. Now one might ask if there is a direct
relation between geometric ergodicity and the existence of an L2 -spectral
gap.
We impose another condition on the transition kernel.
24
Definition 9 (π-irreducible). The transition kernel K is called π-irreducible
if for all A ∈ G and for all x ∈ G there exists an n ∈ N such that
π(A) > 0
K n (x, A) > 0.
=⇒
The assumption of π-irreducibility is a connectivity condition in the following
sense. Whenever a set A with π(A) > 0 is given, then, from any starting
point x ∈ G, a Markov chain with transition kernel K needs at most a finite
number of steps to reach A with positive probability. Thus, there is no set
A with π(A) > 0 which cannot be reached by the Markov chain.
Let the transition kernel be π-irreducible. Then
geometrically ergodic ⇐⇒ L2 -spectral gap ≥
with (α, M(x))
1 − α.
(14)
For a proof of the relation stated in (14) see [RR97] and [RT01].
3
Approximation of expectations
In this section we assume that we have a Markov chain (Xn )n∈N with transition kernel K and initial distribution ν. Further let K be reversible with
respect to π. Then, the goal is to approximate
Z
S(f ) =
f (x) π(dx).
G
We use the computation of an average of a finite Markov chain sample as
approximation of the mean. Thus, we consider
n
1X
Sn,n0 (f ) =
f (Xj+n0 ).
n j=1
The number n determines the number of function values of f which are
computed. The number n0 is the burn-in or warm up time. Intuitively, it
is the number of steps of the Markov chain to get close to the stationary
distribution.
We state a strong law of large numbers also called ergodic theorem. For details
we refer to [AG10], [RR04, Theorem 4, Fact 5, Remark after Corollary 6] or
[MT09, Theorem 17.1.7, p. 427].
25
Theorem 16 (Strong law of large numbers). Let K be a π-irreducible transition kernel. Then, for any f ∈ L1 we have
Px ( lim Sn,n0 (f ) = S(f )) = 1,
n→∞
for π-almost all x ∈ G, where Px denotes the distribution of a Markov chain
(Xn )n∈N with transition kernel K and initial distributions ν = δx , the point
mass at x.
However, we are rather interested on explicit error bounds. We want to study
the mean square error of Sn,n0 , given by
eν (Sn,n0 , f ) = (Eν,K |Sn,n0 (f ) − S(f )|)1/2 ,
where ν and K indicate the initial distribution and transition kernel. A first
question which one might ask could be: How does the error behave if one
already starts with the stationary distribution? Or shorter: How behaves
eπ (Sn,n0 , f )?
Theorem 17. Let (Xn )n∈N be a Markov chain with transition kernel K and
initial distribution π. Further assume that the transition kernel is reversible
with respect to π. We define
Λ = sup{α : α ∈ spec(P − S)},
where spec(P − S) denotes the spectrum of the operator P − S : L2 → L2 and
assume that Λ < 1. Then
sup eπ (Sn,n0 , f )2 ≤
kf k2 ≤1
2
.
n(1 − Λ)
For a proof of this result we refer to [Rud12, Corollary 3.27]. Let us discuss
the assumptions and the result of the theorem. First, note that for the
classical Monte Carlo method we have Λ = 0. In this case we exactly get
(up to a constant factor) what we would expect (see Theorem 4), i.e. the
theorem covers also the classical Monte Carlo method. Further note that
gap(P ) = 1 − kP − SkL2 →L2 and
kP − SkL2 →L2 = sup{|α| : α ∈ spec(P − S)},
such that gap(P ) ≤ 1 − Λ. This also implies that if P : L2 → L2 is positive
semi-definite we obtain gap(P ) = 1 − Λ. Thus, whenever we have a lower
26
bound for the spectral gap we can apply Theorem 17 and can substitute 1−Λ
by gap(P ). Further note if γ ∈ [0, 1), M ∈ (0, ∞) and the transition kernel
is L1 -exponentially convergent with (γ, M) then we have, by using (13), that
gap(P ) ≥ 1 − γ.
Now we might ask how eν (Sn,n0 , f ) behaves depending on the initial distribution? The idea is to decompose the error in a suitable way. For example in a
bias and variance term. However, we want to have an estimate with respect
to kf k2 and in this setting the following decomposition is more convenient:
eν (Sn,n0 , f )2 = eπ (Sn,n0 , f )2 + rest,
where rest denotes an additional term such that equality holds. Then, we
have to estimate the rest term and can use Theorem 17 to obtain an error
bound. For further details of the proof of the following error bound we refer
to [Rud12, Theorem 3.34 and Theorem 3.41].
Theorem 18. Let (Xn )n∈N be a Markov chain with transition kernel K and
initial distribution ν. Further let the transition kernel be reversible with respect to π. Then
sup eν (Sn,n0 , f )2 ≤
kf kp ≤1
2
2Cν γ n0
+ 2
n(1 − Λ) n (1 − γ)2
(15)
with
Λ = sup{α : α ∈ spec(P − S)},
where spec(P − S) denotes the spectrum of the operator P − S : L2 → L2 ,
holds in the following settings:
(i) for p = 2, ν ∈ M∞ and a transition kernel
is L1 -exponentially
dν K which
convergent with (γ, M) where Cν = M dπ
− 1 ∞ ;
dν
− 12 .
(ii) for p = 4, ν ∈ M2 and 1 − γ = gap(P ) > 0 where Cν = 64 dπ
Now we have an explicit error bound. If we want to have an error of ε ∈
(0, 1) it is still not clear how to choose n and n0 when the total amount
of steps n + n0 is fixed. Thus, the question is how to choose the burn-in
n0 ? Let e(n, n0 ) be the right hand side of (15) and we assume that we have
computational resources for N = n + n0 steps of the Markov chain. We want
to get an nopt which minimizes e(N − n0 , n0 ) (depending on n0 ). In [Rud12,
27
Lemma 2.26] the following is proven: For all δ > 0 if N and Cν are large
enough nopt satisfies
log Cν
log Cν
nopt ∈
.
, (1 + δ)
log γ −1
log γ −1
log Cν
Thus, in this setting nopt = ⌈ log
⌉ is a reasonable choice.
γ −1
4
Algorithms
Let G ⊂ Rd , G = B(G) and ρ : G → (0, ∞), where ρ is integrable with respect
to the Lebesgue measure. We can define a distribution πρ on (G, G) with
R
ρ(x) dx
πρ (A) = RA
, A ∈ G.
ρ(x) dx
G
The goal of this section is to provide basic constructions of transition kernels
with stationary distribution πρ .
4.1
Metropolis-Hastings algorithm
The Metropolis-Hastings algorithm is the most famous method for approximate sampling by a Markov chain. The idea is to modify a given transition
kernel, such that πρ gets a stationary distribution of the modification. First
we consider a special case:
Example 19. Let G ⊂ Rd be bounded and assume that µ denotes the
uniform distribution on (G, G). How can we modify µ to obtain something
which gets close to πρ ? We explain the independent Metropolis algorithm.
For i ∈ N a single transition from xi ∈ G works as follows:
1. Sample a proposal state y ∈ G with respect to µ, i.e. with uniform
distribution on (G, G).
2. Let
ρ(y)
α(xi , y) = min 1,
ρ(xi )
and choose r ∈ [0, 1] uniformly distributed. If r ≤ α(xi , y) return
xi+1 = y otherwise return xi+1 = xi .
28
By introducing an acceptance/rejection step we modify the sampling with
respect µ. The dependence on ρ is entirely determined by the acceptance
probability α(·, ·). A state y is proposed and it is checked how likely it is.
Informally, if ρ(y)/ρ(xi ) ≥ 1 is large, then it is very likely and the proposed
state is accepted. If ρ(y)/ρ(xi ) < 1, then the state is only with a certain
probability accepted. This is quantified by measuring how much less likely
the proposed state is compared to the current one.
The example illustrates the general modification procedure. Namely, we
propose a state and accept it with a certain probability which depends on
ρ. We need some further notations to precise the general approach. Let
q : G × G → [0,R∞] be a function such that q(x, ·) is Lebesgue integrable for
all x ∈ G with G q(x, y) dy ≤ 1. Then
Z
Z
Q(x, A) =
q(x, y) dy + 1A (x) 1 −
q(x, y) dy , x ∈ G, A ∈ G.
A
G
is a transition kernel and we call q(·, ·) transition density. Let Q be the
proposal transition kernel.
Now the question is: How can we modify Q with an acceptance/rejection
step to obtain a reversible transition kernel with respect to πρ ?
With the proposal transition kernel Q and the non-normalized density ρ we
want to construct a transition kernel
Z
Z
Kρ (x, A) =
kρ (x, y)dy + 1A (x) 1 −
kρ (x, y)dy ,
A
G
with transition density kρ (·, ·), such that Kρ is reversible with respect to πρ .
In terms of the transition density reversibility with respect to πρ is equivalent
to
kρ (x, y)ρ(x) = kρ (y, x)ρ(x), x, y ∈ G.
Let x, y ∈ G and w.l.o.g assume that
q(x, y)ρ(x) > q(y, x)ρ(y)
We want to modify the left-hand-side so that we achieve equality, thus we
have to multiply it by a number α(x, y) < 1. We obtain
α(x, y)q(x, y)ρ(x) = q(y, x)ρ(y).
29
This gives that kρ (x, y) = α(x, y)q(x, y) and since we want reversibility we
have to choose α(y, x) = 1 so that we get
α(x, y)q(x, y)ρ(x) = q(y, x)ρ(y) = α(y, x)q(y, x)ρ(y).
This completely specifies the acceptance probability as
(
1
q(x, y)ρ(x) = 0,
α(x, y) =
q(y,x)ρ(y)
min{1, q(x,y)ρ(x) } otherwise.
For a single transition of the Metropolis-Hastings algorithm see Algorithm 4.
Now we are also able to define the transition kernel of the Metropolis-Hastings
Algorithm 4: Metropolis-Hastings-algorithm(x, Q)
input : current state x, proposal transition kernel Q.
output: next state y.
Generate y with respect to Q(x, ·);
Compute
α(x, y) = min{1,
ρ(y)q(y, x)
};
ρ(x)q(x, y)
if U(0, 1) ≤ α(x, y) then
Return y;
else
Return y := x;
end
algorithm. Namely, we have
Z
Z
Kρ (x, A) =
α(x, y) Q(x, dy) + 1A (x) 1 −
α(x, y) Q(x, dy)
A
G
Z
Z
=
α(x, y) q(x, y)dy + 1A (x) 1 −
α(x, y) q(x, y)dy .
A
G
By the construction we obtained the following.
Lemma 20. The transition kernel Kρ is reversible with respect to πρ .
30
Proof. The proof follows by the construction. However, if one just has the
transition kernel and wants to check whether reversibility holds one has to
show:
Z
Z
Kρ (x, B) πρ (dx) =
Kρ (x, A) πρ (dx)
A
B
for disjoint A, B ∈ B(G), see Lemma 8(ii). If one keeps in mind that
α(x, y)q(x, y)ρ(x) = α(y, x)q(y, x)ρ(y) it is not too difficult to prove the
assertion.
Now we will define two special cases of the Metropolis-Hastings-algorithm.
Namely, if q(x, y) = q(y, x), i.e. q is symmetric, then Kρ is called Metropolis
algorithm and if q(x, y) = η(y) for a function η : G → (0, ∞) for all x, y ∈ G,
then Kρ is called independent Metropolis algorithm, see Example 19.
The Metropolis-Hastings-algorithm provides a tool to construct a Markov
chain with a desired stationary distribution. However, until now it is not clear
if the Markov chain converges to the stationary distribution. In the following
we state a sufficient condition for the independence Metropolis algorithm
to be uniformly ergodic. This is based on [MT96, Theorem 2.1, p. 105].
The proof is omitted, however we want to remark that we also use [Rud12,
Proposition 3.24, p. 49] to derive the result.
R
Theorem
21. Let η : G → (0, ∞), such that G η(y) dy = 1 and Q(x, A) =
R
η(y) dy for A ∈ B(G). If there exists a number γ > 0 such that
A
η(y)
≥ γ,
ρ(y)
y ∈ G,
then
n
Pρ − S L1 →L1
≤ 2(1 − γ)n ,
where Pρ denotesR the transition operator with corresponding transition kernel
Kρ and S(f ) = G f (x) πρ (dx).
The previous theorem states that the independent Metropolis algorithm is
L1 -exponentially ergodic. Further note that by the diagram in (13) we have
that L1 -exponentially ergodicity implies a spectral gap. Hence
gap(Pρ ) > γ.
31
Now we could apply Theorem 18 to get an error bound for the approximation
of expectations.
We consider an example which satisfies the assumptions of Theorem 21.
Example 22. Let G = R and ξ 2 > 1. Further let
1
ρ(y) = √ exp(−y 2 /2)
2π
and
1
exp(−y 2 /(2ξ 2 )).
2πξ
In other words ρ is the density of a N(0, 1) random variable and η is the
density of a N(0, ξ 2 ) random variable. We have
η(y) = √
η(y)
= ξ −1 exp(−y 2 (ξ −2 − 1)/2) ≥ ξ −1.
ρ(y)
y=0
n
Thus, by Theorem 21 we obtain Pρ − S L →L ≤ 2(1 − ξ −1)n .
1
1
Next, we will introduce another technique to show a convergence property.
More precisely, a technique to show a lower bound of the spectral gap 1 − Λ,
see below. Therefore we need the following new definition.
Definition 10. The conductance of a transition kernel K, which is reversible
with respect to π, is given by
R
K(x, Ac ) π(dx)
A
ϕ=
inf
.
0<π(A)≤1/2
π(A)
The ratio in the infimum is also called bottleneck ratio. It is the probability
P(X2 ∈ Ac | X1 ∈ A) of a Markov chain (Xn )n∈N , where the initial state is
chosen by the stationary distribution, i.e. P(X1 ∈ A) = π(A). In this sense
it measures how fast one leaves a set A in one step of the Markov chain.
The following result is well known, which is, in this form, due to Lawler and
Sokal [LS88].
Theorem 23. We have
ϕ2
≤ 1 − Λ ≤ 2ϕ,
2
with Λ = sup{α : α ∈ spec(P − S)} where spec(P − S) denotes the spectrum
of P − S : L2 → L2 .
32
In the next example we show how one can use this result.
Example 24. Let G ⊂ Rd be a bounded set and assume that there are
numbers 0 < c1 , c2 < ∞ such that c1 ≤ ρ(x) ≤ c2 for all x ∈ G. We consider
an independent Metropolis algorithm. Let the proposal transition kernel be
given by
vold (A)
Q(x, A) = µ(A) =
, A ∈ B(G),
vold (G)
i.e. a state is proposed with the uniform distribution on G. Then
Z
Z
dy
dy
Kρ (x, A) =
α(x, y)
,
+ 1A (x) 1 −
α(x, y)
vold (G)
vold (G)
A
G
ρ(y)
where α(x, y) = min{1, ρ(x)
}. By Theorem 21 and
n
Pρ − S L
1 →L1
≤2 1−
η(y)
ρ(y)
1
c2 vold (G)
≥
n
1
c2 vold (G)
we obtain
,
where Pρ denotes the transition operator induced by Kρ . By (13) holds
gap(Pρ ) ≥ c2 vol1d (G) . We apply Theorem 18 to obtain an error bound for the
approximation of S(f ), see Corollary 25. However, often vold (G) is exponentially increasing for increasing d, for example if G = [−1, 1]d or G = Bd . This
is the drawback of the former estimate.
Let us apply Theorem 23. It is known that Pρ is positive semi-definite on
L2 (πρ ), for details we refer to [RU12]. Thus, we have that gap(Pρ ) = 1 − Λρ
with
Λρ = sup{α : α ∈ spec(Pρ − S)}.
Now we estimate the conductance: Let A ∈ B(G) and assume that 0 <
πρ (A) ≤ 1/2. Then
Z
Z Z
dy
c
πρ (dx)
Kρ (x, A ) πρ (dx) =
α(x, y)
vold (G)
A
A Ac
R
Z Z
ρ(z) dz
1
1
G
=
min
,
πρ (dy)πρ (dx)
ρ(x) ρ(y)
vold (G)
A Ac
Z
Z Z
Z
1
ρ(z)
ρ(z)
=
min
dz,
dz πρ (dy)πρ(dx)
vold (G) A Ac
G ρ(x)
G ρ(y)
c1
c1
πρ (A).
≥ πρ (A)πρ (Ac ) ≥
c2
2c2
33
c2
This implies ϕ ≥ 2cc12 which gives gap(Pρ ) ≥ 8c12 . We obtained a lower bound
2
of gap(Pρ ) which is independent of vold (G), but depends on c21 and c22 .
By the combination of both lower bounds of the spectral gap we have
2
c1
1
1
,
gap(Pρ ) ≥ max
c2
8c2 vold (G)
dµ
and obtain by Theorem 18 and dπρ − 1 ≤ cc12 − 1 the following error
∞
bounds.
Corollary 25. Let (Xn )n∈N be a Markov chain with transition kernel Kρ and
initial distribution µ.
(i) For c21 vold (G) ≤ 8 c2 and n0 = max
the error satisfies
sup eµ (Sn,n0 , f )2 ≤
kf k2 ≤1
nl
m o
c2 vold (G) log 2( cc21 − 1) , 0
2c2 vold (G) 2c22 vold (G)2
+
.
n
n2
(ii) For c21 vold (G) > 8 c2 and n0 = max
error satisfies
sup eµ (Sn,n0 , f )2 ≤
kf k4 ≤1
nl
8c22
c21
m o
log 64( cc12 − 1) , 0 the
16c22 128c42
+ 2 4.
nc21
n c1
The setting of the example can be extended. Similar results hold for general
proposals. In [Vol13, Theorem 3.4] a lower and upper bound of the conductance of the Metropolis-Hastings algorithm is proven depending on c1 ,
c2 and the conductance of the proposal transition kernel. In particular the
lower bound of gap(Pρ ) depends on cc21 .
However, there are some crucial drawbacks of lower bounds of the spectral
gap which contain cc21 . Note that, then the error bound of the mean square
ρ
. Say, for example
error depends on cc12 = sup
inf ρ
ρ(x) = exp(−α |x|2 ), x ∈ Rd ,
(16)
√
i.e. ρ is a density of a N(0, 2α−1 I) random variable, where I denotes the
identity. If G = Bd , then cc12 = exp(α) and if G = [−1, 1]d , then cc21 = exp(αd).
This is not that good, since the ratio cc12 might depend exponentially on α
and d. The goal would be to obtain an error bound which depends only
polynomially on α and d.
34
In the previous example we have seen how one can estimate the conductance.
However, the lower bounds were not satisfying. In the following we want to
state a much better result. These type of results are difficult to prove, they
rely on isoperimetric inequalities and different estimates of geometric objects.
We have the following assumption.
Assumption 26. Let G = Bd and let ρ be log-concave, i.e. for all λ ∈ (0, 1)
and for all x, y ∈ G we have
ρ(λx + (1 − λ)y) ≥ ρ(x)λ ρ(y)1−λ .
(17)
Further, let log ρ be Lipschitz continuous with α, i.e.
|log ρ(x) − log ρ(y)| ≤ α |x − y| .
Note that the setting is more restrictive compared to the previous example.
The goal is to get a lower bound of the conductance which is polynomially
in α−1 and d−1 . In [MN07] the following result is proven.
1
Proposition 27. Let Assumption 26 be satisfied and let δ = min{ √d+1
, α1 }.
Let the proposal transition kernel for the Metropolis algorithm Kρ be given
by the ball walk Bδ , see Example 9(4). Then, the conductance of Kρ satisfies
0.0025
1
1
.
ϕ≥ √
min √
,
d+1
d+1 α
If we consider for example the density of (16), we see that Assumption 26 is
satisfied and the lower bound of the conductance depends only polynomially
on α and d rather than on cc21 = exp(−α).
Note that for the Metropolis algorithm with ball walk proposal it is not clear
whether the transition operator Pρ is positive semi-definite. This would imply
that Theorem 23 gives a lower bound of gap(Pρ ). However, by considering
the lazy version of the Metropolis algorithm with ball walk proposal one
obtains a positive semi-definite operator. For further details see for instance
[Rud12].
35
4.2
Hit-and-run algorithm
Let us describe a transition from x ∈ G of the hit-and-run algorithm. First,
a direction θ with |θ| = 1, i.e. on the sphere Sd−1 , is uniformly distributed
chosen. Then, the next state is sampled on {x + sθ : s ∈ R} with the distribution determined by ρ restricted to the line. Thus, for x ∈ G and A ∈ B(G)
the transition kernel is
Z
1
Hρ (x, A) =
Hθ (x, A) dθ,
vold−1 (Sd−1 ) Sd−1
where
Hθ (x, A) =
with
ℓρ (x, θ) =
R∞
−∞
Z
∞
1A (x + sθ)ρ(x + sθ) ds
ℓρ (x, θ)
1G (x + sθ)ρ(x + sθ) ds.
−∞
Note that Hθ (x, ·) denotes the distribution of the next state. It is the distribution of ρ restricted to the line determined by θ and the current state
x.
Let H̄ρ be the transition operator which corresponds to Hρ . Now, we ask
two questions:
1. Is the transition kernel Hρ reversible with respect to πρ ?
2. Are there estimates of the spectral gap of H̄ρ or estimates of
n
νHρ − πρ tv
in terms of n and the initial distribution ν?
The first question addresses whether the transition kernel is well-defined in
the sense that we want to obtain the desired stationary distribution πρ . The
second question asks for convergence. We want to quantify how fast we get
to the stationary distribution.
The following lemma answers the first question.
Lemma 28. For all θ ∈ Sd−1 we have that Hθ is reversible with respect to
πρ . Furthermore, Hρ is also reversible with respect to πρ .
36
R
Proof. Let c = G ρ(x) dx and A, B ∈ B(G). By
Z
Z Z ∞
1B (x + sθ)ρ(x + sθ) ds ρ(x) dx
Hθ (x, B) πρ (dx) =
ℓρ (x, θ)
c
A
ZA Z−∞
∞
1A (x)1B (x + sθ)ρ(x + sθ)ρ(x) ds dx
=
ℓρ (x, θ)
c
Rd −∞
Z Z ∞
1A (y − sθ)1B (y)ρ(y)ρ(y − sθ) ds dy
=
y=x+sθ Rd −∞
ℓρ (y − sθ, θ)
c
Z Z ∞
1A (y − sθ)ρ(y − sθ)
=
ds πρ (dy)
ℓρ (y − sθ, θ)
B −∞
Z Z ∞
1A (y − sθ)ρ(y − sθ)
ds πρ (dy)
=
ℓρ (y−sθ,θ)=ℓρ (y,θ) B −∞
ℓρ (y, θ)
Z Z ∞
1A (y + tθ)ρ(y + tθ)
=
ds πρ (dy)
−s=t B −∞
ℓρ (y, θ)
Z
=
Hθ (x, A) πρ (dx)
B
reversibility of Hθ is proven. By the reversibility of Hθ we get the reversibility
of Hρ . We have
Z
Z
Z
1
Hθ (x, B) πρ (dx) dθ
Hρ (x, B) πρ (dx) =
vold (Sd−1 ) Sd−1 A
A
Z
Z
1
=
Hθ (x, A) πρ (dx) dθ
vold (Sd−1 ) Sd−1 B
Z
=
Hρ (x, A) πρ (dx).
B
Thus the proof is complete.
Note that the transition kernel of the hit-and-run algorithm induces a positive
semi-definite operator on L2 , for details we refer to [RU12].
Now let us come to the second question. Here we consider two different
settings. First, let ρ = 1G and let Assumption 5 be satisfied. Basically it
is assumed that Bd ⊂ G ⊂ rBd , where G is a convex body. Then, again by
deep arguments and the conductance technique one can prove the following
[LV06b].
37
Theorem 29. Let Assumption 5 be satisfied. Recall that H̄ρ is the transition
operator which corresponds to the transition kernel Hρ . Then
gap(H̄ρ ) ≥
2−52
.
(dr)2
By Theorem 18 this implies an error bound for a suitable initial distribution
of a Markov chain with transition kernel Hρ .
Now let us define the second setting.
Assumption 30. For a number r > 1 we have Bd ⊂ G ⊂ rBd . Furthermore
the function ρ : G → (0, ∞) satisfies:
- ρ is log-concave, for a definition see (17);
- for all s > 0 let G(s) = {x ∈ G : ρ(x) ≥ s} be the level set of ρ of level
s. Then we assume that
πρ (G(s)) ≥
1
8
=⇒
∃z ∈ G such that B(z, 1) ⊂ G(s).
For further details and interpretations of the conditions see [LV06a, Rud].
Then, again by an isoperimetric inequality and a relation of the so-called
s-conductance to the total variation distance one can prove the following
theorem [LV06a].
Theorem 31. LetAssumption 30 be satisfied. Let the initial distribution
dν ν ∈ M∞ , i.e. dπ
< ∞. Further assume that
ρ
∞
10−9
β = exp −
(dr)2/3
Then
dν .
and C = 12 d r dπρ ∞
√
n
νH − πρ ≤ Cβ 3 n ,
ρ
tv
n ∈ N.
We provided two settings where explicit estimates of the spectral gap and
total variation distance are available. The constants in this results are rather
large, but note that these are absolute constants without any hidden dependence on some variables.
38
4.3
Gibbs sampler
The Gibbs sampler, also known as Glauber dynamics or heat bath algorithm,
is widely used in scientific computing. In this subsection we distinguish two
Gibbs sampler, the random scan Gibbs sampler and the deterministic scan
Gibbs sampler.
The random scan Gibbs sampler is conceptually similar to the hit-and-run
algorithm. For x ∈ G and A ∈ B(G) the transition kernel is given by
d
1X
Rρ (x, A) =
He (x, A),
d j=1 j
where {e1 , . . . , ed } is the Euclidean standard basis in Rd and recall that
R∞
1A (x + sej )ρ(x + sej ) ds
Hej (x, A) = −∞
.
ℓρ (x, ej )
We choose a direction ei within the Euclidean standard basis uniformly distributed. This is the only difference to the hit-and-run algorithm, where the
direction is chosen uniformly distributed on the sphere. Then, the next state
is generated on the line {x + sei : s ∈ R} with respect to ρ restricted to
{x + sei : s ∈ R}.
By similar arguments as for the hit-and-run algorithm one can prove that Rρ
is reversible with respect to πρ . The proof is a not too difficult exercise.
Lemma 32. The transition kernel Rρ is reversible with respect to πρ .
Now let us describe the deterministic scan Gibbs sampler. For x ∈ G and
A ∈ B(G) the transition kernel is given by
Dρ (x, A) = He1 He2 · · · Hed (x, A).
The deterministic scan Gibbs sampler from x ∈ G updates each coordinate
in one step. Note that the coordinate updates are according to the Euclidean
standard basis. Thus, in a row each coordinate is updated. This is a very
tempting idea. However, there are some drawbacks. First, one step of a
deterministic scan Gibbs sampler needs computational cost of d steps of a
random scan Gibbs sampler. Furthermore, the transition kernel is in general
not reversible.
39
Lemma 33. The transition kernel Dρ is not reversible with respect to πρ ,
but πρ is a stationary distribution of Dρ .
Proof. To prove that Dρ is in general not reversible with respect to πρ we
construct a counterexample for d = 2. Let G = [0, 2] ×[0, 2] \ [1, 2] ×[1, 2] and
ρ(x) = 1G (x). Now we argue that for A1 = [0, 1]×[1, 2] and A2 = [1, 2]×[0, 1]
holds
Z
Z
dx
dx
H1 H2 (x, A2 )
H1 H2 (x, A1 )
6=
.
vol2 (G)
vol2 (G)
A1
A2
This is true, since for all x ∈ A1 we have H1 H2 (x, A2 ) = 0, but for all x ∈ A2
we have H1 H2 (x, A1 ) = 41 . Thus, Dρ is in general not reversible with respect
to πρ .
However, in the following we prove that πρ is a stationary distribution. From
Lemma 28 we know that πρ Hej (A) = πρ (A) for all A ∈ B(G), since Hej is
reversible with respect to πρ . Thus,
πρ Dρ (A) = πρ He1 He2 · · · Hed (A)
= πρ He2 · · · Hed (A) = · · · = πρ Hed (A) = πρ (A).
We briefly state some convergence results for the Gibbs sampling algorithms.
First, let us consider the case where ρ = 1G , i.e. the goal is to sample the
uniform distribution in G. The following results are proven in [RR98].
Theorem 34. Let G ⊂ Rd be a bounded, open, connected set in Rd , whose
boundary is a (d − 1)-dimensional C 2 manifold. Then the random scan Gibbs
sampler is uniformly ergodic.
Proposition 35. Let G ⊂ R2 be the width 1 triangle with lower angle θ and
upper angle φ, that is,
G = {(x, y) ∈ R2 : 0 < x < 1, x tan θ < y < x tan φ},
where 0 < θ < φ < π/2. Then both Gibbs sampler are geometrically ergodic.
For general densities ρ there seems not much known. In [RR97] it is proven
that:
40
Theorem 36. If the deterministic scan Gibbs sampler is geometrically ergodic, then the random scan Gibbs sampler is also geometrically ergodic.
Note that, on the one hand the results of these theorems hold under weak assumptions compared to the convergence results of the hit-and-run algorithm.
But on the other hand we do not get the parameters (α, M) or (α, M(x)) of
the ergodicity explicitly.
References
[AG10]
S. Asmussen and P. Glynn, Harris recurrence and MCMC: A
simplified approach, submitted (2010).
[BGJM11] S. Brooks, A. Gelman, G. Jones, and X. Meng, Handbook of
Markov Chain Monte Carlo, Chapman & Hall, 2011.
[GRS96]
W. Gilks, S. Richardson, and D. Spiegelhalter, Markov chain
Monte Carlo in practice, Chapman & Hall, 1996.
[Kal02]
O. Kallenberg, Foundations of modern probability, Second ed.,
Probability and its Applications, Springer-Verlag, New York,
2002.
[LS88]
G. Lawler and A. Sokal, Bounds on the L2 spectrum for Markov
chains and Markov processes: a generalization of Cheeger’s inequality, Trans. Amer. Math. Soc. 309 (1988), no. 2, 557–580.
[LV06a]
L. Lovász and S. Vempala, Fast algorithms for logconcave functions: sampling, rounding, integration and optimization, Proceedings of the 47th Annual IEEE Symposium on Foundations
of Computer Science (Washington, DC, USA), FOCS ’06, IEEE
Computer Society, 2006, pp. 57–68.
[LV06b]
, Hit-and-run from a corner, SIAM J. Comput. 35 (2006),
no. 4, 985–1005.
[MN07]
P. Mathé and E. Novak, Simple Monte Carlo and the Metropolis
algorithm, J. Complexity 23 (2007), no. 4-6, 673–696.
[MT96]
K. Mengersen and R. Tweedie, Rates of convergence of the Hastings and Metropolis algorithms, Ann. Statist. 24 (1996), no. 1,
101–121.
41
[MT09]
S. Meyn and R. Tweedie, Markov chains and stochastic stability,
second ed., Cambridge University Press, 2009.
[Num84]
E. Nummelin, General irreducible Markov chains and nonnegative operators, Cambridge University Press, 1984.
[Rev84]
D. Revuz, Markov chains, second ed., North-Holland Mathematical Library, vol. 11, North-Holland Publishing Co., Amsterdam,
1984.
[RR97]
G. Roberts and J. Rosenthal, Geometric ergodicity and hybrid
Markov chains, Electron. Comm. Probab. 2 (1997), no. 2, 13–25.
[RR98]
, On convergence rates of Gibbs samplers for uniform distributions, Ann. Appl. Probab. 8 (1998), no. 4, 1291–1302.
[RR04]
, General state space Markov chains and MCMC algorithms, Probability Surveys 1 (2004), 20–71.
[RT01]
G. Roberts and R. Tweedie, Geometric L2 and L1 convergence
are equivalent for reversible Markov chains, J. Appl. Probab. 38A
(2001), 37–41.
[RU12]
D. Rudolf and M. Ullrich, Positivity of hit-and-run and related
algorithms, ArXiv e-prints (2012).
[Rud]
D. Rudolf, Hit-and-run for numerical integration, To appear in:
J. Dick, F. Y. Kuo, G. Peters, I. H. Sloan (Eds.), Monte Carlo
and Quasi-Monte Carlo Methods 2012, Springer-Verlag.
[Rud12]
, Explicit error bounds for Markov chain Monte Carlo,
Dissertationes Math. 485 (2012), 93 pp.
[Stu10]
A. Stuart, Inverse problems: A bayesian perspective, Acta Numerica 19 (2010), 451–559.
[Vol13]
S. Vollmer, Dimension-Independent MCMC Sampling for Elliptic Inverse Problems with Non-Gaussian Priors, ArXiv e-prints
(2013).
42

Download Report

An introduction to Markov chain Monte Carlo on general state spaces

Paperzz.com

Your Paperzz