Statistics & Probability
North-Holland
Letters
18 June 1993
17 (1993) 231-236
A slowly mixing Markov chain
with implications for Gibbs sampling
Peter Matthews
University of Maryland Baltimore County, Baltimore, MD, USA
Received December
1991
Revised October 1992
Abstract: We give a Markov chain that converges
running on a posterior distribution
of a parameter
Keywords: Gibbs sampling;
posterior
to its stationary
distribution
very slowly. It has the form of a Gibbs
f3 given data X. Consequences
for Gibbs sampling are discussed.
distribution;
sampler
mixing rate; coupling.
1. Introduction
Gibbs sampling is a Monte Carlo technique that has seen explosive growth recently in Bayesian statistics.
See, for example, Gelfand and Smith (1990). Gibbs sampling has deservedly been presented
optimistically as a means of attacking heretofore
intractable
posterior calculations.
Though it is understood
that
Gibbs sampling can fail, this possibility has not received the careful attention it deserves. One difficulty is
the number
of steps necessary
to give near convergence
to the stationary
(posterior)
distribution.
Gelfand, Hills, Racine-Poon
and Smith (1990) suggest monitoring
the sample paths of the Gibbs sampler
until convergence
is apparently
achieved. They make it clear that this methodology
is non-rigorous.
The
purpose of this note is to show there are situations where this methodology
can fail. We give an example
in which, with high probability,
a Gibbs sampler will appear to converge when in fact true convergence
takes much longer. Gelman and Rubin (1992) give other examples of non-convergent
Gibbs samplers
appearing
to converge. The example presented
here is mathematically
simple to facilitate theoretical
computations.
However, it should be realistic enough to suggest that care is needed when using a Gibbs
sampler.
The remainder
of this note is as follows. Section 2 states the model. Section 3 gives a lower bound on
its rate of convergence
to uniformity.
Section 4 gives an example for particular, somewhat realistic, values
of the model parameters.
Section 5 discusses the issues raised in this note.
2. The general model and Gibbs sampling
We consider sampling from the posterior
X. Let C denote the open d-dimensional
Correspondence to: Peter Matthews,
Baltimore,
MD 21228, USA.
Research
supported
0167.7152/93/$06.00
Department
by NSF and NSA under
0 1993 - Elsevier
distribution
hypercube
of Mathematics
NSF grand
Science
Publishers
of a d-dimensional
parameter
13 given some data
(0, ljd. We take the prior distribution
n(0) to be
and
Statistics,
University
of Maryland
Baltimore
County,
DMS 9001295.
B.V. All rights reserved
231
Volume
uniform
17, Number 3
STATISTICS
on C. For some known
parameters
& PROBABILITY
LETTERS
18 June 1993
6 E (0, 1) and (T > 0 we take
d
Thus the distribution
of X given 0 is a 1 - S, 6 mixture of N(B, a*I) and the uniform distribution
on
C. The posterior
II(0 1X) =f(X IO)II(O>/m(X>
is proportional
to f(X 113)on C. The marginal
m(X)
need not play any role in Gibbs sampling; we consider only sampling from a density proportional
to
f(x le>.
We consider Gibbs sampling implemented
as follows. An initial value 8’ = (@, . . . ,f$ E C is chosen.
We will choose 8’ with distribution
n(e), the uniform distribution
on C. A Markov chain on C is run for
t steps. Each step of the chain consists of updating each of the d components
of 0 in turn. Suppose the
current state is (ei
,,...,e;_,,
ei-l,...,
eg-'1.That is, we are making the transition
from e’-l to 8’,
components
1,. . . , k - 1 have been updated, and component
k is to be updated next. The Gibbs sampler
chooses a value t$ from the distribution
f(ek I O1= @,. . . , Ok_,= ei_,, Ok+l= 86;‘,,. . . , Od= ei-‘, X).
These t steps of d updates may then be repeated
independently
m times from independently
chosen
initial positions.
In our example the distribution
He’)
of 8’ converges at an exponential
rate to the posterior n(t? I X)
in total variation.
One can easily show that the one-step
transition
function
of the Markov chain is
bounded
from below (though the bound depends on X). Thus the Doeblin condition
(Doob, 1953, p.
256) is met and geometric ergodicity follows. That is,
II_5qet)-n(eIx)II =SUP
I~(e~~A)-nn(Alx)l
~hfr'
A
for some M > 0 and I <
can show uniform (in the
Y may be arbitrarily close
Doss and Sethuraman
distribution
_Y(B’) of 0’
1. It4 and r may depend on X. Further, with a deterministic
initial position one
initial position) geometric ergodicity. This is an asymptotic result. The exponent
to 1, so for any t of interest this variation distance may be arbitrarily close to 1.
(1991) and Tierney
(1991) give more general
conditions
under which the
converges at an exponential
rate to the posterior n(f3 1X) in total variation.
3. A lower bound on the mixing rate
In this section we use coupling to give an upper bound on the variation distance between _Y(e’) and the
prior II - not the posterior
L!(0 1X). Actually we show more; we show that the entire sample path
8’ has a law that is close in total variation
to the distribution
of t + 1 independent
random
eO,...,
vectors, each uniformly distributed
on C. We do this by showing that, with high probability,
the process
eO,...,
8’ can be coupled with the output of a Gibbs sampler driven by n(0) rather than II(0 I X). Once
this is shown, if n(0) and n(e I X) are distant in total variation, then _Y(0’) and n(0 1X) must be as
well.
Theorem 1. Let U”‘(C)
denote
distributed on C. For X E C,
the d’ISt r-1
‘bu t’ion of t + 1 independent
- (d - 1) log@&)
II-q @, . . . , et) - U'+'(C)
232
II < 2td
and
(b (/T
5
b = 2ra2/(
random
vectors
each uniformly
r( i( d + 1)))(2’(d-1)),
+a))(+“‘z_
(3.1)
18 June 1993
STATISTICS & PROBABILITY LElTERS
Volume 17. Number 3
Proof. Consider a second Markov chain $. Let $’ be uniformly distributed
on C. If $L”l is obtained
from $’ via d updates in Gibbs sampling driven by the prior II(e), then it is straightforward
to show that
I,/J~)
is
U’+‘(C).
To
prove
the
theorem
we
need
only
construct
(0’,
.
.
.
,
f3’)
and
($O,.
. . , 4,‘) on
5%1cr”, . . . ,
the same probability
space such that
p((eO,.. .,et) z
+a))‘d-‘)‘2.
(” (J”h”-2
($” ,..., IL’)) < 2td y
(f3’, . . . ,19’> from ($O,. . . , 4’) and independent
We suppose (Go,. . . , $I> are defined and construct
random variables as needed. Set 19’ = $‘, since both initial distributions
are II. Inductively
suppose we
.
..,e;_,,
ei-l,.
. .,86-V f
0:-‘>
and
we
need
to
construct
0;.
If
co:,
have (ei, ,..., ei_,, ei-I,...,
I&‘),
then
generate
0;
using
random
variables
independent
of
Go,.
. . , $‘. In
<I&..,
&_I, IcIy,...,
. . , e;_,, q-l,.
. . , ep> =
this case the processes
have already
uncoupled.
However,
if
$6-1) then we proceed as follows.
I&-,, l&-l,...,
c*;,...,
to
For XE C, f(ek I e;, . . . , I$_,, e;;‘,, . . . , ~9:~‘, Xl is proportional
<ef,
.
where
superscripts
on 8 have been
dropped.
Let p denote
d
e-Z,tk((X,-0,)/u)2/2
’
e-((~-Xk)/~)Z/2
de.
- 0,)“). Then 0: can be generated
as
Integration
gives p < (1 - 6)(afi>Ycd~
‘) exp(-(2a2jp1Cjfk(X,
follows. With probability
6/(6 + p), let 13; = I,/$, which is U(0, 1). In this case the processes remain
on 0: being
coupled. With probability
p/(p + S> choose f3; from the the N(X,, a*> density conditional
in (0, 1).
The probability
that coo,. . . , f3’> f (I)‘, . . . , I,!J’>is bounded by, in obvious notation,
$)z[(B;
,...,
ei_,, e;-l,...,
e6-1)=(1Cr;,...,$Lf-1,
~t~‘~...~~6-‘)].
We can write pik as a function of I/I-’ and $’ on the indicated set, drop the indicator functions,
the fact that the random variable I&, i = 0,. . . , t, k = 1,. . . , d, are i.i.d. U(O, 1) to obtain
and use
P((e0,...,8t)#(JIo,...,ICIt))~tdEP11/(P11+~).
Since P (0 <pI1/(pI1
Thus consider
P(p,,/6>y)
+ 6) < 1) = 1, if we show P (plI/(pll
GP
<2a2
log[ $(
+ 8) > r> < Y, then E(pll/(p~l
+ 6)) G 27.
&)d-l]j.
’
srnZiZ%&d~X
so this probability
are i.i.d. U(0, l), this is the probability
that they, as a point in IWd-‘, lie within a
2,. . . , Xd). The volume of a ball of radius r in [Wd-’ is [-rr(d-“/2/T(~(d
+ l)>]rd-‘,
is at most
(3.2)
233
Volume
17. Number 3
STATISTICS
& PROBABILITY
We must find a small y such that (3.2) < y. In terms
(ba - b log y)(d-1”2
LETTERS
of a and b defined
18June1993
in the theorem,
we require
< y.
Let y = rcdP ‘)12. Then we must satisfy ab - i(d - 1)b log Y < Y. Note that log r > - l/r for r E (0, l),
so it suffices to have r2 - abr - i(d - 1)b > 0. Consider the positive root of the corresponding
equality.
This root is
y=-
b
2(d - 1)
2
b
If this root is bigger than 1, then (3.1) is trivial. If it is between
The right side of (3.1) is then at the most 2tdy.
0
0 and 1, then it gives a satisfactory
y.
4. An example
In (2.1) take d = 9, (T = 0.01 and 6 = 0.01. For simplicity assume 0.03 <XL < 0.97 for i = 1,. . . ,9. This
has prior probability
approximately
(0.94)9 A 0.57. We shall see that the Gibbs sampler does not converge
in any reasonable
time frame, though it appears to.
A realistic situation similar to this is the following. I have the uniform distribution
on C as my prior
for a set of nine proportions.
Someone else does an experiment
and reports X along with the claim that
given 0, Xi,...,
X, are independent
with 2500X, - Binomial(2500,
ei). I do not completely
trust this
person’s experimental
design. With personal probability
0.99, I think their design is correct, but with
probability
0.01 I think their experiment
is so flawed as to say nothing about 0. Given the frequency of
faulty studies nominally supported by statistics, this is not unreasonable.
Here a model like (2.1) occurs,
though a scaled binomial with variance depending
on 0 replaces than the normal. The normal model was
studied for mathematical
simplicity; one could expect similar results with the binomial.
In this situation a = log(O.99/0.01)
- 8 log(O.Olfi)
G 34.08, b = 2rr(0.01)2/r’/4(5)
L 0.000284. Thus
0’)
U’+‘(C)
II
G
0.0000412t
G
At.
On
the
other
hand
the
variation
distance
between the
II -E”(0”, . . . ,
prior and posterior is nearly 1. To see this we compute the probability
of lJ y{) Xi - oi I < 0.03). For the
values of X we consider, this is a cube centered at X lying entirely within C. Thus the prior probability
of this cube is (0.06)9. For all X, by integrating
(2.1) we see that the marginal density m(X) is at most 1.
With this, the posterior probability
of lJ T{1X, - oi I < 0.03} can be bounded
from below by integrating
(2.1); the result is at least 0.99 (0.997)“. Thus a lower bound on the total variation distance between the
prior and posterior is 0.99 (0.997)9 - (0.06)9 2 0.96. Theorem 1 now implies that 11_Y(8’) - IT(0 1X) (1 >
So at least tens of thousands
of steps are required for convergence.
0.96 - At.
This process is deceptive in that empirically
it appears to converge rapidly. Suppose we perform 200
independent
runs of the same process each of length 50. Then with probability
at least (1 - &)20”
A
0.66, in all of the runs we can think of 8 and + as remaining
coupled. In other words, on a set of
probability
about 0.66, the output of the Gibbs sampler looks just like 200 sequences of length 50 of iid
uniform (0, 1>9 random vectors. It would be easy to infer rapid convergence
naively from this data.
Looking at one long output sequence would have an identical problem.
By increasing
the dimension
or decreasing
(T, the probability
that the two processes remain coupled
can be made arbitrarily
close to 1 for an arbitrarily
long Gibbs sampling sequence.
The value of 6 is
almost immaterial
since it only enters logarithmically
in the coefficient
a. A value of 0.01 seems
reasonable,
though choosing a value several orders of magnitude
smaller would not significantly
affect
the results.
234
Volume
17, Number
3
STATISTICS
& PROBABILITY
LETTERS
18 June 1993
5. Discussion
It is interesting
to consider why convergence
to stationarity
is slow in this example. The posterior is
unimodal. For multimodal
posteriors, convergence
can certainly be slow due to the process being trapped
for a long time near a local minimum. A nice example of this in combinatorial
problems is Jerrum (1992).
On the other hand, the posterior is not log-concave.
Computer
scientists have developed algorithms for
integration
of log-concave
functions
that give provably accurate results in polynomial
time. There is a
precise technical sense to the polynomiality
of the time taken, and the current algorithms are probably
not yet fast enough to be useful. See, for example, Dyer and Frieze (1991) and Applegate
and Kannan
(1991). In the example considered
here the posterior is nearly singular with respect to the prior. This
alone is not enough to cause slow convergence.
If 6 were set to 0 in (2.11, then this near singularity
would become stronger, yet convergence
would be rapid. The parameter
space is high dimensional
though, as with the near prior/posterior
singularity,
this alone is not enough to cause slow convergence.
What seems to me be the best explanation
comes from a geometric view of the posterior
density.
Consider
the region of d + 1 dimensional
space parameterized
by f3,, . . . , ed, y, bounded
by the
this region looks much like a
hyperplane
y = 0 and the manifold
y = n(0,, . . . , ed 1X). Geometrically
high-dimensional
thumbtack;
there is a head with a narrow spike extending from it. In three dimensions
(with a two-dimensional
prior) the analogy is clear. Due to the presence of 6 in (2.1) and the thinness of
the normal tails, the head of the tack is nearly flat. Since each iteration involved in a step of the Gibbs
sampler only looks in a prespecified
direction, until the current position is such that the spike is in the
visible direction, the sampler will wander aimlessly around the head of the tack. Once it sees the spike,
with high probability
the Gibbs sampler will go there and remain there a long time. Of course this
analogy is not quite correct in that the posterior is unimodal,
so there will be some tendency for the
sampler to drift towards the spike. In this example, since the normal distribution
tails are so thin, this
drift is negligible.
It may be useful to think of this example as being like sampling from a bimodal
distribution;
in either case there are two regions of space the sampler alternates
between, spending a
long time in each before moving to the other. Here the two regions are the head and the spike of the
thumbtack.
Requiring
log-concavity
of the density would rule out this extreme behavior.
Since the posterior is explicit in this example, it is easy to suggest corrections
in this case. The real
problem
occurs if you really only have information
on the conditionals
f(0, I X, {ej, j # i)>. In this
situation naive use of the Gibbs sampler is destined to have difficulties. Theorem
1 says that the set of
sample paths from the Gibbs sampler in this problem and the set of sample paths from a process that
simply picks points at random from the unit cube have a substantial
overlap. Picking points at random is
very rapidly mixing; it takes only one step to become completely random. Suppose an algorithm had the
task of deciding whether to stop a Gibbs sampler, based only on the sample path observed so far. The
implication
of Theorem
1 here is that any algorithm that stops early with high probability
when points
are chosen uniformly
at random must also stop early with high probability
when observing the Gibbs
sampler in this example. Thus no algorithm can do all that is desired; it cannot stop early for rapidly
mixing chains and wait to stop for slowly mixing chains. This problem cannot be avoided by looking at
several shorter paths of the Gibbs sampler with different initial positions rather than one long path.
The only solution seems to be to make use of some information
about the posterior. This, unfortunately, leads to Gibbs samplers that are not as easy for non-experts
to implement.
For example, if we can
find its modes by applying a hill-climbing
algorithm to the posterior, then we can easily dispense with
problems like this one. Cui, Tanner, Sinha, and Hall (1992) give a diagnostic for monitoring
convergence
and apply it to the example of this paper.
References
Applegate,
D. and R. Kannan (1991), Sampling
tion of near log-concave
functions, preprint.
and integra-
Cui, L., M. Tanner, D. Sinha and W. Hall (19921, Monitoring
convergence
of the Gibbs sampler:
Further
experience
235
Volume
17. Number
3
STATISTICS
& PROBABILITY
with the Gibbs stopper,
preprint,
Dept. of Biostatist.,
Univ. of Rochester
(Rochester,
NY).
Doob, J. (1953), Stochastic Processes (Wiley, New York).
Doss, H. and J. Sethuraman
(19911, A study of the convergence properties
of successive substitution
sampling based
on Harris recurrence
of Markov chains, preprint.
Dyer, M. and A. Frieze (19911, Computing
the volume of
convex bodies: a case where randomness
helps, Res. Rept.
91-104,
Dept. of Math. Carnegie-Mellon
Univ. (Pittsburgh, PA).
Gelfand, A., S. Hills, A. Racine-Poon
and A. Smith (1990),
Illustrations
of Bayesian inference in normal data models
using Gibbs sampling, J. Amer. Statist. Assoc. 85, 972-985.
236
LETTERS
18June1993
Gelfand, A. and A. Smith (19901, Sampling-based
approaches
to calculating
marginal densities,
J. Amer. Statist. Assoc.
85, 398-409.
Gelman,
A. and D. Rubin (19921, Honest inferences
from
iterative simulation, Tech. Rept., Dept. of Statist., Univ. of
California (Berkeley CA).
Jerrum, M. (1992), Large cliques elude the Metropolis
process, Random Struct. Algor. 3, 347-359.
Tierney,
L. (1991), Markov chains for exploring
posterior
distributions,
Tech. Rept. 560, School of Statist., Univ. of
Minnesota (Minneapolis,
MN).
© Copyright 2026 Paperzz