Efficiency of finite state space Monte Carlo Markov chains

Antonietta Mira
Efficiency of finite state
space Monte Carlo
Markov chains
2000/1
UNIVERSITÀ DELL'INSUBRIA
FACOLTÀ DI ECONOMIA
http://eco.uninsubria.it
In questi Quaderni vengono pubblicati i lavori dei docenti della Facoltà di
Economia dell’Università dell’Insubria.
La pubblicazione di contributi di altri studiosi, che abbiano un rapporto
didattico o scientifico stabile con la Facoltà, può essere proposta da un
professore della Facoltà, dopo che il contributo sia stato discusso
pubblicamente. Il nome del proponente è riportato in nota all'articolo.
I punti di vista espressi nei quaderni della Facoltà di Economia riflettono
unicamente le opinioni degli autori, e non rispecchiano necessariamente
quelli della Facoltà di Economia dell'Università dell'Insubria.
These Working papers collect the work of the faculty of Economics of the
University of Insubria.
The publication of work by other Authors can be proposed by a member of
the Faculty, provided that the paper has been presented in public. The
name of the proposer is reported in a footnote.
The views expressed in the working papers reflect the opinions of the
authors only, and not necessarily the ones of the Economics Faculty of the
University of Insubria.
© Copyright Antonietta Mira
Printed in Italy in October 2000
Università degli Studi dell'Insubria
Via Ravasi 2, 21100 Varese, Italy
All rights reserved. No part of this paper may be reproduced in any form
without permission of the Author.
EÆciency of nite state space
Monte Carlo Markov chains
Antonietta Mira
y
10th October 2000
Abstract
The class of nite state space Markov chains stationary with respect to a common pre-specied distribution is considered. An easy to check partial ordering
is dened on this class. The ordering provides a suÆcient condition for the
dominating Markov chain to be more eÆcient. EÆciency is measured by the
asymptotic variance of the estimator of the integral of a specic function with
respect to the stationary distribution of the chains. A class of transformations
that, when applied to a transition matrix, preserves its stationary distribution
and improves its eÆciency is dened and studied.
Keywords: Markov chain Monte Carlo, EÆciency ordering, Stationarity preserving and eÆciency increasing transfers.
1 Introduction
Consider a distribution of interest (x), x 2 X , possibly known up to a normalizing
constant. Assume that X is an N dimensional nite state space. In order to gather
information about we construct a Harris recurrent Markov chain with transition
matrix P (we will identify Markov chains with the corresponding transition matrices),
that has has its unique stationary distribution: = P . In particular, following the
Markov chain Monte Carlo (MCMC) literature, we estimate E [f (X )] = with the
sample average along a realized path of a chain of length n:
1
^n =
n
X f (X ):
n
i
i=1
This work has been supported by EU TMR network ERB-FMRX-CT96-0095 on "Computational
and Statistical methods for the analysis of spatial data" and by the F.A.R. 2000, of the University of
Insubria.
y Universit
a degli Studi dell'Insubria, Facolt
a di Economia, Via Ravasi 2, 21100, Varese, Italy.
Email: [email protected]
1
Let L2 () indicate the space of functions that have nite variance under and assume
that all the functions of interest belong to this space. Let 1 denote the 1 N vector
of ones and, likewise, 0 denote the 1 N vector of zeros. Dene the inner product on
L2 ( ) to be:
(f; g) = f g = f (x)g(x)(x)
0
X
x2X
where is an N N diagonal matrix with i as the i-th element on the diagonal and
g 0 is the transpose of g . A measure of the eÆciency of ^n is v (f; P ) the limit, as n
tends to innity, of n2 = n Var [^n]. It can be shown, Peskun (1973), that
v (f; P ) = f [2lP 1 I A]f = (f; [2lP 1 I A]f );
(1)
where I is the N N identity matrix, A = 1 and lP = (I (P A)) is called
the Laplacian of P . Notice that, in the formula for the asymptotic variance, the only
quantity that involves the transition matrix (and thus the only quantity on which we
can intervene to improve the performance of our estimates) is the inverse Laplacian.
Given a distribution of interest there is often more than one transition matrix that
has as its unique stationary distribution. The primary criterion used to choose among
them is, as everywhere else in statistics, the asymptotic variance of the (asymptotically
unbiased) estimators. We thus give the following denitions:
0
0
Denition 1. If P and Q are Markov chains with stationary distribution , then P
is at least as eÆcient as Q for a particular function f , P f Q, if v (f; P ) v (f; Q).
Denition 2. If P and Q are Markov chains with stationary
is at least as uniformly eÆcient as Q, P E Q, if
v (f; P ) v (f; Q);
distribution
,
then
P
8f 2 L2 ():
(2)
Peskun ordering, Peskun (1973) and Tierney (1995), is a suÆcient condition for (2)
to hold and the covariance ordering, Mira and Geyer (1998), is necessary and suÆcient.
Uniform eÆciency has already been studied quite extensively: Peskun (1973), Tierney (1995), Mira and Geyer (1998), Frigessi et al. (1992). In this paper we focus
on relative eÆciency and try to answer the following question: if we have a specic
function f in mind and we are only interested in its expected value with respect to ,
which Markov chain should we use for our simulation study? Intuitively, the transition
matrix that we choose having in mind a specic function to estimate, will perform
better than a generic one chosen to minimize the asymptotic variance for all possible
functions with nite asymptotic variance under .
2 SuÆcient conditions for eÆciency
Assume that the function of interest f is monotone. This is not a restrictive hypothesis
since we can always rearrange the state space X , and everything else accordingly, to
make it monotone.
2
Dene the summation matrix T to be an N N upper triangular matrix with
zeros below the main diagonal and ones elsewhere. Dene the south-west sub-matrix
of the matrix M to be the sub-matrix of M obtained deleting the rst row and the last
column of M .
Denition 3. P
dominates
Q in the south-west ordering (with respect to some re Q, if all the elements of the south-west sub-matrix
ordering of the state space), P SW
of T
P Q T are non-negative.
(
)
Theorem 1.
Let P and Q be two Markov chains having
1
1
distribution. If lP
SW lQ then P f Q.
as their unique stationary
It is suÆcient to show that, for the given function, f (lP 1 lQ1 )f 0 0. Consider
the identity f (lP 1 lQ1 )f 0 = fT 1T (lP 1 lQ1 )T T 1f 0 and note that: the last column
of T (lP 1 lQ1 )T equals T (lP 1 lQ1 )10 = 00 ; the rst row of T (lP 1 lQ1 )T equals
1(lP 1 lQ1 )T = (lP 1 lQ1 )T = 0; f monotone is equivalent to the rst (N 1)
elements of T 1f 0 and the last (N 1) elements of fT 1 to have opposite signs. It
follows that lP 1 SW lQ1 implies P f Q.
With respect to the Peskun ordering, the south-west ordering requires to check fewer
conditions, namely (N 1)(N 1) instead of N (N 1). This is an important issue
when the state space is large. Both orderings are partial, that is, do not allow ranking
of all transition matrices having a specied stationary distribution. A possible limit of
Theorem 1 is that it requires to work with inverse Laplacians and computing inverses
of matrices can be computationally intensive (especially on large state spaces). A way
to work with the transition matrices directly is to realize that I + (P A) provides a
rst order approximation to the inverse Laplacian since (Kemeny and Snell 1969)
Proof.
lP
1
X
= I + (P
1
i=1
A)i
and limi!1(P A)i equals, by construction, the N N zero matrix. Thus a rst order
approximation to v(f; P ) is given by
v (f; P ) v (f; A) + 2f (P A)f 0
where v(f; A) is the theoretical independence sampling variance of ^n and f (P A)f 0
can be interpreted as a rst-order covariance if is the distribution of the initial state
of the Markov chain.
This justies the attempt of trying to nd P such that f (P A)f 0 is large (in absolute value) and negative. This is nothing more than an attempt of inducing negative
correlation into the Markov chain Monte Carlo sampler. We thus give the following
denition:
Denition 4.
is at least as
f (P
If
P
and
Q
rst order
Q)f 0 0.
are Markov chains with stationary distribution
eÆcient as Q for a particular function f , P
3
,
then P
1f Q, if
A reasoning similar to the one in the proof of Theorem 1 shows that, if P SW Q,
then P 1f Q.
3 Stationarity preserving transfers
Denote by S.P.T. a stationarity preserving (and eÆciency increasing) transfer performed on a transition matrix P (with stationary distribution ) of the following kind:
given integers 1 i < j N and 1 k < l N and a quantity 0 < h 1 , increase
pi;l and pj;k by h and hi =j respectively and decrease pi;k and pj;l by h and and hi =j
respectively. Thus a S.P.T. is completely dened by giving 4 indexes and the amount
by which to increase/decrease the proper entries of the matrix on which the transfer is
performed. We will thus identify a S.P.T. by P (i; j; k; l; h).
If P is derived from Q via a nite sequence of stationarity-preserving and eÆciency
increasing transfers then P SW Q. We can thus increase the eÆciency of a transition
matrix (provided the rst order approximation to the asymptotic variance is good),
while preserving its stationary distribution, via a sequence of S.P.T. Indeed, we can
keep transferring probability mass around until 9 i < j such that pi;k and pj;l > 0
for some k < l. Of course, every time we move some probability mass around, we
need to check that the resulting matrix remains a proper (all entries between 0 and
1) irreducible transition matrix. Notice that knowing only up to a normalizing
multiplicative constant does not constitute a limit to this theory since only the ratio
of specic values is needed to dene a S.P.T.
It is interesting to study the limiting transition matrix obtained by applying a
sequence of S.P.T. The resulting matrix has at most one non zero element along the
main diagonal and it presents a path of positive entries connecting the north-east to
the south-west corner of the matrix. Which specic pattern is optimal depends on
. For example consider the case N = 2 and assume that is properly normalized.
Consider rst the case where 1 = 1=2. In this setting a transition matrix P has the
proper stationary distribution if and only if it is doubly stochastic. For example, a
Metropolis-Hastings algorithm with proposal matrix K given by k12 = p and k21 = q,
with 0 p; q 1, has transition matrix P with p12 = minfq; pg and p21 = minfq; pg.
How do we optimally chose the proposal distribution, that is the values of p and q?
The rst-order-optimal transition matrix has p11 = 0, p21 = 1 and the path of positive
entries follows the north-east to south-west diagonal. This is nothing but a MetropolisHastings algorithm where the proposal distribution always proposes to jump to the
other state (notice that this proposal is always accepted). The structure of the rstorder-optimal transition matrix is the same for any uniform stationary distribution,
that is, we always have a patterns of ones on the north-east to south-west diagonal.
The resulting rst-order-optimal transition matrix is periodic since it has an eigenvalue
equal to -1. This is a problem for convergence in total variation distance to stationarity
but it is not an issue if we are interested in asymptotic variance of the ergodic average
estimates. If 1 < 1=2 then to obtain the rst-order-optimal transition matrix we have
to set p11 = 0 and p21 = 1 =(1 1 ). Finally if 1 > 1=2 then we have to set p22 = 0
4
and p12 = (1 1)=1 for rst-order optimality.
4 Conclusions
Two partial orderings are dened on the space of irreducible transition matrices having
a common stationary distribution . The south-west ordering involves the inverse of
the transition matrices and provides a suÆcient condition for the dominating Markov
chain to produce MCMC estimates of E f with a smaller asymptotic variance. The
second ordering is dened on the transition matrices them self and provides a suÆcient
condition for the dominating Markov chain to be rst order more eÆcient.
Following the economic literature on inequality comparisons and in particular the
principle of Pigou-Dalton transfers (Dardanoni, 1993), we dene stationarity preserving
and eÆciency increasing transfers of probability mass within a transition Matrix. The
optimal transition matrix that results after a sequence of such transfers, resembles
the matrix that we would observe when studying the distribution of two negatively
correlated characters. The literature on these topics might help to extend the ideas
explored in the present paper from nite state spaces to general state spaces.
Acknowledgments
I would like to thank Gareth Roberts and Peter Green for illuminating discussions and
Valentino Dardanoni for bringing to my attention the paper that inspired the present
work (Dardanoni, 1993).
References
Dardanoni, V. (1993), Measuring Social Mobility, Journal of Economic Theory, 61,
372 { 394.
Frigessi, A. Hwang, C. and Younes, L. (1992), Optimal spectral structure of reversible
stochastic matrices, Monte Carlo methods and the simulation of Markov random
elds, The Annals of Applied Probability, 2, 610 { 628.
Mira A. and Geyer C. (1998), Ordering Monte Carlo Markov chains, Tech. Rep. 632,
U. of Minnesota.
Kemeny, J. G. and Snell, J. L. (1969), Finite Markov chains, Princeton, Van Nostrand.
Peskun, P. H. (1973), Optimum Monte Carlo sampling using Markov chains. Biometrika,
60, 607{612.
Tierney, L. (1995), A Note on Metropolis-Hastings kernels for general state spaces,
Tech. Rep. 606, U. of Minnesota.
5