Estimation of the Asymptotic Variance in the CLT for Markov Chains

STOCHASTIC MODELS
Vol. 19, No. 4, pp. 449–465, 2003
Estimation of the Asymptotic Variance
in the CLT for Markov Chains
Didier Chauveau* and Jean Diebolt*
Laboratoire d’Analyse et de Mathématiques Appliquées,
Universités de Marne-la-Vallée, Champs-sur-Marne, France
ABSTRACT
This paper is devoted to estimating the asymptotic variance in the central limit theorem
(CLT) for Markov chains. We assume that the functional CLT for Markov chains
applies for properly normalized partial-sum processes of functions of the chain, and
study a continuous-time empirical variance process based on i.i.d. parallel chains. The
centered empirical variance process, properly normalized, converges in distribution to a
centered Gaussian process with known covariance function. This allows us to estimate
the limiting variance and to control the fluctuations of the variance estimator after n
steps. An application to Monte Carlo Markov chain (MCMC) convergence control is
suggested.
Key Words: Markov chains; Functional CLT; Asymptotic variance; Stochastic
process; Itô process.
1.
INTRODUCTION
The asymptotic properties of ergodic homogeneous Markov chains such as the strong
law of large numbers (SLLN) or the central limit theorem (CLT) are the key limit
*Correspondence: Didier Chauveau and Jean Diebolt, Laboratoire d’Analyse et de Mathématiques
Appliquées, CNRS UMR 8050, Universités de Marne-la-Vallée, Champs-sur-Marne, France; E-mail:
[email protected] and [email protected].
449
DOI: 10.1081=STM-120025399
Copyright # 2003 by Marcel Dekker, Inc.
1532-6349 (Print); 1532-4214 (Online)
www.dekker.com
450
Chauveau and Diebolt
behaviors which justify the use of Monte Carlo Markov chain (MCMC) algorithms in
statistical inference or simulation of physical particle systems. Since the problem we
address here has initially been motivated by MCMC methods, we begin with a brief
presentation of this technique.
An MCMC algorithm generates an ergodic Markov chain X ¼ (Xn )n 0 with a given
stationary distribution p, which usually comes as the posterior distribution in some
Bayesian setup. Investigating characteristicsÐ of posterior distributions requires simulation
from p, or computation of integrals like g dp, where g is some real-valued function
defined on the Ð state space of X. In many situations, direct simulation from p is not
tractable, and g dp is not available in closed form, so that an MCMC method is
appropriate. Good introduction to the topic can be found in Gelfand and Smith’s[1]
seminal paper, or in Gilks et al.[2]
The usual estimate provided by an P
MCMC algorithm comes from the SLLN applied to
the properly normalized sum Sn (g) ¼ ni¼1 g(Xi ), that is
ð
1
Sn (g) ¼ g dp :¼ p(g)
n!1 n
lim
a.s.
Since precision is desired for the estimate Sn (g)=n, it is natural to apply the CLT, which
states that
1
d
pffiffiffi Sn (g )!N (0, s2 (g))
n
(1)
where g () ¼ g() p(g) is the centered function in stationary regime (i.e., when X0 p), and
1
X
1
var(Sn (g)) ¼ Ep [g (X0 )2 ] þ 2
Ep [g (X0 )g (Xn )]
n!1 n
n¼1
s2 (g) ¼ lim
(2)
is the asymptotic variance and should satisfy 0 < s2 (g) < 1. Various conditions for these
asymptotic results to hold, in both stationary and nonstationary case, are given in Meyn
and Tweedie,[3] Chap. 17, and will be precised in Sec. 2.
The variance s2 (g) is difficult to calculate theoretically (see the discussion hereafter),
but estimating it properly is crucial to build confidence intervals for p(g). This estimation
problem relates somehow to the problem of MCMC convergence assessment (see, e.g.,
Brooks and Roberts[4]). Actually, it is not possible to obtain reliable estimates of s2 (g)
before the chain has approximately reached its stationary regime and normality involved
by the CLT can be empirically checked (see Chauveau and Diebolt[5]). Conversely, the
CLT cannot be used as long as a reliable estimate of s2 (g) has not been found. In this
perspective, Chauveau and Diebolt[5] have proposed to check the stabilization of the
variance after n steps,
sn 2 (g) ¼ var
Sn (g)
pffiffiffi
n
(3)
Asymptotic Variance in CLT
451
pffiffiffi
by graphically monitoring the empirical variance of a sample of values of Sn (g)= n
computed from parallel chains, i.e., i.i.d. copies of the Markov chain started from a
dispersed initial distribution.
Our goal in this paper is two-fold: first, to propose an applicable, generic and
consistent method for estimating s2 (g) from parallel chains; second, to control the
fluctuations of sn 2 (g) around its limiting value. We provide theoretically grounded answers
to both problems. Our usage of the control of the asymptotic normality provides
justification for the normal approximations used. Our approach also provides a way to
‘‘recycle’’ the simulated observations to improve the estimator over the natural one based
on parallel chains (variance reduction). Finally, we insist on the generic aspect of our
method: Approaches requiring problem-specific knowledge of the transition kernel of
the Markov chain, or specific computer codes for their implementation are far less usable
than methods solely based upon the output of the sampler, which can use available
generic code.
It turns out that finding good estimates for s2 (g) is itself difficult. A summary of the
investigations made in this direction can be found in Geyer.[6] More recently, other
methods based on renewal theory have been proposed in Robert[7] (see also GuihenneucJouyaux and Robert,[8] Sec. 4.6). However, up to our knowledge, the proposed procedures
are all based on a single chain (i.e., output of one trajectory of the Markov chain) and
suffer from several drawbacks or lack of applicability, as discussed below.
In Geyer,[6] three methods to estimate s2 (g) in the stationary case are discussed. The
first one, the window estimator, uses the usual empirical estimates g^ n,t of the lag t
autocovariances gt ¼ Cov(g(Xi ), g(Xiþt )), based on a single run of length n. The variance
s2 (g) is then estimated by
s^ 2 (g) ¼
þ1
X
wn (s)^gn,s
s¼1
where wn (s) is a weight function (the lag window) such that w(s) ! 0 when jsj ! 1,
which has to be chosen and tuned in order to get, under strong enough regularity
conditions, consistency of the window estimator. This technique comes from the time
series literature and has been developed for estimating spectral densities. For instance,
Priestley[9] summarizes several possible choices for the shape and parameters of the lag
window. However, the desirable theoretical properties (such as weak consistency) are
obtained under some assumptions about the underlying process X, the less restrictive being
that X must belong to the class of the (so-called) ‘‘general linear processes.’’ Indeed, it is
not clear how reasonable these kinds of assumptions are in the MCMC setup. References
where window estimators have been used in the MCMC context are suggested by Geyer,[6]
but do not provide answers to this question (see, e.g., Green and Han[10]).
The second method is using batch means, i.e., averages over blocks based on a single
run. These batch means are shown to converge to i.i.d. normal random variables, and this
property is used to build confidence intervals for s2 (g). A drawback is that the batch size
needs to be chosen large enough for these asymptotic properties (independence and
normality) to hold. This choice is related to the mixing rate of the chain, which is
unknown. Hence achieving a good compromise between number of batches and batch size
may be difficult without additional information on the chain.
452
Chauveau and Diebolt
The third method uses the fact that, under general conditions (stationarity, irreducibility and reversibility of the Markov chain), the adjacent pairs of autocovariances
g2t þ g2tþ1 are strictly positive, decreasing and convex. This property is used to build
empirical rules for the number of terms over which the estimator obtained by summing the
g^ n,t ’s is computed. The major drawback of this approach is that it only provides consistent
overestimates in general situation (see Geyer[6]).
A more recent approach given in Robert[7] uses renewal properties of the Markov
chain to build i.i.d. blocks, for which the usual CLT holds. The condition for renewal
theory to apply is that there exists a so-called renewal, or small set A, a real 0 < e < 1 and
a probability measure n such that
P(Xnþ1 2 BjXn ¼ xn ) en(B), 8xn 2 A; 8B
In that case, the kernel can be modified in such a way to obtain renewal events with
probability e when Xn 2 A. The variance sA 2 (g) in the usual (i.i.d.) CLT can be estimated
by the usual empirical variance, and an estimate of s2 (g) can be deduced. A first obvious
drawback of this approach is that one needs to exhibit the triple (A, e, n) defining the small
set, and that this requires a deep analytical knowledge about the kernel driving the chain,
hence is problem-specific. Moreover, this small set A must be chosen in a way to obtain
short return times to it, and the associated e not too small, to obtain a sufficient number of
renewal events in a reasonable total simulation time (note in addition that these too
requirements are somehow contradictory). Finally, as the author points out, this leads to a
highly conservative method.
We now precise the flavor of our approach. Suppose that we have an estimator s^ n 2 (g)
of sn 2 (g) and we want to control the range of its fluctuations around s2 (g). This can be
done by determining for some n1 < n2 an estimate of
P
sup
n1 n n2
js^ n 2 (g) s2 (g)j u ,
u>0
when the underlying Markov chain has approximately reached its stability regime, in
order to build confidence bands. Of course, since s2 (g) is unknown, it has in practice to
be replaced by the average of the values of s^ n 2 (g) over n1 n n2. It is then natural
to consider a partial-sum process s^ [nt] 2 (g), t 2 [0, 1], issued from this estimator (where
[t] denotes the largest integer k t) and to show that it satisfies a functional CLT, e.g.,
as in Billingsley.[11] This means that the sequence of continuous random functions
obtained by interpolating the values of s^ [nt] 2 (g) converges in distribution to a Brownian
motion.
Here, we study the limiting behavior of a process issued from the natural estimator
of the variance after n steps which is constructed as follows: (i) we use the fact that the
continuous-time
process issued from the interpolation of the partial sum process
pffiffiffi
S[nt] (g )= n satisfies a functional CLT (see, e.g., Meyn and Tweedie[3] and our Sec. 2.1);
(ii) we consider m i.i.d. copies of the original chain, giving m i.i.d. copies of the partial
sum process that converge to m i.i.d. copies of the Brownian motion; (iii) we build a
continuous time empirical variance process, from the empirical variance of these m
partial sum processes. Therefore, we can expect to gain control over the estimation of
Asymptotic Variance in CLT
453
s2 (g) from the study of the asymptotic behavior of this process as both n and m increase
to 1.
In Sec. 2, we first recall the conditions under which a Markov chain satisfies a
functional CLT. We then present our estimator computed from the independent parallel
Markov chains as well as the continuous-time process that will be considered throughout
the remaining part of the paper. Section 3 states the weak convergence of this empirical
variance process as the ‘‘time’’ n ! 1. We prove that the empirical variance process
essentially converges in distribution to a normalized sum of m independent Itô processes
plus a term converging to 0 as m ! 1. Section 4 is devoted to the study of the weak
convergence of this sum of m i.i.d. Itô processes as m ! 1. A slightly more general result
is also given under some general assumptions. As a consequence, it is proved that the
centered empirical variance process, properly normalized, converges in distribution to a
centered Gaussian process with known covariance function. Finally, Sec. 5 suggests how to
control the fluctuations of sn 2 (g) using the theoretical results obtained in the previous
sections, and proposes a strategy for estimating s2 (g) using the properties of the Gaussian
limiting process. This leads to estimates of s2 (g) with a variance smaller than the variance
of the natural estimate of s2 (g) based on i.i.d. chains.
2.
2.1.
THE EMPIRICAL VARIANCE PROCESS
Functional Central Limit Theorem for Markov Chains
This section starts with a summary of classical results related to the functional CLT for
Markov chains. Here, we consider a single ergodic Markov chain X ¼ (Xn , n 0) over
a general state space (E, B(E)) with transition kernel P, initial position X0 m (where m
is an arbitrary probability distribution) and invariant distribution p. Notice that in the
literature the results below are often stated for the stationary situation (i.e., when X0 p).
We will state them in the general case (X0 m for arbitrary m) since this is the actual
situation in standard MCMC applications. For a function g: E ! R, we define
Sn (g) ¼
n
X
g(Xi ),
and
Sn (g ) ¼
i¼1
n
X
(g(Xi ) p(g))
i¼1
For t 2 [0, 1], the partial-sum process associated to Sn (g ) is
S[nt] (g ) ¼
[nt]
X
(g(Xi ) p(g))
(4)
i¼1
and we suppose in the sequel that X satisfies conditions ensuring that the functional CLT
holds for the normalized linear interpolation of (4):
1
Yn,g (t) ¼ pffiffiffi [S[nt] (g ) þ (nt [nt])(S[nt]þ1 (g ) S[nt] (g ))]
n
(5)
454
Chauveau and Diebolt
More precisely, if s2 (g) is the asymptotic variance in the CLT defined by (2), we have the
following result:
Proposition 1 (Functional CLT)
Assume that X is positive Harris and a solution g^ of the Poisson equation
g^ Pg^ ¼ g p(g) exists with p(g^ 2 ) < 1. If s2 (g) > 0, then
d
Yn,g !s(g)W
as n ! 1
where W denotes a standard Brownian motion over [0, 1].
Proposition 1 is from Meyn and Tweedie,[3] Theorem 17.4.4. There, it is proved in the
stationary case (X0 p). However (as explained in the proof of Theorem 17.4.4), the result
can be extended to the nonstationary case by using their Proposition 17.1.6, which claims
that, if the functional CLT holds for some initial distribution m, then it holds for any initial
distribution. Note also that although the result is written for t 2 [0, 1] for simplicity, it
holds as well for t 2 [0, T ], whatever T > 0.
2.2.
Estimation of the Variance from i.i.d. Chains
We now turn to the description of our estimator. We propose to use m independent
copies of the Markov chain to estimate the variance sn 2 (g). For 1 ‘ m, we denote the
‘th chain by X (‘) ¼ (Xn (‘) , n 0) and the corresponding ‘th sum and centered sum,
respectively, by
Sn (‘) (g) ¼
n
X
g(Xi (‘) ) and Sn (‘) (g ) ¼
i¼1
n
X
(g(Xi (‘) ) p(g))
i¼1
The simulation of m i.i.d. chains up to time n provides an i.i.d. sample of m (noncentered)
random variables
Sn (1) (g)
S (m) (g)
pffiffiffi , . . . , npffiffiffi
n
n
so that the natural estimator of sn 2 (g) is the usual (biased) empirical variance of this
sample:
s^ n,m 2 (g) ¼
2
m 1X
1
1
pffiffiffi Sn (‘) (g) pffiffiffi Sn (g)
m ‘¼1
n
n
(6)
where
Sn (g) ¼
m
1X
S (‘) (g)
m ‘¼1 n
(7)
Asymptotic Variance in CLT
455
To use Proposition 1, we need to introduce the sums Sn (‘) (g ) that satisfy the functional
CLT. This leads us to rewrite (6) as
s^ n,m 2
"
#2
m
n
1X
1 (‘)
1 X
1
pffiffiffi Sn (g ) þ pffiffiffi
¼
p(g) pffiffiffi Sn (g)
m ‘¼1
n
n i¼1
n
2
m
1X
1
1
pffiffiffi Sn (‘) (g ) pffiffiffi Sn (g )
¼
m ‘¼1 n
n
(8)
where
Sn (g ) ¼
m
1X
S (‘) (g )
m ‘¼1 n
(9)
Note that the expression into brackets in (8) can also be interpreted as
n
1 X
pffiffiffi
(g(Xi (‘) ) p^ m i (g))
n i¼1
where
p^ m i (g) ¼
m
1X
g(Xi (‘) )
m ‘¼1
is the estimate of E[g(Xi )] ¼ pi (g), the expectation under the marginal distribution of the
chain at time i, based on the observations from the m i.i.d. chains at time i, which is
consistent in m from the standard SLLN. The estimate s^ n,m 2 (g) appears in this way as the
empirical second moment of the normalized and approximately centered sums associated
to the m chains.
2.3.
Partial Sums and Empirical Variance Process
We will now introduce the partial sum processes to construct our continuous-time
estimator process. For ‘ ¼ 1, . . . , m, we define
S[nt] (‘) (g ) ¼
[nt]
X
(g(Xi (‘) ) p(g)),
t 2 [0, T ]
i¼1
and their normalized linear interpolations,
1
Yn (‘) (t) ¼ pffiffiffi [S[nt] (‘) (g ) þ (nt [nt])(S[nt]þ1 (‘) (g ) S[nt] (‘) (g ))]
n
456
Chauveau and Diebolt
where, for simplicity, we drop g in the notation since g is considered fixed throughout the
paper. We accordingly define the partial sum version of s^ n,m 2 (g) by
s^ [nt],m
2
"
#2
m
m
1X
1
1X
1
(‘)
(‘)
pffiffiffi S[nt] (g ) pffiffiffi S
(g) ¼
(g )
m ‘¼1
n
m ‘¼1 n [nt]
(10)
Using the linear interpolations Yn (‘) ’s, we consequently define the empirical variance
process by
"
#2
m
m
1X
1X
(‘)
(‘)
Vn,m (t) ¼
Y (t) Y (t)
m ‘¼1 n
m ‘¼1 n
!2
m
m
1X
1X
2
(‘)
(‘)
¼
(Y (t)) Y (t)
m ‘¼1 n
m ‘¼1 n
(11)
Note that Vn,m (t) coincides with (10) at the points t ¼ i=n, i ¼ 1, . . . , [nT ]. It is a nonlinear
interpolation of s^ n,m 2 (g).
In the next two sections, we study the asymptotic behavior of Vn,m (t) by considering
first its limit in distribution as n ! 1 for a fixed number m of parallel chains, and then its
behavior for large m’s.
3.
ASYMPTOTIC BEHAVIOR IN TIME
Proposition 2
Assume that X and g satisfy the conditions of Proposition 1. Then
d
Yn (‘) !s(g)W (‘)
as n ! 1,
‘ ¼ 1, . . . , m
where W (1) , . . . , W (m) are m independent standard Brownian motions, and
d
Vn,m !Zm
as n ! 1
for each m 1, where the process Zm (which also depends on g) is
Zm (t) ¼
m
1X
s2 (g) 2
B , t 2 [0, T ]
(s(g)Wt (‘) )2 m ‘¼1
m t
(12)
where Bt is a standard Brownian motion not independent from the Wt (‘) ’s.
Proof.
The first result is just Proposition 1 applied to the m i.i.d. chains. The weak convergence
of the independent
of Vn,m directly follows from the fact that Vn,m is a continuous
P function
2
2 , as given by (11).
processes Yn (1) , . . . ,Yn (m) . This function is f (y1 , . . . ,ym ) ¼ m
i¼1 yi =m y
Asymptotic Variance in CLT
457
The limiting process is then
of the individual limits by this function, which yields
P the image
(‘)
(12) where Bt ¼ m1=2 m
W
is
itself
a standard Brownian motion.
j
‘¼1 t
4.
ASYMPTOTIC BEHAVIOR IN THE NUMBER OF CHAINS
In this section, we describe the behavior of the process Zm when m gets large. Since
E[(s(g)Wt )2 ] ¼ s2 (g)t, let us rewrite (12) as
1
s2 (g) 2
B , t 2 [0, T ]
Zm (t) ¼ pffiffiffiffi xm (t) þ s2 (g)t m
m t
(13)
where
m
1 X
xm (t) ¼ pffiffiffiffi
[c(Wt (‘) ) E[c(Wt )]],
m ‘¼1
and c(x) ¼ s2 (g)x2
We keep the above notation with the function c instead of replacing it directly by s2 (g)x2
to indicate a slightly more general result at the end of Sec. 4.
Theorem 1
Convergence in distribution of xm :
d
xm !G
as m ! 1
where G is a centered Gaussian process which can be represented as
ðt
G(t) ¼
pffiffi
2s2 (g) s dWs
(14)
0
and has the same distribution as the Gaussian process Wa with covariance function
(s, t)7!a(s ^ t), where a(t) ¼ 2s4 (g)t 2 .
Proof.
Each individual process c(Wt ) can be written, via Itô’s formula, as
ðt
c(Wt ) ¼ c(0) þ
c0 (Ws ) dWs þ
0
1
2
ðt
c00 (Ws ) ds
where c(0) ¼ 0 here. The stochastic integral
martingale when
ð T
E
0
(c0 (Ws ))2 ds < 1
(15)
0
Ðt
0
c0 (Ws ) dWs is a square integrable centered
458
Chauveau and Diebolt
and this condition holds here since
ð T
ð T ð þ1
2
eu =2s
(c0 (Ws ))2 ds ¼
c02 (u) pffiffiffiffiffiffiffi du ds
2ps
0
0 1
rffiffiffiffiffiffi ð þ1
2T
2
u2 eu =2T du < 1
4s4 (g)
p 1
Ð
T
Since c00 is constant here, 0 E[jc00 (Ws )j] ds ¼ 2T s2 (g), and
E
ðt
c(Wt ) E[c(Wt )] ¼
c0 (Ws ) dWs þ
0
1
2
ðt
(16)
(c00 (Ws ) E[c00 (Ws )]) ds
0
so that the second term vanishes, and the process xm can be written
m
1 X
xm (t) ¼ pffiffiffiffi
m ‘¼1
ðt
c0 (Ws (‘) ) dWs (‘)
(17)
0
Since xm is a sum of independent martingales, we have
m ð m ðt
1X
1X
0
(‘)
(‘)
hxm i(t) ¼
c (Ws ) dWs (t) ¼
(c0 (Ws (‘) ))2 ds
m ‘¼1 0
m ‘¼1 0
Hence, for each t 2 [0, T ], the SLLN gives
ð t
ðt
0
2
hxm i(t) ! E
(c (Ws )) ds ¼ hc 2 (s) ds a.s. as m ! 1
0
0
where
hc (s) ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
E[c02 (Ws )]
The limit above is continuous on [0, T]. Then, from Rebolledo’s theorem (see, e.g.,
Dacunha-Castelle and Duflo[12]), the sequence {xm } is tight in C([0, T ]), and
d
xm !G
as m ! 1
where the limiting process G can be represented as the martingale
ðt
G(t) ¼
hc (s) dWs
0
Note that G is a Gaussian process since it is the
pffiffiBrownian stochastic integral of the
deterministic function hc . Finally, hc (s) ¼ 2s2 (g) s here, so that the limiting process is
given by (14), with hGi(t) ¼ a(t) ¼ 2s4 (g)t 2 .
j
Asymptotic Variance in CLT
459
Remark
A slightly more general result can be proved: the convergence in distribution as
m ! 1 of the normalized sum of i.i.d. Itô processes
m
1 X
xm (t) ¼ pffiffiffiffi
[c(Wt (‘) ) E[c(Wt )]],
m ‘¼1
t 2 [0, T ]
where W (1) , . . . , W (m) are m independent copies of Brownian motion, for any smooth
enough function c satisfying conditions
(i)
Ð þ1
(ii)
Ð þ1
2
1
c02 (u)eu =2T du < 1
1
c002 (u)eu =2T du < 1
2
Indeed, in that case, xm can be split into the martingale term studied in the proof of
Theorem 1, and a sequence of processes
m
1 X
pffiffiffiffi
2 m ‘¼1
ðt
(c00 (Ws (‘) ) E[c00 (Ws )]) ds
0
that can be proved to be tight in C([0, T ]). The conclusion then follows since the sequence
{xm } is tight in C[0, T ] as the sum of two tight sequences in C[0, T ], and the finitedimensional distributions of xm converge weakly to a normal distribution.
5.
ESTIMATION OF THE LIMITING VARIANCE
We now turn to the estimation of s2 (g) and the control of the fluctuations of sn 2 (g) for
large n. From Proposition 2, Vn,m ! Zm weakly as n ! 1, where the limiting process
takes the form
1
s2 (g) 2
B
Zm (t) ¼ ts2 (g) þ pffiffiffiffi xm (t) m
m t
(18)
with c(x) ¼ s2 (g)x2 . Hence for n large enough, Vn,m (t) Zm (t) for all t 2 [0, T ]. Clearly,
such an approximation can hold only when the finite-dimensional distributions of the
process Yn,g [defined in (5)] over [0, T ] are approximately normal. Although checking the
approximate normality of all the finite-dimensional distributions would be untractable, at
least the CLT control method presented in Chauveau and Diebolt[5] should be applied to
determine a range of sufficiently large values of n for all practical purposes.
In addition, xm ! Wa in distribution as m ! 1, so that for m large enough the term
of order O(1=m) in (18) becomes negligible, and we can consider the approximation
1
Vn,m (t) ts2 (g) þ pffiffiffiffi Wa(t) , t 2 [0, T ]
m
(19)
460
Chauveau and Diebolt
where Wa(t) ¼ (Wa )t . This leads to estimate the variance by
s^ n,m,g 2 (t) ¼ s^ g 2 (t) ¼
Vn,m (t)
1 Wa(t)
s2 (g) þ pffiffiffiffi
t
m t
(20)
where the process Wa(t) =t has covariance function
cov
Wa(s) Wa(t)
s2 ^ t 2
s^t
,
¼ 2s4 (g)
¼ 2s4 (g)
s_t
s
t
st
5.1.
(21)
Control of the Fluctuations of the Estimate
If we set the change of variable s ¼ Teu and t ¼ Tev for u > 0, v > 0, then the
process Uv ¼ Wa(t) =t has covariance function
cov(Uu , Uv ) ¼ 2s4 (g)exp(ju vj)
hence it is an Ornstein-Uhlenbeck process. There exist available results concerning the
distribution of the supremum of this process over compact intervals of time (see, e.g.,
Delong[13]). These results can be used to construct approximate confidence bands for the
fluctuations over compact subintervals of (0, T ] of the estimate s^ g 2 (t) given by (20).
We should keep in mind that the validity of such bands is asymptotic and that trying to
control the fluctuations of some estimate of s2 (g) without having first checked the
approximate normality of the S^ n,m (‘) (g )’s makes no sense.
For some 0 < t1 < t2 T we can define the 95% (say) confidence band of Wa(t) =t
corresponding to [t1 , t2 ]:
"
#
Wa(t) w95% ¼ 5%
P sup t t1 t t2
Suppose that approximate normality has been positively checked (e.g., as in Chauvean and
Diebolt[5]). If the estimate s^ g 2 (t) remains for t1 t t2 in its 95% asymptotic confidence
band,
1
sup s^ g 2 (t) s2 (g) pffiffiffiffi w95%
m
t1 tt2
where the unknown s2 (g) has to be replaced by (t2 t1 )1
that it is reliable.
Ð t2
t1
s^ g 2 (t) dt, then we can expect
Asymptotic Variance in CLT
5.2.
461
Reduction of Variance of the Estimator
The natural way to use the estimator (20) is to compute it at time t ¼ 1, i.e., after n
iterations of the parallel chains. In this case, s^ g 2 (1) is simply s^ n,m 2 (g) defined by (8), and
its variance is
var(s^ g 2 (1)) ¼ var(Vn,m (1)) 2s4 (g)
m
This variance can be reduced by taking advantage of our knowledge of the covariance
function of the limiting process (20).
We suggest to use a weighted average of estimates s^ g 2 (ti ) computed at p different
times ti , i ¼ 1, . . . , p. For t ¼ (t1 , . . . , tp ) and weights w ¼ (w1 , . . . , wp ), the estimator
becomes
s^ g 2 (w, t) ¼
p
X
wi
i¼1
Vn,m (ti )
, w 2 (0, 1) p ,
ti
p
X
wi ¼ 1
i¼1
and has variance
!
p
Wa(ti )
Wa(ti ) Wa(tj )
1X 2
2X
var(s^ g (w, t)) w var
w w cov
,
þ
m i¼1 i
m i<j i j
ti
ti
tj
2
Our objective is then to find (w, t) that minimizes this variance. A natural way to
reduce the complexity of this problem is to notice that, from (21), the above covariances
only depend on ratios ti =tj for ti < tj. Hence, if we choose the time points ti so that
tiþ1 =ti ¼ q is constant, the optimization only relies on the parameters (w, q). Also, the
smallest point t1 has to correspond to a number [nt1 ] of iterations large enough so that our
approximation via the process Wa being valid. For example, we can restrict ourselves to
t1 ¼ 1=2, or t1 ¼ 1=3, as we will see later. We thus propose the following scheme for
p ¼ 2k þ 1 points:
t1 ¼
1
,
qk
ti ¼ qti1 ,
i ¼ 2, . . . , 2k þ 1
so that the middle point tkþ1 ¼ 1, the largest one t2kþ1 ¼ qk , and
Wa(ti ) Wa(tiþj )
cov
,
ti
tiþj
!
¼ 2s4 (g)
ti
tiþj
¼ 2s4 (g)r j
where r ¼ 1=q < 1. With this setting, the variance of the estimator becomes
var(s^ g 2 (w, r)) ¼
2s4 (g)
Fr (w)
m
462
Chauveau and Diebolt
where Fr is the positive definite quadratic form
Fr (w) ¼
2X
kþ1
wi 2 þ 2r
i¼1
¼
2k
X
wi wiþ1 þ 2r2
i¼1
2X
kþ1
wi 2 þ 2
i¼1
2k
X
rj
2X
k1
wi wiþ2 þ þ 2r2k w1 w1þ2k
i¼1
2k(
j1)
X
j¼1
wi wiþj
(22)
i¼1
Lemma 1
For any 0 < r < 1 and integer k 1, the solution to the minimization problem
(
w ¼ argmin Fr (w): w 2 (0, 1)
2kþ1
,
2X
kþ1
)
wi ¼ 1
i¼1
is given by
w1 ¼ w2kþ1 ¼
wi ¼
1
2k þ 1 (2k 1)r
1r
,
2k þ 1 (2k 1)r
(23)
2 i 2k
(24)
Proof.
P
2kþ1
The proposed solution w satisfies the conditions 2kþ1
.
i¼1 wi ¼ 1 and w 2 (0, 1)
If we check in addition that
qFr (w ) ¼ l, i ¼ 1, . . . , 2k þ 1
qwi
for some l > 0, then w is the unique solution by Lagrange’s method. We have
2k
X
qF
ðwÞ ¼ 2
r j wjþ1
qw1
j¼0
i1
2kþ1i
X
X
qF
(w) ¼ 2
rij wj þ 2
r j wiþj ,
qwi
j¼1
j¼0
i ¼ 2, . . . , 2k
2X
kþ1
qF
(w) ¼ 2
r2kþ1j wj
qw2kþ1
j¼1
and it is easy to check that (qF=qwi )(w ) ¼ 2(r þ 1), i ¼ 1, . . . , 2k þ 1.
j
Asymptotic Variance in CLT
463
To simplify the expression Fr (w ), set a ¼ w1 ¼ w2kþ1 , and b ¼ w2 ¼ ¼ w2k .
Then the minimal value of Fr (inductively given by Mathematica) is
"
2
2k
Fr (w ) ¼ 2a (1 þ r ) þ b
2
(2k 1) þ
2X
k2
#
(2k j 1)r
j
þ 4ab
j¼1
2X
k1
rj
j¼1
so that the relative variance reduction, by comparison with the variance of the estimate
using solely t ¼ 1, is
var(s^ g 2 (1)) var(s^ g 2 (w, r))
¼ 1 Fr (w )
var(s^ g 2 (1))
2k(1 r)
¼ R(k, q)
¼
2k þ 1 (2k 1)r
Examples of Variance Reduction
This reduction R(k, q) only depends on (k, q) with q ¼ r1 > 1, but is constrained by
the fact that t1 ¼ 1=qk should not be too small so that our approximations remain valid,
i.e., that the observed Markov chains have reached an approximately stationary regime
after [nt1 ] iterations. How small t1 can be is obviously related to the total number of
iterations observed, which depends essentially on the computing time available. This
means that we have to run the chains for more iterations than the duration necessary to
reach the stationary regime, precisely up to [nt2kþ1 ] ¼ [nqk ] iterations.
For instance, if we assume that the Markov chains have approximately reached
stationarity after n=2 iterations and that we have run the parallel chains up to 2n iterations,
then the time points over which the estimate can be averaged have to be within the interval
[1=2, 2]. Thus t1 ¼ 1=2 implies that k [ log (2)= log (q)]. Informally replacing k with the
real number log (2)= log (q) into R(k, q) yields a relative reduction R which only depends
on q and reduces to
R(q) ¼
2(q 1) log(2)
,
log(q=4) þ q log(4q)
q>1
Table 1. Relative variance reduction for
estimation schemes with time points
ti 2 [1=2, 2], i ¼ 1, . . . , 2k þ 1.
Number of
points
k
q
3
5
7
9
1
2
3
4
2
p
R(q)
2
21=3
21=4
0.400
0.407
0.408
0.409
464
Chauveau and Diebolt
Table 2. Relative variance reduction for
estimation schemes with time points
ti 2 [1=3, 3],i ¼ 1, . . . , 2k þ 1.
Number of
points
k
q
3
5
7
9
1
2
3
4
3
p
R(q)
3
31=3
31=4
0.500
0.517
0.521
0.522
The function R(q) converges to its maximum value of about 0.4094 as q ! 1, i.e., as
k ¼ log(2)= log(q) ! 1. Hopefully, this limiting value is almost reached already for
small values of k. This can be seen if one plots q 7! R(q) and is illustrated by the simple
schemes given in Table 1.
For a second example, if we assume that the process has approximately reached
stationarity after n=3 iterations only and that we have run it up to 3n iterations, then we can
average the estimate in [1=3, 3]. Setting now k ¼ log(3)= log(q) leads to a maximal
variance reduction equal to limq!1 R(q) 0:5235. The behavior is similar to that of the
first example. Simple estimation schemes are given in Table 2.
ACKNOWLEDGMENT
The authors wish to thank a referee for his insightful comments and suggestions.
REFERENCES
1. Gelfand, A.E.; Smith, A.F.M. Sampling based approaches to calculating marginal
densities. J. Am. Stat. Assoc. 1990, 85, 398–409.
2. Gilks, W.R.; Richardson, S.; Spiegelhalter, D.J. Markov Chain Monte Carlo in
Practice; Chapman & Hall: London, 1996.
3. Meyn, S.P.; Tweedie, R.L. Markov Chains and Stochastic Stability; Springer-Verlag:
London, 1993.
4. Brooks, S.P.; Roberts, G. Assessing convergence of Markov chain Monte Carlo
algorithms; Stat. Comput. 1998, 8 (4), 319–335.
5. Chauveau, D.; Diebolt, J. An automated stopping rule for MCMC convergence
assessment. Computation. Stat. 1999, 14 (3), 419–442.
6. Geyer, C.J. Practical Markov chain Monte Carlo. Stat. Sci. 1992, 7 (4), 473–511.
7. Robert, C.P. Convergence control techniques for Markov chain Monte Carlo algorithms. Statis. Sci. 1996a, 10 (3), 231–253.
8. Guihenneuc-Jouyanx; Robert, C.P. Valid discretization via renewal theory. In
Discretization and MCMC Convergence Assessment; Lecture Notes in Statistics,
Robert, C.P.; Ed.; Springer-Verlag: New York, 1998; Vol. 135, Chap. 4, 67–97.
9. Priestley, M.B. Spectral Analysis and Time Series; Academic: London, 1981.
Asymptotic Variance in CLT
465
10. Green, P.J.; Han, X.-L. Metropolis methods, Gaussian proposals, and antithetic
variables. In Lecture Notes in Statistics; Springer-Verlag: New York, 1992; Vol. 74,
142–164.
11. Billingsley, P. Convergence of Probability Measures; John Wiley & Sons: New York,
1968.
12. Dacunha-Castelle, D.; Duflo, M. Probability and Statistics; Springer-Verlag:
New York, 1986; Vol. 2.
13. Delong, D.M. Crossing probabilities for a square root boundary by a Bessel process.
Commun. Stat. Theory 1981, A10, 2197–2213.
Received January 23, 2001
Revised October 11, 2002
Accepted April 7, 2003