STOCHASTIC MODELS Vol. 19, No. 4, pp. 449–465, 2003 Estimation of the Asymptotic Variance in the CLT for Markov Chains Didier Chauveau* and Jean Diebolt* Laboratoire d’Analyse et de Mathématiques Appliquées, Universités de Marne-la-Vallée, Champs-sur-Marne, France ABSTRACT This paper is devoted to estimating the asymptotic variance in the central limit theorem (CLT) for Markov chains. We assume that the functional CLT for Markov chains applies for properly normalized partial-sum processes of functions of the chain, and study a continuous-time empirical variance process based on i.i.d. parallel chains. The centered empirical variance process, properly normalized, converges in distribution to a centered Gaussian process with known covariance function. This allows us to estimate the limiting variance and to control the fluctuations of the variance estimator after n steps. An application to Monte Carlo Markov chain (MCMC) convergence control is suggested. Key Words: Markov chains; Functional CLT; Asymptotic variance; Stochastic process; Itô process. 1. INTRODUCTION The asymptotic properties of ergodic homogeneous Markov chains such as the strong law of large numbers (SLLN) or the central limit theorem (CLT) are the key limit *Correspondence: Didier Chauveau and Jean Diebolt, Laboratoire d’Analyse et de Mathématiques Appliquées, CNRS UMR 8050, Universités de Marne-la-Vallée, Champs-sur-Marne, France; E-mail: [email protected] and [email protected]. 449 DOI: 10.1081=STM-120025399 Copyright # 2003 by Marcel Dekker, Inc. 1532-6349 (Print); 1532-4214 (Online) www.dekker.com 450 Chauveau and Diebolt behaviors which justify the use of Monte Carlo Markov chain (MCMC) algorithms in statistical inference or simulation of physical particle systems. Since the problem we address here has initially been motivated by MCMC methods, we begin with a brief presentation of this technique. An MCMC algorithm generates an ergodic Markov chain X ¼ (Xn )n 0 with a given stationary distribution p, which usually comes as the posterior distribution in some Bayesian setup. Investigating characteristicsÐ of posterior distributions requires simulation from p, or computation of integrals like g dp, where g is some real-valued function defined on the Ð state space of X. In many situations, direct simulation from p is not tractable, and g dp is not available in closed form, so that an MCMC method is appropriate. Good introduction to the topic can be found in Gelfand and Smith’s[1] seminal paper, or in Gilks et al.[2] The usual estimate provided by an P MCMC algorithm comes from the SLLN applied to the properly normalized sum Sn (g) ¼ ni¼1 g(Xi ), that is ð 1 Sn (g) ¼ g dp :¼ p(g) n!1 n lim a.s. Since precision is desired for the estimate Sn (g)=n, it is natural to apply the CLT, which states that 1 d pffiffiffi Sn (g )!N (0, s2 (g)) n (1) where g () ¼ g() p(g) is the centered function in stationary regime (i.e., when X0 p), and 1 X 1 var(Sn (g)) ¼ Ep [g (X0 )2 ] þ 2 Ep [g (X0 )g (Xn )] n!1 n n¼1 s2 (g) ¼ lim (2) is the asymptotic variance and should satisfy 0 < s2 (g) < 1. Various conditions for these asymptotic results to hold, in both stationary and nonstationary case, are given in Meyn and Tweedie,[3] Chap. 17, and will be precised in Sec. 2. The variance s2 (g) is difficult to calculate theoretically (see the discussion hereafter), but estimating it properly is crucial to build confidence intervals for p(g). This estimation problem relates somehow to the problem of MCMC convergence assessment (see, e.g., Brooks and Roberts[4]). Actually, it is not possible to obtain reliable estimates of s2 (g) before the chain has approximately reached its stationary regime and normality involved by the CLT can be empirically checked (see Chauveau and Diebolt[5]). Conversely, the CLT cannot be used as long as a reliable estimate of s2 (g) has not been found. In this perspective, Chauveau and Diebolt[5] have proposed to check the stabilization of the variance after n steps, sn 2 (g) ¼ var Sn (g) pffiffiffi n (3) Asymptotic Variance in CLT 451 pffiffiffi by graphically monitoring the empirical variance of a sample of values of Sn (g)= n computed from parallel chains, i.e., i.i.d. copies of the Markov chain started from a dispersed initial distribution. Our goal in this paper is two-fold: first, to propose an applicable, generic and consistent method for estimating s2 (g) from parallel chains; second, to control the fluctuations of sn 2 (g) around its limiting value. We provide theoretically grounded answers to both problems. Our usage of the control of the asymptotic normality provides justification for the normal approximations used. Our approach also provides a way to ‘‘recycle’’ the simulated observations to improve the estimator over the natural one based on parallel chains (variance reduction). Finally, we insist on the generic aspect of our method: Approaches requiring problem-specific knowledge of the transition kernel of the Markov chain, or specific computer codes for their implementation are far less usable than methods solely based upon the output of the sampler, which can use available generic code. It turns out that finding good estimates for s2 (g) is itself difficult. A summary of the investigations made in this direction can be found in Geyer.[6] More recently, other methods based on renewal theory have been proposed in Robert[7] (see also GuihenneucJouyaux and Robert,[8] Sec. 4.6). However, up to our knowledge, the proposed procedures are all based on a single chain (i.e., output of one trajectory of the Markov chain) and suffer from several drawbacks or lack of applicability, as discussed below. In Geyer,[6] three methods to estimate s2 (g) in the stationary case are discussed. The first one, the window estimator, uses the usual empirical estimates g^ n,t of the lag t autocovariances gt ¼ Cov(g(Xi ), g(Xiþt )), based on a single run of length n. The variance s2 (g) is then estimated by s^ 2 (g) ¼ þ1 X wn (s)^gn,s s¼1 where wn (s) is a weight function (the lag window) such that w(s) ! 0 when jsj ! 1, which has to be chosen and tuned in order to get, under strong enough regularity conditions, consistency of the window estimator. This technique comes from the time series literature and has been developed for estimating spectral densities. For instance, Priestley[9] summarizes several possible choices for the shape and parameters of the lag window. However, the desirable theoretical properties (such as weak consistency) are obtained under some assumptions about the underlying process X, the less restrictive being that X must belong to the class of the (so-called) ‘‘general linear processes.’’ Indeed, it is not clear how reasonable these kinds of assumptions are in the MCMC setup. References where window estimators have been used in the MCMC context are suggested by Geyer,[6] but do not provide answers to this question (see, e.g., Green and Han[10]). The second method is using batch means, i.e., averages over blocks based on a single run. These batch means are shown to converge to i.i.d. normal random variables, and this property is used to build confidence intervals for s2 (g). A drawback is that the batch size needs to be chosen large enough for these asymptotic properties (independence and normality) to hold. This choice is related to the mixing rate of the chain, which is unknown. Hence achieving a good compromise between number of batches and batch size may be difficult without additional information on the chain. 452 Chauveau and Diebolt The third method uses the fact that, under general conditions (stationarity, irreducibility and reversibility of the Markov chain), the adjacent pairs of autocovariances g2t þ g2tþ1 are strictly positive, decreasing and convex. This property is used to build empirical rules for the number of terms over which the estimator obtained by summing the g^ n,t ’s is computed. The major drawback of this approach is that it only provides consistent overestimates in general situation (see Geyer[6]). A more recent approach given in Robert[7] uses renewal properties of the Markov chain to build i.i.d. blocks, for which the usual CLT holds. The condition for renewal theory to apply is that there exists a so-called renewal, or small set A, a real 0 < e < 1 and a probability measure n such that P(Xnþ1 2 BjXn ¼ xn ) en(B), 8xn 2 A; 8B In that case, the kernel can be modified in such a way to obtain renewal events with probability e when Xn 2 A. The variance sA 2 (g) in the usual (i.i.d.) CLT can be estimated by the usual empirical variance, and an estimate of s2 (g) can be deduced. A first obvious drawback of this approach is that one needs to exhibit the triple (A, e, n) defining the small set, and that this requires a deep analytical knowledge about the kernel driving the chain, hence is problem-specific. Moreover, this small set A must be chosen in a way to obtain short return times to it, and the associated e not too small, to obtain a sufficient number of renewal events in a reasonable total simulation time (note in addition that these too requirements are somehow contradictory). Finally, as the author points out, this leads to a highly conservative method. We now precise the flavor of our approach. Suppose that we have an estimator s^ n 2 (g) of sn 2 (g) and we want to control the range of its fluctuations around s2 (g). This can be done by determining for some n1 < n2 an estimate of P sup n1 n n2 js^ n 2 (g) s2 (g)j u , u>0 when the underlying Markov chain has approximately reached its stability regime, in order to build confidence bands. Of course, since s2 (g) is unknown, it has in practice to be replaced by the average of the values of s^ n 2 (g) over n1 n n2. It is then natural to consider a partial-sum process s^ [nt] 2 (g), t 2 [0, 1], issued from this estimator (where [t] denotes the largest integer k t) and to show that it satisfies a functional CLT, e.g., as in Billingsley.[11] This means that the sequence of continuous random functions obtained by interpolating the values of s^ [nt] 2 (g) converges in distribution to a Brownian motion. Here, we study the limiting behavior of a process issued from the natural estimator of the variance after n steps which is constructed as follows: (i) we use the fact that the continuous-time process issued from the interpolation of the partial sum process pffiffiffi S[nt] (g )= n satisfies a functional CLT (see, e.g., Meyn and Tweedie[3] and our Sec. 2.1); (ii) we consider m i.i.d. copies of the original chain, giving m i.i.d. copies of the partial sum process that converge to m i.i.d. copies of the Brownian motion; (iii) we build a continuous time empirical variance process, from the empirical variance of these m partial sum processes. Therefore, we can expect to gain control over the estimation of Asymptotic Variance in CLT 453 s2 (g) from the study of the asymptotic behavior of this process as both n and m increase to 1. In Sec. 2, we first recall the conditions under which a Markov chain satisfies a functional CLT. We then present our estimator computed from the independent parallel Markov chains as well as the continuous-time process that will be considered throughout the remaining part of the paper. Section 3 states the weak convergence of this empirical variance process as the ‘‘time’’ n ! 1. We prove that the empirical variance process essentially converges in distribution to a normalized sum of m independent Itô processes plus a term converging to 0 as m ! 1. Section 4 is devoted to the study of the weak convergence of this sum of m i.i.d. Itô processes as m ! 1. A slightly more general result is also given under some general assumptions. As a consequence, it is proved that the centered empirical variance process, properly normalized, converges in distribution to a centered Gaussian process with known covariance function. Finally, Sec. 5 suggests how to control the fluctuations of sn 2 (g) using the theoretical results obtained in the previous sections, and proposes a strategy for estimating s2 (g) using the properties of the Gaussian limiting process. This leads to estimates of s2 (g) with a variance smaller than the variance of the natural estimate of s2 (g) based on i.i.d. chains. 2. 2.1. THE EMPIRICAL VARIANCE PROCESS Functional Central Limit Theorem for Markov Chains This section starts with a summary of classical results related to the functional CLT for Markov chains. Here, we consider a single ergodic Markov chain X ¼ (Xn , n 0) over a general state space (E, B(E)) with transition kernel P, initial position X0 m (where m is an arbitrary probability distribution) and invariant distribution p. Notice that in the literature the results below are often stated for the stationary situation (i.e., when X0 p). We will state them in the general case (X0 m for arbitrary m) since this is the actual situation in standard MCMC applications. For a function g: E ! R, we define Sn (g) ¼ n X g(Xi ), and Sn (g ) ¼ i¼1 n X (g(Xi ) p(g)) i¼1 For t 2 [0, 1], the partial-sum process associated to Sn (g ) is S[nt] (g ) ¼ [nt] X (g(Xi ) p(g)) (4) i¼1 and we suppose in the sequel that X satisfies conditions ensuring that the functional CLT holds for the normalized linear interpolation of (4): 1 Yn,g (t) ¼ pffiffiffi [S[nt] (g ) þ (nt [nt])(S[nt]þ1 (g ) S[nt] (g ))] n (5) 454 Chauveau and Diebolt More precisely, if s2 (g) is the asymptotic variance in the CLT defined by (2), we have the following result: Proposition 1 (Functional CLT) Assume that X is positive Harris and a solution g^ of the Poisson equation g^ Pg^ ¼ g p(g) exists with p(g^ 2 ) < 1. If s2 (g) > 0, then d Yn,g !s(g)W as n ! 1 where W denotes a standard Brownian motion over [0, 1]. Proposition 1 is from Meyn and Tweedie,[3] Theorem 17.4.4. There, it is proved in the stationary case (X0 p). However (as explained in the proof of Theorem 17.4.4), the result can be extended to the nonstationary case by using their Proposition 17.1.6, which claims that, if the functional CLT holds for some initial distribution m, then it holds for any initial distribution. Note also that although the result is written for t 2 [0, 1] for simplicity, it holds as well for t 2 [0, T ], whatever T > 0. 2.2. Estimation of the Variance from i.i.d. Chains We now turn to the description of our estimator. We propose to use m independent copies of the Markov chain to estimate the variance sn 2 (g). For 1 ‘ m, we denote the ‘th chain by X (‘) ¼ (Xn (‘) , n 0) and the corresponding ‘th sum and centered sum, respectively, by Sn (‘) (g) ¼ n X g(Xi (‘) ) and Sn (‘) (g ) ¼ i¼1 n X (g(Xi (‘) ) p(g)) i¼1 The simulation of m i.i.d. chains up to time n provides an i.i.d. sample of m (noncentered) random variables Sn (1) (g) S (m) (g) pffiffiffi , . . . , npffiffiffi n n so that the natural estimator of sn 2 (g) is the usual (biased) empirical variance of this sample: s^ n,m 2 (g) ¼ 2 m 1X 1 1 pffiffiffi Sn (‘) (g) pffiffiffi Sn (g) m ‘¼1 n n (6) where Sn (g) ¼ m 1X S (‘) (g) m ‘¼1 n (7) Asymptotic Variance in CLT 455 To use Proposition 1, we need to introduce the sums Sn (‘) (g ) that satisfy the functional CLT. This leads us to rewrite (6) as s^ n,m 2 " #2 m n 1X 1 (‘) 1 X 1 pffiffiffi Sn (g ) þ pffiffiffi ¼ p(g) pffiffiffi Sn (g) m ‘¼1 n n i¼1 n 2 m 1X 1 1 pffiffiffi Sn (‘) (g ) pffiffiffi Sn (g ) ¼ m ‘¼1 n n (8) where Sn (g ) ¼ m 1X S (‘) (g ) m ‘¼1 n (9) Note that the expression into brackets in (8) can also be interpreted as n 1 X pffiffiffi (g(Xi (‘) ) p^ m i (g)) n i¼1 where p^ m i (g) ¼ m 1X g(Xi (‘) ) m ‘¼1 is the estimate of E[g(Xi )] ¼ pi (g), the expectation under the marginal distribution of the chain at time i, based on the observations from the m i.i.d. chains at time i, which is consistent in m from the standard SLLN. The estimate s^ n,m 2 (g) appears in this way as the empirical second moment of the normalized and approximately centered sums associated to the m chains. 2.3. Partial Sums and Empirical Variance Process We will now introduce the partial sum processes to construct our continuous-time estimator process. For ‘ ¼ 1, . . . , m, we define S[nt] (‘) (g ) ¼ [nt] X (g(Xi (‘) ) p(g)), t 2 [0, T ] i¼1 and their normalized linear interpolations, 1 Yn (‘) (t) ¼ pffiffiffi [S[nt] (‘) (g ) þ (nt [nt])(S[nt]þ1 (‘) (g ) S[nt] (‘) (g ))] n 456 Chauveau and Diebolt where, for simplicity, we drop g in the notation since g is considered fixed throughout the paper. We accordingly define the partial sum version of s^ n,m 2 (g) by s^ [nt],m 2 " #2 m m 1X 1 1X 1 (‘) (‘) pffiffiffi S[nt] (g ) pffiffiffi S (g) ¼ (g ) m ‘¼1 n m ‘¼1 n [nt] (10) Using the linear interpolations Yn (‘) ’s, we consequently define the empirical variance process by " #2 m m 1X 1X (‘) (‘) Vn,m (t) ¼ Y (t) Y (t) m ‘¼1 n m ‘¼1 n !2 m m 1X 1X 2 (‘) (‘) ¼ (Y (t)) Y (t) m ‘¼1 n m ‘¼1 n (11) Note that Vn,m (t) coincides with (10) at the points t ¼ i=n, i ¼ 1, . . . , [nT ]. It is a nonlinear interpolation of s^ n,m 2 (g). In the next two sections, we study the asymptotic behavior of Vn,m (t) by considering first its limit in distribution as n ! 1 for a fixed number m of parallel chains, and then its behavior for large m’s. 3. ASYMPTOTIC BEHAVIOR IN TIME Proposition 2 Assume that X and g satisfy the conditions of Proposition 1. Then d Yn (‘) !s(g)W (‘) as n ! 1, ‘ ¼ 1, . . . , m where W (1) , . . . , W (m) are m independent standard Brownian motions, and d Vn,m !Zm as n ! 1 for each m 1, where the process Zm (which also depends on g) is Zm (t) ¼ m 1X s2 (g) 2 B , t 2 [0, T ] (s(g)Wt (‘) )2 m ‘¼1 m t (12) where Bt is a standard Brownian motion not independent from the Wt (‘) ’s. Proof. The first result is just Proposition 1 applied to the m i.i.d. chains. The weak convergence of the independent of Vn,m directly follows from the fact that Vn,m is a continuous P function 2 2 , as given by (11). processes Yn (1) , . . . ,Yn (m) . This function is f (y1 , . . . ,ym ) ¼ m i¼1 yi =m y Asymptotic Variance in CLT 457 The limiting process is then of the individual limits by this function, which yields P the image (‘) (12) where Bt ¼ m1=2 m W is itself a standard Brownian motion. j ‘¼1 t 4. ASYMPTOTIC BEHAVIOR IN THE NUMBER OF CHAINS In this section, we describe the behavior of the process Zm when m gets large. Since E[(s(g)Wt )2 ] ¼ s2 (g)t, let us rewrite (12) as 1 s2 (g) 2 B , t 2 [0, T ] Zm (t) ¼ pffiffiffiffi xm (t) þ s2 (g)t m m t (13) where m 1 X xm (t) ¼ pffiffiffiffi [c(Wt (‘) ) E[c(Wt )]], m ‘¼1 and c(x) ¼ s2 (g)x2 We keep the above notation with the function c instead of replacing it directly by s2 (g)x2 to indicate a slightly more general result at the end of Sec. 4. Theorem 1 Convergence in distribution of xm : d xm !G as m ! 1 where G is a centered Gaussian process which can be represented as ðt G(t) ¼ pffiffi 2s2 (g) s dWs (14) 0 and has the same distribution as the Gaussian process Wa with covariance function (s, t)7!a(s ^ t), where a(t) ¼ 2s4 (g)t 2 . Proof. Each individual process c(Wt ) can be written, via Itô’s formula, as ðt c(Wt ) ¼ c(0) þ c0 (Ws ) dWs þ 0 1 2 ðt c00 (Ws ) ds where c(0) ¼ 0 here. The stochastic integral martingale when ð T E 0 (c0 (Ws ))2 ds < 1 (15) 0 Ðt 0 c0 (Ws ) dWs is a square integrable centered 458 Chauveau and Diebolt and this condition holds here since ð T ð T ð þ1 2 eu =2s (c0 (Ws ))2 ds ¼ c02 (u) pffiffiffiffiffiffiffi du ds 2ps 0 0 1 rffiffiffiffiffiffi ð þ1 2T 2 u2 eu =2T du < 1 4s4 (g) p 1 Ð T Since c00 is constant here, 0 E[jc00 (Ws )j] ds ¼ 2T s2 (g), and E ðt c(Wt ) E[c(Wt )] ¼ c0 (Ws ) dWs þ 0 1 2 ðt (16) (c00 (Ws ) E[c00 (Ws )]) ds 0 so that the second term vanishes, and the process xm can be written m 1 X xm (t) ¼ pffiffiffiffi m ‘¼1 ðt c0 (Ws (‘) ) dWs (‘) (17) 0 Since xm is a sum of independent martingales, we have m ð m ðt 1X 1X 0 (‘) (‘) hxm i(t) ¼ c (Ws ) dWs (t) ¼ (c0 (Ws (‘) ))2 ds m ‘¼1 0 m ‘¼1 0 Hence, for each t 2 [0, T ], the SLLN gives ð t ðt 0 2 hxm i(t) ! E (c (Ws )) ds ¼ hc 2 (s) ds a.s. as m ! 1 0 0 where hc (s) ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi E[c02 (Ws )] The limit above is continuous on [0, T]. Then, from Rebolledo’s theorem (see, e.g., Dacunha-Castelle and Duflo[12]), the sequence {xm } is tight in C([0, T ]), and d xm !G as m ! 1 where the limiting process G can be represented as the martingale ðt G(t) ¼ hc (s) dWs 0 Note that G is a Gaussian process since it is the pffiffiBrownian stochastic integral of the deterministic function hc . Finally, hc (s) ¼ 2s2 (g) s here, so that the limiting process is given by (14), with hGi(t) ¼ a(t) ¼ 2s4 (g)t 2 . j Asymptotic Variance in CLT 459 Remark A slightly more general result can be proved: the convergence in distribution as m ! 1 of the normalized sum of i.i.d. Itô processes m 1 X xm (t) ¼ pffiffiffiffi [c(Wt (‘) ) E[c(Wt )]], m ‘¼1 t 2 [0, T ] where W (1) , . . . , W (m) are m independent copies of Brownian motion, for any smooth enough function c satisfying conditions (i) Ð þ1 (ii) Ð þ1 2 1 c02 (u)eu =2T du < 1 1 c002 (u)eu =2T du < 1 2 Indeed, in that case, xm can be split into the martingale term studied in the proof of Theorem 1, and a sequence of processes m 1 X pffiffiffiffi 2 m ‘¼1 ðt (c00 (Ws (‘) ) E[c00 (Ws )]) ds 0 that can be proved to be tight in C([0, T ]). The conclusion then follows since the sequence {xm } is tight in C[0, T ] as the sum of two tight sequences in C[0, T ], and the finitedimensional distributions of xm converge weakly to a normal distribution. 5. ESTIMATION OF THE LIMITING VARIANCE We now turn to the estimation of s2 (g) and the control of the fluctuations of sn 2 (g) for large n. From Proposition 2, Vn,m ! Zm weakly as n ! 1, where the limiting process takes the form 1 s2 (g) 2 B Zm (t) ¼ ts2 (g) þ pffiffiffiffi xm (t) m m t (18) with c(x) ¼ s2 (g)x2 . Hence for n large enough, Vn,m (t) Zm (t) for all t 2 [0, T ]. Clearly, such an approximation can hold only when the finite-dimensional distributions of the process Yn,g [defined in (5)] over [0, T ] are approximately normal. Although checking the approximate normality of all the finite-dimensional distributions would be untractable, at least the CLT control method presented in Chauveau and Diebolt[5] should be applied to determine a range of sufficiently large values of n for all practical purposes. In addition, xm ! Wa in distribution as m ! 1, so that for m large enough the term of order O(1=m) in (18) becomes negligible, and we can consider the approximation 1 Vn,m (t) ts2 (g) þ pffiffiffiffi Wa(t) , t 2 [0, T ] m (19) 460 Chauveau and Diebolt where Wa(t) ¼ (Wa )t . This leads to estimate the variance by s^ n,m,g 2 (t) ¼ s^ g 2 (t) ¼ Vn,m (t) 1 Wa(t) s2 (g) þ pffiffiffiffi t m t (20) where the process Wa(t) =t has covariance function cov Wa(s) Wa(t) s2 ^ t 2 s^t , ¼ 2s4 (g) ¼ 2s4 (g) s_t s t st 5.1. (21) Control of the Fluctuations of the Estimate If we set the change of variable s ¼ Teu and t ¼ Tev for u > 0, v > 0, then the process Uv ¼ Wa(t) =t has covariance function cov(Uu , Uv ) ¼ 2s4 (g)exp(ju vj) hence it is an Ornstein-Uhlenbeck process. There exist available results concerning the distribution of the supremum of this process over compact intervals of time (see, e.g., Delong[13]). These results can be used to construct approximate confidence bands for the fluctuations over compact subintervals of (0, T ] of the estimate s^ g 2 (t) given by (20). We should keep in mind that the validity of such bands is asymptotic and that trying to control the fluctuations of some estimate of s2 (g) without having first checked the approximate normality of the S^ n,m (‘) (g )’s makes no sense. For some 0 < t1 < t2 T we can define the 95% (say) confidence band of Wa(t) =t corresponding to [t1 , t2 ]: " # Wa(t) w95% ¼ 5% P sup t t1 t t2 Suppose that approximate normality has been positively checked (e.g., as in Chauvean and Diebolt[5]). If the estimate s^ g 2 (t) remains for t1 t t2 in its 95% asymptotic confidence band, 1 sup s^ g 2 (t) s2 (g) pffiffiffiffi w95% m t1 tt2 where the unknown s2 (g) has to be replaced by (t2 t1 )1 that it is reliable. Ð t2 t1 s^ g 2 (t) dt, then we can expect Asymptotic Variance in CLT 5.2. 461 Reduction of Variance of the Estimator The natural way to use the estimator (20) is to compute it at time t ¼ 1, i.e., after n iterations of the parallel chains. In this case, s^ g 2 (1) is simply s^ n,m 2 (g) defined by (8), and its variance is var(s^ g 2 (1)) ¼ var(Vn,m (1)) 2s4 (g) m This variance can be reduced by taking advantage of our knowledge of the covariance function of the limiting process (20). We suggest to use a weighted average of estimates s^ g 2 (ti ) computed at p different times ti , i ¼ 1, . . . , p. For t ¼ (t1 , . . . , tp ) and weights w ¼ (w1 , . . . , wp ), the estimator becomes s^ g 2 (w, t) ¼ p X wi i¼1 Vn,m (ti ) , w 2 (0, 1) p , ti p X wi ¼ 1 i¼1 and has variance ! p Wa(ti ) Wa(ti ) Wa(tj ) 1X 2 2X var(s^ g (w, t)) w var w w cov , þ m i¼1 i m i<j i j ti ti tj 2 Our objective is then to find (w, t) that minimizes this variance. A natural way to reduce the complexity of this problem is to notice that, from (21), the above covariances only depend on ratios ti =tj for ti < tj. Hence, if we choose the time points ti so that tiþ1 =ti ¼ q is constant, the optimization only relies on the parameters (w, q). Also, the smallest point t1 has to correspond to a number [nt1 ] of iterations large enough so that our approximation via the process Wa being valid. For example, we can restrict ourselves to t1 ¼ 1=2, or t1 ¼ 1=3, as we will see later. We thus propose the following scheme for p ¼ 2k þ 1 points: t1 ¼ 1 , qk ti ¼ qti1 , i ¼ 2, . . . , 2k þ 1 so that the middle point tkþ1 ¼ 1, the largest one t2kþ1 ¼ qk , and Wa(ti ) Wa(tiþj ) cov , ti tiþj ! ¼ 2s4 (g) ti tiþj ¼ 2s4 (g)r j where r ¼ 1=q < 1. With this setting, the variance of the estimator becomes var(s^ g 2 (w, r)) ¼ 2s4 (g) Fr (w) m 462 Chauveau and Diebolt where Fr is the positive definite quadratic form Fr (w) ¼ 2X kþ1 wi 2 þ 2r i¼1 ¼ 2k X wi wiþ1 þ 2r2 i¼1 2X kþ1 wi 2 þ 2 i¼1 2k X rj 2X k1 wi wiþ2 þ þ 2r2k w1 w1þ2k i¼1 2k( j1) X j¼1 wi wiþj (22) i¼1 Lemma 1 For any 0 < r < 1 and integer k 1, the solution to the minimization problem ( w ¼ argmin Fr (w): w 2 (0, 1) 2kþ1 , 2X kþ1 ) wi ¼ 1 i¼1 is given by w1 ¼ w2kþ1 ¼ wi ¼ 1 2k þ 1 (2k 1)r 1r , 2k þ 1 (2k 1)r (23) 2 i 2k (24) Proof. P 2kþ1 The proposed solution w satisfies the conditions 2kþ1 . i¼1 wi ¼ 1 and w 2 (0, 1) If we check in addition that qFr (w ) ¼ l, i ¼ 1, . . . , 2k þ 1 qwi for some l > 0, then w is the unique solution by Lagrange’s method. We have 2k X qF ðwÞ ¼ 2 r j wjþ1 qw1 j¼0 i1 2kþ1i X X qF (w) ¼ 2 rij wj þ 2 r j wiþj , qwi j¼1 j¼0 i ¼ 2, . . . , 2k 2X kþ1 qF (w) ¼ 2 r2kþ1j wj qw2kþ1 j¼1 and it is easy to check that (qF=qwi )(w ) ¼ 2(r þ 1), i ¼ 1, . . . , 2k þ 1. j Asymptotic Variance in CLT 463 To simplify the expression Fr (w ), set a ¼ w1 ¼ w2kþ1 , and b ¼ w2 ¼ ¼ w2k . Then the minimal value of Fr (inductively given by Mathematica) is " 2 2k Fr (w ) ¼ 2a (1 þ r ) þ b 2 (2k 1) þ 2X k2 # (2k j 1)r j þ 4ab j¼1 2X k1 rj j¼1 so that the relative variance reduction, by comparison with the variance of the estimate using solely t ¼ 1, is var(s^ g 2 (1)) var(s^ g 2 (w, r)) ¼ 1 Fr (w ) var(s^ g 2 (1)) 2k(1 r) ¼ R(k, q) ¼ 2k þ 1 (2k 1)r Examples of Variance Reduction This reduction R(k, q) only depends on (k, q) with q ¼ r1 > 1, but is constrained by the fact that t1 ¼ 1=qk should not be too small so that our approximations remain valid, i.e., that the observed Markov chains have reached an approximately stationary regime after [nt1 ] iterations. How small t1 can be is obviously related to the total number of iterations observed, which depends essentially on the computing time available. This means that we have to run the chains for more iterations than the duration necessary to reach the stationary regime, precisely up to [nt2kþ1 ] ¼ [nqk ] iterations. For instance, if we assume that the Markov chains have approximately reached stationarity after n=2 iterations and that we have run the parallel chains up to 2n iterations, then the time points over which the estimate can be averaged have to be within the interval [1=2, 2]. Thus t1 ¼ 1=2 implies that k [ log (2)= log (q)]. Informally replacing k with the real number log (2)= log (q) into R(k, q) yields a relative reduction R which only depends on q and reduces to R(q) ¼ 2(q 1) log(2) , log(q=4) þ q log(4q) q>1 Table 1. Relative variance reduction for estimation schemes with time points ti 2 [1=2, 2], i ¼ 1, . . . , 2k þ 1. Number of points k q 3 5 7 9 1 2 3 4 2 p R(q) 2 21=3 21=4 0.400 0.407 0.408 0.409 464 Chauveau and Diebolt Table 2. Relative variance reduction for estimation schemes with time points ti 2 [1=3, 3],i ¼ 1, . . . , 2k þ 1. Number of points k q 3 5 7 9 1 2 3 4 3 p R(q) 3 31=3 31=4 0.500 0.517 0.521 0.522 The function R(q) converges to its maximum value of about 0.4094 as q ! 1, i.e., as k ¼ log(2)= log(q) ! 1. Hopefully, this limiting value is almost reached already for small values of k. This can be seen if one plots q 7! R(q) and is illustrated by the simple schemes given in Table 1. For a second example, if we assume that the process has approximately reached stationarity after n=3 iterations only and that we have run it up to 3n iterations, then we can average the estimate in [1=3, 3]. Setting now k ¼ log(3)= log(q) leads to a maximal variance reduction equal to limq!1 R(q) 0:5235. The behavior is similar to that of the first example. Simple estimation schemes are given in Table 2. ACKNOWLEDGMENT The authors wish to thank a referee for his insightful comments and suggestions. REFERENCES 1. Gelfand, A.E.; Smith, A.F.M. Sampling based approaches to calculating marginal densities. J. Am. Stat. Assoc. 1990, 85, 398–409. 2. Gilks, W.R.; Richardson, S.; Spiegelhalter, D.J. Markov Chain Monte Carlo in Practice; Chapman & Hall: London, 1996. 3. Meyn, S.P.; Tweedie, R.L. Markov Chains and Stochastic Stability; Springer-Verlag: London, 1993. 4. Brooks, S.P.; Roberts, G. Assessing convergence of Markov chain Monte Carlo algorithms; Stat. Comput. 1998, 8 (4), 319–335. 5. Chauveau, D.; Diebolt, J. An automated stopping rule for MCMC convergence assessment. Computation. Stat. 1999, 14 (3), 419–442. 6. Geyer, C.J. Practical Markov chain Monte Carlo. Stat. Sci. 1992, 7 (4), 473–511. 7. Robert, C.P. Convergence control techniques for Markov chain Monte Carlo algorithms. Statis. Sci. 1996a, 10 (3), 231–253. 8. Guihenneuc-Jouyanx; Robert, C.P. Valid discretization via renewal theory. In Discretization and MCMC Convergence Assessment; Lecture Notes in Statistics, Robert, C.P.; Ed.; Springer-Verlag: New York, 1998; Vol. 135, Chap. 4, 67–97. 9. Priestley, M.B. Spectral Analysis and Time Series; Academic: London, 1981. Asymptotic Variance in CLT 465 10. Green, P.J.; Han, X.-L. Metropolis methods, Gaussian proposals, and antithetic variables. In Lecture Notes in Statistics; Springer-Verlag: New York, 1992; Vol. 74, 142–164. 11. Billingsley, P. Convergence of Probability Measures; John Wiley & Sons: New York, 1968. 12. Dacunha-Castelle, D.; Duflo, M. Probability and Statistics; Springer-Verlag: New York, 1986; Vol. 2. 13. Delong, D.M. Crossing probabilities for a square root boundary by a Bessel process. Commun. Stat. Theory 1981, A10, 2197–2213. Received January 23, 2001 Revised October 11, 2002 Accepted April 7, 2003
© Copyright 2026 Paperzz