International Outgoing Fellowships
FP7-PEOPLE-IOF-2008
The Marie Curie Actions
PEOPLE
MARIE CURIE ACTIONS
International Outgoing Fellowships (IOF)
Call: FP7-PEOPLE-IOF-2008
Project Title: Control Variates for Markov Chain Monte Carlo Variance Reduction
“CVMCMC”
Research Fellow: Ioannis Kontoyiannis
Host Supervisor: Petros Dellaportas
Project Number: 235837
Project website: http://pages.cs.aueb.gr/users/yiannisk/CVMCMC/
WORK PROGRESS AND ACHIEVEMENTS DURING PERIOD 1
During this period, the project has fully achieved its objectives and technical goals, as stated in the original
proposal and detailed below. All of the scientific and technical achievements described here are presented in great
detail in the following three documents, the first two of which are available on the project’s public website.
[1] P. Dellaportas and I. Kontoyiannis. “Notes on using control variates for estimation with reversible MCMC
samplers.” Technical report, July 2009. Available online: http://arxiv.org/abs/0907.4160 .
[2] P. Dellaportas and I. Kontoyiannis. "Control variates for reversible MCMC samplers." Preprint, submitted for
publication, currently under revision.
[3] Z. Tsourti. Ph.D. thesis draft, jointly supervised by I. Kontoyiannis and P. Dellaportas. Expected graduation
date: December 2010.
1 Overview of progress in line with the original work plan
Below we indicate how each of the steps and deliverables stated in the original work plan for this period has been
accomplished.
WP1 Literature review (months 1-3)
Review and analysis of the existing state-of-the methods for reducing the variance of Markov chain Monte
Carlo (MCMC) algorithms. Review and analysis of the most widely used MCMC algorithms and the main,
prototypical examples of important applications where MCMC methods fail due to slow convergence.
Deliverable 1 (3rd month): Initial Outline Report: State-of-the-art.
An extensive and very thorough literature review was carried out, primarily by the research fellow Kontoyiannis.
The relevant literature, as well as the difficulties and drawbacks of existing methodological proposals are described
in [1,2] and in the Ph.D. thesis [3], currently under preparation. A number of relevant papers are also listed, together
with links to the full papers, on the project’s website.
An significant contribution that was prompted by this literature search was the observation that the method
by which control variates have been used in the vast majority of existing works is not only suboptimal, but the
loss in efficiency (as measured by the increase in the estimation variance) compared to the optimal method which
we subsequently introduce and apply, can be arbitrarily large. This is precisely justified mathematically and also
demonstrated to be so via simulation examples.
1
International Outgoing Fellowships
FP7-PEOPLE-IOF-2008
The Marie Curie Actions
WP2 Application of proposed methodology to Gibbs samplers (months 3-9)
Examination of the performance of the general methodology proposed to all major families of Gibbs sampling
algorithms appearing in applications, including all so-called ‘conjugate’ Gibbs samplers, as well as numerous
versions of Metropolis-within-Gibbs samplers that appear in applied Bayesian statistical analyses, again with
particular emphasis in cases where MCMC methods fail due to slow convergence. Examples will span all
major areas where these algorithms are used, including Gaussian mixture modeling, binary/multinomial probit
models, estimation and model selection in (discrete and continuous) multiple regression studies, and general
hierarchical (linear and nonlinear) Bayesian models.
Deliverable 2 (9th month): Progress report/journal paper draft: Control Variates for Gibbs Samplers in Statistical Applications.
The bulk of the research fellow’s research effort during period 1 has been devoted to this activity. As described
below, and in much more detail in [1,2], we have been very successful in this direction.
First, a clear theoretical foundation has been laid out for the use of control variates and their precise mathematical
properties in the context of general estimation problems using Markov chain samplers.
Second, for the case of ‘conjugate’ Gibbs samplers, a specific black-box methodology has been developed which
is, in most cases, guaranteed to offer a very significant improvement in the accuracy of the estimation process, at a
very small computational cost. This methodology is based on the use of Henderson’s general construction, combined
with two major new results: (i) A new, easily computable and provably consistent estimator is introduced for the
value of the optimal parameter vector for any collection of control variate functions; and (ii) For a special case
of a conjugate Gibbs sampler which can be viewed as a high-dimensional approximation of most such MCMC
algorithms used in many statistical studies, we solve the associated Poisson equation exactly, offering a proposal for
the construction of control variates which is provably optimal. Note that this is perhaps the first realistic example
of a nontrivial Markov chain that naturally arises in applications, for which the solution to the Poisson equation has
been computed exactly.
Third, we have tested this methodology in numerous examples of applied Bayesian inference involving experimental data, and we have found that, in general, it is not only worthwhile to use, but typically very effective. Our
experiments include, among many others: A large hierarchical normal linear model with data from a laboratory
studying the development of the bodies of rats; a hybrid Metropolis-within-Gibbs sampler with provably slow convergence; a log-linear model employed in a hypertension/obesity/alcohol consumption study and a two-threshold
autoregressive model of volatility data.
WP3 Generalization of proposed methodology to Metropolis samplers (months 9-20)
This is the core phase of the project, with two main goals: First, to extend and generalize the proposed control
variates ideas and techniques to Metropolis samplers. This is a highly nontrivial, technically challenging,
but also fascinating and potentially extremely influential task. Second, the application of the methodology
developed in WP2 and WP3 to difficult statistical analyses based on real empirical data and in conjunction
with scientists from the corresponding fields of scientific interest.
Deliverables 3.1 and 3.2 are scheduled for the 20th month.
This activity is well under way. The extension of the basic methodology mentioned in WP2 from conjugate Gibbs
samplers to general Metropolis-Hastings samplers is a highly nontrivial task. There are several difficulties, the most
severe of which is Henderson’s proposal for the construction of control variates – which was the basis of all of the
results described so far – is impossible to carry out for the majority of Metropolis-like MCMC algorithms.
We are currently investigating several ways through which this difficulty can be overcome. [This direction is also
expected to form a significan part of the Ph.D. thesis of our student, Zoi Tsourti.] At the moment we are pursuing
three distinct directions: (a) The use of direct Monte Carlo sampling as a substitute for the conditional expectation
needed in order to define Henderson-style control variates; (b) The use of importance sampling at the post-processing
stage for the same purpose; and (c) The application of Kontoyiannis and Meyn’s “screened estimation” idea to the
case of Metropolis samplers. Preliminary experiments show very encouraging results in all three cases.
2
International Outgoing Fellowships
FP7-PEOPLE-IOF-2008
The Marie Curie Actions
2 Summary of progress towards scientific objectives
The fundamental scientific objectives of this project as described in the original proposal are summarized in the
following three tasks.
1. Theory: Develop the necessary mathematical foundation for the application of control variates to Markov chains
in general, and to particular important classes of Markov chain Monte Carlo (MCMC) algorithms in particular.
2. Methodology: Introduce generic methodologies for the direct use of control variates in all the major families of
MCMC algorithms currently used in applications.
3. Applications: Make significant scientific contributions, based on the above methodology, across the range of
scientific disciplines that require intensive use of statistical computing via MCMC, including problems in genetics,
finance, engineering, biology and medicine.
As mentioned in Section 1 above, during this period, significant progress in several important directions has been
made in all three tasks. Our results are developed in detail in the two manuscripts [1,2], which have already been
made publicly available. Below, after a brief introduction, we describe two of our main theoretical/methodological
results, and we also briefly present an application of our proposed methodology. Note that the results below are only
a representative sample; a much more thorough analysis, with significantly more results in terms of the theory, the
methodology and the applied aspects of this work is given in [2].
Markov chains, control variates and the Poisson equation. Suppose {Xn } is a discrete-time Markov chain with
initial state X0 = x, taking values in the state-space X. For our purposes, X will always be a (discrete or continuous)
subset of Rd and B will denote the set of all (Borel) measurable subsets of X. The distribution of {Xn } is described
by its transition kernel P(x, dy), where, P(x, A) := Pr{Xk+1 ∈ A | Xk = x}, for x ∈ X, A ∈ B. In many applications where it is desirable to compute the expectation Eπ (F) := π (F) := F d π of some function
F : X → R with respect to some probability measure π on (X, B), it turns out that, although the direct computation of
π (F) is impossible and we cannot even produce samples from π , we can construct an easy-to-simulate Markov chain
{Xn } which has π as its unique invariant measure. Under appropriate conditions, the distribution of {Xn } converges
to π . This can be made precise in several ways. For example, writing PF for the function,
PF(x) = E[F(X1 ) | X0 = x],
we have that, for any initial condition x, Pn F(x) := E[F(Xn ) | X0 = x] → π (F), as n → ∞, for an appropriate class of
functions F : X → R. Furthermore, the rate of this convergence can be quantified by the function,
F̂(x) =
∞
∑
Pn F(x) − π (F) ,
(1)
n=0
where F̂ is easily seen to satisfy the Poisson equation for F, namely,
PF̂ − F̂ = −F + π (F).
(2)
The above results describe how the distribution of Xn converges to π . In terms of estimation, the quantities of interest
are the ergodic averages,
1 n−1
μn (F) := ∑ F(Xi ).
n i=0
Again, under appropriate conditions on the chain and the function F, the ergodic theorem holds,
μn (F) → π (F), a.s., as n → ∞.
Moreover, the rate of this convergence is quantified by an associated central limit theorem, which states that,
√
1
D
n[μn (F) − π (F)] = √ ∑ [F(Xi ) − π (F)] −→ N(0, σF2 ), as n → ∞,
n i=1
3
(3)
International Outgoing Fellowships
FP7-PEOPLE-IOF-2008
The Marie Curie Actions
√
where σF2 , the asymptotic variance of F, is given by, σF2 := limn→∞ Varπ ( n μn (F)). Alternatively, it can be expressed in terms of the solution F̂ to Poisson’s equation as,
σF2 = π F̂ 2 − (PF̂)2 .
(4)
Suppose that for this Markov chain {Xn } with transition kernel P and invariant measure π , we use the ergodic
averages μn (F) defined in (3) to estimate the mean π (F) of some function F under π . It many applications, although
the estimates μn (F) converge to π (F) as n → ∞, the associated asymptotic variance σF2 is large and the convergence
is very slow. In order to reduce the variance, we employ the idea of using control variates, as in the case of simple
Monte Carlo with i.i.d. samples. That is, given a function U : X → R for which we know that π (U) = 0, define,
Fθ = F − θ U,
(5)
and consider the modified estimators,
μn (Fθ ) = μn (F) − θ μn (U).
We will concentrate exclusively on the the following class of functions U proposed by Henderson (1997). For an
arbitrary G : X → R with π (|G|) < ∞, define,
U = G − PG.
The invariance of π under P and the integrability of G immediately imply that π (U) = 0. Therefore, the ergodic
theorem guarantees that the {μn (Fθ )} are consistent with probability one, and it is natural to seek particular choices
for U and θ so that the asymptotic variance σF2θ of the modified estimators is significantly smaller that the variance
σF2 of the standard ergodic averages μn (F).
Suppose, at first, that we have complete freedom in the choice of G, so that we may set θ = 1 without loss of
generality. Then we wish to make the asymptotic variance of F − U = F − G + PG as small as possible. But, in
view of the Poisson equation (2), we see that the choice G = F̂ yields, F −U = F − F̂ + PF̂ = π (F), which has zero
variance. So our basic rule of thumb is:
Choose a control variate U = G − PG with G ≈ F̂.
As mentioned above, it is typically impossible to compute F̂ exactly for realistic models in applications. But it may
be possible to come up with a guess for a vector of functions (G1 , G2 , . . . , Gk ) such that a linear combination of the
Gi approximates F̂.
First main result.
We begin by examining the simplest family of reversible MCMC algorithms that can be
used in conjunction with the methodology developed so far, that is, algorithms for which one-step expectations
PG(x) = E[G(Xn+1)|Xn = x] can be readily computed in closed form. This is the collection of random-scan Gibbs
samplers, with a target density π that typically arises as the posterior distribution of the parameters in a Bayesian
inference study. Moreover, since for large sample sizes the posterior is approximately normal. As our starting point
we take the case where π is a multivariate normal distribution. It is somewhat remarkable that, in this case, the
Poisson equation can be solved explicitly, and its solution is of a particularly simple form:
Theorem 1. Let {Xn } denote the Markov chain constructed from the random-scan Gibbs sampler used to simulate
from an arbitrary (nondegenerate) multivariate normal distribution π ∼ N(μ , Σ) in Rk . If the goal is to estimate the
mean of the first component of π , then letting F(x) = x(1) for each x = (x(1) , x(2) , . . . , x(k) )t ∈ Rk , the solution F̂ of the
Poisson equation for F can be expressed as linear combination of the basis functions Gj (x) := x( j) , x ∈ Rk , 1 ≤ j ≤ k,
F̂ =
k
∑ θ jG j.
j=1
Moreover, writing Q = Σ−1 , the coefficient vector θ is given by the first row of the matrix k(I − A)−1 where A has
entries Ai j = −Qi j /Qii , 1 ≤ i = j ≤ k, Aii = 0 for all i, and (I − A) is always invertible.
4
International Outgoing Fellowships
FP7-PEOPLE-IOF-2008
The Marie Curie Actions
Second main result. Once a function is selected and the control variate U = G − PG is defined, we form the
modified estimators μn (Fθ ) with respect to the function, Fθ = F − θ U = F − θ G + θ PG, defined in (5).
The next task is to choose θ = θ∗ so that the resulting variance, σθ2 := σF2θ = π (F̂θ2 − (PF̂θ )2 ), is minimized.
Standard computations show that,
π F̂G − (PF̂)(PG)
∗
θ =
.
(6)
π (G2 − (PG)2 )
Once again, this expression depends on F̂, so it is not immediately clear how to estimate θ∗ from the data {Xn }. In a
way, it is this very difficulty that has held back the applicability of the use of control variates in MCMC. Nevertheless,
as we show below, this difficulty can be overcome.
Theorem 2. Suppose we have a reversible chain, and write F̄ = F − π (F) for the centered version of F. Then,
π (F − π (F))(G + PG)
∗
∗
.
(7)
θ = θrev :=
π (G2 − (PG)2 )
Moreover, consider the natural estimator for θ∗ ,
θ̂n,rev =
μn (F − μn (F))(G + PG)
μn (G2 ) − μn ((PG)2 )
,
and denote the resulting estimator μn (Fθ̂n,rev ) based on the control variate U = G − PG and the parameter θ̂n,rev ,
μn,rev (F) := μn (Fθ̂n,rev ) = μn (F − θ̂n,revU).
ˆ n,rev is consistent, and the asymptotic
Then, under essentially minimal ergodicity assumptions, the estimator theta
variance of the estimator μn,rev (F) is smallest possible.
An example of statistical inference. In an early application of MCMC in Bayesian statistics, Gelfand et al. (1990)
illustrate the use of Gibbs sampling for inference in a large hierarchical normal linear model. The data consist of
N = 5 weekly weight measurements of k = 30 young rats, whose weight is assumed to increase linearly in time, so
that,
1 ≤ i ≤ k, 1 ≤ j ≤ N,
Yi j ∼ N αi + βi xi j , σc2 ,
where the Yi j are the measured weights and the xi j denote the corresponding rats’ ages (in days). The population
structure and the conjugate prior specification are assumed to be of the customary normal-Wishart-inverse gamma
form: For i = 1, 2, . . . , k,
αi
φi =
∼ N(μc , Σc )
βi
αc
μc =
∼ N (η ,C)
βc
∼ W((ρ R)−1 , ρ )
Σ−1
c
ν ν τ2 0 0
0
,
,
σc2 ∼ IG
2
2
with known values for η ,C, ν0 , ρ , R and τ0 .
The posterior has 2k + 2 + 3 + 1 = 66 parameters, and MCMC samples from ((φi ), μc , Σc , σc2 ) can be generated
via conjugate Gibbs sampling since the full conditional densities of all four parameter blocks (φi ), μc , Σc , and σc2 ,
are easily identified explicitly in terms of standard distributions. For example, conditional on (φi ), Σc , σc2 and the
−1 −1
observations (Yi j ), the means μc have a bivariate normal distribution with covariance matrix V := (kΣ−1
c +C )
and mean,
(8)
V Σ−1
∑ φi +C−1η .
c
i
5
International Outgoing Fellowships
FP7-PEOPLE-IOF-2008
The Marie Curie Actions
Suppose, first, that we wish to estimate the posterior mean of αc . We use a four-block, random-scan Gibbs sampler, which at each step selects one of the four blocks at random and replaces the current values of the parameter(s) in
that block with a draw from the corresponding full conditional density. We set F((φi ), μc , Σc , σc2 ) = αc , and construct
control variates according with the methodology suggested by Theorem 1, by first defining 66 functions Gj and then
computing the one step expectations PGj . For example, numbering each G j with the corresponding index in the
order in which it appears above, we have G61 ((φi ), μc , Σc , σc2 ) = αc , and from (8) we obtain,
−1 −1
−1
−1
Σ
+C
.
+C
)
φ
η
PG61 ((φi ), μc , Σc , σc2 ) = (3/4)αc + (1/4)(kΣ−1
i
∑
c
c
i
Figure 1 shows a typical realization of the sequence of estimates obtained by the standard estimators μn (F) and
by the adaptive estimators μn,rev (F), for n = 50000 simulation steps. The variance of μn,rev (F) was found to be
approximately 30 times smaller than that of μn (F).
107.2
107.1
107
106.9
106.8
106.7
106.6
106.5
106.4
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Figure 1: The sequence of the standard ergodic averages is shown as a solid blue line and the adaptive estimates
μn,rev (F) as red diamonds. For visual clarity, the values μn,rev (F) are plotted only every 1000 simulation steps. The
“true” value of the posterior mean of αc , shown as a horizontal line, was computed after n = 107 Gibbs steps and
taken to be ≈ 106.6084.
6
© Copyright 2026 Paperzz