Benchmark Estimation for Markov Chain Monte Carlo

Benchmark Estimation for Markov Chain Monte Carlo Samples
Subharup Guha, Steven N. MacEachern and Mario Peruggia
[email protected]
May 2002; revised November 2002
Abstract
While studying various features of the posterior distribution of a vector-valued parameter using
an MCMC sample, a subsample is often all that is available for analysis. The goal of benchmark
estimation is to use the best available information, i.e. the full MCMC sample, to improve future
estimates made on the basis of the subsample. We discuss a simple approach to do this and provide
a theoretical basis for the method. The methodology and benefits of benchmark estimation are
illustrated using a well-known example from the literature. We obtain as much as an 80% reduction
in MSE with the technique based on a 1-in-10 subsample and show that greater benefits accrue with
the thinner subsamples that are often used in practice.
1
Introduction
While using an MCMC sample to investigate the posterior distribution of a vector-valued parameter θ,
many features of interest have the representation E[g(θ)] for some function g(θ). A subsample of
the MCMC output is often all that is retained for further investigation of the posterior distribution.
Subsampling is often necessary in computationally intensive or real-time, interactive investigations where
speed is essential. Examples include expensive plot processing and examination of changes in the prior
(sensitivity analysis), likelihood (robustness) or data (case influence). Typically, such studies would
include hundreds or thousands of changes to the model. Another reason for subsampling is expensive
1
storage space. Practical constraints, like limited disk space available to users of shared computing
resources, often make it infeasible to store the entire sample of MCMC draws when the parameter has
a large number of components. A subsample is then retained for future investigation of the posterior
distribution.
Geyer (1992) and MacEachern and Berliner (1994) show that subsampling the output of the Markov
chain in a systematic fashion can only lead to poorer estimation of E[g(θ)]. The goal of benchmark
estimation is to produce a number of estimates based on the entire MCMC sample, and to then use
these to improve other estimates made on the basis of the subsample. The benchmark estimates must
be quick and easy to compute. They must also be compatible with quick computations for further,
more (computationally) expensive analyses based on the eventual subsample.
Several motivating perspectives are useful to understand and investigate various aspects of benchmark estimation. The point of view of calibration estimation, developed in the sampling literature
to improve survey estimates (Deville and Särndal, 1992; Vanderhoeft, 2001), helps to bring all these
perspectives together into a unified framework. In calibration estimation, a probability sample from
a finite population is used to compute estimates of population quantities of interest. The (regression
type) estimators are built as weighted averages of the observations in the sample, with the weights determined so as to satisfy a (vector-valued) calibration equation which forces the resulting estimators to
produce exact estimates of known population features. Usually, the constraints imposed by the calibration equation do not determine a unique set of weights. Thus, among the sets of weights satisfying the
calibration equation, one chooses the set that yields weights that are as close as possible (with respect
to some distance metric) to a fixed set of prespecified (typically uniform) weights.
To cast MCMC benchmark estimation into the the framework of calibration estimation, we regard
the MCMC output as a finite population and a 1-in-k systematic subsample as a probability sample
drawn from the finite population. This systematic sampling design gives each unit in the population a
probability 1/k of being selected, though many joint inclusion probabilities are 0. In this setting, the
2
(vector-valued) benchmark E[h(θ)], for which the subsample estimate is forced to match the full sample
estimate, corresponds to the auxiliary information available through the calibration equation. Once the
calibration weights have been calculated, they can then be used to compute the calibration subsample
estimate of any feature E[g(θ)]. As the full MCMC sample size increases, the asymptotic performance
of these benchmark estimators matches that of the corresponding calibration estimators.
The benchmark estimators that we introduce in this paper can be shown to be calibration estimators
corresponding to appropriately chosen calibration equations and metrics. We investigate two methods
of creating weights: post-stratification and maximum entropy. In their simplest form, post-stratification
weights are derived by partitioning the parameter space into a finite number of regions and by forcing
the weighted subsample frequencies of each region to match the corresponding raw frequencies for the
entire MCMC sample. The weights are taken to be constant over the various elements of the partition
and to sum to one.
An improved version of post-stratification, (and, in fact, the approach that in our experience has
generated the most successful estimators) begins with a representation of an arbitrary function g(θ) as
P
a countable linear combination of basis functions hj (θ): g(θ) = ∞
j=1 cj hj (θ). The estimand, E[g(θ)], is
P
expressed as the same linear combination of integrals of the basis functions, ∞
j=1 cj E[hj (θ)]. Splitting
the infinite series representation of g(θ) into two parts, we have a finite series which may provide a good
approximation to g(θ) and an infinite remainder series that fills out the representation of g(θ). Focusing
on the finite series, we determine the weights by forcing estimates of E[hj (θ)] based on the subsample
to match those based on the full sample. In addition, we require the weights to be constant over the
elements of a suitably chosen partition of the parameter space and to sum to one. This produces a
P
better estimate of E[ m
j=1 cj hj (θ)] than one based on the subsample alone. The improvement carries
over to estimation of E[g(θ)] when the tail of the series is of minor importance. We refer to the finite
set of basis functions as the (vector-valued) benchmark function.
In both the basic and improved post-stratification approaches, we specify enough conditions that
3
(for virtually all MCMC samples of reasonable size) there is a unique set of weights that would satisfy
them. Thus, from the point of view of calibration estimation, the choice of the distance metric becomes
immaterial, in the sense that any metric would yield identical weights. In this respect, our poststratification weights arise from a degenerate instance of a problem of calibration estimation. In the
case of the maximum entropy weights, however, we do not specify enough benchmark conditions to
make the weights unique. Rather, among the sets of weights satisfying an under-determined number of
benchmark conditions, we select the set having maximum entropy and this, from the point of view of
calibration estimation, is tantamount to choosing a specific distance metric.
Benchmark estimation yields improved weighted estimators (having smaller variances) based on
MCMC subsamples. We investigate the performance of estimators based on weights determined according to post-stratification and maximum entropy methods. Substantial reductions in the variability of
the estimates occur when examining expectations of functions g(θ) that are similar to a linear combination of benchmark components. Theoretical results suggest the gains that we see in practice. The
methodology is illustrated on an example discussed by George, Makov and Smith (1993). We compare
estimation of E[g(θ)] for a variety of functions g(θ), showing that there are often substantial benefits to
the use of benchmark estimates. The extent of the improvement in estimation of E[g(θ)] for functions
that are noticeably different from the benchmarks is striking, even for values of m as small as 3 or 4.
2
2.1
A simple approach to benchmark estimation
An improved subsample estimator
Let θ ∈ Θ be a vector-valued parameter. Imagine that an MCMC sample is drawn from the posterior
distribution of θ. Call the sequence of draws θ (1) , θ(2) , . . . , θ(N ) . The draws are used to estimate some
feature of the posterior distribution. Often these features of interest can be represented as E[g(θ)] for
4
some (possibly vector-valued) function g(θ). The most straightforward estimator for E[g(θ)] is
Ê[g(θ)]f =
N
1 X
g(θ(i) ),
N
(1)
i=1
where the right hand subscript denotes the full sample estimator. If one selects a systematic 1-in-k
subsample of the data, the natural estimator is
n
1X
g(θ(ki) ),
Ê[g(θ)]s =
n
(2)
i=1
where N = kn and s denotes the subsample estimator. As mentioned earlier, this form of subsampling
always leads to poorer estimation; the unweighted subsample estimator (2) has a larger variance than
the full sample estimator (1).
We wish to use the information available from the full sample to improve future estimation based
on the subsample for any feature E[g(θ)]. For an appropriately chosen (and possibly vector-valued)
function h(θ), we refer to the feature E[h(θ)] as the benchmark. We now create a weighted version of
the subsample estimator of E[g(θ)] as follows:
Ê[g(θ)]w =
n
X
wi g(θ(ki) ),
(3)
i=1
where
Pn
i=1 wi
= 1. The weights wi are chosen so that they force the weighted subsample benchmark
estimate to equal the full sample estimate:
Ê[h(θ)]w = Ê[h(θ)]f .
(4)
Thus Ê[h(θ)]w and Ê[h(θ)]f have the same distributions provided the weights can be constructed, and
all features of their distributions conditional on this event are the same. For a vector-valued benchmark
function, any linear combination of its coordinates results in the same estimate for both the subsample
and the full sample, and the estimators have the same distribution. In particular, the two estimators
have the same variance, and we have possibly greatly increased precision for our subsample estimator
of E[g(θ)].
5
The connection between a conditionally conjugate structure and linear posterior expectation in exponential families implies that, for many popular models, quantities such as the conditional posterior
mean for a case or the conditional posterior predictive mean will be a linear function of hyperparameters. The structure of the hierarchical model enables us to use benchmark functions based on the
hyperparameters to create more accurate estimates of these quantities. The reduction in variability
when moving from Ê[h(θ)]s to Ê[h(θ)]w also appears when examining expectations of functions g(θ)
that are similar to h(θ). Functions such as a predictive variance which depend on first and second
moments will typically be closely related to benchmark functions based on the hyperparameters and so
they will be more accurately estimated with our technique.
The weighted subsample, (wi , θ(ki) ) for i = 1, 2, . . . , n, is now used in place of the unweighted
subsample. The weights act exactly as they would if arising from an importance sample, and so we
obtain weighted subsample estimates Ê[g(θ)]w for various features of interest E[g(θ)] of the posterior.
Techniques and software developed for importance samples can be used without modification for these
weighted samples.
2.2
Some methods of obtaining weights
The constraints that
Pn
i=1 wi
= 1 and that Ê[h(θ)]w = Ê[h(θ)]f will not typically determine the wi .
With a single real benchmark function, we would have only two linear constraints on the w i . We
supplement the constraints with a principle that will yield a unique set of weights. The two principles
we investigate are motivated by the literatures on survey sampling and information theory.
Weights by post-stratification. Post-stratification is a standard technique in survey sampling,
designed to ensure that a sample matches certain characteristics of a population. The population
characteristics are matched by computing a weight for each unit in the sample. Large sample results
show that a post-stratified sample behaves almost exactly like a proportionally allocated stratified
sample. This type of stratification reduces the variance of estimates as compared to a simple random
6
sample.
In this setting, the full sample plays the role of the population while the subsample plays the role
of the sample. Thus, the essence of the technique is to partition the parameter space into (say) m
strata, and to assign the same weight to each θ (ki) in a stratum. Formally, suppose that {Θj }m
j=1 is a
finite partition of the parameter space Θ. Let Ij (θ) denote the indicator of set Θj , for j = 1, . . . , m.
The natural application of the post-stratification method takes as the benchmark function the vector
0
of these m indicator functions. That is, h(θ) = (I1 (θ), I2 (θ), . . . , Im (θ)) . We assign the same weight
to all subsample points belonging to a given stratum. Specifically, for all i such that θ (ki) ∈ Θj , we set
wi = vj , where, according to (4), the values vj are determined by
n
X
i=1
vj Ij (θ
(ki)
N
1 X
)=
Ij (θ(i) ),
N
where j = 1, . . . , m.
i=1
The post-stratification weights are then obtained as:
P
Ij (θ(i) )
N −1 N
vj = Pn i=1 (ki) ,
)
i=1 Ij (θ
(5)
provided each of the strata contains at least one subsample point. As in survey sampling, with fairly
well chosen strata, the chance that any of the strata are empty of subsample points is negligible. The
intuitive description of the post-stratification weight vj is as the ratio of the proportion of full sample
points in Θj to the number of subsample points. We refer to this subsample estimator as the basic
post-stratification estimator, Ê[g(θ)]w,ps .
The perspective of a basis expansion of g(θ) provides a more sophisticated use of post-stratification.
Instead of using a basis formed from indicator functions (essentially a Haar basis), alternative bases
consist of functions other than indicators. An attractive basis, due to its success throughout statistics,
is the polynomial basis that generates Taylor series. Assigning equal weight to subsample points within
each given post-stratum yields n − m linear constraints on the weights, and forcing the weights to sum
to 1 provides one additional constraint. Supplementing these with a further m − 1 linear constraints
7
on the weights (and also with mild conditions on the posterior distribution and simulation method to
guarantee uniqueness) defines the weights. The weights are quickly obtained as a solution to a system of
m linear equations. This version of post-stratification has proven to be extremely effective in practice.
As an example of this technique, suppose that θ is a p-dimensional vector valued parameter and that
the parameter space Θ is partitioned into m=3 strata, namely Θ1 , Θ2 and Θ3 . Post-stratification assigns
the same weight vj to all subsample points belonging to stratum Θj , where j = 1, 2, 3. Conditional on
the event that no stratum is empty of subsample points, this corresponds to n − 3 linearly independent
constraints on the weights. An additional constraint on the weights is that they sum to 1.
Thus two other independent linear constraints will ensure that the weights are unique. For a choice
of two real functions, say h1 (θ) and h2 (θ), as components of the benchmark function, the benchmark
P
PN
(ki) ), and Ê[h (θ)] = (1/N )
(ki) ). For
estimates are given by Ê[h1 (θ)]f = (1/N ) N
2
f
i=1 h1 (θ
i=1 h2 (θ
P
j = 1, 2, 3 and l = 1, 2, let nj = ni=1 Ij (θ(ki) ) denote the number of subsample points falling in stratum
Θj , and let blj denote the sum of the function hl (θ) evaluated at the subsample points belonging to
stratum Θj . Thus
blj =
n
X
hl (θ(ki) ) Ij (θ(ki) ).
i=1
Then, solving the following system of linear equations uniquely determines the weights (v 1 , v2 , v3 )
for the three strata, provided the square matrix below is invertible:

  

1
n1 n2 n3
v1
b11 b12 b13  v2  = Ê[h1 (θ)]f  .
b21 b22 b23
v3
Ê[h2 (θ)]f
Maximum entropy weights. Information theory describes, in various fashions, the amount of
information in data about a parameter or distribution. In a Bayesian context, it is often used to
describe subjective information (playing the role of data) in order to elicit a prior distribution. This
is accomplished by specifying a number of features of the distribution, typically expectations, as the
“information” about the prior. The prior is then chosen to reflect this information but no more. With
8
entropy defined as the negative of information, the prior which reflects exactly the desired information
is that which maximizes entropy among those priors matching the constraints.
In our setting, we borrow this technique, matching exactly the information in the full sample benchmark estimates, but no more. Let w = (w1 , w2 , . . . , wn ) be the n-tuple of weights given in (3).
Let us denote by Ω the (possibly empty) set of all weight n-tuples that satisfy (4). Thus Ω is the
P
P
set { w | wi ≥ 0 ∀i, ni=1 wi = 1, ni=1 wi h(θ(ki) ) = Ê[h(θ)]f }.
Definition 2.1 The entropy of an n-tuple w belonging to set Ω is defined as
En (w) = −
n
X
wi ln wi ,
i=1
subject to the convention that 0 ln(0) equals 0.
We observe that for all w belonging to Ω, En (w) ≤ En
1
n, n, . . .
¡¡ 1
, n1
¢¢
= ln(n) . Since Ω is closed,
there exists an element w ∗ of Ω such that En (w ∗ ) = supw∈Ω En (w). These weights w ∗ are called
maximum entropy weights, and they exist whenever Ω is non-empty.
Finding maximum entropy weights w ∗ is thus equivalent to maximizing En (w) subject to the conP
P
straints wi ≥ 0 for i = 1, . . . , n, ni=1 wi = 1, and ni=1 wi h(θ(ki) ) = Ê[h(θ)]f . For a real benchmark
function h(θ), it can be shown that the maximum entropy weights w ∗ are unique whenever they exist.
For most subsamples of reasonable size they are given by
wi∗ = eλ1 +λ2 h(θ
(ki) )
,
i = 1, 2, . . . n;
(6)
where λ2 ∈ R satisfies the equation
n ³
X
i=1
and λ1 = − ln
³P
´
³ ³
´´
h(θ(ki) ) − Ê[h(θ)]f exp λ2 h(θ(ki) ) − Ê[h(θ)]f
= 0,
n
λ2 h(θ (ki) )
i=1 e
´
(7)
. The few subsamples where the weights fail to exist will have either
h(θ(ki) ) < Ê[h(θ)]f for all i, or h(θ (ki) ) > Ê[h(θ)]f for all i. For all other subsamples, equation (7) has
a unique root because the left hand side of the equation increases monotonically from −∞ to +∞ as
9
λ2 increases. The root can be obtained by numerical methods, and hence wi∗ can be calculated very
quickly using (6).
When h(θ) is an indicator function of a subset of the parameter space Θ, the answer obtained using
(6) matches the post-stratification weights given in (5). Thus, the two approaches will sometimes yield
the same result.
3
Theoretical Results
The motivation for subsampling the output of a Markov chain carries over to the study of asymptotic
properties of the estimators. A first asymptotic is motivated by the common practice of using preliminary runs of the Markov chain to assess the dependence and convergence properties of the chain, and
then selecting a subsampling rate for the run of the chain used for estimation. This practice leads us to
consider the asymptotic where k is held fixed and n grows (Case A). The strongest motivation for subsampling is that further use of the subsample involves expensive (slow) processing. When this further
processing is done in real time, as with investigation of changes in the prior distribution or likelihood,
it is essential to limit the number of points that are repeatedly processed. Pursuing this motivation for
subsampling, a natural asymptotic holds the number of subsampled points, n, fixed while letting the
interval between subsampled points, k, tend to infinity (Case B). In the limit, these subsampled points
will look like a random sample from π. We note that asymptotics between these two, where both n and
k grow (Case C), are also natural candidates for theoretical exploration, and are useful for filling out
the range of asymptotic expressions useful for assessing the accuracy of the benchmark estimators.
Tierney (1994) contains a collection of useful results for the asymptotics of estimators based on
output from a Markov chain. The two essential types of result are ergodic theorems which guarantee
strong convergence of an empirical average (or full-sample estimator) to a corresponding expectation
under the limiting distribution, and central limit theorems which describe weak convergence of an
appropriately centered and scaled full-sample estimator to a normal distribution. We rely heavily on
10
his results to show asymptotic normality of our subsampled estimators.
Case A. We first consider the asymptotic where k is held fixed and n tends to ∞. The ergodic
theorem in Tierney’s paper applied to the sub-sampled chain (which is itself Markovian) allows us to
conclude that, as n tends to ∞, Ê[g(θ)]s tends to E[g(θ)] almost surely. The next theorem establishes
the asymptotic normality of the basic post-stratification estimator for expectations of bounded functions
g(θ) and geometrically ergodic Markov chains.
Theorem 3.1 Suppose that the Markov chain is geometrically ergodic with invariant distribution π. If
the function g(θ) is bounded and not a linear combination of the strata indicators I j (θ), j = 1, . . . , m,
then there exists a real number σ(g) such that, as n → ∞, the distribution of
´
√ ³
n Ê[g(θ)]w,ps − E[g(θ)]
converges weakly to a normal distribution with mean 0 and variance σ 2 (g).
Proof. The result follows from a verification of the conditions of Theorem 4 of Tierney’s paper and by
an application of the delta method. See Appendix A for a more detailed proof.
Under the stronger assumption of uniform ergodicity of the Markov chain, an identical result holds
for all functions g(θ) with finite posterior variance. A proof of this result relies on Theorem 5 of Tierney’s
paper, but is otherwise the same as the one above.
The results extend beyond the basic post-stratification estimator, applying also to the more complex post-stratification estimators. The proof of the following corollary for modified post-stratification
estimators is almost identical to that of the previous theorem. It differs only in minor details specific to
modified post-stratification estimators. Non-singularity of the matrix B defined below is required for
local continuity of the estimator.
Corollary 3.2 Let Ê[g(θ)]∗w,ps denote the modified version of the post-stratification estimator based on
11
benchmark functions h1 (θ), . . . , hm−1 (θ). Let the matrix B be given by

π1
...
πm
 E[h1 (θ)I1 (θ)]
...
E[h1 (θ)Im (θ)]

B=
..

.
E[hm−1 (θ)I1 (θ)] . . .
E[hm−1 (θ)Im (θ)]



.

Under the same assumptions as Theorem 3.1, there exists a real number σ ∗ (g) such that, as n → ∞,
the distribution of
´
√ ³
n Ê[g(θ)]∗w,ps − E[g(θ)]
converges weakly to a normal distribution with mean 0 and variance σ ∗2 (g), provided the matrix B is
invertible.
Case B. The second asymptotic that we consider involves increasingly thinner subsamples as the
full sample size tends to ∞. The subsample size, bounded by constraints on computing power and/or
storage space, remains fixed. The following result and corollary are useful when investigating the limiting
behavior of the basic post-stratification estimator Ê[g(θ)]w,ps in this case. The result states that, for
a fixed subsample size, as the length of the full sample grows, the joint distribution of any continuous
function of the subsample tends to the distribution of that function applied to n independent draws
from π. Again, the result follows from Tierney’s Section 3.
Theorem 3.3 Suppose that the Markov chain is geometrically ergodic with invariant distribution π.
Let f (·) be any function which has a set of discontinuities having measure 0 under the distribution on
(γ (1) , . . . , γ (n) ), where (γ (1) , . . . , γ (n) ) are independent draws from π. Then, as k → ∞, f (θ (1k) , . . . , θ(nk) )
converges in distribution to f (γ (1) , . . . , γ (n) ).
The next theorem gives the limiting distribution of the basic post-stratification estimator as the full
sample size grows, when the subsample size is fixed.
Theorem 3.4 In addition to the assumptions of the previous theorem, assume that the function g(θ) is
continuous, and that the boundary of each stratum has π-measure 0. Then for a fixed subsample size n
12
and as the subsample distance k tends to ∞, the basic post-stratification estimator Ê[g(θ)]w,ps converges
in distribution to Ê[g(θ)]ps∗ , the post-stratification estimator that makes use of π1 , . . . , πm , the posterior
probability of the m strata, and is based on independent draws from the posterior distribution.
Proof. The estimator Ê[g(θ)]w,ps may be viewed as a real valued function mapping the np-dimensional
vector (θ (1k) , . . . , θ(nk) ) into the real line. Match the estimator to the function f (·) in the previous
theorem. Since g(θ) is continuous, the only discontinuities in f (·) appear at the boundaries of the
strata, Θj . The measure assigned to this set under the distribution on (γ (1) , . . . , γ (n) ) is no more than n
times the measure assigned to the boundaries under π. Since π assigns probability 0 to these boundaries,
the result follows.
Case C. We now consider the third asymptotic for the post-stratification estimator, where both the
subsample size n and the subsampling distance k tend to ∞. This situation is frequently encountered
in MCMC sampling when the full sample size grows at a faster rate than the subsample size, resulting in a progressively thinner and approximately independent subsample of draws from the posterior.
Along with the previous theorem, the following theorem indicates that the post-stratification estimator
Ê[g(θ)]w,ps based on a large enough subsample has a smaller asymptotic variance than the unweighted
subsample estimator. The result parallels what sampling statisticians have long known about the tie
between stratified samples and post-stratified samples. Their results are generally described in terms
of the asymptotic distribution of the estimators, as the sample size grows, thereby finessing the (rare)
nonexistence of the post-stratified estimator. See Cochran (1977) and Lohr (1999), among others, for
the result that shows that the asymptotic distribution of the post-stratified estimator is the same as
the asymptotic distribution of the estimator based on a proportionally allocated stratified sample. In
this theorem, we follow this approach, restating the standard sampling result in the context of our
infinite population setting. Interestingly, the asymptotic variance formula applies even when the π j are
13
irrational, thus prohibiting proportional allocation.
Theorem 3.5 Assume that πj > 0, where πj is the posterior probability of the jth stratum for j =
1, . . . , m, and that E[g 2 (θ)] < ∞. Then, with Ê[g(θ)]ps representing the post-stratified estimator based
on n independent draws from π, we have that
r ³
´
n
Ê[g(θ)]ps − E[g(θ)]
v
converges weakly to a standard normal distribution as n → ∞, where v =
Pm
j=1 πj Var(g(θ)|θ²Θj ).
P
2
Proof. Define µj = E[g(θ)|θ²Θj ] and µ = E[g(θ)]. We have Var(g(θ)) = m
j=1 πj E[(g(θ) − µ) |θ²Θj ] ≥
Pm
Pm
2
2
j=1 πj E[(g(θ) − µj ) |θ²Θj ] =
j=1 πj Var(g(θ)|θ²Θj ). Since E[g (θ)] < ∞, this implies that each
within-stratum variance of g(θ) is finite. Once this is established, a straightforward application of the
delta method yields the result. See, for example, Pollard (1984), p. 189.
As evident from the proof, Ê[g(θ)]ps has smaller asymptotic variance than does Ê[g(θ)]s unless µj = µ for
j = 1, . . . , m. Hence, the argument just presented suggests the superiority of the basic post-stratification
estimator Ê[g(θ)]w,ps to the unweighted subsample estimator. The weakness of the argument is that it
does not directly handle the joint limit where n and N = nk tend to ∞. Consideration of this joint limit
for the dependent MCMC sample involves substantial technical detail and will be presented elsewhere.
4
Illustration
The example we consider is discussed in George, Makov and Smith (1993). The data consist of numbers
of failures (xi ) and length of operation time in thousands of hours (ti ) for 10 power plant pumps. Interest
focuses on parameters explicitly appearing in the model and on additional features such as the mean
failure rate for a pump on which there is no data. A gamma-Poisson hierarchical model for the data
is assumed, with the usual conventions of conditional independence. The parameter δ i represents the
14
failure rate of pump i. The model is
xi ∼
Poisson(δi ti ) for i = 1, . . . , 10,
δi ∼
gamma(α, β) for i = 1, . . . , 10,
α ∼
exponential(1), β ∼ gamma(3, 1),
where the gamma distribution is parameterized to have mean α/β.
We conducted a simulation study to estimate the reductions in MSE (if any) achieved using weighted
benchmark estimators. It is quite straightforward to generate MCMC samples from the posterior distribution of the model parameter θ = (α, β, δ1 , δ2 , . . . , δ10 ) by Gibbs sampling. For an appropriately
chosen benchmark function, the full sample benchmark estimate can be computed on the fly, and a systematic 1-in-k subsample stored. We relied on BUGS (Spiegelhalter et al., 1996) for the implementation
that was was based on 100 replications of these basic steps:
1. Set the initial values for α and β equal to one.
2. Discard a burn-in period of 500 initial updates to overcome the effects of the arbitrary starting
values of the chain and retain the next 10,000 updates to provided a 1-in-10 subsample of size
n = 1, 000 from a full sample of size N = 10, 000.
3. After a subsequent burn-in period of 71,500 updates, retain the next 100,000 updates to provide
a 1-in-100 subsample of size n = 1, 000 from a full sample of size N = 100, 000. (The intervening
burn-in period of 71,500 updates was used to investigate other subsampling rates.)
For further details about the BUGS code used in the simulation, please refer to the web site
http://www.stat.ohio-state.edu/˜peruggia.
After the subsample estimates and benchmark estimates for all 100 replications were obtained from
BUGS, an Splus program was run to quickly compute subsample weights by the different methods:
unweighted, post-stratification and its variations, and maximum entropy. We then computed subsample
15
estimates of a number of features of interest, E[g(θ)], and we estimated the variance of the estimators
by the sample variance of the estimates over the 100 independent replications. Each feature E[g(θ)]
was estimated with the mean of the combined 100 subsamples. Treating this estimated value as the
actual E[g(θ)], we then compared the MSE of the weighted subsample estimators to the unweighted
subsample estimators. Use of the unweighted estimator as the target gives a small advantage to the
unweighted estimators in the comparisons that follow.
Four methods for creating the weights were examined.
Post-stratification (i) weights
We chose m = 4 strata, generated by the Cartesian product of
cutoffs of 1.0 for α and 1/1.8 = 0.556 for 1/β. (A parameterization in terms of 1/β, rather than β, is
more directly related to several quantities of interest and slightly more natural for this problem.) The
components of the benchmark functions were the indicator functions of the four strata. The cutoff values
of 1.0 and 0.556 are, approximately, the posterior medians of α and 1/β estimated from a preliminary
run of the MCMC algorithm. As we observe later, the performance of post-stratification estimators
seems to be comparatively robust with respect to the choice of strata.
Post-stratification (ii) weights
Using the same strata as the post-stratification (i) weights, the
m − 1 = 3 components of the benchmark function were chosen as α, β and αβ.
Post-stratification (iii) weights
For the same strata as above, the three components of the bench-
mark function were chosen as α, 1/β and α/β.
Maximum entropy weights
The real-valued benchmark function was chosen as h(θ) = α/β. The
maximum entropy weights were obtained as described in Section 2.2.
We considered a number of features of interest, E[g(θ)]. Several estimands focus on the hyperparameters α and β. Note that α/β is the predictive mean failure rate for a pump on which we have no data
16
and that α/β 2 is the variance for such a pump.
We also investigated functions tied to specific pumps in the data set. The tables show results for
the mean, variance and density of the failure rate, E(δ1 ), V ar(δ1 ) and φ(δ1 ) respectively, of pump 1.
For density estimation, we used a grid of 196 points from 0.005 to 0.2, spaced at intervals of 0.001. The
integrated mean square error was approximated by an average of squared differences from the target
density. To broaden the comparisons, those for pump 1 rely on Rao-Blackwellized estimates for each
estimand. The weights, however, were computed without Rao-Blackwellization.
We also investigated two empirical Bayes problems. For the first problem, we took as our estimand
(E[δ1 ], . . . , E[δ10 ]) with sum of squared error loss. For the second problem, we took (V ar(δ 1 ), . . . , V ar(δ10 ))
as our estimand, again under sum of squared error loss. Rao-Blackwellization was used in this portion
of the study.
The (integrated) squared biases of all the estimators were found to be negligible compared to the
respective variances. The performances of the weighted subsample estimators relative to the unweighted
subsample estimators are summarized in Tables 1 and 2. The percent reduction in MSE is the reduction
in MSE achieved by the weighted estimator expressed as a percentage of the unweighted estimator
MSE. The percent reductions in MSE produced by the subsample estimator for all method-by-rate
combinations and all features of interest E[g(θ)] are also presented graphically in a trellis display in
Figure 1.
We observe that most of the weighted subsample estimators have smaller MSE’s than the corresponding unweighted estimators, indicated by the positive values of percent reduction in MSE in the
tables. As expected, we have a big reduction in MSE for the benchmarks. If the full samples consisted of i.i.d. draws from the posterior, we would expect a 90% reduction in MSE for the benchmarks
for a 1-in-10 subsample. The actual range of 60% to 90% seen in Table 1 accounts for the positive
dependence inherent in the MCMC samples. For thinner and thinner subsamples, as the dependence
between draws gets progressively weaker, the actual reductions in MSE begin to more closely resemble
17
the corresponding values for a sequence of i.i.d. draws. Nicely, all estimators show improvement for
estimation of E[α/β], the predictive failure rate for a pump on which there is no data.
Maximum entropy performs well on its benchmark and on features of interest closely related to the
benchmark, but it shows poor performance for several other estimands, even increasing MSE in some
cases. We conjecture that the relatively poor performance is tied to the behavior of the weights. The
weights are monotone in α/β, either increasing or decreasing, corresponding to whether Ê[h(θ)]s <
Ê[h(θ)]f or Ê[h(θ)]s > Ê[h(θ)]f . In either case, the most extreme weights are in a tail, and this tends
to destabilize the estimates.
Post-stratification (i) shows improvement for all estimands. The most substantial improvement is
shown where the mean of the function varies greatly from stratum to stratum. However, the big winners
are post-stratification (ii) and (iii), motivated by the perspective of a basis expansion. These estimators
show a large reduction in MSE. The reductions for these methods are, on the whole, much greater
than for post-stratification (i) or maximum entropy. Post-stratification (ii) shows better performance
than post-stratification (iii) for functions expressed in terms of β while the reverse holds for functions
expressed in terms of 1/β. The pattern of reduction in MSE is particularly clear for functions estimated
with Rao-Blackwellization, appearing in the last five lines of each table. Posterior means, variances and
densities of the pump failure rates are estimated extremely well with post-stratification (ii) or (iii). We
recommend use of weights similar to post-stratification (ii) or (iii).
Another issue of interest is the sensitivity of the weighted post-stratified subsample estimator to
the choice of the cutoff values used to partition the parameter space. The values of 1.0 and 0.556
used in our simulations to partition the α-(1/β) space into four strata coincide with the posterior
estimates of the 0.55 quantile of α and of the 0.48 quantile of 1/β. Figure 2 displays, as a function
of the cutoff probabilities for α and 1/β, a level plot of the percent MSE reduction for the weighted
(1-in-10) subsample estimation of β 2 using the post-stratification weights (iii). For a given pair of
cutoff probabilities over the range [0.3, 0.7] × [0.3, 0.7], the strata are formed by partitioning the α-(1/β)
18
space into the four quadrants with the common vertex located at the point of coordinates given by the
posterior estimates of the corresponding quantiles for α and 1/β.
The most striking feature revealed by Figure 2 is the insensitivity of the performance of the estimator
to the choice of strata. For a fixed value of the cutoff probabilities for (1/β), the percent MSE reduction
is essentially constant over the [0.3, 0.7] range as a function of the cutoff probabilities for α. The
dependence on the cutoff probabilities for (1/β) is a little more pronounced, but any choice of cutoffs
probabilities for α and (1/β) in the rectangle [0.3, 0.7] × [0.4, 0.7] produces nearly optimal results. This
robustness alleviates our concern that the benefits of benchmark estimation might be due to a fortuitous
choice of strata.
5
Conclusions
The application of benchmark estimation to the pump data shows the remarkable benefits that accrue
to a computationally cheap means of improving MCMC simulation. The method is relatively automatic
and is robust with respect to the choice of many implementation details. The potential application of the
method is much wider than illustrated here. Benchmark estimation can be used with any setting where
more accurate estimators are available. Examples of better estimators include Rao-Blackwellization
of the weights as well as the estimators (Gelfand and Smith, 1990) and subsampled estimators where
the subsampling scheme produces estimators that are more accurate than the full-sample estimators
(MacEachern and Peruggia, 2000). Benchmarks may also be created that match known population
quantities, as there are typically some functions h(θ) for which E[h(θ)] is known. The advantage of
matching these quantities is that the benchmarks then contain no sampling error.
The connection between calibration estimation and benchmark estimation suggests further investigation. We have created the post-stratification estimators by imposing enough linear constraints to
yield a unique solution for the weights. Calibration methods instead create a multitude of weights
that provide estimates matching the benchmark estimates. The final set of weights that is selected is
19
then chosen by minimizing a distance, often from uniform weights. The resulting weights will rarely
be uniform within strata. This may lead to superior estimators, or it may tend to produce estimators
that are susceptible to tail effects, as we have seen with the maximum entropy estimator. An important
feature of our problem that is not always respected with calibration estimation, is that, for our final
estimators, we wish to produce a set of weights that correspond to a probability distribution.
This work also suggests the need for a more complete development of the theoretical underpinnings
of benchmark estimation. We have pursued this line of research, with particular attention to the
asymptotic behavior of the estimators, evaluation of the asymptotic variances of estimators, and use of
the methods in conjunction with importance sampling techniques. This work will appear elsewhere.
Acknowledgments
This material is based upon work supported by the National Science Foundation under Awards No.
DMS-0072526 and SES-0214574. We wish to thank the Associate Editor and referee whose comments
improved the focus of the paper and led us to the connection between benchmark estimation and
calibration estimation.
Appendix
Proof of Theorem 3.1
The basic notation used in the proof was introduced earlier in section 2.2. For each stratum j = 1, . . . , m,
we use Nj to denote the number of full sample points belonging to stratum j, and we write the product
g(θ)Ij (θ) as a function fj (θ). The simple post-stratification estimator Ê[g(θ)]w,ps , provided it exists,
can be written as
Ê[g(θ)]w,ps =
m
n
X
Nj /nj X
N
j=1
20
i=1
fj (θ(ki) ).
Stringing together the 3m empirical sums used in the above expression, we define the random vector Y n
as follows:
Yn =
Ã
n
X
i=1
f1 (θ(ki) ), . . . ,
n
X
fm (θ(ki) ), n1 , . . . , nm , N1 , . . . , Nm
i=1
!
.
In Lemma 5.4, we prove that the vector (1/n)Yn tends to a multivariate normal distribution. The
probability of the non-existence of the post-stratification estimate tends to 0 as n increases, and so by
applying the delta method we will have the result stated in Theorem 3.1. The following three lemmas are
well-known results on convergence of probability measures, and help establish Lemma 5.4. Billingsley
(1968) provides the background needed to fill in the details missing from these proofs.
Lemma 5.1 (Cramer-Wold device). For random vectors Xn = (Xn1 , . . . , Xnp ) and X = (X1 , . . . , Xp ),
a necessary and sufficient condition that Xn converges weakly to X is that l 0 Xn converges weakly to l0 X
for all vectors l ∈ Rp .
Lemma 5.2 Let {Zn } be a tight sequence of random vectors in Rp having characteristic functions
{ϕn }. If limn ϕn (u) = f (u) for all u ∈ Rp , where f is some complex-valued function, then f (u) is the
characteristic function of some random variable Z and Zn converges in distribution to Z.
The next lemma is explicitly needed to prove Lemma 5.4:
Lemma 5.3 Let Xn = (Xn1 , . . . , Xnp ), n = 1, 2, . . . , be a sequence of random vectors. If, for all
l ∈ Rp , l0 Xn converges weakly to some real-valued random variable Yl , then there exists a vector X =
(X1 , . . . , Xp ) such that Xn converges weakly to X.
Proof. Let lk be the vector of length p having only the k-th component equal to 1 and all other
components equal to 0, where k = 1, . . . , p. By our assumption, the sequence of random variables {X nk }
converges weakly to the random variable Ylk and is therefore tight. Applying Bonferroni’s inequality, we
conclude that the sequence {Xn } is also tight. Since l0 Xn converges weakly to the real-valued random
21
0
variable Yl for all l ∈ Rp , the limit of the corresponding characteristic functions E[etl Xn ], as n tends to
∞, equals E[etYl ] for all t ∈ R. Setting u = tl ∈ Rp and applying Lemma 5.2, we conclude that there
exists a random vector X of length p to which Xn converges weakly.
The following lemma completes the proof by establishing the asymptotic multivariate normality of
the vector (1/n)Yn :
Lemma 5.4 Suppose the Markov chain is geometrically ergodic with invariant distribution π. Assume
that the function g(θ) is bounded and is not a linear combination of the strata indicators I j (θ), where
j = 1, . . . , m. Then, as n tends to ∞, the distribution of
√
n
µ
1
Yn − µ Y
n
¶
converges weakly to a N3m (0, Σ) distribution where Σ is a non-negative definite square matrix of dimension 3m, and the vector µY = limn→∞ E[(1/n)Yn ] is the expected value of (f1 (θ), . . . , fm (θ), I1 (θ), . . . ,
Im (θ), kI1 (θ), . . . , kIm (θ)) with respect to π.
Proof. We define a new sequence β (i) of vector-valued MCMC iterates, by combining k consecutive
¡
¢
samples from the original chain. That is, for i ∈ N, we define β (i) = θ(k(i−1)+1) , . . . , θ(ki) . It is
easily verified that {β (i) } is a Markov chain with limiting distribution Πk , which we define as the joint
distribution of k consecutive samples from the original Markov chain, where the first state is generated
from the invariant distribution π.
The new Markov chain is also geometrically ergodic. Geometric ergodicity of the original chain
implies that there exists some 0 ≤ ρ < 1 such that ||π(θ (m) |θ(0) ) − π(θ)|| ≤ ρm M (θ(0) ), for all m ∈ N,
22
with || · || denoting total variation distance. We note that
||π(β (i) |β (0) ) − Πk (β)|| = ||π(θ k(i−1)+1 |θ(0) ) − π(θ)||
≤ ρk(i−1)+1 M (θ(0) )
= [ρk ]i [ρ(1−k) M (θ0 )].
The first equality is a consequence of the Markovian nature of the fine-scale chain. The last equality
establishes that the rate parameter of the new chain is the k-th power of the rate parameter of the
original chain.
The conditions are now set to apply Tierney’s Theorem 4. This theorem implies that, for each
l ∈ R3m ,
√
n (l0 (Yn /n) − l0 µY ) converges weakly to a normal random variable, say Zl . Lemma 5.3
implies that the sequence
√
n ((Yn /n) − µY ) converges weakly to some random vector; Lemma 5.1
equates the distributions l 0 Z and Zl for every l ∈ R3m , and, since all one-dimensional projections of Z
are normal, we have that Z is normal.
References
Billingsley, P. (1968). Convergence of Probability Measures, Wiley, New York.
Cochran, W. G. (1977). Sampling Techniques, 3rd ed. Wiley, New York.
Deville, J. and C. Särndal (1992). Calibration Estimators in Survey Sampling. J. Amer. Statist. Assoc.
87, 376–382.
Gelfand, A. E. and A. F. M. Smith (1990). Sampling-Based Approaches to Calculating Marginal
Densities. J. Amer. Statist. Assoc. 85, 398–409.
George, E. I., U. E. Makov and A. F. M. Smith (1993). Conjugate Likelihood Distributions. Scand. J.
Statist. 20, 147–156.
Geyer, C. J. (1992). Practical Markov chain Monte Carlo. Statist. Sci. 7, 473–483.
Lohr, S. (1999). Sampling: Design and Analysis. Duxbury, Pacific Grove, CA.
23
MacEachern, S. N. and L. M. Berliner (1994). Subsampling the Gibbs Sampler. Amer. Stat. 48, 188–
190.
MacEachern, S. N. and M. Peruggia (2000). Subsampling the Gibbs sampler: Variance reduction.
Statist. Prob. Letters 47, 91–98.
Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, New York.
Spiegelhalter, J., A. Thomas, N. G. Best and W. R. Gilks (1996). BUGS: Bayesian inference Using
Gibbs Sampling, Version 0.5, (version ii). Cambridge, UK: MRC Biostatistics Unit.
Tierney, L. (1994). Markov Chains for Exploring Posterior Distributions. Ann. Statist. 22, 1701–1728.
Vanderhoeft, C. (2001). Generalized Calibration at Statistics Belgium: SPSS Module g-CALIB-S and
Current Practices. Working Paper, Statistics Belgium.
24
g(θ)
α
α2
β
β2
αβ
1/β
1/β 2
α/β
α/β 2
I(α/β ≤ 0.58)
E(δ1 )
V ar(δ1 )
φ(δ1 )
Combined pump means
Combined pump variances
% Reduction in MSE for a Subsampling Rate of 1-in-10
Post-Strat (i) Post-Strat (ii) Post-Strat (iii) Max Ent
46.1 (5.6)
70.9* (4.2)
70.9* (4.2)
6.2 (1.9)
33.3 (7.3)
66.7 (4.6)
67.1 (4.7)
4.7(1.9)
38.5 (7.6)
66.2* (5.1)
42.7 (7.9)
12.6 (7.9)
29.9 (7.3)
64.8 (4.6)
39.4 (6.9)
9.0 (6.5)
37.4 (6.6)
62.3* (5.3)
57.0 (5.5)
-0.4 (3.8)
24.8 (8.5)
44.1 (7.1)
65.5* (5.0)
24.9 (10.0)
8.0 (5.7)
16.1 (5.9)
53.0 (3.9)
22.4 (7.0)
26.3 (9.0)
55.5 (6.9)
86.7* (2.6)
86.7* (2.6)
7.8 (7.8)
16.2 (7.8)
59.7 (4.2)
53.1 (8.3)
32.8 (7.4)
50.5 (7.4)
50.9 (8.0)
50.4 (7.0)
45.0 (5.8)
74.1 (3.8)
74.2 (3.7)
12.7 (3.0)
43.7 (6.0)
72.1 (4.1)
72.3 (4.0)
9.0 (2.5)
46.6 (5.4)
73.1 (3.9)
73.2 (3.8)
12.0 (2.8)
32.6 (8.7)
69.2 (4.9)
60.0 (7.0)
55.1 (7.0)
30.6 (8.0)
77.8 (3.7)
53.5 (7.4)
51.8 (6.5)
Table 1: Comparison of MSE of the subsample estimators for a 1-in-10 systematic subsample. A negative
percent reduction indicates an increase in MSE. An ∗ indicates that the weighted subsample estimate is
forced to agree with the full sample estimate. Shown in parentheses are the estimated standard errors.
25
g(θ)
α
α2
β
β2
αβ
1/β
1/β 2
α/β
α/β 2
I(α/β ≤ 0.58)
E(δ1 )
V ar(δ1 )
φ(δ1 )
Combined pump means
Combined pump variances
% Reduction in MSE for a Subsampling Rate of 1-in-100
Post-Strat (i) Post-Strat (ii) Post-Strat (iii) Max Ent
63.8 (5.6)
97.6* (0.5)
97.6* (0.5)
1.0 (2.5)
56.4 (6.0)
95.0 1.0()
95.0 (0.9)
1.0 (1.9)
71.8 (4.8)
94.5* (1.0)
71.4 (5.5)
25.2 (7.4)
57.9 (5.8)
91.3 (1.6)
64.1 (6.1)
14.0 (7.1)
64.3 (5.0)
96.6* (0.6)
90.4 (1.6)
3.8 (3.8)
51.4 (6.8)
73.1 (4.0)
95.3* (0.9)
43.0 (5.2)
23.7 (7.6)
37.9 (7.8)
76.5 (5.5)
25.2 (10.3)
28.5 (7.4)
60.8 (6.2)
97.2* (0.5)
97.2* (0.5)
24.5 (7.3)
37.8 (7.0)
80.1 (3.5)
57.4 (6.7)
20.0 (8.0)
37.3 (9.5)
44.5 (8.6)
47.4 (6.5)
60.4 (6.0)
98.0 (0.4)
97.7 (0.5)
2.8 (4.2)
61.2 (5.9)
97.7 (0.5)
97.7 (0.5)
1.8 (3.2)
61.0 (5.9)
96.8 (0.6)
96.4 (0.6)
2.6 (4.0)
56.6 (5.6)
85.5 (2.4)
74.2 (4.7)
70.1 (4.5)
52.8 (5.6)
92.8 (1.2)
61.6 (7.0)
62.0 (5.3)
Table 2: Comparison of MSE of the subsample estimators for a 1-in-100 systematic subsample. A
negative percent reduction indicates an increase in MSE. An ∗ indicates that the weighted subsample
estimate is forced to agree with the full sample estimate. Shown in parentheses are the estimated
standard errors.
26
0
20
40
ps.1
1.in.10
ps.1
1.in.100
ps.2
1.in.10
ps.2
1.in.100
ps.3
1.in.10
ps.3
1.in.100
MaxEnt
1.in.10
MaxEnt
1.in.100
60
80
100
phi.d.1
a
E.d.1
V.d.1
b
a.b
a.sq
a.b.inv.le.58
c.p.m
c.p.v
b.sq
a.b.inv
b.inv
b.inv.sq
a.b.inv.sq
phi.d.1
a
E.d.1
V.d.1
b
a.b
a.sq
a.b.inv.le.58
c.p.m
c.p.v
b.sq
a.b.inv
b.inv
b.inv.sq
a.b.inv.sq
phi.d.1
a
E.d.1
V.d.1
b
a.b
a.sq
a.b.inv.le.58
c.p.m
c.p.v
b.sq
a.b.inv
b.inv
b.inv.sq
a.b.inv.sq
phi.d.1
a
E.d.1
V.d.1
b
a.b
a.sq
a.b.inv.le.58
c.p.m
c.p.v
b.sq
a.b.inv
b.inv
b.inv.sq
a.b.inv.sq
0
20
40
60
80
100
Percent.MSE.Reduction
Figure 1: Trellis plots of the percent reductions in MSE produced by the subsample estimator for all
method-by-rate combinations and all features of interest E[g(θ)]. To facilitate visual inspection, the
features of interest are ordered so as to produce an increasing pattern in the upper-left panel.
27
Percent MSE Reduction for Estimation of Beta Squared
40
0.7
37.0
37.5
38.0 38.0
38.0
39
Cut-off Probabilities for Beta Inverse
38.5
38
0.6
38.5
38.5
39.0
39.0
39.0
39.0
39.0
39.0
37
39.0
36
39.5
0.5
39.0
35
38.5
38.0
38.0
0.4
0.3
34
37.5
37.0
32.0
32.0
32.532.5
31.5
32.5
32.5
32.0
36.5
36.0
35.5
35.0
34.5
34.0
33.5
33.0
33
32
31
0.3
0.4
0.5
0.6
0.7
Cut-off Probabilities for Alpha
Figure 2: Level plot of the percent MSE reduction for the weighted subsample estimation of β 2 using
the post-stratification weights (iii) and 1-in-10 subsamples. For a given pair of cutoff probabilities for
α and 1/β, the strata are formed by partitioning the α-(1/β) space into the four quadrants with the
common vertex located at the point of coordinates given by the posterior estimates of the corresponding
quantiles for α and 1/β.
28