A theory of statistical models for Monte Carlo integration

J. R. Statist. Soc. B (2003)
65, Part 3, pp. 585–618
A theory of statistical models for Monte Carlo
integration
A. Kong,
deCODE Genetics, Reykjavik, Iceland
P. McCullagh,
University of Chicago, USA
X.-L. Meng
Harvard University, Cambridge, USA
and D. Nicolae and Z. Tan
University of Chicago, USA
[Read before The Royal Statistical Society at a meeting organized by the Research Section on
Wednesday, December 11th, 2002, Professor D. Firth in the Chair ]
Summary. The task of estimating an integral by Monte Carlo methods is formulated as a statistical model using simulated observations as data. The difficulty in this exercise is that we
ordinarily have at our disposal all of the information required to compute integrals exactly by
calculus or numerical integration, but we choose to ignore some of the information for simplicity
or computational feasibility. Our proposal is to use a semiparametric statistical model that makes
explicit what information is ignored and what information is retained. The parameter space in
this model is a set of measures on the sample space, which is ordinarily an infinite dimensional
object. None-the-less, from simulated data the base-line measure can be estimated by maximum likelihood, and the required integrals computed by a simple formula previously derived by
Vardi and by Lindsay in a closely related model for biased sampling. The same formula was also
suggested by Geyer and by Meng and Wong using entirely different arguments. By contrast with
Geyer’s retrospective likelihood, a correct estimate of simulation error is available directly from
the Fisher information. The principal advantage of the semiparametric model is that variance
reduction techniques are associated with submodels in which the maximum likelihood estimator in the submodel may have substantially smaller variance than the traditional estimator. The
method is applicable to Markov chain and more general Monte Carlo sampling schemes with
multiple samplers.
Keywords: Biased sampling model; Bridge sampling; Control variate; Exponential family;
Generalized inverse; Importance sampling; Invariant measure; Iterative proportional scaling;
Log-linear model; Markov chain Monte Carlo methods; Multinomial distribution; Normalizing
constant; Semiparametric model; Retrospective likelihood
1.
Normalizing constants and Monte Carlo integration
Certain inferential problems arising in statistical work involve awkward summation or high
dimensional integrals that are not analytically tractable. Many of these problems are such that
Address for correspondence: P. McCullagh, Department of Statistics, University of Chicago, 5734 University
Avenue, Chicago, IL 60657-1514, USA.
E-mail: [email protected]
 2003 Royal Statistical Society
1369–7412/03/65585
586
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
only ratios of integrals are required, and it is this class of problems with which we shall be
concerned in this paper. To establish the notation, let Γ be a set, let µ be a measure on Γ, let
{qθ } be a family of functions on Γ and let
c.θ/ =
qθ .x/ dµ:
Γ
Our goal is ideally to compute exactly, or in practice to estimate, the ratios c.θ/=c.θ / for all
values θ and θ in the family. The family may contain a reference function qθ0 whose integral is
known, in which case the remaining integrals are directly estimable by reference to the standard.
Our theory accommodates but does not require such a standard. For an estimator to be useful,
an approximate measure of estimation error is also required.
We refer to c.θ/ as the normalizing constant associated with the function qθ .x/. In particular,
if qθ is non-negative and 0 < c.θ/ < ∞,
dPθ .x/ = qθ .x/ dµ=c.θ/
is a probability distribution on Γ. For the method to work, the family must contain at least one
non-negative function qθ , but it is not necessary that all of them be non-negative. Depending
on the context and on the functions qθ , the normalization constant might represent anything
from a posterior expectation in a Bayesian calculation to a partition function in statistical
physics.
Observations simulated from one or more of these distributions are the key ingredient in
Monte Carlo integration. We assume throughout the paper that techniques are available to
simulate from Pθ without computing the normalization constant. At least initially, we assume
that these techniques generate a stream of independent observations from Pθ .
At first glance, the problem appears to be an exercise in calculus or numerical analysis, and
not amenable to statistical formulation. After all, statistical theory does not seek to avoid estimators that are difficult to compute; nor is it inclined to opt for inferior estimators because they
are convenient for programming. So it is hard to see how
any efficient statistical formulation
could avoid the obvious and excellent estimator c.θ/ = qθ .x/ dµ, which has zero variance and
requires no simulated data.
This paper demonstrates that the exercise can nevertheless be formulated as a model-based
statistical estimation problem in which the parameter space is determined by how much information we choose to ignore. In effect, the statistical model serves to estimate that part of the
information that is ignored and uses the estimate to compute the required integrals in a manner
that is asymptotically efficient given the information available. Neither the nature nor the extent
of the ignored information is predetermined. By judicious use of group invariant submodels
and other submodels, the amount of information ignored may be controlled in such a way that
the simulation variance is reduced with little increase in computational effort.
The literature on Monte Carlo estimation of integrals is very extensive, and no attempt will
be made here to review it. For good summaries, see Hammersley and Hanscomb (1964), Ripley
(1987), Evans and Swartz (2000) or Liu (2001). For overviews on the computation of normalizing constants, see DiCiccio et al. (1997) or Gelman and Meng (1998).
2.
Illustration
The following example with sample space Γ = R×R+ is sufficiently simple that the integrals can
be computed analytically. Nevertheless, it illustrates the gains that are achievable by choice of
Statistical Models for Monte Carlo Integration
587
design and by choice of submodel. Three Monte Carlo techniques are described and compared.
Suppose that we need to evaluate the integrals over the upper half-plane
dx1 dx2
cσ =
2 + .x + σ/2 }2
{x
Γ
2
1
for σ ∈ {0:25, 0:5, 1:0, 2:0, 4:0}. It is conventional to take µ to be Lebesgue measure, so that
qσ .x/ = 1={x12 + .x2 + σ/2 }2 . As it happens, the distribution Pσ has mean .0, σ/ with infinite
variances and covariances. Consider first an importance sampling design in which a stream of
independent observations x1 ,: : :, xn is made available from the distribution P1 . The importance
sampling estimator of cσ =c1 is
ĉσ =ĉ1 = n−1 qσ .xi /=q1 .xi /:
By applying the results from Section 4, we find
matrix

4:411
1:491
0:641
 1:491

V̂ = n−1 0:000
0:000

−0:601 −0:383
−0:821 −0:582
on the basis of n = 500 simulations that the

0:000 −0:601 −0:821
0:000 −0:383 −0:582 

0:000
0:000
0:000 

0:000
0:578
1:273
0:000
1:273
3:591
is such that the asymptotic variance of log.ĉr =ĉs / is equal to V̂rr + V̂ss − 2V̂rs for r, s ∈ {0:25, 0:5,
1:0, 2:0, 4:0}. The individual components cr are not identifiable in the model and the estimates
do not have a variance, asymptotic or otherwise. The estimated variances of the 10 pairwise
logarithmic contrasts range from 0:6=n to 9:6=n, with an average of 3:6=n. The matrix V̂ is
obtained from the observed Fisher information, so a different simulation will yield a slightly
different matrix.
Suppose as an alternative that it is feasible to simulate from any or all of the distributions Pσ ,
as in Geyer (1994), Hesterberg (1995), Meng and Wong (1996) or Owen and Zhou (2000). Various simulation designs, also called defensive importance sampling or bridge sampling plans,
can now be considered in which nr observations are generated from Pr . These are called the
design weights, or bridge sampling weights. For simplicity we consider the uniform design in
which nr = n=5. The importance sampling estimator must now be replaced by the more general
maximum likelihood estimator derived in Section 3, which is obtained by solving
ĉσ =
n
i=1
s
qσ .xi /
ns ĉs−1 qs .xi /
:
.2:1/
The asymptotic covariance matrix of log.ĉ/ obtained from a sample of n = 500 simulated
observations using equation (4.2) is


2:298
0:974 −0:263 −1:197 −1:811
0:668
0:077 −0:588 −1:131 
 0:974


V̂ = n−1 −0:263
0:077
0:239
0:117 −0:170 :


−1:197 −0:588
0:117
0:690
0:979
−1:811 −1:131 −0:170
0:979
2:132
In using this matrix, it must be borne in mind that c ≡ λc for every positive scalar λ, so only
contrasts of log.c/ are identifiable. The estimated variances of the 10 pairwise logarithmic
588
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
contrasts range from 0:7=n to 8:1=n, with an average of 3:0=n. The average efficiency factor
of the uniform design relative to the preceding importance sampling design is conveniently
measured by the asymptotic variance ratio 3:6=3:0 = 1:2.
A third Monte Carlo variant uses a submodel with a reduced parameter space consisting of
measures that are invariant under group action. Details are described in Section 3.3, but the
operation proceeds as follows. Consider the two-element group G = ±1 in which the inversion
g = −1 acts on Γ by reflection in the unit circle
g : .x1 , x2 / → .x1 , x2 /=.x12 + x22 /:
By construction, g 2 = 1 and g −1 = g, so this is a group action. Lebesgue measure is not invariant and is thus not in the parameter space as determined by this action. However, the measure
ρ.dx/ = dx1 dx2 =x22 is invariant, so we compensate by writing
q.x; σ/ = x22 ={x12 + .x2 + σ/2 }2
for the new integrand, and cσ = Γ q.x; σ/ dρ. The submodel estimator is the same as equation
(2.1), but with q replaced by the group average
q̄.x; σ/ =
1
2
q.x; σ/ +
1
2
q.gx; σ/:
With a uniform design and n = 500 observations, the estimated variance matrix of log.ĉ/ is


0:202 −0:093 −0:216 −0:093
0:202
0:045
0:097
0:045 −0:093 
 −0:093


V̂ = n−1 −0:216
0:097
0:239
0:097 −0:216 :


−0:093
0:045
0:097
0:045 −0:093
0:202 −0:093 −0:216 −0:093
0:202
The variances of the 10 pairwise logarithmic contrasts range from 0 to 0:87=n with an average
of 0:37=n. For reasons described in Section 3.3, the two ratios c0:25 =c4 and c0:5 =c2 are estimated
exactly with zero variance. Relative to the preceding Monte Carlo estimator, group averaging reduces the average simulation variance of contrasts by an efficiency factor of 8.1. By this
comparison, n simulated observations using the group-averaged estimator are approximately
equivalent to 8n observations using the estimator (2.1) with the same design weights, but without
group averaging.
To achieve efficiency gains of this magnitude, it is not necessary to use a large group, but it
is necessary to have a good understanding of the integrands in a qualitative sense, and to select
the group action accordingly. If the group action had been chosen so that g.x1 , x2 / = .−x1 , x2 /
the gain in efficiency would have been zero. Arguably, the gain in efficiency would be negative
because of the slight increase in computational effort.
Given observations simulated by any of the preceding schemes, equation (2.1), or the groupaveraged version, can also be used for the estimation of integrals such as
log.x12 + x22 / dx1 dx2
cσ =
2
2 2
Γ {x1 + .x2 + σ/ }
in which the integrand is not positive, and there is no associated probability distribution. In this
extended version of the problem, 10 integrals are estimated simultaneously, and cσ =cσ is the
expected value of log |X12 +X22 | when X∼Pσ . For this example, cσ = π=4σ 2 , and cσ =cσ = 2 log.σ/.
The general theory described in the next section covers integrals of this sort, and the Fisher
information also provides a variance estimate.
Statistical Models for Monte Carlo Integration
3.
589
A semiparametric model
3.1. Problem formulation
The statistical problem is formulated as a challenge issued by one individual called the simulator
and accepted by a second individual called the statistical analyst. In practice, these typically are
two personalities of one individual, but for clarity of exposition we suppose that two distinct
individuals are involved. The simulator is omniscient, honest and secretive but willing to provide
data in essentially unlimited quantity. Partial information is available to the analyst in the form
of a statistical model and the data made available by the simulator.
Let q1 , : : :, qk be real-valued functions on Γ, known to the analyst, and let µ be any nonnegative
measure on Γ. The challenge is to compute the ratios cr =cs , where each integral cr =
q
.x/
dµ
is assumed to be finite, i.e. we are interested in estimating all ratios simultaneously,
r
Γ
where qr .x/ = qθr .x/, using the notation of Section 1. Assume that there is at least one nonnegative function qr such that 0 < cr < ∞, and that nr observations from the weighted distribution
Pr .dx/ = cr−1 qr .x/ µ.dx/
.3:1/
are made available by the simulator at the behest of the analyst. The analyst’s design vector
.n1 ,: : :, nk / has at least one r such that nr > 0. Typically, however, many of the functions qr are
such that nr = 0. For these functions, the non-negativity condition is not required.
Different versions of the problem are available depending on what is known to the analyst
about µ. Four of these are now described.
(a) If µ is known, e.g. if µ is Lebesgue measure, the constants can, in principle, be determined
exactly by integral calculus.
(b) If µ is known up to a positive scalar multiple, the constants can be determined modulo
the same scalar multiple, and the ratios can be determined exactly by integral calculus.
(c) If µ is completely unknown, neither the constants nor their ratios can be determined by
calculus alone. Nevertheless, the ratios can be estimated consistently from simulated data
by using a slight modification of Vardi’s (1985) biased sampling model.
(d) If partial information is available concerning µ, neither the constants nor their ratios can
be determined by calculus. Nevertheless, partial information may permit a substantial
gain of efficiency by comparison with (c).
As an exercise in integral calculus, the first and second versions of the problem are not considered further in this paper.
3.2. A full exponential model
In this section, we focus on a semiparametric model in which the parameter µ is a measure or distribution on Γ, completely unknown to the analyst. The simulator is free to choose any measure
µ, and the analyst’s estimate must be consistent regardless of that choice. The parameter space
is thus the set Θ of all non-negative measures on Γ, not necessarily probability distributions,
and the components of interest are the linear functionals
cr =
qr .x/ dµ:
.3:2/
Γ
The state space Γ in the following analysis is assumed to be countable; the more general argument
covering uncountable spaces is given by Vardi (1985) in a model for biased sampling.
590
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
We suppose that simulated data are available in the form .y1 , x1 /, : : :, .yn , xn /, in which the
pairs are independent, yi ∈ {1,: : :, k} is determined by the simulation design and xi is a random
draw from the distribution Pyi . Then for each draw x from the distribution Pr the likelihood
contribution is
Pr .{x}/ = cr−1 qr .x/ µ.{x}/:
The full likelihood at µ is thus
L.µ/ =
n
i=1
Pyi .{xi }/ =
n
i=1
µ.{xi }/cy−1
qyi .xi /:
i
It is helpful at this stage to reparameterize the model in terms of the canonical parameter θ ∈ RΓ
given by θ.x/ = log[µ.{x}/]. Let P̂ be the empirical measure on Γ placing mass 1=n at each data
point. Ignoring additive constants, the log-likelihood at θ is
n
k
k
θ.xi / −
ns log{cs .θ/} = n θ.x/ d P̂ −
ns log{cs .θ/}:
.3:3/
i=1
Γ
s=1
s=1
Although it may appear paradoxical at first, the canonical sufficient statistic P̂, the empirical
distribution of the simulated values {x1 ,: : :, xn }, ignores a part of the data that might be considered highly informative, namely the association of distribution labels with simulated values.
In fact, all permutations of the labels y give the same likelihood. The likelihood function is
thus unaffected by reassignment of the distribution labels y to the draws x. This point was
previously noted by Vardi (1985), section 6. Thus, under the model as specified, or under any
submodel, the association of draws with distribution labels is uninformative. The reason for
this is that all the information in the labels for estimating the ratios is contained in the design
constants {n1 , : : :, nk }.
As is evident from equation (3.3), the model is of the full exponential family type with
canonical parameter θ, and canonical sufficient statistic P̂. The maximum likelihood estimate
of µ, obtained by equating the canonical sufficient statistic to its expectation, is
n P̂ .dx/ =
where dx = {x}. Thus
k
s=1
ns ĉs−1 qs .x/ µ̂.dx/,
k
µ̂.dx/ = n P̂.dx/
ns ĉs−1 qs .x/,
s=1
.3:4/
where ĉs is the maximum likelihood estimate of cs . Note that µ̂ is supported on the data values
{x1 , : : :, xn }, but the atoms at these points are not all equal. From the integral definition of cr ,
we have
n
qr .xi /
ĉr =
:
.3:5/
qr .x/ dµ̂ =
k
Γ
i=1
ns ĉs−1 qs .xi /
s=1
In principle, it is necessary to check that there exists in the parameter space a measure µ̂ such that
these equations are satisfied, which is a non-trivial exercise for certain extreme configurations
of the observations. Fortunately existence and uniqueness have been studied in great detail for
Vardi’s biased sampling model (Vardi, 1985), which is essentially mathematically equivalent.
Statistical Models for Monte Carlo Integration
591
Although P̂ is uniquely determined, µ̂ is determined modulo an arbitrary positive multiple,
and the constants ĉr are determined modulo the same positive multiple. In other words, the
set of estimable parametric functions may be identified with the space of logarithmic contrasts
Σ ar log.cr / in which Σ ar = 0. This is clear from equation (3.1), which implies that the problem
of ratio estimation is invariant under the transformation qr .x/ → α.x/ qr .x/, where α is strictly
positive on Γ.
The computational algorithm suggested by equation (3.5) is iterative proportional scaling
(Deming and Stephan, 1940; Bishop et al., 1975) applied to the n × k array {nr qr .xi /} in such
a way that, after rescaling,
nr qr .xi / → nr qr .xi / µ̂.xi /=ĉr ,
each row total is 1 and the rth column total is nr . The special case k = 1, called importance
sampling, corresponds to the Horvitz–Thompson estimator (Horvitz and Thompson, 1952),
which is widely used in survey sampling to correct for unequal sampling probabilities. We might
expect to find the more general estimator (3.5) in use for combining data from one or more
surveys that use different known sampling probabilities. Possibly this exercise is not of interest
in survey sampling. In any event, the estimator does not occur in Firth and Bennett (1998) or in
Pfeffermann et al. (1998), which are concerned with unequal selection probabilities in surveys.
The normalized version of equation (3.4) has been obtained by Vardi (1985) and by Lindsay
(1995) in a nonparametric model for biased sampling. In his discussion of Vardi (1985), Mallows
(1985) pointed out the connection with log-linear models and explained why the algorithm converges. These biased sampling models are not posed as Monte Carlo estimation, but the models
are equivalent apart from the restriction of the parameter space to probability distributions. For
further details, including connectivity and support conditions for existence and uniqueness, see
Vardi (1985) or Gill et al. (1988). These conditions are assumed henceforth.
The preceding derivation assumes that Γ is countable, so counting measure dominates all
others. If Γ is not countable, no dominating measure exists. Nevertheless, the likelihood has a
unique maximum given by equation (3.4) provided that the connectivity and support conditions
are satisfied. This maximizing measure has finite support, and the maximum likelihood estimate
of c is given by equation (3.5).
3.3. Symmetry and group invariant submodels
In practice, we invariably ‘know’ that the base-line measure is either counting measure or
Lebesgue measure. The method described above completely ignores such information. As a
result, the conclusions apply equally to discrete sample spaces, finite dimensional vector spaces,
metric spaces, product spaces and arbitrary subsets thereof. On the negative side, if the baseline measure does have symmetry properties that are easily exploited, the estimator may be
considerably less efficient than it need be.
To see how symmetries might be exploited, let G be a compact group acting on Γ in such a
way that the base-line measure µ is invariant: µ.gA/ = µ.A/ for each A ⊂ Γ and each g ∈ G.
For example, G might be the orthogonal group, a permutation group or any subgroup. In this
reduced model, the parameter space consists only of measures that are invariant under G. The
log-likelihood function (3.3) simplifies because θ.x/ = θ.gx/ for each g ∈ G, and the minimal
sufficient statistic is reduced to the symmetrized empirical distribution function P̂ G
P̂ G .A/ = ave { P̂ .gA/}
g∈G
for each A ⊂ Γ. If G is finite and acts freely, P̂ G has mass 1=n|G| at each of the transformed
592
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
sample points gxi with g ∈ G. In a rough sense, the effective sample size is increased by a factor
equal to the average orbit size. The maximum likelihood estimate of µ, obtained by equating
the minimal sufficient statistic to its expectation, is
n P̂ G .dx/ =
k
s=1
ns ĉs−1 q̄s .x/ µ̂.dx/,
where
q̄s .x/ = ave {qs .gx/}:
g∈G
.3:6/
In other words, the estimates from the submodel are still given by equations (3.4) and (3.5), but
with qs replaced by the group average q̄s , and P̂ replaced by P̂ G . The group-averaged estimator
may be interpreted as Rao–Blackwellization given the orbit, so group averaging cannot increase
the variance of µ̂ or of the linear functionals ĉr (Liu (2001), section 2.5.5).
From the estimating equation point of view, the submodel replaces equation (3.2) by
cr =
q̄r .x/ dµ,
.3:7/
Γ
which is a consequence of the assumption that µ is G invariant. However, if we proceed with
equation (3.7) directly as with equation (3.2), it would appear that we need draws from
P̄r .dx/ = cr−1 q̄r .x/ µ.dx/,
rather than Pr .dx/. Although we can easily draw x̄i from P̄yi by randomly drawing a g from G and
setting x̄i = gxi , this step is unnecessary because P̄yi is invariant under G, and thus P̄yi .{x̄i }/ =
P̄yi .{xi }/. Provided that G is sufficiently small that the group averaging in equation (3.6) represents a negligible computational cost, the submodel estimator is no more difficult to compute
than the original ĉ. Consequently, the submodel is most useful if G is a small finite group. If G
is the orthogonal group, aveg∈G {qs .gx/} is the average with respect to Haar measure over an
infinite set. This is usually a non-trivial exercise in calculus or numerical integration, precisely
what we had sought to avoid by simulation.
With a judicious choice of group action, the potential gain in efficiency can be very large.
As an extreme example, suppose that the distributions Pr are such that for each r there exists
a g ∈ G such that Pr .A/ = P1 .gA/ for every measurable A ⊂ Γ. Then q̄r .x/=q̄s .x/ is a constant
independent of x for all r and s, and the ratios of normalizing constants are estimated exactly
with zero variance. This effect is evident in the example in Section 2 in which X ∼ Pσ implies
gX ∼ P1=σ . In practice, such a group action may be hard to find, but it is often possible to
find a group such that there is substantially more overlap among the symmetrized distributions P̄r .A/ = aveg∈G {Pr .gA/} than among the original {Pr }. For location–scale models with
parameter .µ, σ/, reflection in the circle of radius σ̂ centred at .µ̂, 0/ is sometimes effective for
integrals of likelihood functions or posterior distributions.
Although a symmetrized estimator using a reflection such as q̄.x/ = {q.x/ + q.−x/}=2 may
remind us of the antithetic principle to reduce Monte Carlo error, these two methods are
fundamentally different. The antithetic principle exploits symmetry in the sampling distributions, whereas group averaging utilizes the symmetry in the base-line measure. In addition, the
effectiveness of using antithetic variates depends on the form of the integrand (e.g. q2 =q1 ), as
a non-linear function can make antithetic variates worse than the standard estimator (3.5) (see
Craiu and Meng (2004)). In contrast, group averaging can do no harm regardless of the form
of the integrand.
Statistical Models for Monte Carlo Integration
593
The importance link function method of MacEachern and Peruggia (2000) has some features
in common with group averaging, but the construction and the implementation are different in
major ways. Group structure has also been used by Liu and Sabatti (2000), but for a different
purpose, to improve the rate of mixing in Gibbs sampling.
Group averaging has been discussed by Evans and Swartz (2000), page 191, as a method
of variance reduction for importance sampling. Although the aims are similar, the details are
entirely different. Evans and Swartz considered only subgroups of the symmetry group of the
importance sampler. That is to say, the group action preserves both Lebesgue measure and
the importance sampling distribution. By contrast, our method is geared towards more general
bridge sampling designs, and the group action is not on the sampling distributions but on the
base-line measures. In our model, no preferential status is accorded to Lebesgue measure or to
any particular sampler, so it is not necessary that the group action should preserve either. On
the contrary, it is desirable that the group action should mix the distributions thoroughly to
make the averaged distributions as similar as possible.
3.4. Projection and linear submodels
Up to this point, the analysis has treated the k functions q1 , : : :, qk in a symmetric manner
even where the design constants {n1 ,: : :, nk } are not equal. In practice, there may be substantial
asymmetries that can be exploited to reduce simulation error. In the simplest case, it may be
known that two of the normalizing constants are
equal, say c2 = c3 . The reduced parameter
space is then the set of measures µ such that .q2 − q3 / dµ = 0. Ideally, we would like to
estimate µ by maximum likelihood subject to this homogeneous linear constraint. Even when it
exists and is unique, the maximum likelihood estimator in this submodel is unlikely to be cost
effective, so we seek a simple one-step alternative by linear projection.
Let c̃ be the unconstrained estimator from equation (3.5) or the group-averaged version in
Section 3.3, and let Ṽ be the asymptotic variance matrix of log.c̃/ as given in equation (4.2). We
consider a submodel in which c lies in the subspace X ⊂ Rk . For example, a single homogeneous
constraint among the constants gives rise to a subspace X of dimension k − 1, and a matrix X
of order k × .k − 1/ whose columns span X .
Ignoring statistical error in the asymptotic variance matrix cov.c̃/ = CṼ C, where C = diag.c̃/,
the weighted least squares projection is
ĉ = X.XT C−1 Ṽ − C−1 X/− XT C−1 Ṽ − 1,
.3:8/
where 1 is the constant vector with k components. See, for example, Hammersley and Hanscomb (1964), section 5.7. Provided that all generalized inverses are reflexive, i.e. Ṽ − Ṽ Ṽ − = Ṽ − ,
the asymptotic variance matrix is
cov.ĉ/ = X.XT C−1 Ṽ − C−1 X/− XT :
As always, only ratios of cs are estimable in the submodel.
It is perhaps worth mentioning by way of clarification the precise role of the control variates
when the objective is to estimate a single ratio c1 =c2 . Suppose that k = 3 and that the design
constants are .0, n, 0/, so all observations are generated from P2 . Then the importance sampling
estimator c̃1 =c̃2 from equation (3.5) has asymptotic variance O.n−1 /. Suppose now that P1 is
in fact a mixture of P2 and
P3 , so q1 = α2 q2 + α3 q3 , this being the reason for including q3 as a
control variate such that .q2 − q3 / dµ = 0. Then
594
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
c̃1 =
q1 .x/ dµ̃ =
.α2 q2 + α3 q3 / dµ̃
= α2 c̃2 + α3 c̃3
is a linear combination of c̃2 and c̃3 , so the covariance matrix of c̃ has rank 2. After projection,
ĉ2 = ĉ3 and ĉ1 = .α2 + α3 /ĉ2 , so ĉ1 =ĉ2 = α2 + α3 is estimated with zero error. The coefficients
α2 and α3 are immaterial and need not be positive.
It is evident that projection cannot increase simulation variance, but it is not evident that
the potential for reduction in simulation error by projection is very great. Indeed, if we are
interested primarily in estimating c1 =c0 , the reduction is typically not worthwhile unless the
control variates q2 , q3 ,: : : are chosen carefully. The way in which this may be done for Bayesian
posterior calculations is discussed in Section 5. Efficiency factors of the order of 5–10 appear
to be routinely achievable.
The discussion of control variates by Evans and Swartz (2000) involves subtraction rather
than projection, so the technique is different from that proposed above. The algebra in section 5.7 of Hammersley and Hanscomb (1964) and in Rothery (1982), Ripley (1987) and Glynn
and Szechtman (2000) is, in most respects, equivalent to the projection method. There are some
differences but these are mostly superficial, starting with the complication that only ratios are
identifiable in our models and submodels. The main difference in implementation is that our
likelihood method automatically provides a variance matrix, so no preliminary experiment is
required to estimate the coefficients in the projection. Glynn and Szechtman (2000), section 8,
also noted that the required projection is a linear approximation to the nonparametric maximum
likelihood estimator.
3.5. Log-linear submodels
Most sets that arise in statistical applications have a large amount of structure that can potentially be exploited in the construction of submodels. For example, the spaces arising in genetic
problems related to the coalescent model have a tree structure with measured edges. A submodel
may be useful for Monte Carlo purposes if the estimate under the submodel is easy to compute
and has substantially reduced variance. The following example illustrates the principle as it
applies to spaces having a product structure.
If Γ = Γ1 × : : : × Γl is a product set, it is natural to consider the submodel consisting of
product measures only, i.e. µ = µ1 × : : : × µl , in which µj is a measure on Γj . Then each x ∈ Γ
has components .x1 , : : :, xl /, and θ.x/ = θ1 .x1 / + : : : + θl .xl / in an extension of the notation of
Section 3.2. The sufficient statistic in equation (3.3) is reduced to the list of l marginal empirical
distribution functions, and the resulting model is equivalent to the additive log-linear model
(main effects only) for an l-dimensional contingency table, or l + 1 dimensional if the design has
more than one sampler. No closed form estimator is available unless the design is such that, for
each nr > 0, the function qr .x/ is expressible as a product qr .x/ = qr1 .x1 / : : : qrl .xl /. Then each
sampler generates observations with independent components, so µ̂j is given by equation (3.4),
applied to the jth component of x. The component measures µ̂1 , : : :, µ̂l are then independent.
In certain circumstances, the component sets Γ1 ,: : :, Γl are isomorphic, in which case we write
Γl instead of Γ. It is then natural to restrict the parameter space to symmetric product measures
of the form µl. Suppose, for example, that we wish to compute the integrals
exp{−.x1 − θ/2 =2 − .x2 − θ/2 =2 − x12 x22 =2} dx1 dx2
cθ =
R2
for various values of θ in the range .0, 4/. These constitute a subfamily of distributions consid-
Statistical Models for Monte Carlo Integration
595
ered by Gelman and Meng (1991), whose conditional distributions are Gaussian. The parameter
space in the submodel is the set of symmetric product measures µ×µ, so µ is a measure on R. For
a sampler, it is convenient to take any distribution with density f on R, or the product distribution on R2 . The numerical values reported here are based on the standard Cauchy sampler. The
maximum likelihood estimate of µ on R has mass 1=nf.xi / at each data point, so the maximum
likelihood estimate of µ2 on R2 has mass {n2 f.xi / f.xj /}−1 at each ordered pair .xi , xj / in the
sample. We find that the submodel estimator based on n simulated scalar observations is roughly
equivalent in terms of statistical efficiency to 3n=2 bivariate observations in the unconstrained
model. In principle, therefore, the crude importance sampling estimator can be improved by a
factor of 3 without further data. On the negative side, the submodel estimator of cθ
ĉθ = qθ .x, x / dµ̂.x/ dµ̂.x /
is the sum of n2 terms, as opposed to n terms in the unconstrained model. In terms of
computational effort, therefore, the importance sampling estimator is superior. The submodel
estimator is not cost effective unless the simulations are the dominant time-consuming part of the
calculation.
There is one additional circumstance in which the submodel estimator achieves a gain in statistical efficiency that is sufficient to offset the increase in computational effort. If two functions qr
and qs are such that the one-dimensional marginal distributions of Pr are the same as the onedimensional marginal distributions of Ps , the estimated ratio ĉr =ĉs has a variance that is o.n−1 /,
i.e.
n var{log.ĉr =ĉs /} → 0
as n → ∞. Simulation results indicate that the rate of decrease is O.n−2 /. This phenomenon
can be observed if we replace the preceding family of integrands by the family exp.−x12 − x22 +
2θx1 x2 / for −1 < θ < 1.
3.6. Markov chain models
The model in this section assumes that a sequence of draws constitutes an irreducible Markov
chain having known transition density q.·; x/ with respect to the unknown measure µ on Γ. If the
design calls for multiple chains, the transition densities are denoted by qr .·; x/ for r = 1, : : :, k. It
is not necessary that the chain be in equilibrium; nor is it necessary that the chain be constructed
to have a particular stationary distribution. Under this new model, a draw is a chain of length l
with distribution Pr.l/ . The likelihood may be expressed as the product of l factors, the first three
of which are
Pr .dx1 / = cr−1 qr .x1 / µ.dx1 /,
Pr .dx2 |x1 / = cr−1 .x1 / qr .x2 ; x1 / µ.dx2 /,
Pr .dx3 |x2 / = cr−1 .x2 / qr .x3 ; x2 / µ.dx3 /:
If the chain is not in equilibrium, the first factor is ignored. In effect, we now have l ‘independent’ observations from l distinct distributions, each with its own normalizing constant. The
log-likelihood function for θ contributed by a single sequence of length l from Pr.l/ is then given by
l
l
θ.xt / − log{cr .xt−1 ; θ/} = l θ.x/ d P̂ −
log{cr .xt−1 ; θ/},
t=1
Γ
t=1
596
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
which is a function of the entire sequence. Although the form of the likelihood is very similar
to equation (3.3), the empirical distribution function P̂ is no longer sufficient. The log-likelihood
equation from k independent chains, in which the chain from Ps has length ns , is given by
n P̂ .dx/ =
ns
k s=1 t=1
P̂s .dx|xt−1 / =
ns
k s=1 t=1
ĉs−1 .xt−1 / qs .x; xt−1 / µ̂.dx/,
.3:9/
where n = Σs ns , cs .x0 / ≡ cs , qs .x; x0 / ≡ qs .x/ and Ps .dx|x0 / ≡ Ps .dx/. Consequently, for each
r = 1,: : :, k and t = 0,: : :, nr − 1, we have
n
qr .xi ; xt /
,
.3:10/
ĉr .xt / = qr .x; xt / dµ̂.x/ =
ns
k i=1
ĉs−1 .xj−1 / qs .xi ; xj−1 /
s=1 j=1
which can be solved for {ĉr .xt /, t = 0, : : :, nr − 1; r = 1, : : :, k} using the Deming–Stephan algorithm. When all draws are independent and the margins are equal, qs .x; xt−1 / = qs .x/ does
not depend on xt−1 , and equation (3.10) reduces to equation (3.5).
At first sight, we might doubt that equation (3.10) could provide anything useful because we
have at most one draw from each of the targeted Pr , namely the first component of each chain—
recall that we have purposely ignored the information that all the margins are the same. That
is, the transition probabilities qr .xt ; xt−1 / can be arbitrary, and in fact they can even be time
inhomogeneous (i.e. qr .xt ; xt−1 / can be replaced by qr, t .xt ; xt−1 /), as long as the chain is not
reducible. Furthermore, it appears that the number of ‘parameters’ {ĉr .xt /, t = 0, : : :, nr − 1; r =
1, : : :, k} is always the same as the number of data points. However, we must keep in mind that
the model parameter is not c, but the base-line measure µ, and as long as a draw is from a
known density with respect to µ it provides information about µ. This is in fact the fundamental reason that importance sampling can provide consistent estimators when draws are taken
from an unrelated trial density. The information from the trial distributions about the baseline measure must be adequate to estimate the base-line measure of the target distribution. In
particular, the union of the supports of the trial densities must cover the support of the target
density.
4.
Asymptotic covariance matrix
4.1. Multinomial information measure
The log-likelihood (3.3) is evidently a sum of k multinomial log-likelihoods, sharing the same
parameter θ. The Fisher information for θ is best regarded explicitly as a measure on Γ × Γ
such that I.A, B/ = I.B, A/ and I.A, Γ/ = 0 for A, B ⊂ Γ. In particular, the multinomial information measure associated with the distribution Pr on Γ is given by Pr .A ∩ B/ − Pr .A/ Pr .B/.
In the log-likelihood (3.3), the total Fisher information measure for θ is
n I.A, B/ =
k
r=1
nr {Pr .A ∩ B/ − Pr .A/ Pr .B/}:
At least formally, the asymptotic covariance matrix of θ̂ is the inverse Fisher information matrix
n−1 I − , and the asymptotic covariance matrix of dµ̂ is
n−1 dµ.x/ dµ.y/ I − .x, y/,
Statistical Models for Monte Carlo Integration
I − .x, y/
597
I −,
where
is the .x, y/ element of
indexed by Γ × Γ. From expression (3.5) for ĉr , we
find that the asymptotic covariance of ĉr and ĉs is cr cs Vrs , where
−1
Vrs = cov{log.ĉr /, log.ĉs /} = n
I − .x, y/ dPr .x/ dPs .y/:
.4:1/
Γ×Γ
As always, only logarithmic contrasts have variances. In this expression Pr is defined by equation
(3.1) for each r, provided that cr is finite and non-zero. There may be integrands qr that take
both positive and negative values, in which case Pr is not a probability distribution.
For the log-linear submodel discussed in Section 3.5 in which each sampler has independent
components, it is necessary to replace I − .x, y/ in equation (4.1) by the sum I1− .x1 , y1 / + : : : +
Il− .xl , yl /, where Ir is the Fisher information measure for θr . Expression (4.1), or its generalization, gives the O.n−1 / term in the asymptotic variance, but this term may be 0. Examples
of this phenomenon are given in Sections 3.5 and 5.2. In such cases, more refined calculations
are required to find the asymptotic distribution of log.ĉ/.
4.2. Matrix version
The results of the preceding section are easily expressed in matrix notation, at least when all
calculations are performed at the maximum likelihood estimate. Let W = diag.n1 , : : :, nk /, and
let P̂ be the n × k matrix whose .i, r/ element is
P̂ r .xi / = s
qr .xi /=ĉr
ns ĉs−1 qs .xi /
:
The matrix P̂W arises naturally in applying the Deming–Stephan algorithm to solve the maximum likelihood equation. Note that the column sums of P̂ are all 1, whereas the row sums
satisfy Σr nr P̂ r .xi / = 1 for each i.
The Fisher information for θ at θ̂ is (the measure whose density is represented by the matrix)
In − P̂ W P̂ T , and the asymptotic covariance matrix of log.ĉ/ is given by
V̂ = P̂ T .In − P̂ W P̂ T /− P̂
.4:2/
where In is the identity matrix of order n. Typically, the matrix In − P̂ W P̂ T has rank n − 1
with kernel equal to 1, the set of constant vectors. Then In − P̂ W P̂ T + 11T =n is invertible with
approximately unit eigenvalues, and the inverse matrix is also a generalized inverse of In −
P̂ W P̂ T suitable for use in equation (4.2). Although the inversion of n × n matrices can be
avoided, all the numerical calculations reported in this paper use this variance formula.
The eigenvalues of In − P̂ W P̂ T + 11T =n are in fact all less than or equal to 1, with equality
for simple Monte Carlo designs in which all observations are generated from a single sampler.
For more general designs having more than one sampler, the approximate variance formula
V̂ P̂ T P̂ is anticonservative. That is to say, P̂ T .In − P̂W P̂ T /− P̂ P̂ T P̂ in the sense of
Löwner ordering. In practice, if all the samplers have support equal to Γ, the underestimate
is frequently negligible. The approximate variance formula is easier to compute and may be
adequate for projection purposes described in Section 3.4.
5.
Applications to Bayesian computation
5.1. Posterior probability calculation
Consider a regression model in which the component observations y1 , : : :, ym are independent
598
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
and exponentially distributed with means such that log{E.Yi /} = β0 + β1 xi , where x1 , : : :, xm
are known constants. For illustration m = 10, xi = i and the values yi are
2:28, 1:46, 0:90, 0:19, 1:88, 0:72, 2:06, 4:21, 2:90, 7:53:
The maximum likelihood estimate is β̂ = .−0:0668, 0:1494/. The asymptotic standard error
of β̂1 is 0.1101 using the expected Fisher information, and 0.0921 using the observed Fisher
information.
Let π.·/ be a prior distribution on the parameter space. The posterior probability pr.β1 > 0|y/
is the ratio of two integrals. In the denominator the integrand is the product of the likelihood
and the prior. The numerator has an additional Heaviside factor taking the value 1 when β1 > 0
and 0 otherwise. One way to approximate this ratio is to simulate observations from the posterior distribution and to compute the fraction that have β1 > 0. However, this exercise is both
unnecessary and inefficient.
Let q0 .β/ = L.β/ π.β/ be the product of the likelihood and the prior at β, and let q1 .β/ =
q0 .β/ I.β1 > 0/ be the integrand for the numerator. For auxiliary functions, we choose q2 .β/
to be the bivariate normal density at β with mean β̂ and inverse covariance matrix equal to
the Fisher information at β̂. It is marginally better to use the observed Fisher information for
the inverse variance matrix in q2 , but in principle we could use either or both. Let q3 .β/ be the
product q2 .β/ I.β1 > 0/=K, where K = pr.β1 > 0/, computed under the normal distribution
q2 . In this example,
K = Φ.0:1494=0:0921/ = 0:9476
for the observed information approximation, or 0.9127 for the expected information approximaˆ 1=2 for q2 and q3 may be ignored. By construction,
tion. The common
normalizing
constant 2π|I|
therefore, q2 .β/ dβ = q3 .β/ dβ, not necessarily equal to 1.
The numerical calculations that follow use the improper uniform prior and n = 400 simulations from the normal proposal density q2 , so the design constants are n = .0, 0, 400, 0/. The
unconstrained maximum likelihood estimates obtained from equation (3.5) were
log.c̃1 =c̃0 / = −0:0525 ± 0:0137,
log.c̃3 =c̃2 / = 0:0078 ± 0:0108
with correlation 0.901. Note, however, that c2 = c3 by design, but the estimator is not similarly
constrained at this stage. The estimated posterior probability pr.β1 > 0|y/ is thus exp.−0:0525/
= 0:9489, with an approximate 90% confidence interval .0:928, 0:970/.
By imposing the constraint c2 = c3 on the parameter space we obtain a new estimator by
weighted least squares projection log.ĉ/ = X.XT Ṽ − X/− XT Ṽ − log.c̃/. Here c̃ is the unconstrained estimator, Ṽ is the estimated variance matrix of log.c̃/ and X is the model matrix


1 0 0
0 1 0
X=
:
0 0 1
0 0 1
The resulting estimate and its standard error are
log.ĉ1 =ĉ0 / = log.c̃1 =c̃0 / − 1:136 log.c̃3 =c̃2 / = −0:0613 ± 0:0059:
The alternative projection (3.8) is preferable in principle, but in this case the two projections
yield indistinguishable estimates. The point estimate of the posterior probability is 0.940, with
an approximate 90% confidence interval (0.931, 0.950). The efficiency factor is roughly 5, which
Statistical Models for Monte Carlo Integration
599
is similar to the factors obtained by Rothery (1982) using a similar technique in a problem
concerning power calculation.
Much of the gain in efficiency in this example could be achieved by taking the projection
coefficient to be −1 on the log-scale, i.e. by subtraction using the control variate in the traditional way. However, the real applications that we have in mind are those in which a moderately
large number of integrals is to be computed simultaneously, as in likelihood calculations for pedigree analysis, or computing the entire marginal posterior for β1 . It is then desirable to include
several control variates and to compute the projection using equation (3.8). In the present example, the posterior probabilities pr.β1 b|y/ for six equally spaced values of b in .0, 0:25/ were
computed simultaneously by this method using the corresponding six normal control variates.
The six efficiency factors were 5.3, 6.2, 8.4, 10.5, 8.4 and 5.9.
This is a rather small example with m = 10 data points in the regression, in which an
appreciable discrepancy might be expected between the posterior and the normal approximation. Numerical investigations indicate that the efficiency factor tends to increase with larger
m, as might be expected. The approximate variance formula mentioned at the end of Section
4.3 is exact in this case, and effectively exact when observations are simulated from both of the
normal approximations.
5.2. Posterior integral for probit regression
Consider a probit regression model in which the responses are independent Bernoulli variables
with pr.yi = 1/ = Φ.xiT β/, where xi is a vector of covariates and β is the parameter. The aim
of this exercise is to calculate the integral of the product L.β; y/ π.β/, in which L.β; y/ is the
likelihood function and π.·/ is the prior, here taken to be Gaussian. This is done using a Markov
chain constructed to have stationary distribution equal to the posterior.
The chain is generated by the standard technique of Gibbs sampling. Following Albert and
Chib (1993), the parameter space is augmented to include latent variables zi ∼ N.xiT β, 1/ such
that yi is the sign of zi . A cycle of the Gibbs sampler is completed in two steps: β|y, z and z|y, β,
which are respectively multivariate normal and a product of independent truncated normal distributions. The transition probability from θ = .β , z / to θ = .β, z/ of the generated Markov
chain is
P.dθ|θ / = c−1 .θ / p.z|y, β/ p.β|y, z / µ.dθ/,
where p.z|y, β/ and p.β|y, z / are the full conditional densities. By construction, the normalizing
constants c.θ/ are known to be equal to 1 for each θ.
Let {.βt , zt /}nt=1 be the simulated values from the Markov chain. The required integral is
estimated by the ratio of
n
L.βi / π.βi /
.5:1/
L.β/ π.β/ dµ̂ =
n
i=1
p.βi |y, zj /
j=1
to the known value c.θt / = 1. In effect, ĉ.θt / in equation (3.10) is replaced by the known value,
which is then used to compute the approximate maximizing measure µ̂. The resulting estimator,
which may be interpreted as importance sampling with the semideterministic mixture
n−1
n
j=1
p.·|y, zj /
600
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
as the sampler, is consistent but may not be fully efficient. For large n, n times the summand
evaluated at any fixed β is a consistent estimate of the integral. Such an estimate was suggested
by Chib (1995), who recommended choosing a value β Å of high posterior density, taken to
be the mean value in the calculations below. The summand as a function of β also appears
in calculations by Ritter and Tanner (1992) and Cui et al. (1992), but the purpose there is to
monitor convergence of the Gibbs sampler, either with multiple parallel chains or a single long
chain divided into batches.
For numerical illustration and comparisons, we use Chib’s (1995) example, taken from a
case-study by Brown (1980). The data were collected on 53 prostate cancer patients to predict
the nodal involvement. There are five predictor variables and a binary response taking the value
1 if the lymph nodes were affected. For the results reported here, three covariates are used:
the logarithm of the level of serum acid phosphatase, X-ray reading and stage of the tumour,
so the model matrix X is of order 53 × 4 with a constant column. The prior is centred at
(0.75, 0.75, 0.75, 0.75) with variance A = diag.52 , 52 , 52 , 52 /, as in Chib (1995). The Gibbs sampler was started at β = ÃXT y, where à = .A−1 + XT X/−1 , and run for a total of N = n0 + n
cycles with the first n0 discarded. The process was repeated 1000 times for several values of N.
The mean and standard deviation of 1000 estimates of the logarithm of the integral are given in
Table 1. The central processor unit (CPU) time in seconds was measured separately for Gibbs
sampling and for subsequent integral evaluations. For N = 500 + 5000, the results given here
for Chib’s method are in close agreement with those reported by Chib (1995). All programming
was done in C.
Over the range studied, the statistical efficiency of the likelihood method over Chib’s estimator with n0 + n draws is not constant, but roughly n=12, i.e. increasing with n. For example, the
efficiency factor at N = 500 + 5000 is estimated by
.0:0211=0:001 03/2 = 420:
To achieve the same accuracy as the likelihood method using 5000 draws, Chib’s method requires
420 × 5000 Gibbs draws, with CPU time 420 × 0:49 = 206 s for the draws alone. The final row in
Table 1 gives the precision per CPU second, defined as the reciprocal of the product of the total
time and the variance. Over the range studied, the new method achieves a precision per CPU
second of roughly 8.5–9.5 times that of Chib’s method. For fixed n, the likelihood estimator
is computationally more demanding, but the additional effort is worthwhile by a substantial
factor.
Table 1. Numerical comparison of two integral estimators (log-scale)
Results for the following Gibbs sampler cycles and methods:
N = 50 + 500,
0.06 CPU seconds
Chib
Mean + 34
−0:5693
Standard deviation
0.0652
CPU seconds
<0.01
Precision per CPU
3921
second
Likelihood
N = 100 + 1000,
0.11 CPU seconds
Chib
−0:5796 −0:5588
0.00937
0.0475
0.28
<0.01
33500
4029
Likelihood
N = 250 + 2500,
0.25 CPU seconds
Chib
−0:5661 −0:5542
0.00505
0.0299
1.03
<0.01
34396
4474
Likelihood
N = 500 + 5000,
0.49 CPU seconds
Chib
−0:5569 −0:5510
0.00204
0.0211
5.87
0.01
39263
4492
Likelihood
−0:5534
0.00103
21.94
42024
Statistical Models for Monte Carlo Integration
601
As it happens, this is one of those problems in which the likelihood estimator of µ converges
at the standard rate, but the estimate (5.1) of the integral converges at rate n−1 . The bias and
standard deviation are both O.n−1 /: in this example they appear to be approximately equal in
magnitude. By contrast, Chib’s estimator converges at the standard n−1=2 -rate.
6.
Retrospective formulation
It is possible to give a deceptively simple derivation of equation (3.5) by a retrospective argument as follows. Regardless of how the design was in fact selected, we may regard the sample
size vector .n1 , : : :, nk / as the observed value of a multinomial random vector with index n
and parameter vector .π1 ,: : :, πk /. This assumption is innocuous provided that .π1 , : : :, πk / are
treated as free parameters to be estimated from the data. Evidently, π̂r = nr =n is the maximum
likelihood estimate.
Mimicking the argument that is frequently employed in retrospective designs, we argue as
follows. Given that the point x has been observed, what is the probability that this point was generated from distribution Pr rather than from one of the other distributions? A simple calculation
using Bayes’s theorem shows that the required conditional probability vector is


q
.x/π
=c
q
.x/π
=c
k
k k 
1
1 1
p.x/ =  :
,: : :, qs .x/πs =cs
qs .x/πs =cs
s
s
These conditional probabilities depend only on the ratios πr =cr , and not otherwise on the baseline measure µ. Conditioning on x does not eliminate the base-line measure entirely, for cr is a
linear function of µ. The conditional likelihood associated with the single observation .y, x/ is
thus
.πy =cy / qy .x/
,
.πr =cr / qr .x/
r
and the log-likelihood is
r
nr log.πr =cr / −
n
i=1
log
r
.πr =cr / qr .xi / :
.6:1/
Once again, the observed count vector .n1 ,: : :, nk / is the complete sufficient statistic, and the
association of y-values with x-values is not informative.
Differentiation with respect to the parameter log.cr / gives
n
@l
.πr =cr / qr .xi /
@l
= −nr +
= cr
:
@{log.cr /}
@cr
.πs =cs / qs .xi /
i=1
s
By substituting the known value π̂r = nr =n and setting the derivative to 0, we obtain
ĉr =
n
i=1
s
qr .xi /
,
ns qs .xi /=ĉs
.6:2/
which is identical to equation (3.5). That is to say, the retrospective argument, previously put
forward by Geyer (1994), gives exactly the right point estimator of c.
The astute reader will notice that, when we substitute nr =n for πr in the retrospective likelihood, the resulting function depends only on those cs for which nr > 0, and this restriction also
602
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
applies to the maximum likelihood equation (6.2). The apparent equivalence of equations (6.2)
and (3.5) is thus an illusion. By contrast with the model in Section 3, the retrospective argument
does not lead to the conclusion that equation (6.2) is the maximum likelihood estimate of an
integral for which qr .·/ takes negative values.
Even if we are willing to overlook the remarks in the preceding paragraph and to assume
that nr > 0 for each r, the objections are not easily evaded. The difficulty at this point is that
the conditional likelihood is a function of the ratios φr = log.πr =cr /, so the vectors π and c are
not separately estimable from the conditional likelihood. It is tempting, therefore, to substitute
nr =n for πr , treating this as a known prior probability. After all, who can tell how the sample
sizes were chosen? However plausible this argument may sound, the resulting ‘likelihood’ does
not give the correct covariance matrix for log.ĉ/. The components of the negative logarithmic
second-derivative matrix are
−
n
n .π =c /.π =c / q .x / q .x /
@2 l
.πr =cr / qr .xi /
r r
s s r i s i
δrs =
−
:
@{log.cr /}@{log.cs /} i=1
nt qt .xi /=ct i=1
{ nt qt .xi /=ct }2
t
.6:3/
t
At .π̂, ĉ/, the first term is equal to the diagonal matrix nr δrs . The second term is non-negative
definite. To put this in an alternative matrix form, write pi for the conditional probability vector
p.xi /. Then the negative second-derivative matrix shown above is
Iφ =
n
T
.diag{pi } − pi pT
i / W − W P̂ P̂ W ,
i=1
using the matrix notation of Section 4.2.
To see that the inverse of this matrix cannot be the correct asymptotic variance of log.ĉ/, consider the limiting case in which q1 = q2 = : : : = qk are all equal. It is then known with certainty
that c1 = : : : = ck , even in the absence of data. But, in the second-derivative matrix shown
above, all the conditional probability vectors pi are equal to .π1 , : : :, πk /. The second-derivative
matrix is in fact the multinomial covariance matrix with index n and probability vector π. The
generalized inverse of this matrix does give the correct asymptotic variances and covariances
for contrasts of φ̂ = log.π̂=ĉ/, as the general theory requires. But it does not give the correct
asymptotic variances for log.ĉ/, or contrasts thereof.
This line of argument can be partly rescued, but to do so it is necessary to show that π̂ and ĉ
are asymptotically independent. This is not obvious and will not be proved here, but it is a consequence of orthogonality of parameters in exponential family models. By standard properties
of likelihoods, cov{log.π̂/ − log.ĉ/} = Iφ− asymptotically. On the presumption that π̂ and ĉ are
asymptotically uncorrelated, we deduce that
cov{log.ĉ/} = Iφ− − cov{log.π̂/} = Iφ− − diag.1=nπ/ + 11T =n:
.6:4/
The term 11T=n does not contribute to the variance of contrasts of log.ĉ/ and can therefore be
ignored. When evaluated at .ĉ, π̂/, the resulting expression.W −W P̂ T P̂ W/− − W −1 , involving
no n × n matrices, is identical to equation (4.2), provided that each component nr of W is strictly
positive.
7.
Conclusions
The key contribution of this paper is the formulation of Monte Carlo integration as a statistical model, making explicit what information is available for use and what information is ‘out
of bounds’. Given that agreement is reached on the information available, it is now possible
Statistical Models for Monte Carlo Integration
603
to say whether an estimator is or is not efficient. Likelihood methods are thus made available
not only for parameter estimation but also for the estimation of variances and covariances for
various simulation designs, of which importance sampling is the simplest special case. More
interestingly, however, three classes of submodel are identified that have substantial potential for variance reduction. The associated operations are group averaging for group invariant
submodels, linear projection for linear submodels or mixtures and Markov chain models for
Markov chain Monte Carlo schemes. To achieve worthwhile gains in efficiency, it is necessary
to exploit the structure of the problem, so it is not easy to give universally applicable advice.
None-the-less, three simple examples show that efficiency factors in the range 5–10, and possibly
larger, are routinely achievable in certain types of statistical computations. We believe that such
factors are not exceptional, particularly for Bayesian posterior calculations.
It would be remiss of us to overlook the peculiar dilemma for Bayesian computation that
inevitably accompanies the methods described here. Our formulation of all Monte Carlo activities is given in terms of parametric statistical models and submodels. These are fully fledged
statistical models in the sense of the definition given by McCullagh (2002), no more or no less
artificial than any other statistical model. Given that formulation, it might seem natural to
analyse the model by using modern Bayesian methods, beginning with a prior on Θ. If we
adopt the orthodox interpretation of a prior distribution as the one that summarizes the
extent of what is known about the parameter, we are led to the Dirac prior on the true measure,
which is almost invariably Lebesgue measure. For once, the prior is not in dispute. This choice
leads to the logically correct, but totally unsatisfactory, conclusion that the simulated data are
uninformative. The posterior distribution on Θ is equal to the prior, which is unhelpful for
computational purposes. It seems, therefore, that further progress calls for a certain degree of
pretence or pragmatism by selecting a non-informative, or at least non-degenerate, prior on
Θ. Given such a prior distribution, the posterior distribution on Θ can be obtained, and the
posterior moments of the required integrals computed by standard formulae. Although these
operations are straightforward in principle, the computations are rather forbidding, so much
so that it would be impossible to complete the calculations without resorting to Monte Carlo
methods! This computational black hole, an infinite regress of progressively more complicated
models, is an unappealing prospect, to say the least. With this in mind, it is hard to avoid the
conclusion that the old-fashioned maximum likelihood estimate has much to recommend it.
Acknowledgements
The comments of four referees on an earlier draft led to substantial improvements in presentation. We would like to thank Peter Bickel for pointing out the connection with biased sampling
models, and Peter Donnelly for discussions on various points.
Support for this research was provided in part by National Science Foundation grants DMS0071726 (for McCullagh and Tan) and DMS-0072510 (for Meng and Nicolae).
References
Albert, J. and Chib, S. (1993) Bayesian analysis of binary and polychotomous response data. J. Am. Statist. Ass.,
88, 669–679.
Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975) Discrete Multivariate Analyses: Theory and Practice.
Cambridge: Massachusetts Institute of Technology Press.
Brown, B. W. (1980) Prediction analyses for binary data. In Biostatistics Casebook (eds R. J. Miller, B. Efron,
B. W. Brown and L. E. Moses). New York: Wiley.
Chib, S. (1995) Marginal likelihood from the Gibbs output. J. Am. Statist. Ass., 90, 1313–1321.
604
A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan
Craiu, R. V. and Meng, X.-L. (2004) Multi-process parallel antithetic coupling for backward and forward Markov
chain Monte Carlo. Ann. Statist., to be published.
Cui, L., Tanner, M. A., Sinha, D. and Hall, W. J. (1992) Monitoring convergence of the Gibbs sampler: further
experience with the Gibbs stopper. Statist. Sci., 7, 483–486.
Deming, W. E. and Stephan, F. F. (1940) On a least-squares adjustment of a sampled frequency table when the
expected marginal totals are known. Ann. Math. Statist., 11, 427–444.
DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L. (1997) Computing Bayes factors by combining
simulation and asymptotic approximations. J. Am. Statist. Ass., 92, 903–915.
Evans, M. and Swartz, T. (2000) Approximating Integrals via Monte Carlo and Deterministic Methods. Oxford:
Oxford University Press.
Firth, D. and Bennett, K. E. (1998) Robust models in probability sampling. J. R. Statist. Soc. B, 60, 3–21;
discussion, 41–56.
Gelman, A. and Meng, X.-L. (1991) A note on bivariate distributions that are conditionally normal. Am. Statistn,
45, 125–126.
Gelman, A. and Meng, X. L. (1998) Simulating normalizing constants: from importance sampling to bridge
sampling to path sampling. Statist. Sci., 13, 163–185.
Geyer, C. J. (1994) Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo.
Technical Report 568. School of Statistics, University of Minnesota, Minneapolis.
Gill, R., Vardi, Y. and Wellner, J. (1988) Large sample theory of empirical distributions in biased sampling models.
Ann. Statist., 16, 1069–1112.
Glynn, P. W. and Szechtman, R. (2000) Some new perspectives on the method of control variates. In Monte Carlo
and Quasi-Monte Carlo Methods (eds K.-T. Fang, F. J. Hickernell and H. Niederreiter), pp. 27–49. New York:
Springer.
Hammersley, J. M. and Hanscomb, D. C. (1964) Monte Carlo Methods. London: Chapman and Hall.
Hesterberg, T. (1995) Weighted average importance sampling and defensive mixture distributions. Technometrics,
37, 185–194.
Horvitz, D. G. and Thompson, D. J. (1952) A generalization of sampling without replacement from a finite
universe. J. Am. Statist. Ass., 47, 663–683.
Lindsay, B. (1995) Mixture Models: Theory, Geometry and Applications. Hayward: Institute of Mathematical
Statistics.
Liu, J. S. (2001) Monte Carlo Strategies in Scientific Computing. New York: Springer.
Liu, J. S. and Sabatti, C. (2000) Generalized Gibbs sampler and multigrid Monte Carlo for Bayesian computation.
Biometrika, 87, 353–369.
MacEachern, S. N. and Peruggia, M. (2000) Importance link function estimation for Markov Chain Monte Carlo
methods. J. Comput. Graph. Statist., 9, 99–121.
Mallows, C. L. (1985) Discussion of ‘Empirical distributions in selection bias models’ by Y. Vardi. Ann. Statist.,
13, 204–205.
McCullagh, P. (2002) What is a statistical model (with discussion)? Ann. Statist., 30, 1225–1310.
Meng, X.-L. and Wong, W. H. (1996) Simulating ratios of normalizing constants via a simple identity: a theoretical
explanation. Statist. Sin., 6, 831–860.
Owen, A. and Zhou, Y. (2000) Safe and effective importance sampling. J. Am. Statist. Ass., 95, 135–143.
Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H. and Rasbash, J. (1998) Weighting for unequal
selection probabilities in multilevel models (with discussion). J. R. Statist. Soc. B, 60, 23–40; discussion, 41–56.
Ripley, B. D. (1987) Stochastic Simulation. New York: Wiley.
Ritter, C. and Tanner, M. A. (1992) Facilitating the Gibbs sampler: the Gibbs stopper and the griddy-Gibbs
sampler. J. Am. Statist. Ass., 97, 861–868.
Rothery, P. (1982) The use of control variates in the Monte Carlo estimation of power. Appl. Statist., 31, 125–129.
Vardi, Y. (1985) Empirical distributions in selection bias models. Ann. Statist., 13, 178–203.
Discussion on the paper by Kong, McCullagh, Meng, Nicolae and Tan
Michael Evans .University of Toronto/
This is an interesting and stimulating paper containing some useful clarifications. The most provocative
part of the paper is the claim that treating the approximation of an integral
I = f.x/ µ.dx/
.1/
as a problem of statistical inference gives practically meaningful results. Although the paper does effectively argue this, as my discussion indicates, I still retain some doubt about the necessity of adopting this
point of view.
In the paper we have a sequence of integrals
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
ci = qi .x/ µ.dx/,
605
for i = 1, : : :, r. We want to approximate the ratios ci =cj based on samples of size ni from the normalized
densities specified by the qi .ni = 0 whenever qi takes negative values and we assume at least one ni > 0/.
If only one ni is non-zero, say n1 , then the estimators are given by the importance sampling estimates
n
n q .x /
qj .xk /
ci
i k
=
cj
k=1 w.xk /
k=1 w.xk /
.2/
where w = q1 =c1 . Note that c1 need not be known to implement equation (2). As in the general problem
of importance sampling, q1 could be a bad sampler and require unrealistically large sample sizes to make
these estimates accurate. In general, it is difficult to find samplers that can be guaranteed to be good in
problems.
If we have several ni > 0, then the problem is to determine how we should combine these samples to
estimate the ratios. Perhaps an obvious choice is the importance sampling estimate (2) where w is now the
mixture
r n q .x/
i i
w.x/ =
:
.3/
n
ci
i=1
Of course there is no guarantee that this will be a good importance sampler but, more significantly, we do
not know the ci and so cannot implement this directly. Still, if we put equation (3) into equation (2), we
obtain a system of equations in the unknown ratios and, as the paper points out, this system can have a
unique solution for the ratios. These solutions are the estimates that are discussed in the paper.
The paper justifies these estimates as maximum likelihood estimates and more importantly uses likelihood theory to obtain standard errors for the estimates. The above intuitive argument for the estimates
does not seem to lead immediately to error estimates. One might suspect, however, that an argument based
on the delta theorem should generate error estimates. Since I shall not provide such an argument, we must
acknowledge the accomplishment of the paper in doing this. There are still some doubts, however, about
the necessity for the statistical formulation and the likelihood arguments.
The group averaging that is discussed in the paper seems closely related to the use of this technique in
Evans and Swartz (2000). There it is shown that, when G is a finite group of volume preserving symmetries
of the importance sampler w, then we can replace the basic importance sampling estimate f.x/=w.x/, with
x ∼ w, by f G .x/=w.x/ where
1 f.gx/:
f G .x/ =
|G| g∈G
This is unbiased for integral (1) and satisfies
varw
fG
w
= varw
2 f
f − fG
− Ew
:
w
w
.4/
This shows that the group-averaged estimator always has variance smaller than f=w. This is because f G
and w are more alike than f and w as f G and w now share the group of symmetries G. Some standard
variance reduction methods, such as antithetic variates, can be seen to be examples of this approach.
Although equation (4) seems to indicate that we always obtain an improvement with group averaging,
a fairer comparison takes into account the additional function evaluations in f G =w. In that case, it was
shown in Evans and Swartz (2000) that f G =w is a true variance reducer if and only if
2 |G| − 1
f
f − fG
f
varw
< Ew
:
varw
|G|
w
w
w
Noting that f G =w is the orthogonal projection of f=w onto the L2 .w/ space of functions invariant under
G, we see that we have true variance reduction if and only if the residual .f − f G /=w accounts for a
proportion that is equal to .|G| − 1/=|G| of the total variation in f=w. This implies that the larger the
group is the more stringent is the requirement for true variance reduction. Also, if this technique is to be
effective, then f and w must be very different, at least with respect to the symmetries in G, i.e. w must be
606
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
a poor importance sampler for the original problem. This leads to some reservations about the general
utility of the method.
The group averaging technique in the paper can be seen as a generalization of this as G is not required
to be a group of symmetries of w. Rather we symmetrize both the importance sampler and the integrand
so that the basic estimator is f G =wG . It may be that wG is a more effective importance sampler than w but
still we can see that the above analysis will restrict the usefulness of the method.
Although I have expressed some reservations about some aspects of the paper, overall I found it to
be thought provoking as it introduces a different way to approach the analysis of integration problems
and this leads to some interesting results. I am happy to propose the thanks and congratulations to the
authors.
Christian P. Robert .Centre de Recherche en Economie et Statistique and Université Dauphine, Paris/
Past contributions of the authors to the Monte Carlo literature, including the notion of bridge sampling,
are noteworthy, and I therefore regret that this paper does not have a similar dimension for our field.
Indeed, the ‘theory of Monte Carlo integration’ that is advertised in the title reduces to a formal justification of bridge sampling, via nonparametric maximum likelihood estimation, and the device of pretending
to estimate the dominating measure allows in addition for an asymptotic approximation of the Monte
Carlo error through the corresponding Fisher information matrix. Although this additional level of interpretation of importance and bridge sampling Monte Carlo approximations (rather than estimations)
of integrals is welcome, and quite exciting as a formal exercise, it seems impossible to derive a working
principle out of the paper.
A remark of interest related to the supposedly unknown measure is that groups can be seen as acting on
the measure rather than on the distribution. This brings much more freedom (but also the embarrassment
of wealth) in looking for group actions, since
qs .x/ dλ.x/ =
qs .gx/ dv.g/ dλ.x/ =
|G| q̄s .x/ dλ.x/
Γ
Γ=G
G
Γ=G
is satisfied for all groups G such that dλ.gx/ = dλ.x/. However, this representation also exposes the
confusion that is central to the paper between variance reduction, which is obvious by a Rao–Blackwell
argument, and Monte Carlo improvement, which is unclear since the computing cost is not taken
into account. For instance, Fig. 1 analyses the example of Section 2 based on a single realization from
P1 , reparameterized to [0, 1]2 , where larger groups acting on [0, 1]2 bring better approximations of
cσ =c1 = 1=σ 2 . However, this improvement does not directly pertain to Monte Carlo integration, but
rather to numerical integration (with the curse of dimensionality lurking in the background). Although the paper shows examples where using the group structure clearly brings substantial agreement, it does not shed much light on the comparison of different groups from a Monte Carlo point of
view.
The numerical nature of the improvement is also clear when we realize that, since the group action is
on the measure rather than on the distribution, it is often unlikely to be related to the geometry of this
distribution and thus can bring little improvement when q.gxi / = 0 for most gs and generated xi s. Fig. 2
shows the case of a Monte Carlo experiment on a bivariate normal distribution with the same groups as
above: the concentration of the logit transform of the N {.−5, 10/, σ 2 I2 } distribution on a small area of
the [0, 1]2 square (Fig. 2(a)) prevents good evaluations even after 410 computations (Fig. 2(b)).
A second opening related to the likelihood representation of the weight evaluation is that a Fisher
information matrix can be associated with this problem and thus variance-like figures are produced in
the paper. Although these matrices stand as a formal ground for comparison of Monte Carlo strategies, I
have difficulties with these figures given that they do not necessarily provide a good approximation to the
Monte Carlo errors. See for instance the example of Section 2: we could apply Slutsky’s lemma with the
transform cσ = exp{log.cσ /} to obtain the variance of the ĉσ s as
diag.cσ /V diag.cσ /,
but this approximation is invalidated by the absence of variance of the ĉσ s. (See also the normal range
confidence lower bound on Fig. 2(b) which is completely unrelated to the range of the estimates.) Given
that importance sampling estimators are quite prone to suffer from infinite variance (Robert and Casella
(1999), section 3.3.2), the appeal of using Fisher information matrices is somewhat spurious as it gives a
false confidence in estimators that should not be used.
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
n= 1 , Estimate 2.026
n= 2 , Estimate 2.3523
0.0
0.2
0.4
0.6
0.8
1.0
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
Estimate 1.44
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
1.0
0.8
0.6
0.8
1.0
0.2
1.0
0.4
0.6
0.8
1.0
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.4
0.2
0.8
0.0
n= 8 , Estimate 2.3863
0.0
0.6
1.0
0.4
0.4
0.8
1.0
0.8
0.6
0.4
0.2
0.4
0.8
0.2
0.2
n= 7 , Estimate 2.3981
0.0
0.2
0.6
0.0
0.0
n= 6 , Estimate 2.407
0.0
0.4
0.6
0.6
0.4
0.2
0.0
0.4
0.2
n= 5 , Estimate 2.4372
0.8
1.0
0.8
0.6
0.4
0.2
0.0
0.2
0.0
n= 4 , Estimate 2.3908
1.0
n= 3 , Estimate 2.2884
0.0
607
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Fig. 1. Successive actions of the groups Gn on the simulated value .x1 , x2 / P1 (top left-hand corner),
where Gn operates on .z1 , z2 / D .exp.x1 /={1 C exp.x1 /}, x2 =.1 C x2 // 2 [0, 1]2 by changing some of the
first n bits in the binary representation of the decimal part of z1 and z2: the estimates of 1=σ 2 D 2:38 are
given above each graph; the size of the orbit of Gn is 4n
The extension to Markov chain Monte Carlo settings is another interesting point in the paper, in that
estimator (3.10) reduces to
n
n
ĉr =
qr .xt /
q1 .xt |xj /
t=1
j=1
if we use a single chain based on a transition q1 (with known constant). Although this estimator only
seems to apply to the Gibbs sampler, given that the general Metropolis–Hastings algorithm has no density
against the Lebesgue measure, it is a special case of Rao–Blackwellization (Gelfand and Smith, 1990;
Casella and Robert, 1996) applied to importance sampling. As noted in Robert (1995), the naı̈ve importance sampling alternative
c̃r =
n
1
qr .xt /=q1 .xt |xt−1 /,
n t=1
where the ‘true’ distribution of xt is used instead, is a poor choice, since it most often suffers from infinite
variance. (The notation in Section 3.6 is mildly confusing in that cr .x/ is unrelated to cr and is also most
often known, in contrast with cr . It thus seems inefficient to estimate the cr .xt /s.)
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
0.00020
0.00015
0.00005
0.00010
exp(x2) 1+ exp(x2)
0.00025
0.00030
608
0.96
0.97
0.98
0.99
exp(x1) / 1 + exp(x1)
1
2
3
4
5
6
(a)
5
6
7
8
9
10
log 4 (n)
(b)
Fig. 2. (a) Contour plot of the density of the logit transform of the N {.5, 10/, σ 2 I2 } distribution and (b)
summary of the successive evaluations of cσ =c1 D σ 2 D 1:5 by averages over the groups Gn for the logit
transform of a single simulated value .x1 , x2/ N {.5, 10/, σ 2 I2 }:
, average over 100 replications;
, range of the evaluations; . . . . . . . , 2-standard-deviations upper bound
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
609
Except for the misplaced statements of the last paragraph about ‘orthodox Bayesians’, which simply
bring to light the fundamental difference between designed Monte Carlo experiments and statistical inference, I enjoyed working on the authors’ paper and thus unreservedly second the vote of thanks!
The vote of thanks was passed by acclamation.
A. C. Davison .Swiss Federal Institute of Technology, Lausanne/
The choice of the group is a key aspect of the variance reduction by group averaging that is suggested in
the paper, and I would like to ask its authors whether they have any advice for the bootstrapper.
The simplest form of nonparametric bootstrap entails equal probability sampling with replacement
from data y1 , : : :, yn , which is used to approximate quantities such as
m = n−1 t.yÅ , : : :, yÅ /,
1
n
where the sum is over the nn possible resamples y1Å , : : :, ynÅ from the original data. If the statistic
under
consideration is symmetric in the yj then the number of summands may be reduced to 2n−1
or
fewer,
n
but it is usually so large that m is approximated by taking a random subset of the resamples. The most
natural group in this context arises from permuting y1 , : : :, yn and has size n!, which is far too large for use
in most situations. One could readily construct smaller groups, obtained for example by splitting the data
into blocks of size k and permuting within the blocks, but they seem somewhat arbitrary. How should
the group be chosen in this context, where the cost of function evaluations vastly outweighs the cost of
sampling?
Wenxin Jiang and Martin A. Tanner .Northwestern University, Evanston/
One way to rederive equation (2.1) is to regard it as importance sampling using a sample x1n with a ‘base
density’ of the mixture
ns qs .x/
:
f.x/ =
cs
s n
The estimator of cσ = Γ qσ .x/ dµ would be n−1 Σni=1 qσ .xi /=f.xi /. Replacing the unknown constant cs in
the base density by ĉs then results in a self-consistency equation that is identical to equation (2.1). (Simreplaced by q̄.x; σ/ and the integrals cσ =
ilarly, with group averaging the functions qσ .x/ are essentially
n
q̄.x;
σ/
dρ
are
estimated
by
using
an
importance
sample
x
having
a base density Σs .ns =n/ q̄.x; s/=cs .)
1
Γ
The goal here is to achieve a small ‘average asymptotic variance’
var{log.ĉσ =ĉτ /} dπ.σ/ dπ.τ /:
AAVAR =
Ξ
Ξ
(The paper actually uses (25/20) AAVAR as the criterion function, averaging over 10 different pairs of
σ, τ ∈ Ξ = {0:25, 0:5, 1, 2, 4}:)
What is the optimal base density fo and the corresponding optimal AAVAR.fo /? The following is a
result that is obtained by applying variational calculus.
optimality). Let ĉσ .f / = n−1 Σni=1 qσ .xi /=f.xi / be an estimator of cσ =
Proposition 1 (average
n
q
.x/
dµ,
where
x
is
an
independent
and identically distributed sample from a probability measure
1
Γ σ
f.x/ dµ. Let
ĉσ .f /
dπ.σ/ dπ.τ /
var log
AAVAR.f / =
ĉτ .f /
Ξ Ξ
for some prechosen averaging scheme defined by measure π over a set of parameters Ξ. Then
2
−1
qo .x/ dµ.x/ ,
AAVAR.fo / = inf {AAVAR.f /} = n
f
where
Γ
fo .x/ ∝ qo .x/ =
Ξ
Ξ
qσ .x/ qτ .x/
−
cσ
cτ
2
dπ.σ/ dπ.τ / :
For the uniform–average measure π on the space Ξ = {0:25, 0:5, 1, 2, 4}, using qσ .x/ = {x12 + .x2 +
σ/2 }−2 and Γ = R × R+ gives (25/20) AAVAR.fo / = 1:2=n, with the required integral computed
610
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
numerically. The paper uses the base densities f1 .x/ = q1 .x/=c1 and f2 .x/ = Σs .1=5/ qs .x/=cs , achieving
.25=20/ AAVAR.f1,2 / = 3:6=n and 3:0=n, with efficiency factors AAVAR.f1,2 /−1 =AAVAR.fo /−1 being
33% and 40% respectively, compared with the optimal fo .
It is noted that the optimal AAVAR.fo / will be smaller if the densities {qσ .·/=cσ }σ∈Ξ are more similar.
A parallel computation can be done with group averaging. Then qσ =cσ is replaced by the averaged density
p̄σ = 21 .q1=σ =c1=σ + qσ =cσ / when computing .25=20/ AAVAR.fo /, which results in a much smaller value
0:20=n since the averaged densities are more similar to each other. With group averaging, the choice of
the base density in the paper is essentially Σs .1=5/ p̄s .x/ ≡ f3 , with .25=20/ AAVAR.f3 / = 0:37=n and an
efficiency factor AAVAR.f3 /−1 =AAVAR.fo /−1 = 54%.
The paper did not use the exact base densities f1,2,3 , but rather those using the estimated normalizing
constant ĉs , which may have added to the variance. Nevertheless, we see that the use of group averaging
seems more effective for reducing AAVAR than the use of any possible importance samples or design
weights.
Hani Doss .Ohio State University, Columbus/
An interesting application of the ideas in the paper arises in some current joint work with B. Narasimhan.
In very general terms, our set-up is as follows. As in the usual Bayesian paradigm, we have a data vector
D, distributed according to some probability distribution Pφ , and we have a prior νh on φ. This prior is
indexed by some hyperparameter h ∈ H, a subset of Euclidean space, and we are interested in νh,D , the
posterior distribution of φ given the data, as h ranges over H, for doing sensitivity analyses. In our set-up,
.r/
we select k values h1 , : : :, hk ∈ H, and we assume that, for each r = 1, : : :, k, we have a sample φ.r/
1 , : : :, φnr
.r/
from νhr,D . We have in mind situations in which φ.r/
,
:
:
:,
φ
are
generated
by
Markov
chain
Monte
Carlo
1
nr
sampling, and in which the amount of time that it takes to generate these is non-negligible. We wish to use
the k samples to estimate νh,D , and we need to do this in realtime. (Markov chain Monte Carlo sampling
gives rise to dependent samples, but to keep this discussion focused we ignore this difficulty entirely.)
In standard parametric models, the posterior is proportional to the likelihood times the prior density,
i.e.
νh,D .dφ/ = ch−1 lD .φ/ νh .φ/ µ.dφ/,
where ch = ch .D/ is the marginal density of the data when the prior is νh , lD .φ/ is the likelihood function
and µ is a common carrier measure for all the priors. In the context of this paper, we have an infinite family
of probability distributions νh,D , h ∈ H, each known except for a normalizing constant, and we have a
sample from a finite number of these. The qh s may be taken to be lD .φ/ νh .φ/. The issue that for many
models lD .φ/ is too complicated to be ‘known’ is irrelevant, since all formulae require only ratios of qs, in
which lD cancels. Equation (3.4) gives the maximum likelihood estimate of µ, and from this we estimate
νh,D .dφ/ directly via
ν̂h,D .dφ/ =
nr
k 1
1 qh .φ/ µ̂.dφ/ =
k
ĉh
ĉh r=1 i=1 r=1
qh .φ.r/
i /
nr qhr .φ.r/
i /=ĉhr
δφ.r/ .dφ/
i
(here δa is the point mass at a), i.e. ν̂h,D is the probability measure which gives mass
−1
k
νh s .φ.r/
qh .φ.r/
i /ĉh
i /=ĉh
=
ns .r/
wh .i, r/ = k
νh .φi /ĉhs
s=1
ns qhs .φ.r/
/=ĉ
hs
i
s=1
to φ.r/
i . A potentially time-consuming computation needs to be used to solve the system of equations (3.5);
however, as noted by the authors, only the system involving ch1 , : : :, chk needs to be solved, since, once
these have been obtained, for all other hs the calculation of ĉh is immediate. The weights are simply the
sequence
−1
k
νh .φ.r/
i /
ns s .r/
νh .φi /ĉhs
s=1
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
611
normalized to sum to 1. These weights can usually be calculated very rapidly. This is exactly the estimator
that was used
in Doss and Narasimhan (1998, 1999) and obtained by other methods (Geyer, 1994). An
expectation f.φ/ ν̂h,D .dφ/ is estimated via
nr
k r=1 i=1
f.φri / wh .i, r/:
.5/
It is very useful to note that the development in Section 4 enables estimation of the variance of expression
(5). We add the function f.φ/ qh .φ/ to our system. Letting
f
ch = f.φ/ qh .φ/ µ.dφ/,
f
we see that expression (5) is simply ĉh =ĉh . If we take P̃ to be the matrix P̂ in Section 4.2 of the paper,
augmented with the two columns
k
s=1
qh .φ.r/
i /=ĉh
,
ns qhs .φ.r/
i /=ĉhs
all i,all r
f
.r/
fh .φ.r/
i / qh .φi /=ĉh
k
ns qhs .φ.r/
i /=ĉhs
s=1
all i,all r
f
(so P̃ is an n × .k + 2/ matrix), then equation (4.2) gives an estimate of the variance of ĉh =ĉh .
J. Qin .Memorial Sloan–Kettering Cancer Center, New York/ and K. Fokianos .University of Cyprus,
Nikosia/
We congratulate the authors for their excellent contribution which brings together Monte Carlo integration theory and biased sampling models. Our discussion is restricted to the following simple observation
which might be utilized in connection with the theory that is developed in the paper for attacking a larger
class of problems.
Suppose that X1 , : : :, Xn1 are independent random variables from a density function pθ .x/ which can be
written as
pθ .x/ = qθ .x/=c.θ/,
where qθ .x/ is a given non-negative function, θ is an unknown parameter and c.θ/ = qθ .x/ dx is the
normalizing constant. Assume that the integration has no closed form and let p0 .x/ be a density function
from which simulated data Z1 , : : :, Zn2 are easily drawn. Then it follows that pθ .x/ can be rewritten as
p.x/ = exp.α + [log{qθ .x/} − log{p0 .x/}]/ p0 .x/ = exp{α + φ.x, θ/} p0 .x/,
where φ.x, θ/ = log{qθ .x/}−log{p0 .x/} and α = − log{ φ.x, θ/ p0 .x/ dx} = − log{c.θ/}, assuming that
both of the densities are supported on the same set which is independent of θ. Hence the initial model is
transformed to a two-sample semiparametric problem where the ratio of the two densitites p.x/ and p0 .x/
is known up to a parameter vector .α, θ/. In addition the sample size n1 of Xs is fixed as opposed to n2 —the
sample size of the simulated data—which can be made arbitrarily large. Let n = n1 + n2 and denote the
pooled sample .T1 , : : :, Tn / = .X1 , : : :, Xn1 ; Z1 , : : :, Zn2 /. Following the results of Anderson (1979) and Qin
(1998), inferences for the vector .α, θ/ can be based on the profile log-likelihood
l.α, θ/ =
n1
n
{α + φ.xi , θ/} − log[1 + ρ exp{α + φ.ti , θ/}],
i=1
i=1
where ρ = n1 =n2 . The parameter vector .α, θ/ can be estimated by maximizing the profile log-likelihood
l.α, θ/ with respect to α and θ, and the likelihood ratio test regarding θ has an asymptotic χ2 -distribution
under mild conditions (Qin, 1999). The idea can be generalized further by considering m-samples along
the lines of Fokianos et al. (2001).
612
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
Steven N. MacEachern, Mario Peruggia and Subharup Guha .Ohio State University, Columbus/
The authors construct an interesting framework for Monte Carlo integration. The perspective that they
provide will improve simulation-based estimation. The example of Section 5.2 makes a clear case for the
benefits of subsampling the output of a Markov chain.
To evaluate the efficiency of an estimation procedure, we must consider the cost of creating the estimator
along with its precision. Here, the cost of generating the sample is O.n/, which we approximate by c1 n. The
additional time that is required to compute Chib’s estimator is linear in n with an ignorable constant that
is considerably smaller than c1 . The precision of his estimator is O.n/, say p1 n, resulting in an asymptotic
effective precision p1 =c1 . The new estimator relies on the same sample of size n. The cost to compute the
estimator is O.n2 /, with leading term c2 n2 . The new estimator has precision O.n2 /, with leading term p2 n2 .
The asymptotic effective precision of the estimator is limn→∞ {p2 n2 =.c2 n2 + c1 n/} = p2 =c2 . The authors
report simulations that enable us to compare the two estimators, leading to the conclusion that p2 c1 =p1 c2
is roughly 8–10. This gain in asymptotic effective precision suggests using the new estimator.
Consider a 1-in-k systematic subsample of the Markov chain of length n (total sample length nk). The
subsampled version of equation (5.1) has a cost of c1 nk + c2 n2 . The precision of the estimator is O.n2 / with
leading term p2k n2 . Its asymptotic effective precision is p2k =c2 . This example, with positive dependence
between successive parameter vectors, is typical of Gibbs samplers. Positive dependence suggests p2k > p2 ,
and so it is preferable to subsample the output of the Markov chain. This benefit for subsampling should
hold whenever the cost of computing the estimator is worse than O.n/.
We replicated the authors’ simulation (500 + 5000 case). A 1-in-10 subsample reduced the standard
deviation of integral (5.1) by 10%. We also drew the vector β as four successive univariate normals. For
this sampler with stronger, but still modest, serial dependence, a 1-in-10 subsample reduced the standard
deviation by 43% compared with the unsubsampled estimator.
In recent work (Guha and MacEachern, 2002; Guha et al., 2002), we have developed bench-mark estimation, a technique that allows us to base our estimator primarily on the subsample while incorporating
key information from the full sample. Bench-mark estimators, compatible with importance sampling and
group averaging, can be far more accurate than subsampled estimators. As this example shows, subsampled
estimators can be far more accurate than full sample estimators.
The following contributions were received in writing after the meeting.
Siddhartha Chib .Washington University, St Louis/
I view this paper as a contribution on ways to reduce the variability of estimates derived from Monte Carlo
simulations. For obvious reasons, I was drawn to the discussion in Section 5.2 where the authors apply
their techniques to the problem of estimating the marginal likelihood of a binary response probit model,
given Markov chain Monte Carlo draws from its posterior density sampled by the approach of Albert and
Chib (1993). Their estimator of the marginal likelihood is closely related to that from the Chib approach.
In particular, focusing on expression (5.1), from Chib (1995)
L.βi / π.βi /
n−1
n
j=1
π.βi |y, zj /
is an estimate of the marginal likelihood for any βi and the numerical efficiency of this estimate is high,
for a given n, when βi is a high posterior density point.
In contrast, expression (5.1) dictates that we compute the Chib estimate at each sampled βi and then
average those estimates. Not surprisingly, by averaging, and thereby incurring a bigger computational
load, we obtain a theoretically more efficient estimate. Interestingly, however, the gains documented in
Table 1 are not substantial: with n = 5000, the averaging affects the mean only in the third decimal place.
After adding −34 to each mean and observing that the numerical standard error of the Chib estimate
is 0.02, it is clear that the reduction in the numerical standard error achieved by averaging will have no
practical consequences for model choice in this problem.
The authors do not discuss the application of their method to models that are more complex than the
binary probit. There are, for example, many situations where the likelihood is difficult to calculate, and the
posterior ordinate is not available in as simple a form as in the binary probit. In such cases, the key features
of the Chib method—that the likelihood and posterior ordinate must be evaluated just once, that tractable
and efficient ways of estimating each can be separately identified and that this is sufficient for producing
reliable and accurate estimates—have proved invaluable. Illustrations abound, e.g. to multivariate probit
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
613
models, longitudinal count data models, multivariate stochastic volatility models, non-linear stochastic
diffusion models and semiparametric models constructed under the Dirichlet process framework.
If the general approach described in this paper also amounts to averaging the Chib estimate in more
complex models then it is clear that the computational burden of the suggested averaging will be prohibitive in many cases and, therefore, the theoretical efficiency gain, however small or large, will be practically
infeasible to realize. Coupled with the virtual equality of the estimates in a case where averaging is feasible,
it is arguable whether the additional computing effort will be practical or justifiable in more complex
problems.
Ya’acov Ritov .Hebrew University of Jerusalem/
The paper is an interesting bridge between the semiparametric theory of stratified biased samples and Monte Carlo integration. In this comment, I concentrate on the theoretical aspects of the problem, leaving the
numerical analysis side to others.
The asymptotics of the biased sample model have been discussed previously, e.g. Gill et al. (1988),
Pollard (1990) and Bickel and Ritov (1991). Bickel et al. (1993) discussed the information bounds and efficiency considerations for a pot-pourri of models under the umbrella of biased sampling. These include, in
particular, stratified sampling, discussed already in Gill et al. (1988), with or without known total stratum
probability. Case–control studies are the important application of the known probabilities model.
Contrary to the last statement of Section 3.1, the paper deals essentially only with applications in which
the ‘unknown’ measure µ is actually known, at least up to its total mass. This covers most applications
of Markov chain Monte Carlo sampling, as well as integration with respect to the Gaussian or Lebesgue
measures. It is possible to think of other problems, but this is not the main emphasis of the text or of the
examples. The statistician can therefore considerably improve the efficiency of the estimator by using the
known values of different functionals such as moments and probabilities of different sets. The algorithm
becomes increasingly efficient as the number of functionals becomes larger. The result, however, is an
extremely complicated algorithm, which is not necessarily faster. For example, if integration according to
a Gaussian measure on the line is considered, we can use the fact that all quantiles are known. In essence,
we have stratified sampling. However, we can increase efficiency by using the known mean, variance and
higher moments of the distribution.
A similar consideration applies to the group models discussed by the authors. The more symmetry (i.e.
group structure) we use, the more efficient the estimator. Practical considerations may prevent us from
using all possible symmetries, and the actual level of symmetry used amounts to the trade-off between
algorithmic complication and statistical efficiency. The latter is, of course, quite different from algorithmic
efficiency.
I miss an example in the paper in which the stratified biased sample model could be used in full. There
is no reason, to use, for example, only a single Gaussian measure. For example, a variance slightly smaller
than intended improves the efficiency of the evaluation of the integral in the centre, whereas a variance
slightly larger than intended improves the evaluation in the tail. Practical guidance for the design problem
could be helpful. One practical solution would be the use of an adaptive design when different designs
are evaluated by the central processor unit time needed for a fixed reduction in estimation error, and the
design actually used is selected accordingly.
James M. Robins .Harvard School of Public Health, Boston/
I congratulate the authors on a beautiful and highly original paper. In Robins (2003), I describe a substantively realistic model which demonstrates that successful Monte Carlo estimation of functionals of a
high dimensional integral can sometimes require, not only that,
(a) as emphasized by the authors, the simulator hides knowledge from the analyst, but also that
(b) the analyst must violate the likelihood principle and eschew semiparametric, nonparametric or
fully parametric maximum likelihood estimation in favour of non-likelihood-based locally efficient
semiparametric estimators.
Here, I demonstrate point (b) with a model that is closer to the authors’ but is less realistic.
Let L encode base-line variables, A with support {0, 1} encode treatment and X encode responses, where
L and X have many continuous components. Let θ = .l, a/, qθ .x/ be a positive utility function and the
measure uθ depend on θ so that u.l,a/ is the conditionalprobability measure for X given .L, A/ = .l, a/.
Then, among subjects with L = l, c.l, a/ = c.θ/ = cθ = qθ .x/ duθ is the expected utility associated with
treatment a and dop .l/ = arg maxa∈{0, 1} {c.l, a/=c.l, 0/} is the optimal treatment strategy. Suppose that the
614
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
simulated data are .Θi , Xi /, i = 1, : : :, n, with Θi = .Li , Ai / sampled independently and identically distrib−1
uted under the random design f.a, l/ = f.a|l/f.l/ and Xi now biasedly sampled from cΘ
qΘi .x/ uΘi .dx/.
Suppose that the information available to the analyst is f Å .a|l/ and qθÅ .x/, and that c.l, a/=c.l, 0/ =
exp{−γ Å .l, a, ψ † /} where ψ † ∈ Rk is an unknown parameter that determines dop .l/, γ Å .l, a, ψ/ is a known
function satisfying γ Å .l, 0, ψ/ = γ Å .l, a, 0/ = 0 and Å indicates a quantity that is known to the analyst.
Then the likelihood is L = L1 L2 with
Å .X / u .dX /
n
f.Li / qΘ
i
Θi
i
i
L1 =
,
Å
i=1 exp{−γ .Li , Ai , ψ/} c.Li , 0/
.6/
n
f Å .Ai |Li /:
L2 =
i=1
Å .X/ and b.L/ = 1=c.L, 0/ the analyst’s model is precisely the semiparametric reThen with Y = 1=q.L,A/
gression model characterized by f Å .A|L/ known and E.Y |L, A/ = exp{γ Å .L, A, ψ † /} b.L/ with ψ † and
b.L/ unknown. In particular, the analyst does not know whether b.L/ is smooth in L.
In this model, as discussed by Ritov and Bickel (1990), Robins and Ritov (1997), Robins and Wasserman
(2000) and Robins et al. (2000), because of the factorization of the likelihood and lack of smoothness,
(a) any estimator of ψ † , such as a maximum likelihood estimator, that obeys the likelihood principle
must ignore the known design probabilities f Å .a|l/,
(b) estimators that ignore these probabilities cannot converge at rate nα over the set of possible pairs
.f Å .a|l/, c.l, 0// for any α > 0 but
(c) semiparametric estimators that depend on the design probabilities can be n1=2 consistent. An example of an n1=2 consistent estimator is ψ̂.g, h/ solving
0=
n
i=1
[Yi exp{−γ Å .Li , Ai , ψ/} − h.Li /] g.Li /{Ai − prÅ .A = 1|Li /}
for user-supplied functions h.l/ ∈ R1 and g.l/ ∈ Rk . Define hopt .L/ = b.L/ and gopt .L/ = v.1, L/ −
v.0, L/, where
v.A, L/ = W.ψ † /[R.ψ † / − E{W.ψ † /|L}−1 E{W.ψ † /R.ψ † /|L}],
R.ψ † / = b.L/@γ Å .L, A, ψ † /=@ψ
and
W.ψ † / = exp{2 γ Å .L, A, ψ † /} var.Y |A, L/−1 :
Let .ĝopt , ĥopt / be estimates of .gopt , hopt / obtained under ‘working’ parametric submodels for b.L/ and
var.Y |A, L/. Then ψ̂.ĝopt , ĥopt / is locally efficient, i.e. it attains the semiparametric variance bound if the
submodels are correct and remains consistent asymptotically normal under their misspecification (Chamberlain, 1990; Newey, 1990; Robins et al., 2000).
Yehuda Vardi .Rutgers University, Piscataway/
The paper is a significant bridge between nonparametric maximum likelihood estimation under biased
sampling and Monte Carlo integration. Estimation under biased sampling is a fairly mature methodology,
including asymptotic results, and the authors have overlooked considerable published knowledge that is
relevant to their work. My comments mainly attempt to narrow this gap.
(a) The problem formulation in Section 3 is equivalent to Vardi (1985) with qs, cs and µ here being ws, Ws
and F in Vardi (1985). The important special case of two samples (specifically qθ .x/ = xθ , θ = 0, 1)
has been treated even earlier (Vardi, 1982), including the nonparametric maximum likelihood estimator and its asymptotic behaviour.
(b) The algorithm proposed in Vardi (1982, 1985) for solving equation (3.5) of this paper is different
from that of Deming and Stephan (1940) and is often more efficient (see Bickel et al. (1993), pages
340–343, and Pollard (1990), page 83). This is important in computationally intensive appplications
where the Deming–Stephan algorithm is slow to converge.
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
615
(c) Regarding the asymptotics, for two samples, equation (4.2) should match theorem 3.1 of Vardi
(1982) and it would be constructive to show this. For more than two samples, equation (4.2) should
match proposition 2.3 in Gill et al. (1988) and similar results in Bickel et al. (1993).
(d) More on the asymptotics: ‘lifting’ the limiting behaviour of the estimated cr s from standard properties of the exponential family, although a useful shortcut, seems mathematically heuristic. A good
discussion of the interplay between the parametric and nonparametric aspects of the model is in
chapter 14 of Pollard (1990).
(e) The methodology calls for generating samples from weighted distributions, but this is often a difficult problem with no satisfactory solution, so there is a computational trade-off here. Note problem
6.1 of Gill et al. (1988), which also connects biased sampling methodology with design issues and
importance sampling.
(f) Gilbert et al. (1999) extended the model to include weight functions qr s, which share a common
unknown parameter. This might be useful in future applications of your methodology.
(g) You allow negative qr s but assume zero observations for such samples (nr = 0 in the discussion
following equation (3.1)). This is confusing. A simple example with negative weight function(s) and
an explanation of how to estimate the ratio of integrals in this case would help.
(h) Woodroofe (1985) showed that the nonparametric maximum likelihood estimator from truncated
data is consistent only under certain conditions on the probabilities that generate the data. Woodroofe’s problem has a similar data structure to your Markov chain models of Section 3.6, where each
targeted distribution has a sample of size 1 or less. This leads me to think that further conditions
might be necessary to achieve consistency.
The authors replied later, in writing, as follows.
General remarks
The view presented in the paper is that essentially every Monte Carlo activity may be interpreted as
parameter estimation by maximum likelihood in a statistical model. We do not claim that this point of
view is necessary; nor do we seek to establish a working principle from it. By analogy, the geometric
interpretation of linear models is not necessary for fitting or understanding; nor is there a geometric working principle. The fact that the estimators are obtained by maximum likelihood does not mean that they
cannot be justified or derived by heuristic arguments, even by flawed arguments. The crux of the matter is
not necessity or working principle, but whether the new interpretation is helpful or useful in the sense of
leading to new understanding, better simulation designs or more efficient algorithms. Jiang and Tanner’s
optimal sampler, although not readily computable and not from the mixture class, is a step in this direction.
The remark by MacEachern, Peruggia and Guha is another step showing that, for slow mixing Markov
chains, the numerical efficiency of superefficient estimators, such as those described in Section 5.2, may
be improved by subsampling.
Several discussants (Evans, Robert, Davison, : : :) raise questions about group averaging, its effectiveness, how to choose the group, and so on. As Robert correctly points out, having the group act on the
parameter space rather than on the sampler provides great latitude for choice of group actions. Group
averaging can be very effective with a small group or very ineffective with a large group, the latter being
counter-productive. The group actions that we have in mind involve very small groups, so the additional
computational effort is small, if not quite negligible. To choose an effective group action, it is usually
necessary to take advantage of the structure of the problem, the symmetries or geometry of the model
or the topology of the space. Euclidean spaces have ample structure for interesting group actions. The
parameter space in a parametric mixture model has obvious symmetries that can potentially be exploited.
In the nonparametric bootstrap, unfortunately, these useful pieces of structure have been stripped away,
the space that remains being a forlorn finite set with no redeeming structure. In the examples that we have
explored, effective group action carries each sample point to a different point that is not usually in the
sample. The bootstrap restriction to the observed sample points precludes group actions other than finite
permutations. If the retention of structure is not contrary to bootstrap principles, it may be possible to
use temporal order, or factor levels or covariate values, and to permute the covariates or the design. Of
course, this may be precisely what you do not want to do.
It is true, as Ritov points out, that maximum likelihood becomes more efficient as the parameter space
is reduced by the inclusion of symmetries or additional information such as the values of certain integrals.
Our submodels are designed to exploit exactly that. The hard part of the exercise is to construct a submodel
such that the gain in precision is sufficient to justify the additional computational effort. Evans’s remarks
616
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
show that the potential gain from the traditional type of group action is very limited. We doubt that it is
possible to put a similar bound on the effectiveness of the group action on the parameter space.
Mixtures versus strata
Remarks by Evans and Jiang and Tanner suggest that it may be helpful to explain why four superficially
similar estimators can have very different variances:
(a)
(b)
(c)
(d)
the biased sampling estimator or bridge sampling estimator (BSE);
the importance sampling estimator in which a single sample is drawn from the mixture;
the post-stratified importance sampling estimator with actual sample sizes as weights;
the stratified estimator in which the relative normalizing constants for the samplers are given.
We have listed these in increasing order of available information. The principal distinction is that the
BSE does not use the normalizing constants for the samplers. The fourth estimator is maximum likelihood
subject to certain homogeneous linear constraints of the type described in Section 3.4. For (a)–(c), the
estimated mass at the sample point xi is
dµ̂.xi / ∝ s
dµ̂.xi / ∝ s
and
dµ̂.xi / ∝ s
n
,
ns qs .xi /=ĉs
1
πs qs .xi /=cs
n
:
ns qs .xi /=cs
For (d) with the restriction c1 = : : : = ck , the estimator obtained by constrained maximization over the
set of measures supported on the data is
dµ̂.xi / ∝ 1
,
λr qr .xi /
the ratios λr =λs being determined by the condition ĉr = ĉs , where ĉr = qr dµ̂. The constrained maximum likelihood estimator is in a mixture form even though there may be draws from only one of the
distributions.
Two designs with Γ = [0, 2] show that the variances are not necessarily comparable:
q1 .x/ = I.0 x 1/;
q1 .x/ = 0:5 I.0 x 2/;
q2 .x/ = I.1 x 2/;
q2 .x/ = I.0 x 1/;
n1 = 2n2 .design I/;
n1 = 2n2 .design II/;
. 23 , 13 /.
The absence of an overlap between q1 and q2 in design I means that
plus the mixture with weights
µ is not estimable. The BSE of an integral such that the integrand is non-zero on both intervals does not
exist. In practical terms, the variance is infinite. However, the restriction of µ to each interval is estimable,
and the additional information that c1 = c2 means that these estimated measures can be combined in
such a way that µ̂.[0, 1]/ = µ̂.[1, 2]/. The stratified maximum likelihood estimator (d) puts mass 1 at each
sample point in the first interval and 2 at each point in the second. In an importance sampling design with
independent and identically distributed observations from the mixture, the estimator (b) has the same
form with weights 1 or 2 at each sample point. But the number of observations falling in each interval
is binomial, so the measure does not have equal mass on the two intervals. This defect is corrected by a
post-stratification adjustment. Estimator (d) has the natural property that, for each function q3 that is a
linear combination of q1 and q2 , the ratio ĉ3 =ĉ1 has zero variance even if the draws come entirely from P2 .
The main point of this example is that, although the formulae in all cases look like mixture sampling, the
estimators can have very different variances. As Jiang and Tanner note, the use of estimated normalizing
constants in the BSE may increase the variance, potentially by a large factor. Also, their optimal sampler
is not in the mixture family, so the comparison is not entirely fair.
Variances
Vardi makes the point that, for a given sequence of designs, only certain integrals are consistently estimable. It would be good to have a clean description of this class, how it depends on the design and on
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
617
the group action. The asymptotic variance estimates reported in the paper are in agreement with known
results such as proposition 2.3 in Gill et al. (1988), but they apply only to integrals that are consistently
estimable. Our experience has been that the Fisher information calculation usually matches up well with
empirical simulation variances. At the same time, we have not looked at extreme cases near the boundary
of the consistently estimable class, where the approximation might begin to break down. Robert’s example
with n = 1 sheds little light on the adequacy of the asymptotic variance formulae. In applications of this
sort, where the log-ratios log(ĉr =ĉs ) do not have moments, the distinction between variance and asymptotic
variance is critical. The asymptotic variance is well defined and finite, and this is the relevant number for
the assessment of precision in large samples.
Retrospective, empirical and profile likelihood
The simplest way to phrase the remarks by Qin and Fokianos is to consider a regression model with
response Y and covariate x in which the components of Y are independent with distribution
qθ .y/ dµ.y/=cθ .µ/
where θi = xi β, and µ is an arbitrary measure. The likelihood maximized over µ for fixed β is a function of β alone, here assumed to be a scalar. If the covariate is binary, as in a two-sample problem, the
two-parameter function given by Qin and Fokianos is such that l.α̂β , β/ is the profile log-likelihood for β.
An equivalent result can be obtained by a conditional retrospective argument along the lines of Armitage
(1971). In particular, if log{qθ .y/} is linear in θ, both arguments lead to a likelihood constructed as if the
conditional distribution given y satisfies a linear logistic model with independent components.
Comparisons of precision
Table 1 shows that the precision previously achieved in nine times units can now be achieved in one time
unit. Professor Chib argues unconvincingly that this is not a substantial factor. Possibly he has too much
time on his hands. His subsequent remark about the addition of −34 to each value in Table 1 can only be
interpreted tongue in cheek, so the comment about insubstantial factors may be meant likewise.
Likelihood principle
Although it appears to have no direct bearing on any of our results, Robins’s example raises interesting
fundamental issues connected with the likelihood function, its definition and its interpretation. As we
understand it, the likelihood principle does not address matters connected with asymptotics; nor does
it promote point estimation as an inferential activity. In particular, the non-existence of a maximum or
supremum is not regarded as a contradiction of the likelihood principle; nor is the existence of multiple
maxima. By convention, an estimator is said to ‘obey the likelihood principle’ if it is a function of the
sufficient statistic.
In writing this paper, we had given no thought to the applicability of the likelihood principle in infinite
dimensional models. On reflection, however, it is essential that likelihood ratios be well defined, and, to
achieve this, topological conditions are unavoidable. Such conditions are natural in statistical work because an observation is a point in a topological space, in which the σ-field is generated by the open sets.
Time and space limitations preclude more detailed discussion, but an example in which Θ is the set of
Lebesgue measurable functions [0, 1] → [0, 1] illustrates one aspect of the matter. If the model is such that
Yi is Bernoulli with parameter θ.xi /, the ‘likelihood’ expression
L.θ/ = θ.xi /yi {1 − θ.xi /}1−yi
is not a function on Θ as a topological space. Two functions θ and θ differing on a null set represent the
same point in the topological space Θ, but they may not have the same ‘likelihood’. In other words, θ = θ
does not imply L.θ/ = L.θ /, so L is not a function on Θ. None-the-less, there are non-trivial submodels
for which the likelihood exists and factors, and the Robins phenomenon persists.
References in the discussion
Albert, J. and Chib, S. (1993) Bayesian analysis of binary and polychotomous response data. J. Am. Statist. Ass.,
88, 669–679.
Anderson, J. A. (1979) Multivariate logistic compounds. Biometrika, 66, 17–26.
Armitage, P. (1971) Statistical Methods in Medical Research. Oxford: Blackwell.
Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993) Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins.
618
Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan
Bickel, P. J. and Ritov, Y. (1991) Large sample theory of estimation in biased sampling regression, model I. Ann.
Statist., 19, 797–816.
Casella, G. and Robert, C. P. (1996) Rao-Blackwellisation of sampling schemes. Biometrika, 83, 81–94.
Chamberlain, G. (1988) Efficiency bounds for semiparametric regression. Technical Report. Department of Economics, University of Wisconsin, Madison.
Chib, S. (1995) Marginal likelihood from the Gibbs output. J. Am. Statist. Ass., 90, 1313–1321.
Deming, W. E. and Stephan, F. F. (1940) On a least-squares adjustment of a sampled frequency table when the
expected marginal totals are known. Ann. Math. Statist., 11, 427–444.
Doss, H. and Narasimhan, B. (1998) Dynamic display of changing posterior in Bayesian survival analysis.
In Practical Nonparametric and Semiparametric Bayesian Statistics (eds D. Dey, P. Müller and D. Sinha),
pp. 63–87. New York: Springer.
Doss, H. and Narasimhan, B. (1999) Dynamic display of changing posterior in Bayesian survival analysis: the
software. J. Statist. Softwr., 4, 1–72.
Evans, M. and Swartz, T. (2000) Approximating Integrals via Monte Carlo and Deterministic Methods. Oxford:
Oxford University Press.
Fokianos, K., Kedem, B., Qin, J. and Short, D. A. (2001) A semiparametric approach to the one-way layout.
Technometrics, 43, 56–65.
Gelfand, A. E. and Smith, A. F. M. (1990) Sampling based approaches to calculating marginal densities. J. Am.
Statist. Ass., 85, 398–409.
Geyer, C. J. (1994) Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo.
Technical Report 568. School of Statistics, University of Minnesota, Minneapolis.
Gilbert, P. B., Lele, S. R. and Vardi, Y. (1999) Maximum likelihood estimation in semiparametric selection bias
models with application to AIDS vaccine trials. Biometrika, 86, 27–43.
Gill, R., Vardi, Y. and Wellner, J. (1988) Large sample theory of empirical distributions in biased sampling models.
Ann. Statist., 16, 1069–1112.
Guha, S. and MacEachern, S. N. (2002) Benchmark estimation and importance sampling with Markov chain
Monte Carlo samplers. Technical Report. Ohio State University, Columbus.
Guha, S., MacEachern, S. N. and Peruggia, M. (2002) Benchmark estimation for Markov chain Monte Carlo
samples. J. Comput. Graph. Statist., to be published.
Newey, W. (1990) Semiparametric efficiency bounds. J. Appl. Econometr., 5, 99–135.
Pollard, D. (1990) Empirical Processes: Theory and Applications. Hayward: Institute of Mathematical Statistics.
Qin, J. (1998) Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85,
619–630.
Qin, J. (1999) Case-control studies and Monte Carlo methods. Unpublished.
Ritov, Y. and Bickel, P. (1990) Achieving information bounds in non- and semi-parametric models. Ann. Statist.,
18, 925–938.
Robert, C. P. (1995) Convergence control techniques for MCMC algorithms. Statist. Sci., 10, 231–253.
Robert, C. P. and Casella, G. (1999) Monte Carlo Statistical Methods. New York: Springer.
Robins, J. M. (2003) Estimation of optimal treatment strategies. In Proc. 2nd Seattle Symp. Biostatistics, to be
published.
Robins, J. M. and Ritov, Y. (1997) A curse of dimensionality appropriate (CODA) asymptotic theory for semiparametric models. Statist. Med., 16, 285–319.
Robins, J. M., Rotnizky, A. and van der Laan, M. (2000) Comment on ‘On profile likelihood’ (by S. A. Murphy
and A. W. van der Vaart). J. Am. Statist. Ass., 95, 431–435.
Robins, J. M. and Wasserman, L. (2000) Conditioning, likelihood, and coherence: a review of some foundational
concepts. J. Am. Statist. Ass., 95, 1340–1346.
Vardi, Y. (1982) Nonparametric estimation in the presence of length bias. Ann. Statist., 10, 616–620.
Vardi, Y. (1985) Empirical distributions in selection bias models. Ann. Statist., 13, 178–203.
Woodroofe, M. (1985) Estimating a distribution function with truncated data. Ann. Statist., 13, 163.