J. R. Statist. Soc. B (2003) 65, Part 3, pp. 585–618 A theory of statistical models for Monte Carlo integration A. Kong, deCODE Genetics, Reykjavik, Iceland P. McCullagh, University of Chicago, USA X.-L. Meng Harvard University, Cambridge, USA and D. Nicolae and Z. Tan University of Chicago, USA [Read before The Royal Statistical Society at a meeting organized by the Research Section on Wednesday, December 11th, 2002, Professor D. Firth in the Chair ] Summary. The task of estimating an integral by Monte Carlo methods is formulated as a statistical model using simulated observations as data. The difficulty in this exercise is that we ordinarily have at our disposal all of the information required to compute integrals exactly by calculus or numerical integration, but we choose to ignore some of the information for simplicity or computational feasibility. Our proposal is to use a semiparametric statistical model that makes explicit what information is ignored and what information is retained. The parameter space in this model is a set of measures on the sample space, which is ordinarily an infinite dimensional object. None-the-less, from simulated data the base-line measure can be estimated by maximum likelihood, and the required integrals computed by a simple formula previously derived by Vardi and by Lindsay in a closely related model for biased sampling. The same formula was also suggested by Geyer and by Meng and Wong using entirely different arguments. By contrast with Geyer’s retrospective likelihood, a correct estimate of simulation error is available directly from the Fisher information. The principal advantage of the semiparametric model is that variance reduction techniques are associated with submodels in which the maximum likelihood estimator in the submodel may have substantially smaller variance than the traditional estimator. The method is applicable to Markov chain and more general Monte Carlo sampling schemes with multiple samplers. Keywords: Biased sampling model; Bridge sampling; Control variate; Exponential family; Generalized inverse; Importance sampling; Invariant measure; Iterative proportional scaling; Log-linear model; Markov chain Monte Carlo methods; Multinomial distribution; Normalizing constant; Semiparametric model; Retrospective likelihood 1. Normalizing constants and Monte Carlo integration Certain inferential problems arising in statistical work involve awkward summation or high dimensional integrals that are not analytically tractable. Many of these problems are such that Address for correspondence: P. McCullagh, Department of Statistics, University of Chicago, 5734 University Avenue, Chicago, IL 60657-1514, USA. E-mail: [email protected] 2003 Royal Statistical Society 1369–7412/03/65585 586 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan only ratios of integrals are required, and it is this class of problems with which we shall be concerned in this paper. To establish the notation, let Γ be a set, let µ be a measure on Γ, let {qθ } be a family of functions on Γ and let c.θ/ = qθ .x/ dµ: Γ Our goal is ideally to compute exactly, or in practice to estimate, the ratios c.θ/=c.θ / for all values θ and θ in the family. The family may contain a reference function qθ0 whose integral is known, in which case the remaining integrals are directly estimable by reference to the standard. Our theory accommodates but does not require such a standard. For an estimator to be useful, an approximate measure of estimation error is also required. We refer to c.θ/ as the normalizing constant associated with the function qθ .x/. In particular, if qθ is non-negative and 0 < c.θ/ < ∞, dPθ .x/ = qθ .x/ dµ=c.θ/ is a probability distribution on Γ. For the method to work, the family must contain at least one non-negative function qθ , but it is not necessary that all of them be non-negative. Depending on the context and on the functions qθ , the normalization constant might represent anything from a posterior expectation in a Bayesian calculation to a partition function in statistical physics. Observations simulated from one or more of these distributions are the key ingredient in Monte Carlo integration. We assume throughout the paper that techniques are available to simulate from Pθ without computing the normalization constant. At least initially, we assume that these techniques generate a stream of independent observations from Pθ . At first glance, the problem appears to be an exercise in calculus or numerical analysis, and not amenable to statistical formulation. After all, statistical theory does not seek to avoid estimators that are difficult to compute; nor is it inclined to opt for inferior estimators because they are convenient for programming. So it is hard to see how any efficient statistical formulation could avoid the obvious and excellent estimator c.θ/ = qθ .x/ dµ, which has zero variance and requires no simulated data. This paper demonstrates that the exercise can nevertheless be formulated as a model-based statistical estimation problem in which the parameter space is determined by how much information we choose to ignore. In effect, the statistical model serves to estimate that part of the information that is ignored and uses the estimate to compute the required integrals in a manner that is asymptotically efficient given the information available. Neither the nature nor the extent of the ignored information is predetermined. By judicious use of group invariant submodels and other submodels, the amount of information ignored may be controlled in such a way that the simulation variance is reduced with little increase in computational effort. The literature on Monte Carlo estimation of integrals is very extensive, and no attempt will be made here to review it. For good summaries, see Hammersley and Hanscomb (1964), Ripley (1987), Evans and Swartz (2000) or Liu (2001). For overviews on the computation of normalizing constants, see DiCiccio et al. (1997) or Gelman and Meng (1998). 2. Illustration The following example with sample space Γ = R×R+ is sufficiently simple that the integrals can be computed analytically. Nevertheless, it illustrates the gains that are achievable by choice of Statistical Models for Monte Carlo Integration 587 design and by choice of submodel. Three Monte Carlo techniques are described and compared. Suppose that we need to evaluate the integrals over the upper half-plane dx1 dx2 cσ = 2 + .x + σ/2 }2 {x Γ 2 1 for σ ∈ {0:25, 0:5, 1:0, 2:0, 4:0}. It is conventional to take µ to be Lebesgue measure, so that qσ .x/ = 1={x12 + .x2 + σ/2 }2 . As it happens, the distribution Pσ has mean .0, σ/ with infinite variances and covariances. Consider first an importance sampling design in which a stream of independent observations x1 ,: : :, xn is made available from the distribution P1 . The importance sampling estimator of cσ =c1 is ĉσ =ĉ1 = n−1 qσ .xi /=q1 .xi /: By applying the results from Section 4, we find matrix 4:411 1:491 0:641 1:491 V̂ = n−1 0:000 0:000 −0:601 −0:383 −0:821 −0:582 on the basis of n = 500 simulations that the 0:000 −0:601 −0:821 0:000 −0:383 −0:582 0:000 0:000 0:000 0:000 0:578 1:273 0:000 1:273 3:591 is such that the asymptotic variance of log.ĉr =ĉs / is equal to V̂rr + V̂ss − 2V̂rs for r, s ∈ {0:25, 0:5, 1:0, 2:0, 4:0}. The individual components cr are not identifiable in the model and the estimates do not have a variance, asymptotic or otherwise. The estimated variances of the 10 pairwise logarithmic contrasts range from 0:6=n to 9:6=n, with an average of 3:6=n. The matrix V̂ is obtained from the observed Fisher information, so a different simulation will yield a slightly different matrix. Suppose as an alternative that it is feasible to simulate from any or all of the distributions Pσ , as in Geyer (1994), Hesterberg (1995), Meng and Wong (1996) or Owen and Zhou (2000). Various simulation designs, also called defensive importance sampling or bridge sampling plans, can now be considered in which nr observations are generated from Pr . These are called the design weights, or bridge sampling weights. For simplicity we consider the uniform design in which nr = n=5. The importance sampling estimator must now be replaced by the more general maximum likelihood estimator derived in Section 3, which is obtained by solving ĉσ = n i=1 s qσ .xi / ns ĉs−1 qs .xi / : .2:1/ The asymptotic covariance matrix of log.ĉ/ obtained from a sample of n = 500 simulated observations using equation (4.2) is 2:298 0:974 −0:263 −1:197 −1:811 0:668 0:077 −0:588 −1:131 0:974 V̂ = n−1 −0:263 0:077 0:239 0:117 −0:170 : −1:197 −0:588 0:117 0:690 0:979 −1:811 −1:131 −0:170 0:979 2:132 In using this matrix, it must be borne in mind that c ≡ λc for every positive scalar λ, so only contrasts of log.c/ are identifiable. The estimated variances of the 10 pairwise logarithmic 588 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan contrasts range from 0:7=n to 8:1=n, with an average of 3:0=n. The average efficiency factor of the uniform design relative to the preceding importance sampling design is conveniently measured by the asymptotic variance ratio 3:6=3:0 = 1:2. A third Monte Carlo variant uses a submodel with a reduced parameter space consisting of measures that are invariant under group action. Details are described in Section 3.3, but the operation proceeds as follows. Consider the two-element group G = ±1 in which the inversion g = −1 acts on Γ by reflection in the unit circle g : .x1 , x2 / → .x1 , x2 /=.x12 + x22 /: By construction, g 2 = 1 and g −1 = g, so this is a group action. Lebesgue measure is not invariant and is thus not in the parameter space as determined by this action. However, the measure ρ.dx/ = dx1 dx2 =x22 is invariant, so we compensate by writing q.x; σ/ = x22 ={x12 + .x2 + σ/2 }2 for the new integrand, and cσ = Γ q.x; σ/ dρ. The submodel estimator is the same as equation (2.1), but with q replaced by the group average q̄.x; σ/ = 1 2 q.x; σ/ + 1 2 q.gx; σ/: With a uniform design and n = 500 observations, the estimated variance matrix of log.ĉ/ is 0:202 −0:093 −0:216 −0:093 0:202 0:045 0:097 0:045 −0:093 −0:093 V̂ = n−1 −0:216 0:097 0:239 0:097 −0:216 : −0:093 0:045 0:097 0:045 −0:093 0:202 −0:093 −0:216 −0:093 0:202 The variances of the 10 pairwise logarithmic contrasts range from 0 to 0:87=n with an average of 0:37=n. For reasons described in Section 3.3, the two ratios c0:25 =c4 and c0:5 =c2 are estimated exactly with zero variance. Relative to the preceding Monte Carlo estimator, group averaging reduces the average simulation variance of contrasts by an efficiency factor of 8.1. By this comparison, n simulated observations using the group-averaged estimator are approximately equivalent to 8n observations using the estimator (2.1) with the same design weights, but without group averaging. To achieve efficiency gains of this magnitude, it is not necessary to use a large group, but it is necessary to have a good understanding of the integrands in a qualitative sense, and to select the group action accordingly. If the group action had been chosen so that g.x1 , x2 / = .−x1 , x2 / the gain in efficiency would have been zero. Arguably, the gain in efficiency would be negative because of the slight increase in computational effort. Given observations simulated by any of the preceding schemes, equation (2.1), or the groupaveraged version, can also be used for the estimation of integrals such as log.x12 + x22 / dx1 dx2 cσ = 2 2 2 Γ {x1 + .x2 + σ/ } in which the integrand is not positive, and there is no associated probability distribution. In this extended version of the problem, 10 integrals are estimated simultaneously, and cσ =cσ is the expected value of log |X12 +X22 | when X∼Pσ . For this example, cσ = π=4σ 2 , and cσ =cσ = 2 log.σ/. The general theory described in the next section covers integrals of this sort, and the Fisher information also provides a variance estimate. Statistical Models for Monte Carlo Integration 3. 589 A semiparametric model 3.1. Problem formulation The statistical problem is formulated as a challenge issued by one individual called the simulator and accepted by a second individual called the statistical analyst. In practice, these typically are two personalities of one individual, but for clarity of exposition we suppose that two distinct individuals are involved. The simulator is omniscient, honest and secretive but willing to provide data in essentially unlimited quantity. Partial information is available to the analyst in the form of a statistical model and the data made available by the simulator. Let q1 , : : :, qk be real-valued functions on Γ, known to the analyst, and let µ be any nonnegative measure on Γ. The challenge is to compute the ratios cr =cs , where each integral cr = q .x/ dµ is assumed to be finite, i.e. we are interested in estimating all ratios simultaneously, r Γ where qr .x/ = qθr .x/, using the notation of Section 1. Assume that there is at least one nonnegative function qr such that 0 < cr < ∞, and that nr observations from the weighted distribution Pr .dx/ = cr−1 qr .x/ µ.dx/ .3:1/ are made available by the simulator at the behest of the analyst. The analyst’s design vector .n1 ,: : :, nk / has at least one r such that nr > 0. Typically, however, many of the functions qr are such that nr = 0. For these functions, the non-negativity condition is not required. Different versions of the problem are available depending on what is known to the analyst about µ. Four of these are now described. (a) If µ is known, e.g. if µ is Lebesgue measure, the constants can, in principle, be determined exactly by integral calculus. (b) If µ is known up to a positive scalar multiple, the constants can be determined modulo the same scalar multiple, and the ratios can be determined exactly by integral calculus. (c) If µ is completely unknown, neither the constants nor their ratios can be determined by calculus alone. Nevertheless, the ratios can be estimated consistently from simulated data by using a slight modification of Vardi’s (1985) biased sampling model. (d) If partial information is available concerning µ, neither the constants nor their ratios can be determined by calculus. Nevertheless, partial information may permit a substantial gain of efficiency by comparison with (c). As an exercise in integral calculus, the first and second versions of the problem are not considered further in this paper. 3.2. A full exponential model In this section, we focus on a semiparametric model in which the parameter µ is a measure or distribution on Γ, completely unknown to the analyst. The simulator is free to choose any measure µ, and the analyst’s estimate must be consistent regardless of that choice. The parameter space is thus the set Θ of all non-negative measures on Γ, not necessarily probability distributions, and the components of interest are the linear functionals cr = qr .x/ dµ: .3:2/ Γ The state space Γ in the following analysis is assumed to be countable; the more general argument covering uncountable spaces is given by Vardi (1985) in a model for biased sampling. 590 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan We suppose that simulated data are available in the form .y1 , x1 /, : : :, .yn , xn /, in which the pairs are independent, yi ∈ {1,: : :, k} is determined by the simulation design and xi is a random draw from the distribution Pyi . Then for each draw x from the distribution Pr the likelihood contribution is Pr .{x}/ = cr−1 qr .x/ µ.{x}/: The full likelihood at µ is thus L.µ/ = n i=1 Pyi .{xi }/ = n i=1 µ.{xi }/cy−1 qyi .xi /: i It is helpful at this stage to reparameterize the model in terms of the canonical parameter θ ∈ RΓ given by θ.x/ = log[µ.{x}/]. Let P̂ be the empirical measure on Γ placing mass 1=n at each data point. Ignoring additive constants, the log-likelihood at θ is n k k θ.xi / − ns log{cs .θ/} = n θ.x/ d P̂ − ns log{cs .θ/}: .3:3/ i=1 Γ s=1 s=1 Although it may appear paradoxical at first, the canonical sufficient statistic P̂, the empirical distribution of the simulated values {x1 ,: : :, xn }, ignores a part of the data that might be considered highly informative, namely the association of distribution labels with simulated values. In fact, all permutations of the labels y give the same likelihood. The likelihood function is thus unaffected by reassignment of the distribution labels y to the draws x. This point was previously noted by Vardi (1985), section 6. Thus, under the model as specified, or under any submodel, the association of draws with distribution labels is uninformative. The reason for this is that all the information in the labels for estimating the ratios is contained in the design constants {n1 , : : :, nk }. As is evident from equation (3.3), the model is of the full exponential family type with canonical parameter θ, and canonical sufficient statistic P̂. The maximum likelihood estimate of µ, obtained by equating the canonical sufficient statistic to its expectation, is n P̂ .dx/ = where dx = {x}. Thus k s=1 ns ĉs−1 qs .x/ µ̂.dx/, k µ̂.dx/ = n P̂.dx/ ns ĉs−1 qs .x/, s=1 .3:4/ where ĉs is the maximum likelihood estimate of cs . Note that µ̂ is supported on the data values {x1 , : : :, xn }, but the atoms at these points are not all equal. From the integral definition of cr , we have n qr .xi / ĉr = : .3:5/ qr .x/ dµ̂ = k Γ i=1 ns ĉs−1 qs .xi / s=1 In principle, it is necessary to check that there exists in the parameter space a measure µ̂ such that these equations are satisfied, which is a non-trivial exercise for certain extreme configurations of the observations. Fortunately existence and uniqueness have been studied in great detail for Vardi’s biased sampling model (Vardi, 1985), which is essentially mathematically equivalent. Statistical Models for Monte Carlo Integration 591 Although P̂ is uniquely determined, µ̂ is determined modulo an arbitrary positive multiple, and the constants ĉr are determined modulo the same positive multiple. In other words, the set of estimable parametric functions may be identified with the space of logarithmic contrasts Σ ar log.cr / in which Σ ar = 0. This is clear from equation (3.1), which implies that the problem of ratio estimation is invariant under the transformation qr .x/ → α.x/ qr .x/, where α is strictly positive on Γ. The computational algorithm suggested by equation (3.5) is iterative proportional scaling (Deming and Stephan, 1940; Bishop et al., 1975) applied to the n × k array {nr qr .xi /} in such a way that, after rescaling, nr qr .xi / → nr qr .xi / µ̂.xi /=ĉr , each row total is 1 and the rth column total is nr . The special case k = 1, called importance sampling, corresponds to the Horvitz–Thompson estimator (Horvitz and Thompson, 1952), which is widely used in survey sampling to correct for unequal sampling probabilities. We might expect to find the more general estimator (3.5) in use for combining data from one or more surveys that use different known sampling probabilities. Possibly this exercise is not of interest in survey sampling. In any event, the estimator does not occur in Firth and Bennett (1998) or in Pfeffermann et al. (1998), which are concerned with unequal selection probabilities in surveys. The normalized version of equation (3.4) has been obtained by Vardi (1985) and by Lindsay (1995) in a nonparametric model for biased sampling. In his discussion of Vardi (1985), Mallows (1985) pointed out the connection with log-linear models and explained why the algorithm converges. These biased sampling models are not posed as Monte Carlo estimation, but the models are equivalent apart from the restriction of the parameter space to probability distributions. For further details, including connectivity and support conditions for existence and uniqueness, see Vardi (1985) or Gill et al. (1988). These conditions are assumed henceforth. The preceding derivation assumes that Γ is countable, so counting measure dominates all others. If Γ is not countable, no dominating measure exists. Nevertheless, the likelihood has a unique maximum given by equation (3.4) provided that the connectivity and support conditions are satisfied. This maximizing measure has finite support, and the maximum likelihood estimate of c is given by equation (3.5). 3.3. Symmetry and group invariant submodels In practice, we invariably ‘know’ that the base-line measure is either counting measure or Lebesgue measure. The method described above completely ignores such information. As a result, the conclusions apply equally to discrete sample spaces, finite dimensional vector spaces, metric spaces, product spaces and arbitrary subsets thereof. On the negative side, if the baseline measure does have symmetry properties that are easily exploited, the estimator may be considerably less efficient than it need be. To see how symmetries might be exploited, let G be a compact group acting on Γ in such a way that the base-line measure µ is invariant: µ.gA/ = µ.A/ for each A ⊂ Γ and each g ∈ G. For example, G might be the orthogonal group, a permutation group or any subgroup. In this reduced model, the parameter space consists only of measures that are invariant under G. The log-likelihood function (3.3) simplifies because θ.x/ = θ.gx/ for each g ∈ G, and the minimal sufficient statistic is reduced to the symmetrized empirical distribution function P̂ G P̂ G .A/ = ave { P̂ .gA/} g∈G for each A ⊂ Γ. If G is finite and acts freely, P̂ G has mass 1=n|G| at each of the transformed 592 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan sample points gxi with g ∈ G. In a rough sense, the effective sample size is increased by a factor equal to the average orbit size. The maximum likelihood estimate of µ, obtained by equating the minimal sufficient statistic to its expectation, is n P̂ G .dx/ = k s=1 ns ĉs−1 q̄s .x/ µ̂.dx/, where q̄s .x/ = ave {qs .gx/}: g∈G .3:6/ In other words, the estimates from the submodel are still given by equations (3.4) and (3.5), but with qs replaced by the group average q̄s , and P̂ replaced by P̂ G . The group-averaged estimator may be interpreted as Rao–Blackwellization given the orbit, so group averaging cannot increase the variance of µ̂ or of the linear functionals ĉr (Liu (2001), section 2.5.5). From the estimating equation point of view, the submodel replaces equation (3.2) by cr = q̄r .x/ dµ, .3:7/ Γ which is a consequence of the assumption that µ is G invariant. However, if we proceed with equation (3.7) directly as with equation (3.2), it would appear that we need draws from P̄r .dx/ = cr−1 q̄r .x/ µ.dx/, rather than Pr .dx/. Although we can easily draw x̄i from P̄yi by randomly drawing a g from G and setting x̄i = gxi , this step is unnecessary because P̄yi is invariant under G, and thus P̄yi .{x̄i }/ = P̄yi .{xi }/. Provided that G is sufficiently small that the group averaging in equation (3.6) represents a negligible computational cost, the submodel estimator is no more difficult to compute than the original ĉ. Consequently, the submodel is most useful if G is a small finite group. If G is the orthogonal group, aveg∈G {qs .gx/} is the average with respect to Haar measure over an infinite set. This is usually a non-trivial exercise in calculus or numerical integration, precisely what we had sought to avoid by simulation. With a judicious choice of group action, the potential gain in efficiency can be very large. As an extreme example, suppose that the distributions Pr are such that for each r there exists a g ∈ G such that Pr .A/ = P1 .gA/ for every measurable A ⊂ Γ. Then q̄r .x/=q̄s .x/ is a constant independent of x for all r and s, and the ratios of normalizing constants are estimated exactly with zero variance. This effect is evident in the example in Section 2 in which X ∼ Pσ implies gX ∼ P1=σ . In practice, such a group action may be hard to find, but it is often possible to find a group such that there is substantially more overlap among the symmetrized distributions P̄r .A/ = aveg∈G {Pr .gA/} than among the original {Pr }. For location–scale models with parameter .µ, σ/, reflection in the circle of radius σ̂ centred at .µ̂, 0/ is sometimes effective for integrals of likelihood functions or posterior distributions. Although a symmetrized estimator using a reflection such as q̄.x/ = {q.x/ + q.−x/}=2 may remind us of the antithetic principle to reduce Monte Carlo error, these two methods are fundamentally different. The antithetic principle exploits symmetry in the sampling distributions, whereas group averaging utilizes the symmetry in the base-line measure. In addition, the effectiveness of using antithetic variates depends on the form of the integrand (e.g. q2 =q1 ), as a non-linear function can make antithetic variates worse than the standard estimator (3.5) (see Craiu and Meng (2004)). In contrast, group averaging can do no harm regardless of the form of the integrand. Statistical Models for Monte Carlo Integration 593 The importance link function method of MacEachern and Peruggia (2000) has some features in common with group averaging, but the construction and the implementation are different in major ways. Group structure has also been used by Liu and Sabatti (2000), but for a different purpose, to improve the rate of mixing in Gibbs sampling. Group averaging has been discussed by Evans and Swartz (2000), page 191, as a method of variance reduction for importance sampling. Although the aims are similar, the details are entirely different. Evans and Swartz considered only subgroups of the symmetry group of the importance sampler. That is to say, the group action preserves both Lebesgue measure and the importance sampling distribution. By contrast, our method is geared towards more general bridge sampling designs, and the group action is not on the sampling distributions but on the base-line measures. In our model, no preferential status is accorded to Lebesgue measure or to any particular sampler, so it is not necessary that the group action should preserve either. On the contrary, it is desirable that the group action should mix the distributions thoroughly to make the averaged distributions as similar as possible. 3.4. Projection and linear submodels Up to this point, the analysis has treated the k functions q1 , : : :, qk in a symmetric manner even where the design constants {n1 ,: : :, nk } are not equal. In practice, there may be substantial asymmetries that can be exploited to reduce simulation error. In the simplest case, it may be known that two of the normalizing constants are equal, say c2 = c3 . The reduced parameter space is then the set of measures µ such that .q2 − q3 / dµ = 0. Ideally, we would like to estimate µ by maximum likelihood subject to this homogeneous linear constraint. Even when it exists and is unique, the maximum likelihood estimator in this submodel is unlikely to be cost effective, so we seek a simple one-step alternative by linear projection. Let c̃ be the unconstrained estimator from equation (3.5) or the group-averaged version in Section 3.3, and let Ṽ be the asymptotic variance matrix of log.c̃/ as given in equation (4.2). We consider a submodel in which c lies in the subspace X ⊂ Rk . For example, a single homogeneous constraint among the constants gives rise to a subspace X of dimension k − 1, and a matrix X of order k × .k − 1/ whose columns span X . Ignoring statistical error in the asymptotic variance matrix cov.c̃/ = CṼ C, where C = diag.c̃/, the weighted least squares projection is ĉ = X.XT C−1 Ṽ − C−1 X/− XT C−1 Ṽ − 1, .3:8/ where 1 is the constant vector with k components. See, for example, Hammersley and Hanscomb (1964), section 5.7. Provided that all generalized inverses are reflexive, i.e. Ṽ − Ṽ Ṽ − = Ṽ − , the asymptotic variance matrix is cov.ĉ/ = X.XT C−1 Ṽ − C−1 X/− XT : As always, only ratios of cs are estimable in the submodel. It is perhaps worth mentioning by way of clarification the precise role of the control variates when the objective is to estimate a single ratio c1 =c2 . Suppose that k = 3 and that the design constants are .0, n, 0/, so all observations are generated from P2 . Then the importance sampling estimator c̃1 =c̃2 from equation (3.5) has asymptotic variance O.n−1 /. Suppose now that P1 is in fact a mixture of P2 and P3 , so q1 = α2 q2 + α3 q3 , this being the reason for including q3 as a control variate such that .q2 − q3 / dµ = 0. Then 594 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan c̃1 = q1 .x/ dµ̃ = .α2 q2 + α3 q3 / dµ̃ = α2 c̃2 + α3 c̃3 is a linear combination of c̃2 and c̃3 , so the covariance matrix of c̃ has rank 2. After projection, ĉ2 = ĉ3 and ĉ1 = .α2 + α3 /ĉ2 , so ĉ1 =ĉ2 = α2 + α3 is estimated with zero error. The coefficients α2 and α3 are immaterial and need not be positive. It is evident that projection cannot increase simulation variance, but it is not evident that the potential for reduction in simulation error by projection is very great. Indeed, if we are interested primarily in estimating c1 =c0 , the reduction is typically not worthwhile unless the control variates q2 , q3 ,: : : are chosen carefully. The way in which this may be done for Bayesian posterior calculations is discussed in Section 5. Efficiency factors of the order of 5–10 appear to be routinely achievable. The discussion of control variates by Evans and Swartz (2000) involves subtraction rather than projection, so the technique is different from that proposed above. The algebra in section 5.7 of Hammersley and Hanscomb (1964) and in Rothery (1982), Ripley (1987) and Glynn and Szechtman (2000) is, in most respects, equivalent to the projection method. There are some differences but these are mostly superficial, starting with the complication that only ratios are identifiable in our models and submodels. The main difference in implementation is that our likelihood method automatically provides a variance matrix, so no preliminary experiment is required to estimate the coefficients in the projection. Glynn and Szechtman (2000), section 8, also noted that the required projection is a linear approximation to the nonparametric maximum likelihood estimator. 3.5. Log-linear submodels Most sets that arise in statistical applications have a large amount of structure that can potentially be exploited in the construction of submodels. For example, the spaces arising in genetic problems related to the coalescent model have a tree structure with measured edges. A submodel may be useful for Monte Carlo purposes if the estimate under the submodel is easy to compute and has substantially reduced variance. The following example illustrates the principle as it applies to spaces having a product structure. If Γ = Γ1 × : : : × Γl is a product set, it is natural to consider the submodel consisting of product measures only, i.e. µ = µ1 × : : : × µl , in which µj is a measure on Γj . Then each x ∈ Γ has components .x1 , : : :, xl /, and θ.x/ = θ1 .x1 / + : : : + θl .xl / in an extension of the notation of Section 3.2. The sufficient statistic in equation (3.3) is reduced to the list of l marginal empirical distribution functions, and the resulting model is equivalent to the additive log-linear model (main effects only) for an l-dimensional contingency table, or l + 1 dimensional if the design has more than one sampler. No closed form estimator is available unless the design is such that, for each nr > 0, the function qr .x/ is expressible as a product qr .x/ = qr1 .x1 / : : : qrl .xl /. Then each sampler generates observations with independent components, so µ̂j is given by equation (3.4), applied to the jth component of x. The component measures µ̂1 , : : :, µ̂l are then independent. In certain circumstances, the component sets Γ1 ,: : :, Γl are isomorphic, in which case we write Γl instead of Γ. It is then natural to restrict the parameter space to symmetric product measures of the form µl. Suppose, for example, that we wish to compute the integrals exp{−.x1 − θ/2 =2 − .x2 − θ/2 =2 − x12 x22 =2} dx1 dx2 cθ = R2 for various values of θ in the range .0, 4/. These constitute a subfamily of distributions consid- Statistical Models for Monte Carlo Integration 595 ered by Gelman and Meng (1991), whose conditional distributions are Gaussian. The parameter space in the submodel is the set of symmetric product measures µ×µ, so µ is a measure on R. For a sampler, it is convenient to take any distribution with density f on R, or the product distribution on R2 . The numerical values reported here are based on the standard Cauchy sampler. The maximum likelihood estimate of µ on R has mass 1=nf.xi / at each data point, so the maximum likelihood estimate of µ2 on R2 has mass {n2 f.xi / f.xj /}−1 at each ordered pair .xi , xj / in the sample. We find that the submodel estimator based on n simulated scalar observations is roughly equivalent in terms of statistical efficiency to 3n=2 bivariate observations in the unconstrained model. In principle, therefore, the crude importance sampling estimator can be improved by a factor of 3 without further data. On the negative side, the submodel estimator of cθ ĉθ = qθ .x, x / dµ̂.x/ dµ̂.x / is the sum of n2 terms, as opposed to n terms in the unconstrained model. In terms of computational effort, therefore, the importance sampling estimator is superior. The submodel estimator is not cost effective unless the simulations are the dominant time-consuming part of the calculation. There is one additional circumstance in which the submodel estimator achieves a gain in statistical efficiency that is sufficient to offset the increase in computational effort. If two functions qr and qs are such that the one-dimensional marginal distributions of Pr are the same as the onedimensional marginal distributions of Ps , the estimated ratio ĉr =ĉs has a variance that is o.n−1 /, i.e. n var{log.ĉr =ĉs /} → 0 as n → ∞. Simulation results indicate that the rate of decrease is O.n−2 /. This phenomenon can be observed if we replace the preceding family of integrands by the family exp.−x12 − x22 + 2θx1 x2 / for −1 < θ < 1. 3.6. Markov chain models The model in this section assumes that a sequence of draws constitutes an irreducible Markov chain having known transition density q.·; x/ with respect to the unknown measure µ on Γ. If the design calls for multiple chains, the transition densities are denoted by qr .·; x/ for r = 1, : : :, k. It is not necessary that the chain be in equilibrium; nor is it necessary that the chain be constructed to have a particular stationary distribution. Under this new model, a draw is a chain of length l with distribution Pr.l/ . The likelihood may be expressed as the product of l factors, the first three of which are Pr .dx1 / = cr−1 qr .x1 / µ.dx1 /, Pr .dx2 |x1 / = cr−1 .x1 / qr .x2 ; x1 / µ.dx2 /, Pr .dx3 |x2 / = cr−1 .x2 / qr .x3 ; x2 / µ.dx3 /: If the chain is not in equilibrium, the first factor is ignored. In effect, we now have l ‘independent’ observations from l distinct distributions, each with its own normalizing constant. The log-likelihood function for θ contributed by a single sequence of length l from Pr.l/ is then given by l l θ.xt / − log{cr .xt−1 ; θ/} = l θ.x/ d P̂ − log{cr .xt−1 ; θ/}, t=1 Γ t=1 596 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan which is a function of the entire sequence. Although the form of the likelihood is very similar to equation (3.3), the empirical distribution function P̂ is no longer sufficient. The log-likelihood equation from k independent chains, in which the chain from Ps has length ns , is given by n P̂ .dx/ = ns k s=1 t=1 P̂s .dx|xt−1 / = ns k s=1 t=1 ĉs−1 .xt−1 / qs .x; xt−1 / µ̂.dx/, .3:9/ where n = Σs ns , cs .x0 / ≡ cs , qs .x; x0 / ≡ qs .x/ and Ps .dx|x0 / ≡ Ps .dx/. Consequently, for each r = 1,: : :, k and t = 0,: : :, nr − 1, we have n qr .xi ; xt / , .3:10/ ĉr .xt / = qr .x; xt / dµ̂.x/ = ns k i=1 ĉs−1 .xj−1 / qs .xi ; xj−1 / s=1 j=1 which can be solved for {ĉr .xt /, t = 0, : : :, nr − 1; r = 1, : : :, k} using the Deming–Stephan algorithm. When all draws are independent and the margins are equal, qs .x; xt−1 / = qs .x/ does not depend on xt−1 , and equation (3.10) reduces to equation (3.5). At first sight, we might doubt that equation (3.10) could provide anything useful because we have at most one draw from each of the targeted Pr , namely the first component of each chain— recall that we have purposely ignored the information that all the margins are the same. That is, the transition probabilities qr .xt ; xt−1 / can be arbitrary, and in fact they can even be time inhomogeneous (i.e. qr .xt ; xt−1 / can be replaced by qr, t .xt ; xt−1 /), as long as the chain is not reducible. Furthermore, it appears that the number of ‘parameters’ {ĉr .xt /, t = 0, : : :, nr − 1; r = 1, : : :, k} is always the same as the number of data points. However, we must keep in mind that the model parameter is not c, but the base-line measure µ, and as long as a draw is from a known density with respect to µ it provides information about µ. This is in fact the fundamental reason that importance sampling can provide consistent estimators when draws are taken from an unrelated trial density. The information from the trial distributions about the baseline measure must be adequate to estimate the base-line measure of the target distribution. In particular, the union of the supports of the trial densities must cover the support of the target density. 4. Asymptotic covariance matrix 4.1. Multinomial information measure The log-likelihood (3.3) is evidently a sum of k multinomial log-likelihoods, sharing the same parameter θ. The Fisher information for θ is best regarded explicitly as a measure on Γ × Γ such that I.A, B/ = I.B, A/ and I.A, Γ/ = 0 for A, B ⊂ Γ. In particular, the multinomial information measure associated with the distribution Pr on Γ is given by Pr .A ∩ B/ − Pr .A/ Pr .B/. In the log-likelihood (3.3), the total Fisher information measure for θ is n I.A, B/ = k r=1 nr {Pr .A ∩ B/ − Pr .A/ Pr .B/}: At least formally, the asymptotic covariance matrix of θ̂ is the inverse Fisher information matrix n−1 I − , and the asymptotic covariance matrix of dµ̂ is n−1 dµ.x/ dµ.y/ I − .x, y/, Statistical Models for Monte Carlo Integration I − .x, y/ 597 I −, where is the .x, y/ element of indexed by Γ × Γ. From expression (3.5) for ĉr , we find that the asymptotic covariance of ĉr and ĉs is cr cs Vrs , where −1 Vrs = cov{log.ĉr /, log.ĉs /} = n I − .x, y/ dPr .x/ dPs .y/: .4:1/ Γ×Γ As always, only logarithmic contrasts have variances. In this expression Pr is defined by equation (3.1) for each r, provided that cr is finite and non-zero. There may be integrands qr that take both positive and negative values, in which case Pr is not a probability distribution. For the log-linear submodel discussed in Section 3.5 in which each sampler has independent components, it is necessary to replace I − .x, y/ in equation (4.1) by the sum I1− .x1 , y1 / + : : : + Il− .xl , yl /, where Ir is the Fisher information measure for θr . Expression (4.1), or its generalization, gives the O.n−1 / term in the asymptotic variance, but this term may be 0. Examples of this phenomenon are given in Sections 3.5 and 5.2. In such cases, more refined calculations are required to find the asymptotic distribution of log.ĉ/. 4.2. Matrix version The results of the preceding section are easily expressed in matrix notation, at least when all calculations are performed at the maximum likelihood estimate. Let W = diag.n1 , : : :, nk /, and let P̂ be the n × k matrix whose .i, r/ element is P̂ r .xi / = s qr .xi /=ĉr ns ĉs−1 qs .xi / : The matrix P̂W arises naturally in applying the Deming–Stephan algorithm to solve the maximum likelihood equation. Note that the column sums of P̂ are all 1, whereas the row sums satisfy Σr nr P̂ r .xi / = 1 for each i. The Fisher information for θ at θ̂ is (the measure whose density is represented by the matrix) In − P̂ W P̂ T , and the asymptotic covariance matrix of log.ĉ/ is given by V̂ = P̂ T .In − P̂ W P̂ T /− P̂ .4:2/ where In is the identity matrix of order n. Typically, the matrix In − P̂ W P̂ T has rank n − 1 with kernel equal to 1, the set of constant vectors. Then In − P̂ W P̂ T + 11T =n is invertible with approximately unit eigenvalues, and the inverse matrix is also a generalized inverse of In − P̂ W P̂ T suitable for use in equation (4.2). Although the inversion of n × n matrices can be avoided, all the numerical calculations reported in this paper use this variance formula. The eigenvalues of In − P̂ W P̂ T + 11T =n are in fact all less than or equal to 1, with equality for simple Monte Carlo designs in which all observations are generated from a single sampler. For more general designs having more than one sampler, the approximate variance formula V̂ P̂ T P̂ is anticonservative. That is to say, P̂ T .In − P̂W P̂ T /− P̂ P̂ T P̂ in the sense of Löwner ordering. In practice, if all the samplers have support equal to Γ, the underestimate is frequently negligible. The approximate variance formula is easier to compute and may be adequate for projection purposes described in Section 3.4. 5. Applications to Bayesian computation 5.1. Posterior probability calculation Consider a regression model in which the component observations y1 , : : :, ym are independent 598 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan and exponentially distributed with means such that log{E.Yi /} = β0 + β1 xi , where x1 , : : :, xm are known constants. For illustration m = 10, xi = i and the values yi are 2:28, 1:46, 0:90, 0:19, 1:88, 0:72, 2:06, 4:21, 2:90, 7:53: The maximum likelihood estimate is β̂ = .−0:0668, 0:1494/. The asymptotic standard error of β̂1 is 0.1101 using the expected Fisher information, and 0.0921 using the observed Fisher information. Let π.·/ be a prior distribution on the parameter space. The posterior probability pr.β1 > 0|y/ is the ratio of two integrals. In the denominator the integrand is the product of the likelihood and the prior. The numerator has an additional Heaviside factor taking the value 1 when β1 > 0 and 0 otherwise. One way to approximate this ratio is to simulate observations from the posterior distribution and to compute the fraction that have β1 > 0. However, this exercise is both unnecessary and inefficient. Let q0 .β/ = L.β/ π.β/ be the product of the likelihood and the prior at β, and let q1 .β/ = q0 .β/ I.β1 > 0/ be the integrand for the numerator. For auxiliary functions, we choose q2 .β/ to be the bivariate normal density at β with mean β̂ and inverse covariance matrix equal to the Fisher information at β̂. It is marginally better to use the observed Fisher information for the inverse variance matrix in q2 , but in principle we could use either or both. Let q3 .β/ be the product q2 .β/ I.β1 > 0/=K, where K = pr.β1 > 0/, computed under the normal distribution q2 . In this example, K = Φ.0:1494=0:0921/ = 0:9476 for the observed information approximation, or 0.9127 for the expected information approximaˆ 1=2 for q2 and q3 may be ignored. By construction, tion. The common normalizing constant 2π|I| therefore, q2 .β/ dβ = q3 .β/ dβ, not necessarily equal to 1. The numerical calculations that follow use the improper uniform prior and n = 400 simulations from the normal proposal density q2 , so the design constants are n = .0, 0, 400, 0/. The unconstrained maximum likelihood estimates obtained from equation (3.5) were log.c̃1 =c̃0 / = −0:0525 ± 0:0137, log.c̃3 =c̃2 / = 0:0078 ± 0:0108 with correlation 0.901. Note, however, that c2 = c3 by design, but the estimator is not similarly constrained at this stage. The estimated posterior probability pr.β1 > 0|y/ is thus exp.−0:0525/ = 0:9489, with an approximate 90% confidence interval .0:928, 0:970/. By imposing the constraint c2 = c3 on the parameter space we obtain a new estimator by weighted least squares projection log.ĉ/ = X.XT Ṽ − X/− XT Ṽ − log.c̃/. Here c̃ is the unconstrained estimator, Ṽ is the estimated variance matrix of log.c̃/ and X is the model matrix 1 0 0 0 1 0 X= : 0 0 1 0 0 1 The resulting estimate and its standard error are log.ĉ1 =ĉ0 / = log.c̃1 =c̃0 / − 1:136 log.c̃3 =c̃2 / = −0:0613 ± 0:0059: The alternative projection (3.8) is preferable in principle, but in this case the two projections yield indistinguishable estimates. The point estimate of the posterior probability is 0.940, with an approximate 90% confidence interval (0.931, 0.950). The efficiency factor is roughly 5, which Statistical Models for Monte Carlo Integration 599 is similar to the factors obtained by Rothery (1982) using a similar technique in a problem concerning power calculation. Much of the gain in efficiency in this example could be achieved by taking the projection coefficient to be −1 on the log-scale, i.e. by subtraction using the control variate in the traditional way. However, the real applications that we have in mind are those in which a moderately large number of integrals is to be computed simultaneously, as in likelihood calculations for pedigree analysis, or computing the entire marginal posterior for β1 . It is then desirable to include several control variates and to compute the projection using equation (3.8). In the present example, the posterior probabilities pr.β1 b|y/ for six equally spaced values of b in .0, 0:25/ were computed simultaneously by this method using the corresponding six normal control variates. The six efficiency factors were 5.3, 6.2, 8.4, 10.5, 8.4 and 5.9. This is a rather small example with m = 10 data points in the regression, in which an appreciable discrepancy might be expected between the posterior and the normal approximation. Numerical investigations indicate that the efficiency factor tends to increase with larger m, as might be expected. The approximate variance formula mentioned at the end of Section 4.3 is exact in this case, and effectively exact when observations are simulated from both of the normal approximations. 5.2. Posterior integral for probit regression Consider a probit regression model in which the responses are independent Bernoulli variables with pr.yi = 1/ = Φ.xiT β/, where xi is a vector of covariates and β is the parameter. The aim of this exercise is to calculate the integral of the product L.β; y/ π.β/, in which L.β; y/ is the likelihood function and π.·/ is the prior, here taken to be Gaussian. This is done using a Markov chain constructed to have stationary distribution equal to the posterior. The chain is generated by the standard technique of Gibbs sampling. Following Albert and Chib (1993), the parameter space is augmented to include latent variables zi ∼ N.xiT β, 1/ such that yi is the sign of zi . A cycle of the Gibbs sampler is completed in two steps: β|y, z and z|y, β, which are respectively multivariate normal and a product of independent truncated normal distributions. The transition probability from θ = .β , z / to θ = .β, z/ of the generated Markov chain is P.dθ|θ / = c−1 .θ / p.z|y, β/ p.β|y, z / µ.dθ/, where p.z|y, β/ and p.β|y, z / are the full conditional densities. By construction, the normalizing constants c.θ/ are known to be equal to 1 for each θ. Let {.βt , zt /}nt=1 be the simulated values from the Markov chain. The required integral is estimated by the ratio of n L.βi / π.βi / .5:1/ L.β/ π.β/ dµ̂ = n i=1 p.βi |y, zj / j=1 to the known value c.θt / = 1. In effect, ĉ.θt / in equation (3.10) is replaced by the known value, which is then used to compute the approximate maximizing measure µ̂. The resulting estimator, which may be interpreted as importance sampling with the semideterministic mixture n−1 n j=1 p.·|y, zj / 600 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan as the sampler, is consistent but may not be fully efficient. For large n, n times the summand evaluated at any fixed β is a consistent estimate of the integral. Such an estimate was suggested by Chib (1995), who recommended choosing a value β Å of high posterior density, taken to be the mean value in the calculations below. The summand as a function of β also appears in calculations by Ritter and Tanner (1992) and Cui et al. (1992), but the purpose there is to monitor convergence of the Gibbs sampler, either with multiple parallel chains or a single long chain divided into batches. For numerical illustration and comparisons, we use Chib’s (1995) example, taken from a case-study by Brown (1980). The data were collected on 53 prostate cancer patients to predict the nodal involvement. There are five predictor variables and a binary response taking the value 1 if the lymph nodes were affected. For the results reported here, three covariates are used: the logarithm of the level of serum acid phosphatase, X-ray reading and stage of the tumour, so the model matrix X is of order 53 × 4 with a constant column. The prior is centred at (0.75, 0.75, 0.75, 0.75) with variance A = diag.52 , 52 , 52 , 52 /, as in Chib (1995). The Gibbs sampler was started at β = ÃXT y, where à = .A−1 + XT X/−1 , and run for a total of N = n0 + n cycles with the first n0 discarded. The process was repeated 1000 times for several values of N. The mean and standard deviation of 1000 estimates of the logarithm of the integral are given in Table 1. The central processor unit (CPU) time in seconds was measured separately for Gibbs sampling and for subsequent integral evaluations. For N = 500 + 5000, the results given here for Chib’s method are in close agreement with those reported by Chib (1995). All programming was done in C. Over the range studied, the statistical efficiency of the likelihood method over Chib’s estimator with n0 + n draws is not constant, but roughly n=12, i.e. increasing with n. For example, the efficiency factor at N = 500 + 5000 is estimated by .0:0211=0:001 03/2 = 420: To achieve the same accuracy as the likelihood method using 5000 draws, Chib’s method requires 420 × 5000 Gibbs draws, with CPU time 420 × 0:49 = 206 s for the draws alone. The final row in Table 1 gives the precision per CPU second, defined as the reciprocal of the product of the total time and the variance. Over the range studied, the new method achieves a precision per CPU second of roughly 8.5–9.5 times that of Chib’s method. For fixed n, the likelihood estimator is computationally more demanding, but the additional effort is worthwhile by a substantial factor. Table 1. Numerical comparison of two integral estimators (log-scale) Results for the following Gibbs sampler cycles and methods: N = 50 + 500, 0.06 CPU seconds Chib Mean + 34 −0:5693 Standard deviation 0.0652 CPU seconds <0.01 Precision per CPU 3921 second Likelihood N = 100 + 1000, 0.11 CPU seconds Chib −0:5796 −0:5588 0.00937 0.0475 0.28 <0.01 33500 4029 Likelihood N = 250 + 2500, 0.25 CPU seconds Chib −0:5661 −0:5542 0.00505 0.0299 1.03 <0.01 34396 4474 Likelihood N = 500 + 5000, 0.49 CPU seconds Chib −0:5569 −0:5510 0.00204 0.0211 5.87 0.01 39263 4492 Likelihood −0:5534 0.00103 21.94 42024 Statistical Models for Monte Carlo Integration 601 As it happens, this is one of those problems in which the likelihood estimator of µ converges at the standard rate, but the estimate (5.1) of the integral converges at rate n−1 . The bias and standard deviation are both O.n−1 /: in this example they appear to be approximately equal in magnitude. By contrast, Chib’s estimator converges at the standard n−1=2 -rate. 6. Retrospective formulation It is possible to give a deceptively simple derivation of equation (3.5) by a retrospective argument as follows. Regardless of how the design was in fact selected, we may regard the sample size vector .n1 , : : :, nk / as the observed value of a multinomial random vector with index n and parameter vector .π1 ,: : :, πk /. This assumption is innocuous provided that .π1 , : : :, πk / are treated as free parameters to be estimated from the data. Evidently, π̂r = nr =n is the maximum likelihood estimate. Mimicking the argument that is frequently employed in retrospective designs, we argue as follows. Given that the point x has been observed, what is the probability that this point was generated from distribution Pr rather than from one of the other distributions? A simple calculation using Bayes’s theorem shows that the required conditional probability vector is q .x/π =c q .x/π =c k k k 1 1 1 p.x/ = : ,: : :, qs .x/πs =cs qs .x/πs =cs s s These conditional probabilities depend only on the ratios πr =cr , and not otherwise on the baseline measure µ. Conditioning on x does not eliminate the base-line measure entirely, for cr is a linear function of µ. The conditional likelihood associated with the single observation .y, x/ is thus .πy =cy / qy .x/ , .πr =cr / qr .x/ r and the log-likelihood is r nr log.πr =cr / − n i=1 log r .πr =cr / qr .xi / : .6:1/ Once again, the observed count vector .n1 ,: : :, nk / is the complete sufficient statistic, and the association of y-values with x-values is not informative. Differentiation with respect to the parameter log.cr / gives n @l .πr =cr / qr .xi / @l = −nr + = cr : @{log.cr /} @cr .πs =cs / qs .xi / i=1 s By substituting the known value π̂r = nr =n and setting the derivative to 0, we obtain ĉr = n i=1 s qr .xi / , ns qs .xi /=ĉs .6:2/ which is identical to equation (3.5). That is to say, the retrospective argument, previously put forward by Geyer (1994), gives exactly the right point estimator of c. The astute reader will notice that, when we substitute nr =n for πr in the retrospective likelihood, the resulting function depends only on those cs for which nr > 0, and this restriction also 602 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan applies to the maximum likelihood equation (6.2). The apparent equivalence of equations (6.2) and (3.5) is thus an illusion. By contrast with the model in Section 3, the retrospective argument does not lead to the conclusion that equation (6.2) is the maximum likelihood estimate of an integral for which qr .·/ takes negative values. Even if we are willing to overlook the remarks in the preceding paragraph and to assume that nr > 0 for each r, the objections are not easily evaded. The difficulty at this point is that the conditional likelihood is a function of the ratios φr = log.πr =cr /, so the vectors π and c are not separately estimable from the conditional likelihood. It is tempting, therefore, to substitute nr =n for πr , treating this as a known prior probability. After all, who can tell how the sample sizes were chosen? However plausible this argument may sound, the resulting ‘likelihood’ does not give the correct covariance matrix for log.ĉ/. The components of the negative logarithmic second-derivative matrix are − n n .π =c /.π =c / q .x / q .x / @2 l .πr =cr / qr .xi / r r s s r i s i δrs = − : @{log.cr /}@{log.cs /} i=1 nt qt .xi /=ct i=1 { nt qt .xi /=ct }2 t .6:3/ t At .π̂, ĉ/, the first term is equal to the diagonal matrix nr δrs . The second term is non-negative definite. To put this in an alternative matrix form, write pi for the conditional probability vector p.xi /. Then the negative second-derivative matrix shown above is Iφ = n T .diag{pi } − pi pT i / W − W P̂ P̂ W , i=1 using the matrix notation of Section 4.2. To see that the inverse of this matrix cannot be the correct asymptotic variance of log.ĉ/, consider the limiting case in which q1 = q2 = : : : = qk are all equal. It is then known with certainty that c1 = : : : = ck , even in the absence of data. But, in the second-derivative matrix shown above, all the conditional probability vectors pi are equal to .π1 , : : :, πk /. The second-derivative matrix is in fact the multinomial covariance matrix with index n and probability vector π. The generalized inverse of this matrix does give the correct asymptotic variances and covariances for contrasts of φ̂ = log.π̂=ĉ/, as the general theory requires. But it does not give the correct asymptotic variances for log.ĉ/, or contrasts thereof. This line of argument can be partly rescued, but to do so it is necessary to show that π̂ and ĉ are asymptotically independent. This is not obvious and will not be proved here, but it is a consequence of orthogonality of parameters in exponential family models. By standard properties of likelihoods, cov{log.π̂/ − log.ĉ/} = Iφ− asymptotically. On the presumption that π̂ and ĉ are asymptotically uncorrelated, we deduce that cov{log.ĉ/} = Iφ− − cov{log.π̂/} = Iφ− − diag.1=nπ/ + 11T =n: .6:4/ The term 11T=n does not contribute to the variance of contrasts of log.ĉ/ and can therefore be ignored. When evaluated at .ĉ, π̂/, the resulting expression.W −W P̂ T P̂ W/− − W −1 , involving no n × n matrices, is identical to equation (4.2), provided that each component nr of W is strictly positive. 7. Conclusions The key contribution of this paper is the formulation of Monte Carlo integration as a statistical model, making explicit what information is available for use and what information is ‘out of bounds’. Given that agreement is reached on the information available, it is now possible Statistical Models for Monte Carlo Integration 603 to say whether an estimator is or is not efficient. Likelihood methods are thus made available not only for parameter estimation but also for the estimation of variances and covariances for various simulation designs, of which importance sampling is the simplest special case. More interestingly, however, three classes of submodel are identified that have substantial potential for variance reduction. The associated operations are group averaging for group invariant submodels, linear projection for linear submodels or mixtures and Markov chain models for Markov chain Monte Carlo schemes. To achieve worthwhile gains in efficiency, it is necessary to exploit the structure of the problem, so it is not easy to give universally applicable advice. None-the-less, three simple examples show that efficiency factors in the range 5–10, and possibly larger, are routinely achievable in certain types of statistical computations. We believe that such factors are not exceptional, particularly for Bayesian posterior calculations. It would be remiss of us to overlook the peculiar dilemma for Bayesian computation that inevitably accompanies the methods described here. Our formulation of all Monte Carlo activities is given in terms of parametric statistical models and submodels. These are fully fledged statistical models in the sense of the definition given by McCullagh (2002), no more or no less artificial than any other statistical model. Given that formulation, it might seem natural to analyse the model by using modern Bayesian methods, beginning with a prior on Θ. If we adopt the orthodox interpretation of a prior distribution as the one that summarizes the extent of what is known about the parameter, we are led to the Dirac prior on the true measure, which is almost invariably Lebesgue measure. For once, the prior is not in dispute. This choice leads to the logically correct, but totally unsatisfactory, conclusion that the simulated data are uninformative. The posterior distribution on Θ is equal to the prior, which is unhelpful for computational purposes. It seems, therefore, that further progress calls for a certain degree of pretence or pragmatism by selecting a non-informative, or at least non-degenerate, prior on Θ. Given such a prior distribution, the posterior distribution on Θ can be obtained, and the posterior moments of the required integrals computed by standard formulae. Although these operations are straightforward in principle, the computations are rather forbidding, so much so that it would be impossible to complete the calculations without resorting to Monte Carlo methods! This computational black hole, an infinite regress of progressively more complicated models, is an unappealing prospect, to say the least. With this in mind, it is hard to avoid the conclusion that the old-fashioned maximum likelihood estimate has much to recommend it. Acknowledgements The comments of four referees on an earlier draft led to substantial improvements in presentation. We would like to thank Peter Bickel for pointing out the connection with biased sampling models, and Peter Donnelly for discussions on various points. Support for this research was provided in part by National Science Foundation grants DMS0071726 (for McCullagh and Tan) and DMS-0072510 (for Meng and Nicolae). References Albert, J. and Chib, S. (1993) Bayesian analysis of binary and polychotomous response data. J. Am. Statist. Ass., 88, 669–679. Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975) Discrete Multivariate Analyses: Theory and Practice. Cambridge: Massachusetts Institute of Technology Press. Brown, B. W. (1980) Prediction analyses for binary data. In Biostatistics Casebook (eds R. J. Miller, B. Efron, B. W. Brown and L. E. Moses). New York: Wiley. Chib, S. (1995) Marginal likelihood from the Gibbs output. J. Am. Statist. Ass., 90, 1313–1321. 604 A. Kong, P. McCullagh, X.-L. Meng, D. Nicolae and Z. Tan Craiu, R. V. and Meng, X.-L. (2004) Multi-process parallel antithetic coupling for backward and forward Markov chain Monte Carlo. Ann. Statist., to be published. Cui, L., Tanner, M. A., Sinha, D. and Hall, W. J. (1992) Monitoring convergence of the Gibbs sampler: further experience with the Gibbs stopper. Statist. Sci., 7, 483–486. Deming, W. E. and Stephan, F. F. (1940) On a least-squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Statist., 11, 427–444. DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L. (1997) Computing Bayes factors by combining simulation and asymptotic approximations. J. Am. Statist. Ass., 92, 903–915. Evans, M. and Swartz, T. (2000) Approximating Integrals via Monte Carlo and Deterministic Methods. Oxford: Oxford University Press. Firth, D. and Bennett, K. E. (1998) Robust models in probability sampling. J. R. Statist. Soc. B, 60, 3–21; discussion, 41–56. Gelman, A. and Meng, X.-L. (1991) A note on bivariate distributions that are conditionally normal. Am. Statistn, 45, 125–126. Gelman, A. and Meng, X. L. (1998) Simulating normalizing constants: from importance sampling to bridge sampling to path sampling. Statist. Sci., 13, 163–185. Geyer, C. J. (1994) Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo. Technical Report 568. School of Statistics, University of Minnesota, Minneapolis. Gill, R., Vardi, Y. and Wellner, J. (1988) Large sample theory of empirical distributions in biased sampling models. Ann. Statist., 16, 1069–1112. Glynn, P. W. and Szechtman, R. (2000) Some new perspectives on the method of control variates. In Monte Carlo and Quasi-Monte Carlo Methods (eds K.-T. Fang, F. J. Hickernell and H. Niederreiter), pp. 27–49. New York: Springer. Hammersley, J. M. and Hanscomb, D. C. (1964) Monte Carlo Methods. London: Chapman and Hall. Hesterberg, T. (1995) Weighted average importance sampling and defensive mixture distributions. Technometrics, 37, 185–194. Horvitz, D. G. and Thompson, D. J. (1952) A generalization of sampling without replacement from a finite universe. J. Am. Statist. Ass., 47, 663–683. Lindsay, B. (1995) Mixture Models: Theory, Geometry and Applications. Hayward: Institute of Mathematical Statistics. Liu, J. S. (2001) Monte Carlo Strategies in Scientific Computing. New York: Springer. Liu, J. S. and Sabatti, C. (2000) Generalized Gibbs sampler and multigrid Monte Carlo for Bayesian computation. Biometrika, 87, 353–369. MacEachern, S. N. and Peruggia, M. (2000) Importance link function estimation for Markov Chain Monte Carlo methods. J. Comput. Graph. Statist., 9, 99–121. Mallows, C. L. (1985) Discussion of ‘Empirical distributions in selection bias models’ by Y. Vardi. Ann. Statist., 13, 204–205. McCullagh, P. (2002) What is a statistical model (with discussion)? Ann. Statist., 30, 1225–1310. Meng, X.-L. and Wong, W. H. (1996) Simulating ratios of normalizing constants via a simple identity: a theoretical explanation. Statist. Sin., 6, 831–860. Owen, A. and Zhou, Y. (2000) Safe and effective importance sampling. J. Am. Statist. Ass., 95, 135–143. Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H. and Rasbash, J. (1998) Weighting for unequal selection probabilities in multilevel models (with discussion). J. R. Statist. Soc. B, 60, 23–40; discussion, 41–56. Ripley, B. D. (1987) Stochastic Simulation. New York: Wiley. Ritter, C. and Tanner, M. A. (1992) Facilitating the Gibbs sampler: the Gibbs stopper and the griddy-Gibbs sampler. J. Am. Statist. Ass., 97, 861–868. Rothery, P. (1982) The use of control variates in the Monte Carlo estimation of power. Appl. Statist., 31, 125–129. Vardi, Y. (1985) Empirical distributions in selection bias models. Ann. Statist., 13, 178–203. Discussion on the paper by Kong, McCullagh, Meng, Nicolae and Tan Michael Evans .University of Toronto/ This is an interesting and stimulating paper containing some useful clarifications. The most provocative part of the paper is the claim that treating the approximation of an integral I = f.x/ µ.dx/ .1/ as a problem of statistical inference gives practically meaningful results. Although the paper does effectively argue this, as my discussion indicates, I still retain some doubt about the necessity of adopting this point of view. In the paper we have a sequence of integrals Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan ci = qi .x/ µ.dx/, 605 for i = 1, : : :, r. We want to approximate the ratios ci =cj based on samples of size ni from the normalized densities specified by the qi .ni = 0 whenever qi takes negative values and we assume at least one ni > 0/. If only one ni is non-zero, say n1 , then the estimators are given by the importance sampling estimates n n q .x / qj .xk / ci i k = cj k=1 w.xk / k=1 w.xk / .2/ where w = q1 =c1 . Note that c1 need not be known to implement equation (2). As in the general problem of importance sampling, q1 could be a bad sampler and require unrealistically large sample sizes to make these estimates accurate. In general, it is difficult to find samplers that can be guaranteed to be good in problems. If we have several ni > 0, then the problem is to determine how we should combine these samples to estimate the ratios. Perhaps an obvious choice is the importance sampling estimate (2) where w is now the mixture r n q .x/ i i w.x/ = : .3/ n ci i=1 Of course there is no guarantee that this will be a good importance sampler but, more significantly, we do not know the ci and so cannot implement this directly. Still, if we put equation (3) into equation (2), we obtain a system of equations in the unknown ratios and, as the paper points out, this system can have a unique solution for the ratios. These solutions are the estimates that are discussed in the paper. The paper justifies these estimates as maximum likelihood estimates and more importantly uses likelihood theory to obtain standard errors for the estimates. The above intuitive argument for the estimates does not seem to lead immediately to error estimates. One might suspect, however, that an argument based on the delta theorem should generate error estimates. Since I shall not provide such an argument, we must acknowledge the accomplishment of the paper in doing this. There are still some doubts, however, about the necessity for the statistical formulation and the likelihood arguments. The group averaging that is discussed in the paper seems closely related to the use of this technique in Evans and Swartz (2000). There it is shown that, when G is a finite group of volume preserving symmetries of the importance sampler w, then we can replace the basic importance sampling estimate f.x/=w.x/, with x ∼ w, by f G .x/=w.x/ where 1 f.gx/: f G .x/ = |G| g∈G This is unbiased for integral (1) and satisfies varw fG w = varw 2 f f − fG − Ew : w w .4/ This shows that the group-averaged estimator always has variance smaller than f=w. This is because f G and w are more alike than f and w as f G and w now share the group of symmetries G. Some standard variance reduction methods, such as antithetic variates, can be seen to be examples of this approach. Although equation (4) seems to indicate that we always obtain an improvement with group averaging, a fairer comparison takes into account the additional function evaluations in f G =w. In that case, it was shown in Evans and Swartz (2000) that f G =w is a true variance reducer if and only if 2 |G| − 1 f f − fG f varw < Ew : varw |G| w w w Noting that f G =w is the orthogonal projection of f=w onto the L2 .w/ space of functions invariant under G, we see that we have true variance reduction if and only if the residual .f − f G /=w accounts for a proportion that is equal to .|G| − 1/=|G| of the total variation in f=w. This implies that the larger the group is the more stringent is the requirement for true variance reduction. Also, if this technique is to be effective, then f and w must be very different, at least with respect to the symmetries in G, i.e. w must be 606 Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan a poor importance sampler for the original problem. This leads to some reservations about the general utility of the method. The group averaging technique in the paper can be seen as a generalization of this as G is not required to be a group of symmetries of w. Rather we symmetrize both the importance sampler and the integrand so that the basic estimator is f G =wG . It may be that wG is a more effective importance sampler than w but still we can see that the above analysis will restrict the usefulness of the method. Although I have expressed some reservations about some aspects of the paper, overall I found it to be thought provoking as it introduces a different way to approach the analysis of integration problems and this leads to some interesting results. I am happy to propose the thanks and congratulations to the authors. Christian P. Robert .Centre de Recherche en Economie et Statistique and Université Dauphine, Paris/ Past contributions of the authors to the Monte Carlo literature, including the notion of bridge sampling, are noteworthy, and I therefore regret that this paper does not have a similar dimension for our field. Indeed, the ‘theory of Monte Carlo integration’ that is advertised in the title reduces to a formal justification of bridge sampling, via nonparametric maximum likelihood estimation, and the device of pretending to estimate the dominating measure allows in addition for an asymptotic approximation of the Monte Carlo error through the corresponding Fisher information matrix. Although this additional level of interpretation of importance and bridge sampling Monte Carlo approximations (rather than estimations) of integrals is welcome, and quite exciting as a formal exercise, it seems impossible to derive a working principle out of the paper. A remark of interest related to the supposedly unknown measure is that groups can be seen as acting on the measure rather than on the distribution. This brings much more freedom (but also the embarrassment of wealth) in looking for group actions, since qs .x/ dλ.x/ = qs .gx/ dv.g/ dλ.x/ = |G| q̄s .x/ dλ.x/ Γ Γ=G G Γ=G is satisfied for all groups G such that dλ.gx/ = dλ.x/. However, this representation also exposes the confusion that is central to the paper between variance reduction, which is obvious by a Rao–Blackwell argument, and Monte Carlo improvement, which is unclear since the computing cost is not taken into account. For instance, Fig. 1 analyses the example of Section 2 based on a single realization from P1 , reparameterized to [0, 1]2 , where larger groups acting on [0, 1]2 bring better approximations of cσ =c1 = 1=σ 2 . However, this improvement does not directly pertain to Monte Carlo integration, but rather to numerical integration (with the curse of dimensionality lurking in the background). Although the paper shows examples where using the group structure clearly brings substantial agreement, it does not shed much light on the comparison of different groups from a Monte Carlo point of view. The numerical nature of the improvement is also clear when we realize that, since the group action is on the measure rather than on the distribution, it is often unlikely to be related to the geometry of this distribution and thus can bring little improvement when q.gxi / = 0 for most gs and generated xi s. Fig. 2 shows the case of a Monte Carlo experiment on a bivariate normal distribution with the same groups as above: the concentration of the logit transform of the N {.−5, 10/, σ 2 I2 } distribution on a small area of the [0, 1]2 square (Fig. 2(a)) prevents good evaluations even after 410 computations (Fig. 2(b)). A second opening related to the likelihood representation of the weight evaluation is that a Fisher information matrix can be associated with this problem and thus variance-like figures are produced in the paper. Although these matrices stand as a formal ground for comparison of Monte Carlo strategies, I have difficulties with these figures given that they do not necessarily provide a good approximation to the Monte Carlo errors. See for instance the example of Section 2: we could apply Slutsky’s lemma with the transform cσ = exp{log.cσ /} to obtain the variance of the ĉσ s as diag.cσ /V diag.cσ /, but this approximation is invalidated by the absence of variance of the ĉσ s. (See also the normal range confidence lower bound on Fig. 2(b) which is completely unrelated to the range of the estimates.) Given that importance sampling estimators are quite prone to suffer from infinite variance (Robert and Casella (1999), section 3.3.2), the appeal of using Fisher information matrices is somewhat spurious as it gives a false confidence in estimators that should not be used. Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan n= 1 , Estimate 2.026 n= 2 , Estimate 2.3523 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1.0 1.0 Estimate 1.44 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.8 1.0 1.0 0.8 0.6 0.8 1.0 0.2 1.0 0.4 0.6 0.8 1.0 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.4 0.2 0.8 0.0 n= 8 , Estimate 2.3863 0.0 0.6 1.0 0.4 0.4 0.8 1.0 0.8 0.6 0.4 0.2 0.4 0.8 0.2 0.2 n= 7 , Estimate 2.3981 0.0 0.2 0.6 0.0 0.0 n= 6 , Estimate 2.407 0.0 0.4 0.6 0.6 0.4 0.2 0.0 0.4 0.2 n= 5 , Estimate 2.4372 0.8 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.0 n= 4 , Estimate 2.3908 1.0 n= 3 , Estimate 2.2884 0.0 607 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fig. 1. Successive actions of the groups Gn on the simulated value .x1 , x2 / P1 (top left-hand corner), where Gn operates on .z1 , z2 / D .exp.x1 /={1 C exp.x1 /}, x2 =.1 C x2 // 2 [0, 1]2 by changing some of the first n bits in the binary representation of the decimal part of z1 and z2: the estimates of 1=σ 2 D 2:38 are given above each graph; the size of the orbit of Gn is 4n The extension to Markov chain Monte Carlo settings is another interesting point in the paper, in that estimator (3.10) reduces to n n ĉr = qr .xt / q1 .xt |xj / t=1 j=1 if we use a single chain based on a transition q1 (with known constant). Although this estimator only seems to apply to the Gibbs sampler, given that the general Metropolis–Hastings algorithm has no density against the Lebesgue measure, it is a special case of Rao–Blackwellization (Gelfand and Smith, 1990; Casella and Robert, 1996) applied to importance sampling. As noted in Robert (1995), the naı̈ve importance sampling alternative c̃r = n 1 qr .xt /=q1 .xt |xt−1 /, n t=1 where the ‘true’ distribution of xt is used instead, is a poor choice, since it most often suffers from infinite variance. (The notation in Section 3.6 is mildly confusing in that cr .x/ is unrelated to cr and is also most often known, in contrast with cr . It thus seems inefficient to estimate the cr .xt /s.) Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan 0.00020 0.00015 0.00005 0.00010 exp(x2) 1+ exp(x2) 0.00025 0.00030 608 0.96 0.97 0.98 0.99 exp(x1) / 1 + exp(x1) 1 2 3 4 5 6 (a) 5 6 7 8 9 10 log 4 (n) (b) Fig. 2. (a) Contour plot of the density of the logit transform of the N {.5, 10/, σ 2 I2 } distribution and (b) summary of the successive evaluations of cσ =c1 D σ 2 D 1:5 by averages over the groups Gn for the logit transform of a single simulated value .x1 , x2/ N {.5, 10/, σ 2 I2 }: , average over 100 replications; , range of the evaluations; . . . . . . . , 2-standard-deviations upper bound Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan 609 Except for the misplaced statements of the last paragraph about ‘orthodox Bayesians’, which simply bring to light the fundamental difference between designed Monte Carlo experiments and statistical inference, I enjoyed working on the authors’ paper and thus unreservedly second the vote of thanks! The vote of thanks was passed by acclamation. A. C. Davison .Swiss Federal Institute of Technology, Lausanne/ The choice of the group is a key aspect of the variance reduction by group averaging that is suggested in the paper, and I would like to ask its authors whether they have any advice for the bootstrapper. The simplest form of nonparametric bootstrap entails equal probability sampling with replacement from data y1 , : : :, yn , which is used to approximate quantities such as m = n−1 t.yÅ , : : :, yÅ /, 1 n where the sum is over the nn possible resamples y1Å , : : :, ynÅ from the original data. If the statistic under consideration is symmetric in the yj then the number of summands may be reduced to 2n−1 or fewer, n but it is usually so large that m is approximated by taking a random subset of the resamples. The most natural group in this context arises from permuting y1 , : : :, yn and has size n!, which is far too large for use in most situations. One could readily construct smaller groups, obtained for example by splitting the data into blocks of size k and permuting within the blocks, but they seem somewhat arbitrary. How should the group be chosen in this context, where the cost of function evaluations vastly outweighs the cost of sampling? Wenxin Jiang and Martin A. Tanner .Northwestern University, Evanston/ One way to rederive equation (2.1) is to regard it as importance sampling using a sample x1n with a ‘base density’ of the mixture ns qs .x/ : f.x/ = cs s n The estimator of cσ = Γ qσ .x/ dµ would be n−1 Σni=1 qσ .xi /=f.xi /. Replacing the unknown constant cs in the base density by ĉs then results in a self-consistency equation that is identical to equation (2.1). (Simreplaced by q̄.x; σ/ and the integrals cσ = ilarly, with group averaging the functions qσ .x/ are essentially n q̄.x; σ/ dρ are estimated by using an importance sample x having a base density Σs .ns =n/ q̄.x; s/=cs .) 1 Γ The goal here is to achieve a small ‘average asymptotic variance’ var{log.ĉσ =ĉτ /} dπ.σ/ dπ.τ /: AAVAR = Ξ Ξ (The paper actually uses (25/20) AAVAR as the criterion function, averaging over 10 different pairs of σ, τ ∈ Ξ = {0:25, 0:5, 1, 2, 4}:) What is the optimal base density fo and the corresponding optimal AAVAR.fo /? The following is a result that is obtained by applying variational calculus. optimality). Let ĉσ .f / = n−1 Σni=1 qσ .xi /=f.xi / be an estimator of cσ = Proposition 1 (average n q .x/ dµ, where x is an independent and identically distributed sample from a probability measure 1 Γ σ f.x/ dµ. Let ĉσ .f / dπ.σ/ dπ.τ / var log AAVAR.f / = ĉτ .f / Ξ Ξ for some prechosen averaging scheme defined by measure π over a set of parameters Ξ. Then 2 −1 qo .x/ dµ.x/ , AAVAR.fo / = inf {AAVAR.f /} = n f where Γ fo .x/ ∝ qo .x/ = Ξ Ξ qσ .x/ qτ .x/ − cσ cτ 2 dπ.σ/ dπ.τ / : For the uniform–average measure π on the space Ξ = {0:25, 0:5, 1, 2, 4}, using qσ .x/ = {x12 + .x2 + σ/2 }−2 and Γ = R × R+ gives (25/20) AAVAR.fo / = 1:2=n, with the required integral computed 610 Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan numerically. The paper uses the base densities f1 .x/ = q1 .x/=c1 and f2 .x/ = Σs .1=5/ qs .x/=cs , achieving .25=20/ AAVAR.f1,2 / = 3:6=n and 3:0=n, with efficiency factors AAVAR.f1,2 /−1 =AAVAR.fo /−1 being 33% and 40% respectively, compared with the optimal fo . It is noted that the optimal AAVAR.fo / will be smaller if the densities {qσ .·/=cσ }σ∈Ξ are more similar. A parallel computation can be done with group averaging. Then qσ =cσ is replaced by the averaged density p̄σ = 21 .q1=σ =c1=σ + qσ =cσ / when computing .25=20/ AAVAR.fo /, which results in a much smaller value 0:20=n since the averaged densities are more similar to each other. With group averaging, the choice of the base density in the paper is essentially Σs .1=5/ p̄s .x/ ≡ f3 , with .25=20/ AAVAR.f3 / = 0:37=n and an efficiency factor AAVAR.f3 /−1 =AAVAR.fo /−1 = 54%. The paper did not use the exact base densities f1,2,3 , but rather those using the estimated normalizing constant ĉs , which may have added to the variance. Nevertheless, we see that the use of group averaging seems more effective for reducing AAVAR than the use of any possible importance samples or design weights. Hani Doss .Ohio State University, Columbus/ An interesting application of the ideas in the paper arises in some current joint work with B. Narasimhan. In very general terms, our set-up is as follows. As in the usual Bayesian paradigm, we have a data vector D, distributed according to some probability distribution Pφ , and we have a prior νh on φ. This prior is indexed by some hyperparameter h ∈ H, a subset of Euclidean space, and we are interested in νh,D , the posterior distribution of φ given the data, as h ranges over H, for doing sensitivity analyses. In our set-up, .r/ we select k values h1 , : : :, hk ∈ H, and we assume that, for each r = 1, : : :, k, we have a sample φ.r/ 1 , : : :, φnr .r/ from νhr,D . We have in mind situations in which φ.r/ , : : :, φ are generated by Markov chain Monte Carlo 1 nr sampling, and in which the amount of time that it takes to generate these is non-negligible. We wish to use the k samples to estimate νh,D , and we need to do this in realtime. (Markov chain Monte Carlo sampling gives rise to dependent samples, but to keep this discussion focused we ignore this difficulty entirely.) In standard parametric models, the posterior is proportional to the likelihood times the prior density, i.e. νh,D .dφ/ = ch−1 lD .φ/ νh .φ/ µ.dφ/, where ch = ch .D/ is the marginal density of the data when the prior is νh , lD .φ/ is the likelihood function and µ is a common carrier measure for all the priors. In the context of this paper, we have an infinite family of probability distributions νh,D , h ∈ H, each known except for a normalizing constant, and we have a sample from a finite number of these. The qh s may be taken to be lD .φ/ νh .φ/. The issue that for many models lD .φ/ is too complicated to be ‘known’ is irrelevant, since all formulae require only ratios of qs, in which lD cancels. Equation (3.4) gives the maximum likelihood estimate of µ, and from this we estimate νh,D .dφ/ directly via ν̂h,D .dφ/ = nr k 1 1 qh .φ/ µ̂.dφ/ = k ĉh ĉh r=1 i=1 r=1 qh .φ.r/ i / nr qhr .φ.r/ i /=ĉhr δφ.r/ .dφ/ i (here δa is the point mass at a), i.e. ν̂h,D is the probability measure which gives mass −1 k νh s .φ.r/ qh .φ.r/ i /ĉh i /=ĉh = ns .r/ wh .i, r/ = k νh .φi /ĉhs s=1 ns qhs .φ.r/ /=ĉ hs i s=1 to φ.r/ i . A potentially time-consuming computation needs to be used to solve the system of equations (3.5); however, as noted by the authors, only the system involving ch1 , : : :, chk needs to be solved, since, once these have been obtained, for all other hs the calculation of ĉh is immediate. The weights are simply the sequence −1 k νh .φ.r/ i / ns s .r/ νh .φi /ĉhs s=1 Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan 611 normalized to sum to 1. These weights can usually be calculated very rapidly. This is exactly the estimator that was used in Doss and Narasimhan (1998, 1999) and obtained by other methods (Geyer, 1994). An expectation f.φ/ ν̂h,D .dφ/ is estimated via nr k r=1 i=1 f.φri / wh .i, r/: .5/ It is very useful to note that the development in Section 4 enables estimation of the variance of expression (5). We add the function f.φ/ qh .φ/ to our system. Letting f ch = f.φ/ qh .φ/ µ.dφ/, f we see that expression (5) is simply ĉh =ĉh . If we take P̃ to be the matrix P̂ in Section 4.2 of the paper, augmented with the two columns k s=1 qh .φ.r/ i /=ĉh , ns qhs .φ.r/ i /=ĉhs all i,all r f .r/ fh .φ.r/ i / qh .φi /=ĉh k ns qhs .φ.r/ i /=ĉhs s=1 all i,all r f (so P̃ is an n × .k + 2/ matrix), then equation (4.2) gives an estimate of the variance of ĉh =ĉh . J. Qin .Memorial Sloan–Kettering Cancer Center, New York/ and K. Fokianos .University of Cyprus, Nikosia/ We congratulate the authors for their excellent contribution which brings together Monte Carlo integration theory and biased sampling models. Our discussion is restricted to the following simple observation which might be utilized in connection with the theory that is developed in the paper for attacking a larger class of problems. Suppose that X1 , : : :, Xn1 are independent random variables from a density function pθ .x/ which can be written as pθ .x/ = qθ .x/=c.θ/, where qθ .x/ is a given non-negative function, θ is an unknown parameter and c.θ/ = qθ .x/ dx is the normalizing constant. Assume that the integration has no closed form and let p0 .x/ be a density function from which simulated data Z1 , : : :, Zn2 are easily drawn. Then it follows that pθ .x/ can be rewritten as p.x/ = exp.α + [log{qθ .x/} − log{p0 .x/}]/ p0 .x/ = exp{α + φ.x, θ/} p0 .x/, where φ.x, θ/ = log{qθ .x/}−log{p0 .x/} and α = − log{ φ.x, θ/ p0 .x/ dx} = − log{c.θ/}, assuming that both of the densities are supported on the same set which is independent of θ. Hence the initial model is transformed to a two-sample semiparametric problem where the ratio of the two densitites p.x/ and p0 .x/ is known up to a parameter vector .α, θ/. In addition the sample size n1 of Xs is fixed as opposed to n2 —the sample size of the simulated data—which can be made arbitrarily large. Let n = n1 + n2 and denote the pooled sample .T1 , : : :, Tn / = .X1 , : : :, Xn1 ; Z1 , : : :, Zn2 /. Following the results of Anderson (1979) and Qin (1998), inferences for the vector .α, θ/ can be based on the profile log-likelihood l.α, θ/ = n1 n {α + φ.xi , θ/} − log[1 + ρ exp{α + φ.ti , θ/}], i=1 i=1 where ρ = n1 =n2 . The parameter vector .α, θ/ can be estimated by maximizing the profile log-likelihood l.α, θ/ with respect to α and θ, and the likelihood ratio test regarding θ has an asymptotic χ2 -distribution under mild conditions (Qin, 1999). The idea can be generalized further by considering m-samples along the lines of Fokianos et al. (2001). 612 Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan Steven N. MacEachern, Mario Peruggia and Subharup Guha .Ohio State University, Columbus/ The authors construct an interesting framework for Monte Carlo integration. The perspective that they provide will improve simulation-based estimation. The example of Section 5.2 makes a clear case for the benefits of subsampling the output of a Markov chain. To evaluate the efficiency of an estimation procedure, we must consider the cost of creating the estimator along with its precision. Here, the cost of generating the sample is O.n/, which we approximate by c1 n. The additional time that is required to compute Chib’s estimator is linear in n with an ignorable constant that is considerably smaller than c1 . The precision of his estimator is O.n/, say p1 n, resulting in an asymptotic effective precision p1 =c1 . The new estimator relies on the same sample of size n. The cost to compute the estimator is O.n2 /, with leading term c2 n2 . The new estimator has precision O.n2 /, with leading term p2 n2 . The asymptotic effective precision of the estimator is limn→∞ {p2 n2 =.c2 n2 + c1 n/} = p2 =c2 . The authors report simulations that enable us to compare the two estimators, leading to the conclusion that p2 c1 =p1 c2 is roughly 8–10. This gain in asymptotic effective precision suggests using the new estimator. Consider a 1-in-k systematic subsample of the Markov chain of length n (total sample length nk). The subsampled version of equation (5.1) has a cost of c1 nk + c2 n2 . The precision of the estimator is O.n2 / with leading term p2k n2 . Its asymptotic effective precision is p2k =c2 . This example, with positive dependence between successive parameter vectors, is typical of Gibbs samplers. Positive dependence suggests p2k > p2 , and so it is preferable to subsample the output of the Markov chain. This benefit for subsampling should hold whenever the cost of computing the estimator is worse than O.n/. We replicated the authors’ simulation (500 + 5000 case). A 1-in-10 subsample reduced the standard deviation of integral (5.1) by 10%. We also drew the vector β as four successive univariate normals. For this sampler with stronger, but still modest, serial dependence, a 1-in-10 subsample reduced the standard deviation by 43% compared with the unsubsampled estimator. In recent work (Guha and MacEachern, 2002; Guha et al., 2002), we have developed bench-mark estimation, a technique that allows us to base our estimator primarily on the subsample while incorporating key information from the full sample. Bench-mark estimators, compatible with importance sampling and group averaging, can be far more accurate than subsampled estimators. As this example shows, subsampled estimators can be far more accurate than full sample estimators. The following contributions were received in writing after the meeting. Siddhartha Chib .Washington University, St Louis/ I view this paper as a contribution on ways to reduce the variability of estimates derived from Monte Carlo simulations. For obvious reasons, I was drawn to the discussion in Section 5.2 where the authors apply their techniques to the problem of estimating the marginal likelihood of a binary response probit model, given Markov chain Monte Carlo draws from its posterior density sampled by the approach of Albert and Chib (1993). Their estimator of the marginal likelihood is closely related to that from the Chib approach. In particular, focusing on expression (5.1), from Chib (1995) L.βi / π.βi / n−1 n j=1 π.βi |y, zj / is an estimate of the marginal likelihood for any βi and the numerical efficiency of this estimate is high, for a given n, when βi is a high posterior density point. In contrast, expression (5.1) dictates that we compute the Chib estimate at each sampled βi and then average those estimates. Not surprisingly, by averaging, and thereby incurring a bigger computational load, we obtain a theoretically more efficient estimate. Interestingly, however, the gains documented in Table 1 are not substantial: with n = 5000, the averaging affects the mean only in the third decimal place. After adding −34 to each mean and observing that the numerical standard error of the Chib estimate is 0.02, it is clear that the reduction in the numerical standard error achieved by averaging will have no practical consequences for model choice in this problem. The authors do not discuss the application of their method to models that are more complex than the binary probit. There are, for example, many situations where the likelihood is difficult to calculate, and the posterior ordinate is not available in as simple a form as in the binary probit. In such cases, the key features of the Chib method—that the likelihood and posterior ordinate must be evaluated just once, that tractable and efficient ways of estimating each can be separately identified and that this is sufficient for producing reliable and accurate estimates—have proved invaluable. Illustrations abound, e.g. to multivariate probit Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan 613 models, longitudinal count data models, multivariate stochastic volatility models, non-linear stochastic diffusion models and semiparametric models constructed under the Dirichlet process framework. If the general approach described in this paper also amounts to averaging the Chib estimate in more complex models then it is clear that the computational burden of the suggested averaging will be prohibitive in many cases and, therefore, the theoretical efficiency gain, however small or large, will be practically infeasible to realize. Coupled with the virtual equality of the estimates in a case where averaging is feasible, it is arguable whether the additional computing effort will be practical or justifiable in more complex problems. Ya’acov Ritov .Hebrew University of Jerusalem/ The paper is an interesting bridge between the semiparametric theory of stratified biased samples and Monte Carlo integration. In this comment, I concentrate on the theoretical aspects of the problem, leaving the numerical analysis side to others. The asymptotics of the biased sample model have been discussed previously, e.g. Gill et al. (1988), Pollard (1990) and Bickel and Ritov (1991). Bickel et al. (1993) discussed the information bounds and efficiency considerations for a pot-pourri of models under the umbrella of biased sampling. These include, in particular, stratified sampling, discussed already in Gill et al. (1988), with or without known total stratum probability. Case–control studies are the important application of the known probabilities model. Contrary to the last statement of Section 3.1, the paper deals essentially only with applications in which the ‘unknown’ measure µ is actually known, at least up to its total mass. This covers most applications of Markov chain Monte Carlo sampling, as well as integration with respect to the Gaussian or Lebesgue measures. It is possible to think of other problems, but this is not the main emphasis of the text or of the examples. The statistician can therefore considerably improve the efficiency of the estimator by using the known values of different functionals such as moments and probabilities of different sets. The algorithm becomes increasingly efficient as the number of functionals becomes larger. The result, however, is an extremely complicated algorithm, which is not necessarily faster. For example, if integration according to a Gaussian measure on the line is considered, we can use the fact that all quantiles are known. In essence, we have stratified sampling. However, we can increase efficiency by using the known mean, variance and higher moments of the distribution. A similar consideration applies to the group models discussed by the authors. The more symmetry (i.e. group structure) we use, the more efficient the estimator. Practical considerations may prevent us from using all possible symmetries, and the actual level of symmetry used amounts to the trade-off between algorithmic complication and statistical efficiency. The latter is, of course, quite different from algorithmic efficiency. I miss an example in the paper in which the stratified biased sample model could be used in full. There is no reason, to use, for example, only a single Gaussian measure. For example, a variance slightly smaller than intended improves the efficiency of the evaluation of the integral in the centre, whereas a variance slightly larger than intended improves the evaluation in the tail. Practical guidance for the design problem could be helpful. One practical solution would be the use of an adaptive design when different designs are evaluated by the central processor unit time needed for a fixed reduction in estimation error, and the design actually used is selected accordingly. James M. Robins .Harvard School of Public Health, Boston/ I congratulate the authors on a beautiful and highly original paper. In Robins (2003), I describe a substantively realistic model which demonstrates that successful Monte Carlo estimation of functionals of a high dimensional integral can sometimes require, not only that, (a) as emphasized by the authors, the simulator hides knowledge from the analyst, but also that (b) the analyst must violate the likelihood principle and eschew semiparametric, nonparametric or fully parametric maximum likelihood estimation in favour of non-likelihood-based locally efficient semiparametric estimators. Here, I demonstrate point (b) with a model that is closer to the authors’ but is less realistic. Let L encode base-line variables, A with support {0, 1} encode treatment and X encode responses, where L and X have many continuous components. Let θ = .l, a/, qθ .x/ be a positive utility function and the measure uθ depend on θ so that u.l,a/ is the conditionalprobability measure for X given .L, A/ = .l, a/. Then, among subjects with L = l, c.l, a/ = c.θ/ = cθ = qθ .x/ duθ is the expected utility associated with treatment a and dop .l/ = arg maxa∈{0, 1} {c.l, a/=c.l, 0/} is the optimal treatment strategy. Suppose that the 614 Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan simulated data are .Θi , Xi /, i = 1, : : :, n, with Θi = .Li , Ai / sampled independently and identically distrib−1 uted under the random design f.a, l/ = f.a|l/f.l/ and Xi now biasedly sampled from cΘ qΘi .x/ uΘi .dx/. Suppose that the information available to the analyst is f Å .a|l/ and qθÅ .x/, and that c.l, a/=c.l, 0/ = exp{−γ Å .l, a, ψ † /} where ψ † ∈ Rk is an unknown parameter that determines dop .l/, γ Å .l, a, ψ/ is a known function satisfying γ Å .l, 0, ψ/ = γ Å .l, a, 0/ = 0 and Å indicates a quantity that is known to the analyst. Then the likelihood is L = L1 L2 with Å .X / u .dX / n f.Li / qΘ i Θi i i L1 = , Å i=1 exp{−γ .Li , Ai , ψ/} c.Li , 0/ .6/ n f Å .Ai |Li /: L2 = i=1 Å .X/ and b.L/ = 1=c.L, 0/ the analyst’s model is precisely the semiparametric reThen with Y = 1=q.L,A/ gression model characterized by f Å .A|L/ known and E.Y |L, A/ = exp{γ Å .L, A, ψ † /} b.L/ with ψ † and b.L/ unknown. In particular, the analyst does not know whether b.L/ is smooth in L. In this model, as discussed by Ritov and Bickel (1990), Robins and Ritov (1997), Robins and Wasserman (2000) and Robins et al. (2000), because of the factorization of the likelihood and lack of smoothness, (a) any estimator of ψ † , such as a maximum likelihood estimator, that obeys the likelihood principle must ignore the known design probabilities f Å .a|l/, (b) estimators that ignore these probabilities cannot converge at rate nα over the set of possible pairs .f Å .a|l/, c.l, 0// for any α > 0 but (c) semiparametric estimators that depend on the design probabilities can be n1=2 consistent. An example of an n1=2 consistent estimator is ψ̂.g, h/ solving 0= n i=1 [Yi exp{−γ Å .Li , Ai , ψ/} − h.Li /] g.Li /{Ai − prÅ .A = 1|Li /} for user-supplied functions h.l/ ∈ R1 and g.l/ ∈ Rk . Define hopt .L/ = b.L/ and gopt .L/ = v.1, L/ − v.0, L/, where v.A, L/ = W.ψ † /[R.ψ † / − E{W.ψ † /|L}−1 E{W.ψ † /R.ψ † /|L}], R.ψ † / = b.L/@γ Å .L, A, ψ † /=@ψ and W.ψ † / = exp{2 γ Å .L, A, ψ † /} var.Y |A, L/−1 : Let .ĝopt , ĥopt / be estimates of .gopt , hopt / obtained under ‘working’ parametric submodels for b.L/ and var.Y |A, L/. Then ψ̂.ĝopt , ĥopt / is locally efficient, i.e. it attains the semiparametric variance bound if the submodels are correct and remains consistent asymptotically normal under their misspecification (Chamberlain, 1990; Newey, 1990; Robins et al., 2000). Yehuda Vardi .Rutgers University, Piscataway/ The paper is a significant bridge between nonparametric maximum likelihood estimation under biased sampling and Monte Carlo integration. Estimation under biased sampling is a fairly mature methodology, including asymptotic results, and the authors have overlooked considerable published knowledge that is relevant to their work. My comments mainly attempt to narrow this gap. (a) The problem formulation in Section 3 is equivalent to Vardi (1985) with qs, cs and µ here being ws, Ws and F in Vardi (1985). The important special case of two samples (specifically qθ .x/ = xθ , θ = 0, 1) has been treated even earlier (Vardi, 1982), including the nonparametric maximum likelihood estimator and its asymptotic behaviour. (b) The algorithm proposed in Vardi (1982, 1985) for solving equation (3.5) of this paper is different from that of Deming and Stephan (1940) and is often more efficient (see Bickel et al. (1993), pages 340–343, and Pollard (1990), page 83). This is important in computationally intensive appplications where the Deming–Stephan algorithm is slow to converge. Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan 615 (c) Regarding the asymptotics, for two samples, equation (4.2) should match theorem 3.1 of Vardi (1982) and it would be constructive to show this. For more than two samples, equation (4.2) should match proposition 2.3 in Gill et al. (1988) and similar results in Bickel et al. (1993). (d) More on the asymptotics: ‘lifting’ the limiting behaviour of the estimated cr s from standard properties of the exponential family, although a useful shortcut, seems mathematically heuristic. A good discussion of the interplay between the parametric and nonparametric aspects of the model is in chapter 14 of Pollard (1990). (e) The methodology calls for generating samples from weighted distributions, but this is often a difficult problem with no satisfactory solution, so there is a computational trade-off here. Note problem 6.1 of Gill et al. (1988), which also connects biased sampling methodology with design issues and importance sampling. (f) Gilbert et al. (1999) extended the model to include weight functions qr s, which share a common unknown parameter. This might be useful in future applications of your methodology. (g) You allow negative qr s but assume zero observations for such samples (nr = 0 in the discussion following equation (3.1)). This is confusing. A simple example with negative weight function(s) and an explanation of how to estimate the ratio of integrals in this case would help. (h) Woodroofe (1985) showed that the nonparametric maximum likelihood estimator from truncated data is consistent only under certain conditions on the probabilities that generate the data. Woodroofe’s problem has a similar data structure to your Markov chain models of Section 3.6, where each targeted distribution has a sample of size 1 or less. This leads me to think that further conditions might be necessary to achieve consistency. The authors replied later, in writing, as follows. General remarks The view presented in the paper is that essentially every Monte Carlo activity may be interpreted as parameter estimation by maximum likelihood in a statistical model. We do not claim that this point of view is necessary; nor do we seek to establish a working principle from it. By analogy, the geometric interpretation of linear models is not necessary for fitting or understanding; nor is there a geometric working principle. The fact that the estimators are obtained by maximum likelihood does not mean that they cannot be justified or derived by heuristic arguments, even by flawed arguments. The crux of the matter is not necessity or working principle, but whether the new interpretation is helpful or useful in the sense of leading to new understanding, better simulation designs or more efficient algorithms. Jiang and Tanner’s optimal sampler, although not readily computable and not from the mixture class, is a step in this direction. The remark by MacEachern, Peruggia and Guha is another step showing that, for slow mixing Markov chains, the numerical efficiency of superefficient estimators, such as those described in Section 5.2, may be improved by subsampling. Several discussants (Evans, Robert, Davison, : : :) raise questions about group averaging, its effectiveness, how to choose the group, and so on. As Robert correctly points out, having the group act on the parameter space rather than on the sampler provides great latitude for choice of group actions. Group averaging can be very effective with a small group or very ineffective with a large group, the latter being counter-productive. The group actions that we have in mind involve very small groups, so the additional computational effort is small, if not quite negligible. To choose an effective group action, it is usually necessary to take advantage of the structure of the problem, the symmetries or geometry of the model or the topology of the space. Euclidean spaces have ample structure for interesting group actions. The parameter space in a parametric mixture model has obvious symmetries that can potentially be exploited. In the nonparametric bootstrap, unfortunately, these useful pieces of structure have been stripped away, the space that remains being a forlorn finite set with no redeeming structure. In the examples that we have explored, effective group action carries each sample point to a different point that is not usually in the sample. The bootstrap restriction to the observed sample points precludes group actions other than finite permutations. If the retention of structure is not contrary to bootstrap principles, it may be possible to use temporal order, or factor levels or covariate values, and to permute the covariates or the design. Of course, this may be precisely what you do not want to do. It is true, as Ritov points out, that maximum likelihood becomes more efficient as the parameter space is reduced by the inclusion of symmetries or additional information such as the values of certain integrals. Our submodels are designed to exploit exactly that. The hard part of the exercise is to construct a submodel such that the gain in precision is sufficient to justify the additional computational effort. Evans’s remarks 616 Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan show that the potential gain from the traditional type of group action is very limited. We doubt that it is possible to put a similar bound on the effectiveness of the group action on the parameter space. Mixtures versus strata Remarks by Evans and Jiang and Tanner suggest that it may be helpful to explain why four superficially similar estimators can have very different variances: (a) (b) (c) (d) the biased sampling estimator or bridge sampling estimator (BSE); the importance sampling estimator in which a single sample is drawn from the mixture; the post-stratified importance sampling estimator with actual sample sizes as weights; the stratified estimator in which the relative normalizing constants for the samplers are given. We have listed these in increasing order of available information. The principal distinction is that the BSE does not use the normalizing constants for the samplers. The fourth estimator is maximum likelihood subject to certain homogeneous linear constraints of the type described in Section 3.4. For (a)–(c), the estimated mass at the sample point xi is dµ̂.xi / ∝ s dµ̂.xi / ∝ s and dµ̂.xi / ∝ s n , ns qs .xi /=ĉs 1 πs qs .xi /=cs n : ns qs .xi /=cs For (d) with the restriction c1 = : : : = ck , the estimator obtained by constrained maximization over the set of measures supported on the data is dµ̂.xi / ∝ 1 , λr qr .xi / the ratios λr =λs being determined by the condition ĉr = ĉs , where ĉr = qr dµ̂. The constrained maximum likelihood estimator is in a mixture form even though there may be draws from only one of the distributions. Two designs with Γ = [0, 2] show that the variances are not necessarily comparable: q1 .x/ = I.0 x 1/; q1 .x/ = 0:5 I.0 x 2/; q2 .x/ = I.1 x 2/; q2 .x/ = I.0 x 1/; n1 = 2n2 .design I/; n1 = 2n2 .design II/; . 23 , 13 /. The absence of an overlap between q1 and q2 in design I means that plus the mixture with weights µ is not estimable. The BSE of an integral such that the integrand is non-zero on both intervals does not exist. In practical terms, the variance is infinite. However, the restriction of µ to each interval is estimable, and the additional information that c1 = c2 means that these estimated measures can be combined in such a way that µ̂.[0, 1]/ = µ̂.[1, 2]/. The stratified maximum likelihood estimator (d) puts mass 1 at each sample point in the first interval and 2 at each point in the second. In an importance sampling design with independent and identically distributed observations from the mixture, the estimator (b) has the same form with weights 1 or 2 at each sample point. But the number of observations falling in each interval is binomial, so the measure does not have equal mass on the two intervals. This defect is corrected by a post-stratification adjustment. Estimator (d) has the natural property that, for each function q3 that is a linear combination of q1 and q2 , the ratio ĉ3 =ĉ1 has zero variance even if the draws come entirely from P2 . The main point of this example is that, although the formulae in all cases look like mixture sampling, the estimators can have very different variances. As Jiang and Tanner note, the use of estimated normalizing constants in the BSE may increase the variance, potentially by a large factor. Also, their optimal sampler is not in the mixture family, so the comparison is not entirely fair. Variances Vardi makes the point that, for a given sequence of designs, only certain integrals are consistently estimable. It would be good to have a clean description of this class, how it depends on the design and on Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan 617 the group action. The asymptotic variance estimates reported in the paper are in agreement with known results such as proposition 2.3 in Gill et al. (1988), but they apply only to integrals that are consistently estimable. Our experience has been that the Fisher information calculation usually matches up well with empirical simulation variances. At the same time, we have not looked at extreme cases near the boundary of the consistently estimable class, where the approximation might begin to break down. Robert’s example with n = 1 sheds little light on the adequacy of the asymptotic variance formulae. In applications of this sort, where the log-ratios log(ĉr =ĉs ) do not have moments, the distinction between variance and asymptotic variance is critical. The asymptotic variance is well defined and finite, and this is the relevant number for the assessment of precision in large samples. Retrospective, empirical and profile likelihood The simplest way to phrase the remarks by Qin and Fokianos is to consider a regression model with response Y and covariate x in which the components of Y are independent with distribution qθ .y/ dµ.y/=cθ .µ/ where θi = xi β, and µ is an arbitrary measure. The likelihood maximized over µ for fixed β is a function of β alone, here assumed to be a scalar. If the covariate is binary, as in a two-sample problem, the two-parameter function given by Qin and Fokianos is such that l.α̂β , β/ is the profile log-likelihood for β. An equivalent result can be obtained by a conditional retrospective argument along the lines of Armitage (1971). In particular, if log{qθ .y/} is linear in θ, both arguments lead to a likelihood constructed as if the conditional distribution given y satisfies a linear logistic model with independent components. Comparisons of precision Table 1 shows that the precision previously achieved in nine times units can now be achieved in one time unit. Professor Chib argues unconvincingly that this is not a substantial factor. Possibly he has too much time on his hands. His subsequent remark about the addition of −34 to each value in Table 1 can only be interpreted tongue in cheek, so the comment about insubstantial factors may be meant likewise. Likelihood principle Although it appears to have no direct bearing on any of our results, Robins’s example raises interesting fundamental issues connected with the likelihood function, its definition and its interpretation. As we understand it, the likelihood principle does not address matters connected with asymptotics; nor does it promote point estimation as an inferential activity. In particular, the non-existence of a maximum or supremum is not regarded as a contradiction of the likelihood principle; nor is the existence of multiple maxima. By convention, an estimator is said to ‘obey the likelihood principle’ if it is a function of the sufficient statistic. In writing this paper, we had given no thought to the applicability of the likelihood principle in infinite dimensional models. On reflection, however, it is essential that likelihood ratios be well defined, and, to achieve this, topological conditions are unavoidable. Such conditions are natural in statistical work because an observation is a point in a topological space, in which the σ-field is generated by the open sets. Time and space limitations preclude more detailed discussion, but an example in which Θ is the set of Lebesgue measurable functions [0, 1] → [0, 1] illustrates one aspect of the matter. If the model is such that Yi is Bernoulli with parameter θ.xi /, the ‘likelihood’ expression L.θ/ = θ.xi /yi {1 − θ.xi /}1−yi is not a function on Θ as a topological space. Two functions θ and θ differing on a null set represent the same point in the topological space Θ, but they may not have the same ‘likelihood’. In other words, θ = θ does not imply L.θ/ = L.θ /, so L is not a function on Θ. None-the-less, there are non-trivial submodels for which the likelihood exists and factors, and the Robins phenomenon persists. References in the discussion Albert, J. and Chib, S. (1993) Bayesian analysis of binary and polychotomous response data. J. Am. Statist. Ass., 88, 669–679. Anderson, J. A. (1979) Multivariate logistic compounds. Biometrika, 66, 17–26. Armitage, P. (1971) Statistical Methods in Medical Research. Oxford: Blackwell. Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1993) Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins. 618 Discussion on the Paper by Kong, McCullagh, Meng, Nicolae and Tan Bickel, P. J. and Ritov, Y. (1991) Large sample theory of estimation in biased sampling regression, model I. Ann. Statist., 19, 797–816. Casella, G. and Robert, C. P. (1996) Rao-Blackwellisation of sampling schemes. Biometrika, 83, 81–94. Chamberlain, G. (1988) Efficiency bounds for semiparametric regression. Technical Report. Department of Economics, University of Wisconsin, Madison. Chib, S. (1995) Marginal likelihood from the Gibbs output. J. Am. Statist. Ass., 90, 1313–1321. Deming, W. E. and Stephan, F. F. (1940) On a least-squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Statist., 11, 427–444. Doss, H. and Narasimhan, B. (1998) Dynamic display of changing posterior in Bayesian survival analysis. In Practical Nonparametric and Semiparametric Bayesian Statistics (eds D. Dey, P. Müller and D. Sinha), pp. 63–87. New York: Springer. Doss, H. and Narasimhan, B. (1999) Dynamic display of changing posterior in Bayesian survival analysis: the software. J. Statist. Softwr., 4, 1–72. Evans, M. and Swartz, T. (2000) Approximating Integrals via Monte Carlo and Deterministic Methods. Oxford: Oxford University Press. Fokianos, K., Kedem, B., Qin, J. and Short, D. A. (2001) A semiparametric approach to the one-way layout. Technometrics, 43, 56–65. Gelfand, A. E. and Smith, A. F. M. (1990) Sampling based approaches to calculating marginal densities. J. Am. Statist. Ass., 85, 398–409. Geyer, C. J. (1994) Estimating normalizing constants and reweighting mixtures in Markov chain Monte Carlo. Technical Report 568. School of Statistics, University of Minnesota, Minneapolis. Gilbert, P. B., Lele, S. R. and Vardi, Y. (1999) Maximum likelihood estimation in semiparametric selection bias models with application to AIDS vaccine trials. Biometrika, 86, 27–43. Gill, R., Vardi, Y. and Wellner, J. (1988) Large sample theory of empirical distributions in biased sampling models. Ann. Statist., 16, 1069–1112. Guha, S. and MacEachern, S. N. (2002) Benchmark estimation and importance sampling with Markov chain Monte Carlo samplers. Technical Report. Ohio State University, Columbus. Guha, S., MacEachern, S. N. and Peruggia, M. (2002) Benchmark estimation for Markov chain Monte Carlo samples. J. Comput. Graph. Statist., to be published. Newey, W. (1990) Semiparametric efficiency bounds. J. Appl. Econometr., 5, 99–135. Pollard, D. (1990) Empirical Processes: Theory and Applications. Hayward: Institute of Mathematical Statistics. Qin, J. (1998) Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85, 619–630. Qin, J. (1999) Case-control studies and Monte Carlo methods. Unpublished. Ritov, Y. and Bickel, P. (1990) Achieving information bounds in non- and semi-parametric models. Ann. Statist., 18, 925–938. Robert, C. P. (1995) Convergence control techniques for MCMC algorithms. Statist. Sci., 10, 231–253. Robert, C. P. and Casella, G. (1999) Monte Carlo Statistical Methods. New York: Springer. Robins, J. M. (2003) Estimation of optimal treatment strategies. In Proc. 2nd Seattle Symp. Biostatistics, to be published. Robins, J. M. and Ritov, Y. (1997) A curse of dimensionality appropriate (CODA) asymptotic theory for semiparametric models. Statist. Med., 16, 285–319. Robins, J. M., Rotnizky, A. and van der Laan, M. (2000) Comment on ‘On profile likelihood’ (by S. A. Murphy and A. W. van der Vaart). J. Am. Statist. Ass., 95, 431–435. Robins, J. M. and Wasserman, L. (2000) Conditioning, likelihood, and coherence: a review of some foundational concepts. J. Am. Statist. Ass., 95, 1340–1346. Vardi, Y. (1982) Nonparametric estimation in the presence of length bias. Ann. Statist., 10, 616–620. Vardi, Y. (1985) Empirical distributions in selection bias models. Ann. Statist., 13, 178–203. Woodroofe, M. (1985) Estimating a distribution function with truncated data. Ann. Statist., 13, 163.
© Copyright 2026 Paperzz