Benchmark Estimation for Markov Chain Monte Carlo Samples Subharup Guha, Steven N. MacEachern and Mario Peruggia [email protected] May 2002; revised November 2002 Abstract While studying various features of the posterior distribution of a vector-valued parameter using an MCMC sample, a subsample is often all that is available for analysis. The goal of benchmark estimation is to use the best available information, i.e. the full MCMC sample, to improve future estimates made on the basis of the subsample. We discuss a simple approach to do this and provide a theoretical basis for the method. The methodology and benefits of benchmark estimation are illustrated using a well-known example from the literature. We obtain as much as an 80% reduction in MSE with the technique based on a 1-in-10 subsample and show that greater benefits accrue with the thinner subsamples that are often used in practice. 1 Introduction While using an MCMC sample to investigate the posterior distribution of a vector-valued parameter θ, many features of interest have the representation E[g(θ)] for some function g(θ). A subsample of the MCMC output is often all that is retained for further investigation of the posterior distribution. Subsampling is often necessary in computationally intensive or real-time, interactive investigations where speed is essential. Examples include expensive plot processing and examination of changes in the prior (sensitivity analysis), likelihood (robustness) or data (case influence). Typically, such studies would include hundreds or thousands of changes to the model. Another reason for subsampling is expensive 1 storage space. Practical constraints, like limited disk space available to users of shared computing resources, often make it infeasible to store the entire sample of MCMC draws when the parameter has a large number of components. A subsample is then retained for future investigation of the posterior distribution. Geyer (1992) and MacEachern and Berliner (1994) show that subsampling the output of the Markov chain in a systematic fashion can only lead to poorer estimation of E[g(θ)]. The goal of benchmark estimation is to produce a number of estimates based on the entire MCMC sample, and to then use these to improve other estimates made on the basis of the subsample. The benchmark estimates must be quick and easy to compute. They must also be compatible with quick computations for further, more (computationally) expensive analyses based on the eventual subsample. Several motivating perspectives are useful to understand and investigate various aspects of benchmark estimation. The point of view of calibration estimation, developed in the sampling literature to improve survey estimates (Deville and Särndal, 1992; Vanderhoeft, 2001), helps to bring all these perspectives together into a unified framework. In calibration estimation, a probability sample from a finite population is used to compute estimates of population quantities of interest. The (regression type) estimators are built as weighted averages of the observations in the sample, with the weights determined so as to satisfy a (vector-valued) calibration equation which forces the resulting estimators to produce exact estimates of known population features. Usually, the constraints imposed by the calibration equation do not determine a unique set of weights. Thus, among the sets of weights satisfying the calibration equation, one chooses the set that yields weights that are as close as possible (with respect to some distance metric) to a fixed set of prespecified (typically uniform) weights. To cast MCMC benchmark estimation into the the framework of calibration estimation, we regard the MCMC output as a finite population and a 1-in-k systematic subsample as a probability sample drawn from the finite population. This systematic sampling design gives each unit in the population a probability 1/k of being selected, though many joint inclusion probabilities are 0. In this setting, the 2 (vector-valued) benchmark E[h(θ)], for which the subsample estimate is forced to match the full sample estimate, corresponds to the auxiliary information available through the calibration equation. Once the calibration weights have been calculated, they can then be used to compute the calibration subsample estimate of any feature E[g(θ)]. As the full MCMC sample size increases, the asymptotic performance of these benchmark estimators matches that of the corresponding calibration estimators. The benchmark estimators that we introduce in this paper can be shown to be calibration estimators corresponding to appropriately chosen calibration equations and metrics. We investigate two methods of creating weights: post-stratification and maximum entropy. In their simplest form, post-stratification weights are derived by partitioning the parameter space into a finite number of regions and by forcing the weighted subsample frequencies of each region to match the corresponding raw frequencies for the entire MCMC sample. The weights are taken to be constant over the various elements of the partition and to sum to one. An improved version of post-stratification, (and, in fact, the approach that in our experience has generated the most successful estimators) begins with a representation of an arbitrary function g(θ) as P a countable linear combination of basis functions hj (θ): g(θ) = ∞ j=1 cj hj (θ). The estimand, E[g(θ)], is P expressed as the same linear combination of integrals of the basis functions, ∞ j=1 cj E[hj (θ)]. Splitting the infinite series representation of g(θ) into two parts, we have a finite series which may provide a good approximation to g(θ) and an infinite remainder series that fills out the representation of g(θ). Focusing on the finite series, we determine the weights by forcing estimates of E[hj (θ)] based on the subsample to match those based on the full sample. In addition, we require the weights to be constant over the elements of a suitably chosen partition of the parameter space and to sum to one. This produces a P better estimate of E[ m j=1 cj hj (θ)] than one based on the subsample alone. The improvement carries over to estimation of E[g(θ)] when the tail of the series is of minor importance. We refer to the finite set of basis functions as the (vector-valued) benchmark function. In both the basic and improved post-stratification approaches, we specify enough conditions that 3 (for virtually all MCMC samples of reasonable size) there is a unique set of weights that would satisfy them. Thus, from the point of view of calibration estimation, the choice of the distance metric becomes immaterial, in the sense that any metric would yield identical weights. In this respect, our poststratification weights arise from a degenerate instance of a problem of calibration estimation. In the case of the maximum entropy weights, however, we do not specify enough benchmark conditions to make the weights unique. Rather, among the sets of weights satisfying an under-determined number of benchmark conditions, we select the set having maximum entropy and this, from the point of view of calibration estimation, is tantamount to choosing a specific distance metric. Benchmark estimation yields improved weighted estimators (having smaller variances) based on MCMC subsamples. We investigate the performance of estimators based on weights determined according to post-stratification and maximum entropy methods. Substantial reductions in the variability of the estimates occur when examining expectations of functions g(θ) that are similar to a linear combination of benchmark components. Theoretical results suggest the gains that we see in practice. The methodology is illustrated on an example discussed by George, Makov and Smith (1993). We compare estimation of E[g(θ)] for a variety of functions g(θ), showing that there are often substantial benefits to the use of benchmark estimates. The extent of the improvement in estimation of E[g(θ)] for functions that are noticeably different from the benchmarks is striking, even for values of m as small as 3 or 4. 2 2.1 A simple approach to benchmark estimation An improved subsample estimator Let θ ∈ Θ be a vector-valued parameter. Imagine that an MCMC sample is drawn from the posterior distribution of θ. Call the sequence of draws θ (1) , θ(2) , . . . , θ(N ) . The draws are used to estimate some feature of the posterior distribution. Often these features of interest can be represented as E[g(θ)] for 4 some (possibly vector-valued) function g(θ). The most straightforward estimator for E[g(θ)] is Ê[g(θ)]f = N 1 X g(θ(i) ), N (1) i=1 where the right hand subscript denotes the full sample estimator. If one selects a systematic 1-in-k subsample of the data, the natural estimator is n 1X g(θ(ki) ), Ê[g(θ)]s = n (2) i=1 where N = kn and s denotes the subsample estimator. As mentioned earlier, this form of subsampling always leads to poorer estimation; the unweighted subsample estimator (2) has a larger variance than the full sample estimator (1). We wish to use the information available from the full sample to improve future estimation based on the subsample for any feature E[g(θ)]. For an appropriately chosen (and possibly vector-valued) function h(θ), we refer to the feature E[h(θ)] as the benchmark. We now create a weighted version of the subsample estimator of E[g(θ)] as follows: Ê[g(θ)]w = n X wi g(θ(ki) ), (3) i=1 where Pn i=1 wi = 1. The weights wi are chosen so that they force the weighted subsample benchmark estimate to equal the full sample estimate: Ê[h(θ)]w = Ê[h(θ)]f . (4) Thus Ê[h(θ)]w and Ê[h(θ)]f have the same distributions provided the weights can be constructed, and all features of their distributions conditional on this event are the same. For a vector-valued benchmark function, any linear combination of its coordinates results in the same estimate for both the subsample and the full sample, and the estimators have the same distribution. In particular, the two estimators have the same variance, and we have possibly greatly increased precision for our subsample estimator of E[g(θ)]. 5 The connection between a conditionally conjugate structure and linear posterior expectation in exponential families implies that, for many popular models, quantities such as the conditional posterior mean for a case or the conditional posterior predictive mean will be a linear function of hyperparameters. The structure of the hierarchical model enables us to use benchmark functions based on the hyperparameters to create more accurate estimates of these quantities. The reduction in variability when moving from Ê[h(θ)]s to Ê[h(θ)]w also appears when examining expectations of functions g(θ) that are similar to h(θ). Functions such as a predictive variance which depend on first and second moments will typically be closely related to benchmark functions based on the hyperparameters and so they will be more accurately estimated with our technique. The weighted subsample, (wi , θ(ki) ) for i = 1, 2, . . . , n, is now used in place of the unweighted subsample. The weights act exactly as they would if arising from an importance sample, and so we obtain weighted subsample estimates Ê[g(θ)]w for various features of interest E[g(θ)] of the posterior. Techniques and software developed for importance samples can be used without modification for these weighted samples. 2.2 Some methods of obtaining weights The constraints that Pn i=1 wi = 1 and that Ê[h(θ)]w = Ê[h(θ)]f will not typically determine the wi . With a single real benchmark function, we would have only two linear constraints on the w i . We supplement the constraints with a principle that will yield a unique set of weights. The two principles we investigate are motivated by the literatures on survey sampling and information theory. Weights by post-stratification. Post-stratification is a standard technique in survey sampling, designed to ensure that a sample matches certain characteristics of a population. The population characteristics are matched by computing a weight for each unit in the sample. Large sample results show that a post-stratified sample behaves almost exactly like a proportionally allocated stratified sample. This type of stratification reduces the variance of estimates as compared to a simple random 6 sample. In this setting, the full sample plays the role of the population while the subsample plays the role of the sample. Thus, the essence of the technique is to partition the parameter space into (say) m strata, and to assign the same weight to each θ (ki) in a stratum. Formally, suppose that {Θj }m j=1 is a finite partition of the parameter space Θ. Let Ij (θ) denote the indicator of set Θj , for j = 1, . . . , m. The natural application of the post-stratification method takes as the benchmark function the vector 0 of these m indicator functions. That is, h(θ) = (I1 (θ), I2 (θ), . . . , Im (θ)) . We assign the same weight to all subsample points belonging to a given stratum. Specifically, for all i such that θ (ki) ∈ Θj , we set wi = vj , where, according to (4), the values vj are determined by n X i=1 vj Ij (θ (ki) N 1 X )= Ij (θ(i) ), N where j = 1, . . . , m. i=1 The post-stratification weights are then obtained as: P Ij (θ(i) ) N −1 N vj = Pn i=1 (ki) , ) i=1 Ij (θ (5) provided each of the strata contains at least one subsample point. As in survey sampling, with fairly well chosen strata, the chance that any of the strata are empty of subsample points is negligible. The intuitive description of the post-stratification weight vj is as the ratio of the proportion of full sample points in Θj to the number of subsample points. We refer to this subsample estimator as the basic post-stratification estimator, Ê[g(θ)]w,ps . The perspective of a basis expansion of g(θ) provides a more sophisticated use of post-stratification. Instead of using a basis formed from indicator functions (essentially a Haar basis), alternative bases consist of functions other than indicators. An attractive basis, due to its success throughout statistics, is the polynomial basis that generates Taylor series. Assigning equal weight to subsample points within each given post-stratum yields n − m linear constraints on the weights, and forcing the weights to sum to 1 provides one additional constraint. Supplementing these with a further m − 1 linear constraints 7 on the weights (and also with mild conditions on the posterior distribution and simulation method to guarantee uniqueness) defines the weights. The weights are quickly obtained as a solution to a system of m linear equations. This version of post-stratification has proven to be extremely effective in practice. As an example of this technique, suppose that θ is a p-dimensional vector valued parameter and that the parameter space Θ is partitioned into m=3 strata, namely Θ1 , Θ2 and Θ3 . Post-stratification assigns the same weight vj to all subsample points belonging to stratum Θj , where j = 1, 2, 3. Conditional on the event that no stratum is empty of subsample points, this corresponds to n − 3 linearly independent constraints on the weights. An additional constraint on the weights is that they sum to 1. Thus two other independent linear constraints will ensure that the weights are unique. For a choice of two real functions, say h1 (θ) and h2 (θ), as components of the benchmark function, the benchmark P PN (ki) ), and Ê[h (θ)] = (1/N ) (ki) ). For estimates are given by Ê[h1 (θ)]f = (1/N ) N 2 f i=1 h1 (θ i=1 h2 (θ P j = 1, 2, 3 and l = 1, 2, let nj = ni=1 Ij (θ(ki) ) denote the number of subsample points falling in stratum Θj , and let blj denote the sum of the function hl (θ) evaluated at the subsample points belonging to stratum Θj . Thus blj = n X hl (θ(ki) ) Ij (θ(ki) ). i=1 Then, solving the following system of linear equations uniquely determines the weights (v 1 , v2 , v3 ) for the three strata, provided the square matrix below is invertible: 1 n1 n2 n3 v1 b11 b12 b13 v2 = Ê[h1 (θ)]f . b21 b22 b23 v3 Ê[h2 (θ)]f Maximum entropy weights. Information theory describes, in various fashions, the amount of information in data about a parameter or distribution. In a Bayesian context, it is often used to describe subjective information (playing the role of data) in order to elicit a prior distribution. This is accomplished by specifying a number of features of the distribution, typically expectations, as the “information” about the prior. The prior is then chosen to reflect this information but no more. With 8 entropy defined as the negative of information, the prior which reflects exactly the desired information is that which maximizes entropy among those priors matching the constraints. In our setting, we borrow this technique, matching exactly the information in the full sample benchmark estimates, but no more. Let w = (w1 , w2 , . . . , wn ) be the n-tuple of weights given in (3). Let us denote by Ω the (possibly empty) set of all weight n-tuples that satisfy (4). Thus Ω is the P P set { w | wi ≥ 0 ∀i, ni=1 wi = 1, ni=1 wi h(θ(ki) ) = Ê[h(θ)]f }. Definition 2.1 The entropy of an n-tuple w belonging to set Ω is defined as En (w) = − n X wi ln wi , i=1 subject to the convention that 0 ln(0) equals 0. We observe that for all w belonging to Ω, En (w) ≤ En 1 n, n, . . . ¡¡ 1 , n1 ¢¢ = ln(n) . Since Ω is closed, there exists an element w ∗ of Ω such that En (w ∗ ) = supw∈Ω En (w). These weights w ∗ are called maximum entropy weights, and they exist whenever Ω is non-empty. Finding maximum entropy weights w ∗ is thus equivalent to maximizing En (w) subject to the conP P straints wi ≥ 0 for i = 1, . . . , n, ni=1 wi = 1, and ni=1 wi h(θ(ki) ) = Ê[h(θ)]f . For a real benchmark function h(θ), it can be shown that the maximum entropy weights w ∗ are unique whenever they exist. For most subsamples of reasonable size they are given by wi∗ = eλ1 +λ2 h(θ (ki) ) , i = 1, 2, . . . n; (6) where λ2 ∈ R satisfies the equation n ³ X i=1 and λ1 = − ln ³P ´ ³ ³ ´´ h(θ(ki) ) − Ê[h(θ)]f exp λ2 h(θ(ki) ) − Ê[h(θ)]f = 0, n λ2 h(θ (ki) ) i=1 e ´ (7) . The few subsamples where the weights fail to exist will have either h(θ(ki) ) < Ê[h(θ)]f for all i, or h(θ (ki) ) > Ê[h(θ)]f for all i. For all other subsamples, equation (7) has a unique root because the left hand side of the equation increases monotonically from −∞ to +∞ as 9 λ2 increases. The root can be obtained by numerical methods, and hence wi∗ can be calculated very quickly using (6). When h(θ) is an indicator function of a subset of the parameter space Θ, the answer obtained using (6) matches the post-stratification weights given in (5). Thus, the two approaches will sometimes yield the same result. 3 Theoretical Results The motivation for subsampling the output of a Markov chain carries over to the study of asymptotic properties of the estimators. A first asymptotic is motivated by the common practice of using preliminary runs of the Markov chain to assess the dependence and convergence properties of the chain, and then selecting a subsampling rate for the run of the chain used for estimation. This practice leads us to consider the asymptotic where k is held fixed and n grows (Case A). The strongest motivation for subsampling is that further use of the subsample involves expensive (slow) processing. When this further processing is done in real time, as with investigation of changes in the prior distribution or likelihood, it is essential to limit the number of points that are repeatedly processed. Pursuing this motivation for subsampling, a natural asymptotic holds the number of subsampled points, n, fixed while letting the interval between subsampled points, k, tend to infinity (Case B). In the limit, these subsampled points will look like a random sample from π. We note that asymptotics between these two, where both n and k grow (Case C), are also natural candidates for theoretical exploration, and are useful for filling out the range of asymptotic expressions useful for assessing the accuracy of the benchmark estimators. Tierney (1994) contains a collection of useful results for the asymptotics of estimators based on output from a Markov chain. The two essential types of result are ergodic theorems which guarantee strong convergence of an empirical average (or full-sample estimator) to a corresponding expectation under the limiting distribution, and central limit theorems which describe weak convergence of an appropriately centered and scaled full-sample estimator to a normal distribution. We rely heavily on 10 his results to show asymptotic normality of our subsampled estimators. Case A. We first consider the asymptotic where k is held fixed and n tends to ∞. The ergodic theorem in Tierney’s paper applied to the sub-sampled chain (which is itself Markovian) allows us to conclude that, as n tends to ∞, Ê[g(θ)]s tends to E[g(θ)] almost surely. The next theorem establishes the asymptotic normality of the basic post-stratification estimator for expectations of bounded functions g(θ) and geometrically ergodic Markov chains. Theorem 3.1 Suppose that the Markov chain is geometrically ergodic with invariant distribution π. If the function g(θ) is bounded and not a linear combination of the strata indicators I j (θ), j = 1, . . . , m, then there exists a real number σ(g) such that, as n → ∞, the distribution of ´ √ ³ n Ê[g(θ)]w,ps − E[g(θ)] converges weakly to a normal distribution with mean 0 and variance σ 2 (g). Proof. The result follows from a verification of the conditions of Theorem 4 of Tierney’s paper and by an application of the delta method. See Appendix A for a more detailed proof. Under the stronger assumption of uniform ergodicity of the Markov chain, an identical result holds for all functions g(θ) with finite posterior variance. A proof of this result relies on Theorem 5 of Tierney’s paper, but is otherwise the same as the one above. The results extend beyond the basic post-stratification estimator, applying also to the more complex post-stratification estimators. The proof of the following corollary for modified post-stratification estimators is almost identical to that of the previous theorem. It differs only in minor details specific to modified post-stratification estimators. Non-singularity of the matrix B defined below is required for local continuity of the estimator. Corollary 3.2 Let Ê[g(θ)]∗w,ps denote the modified version of the post-stratification estimator based on 11 benchmark functions h1 (θ), . . . , hm−1 (θ). Let the matrix B be given by π1 ... πm E[h1 (θ)I1 (θ)] ... E[h1 (θ)Im (θ)] B= .. . E[hm−1 (θ)I1 (θ)] . . . E[hm−1 (θ)Im (θ)] . Under the same assumptions as Theorem 3.1, there exists a real number σ ∗ (g) such that, as n → ∞, the distribution of ´ √ ³ n Ê[g(θ)]∗w,ps − E[g(θ)] converges weakly to a normal distribution with mean 0 and variance σ ∗2 (g), provided the matrix B is invertible. Case B. The second asymptotic that we consider involves increasingly thinner subsamples as the full sample size tends to ∞. The subsample size, bounded by constraints on computing power and/or storage space, remains fixed. The following result and corollary are useful when investigating the limiting behavior of the basic post-stratification estimator Ê[g(θ)]w,ps in this case. The result states that, for a fixed subsample size, as the length of the full sample grows, the joint distribution of any continuous function of the subsample tends to the distribution of that function applied to n independent draws from π. Again, the result follows from Tierney’s Section 3. Theorem 3.3 Suppose that the Markov chain is geometrically ergodic with invariant distribution π. Let f (·) be any function which has a set of discontinuities having measure 0 under the distribution on (γ (1) , . . . , γ (n) ), where (γ (1) , . . . , γ (n) ) are independent draws from π. Then, as k → ∞, f (θ (1k) , . . . , θ(nk) ) converges in distribution to f (γ (1) , . . . , γ (n) ). The next theorem gives the limiting distribution of the basic post-stratification estimator as the full sample size grows, when the subsample size is fixed. Theorem 3.4 In addition to the assumptions of the previous theorem, assume that the function g(θ) is continuous, and that the boundary of each stratum has π-measure 0. Then for a fixed subsample size n 12 and as the subsample distance k tends to ∞, the basic post-stratification estimator Ê[g(θ)]w,ps converges in distribution to Ê[g(θ)]ps∗ , the post-stratification estimator that makes use of π1 , . . . , πm , the posterior probability of the m strata, and is based on independent draws from the posterior distribution. Proof. The estimator Ê[g(θ)]w,ps may be viewed as a real valued function mapping the np-dimensional vector (θ (1k) , . . . , θ(nk) ) into the real line. Match the estimator to the function f (·) in the previous theorem. Since g(θ) is continuous, the only discontinuities in f (·) appear at the boundaries of the strata, Θj . The measure assigned to this set under the distribution on (γ (1) , . . . , γ (n) ) is no more than n times the measure assigned to the boundaries under π. Since π assigns probability 0 to these boundaries, the result follows. Case C. We now consider the third asymptotic for the post-stratification estimator, where both the subsample size n and the subsampling distance k tend to ∞. This situation is frequently encountered in MCMC sampling when the full sample size grows at a faster rate than the subsample size, resulting in a progressively thinner and approximately independent subsample of draws from the posterior. Along with the previous theorem, the following theorem indicates that the post-stratification estimator Ê[g(θ)]w,ps based on a large enough subsample has a smaller asymptotic variance than the unweighted subsample estimator. The result parallels what sampling statisticians have long known about the tie between stratified samples and post-stratified samples. Their results are generally described in terms of the asymptotic distribution of the estimators, as the sample size grows, thereby finessing the (rare) nonexistence of the post-stratified estimator. See Cochran (1977) and Lohr (1999), among others, for the result that shows that the asymptotic distribution of the post-stratified estimator is the same as the asymptotic distribution of the estimator based on a proportionally allocated stratified sample. In this theorem, we follow this approach, restating the standard sampling result in the context of our infinite population setting. Interestingly, the asymptotic variance formula applies even when the π j are 13 irrational, thus prohibiting proportional allocation. Theorem 3.5 Assume that πj > 0, where πj is the posterior probability of the jth stratum for j = 1, . . . , m, and that E[g 2 (θ)] < ∞. Then, with Ê[g(θ)]ps representing the post-stratified estimator based on n independent draws from π, we have that r ³ ´ n Ê[g(θ)]ps − E[g(θ)] v converges weakly to a standard normal distribution as n → ∞, where v = Pm j=1 πj Var(g(θ)|θ²Θj ). P 2 Proof. Define µj = E[g(θ)|θ²Θj ] and µ = E[g(θ)]. We have Var(g(θ)) = m j=1 πj E[(g(θ) − µ) |θ²Θj ] ≥ Pm Pm 2 2 j=1 πj E[(g(θ) − µj ) |θ²Θj ] = j=1 πj Var(g(θ)|θ²Θj ). Since E[g (θ)] < ∞, this implies that each within-stratum variance of g(θ) is finite. Once this is established, a straightforward application of the delta method yields the result. See, for example, Pollard (1984), p. 189. As evident from the proof, Ê[g(θ)]ps has smaller asymptotic variance than does Ê[g(θ)]s unless µj = µ for j = 1, . . . , m. Hence, the argument just presented suggests the superiority of the basic post-stratification estimator Ê[g(θ)]w,ps to the unweighted subsample estimator. The weakness of the argument is that it does not directly handle the joint limit where n and N = nk tend to ∞. Consideration of this joint limit for the dependent MCMC sample involves substantial technical detail and will be presented elsewhere. 4 Illustration The example we consider is discussed in George, Makov and Smith (1993). The data consist of numbers of failures (xi ) and length of operation time in thousands of hours (ti ) for 10 power plant pumps. Interest focuses on parameters explicitly appearing in the model and on additional features such as the mean failure rate for a pump on which there is no data. A gamma-Poisson hierarchical model for the data is assumed, with the usual conventions of conditional independence. The parameter δ i represents the 14 failure rate of pump i. The model is xi ∼ Poisson(δi ti ) for i = 1, . . . , 10, δi ∼ gamma(α, β) for i = 1, . . . , 10, α ∼ exponential(1), β ∼ gamma(3, 1), where the gamma distribution is parameterized to have mean α/β. We conducted a simulation study to estimate the reductions in MSE (if any) achieved using weighted benchmark estimators. It is quite straightforward to generate MCMC samples from the posterior distribution of the model parameter θ = (α, β, δ1 , δ2 , . . . , δ10 ) by Gibbs sampling. For an appropriately chosen benchmark function, the full sample benchmark estimate can be computed on the fly, and a systematic 1-in-k subsample stored. We relied on BUGS (Spiegelhalter et al., 1996) for the implementation that was was based on 100 replications of these basic steps: 1. Set the initial values for α and β equal to one. 2. Discard a burn-in period of 500 initial updates to overcome the effects of the arbitrary starting values of the chain and retain the next 10,000 updates to provided a 1-in-10 subsample of size n = 1, 000 from a full sample of size N = 10, 000. 3. After a subsequent burn-in period of 71,500 updates, retain the next 100,000 updates to provide a 1-in-100 subsample of size n = 1, 000 from a full sample of size N = 100, 000. (The intervening burn-in period of 71,500 updates was used to investigate other subsampling rates.) For further details about the BUGS code used in the simulation, please refer to the web site http://www.stat.ohio-state.edu/˜peruggia. After the subsample estimates and benchmark estimates for all 100 replications were obtained from BUGS, an Splus program was run to quickly compute subsample weights by the different methods: unweighted, post-stratification and its variations, and maximum entropy. We then computed subsample 15 estimates of a number of features of interest, E[g(θ)], and we estimated the variance of the estimators by the sample variance of the estimates over the 100 independent replications. Each feature E[g(θ)] was estimated with the mean of the combined 100 subsamples. Treating this estimated value as the actual E[g(θ)], we then compared the MSE of the weighted subsample estimators to the unweighted subsample estimators. Use of the unweighted estimator as the target gives a small advantage to the unweighted estimators in the comparisons that follow. Four methods for creating the weights were examined. Post-stratification (i) weights We chose m = 4 strata, generated by the Cartesian product of cutoffs of 1.0 for α and 1/1.8 = 0.556 for 1/β. (A parameterization in terms of 1/β, rather than β, is more directly related to several quantities of interest and slightly more natural for this problem.) The components of the benchmark functions were the indicator functions of the four strata. The cutoff values of 1.0 and 0.556 are, approximately, the posterior medians of α and 1/β estimated from a preliminary run of the MCMC algorithm. As we observe later, the performance of post-stratification estimators seems to be comparatively robust with respect to the choice of strata. Post-stratification (ii) weights Using the same strata as the post-stratification (i) weights, the m − 1 = 3 components of the benchmark function were chosen as α, β and αβ. Post-stratification (iii) weights For the same strata as above, the three components of the bench- mark function were chosen as α, 1/β and α/β. Maximum entropy weights The real-valued benchmark function was chosen as h(θ) = α/β. The maximum entropy weights were obtained as described in Section 2.2. We considered a number of features of interest, E[g(θ)]. Several estimands focus on the hyperparameters α and β. Note that α/β is the predictive mean failure rate for a pump on which we have no data 16 and that α/β 2 is the variance for such a pump. We also investigated functions tied to specific pumps in the data set. The tables show results for the mean, variance and density of the failure rate, E(δ1 ), V ar(δ1 ) and φ(δ1 ) respectively, of pump 1. For density estimation, we used a grid of 196 points from 0.005 to 0.2, spaced at intervals of 0.001. The integrated mean square error was approximated by an average of squared differences from the target density. To broaden the comparisons, those for pump 1 rely on Rao-Blackwellized estimates for each estimand. The weights, however, were computed without Rao-Blackwellization. We also investigated two empirical Bayes problems. For the first problem, we took as our estimand (E[δ1 ], . . . , E[δ10 ]) with sum of squared error loss. For the second problem, we took (V ar(δ 1 ), . . . , V ar(δ10 )) as our estimand, again under sum of squared error loss. Rao-Blackwellization was used in this portion of the study. The (integrated) squared biases of all the estimators were found to be negligible compared to the respective variances. The performances of the weighted subsample estimators relative to the unweighted subsample estimators are summarized in Tables 1 and 2. The percent reduction in MSE is the reduction in MSE achieved by the weighted estimator expressed as a percentage of the unweighted estimator MSE. The percent reductions in MSE produced by the subsample estimator for all method-by-rate combinations and all features of interest E[g(θ)] are also presented graphically in a trellis display in Figure 1. We observe that most of the weighted subsample estimators have smaller MSE’s than the corresponding unweighted estimators, indicated by the positive values of percent reduction in MSE in the tables. As expected, we have a big reduction in MSE for the benchmarks. If the full samples consisted of i.i.d. draws from the posterior, we would expect a 90% reduction in MSE for the benchmarks for a 1-in-10 subsample. The actual range of 60% to 90% seen in Table 1 accounts for the positive dependence inherent in the MCMC samples. For thinner and thinner subsamples, as the dependence between draws gets progressively weaker, the actual reductions in MSE begin to more closely resemble 17 the corresponding values for a sequence of i.i.d. draws. Nicely, all estimators show improvement for estimation of E[α/β], the predictive failure rate for a pump on which there is no data. Maximum entropy performs well on its benchmark and on features of interest closely related to the benchmark, but it shows poor performance for several other estimands, even increasing MSE in some cases. We conjecture that the relatively poor performance is tied to the behavior of the weights. The weights are monotone in α/β, either increasing or decreasing, corresponding to whether Ê[h(θ)]s < Ê[h(θ)]f or Ê[h(θ)]s > Ê[h(θ)]f . In either case, the most extreme weights are in a tail, and this tends to destabilize the estimates. Post-stratification (i) shows improvement for all estimands. The most substantial improvement is shown where the mean of the function varies greatly from stratum to stratum. However, the big winners are post-stratification (ii) and (iii), motivated by the perspective of a basis expansion. These estimators show a large reduction in MSE. The reductions for these methods are, on the whole, much greater than for post-stratification (i) or maximum entropy. Post-stratification (ii) shows better performance than post-stratification (iii) for functions expressed in terms of β while the reverse holds for functions expressed in terms of 1/β. The pattern of reduction in MSE is particularly clear for functions estimated with Rao-Blackwellization, appearing in the last five lines of each table. Posterior means, variances and densities of the pump failure rates are estimated extremely well with post-stratification (ii) or (iii). We recommend use of weights similar to post-stratification (ii) or (iii). Another issue of interest is the sensitivity of the weighted post-stratified subsample estimator to the choice of the cutoff values used to partition the parameter space. The values of 1.0 and 0.556 used in our simulations to partition the α-(1/β) space into four strata coincide with the posterior estimates of the 0.55 quantile of α and of the 0.48 quantile of 1/β. Figure 2 displays, as a function of the cutoff probabilities for α and 1/β, a level plot of the percent MSE reduction for the weighted (1-in-10) subsample estimation of β 2 using the post-stratification weights (iii). For a given pair of cutoff probabilities over the range [0.3, 0.7] × [0.3, 0.7], the strata are formed by partitioning the α-(1/β) 18 space into the four quadrants with the common vertex located at the point of coordinates given by the posterior estimates of the corresponding quantiles for α and 1/β. The most striking feature revealed by Figure 2 is the insensitivity of the performance of the estimator to the choice of strata. For a fixed value of the cutoff probabilities for (1/β), the percent MSE reduction is essentially constant over the [0.3, 0.7] range as a function of the cutoff probabilities for α. The dependence on the cutoff probabilities for (1/β) is a little more pronounced, but any choice of cutoffs probabilities for α and (1/β) in the rectangle [0.3, 0.7] × [0.4, 0.7] produces nearly optimal results. This robustness alleviates our concern that the benefits of benchmark estimation might be due to a fortuitous choice of strata. 5 Conclusions The application of benchmark estimation to the pump data shows the remarkable benefits that accrue to a computationally cheap means of improving MCMC simulation. The method is relatively automatic and is robust with respect to the choice of many implementation details. The potential application of the method is much wider than illustrated here. Benchmark estimation can be used with any setting where more accurate estimators are available. Examples of better estimators include Rao-Blackwellization of the weights as well as the estimators (Gelfand and Smith, 1990) and subsampled estimators where the subsampling scheme produces estimators that are more accurate than the full-sample estimators (MacEachern and Peruggia, 2000). Benchmarks may also be created that match known population quantities, as there are typically some functions h(θ) for which E[h(θ)] is known. The advantage of matching these quantities is that the benchmarks then contain no sampling error. The connection between calibration estimation and benchmark estimation suggests further investigation. We have created the post-stratification estimators by imposing enough linear constraints to yield a unique solution for the weights. Calibration methods instead create a multitude of weights that provide estimates matching the benchmark estimates. The final set of weights that is selected is 19 then chosen by minimizing a distance, often from uniform weights. The resulting weights will rarely be uniform within strata. This may lead to superior estimators, or it may tend to produce estimators that are susceptible to tail effects, as we have seen with the maximum entropy estimator. An important feature of our problem that is not always respected with calibration estimation, is that, for our final estimators, we wish to produce a set of weights that correspond to a probability distribution. This work also suggests the need for a more complete development of the theoretical underpinnings of benchmark estimation. We have pursued this line of research, with particular attention to the asymptotic behavior of the estimators, evaluation of the asymptotic variances of estimators, and use of the methods in conjunction with importance sampling techniques. This work will appear elsewhere. Acknowledgments This material is based upon work supported by the National Science Foundation under Awards No. DMS-0072526 and SES-0214574. We wish to thank the Associate Editor and referee whose comments improved the focus of the paper and led us to the connection between benchmark estimation and calibration estimation. Appendix Proof of Theorem 3.1 The basic notation used in the proof was introduced earlier in section 2.2. For each stratum j = 1, . . . , m, we use Nj to denote the number of full sample points belonging to stratum j, and we write the product g(θ)Ij (θ) as a function fj (θ). The simple post-stratification estimator Ê[g(θ)]w,ps , provided it exists, can be written as Ê[g(θ)]w,ps = m n X Nj /nj X N j=1 20 i=1 fj (θ(ki) ). Stringing together the 3m empirical sums used in the above expression, we define the random vector Y n as follows: Yn = Ã n X i=1 f1 (θ(ki) ), . . . , n X fm (θ(ki) ), n1 , . . . , nm , N1 , . . . , Nm i=1 ! . In Lemma 5.4, we prove that the vector (1/n)Yn tends to a multivariate normal distribution. The probability of the non-existence of the post-stratification estimate tends to 0 as n increases, and so by applying the delta method we will have the result stated in Theorem 3.1. The following three lemmas are well-known results on convergence of probability measures, and help establish Lemma 5.4. Billingsley (1968) provides the background needed to fill in the details missing from these proofs. Lemma 5.1 (Cramer-Wold device). For random vectors Xn = (Xn1 , . . . , Xnp ) and X = (X1 , . . . , Xp ), a necessary and sufficient condition that Xn converges weakly to X is that l 0 Xn converges weakly to l0 X for all vectors l ∈ Rp . Lemma 5.2 Let {Zn } be a tight sequence of random vectors in Rp having characteristic functions {ϕn }. If limn ϕn (u) = f (u) for all u ∈ Rp , where f is some complex-valued function, then f (u) is the characteristic function of some random variable Z and Zn converges in distribution to Z. The next lemma is explicitly needed to prove Lemma 5.4: Lemma 5.3 Let Xn = (Xn1 , . . . , Xnp ), n = 1, 2, . . . , be a sequence of random vectors. If, for all l ∈ Rp , l0 Xn converges weakly to some real-valued random variable Yl , then there exists a vector X = (X1 , . . . , Xp ) such that Xn converges weakly to X. Proof. Let lk be the vector of length p having only the k-th component equal to 1 and all other components equal to 0, where k = 1, . . . , p. By our assumption, the sequence of random variables {X nk } converges weakly to the random variable Ylk and is therefore tight. Applying Bonferroni’s inequality, we conclude that the sequence {Xn } is also tight. Since l0 Xn converges weakly to the real-valued random 21 0 variable Yl for all l ∈ Rp , the limit of the corresponding characteristic functions E[etl Xn ], as n tends to ∞, equals E[etYl ] for all t ∈ R. Setting u = tl ∈ Rp and applying Lemma 5.2, we conclude that there exists a random vector X of length p to which Xn converges weakly. The following lemma completes the proof by establishing the asymptotic multivariate normality of the vector (1/n)Yn : Lemma 5.4 Suppose the Markov chain is geometrically ergodic with invariant distribution π. Assume that the function g(θ) is bounded and is not a linear combination of the strata indicators I j (θ), where j = 1, . . . , m. Then, as n tends to ∞, the distribution of √ n µ 1 Yn − µ Y n ¶ converges weakly to a N3m (0, Σ) distribution where Σ is a non-negative definite square matrix of dimension 3m, and the vector µY = limn→∞ E[(1/n)Yn ] is the expected value of (f1 (θ), . . . , fm (θ), I1 (θ), . . . , Im (θ), kI1 (θ), . . . , kIm (θ)) with respect to π. Proof. We define a new sequence β (i) of vector-valued MCMC iterates, by combining k consecutive ¡ ¢ samples from the original chain. That is, for i ∈ N, we define β (i) = θ(k(i−1)+1) , . . . , θ(ki) . It is easily verified that {β (i) } is a Markov chain with limiting distribution Πk , which we define as the joint distribution of k consecutive samples from the original Markov chain, where the first state is generated from the invariant distribution π. The new Markov chain is also geometrically ergodic. Geometric ergodicity of the original chain implies that there exists some 0 ≤ ρ < 1 such that ||π(θ (m) |θ(0) ) − π(θ)|| ≤ ρm M (θ(0) ), for all m ∈ N, 22 with || · || denoting total variation distance. We note that ||π(β (i) |β (0) ) − Πk (β)|| = ||π(θ k(i−1)+1 |θ(0) ) − π(θ)|| ≤ ρk(i−1)+1 M (θ(0) ) = [ρk ]i [ρ(1−k) M (θ0 )]. The first equality is a consequence of the Markovian nature of the fine-scale chain. The last equality establishes that the rate parameter of the new chain is the k-th power of the rate parameter of the original chain. The conditions are now set to apply Tierney’s Theorem 4. This theorem implies that, for each l ∈ R3m , √ n (l0 (Yn /n) − l0 µY ) converges weakly to a normal random variable, say Zl . Lemma 5.3 implies that the sequence √ n ((Yn /n) − µY ) converges weakly to some random vector; Lemma 5.1 equates the distributions l 0 Z and Zl for every l ∈ R3m , and, since all one-dimensional projections of Z are normal, we have that Z is normal. References Billingsley, P. (1968). Convergence of Probability Measures, Wiley, New York. Cochran, W. G. (1977). Sampling Techniques, 3rd ed. Wiley, New York. Deville, J. and C. Särndal (1992). Calibration Estimators in Survey Sampling. J. Amer. Statist. Assoc. 87, 376–382. Gelfand, A. E. and A. F. M. Smith (1990). Sampling-Based Approaches to Calculating Marginal Densities. J. Amer. Statist. Assoc. 85, 398–409. George, E. I., U. E. Makov and A. F. M. Smith (1993). Conjugate Likelihood Distributions. Scand. J. Statist. 20, 147–156. Geyer, C. J. (1992). Practical Markov chain Monte Carlo. Statist. Sci. 7, 473–483. Lohr, S. (1999). Sampling: Design and Analysis. Duxbury, Pacific Grove, CA. 23 MacEachern, S. N. and L. M. Berliner (1994). Subsampling the Gibbs Sampler. Amer. Stat. 48, 188– 190. MacEachern, S. N. and M. Peruggia (2000). Subsampling the Gibbs sampler: Variance reduction. Statist. Prob. Letters 47, 91–98. Pollard, D. (1984). Convergence of Stochastic Processes. Springer-Verlag, New York. Spiegelhalter, J., A. Thomas, N. G. Best and W. R. Gilks (1996). BUGS: Bayesian inference Using Gibbs Sampling, Version 0.5, (version ii). Cambridge, UK: MRC Biostatistics Unit. Tierney, L. (1994). Markov Chains for Exploring Posterior Distributions. Ann. Statist. 22, 1701–1728. Vanderhoeft, C. (2001). Generalized Calibration at Statistics Belgium: SPSS Module g-CALIB-S and Current Practices. Working Paper, Statistics Belgium. 24 g(θ) α α2 β β2 αβ 1/β 1/β 2 α/β α/β 2 I(α/β ≤ 0.58) E(δ1 ) V ar(δ1 ) φ(δ1 ) Combined pump means Combined pump variances % Reduction in MSE for a Subsampling Rate of 1-in-10 Post-Strat (i) Post-Strat (ii) Post-Strat (iii) Max Ent 46.1 (5.6) 70.9* (4.2) 70.9* (4.2) 6.2 (1.9) 33.3 (7.3) 66.7 (4.6) 67.1 (4.7) 4.7(1.9) 38.5 (7.6) 66.2* (5.1) 42.7 (7.9) 12.6 (7.9) 29.9 (7.3) 64.8 (4.6) 39.4 (6.9) 9.0 (6.5) 37.4 (6.6) 62.3* (5.3) 57.0 (5.5) -0.4 (3.8) 24.8 (8.5) 44.1 (7.1) 65.5* (5.0) 24.9 (10.0) 8.0 (5.7) 16.1 (5.9) 53.0 (3.9) 22.4 (7.0) 26.3 (9.0) 55.5 (6.9) 86.7* (2.6) 86.7* (2.6) 7.8 (7.8) 16.2 (7.8) 59.7 (4.2) 53.1 (8.3) 32.8 (7.4) 50.5 (7.4) 50.9 (8.0) 50.4 (7.0) 45.0 (5.8) 74.1 (3.8) 74.2 (3.7) 12.7 (3.0) 43.7 (6.0) 72.1 (4.1) 72.3 (4.0) 9.0 (2.5) 46.6 (5.4) 73.1 (3.9) 73.2 (3.8) 12.0 (2.8) 32.6 (8.7) 69.2 (4.9) 60.0 (7.0) 55.1 (7.0) 30.6 (8.0) 77.8 (3.7) 53.5 (7.4) 51.8 (6.5) Table 1: Comparison of MSE of the subsample estimators for a 1-in-10 systematic subsample. A negative percent reduction indicates an increase in MSE. An ∗ indicates that the weighted subsample estimate is forced to agree with the full sample estimate. Shown in parentheses are the estimated standard errors. 25 g(θ) α α2 β β2 αβ 1/β 1/β 2 α/β α/β 2 I(α/β ≤ 0.58) E(δ1 ) V ar(δ1 ) φ(δ1 ) Combined pump means Combined pump variances % Reduction in MSE for a Subsampling Rate of 1-in-100 Post-Strat (i) Post-Strat (ii) Post-Strat (iii) Max Ent 63.8 (5.6) 97.6* (0.5) 97.6* (0.5) 1.0 (2.5) 56.4 (6.0) 95.0 1.0() 95.0 (0.9) 1.0 (1.9) 71.8 (4.8) 94.5* (1.0) 71.4 (5.5) 25.2 (7.4) 57.9 (5.8) 91.3 (1.6) 64.1 (6.1) 14.0 (7.1) 64.3 (5.0) 96.6* (0.6) 90.4 (1.6) 3.8 (3.8) 51.4 (6.8) 73.1 (4.0) 95.3* (0.9) 43.0 (5.2) 23.7 (7.6) 37.9 (7.8) 76.5 (5.5) 25.2 (10.3) 28.5 (7.4) 60.8 (6.2) 97.2* (0.5) 97.2* (0.5) 24.5 (7.3) 37.8 (7.0) 80.1 (3.5) 57.4 (6.7) 20.0 (8.0) 37.3 (9.5) 44.5 (8.6) 47.4 (6.5) 60.4 (6.0) 98.0 (0.4) 97.7 (0.5) 2.8 (4.2) 61.2 (5.9) 97.7 (0.5) 97.7 (0.5) 1.8 (3.2) 61.0 (5.9) 96.8 (0.6) 96.4 (0.6) 2.6 (4.0) 56.6 (5.6) 85.5 (2.4) 74.2 (4.7) 70.1 (4.5) 52.8 (5.6) 92.8 (1.2) 61.6 (7.0) 62.0 (5.3) Table 2: Comparison of MSE of the subsample estimators for a 1-in-100 systematic subsample. A negative percent reduction indicates an increase in MSE. An ∗ indicates that the weighted subsample estimate is forced to agree with the full sample estimate. Shown in parentheses are the estimated standard errors. 26 0 20 40 ps.1 1.in.10 ps.1 1.in.100 ps.2 1.in.10 ps.2 1.in.100 ps.3 1.in.10 ps.3 1.in.100 MaxEnt 1.in.10 MaxEnt 1.in.100 60 80 100 phi.d.1 a E.d.1 V.d.1 b a.b a.sq a.b.inv.le.58 c.p.m c.p.v b.sq a.b.inv b.inv b.inv.sq a.b.inv.sq phi.d.1 a E.d.1 V.d.1 b a.b a.sq a.b.inv.le.58 c.p.m c.p.v b.sq a.b.inv b.inv b.inv.sq a.b.inv.sq phi.d.1 a E.d.1 V.d.1 b a.b a.sq a.b.inv.le.58 c.p.m c.p.v b.sq a.b.inv b.inv b.inv.sq a.b.inv.sq phi.d.1 a E.d.1 V.d.1 b a.b a.sq a.b.inv.le.58 c.p.m c.p.v b.sq a.b.inv b.inv b.inv.sq a.b.inv.sq 0 20 40 60 80 100 Percent.MSE.Reduction Figure 1: Trellis plots of the percent reductions in MSE produced by the subsample estimator for all method-by-rate combinations and all features of interest E[g(θ)]. To facilitate visual inspection, the features of interest are ordered so as to produce an increasing pattern in the upper-left panel. 27 Percent MSE Reduction for Estimation of Beta Squared 40 0.7 37.0 37.5 38.0 38.0 38.0 39 Cut-off Probabilities for Beta Inverse 38.5 38 0.6 38.5 38.5 39.0 39.0 39.0 39.0 39.0 39.0 37 39.0 36 39.5 0.5 39.0 35 38.5 38.0 38.0 0.4 0.3 34 37.5 37.0 32.0 32.0 32.532.5 31.5 32.5 32.5 32.0 36.5 36.0 35.5 35.0 34.5 34.0 33.5 33.0 33 32 31 0.3 0.4 0.5 0.6 0.7 Cut-off Probabilities for Alpha Figure 2: Level plot of the percent MSE reduction for the weighted subsample estimation of β 2 using the post-stratification weights (iii) and 1-in-10 subsamples. For a given pair of cutoff probabilities for α and 1/β, the strata are formed by partitioning the α-(1/β) space into the four quadrants with the common vertex located at the point of coordinates given by the posterior estimates of the corresponding quantiles for α and 1/β. 28
© Copyright 2026 Paperzz