J. R. Statist. Soc. B (2001) 63, Part 1, pp. 127±146 Following a moving target Ð Monte Carlo inference for dynamic Bayesian models Walter R. Gilks Medical Research Council Biostatistics Unit, Cambridge, UK and Carlo Berzuini University of Pavia, Italy [Received September 1998. Final revision August 2000] Summary. Markov chain Monte Carlo (MCMC) sampling is a numerically intensive simulation technique which has greatly improved the practicality of Bayesian inference and prediction. However, MCMC sampling is too slow to be of practical use in problems involving a large number of posterior (target) distributions, as in dynamic modelling and predictive model selection. Alternative simulation techniques for tracking moving target distributions, known as particle ®lters, which combine importance sampling, importance resampling and MCMC sampling, tend to suffer from a progressive degeneration as the target sequence evolves. We propose a new technique, based on these same simulation methodologies, which does not suffer from this progressive degeneration. Keywords: Bayesian inference; Dynamic model; Hidden Markov model; Importance resampling; Importance sampling; Markov chain Monte Carlo methods; Particle ®lter; Predictive model selection; Sequential imputation; Simulation; Tracking 1. Introduction Bayesian applications of Markov chain Monte Carlo (MCMC) methods involve generating many samples from the posterior distribution of the model parameters by using a Markov chain, and then approximating posterior expectations of interest with sample averages. Although MCMC sampling is computationally intensive, it has been remarkably successful in expanding the repertoire of feasible Bayesian problems; see, for example, the many applications described in Gilks et al. (1996). However, for dynamic problems where the posterior (or target) distribution evolves over time through the accumulation of data, and in other situations where a large collection of target distributions is involved, MCMC methods are too computationally intensive to be useful, especially where realtime sequential forecasting is required. Examples of dynamic problems include ®nancial and medical time series prediction, sequential system identi®cation in control engineering, speech recognition, military tracking, on-line updating of classi®cation systems and machine learning. Multiple-target distributions also arise with model selection techniques based on k-step-ahead prediction. To reduce the computational burden of dynamic Bayesian analysis, several techniques involving some or all of importance sampling, importance resampling and MCMC sampling have been proposed (Handschin and Mayne, 1969; Zaritskii et al., 1975; Kong et al., 1994; Address for correspondence: Walter R. Gilks, Medical Research Council Biostatistics Unit, Institute of Public Health, University Forvie Site, Robinson Way, Cambridge, CB2 2SR, UK. E-mail: [email protected] & 2001 Royal Statistical Society 1369±7412/01/63127 128 W. R. Gilks and C. Berzuini Rubin, 1988; Gordon et al., 1993; Kanazawa et al., 1995; Isard and Blake, 1996; Berzuini et al., 1997; Liu and Chen, 1998). These are collectively known as particle ®lters and are reviewed brie¯y in Section 4.2. These techniques require the generation of an initial set of particles, which is then progressively resampled and augmented to take account of incoming data and parameters. The state of the art is reviewed by Liu and Chen (1998) and Doucet (1998). However, most of the techniques proposed to date can suer from progressive impoverishment of the representativeness of the particles as the dynamic process evolves, especially when unknown hyperparameters are involved, or inference about past unobserved variables is required; see Section 4.2. Here we propose a new method for dynamic Bayesian analysis, which we call the resample±move algorithm. Although this technique draws conceptually on the same base technologies of importance sampling±resampling and MCMC sampling, it avoids the degeneration of existing methods. We use p dx to denote generically a probability measure on a random variable x, and p x a probability density function. We use . to denote a target distribution and all its conditional and marginal distributions and densities, e.g. dx1 jx2 and x1 . When . is a posterior distribution, conditioning on the data is implicit and will be suppressed notationally. 2. A motivating example oooooo oooo ooooooooo o oo o o o o o oo o o o oo o o o o o o o o o o o o o o oooooooo oo oooo oooooo ooo oooooooooo ooooooo ooooooo ooo oo ooooooooo oo o o oo o o o o o o o o oo 5 10 y 15 20 25 As a concrete motivating example, we consider a version of a classical problem in non-linear ®ltering, known as bearings-only tracking (Gordon et al., 1993; Carpenter et al., 1998, 1999; Bergman, 1999). The problem, which we analyse in Section 5, is described in Fig. 1. A ship is moving in the north±east plane through random smooth accelerations and decelerations. A stationary observer at the origin of the plane takes a noisy measurement zt of the ship's angular position at each time t 1, 2, . . .. Let xt and yt denote the east and north co-ordinates of the ship at integer time t respectively, with corresponding velocities x_ t and y_ t . A simple model for the trajectory evolution of the ship is, for t > 1, 0 observer -0.20 -0.15 -0.10 -0.05 0.0 x Fig. 1. Our illustrative application considers a ship moving along a smooth trajectory in the x±y plane, where x and y represent east and north respectively: *, true position of the ship at each time t 1, 2, 3, . . .; - - - - -, corresponding observed angular position Dynamic Bayesian Models x_ t N x_ t 1 , y_ t N y_ t 1 , xt x t 1 x_ t yt yt 1 y_ t 9 1 , > > > > 1 > , = > 1, > > > > ; , 1 129 1 where N c, d denotes a normal distribution with mean c and variance d. These equations contain the assumption that the x- and y-components of the ship's velocity have the same, constant but unknown, volatility 1 . The equation for the observed bearings, z, is zt tan 1 yt =xt N 0, 2 . 2 As we observe the ship in its motion, new data zt accrue, along with new parameters x_ t , y_ t . The vector of unknowns at time t is t , x1 , y1 , x_ 1 , y_ 1 , . . ., x_ t , y_ t , 3 and the data are z1:t z1 , . . ., zt . From a Bayesian perspective, therefore, the target distribution t dt p dt jz1:t evolves in an expanding space, t . As t increases, our aim is to maintain a set of sampled points, or particles, in t which can be used to estimate aspects of t of interest. In particular, these particles can be used at any given time point t to approximate the conditional posterior distribution for the current state of the ship, given the data z1:t accumulated up to that point, or to predict the ship's position at a future point in time. To re¯ect the accumulating information on the unknown hyperparameter , as t increases there is an increasing need to update the values of represented in the particles. This causes particular problems for existing simulation strategies, as noted above. See Section 4.2 for further details. We now turn to a description of the resample±move algorithm. 3. Method We suppose that we have an evolving sequence of target distributions t d, where t denotes discrete time. Denote the support of t by . For ease of exposition, in this section we assume that does not depend on t. However, as noted above, our main concern is with problems where expands with t. In Section 4.1 we show that the development in this section applies directly to scenarios where depends on t. In general, will be a vector of model parameters, and t d p djXt will be the posterior distribution of after observing data Xt accumulated up to time t. However, our methodology makes no assumptions about the process generating the t -sequence. Indeed, t need not even be a posterior distribution. For example, t could be an importance sampling distribution; see Section 6. Moreover, need not be a parameter vector. We merely regard the t as a sequence of distributions of interest. Our aim is to estimate Et g g t d, 130 W. R. Gilks and C. Berzuini for any function of interest g at any time t, by using Monte Carlo methods. Our resample± move algorithm involves maintaining a set St of particles, each representing a value of . This set of particles is adapted to the evolution in t by using a combination of importance sampling, importance resampling and MCMC sampling. In outline, the procedure is as follows. An initial set St1 of particles is sampled at time t1 and is used until time t2 , when it is resampled and then moved in -space to form set St2 . This process continues so that, at time tk1 , for k 1, 2, . . ., the current particle set Stk is resampled and moved to form Stk1 . Each resampling is an importance-weighted resampling, and each resampled particle is moved according to a Markov chain transition kernel. At any time t during this process, Et g is estimated by importance sampling based on the most recently generated set of particles. We now describe this algorithm in detail. 3.1. Preliminaries To simplify the notation, without loss of generality, we assume that sampling or resampling is done at integer-valued times t 1, 2, . . .. Thus we set tk k, for all positive integers k. j Let k denote a generic particle sampled at time k, and in particular let k denote the jth particle sampled at time k. Let nk denote the number of particles generated at time k, so j Sk fk ; j 1, . . ., nk g. Let wkt t d=k d. 4 (For readers who are unfamiliar with measure theoretic notation, think of t d=k d as a density ratio t =k .) We shall refer to wkt . as an incremental weight function. When t k 1 we suppress the second subscript on wkt , so wk denotes wk, k1 . For k 1, 2, . . ., let qk k , dk1 denote a Markov chain transition kernel with invariant (stationary) distribution k1 , i.e. qk k , dk1 denotes a conditional probability of moving to a measurable set dk1 2 at time k 1, given position k at time k, with the property that k1 dk qk k , dk1 k1 dk1 . 5 k 2 See, for example, Tierney (1994). We adopt the convention that the ®rst subscript k on a variable indicates functional dependence on the particles in Sk , and when combined with superscript j this dependence j j j j j is restricted to just the jth element of Sk , k . Thus gk denotes g k , wk denotes wk k , j j j j wkt denotes wkt k and qk dk1 denotes qk k , dk1 . 3.2. Resample±move algorithm After initialization at time t 1, our algorithm proceeds with a rejuvenation at each subsequent integer time. The rejuvenation comprises two steps: a resample step and a move step, as we now describe. (a) Initialization: at time t 1, generate the initial set of particles S1 by sampling, independently for j 1, . . ., n1 , j 1 1 , 6 where `' denotes `is sampled from'. Then, for k 1, 2, . . ., we have the rejuvenation step. Dynamic Bayesian Models 131 i wk , (b) Rejuvenation: at each time t k 1 calculate weights for i 1, . . ., nk . Generate Sk1 by performing the following two steps, independently for j 1, . . ., nk1 : i (i) resample step Ð randomly select a particle from Sk , such that k is selected with i probability proportional to wk , for i 1, . . ., nk , and denote the selected particle ij by k ; i j j (ii) move step Ð move k to a new position k1 by sampling i j j k1 qk . 7 3.3. Notes on the algorithm We assume that, at time t 1, it is possible to sample independently directly from the target distribution 1 . In most Bayesian applications, 1 would be the prior distribution of the parameters, from which sampling would be easy for the large class of directed acyclic graphical (DAG) models. For non-DAG models, such as typically arise in applications with a spatial component, one strategy would be to set 1 to a convenient approximation of the prior, which nevertheless admits independent sampling, and then to set 2 equal to the prior. i j The resample step at time k 1 selects a particle k at random from the current particle set Sk , for each j 1, . . ., nk1 . This is a weighted resampling with replacement and is known as importance resampling (Rubin, 1988) or the weighted bootstrap (Smith and Gelfand, 1992). Our notation allows the particle set size, nk1 , to change over time, although in practice a ®xed set size might be used. In Section 3.4 we consider why we might want to vary the set size. Regions of which are under-represented in Sk with respect to k1 have weights that are greater than 1 according to equation (4), and similarly particles which are over-represented i j have weights that are less than 1. It can be shown that the set of resampled particles fk ; j 1, . . ., nk1 g, before being moved, is approximately a sample from the current target distribution k1 , provided that particle set sizes are large. Note that this is a dependent sample. Although the resampling at time k 1 is performed independently for each j, it is conditional on the values in Sk . Previous stages of resampling therefore induce marginal dependence between the resampled particles. i The incremental weights wk used in the resample step involve k and k1 , but in most Bayesian applications we can calculate only ~ k ck k and ~ k1 ck1 k1 , where the normalization constants ck and ck1 are unavailable as their evaluation would involve intractable i high dimensional integration. Instead of using wk in the resample step, we can use the unnormalized weight i i i w~ k ~ k1 dk =~ k dk , 8 i wk . since this can be directly calculated and is proportional to The move step at time k 1 in eect performs one or more iterations of an MCMC algorithm on each of the particles selected at the resample step, where the invariant distribution of the MCMC algorithm is k1 . For an introduction to MCMC methods see, for example, Tierney (1994), Gilks et al. (1996) or Robert and Casella (1999). For example, qk might be an iteration of the Gibbs sampler, or perhaps a Gibbs or Metropolis±Hastings move which updates just one element of vector . The kernel qk need be neither irreducible nor reversible. We discuss the choice of qk further, later. As noted above, each selected particle i j k , before moving, is approximately distributed according to k1 . Therefore, by equation 132 W. R. Gilks and C. Berzuini j (5), the moved particle k1 is also approximately distributed according to k1 . Thus we do not require the usual burn-in of MCMC algorithms. The purpose of the resample and move steps is examined in Section 3.4. In Section 4.2 we review connections between the resample±move algorithm and other techniques for dynamic simulation. 3.4. Estimating Et g Recall that our objective is to estimate Et g, where g represents a generic function of interest. Having run the resample±move algorithm up to time t, we can estimate Et g by g~ t nt 1 P j g t . nt j1 9 Thus g~ t is the mean of g over the particle set St . Below we present a central limit theorem (CLT) for g~ t . This involves an integral operator Ikt ., de®ned for any integers k and t such that 1 4 k 4 t: 8 t < h Q wl 1 l 1 ql 1 l 1 , dl , k < t, t 10 Ikt h lk1 : k t, h k , where integration is over the space fk1 2 g . . . ft 2 g. This operator takes h t as its operand and produces a result which depends on k . We suppress both k and t in the notation for Ikt , these being implied by the subscripts. We also need to de®ne a variance-like quantity: V *k 1, t g Ek I 2kt g Et g. 11 Here, g is a function of t , and the outer expectation in equation (11) integrates over k 2 k . Thus V *k 1, t g is a function of the particles in Sk 1 , as indicated by its ®rst subscript. We can now state the CLT. Theorem 1. Under the integrability conditions of theorem 2 in Appendix A g~ t Et g p ) N 0, 1, Vt g as n1 ! 1; then n2 ! 1, and so on, where Vt g t P 1 V *k k1 nk 1, t g. 12 Proof. See corollary 2 of Appendix A. The limit in theorem 1 is for large particle set sizes fnk g, not for large t. This is the appropriate mode of analysis, as we wish to make valid inference and prediction for each t. However, we thank a referee for pointing out that the mode of convergence of the theorem is not directly relevant to the practical context in which we might wish to consider behaviour for large n, where n n1 n2 . . .. Such convergence results would require the use of more complex interacting particle theory; see, for example, Shiga and Tanaka (1985) and Del Moral and Guionnet (1999). Dynamic Bayesian Models 133 We see from theorem 1 that g~ t is a consistent estimator of Et g, whose variance Vt g is a somewhat complex function of the incremental weights and transition kernels encountered up to time t. We now discuss the structure of Vt g in detail. Many particles in Sk will remain unselected after completing the rejuvenation at time k 1. Assuming that nk nk1 n, the expected proportion of particles in Sk which will remain unselected after the rejuvenation is 0 1n n j n B 1P wk C B1 C 5 1 1 ! exp 1 as n ! 1, n P j A n j1 @ n wk j1 by Jensen's inequality. Thus more than a third of the current particles are lost with each rejuvenation. This progressive impoverishment is re¯ected in the contribution to Vt g of one variance component V *k 1, t g=nk from each rejuvenation. Clearly, large set sizes nk will reduce these components, but the accumulation of variance components with each rejuvenation could be highly undesirable. In particular, suppose that qk k , dk1 puts all its probability mass on k1 k , so that the move step almost surely does not move the particle. In this situation, V *k 1, t g reduces to Ek f g Et gwkt g2 Et g Et g2 wkt , leading to Vt g Et g Et g 2 t P 1 wkt , k1 nk 13 suppressing notationally the functional dependence of g and wkt on . From equation (13) we see that Vt g is strictly increasing with each rejuvenation. Several existing resampling algorithms for dynamic simulation share this unfortunate property (see Section 4.2). A special case of theorem 1 is ordinary importance sampling (see, for example, Geweke (1989)), which is obtained when 1 4 t < 2, i.e. before the ®rst rejuvenation. Now, Vt g reduces to equation (13) with t 1, the importance sampling variance formula. Thus, by theorem 1 of Geweke (1989), Et g can be consistently estimated by g~ t without recourse to rejuvenation. (This holds for any t > 0, since it is only for notational convenience that we set rejuvenation times at t 2, 3, . . ..) We have seen that the resample step impoverishes the particle base, and is not required for consistent estimation. What, then, is the purpose of rejuvenation? We now show that the move step enriches the particle base and provides an opportunity to improve estimation i i considerably. Consider a particle k with a heavy weight wk . This particle may be selected at the resample step at time k 1 for several dierent j, and each copy of this particle will tend to be moved to a distinct point in a region of strong support under k1 . Thus successive rejuvenation steps will help the current particle base St to track the moving target t . An illustration of this is given in Section 5. Without rejuvenation, tracking cannot occur, resulting eventually in heavy weights accumulating on just a few particles, impacting adversely on Vt g, as can be seen from the importance sampling formula for Vt g, given by equation (13) with t 1. More precisely, if, for any integer l 4 t, transition kernel ql is rapidly mixing (allowing ¯uid movement around ), then V *k 1, t g will be reduced for all k < l. This is most easily seen by 134 W. R. Gilks and C. Berzuini considering an extreme case where ql is perfectly mixing, i.e. ql l . In this case, Sl does not depend on earlier stages, and V *k 1, t g is identically 0 for all k < l (see lemma 2 in Appendix A). Thus we see that there is a cost, in terms of the precision of estimation, with each rejuvenation but also an important potential bene®t if the move step provides good mixing. In practice, the optimal timing of rejuvenations will depend critically on the speed with which the target t is moving, and the mixing rate of the transition kernels employed. Estimates of Vt g will be invaluable in deciding when to rejuvenate, and in determining an adequate particle set size nt . 3.5. Variance estimation We propose an estimator of Vt g based on only the particles from stage t. For this we require i j j some additional terminology. We say that particle k is the parent of particle k1 , where i j i j is as de®ned in Section 3.2. Further, for k < l, we say that k is an ancestor of a particle l if j i the line of parentage from l passes through k . For k 4 l, de®ne j m jm if k is an ancestor of l , C kl 1 14 0 otherwise. For k 4 t, de®ne j I^ kt g nt nk P jm m C gt . nt m1 kt 15 j j In Appendix A (lemma 8), we show that I^kt g is a consistent estimator of I kt g. For l 4 t, de®ne V^ *l 1, t g nl 1 P j I^ g nl j1 lt g~ t 2 . 16 In Appendix A (lemma 9), we show that V^ *l 1, t g is a consistent estimator of V *l 1, t g. De®ne V^ t g t P l1 nl 1 V^ *l 1, t g. 17 In Appendix A (theorem 3), we show that V^ t g is a consistent estimator of var g~ t . An equivalent but computationally more convenient formula is V^ t g m1 m2 where N t nt nt P 1 P m m m N t 1 2 gt 1 2 nt m1 1 m2 1 m1 is the number of common ancestors of t m1 m2 Nt 4. m2 g~ t gt nl t P P l1 j1 j, m1 C lt m2 and t j, m2 C lt g~ t 18 , i.e. . Dynamic simulation 4.1. Augmenting the space So far we have assumed that the target distribution t has the same support for all t, i.e. t . This assumption must be relaxed to embrace those numerous situations where the Dynamic Bayesian Models 135 -space grows in size during the evolution of t . For example, in clinical monitoring, new patient-speci®c random eects or frailties may be introduced into the Bayesian model with each new patient (Berzuini et al., 1997). Similarly, in hidden Markov model processing of sequential data, each new data observation enters the model accompanied by a new hidden state variable. The development in this section follows Liu and Chen (1998). At time k 1, suppose that new variables k1 are introduced into . To deal with this, just before rejuvenation at k 1, we perform the following augmentation step at time k 1 i for i 1, . . ., nk : draw an independent sample k1 from an augmentation distribution i f k1 dk1 and construct a new particle i i i ~k k , k1 0 , 19 i k1 where the joint distribution k f is absolutely continuous with respect to k1 . i i Rejuvenation then proceeds as described in Section 3.2, with k replaced by ~k , Sk replaced i i by S~ k f ~k g and wk replaced by i w~ k i k1 d~k i i i k dk f k1 dk1 , 20 where we adopt the convention that fk1 dk1 1 if k1 k , i.e. if no augmentation is required. i In a Bayesian context, let k1 dk1 jk denote the conditional posterior distribution i (under the model) of k1 , i.e. the distribution of k1 conditional on k and on all the data i up to time k 1, Xk1 . Let k dk1 jk denote the conditional prior distribution, i.e. the i distribution of k1 conditioning on only k and Xk . Both the conditional prior and the i i conditional posterior are natural choices for f k1 . Choosing the conditional prior for f k1 leads to i i w~ k p dXk1 jXk , ~k1 , i.e. the conditional likelihood. See the example in Section 5.2. If the data arriving at time k 1 carry little information about k1 , then the conditional prior may be a good choice, but if they carry substantial information about k1 then most of the particles generated by this distribution would fall outside the region of high support for k1 and would consequently receive a very low weight, with an adverse eect on the variance of the resulting estimates. A better choice in this case would be the conditional posterior, but this might not be easy to sample from. Moreover, the conditional posterior will typically be i known only up to a k -dependent normalizing `constant'. In principle, any choice for f k1 where the normalizing constant is known, and which is not too dissimilar to the conditional posterior distribution, will suce (Berzuini et al., 1997; Liu and Chen, 1998). 4.2. Other dynamic simulation methods Several other methods have been proposed for dynamic posterior simulation. The simplest of these is sequential importance sampling (Handschin and Mayne, 1969; Zaritskii et al., 1975; Kong et al., 1994), which corresponds to the resample±move algorithm without rejuvenation, i and where the augmentation distribution f k1 is either the conditional prior or conditional posterior distribution. Sequential importance sampling may perform poorly if the arriving information moves the posterior distribution t away from 1 , causing the importance weights w1t to become concentrated on just a few particles of S1 . 136 W. R. Gilks and C. Berzuini To help to overcome this, Kong et al. (1994) introduced rejuvenation through an importance resampling step (Rubin, 1988). This corresponds to the resample±move algorithm without the move step. The advantages and disadvantages of importance resampling in a dynamic context are discussed by Liu and Chen (1995). Similar resampling algorithms have been proposed. The Bayesian bootstrap ®lter (Gordon et al., 1993), also known as the condensation algorithm (Isard and Blake, 1996) or likelihood weighting algorithm (Kanazawa et al., 1995), uses the conditional prior for the augmentation distribution and performs the resampling step before the augmentation step, to introduce more diversity to the particle base. Berzuini et al. (1997), Liu and Chen (1998) and Pitt and Shephard (1999) have introduced MCMC sampling at the augmentation stage, to facilitate sampling from the conditional posterior. These resampling methods have been shown to work well for hidden Markov models in which there are no unknown hyperparameters (e.g. a volatility parameter), and where inference about past states of the hidden process is not required. However, in other situations, these methods may perform poorly, as they provide no opportunity to generate new values for unknown quantities after their initial generation, even though information about them may continue to accrue. Consequently, as the target distribution drifts away from these initial values, the particle base may degenerate to contain few distinct values of these variables. See Section 5 for an illustration. To introduce more diversity to the current set, Sutherland and Titterington (1994) proposed replacing the resampling step with resampling from a kernel density estimate constructed on the current set Sk . Similarly, West (1993) proposed resampling through adaptive importance sampling, where the bandwidth of the kernel density estimate is reduced on successive iterations of the adaptation to reduce the bias in the variance of the target distribution. Liu and Chen (1998) proposed sampling from a Rao±Blackwellized reconstruction of the target distribution, which is essentially a Gibbs sampling form of the `move' step of the resample±move algorithm. Carpenter et al. (1998) have pointed out that MCMC methods may be used to move particles. 5. An illustration Recall the bearings-only example of Section 2. In this section we shall describe an implementation of the resample±move algorithm on this example and then compare this algorithm with a standard sequential importance resampling (SIR) ®lter in terms of their performance in tracking a simulated ship's trajectory. 5.1. Dynamic modelling of ship tracking At time t, on the arrival of a new observed bearing zt , the parameter space expands to incorporate the unobserved ship velocity x_ t , y_ t , and we want to generate an updated set of i particles, the ith particle t containing a realization of the ship's history up to time t, as described by equation (3). Notationally, for array v, we let vt denote restriction to elements corresponding to time t, and vh:k denote restriction to elements associated with time interval h, k. Thus, for example, xh:k : fxh , xh1 , . . ., xk g. At time t, the target is the posterior distribution of t , which has the form t t p t jz1:t / p t . t Q i1 p zi ji , 21 Dynamic Bayesian Models 137 where t t denotes a density with respect to Lebesgue measure of t , the conditional likelihood p zi ji is fully determined by the model speci®cation and where the prior distribution p t has structure 0 C_y1:t p t p 1 . exp 0:5 y_1:t 0 0:5 x_ 1:t Cx_ 1:t . 22 Of the two terms of the above product, the ®rst, p 1 , represents our prior knowledge about the ship's initial conditions. The second term represents prior belief in smoothness of the ship's accelerations. C is the t t tridiagonal symmetric matrix whose elements are all 0 except for those on the diagonals adjacent to the main diagonal, which are equal to 1, and those on the main diagonal, C i, i 1 if i 1 or i t; otherwise C i, i 2. This matrix is rank de®cient, re¯ecting the fact that the exponent term in equation (22) is invariant with respect to global level changes of the co-ordinate and velocity vectors. 5.2. Applying the resample±move algorithm We now discuss our implementation of the resample±move algorithm in the example, with particular attention on the augmentation and rejuvenation stages of the algorithm. i i In the augmentation stage at time t, each particle t 1 was augmented to ~t 1 by incorporating a realization of the pair x_ t , y_ t , sampled from the conditional prior N x_ t 1 , 1 N y_ t 1 , i 1 . The augmented particle ~t 1 then entered the resampling step with weight proportional i to p zt j~t 1 . In the rejuvenation stage at time t we used the following three types of move: (a) rescaling Ð all co-ordinate and velocity values representing the ship's trajectory, along both axes, are reduced or ampli®ed by the same factor; (b) local perturbation Ð the particle trajectory is perturbed locally, by moving a block of consecutive points of the trajectory to a new position, while leaving the remaining part of the trajectory unchanged; (c) -move Ð update the value of . The role of the rescaling move in the bearings-only problem is discussed in Carpenter et al. (1999). We implemented this move according to a Metropolis±Hastings scheme, using a symmetric proposal distribution. Details are given in Berzuini and Gilks (2000). The rescaling move at time t involves the entire length of the ship's trajectory up to time t and therefore its burden of computation is of the order O t. However, we can perform the rescaling move progressively less frequently, with probability O t say, where > 0. Thus the rescaling move represents a burden of computation of the order O t1 . The local perturbation move is applied to the x-vector _ ®rst, and then to the y-vector. _ For 0 , a new candidate x@ _ a:b can be drawn from the any given block x_ a:b currently at value x_ a:b proposal distribution N x_ 0a:b , r. Ca:b, a:b 1 , where C is described in Section 5.1, and parameter r controls the spread of the proposal distribution. To complete the move, we then generate a candidate y_ a@:b for the block y_ a:b from a complex dependence proposal distribution that incorporates information about zt and x_ a:b . Further details about this distribution are given in Berzuini and Gilks (2000). The candidate x@ _ a:b , y_a@:b is then either globally accepted or globally rejected. At stage t of the algorithm, it will often be sensible to update a block containing the tth time point of the trajectory, and subsequently to update less recent blocks, visited in a random order to avoid drift. If the updated blocks cover the entire length of the trajectory up to t without overlapping, the computational burden of the local perturbation move at the updating stage t will be O t. However, because new data zt will provide 138 W. R. Gilks and C. Berzuini progressively less information on x_ k , y_ k , for each ®xed k < t, we can arrange to update x_ k , y_ k with diminishing frequency as t increases, such that the overall burden of computation for the local perturbation move at time t is of order O t , for some < 1. The parameter 1 is updated by a standard Gibbs move, which involves at each updating stage t the entire sequence of length t of points in the trajectory. Because the full conditional distribution involves all elements of the trajectory up to time t, the computational complexity of updating is O t. However, because the marginal posterior distribution for will stabilize over time, we can exploit information on contained in the particles and update progressively less frequently, with probability O t say, where > 0. Thus the burden of computation for -update is of the order O t1 . However, this will impoverish the diversity of -values in St and will tend to reduce the precision of estimates of . The overall burden of computation of each rejuvenation step at time t, summing the separate contributions of the three types of move, is O t1 O t O t1 < O t. By comparison, in a typical MCMC algorithm, all the model parameters would need to be updated at every iteration, with a burden of computation for one iteration of MCMC sampling at time t equal to O t O t O t O t, summing the separate contributions for the rescaling move, the local perturbation move and the -updating. 5.3. Experiment A series of 150 observations was generated from model (1)±(2), with initial conditions x1 0:01, y1 20, x_ 1 0:002 and y_ 1 0:06, and by ®xing 0:000 001 and 0:005. These data are shown in Fig. 1. The simulated data were sequentially incorporated into the resample±move ®lter. In running the algorithm, we chose p 1 to represent a proper informative prior on the ship's initial condition, slightly displaced with respect to the true value of and with respect to initial co-ordinates x1 and y1 , but not with respect to the true initial velocities x_ 1 and y_ 1 . Details on this prior, as well as on other aspects of this experiment, can be found in Berzuini and Gilks (2000). We set equal to 0:005. The number of particles was kept equal to 100 over the entire ®ltering process. On the same data, with the same number of particles, we ran a standard SIR ®lter. This ®lter, developed by Gordon et al. (1993), is based on principles contained in Rubin (1988), as an alternative to MCMC sampling in dynamic systems. The SIR ®lter can be described as a special case of our resample±move algorithm, in which there is no move step, so that no parameter value is changed once it has been sampled. 5.4. Results Fig. 2 compares the performance of the resample±move algorithm and SIR on the same data, using the same number of particles. In Fig. 2, x and y represent east and north respectively. Fig. 2(a) shows results from the resample±move algorithm and Fig. 2(b) the results from SIR. In both plots the ship's true trajectory is represented by a chain of circles, a ®lled square indicating the ®nal position. In both plots the output of the ®lter is represented by a piecewise linear curve whose break points represent the estimated means of the ®ltering distribution of xt , yt , for t 1, 2, . . ., with arrows indicating time ordering. At time t 1, the estimated position of the ship diers markedly from the truth as a consequence of the prior. Then the resample±move method uses the information contained in the incoming data to provide a fair particle approximation of the evolving path, whereas problems of particle impoverishment attend the SIR ®lter. Note that the piecewise linear curve in each plot does not represent a realization of the ship's trajectory, but rather the sequence of ®ltered means. Dynamic Bayesian Models (a) 139 (b) Fig. 2. Comparing the performances of (a) the resample±move method and (b) a standard SIR ®lter: *, the ship's true trajectory; &, ®nal position; Ð, the ®lter's output (see the text for further explanation) Fig. 3 compares the performance of the resample±move method and SIR in estimating the static parameter . Fig. 3(a) suggests that the resample±move algorithm satisfactorily tracks the evolution of the posterior mean for , until, in the end, this mean comes very close to the true value. By contrast, according to Fig. 3(b), SIR appears unable to track the posterior mean satisfactorily. The problem with SIR is that the set of particles propagated is left without low values of after a few resampling stages. Such a loss prevents the method from adapting to the gradual decline of the posterior mean of . Of course, this also means that forecasts of the future path of the ship would correspondingly underestimate the uncertainty involved. The superiority of the resample±move algorithm from this point of view lies in the fact that the sampled values of are occasionally moved. Most methods proposed in the literature, including the algorithms proposed by Liu and Chen (1998), Berzuini et al. (1997) and Gordon et al. (1993), lack the ability of updating the sampled values for the static parameters, once they have been sampled, unless ad hoc modi®cations are employed. Therefore we expect that the performance of these methods in estimating in our bearingsonly example would not be much dissimilar to that of SIR. 6. Discussion The resample±move algorithm provides a framework for dynamic Monte Carlo Bayesian inference and prediction. In applying this framework, there are many implementational details which must be decided, in particular the particle set sizes nk , when to rejuvenate, which parameters to move and which Markov chain kernels to use to obtain rapid mixing. We cannot at this stage make general recommendations regarding these choices, and in most applications some experimentation will be required. Indeed, the implementation of the resample±move algorithm may be more demanding than for MCMC sampling, as an eective strategy must be devised for a wide collection of target distributions. 140 W. R. Gilks and C. Berzuini (a) (b) Fig. 3. Comparing the performances of (a) the resample±move method and (b) SIR in estimating the static parameter The theory of Section 3.4 provides a CLT for large particle set sizes nt . It does not comment on limiting behaviour at large times t. This would depend on the rate and type of information arriving, and whether this information is accompanied by new unobserved variables. This area deserves further research. As noted in Section 1, the target distribution t need not be a posterior distribution. For example, if we are concerned with events in the tails of the process, then the resample±move algorithm could be run with t replaced with some heavier-tailed sequence of distributions, with an appropriate adjustment to the importance weights. Acknowledgements We thank Andrew Gelman for insightful comments on our methodology and the Consiglio Nazionale delle Ricerche Institute of Numerical Analysis of Pavia for computing support. We also are indebted to three referees, each of whom provided exceptionally incisive and constructive reports on our original manuscript. Appendix A Throughout we use k, l and t to denote non-negative integers. De®ne q0 1 . Let Eqk denote expectation with respect to qk k , .. Thus Eqk is a function of k , although we suppress this dependence in the notation. For any k and t such that 0 4 k 4 t and t > 0, de®ne 8P nk j > I kt g > > > > < j1 nk g~ kt P I ktj 1 > > j1 > > > : Et g 1 4 k 4 t, 23 k 0, t > 0. Note that this is consistent with the de®nition of g~ t given by equation (9), i.e. g~ t g~ tt . Let Ek denote expectation over sampling stage k 1, conditional on the particles in Sk . Then Dynamic Bayesian Models 8 nk P j j > > w g k1 qk dk1 > k > > j1 2 < k1 nk P j Ek g wk > > > j1 > > : E1 g k 5 1, 141 24 k 0. Finally, let `)' denote convergence in distribution and `!p ' denote convergence in probability. Let n1 , . . ., nt ! 1 denote taking ®rst n1 ! 1, then n2 ! 1, and so on. Remark 1. We state, without proof, the algebraic identities Ikl fIlt gg Ikt g, Ek g g~ k, k1 , k 4 l 4 t, 25 k 5 0. 26 Lemma 1. Let qR k k1 , dk k1 dk qk k , dk1 . k1 dk1 27 Then qR k is the reverse Markov chain transition kernel corresponding to qk and has stationary measure k1 . R . Proof. k 2 qR k k1 , dk 1, using equation (5). Therefore qk k1 , is a probability measure for all k1 2 . Moreover, from equation (27) k1 dk1 qR k1 dk qk k , dk1 k k1 , dk k1 2 k1 2 k1 dk . Therefore k1 is a stationary measure of qR k. Corollary 1. For 1 4 k 4 t, 8 t d Q > > g t t t qR , dl 1 , < k dk lk1 l 1 l Ikt g k1, . . ., t 2 > > : g t , k < t, k t. Proof. Substitute equation (27) into expression (10). Remark 2. The following two identities follow directly from corollary 1: Ek Ikt g Et g, Ek Ik1,t g g~ kt , Ek Ik1,t 1 0 < k 4 t, 0 4 k < t. 28 29 Lemma 2. Suppose that qk is perfectly mixing, i.e. qk k , dk1 k1 dk1 . Then V l* 1, t g 0 for all l 4 k 4 t. Proof. Substituting k1 for qk in expression (10) leads to Ikt h wk k Ek1 Ik1, t h, wk k Et h, by equation (28). Now from equations (11) and (25), for l 4 k, 30 142 W. R. Gilks and C. Berzuini V l* 1, t g El I 2lt g Et g El Ilk fIkt g Et gg2 El Ilk fwk k Et g from equation (30). But Et g Et gg2 , Et g 0. Hence V l* 1, t g 0 for all l 4 k. Lemma 3. For 1 4 k 4 t, suppose that (a) supk (b) supk fEqk 1 I 2kt gg < 1 and fEqk 1 I 2kt 1g < 1, 1 2 1 2 where the suprema should be ignored for k 1. Then g~ kt g~ k 1, t ) N 0, 1 f 1=nk Vk 1, t gg1=2 as nk ! 1, where Vk 1, t g 1 Ek 1 I 2kt g E2k 1 Ikt 1 2 Ek 1 Ikt g Ek 1 Ikt g Ikt 1 E2k 1 Ikt g Ek 1 I 2kt 1 . Ek 1 Ikt 1 E2k 1 Ikt 1 31 Proof. For 0 < k 4 t, g~ kt has the form of a ratio of sums of conditionally independent, identically distributed, random variables. Lemma 3 then follows from the delta method, as in the lemma of Berzuini et al. (1997), noting that g~ k 1, t is the ratio of expected values of these sums, from equation (29). The integrability conditions of the lemma of Berzuini et al. (1997) hold since Ek 1 I 2kt g has the form of a weighted average, from expression (24), and is therefore less than or equal to supk 1 2 fEqk 1 I 2kt gg, which is ®nite by assumption; similarly for Ek 1 I 2kt 1. Lemma 4. Suppose that conditions (a) and (b) of lemma 3 hold for all k 1, . . ., t. Then Et 1 g ) Et g as n1 , . . ., nt 1 ! 1. Proof. From lemma 3, g~ kt ) g~ k 1, t as nk ! 1, for all k 4 t. Therefore, by induction, g~ kt ) g~ 0t as n1 , . . ., nk ! 1, and in particular g~ t 1, t ) g~ 0t as n1 , . . ., nt 1 ! 1. Now, from equation (26), g~ t 1, t Et 1 g, and g~ 0t Et g, by de®nition (23): hence the result. Lemma 5. Suppose that, for given l and t satisfying 1 < l 4 t, and for all k < l, the following conditions hold: (a) (b) (c) (d) (e) supk supk supk supk supk Eqk Eqk 1 2 Eqk 1 2 Eqk 1 2 Eqk 1 2 1 2 I 2kl fI 2lt gg < 1, I 2kl fI 2lt 1g < 1, 1 I 2kl fIlt g Ilt 1g < 1, 1 I 2kl fIlt gg < 1, 1 I 2kl fIlt 1g < 1. 1 1 Then Vl 1, t g ) V *l 1, t g, as n1 , . . ., nl is given by equation (11). 1 ! 1, where Vl 1, t g is given by equation (31) and V *l 1, t g Proof. The conditions ensure that El 1 I 2lt g ) El I 2lt g as n1 , . . ., nl 1 ! 1, from lemma 4. Similar results follow for El 1 I 2lt 1, etc. Substituting these results into equation (31) gives V *l 1, t g El I 2lt g 2 El Ilt g Ilt 1 Et g El I 2lt 1 E2t g, using equation (28). Rearranging this expression gives V l* 1, t g El I 2lt g (11). Et g, as in equation Theorem 2. Assume that conditions (a)±(e) of lemma 5 hold for each l 1, . . ., t. Then for any k 1, . . ., t: Dynamic Bayesian Models zkt r g~ kt Et g ) N 0, 1 k P 1=nl V *l 1, t g 143 as n1 , . . ., nk ! 1. l1 Proof. Let zkt0 p g~ kt g~ k f 1=nk Vk 1, t 1, t gg 2k , 1, t Vk 1, t g , V *k 1, t g r2k 1, t 1=nk V k* 1, t g . k P * 1=nl V l 1, t g l1 Then zkt zkt0 rk Now rk 1, t is a constant and k imaginary number: 1, t 1, t k 1, t zk 1, t p r2k 1 1, t . 32 does not depend on Sk . So, for any constant a, letting i denote the p 1 r2k 1, t g Ek 1 exp iazkt0 rk 1, t k p ! expfiazk 1, t 1 r2k 1, t 12 ark 1, t k 1, t 2 g Ek 1 exp iazkt expf iazk 1, t 1, t 33 as nk ! 1, by the CLT for zkt0 in lemma 3. Note that conditions (a) and (b) of lemma 3 are a special case of the conditions for theorem 2, obtained with k l. Applying lemma 5 to expression (33) we obtain p Ek 1 exp iazkt ! expfiazk 1, t p 1 r2k 1, t as n1 , . . ., nk ! 1. Now suppose that the assertion of this theorem is true for k l 1 2 ark 1, t 2 g 34 1. Since jEl 1 exp iazlt j 4 El 1 j exp iazlt j 1, we have, by the bounded convergence theorem (e.g. Billingsley (1986), theorem 16.5), as n1 , . . ., nl ! 1, p E El 1 exp iazlt ! E expfiazl 1, t 1 r2l 1, t 12 arl 1, t 2 g exp 1 2 a2 , by the assertion. Therefore, by the continuity theorem (e.g. Billingsley (1986), theorem 26.3), zlt ) N 0, 1 as n1 , . . ., nl ! 1. Therefore, if the assertion is true for k l 1, it is also true for k l 4 t. Finally, for k 1, zkt zkt0 ) N 0, 1 as n1 ! 1, by lemma 3. Hence, by induction, the assertion is true for all k 1, . . ., t. Corollary 2. With the conditions of theorem 2, r g~ t Et g ) N 0, 1 t P * 1=nl Vl 1, t g as n1 , . . ., nt ! 1, l1 where g~ t is de®ned in equation (9). Proof. Put k t in theorem 2, noting that g~ t g~ tt . j Corollary 3. For any k 4 t, nk 1 j wkt ) 1 as n1 , . . ., nk ! 1. Proof. Replace g by wkt in corollary 2. Lemma 6. For k 4 l, de®ne j Y kl g nl 1 P jm m C g l , nl m1 kl 35 144 where W. R. Gilks and C. Berzuini jm C kl j j is de®ned in expression (14). Then Y k, k1 g ) nk 1 I k, k1 g as n1 , . . ., nk1 ! 1. Proof. By the law of large numbers, j j, m m Y k, k1 g ) E C k, k1 g k1 as nk1 ! 1, j w j nk k g k1 q k , dk1 P l wk l1 nk P l1 Now nk 1 l l wk 1 j l wk I k, k1 g. ) 1 as n1 , . . ., nk ! 1, by corollary 3: hence the result. j j j Lemma 7. For any k 4 l, Y kl g ) nk 1 I kl g as n1 , . . ., nl ! 1, where Y kl g is de®ned in equation (35). Proof. The assertion is true trivially for k l. Suppose that the assertion is true for a given k, 2 4 k 4 l. Noting that jm 1, l Ck nk P i1 ji im 1, k C kl , Ck it follows that j 1, l g Yk ) nk P i1 ji 1, k Ck nk P i1 ji 1, k Ck i Y kl g 1 i I g nk kl as n1 , . . ., nl ! 1, by assertion, ) 1 j I fI gg nk 1 k 1, k kl as n1 , . . ., nl ! 1, by lemma 6 (with g replaced by Ikl g), 1 j I g, nk 1 k 1, l by equation (25). Thus, if the assertion is true for a given k, it is also true for k the assertion is true for all positive k 4 l. 1. Hence, by induction, j j j Lemma 8. For any k 4 t, I^kt g ) I kt g as n1 , . . ., nt ! 1, where I^ kt g is de®ned in equation (15). Proof. The result is obtained by applying lemma 7 to equation (15). Lemma 9. For l 4 t, V^ *l 1, t g ) V *l 1, t g as n1 , . . ., nt ! 1, where V^ *l 1, t g is de®ned in equation (16) and V *l 1, t g is de®ned in equation (11). Proof. V^ *l 1, t g ) nl 1 P j I g nl j1 kt Et g2 as n1 , . . ., nt ! 1, by corollary 2 and lemma 8, ) El I 2kt g Et g as n1 , . . ., nt ! 1, by corollary 2. The result then follows with equation (11). Dynamic Bayesian Models 145 Theorem 3. V^ t g )1 var g~ t as n1 , . . ., nt ! 1, where V^ t g is de®ned in equation (17). Proof. From equation (17): t P V^ t g l1 t var g~ t P l1 )1 nl 1 V^ l* 1, t g nl 1 V l* 1, t g t P l1 nl 1 V l* 1, t g var g~ t as n1 , . . ., nt ! 1, by lemma 9 and corollary 2. References Bergman, N. (1999) Recursive Bayesian estimation, navigation and tracking applications. Dissertation 579. Department of Electrical Engineering, Linkoeping University, Linkoeping. Berzuini, C., Best, N., Gilks, W. R. and Larizza, C. (1997) Dynamic conditional independence models and Markov chain Monte Carlo methods. J. Am. Statist. Ass., 92, 1403±1412. Berzuini, C. and Gilks, W. R. (2000) Resample±move ®ltering with cross-model jumps. In Sequential Monte Carlo in Practice (eds A. Doucet, J. F. G. de Freitas and N. J. Gordon). New York: Springer. To be published. Billingsley, P. (1986) Probability and Measure. New York: Wiley. Carpenter, J. R., Cliord, P. and Fernhead, P. (1998) Building robust simulation-based ®lters for evolving data sets. Technical Report. Department of Statistics, University of Oxford, Oxford. Ð (1999) Improved particle ®lter for nonlinear problems. IEE Proc. F, 146, 2±7. Del Moral, P. and Guionnet, A. (1999) A central limit theorem for non-linear ®ltering using interacting particle systems. Ann. Appl. Probab., 9, 275±297. Doucet, A. (1998) On sequential simulation-based methods for Bayesian ®ltering. Technical Report. University of Cambridge, Cambridge. Geweke, J. (1989) Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57, 1317± 1339. Gilks, W. R., Richardson, S. and Spiegelhalter, D. J. (eds) (1996) Markov Chain Monte Carlo in Practice. London: Chapman and Hall. Gordon, N. J., Salmond, D. J. and Smith, A. F. M. (1993) Novel approach to non-linear non-Gaussian Bayesian state estimation. IEEE Proc. F, 140, 107±113. Handschin, J. E. and Mayne, D. Q. (1969) Monte Carlo techniques to estimate the conditional expectation in multistage non-linear ®ltering. Int. J. Control, 9, 547±559. Isard, M. and Blake, A. (1996) Contour tracking by stochastic propagation of conditional density. In Computer Vision: Proc. ECCV '96 (eds B. Buxton and R. Cipolla). New York: Springer. Kanazawa, K., Koller, D. and Russell, S. (eds) (1995) Proc. Conf. Uncertainty in Arti®cial Intelligence. San Mateo: Morgan±Kaufmann. Kong, A., Liu, J. S. and Wong, W. H. (1994) Sequential imputations and Bayesian missing data problems. J. Am. Statist. Ass., 89, 278±288. Liu, J. S. and Chen, R. (1995) Blind deconvolution via sequential imputation. J. Am. Statist. Ass., 90, 567±576. Ð (1998) Sequential Monte Carlo methods for dynamic systems. J. Am. Statist. Ass., 93, 1032±1044. Pitt, M. K. and Shephard, N. (1999) Filtering via simulation: auxiliary particle ®lters. J. Am. Statist. Ass., to be published. Robert, C. and Casella, G. (1999) Monte Carlo Statistical Methods. New York: Springer. Rubin, D. B. (1988) Using the SIR algorithm to simulate posterior distributions. In Bayesian Statistics 3 (eds J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith), pp. 395±402. Oxford: Oxford University Press. Shiga, T. and Tanaka, H. (1985) Central limit theorem for a system of Markovian particles with mean ®eld interaction. Z. Wahrsch. Ver. Geb., 69, 439±459. Smith, A. F. M. and Gelfand, A. E. (1992) Bayesian statistics without tears: a sampling±resampling perspective. Am. Statistn, 46, 84±88. Sutherland, A. I. and Titterington, D. M. (1994) Bayesian analysis of image sequences. Technical Report 94-3. Department of Statistics, University of Glasgow, Glasgow. 146 W. R. Gilks and C. Berzuini Tierney, L. (1994) Markov chains for exploring posterior distributions (with discussion). Ann. Statist., 22, 1701±1786. West, M. (1993) Mixture models, Monte Carlo, Bayesian updating and dynamic models. In Proc. 24th Symp. Computing Science and Statistics, pp. 325±333. Fairfax Station: Interface Foundation of North America. Zaritskii, V. S., Shimelevich, L. I. and Svetnik, V. B. (1975) Monte Carlo technique in problems of optimal data processing. Autom. Remote Control, 12, 95±103.
© Copyright 2026 Paperzz