Geweke (1989) Bayesian Inference in Econometric Models Using

Bayesian Inference in Econometric Models Using Monte Carlo Integration
Author(s): John Geweke
Source: Econometrica, Vol. 57, No. 6 (Nov., 1989), pp. 1317-1339
Published by: The Econometric Society
Stable URL: http://www.jstor.org/stable/1913710
Accessed: 14/08/2010 13:30
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=econosoc.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
The Econometric Society is collaborating with JSTOR to digitize, preserve and extend access to Econometrica.
http://www.jstor.org
Econometrica,Vol. 57, No. 6 (November, 1989),1317-1339
BAYESIAN INFERENCE IN ECONOMETRIC MODELS USING
MONTE CARLO INTEGRATION
BY JOHN
GEWEKE1
Methods for the systematic application of Monte Carlo integration with importance
sampling to Bayesian inference in econometric models are developed. Conditions under
which the numerical approximation of a posterior moment converges almost surely to the
true value as the number of Monte Carlo replications increases, and the numerical accuracy
of this approximation may be assessed reliably, are set forth. Methods for the analytical
verification of these conditions are discussed. Importance sampling densities are derived
from multivariate normal or Student t approximations to local behavior of the posterior
density at its mode. These densities are modified by automatic rescaling along each axis.
The concept of relative numerical efficiency is introduced to evaluate the adequacy of a
chosen importance sampling density. The practical procedures based on these innovations
are illustrated in two different models.
KEYWORDS:
Importance sampling, numerical integration, Markov chain model, ARCH
linear model.
1. INTRODUCTION
ECONOMETRIC MODELS are usually expressed in terms of an unknown vector of
parameters 0 E c Rk, which fully specifies the joint probability distribution of
the observations X = {x, ... XT}. In most cases there exists a probability
density function f(Xl0), and classical inference then often proceeds from the
likelihood function L(8)=f(XIO). The asymptotic behavior of the likelihood
function is well understood, and as a consequence there is a well developed set of
tools with which problems of computation and inference can be approached;
Quandt (1983) and Engle (1984) provide useful surveys. The analytical problems
in a new model are often far from trivial, but there are typically several
approaches that can be explored systematically with the realistic anticipation that
one or more will lead to classical inference procedures with an asymptotic
justification.
Bayesian inference proceeds from the likelihood function and prior information which is usually expressed as a probability density function over the
parameters, T(O), it being implicit that a(#) depends on the conditioning set of
prior information. The posterior distribution is proportional to p(O) = v(p)L(O).
This formulation restricts the prior probability measure for &9to be absolutely
continuous with respect to Lebesgue measure. Extensions that allow concentrations of prior probability mass are generally straightforward but notationally
cumbersome, and are postponed to the concluding section. Most Bayesian
inference problems can be expressed as the evaluation of the expectation of a
e
'Financial support for this work was provided by NSF Grant SES-8605867, and by a grant from
the Pew Foundation to Duke University. This work benefited substantially by suggestions from an
editor, Jean-Francois Richard, Herman van Dijk, Robert Wolpert, and two anonymous referees, who
bear no responsibility for any errors of commission or omission.
1317
1318
JOHN GEWEKE
function of interest g(0) underthe posterior,
(1)
E[g(p)]
- Jg(@)p(0)dp/Jp(0)d0.
Approaches to this problem are nowhere near as systematic,methodical,or
generalas are those to classicalinferenceproblems:classicalinferenceis carried
out routinelyusing likelihoodfunctionsfor which the evaluationof (1) is at best
a majorand dubiousundertakingand for most practicalpurposesimpossible.If
the integration in (1) is undertakenanalyticallythen the range of likelihood
functionsthat can be consideredis small,and the class of priorsand functionsof
interestthat can be consideredis severelyrestricted.Manynumericalapproaches,
like quadraturemethods, requirespecial adaptationfor each g, 7r,or L, and
become unworkableif k exceeds,say, three.(Good surveysof thesemethodsmay
be found in Davis and Rabinowitz,1975; Hammersleyand Handscomb,1979;
and Rubinstein,1981.)
With the advent of powerful and cheap computing, numericallyintensive
methods for the computationof (1) have become more attractive.Monte Carlo
integrationwith importancesamplingprovidesa systematicapproachthat-in
principle-can be applied in any situation in which E[g(O)] exists, and is
practicalfor largevaluesof k. It was discussedby Hammersleyand Handscomb
(1964, Section 5.4) and broughtto the attentionof econometriciansby Kloek and
van Dijk (1978). The main idea is simple. Let {Pi} be a sequenceof k-dimensional randomvectors,i.i.d. in theoryand in practicegeneratedsyntheticallyand
therefore a very good approximationto an i.i.d. sequence. If the probability
distribution function of the #, is proportional to p(O), then n- 1?l 1g(0i)
E[g(O)]; if it is proportional to L(8), then n-'Y27 g(#)1r(#i) --E[g(O)].
(Except in Section 4, all convergenceis in n, the number of Monte Carlo
replications. Almost sure convergence is indicated " -* ", convergence in distribution "=> ".) Only in a very limited set of simple cases is it feasible to generate
syntheticvariateswhose p.d.f. is in proportionto the posterioror the likelihood
function (but the set does include common and interesting cases in which
classical and analyticalBayesianinferencefail; see Geweke, 1986). More generally, suppose that the probabilitydistributionfunctionof the #, is I(8), termed
the importancesamplingdensity.Then, under very weak assumptionsgiven in
Section 2,
n
(2)
n-
n
(#J)I(0Ji]
E[g(0)p
/ [p
E P(JiI(#JJ -- E [g(# )]
The rate of almost sure convergencein (2) dependscriticallyon the choice of
the importance sampling density. Under stronger assumptions,developed in
Section 3,
(3)
n2/2ma
esti
gn-E
e
t N(O, w2),
T
and a2 may be estimatedconsistently.This result was indicatedby Kloek and
BAYESIAN INFERENCE
1319
van Dijk (1978), but does not follow from the resultof Cramer(1946) they cite,
and is repeated by Bauwens (1984) but without explicit discussion of the
momentswhose existenceis assumed.Looselyspeaking,the importancesampling
density shouldmimicthe posteriordensity,and it is especiallyimportantthat the
tails of I(8) not decay more quickly than the tails of p(#). This is keenly
appreciatedby those who have done empiricalwork with importancesampling,
includingZellner and Rossi (1984), Gallantand Monahan(1985), Bauwensand
Richard (1985), and van Dijk, Kloek, and Boender (1985). These and other
investigators have experiencedsubstantialdifficultiesin tailoring importance
samplingdensities to the problemat hand. This is an importantfailing, for the
approach is neither attractivenor methodical if special arcane problems in
numericalanalysishave to be resolvedin each application.
This paper approachesthese problemsanalytically,and presentsnew results
which should make the applicationof Monte Carlo integrationby importance
sampling much more routine. Section 3 provides sufficientand easily verified
conditions for (3). It also introducesa measureof relativenumericalefficiency
which is natural to use in evaluatingthe effectivenessof any given importance
samplingdensity. Based on these considerations,Section 4 outlinesa systematic
approach to the choice of I(8). It utilizes the local behaviorof the posterior
density at its mode, and a new class of importancesampling densities, the
multivariatesplit-normaland multivariatesplit-Student.These developmentsare
illustrated in Section 5, with the homogeneousMarkov chain model, and ia
Section 6, with the ARCH linearmodel. Some directionsfor futureresearchare
taken up in the last section.
2. BAYESIAN INFERENCE WITH IMPORTANCE SAMPLING
We begin with some additionalnotation, and a set of assumptionsbasic to
what follows. Since expectationsare taken with respectto a varietyof distributions, it is useful to denote Ef [h(#)] = [Jef(#) dOf]'Jef(0)h(0) dO wheref(0)
is proportionalto a probabilitydensityand h(#) is integrablewith respectto the
probabilitymeasurethat is inducedby f(O). Similarlydenote by varf[h(#)] and
mdf[h(0)] the variance and mean deviation of h(0) under this probability
measure,when the respectivemomentsexist.
ASSUMPTION 1: The product of the prior density, vn(P), and the likelihood
function, L (#), is proportionalto a properprobabilitydensityfunction defined on 9.
ASSUMPTION 2: (0 }?=1 is a sequence of i.i.d. random vectors, the common
distributionhaving a probabilitydensityfunction I(8).
ASSUMPTION
3: The supportof I(8) includes 39.
ASSUMPTION
4: E [ g(O)] exists and is finite.
1320
JOHN GEWEKE
Assumption 1 is generally satisfied and usually not difficult to verify. It
explicitlyallows the priordensityto be improper,so long as Jep(0) dO c < oo.
For h(0) integrablewith respectto the posteriorprobabilitymeasure,the simpler
E-E[h()] (and similarlyvar[h(0)] and md[h(0)]) will be
notation h-E[h()]
used. Assumptions2 and 3 are specificto the methodof Monte Carlointegration
with importance sampling; their role is obvious from the discussion in the
introduction.Most inferenceproblemscan be cast in the form of determining
E[ g(q)], throughthe functionof interest,g(O).
The value of g-nin (2) is invariantwith respect to arbitraryscaling of the
posteriorand importancesamplingdensities.In certainsituationsit is convenient
not to work with normalizingconstantsfor I(8), so we shall re-expressg-, with
I*(8)
then
=
d-I(#) in place of I(#). Denote the weightfunction w(0) p(#)/I*(O);
'= g(01)w(0)/L7=1 w(O,). Consistency of gn for g is a direct conse-
g, =
quence of the stronglaw of largenumbers.
THEOREM
1: UnderAssumptions1-4, gn -- E[g(p)].
This simple resultprovidesa foundationfor the numericaltreatmentof a very
wide arrayof practicalproblemsin Bayesianinference.A few classesof examples
are worth mention.
(i) If the value of a Borelmeasurablefunctionof the parameters,h(0), whose
expectationexists and is finite,is to be estimatedwith a quadraticloss function,
then g(0) = h(0).
(ii) If a decision is to be made betweenactions a, and a2, accordingto the
expectationof loss functions1(#; a,) and 1(0; a2), then g(0) = 1(0; a,) - 1(0; a2).
(iii) Posteriorprobabilitiesmay be assessedthroughindicatorfunctionsof the
form x(@;S) = 1 if Ge S, 0 if #0 S. To evaluate the posterior probability
E A}). Assumption4 is satisfiedtrivially,
P[h (0) e A], let g(g) = x(8; {0 : h(0)
- PP[h(0) E A]. Construction of posterior
so given the other assumptions
c.d.f.'s is a specialcase that is particularlyrelevantif h(0) itself has no posterior
moments. (An example is provided subsequentlyin Section 5; see especially
Table IV.)
(iv) The applicationto c.d.f.'s suggestshow to constructquantiles.Suppose
P[h(0) < x] = a, and for a given E> 0 P[h(0) < x - e] <a, P[h(0) <x + e] > a.
w(@)> a and
If qn is definedto be any numbersuch that E;ih(O,), qnw(i)/i2=l
Yi:h(f,)>qn
w(q0)/,24'1
w(0i)>
1
-
a, then
by Theorem
1 1iMn1"O P[q
e (x-
+ E)]= 1. Hence if the c.d.f. of h(0) is strictly increasingat x, the ath
quantile can be computedroutinelyas a byproductof Monte Carlointegration
with importancesampling.
(v) In forecasting problems functions of interest are of the form g(0)=
P(XT+S EA I#, X) and in this case g(#) itself is an integral. Monte Carlo
integration with an importancesamplingdensity for the parametersmay be
combined with Monte Carlo integrationfor future realizationsof the random
vector xt to form predictivedensities(Geweke(1988b)).
E, x
1321
BAYESIAN INFERENCE
3. EVALUATINGTHE IMPORTANCESAMPLINGDENSITY
Standingalone theseresultsare of little practicalvalue,becausenothingcan be
said about rates of convergenceand there is no formal guidancefor choosing
I(@) beyond the obviousrequirementthat subsetsof the supportof the posterior
should not be systematicallyexcluded.Moreover,as noted in the introduction,
for seemingly sensible choices of I(@),
k,,
can behave badly. Poor behavior is
usually manifested by values of k,, that exhibit substantialfluctuationsafter
thousandsof replications,which in turn can be tracedto extremelylargevalues
of w(0i) that turn up occasionally.The key problemis the distributionof the
ratio of two means in k,, and this may be characterizedby a central limit
theorem.
THEOREM
2: In addition to Assumptions 1-4, suppose
E[w()]
c-w
[
=
)] dO
p()2/I*(
and
E[g(#)2w(p)]
f
= c-
dO
[g(#)2p(#)2/I*(O)]
are finite. Let
d-E
a2
[g(O)-gI2w()}
[g(
62=-
-
n)
=c-ldn
] W(o)/2
i=l
[g()
_g]2w(f)
(0) da
2
w(Oi)I.
/_i=
Then
*1l/2
(n-- k)
N(O, ]2)
2
* an2
Define A(#)=g(#)w(f)
and observe that g, ==[n
i=lA(#i)]l
1A=A(01)-* cdg, n lEn 1w( 9#) -* cd. By
From Theorem 1, n
a CentralLimit Theorem(e.g. Billingsley(1979, Theorem29.5)), n1/2(gPROOF:
[n
-
lEn 1w(01)].
N(O, aJ2),
a2 =
c-G2d-2{var[A(g)]
= c -2d-1
+ g2
var[w(g)]
[g(#) -_gJ2W(#)p(#)
-
dO.
2gcov[A(#), w(#)]
}
Q.E.D.
as the numerical standard error of gn.
n = (n2)1/2
2
must be verified,analytically,if that resultis to
of
Theorem
The conditions
We shall refer to
be used to assess the accuracy of g-nas an approximation of E[g(O)]. Failure of
1322
JOHN GEWEKE
these conditionsdoes not vitiatethe use of g-"thereis no compellingreasonwhy
g) needs to have a limitingdistribution.But it is essentialto have some
indication of Ig,- -, and this task becomes substantiallymore difficultif a
centrallimit theoremcannotbe applied.Practicalitysuggeststhat I(@)be chosen
so that Theorem2 applies.The additionalassumptionsin Theorem2 typically
can be verifiedby showingeither
n1//2(gn-
(4)
w(#)<iw
(5)
Ois compact, and p(0) <p <oo,
<oo
VOEt9, and var[g()]
<oo;
I(#) > e > 0, VOE 69.
Demonstrationof (4) involvescomparisonof the tail behaviorsof p(O) and I(8).
Demonstrationof (5) is generallysimple,although&9may be compactonly after
a nontrivial reparameterization
of the prior and the likelihood.Meeting these
conditions does not establishthe reliabilityof gn and 62 in any practicalsense:
examplesin Section 5 demonstrateuncontrivedcases in which(5) appliesbut 62
would be unreliableeven for n = 1020. The strengthof these conditionsis, rather,
that they providea startingpoint to use numericalmethodsin constructingI(@).
We shall returnto this endeavorin Section4.
The expressionfor a2 in Theorem2 indicatesthat the numericalstandarderror
is adverselyaffectedby largevar[g(p)],and by largerelativevaluesof the weight
function. The formeris inherentin the functionof interestand in the posterior
density, but the latter can in principlebe controlledthroughthe choice of the
importancesamplingdensity.A simplebenchmarkfor comparingthe adequacy
of importance samplingdensities is the numericalstandarderror that would
result if the importancesamplingdensity were the posteriordensity itself, i.e.,
I*(@) x p(O). In this case a2 = var[g(#)] and the numberof rephcationscontrols
the numericalstandarderrorrelativeto the posteriorstandarddeviationof the
function of interest-e.g., n = 10,000 implies the formerwill be one percentof
the latter.This is a very appealingmetricfor numericalaccuracy.
Only in special cases is it possible or practicalto constructsyntheticvariates
whose distributionis the same as the posterior. But since var[g(e)] can be
computedas a routinebyproductof MonteCarlointegration,it is possibleto see
what the numericalvariancewould have been, had it been possible to generate
synthetic random vectors directly from the posterior distribution.Define the
relative numerical efficiency of the importance sampling density for the function
of interest g(#),
RNE-var[g(q)]/d-E{[g(0)-gI2w(O)}
=var[g(0)]/a2.
The RNE is the ratio of numberof replicationsrequiredto achieveany specified
numerical st--idard error using th'eiiportance sampling density I(@), to the
numberrequiredusing the posteriordensityas the importancesamplingdensity.
The nunmerical
standarderrorfor is the fraction(RNE- n) - 1/2 of the posterior
standarddeviation.
Low values of RNE indicatethat thereexists an importancesamplingdensity
(namely, the posteriordensityitself) that does not have to be tailoredspecifically
BAYESIAN INFERENCE
1323
to the function of interest, and provides substantially greater numerical efficiency.
Thus they alert the investigator to the possibility that more efficient, yet practical,
importance sampling densities might be found. If one is willing to consider
importance sampling densities that depend on the function of interest-which
will be impractical if expected values of many functions of interest are to be
computed-then even greater efficiencies can generally be achieved. A lower
bound on a2 for all such functions can be expressed.
THEoREM 3: In addition to Assumptions 1-4 suppose that md[g(O)] is finite.
Then the importancesampling densitythat minimizes a2 has kernel lg(g) - -1p(0),
and for this choice a2 = {md[g(O)]}2.
PROOF: From Theorem 2, aI2= c-2J0 K(O)2/I(O) dO, when we impose the
constraint d- = Je I() dO= 1 and define K(8) = Ig(e) - -1p(O). Following
Rubinstein (1981, Theorem 4.3.1; the restriction to bounded 69 is inessential), a2
is minimized by I(@) cc IK(O)I.
Q.E.D.
Theorem 3 has some interesting direct applications. An important class of
cases consists of model evaluation problems: 0 E 19* or 0 e 6*, g(O) is an
indicator function for e*, and so E[g(I)] = P[O e 19*] p*. Then_a2is minimized if I(@) cc (1 -p*)p(#) for 0 E 19* and I(@) o p*p(O) for 0 E 19*: half the
drawings should be made in 19* and half in e*, and in proportion to the
posterior in each case.
It appears impractical to construct the densities with kernel Ig(p) - -1p(O), in
general: a different function would-be required for each g(O); a preliminary
evaluation of with a less efficient method would be a prerequisite; and methods
for sampling from those importance sampling densities would need to be devised.
However, Theorem 3 suggests that importance sampling densities with tails
thicker than the posterior density (a characteristic of densities with kernel
lg(e) - gl p(p)) might be more efficient than the posterior density itself (and
hence have RNE > 1) although still suboptimal. This is found in practice with
some frequency, and some examples are given in Section 5.
4. CHOOSING THE IMPORTANCE SAMPLING DENSITY
There are two logical steps in choosing the importance sampling density. The
first is to determine a class of densities that satisfy the conditions of Theorem 2,
usually by using (4) or (5). The second step is to find a density within that class
that attains a satisfactory RNE for the functions of interest. The first objective
can only be achieved analytically. The second is a well-defined problem ill its
own right, that can be solved numerically.
When 69 is compact the first step will usually be trivial. For example, if the
classical regularity conditions for the asymptotic normal distribution of the
maximum likelihood estimator 0 obtain, with 0 LN(O*, V), then the N(p, V)
importance sampling density satisfies (5). If &9is not compact, it will generally be
1324
JOHN GEWEKE
necessaryto investigatethe tail behaviorof the likelihoodfunctionto verify(4).
There are many instancesin which L(8) behaveslike kO[-qfor largevaluesof 0i
and it is not difficultto determineq. Given a multivariatenormalprioron 0 the
tail of the marginalposteriordensity in 0i will behavelike exp(-cc2) for large
values of 0, and a multivariatenormal importancesamplingdensity may be
appropriate.On the otherhand,if the prioris improper,and flat for one or more
parameters,then in these instancesno multivariatenormalimportancesampling
densitywill satisfyTheorem2. An instructiveleadingexampleis the computation
of the expectedvalueof a functionof the meanof a univariatenormalpopulation
with unknown mean and variance,using the N(j, s2/n) importancesampling
density. It is not hard to see that (a) Theorem2 does not apply; (b) if one
computes 6n2and var(g-n)= E7=1[g(0i)- g-]2w(0i)/[E27=1 w(0i)f2, then the computed RNE approaches 0 almost surely as n -- oo; (c) if n is held fixed,
computed RNE -- 1 as samplesize approachesinfinity.These resultsare typical
of cases in which the classicalasymptoticexpansionof the log-likelihoodfunction is valid, and E9is not compact.They underscorethe importanceof investigating the tail behavior of the likelihood function analytically rather than
numerically.
Once the tail behaviorof the likelihoodfunction is characterizeda class of
importancesamplingdensities is usually suggested.For example,if the tail is
multivariate Student with variance matrix 2 and degrees of freedom v, a
multivariatet importancesampling density with weakly larger variance and
weakly smallerdegreesof freedomwill satisfyTheorem2. In generallet I(0; a)
be a family of importancesamplingdensitiesindexedby the g x 1 vector a, and
let w(#; a) p(#)/I(q; a) be the correspondingfamily of weight functions.It
will not usuallybe practicalto choose a distinctimportancesamplingdensityfor
each function of interest. A reasonable objective is the minimization of
E[w(#, a)] c Jofp(0)2/I(#; a)] dO.This function appearsin the first condition
of Theorem 2 and in the denominatorof each E[g(q)]; it is precisely the
functionone wouldminimizeif g(0) werean indicatorfunctionfor 19*,p(e*) =
1/2.
Two related classes of importancesamplingdensities have turned out to be
quite useful in our experiencewith this approach.They involve densities that
have not, to our knowledge,been describedin the literature.The heuristicidea is
to begin with minus the inverse of the Hessian of the log posterior density
evaluatedat its mode as the variancematrixof a multivariatenormalimportance
sampling density, and shift to the multivariateStudent t, if warranted,with
degreesof freedomindicatedby the tail behaviorof the posteriordensity.If the
prior is diffuse, the asymptoticvarianceof the maximumlikelihood estimator
may be substitutedfor minus the inverse of the Hessian of the log posterior
density evaluated at the mode. In practiceeither importancesamplingdensity
can be poor, becausethe Hessianpoorly predicts(especiallyif it underpredicts)
the value of the posteriordensity away from the mode, because the posterior
density is substantiallyasymmetric,or both. Illustrationsof these problemsare
providedin the next two sections.To cope with themwe modify the multivariate
1325
BAYESIAN INFERENCE
normal or Student t, comparing the posterior density and importance sampling
density in each direction along each axis, in each case adjusting the importance
sampling density to reduce E[w(O, a)]. With k parameters, there are 2k adjustments in the last step, so a is a 2k X 1 vector.
A little more formally, let sgn+( ) be the indicator function for nonnegative
real numbers, and let sgn-( ) = 1 - sgn+( ). The k-variate split normal density N*(,u, T, q, r) is readily described by construction of a member x of its
population:
N(O, Ik);
E
[R[qi sgn
X=,lb+
(E-i) +
risgn
(Ei)
(i=L...
Ei
k);
Tq.
The log-p.d.f. at x is (up to an additive constant)
n
-
E [log(qi)sgn
I-1
(6i) + log(ri)sgn-(E)]
-
(1/2)E'E.
In the application here, ,u is the posterior mode, and T is a factorization such
that the inverse of TT' is the negative of the Hessian of the log posterior density
evaluated at the mode. A variate from the multivariate split Student density
t*(u,T, q, r, v) is constructed the same way, except that qi = [qi sgn+(Ei) +
xx2(v). The log-p.d.f. is (up to an additive
with
risgn-(ei)]ei(I/v)-1/2,
constant)
n
[log(qi)sgn+ (ei) + log(ri)sgn (ei)] -[(v + k)/2] log(l + v
-
.
i=1
Choosing a' = (q', r') to minimize E[w(O; a)] requires numerical integration at
each step of the usual algorithms for minimization of a nonlinear function.
Solution of this optimization problem with any accuracy would, in most applications, likely be more time consuming than the computation of E[g(O)] itself. A
much more efficient procedure can be used to select values for q and r that, in
many applications, produces high RNE's. The intuitive idea is to explore each
axis in each direction, to find the slowest rate of decline in the posterior density
relative to a univariate normal (or Student) density, and then choose the variance
of the normal (or Student) density to match that slowest rate of decline. A little
more formally, let e(i) be a k x 1 indicator vector, e$') = e -()e M = 1. For the split
normal define i(-8) accordingto
p(
(6)
"
+' Te (i)) /p ( ") = eXp-8 2/2
ti( )
I6I(2[logp(')
Then take qi = sup,>,0 fi()
-
(8)2]
logp(O + STe('))]
and ri = sup,<,0fi().
}
For the split Student the
1326
JOHN GEWEKE
procedure is the same except that
f1(6)
-
v-1/2II{[p(6)/p(-+
STe(i))]2/(+k)-1/2
In practice, carrying out the evaluation for 8 = 1/2,1,.. ., 6, seems to be satisfactory.
The two examples that follow illustrate how these methods are applied.
5. EXAMPLES: THE BINOMIAL AND MARKOV CHAINS
In this section we take up a set of successively complex examples in which the
parameter space 9 is compact. We begin with situations which can be studied
analytically in some detail, and consequently there would be no need for a
numerical approach at all. This is strictly for illustrative purposes. As will be
seen, not much elaboration is required to generate realistic models in which there
are no analytical solutions.
5.1. A Simple Binomial Model
Let a sample of size N be drawn from a population in which P[x = 1] = 0,
P[x = 0] = 1 - 0, and suppose there are M occurrences of "1". If the prior is
XT(O)= 1, 0 < 0 < 1, then the posterior is
(7)
p(O) = [B(M+1,N-M+1)]l0M(1
o)NfM
(O0
1).
Since N12( - 0) N(0, 0(1 - 0)), a normal approximation based on the
asymptotic distribution of the m.l.e. 0 is
I(0) = [2wN-v10(1 -
6)]
/2exp{ -N(O-
0)2/26(1 - 0)).
Since 0 < 0 < 1 condition (5) is satisfied.
Consider two cases. In Case A, N = 69, M = 6, 0 = .087, the asymptotic
standard error for 0 is .034 and the asymptotic variance is .00115. In Case B,
N= 71, M= 54, 0 = .761, the asymptotic standard error for 0 is .051, and the
asymptotic variance is .00256. For the corresponding normal importance sampling densities Figure 1 shows the weight functions w(0) = p(O)/I(O) indicated
by, the heavy lines. In each case w(0) appears badly behaved, attaining a value of
over 1070 for 0 around .9 in Case A, and attaining a value of over 104 for 0
between .2 and .3 in Case B. For our purposes it is E[w(O)] = Jf w(O)p(O) dO
and not w(O) which matters. The light lines in Figure 1 show w(O)p(O), which is
much less than w(O) when w(O) is large. In Case A E[w(O)] = 1.65 X 1024, and
in Case B E[w(O)] = 1.05. Clearly the importance sampling density I(@) is
adequate in the latter case but not the former. The difficulty in Case A is that the
likelihood function declines more slowly than its asymptotic normal approximation, for 0 > 0. A split normal importance sampling density with variance .00115
for 0 < 0 and variance greater than .00115 for 0 > 0 performs better, as shown in
Figure 2. As the latter variance increases, E[w(0)] decreases exponentially, until
the variance reaches .0016; E[w(0)] attains its minimum at .0019, and then
increases slowly.
1327
BAYESIAN INFERENCE
RATIOS
OFFUNCTIONS:
N:69,Mz6
80
RATIOSOF FUNCTIONS:
lo
60m
'G40
-u
~~~6-
0
~~~~~~~~~4-
o20
0
c"-20
co
-8
-X
N-69,Mz6
8
-2-
~~~~~~~~~~~-1
-40
-6-
-60
-81
000..20
~~~~~~~~~~-8C
040
0.60
0.w
1.00
0b. oio'lSo'oi05
Probability Parameter, Theta
80-
RATIOSOFFUNCTIONS:
N71, M54
o1o'
0.h5
Probability Parameter, Theta
RATIOSOF FUNCTIONS:
N:71,M=54
20
60
15
40-
1.00?5
o20?
0-
\
00
m -20
-05-
-1.5
-60
-80
0.00
020
040
060
080
I..
i.00
-2.0,
055 060 0.65 070 075 0.80 0.85
Probability Porameter, Theta
Probability Parameter, Theta
FIGURE 1.-Values of log(w(6)) (heavy lines) and log(w(O)p(G)) (light lines) for the two binomial
models discussed in Section 5.1.
Minimondas o
Minimondos o
Functionof Variance
Functionof Vorionce
25
1 20-
v
20,-
118
1 16-
14~~~~~~~~~~~~~~1
0
1 12-
'15
~~~~~~~~~110-
4'
100-
lo0't
1 06
1.04
5
\
0
0.10
015
-'
1.02
0.35
1.00
.
016 0.18 0.2 0220240.260.28 0.3 032
Variance in Importance Density
Variance in Importance Density
020
025
030
2.-Values of E(w(O)) as a function of the variance of a split normal importance
sampling density, centered at J, for Case A discussed in Section 5.1.
FIGURE
5.2. The Two-StateHomogeneousMarkov Chain Model2
Suppose an agent can occupyone of two states,with P[In state i at time t IIn
state j at time t - 1] = pj, i # j. Let N agents be observedat times t - 1 and t,
with mr,jagents in state i at time t - 1 and state j at time t. With a priordensity
r(P1, P2) = 1, 0 <pi < 1 (i = 1,2), the posterior p.d.f. is proportional to
(8)
P(P
P21(PP2')
=pP
12p
2)
-PP
m11(1 _ P)m22)
2For elaborationon the Markovchain model and the problemof embeddability,see Singerand
Spilerman(1976) or Geweke,Marshall,and Zarkin(1986a,1986b).
1328
JOHN GEWEKE
TABLE I
THREE EXAMPLES FOR THE TWO-STATE HOMOGENEOUS MARKOV CHAIN MODELs
Mii
M12
M21
M22
P1
P2
P1 + P2
Case I
Case II
Case III
63
6
17
54
.087 (.034)
.239 (.051)
.326 (.061)
21
66
6
24
.759 (.046)
.200 (.073)
.959 (.086)
68
28
17
4
.292 (.046)
.810 (.086)
1.102 (.097)
aCarats
denote m.l.e.'s; asymptotic standard errors are shown parenthetically. Data are taken from
Singer and Cohen (1980).
O< pi
< 1 (i = 1, 2). If the underlying process of transition takes place continuously and homogeneously through time, it may be described in terms of
lim, 08 P[In state i at time t + IlInstate j at time t] = rj, i j. If Pl +P2 < 1,
then rj = P1 log(l - Pl - P2)/( P1 + P2); otherwise the process of transition cannot
be continuous and homogeneous through time. Whenever there exists a continuous time model corresponding to a discrete time Markov chain model, the
discrete time model is said to be embeddable.
To provide specific numerical examples, we use part of a data set reported by
Singer and Cohen (1980). The data pertain to the incidence of malaria in Nigeria,
state 1 being no detectable parasitemia and state 2 being a positive test for
parasitemia. Data are reported separately for different demographic groups. The
process of infection and recovery is modeled as a homogeneous Markov chain,
and Singer and Cohen (1980) focus on whether this model is embeddable. Data
for three cases, and the corresponding maximum likelihood estimates and their
asymptotic standard errors are provided in Table I. Notice that the Case I data
are those used to illustrate the binomial model; a two-state Markov chain model
is simply a combination of two binomial models, as reflected in comparison of (7)
<1
and (8). In Case I A1Pl2<< 1, and the model is embeddable; in Case II l+P
and in Case III Pi +P2> 1, and the question of embeddability is open.
We conduct inference in this model with two priors and six functions of
interest. The first prior is uninformative: nY(PI, P2) = 1 for all pj e (0,1). The
second prior imposes embeddability but otherwise is the same as the first:
T(P11 P2)=2 for all (pi, pj): Pi > ?, Pj > ?, P1 + P2 < 1. The functions of
interest are P'P2l
Pi1 P2 , rj 1, and r7- . The function pJ 1 is the mean
duration in state j in the discrete time model, while rif is the mean duration in
state j assuming a continuous time model. Since if' is undefined if Pl +P2 >1,
E(rj-1) is defined only under the second prior; E[rj] is infinite for either prior,
although the posterior distribution of rj may be obtained readily as suggested in
examples (iii) and (iv) of Section 2.
Results obtained using the methods proposed in earlier sections are presented
in Table II. Asterisks in column 1 denote the use of the second prior; otherwise
the first is used. The first block of columns provides results with the normal
1329
BAYESIAN INFERENCE
TABLE II
INFERENCEIN THETwO-STATE HOMOGENEOUSMARKOVCHAIN MODELa
Importance
Density
g
)b
E[g]
Normal
1000,, RNEC
sd[g]
E[g]
Split Normal
sd[g] 1000,, RNEC
Likelihood Function
E[g]
sd[g] 1000,, RNEC
Actual
E[g]
sd[g]
Case I
Pi
P2
P1i
P2 1
rj71
r 1*
.098 .034
.246 .049
11.7
4.98
4.25
.926
9.60 4.35
3.49
.887
.051
.056
5.09
.970
4.44
.932
.441
.099
.769
.247
.960 11.7
.911 4.24
.961 9.58
.905 3.48
.035
.051
4.94
.934
4.32
.897
.033
.050
4.15
.891
3.62
.854
.099 .035 .035 1.00
.099
1.13
.248 .050 .050 1.00
.247
1.01
1.41 11.6
5.37 5.37 1.00 11.7
1.10 4.21
.910 .910 1.00 4.24
1.42 9.55 4.68 4.68 1.00
.874 .874 1.00
1.10 3.45
.035
.050
4.98
.925
-
.753
.219
1.33
5.16
.740
.182
1.36
6.01
.472
2.18
.045
.072
.083
2.07
.043
.049
.081
2.09
.130
1.22
.044
.070
.081
1.77
.052
.060
.098
2.18
.163
1.30
1.05
1.05
1.05
1.37
.686
.672
.681
.921
.642
.880
.752
.220
1.33
5.15
.739
.183
1.36
5.97
.471
2.17
.046
.072
.084
2.06
.044
.050
.082
2.07
.132
1.22
.046
.072
.084
2.06
.054
.063
.103
2.58
.164
1.51
1.00
.753 .046
1.00
.219 .072
1.00 1.33
.083
1.00 5.17 2.08
.647 .647 .647 .647 .647 .647 -
.296
.784
3.46
1.29
.268
.670
3.82
1.51
1.22
.488
.046
.083
.565
.152
.041
.060
.615
.147
.419
.173
.046 1.01
.082 1.03
.547 1.06
.152 .998
.092 .201
.138 .189
1.32
.217
.340 .185
0.93
.206
.401 .185
.297
.781
3.46
1.30
.270
.668
3.80
1.51
1.21
.490
.046
.085
.556
.156
.041
.062
.595
.152
.408
.179
.046
.085
.556
.156
.089
.136
1.30
.333
.893
.391
1.00
1.00
1.00
1.00
.208
.208
.208
.208
.208
.208
-
Case II
Pi
P2
pj11
p2-1
P1
P2*1*
Pi*
P21*
1*
r1
r27*
Case III
Pi
P2
P1i
P- 1
P1
P2*
Pi*
P2 *
r1-1*
r7'-1*
a
.753 .046
.218 .070
1.33
.083
5.18 2.09
.739 .043
.183 .050
.082
1.36
6.00 2.13
.472 .129
2.18 1.23
.051 .808
.083 .720
.096 .759
1.92 1.19
.060 .526
.064 .629
.117 .491
2.21 .928
.170 .576
1.30 .891
.296 .045 .051
.782 .084 .118
3.46
.558 .583
1.29
.153 .246
.269 .041 .145
.669 .062 .257
3.80
.600 1.82
1.51
.150 .686
1.21
.394 .116
.488 .170 .698
.810
.507
.915
.387
.080
.057
.108
.048
.115
.060
Expected values of functions of interest g(
.296 .046
.783 .084
3.46
.562
1.29
.154
-
-
) were computed by Monte Carlo integration with importance sampling, using
10,000 replications.
b
Prior is I(p'l, P2) = 1 if no asterisk: with asterisk, prior is qr(pl, P2) = 2 if Pi +P2
the posterior satisfies P1 + P2 < 1.
c Relative numerical efficiency.
<
1. In Case I, essentially all the mass of
importance sampling density motivated by the asymptotic distribution: p1 and
The second block of
P2 are independent, pj N(fPi, p1(1 )AMj1+Mj2)).
columns provides results with the split normal importance sampling density
constructed as described in Section 4. Since the posterior density is the product
of a binomial in Pi and a binomial in P2' each pj may be expressed as a ratio
involving two independent chi-square random variables (Johnson and Kotz
(1970, Section 24.2)). Hence it is straightforward to use the likelihood function
itself as the importance sampling density; the third block of columns provides
results obtained this way. Finally, the posterior moments of pi and p.-1 have
simple analytical expressions under the first prior, and these are provided in the
last block of columns. (Under the second prior the posterior moments of pj and
1330
JOHN GEWEKE
PJ-1 could probablybe computedmore efficientlyby numericalevaluationof the
incomplete beta, but this has not been pursuedbecause it does not extend to
multi-stateMarkovchain models.)
In Case I, P[ p1 + P2 < 1] 1 under the first prior; none of the 10,000
replicationswith any importancesamplingdensityturnedup P1 + P2 > 1. In Case
and in Case III P[p1+p2<1]-.21,
under the first
II, P[p1+p2<1]-.65
prior. The computed expected values of all functions of interest with each
importancesamplingdensityare quite similarto each otherand (whereavailable)
to the actual values. There are no surprisesgiven the computed numerical
standard errors. There are greater relative differencesamong the computed
standarddeviationsof the functionsof interest,and it may be verifiedanalytically that this must be the case. One could, of course, compute numerical
standarderrorsfor the posteriorstandarddeviationsas well. The relativenumerical efficienciesindicate that, with a few small exceptionsin Case II, the split
normalis a better importancesamplingdensitythan the normal.The superiority
of the split normal increasessystematicallyas E(pj) approacheszero and the
asymptoticnormal becomesa poorer approximationto the likelihoodfunction.
Many examplesof relativenumericalefficienciesin excess of 1.0 turnup with the
split normal importancesampling density, precisely where we would expect:
E( pj) closer to zero, and momentsof the pj showinggreaterdispersion,like pj-1.
By constructionRNE is 1.0 for importancesamplingfrom the posteriordensity,
and for importancesamplingfrom the likelihood function it is the posterior
probabilityof an imposedprior,evaluatedassuminga priorthat is uniformin the
parameterspace of the likelihoodfunction.
TABLE III
AccuRAcY IN THE
SoME DIAGNOSTICSFOR COMPUTATIONAL
TWO-STATEHOMOGENEOUSMARKOVCHAIN MODEL
Normal Sampling Density
n = 10,000
n = 50,000
Case I
RNE, Pi
RNE, P2
1W10
Case II
RNE, Pi
RNE, P2
1W10
Case III
RNE, Pi
RNE, P2
Split Normal Sampling Density
n = 10,000
n = 50,000
.441
.769
186.2
72.9
.269
.657
1,774.7
497.0
1.137
1.014
2.5
2.5
1.139
1.012
2.5
2.5
.808
.720
31.2
17.5
.774
.564
277.9
106.3
1.054
1.054
1.9
1.9
1.047
1.053
1.9
1.9
.810
.507
101.0
52.8
.790
.462
348.7
161.3
1.011
1.032
1.8
1.8
1.009
1.030
1.8
1.8
BAYESIAN INFERENCE
1331
From the exampleof Section5.1, we know that the population RNE for the
normal importancesamplingdensityin Case I shouldbe on the orderof 10- 24.
The normal is such a bad importancesampling density that, even with 104
replications,it will almostcertainlynot producethe values of the pj that lead to
difficulties,and computed RNE will bear no resemblanceto population RNE.
As the number of replicationsincreasescomputed RNE's tend to decline in
situationslike this, as is clearfromconsiderationof Figure1. Table III contrasts
some computed RNE's for n = 10,000 and n = 50,000 and bears out the diffi-
culties with Case I. In these situationsthe largestweights w(Gi) increaserapidly
as the numberof replications,n, increases,and orderstatisticspertainingto the
w(6i) are often useful in identifyingvery poorly conditionedimportancesampling densities. Since the largest w(pi)'s may be retained systematicallyas n
increasesand E4'1w(Qi)and Ei 1w(Q)2 are computedin any event, the diagnostic of the form wm= (n/m)ET w(w
(where superscripts
(n+l-i)(G)2/Ei =W(pi)2
denote orderstatisticsand m is small)can be quiteusefulas an indicationof thin
tails in the importancesamplingdensity relativeto the posterior.The values of
these diagnosticsfor our examplesareprovidedin Table III, againfor n = 10,000
and n = 50,000; they indicate-correctly-that the split normalis a muchbetter
importancesamplingdensitythan is the normal.
5.3. The QuartileHomogeneousMarkov Chain Model 3
Considerlast a four-statehomogeneousMarkovchain model, state j denoting
the jth quartile of income. Let pij P[In state j at time t IIn state i at time
t - 1], and arrange the pij in a 4 x 4 matrix P. Since 4=1 Pij = 4=1 Pji = 1 for
j = 1, ... , 4, the likelihood function has 9 free parameters and is not proportional
to a product of multinomialdensities.We supposethat the processof transition
between income quartilestakesplace in continuoustime; define
rij= lim 8 - 'P [In statej at time t + 8 IIn state i at time t]
for all i *j, and let ri = -E i rij. If the discretetime model is embeddable,
then R = Q log[A]Q 1, whereQ is the matrixwhose columnsare righteigenvectors of P and A is the diagonalmatrixof correspondingeigenvalues.A given
discrete time model is embeddableif and only if the off-diagonalelements of
Q log[A]Q-1 are all real and nonnegativeand the diagonalelementsare all real
and nonpositivefor some combinationof the branchesof the logarithmsof the
eigenvalues.The problemsof inferencesubjectto the constraintof embeddability
is ill suited to classicaltreatment,and a purelyanalyticalBayesianapproachis
precludedby the complexityof the indicatorfunctionfor embeddability.
3Detailson the constructionof the normalimportancesamplingdensityfor this examplemay be
found in Geweke,Marshall,and Zarkin(1986a).Interestin this problemin generaland in measures
of mobilityin particularis motivatedby Geweke,Marshall,and Zarkin(1986b).
1332
JOHN GEWEKE
Values of fi(8),quartile homogeneous
Morkovchain model
fi(8)
1.6-
a-Axis1
*-Axis 2
o-Axis3
-a -Axis4
-Axis
x-Axis 6
x-Axis7
* -Axis8
1.4i
5
1.2-
1.0-
0.80.6
0.40.2-6
-4
-2
0
2
4
6
a
FIGURE3.-Values of f,(8) along nine orthogonal axes of the likelihood function, constructed as
described in Section 5.3.
The 9 x 1 vectorof parametersfor the likelihoodfunctionconsistsof those Pij
for which j # i and j / i + 1, orderedso that the Pij appearin ascendingorder.4
The split normal importancesamplingdensity was constructedas describedin
Section 4, with T being the lower triangularCholeski factorizationof the
aymptotic variance matrix of 6. Some idea of the behavior of the likelihood
surface is conveyedin Figure3, which shows values of the fi(3) definedin (6).
For the lower-numberedaxes, the boundaryof the parameterspaceis only a few
standarddeviationsin the negativedirection,and so fi(3) is truncatedon the left.
For these axes fi'(8) is greater for 8 > 0 than it is for the higher-numbered axes,
indicatingthat the likelihoodfunctiontapersmost slowly relativeto the asymptotic normal approximationin these directions.This is not surprisingin view of
the examplein Section5.1.
The functions of interestare - r,j-, mean durationin each state, rjj, rate of
transitionto and from each state, and threesummarymeasuresof mobilitythat
have been proposed in the literature: M*(R) = -.25 tr(R), MB,(R) =
.254
=Ej4=ji_-jjrij, and M2*(R)= -log(X2) where 2 is the second largest
eigenvalueof P. We imposethe priorso(P) ac 1 if P is embeddableand so(P) = 0
otherwise. Under this prior posteriormeans of the rjj-1,but not the rjj or the
mobility measures,exist. The data set was extractedfrom the National Longitudinal Surveydata file, consistingof those 460 white males who werenot enrolled
For details, see the Appendix of Geweke, Marshall, and Zarkin (1986a).
1333
BAYESIAN INFERENCE
in school, employed full time, and marriedwith spouse present during 1968,
1969, and 1970; and whose familyincomeswerecoded for 1969, 1970, and 1971.
Monte Carlo integrationwith 10,000 replications,using the normal and split
normal importancesamplingdensitieswas carriedout. The normalimportance
samplingdensity performedvery poorly: w, = 2,829.6.The correspondingvalue
for the split normalwas 22.1. The posteriorprobabilityof embeddabilityis .246
(with a numerical standarderror of .0009, computed using the split normal
importance sampling density). This probability,rather than 1.0, provides the
norm against which RNE's should be compared,for any importancesampling
density based on the likelihoodfunctionratherthan the posterior.
The computedvaluesin Table IV are essentiallythe same,for expectedvalues
and mediansof functionsof interest,with the two importancesamplingdensities
employed. Higher moments,and quantilepoints fartherfrom the median,show
greater differences. The computed RNE's suggest that the split normal is
substantiallymore efficientthan the normal.However,for the normalimportance
sampling density computed RNE tends to vary widely with each set of 10,000
replications, and deterioratesas the number of replicationsincreases.In this
application, and-we conjecture-in most applicationswith more than a few
dimensions, careful modificationof asymptotic normal and other symmetric
distributionsis requiredif computednumericalstandarderrorsare to be interpreted reliablyusing Theorem2.
TABLE IV
INFERENCEIN THE QUARTILEHOMOGENEOUSMARKOVCHAIN MODELO
Importance
Density
g( )
-r221
-331-
_
Importance
Density
g( )
-r17jt
-r22
_r3_ r44
-ril
r22
r33
-r44
M*(R)
MB*(R)
M2*( R)
E[g]
Split Normal
sd[gg]
100l,
2.35
1.33
1.33
2.28
.338
.170
.162
.324
Normal
.786
.393
.380
.786
RNEb
E[g]
sd[g]
100ln
RNEb
.184
.187
.182
.169
2.36
1.34
1.33
2.28
.341
.170
.169
.323
.919
.470
.546
1.056
.138
.131
.095
.094
.01
.25
Split Normal
Quantile
.50
.75
.99
.01
.25
Normal
Quantile
.50
.75
.99
1.68
.987
.979
1.60
.304
.566
.572
.319
.485
.586
.317
2.10
1.22
1.21
2.05
.391
.692
.698
.403
.565
.675
.435
2.33
1.32
1.32
2.26
.428
.759
.760
.443
.598
.718
.810
2.56
1.45
1.43
2.48
.477
.821
.823
.488
.637
.760
1.03
3.28
1.76
1.74
3.13
.594
1.01
1.02
.621
.741
.881
1.48
1.68
.965
.973
1.65
.311
.566
.570
.318
.487
.586
.316
2.12
1.21
1.21
2.05
.390
.691
.694
.402
.563
.676
.418
2.32
1.32
1.32
2.25
.430
.755
.757
.444
.597
.716
.795
2.56
1.44
1.44
2.48
.472
.825
.823
.488
.638
.759
1.00
3.21
1.76
1.75
3.14
.589
1.03
1.02
.598
.745
.864
1.49
aExpected values, standard deviations, and fractiles of functions of interest g( ) were computed by Monte Carlo
integration with importance sampling from a split normal density, using 10,000 replications.
b
Relative numerical efficiency.
1334
JOHN GEWEKE
6. EXAMPLES: ARCH LINEAR MODELS
In this section we take up an examplein which the parameterspace 6) is not
compact. It involves constraintswhich cannot readily be imposed in classical
inference, as discussedin detail elsewhere(Geweke (1988a)). For this, among
other reasons,exact Bayesianinferenceis appealingin this model.
The ARCH linearmodelwas proposedby Engle(1982) to allowpersistencein
conditionalvariancein economictime series.Let xt: k x 1 and yt be time series
for which the distribution of yt conditional on 4-' = {xt-S Yt- s ?> 1} is
(9)
ytiVit-, - N(xtf,8, ht).
Defining et = y,- x'/, take
p
ht =
O+
E aj?t_2
h (Et-
*,EtpYql
j=1
The parameterizationin termsof y allowsrestrictionslike ao = Yo,aj = Yi(p + 1
-j), the linearlydecliningweightsemployedby Engle(1982, 1983).For (9) to be
plausible it is necessary that a0 > 0 and aj >O (j = ,...,p). For { Et} to
be stationaryit is necessaryand sufficientthat the roots of 1 - ER1 ajzz all be
outside the unit circle.
Given the sample (x,, yt; t= 1,...,
T) the log-likelihood function is (up to an
additiveconstant)
T
T
(10)
1=-(1/2)
E
t=p+l
log ht -(1/2)
htlt2.
t=p+l
Engle (1982) has shownthat for largesamplesd 21/d/3dy'-0, and the maximum
likelihoodestimates/3of /3 and ' of y are asymptoticallyindependent.A scoring
algorithmmay be used to maximizethe likelihoodfunction,and revisionof the
estimate y( ) of y and P8(l) of /3 may proceed separatelyat each step. We
programmedthe proceduredescribedin Engle (1982) and then replicatedother
results (Engle (1983)) using data furnishedby the Journalof Money,Credit,and
Banking.
Since /8 and y' are asymptoticallyindependentwe constructthe importance
samplingdensityas the productof a densityfor / and a densityfor y. To employ
condition (4) and the proceduresof Section4 first comparethe tail behaviorof
the multivariatet densityin m dimensions,
f(X;
,
2, v) x [i + V(x-f)
x
)] -(m)/2
with that of the likelihoodfunction.Allowing lxjl oo while fixingall otherxi,
the log of f(x; ,u,2, v) behaveslike - (v + m)logIxj1.The log likelihoodfunction
(10) behaves like - [(T - p)/2]log(yj) as -yj oo with all other parameters fixed,
- oo with all otherparameters
fixed.Hencefor /3
and like - (T - p)logj/3I as I/pj3I
T - p - k degrees
with
we shall use a split Student importance sampling density
1335
BAYESIAN INFERENCE
TABLE V
IN THEARCH LINEARMODELa
INFERENCE
g( )
E[g]
p1
1.096
.954
1.001
.281
.795
P2
YO
Y1
P[Stable]
sd(g]
100a,,
RNEb
.092
.135
.291
.066
-
.110
.155
.373
.079
.475
.700
.757
.609
.693
.722
a Expectedvaluesand standarddeviationsof functions of interest g( ) were computed by Monte
Carlo integrationwith importancesamplingfrom a
split Studentdensity,using 10,000 replications.
b Relativenumericalefficiency.
of freedom,and for -y we shall use a split Studentimportancesamplingdensity
with (T
-
p)/2
-
q degrees of freedom. The variance matrices are given by
expressions(28) and (32), respectively,in Engle (1982).
An artificialsampleof size 200 providesa concreteexample:
2
Yt
x=
E
F,pxtj
j=-
1,
Et ~ N(0ht)
pi =
+ Et,
#2-1
xt2 = cos(27rt/200);
ht=
1 + .5Et2_ + .25Et2_2
The parameterizationof the conditionalvarianceprocess in the model is h =
= y1. We assume a flat pnror on /3 and
yo + -yl(2et21+ et.2); a0 = Yo,a1 = 2yl, Ca2
-y, and estimate these four parametersand assess the posteriorprobabilityof
stability P[Hyll< 1/3]. Results using the split Student importance sampling
density constructedas describedin Section4 are presentedin Table V.
The local behaviorof the log-likelihoodfunctionis portrayedin Figure4. For
each of the four axes this figureindicatesthe actuallog-likelihoodfunction(solid
thick line), the multivariate"t" approximationto the likelihoodfunction using
the informationmatrix(Engle (1982, (28) and (32))) (dotted line), and the split
Student approximationto the likelihood function constructedas describedin
Section 4 (thin line). (In the case of y, the curvesterminateat -yO=0 or Y1l= 0.)
For /3 the classical asymptotictheory provides a good approximationto the
likelihoodfunction,but for -ythe approximationis poor.
In Table VI we comparediagnosticsfor computationalaccuracyusing the
unmodifiedand split normaldensities(which are unjustifiedtheoreticallysince
E[w(#)] is infinitein thesecases)and the unmodifiedand split Studentdensities.
The unmodifieddensitiesperformquite poorly,for reasonsmade clearin Figure
4. The split normal and split Student sampling densities lead to acceptable
performance,in the sense that RNE is apparentlyof the same orderof magnitude as would be achieved if one could sample directly from the likelihood
function.In the case of the split normalwe know this is an illustration,but from
1336
JOHN GEWEKE
SecondBetaAxis
FirstBetaAxis
0-
0-10>7
J-20
J-20-
-30-6 -5 -4
3-2-
-30-6 -5 4 -3 -2
-1 0 1 2 3 4 5 6
0 1 2 3 4 5 6
-1
Deviotions
Standard
Deviations
Standard
Axis
SecondGamma
Axis
FirstGamma
0-0
U.
U.
?-0-2
>?-20-
I -| ,1.1
| Z.I,ls
-304-7
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
. ,
-30 -
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
StandardDeviotions
StandardDeviations
functionsare normalizedto have the value 1 at 0 standarddeviations,which
correspondsto the asymptoticmaximumlikelihoodestimator.The solid thickline is the log likelihood
function;the dotted line is the log multivariatet importancesamplingdensity;and the solid thin line
is the log split Studentimportancesamplingdensity.
FIGuRE 4.-All
TABLE VI
AccuRAcY
SoME DIAGNOSTICSFORCOMPUTATIONAL
ARCH LINEAR MODEL
Split Normal
SamplingDensity
MultivariateNormal
SamplingDensity
n - 10,000
n =
50,000
n = 10,000
n = 50,000
RNE,f1
.006
.016
.686
.666
RNE,,2
RNE, yo
.005
.001
.014
.002
.682
.564
.708
.529
.006
.018
.012
.044
.748
.744
.714
.738
RNE, y1
RNE, p
Oi
'W10
9,872.8
994.4
39,893.6
4,857.2
Multivariatet (99)
SamplingDensity
n = 50,000
10,000
51.0
22.3
n=
225.4
66.0
Split Student
SamplingDensity
10,000 n = 50,000
RNE,f1
.036
.068
.700
.707
RNE, f2
RNE, yo
.027
.002
.065
.006
.757
.609
.742
.501
.017
.049
.041
.118
.692
.722
.690
.741
RNE, y,
RNE, p
Wl
'W10
9,726.9
979.6
37,067.8
4,476.4
19.9
16.0
232.3
70.9
1337
BAYESIAN INFERENCE
TABLE VII
ACCURACY
SOMEDIAGNOSTICSFOR COMPUTATIONAL
ARCH LINEARMODEL
Multivariate Sampling Density with Indicated Degrees of Freedom
n= 10,000
v=1.0
v=1.5
v=2.0
v=0.5
RNE, 1
RNE, f2
RNE, yo
RNE, yt
RNE, p
(01
(010
.144
.150
.127
.129
.127
W1
(010
v=4.0
.536
.540
.336
.410
.493
53.4
34.2
.402
.410
.303
.339
.368
42.2
32.3
52.2
47.2
v= 3.0
RNE,f1
RNE, f2
RNE, yo
RNE, yr
RNE, p
.302
.307
.243
.262
.282
41.2
29.8
v=6.0
.529
.525
.317
.399
.504
69.0
42.7
.487
.537
.271
.366
.522
136.3
71.2
.451
.455
.313
.365
.417
40.2
34.2
v= 12.0
.356
.430
.197
.292
.495
321.4
130.0
values of n are required before RNE is mainly determined
Figure 4 astronomical
sampling
by the relative tail behavior of the likelihood function and importance
for the split Student and split normal importance
sampling
density. Diagnostics
the same.
densities are essentially
samwith importance
The alternative
approach to Monte Carlo integration
pling for posterior densities with tails thick relative to the normal- has been to
t distributions
with small degrees of freedom (Zellner and
employ multivariate
Rossi (1984), van Dijk, Hop, and Louter (1986)). We conclude this example with
of diagnostics
an examination
for computational
accuracy for this procedure.
with the asymptotic
variance matrices for ,B and y, multivariate
Beginning
importance
sampling densities with v = .5, 1, 1.5, 2, 3, 4, 6, and 12 degrees of
for numerical accuracy are given in
freedom were used. The resulting diagnostics
Table VII. RNE for the four parameters attains a maximum at v = 3, that for p
at v = 6. For v < 3 the sampling density is too diffuse, and as v becomes larger it
the multivariate normal which is much too compact. That w-diagnosapproaches
tics are sensitive to sampling densities with tails that are too thin rather than too
t with low
thick is clearly indicated in Table VII. The RNE of the multivariate
degrees of freedom is decidedly less than that of the split Student t, but of the
same order of magnitude for proper choice of v. In the absence of any systematic
is
basis for choosing v, however, costly numerical experimentation
or theoretical
value.
required to find the appropriate
7. CONCLUSION
Integration
by Monte Carlo with importance sampling provides
models. Thanks to increasingly
Bayesian inference in econometric
a paradigm for
cheap comput-
1338
JOHN GEWEKE
ing it is practical to obtain the exact posterior distributions of any function of
interest of the parameters, subject to any diffuse prior. This is done through
systematic exploration of the likelihood function, and the asymptotic sampling
distribution theory for maximum likelihood estimators provides the organization
for this exploration. Diffuse priors may incorporate inequality restrictions which
arise frequently in applied work but are impractical if not impossible to handle in
a classical setting. Concentrations of prior probability mass may be treated by
applying the methods in this paper to both the original and lower dimensional
problems, and combining results through the appropriate posterior odds ratios.
By choosing functions of interest appropriately, formal and direct answers to the
questions that motivate empirical work can be obtained. The investigator can
routinely handle problems in inference that would be analytical nightmares if
attacked from a sampling theoretic point of view, and is left free to craft his
models and functions more according to substantive questions and less subject to
the restrictions of what asymptotic distribution theory can provide.
Integration by Monte Carlo is an attractive research tool because it makes
numerical problems much more routine than do other numerical integration
methods. The principle analytical task left to the econometrician is the characterization of the tail behavior of the likelihood function. This characterization leads
to a family of importance sampling densities from which a good choice can be
made automatically. Prior distributions and functions of interest can then be
specified at will, taking care that computed moments actually exist. There is no
guarantee that these methods will produce computed moments of reasonable
accuracy in a reasonable number (say, 10,000) of Monte Carlo replications.
Pathological likelihood functions will demand more analytical work: for example
analytical marginalization may be possible, and transformation of parameters to
obtain more regular posterior densities can be quite helpful, but these tasks must
be approached on a case-by-case basis. It is in precisely such cases that sampling
theoretic asymptotic theory and normal or other approximations to the posterior
are also likely to be inadequate: the difficulty is endemic to the model and not the
approach. Clearly there is much to be learned about more complicated likelihood
functions. In approaching these problems, the emphasis should be-as it has
been in this paper-on generic rather than specific solutions.
Institute of Statistics and Decision Sciences, Duke University, Durham, NC
27706, U.S.A.
Final manuscriptreceivedMarch, 1989.
REFERENCES
BAUWENS, W. (1984): Bayesian Full Information Analysis of Simultaneous Equation Models Using
Integration by Monte Carlo. Berlin: Springer-Verlag.
BAUWENS,W., AND J. F. RICHARD(1985): "A 1-1 Poly-t Random Variable Generator with
Application to Monte Carlo Integration," Journal of Econometrics,29, 19-46.
P. (1979): Probability
andMeasure.New York: Wiley.
BILLINGSLEY,
BAYESIAN INFERENCE
CRAMIR,
1339
H. (1946): Mathematical
Methodsof Statistics.Princeton:PrincetonUniversityPress.
DAVIS, P. J., AND P. RABINOWITZ(1975): Methods of Numerical Integration. New York: Academic
Press.
ENGLE, R. F. (1982): "Autoregressive Conditional Heteroscedasticity with Estimates of the Variance
of United Kingdom Inflation," Econometrica, 50, 987-1008.
(1983): "Estimates of the Variance of U.S. Inflation Based on the ARCH Model," Journal of
Money, Credit, and Banking, 15, 286-301.
(1984): "Wald, Likelihood Ratio and Lagrange Multiplier Tests in Econometrics," Chapter 13
of Handbook of Econometrics, Vol. II, ed. by Z. Griliches and M. D. Intriligator. Amsterdam:
North-Holland.
GALLANT, A. R., AND J. F. MONAHAN (1985): "Explicitly Infinite-Dimensional Bayesian Analysis of
Production Technologies," Journal of Econometrics, 30, 171-202.
GEWEKE,J. (1986): "Exact Inference in the Inequality Constrained Normal Linear Regression
Model," Journal of Applied Econometrics, 1, 127-141.
(1988a): "Exact Inference in Models with Autoregressive Conditional Heteroscedasticity," in
Dynamic Econometric Modeling, ed. by E. Bemdt, H. White, and W. Barnett. Cambridge:
Cambridge University Press, 73-104.
(1988b): "Antithetic Acceleration of Monte Carlo Integration in Bayesian Inference," Journal
of Econometrics, 38, 73-90.
GEWEKE, J., R. C. MARSHALL, AND G. ZARKIN (1986a): "Exact Inference for Continuous Time
Markov Chain Models," Review of Economic Studies, 53, 653-699.
(1986b): "Mobility Indices in Continuous Time Markov Chains," Econometrica, 54,1407-1423.
J. M., AND D. C. HANDSCOMB(1964):MonteCarloMethods.London:Methuen(First
HAMMERSLEY,
Edition).
(1979): Monte Carlo Methods. London: Chapman and Hall (Second Edition).
JOHNSON, N. C., AND S. KOTZ(1970): Continuous Univariate Distributions-2. New York: Wiley.
KLOEK, T., AND H. K. VAN DIJK (1978): "Bayesian Estimates of Equation System Parameters: An
Application of Integration by Monte Carlo," Econometrica, 46, 1-20.
QUANDT, R. E. (1983): "Computational Problems and Methods," Chapter 12 of Handbook of
Econometrics, Vol. I, ed. by Z. Griliches and M. D. Intriligator. Amsterdam: North-Holland.
RUBINSTEIN, R. Y. (1981): Simulation and the Monte Carlo Method. New York: Wiley.
SINGER, B., AND J. COHEN (1980): "Estimating Malaria Incidence and Recovery Rates from Panel
Surveys," Mathematic Biosciences, 49, 273-305.
of SocialProcessesby MarkovModels,"
SINGER, B., AND S. SPILERMAN(1976):"The Representation
American Journal of Sociology, 82, 1-54.
VAN DIMK, H. K., J. P. Hop, AND A. S. LOUTER (1986): "An Algorithmfor the Computationof
Posterior Moments and Densities Using Simple Importance Sampling," Erasmus University
Econometric Institute Report 8625/E.
VAN DIJK, H. K., T. KLOEK, AND C. G. E. BOENDER (1985): "PosteriorMomentsComputedby
Mixed Integration," Journal of Econometrics, 29, 3-18.
ZELLNER, A., AND P. RossI (1984): "Bayesian Analysis of Dichotomous Quantal Response Models,"
Journal of Econometrics, 25, 365-394.