The Delta Prior and Small Sample Distribution in Time Series

The Delta Prior and Small Sample
Distribution in Time Seriesв€—
Marek JarociВґ
nski
European Central Bank
Albert Marcet
Institut dвЂ™An`alisi Econ`omica CSIC and CEPR
Preliminary, this version: August 22, 2008
Abstract
This paper serves several purposes: i) to reconcile the Bayesian and
classical views about the need for correcting upwards the OLS estimator in autoregressive models, ii) to propose the use of the delta prior,
based on prior knowledge about growth rates of certain variables, iii)
to develop techniques for translating prior knowledge about observables into priors about coefficients. All these purposes are closely
related: the delta prior is based on the prior knowledge that most
economic time series will not grow at arbitrarily high or low rates, this
leads us to propose the delta prior as a widely accepted Bayesian prior.
We also show that the Bayesian and classical views are easily reconciled when we observe that the posterior from the delta prior adjusts
the OLS estimate in the same direction as classical bias corrections,
because frequentist small sample studies were implicitly assuming a
delta prior and because the so-called вЂќexactвЂќ likelihood approach has
the same effect. We find that the delta prior generates estimates that
have lower frequentist risk than frequentist bias-corrected estimators,
в€—
We thank Manolo Arellano, Frank Schorfheide, Chris Sims for useful conversations
and Harald Uhlig for his comments on the earlier version of this paper. All errors are our
own. Albert Marcet acknowledges financial support from CREI, DGES (Ministerio de EducaciВґon y Ciencia), CIRIT (Generalitat de Catalunya) and the Barcelona Economics program of CREA. All computations were performed using Ox (Doornik, 2002). The opinions
expressed herein are those of the authors and do not necessarily represent those of the European Central Bank. Contacts: [email protected] and [email protected].
1
so they are attractive also from the classical perspective. A prior
about the likely distribution of growth rates is a statement about the
features of the marginal distribution of the data. This can not be used
directly to build a posterior, so we need to convert this statement into
a prior about the coefficients of the VAR. For this purpose, we propose a procedure for translating priors about observables into priors
about coefficients that has not been used before and that has potential
applications beyond the construction of the delta prior. We illustrate
the use of the delta prior on a macroeconomic VAR of Christiano,
Eichenbaum and Evans (1999) and compare its effects to those of the
dummy observation priors of Sims and Zha (1998) and of classical
bootstrap approach of Killian (1998).
Keywords: Time Series Models, Small Sample Distribution, Bias Correction, Bayesian Methods, Vector Autoregression, Monetary Policy Shocks
JEL codes: C11, C22, C32
2
Contents
1 Introduction
4
2 Why a Bayesian should agree that ПЃOLS is too small
2.1 Frequentist bias corrections . . . . . . . . . . . . . .
2.2 Views in the Bayesian literature . . . . . . . . . . . .
2.3 Reconciling the two views under the delta prior . . .
2.4 Frequentist evaluation of the delta estimator . . . . .
3 Translating Priors for Observables
3.1 Defining the prior for coefficients .
3.2 Fixed point formulation . . . . . .
3.3 Gaussian approximate fixed point .
3.4 Specifying priors on observables and
prior . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
accuracy of the
. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . .
obtained
. . . . .
6
. 6
. 11
. 13
. 18
21
. 21
. 23
. 25
. 28
4 Comparison with Other Priors
4.1 Priors which adjust the OLS estimator for other reasons . . .
4.2 Priors similar to the delta prior . . . . . . . . . . . . . . . . .
4.3 Dummy observations priors . . . . . . . . . . . . . . . . . . .
31
31
32
34
5 Empirical Example: the effect of monetary shocks in the US 36
6 Conclusions
42
Appendix A Initial conditions
45
A.1 Initial conditions in the studies of bias . . . . . . . . . . . . . 45
A.2 The Bayesian approach: S-priors for the AR(1) . . . . . . . . 47
A.3 The S-prior for VARвЂ™s . . . . . . . . . . . . . . . . . . . . . . 50
Appendix B Marginal distribution of the observables implied
by the prior on growth rates
53
Appendix C Our prior is not a reparameterization
53
Appendix D Analytical iteration on the mapping F for AR(1) 54
D.1 Guess of the solution . . . . . . . . . . . . . . . . . . . . . . . 55
D.2 Approaching the prior by fixed point iteration . . . . . . . . . 55
3
1
Introduction
EconometriciansвЂ™ opinions on how to model growing time series differ widely.
Some recommend pre-testing for unit roots and, in multivariate contexts,
cointegration, while others recommend strongly against it; some estimate in
levels, others estimate in differences; some use classical inference based on
asymptotics, others classical small sample methods, others use Bayesian estimation. Within the Bayesian camp there are several standard, automatic
prior distributions whose virtues are hard to compare in general terms. Often these views imply very different approaches to the data and different
empirical implications.
We think however, that most economists would find the following statement acceptable: most economic time series are unlikely to grow by a huge
amount in a given period. Economists would agree that the growth rate of
output is very unlikely to exceed, say, 50%. In this paper we propose to
use this prior knowledge in the estimation of time series. We call the prior
distribution that is derived from this prior the вЂњdelta priorвЂќ.
Studying the delta prior has many side products. First, it reconciles
Bayesian and classical small sample approaches to time series estimation.
Second, to the extent that it is a relatively uncontroversial and widely acceptable prior, inferences from the corresponding posterior should be widely
acceptable. Third, translating the knowledge about growth rates into a standard Bayesian prior for model parameters is a non-trivial task and it forces
us to develop techniques that, to our knowledge, have not been used before.
These techniques are likely to be useful for the implementation of other prior
knowledge on observables. Fourth, we argue that the delta prior is in fact
desirable from a classical point of view as well, since it delivers a smaller bias
then the OLS while having a lower mean square error (MSE) than classical
small sample approximate bias corrections, whose important selling point is
exactly their improved mean squared error performance. Fifth, the delta
prior provides a unified framework that allows the interpretation of several
Bayesian priors proposed in the literature.
Section 2 reconciles Bayesian and classical small sample approaches. A
growing literature is worrying about the inadequacy of standard inference in
time series based on asymptotic theory. Various methods that perform better in small samples have been designed from a classical perspective. This
literature often focuses on the well-known bias of OLS. But as emphasized
by Sims and Uhlig (1991) a Bayesian with a flat prior sees no need to correct
the bias. It seems as if an applied economist who would like to ignore the
small sample problems all he/she needs to do is to state his wholehearted
belief in the Bayesian approach and a flat prior. The choice between classical
4
bias corrected estimators and Bayesian estimators is purely a matter of faith
in the classical or the Bayesian-flat prior approach. Since most researchers
(like us) do not feel strongly one way or the other and are willing to use
whatever approach is likely to work, we think it is productive to show how
to reconcile these two approaches. This is easily done with the delta prior:
we show that the Bayesian posterior obtained with this prior is no longer
symmetric around the OLS estimate and its mean can be considered to be
вЂњbias correctedвЂќ. Furthermore, we argue that this is not surprising because
classical small sample studies were already incorporating a sort of delta prior
in their assumptions. Furthermore, we show that the Bayesian point estimate
with the delta prior has a lower MSE than classical estimators designed to
correct for the small sample bias. Therefore, the delta prior estimation has
all the virtues that a classical econometrician would ask for: a lowered bias
and a lower mean squared error than competitors. Of course, it also has all
the standard virtues of the Bayesian procedure, such as the facility of performing inference about functions parameters (such as impulse responses),
while constructing confidence intervals for functions of classical bias correcting estimators is tricky (see Sims and Zha (1999) for more discussion of this
issue).
A substantial technical difficulty arises because the a-priori statement
that growth rates of the series have a certain distribution is a statement
about observable series. Instead, the standard procedure in Bayesian econometrics is to formulate a prior for unobservable coefficients. In section 3 we
show how to translate a prior for observable series into a standard prior distribution for parameters.1 This involves solving approximately a functional
equation or, alternatively, a fixed point problem in the space of distributions. We show how to address this issue in a practical way and we propose
an algorithm that works very well in the empirical applications we consider.
The techniques discussed in this section are applicable to other cases where
researchers naturally formulate prior views on the distribution of observable
series.
In section 4 we compare the delta prior with other priors that have been
used in the literature, including the Minnesota prior, JeffreyвЂ™s prior, вЂњexact
likelihoodsвЂќ (which can also be interpreted as using a particular prior), and
1
An excellent discussion of obtaining priors from statements about observables can
be found in (Berger, 1985, Ch.3.5). The techniques discussed in BergerвЂ™s book are not
applicable to the problem at hand, so we propose a technique that, to our knowledge, has
not been used before. Dummy observation priors of Sims and Zha (1998); Sims (2006) and
priors from DSGE models of Del Negro and Schorfheide (2004) are also a related work in
this spirit and in section 4.3 we discuss in detail the relationship of our technique with
these approaches.
5
SimsвЂ™ dummy observation prior. We think that the delta prior provides a
common ground to unify views on these various priors. We would claim that
other approaches were driving up the estimated AR root in a more or less
arbitrary way, while the delta prior achieves this in a well justified way, as a
result of a prior which is easy to elicit and assess intuitively.
Section 5 discusses a practical application which illustrates the effects of
the delta prior and compares it with the results obtained with other priors
and classical procedures. We conclude in section 6. We discuss the relevant
literature in various places along the paper.
2
Why a Bayesian should agree that ПЃOLS is
too small
Throughout this section we use as an example the Gaussian AR(1) model
with an intercept:
yt = О± + ПЃytв€’1 + t
(1)
with t i.i.d N(0, Пѓ 2 ), which holds for t = 1 . . . T . In order to simplify the
setup further, we will always assume that ПЃ > 0, which is the most relevant
situation in modeling macroeconomic time series. Ordinary least squares
estimates of the coefficients of this model will be denoted by О±OLS and ПЃOLS .
In considering bayesian point estimates we will always assume a square loss
function in ПЃ.
This section discusses, first from the classical and then from the bayesian
perspective, the properties of OLS in time series. In the context of our example, the question boils down to whether ПЃOLS is likely to be too small
compared with ПЃ, and thus needs to be adjusted upwards. We will argue
that, contrary to a common belief, the controversy regarding upward adjustment of ПЃOLS is unrelated to whether one takes a bayesian (post-sample) or a
frequentist (pre-sample) perspective. Instead, what is crucial is the prior information used implicitly in the classical analysis or explicitly in the bayesian
analysis. We will argue that this prior information is best elicited as a prior
on the initial growth rates of the series y, and that an estimator using this
prior is attractive for both bayesians and frequentists.
2.1
Frequentist bias corrections
It is well known that asymptotic distribution is not a good approximation to
small sample distribution for most time series applications. In particular, it
has been known for a very long time that OLS is biased and that it is likely
6
to underestimate the highest root of the process for the relevant case of a
root close to 1.2 Therefore classical econometricians should find the use of
OLS problematic in many applied time series studies. Related to the bias is
the well known fact that the small sample distribution of OLS is asymmetric.
These two facts have been largely ignored in much of the applied work based
on classical inference, which often uses the asymptotic distribution.
The small sample bias of OLS may be important in VAR applications.
First, as shown in Abadir et al. (1999), the bias increases with the dimension
of the VAR. Second, the size of the impulse-response coefficients at lag П„ is
proportional to the root of the AR polynomial raised to a power П„ . Therefore,
a small deviation in the VAR coefficients is magnified in the impulse-response
coefficients at medium lags. A researcher who ignores this bias will often
underestimate the persistence of the impulse response and, as a result, may
be understating the economic relevance of his results.
We consider the frequentist small sample distribution of the OLS estimator (О±OLS , ПЃOLS ) when T observations are available. For the present example,
we take: О± = y0 = 0, ПЃ = 0.95 and T = 25. We also take Пѓ = 1, but the
distribution of ПЃOLS is unaffected by the choice of Пѓ (see e.g. Result 2 in
the Appendix). A Monte Carlo approximation of the distribution of ПЃOLS is
shown in Figure 1.
3.5
ПЃ = 0.95
3
2.5
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
Figure 1: Distribution of ПЃOLS when ПЃ = 0.95, sample size is 25 and О± =
y0 = 0. Approximation on the basis of 100,000 simulated series.
This is a familiar picture: the distribution of ПЃOLS is skewed to the left
and this estimator is biased downwards. The reasons for this bias have been
2
Some of the earliest references are Quenouille (1949), Hurwicz (1950), Marriott and
Pope (1954) and Kendall (1954). An general characterization of the effects of the bias on
the highest root is in Stine and Shaman (1989).
7
described at length in the literature, so we do not repeat them here. In
the above example E(ПЃOLS ) = 0.85. Furthermore, the effect of the bias is
magnified when considering moving average representations: according to
the mean OLS estimate the impulse response coefficient at lag 20 is 0.04,
instead of 0.36 with the true coefficient.
This suggests that ПЃOLS should be adjusted upwards to compensate for
the bias. A wealth of procedures for correcting the bias have been designed.
Among others: Quenouille (1949) proposed a jackknife estimator. Orcutt
and Winokur (1969) and Roy and Fuller (2001) use analytical expressions of
the bias to construct bias corrected estimates. Andrews (1993) numerically
obtains a median-unbiased estimator, which is extended to approximately
median-unbiased estimation in higher order autoregressions in Andrews and
Chen (1994). MacKinnon and Smith (1998) consider various strategies of
correcting the bias with bootstrap. A bootstrap bias correction is also used
for VARs in Kilian (1998). We will use the generic name вЂњbias-correctingвЂќ
(BC) estimators for the estimators derived in this literature.
The proposed bias corrections are quite successful in improving the properties of the OLS. First, they eliminate most of the bias.3 Second, numerous Monte Carlo studies demonstrate that BC estimators have lower
mean squared error (MSE) than OLS in a relevant range of parameter values
(which, in the case of the AR(1) is approximately 0.5 < ПЃ в‰¤ 1). For examples of such Monte Carlo studies see MacKinnon and Smith (1998), Roy and
Fuller (2001) or section 2.4 later in this paper. However, there are several
problems that limit the appeal of this literature.
The first issue is the lack of a clear optimality property of any of the
BC estimators. We donвЂ™t know if they are most efficient within any class
(and, with the exception of Andrews (1993), they are not even unbiased).
(Approximate) unbiasedness gives an impression of impartiality, and thus
sounds like an attractive feature for an estimator, but it has no decision theoretic justification, and needs to be specified as a goal in itself. Therefore,
we do not know how to trade it for efficiency. This is a key question because,
although compared with OLS, BC estimators are both less biased and more
efficient, but, as we will show in this paper, even larger gains in MSE can be
obtained by sacrificing some of the bias correction. Another problem is that
often in time series models we are interested in nonlinear functions of parameters, such as impulse responses, whose bias and MSE need to be addressed
separately. Yet another, are the complications in conducting inference with
3
Exact mean-unbiasedness cannot be achieved, because the bias is a non-linear function
of the true parameter which is unknown. The only exact bias correction exists for the
median bias and only in the AR(1) case (Andrews, 1993).
8
BC estimates (confidence intervals, hypothesis tests). These are standard
Bayesian arguments raised against frequentist practices.
An issue which we would like to stress in particular is the dependence
of the frequentist bias and the derived bias corrections on the initial conditions assumed. To illustrate this dependence, consider changing the previous
example. Let us say we change only О± while keeping y0 = 0 (we could
equivalently do the reverse and change y0 while keeping О± constant). The
distribution of ПЃOLS for the case О± = 2 (and О± = 0 for comparison) is shown
in Figure 2
18
16
О±=0
О±=2
ПЃ = 0.95
14
12
10
8
6
4
2
0
0.6
0.7
0.8
0.9
1
1.1
Figure 2: Density of ПЃOLS when ПЃ = 0.95, sample size is 25, y0 = 0, О± = 2
(and the density for О± = 0 for comparison). Approximation on the basis of
100,000 simulated series.
As can be seen, now OLS looks like a much better estimator and its
distribution is almost symmetric. The required bias correction would now be
much smaller than in the initial example. The reason for this is that since
we kept y0 = 0, the long run mean of yt is now given by О±/(1 в€’ ПЃ) which is
much higher than the initial condition. As a result, y grows fast, especially
in the first periods, as it is converging back to this mean. We now provide
an intuitive argument to show that it is these large growth rates of y that
cause OLS to be such a good estimator.4
Figure 3 displays two realizations of the process y, picked from the simulations underlying Figures 1 and 2. The first column of graphs plots yt against
time. The second column plots в€†yt against time. The third column shows
a scatterplot of the right-hand-side variable (ytв€’1 ) against the left-hand-side
variable (yt ) in the regression on equation (1) for this realization. The two
4
A similar вЂњimprovementвЂќ of the OLS estimator would occur if we kept О± = 0 but we
lowered y0 .
9
rows of graphs only differ in the constant term О±, while all other parameters
(including y0 = 0) and random errors are the same in both rows. The first
row assumes О± = 0 while the second row presents a process with О± = 2.
The solid line in the scatterplots of the last column is the true regression
line implied by the true parameters in (1) while the dashed line is the fitted
regression estimated by OLS for this realization.
в€†yt
4
yt
40
35
30
25
О±=0 20
15
10
5
0
yt
ss
yt
3
2
1
0
-1
t
0
10
-2
20
t
10
20
в€†yt
4
yt
40
35
30
25
О±=2 20
15
10
5
0
2
1
0
-1
t
10
20
0
10
20
30
0
10
20
30
yt-1
yt
3
0
О±OLS + ПЃ OLSyt-1
О± + ПЃyt-1
40
35
30
25
20
15
10
5
0
-2
t
10
20
40
35
30
25
20
15
10
5
0
yt-1
Figure 3: Two cases of the AR(1) process and the performance of the OLS
estimator of the coefficients. The first column plots yt against time. The
second column plots the first difference of yt against time. The third column
shows scatter plots of yt against ytв€’1 , along with true and fitted regression
lines.
The series in the upper row starts at its long run mean and it fluctuates
around it. The slope of the dashed line in the last graph of the first row is
lower than the actual regression line, reflecting the OLS bias which results
in ПЃOLS < ПЃ. Notice that the second graph in the first row indicates that the
realization contains small growth rates. But in the second row, the initial
condition assumed implies that the series starts 40 standard errors away
from the long run mean. The transition from the remote starting value to
10
the steady state dominates the series behavior. It results in much higher
dispersion of the values of the explanatory variable ytв€’1 . It is well known
that when the values of the explanatory variable are highly dispersed, OLS
is a very good estimator.5 This is why the fitted regression in the last graph
of the second row is much closer to the truth. Notice from the plots of the
first difference of the series that y grows unusually fast, especially in the first
periods, compared to the long-run growth rate. These large initial growth
rates help OLS to perform well. Therefore, one should not be surprised to
learn that the small sample distribution of ПЃOLS is much more symmetric and
narrow in the case of the second row, as was displayed in Figure 2.
A majority of the cited frequentist studies assume that the first observation is drawn from the stationary distribution of the process (which in our
example is N(О±/(1 в€’ ПЃ), Пѓ 2 /(1 в€’ ПЃ))). This is complemented with some arbitrary assumption for the case of |ПЃ| в‰Ґ 1, when the stationary distribution
does not exist. That is, in the context of our Figure 3, these studies explicitly
discard situations from the lower panel. Obviously, this has dramatic effects
on the conclusions regarding the size of the bias. The importance of the
initial conditions extends also to the unit root testing, as has been discussed
in MВЁ
uller and Elliott (2003). It is hard to assess if these assumptions are
plausible in an applied problem at hand, and probably for this reason, this
issue is rarely discussed. However, we consider it a strength of the frequentist
approach, that it forces researchers to consider the initial conditions explicitly. A bayesian is free to condition on the initial observation and neglect
any reflection about the initial conditions.
2.2
Views in the Bayesian literature
As is well known, using a flat prior for the regression coefficients results in
a posterior distribution which is symmetric and centered at the OLS estimate. Therefore OLS gives the optimal point estimate under a standard loss
function. This result holds regardless of whether all regressors are exogenous, or any of them happen to be lagged dependent variables. Therefore,
for a Bayesian who has a flat prior, OLS needs no adjustment upwards in
autoregressive models despite the presence of a small sample bias.
Sims and Uhlig (1991) use the AR(1) model without intercept as an
illustration of the fact that presence of a bias is not a compelling reason to
adjust the estimator.6 They study the joint distribution of the true coefficient
In other words, if the variance of the explanatory variable is high the matrix (X X)в€’1
is small causing the variance of the OLS estimator to be small.
6
Sims and Uhlig (1991) also formulate other criticisms to the classical approach. In
particular, they point to the inconsistency that arises from the fact that the asymptotic
5
11
ПЃ and its OLS estimate ПЃOLS , implied by the model and a uniform prior (they
display this bivariate distribution from different angles in their helicopter tour
plots). This joint distribution embeds both the skewed, biased distribution
of ПЃOLS conditional on ПЃ, and the Bayesian quasi-posterior distribution of
ПЃ conditional on the OLS estimate (which should be distinguished from the
usual posterior, which is the distribution conditional on the data). They
point out that the distribution of the autoregressive coefficient ПЃ conditional
on its OLS estimate ПЃOLS is symmetric around ПЃOLS , in spite of the bias
of ПЃOLS , which is may be counterintuitive at first,7 and which highlights
the importance of carefully distinguishing the pre-sample and post-sample
perspectives.
There are well known philosophical objections to ever using the presample perspective (i.e. failing to condition on the observed sample). The
Likelihood Principle implies that samples which have not occurred should
not affect the inference with the actual sample, and violating this rule can
lead to very unreasonable inferences (for excellent discussion of these issues
and examples see e.g. Berger and Wolpert (1988)). Therefore, seeing ПЃOLS
as too low, or as appropriate, which seemed a very mundane practical issue,
is actually a matter of foundational principles. A researcher who takes a presample (frequentist) perspective thinks that the actual ПЃ is likely to be higher
then ПЃOLS , while a researcher who takes a post-sample (bayesian) perspective
(and is willing to specify his marginal prior beliefs about ПЃ as uniform), believes that ПЃ is equally likely to be higher or lower. Note that, like Sims and
Uhlig (1991), we are abstracting for now from the situation where a bayesian
has some prior ideas about ПЃ which are not well characterized by a uniform
prior, because this would obviously shift the posterior away from the OLS.
Such non-uniform priors could arise either from prior information, or from
alternative specification of a noninformative prior, namely the JeffreysвЂ™ prior
(see Phillips, 1991). We comment on JeffreysвЂ™ prior later in the paper.
distribution of OLS is discontinuous at exactly a unit root. This criticism pertains to the
use of asymptotic theory in the classical approach. The small sample distribution of OLS
is nicely continuous at a unit root so it is not subject to the вЂњdiscontinuityвЂќ criticism.
We do not mention this issue in the paper because our objective is to compare the small
sample classical approach with Bayesian approaches.
7
The intuition they provide is that the bias towards zero is compensated by the increasing precision of the OLS as the autoregressive parameter increases. Since these effects the bias and the increasing precision - exactly cancel, the distribution of the autoregressive
parameter conditional on the OLS estimate is symmetric around the OLS estimate. However, in the model with an intercept the small sample bias of OLS is much larger, while
the precision of OLS increases less with the size of the autoregressive parameter (which
has been pointed out e.g. in Andrews, 1993, p.158). Therefore, the above intuition about
canceling effects does not extend to the specification with the constant term.
12
We think that this part of SimsвЂ™ and UhligвЂ™s paper is frequently misinterpreted. The misinterpretation is that in general, in autoregressions, the
decision whether to adjust the OLS estimate upwards or not, is a matter of
taking a pre-sample or a post-sample perspective (as long as oneвЂ™s prior for
ПЃ is flat). This suggests a deep dichotomy between the frequentist and the
bayesian treatment of time series models. But in fact, it is only a peculiarity
of the AR(1) model without the constant term, that flat-prior bayesians never
choose to adjust ПЃOLS upwards, independently of the initial condition. In a
more general autoregressive model, there is no such dichotomy between the
two schools. The decision on adjusting ПЃOLS upwards or not depends on the
initial conditions one assumes. Both frequentists and flat-prior bayesians will
agree that ПЃOLS is too small, once the initial observation is assumed to be
close to the deterministic component of the process. Both frequentist and
flat-prior bayesian point estimates will approach the OLS estimate, as the
initial deviation from the deterministic component increases. The claim that
the frequentist bias corrections of the OLS estimate depend on the initial
condition was discussed in the previous section. The claim that the same is
true with bayesians (although their corrections of the OLS are, in general,
different from bias corrections) will be substantiated in the next section.
In fact, in practical applications bayesians did not ignore the frequentist
evidence that OLS underestimates the root of the process. The most popular priors used routinely in VARs: the Minnesota prior of Doan et al. (1984)
and вЂ™dummy observationsвЂ™ priors of Sims, are designed to push the posterior
towards the unit root and thus reduce the bias of the posterior mean, compared with the OLS estimate. These priors are discussed in detail in section
4 of this paper. Motivations for these priors sometimes sound quite close
to worries about the bias of OLS. For example, Sims and Zha (1998, p.959)
justify their priors by the need to correct some undesirable features of the
posterior which are the вЂњother side of the well-known bias toward stationarity
of least-squares estimates of dynamic autoregressions.вЂќ However, these priors are experience-based devices for improving the forecasting performance,
and it is difficult to elicit them or assess if they are reasonable in themselves.
Instead, this paper proposes an intuitive prior which addresses the problems
of the OLS estimator.
2.3
Reconciling the two views under the delta prior
We have illustrated in section 2.1 that the key issue in determining the size
of the problems with the OLS estimator is the initial condition, i.e. the
relation of the first observation in the sample with the parameters of the
model. We find it difficult to specify such initial conditions, and we find the
13
solutions in the literature to be arbitrary. How can we distinguish reasonable
initial conditions from unreasonable ones? Our proposal is to think in terms
of the implied dynamics of the series initialized at such initial conditions.
Therefore, instead of specifying the distribution of the first observation, we
specify a prior about initial growth rates of the process, conditional on the
first observation.
Most economists would agree that economic time series are unlikely to
grow by a huge amount in a given period. If asked to deliver a precise
statement about their beliefs economists might produce a distribution for
the growth rate in, say, period 1. For example, the answer might be8,9
в€†y1 в€ј N(Вµв€† , Пѓ 2в€† )
(2)
for some Вµв€† , Пѓ 2в€† . Recall that y0 is a given constant.
A Bayesian would like to use this prior information and combine it with
model (1) to form a posterior distribution on the parameters О±, ПЃ. The prior
statement (2) is not a prior distribution on the unobserved parameters О±, ПЃ
as is usually required in the application of BayesвЂ™ rule. For this purpose we
have to translate (2) into the prior distribution of О±, ПЃ.
Translating a prior for coefficients from a prior for observable series is in
general not a trivial task, much effort will be devoted to this issue in the next
section. For the AR(1) case, however, this amounts to a reparameterization
of the model. Given our model we have:
в€†y1 = О± + (ПЃ в€’ 1)y0 +
1
(3)
First we guess that О± + (ПЃ в€’ 1)y0 has a normal distribution, independent of
1 . Second, taking the expectation and the variance on both sides of (3) and
using the standard assumption that the distribution of the coefficients О±, ПЃ
is independent of the вЂ™s we can write
Вµв€† = E(О± + (ПЃ в€’ 1)y0 )
Var(в€†y1 ) = Var(О± + (ПЃ в€’ 1)y0 ) + Пѓ 2
(4)
(5)
Putting all this together we find that our prior for the coefficients has to
satisfy:
О± + (ПЃ в€’ 1)y0 в€ј N(Вµв€† , Пѓ 2g )
(6)
for Пѓ 2g = Пѓ 2в€† в€’Пѓ 2 . Under our guess that О±+(ПЃв€’1)y0 had a normal distribution
independent of 1 we see that indeed any distribution for О±, ПЃ that satisfies
(6) is consistent with (2).
8
As usual we consider logs of growing variables so that в€†y is the growth rate.
The assumption of normality is convenient in this section. The techniques discussed
in section 3 and applied in section 5 do not need this assumption at all.
9
14
Condition (6) can be viewed either as a prior distribution of ПЃ conditional
on О± or as a prior distribution of О± conditional on ПЃ. In order to have a
joint prior distribution for О±, ПЃ we need to complement it with the respective
marginal distribution. If no additional prior knowledge is available, it is
natural to complement this condition with a flat prior for either ПЃ or О±. The
convenient feature of this simple example is that the posterior will be the
same in either case, because the kernel of the prior is the same.
A word of caution is in order: since we take the variance of as known
throughout the paper, for a prior for growth rates to be compatible with the
model we need that Пѓ 2в€† в‰Ґ Пѓ 2 . If this is not the case then (2) is incompatible with the model: there is no prior for the coefficients that is consistent
with the model. This possibility of non-existence of the underlying prior for
coefficients will be an issue throughout the paper.
We call (6) the вЂ™deltaвЂ™ prior. To insist on our semantics: throughout
the paper we call a prior statement on observables such as (2) a вЂњprior for
growth ratesвЂќ. We reserve the name вЂњdelta priorвЂќ for the distribution of
unobservable coefficients that is consistent with the prior for growth rates.
In the case (which we will use in this section) that y0 = 0 the prior for
growth rate simply translates to a prior for the constant term (which can be
completed e.g. with a flat prior for ПЃ).
О± в€ј N(Вµв€† , Пѓ 2g )
(7)
Let us illustrate the effect of the delta prior in a вЂњhelicopter tourвЂќ graph
used by Sims and Uhlig (1991), which shows the joint distribution of the OLS
estimate ПЃOLS and the actual parameter ПЃ. Sims and Uhlig use this graph to
illustrate that, in spite of the small sample bias, the posterior conditional on
the ПЃOLS is symmetric and centered around ПЃOLS .
The first row of Figure 4 illustrates the point of Sims and Uhlig (1991).
We reproduce (up to the sampling error) two of their graphs, which are
generated for the case of AR(1) without the constant term. The cross-sections
along the fixed-ПЃ lines represent the small sample distribution of ПЃOLS given
ПЃ. The cross-sections along the fixed-ПЃOLS lines represent the distribution of
ПЃ given a value for the estimate ПЃOLS . These cross sections are a summary of
the posterior distribution of ПЃ given ПЃOLS . These cuts reflect two well known
facts: i) the small sample distribution of OLS under the classical approach is
biased downwards and skewed; ii) the posterior under a flat prior is symmetric
and centered at the OLS estimator.
In the subsequent two rows we present analogous graphs for the case
with the constant term and with the delta prior given by (7).10 We take
10
Adapting the procedure of Sims and Uhlig (1991) to the model with the constant term,
15
Вµв€† = 0 and we consider two values for Пѓ в€† . The plots in the second row of
Figure 4 are generated with a very loose delta prior, with Пѓ в€† = 1, in order to
approximate the effect of the flat prior (we cannot draw from the flat prior
which is not a proper distribution).
The third row of Figure 4 illustrates that in the model with an intercept, when reasonable initial conditions are assumed, both frequentists and
bayesians agree that the OLS estimate is too low, and wish to correct it upwards. These plots are generated assuming Пѓ в€† = 0.1, which would rule out
with 95% probability initial вЂ™systematicвЂ™ (not caused by 1 ) growth rates outside the (-20%,20%) range. We think that this is a very reasonable requirement for most growing macroeconomic time series. The frequentist small
sample distribution on the left plot shows a downward bias, which is much
stronger then in the model without intercept. A frequentist would like to
correct the OLS estimate upwards to compensate for this bias. The posterior generated with this prior is also far from being symmetric around the
OLS estimate, but it is shifted towards higher values.11 In this case, given
the OLS estimate of 0.95, a Bayesian would believe that the true parameter ПЃ
is around 0.975, adjusting the OLS estimate upwards as a frequentist would.
The plots in the second row illustrate that, as noted in the previous
section, as the initial condition becomes less binding, both the frequentist
small sample distribution and the bayesian posterior distribution concentrate
around the OLS estimate. In fact, they converge to a spike at 0.95 for the
exactly flat prior for the initial growth rate. Therefore, the posterior mean
of ПЃ conditional on ПЃOLS is indeed centered at the OLS estimate when the
prior is flat. But this case should not be confronted with the classical small
sample analysis under reasonable initial conditions. It should be confronted
with classical small sample analysis under no restriction for initial conditions.
A flat prior says that the probability that |в€†y1 | is larger than, say, 20% is
much bigger than the probability that |в€†y1 | will be lower than 20%. Alternatively, one could say that a flat prior implies in the prior for growth rates
we take the values of ПЃ on a grid and draw values of О± from the prior (7). For each draw of
(ПЃ, О±) we simulate the process given by (1) starting from y0 = 0. We generate 10000 data
vectors of length T = 100. For each data vector we compute О±OLS , ПЃOLS . Lining up the
histograms of ПЃOLS вЂ™s we obtain a surface which is the joint p.d.f. of ПЃ and ПЃOLS under
the delta prior О± and a flat prior for ПЃ. The images that are seen from our helicopter show
a slightly different concept from those seen from SUвЂ™s helicopter: in our case both О± and
ПЃ are underlying parameters, therefore when we condition on ПЃ we are in fact integrating
over the distribution of О±.
11
Note that ПЃ increases from right to left in the horizontal axis of this plot. We have
followed Sims and Uhlig in this convention, because it allows to switch smoothly between
the two cross-sections of the plot in the вЂ™helicopter tourвЂ™ fashion, without having to flip
one dimension.
16
cut ПЃ OLS|ПЃ =0.95
cut ПЃ |ПЃ OLS=0.95
1400
1200
1400
No constant, flat
1000
1200
800
1000
600
800
400
600
200
400
200
0 0.7 0.75 0.8
0.85
OLS 0.9 0.95
With constant, delta looser
ПЃ
2000
1800
1600
1400
1200
1000
800
600
400
200
0 0.7 0.75 0.8
0.85
OLS 0.9 0.95
ПЃ
1
1
0.8
0.85
ПЃ
0.9
1.050.95
0
0.8
0.85
0.9
0.95 1.1
1.05
1
0.95
ПЃ
0.9
0.85
0.8
2000
1800
1600
1400
1200
1000
800
600
400
200
0
0.8
OLS 0.85
ПЃ
0.9
0.95 1.1
1.05
1
0.95
ПЃ
0.9
0.85
0.8
1.05
1
0.95
0.9
0.85
0.8
ПЃ OLS
0.8
0.85
ПЃ
0.9
1.050.95
With constant, delta tighter
1200
1000
900
800
700
600
500
400
300
200
100
0 0.7 0.75 0.8
0.85
OLS 0.9 0.95
ПЃ
1000
800
600
400
200
1
0
0.8
0.85
0.9
0.95 1.1
0.8
0.85
ПЃ
0.9
1.050.95
ПЃ OLS
ПЃ
Figure 4: Views of joint frequency distributions of (ПЃ, ПЃOLS ) generated with
alternative assumptions.
17
Пѓ 2в€† = в€ћ. This is not an вЂќuninformativeвЂќ prior, on the contrary, this informs
the researcher that OLS is quite a good estimator. Only an analyst who
really thinks that very high growth rates are very likely should use OLS. An
analyst thinking that very high growth rates are impossible or very unlikely
should not use OLS, instead he should incorporate this information into the
estimation.
2.4
Frequentist evaluation of the delta estimator
Our discussion of the connection between Bayesian and classical views may
have seemed purely academic and theoretical. But our purpose is very practical: having put both procedures in the same ground we can now compare
their advantages.
We now study the performance of the delta estimator (a point estimate
obtained with the delta prior and the loss function which is square in the
error of ПЃ) from the classical point of view. We will see that the delta estimator, although inspired by Bayesian principles, can be also justified from
a classical perspective.Classical bias corrections (as discussed earlier) have
an element of arbitrariness in that a full bias correction is never achieved.
For that reason, in the end, much the literature on BC ends up checking the
virtues of an estimator in terms of the MSE reduction it achieves for certain
вЂќrelevantвЂќ parameter values. We show that the delta estimators, in fact,
show a substantially lower mean square error for a wide range of parameter
values that include the ranges considered of interest in many BC papers.
The Bayesian posterior mean is guaranteed to have the lowest MSE on average over the parameter space, where the average is taken with the weights
given by the prior. However, classical evaluations focus on estimatorsвЂ™ performance for specific parameter values, within relevant ranges. We show
the bias and MSE properties of the delta estimator compared to alternative
estimators from this point of view.
To study the frequentist properties of the delta estimator in model (1) we
repeat the Monte Carlo study of MacKinnon and Smith (1998, section 5),
adding the delta estimator. In order to highlight the small sample problems,
we take the case of T = 25. As in the above paper, the initial condition is
generated as y0 = О±/(1 в€’ ПЃ) + where в€ј N (0, Пѓ 2 ). This ensures invariance
of the results with respect to О± and Пѓ (see the Appendix A). To compute the
values in the following figures we took 100,000 realizations of the process for
each value of ПЃ = 0, 0.05, . . . , 1.2. We consider the delta estimators following
from the prior for growth rates stating that the initial growth rate is normally
distributed with mean zero (as in equation (2)), with standard deviations:
0.2 (the estimator called delta1) and 0.05 (the estimator called delta2).
18
0.2
OLS
CBC
delta1
delta2
0.15
0.1
E(Л†ПЃ)-ПЃ
0.05
0
-0.05
-0.1
-0.15
-0.2
0.4
0.5
0.6
0.7
0.8
ПЃ
0.9
1
1.1
1.2
Figure 5: Bias of the OLS, CBC and two delta estimators in a Monte Carlo
experiment, sample size: T=25.
0.3
OLS
CBC
delta1
delta2
0.25
RMSE(Л†ПЃ)
0.2
0.15
0.1
0.05
0
0.4
0.5
0.6
0.7
0.8
ПЃ
0.9
1
1.1
1.2
Figure 6: RMSE of the OLS, CBC and two delta estimators in a Monte Carlo
experiment, sample size: T=25.
19
The performance of the delta estimator is compared with that of OLS
and the constant-bias-correcting (CBC) estimator in the nomenclature of
MacKinnon and Smith (1998).12 Figure 5 shows the bias for each considered
estimator and each true value of ПЃ.13 Obviously, the largest bias is exhibited
by the OLS estimator. The CBC estimator is only approximately unbiased
so the picture reflects that the bias is not zero, but CBC reduces the bias
considerably relative to OLS. We can see that for high values of ПЃ the bias
of the delta estimator is in between that of OLS and CBC. Therefore, the
Bayesian estimator achieves an approximate bias correction in this parameter
region. The correction is not as precise as in the CBC estimator. However,
less biased estimators can have larger MSE.
Figure 6 reports the root MSE for the three estimators under consideration at various values of ПЃ. For the positive ПЃs, the CBC has lower MSE
then the OLS estimator when ПЃ is above 0.5.14 The figure shows that delta
estimators beat OLS and the CBC for sufficiently high values of ПЃ. This
surely contains the relevant roots in practice for possibly non-stationary series. Actually, the delta estimator can be substantially better, for example,
at ПЃ = 1 the RMSE under CBC is around 30% larger than with the delta2
estimator. This loss in efficiency is very large, close to the one that would
result from throwing away half of the sample.
Notice that, in this case, all the cards seem to be stacked in favor of the
CBC: the bias correction is obtained taking into account the actually used
initial condition. In the real world this (or any other) assumption on the
initial condition is quite unwarranted, so we would expect CBC to be at an
even larger disadvantage in practice. We have repeated the Monte Carlo
study using other initial conditions used in the literature (and discussed in
Appendix A.1) and we obtain similar results.
Our conclusion is that, as long as a classical researcher is willing to say
that the true ПЃ stays between, say, .7 and 1.1, the delta estimator is a better
alternative than available bias corrections, even from a classical point of view.
Unfortunately, outside the case of a simple AR(1) with the prior on only
the first period growth rate, the delta posterior can not be computed easily
by using techniques currently available. In the next section we take up a
number of technical problems that arise when trying to elicit the delta prior
12
Kilian (1998) uses the same estimator to improve the small sample properties of inference with VARs.
13
We only report positive values of ПЃ. Since we are concerned with possibly nonstationary series, we ignore negative values of ПЃ.
14
This threshold is approximately 0.5 also for other sample sizes and for more sophisticated bootstrap bias correcting estimators (see MacKinnon and Smith, 1998, Figures 4
and 6) and for other bias correcting estimators (see Roy and Fuller, 2001, Tables 1 and 3).
20
in large scale VARвЂ™s.
3
Translating Priors for Observables
Standard priors used in Bayesian analysis are distributions of unobserved
parameters. A prior about the likely behavior of growth rates of a series such
as in (2) is, instead, a prior about observables. The practical question, which
we study in this section, is how to translate a prior about observables into
a (standard) prior distribution of unobservable parameters. The discussion
and the methods we propose in this section are not limited to the case of
priors about growth rates, but are quite general and can be useful whenever
either experience or formal economic models carry information about the
likely behavior of observable series.
The special case when the prior for the observables takes the form of
the one step ahead predictive distribution conditional on the known right
hand side variables, is known as the вЂ™predictive approachвЂ™ to elicitation. This
approach to obtaining priors (as opposed to the вЂ™structural approachвЂ™ which
involves specifying directly the prior for the model coefficients) has been
advocated, in a general context, e.g. in Kadane (1980). A widely quoted
practical procedure for obtaining priors for parameters in this way is proposed
in Kadane et al. (1980), and has been applied in the time series context in
Kadane et al. (1996). Our prior from the previous section, which specifies
the first period growth rate conditional on the initial observation, can also be
handled with these methods. However, the tools of the вЂ™predictive approachвЂ™
are not applicable in case of priors about growth rates over several periods,
which will be studied and motivated in this section.
A discussion of more general techniques for translating priors on observables into priors for coefficients can be found in Berger (1985, Chapter 3.5).
These techniques, however, are hard to apply for the large scale time series
models used currently in practice, so we develop a novel approach. We note
at this point that translating the marginal distribution of observables into a
prior for parameters is not simply a reparameterization, which is discussed
in more detail in Appendix C.
3.1
Defining the prior for coefficients
Let us consider a general N -dimensional stochastic process {yt }. We define
Y в‰Ў [y1 , ..., yT ] as a T Г— N matrix gathering the random variables from
which the sample of T observations is drawn, while Y represents the actual observed realization of these variables. The joint probabilities for the
21
stochastic process are given by a model that determines the likelihood function LY |B (Y ; B), which gives the density of observing a realization Y for a
parameter value B. We assume the researcher knows LY |B .
We now assume that the researcher is willing to state a prior distribution
on Y
П•Y
(8)
representing the likely behavior of the series the researcher has in mind before analyzing the sample. It is, therefore, a marginal distribution of the
observable data. The researcherвЂ™s knowledge of the likelihood LY |B implies
that the uncertainty represented in П•Y is a combination of the analystвЂ™s uncertainty about the actual values of coefficients B and the error terms of the
model. Therefore knowledge of LY |B and П•Y implicitly defines a marginal
distribution of the parameters fB which is consistent with these functions.
The task of obtaining the prior for coefficients means starting from the known
distribution П•Y and the likelihood LY |B , and finding the prior distribution
of parameters fB that is consistent with them.
A standard formula gives that the desired prior fB satisfies
LY |B (Y ; B) fB (B) dB = П•Y (Y ) for all Y
(9)
Obviously, this just says that the integral over B of the joint distribution of
Y, B has to equal the marginal distribution of Y specified by the prior for
observables (8). Thus (9) gives a functional equation that defines fB in terms
of the functions П•Y , LY |B . Equations of this type are known in calculus as
inhomogeneous Fredholm equations of the first kind.
Obviously the above problem does not need to have solution for arbitrary П•Y and LY |B . For example, we already pointed out in the AR(1) case
analyzed in subsection 2.3 that in order for (2) to be consistent with the
model (1) we needed Пѓ 2в€† в‰Ґ Пѓ 2 . Other cases where a solution for fB or where
the solution is uninteresting will be discussed as we go along. In this case
the researcherвЂ™s belief in П•Y is just incompatible with the model LY |B , the
researcher is asking the model to do something it can not do. As we will
see later, in practice this is rarely a problem because even if the exact solution does not exist, one can often find a prior for parameters which would
approximate the desired distribution П•Y to a satisfactory degree.
Finding an fB that solves the above functional equation is not trivial.
Only in special cases obtaining a prior is easily done analytically, as in the
AR(1) case which is discussed in the Appendix D. As discussed in Appendix
C, translating the prior for observables into the prior for coefficients is not
a matter of reparameterization, using the usual change-of-variable formula.
22
The reason is that the presence of the random errors ( ), in addition to the
random parameters, complicate things. In principle, random errors could be
integrated out, but for that a joint distribution of the errors and observables
has to be specified, which would be consistent with the assumption of the
independence of the parameters and the errors. Specifying such distribution
may not be easier then solving the integral equation (9).
It is easier to find numerically fB that satisfies (9) approximately. One
may be tempted to obtain such a numerical approximation by discretizing
the distributions LY |B and П•Y and view (9) as a linear system determining
a discretized version of fB . However, this approach is difficult to apply in
practice: the resulting discretized fB is often going to have negative terms
and it turns out that it is often ill-conditioned, so that a very large mass
of the distribution can be shifted to values of B very far away from each
other depending on how the discretization is exactly performed.15 We do not
exploit this route in this paper but we propose the following.
3.2
Fixed point formulation
We now reformulate the problem of defining fB as a fixed point problem.
This serves several purposes: first, it will suggest an algorithm to find an
approximate prior by successive iterations that are quite easy to perform
and that work well in the examples we have tried. These approximations will
deliver a normal distribution which will be easy to use in practice. Second,
this formulation will be useful in section 4 to compare the delta prior with
other priors proposed in the literature.
Let g be any density defined on the parameter space of B. Define the
functional F as
F (g)(B) в‰Ў
LY |B (Y ; B) g(B)
П•Y (Y ) dY
LY |B в€— g(Y )
for all B
(10)
where
LY |B в€— g(Y ) в‰Ў
LY |B (Y ; B) g(B) dB
This functional can be interpreted in the following way: note that the term
LY |B (Y ;B) g(B)
is the posterior on the parameters derived from the prior g
L
в€—g(Y )
Y |B
conditional on having observed the realization Y . Therefore, the expression
inside the integral in (10) is the joint distribution of Y, B that emerges when
15
A similar problem in the economics literature is encountered by Sims (2007b), who
also discusses some of the technical difficulties in solving this kind of equation.
23
the marginal density of Y is the specified П•Y . Therefore, F (g) is the integral
of the joint distribution, so it is the marginal distribution of B. The mapping
sends a prior g to a density F (g) which is a mixture of posteriors given each
Y weighted by П•Y . Obviously, F (g) and g have to be consistent, this is
formalized in the following proposition:
Proposition 1
1. F maps the space of densities of B into itself.
2. If fB satisfies (9) then fB is a fixed point of F :
Proof:
We first show that if g is a density of B then F (g) is also a density of B.
Since both LY |B and П•Y are densities we have LY |B , П•Y в‰Ґ 0. Since we
only consider gвЂ™s that are densities we have g в‰Ґ 0. Since the integral of a
non-negative function is non-negative we have
F (g)(B) в‰Ґ 0 for all B.
Furthermore, we show that F (g) integrates to one for any density g:
F (g)(B) dB =
LY |B (Y ; B) g(B)
П•Y (Y ) dB
LY |B в€— g(Y )
=
П•Y (Y )
LY |B в€— g(Y )
=
П•Y (Y ) dY = 1
dY
LY |B (Y ; B) g(B) dB
(11)
dY
(12)
(13)
where the first equality uses the definition of F and Fubini, the second equality just pulls out of the inner integral those terms that do not depend on B,
the third equality uses the definition of LY |B в€— g and the fourth equality
follows because П•Y is a density. This proves part 1.
To prove part 2, if fB solves (9) then, for all B
F (fB )(B) =
=
LY |B (Y ; B) fB (B)
П•Y (Y ) dY
LY |B в€— fB (Y )
LY |B (Y ; B) fB (B) dY = fB (B)
(14)
LY |B (Y ; B) dY = fB (B)
(15)
where the first equality holds from the definition of F , the second equality
follows from (9) and the third equality takes fB (B) before the integral since
24
it does not depend on Y and the fourth equality holds because L is a density
on Y so it integrates to 1 over its domain. Therefore F (fB ) = fB .
It would be nice to have sufficiency of the second part of this proposition
to assure that any fixed point of F is indeed a prior consistent with П•Y .
Unfortunately, at this writing we do not have a proof of sufficiency.
Absent sufficiency, the proposition can be used in the following way: if we
find a fixed point we have to go back and check if (9) is satisfied for the fixed
point that has been found. In that case we know that the fixed point of F has
the desired property. Indeed, since we can only aspire at finding approximate
solutions we will have to check (9) anyway to see if it is sufficiently close to
being satisfied.
3.3
Gaussian approximate fixed point
Iterating on F in order to find a fixed point means iterating on distributions.
Only in very special cases the integrations in the subsequent iterations can
be performed analytically: we discuss one such special case in Appendix D.
In general, this is a difficult numerical task. There are various strategies
to perform it, some based on discretizing and imposing that the result is a
distribution, other based on using the mapping F to generate draws from successive distributions in a similar way as the MCMC techniques. We propose
here an alternative that has worked well for us in practice.
The idea will be to start at a normal distribution for B and iterate on F
only approximately, by staying within the realm of normal distributions along
the iterations. First, this guarantees that along the way we always have a
well-defined distribution for B. Second, normal distribution is fully described
by its mean and variance, which greatly reduces the dimensionality of the
problem. Third, normal distributions produce closed form formulas, which
speeds up the calculations. We will discuss later that the accuracy of this
approximation can be readily checked by considering it (9) holds and that,
in a certain sense, the fact that this is an approximation is nearly irrelevant.
We specify the likelihood LY |B to come from a gaussian VAR, but it
should not be hard to generalize the procedure for other likelihoods. What
will be needed is that the likelihood and the parameterized prior are such
that the posterior is known.
Assume {yt } to be a VAR(P ) process:
P
yt =
О¦i ytв€’i + Оі +
i=1
25
t
tв‰Ґ0
в€ј N(0, ОЈ ) i.i.d. and Оі is a vector of constant terms.16 We have T + P
observations so that our sample is given by Y в‰Ў [y1 , ..., yT ] , a T Г— N matrix
gathering T observations determined by the model and taking the first P
observations as given. The VAR(P ) process can be also written as
t
Y = XB + E
(16)
where, X collects, in the usual way, the lagged values of Y and a column
of ones which multiplies the constant terms , B в‰Ў [О¦1 , . . . , О¦P , Оі] and E в‰Ў
[ 1 , ..., T ] . The initial conditions Y0 в‰Ў [yв€’P +1 , ..., y0 ] are fixed. This implies
the standard likelihood function for gaussian VARвЂ™s, conditional on initial P
observations Y0 (which are inside X, in the first rows):
LY |B в€ј Nvec Y ((IN вЉ— X) vec B, (ОЈ вЉ— IT ))
(17)
We assume for simplicity the error variance ОЈ to be known.
We now look for successive iterations on the mapping F within the space
of normal distributions. We know from standard Bayesian econometrics that
if we start with a gaussian prior for the coefficients g
g в€ј N Вµg , ОЈg
(18)
for some mean Вµg and variance ОЈg , then conditional on observing a value Y
the posterior defined as
pg (B|Y ) в‰Ў
LY |B (Y ; B) g(B)
LY |B в€— g(Y )
is also gaussian with mean and variance
в€’1
Varpg (В·|Y ) (B) = ОЈв€’1
вЉ—X X
g +ОЈ
Epg (В·|Y ) (B) = Varpg (В·|Y ) (B)
в€’1
в€’1
ОЈв€’1
g Вµg + vec(X Y ОЈ )
(19)
(20)
In approximately iterating on the mapping F we exploit this closed-form
solution for the posterior. We would now like to perform successive iterations
on F . According to the above notation
F (g)(B) =
pg (B|Y ) П•Y (Y ) dY
so that F sends a prior g to a density F (g) which is a mixture of posteriors,
averaged over many samples Y , each weighted by the probability of this
16
It is straightforward to generalize this to the case with other exogenous variables.
26
sample according to the prior for observables П•Y . This mixture needs not
be a normal distribution. However, we now approximate F (g) itself with
a gaussian distribution with the same mean and variance. To evaluate the
mean and the variance of B under the distribution F (g) we use a simple
Monte Carlo procedure based on the following result:
Result 1
EF (g) (B) = EП•Y Epg (В·|Y ) (B)
(21)
VarF (g) (B) = EП•Y Varpg (В·|Y ) (B) + VarП•Y Epg (В·|Y ) (B)
(22)
Proof
Given g, for any function h we have
EF (g) (h(B)) =
=
h(B) F (g)(B) dB =
h(B) pg (B|Y ) dB
h(B)
pg (B|Y ) П•Y (Y ) dY
П•Y (Y ) dY = EП•Y
dB
Epg (В·|Y ) (h(B))
(23)
where the first equality follows by definition of EF (g) , the second by definition
of F (g), the third by Fubini and the fourth by definition of EП•Y .
Clearly, (21) follows when we consider h(B) = B.
To prove (22) we notice
VarF (g) (B) = EF (g) (B 2 ) в€’ EF (g) (B)
= EП•Y
Epg (В·|Y ) (B 2 )
2
в€’ EП•Y
Epg (В·|Y ) (B)
= EП•Y Epg (В·|Y ) (B 2 ) в€’ Epg (В·|Y ) (B)
2
2
+ EП•Y
Epg (В·|Y ) (B)
+ VarП•Y Epg (В·|Y ) (B)
= EП•Y Varpg (В·|Y ) (B) + VarП•Y Epg (В·|Y ) (B)
where the first equality follows from the basic fact that Var(X) = E(X 2 ) в€’
(E(X))2 , the second equality uses (23) for h(B) = B 2 , simple algebra, and
(21); the third equality uses the вЂќbasic factвЂќ about variances just mentioned
and that EП•Y is a linear operator, and the fourth equality applies the вЂќbasic
factвЂќ about variances again inside of the operator EП•Y .
This result immediately suggests the following Monte-Carlo approximation to compute the mean and variance according to the distribution F (g):
draw M realizations of Y from П•Y ; for each draw Y compute Epg (В·|Y ) (B) and
Varpg (В·|Y ) (B) using the above closed-form expressions, finally approximate
the expectations EП•Y by averaging over the M draws the expressions inside
27
2
в€’ EП•Y
Epg (В·|Y
the expectations in the right side of (21) and (22). The normal distribution with mean EF (g) (B) and variance VarF (g) (B) found in this way is the
approximation to F (g) that we propose.
In the applications we propose below we found fixed points of F by successive iterations on the above scheme: starting a relatively flat distribution as
an initial guess we find successive means and variances using the approximate
iteration described above until the scheme delivers satisfactory approximation to the desired marginal distribution of the data. We have no theorem
that such an algorithm will work, but in all practical applications we have
tried it delivered priors which implied marginal data densities quite close to
the desired one. In all cases it worked similarly as the analytically tractable
case in Appendix D: after the first few iterations the means of coefficients
stabilized, and subsequent iterations were only shrinking the prior variances.
Obviously if such iteration failed there are a number of search algorithms
that could be used to find a fixed point of the mean and variance.
It can be seen that П•Y can be completely general for this scheme to work,
all we need is that we can generate random draws from it. It is also not necessary to restrict our attention to normal distributions, all that is needed is
that given the likelihood the posterior has a closed-form expression. Therefore, this idea should be applicable to a wide range of priors on observables
and to a wide range of approximations.
3.4
Specifying priors on observables and accuracy of
the obtained prior
The key to this approach is to specify a prior distribution on observables
П•Y . This prior should allow itself to be translated into a prior density for
the coefficients. The translation problem should have a solution and be
numerically manageable. We discuss some generic considerations regarding
how such priors should be specified. Closely related is the issue of checking
if the obtained prior for the coefficients indeed has the desired property that
it is consistent with П•Y . Since numerical approximations will most likely
have to be used this is a necessary final step for any numerical approach.
Additionally, if the fixed point approach of the last subsection is applied,
this is a necessary step since we lack a sufficiency result.
To make our discussion concrete we now specify a prior for growth rates
for an N -dimensional VAR(P ). We state that the analyst has a prior notion
that growth rates are distributed
в€†yt в€ј N (Вµв€† , ОЈв€† ) for all t = 1, ..., T0
28
(24)
for given Вµв€† , ОЈв€† , T0 . Here, Вµв€† is the vector of prior means and ОЈв€† is the
vector of total prior variance that incorporates both the analystвЂ™s uncertainty
about parameters and the uncertainty due to random shocks (in terms of
our discussion of equation (6), this is equal to Пѓ 2g + Пѓ 2 ). Therefore, the prior
for coefficients satisfying (24) will not exist if we specify ОЈв€† which is smaller,
in matrix sense, then ОЈ .17 We assume here that this mean and variance are
constant for all t = 1, ..., T0 , but the researcher may be willing to leave open
the possibility that the initial growth rates are not similar to the long run
growth rates. For example, the research may want to allow for a transition
period due to an initial condition much lower than steady state.
We also need to specify the prior covariance structure of growth rates
across time. Having a priori positively serially correlated growth rate is
appealing for two reasons. First, because empirically observed growth rates
of macroeconomic variables tend show a positive serial correlation but, more
importantly, because if, say, О± is higher than expected this would mean that
all growth rates are higher than expected. In the empirical example we will
specify, for simplicity, a single common correlation of growth rates across
time.
One key issue is how to choose the parameter T0 involved in the above
prior distribution. One may be tempted to impose the prior on all the observations in the sample and to set T0 = T . But this is not reasonable, because
it implies that the weight of the prior increases as the sample size grows,
therefore the tightness of the prior grows with the sample size, and this will
prevent the posterior from being asymptotically valid. So, generically we will
need to specify a fixed T0 < T.
In principle, abstracting from singularities and other complications, it
seems reasonable to expect that the prior (24) should define the distribution
of the same dimension as the number of parameter values in the VAR in order
for the implied prior for B to be unique. For example, in the model with only
lagged dependent variables and a constant we need T0 N = P N 2 + N . If T0
is too low the prior for growth rates would give rise to an improper prior for
coefficients, and to rather loose prior restrictions on the parameters, while if
T0 is too large the implied prior is overdetermined and does not exist. As we
have seen (and we will see in other examples) even if T0 = P N + 1 we may
have an overdetermined or improper prior.
It would seem at first glance that these issues of non-uniqueness or nonexistence could plague any application of priors on observables. In practice,
however, most of these difficulties can be turned into blessings. In section 2
we chose T0 = 1 precisely because we wanted to use a very loose prior and
17
That is, wee need the matrix ОЈв€† в€’ ОЈ to be positive definite.
29
show how even such a loose prior matters for the posterior distribution. To
the extent that many researchers may want to approach the data imposing
minimal prior information on the estimation it is desirable to use a low
T0 , even if this implies an improper prior, and let the data determine the
posterior. In particular, one approach we will follow in the applied examples
will be to choose T0 so that the number of variables determined by the prior
is equal to the number of initial observations that are fixed (i.e. we donвЂ™t
have any terms for them in the likelihood), that is, T0 = N P . One reason for
this is that in this way we include as many restrictions as if we would state a
distribution of initial conditions in terms of the parameters of the model as
in the so-called вЂњexactвЂќ likelihood approach (see our discussion in Appendix
A).
One final issue: instead of imposing a prior on the initial T0 observations
as we did, we could have imposed a prior on the observations t = s, s +
1, ..., s + T0 for some value s > 1. There would be no technical problem with
that, as long as we keep in mind that it would be inconsistent to always
choose the last T0 observations, i.e.. to set s = T в€’ T0 + 1, as this would
amount to saying that the prior ideas on the coefficients change with the
sample size. But the choice of a fixed s is indeed, in a way, arbitrary. We
always choose s = 1 in the applications below because a prior on the initial
dates seems to influence more strongly the posterior and because, as shown
in section 2, it relates to a large literature on small sample bias.
It is clear that (24) plus the assumption about covariances across time
implies a (prior) distribution for the first T0 periods of the series and, therefore, it implies a prior distribution on observables П•Y . The algebraic details
are shown in Appendix B for completeness.
A numerical procedure, such as the one proposed in the previous section,
can be applied to find a candidate prior for coefficients. Since the exact
solution may not exist, or because of the numerical approximation, the prior
needs to be checked. This is done in a straightforward way by a Monte Carlo
simulation, in which coefficients are drawn from the candidate prior, and then
data are drawn from the likelihood function for T0 periods. In the case of
the standard VAR, we draw gaussian errors and simulate the series starting
from the same initial conditions as the actual data at hand. Such data y
will be distributed with the marginal distribution of the data, implied by the
candidate prior. This distribution will then need to be compared with the
desired marginal distribution of the data П•Y . Comparison could involve a few
moments or quantiles of the distribution, or plots of the densities. Formal
comparison criteria are not strictly necessary, because the comparison is in
the end inherently subjective: the researcher should verify if the implied
distribution of observables reflects his prior beliefs.
30
4
Comparison with Other Priors
The preceding discussion of priors for growth rates and their translation
to priors for coefficients is also helpful for interpreting some of the other
standard priors for VARs. We have discussed at length the relationship with
the flat prior in section 2 so we will not comment on it any further. But most
common informative priors for VARs result in adjusting the posterior in the
direction of frequentist bias corrections, similarly as the delta prior does. We
start by commenting on the JeffreyвЂ™s prior and the Minnesota prior, which
adjust the posteriors similarly as the delta prior but for reasons other than
using prior information about plausible growth rates. More related to our
work will be the prior of Villani (2008) and UhligвЂ™s (1994a) вЂ™exact likelihoodвЂ™
analysis, which also can be interpreted as using a particular prior. Finally,
we will turn to priors introduced through dummy observations proposed by
Sims (1996), Sims and Zha (1998), and Del Negro and Schorfheide (2004).
These priors are also in technical details similar to the delta prior and it
turns out that the fixed point formulation discussed above helps interpreting
them.
4.1
Priors which adjust the OLS estimator for other
reasons
Phillips (1991) criticized the flat prior on the grounds that it was not truly
noninformative. He suggested to use JeffreyвЂ™s prior, which has the attractive
property of invariance to reparameterizations, and which turns out to put
increasing weight on higher values of ПЃ. By favoring higher roots, JeffreyвЂ™s
prior adjusts the posterior in the direction of frequentist bias corrections.
PhillipsвЂ™ work sparked an intensive debate, to which we have nothing to add,
but we will note the following: JeffreyвЂ™s prior is not commonly found in
practical applications. In part, this is because of computational difficulties
that make VAR applications hard.18 Second, instead of discussing what is
a вЂњtrulyвЂќ uninformative prior, we prefer to think of priors that are indeed
informative but that are commonly acceptable to many economists.
Applied work often uses the so-called Minnesota prior, with a NormalWishart prior centered at the random-walk model (see Doan et al., 1984).
This estimator has an effect resembling a bias correction, since in practice it
usually increases the root of the process relative to the OLS estimate. In any
case, the correction is determined by the precision of the prior, which is set in
a mechanical and essentially arbitrary way. Minnesota prior has been shown
18
Cointegrating VARs with JeffreyвЂ™s prior are studied in Kleibergen and van Dijk (1994).
31
to be effective for forecasting purposes. However, using a VAR with this prior
to summarize properties of the data may be more contentious. If the prior
matters for the results, these results can also be challenged by criticizing the
prior. Part of the problem is that this prior is stated directly as a prior for
coefficients. Therefore it is difficult to have a meaningful discussion about
what are the reasonable values for the parameters of the Minnesota prior,
since the mapping between coefficients and features of the observables, about
which on may have prior information, is not a very intuitive one.
In the standard specification the Minnesota prior is essentially different
from the delta prior because it does not incorporate any knowledge about
growth rates. The standard specification of the Minnesota prior uses a flat
prior for the constant term of the VAR (Doan, 2000, p.310), which implies
a flat prior for the growth rate of the series. Also, the shape of the Minnesota prior is quite different, because it assumes that the coefficients are
independent a priori. In contrast, the delta prior usually implies a certain
prior correlation among coefficients in the VAR so as to keep the growth rate
in its distribution. For example, in the simple case of subsection 2.3 it is
clear that the condition (6), assuming e.g. y0 > 0 implies a negative prior
correlation between О± and ПЃ.
The above comments related to the way the Minnesota prior has been
applied so far. But the ideas of the Minnesota prior and the ideas discussed
in the current paper can be combined in a productive way: one could specify
a prior on growth rates and then find the parameters on the Minnesota
prior that most closely reproduce these pre-specified growth rates. That
is, one could impose on fB the parameterization of the Minnesota prior and
iterate on these parameters until they produce a prior distribution that agrees
satisfactorily with a prior for growth rates. In this case, given our observation
in the previous paragraph, it would be important to extend it, and allow for
a correlation between constant terms and autoregressive coefficients. This
would still reduce enormously the number of parameters that determine the
prior for VAR coefficients. We have not pursued this idea in this paper.
4.2
Priors similar to the delta prior
The work of Villani (2008) can be interpreted as imposing a delta prior at
some dates in very distant future. Formally, it would involve specifying a
fixed T0 and taking s в†’ в€ћ. This is in the same spirit as the delta prior since
it incorporates prior information on growth rates. The key difference is that
Villani specifies VARs in differences, i.e. imposes unit root with certainty.
In contrast, the delta prior is designed for the VAR in levels and it pushes
the posterior towards the unit root only as much as appropriate, given the
32
specified prior.19
Many papers have implicitly introduced priors about initial growth rates
when exploiting the information about parameters contained in the initial
observations in the sample. The standard likelihood conditional on the initial observations ignores this information. There are two approaches to correcting this deficiency of the standard likelihood (which may be equivalent
in simple cases). One involves specifying the distribution of the initial observation conditional on parameters, and the other involves specifying the
distribution of parameters conditional on the initial observation. Works such
as Thornber (1967), Zellner (1971, ch.7.1), Lubrano (1995) and many others
write the density of the initial observation conditional on parameters. The
likelihood including the term for the initial observation is often called вЂњexactвЂќ likelihood. In fact, using the word вЂњexactвЂќ for this exercise is slightly
presumptuous, since one can only specify the distribution of the initial observation by making very strong assumptions about what has happened before
any sample observation was available. An alternative interpretation is used
e.g. in Schotman and Dijk (1991a,b), who specify a prior about the mean of
the process, conditional on the initial observation.
Linking the parameters of the model and the initial observation is least
controversial in stationary models. Therefore, most of the above papers have
either excluded nonstationary models, or specified separate assumptions for
this case. The approach which allows a unified treatment of stationary and
nonstationary cases is proposed in Uhlig (1994a) (The main focus of his paper
is on studying the JeffreysвЂ™ prior, but we concentrate only on his treatment
of initial observations.) Uhlig (1994a) links parameters and initial conditions
by specifying how many periods the model has been operating before the
sample start (UhligвЂ™s parameter S) and by stating what was the value of the
process at the beginning of these periods. We call his approach an S-prior,
but he stated it as writing the exact likelihood.
The common feature of the S-prior and the previous priors is that they
link the parameters to the initial observation in a way which rules out crazy
growth rates in the first periods. In terms of our Figure 3, they make situations like that in the lower panel very unlikely. In fact, the S-prior (and the
previous approaches) is analogous to the standard practice in classical small
sample econometrics of using the distribution of the estimators where initial
conditions are generated by the model.
In terms of our discussion in section 2, the relation of the delta prior with
19
Note that VillaniвЂ™s statement about long run series behavior is in fact a reparameterization. This is why he does not encounter the difficulties of obtaining a prior that we
discussed in section 3.
33
UhligвЂ™s S-prior (and the previous approaches) is quite close. In Appendix A.2
we report the frequentist properties of the posterior mean under the S-prior
in the univariate case. We show that the S-prior has similar properties of
the delta-prior: it delivers a similar correction of the OLSE and it has very
good mean squared error properties in a relevant range of parameters. Furthermore, in Appendix A.3 we propose an extension of UhligвЂ™s methodology
to the multivariate case and in an empirical application we find results close
to those of the delta prior. These results reinforce the theme of section 2,
namely, that if huge growth rates are discarded in the prior, an asymmetric
posterior is obtained and the Bayesian estimator can be seen as correcting
the bias of OLS and improving over classical estimators.
However, we feel that the delta prior is easier to use for applied work
than the Sв€’prior. To our knowledge, the application in Appendix A.3 is
the first extension to multivariate VARвЂ™s, and it relies on certain approximations. But most importantly, it is not clear how to pick the S and the
initial condition. One alternative is to pick S = в€ћ but this is as arbitrary
as any other assumption: why not 10 periods?, why not since the end of
WWII? Furthermore, for roots equal or larger than 1 we have to assume a
finite number of periods and an arbitrary initial condition at that time. The
natural choice, followed by Uhlig (1994a) is to take yв€’S = Вµ but this is choice
is as hard to defend as any other. We presume that it will be difficult for
economists to agree about the вЂњrightвЂќ value of S and the вЂќrightвЂќ value of the
initial condition. Our proposed delta prior avoids these issues by taking the
initial value of the process as given, which amounts to saying that we have
a completely noninformative prior for S and the initial condition yв€’S and
anything else that happened before our sample starts. To our thinking, it
should be much easier for economists to compromise about reasonable values
of the growth rates.
4.3
Dummy observations priors
An alternative is to specify the prior as dummy observations as in Sims
and Zha (1998), Sims (2006) and other works of Sims. This is similar to
the situation e.g. in Del Negro and Schorfheide (2004), where the dummy
observations are generated by a DSGE model and have a distribution implied
by the prior distribution of structural model parameters.
The idea of introducing dummy observations is motivated by the need to
push the posterior towards a model that is вЂњreasonableвЂќ and that is represented by these dummy observations. This is related, but not equivalent, to
priors for observables. The key difference is that, as stated in (Sims, 2007a,
p.7) вЂњThese dummy observations are interpreted as вЂ™mental observationsвЂ™ on
34
A, c (JM: i.e. VAR parameters), not as observations on data satisfying the
model...вЂќ Let us give a detailed account.
Sims (1996, 2000) argued that, even though a flat prior justified using
OLS from a Bayesian perspective, the results were unsatisfactory because
the OLS estimate tends to attribute too much of the data dynamics to the
deterministic component of the process. He was referring to this phenomenon
as the вЂ™other side of the well-known bias towards stationarityвЂ™ (Sims and Zha,
1998, p.959). Sims proposed to address this issue by using several types of
dummy initial observations. The most relevant for this paper is the вЂ™oneunit-rootвЂ™ dummy observation prior, which reflects the вЂ™belief that no-change
forecasts should be вЂ™goodвЂ™ at the beginning of the sampleвЂ™ (Sims, 1996, p.5).
This is a special case of the delta prior, with the prior growth rate of zero in
the first period. The dummy observation consists of
y1 = y0 = yв€’1 = . . . yв€’P +1 = yВЇО»
в€’1
yв€’i and О» is a specified scalar. The vector of 1вЂ™s that
where yВЇ = P1 Pi=0
corresponds to the constant term in the data matrix is set to О» in this observation.
In terms of our previous discussion this prior can be written as:
p(B) =
LY |B (Y ; B)
LY |B (Y ; B) dB
Оґ SD
Y (Y )dY
where Оґ SD
is a degenerate density of Y which puts a unit mass on the
Y
dummy observation defined above. This is the result of applying mapping F
once to a flat prior, i.e. the first iteration on our fixed point problem.
While similar in spirit, Sims вЂ™one-unit-rootвЂ™ dummy observation prior
differs from the delta prior in the following ways: i) the initial conditions are
taken to be yВЇ instead of the actual yв€’P +1 , ..., y0 ii) all the prior mass is on the
growth rate of 0; and iii) the variance of the error in the dummy observation
is different from the variance error term of the regular observations, and it
is determined by the arbitrary scalar О». Point iii) is the consequence of the
dummy observations being вЂ™mental observations on parametersвЂ™ and not on
the data; and if they are not observations on the data, then further iteration
on the mapping F is not necessary. The effect of ii), i.e. of putting all the
prior mass on just one value of the growth rate, makes the prior tight, but
it can be made looser by inflating the corresponding error variance.
The вЂ™one-unit-rootвЂ™ dummy observation prior is computationally much
more convenient then the delta prior. However, this prior, while making a
reference to the behavior of the series, is in fact a prior for coefficients. This
35
makes the scaling parameter О» difficult to interpret and elicit, contrary to
the variance of the growth rates of observables in the delta prior approach.
5
Empirical Example: the effect of monetary
shocks in the US
To illustrate the effect of alternative priors on macroeconomic VARs we first
replicate the estimation of the effects of the monetary shock in the US from
Christiano et al. (1999).20 The authors estimate a VAR with output (Y,
measured by the log of real GDP), prices (P, the log of the implicit GDP
deflator), commodity prices (PCOM, the smoothed change in an index of
sensitive commodity prices), federal funds rate (FF), total reserves (TR, in
logs), nonborrowed reserves (NBR, in logs) and money (M1 or M2, in logs).
All data are quarterly and the sample is from 1965:3 to 1995:2. Residuals
are orthogonalized with the Choleski decomposition of the variance, with the
above variable ordering, and monetary shock is the one corresponding to the
federal funds rate.
Figure 7 displays the responses of output to a monetary shock, estimated
with alternative approaches. We focus on responses of output first, since
they happen to be most affected both by the frequentist small sample bias,
and by alternative prior assumptions. Responses of the remaining variables
are reported and shortly discussed further below.
Our benchmark is the posterior distribution of the impulse responses obtained with the flat prior for the coefficients. It is displayed in all plots to
facilitate comparisons, plotted as the shaded region. Precisely, this region
shows a 95% band, constructed as quantiles 0.025 and 0.975, calculated separately for each horizon, of the posterior distribution of the impulse responses.
This distribution is simulated by Monte Carlo following the well known algorithm from the RATS package manual (Doan, 2000, example 13.4).21 The
flat prior band is almost symmetric around the OLS point estimate, which is
also displayed in the first plot as the dashed line. Continuous lines on each
plot present 95% bands constructed with alternative approaches.
The first plot reproduces the results in Christiano et al. (1999) who use a
bootstrap procedure that is an attempt to provide an idea about uncertainty
about the point estimates of impulse responses, while disregarding the small
sample bias. In this bootstrap procedure, the OLS point estimate of the
20
We choose this example because the data is on the Internet, and because the authors
use the bootstrap error bounds, which highlight the frequentist small sample bias.
N +1
21
The precise form of the prior is p(B, ОЈ) в€ќ |ОЈ|в€’ 2 .
36
coefficients is taken to be the data generating process, and the series are repeatedly generated with these coefficients, from resampled errors. The band
is constructed from the percentiles of the distribution of the impulse responses
estimated by OLS from the generated series (вЂ™other-percentileвЂ™ bootstrap, see
Sims and Zha (1999)). Although this was not the intention of the authors,
this experiment is essentially a study of the frequentist small sample bias
in this context. It shows that when the data generating process exhibits as
much persistence as the dashed line, the OLS estimates on the data coming from this DGP tend to be less persistent, and output response to the
monetary policy shock is weaker and dies out sooner. The sample is quite
large - 120 observations - so the bias may not be striking at first glance, but
in smaller samples even 95% bootstrap bands often exclude point estimates.
(Also, it is possible that this application has so many parameters that the
sampling error is very large and the bias is, by comparison, small potatoes.)
The next step is to impose the prior on initial growth rates of the variables. The parameters of the basic specification of the prior are given in table
1. We have chosen these values mechanically by setting the prior means and
variances of growth rates equal simply to those observed on average in the
whole sample. A more opinionated reader can plug in his/her own preferred
values. We set all autocorrelations of growth rates to 0.4. We take T0 = 4,
the number of lags in the VAR, so that the number of assumed distributions
of growth rates is equal to the number of initial observations on which the
likelihood conditions. In this way we intend to complete the conditional likelihood with the terms conveying the assumption that the initial observations
behave, in terms of growth rates, similarly as the rest of the sample.
Table 1: Average growth rates and standard deviations of the endogenous
variables in the full sample (1965Q3:1996Q2)
variable mean annualized growth rate annualized standard deviation
Y
2.7
3.6
P
5.0
2.5
PCOM
3.2
206
FF
0.1
4.8
NBR
5.4
9.1
TR
5.2
6.6
M1
6.5
4.0
Note: The quarterly growth rates and their standard deviations are
multiplied by 4. The original quarterly values were used in the delta prior.
37
bootstrap
delta prior - empirical
0.002
0.002
0
0
-0.002
-0.002
-0.004
-0.004
-0.006
-0.006
-0.008
-0.008
flat prior 95% band
OLS
other 95% band
-0.01
-0.01
0
2
4
6
8
10
12
14
0
2
4
delta prior - zero growth
6
8
10
12
14
12
14
12
14
Sims initial dummy
0.002
0.002
0
0
-0.002
-0.002
-0.004
-0.004
-0.006
-0.006
-0.008
-0.008
-0.01
-0.01
0
2
4
6
8
10
12
14
0
2
4
Kilian
6
8
10
Kilian nonstationary
0.002
0.002
0
0
-0.002
-0.002
-0.004
-0.004
-0.006
-0.006
-0.008
-0.008
-0.01
-0.01
0
2
4
6
8
10
12
14
0
2
4
6
8
10
Figure 7: Impulse responses of output to monetary shocks.
38
Note that with this choice of T0 , the dimension of the prior is only 4Г—7 =
28, compared with the 4 Г— 72 + 7 = 203 coefficients, so the precise prior
would be improper. To keep the prior proper, we complete it by a very
weak shrinkage prior centered at zero. We accomplish this by initializing the
fixed point iteration, by which we find the approximate gaussian prior for
the coefficients, with the gaussian prior with zero mean and a large diagonal
variance (with diagonal terms equal to 1002 , which is several times larger
then e.g. the posterior variances of the coefficients estimated with the flat
prior). After 160 iterations most variances have shrunk by many orders of
magnitude, but some of them remain barely different from the starting point,
consistently with the pure delta prior being improper.
The match between the 28 assumed growth rates, and the growth rates
implied by the actually used gaussian prior is illustrated in Figure 8, and it
is deemed to be quite reasonable.
In contrast to the non-informative prior, which is centered around the
OLS estimate, and in an even starker contrast with the bootstrap bands, the
delta prior results in impulse responses which are more persistent then the
OLS estimates imply. They are also stronger in the medium run. The effect
on the economic interpretation of the results is striking. The cumulative effect
of the shock after 4 years is -6.6% of the quarterly GDP (at the median) when
estimated with the delta prior, but only -4.6% when estimated with OLS,
and only -3.1% according to the bootstrap bands. Therefore, using the flat
prior we underestimate the cumulative effects of monetary shock by about one
third, and when using bootstrap bands - by more then a half, compared with
the effects obtained with what we think is a reasonable prior.
That the delta prior impulse responses are more persistent is intuitive,
since, as discussed earlier, the assumptions about initial growth rates are also
implicit in the frequentist analyses of the small sample bias. Incorporating
these assumptions leads to the optimal Bayesian posterior which, from the
frequentist point of view, implements a bias correction.22
In the second row of Figure 7 we display two more plots which illustrate
the relationship between the delta prior and the вЂ™initial dummy observationвЂ™
prior proposed by Sims. SimsвЂ™ initial dummy observation prior reflects the
idea that вЂ™no growth forecast is a good forecast in the beginning of the sampleвЂ™ and it is very convenient to introduce, but its variance has no intuitive
interpretation. We compare it with the prior for growth rates with the similar variance as in the baseline case, but centered at zero growth rates. We
22
The bands are narrower, which partly reflects the fact that, in the current version
of the procedure, we do not take into account the uncertainty about error variance, and
instead fix it at the OLS estimate. A flat prior band which disregards uncertainty about
error variance has similar width as the presented delta prior band.
39
Y, t=1
45
actual
40
desired
35
30
25
20
15
10
5
0
-0.04-0.03-0.02-0.01 0 0.01 0.02 0.03 0.04 0.05
Y, t=2
45
40
35
30
25
20
15
10
5
0
-0.04-0.03-0.02-0.01 0 0.01 0.02 0.03 0.04 0.05
P, t=1
Y, t=3
45
40
35
30
25
20
15
10
5
0
-0.04-0.03-0.02-0.01 0 0.01 0.02 0.03 0.04 0.05 0.06
P, t=2
Y, t=4
45
40
35
30
25
20
15
10
5
0
-0.05-0.04-0.03-0.02-0.01 0 0.010.020.030.040.050.06
P, t=3
P, t=4
70
70
70
70
60
60
60
60
50
50
50
50
40
40
40
40
30
30
30
30
20
20
20
20
10
10
10
0
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0
-0.02
-0.01
PCOM, t=1
0
0.01
0.02
0.03
0.04
10
0
-0.02 -0.01
0
0.01 0.02 0.03 0.04 0.05
PCOM, t=2
0
-0.03 -0.02 -0.01
0.8
0.8
0.8
0.7
0.7
0.7
0.7
0.6
0.6
0.6
0.6
0.5
0.5
0.5
0.5
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0
0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
FF, t=1
0.1
0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
FF, t=2
-3
35
35
35
30
30
30
25
25
25
25
20
20
20
20
15
15
15
15
10
10
10
10
5
5
5
0
-0.06
0
-0.06
0
-0.06
0
0.02
0.04
0.06
-0.04
NBR, t=1
18
16
14
12
10
8
6
4
2
0
-0.1
-0.05
0
0.05
-0.02
0
0.02
0.04
0.06
-0.02
0
0.1
0.15
TR, t=1
-0.05
0
0.05
0.02
0.04
0.06
0.1
0.15
1
2
3
0
-0.1 -0.08-0.06-0.04-0.02 0 0.02 0.04 0.06 0.08
NBR, t=3
18
16
14
12
10
8
6
4
2
0
-0.1
0
5
-0.04
NBR, t=2
18
16
14
12
10
8
6
4
2
0
-0.1
-1
FF, t=4
30
-0.02
-2
FF, t=3
35
-0.04
0.01 0.02 0.03 0.04 0.05
PCOM, t=4
0.8
0
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
0
PCOM, t=3
-0.05
0
TR, t=2
0.05
NBR, t=4
0.1
0.15
18
16
14
12
10
8
6
4
2
0
-0.15
-0.1
-0.05
TR, t=3
25
25
25
25
20
20
20
20
15
15
15
15
10
10
10
10
5
5
5
5
0
-0.08-0.06-0.04-0.02 0 0.02 0.04 0.06 0.08 0.1
0
-0.08-0.06-0.04-0.02 0 0.02 0.04 0.06 0.08 0.1
0
-0.06 -0.04 -0.02
M, t=1
M, t=2
0
0.02 0.04 0.06 0.08 0.1
40
40
35
35
35
30
30
30
25
25
25
20
20
20
15
15
15
10
10
10
5
5
5
0
-0.03-0.02-0.01 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07
0
-0.04-0.03-0.02-0.01 0 0.010.020.030.040.050.060.07
0
-0.03-0.02-0.01 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07
0.05
0.1
0.15
0
-0.1-0.08-0.06-0.04-0.02 0 0.020.040.060.08 0.1 0.12
M, t=3
40
0
TR, t=4
M, t=4
45
40
35
30
25
20
15
10
5
0
-0.06 -0.04 -0.02
0
0.02 0.04 0.06 0.08
Figure 8: Distributions of the growth rates of VAR variables in the first 4
periods of the sample, obtained by Monte Carlo simulation. вЂ™actualвЂ™ - the
marginal density of the data (growth rates) implied by the baseline delta
prior; вЂ™desiredвЂ™ - the assumed growth rates which are to be matched by the
delta prior.
40
can see that SimsвЂ™ prior increases persistence of the response compared with
the flat prior, in the similar manner as the delta prior. Its effect is much
weaker than that of the prior for initial growth rates centered at zero, with
empirically observed variances. Our interpretation is that SimsвЂ™ prior is centered at zero growth rates, but ends up having a larger variance then the
variance of our baseline specification.
In the third row of Figure 7 we display, for comparison, the effect of applying frequentist bias correction procedures in the present context. Of course,
the resulting bands correspond to the distribution of the impulse response
across alternative datasets, which is a different statistical object then the
posterior distribution. However, such bands are often casually interpreted as
delivering a Bayesian post-sample uncertainty about impulse responses. We
apply the procedure to construct error bands for impulse responses proposed
in Kilian (1998), called bootstrap-after-bootstrap. It is intended to improve
the coverage of bootstrap intervals by approximately removing small-sample
bias at each step of the simulation. Bootstrap is used to approximately remove the bias, and the underlying assumption is that the bias is constant in
the neighborhood of the OLS estimate.
As described in Kilian (1998), the first step consists of estimating and
correcting the bias of the OLS point estimate. In the second step, the error
bands are generated: series are repeatedly simulated from the data generating
process implied by the corrected OLS estimate, a VAR is estimated by OLS
at each simulated data set, the OLS estimate is corrected for bias analogously
as in the first step, and finally impulse responses are computed and stored.
In a simplified, but computationally much cheaper version of this algorithm,
the bias estimate obtained in the first step is reused in the second step, for
correction of all OLS estimates on generated data (instead of performing a
separate bootstrap for each of them). We use this simplified bootstrap-afterbootstrap here.
Bias correction pushes the roots of the process towards the nonstationary
range and explosive responses may often result. This may, however, be a
spurious effect caused by the assumption of constant bias. Kilian suggests
to shrink bias estimates in all cases, when the bias corrected estimate would
become explosive, and thus guarantee stationarity of all the simulated distribution of impulse responses. Sims and Zha (1999), when applying KilianвЂ™s
method, deviate here and allow for explosiveness. We report results from
both approaches: the plot labeled вЂ™KillianвЂ™ shows the results with shrinking of the explosive roots and the plot labeled вЂ™Killian nonstationaryвЂ™ shows
responses allowing for explosiveness.
In the present example the VAR has roots close to unity and the results
differ quite a lot depending on the handling of the nonstationary roots. The
41
bands obtained with shrinking of explosive roots actually exhibit a marginally
faster mean reversion even then the flat prior bands. The bands obtained
without shrinking are much more spread out for farther lags, and put even
more weight on the explosive behavior, then the posterior results with delta
or SimsвЂ™ priors. Overall, the figure illustrates the dilemma involved in applying bootstrap-after-bootstrap, when the root of the system is close to unity:
imposing stationarity discards much of the bias correction. Allowing nonstationarity, on the other hand, exposes the results to the inaccuracy of the
assumption of constant bias. We guess that it is because of the failure of this
assumption in practice, that the bands contain too much nonstationarity,
and are so wide for farther lags. It is possible that applying the full, not
simplified, version of KilianвЂ™s algorithm gives better results.
In Figures 9 and 10 we report impulse responses for all variables. OLS
point estimate is plotted with the dashed line, the same on all plots, and the
95% bands are delimited with continuous lines. The first column of Figure
9 shows that frequentist bias towards stationarity affects most clearly the
response of output, and to a much smaller extent federal funds rate (FF),
total reserves (TOTR) and money (M1). We are identifying only one shock,
so we do not check what happens to the remaining 7 Г— 6 = 42 impulse
responses of this system. Interestingly, the effect of delta prior is strongest
exactly where the bias, indicated by the deviation of the bootstrapped median
from OLS, was strongest.
6
Conclusions
We have argued in this paper that, beyond the AR(1) model without intercept, the decision of whether to adjust the OLS estimate in time series or
not is not a matter of philosophical principles (bayesian vs frequentist perspective). Instead, depends on whether we believe that the initial condition
is compatible with reasonable growth rates, or that any initial condition is
possible.
While there is a number of standard priors for VARs, which have the same
qualitative features, our preferred alternative is the delta prior. Contrary to
the contenders, this prior has a clear interpretation and it is really possible
to elicit it. It should be uncontroversial that most economic variables can
not have huge growth in any given period; introducing this knowledge about
the economy reconciles a Bayesian posterior with the certainty that classical
econometricians (at least those that correctly worry about small sample issues) have about the fact that OLS should be adjusted upwards close to a
unit root.
42
bootstrap
Y
0.002
0
0
0
0
-0.002
-0.002
-0.002
-0.004
-0.004
-0.004
-0.004
-0.004
-0.006
-0.006
-0.006
-0.006
-0.006
-0.008
-0.008
-0.008
OLS
pct 2.5
pct 97.5
-0.01
P
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
-0.008
-0.01
0
2
4
6
8
10 12 14 16
-0.01
0
2
4
6
8
10 12 14 16
0.004
0.004
0.004
0.004
0.002
0.002
0.002
0.002
0.002
0
0
0
0
0
-0.002
-0.002
-0.002
-0.002
-0.002
-0.004
-0.004
-0.004
-0.004
-0.004
-0.006
-0.006
-0.006
-0.006
-0.006
-0.008
-0.008
-0.008
-0.008
-0.01
0
PCOM
-0.01
0
0.004
-0.01
2
4
6
8
10 12 14 16
-0.01
0
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.1
0
0
0
0
0
-0.1
-0.1
-0.1
-0.1
-0.1
-0.2
-0.2
-0.2
-0.2
-0.2
-0.3
-0.3
-0.3
-0.3
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0.012
0.012
0.012
0.012
0.01
0.01
0.01
0.01
0.01
0.008
0.008
0.008
0.008
0.008
0.006
0.006
0.006
0.006
0.006
0.004
0.004
0.004
0.004
0.004
0.002
0.002
0.002
0.002
0.002
0
0
0
0
0
-0.002
-0.002
-0.002
-0.002
-0.002
-0.004
-0.004
-0.004
-0.004
-0.004
-0.006
0
2
4
6
8
10 12 14 16
-0.006
0
2
4
6
8
10 12 14 16
-0.006
0
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0.02
0.02
0.02
0.02
0.015
0.015
0.015
0.015
0.015
0.01
0.01
0.01
0.01
0.01
0.005
0.005
0.005
0.005
0.005
0
0
0
0
0
-0.005
-0.005
-0.005
-0.005
-0.005
-0.01
-0.01
-0.01
-0.01
-0.01
-0.015
-0.015
-0.015
-0.015
-0.015
-0.02
-0.02
-0.02
-0.02
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0.02
0.02
0.02
0.02
0.015
0.015
0.015
0.015
0.015
0.01
0.01
0.01
0.01
0.01
0.005
0.005
0.005
0.005
0.005
0
0
0
0
0
-0.005
-0.005
-0.005
-0.005
-0.005
-0.01
-0.01
-0.01
-0.01
-0.01
-0.015
-0.015
-0.015
-0.015
-0.015
-0.02
0
2
4
6
8
10 12 14 16
-0.02
0
2
4
6
8
10 12 14 16
-0.02
0
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0.01
0.01
0.01
0.01
0.005
0.005
0.005
0.005
0.005
0
0
0
0
0
-0.005
-0.005
-0.005
-0.005
-0.005
-0.01
-0.01
-0.01
-0.01
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
-0.02
0
0.01
0
8
-0.02
0
0.02
-0.02
6
-0.006
0
0.02
0
4
-0.3
0
0.012
-0.006
2
-0.01
0
0.2
2
0
-0.008
-0.01
0
0.2
0
FF
Sims dummy
0.002
-0.002
0
TR
delta - zero
0.002
0
-0.01
NBR
delta - empirical
0.002
-0.002
-0.008
M1
flat
0.002
-0.01
0
2
4
6
8
10 12 14 16
Figure 9: Impulse responses to monetary shocks: OLS point estimate (dashed
line) and the 95% uncertainty bands (continuous lines) generated by alternative methods
43
bootstrap
Y
0.002
0
0
0
-0.002
-0.002
-0.004
-0.004
-0.004
-0.006
-0.006
-0.006
-0.008
-0.008
OLS
pct 2.5
pct 97.5
-0.01
-0.01
P
0
2
4
6
8
10 12 14 16
PCOM
4
6
8
10 12 14 16
0.004
0.002
0.002
0.002
0
0
0
-0.002
-0.002
-0.002
-0.004
-0.004
-0.004
-0.006
-0.006
-0.006
-0.008
-0.008
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0.2
0.2
0.1
0.1
0.1
0
0
0
-0.1
-0.1
-0.1
-0.2
-0.2
-0.2
-0.3
-0.3
4
6
8
10 12 14 16
0.012
2
4
6
8
10 12 14 16
0.01
0.01
0.01
0.008
0.008
0.006
0.006
0.006
0.004
0.004
0.004
0.002
0.002
0
0
-0.002
-0.002
-0.004
-0.004
-0.004
-0.006
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0.02
0.02
0.015
0.015
0.015
0.01
0.01
0.01
0.005
0.005
0.005
0
0
0
-0.005
-0.005
-0.005
-0.01
-0.01
-0.01
-0.015
-0.015
-0.015
-0.02
0
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0.02
0.02
0.015
0.015
0.015
0.01
0.01
0.01
0.005
0.005
0.005
0
0
0
-0.005
-0.005
-0.005
-0.01
-0.01
-0.01
-0.015
-0.015
-0.015
-0.02
0
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0.01
0.01
0.005
0.005
0.005
0
0
0
-0.005
-0.005
-0.005
-0.01
0
2
4
6
8
10 12 14 16
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
0
2
4
6
8
10 12 14 16
-0.02
0
0.01
-0.01
0
-0.02
0
0.02
-0.02
10 12 14 16
-0.006
0
0.02
-0.02
8
0.002
0
-0.002
2
6
0.012
0.008
0
4
-0.3
0
0.012
-0.006
2
-0.01
0
0.2
2
0
-0.008
-0.01
0
FF
2
0.004
0
TR
-0.01
0
0.004
-0.01
NBR
Kilian nonst.
0.002
-0.002
-0.008
M1
Kilian
0.002
-0.01
0
2
4
6
8
10 12 14 16
Figure 10: Impulse responses to monetary shocks: OLS point estimate
(dashed line) and the 95% uncertainty bands (continuous lines) generated
by alternative methods
44
Even from the classical perspective, our Bayesian posterior estimates are
attractive. Correcting the bias does not really deliver (mean-) unbiasedness
while our estimator, which also reduces bias, can have an edge in terms of
mean squared error.
We have illustrated the effect of the delta-prior in a macroeconomic VAR
for the US economy. The posterior corrects the excessive stationarity of
the OLS estimate, and implies much more persistent responses of output to
monetary policy shocks.
It will be interesting to apply the philosophy of specifying a prior for the
VAR through features of the data to other cases. Priors for VARs implied
by DSGE models are a promising application on which we plan to work.
Appendices
Appendix A
A.1
Initial conditions
Initial conditions in the studies of bias
This appendix discusses how the initial condition for the AR(1) process was
specified in this literature. To facilitate the discussion of the initial conditions, it is convenient to reparameterize the model as:
yt в€’ Вµ = ПЃ (ytв€’1 в€’ Вµ) +
t
for t = 1 . . . T
(A.1)
This parameterization is a special case of (1) for:23
О± = Вµ(1 в€’ ПЃ)
(A.2)
The specified distribution for y0 will be related to Вµ. Under (A.1) the process
reverts to Вµ if |ПЃ| < 1, goes away from Вµ if |ПЃ| > 1. The constant term drops
out when ПЃ = 1.24
23
One can not go the other way around: if ПЃ = 1 in (1) there is no Вµ that satisfies the
above equation and generates the same yвЂ™s. This will have implications for the practical
application of this parameterization when the observed variable has a positive growth rate
and is close to a unit root, because a trend will have to be introduced in this case to have
a growing variable.
24
Zivot (1994) explains that the motivation for suppressing the constant term О± when
ПЃ = 1 is that it changes interpretation in the unit root case, from determining the level of
the process to determining its drift. When no trend is included for the general ПЃ, it is not
reasonable to allow for a drift in just one point of the parameter space: ПЃ = 1.
45
Classical estimators need to specify fully the distribution of the process
in order to derive the small sample properties of the estimator. 25 A necessary step for that is specifying the distribution of the initial observation y0 .
Since the focus is on estimating ПЃ, it is convenient to specify a distribution
that causes the OLS estimator of ПЃ to be independent from the вЂ™nuisanceвЂ™
parameters of the model: Вµ and Пѓ. In that way, the results on estimators of
ПЃ are valid for a wide range of nuisance parameters. A general condition for
this independence is:
Result 2 Assume the initial condition in model (A.1) is given by:
y0 = Вµ + ПѓП€
(A.3)
where П€ is a random variable. Then, if П€ independent of the shocks and its
distribution is independent of Вµ and Пѓ, the distribution of the OLS estimator
of ПЃ in (1) is independent of Вµ and Пѓ.26
All classical papers that we know of use a version of (A.3). Of course, each
choice of П€ will result in a different small sample distribution of ПЃЛ† conditional
on ПЃ. The most popular approach is to assume that the first observation is
drawn from the stationary distribution of the process (as if the process had
been running for a long time before the beginning of the sample). This means
1
Пѓ2
assuming y0 в€ј N Вµ, 1в€’ПЃ
0, 1в€’ПЃ
2 , or П€ в€ј N
2 . The stationary distribution
25
We note in this paper that this forces classical econometricians to consider carefully
the joint distribution of the data and the parameters, which, in our bayesian interpretation,
helps them to elicit reasonable priors.
26
Similar results have been used in the literature. As can be seen, the proof is very
simple, but we could not find a formal proof, so we offer it here for completeness. The
proposition is very similar to the property of ПЃ
Л† discussed in Andrews (1993, Appendix A),
which contains a verbal proof for |ПЃ| в‰¤ 1 and a particular distribution for П€, but allowing
for a trend.
Proof : Define normalized errors: u в‰Ў /Пѓ. (A.3) allows to write:
t
ПЃtв€’i ui + ПЃt П€
yt = Вµ + Пѓ
= Вµ + Пѓ yЛњt
i=1
where yЛњ is the process with the zero mean, which would obtain from the same realization
of errors, but rescaled to have a unit variance. Then it is a matter of simple algebra to
show that:
ПЃ
Л†в‰Ў
yt ytв€’1 в€’
T
T
2
ytв€’1
в€’(
ytв€’1
yt
2
ytв€’1 )
=
46
yЛњt yЛњtв€’1 в€’
T
T
2
yЛњtв€’1
в€’(
yЛњtв€’1
yЛњt
2
yЛњtв€’1 )
exists only for stationary processes, and therefore this assumption must be
completed by another assumption for the nonstationary values of ПЃ.
(Bhargava, 1986, p.372) (who constructs tests for unit roots invariant to
Вµ and Пѓ) takes
1
1 в€’ ПЃ2
П€ в€ј N (0, 1)
when |ПЃ| < 1
П€ в€ј N 0,
otherwise
(A.4)
Andrews (1993) considers median unbiased estimators for ПЃ by taking a
similar starting point for the stationary case and an arbitrary starting point
(i.e. П€ is an arbitrary constant) for the unit root case (he does not consider
explosive values of ПЃ).
An alternative approach is taken in MacKinnon and Smith (1998) who
make the same assumption for both stationary and nonstationary parameters
values (see their p.206-207):
y0 в€ј N Вµ, Пѓ 2
(A.5)
so that they take П€ в€ј N (0, 1) for all ПЃ.
All these choices guarantee that, in case of the ПЃ < 1, the process starts
not too far from its steady state and therefore the convergence towards the
steady state will not generate excessive growth rates. In the explosive cases,
the speed of divergence from Вµ is increasing with the distance from Вµ, and
the assumptions used limit growth rates also in this case.
Note, however, that these alternative initial conditions have different implications for the bias, and so that MacKinnon and Smith (1998) and Andrews (1993) are correcting different biases.
A.2
The Bayesian approach: S-priors for the AR(1)
In this section we explore the implications of the initial conditions, formulated
following Uhlig (1994a). Uhlig considers processes which had started at the
mean S periods before the first observation in the sample.27
Uhlig (1994) proposes a formulation which encompasses many cases: he
assumes yв€’S = Вµ, where S is a given constant that the researcher has to
specify. This provides an alternative П€.
27
Strictly speaking, Uhlig (1994a) does not derive a вЂ™priorвЂ™ from the above specification
of the initial condition, but he uses it to write the вЂ™exact likelihoodвЂ™. This is, of course, a
semantic issue, since the resulting posterior is the same as if the above specification for the
initial condition is considered a prior given the initial observations, which is our preferred
interpretation.
47
To provide a common nomenclature for all these cases, we call UhligвЂ™s
proposal the вЂ™SвЂ™ model. Then the case y0 = 0 becomes the вЂ™S = 0вЂ™ model,
the assumption of MacKinnon and Smith (1998) as in (A.5) becomes the
вЂ™S = 1вЂ™ model and we call the assumption of Bhargava (1986) in (A.4) the
вЂ™S = в€ћвЂ™ model.
The general вЂ™S-priorвЂ™ (Uhlig (1994a) and (Zivot, 1994, p.569-570)) is:
1
p(Вµ|Пѓ, y0 , S) = Пѓ в€’1 ОЅ(ПЃ, S)в€’ 2 exp в€’
1
ОЅ(ПЃ, S)в€’1 (y0 в€’ Вµ)2
2Пѓ 2
(A.6)
with
ОЅ(ПЃ, S) = (1 в€’ ПЃ2S )/(1 в€’ ПЃ)
=S
|ПЃ| = 1
|ПЃ| = 1
(A.7)
These priors are then complemented with the non-informative prior for ПЃ and
Пѓ:
p(ПЃ, Пѓ) в€ќ Пѓ в€’1
(A.8)
The S = 1 prior is conditionally conjugate, and the full posterior distribution can be conveniently simulated with the Gibbs sampler. For the general
S the conjugacy is lost, but the marginal posterior for ПЃ can be obtained
analytically by a straightforward integration of the posterior with respect to
Вµ and Пѓ. The resulting formula can be found in Zivot (1994, equation (43)).
The bias and mean squared error of the Bayesian вЂ™S-priorвЂ™ estimators are
examined in a Monte Carlo experiment similar to the one in section 2.4. We
denote the Bayesian posterior means obtained with S = 0, 1 and 100 priors
respectively as BS0, BS1 and BS100.
We simulate the AR(1) processes with initial conditions corresponding to
S = 0, 1 and в€ћ/1 and in each case try all 3 Bayesian estimators, in addition
to OLS and the CBC estimator of MacKinnon and Smith (1998). The results
are presented in Figures 11, 12 and 13.
The BS1 and BS100 estimators perform somewhat better, but are broadly
similar to the delta prior: in RMSE terms they beat OLS for the positive ПЃ
and the CBC on an even larger interval (except that the BS100 has slightly
worse RMSE than the CBC for ПЃ around 1). In the range where they perform
well, their bias (unreported here) is stronger then that of the BC, although
reduced compared with OLS. They are not very sensitive to the accuracy
of the prior, e.g. the BS1 estimator performs similarly also when the prior
assumption is false: for the processes starting at the mean (Figure 11), and
in the stationary distribution (Figure 13).
That last observation is not true for the BS0 estimator, which gives a
particularly strong advantage over the alternatives when the prior is correct
48
0.26
0.24
OLS
CBC
BS0
BS1
BS100
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
-1
-0.5
0
ПЃ
0.5
1
Figure 11: RMSE of the OLS, BS0, BS1 and BS100 estimators in a Monte
Carlo experiment with S = 0, sample size: T = 25.
0.26
0.24
OLS
CBC
BS0
BS1
BS100
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
-1
-0.5
0
ПЃ
0.5
1
Figure 12: RMSE of the OLS, BS0, BS1 and BS100 estimators in a Monte
Carlo experiment with S = 1, sample size: T = 25.
49
0.26
0.24
OLS
CBC
BS0
BS1
BS100
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
-1
-0.5
0
ПЃ
0.5
1
Figure 13: RMSE of the OLS, BS0, BS1 and BS100 estimators in a Monte
Carlo experiment with S = в€ћ, sample size: T = 25.
(for the processes starting exactly at the mean, Figure 11), but for some
parameter values it is very misleading when the prior is wrong. The graphs
are truncated, so it cannot be seen that the RMSE of BS0 has a peak of
0.65 (more then 3 times that of the alternatives) around ПЃ = в€’0.6 for S = 1
(Figure 12) and of 0.9 (more then 6 times that of the alternatives) around
ПЃ = в€’0.9 for S = в€ћ/1 (Figure 13). Nevertheless, in the range which is most
relevant in practice, i.e. for high positive values of ПЃ, BS0 is always more
precise then the alternatives.
A.3
The S-prior for VARвЂ™s
In this section we generalize the вЂ™S=1вЂ™ prior to the VAR(P) model, evolving
around an arbitrary exogenous process (for example a constant mean, or a
linear trend).
Let Y be a T Г— J matrix gathering T observations on J jointly endogenous variables. Let L denote a lag operator and define a P Г— 1 vector
l в‰Ў [L, L2 , . . . , LP ] . Define X в‰Ў l вЉ— Y = [Yв€’1 , Yв€’2 , . . . , Yв€’P ], a T Г— JP
matrix with P lagged matrices Y . W is a T Г— K matrix with K exogenous variables (in case of a time invariant mean, W = О№T , a T Г— 1 vector of
ones, while in case of a linear trend W = [О№T , П„ ], where П„ = [1, 2, . . . , T ] ).
Z в‰Ў l вЉ— W = [Wв€’1 , Wв€’2 , . . . , Wв€’P ] is a T Г— KP matrix with P lagged matrices W . Finally. U is a T Г— J matrix with T independent normal vectors
50
of length J, each with a variance ОЈ.
Deviations of Y from the exogenous component evolve according to a
VAR(P) process, perturbed by shocks U :
Y в€’ W О“ = (X в€’ Z (IP вЉ— О“)) О + U
(A.9)
where О is a JP Г— J matrix of VAR coefficients and О“ is a K Г— J matrix with
means and slopes of the time trend for each endogenous variable. Reduced
form of the above model is:
Лњ + XО + U
Y = WО“
(A.10)
Лњ = О“ в€’ (l вЉ— О“)О = О“(I в€’ О¦1 L в€’ ... в€’ О¦P LP )
О“
(A.11)
with
where О¦i is a matrix of VAR coefficients of lag i, so that О = (О¦1 , О¦2 , . . . , О¦P ) .
Equation (A.11) is a generalization of the restriction О± = Вµ(1 в€’ ПЃ) (A.2) for
the AR(1) case, and analogously to that case it has implications when the lag
polynomial in О¦s contains some unit roots. When О“ consists only of constant
terms, (A.11) guarantees that those corresponding to series that have unit
roots are suppressed, and the random walks have no drifts. When О“ contains
constant terms and time trends, (A.11) ensures that the time trends corresponding to unit root series are suppressed, consistently with the postulate
of Uhlig (1994b).
Likelihood conditional on initial P observations is proportional to:
T
1
L(Y ; О , О“) в€ќ |ОЈ|в€’ 2 exp в€’ tr ОЈв€’1 U U
2
(A.12)
A common noninformative prior for the error variance and the autoregressive coefficients is:
1
p(ОЈ, О ) в€ќ |ОЈ|в€’ 2 (J+1)
(A.13)
The prior for О“ must be proper, to compensate the indeterminacy at unit
root implied by restriction (A.11). A вЂ™bias-correctingвЂ™ prior assumption, in
the spirit of the previous sections, relates the coefficients of the deterministic
component of the process to the initial observations:
Y 0 в€’ W 0 О“ = U0
(A.14)
where Y0 is a T0 Г— J matrix with T0 initial observations on the endogenous
variables, W0 is a T0 Г— K matrix with the corresponding values of the exogenous variables, and U0 a T0 Г— J matrix with T0 independent normal vectors
51
length J with a variance ОЈ0 . We take ОЈ0 to be the OLS estimate of the VAR
errors.
The implied prior for О“ is normal, centered on the OLS estimate of О“ on
the T0 initial observations:
Л† 0 , ОЈ0 вЉ— W0 W0в€’1 )
p(vecО“|Y0 , W0 , ОЈ0 ) = N(vecО“
(A.15)
where
Л† 0 в‰Ў (W W0 )в€’1 W Y0
О“
0
0
The simplifying assumption in (A.14), which delivers conjugacy of the
prior, is that the first T0 deviations from the deterministic component are
independent, and only starting with time period 0 the time dependence implied by О kicks in. An exact multivariate and multilag counterpart to the
UhligвЂ™s S-prior would assume, that the process had been equal to its deterministic component up to period в€’S and then started evolving according to
(A.9). But then the variance of U0 would depend in a complicated way on
О and the conjugacy of the prior would be lost. The prior in (A.15) implies
that the initial T0 observations are used only to infer on the coefficients of
the deterministic process О“, but not the О . This may be a simplification, but
it is still better then the case of the flat prior, where the information in the
initial observations is ignored altogether.
The joint posterior implied by (A.12),(A.13) and (A.15) is:
p(О , О“, ОЈ|Y, Y0 ) в€ќ |ОЈ|в€’
T +J+1
2
1
exp в€’ tr(ОЈв€’1 U U + ОЈв€’1
0 U0 U0 )
2
(A.16)
The conditional posteriors are:
p(ОЈ|О , О“, Y, Y0 ) = IW(U U, T )
(A.17)
Лњ ОЈ вЉ— (X
Лњ X)
Лњ в€’1
p(vecО |О“, ОЈ, Y, Y0 ) = N vecО ,
(A.18)
p(vecО“|О , ОЈ, Y, Y0 ) = N Gв€’1 g, Gв€’1
(A.19)
where IW is an Inverted Wishart distribution,
Лњ в‰Ў X в€’ Z(IP вЉ— О“)
X
Лњ = (X
Лњ X)
Лњ в€’1 X
Лњ (Y в€’ W О“)
О G = A (ОЈв€’1 вЉ— IT )A + ОЈв€’1
0 вЉ— W0 W0
Л†
g = A (ОЈв€’1 вЉ— IT )vec(Y в€’ XО ) + ОЈв€’1
0 вЉ— W0 W0 vecО“0
A = IJ вЉ— W в€’ (О вЉ— Z) ((IP вЉ— KJP )(vecIP вЉ— IJ ) вЉ— IK )
and KJP is the commutation matrix (Magnus and Neudecker, 1988, p.46).
A sample from the posterior (A.16) can be generated by Gibbs sampler, i.e.
by drawing in turn from (A.17), (A.18) and (A.19).
52
Appendix B
Marginal distribution of the observables implied by the prior
on growth rates
For completeness we write here explicitly the marginal distribution of the
level of observables Y implied by the prior for growth rates в€†Y , such as
the one in (24). For simplicity we are assuming that the autocorrelations of
growth rates of all variables are the same, and equal to r, which results in
the full variance of growth rates having Kronecker structure
Var(vec в€†Y ) = ОЈв€† вЉ— R
where R is a T0 Г— T0 matrix with ones along the diagonal and r everywhere
else. To write the distribution of Y note that defining the T0 Г— 1 vector
e в‰Ў (1, 0, . . . , 0) and T0 Г— T0 matrix A:
пЈ«
пЈ¶
1 0 ... 0
пЈ¬в€’1 1 . . . 0пЈ·
пЈ¬
пЈ·
A в‰Ў пЈ¬ .. .. . . .. пЈ·
пЈ . .
. .пЈё
0 0 в€’1 1
we can write
пЈ«
пЈ¶
y1 в€’ y0
пЈ¬ y2 в€’ y1 пЈ·
пЈ· = AY в€’ ey0
в€†Y = пЈ¬
пЈ
пЈё
...
yT0 в€’ yT0 в€’1
(B.1)
Defining the T0 Г— 1 vectors О№=(1, . . . , 1) and П„ в‰Ў (1, 2, . . . T0 ) and for a given
initial condition y0 this together with the prior for growth rates implies a
prior for observables
П•Y в€ј N vec ВµП• , ОЈП•
(B.2)
with
ВµП• = Aв€’1 (ey0 + О№Вµв€† ) = О№y0 + П„ Вµв€†
в€’1
(B.3)
в€’1
в€’1
ОЈП• = (IN вЉ— A ) (ОЈв€† вЉ— R) (IN вЉ— A ) = ОЈв€† вЉ— A RA
Appendix C
в€’1
(B.4)
Our prior is not a reparameterization
This section illustrates the problem involved in obtaining the delta prior by
a reparameterization. The problem is that such attempts generate distribu53
tions of coefficients which are not independent of the error terms, which we
find quite unappealing.
The complication in the reparameterization is that there is no one-to-one
mapping between observables and parameters, because the mapping involves
another random term: the errors . Obtaining the delta prior through a
reparameterization would therefore involve specifying the joint distribution
of (y, ), then obtaining the distribution of (B, ) through a change-of-variable
technique.
For the sake of example, consider the AR(1) model with the constant
term and T0 = 2. In this case, the mapping from (B, ) to (y, ) is as follows:
y1
y2
Оµ1
Оµ2
= О± + ПЃy0 + Оµ1
= О± + О±ПЃ + ПЃ2 y0 + ПЃОµ1 + Оµ2
= Оµ1
= Оµ2
(C.1)
(C.2)
(C.3)
(C.4)
It is easy to verify that the Jacobian matrix of this transformation is:
пЈ«
пЈ¶
1
y0
1 0
пЈ¬1 + ПЃ О± + 2ПЃy0 + Оµ1 ПЃ 1пЈ·
пЈ¬
пЈ·
пЈ 0
0
1 0пЈё
0
0
0 1
The determinant of this matrix is О± + (ПЃ в€’ 1)y0 + Оµ1 , and the absolute value of
this term multiplies the distribution in the new parameter space (О±, ПЃ, Оµ1 , Оµ2 ).
This term cannot be factorized into terms involving only Оµs and only the parameters, and therefore the obtained density will not, in general, be consistent with independence of the model parameters and errors. In principle, we
could specify a joint distribution of observables and errors in which this term
would cancel, and which therefore would deliver independence of parameters
and errors. However, specifying such a distribution is a challenge that may
not be easier then solving the functional equation (9).
Appendix D
Analytical iteration on the mapping F for AR(1)
Consider the AR(1) model without the constant term, with y0 = 0 given.
The first observation must be nonzero, because if we are unlucky to start
exactly at the mean, growth rate in the first period does not depend on
54
the coefficients and the prior for growth rate will not carry any information
about ПЃ. But we only require (9) to hold in a probability one set of ys. For
simplicity, Пѓ 2 is given. Everywhere we will implicitly condition on y0 and Пѓ 2 .
The model:
2
(D.1)
yt = ПЃytв€’1 + t
t i.i.d. N(0, Пѓ )
that is, the likelihood of a hypothetical observation in period 1:
Ly1 |ПЃ (y 1 ; ПЃ) = N (ПЃy 0 , Пѓ 2 )
(D.2)
Introduce the prior assumption about zero (without loss of generality)
growth rate in the first period:
fв€†y1 (в€†ВЇ
y1 ) = N(0, Пѓ 2в€† )
(D.3)
П•(ВЇ
y1 ) = N(y0 , Пѓ 2в€† )
(D.4)
which implies:
LetвЂ™s find the marginal prior fПЃ (ПЃ) which will be consistent with the above
L and П•, i.e. which will satisfy
y1 )
L(ВЇ
y1 ; ПЃ)fПЃ (ПЃ)dПЃ = П•(ВЇ
D.1
(D.5)
Guess of the solution
Analyzing the structure of the problem, we can easily guess that the solution
is:
Пѓ2 в€’ Пѓ2
fПЃguess (ПЃ) = N 1, в€† 2
(D.6)
y0
Verifying (we skip algebraic details which are tedious, the integral can be
performed by completing the square):
1
Ly1 |ПЃ (y 1 ; ПЃ)fПЃguess (ПЃ)dПЃ = ... = (2ПЂ)в€’ 2 Пѓ в€’1
в€† exp в€’
1 (y 1 в€’ y0 )2
2
Пѓ 2в€†
= П•(ВЇ
y1 )
(D.7)
so the guess was right: fПЃguess (ПЃ) satisfies condition (D.5).
D.2
Approaching the prior by fixed point iteration
In the context of the AR(1) model one iteration with mapping F produces
F (1)
fПЃ (ПЃ) (as before, the integral is tedious but easy to compute by вЂ™completing
the squareвЂ™ approach):
55
fПЃF (1) (ПЃ) =
L(y1 ; ПЃ) Г— 1
Пѓ2 + Пѓ2
П•(y1 )dy1 = ... = N 1, в€† 2
y0
L(y1 ; ПЃ) Г— 1dПЃ
(D.8)
Verifying if it satisfies D.5, i.e. if it is consistent with the desired marginal
distribution of y1 yields:
y1 )
Ly1 |ПЃ (y 1 ; ПЃ)fПЃF (1) (ПЃ)dПЃ = ... = N y0 , Пѓ 2в€† + 2Пѓ 2 = П•(ВЇ
(D.9)
F (1)
The marginal distribution of y1 implied by fПЃ (ПЃ) is not what we wanted.
It has a correct mean, but the variance is too high.
F (F (1))
In the second iteration, first we compute the prior fПЃ
(ПЃ) by applying
mapping F to the prior obtained in the first step
F (1)
Пѓ 2в€† + Пѓ 2 Пѓ 4в€† + 2Пѓ 2в€† Пѓ 2 + 2Пѓ 4
Г—
F (1)
y02
(Пѓ 2в€† + 2Пѓ 2 )2
L(y1 ; ПЃ) Г— fПЃ (ПЃ)dПЃ
(D.10)
Conveniently, we already computed the integral in the denominator while
verifying F(1) (equation D.9 above).
This prior has a smaller variance then the prior from the first step, because
the second quotient in the variance is less then 1, which can be seen after
expanding the denominator:
fПЃF (F (1)) (ПЃ)
=
L(y1 ; ПЃ) Г— fПЃ
(ПЃ)
П•(y1 )dy1 = ... = N 1,
Пѓ 4в€† + 2Пѓ 2в€† Пѓ 2 + 2Пѓ 4
<1
Пѓ 4в€† + 4Пѓ 2в€† Пѓ 2 + 4Пѓ 4
(D.11)
So the prior F(F(1)) has a smaller variance then the prior F(1). However,
it still does not satisfy (D.5):
Ly1 |ПЃ (y 1 ; ПЃ)fПЃF (F (1)) (ПЃ)dПЃ = ... = N y0 ,
Пѓ 6в€† + 4Пѓ 4в€† Пѓ 2 + 8Пѓ 2в€† Пѓ 4 + 6Пѓ 6
(Пѓ 2в€† + 2Пѓ 2 )2
= П•(ВЇ
y1 )
(D.12)
The marginal distribution of y1 implied by
is still not right, the
mean remains correct, but the variance is still too big, although smaller than
in the first step because it can be transformed to
F (F (1))
fПЃ
(ПЃ)
Пѓ 2в€† <
(Пѓ 2в€† + 2Пѓ 2 )3 в€’ (2Пѓ 4в€† Пѓ 2 + 4Пѓ 2в€† Пѓ 4 + 2Пѓ 6 )
< Пѓ 2в€† + 2Пѓ 2
(Пѓ 2в€† + 2Пѓ 2 )2
56
(D.13)
(The transformation is intended to facilitate seeing the second inequality.
The first inequality is easy to prove too.)
So in the second iteration we got closer to the right prior. Further iterations can be seen as observing more and more samples of size T0 , and
learning better and better about the marginal distribution of parameters.
Unfortunately, for T0 > 1 the integrals involved in the fixed point iterations are no longer closed form, and need to be evaluated by Monte Carlo.
References
Abadir, K. M., Hadri, K., and Tzavalis, E. (1999). The influence of VAR
dimensions on estimator biases. Econometrica, 67(1):163вЂ“181.
Andrews, D. W. K. (1993). Exactly median-unbiased estimation of first order
autoregressive / unit root models. Econometrica, 61(1):139вЂ“165.
Andrews, D. W. K. and Chen, H.-Y. (1994). Approximately median-unbiased
estimation of autoregressive models. Journal of Business and Economic
Statistics, 12(2):187вЂ“204.
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis.
Springer Series in Statistics. Springer, New York, second edition.
Berger, J. O. and Wolpert, R. L. (1988). The Likelihood Principle, volume 6
of Lecture Notes - Monograph Series. Institute of Mathematical Statistics,
Hayward, California, second edition.
Bhargava, A. (1986). On the theory of testing for unit roots in observed
time-series. Review of Economic Studies, 53(3):369вЂ“384.
Christiano, L. J., Eichenbaum, M., and Evans, C. L. (1999). Monetary
policy shocks: What have we learned and to what end? In Taylor, J. B.
and Woodford, M., editors, Handbook of Macroeconomics, number 1A,
chapter 2, pages 65вЂ“148. Amsterdam: North-Holland.
Del Negro, M. and Schorfheide, F. (2004). Priors from general equilibrium
models for VARs. International Economic Review, 45(2):643вЂ“673.
Doan, T., Litterman, R., and Sims, C. (1984). Forecasting and conditional projections using realistic prior distributions. Econometric Reviews,
3(1):1вЂ“100.
57
Doan, T. A. (2000). RATS version 5 UserвЂ™s Guide. Estima, Suite 301, 1800
Sherman Ave., Evanston, IL 60201.
Doornik, J. A. (2002).
Object-Oriented Matrix Programming Using Ox.
London:
Timberlake Consultants Press and Oxford:
www.nuff.ox.ac.uk/Users/Doornik, third edition.
Hurwicz, L. (1950). Least-squares bias in time series. In Koopmans, T. C., editor, Statistical Inference in Dynamic Economic Models. Wiley, New York.
Kadane, J. B. (1980). Predictive and structural methods for eliciting prior
distributions. In Zellner, A., editor, Bayesian Analysis in Econometrics
and Statistics, chapter 8, pages 89вЂ“93. North-Holland, Amsterdam.
Kadane, J. B., Chan, N. H., and Wolfson, L. J. (1996). Priors for unit root
models. Journal of Econometrics, 75(1):99вЂ“111.
Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S., and Peters, S. C.
(1980). Interactive elicitation of opinion for a normal linear model. Journal
of the American Statistical Association, 75(372):845вЂ“854.
Kendall, M. G. (1954). Note on the bias in the estimation of autocorrelation.
Biometrika, XLI:403вЂ“404.
Kilian, L. (1998). Small-sample confidence intervals for impulse response
functions. The Review of Economics and Statistics, 80(2):218вЂ“230.
Kleibergen, F. and van Dijk, H. K. (1994). On the shape of the likelihood/posterior in cointegrating models. Econometric Theory, 10(3-4):514вЂ“
551.
Lubrano, M. (1995). Testing for unit roots in a bayesian framework. Journal
of Econometrics, 69:81вЂ“109.
MacKinnon, J. G. and Smith, A. A. (1998). Approximate bias correction in
econometrics. Journal of Econometrics, 85:205вЂ“230.
Magnus, J. R. and Neudecker, H. (1988). Matrix Differential Calculus with
Applications in Statistics and Econometrics. John Wiley and Sons, New
York.
Marriott, F. H. C. and Pope, J. A. (1954). Bias in the estimation of autocorrelations. Biometrika, XLI:393вЂ“403.
58
MВЁ
uller, U. K. and Elliott, G. (2003). Tests for unit roots and the initial
condition. Econometrica, 71(4):1269вЂ“1286.
Orcutt, G. H. and Winokur, H. S. (1969). First order autoregression: Inference, estimation, and prediction. Econometrica, 37(1):1вЂ“14.
Phillips, P. C. (1991). To criticize the critics: An objective bayesian analysis
of stochastic trends. Journal of Applied Econometrics, 6(4):333вЂ“364.
Quenouille, M. H. (1949). Approximate tests of correlation in time-series.
Journal of the Royal Statistical Society Series B, 11:68вЂ“84.
Roy, A. and Fuller, W. A. (2001). Estimation for autoregressive time series
with a root near 1. Journal of Business and Economic Statistics, 19(4):482вЂ“
493.
Schotman, P. C. and Dijk, H. K. V. (1991a). A bayesian analysis of the unit
root in real exchange rates. Journal of Econometrics, 49(1-2):195вЂ“238.
Schotman, P. C. and Dijk, H. K. V. (1991b). On bayesian routes to unit
roots. Journal of Applied Econometrics, 6(4):387вЂ“401.
Sims, C. A. (2000). Using a likelihood perspective to sharpen econometric
discourse: Three examples. Journal of Econometrics, 95:443вЂ“462.
Sims, C. A. (2006). Conjugate dummy observation priors for VARвЂ™s. Technical report, Princeton University.
Sims, C. A. (2007a). Making macro models behave reasonably. Technical
report, Princeton University.
Sims, C. A. (2007b). Rational inattention: A research agenda. Technical
report, Princeton University.
Sims, C. A. (revised 1996). Inference for multivariate time series models
with trend. Discussion paper, presented at the 1992 Americal Statistical
Association Meetings.
Sims, C. A. and Uhlig, H. (1991). Understanding unit rooters: A helicopter
tour. Econometrica, 59(6):1591вЂ“1599.
Sims, C. A. and Zha, T. (1998). Bayesian methods for dynamic multivariate
models. International Economic Review, 39(4):949вЂ“68.
Sims, C. A. and Zha, T. (1999). Error bands for impulse responses. Econometrica, 67(5):1113вЂ“1155.
59
Stine, R. A. and Shaman, P. (1989). A fixed point characterization for bias
of autoregressive estimators. The Annals of Statistics, 17(3):1275вЂ“1284.
Thornber, H. (1967). Finite sample monte carlo studies: An autoregressive
illustration. Journal of the American Statistical Association, 62(319):801вЂ“
818.
Uhlig, H. (1994a). On Jeffreys prior when using the exact likelihood function.
Econometric Theory, 10(3-4):633вЂ“644.
Uhlig, H. (1994b). What macroeconomists should know about unit roots - a
bayesian perspective. Econometric Theory, 10(3-4):645вЂ“671.
Villani, M. (2008). Steady state priors for vector autoregressions. Journal of
Applied Econometrics. forthcoming.
Zellner, A. (1971). An Introduction to Bayesian Inference in Econometrics.
Wiley, New York.
Zivot, E. (1994). A bayesian analysis of the unit-root hypothesis within an
unobserved components model. Econometric Theory, 10(3-4):552вЂ“578.
60

Download Report