Unit 6 - Georgia Tech ISyE

4
102
from
"
Y te
Y tr
Bayesian Prediction Methodology
!#
!
"
! #
Rte Rte,tr
Fte
−1
ϑ ∼ Nne +ns F β , λZ R⊤ R
tr
tr
te,tr
(4.1.3)
where ϑ depends on the case being studied. A second stage specifies the prior dis tribution of ϑ .
The following notation is used throughout the chapter.
• [W] denotes the distribution of W;
• where needed explicitly, π(w), E {W}, and Cov (W) denote the (joint) probability
density function of W, the mean of W, and the variance-covariance matrix of W,
respectively;
• Fte is the ne × p matrix whose ith row consists of the (known) regression functions
for the input xte
i , 1 ≤ i ≤ ne ;
• Ftr denotes the n s × p matrix of known regression functions for the n s training
data inputs;
• β denotes the unknown p × 1 vector of regression coefficients;
• λZ (λ−1
Z ) is the precision (variance) of the GP that describes deviations from the
regression;
• Rte , the ne × ne correlation matrix Cor Y te , Y te ;
• Rte,tr is the ne × n s cross-correlation matrix Cor Y te , Y tr ;
tr
tr • Rtr is the n s × n s correlation matrix Cor Y , Y ;
• κ denotes the vector of parameters that determine a given correlation function;
• ϑ denotes the vector of all unknown model parameters for the model under discussion; ϑ is β in Section 4.2.1, ϑ is (β, λZ ) in Section 4.2.2, and ϑ is (β, λZ , κ) in
Section 4.3.
For easy reference note that when (4.1.3) holds then conditionally
[Y te | Y tr = ytr , ϑ]
i
h
⊤
tr
−1
Rte − Rte,tr R−1
∼ Nne Fte β + Rte,tr R−1
tr Rte,tr
tr y − Ftr β , λZ
(4.1.4)
where ϑ depends on the case being studied.
The chapter is organized as follows. Section
cases
where
n
“conjugate”
4.2presents
analytic expressions can be derived for π Y te Y tr , E Y te Y tr , and Var Y te Y tr .
The case ϑ = β is a toy example that is sufficiently simple to illustrate the calculations that become more involved in the remaining conjugate and non-conjugate
cases. The conjugate case ϑ = (β, σ2Z ) assumes that the local min/max structure of
the residual process is known and hence is usually not directly useful in applications.
However, in conjunction with the conjugate results for ϑ = (β, σ2Z ), Section 4.3 describes Bayesian methodology for the case ϑ = (β, σ2Z , κ) which is used in practical
settings.
4.2
Conjugate Bayesian Models and Prediction
103
4.2 Examples of Conjugate Bayesian Models and Prediction
The
idea
that is used in this and the following sections is that the predictive density
π yte ytr can be obtained as
Z Z π yte ϑ, ytr π ϑ ytr dϑ .
π yte , ϑ ytr dϑ =
π yte ytr =
(4.2.1)
where ϑ are the unknown model parameters. Inference about the unknown model
parameters ϑ can be
from the conditional distribution
h obtained
h i of ϑ given the train i
ing data, i.e., from ϑ ytr . For example, the mean of ϑ ytr is a Bayesian estimate
of ϑ while the standard deviation of this posterior measures
h ithe uncertainty in the
estimated ϑ. Theorems (4.1) and 4.2 also provide the ϑ ytr conditional distributions.
4.2.1 Predictive Distributions When ϑ = β
This subsection illustrates the application of (4.2.1) in a toy example that is sufficiently simple that the calculations can be completely provided. It is assumed that
the training and test data can be described as draws from the regression + stationary
GP model, in which the regression coefficient is unknown but with known process
precision and correlations, i.e., ν = β.
The predictive distribution [Y te | Y tr ] is derived; as a consequence the Bayesian
predictor E{Y te | Y tr } and the uncertainty quantification Var Y te | Y tr are immediately available. Section 4.2.2 will consider a more challenging model that has application to the “usual” case where all parameters of the regression + stationary GP
model are unknown.
The following theorem provides the predictive distribution of Y te for two different
“conjugate” choices of second-stage priors for β. The first choice is the normal prior
which can be regarded as an informative choice while the second can be thought
of as a non-informative prior. Formally, the non-informative prior and predictive
distribution is obtained by letting the prior precision λβ → 0 in the normal prior
case (a).
Theorem 4.1. Suppose (Y te , Y tr ) follows a two-stage model with first-stage
" te ! #
!#
!
"
Y Rte Rte,tr
Fte
−1
.
β
,
λ
β
∼
N
ne +n s
Z
R⊤te,tr Rtr
Ftr
Y tr where β is unknown while λZ and all correlations are known.
(a) If
i
h
[β] ∼ N p bβ , λ−1
β Vβ ,
(4.2.2)
4
104
Bayesian Prediction Methodology
is the second-stage model where V β is a known positive definite correlation matrix
with bβ and λβ also known, then the posterior distributions of bβ is
i
h
i
h
β | Y tr ∼ N p µβ|tr , Σ β|tr ,
where
and
−1
−1 −1
b
× λZ F⊤tr R−1
µβ|tr = λZ F⊤tr R−1
tr Ftr β + λβ V β bβ ,
tr Ftr + λβ V β
(4.2.3)
−1 −1
Σ β|tr = λZ F⊤tr R−1
tr Ftr + λβ V β
The predictive distribution of Y te is
i
h
i
h
Y te | Y tr = ytr ∼ Nne µte|tr , Σ te|tr ,
(4.2.4)
with mean
tr
µte|tr = µte|tr (xte ) = Fte µβ|tr + Rte,tr R−1
tr y − Ftr µβ|tr ,
(4.2.5)
and covariance matrix
Σ te|tr
(b) If

" λβ −1 ⊤ #−1 ⊤ !



Fte 


− V F
.
1 − (F⊤te , Rte,tr ) λZ β tr
= λ−1
Z 




Rte,tr 
Ftr Rtr
π(β) = 1
(4.2.6)
(4.2.7)
p
on IR , then the posterior distributions of bβ is
−1 −1 i
h
⊤ −1
−1
⊤ −1 tr
,
F
R
F
,
λ
×
F
R
y
β = F⊤tr R−1
F
β | Y tr ∼ N p b
tr
tr
tr tr
Z
tr tr
tr
The predictive distribution of Y te is
[Y te | Y n = yn ] ∼ Nne µte|tr , Σ te|tr ,
(4.2.8)
where the mean µte|tr and covariance Σ te|tr are modifications of (4.2.5) and (4.2.6),
λ
respectively, in which µβ|tr is replaced by b
β and λZβ V −1
β is replaced by the p× p matrix
of zeroes.
The proof of Theorem 4.1 is given Section 4.5. It requires straightforward calculations to implement the right-hand integral in (4.2.1).
Consider first some observations about inferences concerning β provided by the
posterior distribution [β | Y tr ]. The posterior mean of β depends only the ratio λβ /λZ
because
4.2
Conjugate Bayesian Models and Prediction
−1
105
−1
−1
b
× λZ F⊤tr R−1
F⊤tr R−1
µβ|tr = λ−1
tr Ftr β + V β bβ λβ /λZ ,
tr Ftr + V β λβ /λZ
Z
−1 −1
−1
b
× F⊤tr R−1
= F⊤tr R−1
tr Ftr β + V β bβ λβ /λZ .
tr Ftr + V β λβ /λZ
In the special case of (a) where the prior variance of β is identical to the Y(x) process
variance, i.e., λβ = λZ , then both the posterior mean and covariance simplify further.
In this situation, the Bayes estimator of β can be thought of as
−1
−1 −1
b
× F⊤tr R−1
µβ|tr = F⊤tr R−1
tr Ftr β + V β bβ
tr Ftr + V β
= Ωb
β + (I p − Ω) bβ
−1 −1
where Ω = F⊤tr R−1
F⊤tr R−1
tr Ftr + V β
tr Ftr which is the matrix “convex combination” of the prior mean bβ and the BLUP b
β. In contrast, the posterior covariance,
i
i−1
h
h
−1 −1
−1
= λ−1
,
Σβ|tr = λZ F⊤tr R−1
F⊤tr R−1
Z
tr Ftr + λβ V β
tr Ftr + V β λβ /λZ
depends on the individual precision parameters whether or not λβ = λZ .
Prediction at a Single Test Input xte
To provide additional insight about the nature of the Bayesian predictor µte|tr and
its UQ quantification σ2te|tr that are given by Theorem 4.1, we restrict attention to the
case of a single test input xte to simplify the discussion.
Examining first the predictive mean, the following properties hold for both cases
(a) and (b). Algebra shows that µte|tr = µte|tr (xte ) is linear in ytr and, with additional
calculation, that it is an unbiased predictor of Y(xte ), i.e., µte|tr (xte ) is a linear unbiased predictor of Y(xte ). The continuity and other smoothness properties of µte|tr
are inherited from those of the correlation function R(·) and the regressors { f j (·)} pj=1
because
tr
µte|tr = f ⊤ (xi )µβ|tr + r⊤te R−1
tr (Y − Ftr µβ|tr )
p
n
X
X
=
f j (xte )µβ|tr, j +
di R(xte − xi ),
j=1
(4.2.9)
i=1
where µβ|tr, j is the jth element of µβ|tr . Thus in case (b) µte|tr depends only on the
ratio of λZ and λβ because this is true of µβ|tr . Lastly, µte|tr interpolates the training
data. This is true because when xte = xi for some i ∈ {1, . . . , n}, f te = f (xte ) = f (xi ),
th
⊤
and r⊤te R−1
tr = ei , the i unit vector. Thus
tr
µte|tr = f ⊤ (xi )µβ|tr + r⊤te R−1
tr (Y − Ftr µβ|tr )
= f ⊤ (xi )µβ|tr + e⊤i (Y tr − Ftr µβ|tr )
= f ⊤ (xi ) µβ|tr + (Yi − f ⊤ (xi )µβ|tr )
= Yi .
4
106
Bayesian Prediction Methodology
For prior (b), the ne × ne posterior covariance Σ te|tr reduces to
n
o
⊤
⊤ −1
−1
1 − r⊤te R−1
σ2te|tr (xte ) = λ−1
tr rte + h (Fte Rtr Fte ) h
Z
(4.2.10)
where h = f te − F⊤te R−1
tr rte ; equation (4.2.10) was given previously as the variance
of the BLUP (3.2.9). For prior (a),

!
" λβ −1 ⊤ #−1




f

F
V
−
te 
tr
1 − ( f ⊤te , r⊤te ) λZ β
σ2te|tr = λ−1
Z 




rte 
Ftr Rtr
n
h
1 − − f ⊤te Q−1 f te + 2 f ⊤te Q−1 F⊤tr R−1
= λ−1
tr rte
Z
io
−1 ⊤ −1
−1
(4.2.11)
+ r⊤te {R−1
tr − Rtr Ftr Q Ftr Rtr }rte
n
⊤ −1
⊤ −1 ⊤ −1
⊤ −1
−1
= λZ 1 − rte Rtr rte + f te Q f te − 2 f te Q Ftr Rtr rte
o
−1 ⊤ −1
+ r⊤te R−1
tr Ftr Q Ftr Rtr rte
n
o
⊤ −1
= λ−1
1 − r⊤te R−1
(4.2.12)
Z
tr rte + h Q h ,
where h = f (xte ) − F⊤tr R−1
tr rte and
Q = F⊤tr R−1
tr Ftr +
λβ −1
V ;
λZ β
(4.2.13)
the equality in (4.2.11) follows from Lemma C.3.
Intuitively, the variance of the posterior of Y(xte ) given the training data should
be zero whenever xte = xtri , 1 ≤ i ≤ n s because we know exactly the response at each
of the training data sites and there is no measurement error term in the stochastic
process model. This was shown previously for (4.2.10), prior (b). To see that this is
⊤
also the case for prior (a), fix xte = x1 , say. In this case, recall that r⊤te R−1
tr = e1 , and
observe that f (xte ) = f (x1 ). From (4.2.12),
n
o
⊤
⊤ −1
−1
⊤ −1
1 − r⊤te R−1
σ2te|tr (xi ) = λ−1
tr rte + ( f (xte ) − rte Rtr Ftr )Q ( f (xte ) − Fte Rtr rte )
Z
o
n
1 − e⊤1 rte (x1 ) + ( f ⊤ (x1 ) − e⊤i Ftr )Q−1 ( f (x1 ) − F⊤tr ei )
= λ−1
Z
n
o
1 − 1 + ( f ⊤ (x1 ) − f ⊤ (x1 ))Q−1 ( f (x1 ) − f (x1 ))
= λ−1
Z
= λ−1
{1 − 1 + 0} = 0
Z
where Q is given in (4.2.13).
Perhaps the most important use of Theorem 4.1 is to provide pointwise uncertainty bands about the predictor µte|tr (xte ) in (4.2.9). The bands can be obtained by
using the fact that
Y(xte ) − µte|tr (xte )
∼ N(0, 1) .
σ2te|tr (xte )
This gives the posterior prediction interval
4.2
Conjugate Bayesian Models and Prediction
107
1.5
1
0.5
0
-0.5
-1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 4.1 The function y(x) = exp{−1.4x} × cos(3.5πx) (solid line); a seven point input training data
tr
sets (solid circles); the Bayesian predictor µte|tr = µβ|tr + r⊤te R−1
tr (y − 1n µβ|tr ) in (4.2.14) and (4.2.15)
for λβ = 0 (blue), for λβ = 10 (red), and for λβ = 100 (green).
P{Y(xte ) ∈ µte|tr (xte ) ± σte|tr (xte )zα/2 |Y tr } = 1 − α,
where zα/2 is the upper α/2 critical point of the standard normal distribution (see Appendix A). As a special case, if the input xte ∈ (a, b), then µte|tr (xte ) ± σte|tr (xte )zα/2
are pointwise 100(1 − α)% prediction bands for Y(xte ), a < xte < b. Below, we illustrate the prediction band calculation for the hierarchical (Y0 , Y tr ) model presented in
Theorem 4.2.
Example 4.1 (Damped Sine Curve). This example illustrates the effect of the prior
[β] on the mean of the predictive distribution µte|tr (xte ) in Theorem 4.1. Consider the
damped cosine function
y(x) = e−1.4x cos(7πx/2) ,
0 < x < 1,
and n s = 7 points of training data which are the solid curve and dots in in Figure 4.1,
respectively. For any xte ∈ (0, 1) The predictive distribution of Y(xte ) is based on the
hierarchical Bayes model whose first stage is the stationary stochastic process
Y(x) | β0 = β0 + Z(x) ,
0 < x < 1,
where β0 ∈ IR and for purpose here of illustrating the case where only β0 is known,
R(h) is taken to be R(h) = exp{−10.0 × h2 }.
Suppose in part (a) of Theorem 4.1 it is assumed that β0 ∼ N(bte , (λβ )−1 × v2te )
with vte = 1 with known prior mean, bte , and known prior precision, λβ . For any
xte ∈ (0, 1), the Bayesian predictor of y(xte ) is the posterior mean (4.2.5)
tr
µte|tr = µte|tr (xte ) = µβ|tr + r⊤te R−1
(4.2.14)
tr y − 1n µβ|tr ,
where µβ|tr is the posterior mean of β0 given Y n which is
4
108
Bayesian Prediction Methodology
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
-0.005
0
Fig. 4.2 The factor (1 −
0.1
0.2
r⊤te R−1
tr 1n )
µβ|tr
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
te
versus x ∈ (0, 1).
n
1⊤n R−1
tr y + bte λβ /λZ
= 1⊤n R−1
tr 1n + λβ /λZ
⊤ −1 ⊤ −1 n
= ωbte + (1 − ω)(1⊤n R−1
tr 1n ) (1n Rtr y )
= ωbte + (1 − ω)βb0 ,
(4.2.15)
where ω = λβ /[λZ 1⊤n R−1
tr 1n + λβ ] ∈ (0, 1). In words, (4.2.15) can be interpreted as
saying that the posterior mean of β0 given Y tr is a convex combination of the prior
mean bβ and the MLE of β0 . For fixed process precision λZ , ω → 1 and µβ|tr → bβ as
the β0 prior process λβ → ∞; the predictor guesses the prior mean and ignores the
data. Similarly, ω → 0 and µβ|tr → βb0 as the β0 prior process λβ → 0; the predictor
uses only the data and ignores the prior information.
However, the impact of the prior precision of β0 on the predictor µte|tr (xte ) of
y(xte ) can be relatively minor. Consider the data shown in Figure 4.1 by the solid
circles that are superimposed on the damped sine curve. Calculation gives βb0 =
0.372. Suppose bβ = 5, θ = 10.0 and λZ = 6; then µβ|tr → 5 as λβ → ∞ and
µβ|tr → 0.372 as λβ → 0. Figure 4.1 shows that the effect on µte|tr (x0 ) of changing
the prior precision is relatively minor. The calculation that shows this effect is to
observed that
tr
⊤ −1
µte|tr (xte ) = r⊤te R−1
tr y + (1 − rte Rtr 1n ) × µβ|tr
which shows that the magnitude of the Bayes predictor depends on the posterior
mean µβ|tr only through the factor (1 − r⊤te R−1
tr 1n ). Figure 4.2 shows that the factor
(1 − r⊤te R−1
tr 1n ) is relatively minor in the center of the training data but can increase
as xte moves “away” from the training data.
4.2
Conjugate Bayesian Models and Prediction
109
4.2.2 Predictive Distributions When ϑ = (β, λZ )
Theorem 4.2 states the predictive distribution of Y te given Y tr for informative and
non-informative (β, λZ ) priors. The informative prior is stated in terms of the factors
[β, λZ ] = [β | λZ ] × [λZ ]. In both cases, the Y te | Y tr posterior is a location shifted
and scaled multivariate t distribution having degrees of freedom that are increased
for informative prior information from either β or λZ (see Appendix C.4 for the
definition of the non-central m-variate t distribution, T m (ν, µ, Σ)).
The informative conditional [β | λZ ] choice is the multivariate normal distribution
with known mean bβ and known correlation matrix V β ; lacking more definitive information, V β is often taken to be diagonal, if not simply the identity matrix. This
prior model makes strong assumptions; for example, it says that each component of
β is equally likely to be less than or greater than the corresponding component of
bβ . The non-informative β prior is the intuitive choice
π(β) = 1
used in Theorem 4.1. The informative [λZ ] prior is the gamma distribution with
specified mean and variance. This prior can be made quite diffuse and hence is also
stated as an option for the non-informative prior case. A second non-informative
prior for λZ is “Jeffreys prior”
π(λZ ) =
1
, λZ > 0
λZ
(see Jeffreys (1961), who gives arguments for this choice).
Theorem 4.2. Suppose (Y te , Y tr ) follows a two-stage model in which the conditional
distribution [(Y te , Y tr ) | (β, λZ )] is given by (4.1.3) and all correlations are known.
(a) If
[(β, λZ )] = [β | λZ ][λZ ]
has prior specified by
o
n
V β and [λZ ] ∼ Γ(c, d)
[β | λZ ] ∼ N p bβ , λ−1
Z
(4.2.16)
with known prior parameters then the posterior distributions of β and λZ are
(4.2.17)
[λZ | ytr ] ∼ Γ(ca1 , d1a ) and [β | ytr ] ∼ T p 2ca1 + n s , µβ|tr , d1a Σβ|tr /ca1
where
• ca1 = (2c
+ ns )/2
⊤
⊤
tr
b
b
β R−1
• d1a = 2d + ytr − Ftrb
Σπ−1 b
β − bβ /2
tr y − Ftr β + β − bβ
−1
tr
F⊤tr R−1
• b
β = F⊤tr R−1
tr Ftr y
tr Ftr
−1
• Σπ = V β + F⊤tr R−1
tr Ftr
4
110
−1 h
i
Bayesian Prediction Methodology
−1
−1
b
• µβ|tr = F⊤tr R−1
F⊤tr R−1
tr Ftr + V β
tr Ftr β + V β bβ
−1
⊤ −1
.
• Σβ|tr = V −1
β + Ftr Rtr Ftr
The predictive distribution of Y te is
where
[Y te | Y tr = ytr ] ∼ T ne 2c + n s , µte|tr , Mte|tr d1a /ca1
(4.2.18)
tr
• µte|tr = Fte µβ|tr + Rte,tr R−1
tr y − Ftr µβ|tr
⊤
−1 ⊤
• Mte|tr = Rte − Rte,tr R−1
tr Rte,tr + Hte Σβ|tr H te .
(b) Suppose that β and λZ are independent with [β] ∝ 1 and [λZ ] having prior (b.1)
or (b.2) in the following table
[λZ ] Prior
Case Designation
Γ(c, d)
(b.1)
1/λZ
(b.2)
For (b.1) the posterior distributions of β and λZ are
b.1
[λZ | ytr ] ∼ Γ(cb.1
1 , d1 ) and
−1
β, F⊤tr Rtr Ftr d1b.1 /cb.1
[β | ytr ] ∼ T p 2c + n s − p, b
1
(4.2.19)
where b
β is defined in (a) and
• cb.1
1 = (2c
+ ns − p)/2 ⊤
tr
b.1
b
β
y
−
F
/2.
β R−1
• d1 = 2d + ytr − Ftrb
tr
tr
The predictive distribution of Y te is
[Y te | Y tr = ytr ] ∼ T ne 2c + n s − p, µte|tr , Mte|tr d1b.1 /cb.1
1
(4.2.20)
where
tr
b
• µte|tr = Fteb
β + Rte,tr R−1
tr y − Ftr β
−1
⊤ −1
⊤
H⊤te
• Mte|tr = Rte − Rte,tr R−1
tr Rte,tr + Hte Ftr Rtr Ftr
For (b.2) the posterior distributions of β and λZ are
where
b.2
[λZ | ytr ] ∼ Γ(cb.2
1 , d1 ) and
−1
β, F⊤tr Rtr Ftr d1b.2 /cb.2
[β | ytr ] ∼ T p n s − p, b
1
• cb.2
1 = (n
s − p)/2 ⊤
b.2
tr
b
/2
• d1 = ytr − Ftrb
β
y
−
F
β R−1
tr
tr
(4.2.21)
4.2
Conjugate Bayesian Models and Prediction
111
The predictive distribution of Y te is
[Y te | Y tr = ytr ] ∼ T ne n s − p, µte|tr , d1b.2 Mte|tr /cb.2
1
(4.2.22)
where µte|tr and Mte|tr are defined as in (b.1).
The formulas above for the degrees of freedom, location shift, and scale factor in
the Y te predictive distribution all have intuitive interpretations. The base value for
the degrees of freedom is n s − p, which is augmented by p additional degrees of
freedom when β has the informative Gaussian (case (a)), and is further augmented
by 2ca1 or 2cb.1
1 degrees of freedom when λZ has the informative gamma prior (cases
(c) and (b.1)).
The mean of Y te predictive distribution is the same for two λZ cases of Theorem 4.1 and 4.2. In the case of Theorem 4.1 which has known λZ , (4.2.5) and (4.2.8)
give the predictive mean to be
tr
µte|tr = Fte µβ|tr + Rte,tr R−1
tr y − Ftr µβ|tr
where µβ|tr is the Bayes estimator of conditional distribution of β for the informative
or non-informative priors. In the case of Theorem 4.2 which takes λZ to be a hierarchical parameter, the predictive mean is the location parameter of the t-distribution.
Examination of µte|tr shows it is identical to that of Theorem 4.1. Thus the Bayesian
predictor is the same for the two cases.
The UQ of Y te for known λZ that is given in Theorem 4.1 and for unknown λZ (but
given a prior) given in Theorem 4.2 are related. To simplify the discussion, consider
the case of a single input xte at which prediction is desired. When λZ is known and
it is assumed λZ = λβ , Theorem 4.1 gives the predictive variance of Y te to be
n
o
⊤ −1
σ2te|tr = λ−1
1 − r⊤te R−1
Z
tr rte + h Q h ,
(4.2.23)
where h = f (xte ) − F⊤tr R−1
tr rte and
−1
Q = F⊤tr R−1
or Q = F⊤tr R−1
tr Ftr + V β
tr Ftr
as the informative or non-informative β prior is assumed.
When λZ is unknown but has gamma or Jeffreys prior, Theorem 4.2 gives the
predictive variance of Y te
σ2te|tr =
d1b.i
cb.i
1
n
o
⊤ −1
K 1 − r⊤te R−1
tr rte + h Q h ,
i = 1, 2
(4.2.24)
where h is as above and K = (2c+n s − p)/(2c+n s − p−2) or K = (n s − p)/(n s − p−2)
(assuming the denominator is positive). The final factor is the same as in the known
λZ case. The product of the first two terms in (4.2.24) should be thought of as an
estimator of the factor λ−1
in (4.2.23). This is because the posterior of λZ is gamma
Z
b.i
b.i b.i
with parameters cb.i
1 and d1 , i = 1, 2. Thus c1 /d1 is the mean of the λZ posterior
4
112
2
Bayesian Prediction Methodology
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1
-1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-1.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 4.3 95% pointwise prediction bands (4.2.26) for y(x) at ne = 103 equally spaced xte values
over (0, 1). The solid black curve is the damped sine curve used to generate ns = 7 training data
points (solid circles). The left Panel assumes θ = 10 in (4.2.26) and right Panel assumes θ = 75 in
(4.2.26).
−1
and d1b.i /cb.i
1 is a naive guess of λZ . The degrees of freedom “correction” will be near
unity for “small” regression model (p) and “large” data and/or λZ prior information.
Again, considering the case of a single input xte , Theorem 4.2 can be used to
place pointwise prediction bands about y(xte ). Using the fact that, given Y tr ,
Y(xte ) − µte|tr (xte )
∼ T 1 (d.o.f., 0, 1) .
σ2te|tr (xte )
where d.o.f. is either 2c + n s − p or n s − p for the informative or non-informative λZ
prior, respectively, gives
o
P Y(xte ) ∈ µte|tr (xte ) ± σte|tr (xte ) tα/2 Y tr = 1 − α,
(4.2.25)
d.o.f.
where tνα/2 is the upper α/2 critical point of the univariate central t-distribution
with ν degrees of freedom (see Appendix A). When xte ∈ (a, b) then µte|tr (xte ) ±
σte|tr (xte ) tνα/2
for a < xte < b are pointwise 100(1 − α)% prediction bands for y(xte ).
i
Example 4.1 [Continued] Figure 4.3 plots the 95% pointwise prediction bands
s
o
6 ∗ d1b.2 n
2
× 1 − r⊤te R−1
(4.2.26)
µte|tr ±
tr rte + h /Q
b.2
4 ∗ c1
obtained for the prior (b.2) of Theorem 4.2 based on the n s = 7 point training
data set obtained from the damped sine curve and shown as filled circles. Here
b.2
h = 1 − 1⊤7 Rtr rte while cb.2
1 and d1 are specified in Theorem 4.2. The first stage of
the model is a GP with constant mean (p = 1) and Gaussian correlation function
n
o
R(h|θ) = exp −θh2 .
4.3
Non-conjugate Bayesian Models and Prediction
113
The bands have been computed for θ ∈ {10, 75} to show the effect of assuming
a stronger correlation structure between the Y(x) values (θ = 10) and a weaker
correlation structure between the Y(x) values (θ = 75). Intuitively, when the model
assumption allows Y(x) to vary more over a given interval of x, the confidence bands
should be wider as seen in the right panel of the figure. For any θ the bands have
zero width at each of the true data points. Finally, the pointwise y(x) predictor, µte|tr ,
is relatively insensitive to the correlation structure while interpolating the data, although the different left hand prediction values show that this need not be the case
when extrapolating.
4.3 Examples of Non-conjugate Bayesian Models and Prediction
TJS original brief outline of topics
• the HB model
• the posterior
• examples (ex3.3 (damped cosine/something 1-d ?? and which one?)
4.3.1 Introduction
Subsections 4.2.1 and 4.2.2 assumed that the correlations among the observations
are known, i.e., R and r0 are known. Now we assume that y(·) has a hierarchical
Gaussian random field prior with parametric correlation function R(· | ψ) having
unknown vector of parameters ψ (as introduced in Subsection ?? and previously
considered in Subsection 3.3.2 for predictors). To facilitate the discussion below,
suppose that the mean and variance of the normal predictive distribution in (4.2.4)
and (??) are denoted by µ0|n (x0 ) = µ0|n (x0 |ψ) and σ20|n (x0 ) = σ20|n (x0 |ψ), where ψ
was known in these earlier sections. Similarly, recall that the location and scale parameters of the predictive t distributions in (??) are denoted by µi (x0 ) = µi (x0 |ψ)
and σ2i (x0 ) = σ2i (x0 |ψ), for i ∈ {(1), (2), (3), (4)}.
We consider two issues. The first is the assessment of the standard error of the
plug-in predictor µ0|n (x0 |b
ψ) of Y0 (x0 ) that is derived from µ0|n (x0 |ψ) by substituting
b
ψ, which is an estimator of ψ that might be the MLE or REML. This question is
implicitly stated from the frequentist viewpoint. The second issue is Bayesian; we
describe the Bayesian approach to uncertainty in ψ which is to model it by a prior
distribution.
When ψ is known, recall that σ20|n (x0 |ψ) is the MSPE of µ0|n (x0 |ψ). This suggests
ψ). The correct
estimating the MSPE of µ0|n (x0 |b
ψ) by the plug-in MSPE σ2 (x0 |b
0|n
expression for the MSPE of µ0|n (x0 |b
ψ) is
MSPE(µ0|n (x0 |b
ψ), ψ) = Eψ
µ0|n (x0 |b
ψ) − Y(x0 )
2 .
(4.3.1)