Models for robust inference and sensitivity analysis • Assume we

Models for robust inference and sensitivity analysis
• Assume we have completed an analysis
(i.e., we have obtained posterior simulations
from a specified model)
• Often want to assess:
– sensitivity of inferences (do the results
change under other reasonable models)
– robustness to outliers by considering
overdispersed alternatives to our model
(e.g., t rather than normal)
– overdispersed version of model to
address heterogeneity
– effect of other small changes
(e.g., deleting an observation)
• Computational approaches
– exact posterior inference under new model
(may be quite time consuming)
– approximate posterior inference using
importance ratios
153
Models for robust inference and sensitivity analysis
Example
• Recall SAT coaching example:
– data: y = (28, 8, −3, 7, −1, 1, 18, 12)
– model:
yj |θj ∼ N (θj , σj2 )
θj |µ, τ 2 ∼ N (µ, τ 2 )
p(µ, τ ) ∝ 1
• What happens if we replace 12 for 100?
– estimate of τ 2 gets bigger
– estimate of θj moves towards yj
– if school 8 is an outlier, this affects conclusions for
other seven schools
• Would a t-model at one or both stages of the model
help?
154
Models for robust inference and sensitivity analysis
Overdispersed models
• Recall that we have seen several instances in which
allowing for heterogeneity among units leads to “new”
overdispersed models
• Binomial and Beta-Binomial
– Standard model: yi ∼ Binomial(n, p)
E(yi ) = np
V (yi ) = np(1 − p)
– Overdispersed model:
yi
pi

∼ Binomial(n, pi ) 
⇒ yi ∼ Beta-Binomial(n, α, β)

∼ Beta(α, β)
¶
α
E(yi ) = n
α+β
| {z }
µ
p
µ
¶µ
¶µ
¶
α
β
α+β+n
V (yi ) = n
α+β
α+β
α+β+1
| {z } | {z } |
{z
}
p
1−p
(∗) = overdispersion factor
155
(∗)
Models for robust inference and sensitivity analysis
Overdispersed models (cont’d)
• Poisson and Negative Binomial Poisson
– Poisson: variance equals to mean
– Negative Binomial: two-parameter distn allows
the mean and variance to be fitted separately,
with variance as least as great as the mean
– Overdispersed model:
yi
λi


∼
Poisson(λi )
∼
Gamma(α, β) 
α
E(yi ) =
β
⇒ yi ∼ Neg-Bin(α, β)
α
V (yi ) =
β
µ
¶
β+1
β
| {z }
(∗)
(∗) = overdispersion factor
156
Models for robust inference and sensitivity analysis
Overdispersed models (cont’d)
• Normal and t-distribution
t has a longer tail than the normal and can be used for
accommodating:
(a) occasional unusual observations in the
data distribution
(b) occasional extreme parameters in the
prior distribution or hierarchical model
Overdispersed model:


yi
∼
N(µ, Vi )
Vi
∼
Inv-χ2 (ν, σ 2 ) 
⇒ yi ∼ tν (µ, σ 2 )
µ
E(yi ) = µ
¶
ν
V (yi ) = σ 2
for ν > 2
ν−2
| {z }
(∗)
(∗) = overdispersion factor
157
Models for robust inference and sensitivity analysis
• Overdispersed (robust) models are “safer” in the sense
that they include the non-robust models as a special
case (e.g., normal is t with infinite d.f.)
• Why not start with robust (expanded) models?
– non-robust models have special justification
∗ normal justified by CLT
∗ Poisson justified by Poisson process
– non-robust models often computationally
convenient
158
Models for robust inference and sensitivity analysis
Notation for model expansion
• po (y|θ) = sampling distribution for original model
• p(y|θ, φ) = expanded sampling model for y
• φ = hyperparameter defining expanded model
• Normal/t example
– y|µ, σ 2 , ν ∼ tν (µ, σ 2 )
[i.e., θ = (µ, σ 2 )
and φ = ν]
– po (y|µ, σ 2 ) = N (y|µ, σ 2 )
[ν = ∞]
• Can be applied to data model (as above) or prior
distribution for θ in a hierarchical model
159
Models for robust inference and sensitivity analysis
Computation
• Possible inferences
– for one or more fixed φ’s, find
p(θ|y, φ) ∝ p(θ|φ)p(y|θ, φ)
φ = 4 d.f. for t-distribution.
– examine joint posterior p(θ, φ|y) = p(φ|y)p(θ|y, φ)
• Computational approaches:
– redo analysis for expanded model
(use MCMC, especially Gibbs sampling)
– approximations based on importance weights
– approximations based on importance resampling
160
Models for robust inference and sensitivity analysis
Computation: complete analysis
• Consider tν distribution (ν specified) as a robust
alternative to normal model
• Model:
yi |µ, V i, σ 2 ∼ N (µ, Vi σ 2 )
Vi ∼ Inv-χ2 (ν, 1)
p(µ, σ 2 ) ∝ σ −2
• Posterior distribution
p(µ, σ 2 , V |y, ν) ∝
1
σ2
n
Y
"
i=1
e
−ν/(2Vi )
ν/2+1
Vi
#
×
n
Y
i=1
• Computation via Gibbs sampler
³ Pn
´
2
y
/V
i
i
σ
2
– µ|Vi , σ , y ∼ N Pi=1
, Pn 1/Vi
n
i=1 1/Vi
i=1
³
Pn (yi −µ)2 ´
1
2
2
– σ |Vi , µ, y ∼ Inv-χ n, n i=1 Vi
µ
¶
(yi −µ)2
ν+
σ2
– Vi |µ, σ 2 , y ∼ Inv-χ2 ν + 1, ν+1
161
− 12
e
√
(yi −µ)2
Vi σ 2
σ 2 Vi
Models for robust inference and sensitivity analysis
Computation: complete analysis (cont’d)
• What if ν is unknown? Give it a prior distn p(ν) and
include in the model as a parameter
• Posterior distribution
p(µ, σ 2 , V, ν|y) ∝
p(ν)
σ2
n
Y
i=1
"
−ν/(2Vi )
e
(ν/2)
ν/2
ν/2+1
Γ(ν/2)Vi
#
×
n
Y
i=1
−1
2
e
(yi −µ)2
Vi σ 2
√
σ 2 Vi
• First three Gibbs steps are same as
on previous slide
• Metropolis step to draw from conditional distn of ν,
that is p(ν|σ 2 , µ, V, y)
162
Models for robust inference and sensitivity analysis
Approximation based on importance weights
• Want to consider robust model without redoing the
analysis
• Suppose interested in quantity of the form E[h(θ)|φ, y]
– importance sampling review:
R
E(g) = g(y)f (y)dy =
≈
R
g(y)f (y)
p(y) p(y)dy
N
P
g(yi )f (yi )
1
N
p(yi )
i=1
where yi ’s are sampled from p(y)
– importance sampling approach
(sometimes called importance weighting here)
R
∗ E[h(θ)|φ, y] = h(θ)p(θ|φ, y)dθ
po (y) R h(θ)p(θ|φ)p(y|θ,φ)
=
po (θ|y)dθ
po (θ)po (y|θ)
p(y|φ)
| {z }
unknown
∗ initial unknown term makes things
slightly different
∗ use draws θl , l = 1, . . . , L from po (θ|y) but not the
usual importance sampling estimate
163
Models for robust inference and sensitivity analysis
Approximation based on importance weights
• Importance sampling (weighting) (cont’d)
– estimate E[h(θ)|φ, y] with
1
L
ĥ =
L
P
l=1
L
P
p(θ l |φ)p(y|θ l ,φ) po (y)
1
L
po (θ l )po (y|θ l ) p(y|φ)
l=1
i.e.
1
L
ĥ =
L
P
L
P
l
ωl h(θ )
l=1
1
L
where
h(θ l )p(θ l |φ)p(y|θl ,φ) po (y)
p(y|φ)
po (θ l )po (y|θ l )
L
P
=
ωl
l=1
ωl h(θl )
l=1
L
P
ωl
l=1
p(θl |φ)p(y|θl , φ)
ωl =
po (θl )po (y|θl )
– note: denominator in ĥ is same as numerator with
h = 1 (essentially estimates reciprocal of unknown
constant)
164
Models for robust inference and sensitivity analysis
Approximation based on importance weights
• Importance resampling
– may be interested in quantities that are not
posterior expectations
– related idea of importance resampling is to
obtain an “approximate sample” from p(θ|φ, y)
∗ sample θl , l = 1, . . . , L from po (θ|y) with L large
∗ calculate importance ratios
p(θl |φ)p(y|θl , φ)
po (θl )po (y|θl )
∗ check distribution of importance ratios
∗ subsample n draws without replacement from L
draws with probability proportional to
importance ratio.
∗ why without replacement?
to provide protection against the worst
case scenario where one θ has
enormous “importance”
165
Models for robust inference and sensitivity analysis
Approximation based on importance weights
• Importance sampling and importance resampling
– importance sampling computes E(h(θ)|y, φ),
importance resampling obtains posterior
sample
– if there a small number of large importance weights,
then both approximations are suspect
166
Models for robust inference and sensitivity analysis
Approximation based on importance weights
• Accuracy and efficiency of importance sampling
estimates
– no method exists for assessing how accurate the
importance resampling (or reweighted) draws are as
an approximation of the
posterior distribution
– check distribution of importance ratios to
assess quality of estimate
– performance depends on variability in
importance ratios
– estimates will often be poor if the largest ratios are
too large relative to the others
– note small importance ratios are not a problem
(they have little influence on E[h(θ)|φ, y])
167
Regression Models
• Notation
Given a sample of size n
– yi - response or outcome variable for unit i
– y = (y1 , . . . , yn )0
– xi = (xi1 , . . . , xik ) - explanatory variables for unit i;
usually xi1 ≡ 1 ∀i
– X = n × k matrix of predictors
168
Regression Models
Justification of conditional modeling
• Full model for (y, X)
p(y, X|θ, ψ) = p(X|ψ)p(y|X, θ)
• Posterior distribution for (θ, ψ)
p(ψ, θ|X, y) ∝ p(X|ψ)p(y|X, θ)p(ψ, θ)
• If ψ and θ are independent in their prior
distribution, i.e. p(ψ, θ) = p(ψ)p(θ), then
p(ψ, θ|X, y) = p(ψ|X)p(θ|X, y)
• We can analyze the second factor by itself with no loss
of information
p(θ|X, y) ∝ p(θ)p(y|X, θ)
• Note: If the explanatory variables X are set by
experimenter, then p(X) is known, and there are no
parameters ψ; this also justifies conditional
modeling
169
Regression Models
• Goal: statistical inference for the parameters θ,
conditional on X and y
• Since everything is conditional on X, we’ll suppress it
in subsequent notation
• Modeling issues
1. defining X and y so that the conditional
expectation of y given X is reasonably linear
as a function of X
2. setting up a prior distribution on the model
parameters that acccurately reflects
substantive knowledge,
170
Regression Models
Normal ordinary linear regression model
• Assumptions
y|β, σ 2 ∼ N (Xβ, σ 2 I)
where I is the n × n identity matrix
– the distribution of y given X is a normal r.v. whose
mean is a linear function of X
E(yi |β, X) = (Xβ)i = β1 xi1 + . . . + βk xik
– Var(y|β, σ 2 ) = σ 2 I
∗ can think of y − Xβ as “errors”
∗ observation errors are independent
∗ observation errors are constant variance
171
Regression Models
Normal ordinary linear regression model
• Standard noninformative prior distribution
p(β, σ 2 |X) ∝ σ −2
• Posterior distribution
p(β, σ 2 |y)
p(y|β, σ 2 )p(β, σ 2 )
o
n
£ 1 ¤n
0
1
1 (y−Xβ) (y−Xβ)
exp − 2
σ
σ2
σ2
n 0 0
o
¡ 1 ¢ n +1
β X Xβ−2β 0 X 0 y+y 0 y
2
exp −
σ2
2σ 2
∝
∝
=
Completing the square gives
n
o
¡ 1 ¢ n +1
0
−1 0 0
0
0
−1 0
(β−(X
X)
X
y)
(X
X)(β−(X
X)
X
y)
2
p(β, σ |y) ∝ σ2 2 exp −
2
n 2σ 0
o
y y−y 0 X(X 0 X)−1 X 0 y
× exp −
2σ 2
Thus
p(β, σ 2 |y) =
p(β|σ 2 , y)
| {z }
N (β|β̂,σ 2 (X 0 X)−1 )
×
p(σ 2 |y)
| {z }
Inv-χ2 (σ2 |n−k,s2 )
where
0
−1
β̂ = (X X)
(y − X β̂)0 (y − X β̂)
X y and s =
n−k
0
2
172
Regression Models
Normal ordinary linear regression model
• Posterior distribution, p(β, σ 2 |y), is proper as long as:
1. n > k
2. rank(X) = k
• Sampling from the posterior distribution of (β, σ 2 )
0
−1
0
2
Recall: β̂ = (X X) X y and s =
Also let Vβ = (X 0 X)−1
(y−β̂X)0 (y−β̂X)
n−k
1. compute β̂ and Vβ
(note: these calculations are not usually done with
traditional matrix calculations)
2. compute s2
3. draw σ 2 from p(σ 2 |y) = Inv-χ2 (σ 2 |n − k, s2 )
4. draw β from p(β|σ 2 , y) = N (β|β̂, σ 2 Vβ )
173
Regression Models
Normal ordinary linear regression model
• Posterior predictive distribution for new data
– consider new data to be collected with observed
predictor matrix X̃; we wish to predict the
outcomes, ỹ
– posterior predictive simulation
∗ first draw (β, σ 2 ) from their joint posterior distn
∗ then draw ỹ ∼ N (X̃β, σ 2 I)
– posterior predictive distribution is
ỹ|σ 2 , y ∼ N (X̃ β̂, (I + X̃Vβ X̃ 0 )σ 2 ),
averaging over σ 2 gives
ỹ|y ∼ tn−k (β̂, (I + X̃Vβ X̃ 0 )s2 )
174
Regression Models
Model checking
• Diagnostics
– residual plots (traditional or Bayesian versions)
– posterior predictive checks
• Problems/solutions
– nonlinearity
∗ wrong model so all inferences are suspect
∗ fix by transformation and/or adding
predictors (or polynomial terms)
– nonnormality
∗ inferences are not quite right (usually not
terribly important since posterior distn
can be nearly normal even if data are not)
∗ fix by transformation or by using robust
models
175
Regression Models
Model checking (cont’d)
• Problems/solutions (cont’d)
– unequal variances
∗ bad inferences (variance is the problem)
∗ fix by generalizing the model
(GLS: y|β, Σy ∼ N (Xβ, Σy ) with Σy 6= σ 2 I)
∗ can be solved by adding missing predictor
– correlations
∗ bad inferences (variance is the problem)
∗ fix by generalizing the model (GLS)
∗ can be solved by adding missing predictor
(time,space)
176
Generalized Least Squares Model
y|β, Σy ∼ N (Xβ, Σy )
• Possible choices for Σy
– Σy known
– Σy = σ 2 Qy with Qy known
– Σy = f (σ 2 , φ) i.e. a function of some unknown
parameters beyond σ 2
• Posterior distribution for special case Σy known Let
−1/2
y ∗ = Σy y then
y ∗ |β ∼ N (Σ−1/2
Xβ, I)
y
Hence
p(β|y) = N (β̂, Vβ ) where
−1 0 −1
β̂ = (X 0 Σ−1
X Σy y
y X)
−1
Vβ = (X 0 Σ−1
y X)
177
Generalized Least Squares Model
• Special case: Σy = Qy σ 2 (with Qy known)
p(β, σ 2 |y) = p(σ 2 |y)p(β|y, σ 2 )
!
Ã
0 −1
(y − X β̂) Qy (y − X β̂)
p(σ 2 |y) = Inv-χ2 n − k,
n−k
p(β|y, σ 2 ) = N (β̂, σ 2 Vβ )
where
−1 0 −1
β̂ = (X 0 Q−1
X Qy y
y X)
and
−1
Vβ = (X 0 Q−1
y X)
• Note: prediction can be harder in this case since must
account for possible correlation between ỹ and y in Qy,ỹ
178
Generalized Least Squares Model
• General case (σ 2 included inside Σy perhaps with other
parameters also)
– prior distn:
p(β, Σy ) = p(Σy ) p(β|Σy ) ∝ p(Σy )
| {z }
flat
– joint posterior distn:
p(β, Σy |y) ∝ p(Σy )N (y|Xβ, Σy )
– factor joint posterior distn:
p(β, Σy |y) = p(Σy |y) p(β|Σy , y)
| {z }
N (β|β̂,Vβ )
−1 0 −1
−1
where β̂=(X 0 Σ−1
X Σy y and Vβ =(X 0 Σ−1
y X)
y X)
∗ the hard part here is p(Σy |y):
p(Σy |y) =
p(β,Σy |y)
p(β|Σy ,y)
1
¯
∝
p(Σy )N (y|Xβ,Σy ) ¯
¯
N (β|β̂,Vβ )
β=β̂
1
1
0
−1
= |Vβ |− 2 p(Σy )|Σy |− 2 e− 2 (y−X β̂) Σy
179
(y−X β̂)
Prior information
• Suppose y|β, σ 2 ∼ N (Xβ, σ 2 I)
• Conjugate analysis
– conjugate prior distribution
p(β, σ 2 ) = p(σ 2 )p(β|σ 2 ) = Inv-χ2 (σ 2 |n0 , σ02 )×N (β|β0 , σ 2 Σ0 )
– posterior distribution
p(β|σ 2 , y) = N (β|β̃, Vβ )
p(σ 2 |y) = Inv-χ2 (σ 2 |n + n0 , φ)
where
β̃
Vβ
φ
0
−1
0
= (Σ−1
(Σ−1
0 + X X)
0 β0 + (X X)β̂)
0
−1
= σ 2 (Σ−1
0 + X X)
= (n − k)s2 + n0 σ02
−1
0
−1 0
+(β̂ − β0 )0 Σ−1
X X(β̂ − β0 )
0 (Σ0 + X X)
β̂
= (X 0 X)−1 (X 0 y)
s2
= (y − X β̂)0 (y − X β̂)/(n − k)
180
Prior information
• Suppose y|β, σ 2 ∼ N (Xβ, σ 2 I)
• Semi-conjugate analysis
– prior distribution
p(β, σ 2 ) = p(σ 2 )p(β) = Inv-χ2 (σ 2 |n0 , σ02 )×N (β|β0 , Σ0 )
– posterior distribution
p(β|σ 2 , y) =
p(σ 2 |y) =
N (β|β̃, Vβ )
p(β, σ 2 |y)/p(β|σ 2 , y) (a 1-dim grid)
where
β̃
=
−2 0
−2
(Σ−1
X X)−1 (Σ−1
(X 0 X)β̂)
0 +σ
0 β0 + σ
Vβ
=
−2 0
(Σ−1
X X)−1
0 +σ
181
New view of prior information
• Consider prior information for a single regression
coefficient βj of the form βj ∼ N (βj0 , σβ2j )
with βj0 and σβ2j known
• Mathematically equivalent to βj0 ∼ N (βj , σβ2j )
• Prior can be viewed as “additional data”
• Can write y ∗ |β, Σ∗ ∼ N (X ∗ β, Σ∗ ) with



µ ¶
X
Σ
y
 Σ∗ =  y
y∗ =
X∗ = 
βj0
J
0

0
σβ2j

where J = (0, . . . , 0, |{z}
1 , 0, . . . , 0)
j
• Posterior distn is p(β, Σ∗ |y) ∝ p(Σy )N (y ∗ |β, Σ∗ )
(last term is product of two normal distns)
• If σβ2j → +∞, the added “data point” has no effect on
inference
• If σβ2j = 0, the added “data point” fixs βj exactly to βj0
182
New view of prior information
• Same idea works for prior distn for the
whole vector β if β ∼ N (β0 , Σβ )
with β0 , Σβ known
• Treat the prior distribution as k prior “data points”
• Write y∗ ∼ N (X∗ β, Σ∗ ) with




y
X
 X∗ = 

y∗ = 
β0
Ik

Σ∗ = 

Σy
0
0
Σβ
• Posterior distn is
p(β, Σ∗ |y) ∝ p(Σy ) ×
N (y∗ |X∗ β, Σ∗ )
{z
}
|
N (y|Xβ,Σy )N (β|β0 ,Σβ )
• If some of the components of β have infinite
variance (i.e. noninformative prior distributions),
they should be excluded from these added “prior”
data points
• The joint prior distribution for β is proper if all k
components have proper prior distributions;
i.e. rank(Σβ ) = k
183

Hierarchical Linear Models
• Motivation - combine hierarchical modeling ideas with
regression framework
• Useful way to handle
– random effects
– units that can be considered at two or more levels
(students in classes in schools)
• General Notation
– Likelihood for n data points
y|β, Σy ∼ N (Xβ, Σy )
(often Σy = σ 2 I)
– Prior distn on J regression coefficients
β|α, Σβ ∼ N (Xβ α, Σβ )
(often Xβ = 1 and Σβ = σβ2 I)
– Hyperprior distribution on K parameters α
α|α0 , Σα ∼ N (α0 , Σα )
with α0 , Σα known (often assume p(α) ∝ 1)
184
Hierarchical Linear Models
Example: J regression expts
• Model for jth experiment is yj |β j , σj2 ∼ N (Xj β j , σj2 I)
where yj = (y1j , y2j , . . . , ynj j )
• The regressions can be viewed as a single



X
0 ···
0
y
 1
 1 



0
 0 X2 · · ·
 y2 



y= .  X= .
..
..
 ..
 .. 
.
.



yJ
0
···
···
XJ
model


β
 1 




 β2 

 β=

 .. 

 . 




βJ
• Hierarchy involves setting a prior distn for β j ’s, often
β j |α, Σβ ∼ N (α, Σβ )
• Also need hyperpriors, e.g.,
p(α, Σβ ) ∝ 1
σj2 ∼ Inv-χ2 (c, d)
• Implied model is
yj |α, σj2 , Σβ ∼ N (Xj α, σj2 I + Xj0 Σβ Xj )
• Thus the hierarchy introduces correlation
185

Hierarchical Linear Models
Other examples
• SAT coaching example (a.k.a. 8 schools)


σ12




..
2
y|β, σ ∼ N I8 β, 
.


σ82





β|α, σβ2 ∼ N (1α, σβ2 I8 )
• Animal breeding
y|β, u, σ 2 ∼ N (




Xβ + Zu
| {z }
X
Z
u|σα2 ∼ N (0, σα2 A)
186
0 



β
u
, σ 2 I)




p(β) ∝ 1
Hierarchical Linear Models
Random effects to introduce correlation
• More about how random effects introduce
correlation by considering two models
• Model 1
– nj obs from group/cluster j
– expect objects in a group to be correlated
– assume yj = (y1j , y2j , . . . , ynj j )0 |α, Aj
where

σ 2 ρσ 2 · · · ρσ 2


 ρσ 2 σ 2 · · · ρσ 2
Aj = 
..
 ..
..
 .
.
.

ρσ 2
ρσ 2
∼ N (α1, Aj )








σ2
– combine data into single model with
correlated observations so that
y = (y1 , y2 , . . . , yJ )0 |α, Σy ∼ N (α1, Σy )

A
0 ···
0
 1

0
 0 A2 · · ·

Σy =  .
..
..
 ..
.
.

0
187
0
AJ
nj ×nj
where








Hierarchical Linear Models
Random effects to introduce correlation (cont’d)
• Model 2 – Let

1
0
 n1

 0 1n2
X=




···
0
···
..
.
0
..
.








and
1nJ

β
 1 
 . 
β =  .. 


βJ
and assume


2
2
y|β, τ ∼ N (Xβ, τ I) 



B1 · · · 0
 
.
  .
⇒ y|α, τ 2 , τβ2 ∼ N α1,  .. . . . ..

 


2
2
β|α, τβ ∼ N (α1, τβ I)
0 · · · BJ
where



Bj = 

τ 2 + τβ2
..
.
···
..
.
τβ2
..
.
τβ2
···
τ 2 + τβ2





nj ×nj
• Model 1 = Model 2 if we make the identification that
τβ2
2
2
2
σ = τ + τβ and ρ = τ 2 +τ 2
β
188





Hierarchical Linear Models
Computation
• Recall
“likelihood”
y|β, Σy ∼ N (Xβ, Σy )
“population distribution”
β|α, Σβ ∼ N (Xβ α, Σβ )
“hyperprior distribution”
α|α0 , Σα ∼ N (α0 , Σα )
• Interpretation as a single linear regression
y∗ |X∗ , γ, Σ∗ ∼ N (X∗ γ, Σ∗ )
where

y





y∗ = 
 0 
α0
X
0

X∗ = 
 I
0

Σy

Σ∗ = 
 0
0

−Xβ



I
0
Σβ
0
0


γ=
β

α


0 

Σα
p(α, β, Σy , Σβ |y) = N (y|Xβ, Σy )N (β|Xβ α, Σβ )N (α|α0 , Σα )
189
Hierarchical Linear Models
Computation (cont’d)
• Interpretation on previous slide builds on the fact that
the two sides of each equality below are the same distn
statement
N (α|α0 , Σα ) = N (α0 |α, Σα )
N (β|Xβ α, Σβ ) = N (0|β − Xβ α, Σβ )
• Drawing samples from the posterior distribution
– p(α, β|Σy , Σβ , y) is the posterior distn for a linear
regression model with known error variance matrix
which is
−1
0 −1
−1
N ((X∗0 Σ−1
(X∗0 Σ−1
)
∗ X∗ )
∗ y∗ ), (X∗ Σ∗ X∗ )
– need p(Σy , Σβ |y) to complete the joint posterior
distn or p(Σy , Σβ |α, β, y) for Gibbs sampling
– hard to describe this last step in general
because of the many possible models
• Presidential Election example
190
Study design in Bayesian analysis
• Naive view: data collection doesn’t matter
• Example where data collection doesn’t matter
– observe 9 successes in 24 trials
design 1: 24 Bernoulli trials
design 2:
sample until you get 9 successes
– p(θ|y) ∝ θ9 (1 − θ)15 p(θ) is the same for both designs
• Example where data collection does matter
– observe 9 successes, unknown number of trials
design 1:
24 Bernoulli trials
design 2:
wait for 100 failures
– p(θ|y) surely depends on design
191
Study design in Bayesian analysis
• Study design is important
– pattern of what is observed can be informative
– ignorable designs (studies where design doesn’t effect
inference) are likely to be less sensitive to
assumptions.
note: randomization is useful to Bayesians as a tool
for producing ignorable designs
– data one could have observed can help us to build
models (causality)
192
Study design in Bayesian analysis
General framework
• View the world in terms of observed data and
complete data, where complete data includes
observed and “missing” values
Sampling
Experiment
Rounded data
“Observed data”
“Complete data”
Values for the n
Values for all N
units in the sample
units in the population
Outcomes under the observed
Outcomes under all
treatment for each unit treated
treatments for all units
Rounded observations
Precise values of
all observations
Unintentional
Observed data values
Complete data, both
missing data
observed and missing
193
Formal models for data collection
Notation
• Data
y = (y1 , . . . , yN )
where yi = (yi1 , yi2 , . . . , yin ) = data for the ith unit.
• Observed values
I = (I1 , . . . , IN )
where

 1 if y is observed
ij
Iij =
 0 otherwise
where yij is the jth variable for the ith unit. Let
obs = {i, j : Iij = 1} index the observed
components of y and mis = {i, j : Iij = 0} index the
unobserved components of y. Then y can be writen as
y = (yobs , ymis ).
194
Formal models for data collection
• Stability assumption Measurement process (I) doesn’t
effect the data (y) (this assumption fails if, for example,
there are carryover effects or treatments in soil leak out)
• Fully observed covariates x We use the notation x for
variables that are fully observed for all units. We might
want to include x in an analysis for the following
reasons:
– we may be interested in some aspect of the joint
distribution of (x, y)
– we may be interested in some of the distribution of
y, but x provides information about y
– even if we are only interested in y, we must include
x in the analysis if x is involved in the data
collection mechanism
195
Formal models for data collection
• Complete-data model
p(y, I|x, θ, φ) = p(y|x, θ)p(I|x, y, φ)
– p(y|x, θ) models the underlying data without
reference to the data collection process
– The estimands of primary interest are
∗ functions of the complete data y
(finite-population estimands)
∗ functions of the parameters θ
(superpopulation estimands)
– The parameters φ that index the missingness are not
generally of scientific interest
– θ and φ can be related but this is rare
196
Formal models for data collection
• We don’t observe all of y
• Observed-data likelihood
p(yobs , I|x, θ, φ)
=
=
R
R
p(y, I|x, θ, φ) d ymis
p(y|x, θ)p(I|x, y, φ) d ymis
• Posterior distributions
– joint posterior distribution of (θ, φ)
p(θ, φ|x, yobs , I)
∝
=
=
p(θ, φ|x)p(yobs , I|x, θ, φ)
R
p(θ, φ|x) p(y, I|x, θ, φ) d ymis
R
p(θ, φ|x) p(y|x, θ)p(I|x, y, φ) d ymis
– marginal posterior distribution of θ
Z Z
p(θ|x, yobs , I) = p(θ|x)
p(φ|x, θ)p(y|x, θ)p(I|x, y, φ) d ymis d φ
Note: We don’t have to perform the integrals above; as
usual we can simulate treating ymis , θ, and φ as unknowns
197
Ignorability
It is tempting to ignore data collection issues I and
focus on
p(θ|x, yobs )
= p(θ|x)p(yobs |x, θ)
=
R
p(θ|x) p(y|x, θ) d ymis
When the missing data pattern supplies no
information; that is, when
p(θ|x, yobs ) = p(θ|x, yobs , I)
we say that the study design or data collection
mechanism is ignorable (with respect to the
proposed model)
198
Ignorability
When do we get ignorability?
First, some terminology
– Missing at random (MAR)
p(I|x, y, φ) = p(I|x, yobs , φ)
∗ whether a value is missing doesn’t depend on
value it would have had
∗ the state of being missing is allowed to
depend on observed values but not on
unobserved values
– Missing completely at random (MCAR)
p(I|x, y, φ) = p(I|φ)
– Distinct parameters
p(φ|θ, x) = p(φ|x)
199
Ignorability
• If MAR and distinct parameters, then
p(θ|x, yobs , I)
=
p(θ|x)
RR
R
p(φ|x, θ)p(y|x, θ)p(I|x, y, θ)d ymis d φ
Z
=
p(θ|x)
p(y|x, θ)d ymis
∝
p(θ|x)p(yobs |x, θ)
∝
p(θ|yobs , x)
we get ignorability
200
p(φ|x)p(I|x, yobs , φ)d φ
|
{z
}
no info about θ
Formal models for data collection
Example 1
Weigh object 100 times; y|θ ∼ N (θ, 1) Scale works with
probability φ; p(Ii = 1|y, φ) = φ
– Complete data
p(y, I|θ, φ) =
100
Y
N (y|θ, 1)
i=1
100
Y
φIi (1 − φ)1−Ii
i=1
– Observed data
p(yobs , I|θ, φ) =
R 100
Q
N (y|θ, 1)
i=1
P
= φ
i
Ii
100
Q
φIi (1 − φ)1−Ii d ymis
i=1
(1 − φ)100−
P
i
Ii
100
Q
N (yi |θ, 1)χ({Ii = 1})
i=1
since
100 Z
Y
N (yi |θ, 1) χ({Ii = 0}) dyi = 1
i=1
where χ(A) is the indicator function of the event A.
– The data collection mechanism is ignorable
201
Formal models for data collection
Example 2
Weigh object 100 times; y|θ ∼ N (θ, 1) Scale fails if weight
> φ; φ unknown
– Complete data
p(y, I|θ, φ) =
100
Y
N (yi |θ, 1)
i=1
100
Y
χ(Ai )
i=1
where Ai = {{Ii = 1} ∩ {yi < φ}} ∪ {{Ii = 0} ∩ {yi > φ}}
– Observed data
Z
p(yobs , I|θ, φ) =
=
p(y, I|θ, φ)d ymis
100
Y
N (yi |θ, 1)χ({Ii = 1})
i=1
×
100
Y
Z
χ({Ii = 0})
i=1
N (yi |θ, 1)χ({yi > φ})d yi
{z
}
|
Φ(θ−φ)=P (yi >φ)
– This censored data collection mechanism is not ignorable
202
Bayesian Statistics - Summary
• Model building
– basic probability distns as building blocks
– hierarchical structure
– condition on covariates to get ignorable designs
• Posterior inference
– the power of simulation
– flexible inference for any quantity of interest
– use of decision for formal problem-solving
• Model checking
– model checking/model selection
– importance of checking with all available info
– sensitivity analysis
203
Bayesian Statistics - Pro/Con
• Advantages
– account for uncertainty
– combine information from multiple sources
– probability is the language of uncertainty
– usual straightforward how to proceed with model
development
– flexible inference and model extensions
• Disadvantages
– need for prior distn (importance of sensitivity
analysis)
– always requires a formal model (except for Bayesian
nonparametrics)
– high dimensional nuisance parameters
(e.g., in survival analysis)
– communication with practitioners
204
Bayesian Statistics - Final thoughts
• There are differences between Bayesian methods
and traditional procedures
• Both will give reasonable data analyses
in good hands
• Bayesians can be interested in frequency
properties of procedures
• No need to declare as a Bayesian or Frequentist now (or
ever)
• Goal of course has been exposure to the
fundamental concepts and methods of
Bayesian data analysis
205