bayesian essentials and regression

Bayesian Essentials and Bayesian
Regression
1
Distribution Theory 101
Marginal and Conditional Distributions:
pY (y)   pX,Y  x,y  dx   pY X  y x  pX  x  dx
Y
pX (x)   p X,Y  x,y  dy
1
x
  2dy  2y
x
0
 2x
0
pY X  y x  
1
X
y   0,x 
p X,Y  x,y 
pX  x 
2

2x
uniform
2
Simulating from Joint
pX,Y  x,y   pY X  y x  pX  x 
To draw from the joint:
i. draw from marginal on X
ii. Condition on this draw, and draw from
conditional of Y|X
X ~ 2

  
Y X ~ N ,  / X
Y ~ MSt  , ,  
implementation: Y X 
Z
 X
1
2
 ; Z ~ N  0,  
3
The Goal of Inference
Make inferences about unknown quantities using
available information.
Inference -- make probability statements
unknowns -parameters, functions of parameters, states or latent variables,
“future” outcomes, outcomes conditional on an action
Information –
data-based
non data-based
theories of behavior; “subjective views” there is an
underlying structure
parameters are finite or in some range
4
Data Aspects of Marketing Problems
Ex: Conjoint Survey
>500 respondents rank, rate, chose among product
configurations.
Small Amount of Information per respondent
Response Variable Discrete
Ex: Retail Scanning Data
very large number of products (> 10,000 SKUs)
large number of geographical units (markets, zones,
stores)
limited variation in some marketing mix vars
Must make plausible predictions for decision making!
5
The likelihood principle
p(D | )  ()
Note: any function
proportional to
data density can
be called the
likelihood.
LP: the likelihood contains all information relevant for
inference. That is, as long as I have same likelihood
function, I should make the same inferences about the
unknowns.
In contrast to modern econometric methods (GMM) which
does not obey the likelihood principle (e.g., regression
for a 0-1 binary dependent variable)
Implies analysis is done conditional on the data, in contrast
to the frequentist approach where the sampling
distribution is determined prior to observing the data.
6
Bayes theorem
p   D 
p D, 
p D 

p D   p   
p D 
p(|D)  p(D| ) p()
Posterior  “Likelihood” × Prior
Modern Bayesian computing– simulation methods
for generating draws from the posterior
distribution p(|D).
7
Summarizing the posterior

Output from Bayesian Inf: p  D
A high dimensional dist

Summarize this object via simulation:
marginal distributions of , h  
don’t just compute E   D , Var   D 
Contrast with Sampling Theory:
point est/standard error
summary of irrelevant dist
bad summary (normal)
Limitations of Asymptotics
8
Prediction
p(D|D)
See D, compute:
“Predictive Distribution”
p(D|D)   p(D| )p( | D)d
(  p(D| ̂)) !!!)

  
assumes p D,D   p D  p D 
9
Decision theory
Loss: L(a,) where a=action; =state of nature
Bayesian decision theory:


min L(a)  E|D L(a, )   L(a, )p( | D)d
a
Estimation problem is a special case:
 
  
 
ˆ ) ; typically, L ˆ, ˆ  ˆ   ' A ˆ  
min
L(

ˆ


10
Is Sampling Theory Useful?
Sampling theory: the performance of an estimator across
multiple, hypothetical data sets.
Sampling experiments:
i. sample from p(D|θ)
ii. sample from another model
iii. various asymptotic “experiments” – sequences of
data sets
Not useful for inference in applied work (only one data set).
Useful to assess choice of inference procedure.
Need one look further than Bayes estimators?
11
Sampling properties of Bayes estimators


rˆ ()  ED| L(ˆ, )   L ˆ (D),  p(D | )dD
An estimator is admissible if there does not exist
another estimator with lower risk for all values of . The
Bayes estimator minimizes expected (average) risk
which implies they are admissible:
 
 
E r     E ED| L ˆ,     ED E|D L ˆ,   






The Bayes estimator does the best for every D.
Therefore, it must work as at least as well as any other
estimator.
12
Identification


R   : p Data   k
If dim(R) >= 1, then we have an “identification” problem.
That is, there are a set of observationally equivalent
values of the model parameters. The likelihood is “flat”
or constant over R.
Practical Implications
likelihood can have flats or ridges.
Issue for both the Bayesian (is it?) and nonBayesian.
13
Identification
Is this a problem?
no, I have a proper prior
no, I don’t maximize
“Classical” solution:
impose enough constraints so that constrained
parameter space is identified.
“Bayesian” solution:
use proper prior and recognize that some
functions of θ are determined entirely by prior
14
Identification
 1   
define   

 2   
such that dim  1   dim R 
and p  1 D   p  1 
2 is the identified parameter. Report only on the
posterior distribution of this function of .
check p
2
  which is induced by p(θ)
15
Bayes Inference: Summary
Bayesian Inference delivers an integrated approach to:
Inference – including “estimation” and “testing”
Prediction – with a full accounting for uncertainty
Decision – with likelihood and loss (these are
distinct!)
Bayesian Inference is conditional on available info.
The right answer to the right question.
Bayes estimators are admissible. All admissible
estimators are Bayes (Complete Class Thm). Which
Bayes estimator?
16
Bayes/Classical Estimators
Does MLE obey LP? YES
Does theory of maximum likelihood estimator obey
LP? No! who cares about an infinite amount of
irrelevant data!
Is there any relationship?
resort to asymptotics
what is Bayesian asymptotics? does this even
make sense?
Investigate asymptotic behavior of the posterior.
17
Bayes/Classical Estimators
N
 
n
 
p  
Prior washes out – locally uniform!!! Bayes is
consistent unless you have dogmatic prior.
1


ˆ


p   D ~ N  MLE, Hˆ

MLE 



18
Benefits/Costs of Bayes Inf
Benefitsfinite sample answer to right question
full accounting for uncertainty
integrated approach to inf/decision
“Costs”computational (true any more? < classical!!)
prior (cost or benefit?)
esp. with many parms
(hierarchical/non-parameric problems)
19
Bayesian Computations
Before simulation methods, Bayesians used posterior
expectations of various functions as summary of
posterior.
E D h      h  
p D   p   
p D 
d
If p(θ|D) is in a convenient form (e.g. normal), then I
might be able to compute this for some h. Via iid
simulation for all h.
note : p D   p D   p   d
20
Conjugate Families
Models with convenient analytic properties almost
invariably come from conjugate families.
Why do I care now?
- conjugate models are used as building blocks
- build intuition re functions of Bayesian inference
Definition:
A prior is conjugate to a likelihood if the
posterior is in the same class of distributions as prior.
Basically, conjugate priors are like the posterior from
some imaginary dataset with a diffuse prior.
21
Beta-Binomial model
yi ~ Bern ( )
n
( )   yi (1   )1 yi
i 1
  (1   )
y
n y
where y 
n
y
i 1
i
p( | y )  ?
Need a prior!
22
Beta distribution
0.08
0.10
Beta(,)  1(1  )1
0.02
0.04
0.06
E[]   /(  )
0.00
a=2, b=4
a=3, b=3
a=4, b=2
0.0
0.2
0.4
0.6
0.8
1.0
23
Posterior
p( | D)  p(D | )p()
  y (1  )ny    1(1  )1 
 y1(1  )ny1
~ Beta(  y,  n  y)
24
Prediction

Pr(y  1|y)  Pr(y  1| ,y)d
1
   p(|y)d
0
 E[ |y]
25
Regression model
i ~ Normal(0, 2 )
yi  x   i
'
i
p(yi |,  ) 
2
 1

exp  2 (yi  xi')2 
 2

22
1
y X,, 2 ~ N(X, 2I)
Is this model complete? For non-experimental data,
don’t we need a model for the joint distribution of y
and x?
26
simultaneous
systems are not
written this way!
Regression model

p(x,y)  p  x   p y x,, 2

rules out x=f(β)!!!
If Ψ is a priori indep of (β,σ),


p ,,  y,X  p     p  x i   p , 2
i



2




2
p
y
x
,

,

 i i
i

two separate analyses
27
Conjugate Prior
What is conjugate prior? Comes from form of
likelihood function. Here we condition on X.

 ~ N 0, 2In

Jacobian :   y  1
,    p  y X,,      
2
2
2
n / 2

exp  2 12  y  X  ' 

quadratic form suggests normal prior.
Let’s – complete the square on β or rewrite by
projecting y on X (column space of X).
28
Geometry of regression
y
e  y  yˆ
x2
yˆ  x11 x2  2
2x2
1x1
x1
"Least Squares "  min (e'e)
 e' yˆ  0
29
Traditional regression
ˆ 0
e' yˆ  y'e
(Xˆ )'(y  Xˆ )  0
ˆ  (X ' X)1 X ' y
No one ever computes a matrix inverse directly.
Two numerically stable methods:
QR decomposition of X
Cholesky root of X’X and compute inverse using
root
Non-Bayesians have to worry about singularity or
near singularity of X’X. We don’t! more later
30
Cholesky Roots
In Bayesian computations, the fundamental matrix
operation is the Cholesky root. chol() in R
The Cholesky root is the generalization of the
square root applied to positive definite matrices.
As Bayesians with proper priors, we don’t ever
have to worry about singular matrices!


  U'U,  p.d.s.     uii 
 i

2
U is upper triangular with positive diagonal
elements. U-1 is easy to compute by recursively
solving TU = I for T, backsolve() in R.
31
Cholesky Roots
Cholesky roots can be useful to simulate from
Multivariate Normal Distribution.
  U'U; z ~ N  0,I
y  U' z ~ N  0,E U' zz 'U  U'IU   
To simulate a matrix of draws from MVN (each
row is a separate draw) in R,
Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma)
Y=t(t(Y)+mu)
32
Regression with R
data.txt:
UNIT Y
X1
X2
A 1 0.23815 0.43730
A 2 0.55508 0.47938
A 3 3.03399 -2.17571
A 4 -1.49488 1.66929
B 10 -1.74019 0.35368
B 9 1.40533 -1.26120
B 8 0.15628 -0.27751
B 7 -0.93869 -0.04410
B 6 -3.06566 0.14486
df=read.table("data.txt",header=TRUE)
myreg=function(y,X){
#
# purpose: compute lsq regression
#
# arguments:
# y -- vector of dep var
# X -- array of indep vars
#
# output:
# list containing lsq coef and std errors
#
XpXinv=chol2inv(chol(crossprod(X)))
bhat=XpXinv%*%crossprod(X,y)
res=as.vector(y-X%*%bhat)
ssq=as.numeric(res%*%res/(nrow(X)-ncol(X)))
se=sqrt(diag(ssq*XpXinv))
list(b=bhat,std_errors=se)
}
33
Regression likelihood
(y  X)'(y  X)  (Xˆ  X)'(Xˆ  X)  (y  Xˆ )'(y  Xˆ )
 2(y  Xˆ )'(Xˆ  X)
 s2  (ˆ )' X ' X(ˆ )
where
s2  SSE  (y  Xˆ )'(y  Xˆ )
  nk

p y X,, 
2
  ( )
2 k / 2
 1

ˆ
ˆ
exp   2 (  )' X ' X(  ) 
 2

2



s
2  / 2
 ( )
exp   2 
 2 
34
Regression likelihood


p y X,, 2  normal  ?
? is density of form p     e



This is called an inverted gamma distribution. It can
also be related to the inverse of a Chi-squared
distribution.
Note the conjugate prior suggested by the form the
likelihood has a prior on β which depends on σ.
35
Bayesian Regression
Prior:
p(, 2 )  p( | 2 )p(2 )
2 k / 2
p( |  )  ( )
2
Interpretation
as from
another
dataset.


 0  1
2  2

p(2 )  ( )
 1

exp  2 (   )' A(   )
 2

 0s02 
exp 
2 
2



Inverted Chi-Square:
2

s
2 ~ 02 0
0
Draw from prior?
36
Posterior
p(, 2 |D)  (, 2 )p( | 2 )p(2 )
 1

 (2 )n / 2 exp  2 (y  X)'(y  X)
 2

2 k / 2
 ( )
 1

exp  2 (   )' A(   )
 2



 0  1
2  2

 ( )
 0s02 
exp 
2 
2



37
Combining quadratic forms
(y  X)'(y  X)  (   )' A(   )
 (y  X)'(y  X)  (   )'U'U(   )
 (v  W )'(v  W )
 y 
v 
U 
X
W 
U
(v  W)'(v  W)  s2  (   )' W ' W(   )
  (W ' W)1W ' v  (X ' X  A)1(X ' Xˆ  A )
ns2  (v  W )'(  W )  (y  X)'(y  X)  (   )' A (   )
38
Posterior
 1

 (2 )k / 2 exp  2 ( )'(X ' X  A)( )
 2


 ( 2 )
n0  2
2
 (0s02  ns2 ) 
exp 

2
2



[ | 2 ]  N(, 2 (X ' X  A)1)
2

s
[2 ]  12 1 with 1   0  n
1
2
2

s

n
s
s12  0 0
0  n
39
IID Simulations
Scheme: [y|X, , 2] [|2] [2]
1) Draw [2 | y, X]
2) Draw [ | 2,y, X]
3) Repeat
40
IID Simulator, cont.
2

s
1) [2 |y,X]  12 1
1

1
2
2


2)  y, X,    N ,   X' X  A 
  (X' X  A)1(X' Xˆ  A  )


note :  ~ N  0,I ;   U'    ~ N ,U'U    X' X  A 
2
1
41

Bayes Estimator
The Bayes Estimator is the posterior mean of β.
E  D  E2 D E D,2   E2 D   


Marginal on β is a multivariate student t.
Who cares?
42
Shrinkage and Conjugate Priors
The Bayes Estimator is the posterior mean of β.
This is a “shrinkage” estimator.
  (X ' X  A)1(X ' Xˆ  A) shrinks ˆ  
as n  ,  ̂ (Why? X'X is of order n).

Var  
2
   (X ' X  A)
2
1
  A or   X ' X 
2
1
2
1
Is this reasonable?
43
Assessing Prior Hyperparameters
,A, 0 ,s02
These determine prior location and spread for
both coefs and error variance.
It has become customary to assess a “diffuse”
prior:
0


"small" value of A, A  .01Ik  Var  2  100Ik
0 "small",e.g. 3
s02  1
This can be problematic. Var(y)
might be a better choice.
44
Improper or “non-informative” priors
Classic “non-informative” prior (improper):

p , 
2
  p   p   
2
1
 2

Is this “non-informative”?
Of course not, it says that  is large with high
prior “probability”
Is this wise computationally?
No, I have to worry about singularity in X’X
Is this a good procedure?
No, it is not admissible. Shrinkage is good!
45
runireg
runireg=
function(Data,Prior,Mcmc){
#
# purpose:
# draw from posterior for a univariate regression model with natural conjugate prior
#
# Arguments:
# Data -- list of data
#
y,X
# Prior -- list of prior hyperparameters
# betabar,A
prior mean, prior precision
# nu, ssq
prior on sigmasq
# Mcmc -- list of MCMC parms
# R number of draws
# keep -- thinning parameter
#
# output:
#
list of beta, sigmasq draws
#
beta is k x 1 vector of coefficients
# model:
#
Y=Xbeta+e var(e_i) = sigmasq
#
priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1)
#
sigmasq ~ (nu*ssq)/chisq_nu
46
runireg
RA=chol(A)
W=rbind(X,RA)
z=c(y,as.vector(RA%*%betabar))
IR=backsolve(chol(crossprod(W)),diag(k))
#
W'W=R'R ; (W'W)^-1 = IR IR' -- this is UL decomp
btilde=crossprod(t(IR))%*%crossprod(W,z)
res=z-W%*%btilde
s=t(res)%*%res
#
# first draw Sigma
#
sigmasq=(nu*ssq + s)/rchisq(1,nu+n)
#
# now draw beta given Sigma
#
beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k)
list(beta=beta,sigmasq=sigmasq)
}
47
0
500
1000
1500
2000
48
1.0
1.5
out$betadraw
2.0
2.5
0
500
1000
1500
2000
49
0.20
0.25
out$sigmasqdraw
0.30
0.35
Multivariate Regression
y1  X1  1
Y  XB  E,
y c  X c   c
B  1, ,c , ,m 
y m  Xm  m
row ~ iid N  0,  
Y   y1, ,y c , ,ym 
E   1, , c , , m 
50
Multivariate regression likelihood
 1 n

p  Y | X,B,    |  |
exp     yr  B' x r   1  yr  Bx r  
 2 r 1

 1

 |  |n / 2 etr    Y  XB  Y  XB   1 
 2

n / 2
 1

 |  |(nk) / 2 etr   S 1 
 2

 1


 |  |k / 2 etr   B  Bˆ XX B  Bˆ  1 
 2





where S  Y  XBˆ Y  XBˆ




51
Multivariate regression likelihood
But, tr(A 'B)   vec  A   '  vec B  
 
 



tr  B  Bˆ XX B  Bˆ  1   vec B  Bˆ ' vec X X B  Bˆ  1


and vec  ABC    C' A  vec B 










 vec B  Bˆ '  1  XX vec B  Bˆ

therefore,
 1

p  Y | X,B,    |  |(nk) / 2 etr   S 1 
 2

 1


 |  |k / 2 exp     ˆ  1  XX   ˆ 
 2





52
Inverted Wishart distribution
Form of the likelihood suggests that natural
conjugate (convenient prior) for  would be of
the Inverted Wishart form:
p   0,V0   
 0 m1 / 2

etr  21 V01

denoted  ~ IW  0,V0 
if 0  m  1, E  (0  m  1)1 V0
- tightness
V- location
however, as 
increases,
spread also
increases
limitations: i. small  -- thick tail ii. only one
tightness parm
53
Wishart distribution (rwishart)

If  ~ IW  0 ,V0  ,  1 ~ W 0 ,V01

if 0  m  1, E  0 V01
Generalization of 2 :
Let i ~ Nm (0, )
Then W 

   ' ~ W(,)
i 1
i i
The diagonals are 2
54
Multivariate regression prior and
posterior
Prior:
p  ,B  p    p B |  
 ~ IW  0 ,V0 

 |  ~ N ,   A 1
Posterior:


 | Y,X ~ IW 0  n,V0  S


 | Y,X,  ~ N ,    XX  A 
1

 XXBˆ  AB ,


S   Y  XB   Y  XB   B  B  A B  B 
 
  vec B ,
B   XX  A 
1
55
Drawing from Posterior: rmultireg
rmultireg=
function(Y,X,Bbar,A,nu,V)
RA=chol(A)
W=rbind(X,RA)
Z=rbind(Y,RA%*%Bbar)
# note: Y,X,A,Bbar must be matrices!
IR=backsolve(chol(crossprod(W)),diag(k))
#
W'W = R'R & (W'W)^-1 = IRIR' -- this is the UL decomp!
Btilde=crossprod(t(IR))%*%crossprod(W,Z)
#
IRIR'(W'Z) = (X'X+A)^-1(X'Y + ABbar)
S=crossprod(Z-W%*%Btilde)
#
rwout=rwishart(nu+n,chol2inv(chol(V+S)))
#
# now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov)
#
Cov=(X'X + A)^-1 = IR t(IR)
#
Sigma=CICI'
#
therefore, cov(beta)= Omega = CICI' (x) IR IR' = (CI (x) IR) (CI (x) IR)'
#
so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk)
#
Z_mk is m x k matrix of N(0,1)
#
since vec(ABC) = (C' (x) A)vec(B), we have
#
B = Btilde + IR Z_mk CI'
#
B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout$CI)
56
Conjugacy is Fragile!
SUR:
yi  Xi i  i i  1, ,m
set of
regressions
“related” via
correlated
errors
given ,  ~ N would be conjugate
given ,  ~ IW would be conjugate
BUT, no joint conjugate prior!!
57