Slides

BSWG Seminar, Raleigh, NC
BSWG Seminar, Raleigh, NC
Overview
Prior Distributions for the Variable
Selection Problem
• The Variable Selection Problem(VSP)
• A Bayesian Framework
Sujit K Ghosh
• Choice of Prior Distributions
Department of Statistics
North Carolina State University
• Illustrative Examples
• Conclusions
http://www.stat.ncsu.edu/∼ghosh/
Email: [email protected]
Bayesian Statistics Working Group, NCSU
Disclaimer: This talk is not entirely based on my own research work
Sujit Ghosh, October 3, 2006
1
BSWG Seminar, Raleigh, NC
Sujit Ghosh, October 3, 2006
2
BSWG Seminar, Raleigh, NC
Suppose the true data generating process (DGP) is given by
The Variable Selection Problem
y = X 0 β 0 + ,
Consider the following canonical linear model:
y = Xβ + (1)
where β 0 = (β10 , . . . , βp00 )T , X 0 is n × p0 and WLOG assume that
X = (X 0 | X 1 )T and p ≥ p0 ≥ 1 (i.e., X 1 is n × (p − p0 ))
The LSE of β and σ 2 are given by
where ∼ Nn (0, σ 2 I) and β = (β1 , . . . , βp )T (X is an n × p matrix)
β̂
ˆ
σ2
• Under the above model, suppose also that only an unknown
subset of the coefficients βj ’s are nonzero
=
=
(X T X)− X T y
(3)
T
y (I − P X )y/(n − r)
• The problem of variable selection is to identify this unknown
subset.
where r =rank(X) ≤ min(n, p), P X = X T (X T X)− X T is the
projection matrix and (X T X)− is a g-inverse of X T X. Then
• Notice that the above canonical framework can be used to
address many other problems of interest including multivariate
polynomial regression and nonparametric function estimation
Lemma: E[β̂] = ((X T0 X 0 )− X T0 X 0 β 0 , 0)T and E[X β̂] = X 0 β 0 .
Further, E[σˆ2 ] = σ 2 for any g-inverse of X T X.
Sujit Ghosh, October 3, 2006
(2)
In particular, if rank(X 0 ) = p0 , then E[β̂] = (β 0 , 0)T .
3
Sujit Ghosh, October 3, 2006
4
BSWG Seminar, Raleigh, NC
BSWG Seminar, Raleigh, NC
VSP, contd.
VSP, contd.
• The variable selection problem is a special case of the model
selection problem
• A common strategy for this VSP has been to select a model that
minimizes a penalized sum of squares (PSS) criterion by a
constraint optimization method (but why?)
• Each model under consideration corresponds to a distinct subset
of x1 , . . . , xp (Geweke, 1996)
• The model (1) can be generalized to include discrete responses in
terms of first two moments:
E[y|X]
= g(Xβ)
V[y|X]
= Σ(X, β, φ),
(4)
where g(·) is a suitable link function and Σ(·) is an n × n
covariance matrix which may depend on additional parameters φ
• Typically, a single model class is simply applied to all possible
subsets so that all reduced models are nested under the full model
Sujit Ghosh, October 3, 2006
5
BSWG Seminar, Raleigh, NC
P SS(β, δ) = ||y − XD(δ)β||2 /nσ 2 + J(β, δ),
(5)
where D(δ) is the diagonal matrix with diagonal δ and J(·)
denotes a suitable penalty function
• The choice of the penalty J(·) is crucial and can be shown to be
equivalent to the choice of a prior distribution
Sujit Ghosh, October 3, 2006
6
BSWG Seminar, Raleigh, NC
• Recent attempts have been to define penalties in terms of β
p
• J(β, δ) = λ j=1 |βj |q , q ≥ 1 yields bridge regression (Frank
and Friedman, 1993). Only q = 1 yields sparse solution among all
q ≥ 1 (Fan and Li, 2001)
VSP, contd.
• A number of popular criteria correspond to (5) with different
choices for J(·)
p
p
• J(β, δ) = λ(p, n) j=1 δj (notice j=1 δj = #non-zero βj ’s)
– λ(p, n) = 2 yields Cp (Mallows, 1973) and AIC (Akaike, 1973)
– q = 2 yields ridge regression
– q = 1 yields LASSO (Tibshirani, 1996)
p
p
• J(β, δ) = λ1 j=1 |βj | + λ2 j=1 βj2 yields Elastic Net (Zhou
and Hastie, 2005)
p
• J(β, δ) = λ1 ( j=1 |βj | + λ2 j<k max{|βj |, |βk |}) yields OSCAR
(Bondell and Reich, 2006)
– λ(p, n) = log n yields BIC (Schwarz, 1978) and
– λ(p, n) = 2 log p yields RIC (Foster and George, 1994)
• The CM M L criteria (George and Foster, 2000) estimates λ(p, n)
based on marginal maximum likelihood (MML) using an
empirical Bayes framework.
p
• J(β, δ) = 2 j=1 δj log(p/j), Benjamini and Hochberg (1995)
• Thus, a general strategy would be to define a penalty function
that would involve both δ and β
• Notice that all of the above penalties do not involve β and these
are generally a function of δ (and n)
Sujit Ghosh, October 3, 2006
• More specifically, if δ = (δ1 , . . . , δp )T denotes the indicator of the
inclusion (δj = 1) and exclusion (δj = 0) of the variable xj for
j = 1, . . . , p, then a PSS criterion would pick a δ ∈ {0, 1}p (and
also β ∈ Rp ) that minimizes
• We will consider this as priors: π(β, δ) ∝ exp{−J(β, δ)}
7
Sujit Ghosh, October 3, 2006
8
BSWG Seminar, Raleigh, NC
BSWG Seminar, Raleigh, NC
Bayesian Framework, contd.
A Bayesian Framework
• A pure subjective point of view for prior selection is problematic
for the VSP
The full hierarchical Bayes model:
y|β, δ, σ 2
∼ Nn (XD(δ)β, σ 2 I n )
(β, δ)|σ 2
∼ π(β, δ|σ 2 ) ∝ exp{−J(β, δ)/σ 2 }
σ
2
(6)
2
∼ π0 (σ ) ( e.g., IG(a0 , b0 ))
• Given a loss function, L(θ, a) we can obtain (in theory) the
Bayes estimator by minimizing the posterior expected loss,
E[L(θ, a)|y, X] = L(θ, a)π(θ|y, X) dθ
• It is rather unrealistic to assume that uncertainty can be
meaningfully described given the huge number (≈ 2p−1 ) and
complexity of unknown model parameters.
• A common and practical approach has been to construct
noninformative, semi-automatic formulation in this context.
(7)
wrt to a = a(y, X) where θ = (β T , δ T , σ 2 )T
• Which prior distributions? What loss functions? Can we even do
optimization for a given prior distribution and loss function?
Sujit Ghosh, October 3, 2006
9
BSWG Seminar, Raleigh, NC
• Roughly speaking the goal would be to specify priors that allow
the posteriors model probabilities to accumulate near the true
model (via some form of sparseness and smoothing)
• Unfortunately, there are no universally preferred method to
construct such semi-automatic priors! (isn’t that nice?)
Sujit Ghosh, October 3, 2006
10
BSWG Seminar, Raleigh, NC
Bayesian Framework, contd.
Bayesian Framework, contd.
• The choice of loss function, although mostly overlooked, is also
crucial (different loss functions lead to different estimates)
• Notice that if θ̂ denotes a MAP estimator then
n 1
− log f (yi |xi , θ)π(θ)dθ
n i=1
• In general, suppose the true DGP is: (y, x) ∼ m0 (y|x)g0 (x)
• Consider a model: (y, x) ∼ m(y|x)g(x) where
m(y|x) = f (y|x, θ)π(θ)dθ with sampling density f and prior π
• The Kullback-Liebler discrepancy between the DGP and model
can be written as:
K(m0 , m|x)g0 (x)dx + K(g0 , g)
K(m0 g0 , mg) =
m0 (y|x)
where K(m0 , m|x) =
m0 (y|x) log
dy
(8)
m(y|x)
n 1
− log f (yi |xi , θ)π(θ)dθ
≈ const. +
n i=1
Sujit Ghosh, October 3, 2006
11
≈
n
1 − log f (yi |xi , θ̂) + (− log π(θ̂))
n i=1
• When y|x follows a canonical normal linear model the above
criteria is equivalent to the PSS criteria (5)
• Thus, J(θ) = − log(π(θ)) emerges as a choice of the penalty
function up to some multiplicative constant (see slide# 8)
• Hence the choice of a penalty function is equivalent to the choice
of a prior distribution (including improper distributions in some
cases)
Sujit Ghosh, October 3, 2006
12
BSWG Seminar, Raleigh, NC
BSWG Seminar, Raleigh, NC
Choice of priors, contd.
Choice of Prior distributions
• We are not generally confident about any given set of predictors
and hence little prior information on D(δ)β and σ 2 can be
expected for each δ
• For each δ it is desirable to have some default priors for D(δ)β
and σ 2
• Unfortunately default priors for normal linear models are
generally improper
2
• Zellner’s prior: β|δ, σ 2 , X ∼ Np (β 0 , σg (X T X)− ),
σ 2 ∼ IG(a0 , b0 ) and δ = 1 w.p. 1.
Here β 0 , g, a0 and b0 need to be specified by the user (or
estimated using either an EB or a HB procedure)
• Extensions of Zellner priors:
2
• Nonobjective (conjugate) priors for β are typically centered at 0,
making the model with no predictors as the null model within a
testing of hypothesis set up
• The goal is to select a prior (and hence penalty function) that is
criterion-based and fully automatic
Sujit Ghosh, October 3, 2006
• More generally we can think of constructing priors (and hence
penalties) that may also depend on the design matrix X
13
BSWG Seminar, Raleigh, NC
– β|δ, σ 2 , X ∼ Np (β 0 , σg (X(δ)T X(δ))− ), σ 2 ∼ IG(a0 , b0 ) and
δ ∼ U nif ({0, 1}p ) where X(δ) = XD(δ).
– Almost same as above but δ ∼ q
j
δj
P
(1 − q)p−
j
δj
• The advantage of Zellner-type priors is the closed form suitable
for rapid computations over large parameter space for δ.
Sujit Ghosh, October 3, 2006
14
BSWG Seminar, Raleigh, NC
Choice of priors, contd.
Choice of priors, contd.
• In general we may consider the following independence prior:
πI (δ) =
p
δ
qj j (1 − qj )1−δj
(9)
j=1
• Independent and exchangeable priors on δ may be less
satisfactory when the models contain dependent components
(e.g., interactions, polynomials, lagged or indicator variables)
• Consider 3 variables with main effects x1 , x2 , x3 and three
two-factor interactions x1 x2 , x2 x3 , x1 x3 .
qj = Pr[δj = 1] is the inclusion probability of the j-th variable
• Small qj can be used to downweight the j-th variable
• Notice that when qj ≡ 0.5, models with size p/2 get more weight
• Alternatively, assuming qj = q for all j, one may use a prior
q ∼ Beta(a, b) to obtain the exchangeable prior:
πE (δ) = B(a + j δj , b + p − j δj )/B(a, b)
where B(a, b) is the Beta function.
• The importance of the interactions such as x1 x2 will often
depend only on whether the main effects x1 and x2 are included
in the model
• This can be expressed by a prior for δ = (δ1 , . . . , δ13 ) of the form
3
π(δ) = j=1 π(δj ) j<k π(δjk |δj , δk )
where π(δjk |δj , δk ) would require specifying four probabilities one
for each pair (δj , δk ).
• Notice that components of δ are exchangeable but not
independent under the previous prior
Sujit Ghosh, October 3, 2006
P
• E.g., π(δ12 |0, 0) < π(δ12 |0, 1), π(δ12 |1, 0) < π(δ12 |1, 1)
15
Sujit Ghosh, October 3, 2006
16
BSWG Seminar, Raleigh, NC
BSWG Seminar, Raleigh, NC
Choice of priors, contd.
Choice of priors, contd.
• The number of possible models grows exponentially as the
number of interactions, polynomials, lagged variables increases
• Having little prior information on the variables, objective model
selection methods are necessary
• In contrast to independent priors of the form (9), priors for
dependent components models concentrate mass on “plausible”
models, when the number of possible models is huge
• Spiegelhalter and Smith (1982): improper priors
– used conventional improper priors for β
– used a pseudo-Bayes Factor for inference
• This can be crucial in applications such as screening designs,
where p >> n (see Chipman, Hama and Wu, 1997)
• Mitchell and Beauchamp (1988): spike-and-slab priors
• Another limitation of independence priors on δ is their failure to
account for covariate collinearity
– variable selection problem is solved as an estimation problem
• Berger and Pericchi (1996): Intrinsic Priors
• This problem can be resolved by using the so-called dilution
priors (George, 1999)
– developed a fully automatic prior
• A general form of dilution prior can be written as
– used intrinsic Bayes factor for inference based on a model
encompassing approach
πD (δ) = h(det(X(δ)T X(δ)))πI (δ)
Sujit Ghosh, October 3, 2006
– βj |δj ∼ (1 − δj )δ0 + δj U nif (−aj , aj )
17
BSWG Seminar, Raleigh, NC
Sujit Ghosh, October 3, 2006
BSWG Seminar, Raleigh, NC
Choice of priors, contd.
Choice of priors, contd.
• Yuan and Lin (2006) have recently proposed the use of the
following dilution prior:
P
P
π(δ) = q j δj (1 − q)p− j δj det(X(δ)T X(δ))
• Another recent attempt to construct automatic priors has been
made by Casella and Moreno (2006)
• Their proposed methodology is
– Criterion based: provides clear understanding of properties of
selected models
• The main idea behind this prior is to replace a set of highly
correlated variables by one of the variable in that set
• Suppose βj |δj ∼ (1 − δj )δ0 + δj DE(0, τ ) where DE(0, τ ) denotes
a double-exponential distribution with density τ exp{−τ |β|} and
δ0 a distribution with point mass at 0
• Yuan and Lin (2005) have shown that if one sets
q = (1 + τ σ π/2)−1 then the model with highest posterior
probability is approximately equivalent to LASSO with λ = 2σ 2 τ
• A Gibbs sampling scheme is also presented (seminar on Oct 31st!)
Sujit Ghosh, October 3, 2006
18
19
– Automatic: No tuning parameter (hyperparameter) selection
is required
• Formally carries out the hypothesis tests:
H0 : δ = δ ∗ vs. Ha : δ = 1p
where δ ∗ ∈ {0, 1}p but δ ∗ = 1p , i.e. tests the null hypothesis of a
reduced model verses the full model (this is in sharp contrast to
other conjugate prior approaches)
Sujit Ghosh, October 3, 2006
20
BSWG Seminar, Raleigh, NC
BSWG Seminar, Raleigh, NC
Choice of priors, contd.
Illustrative Examples
• The test of hypothesis is carried out using posterior model
probabilities:
Pr[δ ∗ |y, X]
=
m(y|δ ∗ , X)
m(y|1p , X) + δ =1p m(y|δ, X)
=
BFδ ∗ ,1p (y, X)
1 + δ =1p BFδ ,1p (y, X)
• A simulation study adopted from Casella and Moreno (2006):
Full Model: y = β0 + β1 x1 + β2 x2 + β3 x21 + β4 x22 + where ∼ N (0, σ 2 )
True DGP1: y = 1 + x1 + , x1 , x2 ∼ U nif (0, 10), σ = 2
and
where BFδ ∗ ,1p (y, X) denotes the Bayes factor for testing
H0 : δ = δ ∗ vs. Ha : δ = 1p
True DGP2: y = 1 + x1 + 2x2 + x1 , x2 ∼ U nif (0, 10), σ = 2
• The fact the every posterior model probability has the same
denominator facilitates rapid computation
• A sample of n = 10 is generated and posterior model
probabilities are computed for all 24 = 16 models.
• The use of intrinsic priors overcomes the problem of using
improper priors while computing the Bayes factors
• The procedure was repeated 1000 times and compared with
Mallow’s Cp
Sujit Ghosh, October 3, 2006
21
Sujit Ghosh, October 3, 2006
22
BSWG Seminar, Raleigh, NC
Examples, contd.
• Consider the “ancient” Hald data (Casella and Moreno, 2006)
• Measures the effect of heat on composition of cement
• n = 13 observations on heat (y) and four cement compositions
(xj ’s, j = 1, . . . , 4) are available (⇒ 24 = 16 models)
• Historically, it is known that the subset {x1 , x2 } is most
preferred by earlier analysis
subsets
Pr[δ|y, X]
subsets
Pr[δ|y, X]
x1 , x2
0.5224
x1 , x3 , x4
0.0925
x1 , x4
0.1295
x2 , x3 , x4
0.0120
x1 , x2 , x3
0.1225
x1 , x2 , x3 , x4
0.0095
x1 , x2 , x4
0.1098
x3 , x4
0.0013
All other models have posterior probabilities < 10−5
Sujit Ghosh, October 3, 2006
23
BSWG Seminar, Raleigh, NC
BSWG Seminar, Raleigh, NC
Examples, contd.
Examples, contd.
• Based on R2 , Draper and Smith (1981, Sec. 6.1) also concluded
in favour of the top two models with preference to {x1 , x4 } as x4
is the single best predictor
• C & M (2006) also considered the 10-predictor variables model:
y = β0 +
2
• Although {x1 , x2 , x4 } had a high R the variable x4 was excluded
as Corr(x2 , x4 ) = −0.973
• Interestingly, George and McCulloch (1993) analyzed this data
and favored the model with no predictors (δ = 04 ) followed by
the model with one predictor
3
j=1
βj xj +
3
ηj x2j +
j=1
ηjk xj xk + η123 x1 x2 , x3 + j<k
• The true DGP2 was used to simulate the data and a stochastic
search with intrinsic prior was used to estimate posterior model
probabilities
• A total of 104 MCMC samples were generated
• George and McCulloch (1992)’s stochastic search algorithm
visited the model {x1 , x2 } less than 7% of the time!
• Exact posterior model probabilities for all 210 = 1, 024 models
were also computed
• This could be because of considering the no predictors as the null
model in all comparisons
• The entire procedure was repeated 1,000 times with n = 10
• Their methods were sensitive to the choice of priors for β and δ
• Two values σ = 2, 5 were used
Sujit Ghosh, October 3, 2006
24
BSWG Seminar, Raleigh, NC
Sujit Ghosh, October 3, 2006
BSWG Seminar, Raleigh, NC
Examples, contd.
Conclusions
Model
Pr[δ|y]
MCMC visits
σ=2
(exact)
(ssvs)
x1 , x2
0.449
0.860
x1 , x2 , x21
0.080
0.035
x1 , x2 , x1 x2
0.07
0.008
x1 , x2 , x22
0.040
0.011
• Default priors typically used for model parameters are improper,
and thus they are not suitable for computing model posterior
probabilities
• The commonly used “vague priors,” (as a limit of conjugate
priors) is typically an ill-defined notion
• Variable selection can be considered as a multiple testing
problem in which we test whether any reduction in complexity of
the full model is plausible
σ=5
Sujit Ghosh, October 3, 2006
25
x1 , x2
0.136
0.188
x1 , x22
0.075
0.112
x2
0.063
0.075
x21 , x2
0.051
0.076
• Intrinsic priors are well defined, depend on sampling density and
do not require the choice of tuning parameters
• Intrinsic prior for full model parameters is “centered” at the
reduced model and has heavy tails
26
Sujit Ghosh, October 3, 2006
27
BSWG Seminar, Raleigh, NC
BSWG Seminar, Raleigh, NC
Conclusions, contd.
THANKS!
• The role of the SSVS is different from estimating a posterior
distribution
All references mentioned in this talk and many more are available
online
• The goal is to find “good” models rather than estimating the
modes accurately
• However determining how many MCMC runs to be carried out is
a complex issue
• Rigorous evaluation of SSVS in terms of convergence and mixing
is very difficult and might be worth more exploration
http://www.stat.ncsu.edu/bswg/
• Open problems:
– Given two priors (or equivalently penalty functions), how one
would rigorously choose a model/method for VSP?
– Can the computational cost be factored in the loss the
function?
Sujit Ghosh, October 3, 2006
28
Sujit Ghosh, October 3, 2006
29