BSWG Seminar, Raleigh, NC BSWG Seminar, Raleigh, NC Overview Prior Distributions for the Variable Selection Problem • The Variable Selection Problem(VSP) • A Bayesian Framework Sujit K Ghosh • Choice of Prior Distributions Department of Statistics North Carolina State University • Illustrative Examples • Conclusions http://www.stat.ncsu.edu/∼ghosh/ Email: [email protected] Bayesian Statistics Working Group, NCSU Disclaimer: This talk is not entirely based on my own research work Sujit Ghosh, October 3, 2006 1 BSWG Seminar, Raleigh, NC Sujit Ghosh, October 3, 2006 2 BSWG Seminar, Raleigh, NC Suppose the true data generating process (DGP) is given by The Variable Selection Problem y = X 0 β 0 + , Consider the following canonical linear model: y = Xβ + (1) where β 0 = (β10 , . . . , βp00 )T , X 0 is n × p0 and WLOG assume that X = (X 0 | X 1 )T and p ≥ p0 ≥ 1 (i.e., X 1 is n × (p − p0 )) The LSE of β and σ 2 are given by where ∼ Nn (0, σ 2 I) and β = (β1 , . . . , βp )T (X is an n × p matrix) β̂ ˆ σ2 • Under the above model, suppose also that only an unknown subset of the coefficients βj ’s are nonzero = = (X T X)− X T y (3) T y (I − P X )y/(n − r) • The problem of variable selection is to identify this unknown subset. where r =rank(X) ≤ min(n, p), P X = X T (X T X)− X T is the projection matrix and (X T X)− is a g-inverse of X T X. Then • Notice that the above canonical framework can be used to address many other problems of interest including multivariate polynomial regression and nonparametric function estimation Lemma: E[β̂] = ((X T0 X 0 )− X T0 X 0 β 0 , 0)T and E[X β̂] = X 0 β 0 . Further, E[σˆ2 ] = σ 2 for any g-inverse of X T X. Sujit Ghosh, October 3, 2006 (2) In particular, if rank(X 0 ) = p0 , then E[β̂] = (β 0 , 0)T . 3 Sujit Ghosh, October 3, 2006 4 BSWG Seminar, Raleigh, NC BSWG Seminar, Raleigh, NC VSP, contd. VSP, contd. • The variable selection problem is a special case of the model selection problem • A common strategy for this VSP has been to select a model that minimizes a penalized sum of squares (PSS) criterion by a constraint optimization method (but why?) • Each model under consideration corresponds to a distinct subset of x1 , . . . , xp (Geweke, 1996) • The model (1) can be generalized to include discrete responses in terms of first two moments: E[y|X] = g(Xβ) V[y|X] = Σ(X, β, φ), (4) where g(·) is a suitable link function and Σ(·) is an n × n covariance matrix which may depend on additional parameters φ • Typically, a single model class is simply applied to all possible subsets so that all reduced models are nested under the full model Sujit Ghosh, October 3, 2006 5 BSWG Seminar, Raleigh, NC P SS(β, δ) = ||y − XD(δ)β||2 /nσ 2 + J(β, δ), (5) where D(δ) is the diagonal matrix with diagonal δ and J(·) denotes a suitable penalty function • The choice of the penalty J(·) is crucial and can be shown to be equivalent to the choice of a prior distribution Sujit Ghosh, October 3, 2006 6 BSWG Seminar, Raleigh, NC • Recent attempts have been to define penalties in terms of β p • J(β, δ) = λ j=1 |βj |q , q ≥ 1 yields bridge regression (Frank and Friedman, 1993). Only q = 1 yields sparse solution among all q ≥ 1 (Fan and Li, 2001) VSP, contd. • A number of popular criteria correspond to (5) with different choices for J(·) p p • J(β, δ) = λ(p, n) j=1 δj (notice j=1 δj = #non-zero βj ’s) – λ(p, n) = 2 yields Cp (Mallows, 1973) and AIC (Akaike, 1973) – q = 2 yields ridge regression – q = 1 yields LASSO (Tibshirani, 1996) p p • J(β, δ) = λ1 j=1 |βj | + λ2 j=1 βj2 yields Elastic Net (Zhou and Hastie, 2005) p • J(β, δ) = λ1 ( j=1 |βj | + λ2 j<k max{|βj |, |βk |}) yields OSCAR (Bondell and Reich, 2006) – λ(p, n) = log n yields BIC (Schwarz, 1978) and – λ(p, n) = 2 log p yields RIC (Foster and George, 1994) • The CM M L criteria (George and Foster, 2000) estimates λ(p, n) based on marginal maximum likelihood (MML) using an empirical Bayes framework. p • J(β, δ) = 2 j=1 δj log(p/j), Benjamini and Hochberg (1995) • Thus, a general strategy would be to define a penalty function that would involve both δ and β • Notice that all of the above penalties do not involve β and these are generally a function of δ (and n) Sujit Ghosh, October 3, 2006 • More specifically, if δ = (δ1 , . . . , δp )T denotes the indicator of the inclusion (δj = 1) and exclusion (δj = 0) of the variable xj for j = 1, . . . , p, then a PSS criterion would pick a δ ∈ {0, 1}p (and also β ∈ Rp ) that minimizes • We will consider this as priors: π(β, δ) ∝ exp{−J(β, δ)} 7 Sujit Ghosh, October 3, 2006 8 BSWG Seminar, Raleigh, NC BSWG Seminar, Raleigh, NC Bayesian Framework, contd. A Bayesian Framework • A pure subjective point of view for prior selection is problematic for the VSP The full hierarchical Bayes model: y|β, δ, σ 2 ∼ Nn (XD(δ)β, σ 2 I n ) (β, δ)|σ 2 ∼ π(β, δ|σ 2 ) ∝ exp{−J(β, δ)/σ 2 } σ 2 (6) 2 ∼ π0 (σ ) ( e.g., IG(a0 , b0 )) • Given a loss function, L(θ, a) we can obtain (in theory) the Bayes estimator by minimizing the posterior expected loss, E[L(θ, a)|y, X] = L(θ, a)π(θ|y, X) dθ • It is rather unrealistic to assume that uncertainty can be meaningfully described given the huge number (≈ 2p−1 ) and complexity of unknown model parameters. • A common and practical approach has been to construct noninformative, semi-automatic formulation in this context. (7) wrt to a = a(y, X) where θ = (β T , δ T , σ 2 )T • Which prior distributions? What loss functions? Can we even do optimization for a given prior distribution and loss function? Sujit Ghosh, October 3, 2006 9 BSWG Seminar, Raleigh, NC • Roughly speaking the goal would be to specify priors that allow the posteriors model probabilities to accumulate near the true model (via some form of sparseness and smoothing) • Unfortunately, there are no universally preferred method to construct such semi-automatic priors! (isn’t that nice?) Sujit Ghosh, October 3, 2006 10 BSWG Seminar, Raleigh, NC Bayesian Framework, contd. Bayesian Framework, contd. • The choice of loss function, although mostly overlooked, is also crucial (different loss functions lead to different estimates) • Notice that if θ̂ denotes a MAP estimator then n 1 − log f (yi |xi , θ)π(θ)dθ n i=1 • In general, suppose the true DGP is: (y, x) ∼ m0 (y|x)g0 (x) • Consider a model: (y, x) ∼ m(y|x)g(x) where m(y|x) = f (y|x, θ)π(θ)dθ with sampling density f and prior π • The Kullback-Liebler discrepancy between the DGP and model can be written as: K(m0 , m|x)g0 (x)dx + K(g0 , g) K(m0 g0 , mg) = m0 (y|x) where K(m0 , m|x) = m0 (y|x) log dy (8) m(y|x) n 1 − log f (yi |xi , θ)π(θ)dθ ≈ const. + n i=1 Sujit Ghosh, October 3, 2006 11 ≈ n 1 − log f (yi |xi , θ̂) + (− log π(θ̂)) n i=1 • When y|x follows a canonical normal linear model the above criteria is equivalent to the PSS criteria (5) • Thus, J(θ) = − log(π(θ)) emerges as a choice of the penalty function up to some multiplicative constant (see slide# 8) • Hence the choice of a penalty function is equivalent to the choice of a prior distribution (including improper distributions in some cases) Sujit Ghosh, October 3, 2006 12 BSWG Seminar, Raleigh, NC BSWG Seminar, Raleigh, NC Choice of priors, contd. Choice of Prior distributions • We are not generally confident about any given set of predictors and hence little prior information on D(δ)β and σ 2 can be expected for each δ • For each δ it is desirable to have some default priors for D(δ)β and σ 2 • Unfortunately default priors for normal linear models are generally improper 2 • Zellner’s prior: β|δ, σ 2 , X ∼ Np (β 0 , σg (X T X)− ), σ 2 ∼ IG(a0 , b0 ) and δ = 1 w.p. 1. Here β 0 , g, a0 and b0 need to be specified by the user (or estimated using either an EB or a HB procedure) • Extensions of Zellner priors: 2 • Nonobjective (conjugate) priors for β are typically centered at 0, making the model with no predictors as the null model within a testing of hypothesis set up • The goal is to select a prior (and hence penalty function) that is criterion-based and fully automatic Sujit Ghosh, October 3, 2006 • More generally we can think of constructing priors (and hence penalties) that may also depend on the design matrix X 13 BSWG Seminar, Raleigh, NC – β|δ, σ 2 , X ∼ Np (β 0 , σg (X(δ)T X(δ))− ), σ 2 ∼ IG(a0 , b0 ) and δ ∼ U nif ({0, 1}p ) where X(δ) = XD(δ). – Almost same as above but δ ∼ q j δj P (1 − q)p− j δj • The advantage of Zellner-type priors is the closed form suitable for rapid computations over large parameter space for δ. Sujit Ghosh, October 3, 2006 14 BSWG Seminar, Raleigh, NC Choice of priors, contd. Choice of priors, contd. • In general we may consider the following independence prior: πI (δ) = p δ qj j (1 − qj )1−δj (9) j=1 • Independent and exchangeable priors on δ may be less satisfactory when the models contain dependent components (e.g., interactions, polynomials, lagged or indicator variables) • Consider 3 variables with main effects x1 , x2 , x3 and three two-factor interactions x1 x2 , x2 x3 , x1 x3 . qj = Pr[δj = 1] is the inclusion probability of the j-th variable • Small qj can be used to downweight the j-th variable • Notice that when qj ≡ 0.5, models with size p/2 get more weight • Alternatively, assuming qj = q for all j, one may use a prior q ∼ Beta(a, b) to obtain the exchangeable prior: πE (δ) = B(a + j δj , b + p − j δj )/B(a, b) where B(a, b) is the Beta function. • The importance of the interactions such as x1 x2 will often depend only on whether the main effects x1 and x2 are included in the model • This can be expressed by a prior for δ = (δ1 , . . . , δ13 ) of the form 3 π(δ) = j=1 π(δj ) j<k π(δjk |δj , δk ) where π(δjk |δj , δk ) would require specifying four probabilities one for each pair (δj , δk ). • Notice that components of δ are exchangeable but not independent under the previous prior Sujit Ghosh, October 3, 2006 P • E.g., π(δ12 |0, 0) < π(δ12 |0, 1), π(δ12 |1, 0) < π(δ12 |1, 1) 15 Sujit Ghosh, October 3, 2006 16 BSWG Seminar, Raleigh, NC BSWG Seminar, Raleigh, NC Choice of priors, contd. Choice of priors, contd. • The number of possible models grows exponentially as the number of interactions, polynomials, lagged variables increases • Having little prior information on the variables, objective model selection methods are necessary • In contrast to independent priors of the form (9), priors for dependent components models concentrate mass on “plausible” models, when the number of possible models is huge • Spiegelhalter and Smith (1982): improper priors – used conventional improper priors for β – used a pseudo-Bayes Factor for inference • This can be crucial in applications such as screening designs, where p >> n (see Chipman, Hama and Wu, 1997) • Mitchell and Beauchamp (1988): spike-and-slab priors • Another limitation of independence priors on δ is their failure to account for covariate collinearity – variable selection problem is solved as an estimation problem • Berger and Pericchi (1996): Intrinsic Priors • This problem can be resolved by using the so-called dilution priors (George, 1999) – developed a fully automatic prior • A general form of dilution prior can be written as – used intrinsic Bayes factor for inference based on a model encompassing approach πD (δ) = h(det(X(δ)T X(δ)))πI (δ) Sujit Ghosh, October 3, 2006 – βj |δj ∼ (1 − δj )δ0 + δj U nif (−aj , aj ) 17 BSWG Seminar, Raleigh, NC Sujit Ghosh, October 3, 2006 BSWG Seminar, Raleigh, NC Choice of priors, contd. Choice of priors, contd. • Yuan and Lin (2006) have recently proposed the use of the following dilution prior: P P π(δ) = q j δj (1 − q)p− j δj det(X(δ)T X(δ)) • Another recent attempt to construct automatic priors has been made by Casella and Moreno (2006) • Their proposed methodology is – Criterion based: provides clear understanding of properties of selected models • The main idea behind this prior is to replace a set of highly correlated variables by one of the variable in that set • Suppose βj |δj ∼ (1 − δj )δ0 + δj DE(0, τ ) where DE(0, τ ) denotes a double-exponential distribution with density τ exp{−τ |β|} and δ0 a distribution with point mass at 0 • Yuan and Lin (2005) have shown that if one sets q = (1 + τ σ π/2)−1 then the model with highest posterior probability is approximately equivalent to LASSO with λ = 2σ 2 τ • A Gibbs sampling scheme is also presented (seminar on Oct 31st!) Sujit Ghosh, October 3, 2006 18 19 – Automatic: No tuning parameter (hyperparameter) selection is required • Formally carries out the hypothesis tests: H0 : δ = δ ∗ vs. Ha : δ = 1p where δ ∗ ∈ {0, 1}p but δ ∗ = 1p , i.e. tests the null hypothesis of a reduced model verses the full model (this is in sharp contrast to other conjugate prior approaches) Sujit Ghosh, October 3, 2006 20 BSWG Seminar, Raleigh, NC BSWG Seminar, Raleigh, NC Choice of priors, contd. Illustrative Examples • The test of hypothesis is carried out using posterior model probabilities: Pr[δ ∗ |y, X] = m(y|δ ∗ , X) m(y|1p , X) + δ =1p m(y|δ, X) = BFδ ∗ ,1p (y, X) 1 + δ =1p BFδ ,1p (y, X) • A simulation study adopted from Casella and Moreno (2006): Full Model: y = β0 + β1 x1 + β2 x2 + β3 x21 + β4 x22 + where ∼ N (0, σ 2 ) True DGP1: y = 1 + x1 + , x1 , x2 ∼ U nif (0, 10), σ = 2 and where BFδ ∗ ,1p (y, X) denotes the Bayes factor for testing H0 : δ = δ ∗ vs. Ha : δ = 1p True DGP2: y = 1 + x1 + 2x2 + x1 , x2 ∼ U nif (0, 10), σ = 2 • The fact the every posterior model probability has the same denominator facilitates rapid computation • A sample of n = 10 is generated and posterior model probabilities are computed for all 24 = 16 models. • The use of intrinsic priors overcomes the problem of using improper priors while computing the Bayes factors • The procedure was repeated 1000 times and compared with Mallow’s Cp Sujit Ghosh, October 3, 2006 21 Sujit Ghosh, October 3, 2006 22 BSWG Seminar, Raleigh, NC Examples, contd. • Consider the “ancient” Hald data (Casella and Moreno, 2006) • Measures the effect of heat on composition of cement • n = 13 observations on heat (y) and four cement compositions (xj ’s, j = 1, . . . , 4) are available (⇒ 24 = 16 models) • Historically, it is known that the subset {x1 , x2 } is most preferred by earlier analysis subsets Pr[δ|y, X] subsets Pr[δ|y, X] x1 , x2 0.5224 x1 , x3 , x4 0.0925 x1 , x4 0.1295 x2 , x3 , x4 0.0120 x1 , x2 , x3 0.1225 x1 , x2 , x3 , x4 0.0095 x1 , x2 , x4 0.1098 x3 , x4 0.0013 All other models have posterior probabilities < 10−5 Sujit Ghosh, October 3, 2006 23 BSWG Seminar, Raleigh, NC BSWG Seminar, Raleigh, NC Examples, contd. Examples, contd. • Based on R2 , Draper and Smith (1981, Sec. 6.1) also concluded in favour of the top two models with preference to {x1 , x4 } as x4 is the single best predictor • C & M (2006) also considered the 10-predictor variables model: y = β0 + 2 • Although {x1 , x2 , x4 } had a high R the variable x4 was excluded as Corr(x2 , x4 ) = −0.973 • Interestingly, George and McCulloch (1993) analyzed this data and favored the model with no predictors (δ = 04 ) followed by the model with one predictor 3 j=1 βj xj + 3 ηj x2j + j=1 ηjk xj xk + η123 x1 x2 , x3 + j<k • The true DGP2 was used to simulate the data and a stochastic search with intrinsic prior was used to estimate posterior model probabilities • A total of 104 MCMC samples were generated • George and McCulloch (1992)’s stochastic search algorithm visited the model {x1 , x2 } less than 7% of the time! • Exact posterior model probabilities for all 210 = 1, 024 models were also computed • This could be because of considering the no predictors as the null model in all comparisons • The entire procedure was repeated 1,000 times with n = 10 • Their methods were sensitive to the choice of priors for β and δ • Two values σ = 2, 5 were used Sujit Ghosh, October 3, 2006 24 BSWG Seminar, Raleigh, NC Sujit Ghosh, October 3, 2006 BSWG Seminar, Raleigh, NC Examples, contd. Conclusions Model Pr[δ|y] MCMC visits σ=2 (exact) (ssvs) x1 , x2 0.449 0.860 x1 , x2 , x21 0.080 0.035 x1 , x2 , x1 x2 0.07 0.008 x1 , x2 , x22 0.040 0.011 • Default priors typically used for model parameters are improper, and thus they are not suitable for computing model posterior probabilities • The commonly used “vague priors,” (as a limit of conjugate priors) is typically an ill-defined notion • Variable selection can be considered as a multiple testing problem in which we test whether any reduction in complexity of the full model is plausible σ=5 Sujit Ghosh, October 3, 2006 25 x1 , x2 0.136 0.188 x1 , x22 0.075 0.112 x2 0.063 0.075 x21 , x2 0.051 0.076 • Intrinsic priors are well defined, depend on sampling density and do not require the choice of tuning parameters • Intrinsic prior for full model parameters is “centered” at the reduced model and has heavy tails 26 Sujit Ghosh, October 3, 2006 27 BSWG Seminar, Raleigh, NC BSWG Seminar, Raleigh, NC Conclusions, contd. THANKS! • The role of the SSVS is different from estimating a posterior distribution All references mentioned in this talk and many more are available online • The goal is to find “good” models rather than estimating the modes accurately • However determining how many MCMC runs to be carried out is a complex issue • Rigorous evaluation of SSVS in terms of convergence and mixing is very difficult and might be worth more exploration http://www.stat.ncsu.edu/bswg/ • Open problems: – Given two priors (or equivalently penalty functions), how one would rigorously choose a model/method for VSP? – Can the computational cost be factored in the loss the function? Sujit Ghosh, October 3, 2006 28 Sujit Ghosh, October 3, 2006 29
© Copyright 2026 Paperzz