A Review of Bayesian Variable Selection By: Brian J Reich and Sujit Ghosh, North Carolina State University E-mail:[email protected] Slides, references, and BUGS code are on my webpage, http://www4.stat.ncsu.edu/∼reich/. Brian Reich Bayesian variable selection Why Bayesian variable selection? I Bayesian variable selection methods come equipped with natural measures of uncertainty, such as the posterior probability of each model and the marginal inclusion probabilities of the individual predictors. I Stochastic variable search is extremely flexible and has been applied in a wide range of applications. I Model uncertainly can be incorporated into prediction through model averaging, which usually improves prediction. I Given the model (likelihood, prior, and loss function), there are formal justifications for choosing a particular model. I Missing data is easily handled by MCMC. Brian Reich Bayesian variable selection Drawbacks of Bayesian methods Computational issues: I For the usual regression setting, the “BMA” (Bayesian Model Averaging) package in R can be used for Bayesian variable selection. I WinBUGS is a free user-friendly software package that can be used for stochastic variable search. Prior selection: I Bayesian variable selection can be influenced by the prior. I Priors can be chosen to mimic Frequentist solutions (Yuan and Lin). I Priors can be derived to have large sample properties (Ishwaran and Rao). Brian Reich Bayesian variable selection Notation I Full model: yi = β0 + x1i β1 + ... + xpi βp + ²i , where iid ²i ∼N(0,σ 2 ). I The objective is to find a subset of the predictors that fits the data well. I Let γj = 1 if xj is in the model and γj = 0 otherwise. I The vector of indicators γ = (γ1 , ..., γp ) represents the subset of predictors in the model. I Xγ is the design matrix that includes only covariates with γj = 1. I The regression coefficients the conjugate ¡ βγ are often given ¢ Zellner g -prior βγ ∼ N 0, g σ 2 (Xγ0 Xγ )−1 . Brian Reich Bayesian variable selection Bayes Factors I One way to compare models defined by γ(1) and γ(2) is with a Bayes factor. I BF = I R p(γ(1)|y) ∝ p(γ(1), βγ(1) , σ 2 |y)dβγ(1) dσ 2 is the posterior probability of the model specified by γ(1). I p(γ(1)) is the prior probability of model specified by γ(1). p(γ(1)|y)/p(γ(2)|y) p(γ(1))/p(γ(2)) , where Brian Reich Bayesian variable selection BIC approximation I Schwarz (1978) showed that for large n, −2log (BF ) ≈ ∆BIC = W − log (n)(|γ(2)| − |γ(1)|), where I ∆BIC is the change in BIC from model 1 to model 2, h I I W = −2log statistic, supγ(1) p(y|βγ(1) σ 2 ) supγ(2) p(y|βγ(2) σ 2 ) i is the usual likelihood ratio P and |γ| = pj=1 γj is the number of parameters in the model defined by γ. Brian Reich Bayesian variable selection BIC approximation I The BIC approximation can be used to compute the posterior probability of each model. I This is implemented by the “BMA” package of Raftery et al. I The R command is bicreg(x,y) where x is the full n × p design matrix and y is the vector of outcomes. I “BMA” also has functions for survival and generalized linear models. I For a list of papers and software for Bayesian model averaging, visit the BMA homepage, http://www.research.att.com/∼volinsky/bma.html. I Example: subset of the Boston Housing data. Brian Reich Bayesian variable selection Boston Housing Data Dependent variable (n=100) I MV: Median value of owner-occupied homes in $1000s Independent variables (p = 6) I INDUS: Proportion of non-retail business acres per town. I NOX: Nitric oxides concentration (parts per 10 million). I RM: Average number of rooms per dwelling. I TAX: Full-value property-tax rate per $10,000. I PT: Pupil-teacher ratio. I LSTAT: Proportion of population of lower status. Brian Reich Bayesian variable selection Least squares solution The results of usual linear regression are: Variable INDUS NOX RM TAX PT LSTAT Estimate -0.31 -0.61 3.35 -0.45 -0.63 -1.77 SE 0.30 0.31 0.33 0.28 0.27 0.35 P-Value 0.30 0.05 0.00 0.12 0.02 0.00 Brian Reich Bayesian variable selection BMA solution The results of the R command summary(bicreg(x,y)) are: Variable INDUS NOX RM TAX PT LSTAT Post mean -0.13 -0.34 3.45 -0.24 -0.68 -1.93 Post sd. 0.25 0.26 0.41 0.34 0.33 0.38 Prob non-zero 0.28 0.51 1.00 0.43 0.86 1.00 The model with highest posterior probability (prob = 0.20) has RM, PT, and LSTAT. Brian Reich Bayesian variable selection Stochastic variable selection I A limitation of the Bayes factor approach is that for large p, computing the BIC for all 2p models is impossible. I Madigan and Raftery (1994) limit the search to a class of models under consideration to models with a minimum level of posterior support and propose a greedy-search algorithm to find models with high posterior probability. I Stochastic variable selection is an alternative. I Here we use MCMC to draw samples from the model space so models with high posterior probability will be visited more often than low probability models. I We’ll follow the notation and model of George and McCullough, JASA, 1993. Brian Reich Bayesian variable selection George and McCullough I Likelihood: yi = β0 + x1i β1 + ... + xpi βp + ²i , where iid ²i ∼N(0,σ 2 ). I Mixture prior for βj : βj ∼ (1 − γj )N(0, τj2 ) + γj N(0, cj2 τj2 ). I The constant τj2 is small, so that if γj = 0, βj “could be ‘safely’ estimated by 0”. I The constant cj2 is large, so that if γj = 1, “a non-zero estimate of βj should probably be included in the final model”. I One of the priors for the inclusion indicators they consider is γj ∼ Bernoulli(0.5) prior. Brian Reich Bayesian variable selection Comments on SVS I A similar SVS model that has positive prior probably that βj = 0 is the “spike and slab” prior of Mitchell and Beauchamp (1988) I One disadvantage of SVS when p is large is that in 50,000 MCMC iterations, the model with highest posterior model may only be visited a handful of times. I SVS works well for computing the marginal inclusion probabilities of each covariate and for model averaging. I SVS is computationally convenient and extremely flexible. Brian Reich Bayesian variable selection WinBUGS code # The likelihood: for(i in 1:100){ MV[i]∼dnorm(mu[i],taue) mu[i]←int+INDUS[i]*b[1] + NOX[i]*b[2] + RM[i]*b[3] + TAX[i]*b[4] + PT[i]*b[5] + LSTAT[i]*b[6] } int∼dnorm(0,0.01) # int has prior variance 1/0.01. taue∼dgamma(0.1,0.1) # taue is a precision (inverse variance) # The priors: for(j in 1:6){ b[j]∼dnorm(0,prec[j]) prec[j]←1/var[j] var[j]←(1-gamma[j])*0.001 + gamma[j]*10 gamma[j]∼dbern(0.5) } Brian Reich Bayesian variable selection SVS solution The results of the SVS model from WinBUGS are: Variable INDUS NOX RM TAX PT LSTAT Post mean -0.13 -0.34 3.44 -0.22 -0.64 -1.93 Post sd. 0.26 0.41 0.34 0.32 0.40 0.38 Brian Reich Prob non-zero 0.27 0.48 1.00 0.39 0.81 1.00 Bayesian variable selection More about SVS Applications of SVS: Choosing the priors for β and γ: I Connection with the LASSO (Yuan and Li, JASA, 2005) I Rescaled spike and slab priors of Ishwaran and Rao (Annals, 2005). I Gene selection using microarray data. I Nonparametric regression for complex computer models Brian Reich Bayesian variable selection Connection with penalized regression I Yuan and Lin (JASA, 2005) show a connection between Bayesian variable selection and the LASSO penalized regression method. I The LASSO solution is I The L1 penalty for the regression coefficients encourages sparsity. I Yuan and Lin’s spike and slab prior for βj is ³ ´ X β̂ = argmin ||y − Xβ||2 + λ |βj | . βj = (1 − γi )δ(0) + γi DE (0, λ). I δ(0) is the point mass distribution centered at zero. I DE (0, λ) is the double exponential distribution with density λ exp(−λ|x|). Brian Reich Bayesian variable selection Connection with penalized regression I The prior for the model indicators is p(γ) ∝ π |γ| (1 − π)p−|γ| |Xγ0 Xγ |1/2 . Pp I |γ| = i=1 γj is the number of predictors in the model. I |Xγ0 Xγ |1/2 penalizes models with correlated predictors. I Using a Laplace approximation, they show that the posterior mode of γ selects the same parameters as the LASSO. I There are fast algorithms to find the LASSO solution, so this helps find models with high posterior probability. I Yuan and Lin also propose an empirical Bayes estimate for the tuning parameter λ. Brian Reich Bayesian variable selection Rescaled spike and slab model of Ishwaran and Rao I Under the usual SVS model, the effect of the prior vanishes as n increases and coefficients are rarely set to zero. I Ishwaran and Rao propose rescaled model that gives the prior a non-vanishing effect. I The covariates are standardized so P n 2 i=1 xji = n. Pn i=1 xji = 0 and I p Let yi∗ = yi / σ̂ 2 /n., where σ̂ 2 is the OLS estimate of σ 2 . I The rescaled model is yi∗ = xi0 β + N(0, σ 2 λn ) (say, λn = n). I σ 2 has an inverse gamma prior that does not depend on n. I βj has a continuous bimodal prior similar to George and McCullough. Brian Reich Bayesian variable selection Properties of the rescaled spike and slab model I If λn → ∞ and λn /n → 0, the posterior mean of β is consistent. While consistency is important, they choose λn = n for variable selection. I With λn = n, the posterior mean of β, β̂, asymptotically maximizes the posterior and they recommend selecting the Zcut model that includes only variables with |β̂j | > zα/2 . I They compare this rule with OLS-hard rule that includes only variables with estimated z-scores more extreme than |zα/2 |. I They argue that since information of the posterior mean of β is pooled across several models that the Zcut model is better than the OLS-hard model. I Under certain conditions, they show that the Zcut model has and oracle property relative to the OLS-hard model, that is, the Zcut model includes less unimportant variables. Brian Reich Bayesian variable selection Gene selection I SVS has been used for microarray data to identify genes that can distinguish between the two types of breast cancer. For example, Lee et al. (including Marina Vannucci), Bioinformatics, 2003. I yi indicates the type of breast cancer for patient i. I xji measures the expression level of gene j for patient i. I Probit model: yi = I ²i ∼ N(0, 1). ¡ ¢ iid βγ ∼ N 0, c(Xγ0 Xγ )−1 , γj ∼ Bern(π). ½ I 1, xi β + ²i > 0; 0, xi β + ²i ≤ 0. Brian Reich Bayesian variable selection Variable selection for multivariate nonparametric regression I For complex computer models that take a long time to run, scientists often run a moderate number of simulations and use nonparametric regression to emulate the computer model. I Nonparametric regression: yi = f (x1i , ..., xpi ) + ²i I The goal is to estimate f . I A common model for f is a Gaussian process prior with covariance function I Cov [f (x1i , ..., xpi ), f (x1j , ..., xpj )] = τ 2 exp (−ρdij ) , p where dij = (x1i − x1j )2 + ... + (xpi − xpj )2 . Should all the variables contribute equally to the distance measure in the covariance function? Brian Reich Bayesian variable selection Variable selection for multivariate nonparametric regression I One thing we’re (Drs. Bondell and Storlie) working on is an anisotropic covariance function that allows the variables to make different contributions to the distance measure, i.e., µ q ¶ 2 2 2 2 2 τ exp − ρ1 (x1i − x1j ) + ... + ρp (xpi − xpj ) I A variable selection mixture prior for ρj is ρj = γj Unif (c1 , c2 ) + (1 − γj )δ(c1 ), I where c1 is chosen so that c12 (xji − xji 0 )2 ≈ 0 for all xji and xji 0 . I The posterior mean of γj can be interpreted as the posterior probability that xj is in the model. I If γ1 = ... = γp = 0, f is essentially smoothed to zero for all x. Brian Reich Bayesian variable selection Conclusions I The Bayesian approach to variable selection has several attractive features, including straight-forward quantification of variable importance and model uncertainty and the flexibility to handle missing data and non-Gaussian distributions. I Outstanding issues include computational efficiency and concerns about priors. I Hopefully we will see some interesting talks about these issues throughout the semester. I THANKS! Brian Reich Bayesian variable selection Conclusions I The Bayesian approach to variable selection has several attractive features, including straight-forward quantification of variable importance and model uncertainty and the flexibility to handle missing data and non-Gaussian distributions. I Outstanding issues include computational efficiency and concerns about priors. I Hopefully we will see some interesting talks about these issues throughout the semester. I THANKS! Brian Reich Bayesian variable selection
© Copyright 2026 Paperzz