Variable Selection in R

Marketing Analysts, LLC
Variable Selection in R
Fun with carets, elasticnets, and the
Reverend Thomas Bayes …
Charles Ellis, MAi Research
Mitchell Killian, Ipsos Marketing
Why do Variable Selection?
"Pluralitas non est ponenda sine neccesitate"
• Overcoming the “Curse of Dimensionality” and
developing more efficient data mining activities
▫ Identifying relevant features & discarding those that are not
▫ Enhancing the performance of data mining algorithms
▫ Better prediction/classification
• This applies to almost all fields, but especially those that
are “data rich and theory poor” (e.g., Marketing)
2
Options for Tackling the Problem
• Many different approaches have been suggested … it is a
growing field, and many of them are implemented in R
code.
▫
▫
▫
▫
▫
▫
▫
▫
▫
▫
▫
▫
▫
▫
BMA
rfe
glmnet
stepPlr
subselect
varselectRF
WilcoxCV
clustvarsel
Party
Boruta
PenalizedSVM
spikeslab
glmulti
BMS
3
Options for Tackling the Problem
• Today we focus on three approaches, which range in
degree of complexity and applicability.
▫ Recursive Feature Elimination (package: caret [Kuhn])
▫ Bayesian Model Averaging (package: bma [Raftery et al.])
▫ Penalized regression (package: glmnet [Friedman et al.])
4
Recursive Feature Elimination
(with resampling)
• Implemented in the package caret
• The basic idea (from Kuhn, 2009)
▫ For each resampling unit (default is 10-fold crossvalidation) do the following:




Partition the data into training & test sets
Train the model on the training set using all predictors
Predict outcomes using the test data
Calculate variable importance for all predictors
5
Recursive Feature Elimination (cont’d)
(with resampling)
 For each subset size (Si)to be considered keep the Si most
important variables
 Train the model on the training set using the Si predictors
 Predict outcomes using the test data
[Optional] Recalculate the rankings for each predictor
▫ Calculate the performance profile over the Si predictors
using the held-back samples
▫ Determine the appropriate number of predictors
▫ Fit the final model based on the optimal Si using the
original training set
6
Recursive Feature Elimination
An Example
▫ Data set up (same across all examples)
 Hot Breakfast Cereal Category
 N = 310 consumers
 Outcome Variable – Overall Liking
 5 point scale
 Predictors – 31 Agree-Disagree statements measuring attitudes
toward package and its components
 5 point scale
 Outcome and predictors are treated as continuous (although
they need not be)
7
Recursive Feature Elimination
• Results:
Variables RMSE Rsquared RMSE_SD Rsquared_SD
1
0.5291
0.1287
0.09067
0.1263
2
0.5325
0.1315
0.09034
0.1268
3
0.5291
0.1357
0.09426
0.1331
4
0.5257
0.15
0.09098
0.1382
5
0.5218
0.1565
0.09372
0.1234
6
0.5269
0.1383
0.09406
0.1232
7
0.5284
0.1357
0.09431
0.1352
8
0.5299
0.1316
0.09338
0.1341
9
0.5297
0.1283
0.09107
0.1266
10
0.5421
0.1213
0.09395
0.1295
15
0.5417
0.1236
0.097
0.1336
20
0.5469
0.114
0.09747
0.1354
25
0.547
0.1141
0.09804
0.1317
30
0.5471
0.1138
0.09796
0.1313
31
0.5471
0.1138
0.09796
0.1313
• The top 5 variables are: q4b_2, q4b_28, q4b_21,
q4b_24, q4b_9
8
Recursive Feature Elimination
• Results (cont’d):
9
Recursive Feature Elimination
• Results (cont’d):
10
Bayesian Model Averaging
• Implemented in the package BMA (also BMS)
• The basic idea (from Hoeting et al., 1999)
▫ All models are wrong, some are useful (Box, 1987)
▫ Approach is to average over model uncertainty
▫ Average over the posterior distribution of any statistic (e.g.,
parameter estimates)
▫ Can be problematic for models with a large number of
potential predictors
 For “r” predictors, the set of potential models is 2r
 Occam’s Window – Average over the subset of models that are
supported by the data
11
Bayesian Model Averaging
p!=0
• Results:
Note the
similarity in the
predictors chosen
across the 5 best
models
(compared to the
rfe algorithm)
Intercept 100.00
q4b_1
0.00
q4b_2
98.70
q4b_3
0.00
q4b_4
0.00
q4b_5
0.00
q4b_6
0.00
q4b_7
0.00
q4b_8
1.00
q4b_9
41.70
q4b_1 0
0.00
q4b_1 1
0.00
q4b_1 2
0.00
q4b_1 3
0.00
q4b_1 4
1.20
q4b_1 5
7.50
q4b_1 6
0.00
q4b_1 7
0.00
q4b_1 8
0.00
q4b_1 9
0.00
q4b_20
0.00
q4b_21
24.10
q4b_22
0.00
q4b_23
1.20
q4b_24
42.00
q4b_25
25.60
q4b_26
2.10
q4b_27
0.00
q4b_28
88.40
q4b_29
14.10
q4b_30
1.20
q4b_31
0.00
EV
3.63
0.00
0.15
0.00
0.00
0.00
0.00
0.00
0.00
-0.03
0.00
0.00
0.00
0.00
0.00
0.01
0.00
0.00
0.00
0.00
0.00
-0.02
0.00
0.00
0.04
0.02
0.00
0.00
0.10
-0.01
0.00
0.00
SD
0.18
0.00
0.05
0.00
0.00
0.00
0.00
0.00
0.00
0.05
0.00
0.00
0.00
0.00
0.00
0.02
0.00
0.00
0.00
0.00
0.00
0.04
0.00
0.01
0.06
0.04
0.01
0.00
0.05
0.02
0.00
0.00
nV ar
r2
BIC
post prob
12
Model 1
3.57
.
0.14
.
.
.
.
.
.
-0.09
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.11
.
.
.
0.11
.
.
.
4
0.16
-31.12
0.11
Model 2
3.67
.
0.15
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.10
.
.
.
2
0.13
-30.88
0.10
Model 3
3.73
.
0.18
.
.
.
.
.
.
-0.07
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.13
.
.
.
3
0.14
-30.06
0.07
Model 4
3.53
.
0.12
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.08
.
.
.
0.08
.
.
.
3
0.14
-29.53
0.05
Model 5
3.73
.
0.18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
-0.08
.
.
.
.
.
.
0.14
.
.
.
3
0.14
-29.36
0.05
Bayesian Model Averaging
• Results (cont’d):
13
Penalized Regression
• Implemented in the package glmnet
▫ glmnet is an extension/application of the elasticnet package
(Zou & Hastie, 2008)
• The basic idea (from Friedman et al., 2010)
▫ Ridge regression – applies an adjustment (the “ridge”) to
the coefficient estimates, allowing them to borrow from
each other, and shrinks the coefficients values.
▫ However, Ridge aggressively shrinks coefficients to be equal
to each other, allowing for no meaningful interpretation
▫ Additionally, there is no easy way to determine how to set
the penalization parameter
14
Penalized Regression (cont’d)
• Lasso regression also adjusts the coefficients but tends to
be “somewhat indifferent to very correlated predictors”
▫ Essentially turns coefficients on/off, elevating one variable
over another
• Elastic Net – a compromise between Ridge and Lasso
▫ Averages the effects of highly correlated predictors to create
a “weighted” contribution of each variable
 Lambda, a ridge regression penalty, shrinks coefficients toward
each other
 Alpha influences the number of non-zero coefficients in the
model.
 Alpha=0 is Ridge Regression and Alpha=1 is Lasso
15
Penalized Regression (cont’d)
Alpha=1
At each step
there is a
unique value
of lambda
Alpha=0.2
16
Alpha=0
Penalized Regression
• Results: The impact of different parameterizations of alpha
α= 0.75
α= 0.10
17
Penalized Regression
• Results:
Again, notice the
similarity wrt
predictors chosen
with the other two
algorithms
(Intercept)
q4b_1
q4b_2
q4b_3
q4b_4
q4b_5
q4b_6
q4b_7
q4b_8
q4b_9
q4b_10
q4b_11
q4b_12
q4b_13
q4b_14
q4b_15
q4b_16
3.706082732
0
0.095195218
0
0
0
0
0
0.004512409
-0.030557283
0
0
0
0
0
0.021679874
0
18
q4b_17
q4b_18
q4b_19
q4b_20
q4b_21
q4b_22
q4b_23
q4b_24
q4b_25
q4b_26
q4b_27
q4b_28
q4b_29
q4b_30
q4b_31
0
0
0
0
-8.84E-09
0
0
0.058411084
0.033686624
0.002061599
0
0.068898885
-0.016100951
0
0
Questions?
19