Marketing Analysts, LLC Variable Selection in R Fun with carets, elasticnets, and the Reverend Thomas Bayes … Charles Ellis, MAi Research Mitchell Killian, Ipsos Marketing Why do Variable Selection? "Pluralitas non est ponenda sine neccesitate" • Overcoming the “Curse of Dimensionality” and developing more efficient data mining activities ▫ Identifying relevant features & discarding those that are not ▫ Enhancing the performance of data mining algorithms ▫ Better prediction/classification • This applies to almost all fields, but especially those that are “data rich and theory poor” (e.g., Marketing) 2 Options for Tackling the Problem • Many different approaches have been suggested … it is a growing field, and many of them are implemented in R code. ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ ▫ BMA rfe glmnet stepPlr subselect varselectRF WilcoxCV clustvarsel Party Boruta PenalizedSVM spikeslab glmulti BMS 3 Options for Tackling the Problem • Today we focus on three approaches, which range in degree of complexity and applicability. ▫ Recursive Feature Elimination (package: caret [Kuhn]) ▫ Bayesian Model Averaging (package: bma [Raftery et al.]) ▫ Penalized regression (package: glmnet [Friedman et al.]) 4 Recursive Feature Elimination (with resampling) • Implemented in the package caret • The basic idea (from Kuhn, 2009) ▫ For each resampling unit (default is 10-fold crossvalidation) do the following: Partition the data into training & test sets Train the model on the training set using all predictors Predict outcomes using the test data Calculate variable importance for all predictors 5 Recursive Feature Elimination (cont’d) (with resampling) For each subset size (Si)to be considered keep the Si most important variables Train the model on the training set using the Si predictors Predict outcomes using the test data [Optional] Recalculate the rankings for each predictor ▫ Calculate the performance profile over the Si predictors using the held-back samples ▫ Determine the appropriate number of predictors ▫ Fit the final model based on the optimal Si using the original training set 6 Recursive Feature Elimination An Example ▫ Data set up (same across all examples) Hot Breakfast Cereal Category N = 310 consumers Outcome Variable – Overall Liking 5 point scale Predictors – 31 Agree-Disagree statements measuring attitudes toward package and its components 5 point scale Outcome and predictors are treated as continuous (although they need not be) 7 Recursive Feature Elimination • Results: Variables RMSE Rsquared RMSE_SD Rsquared_SD 1 0.5291 0.1287 0.09067 0.1263 2 0.5325 0.1315 0.09034 0.1268 3 0.5291 0.1357 0.09426 0.1331 4 0.5257 0.15 0.09098 0.1382 5 0.5218 0.1565 0.09372 0.1234 6 0.5269 0.1383 0.09406 0.1232 7 0.5284 0.1357 0.09431 0.1352 8 0.5299 0.1316 0.09338 0.1341 9 0.5297 0.1283 0.09107 0.1266 10 0.5421 0.1213 0.09395 0.1295 15 0.5417 0.1236 0.097 0.1336 20 0.5469 0.114 0.09747 0.1354 25 0.547 0.1141 0.09804 0.1317 30 0.5471 0.1138 0.09796 0.1313 31 0.5471 0.1138 0.09796 0.1313 • The top 5 variables are: q4b_2, q4b_28, q4b_21, q4b_24, q4b_9 8 Recursive Feature Elimination • Results (cont’d): 9 Recursive Feature Elimination • Results (cont’d): 10 Bayesian Model Averaging • Implemented in the package BMA (also BMS) • The basic idea (from Hoeting et al., 1999) ▫ All models are wrong, some are useful (Box, 1987) ▫ Approach is to average over model uncertainty ▫ Average over the posterior distribution of any statistic (e.g., parameter estimates) ▫ Can be problematic for models with a large number of potential predictors For “r” predictors, the set of potential models is 2r Occam’s Window – Average over the subset of models that are supported by the data 11 Bayesian Model Averaging p!=0 • Results: Note the similarity in the predictors chosen across the 5 best models (compared to the rfe algorithm) Intercept 100.00 q4b_1 0.00 q4b_2 98.70 q4b_3 0.00 q4b_4 0.00 q4b_5 0.00 q4b_6 0.00 q4b_7 0.00 q4b_8 1.00 q4b_9 41.70 q4b_1 0 0.00 q4b_1 1 0.00 q4b_1 2 0.00 q4b_1 3 0.00 q4b_1 4 1.20 q4b_1 5 7.50 q4b_1 6 0.00 q4b_1 7 0.00 q4b_1 8 0.00 q4b_1 9 0.00 q4b_20 0.00 q4b_21 24.10 q4b_22 0.00 q4b_23 1.20 q4b_24 42.00 q4b_25 25.60 q4b_26 2.10 q4b_27 0.00 q4b_28 88.40 q4b_29 14.10 q4b_30 1.20 q4b_31 0.00 EV 3.63 0.00 0.15 0.00 0.00 0.00 0.00 0.00 0.00 -0.03 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 -0.02 0.00 0.00 0.04 0.02 0.00 0.00 0.10 -0.01 0.00 0.00 SD 0.18 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.04 0.00 0.01 0.06 0.04 0.01 0.00 0.05 0.02 0.00 0.00 nV ar r2 BIC post prob 12 Model 1 3.57 . 0.14 . . . . . . -0.09 . . . . . . . . . . . . . . 0.11 . . . 0.11 . . . 4 0.16 -31.12 0.11 Model 2 3.67 . 0.15 . . . . . . . . . . . . . . . . . . . . . . . . . 0.10 . . . 2 0.13 -30.88 0.10 Model 3 3.73 . 0.18 . . . . . . -0.07 . . . . . . . . . . . . . . . . . . 0.13 . . . 3 0.14 -30.06 0.07 Model 4 3.53 . 0.12 . . . . . . . . . . . . . . . . . . . . . 0.08 . . . 0.08 . . . 3 0.14 -29.53 0.05 Model 5 3.73 . 0.18 . . . . . . . . . . . . . . . . . . -0.08 . . . . . . 0.14 . . . 3 0.14 -29.36 0.05 Bayesian Model Averaging • Results (cont’d): 13 Penalized Regression • Implemented in the package glmnet ▫ glmnet is an extension/application of the elasticnet package (Zou & Hastie, 2008) • The basic idea (from Friedman et al., 2010) ▫ Ridge regression – applies an adjustment (the “ridge”) to the coefficient estimates, allowing them to borrow from each other, and shrinks the coefficients values. ▫ However, Ridge aggressively shrinks coefficients to be equal to each other, allowing for no meaningful interpretation ▫ Additionally, there is no easy way to determine how to set the penalization parameter 14 Penalized Regression (cont’d) • Lasso regression also adjusts the coefficients but tends to be “somewhat indifferent to very correlated predictors” ▫ Essentially turns coefficients on/off, elevating one variable over another • Elastic Net – a compromise between Ridge and Lasso ▫ Averages the effects of highly correlated predictors to create a “weighted” contribution of each variable Lambda, a ridge regression penalty, shrinks coefficients toward each other Alpha influences the number of non-zero coefficients in the model. Alpha=0 is Ridge Regression and Alpha=1 is Lasso 15 Penalized Regression (cont’d) Alpha=1 At each step there is a unique value of lambda Alpha=0.2 16 Alpha=0 Penalized Regression • Results: The impact of different parameterizations of alpha α= 0.75 α= 0.10 17 Penalized Regression • Results: Again, notice the similarity wrt predictors chosen with the other two algorithms (Intercept) q4b_1 q4b_2 q4b_3 q4b_4 q4b_5 q4b_6 q4b_7 q4b_8 q4b_9 q4b_10 q4b_11 q4b_12 q4b_13 q4b_14 q4b_15 q4b_16 3.706082732 0 0.095195218 0 0 0 0 0 0.004512409 -0.030557283 0 0 0 0 0 0.021679874 0 18 q4b_17 q4b_18 q4b_19 q4b_20 q4b_21 q4b_22 q4b_23 q4b_24 q4b_25 q4b_26 q4b_27 q4b_28 q4b_29 q4b_30 q4b_31 0 0 0 0 -8.84E-09 0 0 0.058411084 0.033686624 0.002061599 0 0.068898885 -0.016100951 0 0 Questions? 19
© Copyright 2026 Paperzz