!@ CAS Predictive Modeling Seminar Practical Issues in Model Design Chuck Boucek (312) 879-3859 # Overview • Data usually does not seamlessly fit into model assumptions • The focus of this presentation is the impact that selected issues have on the design matrix • Agenda – Overview of the Design Matrix – Non-linearity in predictors – Missing data 1 What is the Design Matrix? Design Matrix Non-Linearity Missing Data • Representation of the predictor variables used to construct model Data Class 65198 65198 70446 70446 64446 64446 State MA IL MA FL MA IN AOI 125 235 240 350 100 110 Design Matrix Pop Density Intercept .033 .032 .034 .044 .023 .025 1 1 1 1 1 1 Class 0 0 1 1 0 0 0 0 0 0 1 1 ST MA AOI 1 0 1 0 1 0 125 235 240 350 100 110 Pop Density .033 .032 .034 .044 .023 .025 2 How is GLM Fit to Data? Design Matrix x 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 Design Matrix Non-Linearity Missing Data Coefficients = 1 0 1 0 1 0 125 235 240 350 100 110 .033 .032 .034 .044 .023 .025 Linear Predictors X a1 a2 a3 a4 a5 a6 = LP1 LP2 LP3 LP4 LP5 LP6 • Linear predictors are transformed to estimate of response data via inverse link function • Family and link function determine form of MLE – Family: Gaussian, Link: identity, MLE: p ( y xij j ) 2 i n 1 1 j 1 2 l ( y) ln( 2 ) 2 2 i 1 2 3 0.980 0.985 0.990 0.995 1.000 1.005 Non Linearity – Description of Issue Design Matrix Non-Linearity Missing Data 0 10 20 30 4 0.980 0.985 0.990 0.995 1.000 1.005 Non Linearity – Description of Issue Design Matrix Non-Linearity Missing Data 0 10 20 30 • GLMs fit linear patterns to data • Produces poor fit for certain predictor variables • Splines can address non-linearity within a GLM 5 Natural Cubic Spline Characteristics 0.980 0.985 0.990 0.995 1.000 1.005 Design Matrix Non-Linearity Missing Data 0 10 20 30 • 3rd degree polynomial between the knots • Continuous value, first and second derivative at the knots • Linear outside of the boundary knots 6 GLM with a Natural Spline 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 125 235 240 350 100 110 0.6 0.0 98.4 66.1 109.8 75.1 497.3 401.3 0.0 0.0 0.0 0.0 .033 .032 .034 .044 .023 .025 Design Matrix Non-Linearity Missing Data X a1 a2 a3 a4 a5 a6 a7 a8 = LP1 LP2 LP3 LP4 LP5 LP6 • Two columns are added to the design matrix – These columns are the spline basis – Two additional coefficients are needed • GLM is fit with same MLE and link function 7 GLM with Natural Spline 0.980 0.985 0.990 0.995 1.000 1.005 Design Matrix Non-Linearity Missing Data 0 10 20 30 • Proper reasonability testing – Statistical Significance – Time Consistency Plot 8 Design Matrix Non-Linearity Missing Data Time Consistency Plot 0 1997 5 10 15 20 25 1998 1.01 1.00 0.99 1999 2000 1.01 1.00 0.99 0 5 10 15 20 25 9 Missing Data-Description of Issue Design Matrix Non-Linearity Missing Data • Missing data can present unique challenges in model creation Data Class 65198 65198 70446 70446 64446 64446 State MA IL MA FL MA IN AOI 125 235 240 350 100 110 Design Matrix Pop Density Intercept .033 .032 .034 .044 .023 NA 1 1 1 1 1 1 Class 0 0 1 1 0 0 0 0 0 0 1 1 ST MA AOI 1 0 1 0 1 0 125 235 240 350 100 110 Pop Density .033 .032 .034 .044 .023 NA 10 Missing Data-Description of Issue Design Matrix Non-Linearity Missing Data • What methodologies exist for addressing missing data? Intercept 1 1 1 1 1 1 Class 0 0 1 1 0 0 0 0 0 0 1 1 ST MA AOI 1 0 1 0 1 0 125 235 240 350 100 110 Pop Density .033 .032 .034 .044 .023 NA 11 Methodology #1 Design Matrix Non-Linearity Missing Data • Listwise Deletion: Eliminate any row in the design matrix with missing values Intercept 1 1 1 1 1 Class 0 0 1 1 0 0 0 0 0 1 ST MA AOI 1 0 1 0 1 125 235 240 350 100 Pop Density .033 .032 .034 .044 .023 12 Methodology #2 Design Matrix Non-Linearity Missing Data • Mean Imputation: Replace missing values with mean of values where data is present Intercept 1 1 1 1 1 1 Class 0 0 1 1 0 0 0 0 0 0 1 1 ST MA AOI 1 0 1 0 1 0 125 235 240 350 100 110 Pop Density .033 .032 .034 .044 .023 .033 13 Methodology #3 Design Matrix Non-Linearity Missing Data • Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis Intercept 1 1 1 1 1 1 Class 0 0 1 1 0 0 0 0 0 0 1 1 ST MA AOI 1 0 1 0 1 0 125 235 240 350 100 110 Pop Density .033 .032 .034 .044 .023 .109 .102 .116 .194 .053 .359 .328 .393 .852 .122 14 Methodology #3 Design Matrix Non-Linearity Missing Data • Linear Mean Imputation: Create spline basis excluding missing values and mean impute on spline basis Intercept 1 1 1 1 1 1 Class 0 0 1 1 0 0 0 0 0 0 1 1 ST MA AOI 1 0 1 0 1 0 125 235 240 350 100 110 Pop Density .033 .032 .034 .044 .023 .033 .109 .102 .116 .194 .053 .411 .359 .328 .393 .852 .122 .115 15 Methodology #4 Design Matrix Non-Linearity Missing Data • Single imputation: Use other predictor variables to build a model and impute missing values – Example: Model Pop Density based on AOI Intercept 1 1 1 1 1 1 Class 0 0 1 1 0 0 0 0 0 0 1 1 ST MA AOI 1 0 1 0 1 0 125 235 240 350 100 110 Pop Density .033 .025 .034 .044 .023 .027 16 Methodology #5 Design Matrix Non-Linearity Missing Data • Multiple Imputation: Use other predictor variables to model missing values – Multiple imputations are created based on distribution of residuals in estimates of missing values Intercept 1 1 1 1 1 1 Class 0 0 1 1 0 0 ST MA Intercept 0 0 01 01 11 11 1 1 1 0 1 0 1 0 AOI Pop Density Class ST MA AOI 125 .033 235 .025 240 .034 0 0 1 Class 125 Intercept 350 .044 0 0 0 235 100 .023 1 01 1 0 240 0 110 .027 1 01 0 0 350 0 0 11 1 1 100 0 0 11 0 1 110 0 1 0 1 1 0 1 Pop Density .033 ST MA .025 .034 1 .044 0 .023 1 .029 0 1 0 AOI Pop Density 125 235 240 350 100 110 .033 .025 .034 .044 .023 .025 17 Multiple Imputation Process Design Matrix Non-Linearity Missing Data 1. Choose starting values for mean and covariance matrix of predictor variables 2. Use mean and covariance matrix to estimate regression parameters 3. Use regression parameters to estimate missing values. Add a random draw from the residual normal distribution for that variable 4. Use the resulting data set to compute new mean and covariance matrix 5. Make a random draw from the posterior distribution of the means and covariances 6. Use the random draw from step 5, go back to step and cycle through the process until convergence is achieved 18 Multiple Imputation Process Design Matrix Non-Linearity Missing Data • Assumptions underlying multiple imputation algorithms – Data is Missing At Random: Missingness of predictor variable “V” cannot depend on value of “V” but can depend on values of other predictor variables. – Data is distributed with a Multi-Variate Normal distribution • Two issues that must be addressed – Initial convergence of iterations – Correlation of consecutive iterations 19 Time Series Plot Design Matrix Non-Linearity Missing Data 0.84 0.86 0.88 0.90 0.92 0.94 • Initial convergence is assessed via a time series plot 0 20 40 60 80 100 Iteration Number 20 Auto Correlation Plot Design Matrix Non-Linearity Missing Data 0.0 0.2 ACF 0.6 0.4 0.8 1.0 • Spread between iterations is assessed via an autocorrelation plot 0 20 40 60 80 100 lag 21 Testing of Missing Value Methods Design Matrix Non-Linearity Missing Data • Method #1 – Created a training and holdout data sets • Both contained missing data – Built models of claim frequency under different missing value analysis methods with training dataset • Identical predictor variables in all models – Compared results (deviance) of methods in data set where all data is present 22 Testing of Missing Value Options Design Matrix Non-Linearity Missing Data • Method #2 – Created a model of missing probability – Limited modeling database to observations in which all data was present – Randomly generated missing values based on missing probability • 100 iterations – Built models of claim frequency under different missing value analysis methods • Identical predictor variables in all models – Compared results (deviance) of methods in data set where all data is present 23 Performance of Missing Value Methods Design Matrix Non-Linearity Missing Data 1. Single Imputation/Multiple Imputation 2. Linear Mean Imputation 3. Mean Imputation 4. Listwise Deletion 24 Missing Data Framework Design Matrix Non-Linearity Missing Data • Questions – What is the level of missing data? – What can be inferred about the missing data mechanism? – What is the size of the modeling database in which all values are present? – Will the data continue to be missing when the model is applied? 25 Missing Data Framework Design Matrix Non-Linearity Missing Data • Actions – For low proportions of missing data: Listwise Deletion – For higher proportions of missing data in a large modeling database: Listwise Deletion with oversampling – For mid to small modeling databases: employ imputation • Initial exploration with Linear Mean Imputation • Fit final model with Single Imputation or Multiple Imputation 26 Sources • Splines – Hastie, Tibshirani and Friedman: The Elements of Statistical Learning • Missing Data – Paul Allison: Missing Data – J.L. Schafer: Analysis of Incomplete Multivariate Data – Insightful Corporation: Analyzing Data with Missing Values in S-Plus 27
© Copyright 2026 Paperzz