Missing Data

!@
CAS Predictive Modeling Seminar
Practical Issues in Model Design
Chuck Boucek (312) 879-3859
#
Overview
• Data usually does not seamlessly fit into model
assumptions
• The focus of this presentation is the impact that
selected issues have on the design matrix
• Agenda
– Overview of the Design Matrix
– Non-linearity in predictors
– Missing data
1
What is the Design Matrix?
Design Matrix
Non-Linearity
Missing Data
• Representation of the predictor variables used to
construct model
Data
Class
65198
65198
70446
70446
64446
64446
State
MA
IL
MA
FL
MA
IN
AOI
125
235
240
350
100
110
Design Matrix
Pop
Density
Intercept
.033
.032
.034
.044
.023
.025
1
1
1
1
1
1
Class
0
0
1
1
0
0
0
0
0
0
1
1
ST MA
AOI
1
0
1
0
1
0
125
235
240
350
100
110
Pop
Density
.033
.032
.034
.044
.023
.025
2
How is GLM Fit to Data?
Design Matrix x
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
1
1
Design Matrix
Non-Linearity
Missing Data
Coefficients =
1
0
1
0
1
0
125
235
240
350
100
110
.033
.032
.034
.044
.023
.025
Linear Predictors
X
a1
a2
a3
a4
a5
a6
=
LP1
LP2
LP3
LP4
LP5
LP6
• Linear predictors are transformed to estimate of response data via
inverse link function
• Family and link function determine form of MLE
– Family: Gaussian, Link: identity, MLE:
p




(
y

xij  j ) 2

i
n
1
1
j 1
2 
 l ( y)   
 ln( 2 ) 
2
2
i 1 2




3
0.980
0.985
0.990
0.995
1.000
1.005
Non Linearity – Description of Issue
Design Matrix
Non-Linearity
Missing Data
0
10
20
30
4
0.980
0.985
0.990
0.995
1.000
1.005
Non Linearity – Description of Issue
Design Matrix
Non-Linearity
Missing Data
0
10
20
30
• GLMs fit linear patterns to data
• Produces poor fit for certain predictor variables
• Splines can address non-linearity within a GLM
5
Natural Cubic Spline Characteristics
0.980
0.985
0.990
0.995
1.000
1.005
Design Matrix
Non-Linearity
Missing Data
0
10
20
30
• 3rd degree polynomial between the knots
• Continuous value, first and second derivative at the knots
• Linear outside of the boundary knots
6
GLM with a Natural Spline
1
1
1
1
1
1
0
0
1
1
0
0
0
0
0
0
1
1
1
0
1
0
1
0
125
235
240
350
100
110
0.6
0.0
98.4 66.1
109.8 75.1
497.3 401.3
0.0
0.0
0.0
0.0
.033
.032
.034
.044
.023
.025
Design Matrix
Non-Linearity
Missing Data
X
a1
a2
a3
a4
a5
a6
a7
a8
=
LP1
LP2
LP3
LP4
LP5
LP6
• Two columns are added to the design matrix
– These columns are the spline basis
– Two additional coefficients are needed
• GLM is fit with same MLE and link function
7
GLM with Natural Spline
0.980
0.985
0.990
0.995
1.000
1.005
Design Matrix
Non-Linearity
Missing Data
0
10
20
30
• Proper reasonability testing
– Statistical Significance
– Time Consistency Plot
8
Design Matrix
Non-Linearity
Missing Data
Time Consistency Plot
0
1997
5
10
15
20
25
1998
1.01
1.00
0.99
1999
2000
1.01
1.00
0.99
0
5
10
15
20
25
9
Missing Data-Description of Issue
Design Matrix
Non-Linearity
Missing Data
• Missing data can present unique challenges in model
creation
Data
Class
65198
65198
70446
70446
64446
64446
State
MA
IL
MA
FL
MA
IN
AOI
125
235
240
350
100
110
Design Matrix
Pop
Density
Intercept
.033
.032
.034
.044
.023
NA
1
1
1
1
1
1
Class
0
0
1
1
0
0
0
0
0
0
1
1
ST MA
AOI
1
0
1
0
1
0
125
235
240
350
100
110
Pop
Density
.033
.032
.034
.044
.023
NA
10
Missing Data-Description of Issue
Design Matrix
Non-Linearity
Missing Data
• What methodologies exist for addressing missing
data?
Intercept
1
1
1
1
1
1
Class
0
0
1
1
0
0
0
0
0
0
1
1
ST MA
AOI
1
0
1
0
1
0
125
235
240
350
100
110
Pop
Density
.033
.032
.034
.044
.023
NA
11
Methodology #1
Design Matrix
Non-Linearity
Missing Data
• Listwise Deletion: Eliminate any row in the design
matrix with missing values
Intercept
1
1
1
1
1
Class
0
0
1
1
0
0
0
0
0
1
ST MA
AOI
1
0
1
0
1
125
235
240
350
100
Pop
Density
.033
.032
.034
.044
.023
12
Methodology #2
Design Matrix
Non-Linearity
Missing Data
• Mean Imputation: Replace missing values with mean
of values where data is present
Intercept
1
1
1
1
1
1
Class
0
0
1
1
0
0
0
0
0
0
1
1
ST MA
AOI
1
0
1
0
1
0
125
235
240
350
100
110
Pop
Density
.033
.032
.034
.044
.023
.033
13
Methodology #3
Design Matrix
Non-Linearity
Missing Data
• Linear Mean Imputation: Create spline basis
excluding missing values and mean impute on spline
basis
Intercept
1
1
1
1
1
1
Class
0
0
1
1
0
0
0
0
0
0
1
1
ST MA
AOI
1
0
1
0
1
0
125
235
240
350
100
110
Pop
Density
.033
.032
.034
.044
.023
.109
.102
.116
.194
.053
.359
.328
.393
.852
.122
14
Methodology #3
Design Matrix
Non-Linearity
Missing Data
• Linear Mean Imputation: Create spline basis
excluding missing values and mean impute on spline
basis
Intercept
1
1
1
1
1
1
Class
0
0
1
1
0
0
0
0
0
0
1
1
ST MA
AOI
1
0
1
0
1
0
125
235
240
350
100
110
Pop
Density
.033
.032
.034
.044
.023
.033
.109
.102
.116
.194
.053
.411
.359
.328
.393
.852
.122
.115
15
Methodology #4
Design Matrix
Non-Linearity
Missing Data
• Single imputation: Use other predictor variables to build a
model and impute missing values
– Example: Model Pop Density based on AOI
Intercept
1
1
1
1
1
1
Class
0
0
1
1
0
0
0
0
0
0
1
1
ST MA
AOI
1
0
1
0
1
0
125
235
240
350
100
110
Pop
Density
.033
.025
.034
.044
.023
.027
16
Methodology #5
Design Matrix
Non-Linearity
Missing Data
• Multiple Imputation: Use other predictor variables to model
missing values
– Multiple imputations are created based on distribution of residuals in
estimates of missing values
Intercept
1
1
1
1
1
1
Class
0
0
1
1
0
0
ST MA
Intercept
0
0
01
01
11
11
1
1
1
0
1
0
1
0
AOI
Pop
Density
Class
ST
MA
AOI
125
.033
235
.025
240
.034
0
0
1 Class
125
Intercept
350
.044
0
0
0
235
100
.023
1
01
1 0
240
0
110
.027
1
01
0 0
350
0
0
11
1 1
100
0
0
11
0 1
110
0
1
0
1
1
0
1
Pop
Density
.033
ST
MA
.025
.034
1
.044
0
.023
1
.029
0
1
0
AOI
Pop
Density
125
235
240
350
100
110
.033
.025
.034
.044
.023
.025
17
Multiple Imputation Process
Design Matrix
Non-Linearity
Missing Data
1.
Choose starting values for mean and covariance matrix of
predictor variables
2.
Use mean and covariance matrix to estimate regression
parameters
3.
Use regression parameters to estimate missing values. Add a
random draw from the residual normal distribution for that
variable
4.
Use the resulting data set to compute new mean and covariance
matrix
5.
Make a random draw from the posterior distribution of the means
and covariances
6.
Use the random draw from step 5, go back to step and cycle
through the process until convergence is achieved
18
Multiple Imputation Process
Design Matrix
Non-Linearity
Missing Data
• Assumptions underlying multiple imputation
algorithms
– Data is Missing At Random: Missingness of predictor
variable “V” cannot depend on value of “V” but can
depend on values of other predictor variables.
– Data is distributed with a Multi-Variate Normal
distribution
• Two issues that must be addressed
– Initial convergence of iterations
– Correlation of consecutive iterations
19
Time Series Plot
Design Matrix
Non-Linearity
Missing Data
0.84
0.86
0.88
0.90
0.92
0.94
• Initial convergence is assessed via a time series
plot
0
20
40
60
80
100
Iteration Number
20
Auto Correlation Plot
Design Matrix
Non-Linearity
Missing Data
0.0
0.2
ACF
0.6
0.4
0.8
1.0
• Spread between iterations is assessed via an
autocorrelation plot
0
20
40
60
80
100
lag
21
Testing of Missing Value Methods
Design Matrix
Non-Linearity
Missing Data
• Method #1
– Created a training and holdout data sets
• Both contained missing data
– Built models of claim frequency under different missing
value analysis methods with training dataset
• Identical predictor variables in all models
– Compared results (deviance) of methods in data set
where all data is present
22
Testing of Missing Value Options
Design Matrix
Non-Linearity
Missing Data
• Method #2
– Created a model of missing probability
– Limited modeling database to observations in which all data was
present
– Randomly generated missing values based on missing probability
• 100 iterations
– Built models of claim frequency under different missing value
analysis methods
• Identical predictor variables in all models
– Compared results (deviance) of methods in data set where all data
is present
23
Performance of Missing Value Methods
Design Matrix
Non-Linearity
Missing Data
1. Single Imputation/Multiple Imputation
2. Linear Mean Imputation
3. Mean Imputation
4. Listwise Deletion
24
Missing Data Framework
Design Matrix
Non-Linearity
Missing Data
• Questions
– What is the level of missing data?
– What can be inferred about the missing data
mechanism?
– What is the size of the modeling database in
which all values are present?
– Will the data continue to be missing when the
model is applied?
25
Missing Data Framework
Design Matrix
Non-Linearity
Missing Data
• Actions
– For low proportions of missing data: Listwise Deletion
– For higher proportions of missing data in a large
modeling database: Listwise Deletion with oversampling
– For mid to small modeling databases: employ
imputation
• Initial exploration with Linear Mean Imputation
• Fit final model with Single Imputation or Multiple
Imputation
26
Sources
• Splines
– Hastie, Tibshirani and Friedman: The Elements of
Statistical Learning
• Missing Data
– Paul Allison: Missing Data
– J.L. Schafer: Analysis of Incomplete Multivariate Data
– Insightful Corporation: Analyzing Data with Missing
Values in S-Plus
27