ch05-sec7-11.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Chapter 5,
Sections 5.7-5.11
Luo Xiao
October 19, 2015
1 / 27
ST430 Introduction to Regression Analysis
Model Building
2 / 27
Model Building
ST430 Introduction to Regression Analysis
Models with one qualitative variable
Recall: a qualitative variable with k levels is represented by (k − 1)
indicator (or dummy) variables.
For a chosen reference level, all the indicator variables are 0;
For each other level, the corresponding indicator variable is 1, and the
others are 0.
3 / 27
Model Building
ST430 Introduction to Regression Analysis
Example
Per-user software maintainence cost, by state (samples of 10 users per
state).
Y : cost
Reference level: Kansas
X1 : indicator variable for Kentucky
X2 : indicator variable for Texas
R code for reading in and plotting data:
setwd("/Users/xiaoyuesixi/Dropbox/teaching/2015Fall/R_datasets
load("BIDMAINT.Rdata")# load in data
par(mfrow=c(1,1),mar=c(4,4.5,2,1))
boxplot(COST~STATE,data = BIDMAINT) #box plot
4 / 27
Model Building
ST430 Introduction to Regression Analysis
100
200
300
400
500
600
700
Box plot
Kansas
5 / 27
Kentucky
Texas
Model Building
ST430 Introduction to Regression Analysis
Model:
E (Y ) = β0 + β1 X1 + β2 X2 .
R code for model fit (see “output1.txt”):
setwd("/Users/xiaoyuesixi/Dropbox/teaching/2015Fall/R_datasets
load("BIDMAINT.Rdata")# load in data
fit = lm(COST~STATE,data=BIDMAINT)
summary(fit)
6 / 27
Model Building
ST430 Introduction to Regression Analysis
The fitted equation is
E (Y ) = 279.6 + 80.3X1 + 198.2X2
where:
X1 = indicator variable for Kentucky,
X2 = indicator variable for Texas.
For Kansas, X1 = X2 = 0, so E (Y ) = 279.6.
That is, the “intercept” is actually the expected value for the reference
state, Kansas.
7 / 27
Model Building
ST430 Introduction to Regression Analysis
For Kentucky, X1 = 1 and X2 = 0, so E (Y ) = 279.6 + 80.3 = 359.9.
That is, the coefficient ’STATEKentucky’ is the difference between the
expected value for Kentucky and the expected value for the reference
state.
Similarly, the coefficient ’STATETexas’ is the difference between the
expected value for Texas and the expected value for the reference state.
8 / 27
Model Building
ST430 Introduction to Regression Analysis
In R, the default reference level is the first in alphabetic order.
The default can be overridden, e.g., using the R code:
’relevel(STATE,ref="Kentucky")’ in linear regression.
Often these differences themselves are of no special interest, and the focus
is on testing whether there are any differences:
H0 : β1 = β2 = · · · = βk−1 = 0.
The value of the F -statistic is unaffected by the choice of reference
level.
R code for F -test: ’anova(fit)’
9 / 27
Model Building
ST430 Introduction to Regression Analysis
Two qualitative variables
80
70
60
50
40
40
50
60
70
80
E.g., two brands of diesel engine and three types of fuel.
F1
F2
10 / 27
F3
B1
Model Building
B2
ST430 Introduction to Regression Analysis
11 / 27
Model Building
ST430 Introduction to Regression Analysis
Notation
Y : performance
Reference fuel type F1
X1 : Fuel type F2
X2 : Fuel type F3
Reference brand B1
X3 : Brand B2
Main effects model:
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 .
12 / 27
Model Building
ST430 Introduction to Regression Analysis
Means for main effects model
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 .
———
B1
B2
F1
F2
F3
µ11 = β0
µ21 = β0 + β1
µ31 = β0 + β2
µ12 = β0 + β3
µ22 = β0 + β1 + β3
µ32 = β0 + β2 + β3
13 / 27
Model Building
ST430 Introduction to Regression Analysis
Main effects model: Mean response as a function of F and
B when F and B affect E (Y ) independently
14 / 27
Model Building
ST430 Introduction to Regression Analysis
80
FUEL
50
60
70
F3
F2
F1
40
50
60
70
B2
B1
mean of PERFORM
BRAND
40
mean of PERFORM
80
Interaction plot
F1
F2
F3
FUEL
15 / 27
B1
B2
BRAND
Model Building
ST430 Introduction to Regression Analysis
Complicated story from previous interaction plot:
For F1 and F2 , effects are additive, with B1 performing better than B2
and F2 performing better than F1 ;
But for F3 : B2 performs better than B1 .
Interaction model:
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X1 X3 + β5 X2 X3 .
R code for fitting interaction model:
lm(PERFORM~FUEL*BRAND,data=DIESEL)
16 / 27
Model Building
ST430 Introduction to Regression Analysis
Testing if interaction effect is significant
Compare the main effects model
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3
with the interaction model
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X1 X3 + β5 X2 X3 .
Testing H0 : β4 = β5 = 0:
Use R function: ’anova’ (output in "output2.txt")
17 / 27
Model Building
ST430 Introduction to Regression Analysis
Three or more qualitative variables
With a response Y and independent variables a, b, c, . . . , model might
contain:
main effects:
a + b + c + ...;
two-way interactions:
a + b + c + a : b + a : c + b : c + ...;
higher-order interactions:
a + b + c + a : b + a : c + b : c + a : b : c + ...;
Often only main effects and low-order interactions are significant.
18 / 27
Model Building
ST430 Introduction to Regression Analysis
To estimate the highest-order interactions, we need observations for all
possible combinations of levels–a factorial design.
E.g., 2 × 3 = 6 for the diesel engines.
With several variables, all with at least 2 levels, the number of combinations
can be large.
Sometimes a carefully chosen fraction of all possible combinations is used–a
fractional factorial design.
19 / 27
Model Building
ST430 Introduction to Regression Analysis
Models with both quantitative and qualitative variables
Example
Diesel engine performance Y , as a function of:
engine speed, X1 ;
fuel type, with levels F1 , F2 , and F3 ;
take F1 as the reference level, and X2 and X3 as indicators for F2 and
F3 , respectively.
20 / 27
Model Building
ST430 Introduction to Regression Analysis
Simple model, ignoring fuel type: second-order model in X1 :
E (Y ) = β0 + β1 X1 + β2 X12 .
Additive model: include main effects of fuel type:
E (Y ) = β0 + β1 X1 + β2 X12 + β3 X2 + β4 X3 .
Switching fuel from F1 to F2 adds β3 to the performance Y , independently
of engine speed X1 .
Interaction model:
E (Y ) = β0 + β1 X1 + β2 X12 + β3 X2 + β4 X3
+ β5 X1 X2 + β6 X1 X3 + β7 X12 X2 + β8 X12 X3 .
21 / 27
Model Building
ST430 Introduction to Regression Analysis
The interaction model is the complete second order model.
It allows E (Y ) to be a different quadratic function of X1 at each level of
fuel type.
These models form a nested hierarchy, and we could choose among them
using F -tests.
An intermediate model like
E (Y ) = β0 + β1 X1 + β2 X12 + β3 X2 + β4 X3
+ β5 X1 X2 + β6 X1 X3 .
(that is, the interaction model with β7 = β8 = 0) might also be of interest.
22 / 27
Model Building
ST430 Introduction to Regression Analysis
Model validation
Regression models are usually fitted to infer something about the behavior of
the expected response, E (Y ), beyond the particular sample used for fitting.
Often the specific goal is to estimate E (Y ) or predict Y for some
combination of independent variables not in the fitting data.
A model that fits well (high R 2 ) may not predict well; adjusted R 2 , Ra2 , is a
step in the right direction, but only a step.
23 / 27
Model Building
ST430 Introduction to Regression Analysis
The best validation is to actually collect new data, and compare predictions
with actual responses.
Label the m new responses as yn+1 , . . . , yn+m :
n+m
X
2
Rprediction
=1−
(yi − ŷi )2
i=n+1
n+m
X
(yi − ȳ )2
i=n+1
n+m
X
(yi − ŷi )2
MSEprediction =
i=n+1
m
Note: denominator in MSEprediction is m, not m − (k + 1).
24 / 27
Model Building
ST430 Introduction to Regression Analysis
Almost as good: divide the original data into one part used for model
building and fitting, and another “hold-out” part for validation.
But for a true validation, the hold-out data are completely unused until the
model building and fitting are complete.
25 / 27
Model Building
ST430 Introduction to Regression Analysis
Cross validation
If we do not have enough data to hold out part for a true validation, we can
use cross validation:
First build (and fit) a model to the whole data set.
Then leave out part of the data (one half? one third? one fifth? a
single observation?), refit the model to the remainder, and use the
refitted model to predict the held out data. (K -fold cross validation)
Repeat so that all parts of the data are held out in turn.
Calculate R 2 and MSE as for a true validation.
Cross validation is not true validation, because the whole data set is used to
build the model and then re-used in the validation.
26 / 27
Model Building
ST430 Introduction to Regression Analysis
The Jackknife
The jackknife is closely related to leave-out-one cross validation. It uses
deletion residuals:
Delete one observation, say yi ;
Refit the model, and use it to predict the deleted observation as ŷ(i) ;
The deleted residual (or prediction residual) is di = yi − ŷ(i) .
R 2 -like statistics:
P
2
=1− P
Rjackknife
27 / 27
yi − ŷ(i)
2
(yi − ȳ )2
.
Model Building