ch04-sec12-14.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Ch4, Sec
4.12-4.14
Luo Xiao
September 28, 2015
1 / 29
ST430 Introduction to Regression Analysis
Multiple Linear Regression
2 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Complete second-order models
When the first-order model
E (Y ) = β0 + β1 X1 + β2 X2
is inadequate, the interaction model
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2
may be better, but sometimes a complete second-order model is needed:
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22
3 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Example: cost of shipping packages
Example 4.8 in textbook
Y : cost of package (dollars)
X1 : package weight (pounds)
X2 : distanced shipped (miles)
Plot the data (see next slide).
Model
EY = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22 .
4 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
15
10
0
5
Cost
0
5
Cost
10
15
Scatter plots
0
2
4
Weight
5 / 29
6
8
50
100 150 200 250
Distance
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Fit a complete second-order model in R
R code (see the text file: Output1):
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("EXPRESS.Rdata")# load in data
fit = lm(Cost~Weight*Distance + I(Weight^2) + I(Distance^2),
data=EXPRESS)#fit the model
summary(fit)# display the results
6 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Comparing nested models
Two models are nested if one model contains all the terms of the other, and
at least one additional term.
The larger model is the complete (or full) model, and the smaller is the
reduced (or restricted) model.
Example: with two independent variables X1 and X2 , possible terms are X1 ,
X2 , X1 X2 , X12 , and so on.
7 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Consider three models:
First order:
E (Y ) = β0 + β1 X1 + β2 X2 ;
Interaction:
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 ;
Full second order:
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22 .
8 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
The first order model is nested within both the Interaction model and the
Full second order model.
The Interaction model is nested within the Full second order model.
We usually want to use the simplest (most parsimonious) model that
adequately fits the observed data.
One way to decide between a full model and a reduced model is by testing
H0 : reduced model is adequate;
Ha : full model is better.
9 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
When the full model has exactly one more term than the reduced model, we
can use a t-test.
E.g., testing the Interaction model
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 ;
against the First order model
E (Y ) = β0 + β1 X1 + β2 X2 .
H0 : “reduced model is adequate” is the same as H0 : β3 = 0.
So the usual t-statistic is the relevant test statistic.
10 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
When the full model has more than one additional term, we use an F -test,
which generalizes the t-test.
Basic idea: fit both models, and test whether the full model fits significantly
better than the reduced model:
F =
Drop in SSE
Number of extra terms
s 2 for larger model
where SSE is the sum of squared residuals.
When H0 is true, F follows the F -distribution, which we use to find the
P-value.
11 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
E.g., testing the (full) second order model
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22 ;
against the (reduced) interaction model
E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 .
Here H0 is β4 = β5 = 0, and Ha is the opposite.
Two ways of carrying out the test:
Manually by R functions "summary" or "anova"
Automatically by R function "anova"
12 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Manual calculation
Note:
F =
(SSER − SSEC )/Number of β parameters being tested
s 2 for larger model
Need to find SSER , SSEC , and s 2 for larger model: use R function "anova".
13 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Example 4.8 revisited
Example 4.11 in textbook.
R code (see text file: Output2):
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("EXPRESS.Rdata")# load in data
fitR = lm(Cost~Weight*Distance,
data=EXPRESS)#interaction model
fitC = lm(Cost~Weight*Distance + I(Weight^2) + I(Distance^2),
data=EXPRESS)#complete second order model
anova(fitR)
anova(fitC)
14 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
See ’Residuals’ line in each set of results:
SSE(Full second order) = 2.745,
SSE(Interaction) = 6.633.
For the Full second order model: s 2 = 0.196.
So
F =
15 / 29
(6.633 − 2.745)/2
= 9.92, P < .01.
0.196
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Automatic calculation in R
Use R function "anova(fitR, fitC)", where "fitR" is the fit from the reduced
model and "fitC" is the fit from the complete model.
R code (see the text file (Output3) for output):
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("EXPRESS.Rdata")# load in data
fitR = lm(Cost~Weight*Distance,
data=EXPRESS)#interaction model
fitC = lm(Cost~Weight*Distance + I(Weight^2) + I(Distance^2),
data=EXPRESS)#complete second order model
anova(fitR, fitC)
16 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
When there is only one additional term in the complete
model
We can use the the t-test for testing the corresponding β parameter.
We can also use the above F -test.
The two tests are exactly the same test:
The F -statistic is exactly the square of the t-statistic.
The F critical values are exactly the squares of the (two-sided) t
critical values.
So the P-value is exactly the same.
17 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Complete example: road construction cost
Section 4.14 of textbook; data from the Florida Attorney General’s office.
Y = winning contract cost ("COST" in data "FLAG")
X1 = DOT engineer’s estimate of cost ("DOTEST" in data "FLAG")
X2 = indicator of fixed bidding ("STATUS" in data "FLAG"):
(
X2 =
18 / 29
1
0
if fixed
if competitive
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Get the data and plot them (next slide):
What does the visualization tell us?
Y seems to be positively related to X1 ;
X2 seems to affect Y ;
Further insights from visualization:
It seems to suggest different slopes;
It seems that straight lines might be sufficient;
19 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Visualization
STATUS = 1
10000
10000
STATUS = 0
6000
8000
45 degree line
0
2000
4000
COST
6000
4000
2000
0
COST
8000
45 degree line
0
2000
6000
DOTEST
20 / 29
10000
0
2000
6000
DOTEST
Multiple Linear Regression
10000
ST430 Introduction to Regression Analysis
Section 4.14 suggests beginning with the full second order model, and
simplifying it as far as possible (but no further!).
We’ll take the opposite approach: begin with the first order model, and
complicate it as far as necessary.
Because X2 is an indicator variable, the first order model is a pair of parallel
straight lines (see plot on later slides):
21 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
R code for additive model (see text file: Output4):
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("FLAG.Rdata")# load in data
fitA = lm(COST~DOTEST + STATUS,
data=FLAG)#additive model
summary(fitA)
22 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
10000
overlay fits from additive model
6000
4000
2000
0
COST
8000
STATUS = 0
STATUS = 1
0
2000
4000
6000
8000
10000
DOTEST
23 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Both variables look important.
The ’DOTEST’ coefficient is close to 1, so the winning bids roughly track
the estimated cost.
The positive ’STATUS’ coefficient means the line for ’STATUS’ = 1 is
higher than the line for ’STATUS’ = 0.
Are the slopes different? Try the interaction model (see text file: Output5):
fitR = lm(COST~DOTEST*STATUS,
data=FLAG)#interaction model
24 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
Interaction model involving qualitative variable
The interaction model:
E (COST) = β0 + β1 DOTEST + β2 STATUS + β3 DOTEST ∗ STATUS.
Hence,
E (COST) = β0 + β1 DOTEST,
if STATUS = 0;
E (COST) = (β0 + β2 ) + (β1 + β3 )DOTEST,
25 / 29
if STATUS = 1.
Multiple Linear Regression
ST430 Introduction to Regression Analysis
The interaction term is highly significant: reject the first order model in
favor of the interaction model.
The slopes are 0.921336 for ’STATUS’ = 0, and 0.921336 + 0.163282 =
1.084618 for ’STATUS’ = 1.
So the competitive auctions are won with bids that fall slightly below the
estimated cost, while the fixed winning bids fall slightly above the estimated
cost.
26 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
overlay fits from interaction model
STATUS = 1
10000
10000
STATUS = 0
6000
8000
45 degree line
fitted line
0
2000
4000
COST
6000
4000
2000
0
COST
8000
45 degree line
fitted line
0
2000
6000
DOTEST
27 / 29
10000
0
2000
6000
DOTEST
Multiple Linear Regression
10000
ST430 Introduction to Regression Analysis
To validate the interaction model, we compare it with (finally!) the full
second order model.
Note
When some variables are qualitative, the “full second order model” consists
of the full second order (i.e., quadratic) model in the quantitative variables,
plus the interactions of those terms with the qualititative variables:
R code for model (see text file: Output6):
fitC = lm(COST~DOTEST*STATUS + I(DOTEST^2)
+ I(DOTEST^2*STATUS),
data=FLAG)#complete second order model
28 / 29
Multiple Linear Regression
ST430 Introduction to Regression Analysis
F -test for interaction model against complete second order
model
Use R function: anova(fitR, fitC)
See text file: Output7.
The significance of the two terms that were added is tested using an F
statistic with 2 degrees of freedom in the numerator; the P-value is 0.3548,
hence we accept the null hypothesis that the interaction model is adequate.
29 / 29
Multiple Linear Regression