ch08-sec01-04.pdf

ST430 Introduction to Regression Analysis
ST430: Introduction to Regression Analysis, Chapter 8,
Sections 1-4
Luo Xiao
November 2, 2015
1 / 28
ST430 Introduction to Regression Analysis
Residual Analysis
2 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Residual analysis
Inferences about a regression model are valid only under assumptions about
the random errors in the observations.
Objectives:
Show how residuals reveal departures from assumptions;
Suggest procedures for coping with such departures.
3 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Regression residuals
The random errors satisfy
Y = E (Y ) + , or = Y − E (Y ).
We observe Y , but we do not know E (Y ), so we cannot calculate .
We estimate E (Y ) by Ŷ , the predicted (or fitted) value.
We approximate the random errors by regression residuals:
ˆi = Yi − Ŷi ,
4 / 28
i = 1, 2, . . . , n.
Residual Analysis
ST430 Introduction to Regression Analysis
Properties of residuals
If the model contains an intercept, the sum of the residuals, and also their
mean, is zero:
n
X
ˆi = 0, and so ¯
ˆ = 0.
i=1
The covariance of the residuals and any term in the regression model is zero:
n
X
ˆi Xi,j = 0,
j = 1, 2, . . . , k.
i=1
5 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Detecting lack of fit
A misspecified model is one that leaves out a relevant predictor.
The residuals from a misspecified model do not have mean zero.
Example: serum cholesterol (Y ) and dietary fat (X ) in Olympic athletes.
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("OLYMPIC.RData")
pairs(OLYMPIC)
6 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Scatter plot
1400
1600
1800
2000
1500 2000 2500
1200
1600
2000
FAT
1200
CHOLES
1500
2000
7 / 28
2500
Residual Analysis
ST430 Introduction to Regression Analysis
Suppose we ignore the graph, and fit a first-order model (see “output1.txt”):
l1 <- lm(CHOLES ~ FAT, OLYMPIC)
summary(l1)
plot(OLYMPIC$FAT, residuals(l1),pch=20) #X versus residuals
abline(h=0,col="gray") #0 horizontal line
The summary of the fitted model looks reasonable.
But the graph of the residuals against X show that the assumption
E () = 0 is violated.
Because this is a straight-line model, this graph is effectively the same
as the “residuals versus fitted value” graph from ’plot(l1)’.
8 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
0
−100
−200
residuals(l1)
100
Residual plot
1500
2000
2500
OLYMPIC$FAT
9 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Because of the curvature, we could fit the second-order (quadratic) model
(see “output2.txt”):
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("OLYMPIC.RData")
l2 <- lm(CHOLES ~ FAT + I(FAT^2), OLYMPIC)
summary(l2)
plot(OLYMPIC$FAT, residuals(l2),pch=20)
abline(h=0,col="gray")
The residual plot suggests that the model is satisfactory.
The quadratic term is highly significant.
10 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
0
−20
−40
−60
residuals(l2)
20
40
Residual plot
1500
2000
2500
OLYMPIC$FAT
11 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Partial residuals
Sometimes the effect of an independent variable is better described by a
transformed version: log(X ), 1/X , etc.
The partial residual plot can help identify the transformation:
The partial residuals for independent variable Xj are
ˆ∗ = ˆ + β̂j Xj
Plot ˆ∗ against Xj .
Also known as a “Component + Residual” plot.
12 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Example
Effect of price (p) and advertising (X2 ) on demand (Y ) for coffee.
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("COFFEE2.RData")
pairs(COFFEE2,pch=20)
Try a first-order model:
l1 <- lm(DEMAND ~ PRICE + X, COFFEE2)
summary(l1)
plot(COFFEE2$PRICE, residuals(l1),pch=20)
abline(h=0,col="gray")
The residual plot shows misspecification.
13 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Scatter plot
3.5
4.0
4.5
5.0
1000
3.0
4.5
5.0
200
600
DEMAND
0.8
3.0
3.5
4.0
PRICE
0.0
0.4
X
200
400
600
800 1000
14 / 28
0.0
0.2
0.4
0.6
0.8
1.0
Residual Analysis
ST430 Introduction to Regression Analysis
0
−50
residuals(l1)
50
Residual plot
3.0
3.5
4.0
4.5
5.0
COFFEE2$PRICE
15 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
The Component + Residual plot:
library(car)
crPlot(l1, variable = "PRICE",pch=20)
Curve suggests either adding ’PRICE2’, or transforming to log(PRICE)
or 1/PRICE.
R 2 and Ra2 are highest for 1/PRICE.
Note: the partial regression plot is different.
16 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
400
200
0
−200
−400
Component+Residual(DEMAND)
Component + Residual plot
3.0
3.5
4.0
4.5
5.0
PRICE
17 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Detecting unequal variances
Homoscedasticity versus heteroscedasticity.
That is, constant variance versus varying variance.
When the variance is not constant, it is most often related to the mean.
For Poisson-distributed data (counts), var(Y ) = E (Y ).
When errors are multiplicative, Y = E (Y ) × (1 + ), and var(Y ) ∝ E (Y )2 .
18 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Sometimes the variance can be made constant by transforming Y .
For example, with multiplicative errors,
log(Y ) = log[E (Y ) × (1 + )]
= log[E (Y )] + log[1 + ]
≈ log[E (Y )] + .
So var[log Y ] is (approximately) constant.
Sometimes variance can be made constant by transformation, but a
different method may be better than using a transformation.
√
For example, with Poisson-distributed counts, Y has approximately
constant variance, but a generalized linear model may be more satisfactory.
19 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Example
Salary and experience for social workers.
setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl
load("SOCWORK.RData")
pairs(SOCWORK,pch=20)
Try a second-order model:
l2 <- lm(SALARY ~ EXP + I(EXP^2), SOCWORK)
summary(l2)
plot(fitted(l2), residuals(l2),pch=20)
abline(h=0,col="gray")
20 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Scatter plot
6e+04
1e+05
20
2e+04
6e+04
1e+05
0
5 10
EXP
11.0
2e+04
SALARY
10.0
LNSALARY
0
5
10
15
20
25
21 / 28
10.0
10.5
11.0
11.5
Residual Analysis
ST430 Introduction to Regression Analysis
10000
0
−10000
residuals(l2)
20000
Residual plot
20000
30000
40000
50000
60000
70000
fitted(l2)
22 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
The “Residuals vs Fitted” plot shows a fan-shaped scatter, and the
“Scale-Location” plot shows an upward trend (try ’plot(l2)’).
It suggests
std dev(Y ) ∝ E (Y ),
hence
var(Y ) ∝ E (Y )2 ,
so try logarithms.
23 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Second-order model for log(SALARY):
lLog2 <- lm(log(SALARY) ~ EXP + I(EXP^2), SOCWORK)
summary(lLog2)
The quadratic term is not significant, so try a first-order model:
lLog1 <- lm(log(SALARY) ~ EXP, SOCWORK)
summary(lLog1)
plot(fitted(lLog1), residuals(lLog1),pch=20)
abline(h=0,col="gray")
The residual plots are more satisfactory.
24 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
0.1
0.0
−0.3 −0.2 −0.1
residuals(lLog1)
0.2
Residual plot
10.0
10.2
10.4
10.6
10.8
11.0
11.2
fitted(lLog1)
25 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Simple test for heteroscedasticity
Divide the data set in two, for instance low fitted values versus high fitted
values.
Fit the model separately to each part, and compare the MSEs (Mean
Square Errors).
Under H0 : variance is constant,
F∗ =
MSE1
MSE2
has the F -distribution with ν1 = n1 − (k + 1) and ν2 = n2 − (k + 1)
degrees of freedom.
26 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
This is usually a two-sided test; Ha : variance is not constant.
Reject H0 at level α if F ∗ differs too far from 1 in either direction; that is, if
F ∗ < F1−α/2 (ν1 , ν2 ), the lower α/2-point of the distribution, or
F ∗ > Fα/2 (ν1 , ν2 ), the upper α/2-point of the distribution.
27 / 28
Residual Analysis
ST430 Introduction to Regression Analysis
Note: F1−α/2 (ν1 , ν2 ) = 1/Fα/2 (ν2 , ν1 ), so an equivalent method is based on
Larger MSE
1
F =
= max F ∗ , ∗ .
Smaller MSE
F
Then we reject H0 if
F > Fα/2 (νLarger , νSmaller ) .
28 / 28
Residual Analysis