wine.lmprice~ temp + h.rain + w.rain + year, data=wine.df) >

STATS 330: Lecture 13
7/14/2017
330 lecture 13
1
Diagnostics 5
Aim of today’s lecture
To discuss diagnostics and remedies for
non-normality
To apply these and the other diagnostics in a
case study
7/14/2017
330 lecture 13
2
Normality
 Another assumption in the regression
model is that the errors are normally
distributed.
 This is not so crucial, but can be important
if the errors have a long-tailed distribution,
since this will imply there are several
outliers
 Normality assumption important for
prediction
7/14/2017
330 lecture 13
3
Detection
The standard diagnostic is the normal plot of the residuals:
• A straight plot is indicative of normality
• A
shape is indicative of right skew errors (eg
gamma)
• A
shape is indicative of a symmetric, shorttailed distribution
• A
shape is indicative of a symmetric longtailed distribution
7/14/2017
330 lecture 13
4
qqnorm(residuals(xyz.lm))
Normal Q-Q Plot
Right-skew
errors
3
0
-2
1
2
Sample Quantiles
4
5
2
1
0
-1
Sample Quantiles
Normal
errors
Normal Q-Q Plot
-1
0
1
2
-2
1
Normal Q-Q Plot
Normal Q-Q Plot
5
2
0.0
-0.4
0
-10
-0.2
Long-tailed
errors
-5
Sample Quantiles
0.2
Sample Quantiles
0
Theoretical Quantiles
Short-tailed
errors
-2
-1
0
1
2
-2
Theoretical Quantiles
7/14/2017
-1
Theoretical Quantiles
0.4
-2
-1
0
1
2
Theoretical Quantiles
330 lecture 13
5
Weisberg-Bingham test
 Test statistic: WB=square of correlation of
normal plot, measures how straight the
plot is
 WB is between 0 and 1. Values close to 1
indicate normality
 Use R function WB.test in the 330
functions: if pvalue is >0.05, normality is
OK
7/14/2017
330 lecture 13
6
Example: cherry trees
> WB.test(cherry.lm)
0.989
-5
WB test statistic =
p = 0.68
0
> qqnorm(residuals(cherry.lm))
Sample Quantiles
5
Normal Q-Q Plot
Since p=0.68>0.05,
normality is OK
7/14/2017
-2
-1
0
1
2
Theoretical Quantiles
330 lecture 13
7
Remedies for
non-normality
 The standard remedy is to transform the response using
a power transformation
 The idea is, on the original scale the model doesn’t fit
well, but on the transformed scale it does
 The power is obtained by means of a “Box-Cox plot”
 The idea is to assume that for some power p, the
response yp follows the regression model. The plot is a
graphical way of estimating the power p.
 Technically, it is a plot of the “profile likelihood”
7/14/2017
330 lecture 13
8
Case study: the wine data
 Data on 27 vintages of Bordeaux wines
 Variables are
•
•
•
•
•
year (1952-1980)
temp (average temp during the growing season, oC)
h.rain (total rainfall during harvest period, mm)
w.rain (total rainfall over preceding winter, mm)
Price (in 1980 US dollars, converted to an index with
1961=100)
 Data on web page as wine.txt
7/14/2017
330 lecture 13
9
$US 1000
$US 800
7/14/2017
330 lecture 13
$US 450
10
Wine prices
 Bordeaux wines are an iconic luxury
consumer good. Many consider these to
be the best wines in the world.
 The quality and the price depends on the
vintage (i.e. the year the wines are made.)
 The prices are (in 1980 dollars, in index
form, 1961=100)
7/14/2017
330 lecture 13
11
Bordeaux wine price index 1953-1980, 1961=100
60
20
40
Price index
80
100
Note
downward
trend: the
older the
wine, the
more
valuable it
is.
1955
1960
1965
1970
1975
1980
Year
7/14/2017
330 lecture 13
12
Preliminary analysis
> wine.df<-read.table(file.choose(),header=T)
> wine.lm<-lm(price~ temp + h.rain + w.rain + year,
data=wine.df)
> summary(wine.lm)
7/14/2017
330 lecture 13
13
Summary output
Residuals:
Min
1Q
-14.077 -9.040
Median
-1.018
3Q
3.172
Max
26.991
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1305.52761 597.31137
2.186 0.03977 *
temp
19.25337
3.92945
4.900 6.72e-05 ***
h.rain
-0.10121
0.03297 -3.070 0.00561 **
w.rain
0.05704
0.01975
2.889 0.00853 **
year
-0.82055
0.29140 -2.816 0.01007 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.69 on 22 degrees of freedom
Multiple R-Squared: 0.7369,
Adjusted R-squared: 0.6891
F-statistic: 15.41 on 4 and 22 DF, p-value: 3.806e-06
7/14/2017
330 lecture 13
14
Diagnostic plots
7/14/2017
330 lecture 13
15
Checking normality
Normal Q-Q Plot
10
-10
0
Sample Quantiles
20
> qqnorm(residuals(wine.lm))
> WB.test(wine.lm)
WB test statistic = 0.957
p = 0.03
-2
-1
0
1
2
Theoretical Quantiles
7/14/2017
330 lecture 13
16
Box-Cox routine
boxcoxplot(price~year+temp+h.rain+w.rain,data=wine.df)
110
Power = -1/3
100
Profile likelihood
120
130
Box-Cox plot
-2
-1
0
1
2
p
7/14/2017
330 lecture 13
17
Transform and refit
 Use y^(-1/3) as a response (reciprocal cube
root)
 Has the fit improved?
• Are the errors now more normal? (normal plot)
• Look at R2, has R2 increased?
• Could we improve normality by a further
transformation? (power=1 means that the fit cannot
be improved by further transformation)
7/14/2017
330 lecture 13
18
0.02
0.00
Normality
better!
-0.04
-0.02
Sample Quantiles
0.04
Normal Q-Q Plot
-2
-1
0
1
2
Theoretical Quantiles
7/14/2017
330 lecture 13
19
Call:
lm(formula = price^(-1/3) ~ temp + h.rain + w.rain + year, data =
wine.df)
Residuals:
Min
1Q
Median
-0.048426 -0.024007 -0.002757
3Q
0.022559
Max
0.055406
Better!
(was 0.7369)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.666e+00 1.613e+00 -2.273 0.03317 *
temp
-7.051e-02 1.061e-02 -6.644 1.11e-06 ***
h.rain
4.423e-04 8.905e-05
4.967 5.71e-05 ***
w.rain
-1.157e-04 5.333e-05 -2.170 0.04110 *
year
2.639e-03 7.870e-04
3.353 0.00288 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.03156 on 22 degrees of freedom
Multiple R-Squared: 0.8331,
Adjusted R-squared: 0.8028
F-statistic: 27.46 on 4 and 22 DF, p-value: 2.841e-08
7/14/2017
330 lecture 13
20
> boxcoxplot(price^(-1/3)~year +
temp+h.rain+w.rain,data=wine.df)
-48
-46
Power=1, no
further
improvement
possible
-50
Profile likelihood
-44
-42
Box-Cox plot
-2
-1
0
1
2
p
7/14/2017
330 lecture 13
21
Conclusion
 Transformation has been spectacularly
successful in improving the fit!
 What about other aspects of the fit?
• residuals/fitted values
• Pairs
• Partial residual plots
7/14/2017
330 lecture 13
22
plot(trans.lm)
No
problems!
7/14/2017
330 lecture 13
23
plot(gam(price^(-1/3) ~ temp + h.rain + s(w.rain) +
s(year), data = wine.df))
4th degree poly for
winter rain??
w.rain
better as
quadatic?
7/14/2017
330 lecture 13
24
> summary(trans2.lm)
lm(formula = price^(-1/3) ~ year + temp + h.rain + poly(w.rain,
4), data = wine.df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-2.974e+00 1.532e+00 -1.942 0.06715 .
year
2.284e-03 7.459e-04
3.062 0.00642 **
temp
-7.478e-02 1.048e-02 -7.137 8.75e-07 ***
h.rain
4.869e-04 8.662e-05
5.622 2.02e-05 ***
poly(w.rain, 4)1 -7.561e-02 3.263e-02 -2.317 0.03180 *
poly(w.rain, 4)2 4.469e-02 3.294e-02
1.357 0.19079
poly(w.rain, 4)3 -2.153e-02 2.945e-02 -0.731 0.47374
poly(w.rain, 4)4 6.130e-02 2.956e-02
2.074 0.05194 .
polynomial not
required
7/14/2017
330 lecture 13
25
Final fit
Call:
lm(formula = price^(-1/3) ~ year + temp + h.rain + w.rain,
data = wine.df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.666e+00 1.613e+00 -2.273 0.03317 *
year
2.639e-03 7.870e-04
3.353 0.00288 **
temp
-7.051e-02 1.061e-02 -6.644 1.11e-06 ***
h.rain
4.423e-04 8.905e-05
4.967 5.71e-05 ***
w.rain
-1.157e-04 5.333e-05 -2.170 0.04110 *
--Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` '
1
Residual standard error: 0.03156 on 22 degrees of freedom
Multiple R-Squared: 0.8331,
Adjusted R-squared: 0.8028
F-statistic: 27.46 on 4 and 22 DF, p-value: 2.841e-08
7/14/2017
330 lecture 13
26
Conclusions
 Model using price-1/3 as response fits well
 Use this model for prediction, understanding
relationships
• Coef of year >0 so price-1/3 increases with year (i.e.
older vintages are more valuable)
• Coef of h.rain > 0 so high harvest rain increases
price-1/3 (decreases price)
• Coef of w.rain < 0 so high winter rain decreases
price-1/3 (increases price)
• Coef of temp < 0 so high temps decrease price-1/3
(increases price)
7/14/2017
330 lecture 13
27
Another use for Box-Cox
plots
 The transformation indicated by the BoxCox plot is also useful in fixing up nonplanar data, and unequal variances, as we
saw in lecture 10
 If the gam plots indicate several
independent variables need transforming,
then transforming the response using the
Box-Cox power often works.
7/14/2017
330 lecture 13
28