STATS 330: Lecture 13 7/14/2017 330 lecture 13 1 Diagnostics 5 Aim of today’s lecture To discuss diagnostics and remedies for non-normality To apply these and the other diagnostics in a case study 7/14/2017 330 lecture 13 2 Normality Another assumption in the regression model is that the errors are normally distributed. This is not so crucial, but can be important if the errors have a long-tailed distribution, since this will imply there are several outliers Normality assumption important for prediction 7/14/2017 330 lecture 13 3 Detection The standard diagnostic is the normal plot of the residuals: • A straight plot is indicative of normality • A shape is indicative of right skew errors (eg gamma) • A shape is indicative of a symmetric, shorttailed distribution • A shape is indicative of a symmetric longtailed distribution 7/14/2017 330 lecture 13 4 qqnorm(residuals(xyz.lm)) Normal Q-Q Plot Right-skew errors 3 0 -2 1 2 Sample Quantiles 4 5 2 1 0 -1 Sample Quantiles Normal errors Normal Q-Q Plot -1 0 1 2 -2 1 Normal Q-Q Plot Normal Q-Q Plot 5 2 0.0 -0.4 0 -10 -0.2 Long-tailed errors -5 Sample Quantiles 0.2 Sample Quantiles 0 Theoretical Quantiles Short-tailed errors -2 -1 0 1 2 -2 Theoretical Quantiles 7/14/2017 -1 Theoretical Quantiles 0.4 -2 -1 0 1 2 Theoretical Quantiles 330 lecture 13 5 Weisberg-Bingham test Test statistic: WB=square of correlation of normal plot, measures how straight the plot is WB is between 0 and 1. Values close to 1 indicate normality Use R function WB.test in the 330 functions: if pvalue is >0.05, normality is OK 7/14/2017 330 lecture 13 6 Example: cherry trees > WB.test(cherry.lm) 0.989 -5 WB test statistic = p = 0.68 0 > qqnorm(residuals(cherry.lm)) Sample Quantiles 5 Normal Q-Q Plot Since p=0.68>0.05, normality is OK 7/14/2017 -2 -1 0 1 2 Theoretical Quantiles 330 lecture 13 7 Remedies for non-normality The standard remedy is to transform the response using a power transformation The idea is, on the original scale the model doesn’t fit well, but on the transformed scale it does The power is obtained by means of a “Box-Cox plot” The idea is to assume that for some power p, the response yp follows the regression model. The plot is a graphical way of estimating the power p. Technically, it is a plot of the “profile likelihood” 7/14/2017 330 lecture 13 8 Case study: the wine data Data on 27 vintages of Bordeaux wines Variables are • • • • • year (1952-1980) temp (average temp during the growing season, oC) h.rain (total rainfall during harvest period, mm) w.rain (total rainfall over preceding winter, mm) Price (in 1980 US dollars, converted to an index with 1961=100) Data on web page as wine.txt 7/14/2017 330 lecture 13 9 $US 1000 $US 800 7/14/2017 330 lecture 13 $US 450 10 Wine prices Bordeaux wines are an iconic luxury consumer good. Many consider these to be the best wines in the world. The quality and the price depends on the vintage (i.e. the year the wines are made.) The prices are (in 1980 dollars, in index form, 1961=100) 7/14/2017 330 lecture 13 11 Bordeaux wine price index 1953-1980, 1961=100 60 20 40 Price index 80 100 Note downward trend: the older the wine, the more valuable it is. 1955 1960 1965 1970 1975 1980 Year 7/14/2017 330 lecture 13 12 Preliminary analysis > wine.df<-read.table(file.choose(),header=T) > wine.lm<-lm(price~ temp + h.rain + w.rain + year, data=wine.df) > summary(wine.lm) 7/14/2017 330 lecture 13 13 Summary output Residuals: Min 1Q -14.077 -9.040 Median -1.018 3Q 3.172 Max 26.991 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1305.52761 597.31137 2.186 0.03977 * temp 19.25337 3.92945 4.900 6.72e-05 *** h.rain -0.10121 0.03297 -3.070 0.00561 ** w.rain 0.05704 0.01975 2.889 0.00853 ** year -0.82055 0.29140 -2.816 0.01007 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 11.69 on 22 degrees of freedom Multiple R-Squared: 0.7369, Adjusted R-squared: 0.6891 F-statistic: 15.41 on 4 and 22 DF, p-value: 3.806e-06 7/14/2017 330 lecture 13 14 Diagnostic plots 7/14/2017 330 lecture 13 15 Checking normality Normal Q-Q Plot 10 -10 0 Sample Quantiles 20 > qqnorm(residuals(wine.lm)) > WB.test(wine.lm) WB test statistic = 0.957 p = 0.03 -2 -1 0 1 2 Theoretical Quantiles 7/14/2017 330 lecture 13 16 Box-Cox routine boxcoxplot(price~year+temp+h.rain+w.rain,data=wine.df) 110 Power = -1/3 100 Profile likelihood 120 130 Box-Cox plot -2 -1 0 1 2 p 7/14/2017 330 lecture 13 17 Transform and refit Use y^(-1/3) as a response (reciprocal cube root) Has the fit improved? • Are the errors now more normal? (normal plot) • Look at R2, has R2 increased? • Could we improve normality by a further transformation? (power=1 means that the fit cannot be improved by further transformation) 7/14/2017 330 lecture 13 18 0.02 0.00 Normality better! -0.04 -0.02 Sample Quantiles 0.04 Normal Q-Q Plot -2 -1 0 1 2 Theoretical Quantiles 7/14/2017 330 lecture 13 19 Call: lm(formula = price^(-1/3) ~ temp + h.rain + w.rain + year, data = wine.df) Residuals: Min 1Q Median -0.048426 -0.024007 -0.002757 3Q 0.022559 Max 0.055406 Better! (was 0.7369) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.666e+00 1.613e+00 -2.273 0.03317 * temp -7.051e-02 1.061e-02 -6.644 1.11e-06 *** h.rain 4.423e-04 8.905e-05 4.967 5.71e-05 *** w.rain -1.157e-04 5.333e-05 -2.170 0.04110 * year 2.639e-03 7.870e-04 3.353 0.00288 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.03156 on 22 degrees of freedom Multiple R-Squared: 0.8331, Adjusted R-squared: 0.8028 F-statistic: 27.46 on 4 and 22 DF, p-value: 2.841e-08 7/14/2017 330 lecture 13 20 > boxcoxplot(price^(-1/3)~year + temp+h.rain+w.rain,data=wine.df) -48 -46 Power=1, no further improvement possible -50 Profile likelihood -44 -42 Box-Cox plot -2 -1 0 1 2 p 7/14/2017 330 lecture 13 21 Conclusion Transformation has been spectacularly successful in improving the fit! What about other aspects of the fit? • residuals/fitted values • Pairs • Partial residual plots 7/14/2017 330 lecture 13 22 plot(trans.lm) No problems! 7/14/2017 330 lecture 13 23 plot(gam(price^(-1/3) ~ temp + h.rain + s(w.rain) + s(year), data = wine.df)) 4th degree poly for winter rain?? w.rain better as quadatic? 7/14/2017 330 lecture 13 24 > summary(trans2.lm) lm(formula = price^(-1/3) ~ year + temp + h.rain + poly(w.rain, 4), data = wine.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.974e+00 1.532e+00 -1.942 0.06715 . year 2.284e-03 7.459e-04 3.062 0.00642 ** temp -7.478e-02 1.048e-02 -7.137 8.75e-07 *** h.rain 4.869e-04 8.662e-05 5.622 2.02e-05 *** poly(w.rain, 4)1 -7.561e-02 3.263e-02 -2.317 0.03180 * poly(w.rain, 4)2 4.469e-02 3.294e-02 1.357 0.19079 poly(w.rain, 4)3 -2.153e-02 2.945e-02 -0.731 0.47374 poly(w.rain, 4)4 6.130e-02 2.956e-02 2.074 0.05194 . polynomial not required 7/14/2017 330 lecture 13 25 Final fit Call: lm(formula = price^(-1/3) ~ year + temp + h.rain + w.rain, data = wine.df) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.666e+00 1.613e+00 -2.273 0.03317 * year 2.639e-03 7.870e-04 3.353 0.00288 ** temp -7.051e-02 1.061e-02 -6.644 1.11e-06 *** h.rain 4.423e-04 8.905e-05 4.967 5.71e-05 *** w.rain -1.157e-04 5.333e-05 -2.170 0.04110 * --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.03156 on 22 degrees of freedom Multiple R-Squared: 0.8331, Adjusted R-squared: 0.8028 F-statistic: 27.46 on 4 and 22 DF, p-value: 2.841e-08 7/14/2017 330 lecture 13 26 Conclusions Model using price-1/3 as response fits well Use this model for prediction, understanding relationships • Coef of year >0 so price-1/3 increases with year (i.e. older vintages are more valuable) • Coef of h.rain > 0 so high harvest rain increases price-1/3 (decreases price) • Coef of w.rain < 0 so high winter rain decreases price-1/3 (increases price) • Coef of temp < 0 so high temps decrease price-1/3 (increases price) 7/14/2017 330 lecture 13 27 Another use for Box-Cox plots The transformation indicated by the BoxCox plot is also useful in fixing up nonplanar data, and unequal variances, as we saw in lecture 10 If the gam plots indicate several independent variables need transforming, then transforming the response using the Box-Cox power often works. 7/14/2017 330 lecture 13 28
© Copyright 2024 Paperzz