ST430 Introduction to Regression Analysis ST430: Introduction to Regression Analysis, Chapter 8, Sections 1-4 Luo Xiao November 2, 2015 1 / 28 ST430 Introduction to Regression Analysis Residual Analysis 2 / 28 Residual Analysis ST430 Introduction to Regression Analysis Residual analysis Inferences about a regression model are valid only under assumptions about the random errors in the observations. Objectives: Show how residuals reveal departures from assumptions; Suggest procedures for coping with such departures. 3 / 28 Residual Analysis ST430 Introduction to Regression Analysis Regression residuals The random errors satisfy Y = E (Y ) + , or = Y − E (Y ). We observe Y , but we do not know E (Y ), so we cannot calculate . We estimate E (Y ) by Ŷ , the predicted (or fitted) value. We approximate the random errors by regression residuals: ˆi = Yi − Ŷi , 4 / 28 i = 1, 2, . . . , n. Residual Analysis ST430 Introduction to Regression Analysis Properties of residuals If the model contains an intercept, the sum of the residuals, and also their mean, is zero: n X ˆi = 0, and so ¯ ˆ = 0. i=1 The covariance of the residuals and any term in the regression model is zero: n X ˆi Xi,j = 0, j = 1, 2, . . . , k. i=1 5 / 28 Residual Analysis ST430 Introduction to Regression Analysis Detecting lack of fit A misspecified model is one that leaves out a relevant predictor. The residuals from a misspecified model do not have mean zero. Example: serum cholesterol (Y ) and dietary fat (X ) in Olympic athletes. setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("OLYMPIC.RData") pairs(OLYMPIC) 6 / 28 Residual Analysis ST430 Introduction to Regression Analysis Scatter plot 1400 1600 1800 2000 1500 2000 2500 1200 1600 2000 FAT 1200 CHOLES 1500 2000 7 / 28 2500 Residual Analysis ST430 Introduction to Regression Analysis Suppose we ignore the graph, and fit a first-order model (see “output1.txt”): l1 <- lm(CHOLES ~ FAT, OLYMPIC) summary(l1) plot(OLYMPIC$FAT, residuals(l1),pch=20) #X versus residuals abline(h=0,col="gray") #0 horizontal line The summary of the fitted model looks reasonable. But the graph of the residuals against X show that the assumption E () = 0 is violated. Because this is a straight-line model, this graph is effectively the same as the “residuals versus fitted value” graph from ’plot(l1)’. 8 / 28 Residual Analysis ST430 Introduction to Regression Analysis 0 −100 −200 residuals(l1) 100 Residual plot 1500 2000 2500 OLYMPIC$FAT 9 / 28 Residual Analysis ST430 Introduction to Regression Analysis Because of the curvature, we could fit the second-order (quadratic) model (see “output2.txt”): setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("OLYMPIC.RData") l2 <- lm(CHOLES ~ FAT + I(FAT^2), OLYMPIC) summary(l2) plot(OLYMPIC$FAT, residuals(l2),pch=20) abline(h=0,col="gray") The residual plot suggests that the model is satisfactory. The quadratic term is highly significant. 10 / 28 Residual Analysis ST430 Introduction to Regression Analysis 0 −20 −40 −60 residuals(l2) 20 40 Residual plot 1500 2000 2500 OLYMPIC$FAT 11 / 28 Residual Analysis ST430 Introduction to Regression Analysis Partial residuals Sometimes the effect of an independent variable is better described by a transformed version: log(X ), 1/X , etc. The partial residual plot can help identify the transformation: The partial residuals for independent variable Xj are ˆ∗ = ˆ + β̂j Xj Plot ˆ∗ against Xj . Also known as a “Component + Residual” plot. 12 / 28 Residual Analysis ST430 Introduction to Regression Analysis Example Effect of price (p) and advertising (X2 ) on demand (Y ) for coffee. setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("COFFEE2.RData") pairs(COFFEE2,pch=20) Try a first-order model: l1 <- lm(DEMAND ~ PRICE + X, COFFEE2) summary(l1) plot(COFFEE2$PRICE, residuals(l1),pch=20) abline(h=0,col="gray") The residual plot shows misspecification. 13 / 28 Residual Analysis ST430 Introduction to Regression Analysis Scatter plot 3.5 4.0 4.5 5.0 1000 3.0 4.5 5.0 200 600 DEMAND 0.8 3.0 3.5 4.0 PRICE 0.0 0.4 X 200 400 600 800 1000 14 / 28 0.0 0.2 0.4 0.6 0.8 1.0 Residual Analysis ST430 Introduction to Regression Analysis 0 −50 residuals(l1) 50 Residual plot 3.0 3.5 4.0 4.5 5.0 COFFEE2$PRICE 15 / 28 Residual Analysis ST430 Introduction to Regression Analysis The Component + Residual plot: library(car) crPlot(l1, variable = "PRICE",pch=20) Curve suggests either adding ’PRICE2’, or transforming to log(PRICE) or 1/PRICE. R 2 and Ra2 are highest for 1/PRICE. Note: the partial regression plot is different. 16 / 28 Residual Analysis ST430 Introduction to Regression Analysis 400 200 0 −200 −400 Component+Residual(DEMAND) Component + Residual plot 3.0 3.5 4.0 4.5 5.0 PRICE 17 / 28 Residual Analysis ST430 Introduction to Regression Analysis Detecting unequal variances Homoscedasticity versus heteroscedasticity. That is, constant variance versus varying variance. When the variance is not constant, it is most often related to the mean. For Poisson-distributed data (counts), var(Y ) = E (Y ). When errors are multiplicative, Y = E (Y ) × (1 + ), and var(Y ) ∝ E (Y )2 . 18 / 28 Residual Analysis ST430 Introduction to Regression Analysis Sometimes the variance can be made constant by transforming Y . For example, with multiplicative errors, log(Y ) = log[E (Y ) × (1 + )] = log[E (Y )] + log[1 + ] ≈ log[E (Y )] + . So var[log Y ] is (approximately) constant. Sometimes variance can be made constant by transformation, but a different method may be better than using a transformation. √ For example, with Poisson-distributed counts, Y has approximately constant variance, but a generalized linear model may be more satisfactory. 19 / 28 Residual Analysis ST430 Introduction to Regression Analysis Example Salary and experience for social workers. setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("SOCWORK.RData") pairs(SOCWORK,pch=20) Try a second-order model: l2 <- lm(SALARY ~ EXP + I(EXP^2), SOCWORK) summary(l2) plot(fitted(l2), residuals(l2),pch=20) abline(h=0,col="gray") 20 / 28 Residual Analysis ST430 Introduction to Regression Analysis Scatter plot 6e+04 1e+05 20 2e+04 6e+04 1e+05 0 5 10 EXP 11.0 2e+04 SALARY 10.0 LNSALARY 0 5 10 15 20 25 21 / 28 10.0 10.5 11.0 11.5 Residual Analysis ST430 Introduction to Regression Analysis 10000 0 −10000 residuals(l2) 20000 Residual plot 20000 30000 40000 50000 60000 70000 fitted(l2) 22 / 28 Residual Analysis ST430 Introduction to Regression Analysis The “Residuals vs Fitted” plot shows a fan-shaped scatter, and the “Scale-Location” plot shows an upward trend (try ’plot(l2)’). It suggests std dev(Y ) ∝ E (Y ), hence var(Y ) ∝ E (Y )2 , so try logarithms. 23 / 28 Residual Analysis ST430 Introduction to Regression Analysis Second-order model for log(SALARY): lLog2 <- lm(log(SALARY) ~ EXP + I(EXP^2), SOCWORK) summary(lLog2) The quadratic term is not significant, so try a first-order model: lLog1 <- lm(log(SALARY) ~ EXP, SOCWORK) summary(lLog1) plot(fitted(lLog1), residuals(lLog1),pch=20) abline(h=0,col="gray") The residual plots are more satisfactory. 24 / 28 Residual Analysis ST430 Introduction to Regression Analysis 0.1 0.0 −0.3 −0.2 −0.1 residuals(lLog1) 0.2 Residual plot 10.0 10.2 10.4 10.6 10.8 11.0 11.2 fitted(lLog1) 25 / 28 Residual Analysis ST430 Introduction to Regression Analysis Simple test for heteroscedasticity Divide the data set in two, for instance low fitted values versus high fitted values. Fit the model separately to each part, and compare the MSEs (Mean Square Errors). Under H0 : variance is constant, F∗ = MSE1 MSE2 has the F -distribution with ν1 = n1 − (k + 1) and ν2 = n2 − (k + 1) degrees of freedom. 26 / 28 Residual Analysis ST430 Introduction to Regression Analysis This is usually a two-sided test; Ha : variance is not constant. Reject H0 at level α if F ∗ differs too far from 1 in either direction; that is, if F ∗ < F1−α/2 (ν1 , ν2 ), the lower α/2-point of the distribution, or F ∗ > Fα/2 (ν1 , ν2 ), the upper α/2-point of the distribution. 27 / 28 Residual Analysis ST430 Introduction to Regression Analysis Note: F1−α/2 (ν1 , ν2 ) = 1/Fα/2 (ν2 , ν1 ), so an equivalent method is based on Larger MSE 1 F = = max F ∗ , ∗ . Smaller MSE F Then we reject H0 if F > Fα/2 (νLarger , νSmaller ) . 28 / 28 Residual Analysis
© Copyright 2025 Paperzz