ST430 Introduction to Regression Analysis ST430: Introduction to Regression Analysis, Chapter 8, Sections 5-7 Luo Xiao November 9, 2015 1 / 25 ST430 Introduction to Regression Analysis Residual Analysis 2 / 25 Residual Analysis ST430 Introduction to Regression Analysis Checking normality One of the standard assumptions that ensure that inferences are valid is that the random errors = Y − E (Y |X) are normally distributed. Standard error calculations do not depend on the normality assumption, but P-values do. Except in small samples, departures from normality do not usually invalidate hypothesis tests or confidence intervals. 3 / 25 Residual Analysis ST430 Introduction to Regression Analysis Often, when data are not normal, they show longer/heavier tails. Heavy tails generally make inferences conservative. For instance, a 95% confidence interval actually covers the true parameter value with a probability higher than 95%. Similarly, the Type I error rate in a hypothesis test is less than the nominal α. Conservative inferences are not optimal (for instance, confidence intervals are wider than they need to be). 4 / 25 Residual Analysis ST430 Introduction to Regression Analysis One approach to checking normality is by a hypothesis test: H0 : is normally distributed, versus Ha : is not normally distributed. The Shapiro-Wilk test is often recommended. All tests have relatively low power in small samples, and even in moderately large samples. That is, the chance of detecting moderate non-normality is not close to 1, or, type II error rate is not close to 0. 5 / 25 Residual Analysis ST430 Introduction to Regression Analysis Graphical checks Use histogram and overlay a normal curve. Use normal quantile-quantile plot (Q-Q plot): quantiles of observed data against quantiles of a standard normal distribution. The normal Q-Q plot is more useful than the histogram. 6 / 25 Residual Analysis ST430 Introduction to Regression Analysis R code for normality checks setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("SOCWORK.RData") fit = lm(SALARY ~ EXP + I(EXP^2), SOCWORK) r = residuals(fit) par(mfrow=c(1,2)) hist(r,probability=TRUE) #histogram #add a normal curve curve(dnorm(x,mean = 0,sd = sd(r)),col=2,add=TRUE) qqnorm(r,pch=20) #q-q plot qqline(r,col=2) shapiro.test(r) #Shapiro-Wilk test gives a P-value 0.3475 7 / 25 Residual Analysis ST430 Introduction to Regression Analysis Histogram and Q-Q plot −20000 0 20000 r 8 / 25 −10000 10000 Normal Q−Q Plot Sample Quantiles 3e−05 0e+00 Density 6e−05 Histogram of r −2 −1 0 1 2 Theoretical Quantiles Residual Analysis ST430 Introduction to Regression Analysis A Cauchy distribution (heavy-tailed) Normal Q−Q Plot 20 0 −20 −60 −40 Sample Quantiles 0.03 0.02 0.01 0.00 Density 0.04 40 0.05 Histogram of y −80 −40 0 y 9 / 25 20 40 −2 −1 0 1 Theoretical Quantiles Residual Analysis 2 ST430 Introduction to Regression Analysis Another “non-normal” example Normal Q−Q Plot 0 20 40 60 80 Sample Quantiles 0.010 0.000 Density 0.020 120 Histogram of r 0 50 100 r 10 / 25 150 −2 −1 0 1 2 Theoretical Quantiles Residual Analysis ST430 Introduction to Regression Analysis Outliers Recall that ˆi = Yi − Ŷi is the i th residual; it has the same units as Y . Residuals are often scaled in some way to make them dimensionless. Terminology varies! Here we follow R (’rstandard()’ and ’rstudent()’) and SAS, not the textbook. 11 / 25 Residual Analysis ST430 Introduction to Regression Analysis Scaled residual (“standardized” residual in the text): scaled residual zi = Yi − Ŷi ˆi = . s s Rule of thumb If |zi | > 3, the i th observation is an outlier. Equivalently, |Yi − Ŷi | > 3s, a “3-σ event”. 12 / 25 Residual Analysis ST430 Introduction to Regression Analysis The “hat” matrix Each observation contributes to the value of β̂; in matrix notation, −1 β̂ = X0 X X0 Y. So it also contributes to the predicted values: Ŷ = Xβ̂ = X X0 X −1 where H = X X0 X X0 Y = HY, −1 X0 is the hat matrix. H “puts the hat on Y”. 13 / 25 Residual Analysis ST430 Introduction to Regression Analysis The residuals are ˆ = Y − Ŷ = (I − H)Y and consequently (with some matrix algebra) var (ˆ i ) = σ 2 (1 − hi ) where hi is the i th diagonal entry in H. The Yi − Ŷi zi ˆi = √ =√ standardized residual zi∗ = √ s 1 − hi s 1 − hi 1 − hi (“studentized” residual in the text) is adjusted for these different variances. We can also use the rule of thumb with standardized residuals: if |zi∗ | > 3, the i th observation is an outlier. 14 / 25 Residual Analysis ST430 Introduction to Regression Analysis Leverage Recall that Ŷ = HY, where H is the hat matrix: Ŷi = n X hi,j Yj . j=1 The diagonal entry hi,i = hi is the weight attached to yi itself in computing ŷi . The diagonal entry hi is defined to be the leverage of the i th observation. Leverage measures the contribution of yi to its predicted value ŷi . 15 / 25 Residual Analysis ST430 Introduction to Regression Analysis Leverage satisfies 0 < hi ≤ 1; the average leverage is always h̄ = k +1 , n where k the number of predictors. In many designed experiments, all observations have the same leverage: hi ≡ h̄; in observational studies, leverage can vary widely. Rule of thumb If hi > 2h̄, the i th observation is a leverage point. In the fourth residual plot, the standardized residuals are plotted against leverage. 16 / 25 Residual Analysis ST430 Introduction to Regression Analysis Influence An observation can be a leverage point but not have a great influence on β̂. Write β̂ If β̂ (i) (i) for the parameter estimates when the i th observation is omitted. is very different from β̂, the i th observation has high influence. 17 / 25 Residual Analysis ST430 Introduction to Regression Analysis One measure of the magnitude of β̂ Pn j=1 Di = (i) Ŷj − β̂ is Cook’s distance, − Ŷj 2 (k + 1)s 2 β̂ = (i) (i) 0 − β̂ (X0 X) β̂ (i) − β̂ (k + 1)s 2 (i) where Ŷj is the usual predicted value of yj and Ŷj is the predicted value (i) using β̂ . 18 / 25 Residual Analysis ST430 Introduction to Regression Analysis It can be shown that z2 zi∗ 2 hi hi Di = i = . 2 k + 1 (1 − hi ) k + 1 1 − hi where zi is the scaled residual, zi∗ is the standardized residual, and hi is the leverage. If the i th observation has a large standardized residual zi∗ and high leverage hi , Cook’s distance Di will be large. Rule of thumb If Di > 1, the i th observation is highly influential. The fourth residual plot shows contours of Cook’s distance, so the rule of thumb is easy to use. 19 / 25 Residual Analysis ST430 Introduction to Regression Analysis Detecting correlation Time series data Regression models are sometimes used with responses Y1 , Y2 , . . . , Yn that are collected over time. Often one response is similar to the immediately preceding responses, which means that they are correlated. Since standard errors are usually calculated on the assumption of zero correlation, they can be quite incorrect, often too small by a factor of 2 or more. 20 / 25 Residual Analysis ST430 Introduction to Regression Analysis When such serial correlation is present, both the estimation procedure (least squares) and the calculation of standard errors need to be modified. First we need to know when significant correlation is present. Durbin-Watson test The widely available Durbin-Watson test was developed by Jim Durbin and Geof Watson. It is based on the statistic Pn d= 21 / 25 i − ˆi−1 ) i=2 (ˆ Pn ˆ2i i=1 2 . Residual Analysis ST430 Introduction to Regression Analysis d statistic Range of d : 0 ≤ d ≤ 4; If there is no correlation, d ≈ 2; If observations are positively correlated, d < 2, if strong positive correlation, d ≈ 0; If observations are negatively correlated, d > 2, if strong negative correlation, d ≈ 4; 22 / 25 Residual Analysis ST430 Introduction to Regression Analysis Example: trend in sales (run the code!) setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("SALES35.RData") plot(SALES ~ T, data = SALES35, pch=20) fit = lm(SALES ~ T, SALES35) summary(fit) plot(fit) library(car) durbinWatsonTest(fit) 23 / 25 Residual Analysis ST430 Introduction to Regression Analysis The usual four plots give no information about correlation. Looking Ahead... The ’arima()’ function fits a regression model given an assumed model for the residual correlation. One simple model is the first order autoregression, AR(1) (output in "output1.txt"): fitAR = arima(SALES35$SALES, order = c(1, 0, 0), xreg = SALES35$T) fitAR #AR model summary(fit) #regular linear regression 24 / 25 Residual Analysis ST430 Introduction to Regression Analysis Note the increase in standard error of the estimated trend, from 0.1069 to 0.1760. 25 / 25 Residual Analysis
© Copyright 2025 Paperzz