ST430 Introduction to Regression Analysis ST430: Introduction to Regression Analysis, Case Study 2 Luo Xiao October 12, 2015 1 / 23 ST430 Introduction to Regression Analysis Case Study 2 2 / 23 Case Study 2 ST430 Introduction to Regression Analysis Price of residential property How does the sale price of a property relate to the appraised values of the land and improvements on the property, and the neighborhood it is in? Two questions: Do the data indicate that price can be predicted based on these variables? Is the relationship the same in different neighborhoods? 3 / 23 Case Study 2 ST430 Introduction to Regression Analysis Available data for 176 sales between May 2008 and June 2009: Sale price, in thousands of dollars, Y ; Appraised land value, in thousands of dollars, X1 ; Appraised improvement value, in thousands of dollars, X2 ; Neighborhood: Cheval, Davis Isles, Hunter’s Green and Hyde Park: Baseline neighborhood: Cheval; Indicator variable X3 for David Isles; Indicator variable X4 for Hunter’s Green; Indicator variable X5 for Hyde Park. 4 / 23 Case Study 2 ST430 Introduction to Regression Analysis Load in data and plot them (next slide) setwd("/Users/xiaoyuesixi/Dropbox/teaching/2015Fall/R_datasets load("TAMSALES4.Rdata")# load in data par(mfrow=c(1,2),mar=c(4,4.5,2,1)) plot(SALES~LAND,data=TAMSALES4,pch=20) plot(SALES~IMP,data=TAMSALES4,pch=20) 5 / 23 Case Study 2 ST430 Introduction to Regression Analysis 2500 500 1500 SALES 1500 500 SALES 2500 Scatter plot 0 200 600 LAND 6 / 23 1000 0 200 600 IMP Case Study 2 1000 ST430 Introduction to Regression Analysis Scatter plot: Y versus X1 (different neighborhoods) 2000 1000 500 SALES 3000 CHEVAL DAVISISLES HUNTERSGREEN HYDEPARK 0 200 400 600 800 1000 LAND 7 / 23 Case Study 2 ST430 Introduction to Regression Analysis Scatter plot: Y versus X2 (different neighborhoods) 2000 1000 500 SALES 3000 CHEVAL DAVISISLES HUNTERSGREEN HYDEPARK 0 200 400 600 800 1000 1200 IMP 8 / 23 Case Study 2 ST430 Introduction to Regression Analysis Models (nested) to consider Model 1 E (Y ) = —0 + —1 X1 + —2 X2 ; Model 2 E (Y ) = —0 + —1 X1 + —2 X2 + —3 X3 + —4 X4 + —5 X5 ; Model 3 E (Y ) = —0 + —1 X1 + —2 X2 + —3 X3 + —4 X4 + —5 X5 + —6 X1 X3 + —7 X1 X4 + —8 X1 X5 + —9 X2 X3 + —10 X2 X4 + —11 X2 X5 ; 9 / 23 Case Study 2 ST430 Introduction to Regression Analysis Model 4 E (Y ) = —0 + —1 X1 + —2 X2 + —3 X3 + —4 X4 + —5 X5 + —6 X1 X3 + —7 X1 X4 + —8 X1 X5 + —9 X2 X3 + —10 X2 X4 + —11 X2 X5 + —12 X1 X2 + —13 X1 X2 X3 + —14 X1 X2 X4 + —15 X1 X2 X5 . 10 / 23 Case Study 2 ST430 Introduction to Regression Analysis Implications of models when X3 = 1 Model 1 E (Y ) = —0 + —1 X1 + —2 X2 ; Model 2 E (Y ) = (—0 + —3 ) + —1 X1 + —2 X2 ; Model 3 E (Y ) = (—0 + —3 ) + (—1 + —6 )X1 + (—2 + —9 )X2 ; Model 4 E (Y ) = (—0 + —3 ) + (—1 + —6 )X1 + (—2 + —9 )X2 + (—12 + —13 )X1 X2 . 11 / 23 Case Study 2 ST430 Introduction to Regression Analysis Model formulas in R #Model fit1 = #Model fit2 = #Model fit3 = #Model fit4 = 1 lm(SALES~LAND + IMP, data = TAMSALES4) 2 lm(SALES~LAND + IMP + NBHD, data = TAMSALES4) 3 lm(SALES~(LAND + IMP)*NBHD, data = TAMSALES4) 4 lm(SALES~LAND*IMP*NBHD, data = TAMSALES4) 12 / 23 Case Study 2 ST430 Introduction to Regression Analysis Summary of models 13 / 23 Model R2 Ra2 s 1 2 3 4 .9242 .9277 .9334 .9415 .9233 .9256 .9290 .9361 112.9 111.3 108.7 103.1 Case Study 2 ST430 Introduction to Regression Analysis Notes Models with more predictors always give smaller R 2 . Models with higher Ra2 always give smaller s. Small s and high Ra2 are desirable. Here, Model 4 is optimal for both. 14 / 23 Case Study 2 ST430 Introduction to Regression Analysis Model 1 versus Model 2 H0 :—3 = —4 = —5 = 0; Ha :at least one of —3 , —4 , —5 is not zero. R code for testing nested models ’anova(fit1, fit2)’ (Output in "CS2_output1.txt") Model 2 differs from Model 1 only by including NBHD, so we can also use the R code: ’anova(fit2)’ (Output in "CS2_output2.txt") The result of test shows that we reject Model 1 in favor of Model 2 at the 5% level, but not at the 1% level. 15 / 23 Case Study 2 ST430 Introduction to Regression Analysis Model 2 versus Model 3 H0 :—6 = —7 = —8 = —9 = —10 = —11 = 0; Ha :at least one of —6 , . . . , —11 is not zero. Model 3 differs from Model 2 by including the interactions LAND:NBHD and IMP:NBHD, so there are two ways for calculating the F-test: 1 2 Use the R code ’anova(fit2,fit3)’ (Ouput in "CS2_output3.txt") Do the calculation by outputs from ’anova(fit2)’ and ’anova(fit3)’ (outputs in "CS2_output2.txt" and "CS2_output4.txt") 16 / 23 Case Study 2 ST430 Introduction to Regression Analysis F -test Recall the F statistic formula for nested models: F = (SSER ≠ SSEC )/Number of —’s tested . MSEC From output "CS2_output2.txt" and "CS2_output4.txt": SSER = 2104582, SSEC = 1936871, MSEC = 11810, (2104582 ≠ 1936871)/6 F = = 2.367. 11810 17 / 23 Case Study 2 ST430 Introduction to Regression Analysis This F -statistic has a F -distribution with 6 and 164 degrees of freedom. R code for p-value ’1-pf(2.367,6,164)’ F = 2.367 has a p-value of 0.0321, so we also reject Model 2 in favor of Model 3 at the 5% level. 18 / 23 Case Study 2 ST430 Introduction to Regression Analysis Model 3 versus Model 4 H0 :—12 = . . . = —15 = 0; Ha :at least one of —12 , . . . , —15 is not zero. Model 4 differs from Model 3 by including the interactions LAND:IMP and LAND:IMP:NBHD. Use either the R code ’anova(fit3, fit4)’ ("CS2_output5.txt") for testing or the outputs from ’anova(fit3)’ ("CS2_output4.txt") and ’anova(fit4)’ ("CS2_output6.txt") F = 5.5389 with a p-value of .0003, so we also reject Model 3 in favor of Model 4 at the 5% level, and at the 1% level. 19 / 23 Case Study 2 ST430 Introduction to Regression Analysis Notes about the F -tests Each of these tests answers the question: Is there enough evidence against the simpler model to reject it? This is not the same as: Which of these models will give the best predictions? 20 / 23 Case Study 2 ST430 Introduction to Regression Analysis Interpreting Model 4 (R output in ‘CS2_output7.txt’) The baseline neighborhood is Cheval, so the equation for that neighborhood is E (Y ) = 155.2 ≠ 0.8272X1 + 0.9609X2 + 0.00517X1 X2 , a two-variable interaction model. For each other neighborhood, the equation is also a two-variable interaction model, but with different coefficients. 21 / 23 Case Study 2 ST430 Introduction to Regression Analysis For another neighborhood Davis Isles (indicator variable X3 ), we must add the corresponding interaction terms between X3 and X1 or X2 : NBHDDAVISISLES = -60.17 to the intercept; LAND:NBHDDAVISISLES = 2.012 to the coefficient of X1 ; IMP:NBHDDAVISISLES = -0.1977 to the coefficient of X2 ; LAND:IMP:NBHDDAVISISLES = -0.004278 to the coefficient of X1 X2 . We get E (Y ) = (155.2 ≠ 60.17) + (≠0.8272 + 2.012)X1 + (0.9609 ≠ 0.1977)X2 + (0.00517 ≠ 0.004278)X1 X2 = 95.03 + 1.1848X1 + 0.7632X2 + 0.000892X1 X2 . 22 / 23 Case Study 2 ST430 Introduction to Regression Analysis Prediction with Model 4 Predict the sale price of a residential property with appraised land value 150 thousand dollars and appraised improvement value of 750 thousand dollars and located in Hyde Park. R code (Output in “CS2_output8.txt”): predict(fit4,newdata = data.frame(LAND=150,IMP=750,NBHD=subsets[[4]]$NBHD[1]), interval = "prediction") 23 / 23 Case Study 2
© Copyright 2025 Paperzz