ST430 Introduction to Regression Analysis ST430: Introduction to Regression Analysis, Ch4, Sec 4.12-4.14 Luo Xiao September 28, 2015 1 / 29 ST430 Introduction to Regression Analysis Multiple Linear Regression 2 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Complete second-order models When the first-order model E (Y ) = β0 + β1 X1 + β2 X2 is inadequate, the interaction model E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 may be better, but sometimes a complete second-order model is needed: E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22 3 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Example: cost of shipping packages Example 4.8 in textbook Y : cost of package (dollars) X1 : package weight (pounds) X2 : distanced shipped (miles) Plot the data (see next slide). Model EY = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22 . 4 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis 15 10 0 5 Cost 0 5 Cost 10 15 Scatter plots 0 2 4 Weight 5 / 29 6 8 50 100 150 200 250 Distance Multiple Linear Regression ST430 Introduction to Regression Analysis Fit a complete second-order model in R R code (see the text file: Output1): setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("EXPRESS.Rdata")# load in data fit = lm(Cost~Weight*Distance + I(Weight^2) + I(Distance^2), data=EXPRESS)#fit the model summary(fit)# display the results 6 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Comparing nested models Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller is the reduced (or restricted) model. Example: with two independent variables X1 and X2 , possible terms are X1 , X2 , X1 X2 , X12 , and so on. 7 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Consider three models: First order: E (Y ) = β0 + β1 X1 + β2 X2 ; Interaction: E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 ; Full second order: E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22 . 8 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis The first order model is nested within both the Interaction model and the Full second order model. The Interaction model is nested within the Full second order model. We usually want to use the simplest (most parsimonious) model that adequately fits the observed data. One way to decide between a full model and a reduced model is by testing H0 : reduced model is adequate; Ha : full model is better. 9 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis When the full model has exactly one more term than the reduced model, we can use a t-test. E.g., testing the Interaction model E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 ; against the First order model E (Y ) = β0 + β1 X1 + β2 X2 . H0 : “reduced model is adequate” is the same as H0 : β3 = 0. So the usual t-statistic is the relevant test statistic. 10 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis When the full model has more than one additional term, we use an F -test, which generalizes the t-test. Basic idea: fit both models, and test whether the full model fits significantly better than the reduced model: F = Drop in SSE Number of extra terms s 2 for larger model where SSE is the sum of squared residuals. When H0 is true, F follows the F -distribution, which we use to find the P-value. 11 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis E.g., testing the (full) second order model E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 + β4 X12 + β5 X22 ; against the (reduced) interaction model E (Y ) = β0 + β1 X1 + β2 X2 + β3 X1 X2 . Here H0 is β4 = β5 = 0, and Ha is the opposite. Two ways of carrying out the test: Manually by R functions "summary" or "anova" Automatically by R function "anova" 12 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Manual calculation Note: F = (SSER − SSEC )/Number of β parameters being tested s 2 for larger model Need to find SSER , SSEC , and s 2 for larger model: use R function "anova". 13 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Example 4.8 revisited Example 4.11 in textbook. R code (see text file: Output2): setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("EXPRESS.Rdata")# load in data fitR = lm(Cost~Weight*Distance, data=EXPRESS)#interaction model fitC = lm(Cost~Weight*Distance + I(Weight^2) + I(Distance^2), data=EXPRESS)#complete second order model anova(fitR) anova(fitC) 14 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis See ’Residuals’ line in each set of results: SSE(Full second order) = 2.745, SSE(Interaction) = 6.633. For the Full second order model: s 2 = 0.196. So F = 15 / 29 (6.633 − 2.745)/2 = 9.92, P < .01. 0.196 Multiple Linear Regression ST430 Introduction to Regression Analysis Automatic calculation in R Use R function "anova(fitR, fitC)", where "fitR" is the fit from the reduced model and "fitC" is the fit from the complete model. R code (see the text file (Output3) for output): setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("EXPRESS.Rdata")# load in data fitR = lm(Cost~Weight*Distance, data=EXPRESS)#interaction model fitC = lm(Cost~Weight*Distance + I(Weight^2) + I(Distance^2), data=EXPRESS)#complete second order model anova(fitR, fitC) 16 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis When there is only one additional term in the complete model We can use the the t-test for testing the corresponding β parameter. We can also use the above F -test. The two tests are exactly the same test: The F -statistic is exactly the square of the t-statistic. The F critical values are exactly the squares of the (two-sided) t critical values. So the P-value is exactly the same. 17 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Complete example: road construction cost Section 4.14 of textbook; data from the Florida Attorney General’s office. Y = winning contract cost ("COST" in data "FLAG") X1 = DOT engineer’s estimate of cost ("DOTEST" in data "FLAG") X2 = indicator of fixed bidding ("STATUS" in data "FLAG"): ( X2 = 18 / 29 1 0 if fixed if competitive Multiple Linear Regression ST430 Introduction to Regression Analysis Get the data and plot them (next slide): What does the visualization tell us? Y seems to be positively related to X1 ; X2 seems to affect Y ; Further insights from visualization: It seems to suggest different slopes; It seems that straight lines might be sufficient; 19 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Visualization STATUS = 1 10000 10000 STATUS = 0 6000 8000 45 degree line 0 2000 4000 COST 6000 4000 2000 0 COST 8000 45 degree line 0 2000 6000 DOTEST 20 / 29 10000 0 2000 6000 DOTEST Multiple Linear Regression 10000 ST430 Introduction to Regression Analysis Section 4.14 suggests beginning with the full second order model, and simplifying it as far as possible (but no further!). We’ll take the opposite approach: begin with the first order model, and complicate it as far as necessary. Because X2 is an indicator variable, the first order model is a pair of parallel straight lines (see plot on later slides): 21 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis R code for additive model (see text file: Output4): setwd("~/Dropbox/teaching/2015Fall/R_datasets/Exercises&Exampl load("FLAG.Rdata")# load in data fitA = lm(COST~DOTEST + STATUS, data=FLAG)#additive model summary(fitA) 22 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis 10000 overlay fits from additive model 6000 4000 2000 0 COST 8000 STATUS = 0 STATUS = 1 0 2000 4000 6000 8000 10000 DOTEST 23 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Both variables look important. The ’DOTEST’ coefficient is close to 1, so the winning bids roughly track the estimated cost. The positive ’STATUS’ coefficient means the line for ’STATUS’ = 1 is higher than the line for ’STATUS’ = 0. Are the slopes different? Try the interaction model (see text file: Output5): fitR = lm(COST~DOTEST*STATUS, data=FLAG)#interaction model 24 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis Interaction model involving qualitative variable The interaction model: E (COST) = β0 + β1 DOTEST + β2 STATUS + β3 DOTEST ∗ STATUS. Hence, E (COST) = β0 + β1 DOTEST, if STATUS = 0; E (COST) = (β0 + β2 ) + (β1 + β3 )DOTEST, 25 / 29 if STATUS = 1. Multiple Linear Regression ST430 Introduction to Regression Analysis The interaction term is highly significant: reject the first order model in favor of the interaction model. The slopes are 0.921336 for ’STATUS’ = 0, and 0.921336 + 0.163282 = 1.084618 for ’STATUS’ = 1. So the competitive auctions are won with bids that fall slightly below the estimated cost, while the fixed winning bids fall slightly above the estimated cost. 26 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis overlay fits from interaction model STATUS = 1 10000 10000 STATUS = 0 6000 8000 45 degree line fitted line 0 2000 4000 COST 6000 4000 2000 0 COST 8000 45 degree line fitted line 0 2000 6000 DOTEST 27 / 29 10000 0 2000 6000 DOTEST Multiple Linear Regression 10000 ST430 Introduction to Regression Analysis To validate the interaction model, we compare it with (finally!) the full second order model. Note When some variables are qualitative, the “full second order model” consists of the full second order (i.e., quadratic) model in the quantitative variables, plus the interactions of those terms with the qualititative variables: R code for model (see text file: Output6): fitC = lm(COST~DOTEST*STATUS + I(DOTEST^2) + I(DOTEST^2*STATUS), data=FLAG)#complete second order model 28 / 29 Multiple Linear Regression ST430 Introduction to Regression Analysis F -test for interaction model against complete second order model Use R function: anova(fitR, fitC) See text file: Output7. The significance of the two terms that were added is tested using an F statistic with 2 degrees of freedom in the numerator; the P-value is 0.3548, hence we accept the null hypothesis that the interaction model is adequate. 29 / 29 Multiple Linear Regression
© Copyright 2025 Paperzz