ST430 Introduction to Regression Analysis ST430: Introduction to Regression Analysis, Chapter 5, Sections 5.7-5.11 Luo Xiao October 19, 2015 1 / 27 ST430 Introduction to Regression Analysis Model Building 2 / 27 Model Building ST430 Introduction to Regression Analysis Models with one qualitative variable Recall: a qualitative variable with k levels is represented by (k − 1) indicator (or dummy) variables. For a chosen reference level, all the indicator variables are 0; For each other level, the corresponding indicator variable is 1, and the others are 0. 3 / 27 Model Building ST430 Introduction to Regression Analysis Example Per-user software maintainence cost, by state (samples of 10 users per state). Y : cost Reference level: Kansas X1 : indicator variable for Kentucky X2 : indicator variable for Texas R code for reading in and plotting data: setwd("/Users/xiaoyuesixi/Dropbox/teaching/2015Fall/R_datasets load("BIDMAINT.Rdata")# load in data par(mfrow=c(1,1),mar=c(4,4.5,2,1)) boxplot(COST~STATE,data = BIDMAINT) #box plot 4 / 27 Model Building ST430 Introduction to Regression Analysis 100 200 300 400 500 600 700 Box plot Kansas 5 / 27 Kentucky Texas Model Building ST430 Introduction to Regression Analysis Model: E (Y ) = β0 + β1 X1 + β2 X2 . R code for model fit (see “output1.txt”): setwd("/Users/xiaoyuesixi/Dropbox/teaching/2015Fall/R_datasets load("BIDMAINT.Rdata")# load in data fit = lm(COST~STATE,data=BIDMAINT) summary(fit) 6 / 27 Model Building ST430 Introduction to Regression Analysis The fitted equation is E (Y ) = 279.6 + 80.3X1 + 198.2X2 where: X1 = indicator variable for Kentucky, X2 = indicator variable for Texas. For Kansas, X1 = X2 = 0, so E (Y ) = 279.6. That is, the “intercept” is actually the expected value for the reference state, Kansas. 7 / 27 Model Building ST430 Introduction to Regression Analysis For Kentucky, X1 = 1 and X2 = 0, so E (Y ) = 279.6 + 80.3 = 359.9. That is, the coefficient ’STATEKentucky’ is the difference between the expected value for Kentucky and the expected value for the reference state. Similarly, the coefficient ’STATETexas’ is the difference between the expected value for Texas and the expected value for the reference state. 8 / 27 Model Building ST430 Introduction to Regression Analysis In R, the default reference level is the first in alphabetic order. The default can be overridden, e.g., using the R code: ’relevel(STATE,ref="Kentucky")’ in linear regression. Often these differences themselves are of no special interest, and the focus is on testing whether there are any differences: H0 : β1 = β2 = · · · = βk−1 = 0. The value of the F -statistic is unaffected by the choice of reference level. R code for F -test: ’anova(fit)’ 9 / 27 Model Building ST430 Introduction to Regression Analysis Two qualitative variables 80 70 60 50 40 40 50 60 70 80 E.g., two brands of diesel engine and three types of fuel. F1 F2 10 / 27 F3 B1 Model Building B2 ST430 Introduction to Regression Analysis 11 / 27 Model Building ST430 Introduction to Regression Analysis Notation Y : performance Reference fuel type F1 X1 : Fuel type F2 X2 : Fuel type F3 Reference brand B1 X3 : Brand B2 Main effects model: E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 . 12 / 27 Model Building ST430 Introduction to Regression Analysis Means for main effects model E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 . ——— B1 B2 F1 F2 F3 µ11 = β0 µ21 = β0 + β1 µ31 = β0 + β2 µ12 = β0 + β3 µ22 = β0 + β1 + β3 µ32 = β0 + β2 + β3 13 / 27 Model Building ST430 Introduction to Regression Analysis Main effects model: Mean response as a function of F and B when F and B affect E (Y ) independently 14 / 27 Model Building ST430 Introduction to Regression Analysis 80 FUEL 50 60 70 F3 F2 F1 40 50 60 70 B2 B1 mean of PERFORM BRAND 40 mean of PERFORM 80 Interaction plot F1 F2 F3 FUEL 15 / 27 B1 B2 BRAND Model Building ST430 Introduction to Regression Analysis Complicated story from previous interaction plot: For F1 and F2 , effects are additive, with B1 performing better than B2 and F2 performing better than F1 ; But for F3 : B2 performs better than B1 . Interaction model: E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X1 X3 + β5 X2 X3 . R code for fitting interaction model: lm(PERFORM~FUEL*BRAND,data=DIESEL) 16 / 27 Model Building ST430 Introduction to Regression Analysis Testing if interaction effect is significant Compare the main effects model E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 with the interaction model E (Y ) = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X1 X3 + β5 X2 X3 . Testing H0 : β4 = β5 = 0: Use R function: ’anova’ (output in "output2.txt") 17 / 27 Model Building ST430 Introduction to Regression Analysis Three or more qualitative variables With a response Y and independent variables a, b, c, . . . , model might contain: main effects: a + b + c + ...; two-way interactions: a + b + c + a : b + a : c + b : c + ...; higher-order interactions: a + b + c + a : b + a : c + b : c + a : b : c + ...; Often only main effects and low-order interactions are significant. 18 / 27 Model Building ST430 Introduction to Regression Analysis To estimate the highest-order interactions, we need observations for all possible combinations of levels–a factorial design. E.g., 2 × 3 = 6 for the diesel engines. With several variables, all with at least 2 levels, the number of combinations can be large. Sometimes a carefully chosen fraction of all possible combinations is used–a fractional factorial design. 19 / 27 Model Building ST430 Introduction to Regression Analysis Models with both quantitative and qualitative variables Example Diesel engine performance Y , as a function of: engine speed, X1 ; fuel type, with levels F1 , F2 , and F3 ; take F1 as the reference level, and X2 and X3 as indicators for F2 and F3 , respectively. 20 / 27 Model Building ST430 Introduction to Regression Analysis Simple model, ignoring fuel type: second-order model in X1 : E (Y ) = β0 + β1 X1 + β2 X12 . Additive model: include main effects of fuel type: E (Y ) = β0 + β1 X1 + β2 X12 + β3 X2 + β4 X3 . Switching fuel from F1 to F2 adds β3 to the performance Y , independently of engine speed X1 . Interaction model: E (Y ) = β0 + β1 X1 + β2 X12 + β3 X2 + β4 X3 + β5 X1 X2 + β6 X1 X3 + β7 X12 X2 + β8 X12 X3 . 21 / 27 Model Building ST430 Introduction to Regression Analysis The interaction model is the complete second order model. It allows E (Y ) to be a different quadratic function of X1 at each level of fuel type. These models form a nested hierarchy, and we could choose among them using F -tests. An intermediate model like E (Y ) = β0 + β1 X1 + β2 X12 + β3 X2 + β4 X3 + β5 X1 X2 + β6 X1 X3 . (that is, the interaction model with β7 = β8 = 0) might also be of interest. 22 / 27 Model Building ST430 Introduction to Regression Analysis Model validation Regression models are usually fitted to infer something about the behavior of the expected response, E (Y ), beyond the particular sample used for fitting. Often the specific goal is to estimate E (Y ) or predict Y for some combination of independent variables not in the fitting data. A model that fits well (high R 2 ) may not predict well; adjusted R 2 , Ra2 , is a step in the right direction, but only a step. 23 / 27 Model Building ST430 Introduction to Regression Analysis The best validation is to actually collect new data, and compare predictions with actual responses. Label the m new responses as yn+1 , . . . , yn+m : n+m X 2 Rprediction =1− (yi − ŷi )2 i=n+1 n+m X (yi − ȳ )2 i=n+1 n+m X (yi − ŷi )2 MSEprediction = i=n+1 m Note: denominator in MSEprediction is m, not m − (k + 1). 24 / 27 Model Building ST430 Introduction to Regression Analysis Almost as good: divide the original data into one part used for model building and fitting, and another “hold-out” part for validation. But for a true validation, the hold-out data are completely unused until the model building and fitting are complete. 25 / 27 Model Building ST430 Introduction to Regression Analysis Cross validation If we do not have enough data to hold out part for a true validation, we can use cross validation: First build (and fit) a model to the whole data set. Then leave out part of the data (one half? one third? one fifth? a single observation?), refit the model to the remainder, and use the refitted model to predict the held out data. (K -fold cross validation) Repeat so that all parts of the data are held out in turn. Calculate R 2 and MSE as for a true validation. Cross validation is not true validation, because the whole data set is used to build the model and then re-used in the validation. 26 / 27 Model Building ST430 Introduction to Regression Analysis The Jackknife The jackknife is closely related to leave-out-one cross validation. It uses deletion residuals: Delete one observation, say yi ; Refit the model, and use it to predict the deleted observation as ŷ(i) ; The deleted residual (or prediction residual) is di = yi − ŷ(i) . R 2 -like statistics: P 2 =1− P Rjackknife 27 / 27 yi − ŷ(i) 2 (yi − ȳ )2 . Model Building
© Copyright 2025 Paperzz