Mathematics 243 Measuring fit/Categorical Variables Observation = Model + Error yi = β0 + β1 xi,i + · · · + βk xk,i + εi Observation = Fitted + Residual yi = b0 + bxi,i + · · · + bk xk,i + ei . Predicting success in college Measuring Fit: Numbers to look at. 1. residual standard error 2. R2 3. p-values for individual coefficients. > model1=lm(Grad.GPA~HS.GPA,data=SampleGrads) > model2=lm(Grad.GPA~HS.GPA+ACT.Comp,data=SampleGrads) > summary(model2) Call: lm(formula = Grad.GPA ~ HS.GPA + ACT.Comp, data = SampleGrads) Residuals: Min 1Q -0.61454 -0.17686 Median 0.02046 3Q 0.18161 Max 0.69790 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.517982 0.221594 2.338 0.02147 * HS.GPA 0.563748 0.083122 6.782 9.33e-10 *** ACT.Comp 0.028844 0.009065 3.182 0.00197 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.2668 on 97 degrees of freedom Multiple R-squared: 0.6291, Adjusted R-squared: 0.6215 F-statistic: 82.26 on 2 and 97 DF, p-value: < 2.2e-16 > anova(model2) # sums of squares analysis Analysis of Variance Table Response: Grad.GPA Df Sum Sq Mean Sq F value Pr(>F) HS.GPA 1 10.9882 10.9882 154.403 < 2.2e-16 *** ACT.Comp 1 0.7205 0.7205 10.125 0.001966 ** Residuals 97 6.9031 0.0712 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Day 43 - April 26 Comparing two models > anova(model1,model2) Analysis of Variance Table # comparing two models Model 1: Grad.GPA ~ HS.GPA Model 2: Grad.GPA ~ HS.GPA + ACT.Comp Res.Df RSS Df Sum of Sq F Pr(>F) 1 98 7.6236 2 97 6.9031 1 0.72053 10.125 0.001966 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Setting: two models, model1 nested within model2. model1 is a linear model with p1 parameters and model2 has p2 > p1 parameters. F = (SSE1 − SSE2 )/(p2 − p1 ) SSE1 /(n − p1 ) The null hypothesis that F is used to check is that all the additional parameters are 0. (Qualitatively, F will be large if the null hypothesis is false.) Cats A binary categorical variable can be coded as 0 and 1. > model=lm(Hwt~Sex,data=cats) > summary(model) Call: lm(formula = Hwt ~ Sex, data = cats) Residuals: Min 1Q -4.8227 -1.7227 Median 0.0273 3Q 1.2273 Max 9.1773 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.2021 0.3251 28.308 < 2e-16 *** SexM 2.1206 0.3961 5.354 3.38e-07 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.229 on 142 degrees of freedom Multiple R-squared: 0.168, Adjusted R-squared: 0.1621 F-statistic: 28.66 on 1 and 142 DF, p-value: 3.38e-07 2 > model1=lm(Hwt~Bwt,data=cats) > model2=lm(Hwt~Bwt+Sex,data=cats) > summary(model2) Call: lm(formula = Hwt ~ Bwt + Sex, data = cats) Residuals: Min 1Q Median -3.5833 -0.9700 -0.0948 3Q 1.0432 Max 5.1016 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.4149 0.7273 -0.571 0.569 Bwt 4.0758 0.2948 13.826 <2e-16 *** SexM -0.0821 0.3040 -0.270 0.788 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.457 on 141 degrees of freedom Multiple R-squared: 0.6468, Adjusted R-squared: 0.6418 F-statistic: 129.1 on 2 and 141 DF, p-value: < 2.2e-16 > anova(model1,model2) Analysis of Variance Table Model 1: Model 2: Res.Df 1 142 2 141 Hwt ~ Bwt Hwt ~ Bwt + Sex RSS Df Sum of Sq F Pr(>F) 299.53 299.38 1 0.1548 0.0729 0.7875 Homework, Due Tuesday, May 1 The built-in R dataset stackloss has measurements on 21 days of operation of a plant for the oxidation of ammonia (NH3) to nitric acid (HNO3). Air.Flow Water.Temp Acid.Conc. stack.loss represents operation rate of plant temperature of cooling water parts per 1000 minus 500 10 times percentage of ingoing ammonia that escapes 1. Fit the following model and report the coefficients. stack.loss = β0 + β1 Air.Flow + β2 Water.Temp + β3 Acid.Conc. + ε 2. How good does the model fit the data? Use some precise, quantitative measures of fit. 3. Is there any evidence to suggest that a smaller model (fewer explanatory variables) might be appropriate? 3
© Copyright 2026 Paperzz