April 26

Mathematics 243
Measuring fit/Categorical Variables
Observation = Model + Error
yi = β0 + β1 xi,i + · · · + βk xk,i + εi
Observation = Fitted + Residual
yi = b0 + bxi,i + · · · + bk xk,i + ei .
Predicting success in college
Measuring Fit: Numbers to look at.
1. residual standard error
2. R2
3. p-values for individual coefficients.
> model1=lm(Grad.GPA~HS.GPA,data=SampleGrads)
> model2=lm(Grad.GPA~HS.GPA+ACT.Comp,data=SampleGrads)
> summary(model2)
Call:
lm(formula = Grad.GPA ~ HS.GPA + ACT.Comp, data = SampleGrads)
Residuals:
Min
1Q
-0.61454 -0.17686
Median
0.02046
3Q
0.18161
Max
0.69790
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.517982
0.221594
2.338 0.02147 *
HS.GPA
0.563748
0.083122
6.782 9.33e-10 ***
ACT.Comp
0.028844
0.009065
3.182 0.00197 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2668 on 97 degrees of freedom
Multiple R-squared: 0.6291,
Adjusted R-squared: 0.6215
F-statistic: 82.26 on 2 and 97 DF, p-value: < 2.2e-16
> anova(model2)
# sums of squares analysis
Analysis of Variance Table
Response: Grad.GPA
Df Sum Sq Mean Sq F value
Pr(>F)
HS.GPA
1 10.9882 10.9882 154.403 < 2.2e-16 ***
ACT.Comp
1 0.7205 0.7205 10.125 0.001966 **
Residuals 97 6.9031 0.0712
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Day 43 - April 26
Comparing two models
> anova(model1,model2)
Analysis of Variance Table
# comparing two models
Model 1: Grad.GPA ~ HS.GPA
Model 2: Grad.GPA ~ HS.GPA + ACT.Comp
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
98 7.6236
2
97 6.9031 1
0.72053 10.125 0.001966 **
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Setting: two models, model1 nested within model2. model1 is a linear model with p1 parameters and model2 has
p2 > p1 parameters.
F =
(SSE1 − SSE2 )/(p2 − p1 )
SSE1 /(n − p1 )
The null hypothesis that F is used to check is that all the additional parameters are 0. (Qualitatively, F will be
large if the null hypothesis is false.)
Cats
A binary categorical variable can be coded as 0 and 1.
> model=lm(Hwt~Sex,data=cats)
> summary(model)
Call:
lm(formula = Hwt ~ Sex, data = cats)
Residuals:
Min
1Q
-4.8227 -1.7227
Median
0.0273
3Q
1.2273
Max
9.1773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
9.2021
0.3251 28.308 < 2e-16 ***
SexM
2.1206
0.3961
5.354 3.38e-07 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.229 on 142 degrees of freedom
Multiple R-squared: 0.168,
Adjusted R-squared: 0.1621
F-statistic: 28.66 on 1 and 142 DF, p-value: 3.38e-07
2
> model1=lm(Hwt~Bwt,data=cats)
> model2=lm(Hwt~Bwt+Sex,data=cats)
> summary(model2)
Call:
lm(formula = Hwt ~ Bwt + Sex, data = cats)
Residuals:
Min
1Q Median
-3.5833 -0.9700 -0.0948
3Q
1.0432
Max
5.1016
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4149
0.7273 -0.571
0.569
Bwt
4.0758
0.2948 13.826
<2e-16 ***
SexM
-0.0821
0.3040 -0.270
0.788
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.457 on 141 degrees of freedom
Multiple R-squared: 0.6468,
Adjusted R-squared: 0.6418
F-statistic: 129.1 on 2 and 141 DF, p-value: < 2.2e-16
> anova(model1,model2)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
142
2
141
Hwt ~ Bwt
Hwt ~ Bwt + Sex
RSS Df Sum of Sq
F Pr(>F)
299.53
299.38 1
0.1548 0.0729 0.7875
Homework, Due Tuesday, May 1
The built-in R dataset stackloss has measurements on 21 days of operation of a plant for the oxidation of ammonia
(NH3) to nitric acid (HNO3).
Air.Flow
Water.Temp
Acid.Conc.
stack.loss
represents operation rate of plant
temperature of cooling water
parts per 1000 minus 500
10 times percentage of ingoing ammonia that escapes
1. Fit the following model and report the coefficients.
stack.loss = β0 + β1 Air.Flow + β2 Water.Temp + β3 Acid.Conc. + ε
2. How good does the model fit the data? Use some precise, quantitative measures of fit.
3. Is there any evidence to suggest that a smaller model (fewer explanatory variables) might be appropriate?
3