April 19

Mathematics 243
Inference in Linear Models - Continued
Observation = Model + Error
Observation = Fitted + Residual
yi = β0 + β1 xi + εi
The disribution of β1 :
b1 − β1
se(b1 )
Day 40 - April 19
yi = b0 + b1 xi + ei .
has a t-distribution with n − 2 degrees of freedom.
Rust
The dataframe corrosion has data on corrosion in 13 specimens of 90-10 Cu-Ni alloys with varying iron content.
The bars were submerged in sea water for 60 days and the loss of material due to corrosion was recorded. The
variables are Fe (iron content in percent) and loss (the weight loss in mg per square decimeter per day). Of course
Fe is the explanatory variable and loss the response variable.
> rust=lm(loss~Fe,data=corrosion)
> summary(rust)
Call:
lm(formula = loss ~ Fe, data = corrosion)
Residuals:
Min
1Q
-3.7980 -1.9464
Median
0.2971
3Q
0.9924
Max
5.7429
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 129.787
1.403
92.52 < 2e-16 ***
Fe
-24.020
1.280 -18.77 1.06e-09 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.058 on 11 degrees of freedom
Multiple R-squared: 0.9697,
Adjusted R-squared: 0.967
F-statistic: 352.3 on 1 and 11 DF, p-value: 1.055e-09
1. Confidence interval for the mean of y given x = x0 :
> x=data.frame(Fe=c(.25,.5,.75,1))
> predict(rust,newdata=x,interval='confidence')
fit
lwr
upr
1 123.7816 121.2195 126.3437
2 117.7767 115.6346 119.9187
3 111.7717 109.8732 113.6702
4 105.7667 103.8662 107.6672
2. Prediction interval for an individual y given x = x0 :
> x=data.frame(Fe=c(.25,.5,.75,1))
> predict(rust,newdata=x,interval='prediction')
fit
lwr
upr
1 123.7816 116.58030 130.9829
2 117.7767 110.71385 124.8395
3 111.7717 104.77890 118.7645
4 105.7667 98.77338 112.7600
1980 Census Undercount
The dataset Ericksen in the package car has data on undercounting in the 1980 census. It includes data on the
16 largest cities, the rest of the states in which those cities are located, and the other US states. We are interested
in whether the percentage of minority residents (minority) is related to the percentage by which the 1980 census
undercounted the city or state population (undercount).
1. Write a linear function to predict undercount percentage from minority population percentage.
2. Peruse a residual plot to determine whether the three assumptions necessary for inference are plausible for this
situation.
3. Test the hypothesis that undercount percentage is positively related to minority percentage. Be sure to write the
hypotheses carefully in terms of the parameters of the model.
4. Write and interpret a 95% confidence interval for the slope of the regression line.
5. According to the 1980 census, Grand Rapids had a population of 181,843 and a minority population of 18.9%.
(a) Predict the undercount for Grand Rapids and, hence, the true population.
(b) Give a 95% prediction interval for the true population of Grand Rapids in 1980.
2
Homework, due Thursday, April 26 (due to advising days)
1. A famous dataset (Pierce, 1948) contains data on the relationship between cricket chirps and temperature. The
dataset is in the M241 package and is named crickets. Here the variables are Temperature in degrees Fahrenheit
and Chirps giving the number of chirps per second of crickets at that temperature.
(a) Write the equation of the regression line that could be used to predict the temperature from the number of
cricket chirps per second.
(b) Write a 95% confidence interval for the slope of the line.
(c) Write a 95% confiddence interval for the mean temperature for each of the values 12, 14, 16, and 18 of cricket
chirps per second.
(d) You hear a cricket chirping 15 times per second. What is an interval that is likely to capture the value of the
temperature? Explain what likely means here.
(e) Is there anything in the residual plot that gives you any concern about using a linear model here?
2. The faraway package contains a dataset cpd which has the projected and actual sales of 20 different products of
a company. (The data were actually transformed to disguise the company.)
(a) Write a regression line that describes a linear relationship between projected and actual sales.
(b) Identify one data point that has particularly large influence on the regression.
(c) Recompute the regression line after removing the data point that you identfied. How does the equation of
the line change?
Regression in R:
>
>
>
>
>
>
>
>
>
>
>
>
model=lm(height~foot,data=statstu)
r=residuals(model)
f=fitted(model)
bs=coefficients(model)
xyplot(r~foot,data=statstu)
xyplot(height~foot,data=statstu)
ladd(panel.abline(model))
summary(model)
confint(model)
x=data.frame(foot=c(25,30,35))
predict(model,x,interval='confidence')
predict(model,x,interval='prediction')
#
#
#
#
#
#
#
#
#
#
#
#
fits the linear model
defines the vector of residuals
defines the vector of fitted values
a vector with the slope and intercept
plotting the residuals
plotting the data
adds the regression line to data plot
summary of statistics and inference
confindence intervals for betas
a dataframe of x values
predictions and confidence interval for mean of y at x
prediction and prediction interval for value of y at x
3