Nov. 28

MATH 143 Activities for Mon., Nov. 28: The Standard Linear Model
Influential data points.
1. Load the following tab-delimited data set:
> animals = read.table("http://www.calvin.edu/~scofield/data/tab/rc/animals.dat",
+
sep="\t", header=T)
Look at the relationship between longevity (as explanatory variable) and gestation (as
response) in this plot containing both points and least squares regression line.
> lmResult = lm(gestation ~ longevity, data=animals)
> plot(animals$longevity, animals$gestation, pch=19, col="navy", cex=.5)
> abline(lmResult)
# command to add the least squares regression line
Find the observation in the dataset that has the largest absolute residual, starting with
sort( abs( lmResult$residuals ) )
Create another data frame omitting the animal with the largest residual. As an
example, if it were the pig we wanted to leave out, our new data frame newAnim
could be derived from the old with the command
> newAnim = animals[animals$animal != "pig", ]
Find the equations of the regression lines, both with and without the animal whose
residual is largest. Plot the original data (all observations/animals) along with both
lines. Note: To overlay a second line, say, one with slope 14 and intercept 18, to the
previous scatterplot, you may use a command like
abline(a = 18, b = 14, col="red")
Is it accurate to say that the point with the largest residual greatly influences the
regression line? Is there another data point which is more influential? If so, by
how much do the slope and intercept change when this (new) point is removed? In
general, how might one identify grossly influential data points?
Answer:
MATH 143: The Standard Linear Model
> lm(gestation ~ longevity, data=animals)
Call:
lm(formula = gestation ~ longevity, data = animals)
Coefficients:
(Intercept)
21.71
longevity
13.13
> newAnim = animals[animals$animal != "giraffe", ]
> lm(gestation ~ longevity, data=newAnim)
Call:
lm(formula = gestation ~ longevity, data = newAnim)
Coefficients:
(Intercept)
9.029
longevity
13.558
> plot(animals$longevity, animals$gestation, pch=19, cex=.3, xlab="Longevity", ylab="Gestation")
> abline(a=21.71, b=13.13, col="black")
> abline(a=9.03, b=13.56, col="red")
600
●
400
●
●
●
●
●
●
●
●
●
●
200
Gestation
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
0
●
●
●
●
●
●
●
10
20
30
40
Longevity
It does not seem that the line is much changed after removal of the giraffe (red) from
before (black). The intercept changed noticeably, but the relative change in residuals
appears not to be much.
The elephant, observation 15, is one far removed (in the x-direction) from the rest of
the pack.
We remove it.
2
MATH 143: The Standard Linear Model
> animMinusElephant = animals[c(1:14,16:40), ]
> lm(gestation ~ longevity, data=animMinusElephant)
Call:
lm(formula = gestation ~ longevity, data = animMinusElephant)
Coefficients:
(Intercept)
44.97
longevity
11.06
> plot(animals$longevity, animals$gestation, pch=19, cex=.3, xlab="Longevity", ylab="Gestation")
> abline(a=21.71, b=13.13, col="black")
> abline(a=44.97, b=11.06, col="red")
600
●
400
●
●
●
●
●
●
●
●
●
●
200
Gestation
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
●
●
●
0
10
20
30
40
Longevity
Both the slope and intercept change a great deal when the elephant is removed, the intercept
from 21.71 to 44.97, and the slope from 13.13 to 11.06.
What makes the elephant so influential
is the fact that it is far from the other data points horizontally (its explanatory value
is up around 40, while all others are about 25 or less).
Checking Model Assumptions.
In fitting data with a model, we generally assume that for each value of x for the explanatory variable, the response variable
Y = fit + error.
In the case of linear regression, we assume the fit is provided by the line α + βx, so that
Y = α + βx + .
In order to make inferences from data (about the true slope, for instance) we must further
assume that the error has a particular distribution, namely Norm(0, σ), and that the
3
MATH 143: The Standard Linear Model
error in one observation is independent from errors in other observations. A graphic of
these assumptions appears on p. 482 of the textbook. In today’s activities, we explore the
implications of these assumptions.
2. Let’s first construct numbers intentionally to fit the regression assumptions. We
don’t often get to choose the x-values measured, so let’s generate them randomly.
Execute commands that appear in shaded boxes like this one:
> n = 15
> x = runif(n, 17, 31)
Suppose the perfect relationship between x and y values were linear, with slope 2.7
and intercept 11.1. The next command produces the ideal y-values:
> y = 2.7*x + 11.1
> plot(x, y, pch=19, cex=.5, col="navy")
But in our model, the data come with "errors", or residuals. Just how large a residual
we are likely to see depends upon the size of σ.
> sigma = 1
> y = 2.7*x + 11.1 + rnorm(n, 0, sigma)
> plot(x, y, pch=19, cex=.5)
The residuals are just the differences between actual y-values and the fitted ones:
y − (αx + β).
> resids = y - (2.7*x + 11.1)
Two common plots of residuals are the histogram of residuals, and the scatterplot of
residuals vs. x-values. (A normal quantile plot of the residuals is also common, serving
the same purpose as the histogram of residuals.)
histogram(resids)
plot(x, resids, pch=19, cex=.5, col="navy")
# for the histogram of residuals
# residuals vs. x-values
These plots should provide a sense of what one sees from data that perfectly satisfies
the assumptions of linear regression.
Repeat the commands of the previous boxes, individually changing σ to 5 and the
sample size n to 50.
3. An important question that arises from the explorations above is whether it is possible
for data that contains errors to reveal to us the true relationship between explanatory
and response variables. From the File menu, select New R Script. In the space
that appears above the "Console" panel, place these commands:
4
MATH 143: The Standard Linear Model
n = 15
sigma = 5
x = runif(n, 17, 31)
y = 2.7*x + 11.1 + rnorm(n, 0, sigma)
lmResult = lm(y ~ x)
print(summary(lmResult))
Then click "Run All". What are the true slope (β) and intercept (α) for the relationship
between x and y? What estimates a, b for these are provided by your data? Why
aren’t the estimates precisely equal to the true values? Does a 95% confidence interval
for β contain the true slope? Does a 95% confidence interval for α contain the true
intercept?
Answer:
We have built the data so that the true values are α = 11.1 and β = 2.7.
The
estimates for a, b are going to vary from student to student, since the data are generated
using some built-in randomness.
It is because of randomness that the estimates from data
do not match the parameters α, β.
Whether a confidence interval will contain the true
α is random, too; for a 95% CI, it will in about 19 out of 20 cases.
4. Suppose you have control over n, the sample size. What does your intuition tell you
about the effect on the estimates a, b of α, β when n is changed? Use your intuition
to finish this sentence: "When all other factors remain constant, a rise in sample size
n will lead to estimates a, b of intercept and slope which are more certain."
Use RStudio to investigate whether you are correct. (Try modifying, as needed, the
commands in the "R Script" window.) What evidence do you seek to confirm our
refute your intuition?
Answer:
The main indicator is the standard error SEb .
As it grows, the width of a corresponding
confidence interval grows as well; a wide 95% confidence interval leaves open more possibilities
for the parameter than a small 95% confidence interval.
5. Suppose, instead, you have control over σ. Guess the effect on estimates a, b of
α, β when σ is changed? Once again, finish this sentence "When all other factors
(including sample size) remain constant, a rise in σ will lead to estimates a, b of
intercept and slope which are less certain."
Then use RStudio to investigate the correctness of your intuition.
Of course, the greater the uncertainty in our estimates a, b of slope and intercept, the
greater the uncertainty in predicting both the mean response as well as the next response
at a given x. Review, as needed, both the meaning of the following ideas, and how to use
RStudio to find them.
• confidence interval for the mean response at a given x
5
MATH 143: The Standard Linear Model
• prediction interval for an individual response at a given x
6. Look at plots of the residuals for the regression of the previous problem. Does it
seem the assumptions required for regression inference are in place? If not, which
assumptions are violated?
Answer:
Don’t grade this one.
There’s just too little clarity on what data gets used
(or too little uniformity in it, at least).
7. Sometimes other types of residual plots make sense. For instance, look at the data
relating children’s height (in inches) to their head circumference1 found in the file
http://www.calvin.edu/~scofield/data/comma/bodyMeas.csv
Carry out least squares regression taking head circumference as the response variable
and height as explanatory. Then look at the usual plot of residuals vs. x (explanatory)
values. Is there an obvious lack of independence between residuals?
Now plot residuals vs. observation number. Is there an apparent lack of independence
between residuals on this plot? Explain what you see, and how such a thing might
arise.
Answer:
> bodyMeas = read.csv("http://www.calvin.edu/~scofield/data/comma/bodyMeas.csv")
> lm1 = lm(headCirc ~ height, data=bodyMeas)
> plot(bodyMeas$height, lm1$residuals, pch=19, cex=.4, col="navy")
●
0.05
●
●
●
−0.05
●
●
●
●
●
−0.15
lm1$residuals
●
●
24.5
25.0
25.5
26.0
26.5
27.0
27.5
bodyMeas$height
I detect no apparent lack of independence between residuals in the graph appearing above.
This data, from Exercise 21, p. 219 of Statistics: Informed Decisions Using Data, 3rd Ed., by Michael Sullivan, III
(Prentice Hall), may well be simulated rather than taken from a real pediatrician.
1
6
MATH 143: The Standard Linear Model
> plot(bodyMeas$obs, lm1$residuals, pch=19, cex=.4, col="navy")
●
−0.05
0.05
●
●
●
●
●
●
●
●
−0.15
lm1$residuals
●
●
2
4
6
8
10
bodyMeas$obs
On the other hand, this plot, while based on just a few data points, may suggest a change
in the standard deviation σ around the middle of the plot.
If true, this could be the
result of a piece of equipment warming up, or it may indicate the mesaurements were being
taken by one person at the start and a different person later.
Transforming Data
8. Load the tab-delimited dataset found at
http://www.calvin.edu/~scofield/data/tab/rc/tvlife.dat
and view a scatterplot with life expectancy as the response due to explanatory variable
per.TV. What evidence is there that linear regression is not appropriate?
Now view the scatterplot
plot(log(tvlife$per.TV), tvlife$life.exp, pch=19, cex=.5, col="blue")
(a) Which variable, the explanatory or response, has been transformed to produce
the most recent graph? How, exactly, was it transformed? Does the relationship
appear more or less linear than for the original data?
(b) Look at plots of residuals to see whether the assumptions of linear regression
(inference) appear to hold in the transformed data.
(c) What is the correlation coefficient between life expectancy and log avg. number of
people per tv?
7
MATH 143: The Standard Linear Model
(d) What is the estimated least squares regression line using the transformed data?
What portion of variability in life expectancy is explained by this linear model?
(e) Write a relationship between original variables (per.TV and life expectancy).
Answer:
(a) It is the explanatory variable that has been transformed by taking its (natural)
log, and this seems to produce a plot that is more linear (on the right) than the
●
●
●
●
●
●
●●
75
●
●
●
●
●
●
●
●
●
65
●
●
●
●
●
●
●
●
●
55
●
65
●
●
55
tvlife$life.exp
●
●
●
tvlife$life.exp
75
corresponding plot using original data (on the left)
> tvlife = read.table("http://www.calvin.edu/~scofield/data/tab/rc/tvlife.dat",
+
header=T, sep="\t")
> par(mfrow=c(1,2))
> plot(tvlife$per.TV, tvlife$life.exp, pch=19, cex=.5, col="navy")
> plot(log(tvlife$per.TV), tvlife$life.exp, pch=19, cex=.5, col="navy")
●
●
●
●
●
●
●
●
0
50
100
200
tvlife$per.TV
●
45
45
●
●
1
2
3
4
log(tvlife$per.TV)
(b) > tvlifeLM = lm(life.exp ~ log(per.TV), data=tvlife)
> par(mfrow=c(1,2))
> hist(tvlifeLM$residuals)
> plot(tvlife$per.TV, tvlifeLM$residuals, pch=19, cex=.5, col="blue")
8
5
MATH 143: The Standard Linear Model
5
●
●
●
●
●
●
●
●
● ●
0
5
10
15
tvlifeLM$residuals
●
0
●
●
●
●
●
●
−10
0
−10
●
●
●
●
−5
6
4
2
Frequency
8
tvlifeLM$residuals
10
10
Histogram of tvlifeLM$residuals
●
0
50
100
tvlife$per.TV
There is the suggestion of nonnormality in the histogram of residuals.
of residuals vs.
200
The plot
x-values does not appear too bad, though there are some extreme
residuals.
(c) > cor( tvlife$per.TV, tvlife$life.exp )
[1] -0.8038097
The correlation coefficient is −0.804.
9
MATH 143: The Standard Linear Model
(d) > summary(tvlifeLM)
Call:
lm(formula = life.exp ~ log(per.TV), data = tvlife)
Residuals:
Min
1Q
-9.5318 -2.5100
Median
0.7823
3Q
Max
2.2028 10.2002
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.5919
1.7548
45.93 < 2e-16 ***
log(per.TV) -5.7896
0.5427 -10.67 1.06e-09 ***
--Signif. codes: 0 âĂŸ***âĂŹ 0.001 âĂŸ**âĂŹ 0.01 âĂŸ*âĂŹ 0.05 âĂŸ.âĂŹ 0.1 âĂŸ âĂŹ 1
Residual standard error: 4.304 on 20 degrees of freedom
Multiple R-squared: 0.8505,
Adjusted R-squared: 0.843
F-statistic: 113.8 on 1 and 20 DF, p-value: 1.056e-09
The estimated least-squares line is
(life expectancy) = 80.59 − 5.79 log (per.TV).
The proportion of variability in life expectancy explained by this model is the R2 -value
0.85 (or about 85%).
(e) This was a poor question. You already have such a relationship coming from letter
(d).
Don’t grade this one.
10