Homework set 10

Homework set 10 - Due 4/26/2013
Math 3200 – Renato Feres
Preliminaries
The point of this assignment is to show how to do simple linear regression analysis in R. Not everything we describe
in these preliminaries is needed for the homework problems. Much of this material is taken from Using R for Introductory Statistics, by John Verzani, http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf, R Cookbook,
by Paul Teetor, and Introductory Statistics with R, by Peter Dalgaard.
The simple linear regression model. The basic model for linear regression is that pairs of data (xi , y i ) are related
through the equation
y i = β0 + β1 xi + ǫi
where β0 and β1 are constants and the ǫi are random deviations from the straight line model; they are assumed to
¡
¢
be independent values from a N 0, σ2 -distribution. We will see later in these preliminaries how the validity of the
various assumptions involved in this model of the data can be tested.
¢2
P¡
The quantity Q(β0 , β1 ) =
y i − (β0 + β1 xi ) is minimized when the coefficients β0 , β1 take on the values
β̂1 =
P
(xi − x̄)(y i − ȳ)
, β̂0 = ȳ − βˆ1 x̄.
P
(xi − x̄)2
Denoting by ŷ i = β̂0 + β̂1 xi the fitted values of y and defining the residuals e i = y i − ŷ i , we can write the minimum
value of Q, also known as the error sum of squares, or SSE, as follows:
Q min = SSE =
n
X
i=1
e i2 .
Example. The maximum heart rate of a person is known to be related to age. The formula y = 220 − x is often
assumed, where y is maximum heart rate in beats per minute and x is the person’s age in years. To test this formula
empirically, 15 persons are tested for their maximum heart rate and the following data are found:
> x=c(18,23,25,35,65,54,34,56,72,19,23,42,18,39,37)
> y=c(202,186,187,180,156,169,174,172,153,199,193,174,198,183,178)
A scatter plot of the data and the regression line are shown in the next graph. The graph is obtained using the
commands:
> plot(x,y)
#make a scatter plot of the data
> abline(lm(y~x)) #plot the regression line
The key R-function used is lm, which stands for linear model. The resulting graph is shown next. A linear relation
between x and y is readily apparent.
200
190
180
y
170
160
20
30
40
50
60
70
x
In order to find the coefficients of the regression line, use again the function:
> lm(y~x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
210.0485
x
-0.7977
From this we can read the equation of the regression line: ŷ = 210.049 − 0.798x. The coefficients can also be
obtained by calling the summary of the lm function:
> summary(lm(y~x))
Call:
lm(formula = y ~ x)
Residuals:
Min
1Q
-8.9258 -2.5383
Median
0.3879
3Q
3.1867
Max
6.6242
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 210.04846
2.86694
73.27 < 2e-16 ***
x
-0.79773
0.06996 -11.40 3.85e-08 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.578 on 13 degrees of freedom
Multiple R-squared: 0.9091,Adjusted R-squared: 0.9021
F-statistic:
130 on 1 and 13 DF, p-value: 3.848e-08
2
There is a great deal of information to be interpreted here. For the moment I simply draw your attention to the
column ‘Estimate’ under ‘Coefficients’ where the intercept and slope estimates are shown.
Statistical inference on β1 . There are three parameters in the model: σ, β0 , β1 . Recall that σ2 is the variance of the
random error ǫ. The slope β1 is of particular interest since it tells how important is the influence of x on y; the closer
β1 is to zero, the smaller the effect of x on y.
We first need an estimate for σ2 . From the general theory we know that
s2 =
1 X 2
ei
n −2
is an unbiased estimator of σ2 with n −2 degrees of freedom, where n is the sample size. The standard error for β̂1 can
now be written as
¡ ¢
s
SE β̂1 = pP
.
(xi − x̄)2
If the null hypothesis is H0 : β1 = a versus the alternative hypothesis H1 : β1 6= a then a test of H0 is done by means
of the t -statistic
t=
β̂1 − a
¡ ¢.
SE β̂1
From this we can find the P-value from the Student’s t -distribution.
Example. Continuing with the maximum heart rate example, we can do a test to see if the slope of −1 is correct.
Let H0 : β1 = −1 against the alternative hypothesis that H1 : β1 6= 1. We can create the test statistic and find the P-value
by hand as follows:
> m=lm(y~x)
> e=resid(m)
#This extracts the vector of residuals e1, ..., en
> b1=(coef(m))[[’x’]]
#The x part of the coefficients
> n=length(x)
> s=sqrt(sum(e^2)/(n-2))
> SE=s/sqrt(sum((x-mean(x))^2))
> t=(b1-(-1))/SE
#This finds the t-statistic
> 2*pt(-t,13)
#Find the P-value using the cumulative df of t.
[1] 0.01262031
This shows that it is not very likely that the true slope is −1.
The summary function automatically does a hypotheses test for the assumption H0 : β1 = 0 vs. H1 : β1 6= 0. The
corresponding P-value is shown under ‘Coefficients’ (row ‘Intercept’, last column.) Note that it is extremely unlikely
that the true slope is 0.
Inference on β0 . We can similarly do inferences on β0 . Just as for β1 , the above summary includes the test for
H0 : β0 = 0. The estimator β̂0 of β0 is also unbiased and has standard error
¡
¢
SE β̂0 = s
s
The test statistic for H0 : β0 = a is then
t=
x̄ 2
1
.
+P
(xi − x̄)2
n
β̂0 − a
¡ ¢
SE β̂0
which has a t -distribution with n − 2 degrees of freedom.
3
“Dissecting” summary. Let us go back and look further into the above summary of the regression analysis done by
the function lm. The first interesting information is under Residuals This gives a superficial view of the distribution
of the residuals e i , which can be used for a quick check of the distributional assumptions in the linear model. The
average of the residuals is 0 by definition, so the median should not be far from 0 and the minimum and maximum
should roughly be equal in absolute value.
Then, under Coefficients we see the regression coefficient and the intercept with their standard errors, t-tests,
and P-values for the null-hypothesis that assumes value zero for β0 and β1 . The symbols on the right are graphical
indicators of the level of significance. The line below the table shows the definition of these indicators; for example,
one star means 0.01 < p < 0.05.
Next is the line for the Residual standard error. This is the residual variation, an expression of the variation
of the observations around the regression line, estimating the model parameter σ. It is the value of s defined above.
Then there is Multiple R-squared. This is the square of the sample correlation coefficient r (see textbook.)
And finally, the F-statistic. This is an F test for the hypothesis that the regression coefficient is zero. this test is
not really interesting in a simple (as opposed to multiple) linear regression since it just duplicates information already
given. It becomes interesting when there is more than one explanatory (or predictor) variable. In fact, the F-test is
identical to the square of the t-test. This is true in any model with 1 degree of freedom (i.e., with a single predictor
variable.)
Functions of lm. I summarize here a few functions of interest that you can explore. Each of the following functions
acts on m, defined by
> m=lm(y~x)
• coefficients(m) – model coefficients
• coef(m) – the same as coefficients(m)
• confint(m,level=0.99) – confidence intervals for the regression coefficients
• deviance(m) – residual sum of squares (SSE)
• fitted(m) – vector of fitted y values
• residuals(m) – vector of model residuals
• resid(m) – the same as residuals(m)
• summary(m) – the summary function already described
• vcov(m) – variance-covariance matrix of the main parameters.
As an example, suppose we want a scatter plot of the residuals vs. the fitted values. The following will do it:
>
>
>
>
>
m=lm(y~x)
e=resid(m)
yhat=fitted(m)
plot(yhat,e)
abline(lm(e~yhat))
The result is the graph shown next.
4
5
0
e
−5
160
170
180
190
yhat
Regressing on transformed variables. It is possible to use the linear regression method even when the model of
the relationship between x and y is not given by a linear function. This is typically done by applying linear regression
to functions of or x and/or y. For example, if we expect y to be an exponential function of x, it makes sense to apply
linear regression to log y and x. In this case we would do the above analysis starting from
> m=lm(log(y)~x)
In general, we can apply lm two functions g (y) and f (x). Even when the model relationship between x and y is
linear, we may still want to transform y so as to make the hypothesis H0 : β0 = 0 and H0 : β1 = 0 meaningful since
these are the default null hypotheses for the regression summary function. In the maximum heart rate example we
wanted to test the initial (hypothetical) model y = 220− x. The transformed function y − 220+ x would have under H0
an intercept and slope equal to zero. In this case we may start our analysis with
> m=lm(y-220+x~x)
Diagnosing a linear regression. Several types of diagnostic plots are produced when calling plot(lm(y x)). For
example, to obtain a scatter plot of the residuals vs. the fitted values (compare this to Figure 10.7, page 368 of the
textbook) we do:
> plot(lm(y~x),which=1)
Note that, rather than drawing a regression line as I did in the previous graph, R draws a smoothed line through
the residuals as a visual aid to finding significant patterns, indicating deviation from zero mean. Ideally, the points in
the residuals vs. fitted plot should be randomly scattered with no particular pattern.
Another useful plot for diagnostic purposes is the normal Q-Q plot for the residuals; it should roughly falls on a
line, indicating that the residuals follow a normal distribution. It is obtained by
> plot(lm(y~x),which=2)
The above two types of plots are shown in the next figure. The assumptions of normality and zero mean seem to
hold tolerably.
5
Residuals vs Fitted
Normal Q−Q
8
1
8
1
0
−1
Standardized residuals
0
−5
Residuals
5
1
−2
−10
7
160
170
180
7
190
−1
Fitted values
lm(y ~ x)
0
1
Theoretical Quantiles
lm(y ~ x)
Prediction and confidence bands. The concepts of prediction values and intervals are discussed in section 10.3.3
of the textbook. Fitted lines are often presented with uncertainty bands around them. There are two kinds of bands,
often referred to as the “narrow” and “wide” bands.
The narrow, or confidence bands, reflect uncertainty about the line itself; if there are many observations, they will
be quite narrow. These bands are typically curved, becoming narrower near the center of the point cloud. The wide,
or prediction bands, include the uncertainty about future observations.
Predicted values and prediction intervals may be extracted with the R-function predict. With no arguments, it
just gives the fitted values:
> predict(m)
1
2
3
4
5
6
7
8
9
195.6894 191.7007 190.1053 182.1280 158.1962 166.9712 182.9258 165.3758 152.6121
10
11
12
13
14
15
194.8917 191.7007 176.5439 195.6894 178.9371 180.5326
If you add into predict the argument interval=’confidence’ or interval=’prediction’ then you get the
vector of predicted values augmented with lower and upper limits:
> predict(m,interval=’confidence’)
fit
lwr
upr
1 195.6894 191.8087 199.5700
2 191.7007 188.3520 195.0495
3 190.1053 186.9437 193.2668
4 182.1280 179.5503 184.7058
5 158.1962 153.2965 163.0959
6 166.9712 163.3843 170.5582
7 182.9258 180.3230 185.5285
8 165.3758 161.5704 169.1811
9 152.6121 146.7833 158.4410
10 194.8917 191.1235 198.6598
11 191.7007 188.3520 195.0495
12 176.5439 173.8948 179.1931
6
13 195.6894 191.8087 199.5700
14 178.9371 176.3712 181.5030
15 180.5326 177.9786 183.0866
Here is what to do if you want to predict new values of y (level=0.95 is the default value):
> newdata=data.frame(x=20)
> predict(m,newdata,interval=’predict’,level=0.95)
fit
lwr
upr
1 194.0939 183.5492 204.6386
Plotting the two bands requires some care. Here is a way to do it:
>
>
>
>
>
>
>
>
>
pred.frame=data.frame(x=18:72)
pp=predict(m,interval=’predict’,newdata=pred.frame)
pp=predict(m,interval=’prediction’,newdata=pred.frame)
pc=predict(m,interval=’confidence’,newdata=pred.frame)
plot(x,y,ylim=range(y,pp,na.rm=TRUE))
pred.x=pred.frame$x
matlines(pred.x,pc,lty=c(1,2,2),col=’black’)
matlines(pred.x,pp,lty=c(1,3,3),col=’black’)
grid()
y
140
150
160
170
180
190
200
This produced the following plot:
20
30
40
50
60
70
x
A few explanations for the above script: First we create a new data frame in which the x variable contains the values
at which we want predictions to be made; pp and pc are then made to contain the result of predict for the new data
in predict.frame with prediction limits and confidence limits, respectively.
For the plotting, we first create a standard scatterplot, except that we ensure that it has enough room for the prediction limits. This is obtained by setting ylim=range(y,pp,na.rm=TRUE). The function range returns a vector of
length 2, containing the minimum and maximum value of its arguments. We need the na.rm=TRUE argument (“remove missing data”) to cause missing values to be skipped for the range computation and y is included to ensure
7
that points outside the prediction limits are not missed. Finally, the curves are added, using the matlines (“multiple
lines”) function.
Problems
1. Problem 10.6, page 388. The following data give the barometric pressure (in inches of mercury) and the boiling
point (in ◦ F) of water in the Alps. (I have imported the data under the variable name Data.)
> Data
Pres
1 20.79
2 20.79
3 22.40
4 22.67
5 23.15
6 23.35
7 23.89
8 23.99
9 24.02
10 24.01
11 25.14
12 26.57
13 28.49
14 27.76
15 29.04
16 29.88
17 30.06
Temp
194.5
194.3
197.9
198.4
199.4
199.9
200.9
201.1
201.4
201.3
203.6
204.6
209.5
208.6
210.7
211.9
212.2
(a) Make a scatter plot of the boiling point by barometric pressure. Does the relationship appear to be approximately linear?
(b) Fit a least squares regression line. What proportion of variation in the boiling point is accounted for by
linear regression on the barometric pressure?
(c) Calculate the mean square error estimate of σ.
Solutions: (a) The following commands produce the desired scatter plot, shown next. The linear relationship is
very pronounced.
Pressure=Data[,1]
Temperature=Data[,2]
m=lm(Temperature~Pressure)
plot(Temperature~Pressure)
abline(m)
grid()
8
210
205
Temperature
200
195
22
24
26
28
30
Pressure
(b) The least squares line is also obtained by the above script.
(c) Recall that the mean square error estimate of σ is
s=
r
1 X 2
ei
n −2
where the sample size is n = 17. One way to obtain s is by using the definition:
> TemperatureHat=fitted(m)
> s=sqrt((1/15)*sum((Temperature-TemperatureHat)^2))
> s
[1] 0.4440299
Therefore, the mean square error estimate of σ is s=0.444.
We can obtain the same value simply by extracting from the summary report for lm:
> summary(lm(Temperature~Pressure))
Call:
lm(formula = Temperature ~ Pressure)
Residuals:
Min
1Q
-1.22687 -0.22178
Median
0.07723
3Q
0.19687
Max
0.51001
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 155.29648
0.92734 167.47
<2e-16 ***
Pressure
1.90178
0.03676
51.74
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
9
Residual standard error: 0.444 on 15 degrees of freedom
Multiple R-squared: 0.9944,Adjusted R-squared: 0.9941
F-statistic: 2677 on 1 and 15 DF, p-value: < 2.2e-16
We can read the value under Residual standard error:
0.444 on 15 degrees of freedom.
2. Problem 10.11, page 389. Refer to the Old Faithful data in Exercise 10.4.
(a) Calculate a 95% PI for the time to the next eruption if the last eruption lasted 3 minutes.
(b) Calculate a 95% CI for the mean time to the next eruption for a last eruption lasting 3 minutes. Compare
this CI with the PI obtained in (a).
(c) Repeat (a) if the last eruption lasted 1 minute. Do you think this prediction is reliable? Why or why not?
Solution: We begin by importing the data into two variables Last and Next. The sample size is n=21.
(a) To find the prediction interval using R’s regression analysis functions simply call:
> m=lm(Next~Last)
> newdata=data.frame(Last=3)
> predict(m,newdata,interval=’predict’)
fit
lwr
upr
1 60.38332 47.2377 73.52893
This gives the prediction interval [47.24, 73.53].
To do the same by hand we would use the values in inequality (10.19), page 362 of the textbook.
(b) To find the confidence interval for the mean of Next given that Last=3:
> m=lm(Next~Last)
> newdata=data.frame(Last=3)
> predict(m,newdata,interval=’confidence’)
fit
lwr
upr
1 60.38332 57.51009 63.25654
Therefore, the confidence interval is [57.51 63.26].
To find this interval by hand we would use inequalities (10.18), page 362 of the textbook.
(c) If Last=1:
> m=lm(Next~Last)
> newdata=data.frame(Last=1)
> predict(m,newdata,interval=’predict’)
fit
lwr
upr
1 40.80318 26.33021 55.27614
10
Now the predicted value lies in [26.33 55.28]. We should not expect this interval to be too reliable since the 1
lies outside the range of values of Last, which is the interval [1.7, 4.9]. (See the brief discussion at the middle of
page 363 of the textbook.)
3. Problem 10.16, page 391. A prime number is a positive integer that has no integer factors other than 1 and
itself. (1 is not regarded as a prime.) The number of primes in any given interval of whole numbers is highly
irregular. However, the proportion of primes less than or equal to any given number x (denoted by p(x)) follows
a regular pattern as x increases. The following table gives the number and proportion of primes for x = 10n for
n = 1, 2, . . . , 10. The objective of the present exercise is to discover this pattern.
No. of Primes
x
Proportion of Primes (p(x))
1
10
4
0.400000
102
25
0.250000
103
168
0.168000
104
1, 229
0.122900
105
9, 592
0.095920
6
10
78, 498
0.078498
107
664, 579
0.066458
108
5, 761, 455
0.057615
109
50, 847, 534
0.050848
1010
455, 052, 512
0.045505
p
(a) Plot the proportion of primes, p(x), against 10, 000/x, 1000/ x, 1/ log10 x. Which relationship appears
most linear?
(b) Estimate the slope of the line p(x) = β0 + β1 log1
10 x
and show that β̂1 ≈ log10 e = 0.4343.
(c) Explain how the relationship found in (b) roughly translates into the prime number theorem: For large x,
p(x) ≈ 1/ loge x.
Solution: (a) We first import the data. The variable primes will represent the number of primes for each value
of x (second column) and proportion the column for p(x). (R seems to interpret the values of x given in the
data table provided by the textbook as strings of characters; it didn’t allow me to do any calculation with them.)
So I’m entering the values here “by hand.”) Now define the three functions of x and plot:
>
>
>
>
>
x=10^c(1:10)
f1=10000/x
f2=1000/sqrt(x)
f3=log(10)/log(x)
plot(proportion,f1)
The last line above produces the scatter plot for proportion versus f1. The three graphs are shown next.
It is obvious that the graph of proportion versus the reciprocal of the logarithm of x is the most linear.
(b) We now regress proportion on f3. Here is the result (using the functions coefficients and confint):
11
200
400
600
800
1000
0.35
proportion
0.05
0.15
0.25
0.35
0.05
0.15
proportion
0.25
0.35
0.25
proportion
0.15
0.05
0
0
50
100
f1
150
200
250
300
0.2
0.4
f2
0.6
0.8
1.0
f3
> coefficients(m)
(Intercept)
f3
0.01518797 0.40419159
> confint(m,level=0.95)
2.5 %
97.5 %
(Intercept) -0.002615247 0.03299118
f3
0.358967988 0.44941519
We can read from this the estimated value of the slope: β̂1 = 0.0404 and the confidence interval (not asked by
the problem) [0.36, 0.45]. The estimate value of the intercept is 0.015.
(c) The correct asymptotic relation is p(x) ≈ 1/ loge x. This is a mathematical rather than statistical result. Assuming that the correct value of the slope is β1 = 0.4343 = log10 e = 1/ loge 10 and the intercept is β0 = 0, then for
large x
p(x) ≈ β1 ×
1
1
=
= 1/ loge x
log10 x loge (10) × log10 x
which is what the problem asks to show.
4. Problem 10.20, page 393. To relate the stopping distance of a car to its speed, ten cars were tested at five
different speeds, two cars at each speed. The following data were obtained.
Speed x (mph)
20
20
30
30
40
40
50
50
60
60
Stop. Dist. y (ft)
16.3
26.7
39.2
63.5
65.7
98.4
104.1
155.6
217.2
160.8
(a) Fit an LS straight line to these data. Plot the residuals against the speed.
(b) Comment on the goodness of the fit based on the overall F -statistic and the residual plot. Which two
assumptions of the linear regression model seem to be violated?
(c) Based on the residual plot, what transformation of stopping distance should be used to linearize the relationship with respect to speed? A clue to find this transformation is provided by the following engineering
argument: In bringing a car to a stop, its kinetic energy is dissipated as its braking energy, and the two
are roughly equal. The kinetic energy is proportional to the square of the car’s speed, while the breaking
energy is proportional to the stopping distance, assuming a constant breaking force.
(d) Make this linearizing transformation and check the goodness of fit. What is the predicted stopping distance
according to this model if the car is traveling at 40 mph?
12
Solution: (a) After importing the data as a table Data with two columns we can produce the scatter plot with the
regression line:
speed=Data[,1]
distance=Data[,2]
m=lm(distance~speed)
plot(speed,distance)
abline(m)
distance
50
100
150
200
>
>
>
>
>
20
30
40
50
60
50
60
speed
To plot the residuals agains speed:
−30
−10
0
e
10
20
30
40
> e=residuals(m)
> plot(speed,e)
20
30
40
speed
¡
¢
(b) The residual graph shows systematic deviation from the assumption of normality e i ∼ N 0, σ2 . The mean
values of e i follow a clear nonlinear curve and the variance appears to increase with the speed. The F-statistic
13
is 58.77 with P-value 5.9 × 10−5 . The very small P-value means that we should reject the hypothesis that β1 = 0.
Moreover, r 2 = 0.88 means that most (nearly 90%) of the variation in distance is accounted for by the regression
on speed rather than random error. All of this suggests to me that there is a strong dependence of distance on
speed, although this dependence is not well described by a linear relation.
(c) The physical argument proposed is that the kinetic energy, which is proportional to the square of speed must
equal the dissipated energy through breaking, which is proportional to distance (assuming the breaking force
to be approximately constant). This suggests the following relationship:
distance = β0 + β1 (speed)2 .
(d) Based on (c) we try the regression of distance on the square of speed. The coefficients are
> lm(distance~speedsq)
Call:
lm(formula = distance ~ speedsq)
Coefficients:
(Intercept)
1.62064
speedsq
0.05174
distance
50
100
150
200
So the linearized regression line is distance = 0.05 + 1.62 × (speed)2 . The graph is shown next.
500
1000
1500
2000
2500
3000
3500
speedsq
To evaluate the goodness of fit, we can look at the residual plots produced by R. The below graphs show that the
residuals follow much better the regression model assumptions, although the variance still show a clear increase
with distance or speed.
The predicted stopping distance at speed 40 mph is
1.62 + 0.052 × (40)2 = 84.34 feet.
In order to obtain the prediction interval at speed 40 mph (which the problem does not ask for) at the 95% level,
we do the following:
14
> m=lm(distance~speedsq)
> newdata=data.frame(speedsq=1600)
> predict(m,newdata,interval=’predict’)
fit
lwr
upr
1 84.40229 31.35449 137.4501
This gave the fitted value 84.40 and (the fairly wide) prediction interval [31.35, 137.45].
100
1.5
1.0
−1.5
10
0.5
Standardized residuals
7
9
−0.5 0.0
30
10
0
−30 −20 −10
Residuals
20
9
50
Normal Q−Q
2.0
Residuals vs Fitted
150
7
10
−1.5
Fitted values
lm(distance ~ speedsq)
−1.0
−0.5
0.0
0.5
Theoretical Quantiles
lm(distance ~ speedsq)
15
1.0
1.5