Exercise 2.23

Exercise 2.23
Villanova MAT 8406
September 7, 2015
Step 1: Understand the Question
Consider the simple linear regression model y = 50 + 10x + ε where ε is NID(0, 16). Suppose that
n = 20 pairs of observations are used to fit this model. Generate 500 samples of 20 observations,
drawing one observation for each level of x = 1, 1.5, 2. . . . , 10 for each sample.
R makes this easy because its normal random number generator, rnorm, does not require fixed values of the
parameters (the mean and standard deviation): you may vary them! Therefore you can generate one dataset
according to the preceding instructions by means of remarkably terse, efficient commands:
sigma.2 <- 16
beta <- c(50, 10)
x <- seq(1, 10, by=1/2)
y <- rnorm(length(x), beta[1] + beta[2]*x, sigma.2)
Before proceeding, let’s check that this is correct and matches what is intended in the problem. Always draw
a picture:
plot(x, y, main="First Try at Sampling")
40 60 80
y
120
160
First Try at Sampling
2
4
6
8
10
x
Does it look correct? Is this a plot of 20 points that could be described by the model y ∼ NID(50 + 10x, 16)?
A quick check is afforded by fitting the OLS line and reading the summary output:
1
fit <- lm(y ~ x)
summary(fit)
Call:
lm(formula = y ~ x)
Residuals:
Min
1Q
-14.587 -6.760
Median
-1.073
3Q
10.555
Max
22.435
Coefficients:
Estimate Std. Error t value
(Intercept) 46.4377
5.6664
8.195
x
11.9093
0.9222 12.913
--Signif. codes: 0 '***' 0.001 '**' 0.01
Pr(>|t|)
2.62e-07 ***
3.25e-10 ***
'*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.01 on 17 degrees of freedom
Multiple R-squared: 0.9075,
Adjusted R-squared: 0.902
F-statistic: 166.8 on 1 and 17 DF, p-value: 3.25e-10
Scan it carefully, looking for evidence of every quantitative value that was used: the dataset size of 20, the
model y = 50 + 10x, and the variance of 16 in the errors.
There are two salient problems that need to be addressed. (It’s good we did this check before
proceeding with extensive simulation!)
1. The value of 17 for “DF” (degrees of freedom) is one less than we would expect. Indeed, x has only 19
elements!
(length(x))
[1] 19
Let’s just assume statisticians can’t count :-) and presume the question really is calling for generating
samples of size 19. (A quick scan through the rest of the question suggests none of it relies fundamentally
on the sample size being 20.)
2. The “residual standard error” of 11 suggests the error variance (its square) is around 121, which is far
larger than the intended value of 16. This kind of mistake is common but insidious: the textbook uses a
different parameterization of Normal distributions than the software does. R uses the mean and standard
deviation while the text uses the mean and variance. (Still other sources might use the precision, which
is the reciprocal of the variance, or even the logarithm of the variance for the second parameter.) This
problem is particularly acute with other distributions, like the Gamma distributions, for which there is
no clear convention for the parameters. It is crucial to understand what the parameters mean so that
you can perform calculations correctly!
There may be additional problems: the intercept of 46.4 and the slope of 11.91 differ somewhat from the
intended intercept of 50 and slope of 10. However, they’re of the right order of magnitude, so let’s hope the
discrepancies are due to randomness–but we’ll keep an eye on this issue and perform a fuller check later.
Fixing these problems is easy: (1) needs no change, while (2) requires us to convert the variance of 16 into its
square root:
2
y <- rnorm(length(x), beta[1] + beta[2]*x, sqrt(sigma.2))
plot(x, y, main="Fixed-up Sample") # Always check!
100
60
80
y
120
140
Fixed−up Sample
2
4
6
8
10
x
(You should re-run the lm and summary code to verify that you’re getting what you expected.)
Step 2: Do the Calculations
We are asked to “generate 500 samples” according to this model. Now that we have written and tested the
commands to generate one sample, there are many (easy) ways to generate 500 samples. Because 500 is a
relatively small number and each sample is small and requires relatively little calculation, we can afford to be
inefficient. Rather than extracting all the information requested in parts (a) - (d) of the question, let’s just
save all the samples and all the fits. We can then post-process them at our leisure. Here’s the command:
sim <- replicate(3, {
y <- rnorm(length(x), beta[1] + beta[2]*x, sqrt(sigma.2))
lm(y ~ x)
})
To get started, the intended count of “500” has been replaced by “3”. That’s enough to practice with yet
small enough to avoid being overwhelmed by managing 500 different (complex) fits. One step at a time!
The result is an array of three (or later, 500) objects: each of them is the output of lm in the last line. It is
an R idiosyncrasy that each object will be considered to be indexed by a second coordinate. For instance, the
result of applying lm to the first sample is contained in sim[, 1], not sim[1, ]. You can confirm this by
inspecting sim (either in the “Global Environment” pane in RStudio or by computing dim(sim)).
3
Question (a)
a. For each sample compute the least-squares estimates of the slope and intercept. Construct
histograms of the sample values of β̂0 and β̂1 . Discuss the shape of these histograms.
To apply some procedure–such as extracting the least-squares estimates of the coefficients–to an array like
sim, you will usually use one of the *apply functions in R: often apply, lapply, or sapply, with the first
being appropriate for looping over rows or columns of arrays. In this case we wish to treat sim as an array of
columns by looping over its second index (number 2). The coefficients of the fit in each column are extracted
using the coef function:
beta.hat <- apply(sim, 2, coef)
The output will have one column for iteration in the loop. Because coef returns first the intercept and then
the slope, the intercepts will be found in the first row of beta.hat and the slopes in its second row. Let’s
look:
print(beta.hat)
[,1]
[,2]
[,3]
(Intercept) 51.445077 53.584327 49.01810
x
9.956447 9.213877 10.06204
That’s looking good! The first row is actually named “(Intercept)” and the second row, “x” (because x was
the name of the regressor in the call to lm). We may refer to the rows by name. This is usually a good idea
because it avoids mistakes made when we miscount the number of a row in which we are interested. Thus,
for instance, the histograms can be obtained with two calls to hist, one for each row. Since a histogram of
just three values won’t reveal much, first we go back and re-do the simulation with the full 500 values.
sim <- replicate(500, {
y <- rnorm(length(x), beta[1] + beta[2]*x, sqrt(sigma.2))
lm(y ~ x)
})
beta.hat <- apply(sim, 2, coef)
par(mfrow=c(1,2)) # Draws side-by-side histograms
hist(beta.hat["(Intercept)", ], freq=FALSE,
main="", xlab=expression(hat(beta)[0]))
hist(beta.hat["x", ], freq=FALSE,
main="", xlab=expression(hat(beta)[1]))
4
0.8
0.0
0.4
Density
0.10
0.00
Density
44
48
52
56
9.0
^
β0
9.5
10.0 10.5 11.0
^
β1
“Discuss the shape of these histograms” should include quantitative evaluation of their centers and spreads,
along with either quantitative or qualitative assessment of other aspects of a distribution, such as its skewness,
heaviness of tails, presence of outliers, peakedness, numbers of modes, etc. If you have reason to suppose the
data shown by these histograms would look approximately like some well-known distributional shape (such
as Normal, Student t, etc) then compare them to that shape as a reference.
Question (b)
For each sample, compute an estimate of E(y|x = 5). Construct a histogram of the estimates you
obtained. Discuss the shape of the histogram.
The preferred way in R to estimate this expectation is with the predict function. It works in a strangely
restricted way: you must supply it a “data frame” of the values of x in which you are interested. To test,
note that you still have an object fit lying around from your initial testing. Let’s try out predict on it:
predict(object=fit, newdata=data.frame(x=5))
1
105.9842
fit is the name of the object containing the lm output (we chose it) and x is the name of the regressor
variable used by lm. The output value of 106 is reasonably close to the model value 50 + 10 × 5 = 100.
Having successfully done the calculation with one fit, we are ready to apply it to the entire simulation. As
before, all 500 values will be stored in a variable which is then fed to hist for visualization as a histogram.
y.hat.0 <- apply(sim, 2, function(f) {
class(f) <- "lm"
predict(f, newdata=data.frame(x=5))
})
5
As you can see, this is fussy: we are obliged to define a function on the fly that (re-)informs R that each
column of sim really is the output of lm just so we can apply predict. (R tends to be inconsistent: even core
procedures like lm, coef, and predict do not work together in a consistent manner.
A simpler approach is to use your knowledge of least squares. The predicted value at x = 5 is given by the
estimated coefficients, which we already have computed (and stored as rows in beta.hat):
y.hat <- beta.hat["(Intercept)", ] + beta.hat["x", ] * 5
par(mfrow=c(1,2))
hist(y.hat.0, freq=FALSE, main="Output of `predict`", cex.main=0.95,
xlab=expression(hat(y)[0]))
hist(y.hat, freq=FALSE, main="Manually computed predictions", cex.main=0.95,
xlab=expression(hat(y)))
0.0
0.2
Density
0.2
0.0
Density
0.4
Manually computed predictions
0.4
Output of `predict`
97
99
101
97
y^0
99
101
y^
The results are the same, of course.
Question (c)
c. For each sample, compute a 95% CI on the slope. How many of these intervals contain the
true value β1 = 10? Is this what you would expect?
It’s a good exercise to compute this CI using formulas from the book. In practice, though, you would look for
a built-in R function. It is confint:
confint(fit, "x", level=95/100)
2.5 %
97.5 %
x 9.963523 13.85507
The art of statistical computing lies in continually checking that your understanding of the
software is correct. How do we know that this output really is providing a symmetric, two-sided, 95%
6
confidence interval for β1 ? One way is to compute the same interval in an alternative way. For instance, we
could inspect the summary table. For fit it included an estimate of β̂1 = 11.909 and a standard error of
0.9222. Using 19 − 2 = 17 degrees of freedom (also shown in the summary output) we may compute the
corresponding multiplier from the Student t distribution as κ = t−1
df (1 − α/2). Here are the commands to
perform these calculations and display κ:
confidence <- 95/100
alpha <- (1 - confidence)/2
df <- fit$df.residual
(multiplier <- qt(1 - alpha, df))
[1] 2.109816
The confidence interval is β̂1 ± κse(β̂1 ) = 11.909 ± 2.11 × 0.9222. It agrees with the output of confint. Now
we can feel comfortable using confint in our work.
Let’s apply this to the simulation:
CI.beta.1 <- apply(sim, 2, function(f) {
class(f) <- "lm"
confint(f, "x", level=95/100)
})
To count the number of intervals containing the true value, compare them with the true value:
covers <- CI.beta.1[1, ] <= beta[2] & beta[2] <= CI.beta.1[2, ]
print(paste0(sum(covers), " (", mean(covers)*100,
"%) of the intervals cover the true value."))
[1] "475 (95%) of the intervals cover the true value."
Question (d)
d. For each estimate of E(y|x = 5) in part b, compute the 95% CI, etc.
The R solution once again is predict. This function is overloaded: it does lots of different things, depending
on what you ask of it. As before, we should not rely on it until we have tested it/
predict(fit, newdata=data.frame(x=5), interval="confidence", level=95/100)
fit
lwr
upr
1 105.9842 100.5674 111.401
Evidently it produces a vector of three values: the fit ŷ and the lower and upper (symmetric, two-sided)
confidence interval. We can deal with these exactly as we did with β̂: the result of apply will be three rows
of output which can be referenced by their names fit, lwr, and upr.
y.hat.0 <- apply(sim, 2, function(f) {
class(f) <- "lm"
predict(f, newdata=data.frame(x=5))
})
From this point on, emulate the calculations and the answer to part (c).
7