MATH W80 Daily Notes
Linear Regression
Linear Regression
Questions surrounding linear regression
1. Is there a relationship between the predictors and the response? Does the collection of
predictors actually have predictive value?
2. If there is a relationship, how strong is it? Are predictions accurate?
3. Which predictors actually have value?
4. How fuzzy is our ability to say the level of effect each predictor has on the response?
5. How accurately, in general, can we predict the response?
6. Is the relationship a linear one?
7. Is there synergy/interaction between predictors?
In linear regression, we seek an approximate model function
β0
h
i β1
fˆ(X) = [1 X]T β = 1 X1 X2 . . . Xp .
..
βp
= β0 + β1 X1 + β2 X2 + · · · + βp Xp .
The β j are, in fact, random variables, changing from one training dataset to another. The values
for these from our dataset will have “hats" over them: β̂ j . Given a training observation (xi , yi ), we
call yi the observed value,
ŷi = β̂0 + β̂1 x11 + β̂2 x12 + · · · + β̂p x1p
the predicted value, and
ei = yi − ŷi
the textbfresidual.
In the case of a single predictor variable (p = 1), we only require 2 dimensions (a scatterplot) to
graph the training data
(x1 , y1 ), . . . , (xN , yN ),
and the estimated model is a line Ŷ = β̂0 + β̂1 X. Figure 3.1 in our text depicts this situation. The
training data appears as red dots, and the residuals appear as vertical line segments from the data
to the regression line.
When there are p = 2 predictors, linear regression leads to a plane. Figure 3.4 depicts the data and
residuals in this case.
2
MATH W80 Daily Notes
Linear Regression
The residual sum of squares (RSS) is a measure of the appropriateness of the choice of the β̂ j s:
RSS =
N
X
e2i ,
(1)
i=1
and the typical manner for choosing them, known as least squares linear regression, is so that the
RSS is minimized.
RSS is what the least-squares approach to linear regression minimizes. Two related quantities we
use to evaluate how we have done are the R2 value, and the residual standard error (RSE). The
first is given by
X
RSS
TSS − RSS
= 1−
,
where
TSS =
( yi − y ) 2 ,
(2)
R2 =
TSS
TSS
i
with y being the mean of response values in the training data, measures the portion of variation
in response values explained by the regression model. The latter is given by
s
RSS
RSE =
,
(3)
n−p−1
and estimates the standard deviation σ . Numbers such as these are used in the computation of
the F-statistic, leading to a p-value which serves to indicate to us whether a linear model on the
predictors has any predictive value.
Let us address some of the important questions of linear regression.
1. Is at least one of the predictors useful in predicting the response?
There is a hypothesis test lying behind this question with hypotheses
H0 : β1 = β2 = . . . = βp = 0,
Ha : at least one β j is nonzero.
In what follows, we read in the Advertising dataset used for examples in Chapter 3, and produce some results like those displayed in Section 3.2. The dataset must first be downloaded
(achieved by the read.csv() command below).
Advertising = read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv",
row.names=1)
cor(Advertising)
# result similar to Table 3.5 in text
TV
TV
1.00000000
Radio
0.05480866
Newspaper 0.05664787
Sales
0.78222442
Radio
0.05480866
1.00000000
0.35410375
0.57622257
Newspaper
0.05664787
0.35410375
1.00000000
0.22829903
3
Sales
0.7822244
0.5762226
0.2282990
1.0000000
MATH W80 Daily Notes
Linear Regression
lm.res = lm(Sales ~ TV + Radio + Newspaper, data=Advertising)
summary(lm.res)
# result similar to Table 3.4 in text
Call:
lm(formula = Sales ~ TV + Radio + Newspaper, data = Advertising)
Residuals:
Min
1Q
-8.8277 -0.8908
Median
0.2418
3Q
1.1893
Max
2.8292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.938889
0.311908
9.422
<2e-16 ***
TV
0.045765
0.001395 32.809
<2e-16 ***
Radio
0.188530
0.008611 21.893
<2e-16 ***
Newspaper
-0.001037
0.005871 -0.177
0.86
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared: 0.8972,Adjusted R-squared: 0.8956
F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
The F-statistic and, in particular, the p-value it yields, gives us our answer: we reject the null
hypothesis in favor of the alternative, that some β j , 0.
2. Do all the predictors help the response, or is only a subset of the predictors useful?
Following the process described as mixed selection one can investigate the effect on Sales
of each individual medium:
summary(lm(Sales ~ TV, data=Advertising))
summary(lm(Sales ~ Radio, data=Advertising))
summary(lm(Sales ~ Newspaper, data=Advertising))
It appears, from p-values these commands generate, that each of the three predictor variables
has predictive value for Sales but, as measured by the RSE and R2 values, the strongest
connection is with TV. We include it first in our model.
Next, we consider including a second predictor. Taking TV as a given we compare radio
vs. newspaper:
summary(lm(Sales ~ . - Newspaper, data=Advertising))
summary(lm(Sales ~ .-Radio, data=Advertising))
4
MATH W80 Daily Notes
Linear Regression
Based on the same criteria, Radio appears the one of the two remaining variables which has
more predictive value for Sales.
When we use all three predictors, however, Newspaper has p-value that is not significant,
and provides no real change in the R-squared nor RSE values. The result of a hypothesis test
often usually closely tied to a confidence interval and, indeed, the 95% confidence intervals
for the coefficients β0 , β1 , β2 , β3 displayed below do not rule out the possibility that β3 = 0.
lm.res = lm(Sales ~ ., data = Advertising)
confint(lm.res)
2.5 %
(Intercept) 2.32376228
TV
0.04301371
Radio
0.17154745
Newspaper
-0.01261595
97.5 %
3.55401646
0.04851558
0.20551259
0.01054097
A view of Table 3.5 in the text reveals that Newspaper is more highly correlated with Radio
even than it is with Sales.
3. How well does the model fit the data?
We have introduced several measures of how we have done: RSE, R2 , the p-value associated
with the F-statistic. But this question gets at whether it was reasonably accurate, in the
beginning, to think Y depends on the predictors X linearly, except for errors which behave
as if they are drawn independently from a normal distribution Norm(0, σ). (All of the
hypothesis tests and confidence intervals we obtain presume so!) The residuals ei = yi − ŷi
are instantiations of , and plots of these residuals can help us determine if they truly come
randomly from some normal-like distribution. The command
lm.mod = lm(Sales ~ TV + Radio, data = Advertising)
par(mfrow=c(2,2))
plot(lm.mod)
# try with and without preceding par(mfrow=c(2,2)) command
produces a 2-by-2 array of plots which help to determine if ∼ Norm(0, σ). The most
important ones addressing this particular issue are the top two plots (residuals vs. fitted
values, and a quantile-quantile plot of residuals). These are reproducible (or, if not identically,
at least reasonably closely) by dirtier (more hands-on) commands as follows:
lm.mod = lm(Sales ~ TV + Radio, data = Advertising)
plot(predict(lm.res), residuals(lm.res))
# top left
qqnorm(residuals(lm.res))
# top right
plot(hatvalues(lm.res),rstudent(lm.res))
# bottom right
Based on the results of the plots, the residuals have no obvious trends other than that they
seem more widely spread about 0 when the fitted values ŷ are larger. The quantile-quantile
5
MATH W80 Daily Notes
Linear Regression
plot makes it seem reasonable to assume comes from some normal distribution, its just that
it may not be a single normal distribution.
4. Given a set of predictor values, what response value should we predict, and how accurate is
our prediction?
The former question may be asking: What single value do I predict? If so, there is an
immediate answer, and that is use the regression line formula. For instance, above we did
linear regression of Sales against predictors TV and Radio. Repeating that here, we have
coefficients
lm.res = lm(Sales ~ TV + Radio, data=Advertising)
lm.res$coefficients
(Intercept)
2.92109991
TV
0.04575482
Radio
0.18799423
so the model is
Sales = 2.921 + 0.0458(TV) + 0.188(Radio).
The single number representing sales we predict at these levels for TV and Radio advertising
budgets is is
2.921 + 0.458(225) + 0.188(25) = 110.67,
or about 111 thousand units.
But, coupled with the follow-up question, "how accurate is our prediction?", our best response
may be a range of numbers. In this vein, there are two responses which are common, based
on two different understandings of what is desired. It may be that we want a likely sales
range given we plan setting TV and Radio budgets at these levels just once. In this context,
we respond with a prediction interval. It can be generated in R using the command
lm.res = lm(Sales ~ TV + Radio, data=Advertising)
# may have been done already
predict(lm.res, data.frame(TV=225, Radio=25), interval="prediction")
fit
lwr
upr
1 17.91579 14.58485 21.24673
We may, however, envision setting these advertising budgets at 225 and 25 repeatedly. In
this context, we may want a range in which the mean level of sales lies. The answer to that
question is called a confidence interval for the mean response. Its R command (tailored for
90% confidence instead of 95%):
predict(lm.res, data.frame(TV=225, Radio=25), interval="confidence", level=.9)
fit
lwr
upr
1 17.91579 17.64976 18.18182
6
© Copyright 2026 Paperzz