Chapter 14

Chapter 14
Inference for Regression
Lesson 14-1, Part 1
Inference for Regression
Review Least-Square Regression
A family doctor is interested in examining the relationship between
patient’s age and total cholesterol. He randomly selects 14 of his female
patients and obtains the data present in Table 1. The data are based upon
results obtained from the National Center for Health Statistics.
Table 1
Age
Total Cholesterol
Age
Total Cholesterol
25
180
42
183
25
195
48
204
28
186
51
221
32
180
51
243
32
210
58
208
32
197
62
228
38
239
65
269
Review Least-Square Regression
1. What is the least-square regression line for
predicting total cholesterol from age for women?
The least square regression equation is ŷ = 151.3537 +
1.3991x, where ŷ represents the predicted total
cholesterol for a female who age is x.
Review Least-Square Regression
2. What is the correlation coefficient between age and
cholesterol? Interpret the correlation coefficient in the
context of the problem
The linear correlation coefficient is 0.718. There is a
moderate, positive linear relationship between
female age and total cholesterol.
Review Least-Square Regression
3. What is the predicted cholesterol level of 67 year old
female?
yˆ  151.3537  1.399 x
cholesterol  151.3537  1.3991(age)
 151.3537  1.3991(67)
 245
Review Least-Square Regression
4. Interpret the slope of the regression line in the context of
the problem?
For each increase in age of one year, the total
cholesterol is predicted to increases by 1.3991.
Statistics and Parameters
• When doing inference for regression, we use ŷ  a  bx
to estimate the population regression line.
▫ a and b are estimators of population parameters α and
β, the intercept and slope of the population regression
line.
Conditions
• The conditions necessary for doing inference for
regression are:
▫ For each given value of x, the values of the response
variable y-values are independent and normally
distributed.
▫ For each value of x, the standard deviation, σ, of yvalues is the same.
▫ The mean response of the y values for the fixed values
of x are linearly related by the equation μy = α + βx
Standard Error of the Regression Line
• Gives the variability of the vertical distances of the
y-values from the regression line
• Remember that a residual was the error involved
when making a prediction from the regression
equation
• The spread around the line is measured with the
standard deviation of the residual, s.
s 
 y i
 yˆi 

n 2
2
 residuals 
n 2
2
Standard Error of the Slope of the
Regression Line
• Gives the variability of the estimates of the slope
of the regression line
  y  yˆ 
2
SE b 
i
s
 x
x
2
i

i
n2
 x
x
2
i
Summary
• Inference for regression depends upon estimating
μy = α + βx with ŷ = a + bx
• For each x, the response values of y are independent
and follow a normal distribution, each distribution
having the same standard deviation.
• Inference for regression depends on the following
statistics:
▫
▫
▫
▫
a, the estimate of the y intercept, α, of μy
b, the estimate of the slope, β, of μy
s, the standard error of the residuals
SEb the standard error of the slope of the regression
line.
Computing Standard Error
of the Residual
Age, x
Total
Cholesterol, y
ŷ = 151.3537 + 1.3991x
25
180
186.33
-6.33
40.0689
25
195
186.33
8.67
75.1689
28
186
190.53
-4.53
20.5209
32
180
196.12
-16.12
259.8544
32
210
196.12
13.88
192.6544
32
197
196.12
0.88
0.7744
38
239
204.52
34.48
1188.8704
62
228
238.10
-10.10
102.01
65
269
242.30
26.70
712.89
Residuals
(y – ŷ)
Residuals2
(y – ŷ)2
Σ residuals2 = 4553.708
Computing Standard Error
S 
 residuals
n 2
2

4553.705
 19.48
14  2
Example – Page 787, #14.2
Body weights and backpack weights were collected for eight
students
Weight
(lbs)
120
Backpack 26
weight
(lbs)
187
109
103
131
165
158
116
30
26
24
29
35
31
28
These data were entered into a statistical package and leastsquares regression of backpack weight on body weight as
requested. Here are the results.
Example – Page 787, #14.2
Predictor
Constant
Coef
16.265
Stdev
3.937
t-ratio
4.13
p
0.006
BodyWT
0.09080
0.02831
3.21
0.018
S = 2.270
R-sq = 63.2%
R-sq(adj) = 57.0%
A) What is the equation of the least-square line?
Backpack weight = 16.265 + 0.09080(bodyweight)
Example – Page 787, #14.2
Predictor
Coef
Stdev
t-ratio
p
Constant
16.265
3.937
4.13
0.006
BodyWT
0.09080
0.02831
3.21
0.018
S = 2.270
R-sq = 63.2%
R-sq(adj) = 57.0%
B) The model for regression inference has three parameters,
which we call α, β and σ. Can you determine the
estimates for α and β from the computer printout?
a = 16.265 estimates the true intercept α and b = 0.09080
estimates the true slope β.
Example – Page 787, #14.2
Predictor
Coef
Stdev
t-ratio
p
Constant
16.265
3.937
4.13
0.006
BodyWT
0.09080
0.02831
3.21
0.018
S = 2.270
R-sq = 63.2%
R-sq(adj) = 57.0%
C) The computer output reports that s = 2.270. This is an
estimate of the parameter σ. Use the formula for s to
verify the computer’s value of s.
Use your TI to verify this.
Example – Page 788, #14.4
Exercise 3.71 on page 187 provided data on the speed of
competitive runners and the number of steps they took per
second. Good runners take more steps per second as they
speed up. Here is the data again.
speed 15.86 16.88 17.50 18.62 19.97 21.06 22.11
steps 3.05 3.12 3.17 3.25 3.36 3.46 3.55
A) Enter the data into your calculator, perform least-square
regression, and plot the scatterplot with the least-square
line. What is the strength of the association between
speed and steps per second?
Example – Page 788, #14.4
Steps = 1.77 + 0.0803(speed). There is a very strong
positive linear relationship between speed and steps; r = 0.999.
nearly all the variation (r2 = 0.998) 99.8% of it in steps per
second is explained by the linear relationship.
steps per second
Example – Page 788, #14.4
speed (feet per second)
Example – Page 788, #14.4
C) The model for regression inference has three parameters,
α, β and σ. Estimate these parameters from the data
a = 1.766 is the estimate of α
b = 0.0803 is the estimate of β
s = 0.0091 is the estimate of σ
Lesson 14-1, Part 2
Inference for Regression
Significance Test for the Slope of a
Regression Line
• We want to test whether the slope of the
regression line is zero or not.
▫ If the slope of the line is zero, then there is no linear
relationship between x and y variables.
▫ Remember (formula for b) if r = 0, then b = 0
• Hypothesis
▫ Two Tailed: Ho: β = 0 and Ha: β ≠ 0
▫ Left Tailed: Ho: β = 0 and Ha: β < 0
▫ Right Tailed: Ho: β = 0 and Ha: β > 0
Test Statistics and Confidence Interval
bβ b
t

SE b SE b
b  t * SE b
• t distribution with n – 2 degrees of freedom
• SEb = Standard error of the slope
SE b 
s
 x
x
2
i
Reading Computer Printouts
Example – Page 794, #14.6
Exercise 14.1 (page 786) presents data on the lengths of two
bones in five fossil specimens of the extinct beast
Archaeopteryx. Here is part of the output from the S-PLUS
statistical software when we regress the length y of the
humerus on the length x of the femur.
Coefficients
Value
Std Error
t value
Pr(>|t|)
(Intercepts)
– 3.6596
4.4590
– 0.8207
0.4719
Femur
1.1969
0.0751
Example – Page 794, #14.6
Coefficients
Value
Std Error
t value
Pr(>|t|)
(Intercepts)
– 3.6596
4.4590
– 0.8207
0.4719
Femur
1.1969
0.0751
A) What is the equation of the least-squares regression line?
humerus  3.6596  1.1969( femur )
Example – Page 794, #14.6
Coefficients
Value
Std Error
t value
Pr(>|t|)
(Intercepts)
– 3.6596
4.4590
– 0.8207
0.4719
Femur
1.1969
0.0751
B) We left out the t statistic for testing Ho: β = 0 and its
P-value. Use the output to find t.
b 1.1969
t 
 15.94
S b 0.0751
Example – Page 794, #14.6
C)How many degrees of freedom does t have? Use Table C
to approximate the P-value of t against the one-sided
alternative Ha: β > 0.
df = 3; since t > 12.92, we know P-value < 0.0005
tcdf (15.9374, E 99,3)  2.685104
Example – Page 794, #14.6
D)Write a sentence to describe your conclusion about the
slope of the true regression line.
There is very strong evidence that β > 0, that is, that
the line is useful for predicting the length of the
humerus given the length of the femur
Example – Page 794, #14.6
E) Determine a 99% confidence interval for the true slope
of the regression line.
Example – Page 794, #14.6
b  t *S b
1.1969  5.841(0.0751)
(0.758,1.636)
Example – Page 794, #14.8
There is some evidence that drinking moderate amounts
of wine helps prevent heart attacks. Exercise 3.63 (Page 183)
gives data on yearly wine consumption (liters of alcohol from
drinking wine, per person) and yearly deaths from heart
disease (deaths per 100,000 people) in 19 developed
nations.
A) Is there statistically significant evidence of a negative
association between wine consumption and heart disease
deaths? Carry out the appropriate test of significance and
write a summary statement about your conclusions.
Example – Page 794, #14.8
Example – Page 794, #14.8
β = negative association between wine consumption
and heart disease deaths.
Ho : β  0
Ha : β  0
Example – Page 794, #14.8
Linear Regression T-test
Condition
1. The observations are independent
2. The true relationship is linear (check scatterplot to check
that the overall pattern is linear or plot of residuals
against the predicted values)
3. The standard deviation of the response about the true
line is the same everywhere (make sure the spread
around the line is nearly constant)
4. The response varies normally about the true regression
line (normal probability plot of residuals is quite straight)
Example – Page 794, #14.8
b 22.97
t 
 6.47
S b 3.357
Sb 
s
2
(
x

x
)

p  value  2.96  106
P  value  0.0005
Reject Ho, since p-value = 0.0005 <  = 0.05 and conclude
that there a linear relationship between wine consumption
and heart disease deaths.
Example – Page 795, #14.10
Exercise 14.4 (page 788) presents data on the relationship
between the speed of runners (x, in feet per second) and
the number of steps y that they take in a second. Here
is part of the Data Desk Regression output for these data:
R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable
Coefficient
s.e. of Coeff
t-ratio
prob
Constant
1.76608
0.0307
57.6
<0.0001
speed
0.080284
0.0016
49.7
<0.0001
Example – Page 795, #14.10
R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable
Coefficient
s.e. of Coeff
t-ratio
prob
Constant
1.76608
0.0307
57.6
<0.0001
speed
0.080284
0.0016
49.7
<0.0001
A) How can you tell from this output, even without the
scatterplot, that there is a very strong straight-line
relationship between running speed and steps per second?
Example – Page 795, #14.10
R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable
Coefficient
s.e. of Coeff
t-ratio
prob
Constant
1.76608
0.0307
57.6
<0.0001
speed
0.080284
0.0016
49.7
<0.0001
r2 is very close to 1, which means that nearly all the variation
in steps per second is accounted for by foot speed. Also, the
P-value for β is small.
Example – Page 795, #14.10
R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable
Coefficient
s.e. of Coeff
t-ratio
prob
Constant
1.76608
0.0307
57.6
<0.0001
speed
0.080284
0.0016
49.7
<0.0001
B) What parameter in the regression model gives the rate at
which steps per second increase as running speed
increases? Give a 99% confidence interval for this rate.
Example – Page 795, #14.10
R squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable
Coefficient
s.e. of Coeff
t-ratio
prob
Constant
1.76608
0.0307
57.6
<0.0001
speed
0.080284
0.0016
49.7
<0.0001
β (the slope) is this rate; the estimate is listed as coeffincient
of “Speed,” 0.080284.
b  t *S b  0.080284  4.032(0.0016)  (0.074,0.087)
Lesson 14-2, Part 1
Predictions and Conditions
Confidence Intervals
• Write the given value of the explanatory variable x
as x*.
▫ The distinction between predicting a single outcome
and predicting the mean of all outcomes when x = x*
determines what margin of error is correct.
• Estimate the mean response, we use a confidence
interval.
▫ µy = α + βx*
• Estimate an individual response y, we use a
prediction interval
Confidence Intervals for Regression Response
A level C confidence interval for the mean response
µy when x takes the value x* is
yˆ  t SE μˆ
*
The standard error


2
x x
1
SE μˆ  s

2
n  x  x 
*
Prediction Intervals for Regression Response
A level C prediction interval for a single observation
on y when x takes the value x*
yˆ  t SE yˆ
*
The standard error


2
x x
1
SE yˆ  s 1  
2
n  x  x 
*
Conditions for Regression Inference
• The observations are independent
• The true relationship is linear
• The standard deviation of the response about the
true line is the same everywhere.
• The response varies normally about the true
regression line.
• Check conditions using the residuals.
Examine the residual plot to check that the relationship is
roughly linear and that the scatter about the line is the same
from end to end.
Violations of the regression conditions: The variation of the
residuals is not constant.
Violations of the regression conditions: There is a curved
relationship between the response variable and the explanatory
variable.
Example – Page 802, #14.12
A) The residuals for the crying and IQ data appear in
Example 14.3 (page 785). Make a stemplot to display
the distribution of the residuals. Are there outliers or
signs of strong departures from normality?
19.20 31.13 22.65 15.18 12.18 15.15 16.63 6.18
1.70
9.14
9.82
20.85
22.60
1.66
10.82
24.35
6.68
6.14
0.37
18.94
6.17
12.60
8.85
32.89
9.15
0.34
10.87
18.47
23.58 9.14
8.62 2.85
19.34 10.89
51.32
2.80
14.30
2.55
Example – Page 802, #14.12
3 1
2 4 3 3
1 0 8 5 5 3 2
0 9 9 9 9 7 6 6 6 3 2 2 0
0 0 3 3 9
1
2
3
4
5
0 1 1 1 4 8 9 9
1 4
One residual (51.32) may be a high
3
outlier, but the stemplot does not
Show any other deviations from
normality.
1
Example – Page 802, #14.12
B) What other assumptions or conditions are required for
using inference for regression on these data? Check that
those conditions are satisfied and then describe your
findings.
Example – Page 802, #14.12
Example – Page 802, #14.12
The scatter of the data points about the regression line varies
to a extent as we move along the line, but the variation is
not serious, as a residual plot shows. The other conditions can
be assumed to be satisfied.
Example – Page 802, #14.12
C) Would a 95% prediction interval for x = 25 be narrower,
the same size, or wider than a 95% confidence interval?
Explain your reasoning.
A prediction interval would be wider. For a fixed
confidence level, the margin of error is always larger
when we are predicting a single observation than when
we are estimating the mean response.
Example – Page 802, #14.12
D) A computer package reports that the 95% prediction
interval for x = 25 is (91.85, 165.33). Explain what this
interval means in simple language.
We are 95% confident that when x (crying intensity) = 25,
the corresponding value of y (IQ) will be between 91.85
and 165.33
Example – Page 802, #14.14
In exercise 14.11 (page 795) we regressed the lean of the
Leaning Tower of Pisa on year to estimate the rate at which
the tower is tilting. Here are the residuals from that
regression, in order by years across the rows:
4.220
3.099 0.418 1.264
5.011 0.670
2.055 3.626 2.308
4.648 5.967 1.714
7.396
Use the residuals to check the regression conditions, and
describe your findings. Is the regression in exercise 14.11
trustworthy?
Example – Page 802, #14.14
In exercise 14.11 (page 795) we regressed the lean of the
Leaning Tower of Pisa on year to estimate the rate at which
the tower is tilting. Here are the residuals from that
regression, in order by years across the rows:
4.220
3.099 0.418 1.264
5.011 0.670
2.055 3.626 2.308
4.648 5.967 1.714
7.396
Use the residuals to check the regression conditions, and
describe your findings. Is the regression in exercise 14.11
trustworthy?
Example – Page 802, #14.14
Residual
Normal Prop.
Of Residual
The scatterplot of the residual versus year does not suggest
any problems. The regression in Exercise 14.11 should be
fairly reliable
Example – Page 809, #14.24
Here are data on the time (in minutes) Professor Moore takes
to swim 2000 yards and his pulse rate (beat per minute)
after swimming:
Time:
34.12
35.72 34.72
34.05
34.13
35.72
36.17
35.57
Pulse: 152
124
152
146
128
136
144
Time:
35.57 35.43
36.05
34.85
34.70
34.75
33.93
Pulse: 148
144
124
148
144
140
156
Time:
34.00 34.35
35.62
35.68
35.28
35.97
148
132
124
132
139
35.37
34.60
Pulse: 136
140
136
148
Example – Page 809, #14.24
A scatterplot shows a negative linear relationship: a faster
time (fewer minutes) is associated with a higher heart rate.
Here is part of the output from the regression function in
Excel spreadsheets.
Coefficients
Standard Error t Stat
P-value
Intercepts 479.9341457
66.22779275
7.246718119
3.87075E–07
X variable – 9.694903394
1.888664503
– 5.1332057
4.37908E–05
Give a 90% confidence interval for the slope of the true
regression line. Explain what your result tells us about the
relationship between the professor’s swimming time and
heart rate.
Example – Page 809, #14.24
Coefficients
Standard Error t Stat
P-value
Intercepts 479.9341457
66.22779275
7.246718119
3.87075E–07
X variable – 9.694903394
1.888664503
– 5.1332057
4.37908E–05
b  t SEb
*
21
9.9649  1.721(1.8887)
– 12.9454 to – 6.4444 bpm per minute
With a 90% confidence, we can say that for each
1-minute increase in swimming time, pulse rate
drops by 6 to 13 bpm.
Example – Page 809, #14.24
Using the TI
Example – Page 809, #14.25
Exercise 14.24 gives data on a swimmer’s time and heart
rate. One day the swimmer completes his laps in 34.3
minutes but forgets to take his pulse. Minitab gives this
prediction for heart rate when x* = 34.3:
Fit
StDev Fit
147.40
1.97
90.0% CI
90.0% PI
(144.02, 150.78) (135.79, 159.01)
A) Verify that “Fit” is the predicted heart rate from the
least-square line found in exercise 14.24. Then choose
one of the intervals from the output to estimate the
swimmer’s heart rate that day and explain why you
chose this interval.
Example – Page 809, #14.25
Fit
StDev Fit
147.40
1.97
90.0% CI
90.0% PI
(144.02, 150.78) (135.79, 159.01)
ŷ ( pulse)  479.9  9.6949 x(time)
when x = 34.3 minutes
ŷ ( pulse)  479.9  9.6949(34.3)  147.37
this agrees the output
Example – Page 809, #14.25
Fit
StDev Fit
147.40
1.97
90.0% CI
90.0% PI
(144.02, 150.78) (135.79, 159.01)
The prediction interval is appropriate for estimating one
value (as opposed to mean of many values): 135.79 to
159.01 bpm
Example – Page 809, #14.25
Fit
StDev Fit
147.40
1.97
90.0% CI
90.0% PI
(144.02, 150.78) (135.79, 159.01)
B) Minitab gives only one of the two standard errors used
in prediction. It is SEˆ the standard error for estimating
the mean response. Use this fact and a critical value
from table C to verify Minitab’s 90% confidence interval
for the mean heart rate on days when the swimming time
is 34.3 minutes.
Example – Page 809, #14.25
Fit
StDev Fit
147.40
1.97
90.0% CI
90.0% PI
(144.02, 150.78) (135.79, 159.01)
yˆ  t SEˆ
*
21
147.40  1.721(1.97)
144.01 to 150.79, which agrees with the computer
output