ch 14 hw solutions

X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
AP Statistics
Solutions to Packet 14
X
Inference for Regression
Inference about the Model
Predictions and Conditions
X
X
X
X
X
X
X
X
X
X
X
X
X
X
HW #32 1, 2, 6, 7
14.1 AN EXTINCT BEAST, I Archaeopteryx is an extinct beast having feathers like a bird but
teeth and a long bony tail like a reptile. Here are the lengths in centimeters of the femur (a leg bone) and
the humerus (a bone in the upper arm) for the five fossil specimens that preserve both bones:
Femur:
Humerus:
38
41
56
63
59
70
64
72
74
84
The strong linear relationship between the lengths of the two bones helped persuade scientists that all
five specimens belong to the same species.
(a) Examine the data. Make a scatterplot with femur length as the explanatory variable. Use your
calculator to obtain the correlation r and the equation of the least-squares regression line. Do you think
the femur length will allow good prediction of humerus length?
The correlation is r = 0.994,
and linear regression
gives yˆ = − 3.660 + 1.1969 x
The scatterplot below shows
a strong, positive, linear
relationship,
which is confirmed by r.
(b) Explain in words what the slope β of the true regression line says about Archaeopteryx. What is the
estimate of β from the data? What is your estimate of the intercept α of the true regression line?
β represents how much we can expect the humerus length to increase when femur length
increases by 1 cm, b (the estimate of β ) is 1.1969, and the estimate of α is a = −3.660.
(c) Calculate the residuals for the five data points. Check that their sum is 0 (up to roundoff error.) Use
the residuals to estimate the standard deviation σ in the regression model. You have now estimated all
three parameters in the model.
The residuals are −0.8226, −0.3668, 3.0425, −0.9420, and −0.9110; the sum is −0.0001 (but
carrying a different number of digits might change this). Squaring and summing the residuals
gives 11.79, so that s = 11.79 / 3 = 1.982.
2
14.2 BACKPACKS Body weights and backpack weights were collected for eight students.
Weight (lbs):
Backpack weight (lbs):
120
26
187
30
109
26
103
24
131
29
165
35
158
31
116
28
These data were entered into a statistics package and least-squares regression of backpack weight on
body weight was requested. Here are the results:
Predictor
Constant
BodyWT
Coef
16.265
0.09080
s = 2.270
Stdev
3.937
0.02831
R-sq = 63.2%
t-ratio
4.13
3.21
P
0.006
0.018
R-sq(adj) = 57.0%
(a) What is the equation of the least-squares line? (Hint: Look for the column “Coef.” What is the
intercept? What is the slope?
backpack weight = 16.265 + 0.0908(body weight). The intercept is 16.265 and the slope is
0.0908
(b) The model for regression inference has three parameters, which we call α, β, and σ. Can you
determine the estimates for α and β from the computer output? What are they?
The estimate for α is the intercept of the least-squares line, that is, 16.265.
The estimate for β is the slope of the least-squares line, that is, 0.0908.
(c) The computer output reports that s = 2.270. This is an estimate of the parameter σ. Use the formula
for s to verify the computer’s value of s.
The estimate for σ is s
s=
∑ resid
n−2
2
=
30.9049
= 2.2695
6
3
14.6 AN EXTINCT BEAST, II Refer to exercise 14.1. Below is part of the output from the S-PLUS
statistical software when we regress the length y of the humerus on the length x of the femur.
Coefficients:
(Intercept)
Femur
Value
-3.6596
1.1969
Std. Error
4.4590
0.0751
t value
-0.8207
Pr(> |t|)
0.4719
(a) What is the equation of the least-squares regression line?
yˆ = − 3.6596 + 1.1969 x
(b) We left out the t statistic for testing H0: β = 0 and its P-value. Use the output to find t.
b
1.1969
t=
=
= 15.9374
SEb 0.0751
(c) How many degrees of freedom does t have? Use Table C to approximate the P-value of t against the
one-sided alternative Ha: β > 0.
df = 3; since t > 12.92, we know that p < 0.0005.
(d) Write a sentence to describe your conclusions about the slope of the true regression line.
There is very strong evidence that β > 0, that is, that the line is useful for predicting the length
of the humerus given the length of the femur.
(e) Determine a 99% confidence interval for the true slope of the regression line. (Show your
calculation.) Interpret the interval.
For df = 3, the critical value for a 99% confidence interval is t* = 5.841.
The interval is 1.1969 ± (5.841)(0.0751) or 1.1969 ± 0.439, that is, 0.7579 to 1.6359.
We are 99% confident that the true slope of the LSRL of the length of humerus on the length
of femur is between 0.758 and 1.636.
4
14.7 JET SKIS, I Data for the number of jet skis in use and number of fatalities for the years 1987 to
2000 are given below.
Year
Number in use
Accidents
Fatalities
5
376
92,756
1987
20
650
126,881
1988
20
844
178,510
1989
28
1,162
241,376
1990
26
1,513
305,915
1991
34
1,650
372,283
1992
35
2,236
454,545
1993
56
3,002
600,000
1994
68
4,028
760,000
1995
55
4,010
900,000
1996
(a) Formulate null and alternative hypotheses about the slope of the true regression line. State a onesided alternative hypothesis.
H0: β = 0 (there is no association between number of jet skis in use and number of fatalities).
Ha: β > 0 (there is a positive association between number of jet skis in use and number of
fatalities).
(b) What conditions or assumptions are necessary in order to perform a linear regression test of
significance? Are these reasonable assumptions in this situation?
y responses are independent – not given, proceed with caution.
True relationship is linear – yes
σ is constant – yes
y varies normally - yes
(c) Perform a linear regression t test. Report the t statistic, the degrees of freedom, and the P-value.
Write your conclusion in plain language.
LinRegTTest (TI-84) reports that t = 7.26 with df = 8. The P-value is 0.000. With the earlier
caveat, there is sufficient evidence to reject H0 and conclude that there is an association
between year and number of fatalities. As the number of jet skis in use increases, the number
of fatalities increases.
(d) Determine a 98% confidence interval for the true slope of the regression line. (Show your
calculation.) Write your conclusion in plain language.
The confidence interval takes the form b ± t* SEb. With t* = 2.8214, and SEb = 0.00000913,
the 98% confidence interval is approximately (0.00004024, 0.00009176).
We are 98% confident that the true slope of the LSRL of fatalities on number of jet skis in use
in thousands is between 0.04 and 0.093.
5
HW # 33 9 – 11, 13 – 15
14.9 DOES FAST DRIVING WASTE FUEL? The table below gives data on the fuel consumption of a
small car at various speeds from 10 to 150 kilometers per hour. Is there evidence of straight-line
dependence between speed and fuel use? Make a scatterplot and use it to explain the result of your test.
Speed (km/h)
10
20
30
40
50
60
70
80
Fuel used
(liters/100km)
21.00
13.00
10.00
8.00
7.00
5.90
6.30
6.95
Speed (km/h)
90
100
110
120
130
140
150
Fuel used
(liters/100km)
7.57
8.27
9.03
9.87
10.79
11.77
12.83
Regression of fuel consumption on speed gives b = −0.01466, SEb = 0.02334, and t = −0.63.
With df = 13, we see that p > 2(0.25) = 0.50 (software reports 0.541), so we have no evidence
to suggest a straight-line relationship. While the relationship between these two variables is
very strong, it is definitely not linear.
6
14.10 The table below presents data on the relationship between the speed of runners (x, in feet per
second) and the number of steps y that they take in a second.
Speed (ft/s):
Steps per second:
15.86
3.05
16.88
3.12
17.50
3.17
18.62
3.25
19.97
3.36
21.06
3.46
22.11
3.55
Here is part of the Data Desk regression output for these data:
R-squared = 99.8%
s = 0.0091 with 7 – 2 = 5 degrees of freedom
Variable
Coefficient
s.e. of coefficient
Constant
1.76608
0.0307
Speed
0.080284
0.0016
t-ratio
57.6
49.7
Prob
< 0.0001
< 0.0001
(a) How can you tell from this output, even without the scatterplot, that there is a very strong straightline relationship between running speed and steps per second?
r2 is very close to 1, which means that nearly all the variation in steps per second is accounted
for by foot speed. Also, the P-value for β is small.
(b) What parameter in the regression model gives the rate at which steps per second increase as running
speed increases? Find and interpret a 99% confidence interval for this rate.
β (the slope) is this rate; the estimate is listed as the coefficient of “Speed,” 0.080284.
Using a t(5) distribution the confidence interval is
0.080284 ± (4.032)(0.0016) = 0.07383 to 0.08674.
We are 99% confident that the true slope of the LSRL of steps per second on running speed is
between 0.07 and 0.09
7
14.11 THE LEANING TOWER OF PISA The Leaning Tower of Pisa leans more as time passes.
Here are measurements of the lean of the tower of the years 1975 to 1987. The lean is the distance
between where a point on the tower would be if the tower were straight and where it actually is. The
distances are tenths of a millimeter is excess of 2.9 meters. For example, the 1975 lean, which was
2.9642 meters, appears in the table as 642. We use only the last two digits of the year as our time
variable.
Year:
Lean:
75
642
76
644
77
656
78
667
79
673
80
688
81
696
82
698
83
713
84
717
85
725
86
742
87
757
Here is part of the output from the Data Desk regression procedure with year as the explanatory variable
and lean as the response variable:
Variable
Constant
year
Coefficient
-61.1209
9.31868
s.e. of coefficient
25.13
0.3099
t-ratio
-2.43
30.1
prob
0.0333
< 0.0001
(a) Plot the data. Briefly describe the shape, strength, and direction of the relationship. The tower is
tilting at a steady rate.
The plot (below) shows a strong positive linear relationship.
(b) The main purpose of the study is to estimate how fast the tower is tilting. What parameter in the
regression model gives the rate at which the tilt is increasing, in tenths of a millimeter per year?
β (the slope) is this rate; the estimate is listed as the coefficient of “year”: 9.31868.
(c) We want a 95% confidence interval for this rate. How many degrees of freedom does t have? Find
the critical value t* and the confidence interval. Interpret the interval.
df = 11; t* = 2.201;
9.31868 ± (2.201)(0.3099) = 8.6366 to 10.0008. We are 95% confident that the true slope of
the LSRL of tilt on year is between 8.6 and 10.0
8
14.13 THE GENTLE MANATEE The relationship between the number of powerboats registered
and the number of manatees killed each year was explored in Chapter 3. We will revisit the data below:
Year
Powerboat
Manatees
Year
Powerboat
Manatees
registrations (1000)
killed
registrations (1000)
killed
33
614
1986
13
447
1977
39
645
1987
21
460
1978
43
675
1988
24
481
1979
50
711
1989
16
498
1980
47
719
1990
24
513
1981
53
716
1991
20
512
1982
38
716
1992
15
526
1983
35
716
1993
34
559
1984
49
735
1994
33
585
1985
We conducted inference on the manatee data earlier, but was this prudent? Check the conditions, and
report your interpretations.
The major difficulty is that the observations are not independent. The number of powerboat
registrations for any year is related to the number of registrations for the previous year.
The other conditions can be assumed to be satisfied.
The true relationship is linear
The standard deviation of the response about the true line is the sam everywhere
The response varies normally about the true regression line.
14.14 PISA, PISA! In Exercise 4.11 we regressed the lean of the Leaning Tower of Pisa on year to
estimate the rate at which the tower is tilting. Here are the residuals from that regression, in order by
years across the rows:
4.220
-5.011
-3.099
0.670
-0.418
-4.648
1.264
-5.967
-2.055
1.714
3.626
7.396
2.308
Use the residuals to check the regression conditions, and describe your findings. Is the regression in
Exercise 4.11 trustworthy?
The number of points is so small that it is hard to judge much from the stemplot. The scatterplot
of residuals vs. year does not suggest any problems. The regression in Exercise 14.11 should
be fairly reliable.
9
14.15 DO HEAVIER PEOPLE BURN MORE ENERGY? Metabolic rate, the rate at which the
body consumes energy, is important in studies of weight gain, dieting, and exercise. Lean body mass is
an important influence on metabolic rate. Men and women show a similar pattern, so we will ignore
gender. Here are the data on mass (in kilograms) and metabolic rate (in calories):
Mass: 62.0
Rate: 1792
Mass: 40.3
Rate: 1189
62.9
1666
33.1
913
36.1
995
51.9
1460
54.6
1425
42.4
1124
48.5
1396
34.5
1052
42.0
1418
51.1
1347
47.4
1362
41.2
1204
50.6
1502
51.9
1867
42.0
1256
46.9
1439
48.7
1614
Use your calculator or software to analyze these data. Make a scatterplot and find the least-squares line.
Give a 90% confidence interval for the slope β and explain clearly what your interval says about the
relationship between lean body mass and metabolic rate. Find the residuals and examine them. Are the
conditions for regression inference met?
The scatterplot (below) shows a positive association. The regression line is yˆ = 113.2 + 26.88 x
the linear relationship with body mass accounts for r2 = 74.8% of the variation in metabolic rate.
Minitab output (on the next page) reports b = 26.879 and SEb = 3.786; with df = 17, the critical
value is t* = 1.740, so the 90% confidence interval for β is 26.879 ± (1.740)(3.786) = 20.29 to
33.47 cal/kg. For each additional kilogram of mass, metabolic rate increases by about 20 to 33
calories.
The residuals are listed on the next page (in order, down the columns). A stemplot (on the next
page) suggests that the distribution of residuals is right-skewed, and the largest residual may be
an outlier. A scatterplot (on the next page) of the residuals against the explanatory variable
gives some hint that the variation about the line is not constant (in violation of the regression
assumptions).
However, the three highest residuals account for most of that impression (as well as the
skewness of the distribution), so these three individuals may need to be examined further.
10
11
HW #34
19, 23
14.19 BEAVERS AND BEETLES Ecologists sometimes find rather strange relationships in our
environment. One study seems to show that beavers benefit beetles. The researchers laid out 23 circular
plots, each four meters in diameter, in an area where beavers were cutting down cottonwood trees. In
each plot, they measured the number of stumps from trees cut by beavers and the number of cluster of
beetle larvae. Here are the data:
Stumps:
Beetle Larvae:
Stumps:
Beetle Larvae:
2
10
2
25
2
30
1
8
1
12
2
21
3
24
2
14
3
36
1
16
4
40
1
6
3
43
4
54
1
11
1
9
2
27
2
13
5
56
1
14
1
18
4
50
3
40
(a) Make a scatterplot that shows how the number of beaver-caused stumps influences the number of
beetle larvae clusters. What does your plot show?
Stumps (the explanatory variable) should be on the horizontal axis; the plot shows a positive
linear association.
(b) Here is the Minitab regression output for these data:
Predictor
Coef
Stdev
T
Constant
-1.286
2.853
-0.45
Stumps
11.894
1.136
10.47
s = 6.419
P
0.657
0.000
R-sq = 83.9%
Find the least-squares regression line and draw in on your plot. What percent of the observed variation
in beetle larvae counts can be explained by straight-line dependence on beaver stump counts?
The regression line is yˆ = − 1.286 + 11.849 x . Regression on stump counts explains 83.9% of
the variation in the number of beetle larvae.
(c) Is there strong evidence that beaver stumps help explain beetle larvae counts? Give appropriate
statistical evidence to support your conclusion.
Our hypotheses are H0: β = 0 versus Ha: β ≠ 0, and the test statistic is t = 10.47 (df = 21).
The output shows p = 0.000, so we know that p < 0.0005; we have strong evidence that beaver
stump counts help explain beetle larvae counts.
12
14.23 WEEDS AMONG THE CORN Lamb’s quarter is a common weed that interferes with the
growth of corn. An agriculture researcher planted corn at the same rate in 16 small plots of ground, then
weeded the plots by hand to allow a fixed number of lamb’s quarter plants to grow in each meter of corn
row. No other weeds were allowed to grow. Here are the yields of corn (bushels per acre) in each of the
plots:
Weeds
per meter
0
0
0
0
Corn
yield
166.7
172.2
165.0
176.9
Weeds
per meter
1
1
1
1
Corn
yield
166.2
157.3
166.7
161.1
Weeds
per meter
3
3
3
3
Corn
yield
158.6
176.4
153.1
156.0
Weeds
per meter
9
9
9
9
Corn
yield
162.8
142.4
162.8
162.4
Use your calculator or software to analyze these data.
(a) Make a scatterplot and find the least-squares line. What percent of the observed variation in corn
yield can be explained by a linear relationship between yield and weeds per meter?
Scatterplot below. Regression gives yˆ = 166.5 − 1.099 x ; the linear relationship explains about
r2 = 20.9% of the variation in yield.
(b) Is there good evidence that more weeds reduce corn yield?
The t statistic for testing H0: β = 0 vs. Ha : β < 0 is t = −1.92; with df = 14, the P-value is 0.0375
We have some evidence that weeds influence corn yields, but it is not strong enough to meet
the usual standards of statistical significance.
(c) Explain from your findings in (a) and (b) why you expect predictions based on this regression to be
quite imprecise. Predict the mean corn yield under these experimental conditions when there are 6
weeds per meter of row.
The small value of r2 and the lack of significance of the t test indicate that this regression has
little predictive use. When x = 6, yˆ = 159.9 bu/acre; the 95% confidence interval with t* = 2.145
and : SEµˆ =2.54 is
159.9 ± (2.145) (2.54).
The width of this interval is another indication that the model has little practical use.
13