Regression Inference
1. Ms. President? The Gallup organization, over six decades,
periodically asked the following question:
If your party nominated a generally well-qualified person
for president who happened to be a woman, would you
vote for that person?
Here is a scatterplot of the percentage answering "yes" vs.
the year of the century (37 = 1937):
Here is the regression analysis:
Dependent variable is: Yes
R-squared = 94.2%
s = 4.274 with 16 - 2 = 14 degrees of freedom
Variable Coefficient SE(Coeff) t-ratio Prob
Constant -5.58269
4.582
-1.22
0.2432
Year
0.999373
0.0661
15.1
<0.0001
a) Explain in words and numbers what the regression
says.
b) State the hypothesis about the slope (both numerically
and in words) that describes how voters' thoughts have
changed about voting for a woman.
c) Assuming that the assumptions for inference are
satisfied, perform the hypothesis test and state your
conclusion. Be sure to state it in terms of voters'
opinions.
d) Explain what the R-squared in this regression means.
a)
Explain in words and numbers what the regression
says.
b) State the hypothesis about the slope (both
numerically and in words) that describes how use of
marijuana is associated with other drugs.
c) Assuming that the assumptions for inference are
satisfied, perform the hypothesis test and state your
conclusion in context.
d) Explain what the R-squared in this regression means.
e) Do these results indicate that marijuana use leads to
the use of harder drugs? Explain,
3. No opinion. Here's a regression of the percentage of
respondents whose response to the question about voting
for a woman president was "no opinion." We wonder if the
percentage of the public who have no opinion on this issue
has changed over the years. Assume that the conditions for
inference are satisfied.
Dependent variable is: No Opinion
R-squared = 9.5% s = 2.280 with 16 - 9 = 14 degrees of
freedom
Variable Coefficient SE(Coeff) t-ratio Prob
Constant 7.69962
2.445
3.15
0.0071
Year
-0.042708
0.0353
?
?
a) State the appropriate hypothesis for the slope.
b) Test your hypothesis and state your conclusion in the
proper context.
c) Below is the scatterplot corresponding to the
regression for No Opinion. How does the scatterplot
change your opinion of the trend in "no opinion"
responses? Do you think the true slope is negative?
Does this change the conclusion of your hypothesis
test of part b? Explain.
2. Drug use. The European School Study Project on Alcohol
and Other Drugs, published in 1995, investigated the use
of marijuana and other drugs. Data from 11 countries are
summarized in the scatterplot and regression analysis
below. They show the association between the percentage
of a country's ninth graders who report having smoked
marijuana and who have used other drugs such as LSD,
amphetamines, and cocaine.
Dependent variable is: Other
R-squared = 87-3%
s = 3.853 with 11 -2 = 9 degrees of freedom
Variable Coefficient SE(Coeff) t-ratio
Constant
-3.06780
2.204
-1.39
Marijuana
0.615003
0.0784
7.85
Prob
0.1974
0.0001
4. Cholesterol. Does a person's cholesterol level tend to
increase with age? Data collected in Framingham, MA,
from 294 adults aged 45 to 62 produced the regression
analysis shown. Assuming that the data satisfy the conditions for inference, examine the association between age
and cholesterol level.
Dependent variable is: Chol
Variable Coefficient SE(Coeff] t-ratio Prob
Constant 196.619
33.21
5.92
0.0001
Age
0.745779
0.6075
?
?
a) State the appropriate hypothesis for the slope,
b) Test your hypothesis and state your conclusion in the
proper context.
5. Marriage age. The scatterplot suggests a decrease in the
difference in ages at first marriage for men and women
since 1975. We want to examine the regression to see if
this decrease is significant.
Dependent variable is: Men — Women
R-squared = 46.3%
s = 0.1866 with 24 - 2 = 22 degrees of freedom
Variable Coefficient SE(Coeff) t-ratio
Prob
Constant
49.9021
10.93
4.56
0.0001
Year
-0.0293957
0.0055
?
?
a) Write appropriate hypotheses.
b) Here are the residuals plot and a histogram of the
residuals. Do you think the conditions for inference
are satisfied? Explain.
Test the hypothesis and state your conclusion about
the trend in age at first marriage.
d) Give a 95% confidence interval for the rate at which
the age gap is closing. Clearly explain what your
confidence interval means.
7. Fuel economy. A consumer organization has reported test
data for 50 car models. We will examine the association
between the weight of the car (in thousands of pounds) and
the fuel efficiency {in miles per gallon). Shown are the
summary statistics, scatterplot, and regression analysis:
Variable
Count Mean
StdDev
MPG
50
25.02000
4.83394
wt/1000
50
2.88780
0.511656
Dependent variable is: MPG
R-squared = 75.6%
s = 2.413 with 50 - 2 = 48 df
Variable Coefficient SE(Coeff] t-ratio
Prob
Constant
48.7393
1.976
24.7
0.0001
Weight
-8.21362
0.6738
-12.2 0.0001
a) Is there strong evidence of an association between the
weight of a car and its gas mileage? Write an appropriate hypothesis.
b) Are the assumptions for regression satisfied?
c)
6. Used cars. Classified ads in a newspaper offered several
used Toyota Corollas for sale. Listed below are the ages of
the cars and the advertised prices.
Age (yr) Prices advertised ($)
1
12.995;10,950
2
10.495
3
10.995; 10,995
4
6.995; 7,990
5
8.700; 6,995
6
5.990; 4,995
9
3,200; 2,250; 3,995
11
2,900; 2,995
13
1,750
a) Make a scatterplot for these data.
b) Do you think a linear model is appropriate? Explain.
c) Find the equation of the regression line.
d) Check the residuals to see if the conditions for
inference are met.
e) Create a 95% confidence interval for the slope of the
regression line.
f) Explain what your confidence interval means.
c) Test your hypothesis and state your conclusion.
d) Create a 95% confidence interval for the slope of the
regression line.
e) Explain in this context what your confidence interval
means.
8. SAT scores. How strong is the association between student
scores on the Math and Verbal sections of the SAT? Scores
on this exam range from 200 to 800, and are widely used
by college admissions offices. Here are summaries and
plots of the scores for a recent graduating class at Ithaca
High School.
Variable
Count
Mean
Median
StdDev
Range
IQR
Verbal
162
596.296
610
99.5199
490
140
Math
162
612.099
630
98.1343
440
150
Dependent variable is: Math
R-squared = 46.9%
s = 71.75 with 162 - 2 = 160 df
Variable
Constant
Verbal
Coefficient
209.554
0.675075
SE(Coeff)
34.35
0.0568
t-ratio
6.10
?
Prob
0.0001
?
12. Which of the following is a valid conclusion that could be
drawn from this regression analysis?
a) There is sufficient evidence to reject the hypothesis
that = 0.
b) There is not sufficient evidence to reject the
hypothesis that 0.
c) This test is not significant at the 1% level.
d) Significance cannot be determined from this printout.
e) None of these is a valid conclusion.
a)
Is there evidence of an association between math and
verbal scores? Write an appropriate hypothesis.
b) Discuss the assumptions for inference.
c) Test your hypothesis and state an appropriate
conclusion.
d) Find a 90% confidence interval for the slope of the
true line describing the association between math and
verbal scores.
e) Explain in this context what your confidence interval
means.
9. Cereal. A healthy cereal should be low in both calories and
sodium. Data for 77 cereals were examined and judged
acceptable for inference. The 77 cereals had between 50
and 160 calories per serving and between 0 and 320 mg of
sodium per serving. The regression analysis is shown.
Dependent variable is: Sodium
R-squared = 9.0%
s = 80.49 with 77 - 2 = 75 degrees of freedom
Variable
Constant
Calories
Coefficient
21.4143
1.29357
SE(Coeff)
51.47
0.4738
t-ratio
0.416
?
Prob
0.6706
?
a)
Is there an association between the number of
calories and the sodium content of cereals? Explain.
b) Do you think this association is strong enough to be
useful? Explain.
In questions 10-13, use the following printout of the linear
regression relating the SAT Math scores of 200 randomly
chosen college freshmen and their first semester GPA’s.
The regression equation is GPA = 1.53 + 0.00170 Math
Predictor
Coef
StDev
T
P
Constant
1.5264
0.3981 3.83 0.00
Math
0.0016990 0.0006098 2.79 0.00
S = 0.5707 R-Sq = 3.8% R-Sq(adj) = 3.3%
10. The value of Sb for this regression is
a) 0.0006098
b) 0.0016990
c) 0.006
d) 0.3981
e) 1.5264
11. The test statistic for a test of significance for a non-zero
slope is:
a) 0.0006098
b) 0.3891
c) 2.79
d) 3.83
e) None of these.
13. Which of the following is the 95% confidence interval for
the population slope?
a) (0.0005, 0.0029)
b) (0.0129, 0.0211)
c) (-0.0170, 0.0340)
d) (0.0008, 0.0026)
e) None of these.
14. If the 90% confidence interval for the slope of regression
line does not contain 0, then which of the following is a
valid conclusion?
a) The confidence interval is not valid.
b) A significance test will not be significant at the 10%
level.
c) There is sufficient evidence to conclude that the slope
of the true regression line is 0.
d) There is sufficient evidence to conclude that the slope
of the true regression line is not 0.
e) None of these is valid
15. A new process designed to increase the temperature inside
steel girders shows great promise. In a test of 90 randomly
selected girders, the following regression was performed; a
partial computer printout is displayed:
Predictor
Constant
Temp 1
Coef
0.2074
1.05651
StDev
0.2318
0.02221
T
0.89
?
P
0
?
s = 0.6009
R-Sq = 96.3% R-Sq(adj) = 96.2%
Temp 1 is the initial temperature and Temp 2 is the
temperature after the process has terminated.
a. State the regression equation.
b. Interpret the slope of the regression in the context of
the problem.
c. Interpret the value of R-Sq in words
d. Find the values of T and P indicated question marks in
the printout.
16. A midterm exam in Applied Mathematics consists of
problems in 8 topical areas. One of the teachers believes
that the most important of these, and the best indicator of
overall performance, is the section on problem solving. She
analyzes the scores of 36 randomly chosen students using
computer software and produces the following printout
relating the total score to the problem-solving subscore,
ProbSolv:
Predictor
Coef StDev
T
P
Constant 12.960
6.228 2.08 0.045
ProbSolv 4.0162 0.5393 7.45 0.000
S = 0.6009 R-Sq = 62.0%
R-Sq(adj) = 60.9%
a. What is the regression equation?
b.
c.
d.
e.
f.
g.
Interpret the slope of the regression in the context of
the problem.
Interpret the value of R-Sq in words.
Interpret the value of S in words.
Calculate the 95% confidence interval of the slope of
the regression line for all Applied Mathematics
students.
Use the information provided to test whether there is
a significant relationship between the problem
solving subsection and the total score at the 5% level.
Are the decisions reached through the construction of
the confidence interval and through the use of a
significance test consistent? Explain the reasons for
your answer.
17. The following MINITAB output is from a data set relating
x = the length of a steelhead trout in inches and y = the
weight of a steelhead trout in pounds.
The regression equation is weight = -19.3+0.976 length
Predictor
Constant
Length
Coef
-19.342
0.97567
StDev
1.030
0.0355
t-ratio
-18.78
27.44
s = 0.215
R-Sq = 97.7%
Analysis of Variance
SOURCE
Regression
Error
Total
a.
b.
c.
d.
e.
DF
1
18
19
SS
77.868
1.861
79.729
MS
77.868
0.103
P
0.000
0.000
R-Sq(adj) = 97.5%
F
753.12
p
0.000
What is the equation of the regression line?
Give point estimates for the slope and y-intercept.
What is the magnitude of the typical deviation from
the estimated regression line?
What is the sample correlation between x and y?
Determine and interpret the value of r2.
18. A veterinary graduate student is studying the relationship
between the weight of a one year old golden retriever y (in
pounds) and the amount of dog food x that the dog is fed
each day (in pounds). A random sample of 10 golden
retrievers yielded the following data and summary
statistics.
Dog’s Weight (y) Amount Fed (x)
42
0.5
67
1
47
0.6
55
1
62
1.3
71
1.5
36
0.5
42
0.7
50
0.8
42
0.6
a. Estimate the slope and the y-intercept of the
population regression line
b. Estimate the mean weight of a one year old golden
retriever that has been fed 1 pound of dog food per
day.
c. Compute the standard deviation of the statistics b
d. Compute a 95% confidence interval for the slope of
the population regression line
e. Determine and interpret the value of r2.
19. A study concerning the relationship between the average
ozone level y (in parts per million) and the population of a
city x (in millions) resulted in the following MINITAB
output.
The regression equation is ozone = 8.89 + 16.6 popn
Predictor Coef
StDev t-ratio P
Constant 8.892
2.395 3.71
0.002
popn
16.650 1.910 8.72
0.000
s = 5.454
R-Sq = 84.4% R-Sq(adj) = 83.3%
Analysis of Variance
SOURCE
DF
SS
MS
F
p
Regression 1
2260.5 2260.5 76.00 0.000
Error
14 416.4
29.7
Total
15 2676.9
a. Give point estimates for the slope and y-intercept of
the population regression line.
b. Estimate the mean ozone level for a city having a
population of 1 million people.
c. Is the simple linear regression model useful for
predicting the diameter of the trees from a given age
of the forest? Explain.
d. Estimate with 95% confidence interval
e. Determine the proportion of the observed variation in
ozone levels that can be attributed to the simple linear
regression model
20. A scientist is studying the relationship between x = age of
a forest (in years) and y = the average diameter of the trees
(in cm). One study reported the following data.
x 15 30 25 120 60 40 150 100 175
y 17 21 20 51 32 25 62
47
68
a. Construct a scatterplot for this data, including the
least squares regression line.
b.
c.
d.
e.
What is the equation of the least squares regression
line?
Is the simple linear regression model useful for
predicting the diameter of the trees from a given age
of the forest? Explain.
Predict the tree diameter of an 83 year old forest.
Estimate using a 90% confidence interval.
21. Let x be the number vending machines and let y be the
time (in hours) it takes to stock them. The data are as
follows:
x
4
8
10
13 16
11
5
y
1
2
2.5
6
8
3.5
1.3
x
9
18 14
6
12
20
y 2.5
9
6
1.4
5
10.5
a. Construct a scatterplot for this data, including the
least squares regression line.
b.
c.
d.
e.
f.
What is the equation of the estimated regression line?
What is the predicted time required to stock vending
machine?
What percentage of observed variation (in y values)
can be explained by the simple linear regression
model
What is the magnitude of the typical deviation from
the estimated regression line?
Estimate using a 95% confidence interval.
22. A scientist is studying the relationship between x = inches
of annual rainfall and y = inches of shoreline erosion. One
study reported the following data. Us the following
MINITAB output to answer the following questions.
The regression equation is y = 1.74 + 0.0731x
Predictor
Coef
Stdev t-ratio
p
Constant
-1.7359
0.1882 -9.22 0.000
X
0.073099 0.002867 25.50 0.000
s = 0.2416
R-sq = 98.8%
R-sq(adj) = 98.6%
Analysis of Variance
SOURCE
DF SS
MS
F
p
Regression 1
37.938 37.938 650.14 0.000
Error
8
0.467
0.058
Total
9
38.405
a. What is the equation of the estimated regression line?
b. Interpret the slope for these data.
c. Is the simple linear regression model useful for
predicting erosion from a given amount of rainfall?
Test the following hypothesis H 0 : 0 versus
Ha : 0
d. Predict the amount of erosion when the annual
rainfall is 40 inches.
e. Estimate using a 99% confidence interval.
f. What is the coefficient of determination? Interpret its
meaning.
Answers
1. a) % Yes = -5.583 + 0.999(Year). The percentage who
would vote for a woman has been increasing about 1% per
year.
b) H0: There has been no change in percentage of voters
who would vote for a woman candidate for president, 1 =
0. HA: There has been a change, 1 0.
c) t =15.1, P-value <0.0001. With such a low P-value, we
reject H0. There has been a significant change. Voters are
increasingly willing to vote for a woman for president.
d) 94.2% of the variation in percentage willing to vote for a
woman is accounted for by the regression on year.
2. a) Percent using other drugs = -3.068 + 0.615(percent using
marijuana). The percentage of ninth graders in these
countries who have used other drugs is estimated to
increase 0.615% for each 1% increase in the percentage of
ninth graders who have used marijuana.
b) H0: There is no (linear) relationship between marijuana
use and use of other drugs, 1 = 0. HA: There is a
relationship, 1 0 .
c) t = 7.85, P-value = 0.0001. With such a low P-value, we
reject H0. Percentage of teens using Other drugs is
positively related to percentage using marijuana.
d) Percentage using marijuana accounts for 87.3% of the
variation in other drug usage for ninth graders in these
countries.
e) The use of other drugs is associated with marijuana use,
but there is no proof of causality. There may be lurking
variables.
3. a) H0: There has been no change in the percentage
expressing no opinion about voting for a woman president,
1 = 0. HA: There has been a change, 1 0 .
b) t = -1.21, P-value = 0.2458. Because the P-value is so
large, we do not reject H0. These data do not indicate the
percentage of "no opinion" responses has changed.
c) The plot indicates no real trend. The high value in 1945
and low value in 1990 are influential. True slope does not
appear to be negative after discounting influential points at
the ends. Answer to the last question remains the same
because we did not reject H0 in b.
4. a) H0: There is no linear relationship between age and
cholesterol level, 1 = 0. HA: Cholesterol levels increase
with age, 1 > 0.
b) t = 1.23, P-value = 0.1103. Because the P-value is so
large, we do not reject H0. These data show no significant
relationship between age and cholesterol levels. The data
show no (linear) relationship between age and cholesterol.
The P-value of 0.1103 indicates over an 11% chance of a
slope as large as 0.746 or higher just from natural sampling
variation. The plot should be examined to see if a nonlinear
relationship is present.
5. a) H0. Difference in age at first marriage has not been
decreasing, 1 = 0. HA: Difference in age at first marriage
has been decreasing, 1 < 0.
b) Residual plot shows no obvious pattern; histogram is not
particularly Normal, but shows no obvious skewness or
outliers.
c) t = -5.34, P-value = 0.0001. With such a low P-value,
we reject H0. These data show evidence that difference in
age at first marriage is decreasing.
d) Based on these data, we are 95% confident that the
average difference in age at first marriage is changing at a
rate between -0.041 and -0.018 years per year.
6. a)
b) Yes, the plot seems linear.
c) Predicted advertised price = 12319.59 – 924.00(age)
d)
The residual plot shows some possible curvature. The
histogram of the residuals is not particularly Normal.
Inference may not be valid here, but we will proceed (with
caution).
e) (-1099.4, -748.6)
f) Based on these data, we are 95% confident that a used
car's value decreases between $748.60 and $1099.40 per
year.
7. a) H0: Fuel economy and weight are not (linearly) related,
1 = 0. HA: Fuel economy changes with weight, 1 0.
P-value < 0.0001, indicating strong evidence of an
association.
b) Yes, the conditions seem satisfied. Histogram of
residuals is unimodal and symmetric; residual plot looks
OK, but some "thickening" of the plot with increasing
values.
c) t = -12.2, P-value < 0.0001. These data show evidence
that gas mileage decreases with the weight of the car.
d) (-9.57,-6.86) mpg per 1000 pounds.
e) Based on these data we are 95% confident that fuel
mileage decreases between 6.86 and 9.57 miles per gallon
on average for each additional 1000 pounds of weight.
8. a) The scatterplot suggests a positive linear relationship. H0.
There is no (linear) relationship between SAT Verbal and
Math scores, 1 = 0. HA: There is a relationship, 1 0.
b) Assumptions seem reasonable, since conditions are
satisfied. Residual plot shows no patterns (one outlier);
histogram is unimodal and roughly symmetric.
c) t = 11.9; P-value < 0.0001. These data show evidence of
a positive relationship between SAT Verbal and Math
scores.
d) (0.581, 0.769)
e) Based on the sample, we are 90% confident that average
Math SAT scores increase between 0.58 and 0.769 points
for each additional point scored on the Verbal test.
9. a) Yes. t = 2.73, P-value = 0.0079. With a P-value so low,
we reject H0. There is a positive relationship between
calories and sodium content.
b) No, R2 = 9% and s appears to be large, although without
seeing the data it is a bit hard to tell.
10. A
11. C
12. A With a p-value of .006,we reject H0: 61 = 0 and
conclude that there is a relationship between the two
variables involved.
13. A With 200 observations (in problem statement), df =
200 -2 = 198. Using df = 100 in the table, we get Critical t
= 1.984 formula = .001699 ± 1.984(.0006098) =
(.0005,.0029)
14. D
15. a. predicted temp 2 = .2074 + 1.05651 (temp 1)
b. R2 indicates that 96.2% of the variation in temp 2 values
can be explained by the variation in temp 1 values.
1.05651 0
c. t
47.57
.02221
We want to compare the f-value found above to a t-critical
with 100 - 2 = 98 df. Using a table with df = 100, we find
that 47.57 is off the chart, so the two-sided p-value is less
than 2(0.0005) = .001 (and probably much less than that!)
16. a. Predicted total score = 12.96 + 4.0162(ProbSolv)
b. R2 indicates that 60.9% of the variation in total scores
can be explained by the variation in ProbSolv scores.
c. s is an estimate of the standard deviation in total scores
with respect to each ProbSolv score.
d. 95% Cl for : n = 36, so df = 34. Using df = 30 in our
table, Critical t = 2.042
Formula = 4.0162 ± 2.042(.5393) = (2.91, 5.12)
e. With df = 30 for a two-sided 5% significance test our
critical value would be t = 2.042
17. a.
ŷ 19.3 .976 length
b. The point estimate of the slope is b = .976 and the point
estimate of the y-intercept a = -19 3
c. sb = 0.0355
d. r = 0.988
e. r2 = 0.977, which is the proportion of the observed
variation in the weight of a steelhead tout that can be
explained by the simple linear regression model.
18. a. b = 31.7 and a = 24.6
b. 56.3
c. sb = 4.079
d. t = 6.73 and p = 0.000, and thus, it appears that there is a
significant linear relationship.
e. r2 = 0.85, which is the proportion of the observed
variation in the weight of a golden retriever that can be
explained by the simple linear regression model.
19. a. a = 8.89 and b = 15.6
b. 25.59
c. t = 8.72 and p = 0.000, and thus, it appears that there is a
significant linear relationship.
d. (12.5, 20.7)
e. r2 = 0.844
20.a.
b. cm = 12.1 + 0.328 years
c. Yes, this is the equivalent of testing the hypothesis of the
slope = 0, where t = 49.77 and p-value = zero. So, we
reject the hypothesis that the slope is zero.
d. 39.277 cm
e. 0.328 (1.895)(0.00659) = (0.316, 0.340)
21. a.
b. hours = -2.54 + 0.628 machines
c. 1.856 hours
d. r2 = 0.957
e. 0.6882
f. 0.628 (2.201)(0.04025) = (0.539, 0.717)
22. a. Y = -1.74 + 0.0731x
b. For each additional inch of rainfall, the shoreline erodes
0.0731 inch.
c. Yes, the value of t = 25.5 and p-value = 0
d. 1.188 inches
e. 0.0731 93.3554)(0.002867) = (0.0635, 0.0827)
f. r2 = .988. About 98.8 % of observed variation in y
values can be explained by the simple regression line.
© Copyright 2026 Paperzz