STAB22 section 2.4

STAB22 section 2.4
2.82 Use the least-squares regression line given in Example 2.19. For
an individual with NEA increase 143, the predicted fat gain is
3.505 − (0.00344)(143) = 3.013 kg. The actual fat gain, 3.2, is
bigger than the predicted one (above the line), so the residual is
positive (despite what the question says): 3.2 − 3.013 = 0.187 kg.
2.84 Type the data values into a Minitab worksheet. You can do
(a) and (b) in one go with a fitted line plot. Price should be
your response because it is the outcome: a bigger coffee costs
more. In fact, you can do most of (d) and (e) from the fitted line
plot too: after you select Regression and Fitted Line Plot, click on
Graphs and click on the bottom box, “residuals vs. the variables”.
Double-click on Size to get a plot of residuals against size. Click
OK. Also, click on Storage and check the box next to Residuals.
The residuals will be stored in the worksheet so you can look at
them. My fitted line plot is in Figure 1 and my residual plot is in
Figure 2.
Figure 2: Plot of residuals vs. size
partly because of the cost of the coffee. Per ounce, a larger drink
is actually less expensive. So describing this relationship with a
line will not tell the whole story.
The regression line is ŷ = 2.257 + 0.08036x, where x is size and
y is price. The slope is positive, so the trend is indeed upwards.
The fitted line plot suggests that the residual for the 16-oz coffee
is positive (above the line) and the other two are negative (below). The residuals themselves are under RESI1 in the Minitab
worksheet: -0.07, 0.11, -0.04 which do indeed sum to 0. My residual plot is in Figure 2; it’s hard to say much with only three
data points, but there appears to be a down-up-down curve in
the residual plot, suggesting that the relationship is actually a
curve. (With only three points, we could get this pattern just
by chance; if we had, say, 10 size-price values and the pattern
persisted, we could be more confident. Maybe one day Starbucks
will introduce a 20-ounce “mezzo-Venti” and we’ll be able to tell
then.) But with many things you buy, the more you buy, the
smaller the cost per unit.
Figure 1: Fitted line plot for price vs. size
The relationship between size and price appears to be increasing 2.86 You can draw a scatterplot and mark the line on it by hand,
or you can use Minitab’s own facility for drawing regression lines
but not linear. In Exercise 2.3 I speculated that there is a fixed
on scatterplots, called the Fitted Line Plot. Select Stat, Regrescost of serving you, so that a bigger drink is more expensive only
1
this curved pattern is a warning sign that the actual relationship
is a curve, not a straight line (indeed, any time a residual plot
shows a pattern, you have something to worry about). This is
hardly a surprise here, but when the curve is harder to see, it will
show up on the residual plot more noticeably than it will show up
elsewhere.
Figure 3: Fitted line plot for fuel usage data
sion, Fitted Line Plot; select your response, fuel usage, and your
explanatory variable, speed, and then click OK. My plot is in Figure 3. The relationship is curved, so trying to fit a line through
it does a very poor job!
Figure 4: Residual plot for 2.62
2.87 Again, we can do (a) and (b) at once by using a fitted line graph
(or you can do them separately by drawing the scatterplot first
and then figuring out where the regression line goes and drawing
it on your graph). My plot is shown in Figure 5. While weight
certainly increases with age, it doesn’t seem to do so in a straightline way: growth is rapid at the beginning, then it levels off, and
then it grows faster again. The straight line doesn’t capture this
pattern at all. Though we can fit a line here, it doesn’t make
much sense to do so.
Add the given residuals up to get −0.01, which is zero to within
rounding error.
You can either type the given residuals into a column and plot
them against speed, or you can have Minitab calculate and plot
the residuals. There’s nothing new in the first way (just a bit of
typing). To do the second way involves a new step. First, select
Stat, Regression and Regression, and select the response variable
(fuel usage) and explanatory “predictor” variable (speed). Before
you click OK, click on Graphs. Look for Residuals vs. the Variables and click on the text box underneath; select the explanatory
variable Speed. You’ll get a plot like Figure 4 of residuals against
speed with the zero line shown. The fact that the residuals show
The residuals are given in the data set on the disk (and in the
text), so you can just plot them against age and draw in the
zero line. Or you can use the Graphs option of the Regression
command, just as in 2.86. Select Stat, Regression, Regression;
2
select Weight as your response and Age as your predictor; click
Graphs, and select Age into the bottom box to plot residuals
against Age. You can verify that the fitted intercept and slope (in
the Session window) are correct. My residual plot is in Figure 6.
The clear curved pattern in the residual plot indicates that we
should have fitted a curve in the first place (the curve is clearer
in the residual plot than on the original scatterplot). Here we see
that fitting a “good enough” line is not good enough!
The residuals as printed in the text add up to 0.01, which is zero
to within rounding error. (Adding up the residuals calculated by
Minitab will give you an answer closer to 0 because Minitab keeps
more than two decimal places.)
2.88 Put in 28 for x to get ŷ = 0.965 − 0.00045(28) = 0.9524. The
residual uses the observed y-value, 0.99 − 0.9524 = 0.0376, close
enough.
Figure 5: Fitted line plot for weight vs. age
To plot the residuals, first get the data for Exercise 2.80 into
Minitab. You can then either type the residual values into an
empty column and plot them against days stored, or you can run
the regression in Minitab and ask for a residual plot. I did the
latter: Stat, Regression, Regression; select fenthion as response
and days as explanatory, then click on Graphs and ask for a plot
of residuals against days (select days into the bottom box). My
residual plot is in Figure 7. There is a suggestion of a curved
pattern with the residuals in the middle tending to be negative
and the ones at the ends tending to be more positive, but the
pattern is very weak — for four of the five storage day values,
two of the four residuals are positive and two negative, and for
the middle value, one residual is positive and three are negative:
not exactly an uneven split! So there is a very weak curve visible,
and I do not think there is enough evidence to say that a curve
should be fitted instead of a line. (The line, even though the slope
is very close to 0, fits well: the R-squared is 86.7%).
Figure 6: Residual plot for weight-age data
2.89 The work in 2.87 says that if you start from a known age, you
3
be confusing or even misleading.
Figure 7: Residual plot for predicting fenthion from days stored
can predict the mean weight for children of that age very accurately. But individual children will vary in weight quite a bit. For
example, the mean weight of 4-month-old children is 6.3 kg, but
individual children might be 6.0, 6.1, 6.5, 6.6 kg: the mean weight
is 6.3, but none of the individual weights are close to that. If the
Figure 8: Fitted line plot for gas chromatography data
regression predicts the mean weight accurately, it doesn’t have to
predict individual weights equally accurately, and so r could be a 2.94 You can do this question by getting a fitted line plot (to do (a)
lot smaller.
and (b)) and a plot of residuals against amount (for (c)). My
plots are in Figures 8 and 9.
2.91 Mean SAT score depends on two things: year (ie. time), and high
school grade. Things are further complicated by the fact that
high school grade depends on time as well (that’s what “grade
inflation” is). The statement “mean SAT score has increased
over time” doesn’t account for grade at all, and if grade makes a
difference, we could be misled. Think about a B student 4 years
ago; if that same student graduated now, he or she might be an A
student because of grade inflation, but his/her SAT score would
not change very much. So the 2004 A students are better (in
terms of SAT scores) than 2008 A students, because an A “is not
worth as much as it used to be”.
The fitted line plot appears to show that the data lie close to the
straight line, but because the “response” values vary so much, it’s
hard to see detail.
The residual plot, however, shows deviations from a straight line
much more clearly, because the residuals are on a much more
“human” scale (they are all between −15 and 15). The residuals
for amounts 0 and 20 are all positive and those for amounts 1 and
5 are all negative. That is, there is a curved pattern, up-down-up,
and so the relationship is a curve rather than a straight line (this
is very hard to tell from the original scatterplot). Also, as you go
across the picture from left to right, the residuals appear to get
more spread out. (If you go on to STAB27, you’ll learn that a
The statistical moral is that any variables that might affect the
response should be included in the analysis; leaving them out can
4
Figure 9: Residual plot for gas chromatography data
Figure 10: Scatterplot of silicon-isotope data
transformation of the response values is called for, and that using,
say, the logarithms of the response values might work better.)
Correlations (Pearson)
Correlation of isotope and silicon = -0.339, P-Value = 0.281
2.96 My scatterplot is shown in Figure 10, with the outlier marked
with an X. (To draw this, find the outlier in the data set — it’s
the last, 12th, row of values — and create a new column with
only an X in that row, by typing an X into the cell in the 12th
row and 3rd column. Then set up a Scatterplot in the usual way,
but before you click OK, click on Labels, select Data Labels, Use
Labels from Column, and select the column with the X in it.)
Regression Analysis
The regression equation is
silicon = - 493 - 33.8 isotope
Figure 11: Correlation and regression with the outlier
Ignoring the outlier, the trend is a moderate-to-strong negative
association.
Correlations (Pearson)
Correlation of isotope and silicon = -0.787, P-Value = 0.004
To calculate the correlation and regression line with and without
the outlier, first calculate both using all the points, then replace
the value 337 in row 12 by * (for “missing”) and re-do the calculations. My results are shown in Figures 11 and 12. I also used
the Storage option on Regression to save both sets of Fits (fitted
values), for my plot in (c). The correlation changed from −0.339
to −0.787, a pretty substantial change considering that we only
Regression Analysis
The regression equation is
silicon = - 1372 - 75.5 isotope
Figure 12: Correlation and regression without the outlier
5
the pairs of points by lines drawn by hand.
Finally, think about why we constructed this graph in the first
place. Including or excluding the outlier makes a substantial difference to where the regression line goes; the red and green lines
on the graph are quite different.
2.97 In 2.45, we found that the Insight lies on the trend, and including
it makes the correlation higher. The scatterplot, with fitted line
superimposed, is shown in Figure 14. (This is a Fitted Line plot.)
The regression line is y = 6.78 + 1.01x where x is city mileage and
y is highway mileage.
Figure 13: Scatterplot of silicon-isotope data with two regression lines
removed one observation. This shows that the correlation can be
strongly influenced by points not on the trend. Similarly, the intercept and slope of the regression line both change substantially
(in particular, the slope becomes more negative).
Since I saved both sets of Fitted Values, now I can superimpose
the fitted lines on the plot. This can be done by drawing three
plots and superimposing them on one graph. Call for the three
plots by putting, under Y and X in the dialog box, first Silicon
and Isotope, second Fits1 (the fits from the 1st regression) and
Isotope, third Fits2 (the fits from the 2nd regression) and Isotope.
To get these all on one graph, click on Multiple Graphs, select
Multiple Variables, and click on Overlaid on the Same Graph.
There are probably more elegant ways of doing it, but this one
worked. I got some points (in black) that are the scatterplot,
some points (in red) that are the regression line with the outlier
included, and some points (in green) that are the regression line
with the outlier excluded. As you see in Figure 13, the second
regression line has a much more negative slope.
Figure 14: Fitted line plot for gas mileage data
You can make the residual plot in two stages by storing the residuals from the regression and then plotting them, or you can do
it in one go by using the Graphs option in Regression. Since we
actually want to look at the residuals this time, the first way is
the way to go. Select Stat, Regression and Regression, and select
the response and explanatory variables. Click on Storage, then
If you don’t want to go to all this trouble, you can figure out
where the regression lines go (eg. by finding the value of Silicon
when Isotope is −21.5 and −19.5 for each line, and then joining
6
very close to the line (and r 2 = 99.8% is very high). So a straight
line does a very good job of describing the data. The regression
equation (from the plot) is y = 1.77 + 0.08x, where x is speed and
y is stride rate. (8.03E − 2 is 8.03 × 10−2 , or 0.0803.)
Figure 15: Residual plot for gas mileage data
click the box next to Residuals. Click OK twice. You’ll see a
new column of data in the worksheet (bottom half of the screen)
called RESI1, containing the residuals. You can scan down this
column to find the biggest positive value, 1.99 in row 25, and the
biggest negative value, −2.89 in row 12. These are the BMW
330CI and the Lamborghini Murcielago. My residual plot is in
Figure 15. The strange near-horizontal lines are because the gas
mileages were only measured to the nearest whole number.
Figure 16: Fitted line plot for stride rate data
For (c), we need to look at the residuals, so we can obtain them
first. Select Stat, Regression and Regression; select Stride Rate
as the response and Speed as explanatory. Before you click OK,
click on Storage, select Fits and Residuals (by clicking on the
boxes next to the words), then click OK twice. The residuals
appear as an extra column in the worksheet, marked RESI1, as
do the fitted values, FITS1. They are, to the accuracy shown,
0.011, −0.001, −0.001, −0.011, −0.009, 0.003, 0.009, which add up
to zero to within rounding (actually 0.001). You can see that
the residuals are very small compared to the stride rate values.
Then simply make a plot of the residuals against speed, as shown
in Figure 17. The residuals form a curved pattern (down and
then up), which shows that fitting a curved relationship would do
The Insight isn’t an outlier for two reasons: it is more or less
on the line, so its residual should be close to zero, and because
the Insight’s x and y values are so far from the other cars, it
is influential, pulling the line closer to itself than the line would
otherwise go. You can try fitting the line with and without the
Honda Insight to see what the effect is.
2.98 This question is small enough to do by hand. Or, if you want
to use Minitab (like me), looking carefully at (a) and (b) reveals
that a fitted line plot will again do the job. My fitted line plot
is in Figure 16, from which you see that the observed values are
7
(even) better at predicting stride rate from speed. (But the fit of
the line is so good that we may be happy with that.)
The scatterplot doesn’t show any observations to be influential,
because none of the values are a long way from the rest. (Taking
out one point wouldn’t change the line very much.)
Figure 18: Income-education scatterplot
to those with less education. (The correlation in my picture is
−0.813.) Salary depends on both education and location of employment, so to consider the influence of only one of those on
salary is missing the whole picture.
Figure 17: Residual plot for stride rate data
2.104 Note that the values of x are the same for the first three data
sets, so this set of x-values appears only once in the data set on
the disk.
2.103 Think about businesses first: while having more education is
associated with a higher salary, most people employed by business have only 3–5 years of education (a bachelor’s degree) and a
relatively high salary. Imagine 3 people: they might have 3 years
of education and a salary of $69,000, 4 years and $70,000, 5 years
and $71,000. Within this group, more education means a higher
salary.
The four correlations are all 0.816, and all four regressions are
ŷ = 3 + 0.5x. (b) can be answered by drawing fitted line plots in
the four cases. See Figures 19, 20, 21 and 22.
Data set 1 looks very reasonable to fit with a regression line. Data
set 2 is a clear curve. Data set 3 has an outlier in the y-direction
(the second-last point), and in data set 4, the whole strength of
the relationship is formed by the influential point (x-outlier) on
the right. So only for data set 1 does it make sense to use a
regression line, and doing so in the other three cases would be
misleading.
Now turn to academia: economists there tend to have a PhD, but
earn less money than industry-employed economists do. To make
up some numbers: 6 years of education and $48,000; 7 years and
$50,000; 8 years and $52,000. (Your numbers may differ; it’s the
pattern that’s important.)
My scatterplot is shown in Figure 18. Within each group, more
education means more income, but overall the higher salaries go
If you think about it, it is really quite a feat to dream up four
8
Figure 19: Data set 1
Figure 21: Data set 3
Figure 20: Data set 2
Figure 22: Data set 4
9
data sets with the same correlations, intercepts and slopes, and
yet quite differently appropriate regressions.
10