Solutions

STAB22 section 2.4
2.73 The four correlations are all 0.816, and all four
regressions are ŷ = 3 + 0.5x. (b) can be answered
by drawing fitted line plots in the four cases. See
Figures 1, 2, 3 and 4.
Figure 2: Data set 2
Figure 1: Data set 1
Data set 1 looks very reasonable to fit with a
regression line. Data set 2 is a clear curve. Data
set 3 has an outlier in the y-direction (the secondlast point), and in data set 4, the whole strength
of the relationship is formed by the influential
point (x-outlier) on the right. So only for data
set 1 does it make sense to use a regression line,
and doing so in the other three cases would be
misleading.
Figure 3: Data set 3
If you think about it, it is really quite a feat to
1
you can do most of (d) and (e) from the fitted line
plot too: after you select Regression and Fitted
Line Plot, click on Graphs and click on the bottom box, “residuals vs. the variables”. Doubleclick on Size to get a plot of residuals against
size. Click OK. Also, click on Storage and check
the box next to Residuals. The residuals will be
stored in the worksheet so you can look at them.
In StatCrunch, after you enter the data in an
empty worksheet, you do this: Select Stat, Regression and Simple; select Ounces as your x and
Price as your y (or use the names you gave the
variables; you can cursor up to the Var line at
the top, delete the name that’s there, and type
in the name you want to use for the variable).
Click Next twice. On the Options screen, check
Save Residuals. On the Graphics screen, select
Plot the Fitted Line and Residuals vs. x-values
(this will do (d) and (e)). Now you can click Calculate. The output comes out one screen at a
time; use Next to go to the next one.
Figure 4: Data set 4
dream up four data sets with the same correlations, intercepts and slopes, and yet quite differently appropriate regressions.
2.92 Use the least-squares regression line given in
Example 2.19. For an individual with NEA increase 143, the predicted fat gain is 3.505 −
(0.00344)(143) = 3.013 kg. The actual fat gain,
3.2, is bigger than the predicted one (above the
line), so the residual is positive: 3.2 − 3.013 =
0.187 kg.
In either case, you’ll have to do (c) by hand, unless you are exceedingly clever.
The regression line is (rounding off) y = 2.61 +
0.08x, where y is price and x is ounces. My fitted
line plot is in Figure 5 and my residual plot is in
Figure 6.
2.100 Type the data values into a worksheet in your
software. Instructions below are for Minitab.
You can do (a) and (b) in one go with a fitted line
plot. Price should be your response because it is
the outcome: a bigger coffee costs more. In fact,
The relationship between size and price appears
2
Figure 5: Fitted line plot for price vs. size
Figure 6: Plot of residuals vs. size
3
to be increasing but not linear. In Exercise 2.3
I speculated that there is a fixed cost of serving you, so that a bigger drink is more expensive
only partly because of the cost of the coffee. Per
ounce, a larger drink is actually less expensive.
So describing this relationship with a line will not
tell the whole story. As you see in Figure 5, the
line doesn’t go through the points, though it does
go reasonably close to them.
and (ii) with only 3 points, the simplest kind
of curve (a parabola) would go through exactly
through the data values, with no kind of guarantee that the curve you get would apply to any
drink prices beyond these three. In any case, the
R-squared for the straight-line regression is up
around 96%, so it’s hard to justify going beyond
a straight line with a data set as small as this.
Maybe one day Starbucks will introduce a 20ounce “mezzo-Venti” and we’ll be able to tell then
whether a straight line can be sensibly improved
upon. But, as with many things you buy, the
message is that the more you buy, the smaller
the cost per unit.
Drawing a vertical line from each point to the line
is left “as an exercise for the reader”. The idea
is that the 16-ounce drink is the farthest from
the line (above), the 12-ounce drink is the next
farthest away (below), and the 24-ounce drink is
just a little below the line. The actual residuals
(which you’ll find have appeared in your work- 2.102 Again, we can do (a) and (b) at once by using a fitted line graph (or you can do them sepsheet while you were looking at your graphs) are
arately by drawing the scatterplot first and then
−0.071, 0.107, −0.036. These do sum to zero as
figuring out where the regression line goes and
they should. This echoes what we just said about
drawing it on your graph). My plot is shown in
the 16-ounce drink being the farthest from the
Figure 7. While weight certainly increases with
line; if the line were correct, the Grande would
age, it doesn’t seem to do so in a straight-line
be about 10 cents cheaper.
way: growth is rapid at the beginning, then it
Finally, looking at Figure 6, you see (as far as you
levels off, and then it grows faster again. The
can see anything from just three data points) a
straight line doesn’t capture this pattern at all.
down-up-down pattern, indicating that a curve
Though we can fit a line here, it doesn’t make
would be a better description of the data. Note
much sense to do so.
a couple of things, though: (i) the vertical scale
on the residual plot is exaggerated: none of the
The residuals are given in the data set on the
drinks is off the line by more than about 10 cents,
disk (and in the text), so you can just plot them
4
Figure 8: Residual plot for weight-age data
against age and draw in the zero line. Or you
can use the Graphs option of the Regression command. Select Stat, Regression, Regression; select
Weight as your response and Age as your predictor; click Graphs, and select Age into the bottom
box to plot residuals against Age. You can verify
that the fitted intercept and slope (in the Session
window) are correct. My residual plot is in Figure 8. The clear curved pattern in the residual
plot indicates that we should have fitted a curve
in the first place (the curve is clearer in the residual plot than on the original scatterplot). Here
we see that fitting a “good enough” line is not
good enough!
Figure 7: Fitted line plot for weight vs. age
5
The residuals as printed in the text add up to
0.01, which is zero to within rounding error.
(Adding up the residuals calculated by Minitab
will give you an answer closer to 0 because
Minitab keeps more than two decimal places.)
very much. So the 2004 A students are better
(in terms of SAT scores) than 2008 A students,
because an A “is not worth as much as it used to
be”.
The statistical moral is that any variables that
might affect the response should be included in
the analysis; leaving them out can be confusing
or even misleading.
2.103 The work in 2.102 says that if you start from a
known age, you can predict the mean weight for
children of that age very accurately. But individual children will vary in weight quite a bit. For
example, the mean weight of 4-month-old children is 6.3 kg, but individual children might be
6.0, 6.1, 6.5, 6.6 kg: the mean weight is 6.3,
but none of the individual weights are close to
that. If the regression predicts the mean weight
accurately, it doesn’t have to predict individual
weights equally accurately, and so r could be a
lot smaller.
2.105 Mean SAT score depends on two things: year
(ie. time), and high school grade. Things are
further complicated by the fact that high school
grade depends on time as well (that’s what “grade
inflation” is). The statement “mean SAT score
has increased over time” doesn’t account for
grade at all, and if grade makes a difference, we Figure 9: Fitted line plot for gas chromatography data
could be misled. Think about a B student 4 years
ago; if that same student graduated now, he or 2.108 You can do this question by getting a fitted
line plot (to do (a) and (b)) and a plot of residshe might be an A student because of grade inuals against amount (for (c)). My plots are in
flation, but his/her SAT score would not change
6
plot). Also, as you go across the picture from left
to right, the residuals appear to get more spread
out. (If you go on to STAB27, you’ll learn that
a transformation of the response values is called
for, and that using, say, the logarithms of the
response values might work better.)
2.110 My scatterplot is shown in Figure 11, with
the outlier marked with an X. (To draw this in
Minitab, find the outlier in the data set — it’s
the last, 12th, row of values — and create a new
column with only an X in that row, by typing an
X into the cell in the 12th row and 3rd column.
Figure 10: Residual plot for gas chromatography data
Then set up a Scatterplot in the usual way, but
before you click OK, click on Labels, select Data
Labels, Use Labels from Column, and select the
Figures 9 and 10.
column with the X in it.)
The fitted line plot appears to show that the data
Ignoring the outlier, the trend is a moderate-tolie close to the straight line, but because the “restrong negative association.
sponse” values vary so much, it’s hard to see detail.
To calculate the correlation and regression line
The residual plot, however, shows deviations
from a straight line much more clearly, because
the residuals are on a much more “human” scale
(they are all between −15 and 15). The residuals
for amounts 0 and 20 are all positive and those
for amounts 1 and 5 are all negative. That is,
there is a curved pattern, up-down-up, and so the
relationship is a curve rather than a straight line
(this is very hard to tell from the original scatter-
with and without the outlier, first calculate both
using all the points, then replace the value 337 in
row 12 by * (for “missing”) and re-do the calculations. My results are shown in Figures 12 and
13. I also used the Storage option on Regression
to save both sets of Fits (fitted values), for my
plot in (c). The correlation changed from −0.339
to −0.787, a pretty substantial change considering that we only removed one observation. This
7
Correlations (Pearson)
Correlation of isotope and silicon = -0.787, P-Value = 0.004
Regression Analysis
The regression equation is
silicon = - 1372 - 75.5 isotope
Figure 13: Correlation and regression without the outlier
Figure 11: Scatterplot of silicon-isotope data
Correlations (Pearson)
Correlation of isotope and silicon = -0.339, P-Value = 0.281
Regression Analysis
The regression equation is
silicon = - 493 - 33.8 isotope
Figure 14: Scatterplot of silicon-isotope data with two
Figure 12: Correlation and regression with the outlier regression lines
8
shows that the correlation can be strongly influenced by points not on the trend. Similarly, the
intercept and slope of the regression line both
change substantially (in particular, the slope becomes more negative).
Finally, think about why we constructed this
graph in the first place. Including or excluding the outlier makes a substantial difference to
where the regression line goes; the red and green
lines on the graph are quite different.
Since I saved both sets of Fitted Values, now I can 2.113 Think about businesses first: while having
superimpose the fitted lines on the plot. This can
more education is associated with a higher salary,
be done by drawing three plots and superimposmost people employed by business have only 3–5
ing them on one graph. Call for the three plots by
years of education (a bachelor’s degree) and a relputting, under Y and X in the dialog box, first
atively high salary. Imagine 3 people: they might
Silicon and Isotope, second Fits1 (the fits from
have 3 years of education and a salary of $69,000,
the 1st regression) and Isotope, third Fits2 (the
4 years and $70,000, 5 years and $71,000. Within
fits from the 2nd regression) and Isotope. To get
this group, more education means a higher salary.
these all on one graph, click on Multiple Graphs,
Now turn to academia: economists there tend to
select Multiple Variables, and click on Overlaid
have a PhD, but earn less money than industryon the Same Graph. There are probably more
employed economists do. To make up some numelegant ways of doing it, but this one worked. I
bers: 6 years of education and $48,000; 7 years
got some points (in black) that are the scatterand $50,000; 8 years and $52,000. (Your numbers
plot, some points (in red) that are the regression
may differ; it’s the pattern that’s important.)
line with the outlier included, and some points
My scatterplot is shown in Figure 15. Within
(in green) that are the regression line with the
each group, more education means more income,
outlier excluded. As you see in Figure 14, the
but overall the higher salaries go to those with
second regression line has a much more negative
less education. (The correlation in my picture
slope.
is −0.813.) Salary depends on both education
If you don’t want to go to all this trouble, you
and location of employment, so to consider the
can figure out where the regression lines go (eg.
influence of only one of those on salary is missing
by finding the value of Silicon when Isotope is
the whole picture.
−21.5 and −19.5 for each line, and then joining
the pairs of points by lines drawn by hand.
2.114 Look back at 2.73 to get an idea of what’s going
9
The third residual plot is in Figure 18. There is
a downward trend to the residuals except for the
second-last one. All the residuals are between
about −1 and 1 except for the second-last one,
which is about 3, which says to us “outlier in the
y direction”. This is the same thing we saw in
Figure 3.
The last residual plot is in Figure 19. This is the
oddest residual plot you’ll see in a while. There
are a bunch of different residuals corresponding
to x = 8, and then one residual at x = 19 which is
exactly zero. One x-value out on its own ought to
suggest “influential”, which is exactly what you
see in Figure 4.
Figure 15: Income-education scatterplot
on here. The regression equations were all the
same (y = 3+0.5x), but a straight-line regression
was sometimes appropriate and sometimes not.
The story was clear enough for us to judge what
was happening from the scatterplots, so we can
guess what we might see from the residual plots.
The first residual plot is in Figure 16. It looks
like a random mess of nothing. This confirms
our impression from Figure 1 that a straight-line
regression is appropriate.
The second residual plot is in Figure 17. This
the plainest down-up-down curve you could wish
to see, so, as we guessed in Figure 2, the actual
relationship is a curve not a straight line.
10
The overall story here is that if you make a point
of looking at plots of the residuals whenever you
do a regression, you are much less likely to get
caught out by data that are not suitable for a
straight-line regression. These ones were pretty
clear just by looking at scatterplots, but you can
easily run into something like 2.108 where you
don’t see any problems until you look at a residual plot.
Figure 17: Residual plot for 2nd Anscombe data set
Figure 16: Residual plot for 1st Anscombe data set
11
Figure 18: Residual plot for 3rd Anscombe data set
Figure 19: Residual plot for 4th Anscombe data set
12