STAB22 section 2.1

STAB22 section 2.1
2.2 We’re changing dog breed (categorical) to breed size (quantitative). This would enable us to see how life span depends on breed
size (if it does), which we could assess by drawing a scatterplot.
(A scatterplot is no good if one of the variables is categorical,
but if we had several life spans for each breed, we could draw
side-by-side boxplots to see how the breeds compare.)
2.3 Both ounces and price are quantitative variables, and so we could
draw a scatterplot to see how they are related. (We might expect
that bigger sizes cost more, though a Venti (24 ounces) costs less
than twice a Tall (12 ounces), even though it’s twice the size. I
have problems with a company that calls its smallest serving a
Tall, but that may just be me.) If you leave the variable Size as
categorical, there is no nice way to make a graph. The individuals
(cases) here are cups of Mocha Frappuccino.
Figure 1: Scatterplot of price vs. size for Mocha Frappuccino
is essentially no relationship between the two scores: if you knew
the first test score, that would not help you at all in predicting the
final exam score. This might be because the first test came very
early in the course, and the material it tested was very different
from that on the final exam. Or students might react to their first
test result: a student who scores poorly might study hard for the
final, and a student who scores well might relax a bit too much
before the final.
2.4 The price of a drink depends on the size (rather than vice versa,
logically). So price should be the response and size the explanatory variable, and on your scatterplot price should be on the vertical y scale. I typed the numbers into Minitab and produced the
plot shown in Figure 1, though you could just as easily do this
by hand. As the size goes up, the price goes up as well, but not
in a straight-line way: the relationship looks less steep as the size
increases (reflecting the fact that a 24-ounce drink costs the least
per ounce of coffee, because the coffee itself is only one component
of the price, and there is also the fixed cost of hiring a barista to
serve you, however big a drink you have).
2.6 The first test comes before the final exam chronologically, so the
final exam score should be the response (and go on the vertical
scale on your scatterplot). Again, this one could be done either
by hand or by using Minitab (your choice). I used Minitab, with
the results shown in Figure 2. Select Scatterplot and Simple,
then select the response as Y and the explanatory as X. There
Figure 2: Scatterplot of first test and final exam scores
1
able to say “if I knew x, I would be able to predict y” means that
x is explanatory and y is the response: here age is explanatory
and weight is the response.
2.7 Again, the final exam score will be the response. My scatterplot
is shown in Figure 3. This appears to be something of a positive
association (more so than in Figure 2, anyway), so that knowing
the score on the second test helps a bit in predicting the final exam
score. (Note that the student who does best on the 2nd test, 175,
does well on the final, and the two students who score under 150
on the second test don’t do very well on the final either.)
In (c), if you knew how many bedrooms the apartment has, you
could make a guess at its rental price. Thus bedrooms is explanatory, and rental price is the response.
In (d), likewise, if you knew how much sugar a cup of coffee has,
you would be able to guess how sweet it would taste. (A more
interesting setup would be to have a friend prepare three cups of
coffee with differing amounts of sugar in, and then, by tasting,
you would rank them in order of sweetness. If you’re a big coffee
drinker, you would probably get pretty close to the right order.)
By the time the second test comes around, usually late in the
semester, it will usually be pretty clear what material is going to
be tested (pretty much the same stuff that will be on the final),
so a student who does well on one will probably do well on the
other (and will know how hard they need to study for the final).
In each of (a), (c) and (d) here, you could make a case for the
explanatory and response variables being the other way around,
but the major interest would be in the relationships as described
above. For instance, if you knew the weight of a child, you could
guess their age, but you would normally want to do it the other
way around.
2.10 Parents’ income is explanatory and college debt is the response,
because parental income influences college debt (it comes first).
These variables are both quantitative (you would measure them).
If the parents have a high income, the student will not have to
borrow so much money, so the debt will be low; if the parents have
Figure 3: Scatterplot of second test and final exam scores
a low income, the student will have to borrow a lot of money to
pay tuition, living expenses and so on. So we would expect a
2.9 Think of whether one variable might be the cause of the other,
negative association.
or whether the two variables are just things that “happen to go
This is assuming that parents will pay their children’s college
together”.
expenses, if they can. This isn’t always the case. Some students
In (b) and (e), the two values in each case are obtained at the
work while they’re at school (or during the summers) and save
same time, and so they just “go together” (or not): just explore
what they earn, and such students can be expected to graduate
the relationship in each case.
with a lower debt than they would otherwise have had.
In (a), older children will tend to be heavier, so that if you knew
the age of a child, you wold be able to predict their weight. Being 2.11 IQ is supposed to be a measure of general intelligence, and we
2
First get rid of the Honda Insight. This is the car with the highest
gas mileages, and is very different from the other cars. Click on
the number 10; this highlights the whole 10th row of the data.
Right-click on one of the highlighted cells, and select Delete Cells.
(Or hit the Delete key while you have the whole row highlighted.)
The Honda Insight disappears.
would expect more intelligent children to be more interested in
and more skilled in reading. This would be especially true for
children in the same grade (and thus of about the same age).
In Figure 2.6, children with higher IQ scores generally have higher
reading scores, though there is a lot of scatter. There are four children (with IQs between 100 and 130, and reading scores less than
20) that don’t seem to follow the general trend. Their reading
scores are about 40 points less then you would expect based on
their IQ; these children could have some kind of developmental
problems that hinder their reading even though they score well
on general intelligence.
Here’s how to get the plot you need. First notice that the worksheet in Minitab has one table and three columns (unlike Table
1.10 in the text): the first column is an M or a T corresponding to
the type of car. We’re going to use this column to help us make
the plot. Select Graph and Scatterplot. Select the second option,
With Groups (in version 14). In the dialogue that appears, put
the cursor under Y variables and select Hwy (the response), then
put the cursor under X variables and select City (explanatory).
Then make the groups: under Categorical Variables for Grouping,
click on the box and select Type (which appears in the list on the
left). When you’ve done all that, click OK. I got the plot shown
in Figure 4.
Ignoring the outliers, the trend is roughly linear (there is no obvious curve to the relationship, which is how you tell). But it isn’t
very strong: there is a lot of scatter in the in the picture, which is
another way of saying that if you know a child’s IQ, you wouldn’t
be able to predict their reading test score very accurately. (There
is more to reading than general intelligence, in other words.)
2.13 As on a normal probability plot, when you see a “stair-step”
pattern like this, it means that one of the variables only takes a
few different values. Here, it’s the child’s self-estimate of reading
ability, which can only be 1, 2, 3, 4 or 5. There are 60 children,
so there are several with the same self-estimate. Having said
that, children with a high test score also tend to have a high
self-estimate (all of the children with test scores above 80 rate
themselves 3 or better). Likewise, the children with a test score
below 40 rate themselves 3 or worse, with one exception. This
exception is the one outlier: a test score of about 10, and a selfestimate of 4, which is a serious over-estimate (looking at the plot,
you would expect this child to have a self-rating of 1 or maybe 2).
Figure 4: Scatter plot of gas mileages
2.16 This is most easily done with Minitab. You can get the data
from Table 1.10 off the disk (with the textbook); you can open
the “.mtp” file in Minitab.
The plot shows black circles for minicompact cars, and red squares
for two-seater cars, as shown in the legend on the right. There
is a clear positive association; cars with good city gas mileage
3
outliers. The data do suggest that distress from social exclusion
have clearly better highway gas mileage also. The plot is roughly
is related to brain activity in the “pain” region.
linear (ask yourself “is there a clear curve in the trend”, which
here there is not). Imagine separating out the reds and the blacks;
the relationship appears to be about the same for the two types of 2.18 Same procedure in Minitab: get the data from the disk into
a worksheet, and select Graph and Plot, with the right variable
cars. The major difference is that there are some two-seater cars
(cycle length, here) as the response, Y, variable. My plot is shown
with very poor gas mileage (bottom left of plot). You can look
in Figure 6.
back at the data to see which cars these are: the ones with highway
gas mileage less than about 16. These are the two Lamborghini
models and the two Ferrari models. If you own a Lamborghini or
a Ferrari, gas mileage is not what you’re worried about!
2.17 I did this in Minitab again (though you could do this one by hand
if you really want to). Get the data from the disk into Minitab;
treat brain activity as the response. Select Graph and Plot, and
select the two variables into Y and X with brain activity as Y.
My plot is in Figure 5.
Figure 6: Plot of cycle length against day length
The point on the far right (with day length close to 24) is an
outlier, because it is not part of the general pattern. You could
claim that there is a positive association, but it is very weak: if
you try to predict cycle length from day length, your prediction
won’t be very accurate.
2.19 My plot of team value against revenue is in Figure 7. There appear to be five outliers: the three teams with revenue less than 80
Figure 5: Scatter plot of brain activity against social distress
and value higher than the other teams with the same revenue, and
the two teams top right with the highest revenues. (You could
argue that the latter two teams just happen to have high revThe relationship shows an upward trend: a higher score on the
enues but are on the line that marks the general trend.) To find
distress scale leads to a higher brain activity measurement. The
out which teams these are, look back at the data: the Grizzlies,
relationship is more or less linear and fairly strong. I don’t see any
4
Cavaliers and Rockets have higher values than the revenues suggest, while the Lakers and Knicks have high revenues and values.
There is a more or less linear trend with a positive association.
The relationship is quite strong.
Compare the plot of team value against operating income, Figure 8. There is much less of a trend, so it’s harder to talk of
“outliers”, just points that don’t fit the overall scatter. The Lakers and Knicks again stand out as the teams with highest value.
The team over on the left with negative operating income is the
Trailblazers. If you had to predict value, revenue is the better
variable to use because the relationship is stronger.
2.20 “Perhaps the severity of MA can help predict the severity of
HAV” is the clue that MA is explanatory and HAV the response.
So put HAV on the vertical scale and MA on the horizontal of
your scatterplot. My scatterplot is in Figure 9.
Figure 7: Plot of team value against revenue
Figure 9: Scatterplot of HAV angle vs. MA angle
There is something of a positive trend here (you might call it a
weak-to-moderate trend). The patients with higher MA angle do
tend to have a higher HAV angle. There is one outlier: the patient
with HAV angle 50.
Figure 8: Plot of team value against operating income
5
There is a relationship, but it’s not very strong, so MA angle
could be used to predict HAV angle. It’s just that there is so
much scatter that the predictions wouldn’t be very good.
the opposite question: how does fuel consumption change as speed
changes? So fuel consumption is the response, and speed the
explanatory variable. There’s nothing else new about making the
plot, as shown in Figure 11.
2.21 This is the same idea as 2.16, and can be done the same way
in Minitab. The last sentence of the first paragraph in the text
gives you a clue as to what should be on the y-axis: rate is the
response, and mass the explanatory variable.
So get a scatterplot of Rate against Mass, “with groups”, and use
Sex as the grouping categorical variable. Your plot should look
something like Figure 10.
Figure 11: Fuel used against speed
The relationship goes down and then up, so you can’t describe it
with a straight line. It’s a curve. Because of the way fuel usage
is measured here, a low value is good: a lot of gas is used at low
speeds and at high speeds, with a “best” value coming in between,
here at about 60 km/h. (The same kind of picture happens for
other cars: there is a “best” speed for fuel consumption which
is less than typical highway speeds.) Because the relationship
doesn’t go consistently either down or up, it doesn’t make sense
to describe it as either a positive or negative association. The
relationship is actually quite strong: if you were to use a curve
to describe this relationship, you’d be able to predict fuel usage
quite accurately from speed. You just wouldn’t be able to describe
the relationship by a straight line. (Later, we learn to calculate
a number called the correlation, which describes how strongly
linear a relationship is; here the correlation would be quite small,
Figure 10: Metabolic rate vs. lean body mass
Looking at all the data, the relationship is positive (larger lean
body mass goes with larger metabolic rate), and the trend looks
linear. The relationship looks quite strong, except perhaps at the
upper end. Separating out the men and women, some of the men
(red squares) have large lean body mass and large metabolic rate,
and the trend overall for the men is not as clear as it is for the
women (black circles). (Most of the larger values are men, and
all of the smaller values, on both variables, are women.)
2.22 It’s tricky to sort out the roles of the variables here. Normally,
fuel consumption would “lead to” speed, but here we are asking
6
because, even though there is a strong relationship, it doesn’t look 2.25 To get the plot with men and women’s records separately labelled, use the same idea as 2.16 and 2.21: do a scatterplot “with
like a straight line.)
groups”, and select Sex as the grouping variable.
2.23 This one could be done by hand (as long as you take care to
get the vertical scale sensible). Or you can do it in Minitab. The
tricky part is getting the data in the right form; as the data come
off the disk, all the years and record times are in one column
(each), with the sex of the athletes in a third column. You can
copy and paste the women’s times into two new columns, in a
more or less obvious way: select the values you want to copy by
clicking and dragging, move the cursor to the top of a new column,
and then paste.
Then make a scatterplot of time (y) against year (x). My plot
is in Figure 12. The plot shows a big jump before 1970, then a
steady rate of improvement until the mid-80s, and a slower rate of
improvement since then. (Note that the women’s 10,000 metres
only became an Olympic event in 1988, so that more attention
may have been paid to training since that time. That would
explain the lack of large improvement since the mid-80s.)
Figure 13: Men and women’s 10,000 record times
Men (red squares) have been running this event for longer than
women black circles), so their history is longer. But the women’s
record appears to have been dropping more quickly than the
men’s. In recent years, though, the women’s record hasn’t
dropped very much, while the men’s has dropped more quickly.
So the data support the first claim of (b), but not the second (the
men’s record is still less than the women’s, with no apparent sign
that the women are going to catch up).
2.28 The 2002 returns are mostly negative, reflecting the fact that
the stocks composing the mutual funds mostly fell, and the 2003
returns are mostly positive, for the opposite reason. I also drew a
scatterplot of the 2002 and 2003 returns, as shown in Figure 14.
There is one outlier on the right (as mentioned in the text); apart
from this, there seems to be a downward trend (negative association), saying that funds that did badly in 2002 (lost a lot of
their value) did well in 2003, and stocks that did well (less badly)
in 2002 did poorly in 2003. I don‘t know what this says about
prospects of success when you invest in mutual funds, but it does
Figure 12: Women’s record times for 10,000 m race
7
suggest a “bounce-back” effect: funds that do especially badly
one year will recover the next. (Mutual funds are designed to be
“good” collections of stocks, to allow small investors to diversify
and protect themselves against extreme behaviour of the market.)
Figure 14: Scatterplot of 2002 and 2003 returns
8