STAB22 section 2.4 2.73 The four correlations are all 0.816, and all four regressions are ŷ = 3 + 0.5x. (b) can be answered by drawing fitted line plots in the four cases. See Figures 1, 2, 3 and 4. Figure 2: Data set 2 Figure 1: Data set 1 Data set 1 looks very reasonable to fit with a regression line. Data set 2 is a clear curve. Data set 3 has an outlier in the y-direction (the secondlast point), and in data set 4, the whole strength of the relationship is formed by the influential point (x-outlier) on the right. So only for data set 1 does it make sense to use a regression line, and doing so in the other three cases would be misleading. Figure 3: Data set 3 If you think about it, it is really quite a feat to 1 you can do most of (d) and (e) from the fitted line plot too: after you select Regression and Fitted Line Plot, click on Graphs and click on the bottom box, “residuals vs. the variables”. Doubleclick on Size to get a plot of residuals against size. Click OK. Also, click on Storage and check the box next to Residuals. The residuals will be stored in the worksheet so you can look at them. In StatCrunch, after you enter the data in an empty worksheet, you do this: Select Stat, Regression and Simple; select Ounces as your x and Price as your y (or use the names you gave the variables; you can cursor up to the Var line at the top, delete the name that’s there, and type in the name you want to use for the variable). Click Next twice. On the Options screen, check Save Residuals. On the Graphics screen, select Plot the Fitted Line and Residuals vs. x-values (this will do (d) and (e)). Now you can click Calculate. The output comes out one screen at a time; use Next to go to the next one. Figure 4: Data set 4 dream up four data sets with the same correlations, intercepts and slopes, and yet quite differently appropriate regressions. 2.92 Use the least-squares regression line given in Example 2.19. For an individual with NEA increase 143, the predicted fat gain is 3.505 − (0.00344)(143) = 3.013 kg. The actual fat gain, 3.2, is bigger than the predicted one (above the line), so the residual is positive: 3.2 − 3.013 = 0.187 kg. In either case, you’ll have to do (c) by hand, unless you are exceedingly clever. The regression line is (rounding off) y = 2.61 + 0.08x, where y is price and x is ounces. My fitted line plot is in Figure 5 and my residual plot is in Figure 6. 2.100 Type the data values into a worksheet in your software. Instructions below are for Minitab. You can do (a) and (b) in one go with a fitted line plot. Price should be your response because it is the outcome: a bigger coffee costs more. In fact, The relationship between size and price appears 2 Figure 5: Fitted line plot for price vs. size Figure 6: Plot of residuals vs. size 3 to be increasing but not linear. In Exercise 2.3 I speculated that there is a fixed cost of serving you, so that a bigger drink is more expensive only partly because of the cost of the coffee. Per ounce, a larger drink is actually less expensive. So describing this relationship with a line will not tell the whole story. As you see in Figure 5, the line doesn’t go through the points, though it does go reasonably close to them. and (ii) with only 3 points, the simplest kind of curve (a parabola) would go through exactly through the data values, with no kind of guarantee that the curve you get would apply to any drink prices beyond these three. In any case, the R-squared for the straight-line regression is up around 96%, so it’s hard to justify going beyond a straight line with a data set as small as this. Maybe one day Starbucks will introduce a 20ounce “mezzo-Venti” and we’ll be able to tell then whether a straight line can be sensibly improved upon. But, as with many things you buy, the message is that the more you buy, the smaller the cost per unit. Drawing a vertical line from each point to the line is left “as an exercise for the reader”. The idea is that the 16-ounce drink is the farthest from the line (above), the 12-ounce drink is the next farthest away (below), and the 24-ounce drink is just a little below the line. The actual residuals (which you’ll find have appeared in your work- 2.102 Again, we can do (a) and (b) at once by using a fitted line graph (or you can do them sepsheet while you were looking at your graphs) are arately by drawing the scatterplot first and then −0.071, 0.107, −0.036. These do sum to zero as figuring out where the regression line goes and they should. This echoes what we just said about drawing it on your graph). My plot is shown in the 16-ounce drink being the farthest from the Figure 7. While weight certainly increases with line; if the line were correct, the Grande would age, it doesn’t seem to do so in a straight-line be about 10 cents cheaper. way: growth is rapid at the beginning, then it Finally, looking at Figure 6, you see (as far as you levels off, and then it grows faster again. The can see anything from just three data points) a straight line doesn’t capture this pattern at all. down-up-down pattern, indicating that a curve Though we can fit a line here, it doesn’t make would be a better description of the data. Note much sense to do so. a couple of things, though: (i) the vertical scale on the residual plot is exaggerated: none of the The residuals are given in the data set on the drinks is off the line by more than about 10 cents, disk (and in the text), so you can just plot them 4 Figure 8: Residual plot for weight-age data against age and draw in the zero line. Or you can use the Graphs option of the Regression command. Select Stat, Regression, Regression; select Weight as your response and Age as your predictor; click Graphs, and select Age into the bottom box to plot residuals against Age. You can verify that the fitted intercept and slope (in the Session window) are correct. My residual plot is in Figure 8. The clear curved pattern in the residual plot indicates that we should have fitted a curve in the first place (the curve is clearer in the residual plot than on the original scatterplot). Here we see that fitting a “good enough” line is not good enough! Figure 7: Fitted line plot for weight vs. age 5 The residuals as printed in the text add up to 0.01, which is zero to within rounding error. (Adding up the residuals calculated by Minitab will give you an answer closer to 0 because Minitab keeps more than two decimal places.) very much. So the 2004 A students are better (in terms of SAT scores) than 2008 A students, because an A “is not worth as much as it used to be”. The statistical moral is that any variables that might affect the response should be included in the analysis; leaving them out can be confusing or even misleading. 2.103 The work in 2.102 says that if you start from a known age, you can predict the mean weight for children of that age very accurately. But individual children will vary in weight quite a bit. For example, the mean weight of 4-month-old children is 6.3 kg, but individual children might be 6.0, 6.1, 6.5, 6.6 kg: the mean weight is 6.3, but none of the individual weights are close to that. If the regression predicts the mean weight accurately, it doesn’t have to predict individual weights equally accurately, and so r could be a lot smaller. 2.105 Mean SAT score depends on two things: year (ie. time), and high school grade. Things are further complicated by the fact that high school grade depends on time as well (that’s what “grade inflation” is). The statement “mean SAT score has increased over time” doesn’t account for grade at all, and if grade makes a difference, we Figure 9: Fitted line plot for gas chromatography data could be misled. Think about a B student 4 years ago; if that same student graduated now, he or 2.108 You can do this question by getting a fitted line plot (to do (a) and (b)) and a plot of residshe might be an A student because of grade inuals against amount (for (c)). My plots are in flation, but his/her SAT score would not change 6 plot). Also, as you go across the picture from left to right, the residuals appear to get more spread out. (If you go on to STAB27, you’ll learn that a transformation of the response values is called for, and that using, say, the logarithms of the response values might work better.) 2.110 My scatterplot is shown in Figure 11, with the outlier marked with an X. (To draw this in Minitab, find the outlier in the data set — it’s the last, 12th, row of values — and create a new column with only an X in that row, by typing an X into the cell in the 12th row and 3rd column. Figure 10: Residual plot for gas chromatography data Then set up a Scatterplot in the usual way, but before you click OK, click on Labels, select Data Labels, Use Labels from Column, and select the Figures 9 and 10. column with the X in it.) The fitted line plot appears to show that the data Ignoring the outlier, the trend is a moderate-tolie close to the straight line, but because the “restrong negative association. sponse” values vary so much, it’s hard to see detail. To calculate the correlation and regression line The residual plot, however, shows deviations from a straight line much more clearly, because the residuals are on a much more “human” scale (they are all between −15 and 15). The residuals for amounts 0 and 20 are all positive and those for amounts 1 and 5 are all negative. That is, there is a curved pattern, up-down-up, and so the relationship is a curve rather than a straight line (this is very hard to tell from the original scatter- with and without the outlier, first calculate both using all the points, then replace the value 337 in row 12 by * (for “missing”) and re-do the calculations. My results are shown in Figures 12 and 13. I also used the Storage option on Regression to save both sets of Fits (fitted values), for my plot in (c). The correlation changed from −0.339 to −0.787, a pretty substantial change considering that we only removed one observation. This 7 Correlations (Pearson) Correlation of isotope and silicon = -0.787, P-Value = 0.004 Regression Analysis The regression equation is silicon = - 1372 - 75.5 isotope Figure 13: Correlation and regression without the outlier Figure 11: Scatterplot of silicon-isotope data Correlations (Pearson) Correlation of isotope and silicon = -0.339, P-Value = 0.281 Regression Analysis The regression equation is silicon = - 493 - 33.8 isotope Figure 14: Scatterplot of silicon-isotope data with two Figure 12: Correlation and regression with the outlier regression lines 8 shows that the correlation can be strongly influenced by points not on the trend. Similarly, the intercept and slope of the regression line both change substantially (in particular, the slope becomes more negative). Finally, think about why we constructed this graph in the first place. Including or excluding the outlier makes a substantial difference to where the regression line goes; the red and green lines on the graph are quite different. Since I saved both sets of Fitted Values, now I can 2.113 Think about businesses first: while having superimpose the fitted lines on the plot. This can more education is associated with a higher salary, be done by drawing three plots and superimposmost people employed by business have only 3–5 ing them on one graph. Call for the three plots by years of education (a bachelor’s degree) and a relputting, under Y and X in the dialog box, first atively high salary. Imagine 3 people: they might Silicon and Isotope, second Fits1 (the fits from have 3 years of education and a salary of $69,000, the 1st regression) and Isotope, third Fits2 (the 4 years and $70,000, 5 years and $71,000. Within fits from the 2nd regression) and Isotope. To get this group, more education means a higher salary. these all on one graph, click on Multiple Graphs, Now turn to academia: economists there tend to select Multiple Variables, and click on Overlaid have a PhD, but earn less money than industryon the Same Graph. There are probably more employed economists do. To make up some numelegant ways of doing it, but this one worked. I bers: 6 years of education and $48,000; 7 years got some points (in black) that are the scatterand $50,000; 8 years and $52,000. (Your numbers plot, some points (in red) that are the regression may differ; it’s the pattern that’s important.) line with the outlier included, and some points My scatterplot is shown in Figure 15. Within (in green) that are the regression line with the each group, more education means more income, outlier excluded. As you see in Figure 14, the but overall the higher salaries go to those with second regression line has a much more negative less education. (The correlation in my picture slope. is −0.813.) Salary depends on both education If you don’t want to go to all this trouble, you and location of employment, so to consider the can figure out where the regression lines go (eg. influence of only one of those on salary is missing by finding the value of Silicon when Isotope is the whole picture. −21.5 and −19.5 for each line, and then joining the pairs of points by lines drawn by hand. 2.114 Look back at 2.73 to get an idea of what’s going 9 The third residual plot is in Figure 18. There is a downward trend to the residuals except for the second-last one. All the residuals are between about −1 and 1 except for the second-last one, which is about 3, which says to us “outlier in the y direction”. This is the same thing we saw in Figure 3. The last residual plot is in Figure 19. This is the oddest residual plot you’ll see in a while. There are a bunch of different residuals corresponding to x = 8, and then one residual at x = 19 which is exactly zero. One x-value out on its own ought to suggest “influential”, which is exactly what you see in Figure 4. Figure 15: Income-education scatterplot on here. The regression equations were all the same (y = 3+0.5x), but a straight-line regression was sometimes appropriate and sometimes not. The story was clear enough for us to judge what was happening from the scatterplots, so we can guess what we might see from the residual plots. The first residual plot is in Figure 16. It looks like a random mess of nothing. This confirms our impression from Figure 1 that a straight-line regression is appropriate. The second residual plot is in Figure 17. This the plainest down-up-down curve you could wish to see, so, as we guessed in Figure 2, the actual relationship is a curve not a straight line. 10 The overall story here is that if you make a point of looking at plots of the residuals whenever you do a regression, you are much less likely to get caught out by data that are not suitable for a straight-line regression. These ones were pretty clear just by looking at scatterplots, but you can easily run into something like 2.108 where you don’t see any problems until you look at a residual plot. Figure 17: Residual plot for 2nd Anscombe data set Figure 16: Residual plot for 1st Anscombe data set 11 Figure 18: Residual plot for 3rd Anscombe data set Figure 19: Residual plot for 4th Anscombe data set 12
© Copyright 2026 Paperzz