Now that we have covered some useful preliminary material, it is time for us to develop the regression model, itself. In this chapter, we begin to do that, using bivariate data. That is, we are focusing on the relationship between a single independent variable (generically called X) and our dependent variable of interest (generically, Y ). To state our objective in terms of the definition we provided back at the end of Chapter 1, we will develop a strategy for describing how the conditional distribution of our Y variable changes across the range of our X variable. Note our emphasis on developing the regression model. We really intend to take you through a process that reveals the underlying logic of this kind of analysis. Once again, we do not think it would be very effective if we simply dropped some formulas in front of you. We will eventually get to the formulas— but, we want you to understand where they come from and why we use them. Also note that our focus on bivariate data is somewhat artificial. In real-world applications of theory-testing and forecasting, dependent variables are almost always affected by more than one independent variable. Nevertheless, the bivariate regression model provides us with the simplest case. We can lay out our basic approach and develop the components of the model in the most straightforward manner possible. We will see that all of the ideas we discuss here can be generalized very easily to situations involving several independent variables— that is, to multiple regression. SCATTERPLOT SMOOTHING AND NONPARAMETRIC REGRESSION How can we determine whether the conditional Y distributions vary across the values of X? One straightforward approach is to simply look at the data in a scatterplot. If the variables are related (according to the definitions we provided back in Chapter 1), then the Y values should differ in some systematic way, across the X values. Now, it would be difficult (if not impossible) to describe variability in the entirety of each conditional Y distribution. So, instead, we will focus on the most salient characteristic 1 of each conditional distribution— its central tendency. This should make intuitive sense: Measures of central tendency provide the single numeric value that is most typical of the entire set of values within a variable’s distribution. Therefore, it is reasonable to examine how this “most typical value” for Y varies across the range of values in X’s distribution. Based upon the preceding idea, our immediate objective will be to fit a curve to the scatterplot. This curve will trace out the path of the conditional means of Y from left to right across the plotting region of the scatterplot; that is, the conditional means will be calculated across the distinct values of X. This general process is called “scatterplot smoothing” because the goal is to generate a smooth curve which captures the basic functional dependence of Y on X. Hence, the smooth curve is actually a model of the full bivariate dataset depicted in the scatterplot. There are a variety of specific methods (called “scatterplot smoothers”) for creating such curves, and we will cover several in this chapter. Starting Simple: The Conditional Mean Smoother Consider the following scenario: State government officials establish a number of local social service offices, to distribute public benefits to clientele. Some of these offices receive a large number of clients, while others are not as busy. Why is this the case? We might speculate that the number of visitors to each office is affected by the length of time that the office has been in existence. In order to test the preceding conjecture (or, to use the more appropriate term, the preceding hypothesis), we obtain some data on 78 social service offices. For each office, we have values on two variables: The number of months the office has been in existence, and the mean number of clients per week, during the preceding month. Given the nature of our hypothesis, the first variable is our independent variable and we can designate it as X. That leaves the second variable as our dependent variable, Y . The full dataset is presented in the Appendix to this chapter. Of course, with 156 values (78 observations times two variables), we probably cannot learn very much by looking at the data, themselves. We can probably see more by drawing 2 a picture. Therefore, Figure 3.1 shows the scatterplot for the number of clients per week versus the number of months the office has been open. Note that jittering has been applied in the horizontal direction, only. We do this because, with only six distinct values, the X variable is relatively discrete. On the other hand, the Y variable takes on 7x distinct values across the 78 observations. So, the mean number of clients (i.e., Y ) is relatively continuous, and does not need to be jittered. In the present context, we are more interested in the general structure of the bivariate data, rather than the specific values of the variables. But, just from looking at the scatterplot, we can see that there are systematic shifts in the conditional Y distributions. For example, when X = 2, the Y values range from about 60 to about 125. But, when X = 5, the Y values span an interval from just under 150 to about 190. So, from a quick inspection of the display alone, we can see very easily that the Y values do tend to increase with X. But, can we be more specific? Can we make a more detailed statement about the way the conditional Y distributions shift across the range of X? In order to do so, we will fit a very simple scatterplot smoother to the data in Figure 3.1. Again, our objective is to see how the conditional means of Y vary as X increases from its minimum value (i.e., one) to its maximum (i.e., six). In order to do that, we can simply group the 78 observations into six categories, based upon their X values and calculate the mean Y value within each of the groups. Thus, we calculate the conditional means, directly from the data. Table 3.1 shows the conditional means calculated for each of the six X values. So, for example, the first entry in the second column of the table is 64.50, and this is the mean of the Y values for the two observations for which X = 1. Similarly, 100.43 is the mean Y for the seven observations with X values of two, and so on. Note that there is only one observation with X = 6, so the conditional mean of Y for that X value is just the Y value for that single observation (or 170.00). In Figure 3.2A, we repeat the scatterplot of mean clients per week versus number of 3 months. Here, however, we superimpose six new plotted points, representing the conditional means of Y , plotted against their corresponding X values. These new points are shown as filled black diamonds to distinguish them from the points that represent the data. Note also that these new points are not jittered; they are plotted directly over the appropriate X values. Figure 3.2B uses line segments to connect adjacent conditional means. And, then, Figure 3.3 shows the final smoothed scatterplot. Here, we remove the points for the conditional means, leaving only the data and the “curve” (admittedly, more a slightly jagged series of connected line segments) that traces out the locations of the conditional means. It may not seem like we have done very much by adding the curve to the scatterplot. Believe it or not, however, we have just performed a regression analysis! The curve that we have “fitted” to the data in the scatterplot provides a succinct— albeit graphical— description of the variability in the conditional Y means across the X values. From the curve, we can see that the average Y tends to increase very sharply across X values from one to three. After that, the average Y continues to increase with larger X values, but the rate of increase decreases. In other words, after we move beyond three months (i.e., to the right within the plotting region of the scatterplot), each additional month of existence leads to a smaller increase in the average number of clients. So, our “regression analysis” really does enable us to make a more detailed statement about the way that an office’s number of clients is related to the length of time it has been open. It has provided an insight about the data that we probably would not have seen by looking only at the data values, themselves. Let us pause here to consider what we have accomplished. Stating the problem a bit differently from the way we did earlier, we can say that we have tried to account for the variance in our Y variable by hypothesizing that the distribution of the Yi ’s changes systematically across the values of X. The smooth curve in Figure 3.3 confirms our hypothesis, by showing that the average Y does increase as we move from smaller to larger X values. And, the curved appearance, itself, shows us how the rate of increase also varies systematically with the X values. Going back to some of the ideas presented in Chapter 2, we would 4 conclude that there is a nonlinear relationship between the length of time an office has been open and the number of clients it serves, because the shape of the curve shows us that the “responsiveness” of Y to X varies across the range of X. Of course, central tendency (as captured in the mean) is only one characteristic of a distribution. We have not attempted to describe any of the other possible characteristics of the conditional distributions (such as their spread or shape). Does that mean we are missing anything important? A general answer to that question depends upon the context. But, here, we would say no: While the number of observations at each X value varies a bit, the amount of vertical spread around each conditional mean seems to be fairly constant, except for the two endmost values, where there are very few observations. Similarly, the vertical dispersion around the conditional means appears to be fairly symmetric in each case (again, with the exceptions of the two endmost X categories). So, we might argue that the only “interesting” differences across the conditional distributions are their locations. And, this latter property is captured very adequately by the conditional means alone. The procedure we have used to fit the smooth curve to the data in the scatterplot is sometimes called the “conditional mean” smoother. The origin of this name should be obvious, since the curve is created by simply connecting the conditional means of Y for each of the adjacent X values. There are many other kinds of scatterplot smoothers, which differ in various ways from our relatively simple conditional mean procedure. They all fall under the general category of “nonparametric regression.” Scatterplot smoothers are regression analyses because they are intended to describe the variability in the conditional Y distributions across the X values in a scatterplot. They are nonparametric in the sense that the shape of the curve is determined entirely by the data; it is not specified in advance by the researcher. This latter property makes scatterplot smoothers very useful tools for exploring structure in bivariate data, because they do not impose any particular form on that structure. While we will devote a little more attention to scatterplot smoothing in this chapter, it is not our main focus. But, we will return to it in Chapter 9 when we discuss 5 nonlinear relationships between variables. Dealing with Continuous X Variables: Slicing a Scatterplot The basic conditional mean smoother works pretty well when the X variable is discrete. However, it can be problematic when X is a continuous variable because there are generally very few observations at each distinct value of X (at least, in samples of finite size). Therefore, the conditional means are generally not very reliable estimators for the central tendencies in the conditional Y distributions. To see the problem that occurs, we will return to the data on Medicaid spending in the American states. Figure 3.4 shows the basic scatterplot between Medicaid expenditures (per capita) in 2005, and the number of Medicaid recipients as a percentage of the state population. As we have already seen, the Y values do tend to get larger as the X values get larger. In order to summarize this pattern, we could try fitting a conditional mean smoother to the data, producing the graph shown in Figure 3.5. Far from a “smooth” curve, the conditional mean procedure has generated a series of very sharp zigs and zags across the width of the plot. The zig-zag progression does seem to be tilted upward, such that the general path of the line segments moves from lower-left to upper-right. But, we already saw this basic feature in the original data; we didn’t need this “smoother” to draw that conclusion. The problem in Figure 3.5 is that the X variable is relatively continuous. There are 50 observations in the dataset (each one is a state), and there are 43 distinct values for the number of Medicaid recipients per 1000. Because of this, most of the conditional means are “calculated” from a single data value. It seems a bit nonsensical to regard such values as summary statistics for a conditional distribution, since there are no empirical Y distributions for 36 out of the 43 existing X values! The seven remaining X values have two observations apiece, so it is still a bit of a stretch to call these “distributions.” The net result is that the contents of Figure 3.5 look more like the outcome of a connect-the-dots game than a model which captures the interesting features of the bivariate data. And, because we believe the 6 zigzags in the not-so-smooth smoother represent fluctuations due to unreliable estimation of the conditional means, we are not particularly interested in them. We would prefer to remove the zigzags, somehow, so we can discern more interesting features of the bivariate data. What we need to do is bring more observations into the calculation of each conditional mean of Y . Hopefully, this will produce more stable estimates of the average Y values that arise within different segments of X’s range. The obvious way to proceed is to take the conditional Y means within intervals of X values, rather than at each specific X value. This leads to a general strategy called “slicing a scatterplot,” so named because the intervals of X values form a sequence of adjacent vertical strips or slices, laid out along the horizontal axis of the plotting region. We first need to determine a strategy for creating the slices, and also decide how many slices we will use. There are several ways we could define the scatterplot slices. For example, we could create fixed-width slices, such that each slice corresponds to a uniformly-sized interval along the X axis. The problem with this approach is that fixed-width slices would probably contain differing numbers of observations from one slice to the next. As a result, the stability of the conditional means would vary across the scatterplot. In order to avoid this, we will create the slices so that each one contains the same number of observations. Of course, this means that the physical widths of the slices may vary within the scatterplot; but, as we will see, this is not a serious problem at all. Next, we need to decide how many observations will be included within each slice; of course, this decision effectively determines the number of slices. For now, we will simply specify that each slice contains five observations, or 10% of the total number of observations (since the observations are states, n = 50). Generally, we would say that arbitrary specifications of this type are not a good idea. But, we need something to get started. And, for that matter, we can always repeat our analysis with different numbers of slices and see what differences emerge from changing the slice definitions. 7 Table 3.2 shows the calculations for finding the conditional means. First, the fifty observations are sorted from smallest to largest, according to their values on the X variable. Then, we divide the observations into ten groups of five observations each. These groups are separated by the dashed lines in Table 3.2; you probably already realize that each of these groups represents the subset of observations that fall into one of the slices. Within each group, we calculate the mean of the Y values (by taking the sum of the Yi ’s and dividing by five) to produce our set of ten conditional means. Now, we plan to plot these conditional means in a scatterplot. The conditional Y means are the vertical axis coordinates for the points, but we still need to specify some corresponding coordinates for each conditional mean along the horizontal axis. In principle, we could plot each conditional mean at any horizontal position within its slice; but, it seems to make the most sense to place each conditional mean at the center of the respective slices. And, we will operationalize the “centers” as the means of the X values that are contained within each slice.1 So, the calculations for the conditional X means (which we prefer to call the “slice centers”) are also shown in the table. Figure 3.6 shows how we use the results from Table 3.2 to create the smoothed scatterplot. The first panel in the figure shows the scatterplot of the data, with the slice boundaries drawn in.2 Panel B shows adds the slice centers, represented as the solid triangles near the bottom border of the plotting region. Then, panel C plots points (i.e., the solid circles) at the height of each conditional Y mean directly over the respective slice centers. Finally, panel D “connects the dots” between the conditional means. Once we have done that, there is no reason to retain the slice boundaries, the locations of the slice centers, or the plotted solid circles. So, Figure 3.7 removes these elements from the graphical display and shows the final scatterplot with the conditional mean smoother superimposed over the points. What can we see in the smoothed scatterplot? Our eyes are probably drawn to the “flutter” that occurs in the middle portion of the curve. But, looking on either side of that, the “ends” of the curve seem to trend upward (i.e., from lower-left toward the upper-right) within the plotting region at a fairly constant rate. That very seems reasonable: It suggests 8 that Medicaid spending increases steadily with the size of the Medicaid population. But, do we have any substantive explanation for the fluttering in the middle of the graph? Probably not (at least, nothing occurs to us). Instead, we suspect that the zigzags within the fitted smooth curve represent random fluctuations which arise because the conditional Y means are based upon only five observations each. In other words, they probably stem from sampling error, which is not very interesting from a substantive perspective. We could try to remove the zigzags from the smoother by increasing the number of data points included within each slice. That should help to stabilize the computed values of the conditional means, since more observations would be used in the calculation of each one. Of course, it also makes the slices wider and decreases the total number of slices. Doing that might cause us to miss details in the bivariate structure. We are not too worried about the latter problem here, because the zigzags in Figure 3.7 really do look like random fluctuations. But, it is important to understand and keep in mind that all of our decisions about fitting a model to data involve both benefits and costs. Here, our “model” is the conditional mean smoother. The anticipated benefit of increasing the number of observations within each slice is that the fitted smooth curve should be easier to comprehend. The potential cost is that we may inadvertently make the curve too smooth, in the sense that it “misses” interesting features of the data. This trade-off between costs and benefits generated by our decisions about a statistical model is a theme that we will confront many times throughout the rest of this book! Returning to the problem at hand, we will create slices in the Medicaid data which contain ten observations each; since n = 50, this leaves us with five slices. Table 3.3 shows how the slices and the conditional means are defined. Apart from the wider slices, the steps in the process are identical to those that we used previously. And, Figure 3.8 shows the smoothed scatterplot that is produced from these calculations. As you can see, the fitted curve is definitely smoother than it was before. In fact, the “curve” is very nearly a straight line! 9 The evidence in Figure 3.8 suggests that the Medicaid spending responds to the size of the Medicaid population in a manner that is constant, regardless of the specific values that either of these variables take on. So, looking at the graph, let’s focus on the position along the horizontal axis where X = 10. Scan upward, until you encounter the curve; then scan over in the horizontal direction to the left-hand vertical axis. The Y value there is approximately 750 or so— note that we are not saying this is the precise value of Y corresponding to an X of 15. We are just “eyeballing” the scatterplot right now, so this is definitely just an approximate figure. Anyway, move over to the right along the horizontal axis, to the position where X = 15, and repeat the process. That is, scan up to the curve, and then over to the vertical axis. Now, the Y value is about 1,000. So, a difference of five units along the X axis (i.e., from 10 to 15) corresponds to a difference of approximately 250 units in the conditional means that are associated with those two values (i.e., from about 750 to about 1000). Since the curve is actually linear, we know that we should see a similar increase in the conditional means for a five-unit positive difference anywhere along the range of X values. We can confirm this by looking at the difference that occurs for, say, the interval from X = 15 to X = 20. We’ve already said that the curve crosses the former X value at a Y value of about 1000. When X = 20, it looks like the curve corresponds to a Y value about half-way between the tick marks for 1000 and 1500 along the vertical axis; of course, this is at about 1250 (again, we are just eyeballing to obtain this value). So, here too, an increase of five units along the X axis corresponds to a 250-unit difference in the position of the curve along the vertical axis. This is just what should occur, given the linear relationship revealed by the smooth curve that we have fitted to the bivariate data. So, what have we accomplished by carrying out this second exercise in scatterplot smoothing? Once again, we have performed a regression analysis. We say this because the smooth curve in Figure 3.8 describes the way the conditional Y means vary across the range of the X values. More specifically, we carried out a nonparametric regression, because we did not specify the shape of the curve before we fitted it to the data. The results enable us to make 10 some very detailed statements about the way that state-level Medicaid spending relates to the size of the Medicaid-eligible populations within each state. Specifically, the relationship between the two is linear, telling us that the relative differences across the two variables are constant, across the entire range of the X variable. In order to make our description of the relationship between the two variables a bit more efficient, we can take the results from our visual inspection of the smoothed scatterplot and take the ratio of the difference in the conditional Y means that corresponds to a given difference in the X values. Recall that a difference of five units on X corresponds to a difference of 250 units in the vertical location of the smooth (and almost perfectly linear) curve. Therefore, dividing the latter value by the former, we can say that the slope of the line tracing the conditional means is 250/5 = 50. Stated verbally, a one-unit positive difference in the size of the Medicaid population corresponds to a difference of 50 units in the average level of Medicaid spending. Thus, our simple nonparametric regression allows us to make a very precise and succinct statement about the nature of the relationship between these two variables. The Transition from Nonparametric to Parametric Regression Up to this point, we have performed simple types of nonparametric regression analysis. Again, this approach is justified because we want to explore the relationship between two variables. But, the output from a nonparametric regression is entirely graphical: It consists of of the smooth curve fitted to the data in the scatterplot. If the shape of the curve is relatively simple (as it was in the two examples we just analyzed) then we can interpret the pattern that emerges from the data in substantive terms. But, even though our interpretations are fairly detailed statements about the ways that the respective Y variables “respond” to their corresponding X variables, they still retain a high level of subjectivity. This is due to the fact that interpretation of a smoothed scatterplot proceeds through visual inspection of the fitted curve, alone. 11 In fact, we usually want to make relatively specific assessments of one variable’s relationship with, or impact upon, another. And, this often leads analysts to quantify the central characteristics of the empirical relationship. We tried to do that with the Medicaid data, by “eyeballing” the smooth curve, discerning that it was linear, and “guesstimating” the slope of that line. Even though the outcome of that exercise (i.e., our conclusion that the slope of the line is about 50) seems like a reasonable description of the bivariate data, we can probably do better if we rely upon a strategy that eliminates the subjectivity and potential inaccuracy that may arise from visual inspection. Proceeding along the latter path, we have noted that the conditional means in Figure 3.8 trace out a linear pattern across the scatterplot. And, we remember from our discussion in Chapter 2 that a line can be described pretty easily in quantitative terms, using an equation of the form Yi = a + bXi . Therefore, we might restate our research task (in somewhat informal terms)as follows: If a linear structure exists in the conditional Y distributions, across the range of X values, then— We will fit a line to the point cloud in the scatterplot such that the line passes through the middle of the cloud. By “fitting a line,” we mean finding the equation that describes that line. If we can accomplish this task, then the coefficients in the equation should be useful for describing the predominant pattern (i.e., the linear structure) that exists in the bivariate data. Notice how the line (once we find it) is a model of our data. It is a simplified description, which only requires two numeric values (or “parameters”), the coefficients a and b, to represent the 2n numeric values (i.e., the values of the n observations on the X and Y variables) plotted in the original scatterplot. The equation for the line “describes” the data in the sense that the intercept (a) summarizes the “height” of the point cloud over the X axis, and the slope (b) gives the general orientation (or “tilt”) of the cloud within the plotting region. Of course, the line will only be a general summary of the data; as a simplified representation, we lose some of the detail contained in the specific data values for the respective observations. But, if the 12 point cloud truly does adhere to a generally linear shape, then, we could use the fitted line to characterize the interesting features of the data, without having to refer to the individual data points, themselves. By stating the problem in the preceding manner, it may seem that we have moved away from our original strategy of tracing out the conditional Y means across the X values. After all, the sentence about the point cloud doesn’t mention conditional means or conditional distributions anywhere. But, this is not so worrisome. If the shape of the point cloud in the scatterplot really is linear, then passing a line through the middle of the cloud should really trace out a path that comes very close to (and may coincide with) the conditional Y means. And, in the next chapter, we will bring the discussion right back to the problem of placing the line through the conditional means, anyway. So, we are stating the problem a bit differently than before, but we are not really moving to any new research objective in any serious way. Notice, however, that we have made a fairly important change in the kind of strategy that we are going to employ in achieving that research objective. With our previous nonparametric regression approach, we passed a curve through the conditional means without specifying in advance what the shape of that curve would be; in fact, determining the shape or form of the relationship between X and Y was an integral element of our analysis in each of the examples we presented. Now, we are willing to specify the shape of the curve before we go about fitting it to the data. So, we are satisfied that an interesting feature of the Medicaid data is the linear form of the relationship between the two variables. We just have to determine which line most accurately characterizes the bivariate data. Therefore, we are now employing a parametric regression strategy in the sense that we know which parameters we are looking for— the intercept and the slope of the line that passes through the middle of the point cloud. We simply need to devise a strategy for calculating the appropriate numeric values for the parameters, a and b. FITTING A LINE TO BIVARIATE DATA: AN INTUITIVE APPROACH 13 We want to fit a line to the bivariate Medicaid data so that it passes through the middle of the point cloud. You are probably already getting tired of reading that phrase! But, it is an important statement because it describes our immediate analytic objective. The “Middle” of What? In order to proceed, we have to define precisely what we mean by “middle.” Here, that word refers to a line within the scatterplot which is located so that data points fall on either side of it. That conception of “middle” could mean at least three different things. These are illustrated in the panels of Figure 3.9. First, “middle” could be defined in terms of horizontal distances between the points and the line, as shown in Figure 3.9A. In other words, we would try to find a line that tries to achieve a sort of balance in terms of points falling to the left and the right of that line. However, we probably don’t want to do this, because a horizontal conception of “middle” would imply that we are using the values of the X variable to determine the line’s placement. Remember that our ultimate objective is to determine the sources of variation in the Y variable, not X. We will simply regard the specific Xi ’s in our observations as given quantities.3 So, this is not the criterion that we will use to fit the line. A second possibility would be to define the “middle” in terms of the direction that is perpendicular to the predominant orientation of the point cloud within the scatterplot. This would imply locating the line on the basis of the perpendicular distances from the points to the line, as shown in panel B of Figure 3.9. At first, this might seem to have some intuitive appeal, since it corresponds to what appears to be the smallest possible way of measuring the distance from each point to the line. But, this criterion is not particularly useful for present purposes: Consider the point labeled “i” in the figure; the perpendicular segment from this point (shown as the dashed line) intersects the fitted line at the point labeled “î”. We can use the Pythagorean theorem to calculate the distance between these points as 0.5 dist(i, î) = [(Xi − Xî )2 + (Yi − Yî )2 ] . Thus, we can see that we would still have to take the X values into account when we calculate the perpendicular distances. And, the measurement 14 units for these distances would be difficult to interpret whenever the X and Y variables are measured in different units (as they are with the Medicaid data). So, we will not use this criterion, either.4 The third alternative is to define “middle” in terms of the vertical direction, as shown in Figure 3.9.C. This has considerable appeal. For one thing, this means that the distances from the points to the line are calculated using only the Y values— X does not appear in the formula for the distance when we measure it in the exactly vertical direction. Notice in the figure that data point i and the point, î, along the line share the same value of X, which we will designate as Xi . Therefore, the distance between these two points is: 0.5 0.5 0.5 dist(i, î) = (Xi − Xi )2 + (Yi − Yî )2 = 0 + (Yi − Yî )2 = (Yi − Yî )2 (1) So, we have eliminated the X variable from our conception of middle. There is an added advantage in that the distances from the points to the line are now measured in the units of the Y variable— this is much simpler than the combined units that we would have had to use with the second criterion, above. And, using the middle in the vertical direction implies that we are locating the line on the basis of the dependent variable. That is probably a good thing, since it is really the variability in Y that is motivating our research in the first place. Thus, we will want our eventual fitted line to pass through the middle of the points in the scatterplot, with “middle” conceptualized in the vertical direction. Defining Two New (Imaginary) Variables Now that we have clarified our conception of “close,” it is useful to re-state the linefitting process. Our original dataset contains two variables which, for convenience, we label X and Y . We will add a third, new, variable to this dataset which we will call Ŷ . So, each observation, i − 1, 2, . . . , n, has not only an X value (Xi ) and a Y value, Yi , but also a value on the new variable, Ŷi . Note that Ŷ is an “imaginary” variable, with no empirical existence; it is just a set of n numeric values that we will make up for our own analytic convenience. 15 Now, the distinguishing feature of this new imaginary variable is that it is a perfect linear transformation of the X variable, such that for any observation, i, the following relationship will hold: Ŷi = a + bXi (2) If we were to plot the Ŷi ’s against the Xi ’s for the full set of n observations, the points would (of course) fall perfectly along a line. This is true, by definition, since Ŷ is a linear transformation of X. Our task is to find values for the coefficients, a and b, that produce a line that meets our criterion. That is, it runs through the middle of the point cloud, with “middle” defined in terms of the vertical direction. Let us agree upon some terminology involving this new variable, Ŷ . For many purposes, we could just call it “Y hat.” In fact, many data analysts do exactly that. For the moment, though, we prefer to call the Ŷi ’s the “fitted values” for our observations, since they all fall along a line that we will (eventually— we promise!) fit to the bivariate data. Also note that many researchers and texts refer to the the Ŷi ’s as the “predicted values” of the Y variable. We will eventually do that, too. But, this term suggests a substantive interpretation that we are not ready to impose; so, we’ll defer using it for awhile. For most observations, Ŷi will not be exactly equal to Yi , since the actual data points do not fall along a single line. Therefore, we will create still another new variable; this one measures the difference between the Yi and Ŷi for each observation: ei = Yi − Ŷi (3) This new variable, e, is called the “residual” and it represents the signed vertical discrepancy between each observation’s value on Y and its value on Ŷ . Of course, the size of any observation’s residual is just the distance from its point and the line, evaluated in the vertical direction. Note, however, that distances are always positive. Residuals can be either positive or negative. 16 Figure 3.10 shows a simple scatterplot with only five observations and a fitted line. The Y and Ŷ values for observation 1 (the leftmost data point) are shown along the left-hand vertical axis. Since the former is larger than the latter (as you can see from their relative positions along the axis), the residual for this observation will be a positive number (that is, e1 > 0). The Y and Ŷ values for observation 4 (the rightmost data point) are shown along the right-hand vertical axis. Here, Ŷ4 is larger than Y4 ; therefore, the residual for observation 4 is a negative number (e4 < 0). Observations with points located above the line will have positive residuals, while those represented by points below the line will have negative residuals. An observation with a point that falls exactly on the line (such as observation 3 in figure 3.10) will have a residual of zero. We do not show the residuals explicitly for observations 2 and 4 in the figure. But, before proceeding, you should make sure that you can see how the former would have a relatively small negative residual, while the latter would have a relatively large positive residual. Operationalizing “the Middle of the Point Cloud” Now, we are ready to define what we mean when we say that the line (which is defined by the Ŷi ’s, which are in turn defined by the values of the coefficients, a and b) must pass through the middle of the point cloud in the scatterplot. Specifically, we will define “passing through the middle of the point cloud” in terms of three criteria. First, we will require the line to pass through a location in the scatterplot called “the centroid.” This strange-sounding term simply refers to a point with coordinates that are the means of the two variables, or (X̄, Ȳ ). To state the criterion a little differently, we require that ŶX̄ = Ȳ . Verbally, the preceding expression says that the fitted value for an X value that is equal to the sample mean of X must be the sample mean of Y . Still another (and, perhaps, more succinct) way of showing the same thing is: Ȳ = a + bX̄ (4) The means measure the central tendencies of the two variables’ distributions. So, it seems 17 entirely reasonable to us that a line going through the “middle” of the bivariate data must pass through this particular location. To provide some visual confirmation of what we are talking about, Figure 3.11 shows a scatterplot with 25 hypothetical observations. The observations are plotted as open circles, and the centroid is shown as the single solid black circle. As you can see, it truly does appear in the center of the point cloud. And, it is difficult to imagine how we could run a line through the middle of the cloud without intersecting this particular location! Second, we will require that the residuals have a mean of zero, or ē = 0. That also implies P that ni=1 ei = 0, since the sum is the numerator of the sample mean.5 This requirement is also pretty reasonable: The only way we can obtain a mean residual of zero is when positive and negative values are summed, so they cancel each other out. And, as we saw from our discussion of Figure 3.10, this happens when some of the points are above the line and others are below the line. Once again, that is exactly what we would want from a line that runs through the middle of the points. In order to convince you that this second condition is P important, Figure 3.12 shows two situations where ni=1 ei 6= 0. The first panel of the figure shows a situation where the residuals sum to a positive number. The geometric implication is that the points tend to fall above the line (here, in fact, they all fall above the line!). Hence, it is not reasonable to say that this line passes through the middle of the point cloud. The second panel of Figure 3.12 shows the opposite situation— the residuals sum to a negative value, meaning that most observations fall below the line. Once again, you do not need to a great deal of technical expertise to see that this line does not run through the middle of the data!6 Third, we will require that the fitted line be positioned so that the values of the residuals are uncorrelated with the values of the X variable, or rX,e = 0. At first glance, this condition may seem a little strange— or, at least, less intuitive than the other two. Nevertheless, it is critically important for making sure that the line passes through the middle of the entire point cloud. If this condition were violated, the size of the residuals would trend along with 18 the size of the X values, and the fitted line would be “off” from the point cloud, even if the previous two conditions are met. Figure 3.13 uses the same hypothetical 50-observation dataset as Figure 3.12 to show what happens when rX,e 6= 0. In panel A, the X variable and the residual are positively correlated, with rX,e = 0.851. The value of this coefficient implies that the values of the residual should tend to increase along with the values of X. We can see that the residuals near the left side of the plotting region in Figure 3.13A tend to be relatively large and negative (i.e., the plotted points are far below the line). As we move toward the right, the points tend to be closer to the line, although still below it. This means the values of the ei ’s are becoming larger (i.e, the negative numbers approach zero). On the right side of the plotting region, the residuals tend to be positive (i.e., most of the points are located above the line) and their values tend to increase as we approach the righthand edge of the plotting region. So, the residual does tend to increase, moving from left to right in the scatterplot. And, of course, the horizontal axis corresponds to the X variable, so its values definitely increase as we move in the same direction. That produces the positive correlation between X and e. And, this correlation, in turn, produces a fitted line that is “tilted” too far, producing an excessively shallow slope, relative to the point cloud. Figure 3.13B shows that the opposite occurs when X and e are negatively correlated. Here, rX,e = −0.908, so the ei ’s tend to decrease as the Xi ’s increase. This produces a fitted line that is not tilted enough, relative to the point cloud; its slope is too steep. In contrast to the situations shown in the first two panels of Figure 3.13, the third panel shows the line that occurs when rX,e = 0 (again, we are assuming that the first two conditions are met). As you can see, there is no discernible trend in the values of the residuals. Within any narrow interval along the X axis, there are points plotted both above and below the line; there is no tendency for small X values to coincide with negative values of e, and vice versa. And, the result is a line that we could reasonably describe as running through the middle of the entire cloud, from left to right within the plotting region. Fitting a Line to Bivariate Data (Finally!) 19 At last, we are ready to proceed with our task of fitting a line to bivariate data such that our line runs through the middle of the point cloud in the scatterplot. And, we’ll use the operational definition of “through the middle of the point cloud” that we developed in the preceding preceding sections. Let’s begin by rearranging our definition of the residual for arbitrarily-selected observation, i, as follows: Yi − Ŷi = ei (5) Yi = Ŷi + ei (6) Yi = a + bXi + ei (7) Line (5) in the preceding derivation simply repeats equation (3), which defines the residual for an observation. Next, Ŷi is added to both sides of the equation; of course, it cancels out on the left-hand side, producing equation (6). Finally, in equation (7), we expand the right-hand side using the definition of Ŷ as a linear transformation of X. That last line shows explicitly how the value of the dependent variable for any observation is connected to the value of the independent variable for that observation through the coefficients, a and b, and the residual for that observation, ei . And, to remind you of something from Chapter 2, equation (7) expresses Y as a linear combination. In this linear combination, a is the coefficient on a pseudo-variable which has a value of one for every observation. Hence, the latter does not vary, and we do not need to show it explicitly in the expression for the linear combination. In a similar manner, e is a variable with a coefficient of one; therefore, that coefficient does not need to be shown. Of course, b is the coefficient for the third— and only “real”— variable in the linear combination, X. Our next task is to use the information we have available to produce formulas for the coefficients; after doing that, it will be an easy matter to calculate the Ŷi ’s for the observations, and use them to obtain the ei ’s. We will begin with the b coefficient; as the slope of the linear transformation from X to Ŷ , it is (at least arguably) the most interesting part of 20 the fitted line. The derivation of a formula for b is shown in Box 3.1. Our objective is to manipulate equation (3) in order to isolate b on the left-hand side. Hopefully, when we do so, the right-hand side will only contain information that we already possess. If so, then we should be able to calculate the actual, numeric, value of b. Line (1) in Box 3.1 simply repeats equation (3). In line (2), we subtract the mean of Y from each side. Remember that Y is a linear combination, and the mean of a linear combination is a weighted sum of the constituent means. That is what produces the sum within the second set of parentheses on the right-hand side of line (2). In line (3), we remove the parentheses from the right-hand side and rearrange the terms. Line (4) inserts three new pairs of parentheses on the right-hand side, grouping together related terms. Two important things occur in line (5): First, (a − a) is obviously zero, so we eliminate it from the equation. Second, we factor the b out of the difference between Xi and X̄. We could simplify line (5) a bit further, by removing ē. The latter is equal to zero, so it does not really need to appear there. But, we want to retain it, because it will help us recognize something important that occurs in a few more steps. In line (6), we multiply both sides by (Xi − X̄). Please, just bear with us on this step— you may not see why we are doing it here, but it does facilitate our task of isolating the b coefficient (take our word for it!). Moving from Line (6) to line (7) we distribute the (Xi − X̄) across the two right-hand side terms within the brackets. In line (8), we take the sum, across the n observations, of the terms on each side of the equation. Why do we do this? Remember that the line-fitting process needs to involve all of the data points, and not just the ith observation. Taking the sum seems like a straightforward way to bring all of our data (i.e., from all n observations) into the calculations. Notice, too, we have a particular name for the quantity that appears on the left-hand side: It is the sum of cross-products for X and Y . Next, line (9) distributes the summation on the right-hand side across the two terms within the brackets (using the first rule for summations in Appendix 1). Now, remember 21 that the b coefficient is a constant— it doesn’t vary from one observation to the next. Therefore, the first term on the right-hand side of line (9) is the sum of a constant times a variable. Therefore, we apply the third rule of summations (again, from Appendix 1) in line (10) to factor the b coefficient out of that summation. Take a very careful look at the right-hand side of line (10). The first term is the b coefficient times another quantity for which we have a particular name: The sum of squares for X. And, the second term is the sum of cross-products for X and e. (This is why we left the ē in there— so you’d recognize this a bit more easily). Recall from Chapter 2 that the sum of cross-products is the numerator of the covariance for the two variables, and also the numerator of the correlation coefficient. Furthermore, remember our third criterion for passing the line through the middle of the point cloud: The residual (e) must be uncorrelated with X. Therefore, the sum of cross-products must be zero; that is the only way we can produce rX,e = 0. So, the last term on the right-hand side of line (10) drops out, leaving line (11). Next, line (12) divides both sides by the sum of squares in X. Of course, this sum of squares cancels out on the right-hand side of line (12), leaving only the b coefficient. And, line (13) follows tradition by reversing the two sides of the equation, placing the quantity we are trying to find on the left, and the formula for finding it on the right-hand side. The preceding derivation may have felt like a lot of steps. But, we have achieved our objective, a formula for the b coefficient! Perhaps most important, this formula is composed entirely of quantities that we can calculate very easily from our bivariate data. So, this is an important accomplishment! Even better, the formula is quite straightforward. It consists of a fraction formed by two familiar quantities: The sum of cross-products for X and Y , divided by the sum of squares for X. We can go a step further and point out the b is actually the ratio of the covariance (between X and Y ) to the variance of X. Remember that each of these two statistics have the degrees of freedom (n − 1) in their respective denominators. These cancel out when we form the fraction with the covariance in the numerator and the 22 variance in the denominator. So, we feel fully justified in claiming that we have arrived at a particularly neat solution for the slope of the line that will pass through the middle of the point cloud in our scatterplot! Satisfying as our preceding accomplishment may be, we cannot stop here. Remember that we also have to solve for the a coefficient, the intercept of the line that we are going to use in order to smooth our scatterplot. The procedure is, again, straightforward and we can apply directly the results we have already obtained. In other words, we can assume that we have already calculated the value of the b coefficient. Box 3.2 lays out the derivation. Line (1) in Box 3.2 just lays out the expression that shows Y as a linear combination of X, e, and the pseudo-variable associated with the a coefficient. In line (2), we take the mean of the linear combination, producing the weighted sum on the right-hand side. In line (3), we take advantage of our requirement that the residuals sum to zero, and remove ē from the right-hand side. Line (4) is produced by subtracting bX̄ from both sides of the equation. Of course, this isolates the a coefficient, giving us exactly what we are trying to achieve. And, line (5) once again, places the coefficient on the left and the formula on the right. Looking at line (5), we see once again that the coefficient in question, a in this case, can be obtained very easily applying a simple formula to quantities that we have probably calculated from our data in previous steps of the analysis. So, we needed X̄ and Ȳ to calculate the sum of squares and sum of cross-products needed to calculate b. And, we have already produced a formula for b, itself, which we can plug directly into line (5) of Box 3.2. Box 3.3 repeats the formulas we have derived for the a and b coefficients, for convenience. Remember, that our objective was to use these coefficients to generate the variable Ŷ , by plugging each of the n values of X in our dataset into the formula, Ŷi = a + bXi . And, the line upon which all of the resultant Ŷ ’s fall should pass through the middle of the point cloud in the scatterplot of X and Y . Does the line produced by our newly-derived formulas actually work this way? Fitting a Line to the Medicaid Data 23 We will provide at least a partial answer to the preceding question by using the formulas in Box 3.3 to fit a line “by hand” to our bivariate Medicaid data. Table 3.4 shows the basic calculations.7 The first two columns give the actual values of the X and Y variables. The sums for each variable are shown at the bottom of the respective columns. The third and fourth columns subtract the means from the variable values to obtain the deviation scores. Note that deviation scores always sum to zero (as shown at the bottom of these columns). Next, the fifth column multiplies the deviation scores for X and Y together to produce the cross-product for each observation. And, the sum of cross-products is shown at the bottom of this column. Finally, the sixth column squares the deviation score for X, and the sum of squares for X is shown at the bottom of this column. Table 3.5 uses the various sums from Table 3.4 to calculate the values of the two coefficients. First, we divide the sum of cross-products (42,997.520) by the sum of squares for X (891.099) to produce the value for the b coefficient, or 48.252.8 Next, we divide the sum of the X (724.800) values by n (50) to obtain the mean of X, or X̄ = 14.496. The mean of Y is obtained the same way, producing Ȳ = 1000.600. These two means, along with the previously-calculated value of b (48.252) are inserted into the formula for a to produce an intercept of a = 301.136. Next, Table 3.6 calculates the predicted values and residuals for each of the fifty observations in the dataset. Once again, the first two columns in the table simply repeat the data values, for convenience. In the third column, we obtain the fifty Ŷi values, by plugging each Xi into the formula Ŷi = a + bXi , with the appropriate numeric values substituted into the formula for a and b. That is, for each observation, Ŷi = 301.136 + 48.252Xi . The fourth column subtracts the Ŷi ’s from their corresponding Yi values to produce the residuals. And, the fifth column presents the cross-products between the Xi ’s and the corresponding ei ’s. Note that the deviation scores for X are not shown; they are identical to those that were presented in Table 3.4. Note also that the deviation scores for the residuals are just equal to the residuals, themselves, because ē = 0, by construction. 24 The column sums are once again shown at the bottom of the table. And, these sums have several interesting features. First, we can see that the residuals sum to zero. They were created to possess exactly this property, so this result may not be particularly surprising. Still, it is reassuring to verify that things work out as they should! Second, the sum of cross-products for X and e is exactly zero. But, this too, is to be expected: Remember that the we required that X and e be uncorrelated. That requirement implies that the sum of cross-products between them must be zero, exactly as we see it is in the table. Third, look at the sum of the fitted values. The entry at the bottom of the third column in Table 3.6 P shows that ni=1 Ŷi = 50030. But, this is exactly the same value as the sum of the original Yi ’s (verify this by looking at the bottom of the second column in Table 3.6). Of course, we know that the sum of a variable’s values is the numerator of the mean for that variable. So, the result we have just seen tells us that the mean of the fitted values is identical to the ¯ mean of the dependent variable. Using our generic nomenclature, Ȳ = Ŷ . For the moment, we will regard this equivalence as an interesting “coincidence.” But, we will soon see that it plays a very important role in developing the regression model. Let us next reconstruct the scatterplot of Medicaid expenditures against the size of the Medicaid population within each state. Now, however, we will superimpose our fitted line over the 50 data points. We position the line in the table by plotting the Ŷi ’s against the corresponding Xi ’s, and running a straight line through the points, to span the entire width of the scatterplot. This is shown in Figure 3.14. The line shown in Figure 3.14 does exactly what we want it to do. Given the irregular shape of the point cloud, we believe this line approximates the “middle” of the cloud very well; at least, we can say that at any position along the horizontal axis, there are data points falling both above and below the fitted line. We would argue that the line shown in Figure 3.14 does successfully capture the predominant pattern in the way the values of the dependent variable tend to shift upward when the independent variable values are larger. Thus, the line provides us with a succinct summary of the interesting structure in the overall 25 bivariate dataset. Rather than looking at the 100 distinct data values, we can use the line as a general description that transcends the entire set of 50 separate observations. If the line summarizes the data, then the a and b coefficients describe that summary line. Therefore, we might claim that a summary of the way Medicaid expenditures respond to differences in the number of Medicaid recipients lies in the slope coefficient. The predominant linear pattern in these variables indicates that a one-unit positive difference in the size of the Medicaid population corresponds to a difference of 48.252 units in the fitted values. Now, this value of the b coefficient may not correspond to the actual differences in X and Y values for any particular pair of observations in the dataset. But, it is not intended to do that; instead, the slope tries to summarize the predominant pattern that occurs across all 50 observations in the full dataset. For now, we will not push the substantive interpretation any farther than this. But, try to bear with us for now: We can assure you that we will eventually provide a much more specific way to impose substantive meaning onto the calculated numeric value of the slope coefficient. The value of b determines the directional orientation of the line (or, its steepness, relative to the axes) within the plotting region of figure 3.14. The value of the intercept, a, determines the vertical position of the line, relative to the vertical axis. Here, the intercept is 301.136. In mathematical terms, this is the fitted value that corresponds to Xi = 0. Geometrically, if our vertical axis were anchored at the origin (i.e., the zero point) along the horizontal axis, the fitted line in Figure 3.14 would cross that vertical axis at the location corresponding to a value of 301.136. But, notice that the values of our X variable do not extend that far downward. The minimum percentage of the population in any state that receives Medicaid is 7.2; thus, Xi = 0 simply does not occur in our dataset. So, the intercept really is just a “height adjustment” term, used to guarantee that the fitted line passes through the point cloud in an appropriate manner (i.e., near the middle). Again, we will discuss the substantive interpretation of the intercept later on. We have done a lot in this section! So, it is probably useful to catch our breathe and 26 summarize our accomplishments. First, we changed the strategy we are using to summarize the structure within our bivariate data. Since an earlier nonparametric regression suggested that this structure is linear, we moved to a parametric approach. In other words, we specified that we would fit a “curve” with a linear shape (or, stated more clearly, a line). Then, it was just a matter of finding a specific line that meets our criteria— running through the middle of the point cloud. We accomplished that by operationalizing the criteria in very specific terms and then working through some straightforward algebra. The final result was a line that successfully summarizes our bivariate data— at least, according to our visual interpretation of the scatterplot, with the fitted line superimposed over the data points. And, to reiterate a very important point: This line is potentially useful because it seems to summarize very succinctly, the position and orientation of the point cloud. And, in so doing, it helps us discern structure in our data that is potentially relevant for addressing the substantive concerns that led us to examine the data in the first place, such as theory-testing or forecasting (or, very likely, both). FINDING THE “BEST” LINE From our work in the preceding section, we have produced a line that seems to pass through the middle of our data point cloud in the scatterplot of the Medicaid data. But, how do we know that our line is the best possible line that could be used to represent the data in the scatterplot? Of course, our line conforms to the criteria that we laid out to operationalize the concept of “middle” in the scatterplot. But, another analyst could easily argue that our criteria have no particular status, and that other lines provide a summary representation of the systematic pattern in the data that is just as good as ours. For example, remember that we came up with a slope of 50 from our visual inspection of the conditional mean smoother that we fit to the sliced scatterplot. The value, 50, is a nice round number which is probably easier to remember than the slope that we just calculated (that is, b = 48.252). So, it might make just as much sense to use that as the slope of the line that we impose over the data. If we use b = 50, and still require the line to run through 27 the centroid (which we do not really have to do— it just makes good sense to us), then the intercept will be 275.80.9 The first panel of Figure 3.15 shows this line in the scatterplot. Visually, the new line is just about identical to the one from the previous section. But, again, it is a different line in the sense that it is described by the equation Ŷi = 275.80+50Xi rather than the equation Ŷi = 301.136 + 48.252Xi . This new line does not meet all the criteria we laid out in the preceding section: Specifically, there is a nonzero correlation between the independent variable and the residual. If we designate the residuals from this new line as e∗ , then rX,e∗ = −0.030. But, again, our criteria have no particular theoretical status; they just “made sense to us” and they produced a line that seemed to run through the middle of the points. The new line appears to do exactly the same thing, even though it doesn’t meet our criteria. Therefore, one could claim it is just as good as our line. Still another analyst might criticize the lines we have looked at so far (that is, both ours and the new one) because their slope is too shallow. To this person, the problem might be the position of the line near the right side of the scatterplot; One could argue that it passes very close to the points in the lower region of the scatterplot, but too far from the points high in the upper region— particularly the one point in the upper right-hand corner of the plot. This individual might prefer that we use a steeper slope, and and a smaller intercept, producing the line shown in Figure 3.15B. Here, the equation for the line is Ŷi = 100 + 68Xi . The coefficients in this equation do not conform to any particular criteria; the line was simply oriented within the scatterplot so that it “looked right” to our analyst. From this individual’s subjective perspective, the line in Figure 3.15B is just as clearly running through the “middle” of the scatterplot as was our line (from Figure 3.14) or the new line we just examined a moment ago (in Figure 3.15A). Of course, the general process could continue ad infinitum. The problem is that we have not defined any criterion that establishes the best line for a given scatterplot. And, until we do so, this indeterminacy about the most accurate representation of the data will continue. So, what do we mean by the “best” line? Since we are using the line to represent the 28 data, we probably want to generate a line that comes as close as possible to as many of the points as possible. Furthermore, we may as well go ahead and define “close” in the vertical direction within the scatterplot. Our reasons for doing so are identical to the reasons we laid out for defining the middle of the point cloud in the vertical direction. Thus, we want to find a line such that the Ŷi ’s are as similar to the actual Yi ’s as possible. Stated a bit differently, we want to produce a line with the smallest possible residuals, since ei = Yi − Ŷi , for observations from i = 1 to n. Operationalizing “Close”: The Least-Squares Criterion So far, we have stated our objective in relatively informal, but intuitive, terms. We need to be more specific in stating exactly what we mean when we say that we want the line to come as close as possible to as many points as possible, or that we want the line that makes the residuals as small as possible. And, just as we did when we operationalized the “middle” of the point cloud, we will consider and reject several potential strategies for operationalizing “closeness” (or “small residuals”) before arriving at the one that we will actually use. As a starting point, let’s think about making each observation’s residual (that is, ei ) as small as possible. Then we can just sum across the n observations. The general idea is that P P we would find the line that minimizes ni=1 ei . But, this will not work: We can make ni=1 ei as small as we want, simply by drawing the line farther and farther above the point cloud within the scatterplot. As we saw back in Figure 3.12B, such a line is unacceptable because it does not run through the point cloud at all! As a second try, we could find the line that makes the residuals sum to zero. After all, that was one of our criteria for defining the middle of the point cloud, and the line we produced P from that exercise worked pretty well. In fact, ni=1 = ei 0 is a very reasonable requirement for our “best” line; it is just not enough, by itself. The problem is that there are many lines that will produce residuals which sum to zero. Some of these lines will be good summaries of the bivariate data (such as the one that we calculated in the previous section, and showed in Figure 3.14). However, other lines will not coincide very closely with the predominant 29 trend in the point cloud, even though they meet this requirement. For example, consider P Figure 3.16; the line shown in this scatterplot produces ni=1 = 0, even though it would be a silly depiction of the Medicaid data. That is, the line runs in a nearly horizontal direction through the plotting region, while the data run from lower-left to upper-right. The problem with the preceding attempt, of course, is that the negative residuals and positive residuals cancel each other out. This can occur for a wide variety of different lines, regardless of their orientation relative to the data point cloud. But, recognizing the problem also suggests the solution: Rather than minimizing the sum of the “raw” residuals across the n observations, we can perform some transformation on the residuals to make them all positive. Then, we could minimize the sum of these transformed versions of the residuals. One approach would be to transform the residuals by taking their absolute values, and P minimizing ni=1 |ei |. This strategy has some initial appeal, because the |ei |’s are the vertical distances from the points to the line. Thus, we really would be putting the line as close as possible to the points. An approach based upon this idea would work; in fact, it is actually used in some regression applications where the data include unusual observations with discrepant data values compared to most of the other observations. But, it also has a drawback: It is a relatively complicated mathematical problem to produce formulas for the a and b coefficients that minimize the sum of the absolute residuals. And, there is no need for us to bother with that complexity (at least, for the moment) because we have a much simpler approach available to us. As you probably know, we could also transform the residuals to a set of non-negative values by squaring them. Next, we would sum the squared residuals across the observations, P producing ni=1 e2i . Then, we could find the line (or, more precisely, the coefficients a and b which define the line) that minimizes this sum of squared residuals. This is called the least-squares criterion for fitting a line to the data. This, in fact, is the strategy that we will use. As we will see, the mathematical steps involved in finding a and b with the least-squares criterion are straightforward. (We hesitate 30 to say “simple” because they will involve calculus. But, please, don’t be intimidated by that. We’ll get through that part pretty easily, as you will see shortly!). And, in the next chapter, we will show that the resultant formulas have some very nice statistical properties— at least, when certain conditions are met in our data. Using Least-Squares to Fit a Line to Data We want to find values for a and b in the equation, Ŷi = a + bXi such that, when we P calculate ni=1 e2i , that sum is as small as possible. At first, that idea may seem a bit strange to you, since conceptually, the residuals can only be found after the line, itself, has been created. If so, then how can we use the sum of squared residuals as the criterion for creating the line in the first place? Isn’t that a statistical manifestation of putting the cart before the horse? In fact, we cannot calculate the ei ’s and, hence, the e2i ’s for the individual observations until we have actually fitted the line to the data. But, it is easy to show that the sum of the e2i ’s across the full set of n observations is determined entirely by the values of a and b that we use to create the line. Now, the residuals are defined as the difference between the actual values of Y and the fitted values (that is, Ŷ ) for each observation. Therefore, the sum of squared residuals can be shown as: n X e2i = i=1 n X (Yi − Ŷi )2 i=1 Now, the fitted values for each observation are defined as Ŷi = a + bXi . So, we can substitute the latter into the formula for the sum of squared residuals as follows: n X i=1 n X i=1 e2i = e2i = n X i=1 n X (Yi − [a + bXi ])2 (Yi − a − bXi )2 i=1 Look at the right-hand side of the preceding equation. Four different quantities appear 31 within the parentheses following the summation operator. Two of these are data values, Yi and Xi . Of course, the data are fixed by the observations in the sample before us. We cannot alter the values of Yi and Xi in any way! (There are severe professional sanctions for doing so!). This leaves us with a and b as the only quantities that are available for us to manipulate. So, again, we will try to obtain values for these two coefficients that minimize the term on the left-hand side, the sum of squared residuals. How might we go about this task? One strategy might be to conduct a search across different values of the coefficients. In other words, select a pair of numeric values for a and b, use them to calculate the Ŷi ’s for the n observations in the dataset, use these fitted values to calculate the ei ’s (remember, we have the Yi ’s from the data, themselves, so it is easy for us to calculate ei = Yi − Ŷi for each of the n observations), then square them and sum the e2i P values to obtain the sum of squared residuals, or ni=1 e2i . Repeat this entire process many times, trying different values of a or b, or both, in each case. Record the sum of squared residuals for each combination of a and b, and simply select the pair of coefficient values that P produces the smallest ni=1 e2i . The preceding search strategy might (and probably would) work if we searched across enough combinations of a and b values. It would involve a lot of computation (remember, we would have to calculate n residuals for each distinct pair of coefficients in order to obtain P the ni=1 e2i ), but this is not really a problem with modern computing technology. More troublesome is the fact that this search strategy would have to be carried out separately for each new dataset that we might want to analyze; there is no formula to “unify” the process and make it more coherent. Furthermore, we would probably always worry that our search might have missed the set of coefficients that is truly the best, according to the least-squares criterion. So, even though the search strategy is not entirely unreasonable, we reject it for relatively practical reasons. In fact, there is a much easier way for us to proceed which avoids all of the previous difficulties. This new strategy will involve some calculus. We promised not to use calculus 32 in this book and we intend to keep that promise. But, we do want you to know what it does and why it is useful in this particular situation! Assume that we did, in fact, carry out our search process and that we tried a very wide range of different a and b values. Also assume that we already know the proper value for the a coefficient. We realize that this latter assumption is terribly unrealistic; we only adopt it for the moment, because it will make it easier for us to draw a picture of the problem that we are trying to solve. With the preceding assumptions in place, let’s graph the Pn 2 i=1 ei values, plotted against the b values that produced them (remember, we’ve fixed a at its “correct” value, so we don’t need to show it in the graph). Figure 3.17A shows this graph; the figure includes a curve, rather than individual plotted points, because the b coefficient can (in principle) be any number from negative infinity to positive infinity. Each position along the horizontal axis in Figure 3.17 represents a particular numeric value of b. The height of the curve at any given horizontal position represents the sum of squared residuals associated with that value of b. Of course, we want to find the b value associated with the lowest point on the plotted curve— that is, at the very bottom of the “U.”10 In order to accomplish that, we will take a brief sidestep, and ask you to consider the slope of the plotted curve. More specifically, look at how the curve’s slope changes as we move from left to right within the plotting region. As we show in Figure 3.17B, the slope of the curve near the left side of the plotting region (at position b1 on the horizontal axis) is negative and very steep. As we move toward the right, the slope of the curve remains negative, but it gets shallower (i.e., closer to horizontal), as at position b2 . After we “round the curve” at the bottom of the U-shape, the slope becomes positive. Near the middle of the plotting region (say, at position b3 ), the positive slope would be relatively shallow, but it becomes steeper as we move even further toward the right (say, to position b4 ). Now, we want to find the b value at the exact bottom of the U-shape; stated differently, we want the b value associated with the exact position along the curve where the slope changes from 33 negative to positive. At this position, the slope would be precisely horizontal, and its value would be zero. Figure 3.17C shows this situation. So, we want to find the value of b which corresponds to the position where the plotted curve has a slope of zero (which happens to be the lowest point on that curve). Please make sure this makes sense to you before proceeding! Now, we’ll return to our main course from our sidestep. Is there any way for us to find the lowest point on the curve in Figure 3.17 without calculating all of those possible b values to produce the curve in the first place? As it turns out, the answer is yes! We can use differential calculus to produce a formula which gives the slope of the curve for any possible value of b; note that we can obtain this formula without even having to draw the curve at all! All we need to do is get this formula, set it equal to zero, and “work backwards” to find the b value that occurs for that particular slope. At this stage, we can also remove our unrealistic assumption that we already know the value of a. Calculus can also give us a similar formula, which provides the slope of the curve for any possible value of a. Once again, we just obtain this formula, set it to zero, and find the value of a that results. These mysterious formulas are called partial derivatives. Speaking very informally (our mathematician friends will gnash their teeth at our definitions!), the partial derivative of Pn 2 i=1 ei with respect to b gives the slope of the graph relating the sum of squared residuals P to the value of b, for any specified value of b. Similarly, the partial derivative of ni=1 e2i with respect to a gives the slope of the graph relating the sum of squared residuals to the value P of a, for any specified value of a. Since we already know that ni=1 e2i is determined by the combination of the two coefficients (a and b), you can probably guess that these two partial derivatives are intertwined with each other. In other words, to calculate the specific numeric value of one partial derivative, you must incorporate the information from the other partial derivative, and vice versa. Enough talk, already! Let’s look at the actual formulas for these two partial derivatives (note that all summations are taken across the n observations in the dataset; therefore, we 34 omit the limits of summation in the interest of “cleaning up” the formulas): ∂ P e2i = −2 X Xi Yi + 2a e2i = −2 X Yi + 2na + 2b ∂b ∂ P ∂a X Xi + 2b X X Xi2 Xi (8) (9) Those strange-looking fractions on the left-hand sides of equations (7) and (8) are just standard mathematical notation for partial derivatives. In equation (7), the left-hand side is P 2 read as “the partial derivative of ei with respect to b. The left-hand side of equation (8) P 2 is similarly verbalized as “the partial derivative of ei with respect to a. Now, we could plug in any pair of a and b values on the right-hand sides of these two equations to find the slope of the plotted curve for that pair of values. But, instead of doing this, we want to plug a zero in on each of the left-hand sides; the reasoning is that we already know this is the slope we want. We “merely” want to find the values of a and b that are associated with this slope. So, we show the derivatives as follows: 0 = −2 X Xi Yi + 2a X 0 = −2 X Yi + 2na + 2b Xi + 2b X X Xi2 Xi (10) (11) At this stage, we can carry out some very simple algebraic manipulations. Basically, we will isolate the terms involving a and b on the right-hand side of each equation and also divide each equation by 2. Those operations produce the following results: X Xi Yi = a X X Xi + b Yi = na + b 35 X X Xi Xi2 (12) (13) These are two simultaneous equations, in two unknowns (a and b). This rather intimidatingsounding phrase simply means that the two equations are “connected” to each other (we already know that because the same coefficients appear in both of them) and that all of the quantities except for a and b can be calculated directly from our data. The fact that the number of unknown quantities is exactly equal to the number of separate equations (two, in this case) means that we can solve the two equations together in order to produce unique values for those unknown quantities. There are a number of specific ways we could do this, but in every case, the operations would produce the following formulas for the two coefficients: b = n P P P Xi Yi − Xi Yi P P n Xi2 − ( Xi )2 P a = P Yi Xi −b n n (14) (15) These are the formulas for a and b coincide with the point on the plotted curve where the slope is zero. And, as we know, these are also the values that minimize the sum of squared residuals. Notice that the right-hand sides of the preceding two equations only include quantities that we already know (n) or can calculate quite readily from the data (the various sums involving the Xi ’s and the Yi ’s). This is important because it means that we can actually calculate the values for a and b using the information that is available to us. In fact, we can even go a bit farther. Equations (13) and (14) can be manipulated algebraically in ways that allow us to express them in the following manner: P (Xi − X̄)(Yi − Ȳ ) cov(X, Y ) SXY P b = = = 2 var(X) SX (Xi − X̄)2 (16) a = Ȳ − bX̄ (17) As you can see, these are exactly the same formulas for a and b that we derived earlier, 36 when we were trying to run a line through the middle of the point cloud in the scatterplot! So, back in that exercise, we not only produced some line; we actually found the best line in the sense that it runs through the middle of the point cloud, coming as close as possible to as many of the points as possible. Here, “close” is defined in terms of the least-squares criterion. In other words, these are the coefficients which generate the line that produces the smallest possible sum of squared residuals. Equations (15) and (16) (or their equivalents, equations (13) and (14)) are so important that they have a special name: They are usually called the ordinary least squares estimators for the slope and intercept, respectively. Obviously, “least squares” refers to the fact that these formulas minimize the sum of squared residuals from the fitted line to the data points. And, “ordinary” simply means that this is our typical, “default” method for fitting the line. In later chapters, we will see that we sometimes need to modify these formulas to take into account the characteristics of particular datasets. Note, too, that “ordinary least squares” is a bit of a mouthful, so it is often abbreviated to the acronym, “OLS.” Because the line produced by the OLS formulas (which we will begin calling “the OLS line” for verbal convenience) is identical to the line we fitted earlier, the OLS line shares all of its characteristics. That is, the sum (and therefore, the mean) of the residuals from the OLS line is zero, the OLS line passes through the centroid of the bivariate data, and the residuals from the OLS line are uncorrelated with the values of the independent variable. Taken together, these characteristics imply that the OLS line should run through the middle of the data point cloud in the scatterplot, if the cloud really does have a linear shape (as it did, for example, with the Medicaid data shown in Figure 3.14). And, since it comes as close as possible to as many of the points as possible (in the least-squares sense), we can be confident that this is the single best line that we could come up with to summarize the linear pattern in the data. This is an incredible accomplishment! GOODNESS OF FIT We have found a line that we claim gives the best fit to our bivariate data (as long as we 37 define “best” in terms of the least-squares criterion). But, how good is it? In other words, when we make the informal claim that the line passes as close as possible to as many of the points as possible, just how close does it come, in general? The answer to this question is important from an analytic perspective because it addresses the degree to which the OLS line really does approximate the entire cloud of points within the bivariate scatterplot. And, substantively, this question is important because we are trying to use the line to provide a summary depiction of the way that our Y variable is related to the X variable. In either case, the closer the points actually fall to the line, the better the summary that the line provides for us— in either a geometric or a substantive sense. It would be nice if we had some statistic that, in a single numeric value, summarizes the overall degree to which the OLS line approximates the actual set of n observations. You might be thinking that we already have a summary statistic which describes the degree to which bivariate data conform to a linear pattern: The correlation coefficient.11 But, there are two reasons why we do not want to simply fall back on this statistic. First, we don’t know whether the correlation coefficient and the OLS estimators are focusing on the same line within the data. If the correlation measures the degree to which the data approximate some other line, then it would hardly suffice as a goodness-of-fit measure for the OLS line! Second, remember that we do not have a very clear interpretation for intermediate values of the correlation coefficient— those that fall between zero and either positive or negative one; instead, we only offered some broad and unsatisfying rules of thumb (with disclaimers!). It would certainly be better if we could produce a statistic with a more precise meaning for any specific value it might take on in a particular dataset. Partitioning the Variance in Y In order to address this problem, we will begin with the equation that breaks down any observation’s value on Y into its separate components: Yi = a + bXi + ei 38 (18) We also know that the fitted values for the OLS line are obtained from the first two terms on the right-hand side of equation (17), or Ŷi = a + bXi . So, we can substitute this in, as follows: Yi = Ŷi + ei (19) Equation (18) shows that the variable, Y , is actually a linear combination of two other variables: The fitted values (that is, Ŷ ) and the residuals (that is, e). (Note that the coefficients on the two variables in this linear combination are both implicitly equal to one). We could say that the dependent variable, Y , is divided into two parts. First, Ŷ is the part of Y that is related to X, through the OLS line. Second, e is the left-over part of Y which (by definition) is not at all linearly related to X. Now, despite our emphasis throughout this chapter on fitting curves and lines to bivariate data, the real, ultimate objective is to account for the variance in Y . Since we recognize that Y is a linear combination, its variance can be broken down as follows: var(Y ) = var(Ŷ ) + var(ei ) + 2cov(Ŷ , e) (20) But, Ŷ is a linear transformation of X and we know that X is uncorrelated with e. Therefore, the last term on the right-hand side is zero. (Why?). If we treat the variances strictly as descriptive quantities for our sample of n observations (i.e., there is absolutely no question about inferences to some broader population from which these observations were drawn), then we can re-express equation (19) as follows: P P P ¯ (Ŷi − Ŷ )2 (ei − ē)2 (Yi − Ȳ )2 = + n−1 n−1 n−1 (21) We can multiply both sides of equation (20) by n − 1 to get rid of the fractions: X (Yi − Ȳ )2 = X X ¯ (Ŷi − Ŷ )2 + (ei − ē)2 39 (22) Remember, too, that the mean of the fitted values is equal to the mean of Y , and that the mean of the residuals is always zero. Therefore, we can simplify even further: X (Yi − Ȳ )2 = X (Ŷi − Ȳ )2 + X e2i (23) Equation (22) shows a very important result. The left-hand side of the equation contains the numerator of the variance in Y , or the sum of squares in Y . We can abbreviate this as SSY . The first term on the right-hand side in equation (22) is the sum of squares (i.e., the sum of squared deviations around the mean) in the Ŷi ’s. Since these fitted values fall along the OLS line, we might call this term the “regression sum of squares.” We’ll abbreviate this as SSReg . Finally, the second term on the right-hand side is the sum of squared residuals, which we can show as SSRes . Just to remind you, the OLS coefficients create the line that makes this residual sum of squares as small as possible. Now, SSY represents the “total spread” in the Y values around the mean of Y . SSReg might be viewed as the part of this total spread that is associated with the X values. And, SSRes could be conceptualized as the part of Y ’s dispersion that is not at all linearly related to the X’s. Our OLS line breaks the total sum of squares in Y neatly into these two additive components, because X (and, hence, Ŷ ) is uncorrelated with e. Since the sums of squares for the fitted values and for the residuals must add up to the total sum of squares in Y , we know that as one gets larger, the other must get smaller. And, any sum of squares (regardless whether it is SSY , SSReg , or SSRes ) must be non-negative (why?). Therefore, the maximum possible value of either term on the right-hand side of equation (22) is SSY , and the minimum value must be zero. A Summary Statistic for Goodness of Fit Next, we can divide both sides of equation (22) by SSY , to express the two additive 40 components as proportions of the total sum of squares in Y : P P P 2 (Yi − Ȳ )2 (Ŷi − Ȳ )2 ei P P P = 1.00 = + (Yi − Ȳ )2 (Yi − Ȳ )2 (Yi − Ȳ )2 (24) We are going to focus on the first term on the right-hand side. In fact, we are going to give that first fraction a special name, as follows: P (Ŷi − Ȳ )2 R =P (Yi − Ȳ )2 2 (25) R2 expresses the regression sum of squares as a proportion of the total sum of squares in Y . And, as such, we can use it as a goodness-of-fit statistic for the OLS line. The reasoning is straightforward. The Ŷi ’s represent the part of each Yi that falls exactly along the fitted OLS line. Therefore, R2 shows the portion of Y ’s total sum of squares that is due to disperion in these Ŷi ’s. The maximum possible value of R2 is one; this will only occur when the fitted values are exactly equal to the actual values of Y . That, in turn, occurs when the points in the scatterplot fall perfectly into a linear pattern. Notice that, in this case, the residual for each observation would be zero, and the residual sum of squares would also be zero. At the other extreme, R2 has a minimum value of zero. This occurs when SSReg = 0. But, that only happens when Ŷi = Ȳ for every observation and that, in turn, can only happen when the OLS line has a slope of zero. The latter is equivalent to saying that Y exhibits no linear responsiveness to X whatsoever. Values of R2 that fall in between these two extremes can be interpreted very precisely as the proportion of the total sum of squares for Y that is shared with X through the OLS line. As we know, the total sum of squares for Y is the numerator of Y ’s variance. Therefore, speaking colloquially, R2 is often interpreted as the proportion of variance in the dependent variable that is “explained by” the independent variable. Even more succinctly, some people say that R2 is a measure of “variance explained,” period. This terminology is very popular; in fact, it is used almost universally to describe the meaning of the R2 statistic. We will go 41 with common practice and use this phrase ourselves. But, it is very important to remember that the word “explained” should definitely not be taken very literally when doing so. You might wonder why “R” is used in the name of this summary goodness-of-fit statistic. As it turns out, R2 is equal to the square of the correlation coefficient between X and Y , 2 or rXY . The derivation of this interesting result is shown in Box 3.ZZ. The implications are important: First, it shows that the correlation coefficient does, in fact, assess the degree to which the data conform to exactly the same line as that obtained from the OLS formulas. Second, R2 provides a simple transformation of the correlation coefficient (i.e., squaring it) which produces a much better interpretation of the strength of the relationship between X and Y . Interpreting R2 in terms of the proportion of variance in Y that is “explained” by X is more precise and specific than the vague guidelines that we provided earlier for the correlation coefficient. Goodness of Fit and the Medicaid Data In our earlier calculations, the line we fitted to the Medicaid data had a slope of 48.252 and an intercept of 301.136. We now know that this is the best-fitting line, in the sense that it is the pair of coefficients which produces the line that minimizes the sum of squared residuals. We could also say that the OLS estimates of the linear structure in the Medicaid data produce the following equation: Yi = 301.136 + 48.252Xi + ei (26) In equation (25), Yi corresponds to Medicaid spending per capita in state i and Xi is the number of Medicaid recipients per thousand population in state i. We have already argued that this line effectively summarizes the linear pattern of data points that we observed in the scatterplot for these variables. In order to measure the degree to which the OLS line captures the variance in the dependent variable values, we can calculate the R2 for this equation. The relevant calculations are shown in Table 3.5. Once again, the leftmost two columns 42 in the table simply give the data values for the 50 observations. The third column gives the Ŷi ’s for each observation, calculated from the OLS equation. The fourth and fifth columns give the deviation scores for Y and Ŷ , respectively. And, finally, the last two columns provide the squared deviation scores for these two variables. As in previous tables, the sums for the variables are given at the bottom of each column. From the latter, we can see that the total sum of squares for Y , SSY = 5, 013, 494. The regression sum of squares, SSReg = 2, 074, 726. So, R2 = 2, 074, 726/5, 013, 494 = 0.414. For verbal convenience, we can multiply the value of R2 by 100 in order to convert the proportion into a percentage. And then, we might say that 41.4% of the variance in state Medicaid spending per capita is explained by the linear relationship with the size of the state Medicaid population. Addressing Some Common Questions About R2 We know that R2 can vary between zero and one, because it is a proportion. And, in most cases, we are looking for relationships between our independent and dependent variables, so we would like the calculated value of this statistic to be as close to one as possible. But, it is probably natural to wonder how large this proportion needs to be. In other words, when is R2 large enough (or close enough to its maximum value of 1.00) to be useful or to indicate that a linear relationship really does exist between the X and Y variables? Unfortunately, there is just no straightforward and general answer to this question. And, for a variety of reasons, we often have to contend with fairly weak empirical relationships between our independent and dependent variables. Therefore, the R2 values associated with our regression equations sometimes seem to be a bit low. For example, in the Medicaid data, where we saw a fairly obvious linear pattern in the data, the OLS line still only accounts for less than half (i.e., about 41%) of the variance in the dependent variable. As we proceed through the rest of the material in this book, we will discuss several things that tend to increase the R2 value in a regression equation, such as incorporating additional variables, allowing for nonlinear relationships, and reducing measurement error in our variables. But, 43 it is important to understand that these involve substantive and theoretical considerations as much as, or even more so than, statistical issues. Another question often asked is: Which is more important, the OLS coefficient estimates or the R2 ? A related version of this question asks about the relative importance of the OLS coefficients versus the correlation coefficient. In fact, either version is just the wrong question to ask. The OLS estimates on the one hand and the R2 or the correlation coefficient on the other hand measure different things. The OLS equation tells which line produces the best least-squares fit to the bivariate data. The R2 and the correlation coefficient provide two measures of how well the data conform to that best-fitting line. These are two quite different properties of the relationship between the independent and dependent variables. In order to show the difference between the regression coefficients and the goodness of fit (or correlation), Figure 3.18 shows four scatterplots with OLS lines superimposed over the data. First, look at panels A and B. Each of these two scatterplots contains 100 observations. Notice that the regression line is identical in each of these two displays. The OLS line for both is: Yi = 10.05 + 0.62Xi + ei . However, the goodness of fit is very different in each case. The R2 for the linear relationship depicted in Figure 3.18A is quite large, at 0.94, while the R2 for Figure 3.18B is much smaller, at 0.37. Now, look at panels C and D. In these two displays, the R2 is identical, at 0.85. However, the OLS lines are quite different: In Figure 3C, Yi = 4.19 + 0.51Xi + ei . In Figure 3D, Yi = −1.61 + 1.48Xi + ei . Hopefully, our brief consideration of these two questions makes you aware of two important things: First, the “best” in the “best-fitting line” is always relative. There is no set criterion for saying how large is good enough in any particular research setting. Second, the OLS line, itself, is different from the data’s goodness of fit to that line. Both are important and substantively interesting characteristics of the relationship between the variables and they should always be included whenever the results of an analysis are conveyed to an audience (e.g., in a report, thesis, conference paper, etc.). LOOKING AT THE RESIDUALS 44 We fitted a line to the Medicaid data because visual inspection of the scatterplot suggested that there is a linear pattern in the combinations of X and Y values that observations possess. In other words, we used a line, rather than some other type of curve, because it conforms to what we believe to be the shape of the systematic structure which exists in the data. This is a critically important idea, which we want to emphasize: The curve that we fit to the data always should be consistent with the structure that exists within those data. Building upon the preceding idea, if we truly believe the structure in our data is linear, then once we remove that linear relationship from the observations, there should be no remaining systematic structure within the data whatsoever; the “left-over” variability in Y after we remove the part of each Yi that is related to X should be nothing more than random, non-systematic, “noise.” If there is any nonrandom, discernible structure left in the data after we pull out the linear relationship, then it would suggest that the fitted line does not adequately capture all of the interesting elements in our bivariate data. In order to investigate this possibility, it is very useful to take a careful look at the residuals from our OLS line. As we know, the residual for each observation, i, is equivalent to the signed difference between the actual value of Yi , and the fitted value, Ŷi . And, the fitted values represent the part of Y that is linearly related to X; this leaves the residuals as the part of Y that is not linearly related to X. We want to make sure that there is not any other kind of nonrandom structure remaining within these residuals. And, since we don’t really know what kind of structure might exist, it is probably useful to look at the residuals, in a very literal sense, using a graphical display. The Residual Plot for the OLS Line Fitted to the Medicaid Data Figure 3.19 highlights the residuals from the OLS line in the Medicaid data by using vertical line segments that connect the fitted values with their corresponding Yi ’s. We do not necessarily need these line segments and we generally will not use them in future graphs (except for pedagogical purposes); we merely show them here to focus our attention on the relevant parts of the information in the scatterplot. The problem, however, is that the basic 45 scatterplot is not very useful for assessing patterns in the residuals. For one thing, our visual impression is dominated by the linear structure between X and Y , captured by the OLS line. Another problem is that it is difficult to make comparisons of the residuals across observations, because they are shown as deviations from a “tilted” line. It would be better if we could remove the linear trend from the graphical display and “un-tilt” the baseline from which the residuals depart. This can be accomplished very easily, by creating a scatterplot with the residuals on the vertical axis, and the fitted values on the horizontal axis. Such a display is usually called the residual versus fitted plot or just the residual plot for the OLS line. Figure 3.20 shows the residual plot for the OLS line fitted to the Medicaid data. We want to point out several features that should hold for all residual plots. First, the number of points in the residual plot is equal to the number in the original scatterplot of the Y variable against the X variable. In fact, the points represent exactly the same observations across the two graphical displays. If you look carefully at Figure 3.20, you might be able to see that it really does represent an “un-tilted” version of Figure 3.19. So we are, indeed, looking at a version of the scatterplot with the predominant linear pattern linking the number of Medicaid recipients to the amount of Medicaid spending removed. Second, the residuals from an OLS fit always have a mean of zero; therefore, that value invariably appears somewhere near the middle of the vertical axis in a residual plot.12 Third, and closely related, if an observation had a residual of exactly zero, then its plotted point would fall exactly on the fitted OLS line. The horizontal axis in the residual plot corresponds to the fitted values. Therefore, the zero point on the vertical axis can be viewed as a baseline for evaluating the residuals. In order to emphasize this, we draw a horizontal, dashed, line emanating from this position along the vertical axis. Fourth, the vertical line segments we used back in Figure 3.19 are now largely unnecessary because the linear pattern has been removed, and the residuals now represent vertical discrepancies from the horizontal line. Accordingly, visual assessment is facilitated and more accurate judgments are possible. Fifth, there is absolutely 46 no linear structure between the two variables represented by the axes of the residual plot. We can say this with perfect confidence, because we know from the properties of the OLS fitting process that the residuals (i.e., the vertical-axis variable) are uncorrelated with the fitted values (i.e., the horizontal axis variable). Turning our attention to the contents of the residual plot in Figure 3.20, we do not see any particular patterns in the graphical display. It appears to us that the plot shows a shapeless cloud of points. Therefore, we would conclude that our OLS line really does capture the systematic structure that connects the size of a state’s Medicaid population to its level of Medicaid spending. Do you agree with our assessment of this residual plot? For now, our assessment is entirely subjective, so there is certainly room for disagreement and differing opinions. Later in this book, we will discuss a variety of strategies for making more systematic evaluations of the information in a residual plot. But, for now, we think it is fairly safe to conclude that there are no easily-discernible patterns in these residuals. It may seem a little strange to say this, but the lack of any clear structure is exactly what we want to find here. And, it confirms that our linear model, constructed by fitting the OLS line, truly does summarize the interesting elements of these bivariate data. The Residual Plot for an OLS Line Fitted to the Social Service Office Data Before leaving our discussion of residual plots, we want to show you an example in which a linear model does not adequately capture the structure within a bivariate dataset. Remember the data on state social service offices that we examined at the beginning of this chapter? The scatterplot in Figure 3.1 showed a distinctly curved pattern in the points, and we concluded that the size of an office’s clientele tends to increase with the amount of time the office has been open, but the average size of this increase decreases over time. Any attempt to use a straight line (obtained via OLS or any other method) in order to 47 summarize these data would produce a distorted representation of the existing structure and relationship between the length of time an office has been open and the size of its clientele. Nevertheless, we could go ahead and apply the formulas for the OLS slope and intercept to the bivariate data on social service offices. At this point, hopefully, you are asking yourself, “Why would we do this, since we have already inspected these data and know that the two variables are related in a nonlinear way?” True enough, and we agree with you wholeheartedly. But, we are trying to show you what will happen when we try to fit a model that is inconsistent with the predominant structure in the data. And, we are sorry to say that many data analysts proceed in exactly this manner: They estimate the OLS coefficients without first examining the data to see whether it is really appropriate to do so. In any event, the OLS formulas applied to the social service office data produce the following equation: Yi = 62.649 + 22.312Xi + ei (27) Here, Yi is the mean number of clients per week for the ith social service office, and Xi is the number of months social service office i has been open. If this equation is the only evidence we examine, apart from the original data, we might well conclude that the two variables are linearly related. After all, the slope has a sizable positive value, at 22.312; thus, the mean number of weekly clients does seem to increase fairly substantially with each additional week that an office is open. Furthermore, the OLS equation seems to fit the data very well: The R2 value associated with this OLS line is 0.735, telling us that 73.5% of the variance in the dependent variable is shared with, or explained by, the variability in the independent variable through the linear relationship. None of the preceding statements are completely incorrect. However, they do provide only an incomplete description of the structure that exists in the data. Figure 3.21 shows the residual plot for the OLS fit to the social service office data. Just as we did earlier, jittering is applied to the horizontal-axis variable in order to break up the plotting locations a bit. As is always the case, the residuals are uncorrelated with the predicted values, so 48 there is no linear structure to be found in this graphical display. But, if you look closely, you can see that there is a sort of “hump” in the point cloud. That is, the points in the central area of the plotting region are far more likely to fall above the horizontal baseline, while the points near the left and right sides of the plotting region tend to fall below the baseline. Figure 3.22 fits a conditional mean smoother to these data and this further emphasizes the existence of a clear systematic, albeit nonlinear, pattern in these data. The OLS line misses this nonlinearity entirely. But, it does pull out whatever linear trend exists in the data. The “hump” in the residual plot represents the structure that is left over after doing so. Therefore, we would say that the OLS equation provides an incomplete description of the relationship between the two variables. Relying solely on the linear model of the data, we would miss some potentially interesting nonlinear structure. In most contexts, we believe that doing this (i.e., ignoring nonlinearity when it is clearly present in data) is a bad thing. But, there is definitely a trade-off involved in taking the nonlinearity into account. We have already stated that nonlinear relationships are generally more complex than linear relationships because the connection between the independent and dependent variables is not constant across the range of X. As we will see in Chapter 9, the specification of the equations that we use to provide a quantitative description of these relationships must be adjusted accordingly. Now, parsimony is an explicit objective of scientific explanations. Therefore, we must decide whether adjusting our model to take the nonlinearity into account is worth the costs imposed by this additional complexity. Generally, we believe that it is. However, others could well disagree. And, we want to emphasize that this is always a modeling decision that is entirely up to the data analyst. CONCLUSIONS We have covered a lot of ground in this chapter! We started with the general objective of regression analysis to provide a description of how the conditional Y distributions vary across the values of the X variable. And, we focused particularly on the central tendencies of these conditional distributions, in the form of the conditional means. Scatterplot smoothing 49 is a general strategy for superimposing a curve over the points in a scatterplot such that the curve follows the means of the conditional Y distributions. The result is a graphical form of regression analysis. We called it “nonparametric regression” because the shape of the curve is not specified in advance of fitting it to the scatterplot; instead, the form of the fitted curve is determined entirely by the bivariate data. We noted earlier and we want to reiterate that scatterplot smoothing and nonparametric regression are potentially vast topics that deserve far more attention than we have given them so far. We will definitely return to these ideas later in this book, where we will use scatterplot smoothing as an important tool for insuring that our regression models produce accurate descriptions of the structure within our data. But, for now, we used the smooth curves obtained through nonparametric regression more as an exploratory mechanism for determining what the nature of the structure is within the bivariate data in the first place. We moved to parametric regression when our scatterplot smoother revealed that the predominant structure within the bivariate data really is linear in form. And, we characterized the problem as one in which we try to find the single line that best approximates that linear structure. Notice that this is a manifestation of the overall idea of a descriptive statistic. In other words, any given descriptive statistic seeks to summarize a particular characteristic of data. For example, the mean summarizes the central tendency of a univariate distribution, while the standard deviation summarizes the spread or dispersion of a univariate distribution. In the same spirit, the slope and intercept of a regression line are intended to summarize the predominant linear pattern in a bivariate data distribution. So, we are really just continuing on with a process that you began back when you started studying statistics. Once we decide that the predominant structure within a set of bivariate data really does conform to a linear pattern, the task becomes one of positioning a line within the data cloud to provide a representative summary of that pattern. We started with an intuitive approach, in which we merely try to place the line so that it runs through the middle of the point cloud in the scatterplot. Our intuition was pretty useful in this case, because we developed 50 formulas for the slope and the intercept which are easy to calculate and can be expressed in terms of a few summary statistics that we encountered earlier (i.e., the covariance, the variance, and the mean). But, we couldn’t stop at that point, because we didn’t want to just have some line— we wanted the best line. We used the least-squares criterion to define precisely what we mean by “best.” And, after doing so, we discovered that our intuitively-derived line through the middle of the point cloud is also the one that produces the smallest possible sum of squared residuals, hence making it the best possible line, according to our specified criterion. We called the formulas for finding this line the ordinary least squares estimators, and we began to call the line, itself, the OLS line. Once we find the best line for a given set of bivariate data, it is natural to ask how good the line is as a description of the pattern within the data. In order to answer that kind of question, we reached back to our original motivation for performing the data analysis in the first place: We want to account for the existence of variance in the dependent variable, Y . We found that the OLS line actually breaks the values of the dependent variable down into two uncorrelated sets of values: The fitted values, which we presented as the “new” variable, Ŷ , and the residuals, which we designated as the variable, e. The fitted values represent the part of Y that is linearly related to X, through the OLS line. So, we take the sum of squares in these fitted values (which we called the regression sum of squares and designated SSReg ) as the numerator of a fraction in which the sum of squares for Y , itself (which we called the total sum of squares in Y , or SSY ), is the denominator. This fraction gives the proportion of the variance in the dependent variable that is linearly related to variance in the independent variable through the fitted values (which are, themselves, linear transformations of the Xi ’s). Also, the fraction is called R2 to reflect the fact that its value is equal to the square of the correlation coefficient between X and Y . Notice that the work we carried out in this chapter is nicely consistent with some of the ideas that we first raised back in Chapter 1. The OLS line is actually a model of 51 bivariate data in the sense that it provides a parsimonious representation of an interesting aspect of those data. The line is determined by only two parameters— the slope and the intercept. But, if the structure in the data truly is linear, then this OLS line “stands in” for the information contained in the original set of 2n data values (that is, n values of the X variable and n values of the Y variable). Of course, we recognize that this fitted line does not tell us everything about what is going on in our data; models are inherently simplifications which omit some of the detail. And, the R2 provides goodness of fit information to tell us how much of the data (or, more precisely, what proportion of Y ’s variance) is captured in the OLS regression line. It is gratifying to see that our hard work in this chapter really has taken us along the road we laid out initially in Chapter 1, using the tools that we developed in Chapter 2! Apparently, we are traveling in the direction that we intended. But, this chapter also ended on something of a cautionary note: We argued that fitting the OLS line to the data is not sufficient, in itself, if we really do want that line to be a useful model of those data. Instead, we need to take a further diagnostic step and try to guarantee that the line which we have imposed on the data really does reflect the structure that exists in those data. This leads us to the residual plot for the OLS line, which we introduced as a tool to search for any additional nonrandom patterns that may linger in the data after we have pulled out any linear structure between X and Y . The residual plot is a specific tool that we will employ frequently in subsequent chapters of this book. But, for now, we want to emphasize the underlying line of thinking that led us to it in the first place— we believe this is really, really important! Models are always artificial, simplistic, constructions that the analyst imposes on his or her data. It is incumbent upon that researcher (and no one else!) to make sure that the model really does provide an accurate and reasonably complete representation of the interesting elements within the dataset under investigation. Therefore, fitting a model, such as an OLS line, is just the beginning of a more continuous process. It is always important to inspect the correspondence between the fitted model and the data (using tools like the residual plot) in order to diagnose and address 52 any problems that may exist. Indeed, we believe that good researchers are characterized by their aggressiveness in carrying out this introspective process of always questioning their regression models and applying suitable remedies when problems are discovered. This is a theme to which we will return many times throughout the rest of this book. 53 Box 3-1: Derivation of the formula for the b coefficient in a line that runs through the middle of the point cloud in a scatterplot. Yi = a + bXi + ei n X Yi − Ȳ = (a + bXi + ei ) − (a + bX̄ + ē) (2) Yi − Ȳ = a + bXi + ei − a − bX̄ − ē (3) Yi − Ȳ = (a − a) + (bXi − bX̄) + (ei − ē) (4) Yi − Ȳ = b(Xi − X̄) + (ei − ē) (5) (Xi − X̄)(Yi − Ȳ ) = (Xi − X̄) b(Xi − X̄) + (ei − ē) (6) (Xi − X̄)(Yi − Ȳ ) = b(Xi − X̄)2 + (Xi − X̄)(ei − ē) (7) n X (Xi − X̄)(Yi − Ȳ ) = b(Xi − X̄)2 + (Xi − X̄)(ei − ē) (8) i=1 i=1 n X n X i=1 n X (Xi − X̄)(Yi − Ȳ ) = 2 b(Xi − X̄) + i=1 (Xi − X̄)(Yi − Ȳ ) = b n X i=1 i=1 n X n X i=1 (1) (Xi − X̄)(Yi − Ȳ ) = b n X (Xi − X̄)(ei − ē) (9) (Xi − X̄)(ei − ē) (10) i=1 (Xi − X̄)2 + n X i=1 (Xi − X̄)2 (11) i=1 Pn 2 (X − X̄)(Y − Ȳ ) i i i=1 (Xi − X̄) i=1 Pn = b Pn 2 2 i=1 (Xi − X̄) i=1 (Xi − X̄) Pn i=1 (Xi − X̄)(Yi − Ȳ ) Pn b = 2 i=1 (Xi − X̄) Pn (12) (13) Box 3-2: Derivation of the formula for the a coefficient in a line that runs through the middle of the point cloud in a scatterplot. Yi = a + bXi + ei (1) Ȳ = a + bX̄ + ē (2) Ȳ = a + bX̄ (3) Ȳ − bX̄ = a a = Ȳ − bX̄ (4) (5) NOTES 1. Mention other ways of defining the center of each slice. 2. Boundaries placed halfway between max X in lower slice and min X in higher slice. 3. Explain again that X could be a dependent variable in some other regression. Entirely context-specific! 4. Mention PCA, which does use this criterion to fit a line. 5. In fact, the first criterion for our operational definition of “middle” is unnecessary, given this second condition. 6. Mention that these are “well-behaved” point clouds. It is possible to construct situations where most observations fall on one side of line, but one additional observation (or a small set of observations) have extremely large values on the other side of the line, and the line is thereby “forced” away from the main point cloud. We will consider precisely this kind of situation in Chapter 11 on unusual data. 7. Fifty observations is a pretty large dataset, and we certainly do not expect you to perform calculations in a dataset of such size by hand. But, in principle, it could be done and we want to show you the exact steps that you might follow in order to do so. But, like any sane person, we actually rely upon the labor-saving power of hand calculators and computers to do the calculating work for us! 8. We cannot help but point out that this calculated value of the slope is remarkably close to the estimate we produced by “eyeballing” the smooth curve that we obtained by calculating the conditional mean smoother for the sliced version of the scatterplot. 9. Explain how intercept calculated. 10. You might wonder how we know the shape of the curve. Answer is that sum of squared residuals is a particular kind of formula called a quadratic, and graphs of quadratics always have this shape. 11. In fact, the covariance also measures this. But, we stick with the correlation because of its interpretational advantages. 12. In fact, we generally set the vertical axis so that range of values shown in the plot is symmetric around the zero point. Mean number of clients per week Figure 3.1: Scatterplot of data on state social service offices, showing the number of clients per week (during the preceding month) versus the number of months the office has been open. 150 100 1 2 3 4 5 Number of months office has been open 6 Figure 3.2: Fitting a smooth curve to the scatterplot of the social service office data, using conditional means. Mean number of clients per week A. Scatterplot with conditional Y means superimposed over data 150 100 1 2 3 4 5 6 Number of months office has been open Mean number of clients per week B. Conditional means are connected with line segments 150 100 1 2 3 4 5 6 Number of months office has been open Mean number of clients per week Figure 3.3: Scatterplot of data on state social service offices, showing final conditional mean smoother. 150 100 1 2 3 4 Number of months office has been open 5 6 State Medicaid spending, 2005 (dollars per capita) Figure 3.4: Scatterplot showing 2005 Medicaid expenditures in the American states (dollars per capita) versus the number of Medicaid recipients in each state (as a percentage of total state population). 2000 1500 1000 500 10 15 20 Medicaid recipients, as percentage of state population State Medicaid spending, 2005 (dollars per capita) Figure 3.5: Scatterplot of 2005 state Medicaid data, with a basic conditional mean smoother superimposed over the data points. 2000 1500 1000 500 10 15 20 Medicaid recipients, as percentage of state population Figure 3.6: Slicing the scatterplot of the state Medicaid data into six slices, and fitting the conditional mean smoother across the slices. 2000 B. Slice centers shown within slices Medicaid spending Medicaid spending) A. Slice boundaries added to scatterplot 1500 1000 500 2000 1500 1000 500 10 15 20 10 Medicaid recipients 2000 1500 1000 500 20 Medicaid recipients D. Line segments connecting the conditional means Medicaid spending Medicaid spending C. Plot conditional Y means over slice centers 15 2000 1500 1000 500 10 15 20 Medicaid recipients 10 15 20 Medicaid recipients State Medicaid spending, 2005 (dollars per capita) Figure 3.7: Scatterplot of the 2005 state Medicaid data, showing the final conditional mean smoother fitted across ten slices. 2000 1500 1000 500 10 15 20 Medicaid recipients, as percentage of state population State Medicaid spending, 2005 (dollars per capita) Figure 3.8: Scatterplot of the 2005 state Medicaid data, showing the final conditional mean smoother fitted across five slices. 2000 1500 1000 500 10 15 20 Medicaid recipients, as percentage of state population Figure 3.11: Scatterplot showing centroid of data points. 80 Y variable 60 40 20 0 20 40 X variable 60 80 Scatterplots with lines fitted such that summed residuals do not equal zero. A. ∑ei = 384.414 Y variable 80 60 40 20 0 20 40 60 80 X variable B. ∑ ei = − 365.586 80 Y variable Figure 3.12: 60 40 20 0 20 40 60 X variable 80 Scatterplots with lines fitted such that residuals are correlated with the X variable. A. rX, e = 0.851 Y variable 80 60 40 20 0 20 40 60 80 X variable B. rX, e = − 0.908 80 Y variable Figure 3.13: 60 40 20 0 20 40 60 X variable 80 State Medicaid spending, 2005 (dollars per capita) Figure 3.14: Scatterplot of 2005 state Medicaid spending versus size of Medicaid recipient population, with line fitted through the middle of the point cloud. 2000 1500 1000 500 10 15 20 Medicaid recipients, as percentage of state population Figure 3.15: Scatterplots of the 2005 state Medicaid data, showing different lines running through the “middle” of the point cloud. Medicaid spending ^ A. Line Yi = 275.8 + 50Xi fitted to data 2000 1500 1000 500 10 15 20 Medicaid recipients Medicaid spending ^ B. Line Yi = 100 + 68Xi fitted to data 2000 1500 1000 500 10 15 20 Medicaid recipients State Medicaid spending, 2005 (dollars per capita) Figure 3.16: Scatterplot of 2005 state Medicaid data, showing a line that does not fit the data very well, but for which the sum of the residuals equals zero. 2000 1500 1000 500 10 15 20 Medicaid recipients, as percentage of state population State Medicaid spending, 2005 (dollars per capita) Figure 3.19: Scatterplot of 2005 state Medicaid data with OLS line superimposed over data points, and residuals drawn in as vertical line segments. 2000 1500 1000 500 10 15 20 Medicaid recipients, as percentage of state population Residuals from OLS line fitted to Medicaid data Figure 3.20: Residual versus fitted plot for OLS regression line fitted to the 2005 state Medicaid data. 500 0 -500 800 1000 1200 Fitted values from OLS line fitted to Medicaid data 1400 Residuals from OLS line fitted to social service office data Figure 3.21: Residual versus fitted plot for OLS line fitted to data on state social service offices. 20 10 0 -10 -20 80 100 120 140 160 180 200 Predicted values (jittered) from OLS line fitted to social service office data Residuals from OLS line fitted to social service office data Figure 3.22: Residual versus fitted plot for OLS line fitted to data on state social service offices, with conditional mean smoother superimposed over data points. 20 10 0 -10 -20 80 100 120 140 160 180 200 Predicted values (jittered) from OLS line fitted to social service office data
© Copyright 2026 Paperzz