Schneider-Jacoby manuscript, Chapter 3 Draft

Now that we have covered some useful preliminary material, it is time for us to develop
the regression model, itself. In this chapter, we begin to do that, using bivariate data. That
is, we are focusing on the relationship between a single independent variable (generically
called X) and our dependent variable of interest (generically, Y ). To state our objective in
terms of the definition we provided back at the end of Chapter 1, we will develop a strategy
for describing how the conditional distribution of our Y variable changes across the range of
our X variable.
Note our emphasis on developing the regression model. We really intend to take you
through a process that reveals the underlying logic of this kind of analysis. Once again, we
do not think it would be very effective if we simply dropped some formulas in front of you.
We will eventually get to the formulas— but, we want you to understand where they come
from and why we use them.
Also note that our focus on bivariate data is somewhat artificial. In real-world applications of theory-testing and forecasting, dependent variables are almost always affected by
more than one independent variable. Nevertheless, the bivariate regression model provides
us with the simplest case. We can lay out our basic approach and develop the components
of the model in the most straightforward manner possible. We will see that all of the ideas
we discuss here can be generalized very easily to situations involving several independent
variables— that is, to multiple regression.
SCATTERPLOT SMOOTHING AND NONPARAMETRIC REGRESSION
How can we determine whether the conditional Y distributions vary across the values
of X? One straightforward approach is to simply look at the data in a scatterplot. If the
variables are related (according to the definitions we provided back in Chapter 1), then the
Y values should differ in some systematic way, across the X values.
Now, it would be difficult (if not impossible) to describe variability in the entirety of
each conditional Y distribution. So, instead, we will focus on the most salient characteristic
1
of each conditional distribution— its central tendency. This should make intuitive sense:
Measures of central tendency provide the single numeric value that is most typical of the
entire set of values within a variable’s distribution. Therefore, it is reasonable to examine
how this “most typical value” for Y varies across the range of values in X’s distribution.
Based upon the preceding idea, our immediate objective will be to fit a curve to the
scatterplot. This curve will trace out the path of the conditional means of Y from left to right
across the plotting region of the scatterplot; that is, the conditional means will be calculated
across the distinct values of X. This general process is called “scatterplot smoothing” because
the goal is to generate a smooth curve which captures the basic functional dependence of Y
on X. Hence, the smooth curve is actually a model of the full bivariate dataset depicted in
the scatterplot. There are a variety of specific methods (called “scatterplot smoothers”) for
creating such curves, and we will cover several in this chapter.
Starting Simple: The Conditional Mean Smoother
Consider the following scenario: State government officials establish a number of local
social service offices, to distribute public benefits to clientele. Some of these offices receive
a large number of clients, while others are not as busy. Why is this the case? We might
speculate that the number of visitors to each office is affected by the length of time that the
office has been in existence.
In order to test the preceding conjecture (or, to use the more appropriate term, the
preceding hypothesis), we obtain some data on 78 social service offices. For each office, we
have values on two variables: The number of months the office has been in existence, and
the mean number of clients per week, during the preceding month. Given the nature of our
hypothesis, the first variable is our independent variable and we can designate it as X. That
leaves the second variable as our dependent variable, Y . The full dataset is presented in the
Appendix to this chapter.
Of course, with 156 values (78 observations times two variables), we probably cannot
learn very much by looking at the data, themselves. We can probably see more by drawing
2
a picture. Therefore, Figure 3.1 shows the scatterplot for the number of clients per week
versus the number of months the office has been open. Note that jittering has been applied
in the horizontal direction, only. We do this because, with only six distinct values, the X
variable is relatively discrete. On the other hand, the Y variable takes on 7x distinct values
across the 78 observations. So, the mean number of clients (i.e., Y ) is relatively continuous,
and does not need to be jittered.
In the present context, we are more interested in the general structure of the bivariate
data, rather than the specific values of the variables. But, just from looking at the scatterplot,
we can see that there are systematic shifts in the conditional Y distributions. For example,
when X = 2, the Y values range from about 60 to about 125. But, when X = 5, the Y
values span an interval from just under 150 to about 190.
So, from a quick inspection of the display alone, we can see very easily that the Y values
do tend to increase with X. But, can we be more specific? Can we make a more detailed
statement about the way the conditional Y distributions shift across the range of X?
In order to do so, we will fit a very simple scatterplot smoother to the data in Figure
3.1. Again, our objective is to see how the conditional means of Y vary as X increases from
its minimum value (i.e., one) to its maximum (i.e., six). In order to do that, we can simply
group the 78 observations into six categories, based upon their X values and calculate the
mean Y value within each of the groups. Thus, we calculate the conditional means, directly
from the data.
Table 3.1 shows the conditional means calculated for each of the six X values. So, for
example, the first entry in the second column of the table is 64.50, and this is the mean
of the Y values for the two observations for which X = 1. Similarly, 100.43 is the mean
Y for the seven observations with X values of two, and so on. Note that there is only one
observation with X = 6, so the conditional mean of Y for that X value is just the Y value
for that single observation (or 170.00).
In Figure 3.2A, we repeat the scatterplot of mean clients per week versus number of
3
months. Here, however, we superimpose six new plotted points, representing the conditional
means of Y , plotted against their corresponding X values. These new points are shown as
filled black diamonds to distinguish them from the points that represent the data. Note
also that these new points are not jittered; they are plotted directly over the appropriate
X values. Figure 3.2B uses line segments to connect adjacent conditional means. And,
then, Figure 3.3 shows the final smoothed scatterplot. Here, we remove the points for the
conditional means, leaving only the data and the “curve” (admittedly, more a slightly jagged
series of connected line segments) that traces out the locations of the conditional means.
It may not seem like we have done very much by adding the curve to the scatterplot.
Believe it or not, however, we have just performed a regression analysis! The curve that
we have “fitted” to the data in the scatterplot provides a succinct— albeit graphical—
description of the variability in the conditional Y means across the X values. From the
curve, we can see that the average Y tends to increase very sharply across X values from
one to three. After that, the average Y continues to increase with larger X values, but the
rate of increase decreases. In other words, after we move beyond three months (i.e., to the
right within the plotting region of the scatterplot), each additional month of existence leads
to a smaller increase in the average number of clients. So, our “regression analysis” really
does enable us to make a more detailed statement about the way that an office’s number of
clients is related to the length of time it has been open. It has provided an insight about the
data that we probably would not have seen by looking only at the data values, themselves.
Let us pause here to consider what we have accomplished. Stating the problem a bit
differently from the way we did earlier, we can say that we have tried to account for the
variance in our Y variable by hypothesizing that the distribution of the Yi ’s changes systematically across the values of X. The smooth curve in Figure 3.3 confirms our hypothesis, by
showing that the average Y does increase as we move from smaller to larger X values. And,
the curved appearance, itself, shows us how the rate of increase also varies systematically
with the X values. Going back to some of the ideas presented in Chapter 2, we would
4
conclude that there is a nonlinear relationship between the length of time an office has been
open and the number of clients it serves, because the shape of the curve shows us that the
“responsiveness” of Y to X varies across the range of X.
Of course, central tendency (as captured in the mean) is only one characteristic of a
distribution. We have not attempted to describe any of the other possible characteristics of
the conditional distributions (such as their spread or shape). Does that mean we are missing
anything important? A general answer to that question depends upon the context. But,
here, we would say no: While the number of observations at each X value varies a bit, the
amount of vertical spread around each conditional mean seems to be fairly constant, except
for the two endmost values, where there are very few observations. Similarly, the vertical
dispersion around the conditional means appears to be fairly symmetric in each case (again,
with the exceptions of the two endmost X categories). So, we might argue that the only
“interesting” differences across the conditional distributions are their locations. And, this
latter property is captured very adequately by the conditional means alone.
The procedure we have used to fit the smooth curve to the data in the scatterplot is
sometimes called the “conditional mean” smoother. The origin of this name should be
obvious, since the curve is created by simply connecting the conditional means of Y for
each of the adjacent X values. There are many other kinds of scatterplot smoothers, which
differ in various ways from our relatively simple conditional mean procedure. They all
fall under the general category of “nonparametric regression.” Scatterplot smoothers are
regression analyses because they are intended to describe the variability in the conditional
Y distributions across the X values in a scatterplot. They are nonparametric in the sense
that the shape of the curve is determined entirely by the data; it is not specified in advance
by the researcher. This latter property makes scatterplot smoothers very useful tools for
exploring structure in bivariate data, because they do not impose any particular form on
that structure. While we will devote a little more attention to scatterplot smoothing in this
chapter, it is not our main focus. But, we will return to it in Chapter 9 when we discuss
5
nonlinear relationships between variables.
Dealing with Continuous X Variables: Slicing a Scatterplot
The basic conditional mean smoother works pretty well when the X variable is discrete.
However, it can be problematic when X is a continuous variable because there are generally very few observations at each distinct value of X (at least, in samples of finite size).
Therefore, the conditional means are generally not very reliable estimators for the central
tendencies in the conditional Y distributions.
To see the problem that occurs, we will return to the data on Medicaid spending in
the American states. Figure 3.4 shows the basic scatterplot between Medicaid expenditures
(per capita) in 2005, and the number of Medicaid recipients as a percentage of the state
population. As we have already seen, the Y values do tend to get larger as the X values get
larger. In order to summarize this pattern, we could try fitting a conditional mean smoother
to the data, producing the graph shown in Figure 3.5. Far from a “smooth” curve, the
conditional mean procedure has generated a series of very sharp zigs and zags across the
width of the plot. The zig-zag progression does seem to be tilted upward, such that the
general path of the line segments moves from lower-left to upper-right. But, we already
saw this basic feature in the original data; we didn’t need this “smoother” to draw that
conclusion.
The problem in Figure 3.5 is that the X variable is relatively continuous. There are 50
observations in the dataset (each one is a state), and there are 43 distinct values for the
number of Medicaid recipients per 1000. Because of this, most of the conditional means are
“calculated” from a single data value. It seems a bit nonsensical to regard such values as
summary statistics for a conditional distribution, since there are no empirical Y distributions
for 36 out of the 43 existing X values! The seven remaining X values have two observations
apiece, so it is still a bit of a stretch to call these “distributions.” The net result is that the
contents of Figure 3.5 look more like the outcome of a connect-the-dots game than a model
which captures the interesting features of the bivariate data. And, because we believe the
6
zigzags in the not-so-smooth smoother represent fluctuations due to unreliable estimation
of the conditional means, we are not particularly interested in them. We would prefer to
remove the zigzags, somehow, so we can discern more interesting features of the bivariate
data.
What we need to do is bring more observations into the calculation of each conditional
mean of Y . Hopefully, this will produce more stable estimates of the average Y values that
arise within different segments of X’s range. The obvious way to proceed is to take the
conditional Y means within intervals of X values, rather than at each specific X value. This
leads to a general strategy called “slicing a scatterplot,” so named because the intervals of
X values form a sequence of adjacent vertical strips or slices, laid out along the horizontal
axis of the plotting region.
We first need to determine a strategy for creating the slices, and also decide how many
slices we will use. There are several ways we could define the scatterplot slices. For example,
we could create fixed-width slices, such that each slice corresponds to a uniformly-sized
interval along the X axis. The problem with this approach is that fixed-width slices would
probably contain differing numbers of observations from one slice to the next. As a result,
the stability of the conditional means would vary across the scatterplot. In order to avoid
this, we will create the slices so that each one contains the same number of observations.
Of course, this means that the physical widths of the slices may vary within the scatterplot;
but, as we will see, this is not a serious problem at all. Next, we need to decide how
many observations will be included within each slice; of course, this decision effectively
determines the number of slices. For now, we will simply specify that each slice contains five
observations, or 10% of the total number of observations (since the observations are states,
n = 50). Generally, we would say that arbitrary specifications of this type are not a good
idea. But, we need something to get started. And, for that matter, we can always repeat
our analysis with different numbers of slices and see what differences emerge from changing
the slice definitions.
7
Table 3.2 shows the calculations for finding the conditional means. First, the fifty observations are sorted from smallest to largest, according to their values on the X variable.
Then, we divide the observations into ten groups of five observations each. These groups are
separated by the dashed lines in Table 3.2; you probably already realize that each of these
groups represents the subset of observations that fall into one of the slices. Within each
group, we calculate the mean of the Y values (by taking the sum of the Yi ’s and dividing by
five) to produce our set of ten conditional means. Now, we plan to plot these conditional
means in a scatterplot. The conditional Y means are the vertical axis coordinates for the
points, but we still need to specify some corresponding coordinates for each conditional mean
along the horizontal axis. In principle, we could plot each conditional mean at any horizontal
position within its slice; but, it seems to make the most sense to place each conditional mean
at the center of the respective slices. And, we will operationalize the “centers” as the means
of the X values that are contained within each slice.1 So, the calculations for the conditional
X means (which we prefer to call the “slice centers”) are also shown in the table.
Figure 3.6 shows how we use the results from Table 3.2 to create the smoothed scatterplot.
The first panel in the figure shows the scatterplot of the data, with the slice boundaries drawn
in.2 Panel B shows adds the slice centers, represented as the solid triangles near the bottom
border of the plotting region. Then, panel C plots points (i.e., the solid circles) at the
height of each conditional Y mean directly over the respective slice centers. Finally, panel
D “connects the dots” between the conditional means. Once we have done that, there is no
reason to retain the slice boundaries, the locations of the slice centers, or the plotted solid
circles. So, Figure 3.7 removes these elements from the graphical display and shows the final
scatterplot with the conditional mean smoother superimposed over the points.
What can we see in the smoothed scatterplot? Our eyes are probably drawn to the
“flutter” that occurs in the middle portion of the curve. But, looking on either side of that,
the “ends” of the curve seem to trend upward (i.e., from lower-left toward the upper-right)
within the plotting region at a fairly constant rate. That very seems reasonable: It suggests
8
that Medicaid spending increases steadily with the size of the Medicaid population. But, do
we have any substantive explanation for the fluttering in the middle of the graph? Probably
not (at least, nothing occurs to us). Instead, we suspect that the zigzags within the fitted
smooth curve represent random fluctuations which arise because the conditional Y means are
based upon only five observations each. In other words, they probably stem from sampling
error, which is not very interesting from a substantive perspective.
We could try to remove the zigzags from the smoother by increasing the number of data
points included within each slice. That should help to stabilize the computed values of the
conditional means, since more observations would be used in the calculation of each one. Of
course, it also makes the slices wider and decreases the total number of slices. Doing that
might cause us to miss details in the bivariate structure. We are not too worried about the
latter problem here, because the zigzags in Figure 3.7 really do look like random fluctuations.
But, it is important to understand and keep in mind that all of our decisions about fitting
a model to data involve both benefits and costs. Here, our “model” is the conditional mean
smoother. The anticipated benefit of increasing the number of observations within each slice
is that the fitted smooth curve should be easier to comprehend. The potential cost is that
we may inadvertently make the curve too smooth, in the sense that it “misses” interesting
features of the data. This trade-off between costs and benefits generated by our decisions
about a statistical model is a theme that we will confront many times throughout the rest
of this book!
Returning to the problem at hand, we will create slices in the Medicaid data which
contain ten observations each; since n = 50, this leaves us with five slices. Table 3.3 shows
how the slices and the conditional means are defined. Apart from the wider slices, the steps
in the process are identical to those that we used previously. And, Figure 3.8 shows the
smoothed scatterplot that is produced from these calculations. As you can see, the fitted
curve is definitely smoother than it was before. In fact, the “curve” is very nearly a straight
line!
9
The evidence in Figure 3.8 suggests that the Medicaid spending responds to the size of
the Medicaid population in a manner that is constant, regardless of the specific values that
either of these variables take on. So, looking at the graph, let’s focus on the position along the
horizontal axis where X = 10. Scan upward, until you encounter the curve; then scan over
in the horizontal direction to the left-hand vertical axis. The Y value there is approximately
750 or so— note that we are not saying this is the precise value of Y corresponding to
an X of 15. We are just “eyeballing” the scatterplot right now, so this is definitely just an
approximate figure. Anyway, move over to the right along the horizontal axis, to the position
where X = 15, and repeat the process. That is, scan up to the curve, and then over to the
vertical axis. Now, the Y value is about 1,000. So, a difference of five units along the X axis
(i.e., from 10 to 15) corresponds to a difference of approximately 250 units in the conditional
means that are associated with those two values (i.e., from about 750 to about 1000).
Since the curve is actually linear, we know that we should see a similar increase in the
conditional means for a five-unit positive difference anywhere along the range of X values.
We can confirm this by looking at the difference that occurs for, say, the interval from X = 15
to X = 20. We’ve already said that the curve crosses the former X value at a Y value of
about 1000. When X = 20, it looks like the curve corresponds to a Y value about half-way
between the tick marks for 1000 and 1500 along the vertical axis; of course, this is at about
1250 (again, we are just eyeballing to obtain this value). So, here too, an increase of five
units along the X axis corresponds to a 250-unit difference in the position of the curve along
the vertical axis. This is just what should occur, given the linear relationship revealed by
the smooth curve that we have fitted to the bivariate data.
So, what have we accomplished by carrying out this second exercise in scatterplot smoothing? Once again, we have performed a regression analysis. We say this because the smooth
curve in Figure 3.8 describes the way the conditional Y means vary across the range of the
X values. More specifically, we carried out a nonparametric regression, because we did not
specify the shape of the curve before we fitted it to the data. The results enable us to make
10
some very detailed statements about the way that state-level Medicaid spending relates to
the size of the Medicaid-eligible populations within each state. Specifically, the relationship
between the two is linear, telling us that the relative differences across the two variables are
constant, across the entire range of the X variable.
In order to make our description of the relationship between the two variables a bit more
efficient, we can take the results from our visual inspection of the smoothed scatterplot
and take the ratio of the difference in the conditional Y means that corresponds to a given
difference in the X values. Recall that a difference of five units on X corresponds to a
difference of 250 units in the vertical location of the smooth (and almost perfectly linear)
curve. Therefore, dividing the latter value by the former, we can say that the slope of the line
tracing the conditional means is 250/5 = 50. Stated verbally, a one-unit positive difference
in the size of the Medicaid population corresponds to a difference of 50 units in the average
level of Medicaid spending. Thus, our simple nonparametric regression allows us to make a
very precise and succinct statement about the nature of the relationship between these two
variables.
The Transition from Nonparametric to Parametric Regression
Up to this point, we have performed simple types of nonparametric regression analysis.
Again, this approach is justified because we want to explore the relationship between two
variables. But, the output from a nonparametric regression is entirely graphical: It consists of
of the smooth curve fitted to the data in the scatterplot. If the shape of the curve is relatively
simple (as it was in the two examples we just analyzed) then we can interpret the pattern
that emerges from the data in substantive terms. But, even though our interpretations are
fairly detailed statements about the ways that the respective Y variables “respond” to their
corresponding X variables, they still retain a high level of subjectivity. This is due to the
fact that interpretation of a smoothed scatterplot proceeds through visual inspection of the
fitted curve, alone.
11
In fact, we usually want to make relatively specific assessments of one variable’s relationship with, or impact upon, another. And, this often leads analysts to quantify the central
characteristics of the empirical relationship. We tried to do that with the Medicaid data, by
“eyeballing” the smooth curve, discerning that it was linear, and “guesstimating” the slope
of that line. Even though the outcome of that exercise (i.e., our conclusion that the slope
of the line is about 50) seems like a reasonable description of the bivariate data, we can
probably do better if we rely upon a strategy that eliminates the subjectivity and potential
inaccuracy that may arise from visual inspection.
Proceeding along the latter path, we have noted that the conditional means in Figure 3.8
trace out a linear pattern across the scatterplot. And, we remember from our discussion in
Chapter 2 that a line can be described pretty easily in quantitative terms, using an equation
of the form Yi = a + bXi . Therefore, we might restate our research task (in somewhat
informal terms)as follows:
If a linear structure exists in the conditional Y distributions, across the range of
X values, then— We will fit a line to the point cloud in the scatterplot such that
the line passes through the middle of the cloud.
By “fitting a line,” we mean finding the equation that describes that line. If we can
accomplish this task, then the coefficients in the equation should be useful for describing the
predominant pattern (i.e., the linear structure) that exists in the bivariate data. Notice how
the line (once we find it) is a model of our data. It is a simplified description, which only
requires two numeric values (or “parameters”), the coefficients a and b, to represent the 2n
numeric values (i.e., the values of the n observations on the X and Y variables) plotted in
the original scatterplot. The equation for the line “describes” the data in the sense that the
intercept (a) summarizes the “height” of the point cloud over the X axis, and the slope (b)
gives the general orientation (or “tilt”) of the cloud within the plotting region. Of course, the
line will only be a general summary of the data; as a simplified representation, we lose some
of the detail contained in the specific data values for the respective observations. But, if the
12
point cloud truly does adhere to a generally linear shape, then, we could use the fitted line
to characterize the interesting features of the data, without having to refer to the individual
data points, themselves.
By stating the problem in the preceding manner, it may seem that we have moved away
from our original strategy of tracing out the conditional Y means across the X values. After
all, the sentence about the point cloud doesn’t mention conditional means or conditional
distributions anywhere. But, this is not so worrisome. If the shape of the point cloud in
the scatterplot really is linear, then passing a line through the middle of the cloud should
really trace out a path that comes very close to (and may coincide with) the conditional Y
means. And, in the next chapter, we will bring the discussion right back to the problem of
placing the line through the conditional means, anyway. So, we are stating the problem a
bit differently than before, but we are not really moving to any new research objective in
any serious way.
Notice, however, that we have made a fairly important change in the kind of strategy
that we are going to employ in achieving that research objective. With our previous nonparametric regression approach, we passed a curve through the conditional means without
specifying in advance what the shape of that curve would be; in fact, determining the shape
or form of the relationship between X and Y was an integral element of our analysis in each
of the examples we presented. Now, we are willing to specify the shape of the curve before
we go about fitting it to the data. So, we are satisfied that an interesting feature of the
Medicaid data is the linear form of the relationship between the two variables. We just have
to determine which line most accurately characterizes the bivariate data. Therefore, we are
now employing a parametric regression strategy in the sense that we know which parameters
we are looking for— the intercept and the slope of the line that passes through the middle of
the point cloud. We simply need to devise a strategy for calculating the appropriate numeric
values for the parameters, a and b.
FITTING A LINE TO BIVARIATE DATA: AN INTUITIVE APPROACH
13
We want to fit a line to the bivariate Medicaid data so that it passes through the middle
of the point cloud. You are probably already getting tired of reading that phrase! But, it is
an important statement because it describes our immediate analytic objective.
The “Middle” of What?
In order to proceed, we have to define precisely what we mean by “middle.” Here, that
word refers to a line within the scatterplot which is located so that data points fall on either
side of it. That conception of “middle” could mean at least three different things. These
are illustrated in the panels of Figure 3.9. First, “middle” could be defined in terms of
horizontal distances between the points and the line, as shown in Figure 3.9A. In other
words, we would try to find a line that tries to achieve a sort of balance in terms of points
falling to the left and the right of that line. However, we probably don’t want to do this,
because a horizontal conception of “middle” would imply that we are using the values of
the X variable to determine the line’s placement. Remember that our ultimate objective is
to determine the sources of variation in the Y variable, not X. We will simply regard the
specific Xi ’s in our observations as given quantities.3 So, this is not the criterion that we
will use to fit the line.
A second possibility would be to define the “middle” in terms of the direction that is
perpendicular to the predominant orientation of the point cloud within the scatterplot. This
would imply locating the line on the basis of the perpendicular distances from the points to
the line, as shown in panel B of Figure 3.9. At first, this might seem to have some intuitive
appeal, since it corresponds to what appears to be the smallest possible way of measuring
the distance from each point to the line. But, this criterion is not particularly useful for
present purposes: Consider the point labeled “i” in the figure; the perpendicular segment
from this point (shown as the dashed line) intersects the fitted line at the point labeled
“î”. We can use the Pythagorean theorem to calculate the distance between these points as
0.5
dist(i, î) = [(Xi − Xî )2 + (Yi − Yî )2 ] . Thus, we can see that we would still have to take the
X values into account when we calculate the perpendicular distances. And, the measurement
14
units for these distances would be difficult to interpret whenever the X and Y variables are
measured in different units (as they are with the Medicaid data). So, we will not use this
criterion, either.4
The third alternative is to define “middle” in terms of the vertical direction, as shown
in Figure 3.9.C. This has considerable appeal. For one thing, this means that the distances
from the points to the line are calculated using only the Y values— X does not appear in
the formula for the distance when we measure it in the exactly vertical direction. Notice in
the figure that data point i and the point, î, along the line share the same value of X, which
we will designate as Xi . Therefore, the distance between these two points is:
0.5 0.5 0.5
dist(i, î) = (Xi − Xi )2 + (Yi − Yî )2
= 0 + (Yi − Yî )2
= (Yi − Yî )2
(1)
So, we have eliminated the X variable from our conception of middle. There is an added
advantage in that the distances from the points to the line are now measured in the units of
the Y variable— this is much simpler than the combined units that we would have had to
use with the second criterion, above. And, using the middle in the vertical direction implies
that we are locating the line on the basis of the dependent variable. That is probably a good
thing, since it is really the variability in Y that is motivating our research in the first place.
Thus, we will want our eventual fitted line to pass through the middle of the points in the
scatterplot, with “middle” conceptualized in the vertical direction.
Defining Two New (Imaginary) Variables
Now that we have clarified our conception of “close,” it is useful to re-state the linefitting process. Our original dataset contains two variables which, for convenience, we label
X and Y . We will add a third, new, variable to this dataset which we will call Ŷ . So, each
observation, i − 1, 2, . . . , n, has not only an X value (Xi ) and a Y value, Yi , but also a value
on the new variable, Ŷi . Note that Ŷ is an “imaginary” variable, with no empirical existence;
it is just a set of n numeric values that we will make up for our own analytic convenience.
15
Now, the distinguishing feature of this new imaginary variable is that it is a perfect linear
transformation of the X variable, such that for any observation, i, the following relationship
will hold:
Ŷi = a + bXi
(2)
If we were to plot the Ŷi ’s against the Xi ’s for the full set of n observations, the points
would (of course) fall perfectly along a line. This is true, by definition, since Ŷ is a linear
transformation of X. Our task is to find values for the coefficients, a and b, that produce a
line that meets our criterion. That is, it runs through the middle of the point cloud, with
“middle” defined in terms of the vertical direction.
Let us agree upon some terminology involving this new variable, Ŷ . For many purposes,
we could just call it “Y hat.” In fact, many data analysts do exactly that. For the moment,
though, we prefer to call the Ŷi ’s the “fitted values” for our observations, since they all fall
along a line that we will (eventually— we promise!) fit to the bivariate data. Also note that
many researchers and texts refer to the the Ŷi ’s as the “predicted values” of the Y variable.
We will eventually do that, too. But, this term suggests a substantive interpretation that
we are not ready to impose; so, we’ll defer using it for awhile.
For most observations, Ŷi will not be exactly equal to Yi , since the actual data points
do not fall along a single line. Therefore, we will create still another new variable; this one
measures the difference between the Yi and Ŷi for each observation:
ei = Yi − Ŷi
(3)
This new variable, e, is called the “residual” and it represents the signed vertical discrepancy
between each observation’s value on Y and its value on Ŷ . Of course, the size of any
observation’s residual is just the distance from its point and the line, evaluated in the vertical
direction. Note, however, that distances are always positive. Residuals can be either positive
or negative.
16
Figure 3.10 shows a simple scatterplot with only five observations and a fitted line. The
Y and Ŷ values for observation 1 (the leftmost data point) are shown along the left-hand
vertical axis. Since the former is larger than the latter (as you can see from their relative
positions along the axis), the residual for this observation will be a positive number (that is,
e1 > 0). The Y and Ŷ values for observation 4 (the rightmost data point) are shown along the
right-hand vertical axis. Here, Ŷ4 is larger than Y4 ; therefore, the residual for observation
4 is a negative number (e4 < 0). Observations with points located above the line will
have positive residuals, while those represented by points below the line will have negative
residuals. An observation with a point that falls exactly on the line (such as observation
3 in figure 3.10) will have a residual of zero. We do not show the residuals explicitly for
observations 2 and 4 in the figure. But, before proceeding, you should make sure that you
can see how the former would have a relatively small negative residual, while the latter would
have a relatively large positive residual.
Operationalizing “the Middle of the Point Cloud”
Now, we are ready to define what we mean when we say that the line (which is defined
by the Ŷi ’s, which are in turn defined by the values of the coefficients, a and b) must pass
through the middle of the point cloud in the scatterplot. Specifically, we will define “passing
through the middle of the point cloud” in terms of three criteria. First, we will require the
line to pass through a location in the scatterplot called “the centroid.” This strange-sounding
term simply refers to a point with coordinates that are the means of the two variables, or
(X̄, Ȳ ). To state the criterion a little differently, we require that ŶX̄ = Ȳ . Verbally, the
preceding expression says that the fitted value for an X value that is equal to the sample
mean of X must be the sample mean of Y . Still another (and, perhaps, more succinct) way
of showing the same thing is:
Ȳ = a + bX̄
(4)
The means measure the central tendencies of the two variables’ distributions. So, it seems
17
entirely reasonable to us that a line going through the “middle” of the bivariate data must
pass through this particular location. To provide some visual confirmation of what we
are talking about, Figure 3.11 shows a scatterplot with 25 hypothetical observations. The
observations are plotted as open circles, and the centroid is shown as the single solid black
circle. As you can see, it truly does appear in the center of the point cloud. And, it is difficult
to imagine how we could run a line through the middle of the cloud without intersecting this
particular location!
Second, we will require that the residuals have a mean of zero, or ē = 0. That also implies
P
that ni=1 ei = 0, since the sum is the numerator of the sample mean.5 This requirement is
also pretty reasonable: The only way we can obtain a mean residual of zero is when positive
and negative values are summed, so they cancel each other out. And, as we saw from our
discussion of Figure 3.10, this happens when some of the points are above the line and others
are below the line. Once again, that is exactly what we would want from a line that runs
through the middle of the points. In order to convince you that this second condition is
P
important, Figure 3.12 shows two situations where ni=1 ei 6= 0. The first panel of the figure
shows a situation where the residuals sum to a positive number. The geometric implication is
that the points tend to fall above the line (here, in fact, they all fall above the line!). Hence,
it is not reasonable to say that this line passes through the middle of the point cloud. The
second panel of Figure 3.12 shows the opposite situation— the residuals sum to a negative
value, meaning that most observations fall below the line. Once again, you do not need to a
great deal of technical expertise to see that this line does not run through the middle of the
data!6
Third, we will require that the fitted line be positioned so that the values of the residuals
are uncorrelated with the values of the X variable, or rX,e = 0. At first glance, this condition
may seem a little strange— or, at least, less intuitive than the other two. Nevertheless, it
is critically important for making sure that the line passes through the middle of the entire
point cloud. If this condition were violated, the size of the residuals would trend along with
18
the size of the X values, and the fitted line would be “off” from the point cloud, even if
the previous two conditions are met. Figure 3.13 uses the same hypothetical 50-observation
dataset as Figure 3.12 to show what happens when rX,e 6= 0. In panel A, the X variable
and the residual are positively correlated, with rX,e = 0.851. The value of this coefficient
implies that the values of the residual should tend to increase along with the values of X.
We can see that the residuals near the left side of the plotting region in Figure 3.13A tend
to be relatively large and negative (i.e., the plotted points are far below the line). As we
move toward the right, the points tend to be closer to the line, although still below it. This
means the values of the ei ’s are becoming larger (i.e, the negative numbers approach zero).
On the right side of the plotting region, the residuals tend to be positive (i.e., most of the
points are located above the line) and their values tend to increase as we approach the righthand edge of the plotting region. So, the residual does tend to increase, moving from left to
right in the scatterplot. And, of course, the horizontal axis corresponds to the X variable, so
its values definitely increase as we move in the same direction. That produces the positive
correlation between X and e. And, this correlation, in turn, produces a fitted line that is
“tilted” too far, producing an excessively shallow slope, relative to the point cloud.
Figure 3.13B shows that the opposite occurs when X and e are negatively correlated.
Here, rX,e = −0.908, so the ei ’s tend to decrease as the Xi ’s increase. This produces a fitted
line that is not tilted enough, relative to the point cloud; its slope is too steep.
In contrast to the situations shown in the first two panels of Figure 3.13, the third panel
shows the line that occurs when rX,e = 0 (again, we are assuming that the first two conditions
are met). As you can see, there is no discernible trend in the values of the residuals. Within
any narrow interval along the X axis, there are points plotted both above and below the
line; there is no tendency for small X values to coincide with negative values of e, and vice
versa. And, the result is a line that we could reasonably describe as running through the
middle of the entire cloud, from left to right within the plotting region.
Fitting a Line to Bivariate Data (Finally!)
19
At last, we are ready to proceed with our task of fitting a line to bivariate data such that
our line runs through the middle of the point cloud in the scatterplot. And, we’ll use the
operational definition of “through the middle of the point cloud” that we developed in the
preceding preceding sections. Let’s begin by rearranging our definition of the residual for
arbitrarily-selected observation, i, as follows:
Yi − Ŷi = ei
(5)
Yi = Ŷi + ei
(6)
Yi = a + bXi + ei
(7)
Line (5) in the preceding derivation simply repeats equation (3), which defines the residual
for an observation. Next, Ŷi is added to both sides of the equation; of course, it cancels
out on the left-hand side, producing equation (6). Finally, in equation (7), we expand the
right-hand side using the definition of Ŷ as a linear transformation of X. That last line
shows explicitly how the value of the dependent variable for any observation is connected to
the value of the independent variable for that observation through the coefficients, a and b,
and the residual for that observation, ei . And, to remind you of something from Chapter
2, equation (7) expresses Y as a linear combination. In this linear combination, a is the
coefficient on a pseudo-variable which has a value of one for every observation. Hence, the
latter does not vary, and we do not need to show it explicitly in the expression for the linear
combination. In a similar manner, e is a variable with a coefficient of one; therefore, that
coefficient does not need to be shown. Of course, b is the coefficient for the third— and only
“real”— variable in the linear combination, X.
Our next task is to use the information we have available to produce formulas for the
coefficients; after doing that, it will be an easy matter to calculate the Ŷi ’s for the observations, and use them to obtain the ei ’s. We will begin with the b coefficient; as the slope of
the linear transformation from X to Ŷ , it is (at least arguably) the most interesting part of
20
the fitted line. The derivation of a formula for b is shown in Box 3.1. Our objective is to
manipulate equation (3) in order to isolate b on the left-hand side. Hopefully, when we do
so, the right-hand side will only contain information that we already possess. If so, then we
should be able to calculate the actual, numeric, value of b.
Line (1) in Box 3.1 simply repeats equation (3). In line (2), we subtract the mean of
Y from each side. Remember that Y is a linear combination, and the mean of a linear
combination is a weighted sum of the constituent means. That is what produces the sum
within the second set of parentheses on the right-hand side of line (2). In line (3), we remove
the parentheses from the right-hand side and rearrange the terms. Line (4) inserts three new
pairs of parentheses on the right-hand side, grouping together related terms.
Two important things occur in line (5): First, (a − a) is obviously zero, so we eliminate
it from the equation. Second, we factor the b out of the difference between Xi and X̄. We
could simplify line (5) a bit further, by removing ē. The latter is equal to zero, so it does
not really need to appear there. But, we want to retain it, because it will help us recognize
something important that occurs in a few more steps.
In line (6), we multiply both sides by (Xi − X̄). Please, just bear with us on this step—
you may not see why we are doing it here, but it does facilitate our task of isolating the b
coefficient (take our word for it!). Moving from Line (6) to line (7) we distribute the (Xi − X̄)
across the two right-hand side terms within the brackets.
In line (8), we take the sum, across the n observations, of the terms on each side of the
equation. Why do we do this? Remember that the line-fitting process needs to involve all of
the data points, and not just the ith observation. Taking the sum seems like a straightforward
way to bring all of our data (i.e., from all n observations) into the calculations. Notice, too,
we have a particular name for the quantity that appears on the left-hand side: It is the sum
of cross-products for X and Y .
Next, line (9) distributes the summation on the right-hand side across the two terms
within the brackets (using the first rule for summations in Appendix 1). Now, remember
21
that the b coefficient is a constant— it doesn’t vary from one observation to the next.
Therefore, the first term on the right-hand side of line (9) is the sum of a constant times a
variable. Therefore, we apply the third rule of summations (again, from Appendix 1) in line
(10) to factor the b coefficient out of that summation.
Take a very careful look at the right-hand side of line (10). The first term is the b
coefficient times another quantity for which we have a particular name: The sum of squares
for X. And, the second term is the sum of cross-products for X and e. (This is why we
left the ē in there— so you’d recognize this a bit more easily). Recall from Chapter 2 that
the sum of cross-products is the numerator of the covariance for the two variables, and also
the numerator of the correlation coefficient. Furthermore, remember our third criterion for
passing the line through the middle of the point cloud: The residual (e) must be uncorrelated
with X. Therefore, the sum of cross-products must be zero; that is the only way we can
produce rX,e = 0. So, the last term on the right-hand side of line (10) drops out, leaving line
(11).
Next, line (12) divides both sides by the sum of squares in X. Of course, this sum of
squares cancels out on the right-hand side of line (12), leaving only the b coefficient. And,
line (13) follows tradition by reversing the two sides of the equation, placing the quantity
we are trying to find on the left, and the formula for finding it on the right-hand side.
The preceding derivation may have felt like a lot of steps. But, we have achieved our
objective, a formula for the b coefficient! Perhaps most important, this formula is composed
entirely of quantities that we can calculate very easily from our bivariate data. So, this is
an important accomplishment! Even better, the formula is quite straightforward. It consists
of a fraction formed by two familiar quantities: The sum of cross-products for X and Y ,
divided by the sum of squares for X. We can go a step further and point out the b is actually
the ratio of the covariance (between X and Y ) to the variance of X. Remember that each
of these two statistics have the degrees of freedom (n − 1) in their respective denominators.
These cancel out when we form the fraction with the covariance in the numerator and the
22
variance in the denominator. So, we feel fully justified in claiming that we have arrived at a
particularly neat solution for the slope of the line that will pass through the middle of the
point cloud in our scatterplot!
Satisfying as our preceding accomplishment may be, we cannot stop here. Remember
that we also have to solve for the a coefficient, the intercept of the line that we are going to
use in order to smooth our scatterplot. The procedure is, again, straightforward and we can
apply directly the results we have already obtained. In other words, we can assume that we
have already calculated the value of the b coefficient. Box 3.2 lays out the derivation.
Line (1) in Box 3.2 just lays out the expression that shows Y as a linear combination of
X, e, and the pseudo-variable associated with the a coefficient. In line (2), we take the mean
of the linear combination, producing the weighted sum on the right-hand side. In line (3),
we take advantage of our requirement that the residuals sum to zero, and remove ē from the
right-hand side. Line (4) is produced by subtracting bX̄ from both sides of the equation. Of
course, this isolates the a coefficient, giving us exactly what we are trying to achieve. And,
line (5) once again, places the coefficient on the left and the formula on the right.
Looking at line (5), we see once again that the coefficient in question, a in this case,
can be obtained very easily applying a simple formula to quantities that we have probably
calculated from our data in previous steps of the analysis. So, we needed X̄ and Ȳ to
calculate the sum of squares and sum of cross-products needed to calculate b. And, we have
already produced a formula for b, itself, which we can plug directly into line (5) of Box 3.2.
Box 3.3 repeats the formulas we have derived for the a and b coefficients, for convenience.
Remember, that our objective was to use these coefficients to generate the variable Ŷ , by
plugging each of the n values of X in our dataset into the formula, Ŷi = a + bXi . And,
the line upon which all of the resultant Ŷ ’s fall should pass through the middle of the point
cloud in the scatterplot of X and Y . Does the line produced by our newly-derived formulas
actually work this way?
Fitting a Line to the Medicaid Data
23
We will provide at least a partial answer to the preceding question by using the formulas
in Box 3.3 to fit a line “by hand” to our bivariate Medicaid data. Table 3.4 shows the basic
calculations.7 The first two columns give the actual values of the X and Y variables. The
sums for each variable are shown at the bottom of the respective columns. The third and
fourth columns subtract the means from the variable values to obtain the deviation scores.
Note that deviation scores always sum to zero (as shown at the bottom of these columns).
Next, the fifth column multiplies the deviation scores for X and Y together to produce the
cross-product for each observation. And, the sum of cross-products is shown at the bottom
of this column. Finally, the sixth column squares the deviation score for X, and the sum of
squares for X is shown at the bottom of this column.
Table 3.5 uses the various sums from Table 3.4 to calculate the values of the two coefficients. First, we divide the sum of cross-products (42,997.520) by the sum of squares for
X (891.099) to produce the value for the b coefficient, or 48.252.8 Next, we divide the sum
of the X (724.800) values by n (50) to obtain the mean of X, or X̄ = 14.496. The mean
of Y is obtained the same way, producing Ȳ = 1000.600. These two means, along with the
previously-calculated value of b (48.252) are inserted into the formula for a to produce an
intercept of a = 301.136.
Next, Table 3.6 calculates the predicted values and residuals for each of the fifty observations in the dataset. Once again, the first two columns in the table simply repeat the data
values, for convenience. In the third column, we obtain the fifty Ŷi values, by plugging each
Xi into the formula Ŷi = a + bXi , with the appropriate numeric values substituted into the
formula for a and b. That is, for each observation, Ŷi = 301.136 + 48.252Xi . The fourth
column subtracts the Ŷi ’s from their corresponding Yi values to produce the residuals. And,
the fifth column presents the cross-products between the Xi ’s and the corresponding ei ’s.
Note that the deviation scores for X are not shown; they are identical to those that were
presented in Table 3.4. Note also that the deviation scores for the residuals are just equal
to the residuals, themselves, because ē = 0, by construction.
24
The column sums are once again shown at the bottom of the table. And, these sums
have several interesting features. First, we can see that the residuals sum to zero. They were
created to possess exactly this property, so this result may not be particularly surprising.
Still, it is reassuring to verify that things work out as they should! Second, the sum of
cross-products for X and e is exactly zero. But, this too, is to be expected: Remember that
the we required that X and e be uncorrelated. That requirement implies that the sum of
cross-products between them must be zero, exactly as we see it is in the table. Third, look
at the sum of the fitted values. The entry at the bottom of the third column in Table 3.6
P
shows that ni=1 Ŷi = 50030. But, this is exactly the same value as the sum of the original
Yi ’s (verify this by looking at the bottom of the second column in Table 3.6). Of course,
we know that the sum of a variable’s values is the numerator of the mean for that variable.
So, the result we have just seen tells us that the mean of the fitted values is identical to the
¯
mean of the dependent variable. Using our generic nomenclature, Ȳ = Ŷ . For the moment,
we will regard this equivalence as an interesting “coincidence.” But, we will soon see that it
plays a very important role in developing the regression model.
Let us next reconstruct the scatterplot of Medicaid expenditures against the size of the
Medicaid population within each state. Now, however, we will superimpose our fitted line
over the 50 data points. We position the line in the table by plotting the Ŷi ’s against the
corresponding Xi ’s, and running a straight line through the points, to span the entire width
of the scatterplot. This is shown in Figure 3.14.
The line shown in Figure 3.14 does exactly what we want it to do. Given the irregular
shape of the point cloud, we believe this line approximates the “middle” of the cloud very
well; at least, we can say that at any position along the horizontal axis, there are data
points falling both above and below the fitted line. We would argue that the line shown in
Figure 3.14 does successfully capture the predominant pattern in the way the values of the
dependent variable tend to shift upward when the independent variable values are larger.
Thus, the line provides us with a succinct summary of the interesting structure in the overall
25
bivariate dataset. Rather than looking at the 100 distinct data values, we can use the line
as a general description that transcends the entire set of 50 separate observations.
If the line summarizes the data, then the a and b coefficients describe that summary line.
Therefore, we might claim that a summary of the way Medicaid expenditures respond to
differences in the number of Medicaid recipients lies in the slope coefficient. The predominant
linear pattern in these variables indicates that a one-unit positive difference in the size of the
Medicaid population corresponds to a difference of 48.252 units in the fitted values. Now, this
value of the b coefficient may not correspond to the actual differences in X and Y values for
any particular pair of observations in the dataset. But, it is not intended to do that; instead,
the slope tries to summarize the predominant pattern that occurs across all 50 observations
in the full dataset. For now, we will not push the substantive interpretation any farther than
this. But, try to bear with us for now: We can assure you that we will eventually provide
a much more specific way to impose substantive meaning onto the calculated numeric value
of the slope coefficient.
The value of b determines the directional orientation of the line (or, its steepness, relative
to the axes) within the plotting region of figure 3.14. The value of the intercept, a, determines
the vertical position of the line, relative to the vertical axis. Here, the intercept is 301.136.
In mathematical terms, this is the fitted value that corresponds to Xi = 0. Geometrically, if
our vertical axis were anchored at the origin (i.e., the zero point) along the horizontal axis,
the fitted line in Figure 3.14 would cross that vertical axis at the location corresponding to
a value of 301.136. But, notice that the values of our X variable do not extend that far
downward. The minimum percentage of the population in any state that receives Medicaid
is 7.2; thus, Xi = 0 simply does not occur in our dataset. So, the intercept really is just a
“height adjustment” term, used to guarantee that the fitted line passes through the point
cloud in an appropriate manner (i.e., near the middle). Again, we will discuss the substantive
interpretation of the intercept later on.
We have done a lot in this section! So, it is probably useful to catch our breathe and
26
summarize our accomplishments. First, we changed the strategy we are using to summarize
the structure within our bivariate data. Since an earlier nonparametric regression suggested
that this structure is linear, we moved to a parametric approach. In other words, we specified
that we would fit a “curve” with a linear shape (or, stated more clearly, a line). Then, it
was just a matter of finding a specific line that meets our criteria— running through the
middle of the point cloud. We accomplished that by operationalizing the criteria in very
specific terms and then working through some straightforward algebra. The final result was
a line that successfully summarizes our bivariate data— at least, according to our visual
interpretation of the scatterplot, with the fitted line superimposed over the data points.
And, to reiterate a very important point: This line is potentially useful because it seems
to summarize very succinctly, the position and orientation of the point cloud. And, in so
doing, it helps us discern structure in our data that is potentially relevant for addressing the
substantive concerns that led us to examine the data in the first place, such as theory-testing
or forecasting (or, very likely, both).
FINDING THE “BEST” LINE
From our work in the preceding section, we have produced a line that seems to pass
through the middle of our data point cloud in the scatterplot of the Medicaid data. But,
how do we know that our line is the best possible line that could be used to represent the
data in the scatterplot? Of course, our line conforms to the criteria that we laid out to
operationalize the concept of “middle” in the scatterplot. But, another analyst could easily
argue that our criteria have no particular status, and that other lines provide a summary
representation of the systematic pattern in the data that is just as good as ours.
For example, remember that we came up with a slope of 50 from our visual inspection of
the conditional mean smoother that we fit to the sliced scatterplot. The value, 50, is a nice
round number which is probably easier to remember than the slope that we just calculated
(that is, b = 48.252). So, it might make just as much sense to use that as the slope of the
line that we impose over the data. If we use b = 50, and still require the line to run through
27
the centroid (which we do not really have to do— it just makes good sense to us), then the
intercept will be 275.80.9 The first panel of Figure 3.15 shows this line in the scatterplot.
Visually, the new line is just about identical to the one from the previous section. But, again,
it is a different line in the sense that it is described by the equation Ŷi = 275.80+50Xi rather
than the equation Ŷi = 301.136 + 48.252Xi . This new line does not meet all the criteria we
laid out in the preceding section: Specifically, there is a nonzero correlation between the
independent variable and the residual. If we designate the residuals from this new line as
e∗ , then rX,e∗ = −0.030. But, again, our criteria have no particular theoretical status; they
just “made sense to us” and they produced a line that seemed to run through the middle of
the points. The new line appears to do exactly the same thing, even though it doesn’t meet
our criteria. Therefore, one could claim it is just as good as our line.
Still another analyst might criticize the lines we have looked at so far (that is, both ours
and the new one) because their slope is too shallow. To this person, the problem might be
the position of the line near the right side of the scatterplot; One could argue that it passes
very close to the points in the lower region of the scatterplot, but too far from the points
high in the upper region— particularly the one point in the upper right-hand corner of the
plot. This individual might prefer that we use a steeper slope, and and a smaller intercept,
producing the line shown in Figure 3.15B. Here, the equation for the line is Ŷi = 100 + 68Xi .
The coefficients in this equation do not conform to any particular criteria; the line was simply
oriented within the scatterplot so that it “looked right” to our analyst.
From this individual’s subjective perspective, the line in Figure 3.15B is just as clearly
running through the “middle” of the scatterplot as was our line (from Figure 3.14) or the
new line we just examined a moment ago (in Figure 3.15A). Of course, the general process
could continue ad infinitum. The problem is that we have not defined any criterion that
establishes the best line for a given scatterplot. And, until we do so, this indeterminacy
about the most accurate representation of the data will continue.
So, what do we mean by the “best” line? Since we are using the line to represent the
28
data, we probably want to generate a line that comes as close as possible to as many of the
points as possible. Furthermore, we may as well go ahead and define “close” in the vertical
direction within the scatterplot. Our reasons for doing so are identical to the reasons we
laid out for defining the middle of the point cloud in the vertical direction. Thus, we want
to find a line such that the Ŷi ’s are as similar to the actual Yi ’s as possible. Stated a bit
differently, we want to produce a line with the smallest possible residuals, since ei = Yi − Ŷi ,
for observations from i = 1 to n.
Operationalizing “Close”: The Least-Squares Criterion
So far, we have stated our objective in relatively informal, but intuitive, terms. We need
to be more specific in stating exactly what we mean when we say that we want the line to
come as close as possible to as many points as possible, or that we want the line that makes
the residuals as small as possible. And, just as we did when we operationalized the “middle”
of the point cloud, we will consider and reject several potential strategies for operationalizing
“closeness” (or “small residuals”) before arriving at the one that we will actually use.
As a starting point, let’s think about making each observation’s residual (that is, ei ) as
small as possible. Then we can just sum across the n observations. The general idea is that
P
P
we would find the line that minimizes ni=1 ei . But, this will not work: We can make ni=1 ei
as small as we want, simply by drawing the line farther and farther above the point cloud
within the scatterplot. As we saw back in Figure 3.12B, such a line is unacceptable because
it does not run through the point cloud at all!
As a second try, we could find the line that makes the residuals sum to zero. After all, that
was one of our criteria for defining the middle of the point cloud, and the line we produced
P
from that exercise worked pretty well. In fact, ni=1 = ei 0 is a very reasonable requirement
for our “best” line; it is just not enough, by itself. The problem is that there are many lines
that will produce residuals which sum to zero. Some of these lines will be good summaries
of the bivariate data (such as the one that we calculated in the previous section, and showed
in Figure 3.14). However, other lines will not coincide very closely with the predominant
29
trend in the point cloud, even though they meet this requirement. For example, consider
P
Figure 3.16; the line shown in this scatterplot produces ni=1 = 0, even though it would be
a silly depiction of the Medicaid data. That is, the line runs in a nearly horizontal direction
through the plotting region, while the data run from lower-left to upper-right.
The problem with the preceding attempt, of course, is that the negative residuals and
positive residuals cancel each other out. This can occur for a wide variety of different lines,
regardless of their orientation relative to the data point cloud. But, recognizing the problem
also suggests the solution: Rather than minimizing the sum of the “raw” residuals across
the n observations, we can perform some transformation on the residuals to make them all
positive. Then, we could minimize the sum of these transformed versions of the residuals.
One approach would be to transform the residuals by taking their absolute values, and
P
minimizing ni=1 |ei |. This strategy has some initial appeal, because the |ei |’s are the vertical
distances from the points to the line. Thus, we really would be putting the line as close
as possible to the points. An approach based upon this idea would work; in fact, it is
actually used in some regression applications where the data include unusual observations
with discrepant data values compared to most of the other observations. But, it also has a
drawback: It is a relatively complicated mathematical problem to produce formulas for the
a and b coefficients that minimize the sum of the absolute residuals. And, there is no need
for us to bother with that complexity (at least, for the moment) because we have a much
simpler approach available to us.
As you probably know, we could also transform the residuals to a set of non-negative
values by squaring them. Next, we would sum the squared residuals across the observations,
P
producing ni=1 e2i . Then, we could find the line (or, more precisely, the coefficients a and
b which define the line) that minimizes this sum of squared residuals. This is called the
least-squares criterion for fitting a line to the data.
This, in fact, is the strategy that we will use. As we will see, the mathematical steps
involved in finding a and b with the least-squares criterion are straightforward. (We hesitate
30
to say “simple” because they will involve calculus. But, please, don’t be intimidated by that.
We’ll get through that part pretty easily, as you will see shortly!). And, in the next chapter,
we will show that the resultant formulas have some very nice statistical properties— at least,
when certain conditions are met in our data.
Using Least-Squares to Fit a Line to Data
We want to find values for a and b in the equation, Ŷi = a + bXi such that, when we
P
calculate ni=1 e2i , that sum is as small as possible. At first, that idea may seem a bit strange
to you, since conceptually, the residuals can only be found after the line, itself, has been
created. If so, then how can we use the sum of squared residuals as the criterion for creating
the line in the first place? Isn’t that a statistical manifestation of putting the cart before
the horse?
In fact, we cannot calculate the ei ’s and, hence, the e2i ’s for the individual observations
until we have actually fitted the line to the data. But, it is easy to show that the sum of the
e2i ’s across the full set of n observations is determined entirely by the values of a and b that
we use to create the line. Now, the residuals are defined as the difference between the actual
values of Y and the fitted values (that is, Ŷ ) for each observation. Therefore, the sum of
squared residuals can be shown as:
n
X
e2i =
i=1
n
X
(Yi − Ŷi )2
i=1
Now, the fitted values for each observation are defined as Ŷi = a + bXi . So, we can substitute
the latter into the formula for the sum of squared residuals as follows:
n
X
i=1
n
X
i=1
e2i
=
e2i =
n
X
i=1
n
X
(Yi − [a + bXi ])2
(Yi − a − bXi )2
i=1
Look at the right-hand side of the preceding equation. Four different quantities appear
31
within the parentheses following the summation operator. Two of these are data values,
Yi and Xi . Of course, the data are fixed by the observations in the sample before us. We
cannot alter the values of Yi and Xi in any way! (There are severe professional sanctions
for doing so!). This leaves us with a and b as the only quantities that are available for us to
manipulate. So, again, we will try to obtain values for these two coefficients that minimize
the term on the left-hand side, the sum of squared residuals.
How might we go about this task? One strategy might be to conduct a search across
different values of the coefficients. In other words, select a pair of numeric values for a and
b, use them to calculate the Ŷi ’s for the n observations in the dataset, use these fitted values
to calculate the ei ’s (remember, we have the Yi ’s from the data, themselves, so it is easy for
us to calculate ei = Yi − Ŷi for each of the n observations), then square them and sum the e2i
P
values to obtain the sum of squared residuals, or ni=1 e2i . Repeat this entire process many
times, trying different values of a or b, or both, in each case. Record the sum of squared
residuals for each combination of a and b, and simply select the pair of coefficient values that
P
produces the smallest ni=1 e2i .
The preceding search strategy might (and probably would) work if we searched across
enough combinations of a and b values. It would involve a lot of computation (remember,
we would have to calculate n residuals for each distinct pair of coefficients in order to obtain
P
the ni=1 e2i ), but this is not really a problem with modern computing technology. More
troublesome is the fact that this search strategy would have to be carried out separately for
each new dataset that we might want to analyze; there is no formula to “unify” the process
and make it more coherent. Furthermore, we would probably always worry that our search
might have missed the set of coefficients that is truly the best, according to the least-squares
criterion. So, even though the search strategy is not entirely unreasonable, we reject it for
relatively practical reasons.
In fact, there is a much easier way for us to proceed which avoids all of the previous
difficulties. This new strategy will involve some calculus. We promised not to use calculus
32
in this book and we intend to keep that promise. But, we do want you to know what it does
and why it is useful in this particular situation!
Assume that we did, in fact, carry out our search process and that we tried a very wide
range of different a and b values. Also assume that we already know the proper value for the
a coefficient. We realize that this latter assumption is terribly unrealistic; we only adopt it
for the moment, because it will make it easier for us to draw a picture of the problem that
we are trying to solve.
With the preceding assumptions in place, let’s graph the
Pn
2
i=1 ei
values, plotted against
the b values that produced them (remember, we’ve fixed a at its “correct” value, so we don’t
need to show it in the graph). Figure 3.17A shows this graph; the figure includes a curve,
rather than individual plotted points, because the b coefficient can (in principle) be any
number from negative infinity to positive infinity. Each position along the horizontal axis
in Figure 3.17 represents a particular numeric value of b. The height of the curve at any
given horizontal position represents the sum of squared residuals associated with that value
of b. Of course, we want to find the b value associated with the lowest point on the plotted
curve— that is, at the very bottom of the “U.”10
In order to accomplish that, we will take a brief sidestep, and ask you to consider the
slope of the plotted curve. More specifically, look at how the curve’s slope changes as we
move from left to right within the plotting region. As we show in Figure 3.17B, the slope
of the curve near the left side of the plotting region (at position b1 on the horizontal axis)
is negative and very steep. As we move toward the right, the slope of the curve remains
negative, but it gets shallower (i.e., closer to horizontal), as at position b2 . After we “round
the curve” at the bottom of the U-shape, the slope becomes positive. Near the middle of
the plotting region (say, at position b3 ), the positive slope would be relatively shallow, but
it becomes steeper as we move even further toward the right (say, to position b4 ). Now, we
want to find the b value at the exact bottom of the U-shape; stated differently, we want
the b value associated with the exact position along the curve where the slope changes from
33
negative to positive. At this position, the slope would be precisely horizontal, and its value
would be zero. Figure 3.17C shows this situation. So, we want to find the value of b which
corresponds to the position where the plotted curve has a slope of zero (which happens to be
the lowest point on that curve). Please make sure this makes sense to you before proceeding!
Now, we’ll return to our main course from our sidestep. Is there any way for us to find
the lowest point on the curve in Figure 3.17 without calculating all of those possible b values
to produce the curve in the first place? As it turns out, the answer is yes! We can use
differential calculus to produce a formula which gives the slope of the curve for any possible
value of b; note that we can obtain this formula without even having to draw the curve at all!
All we need to do is get this formula, set it equal to zero, and “work backwards” to find the b
value that occurs for that particular slope. At this stage, we can also remove our unrealistic
assumption that we already know the value of a. Calculus can also give us a similar formula,
which provides the slope of the curve for any possible value of a. Once again, we just obtain
this formula, set it to zero, and find the value of a that results.
These mysterious formulas are called partial derivatives. Speaking very informally (our
mathematician friends will gnash their teeth at our definitions!), the partial derivative of
Pn 2
i=1 ei with respect to b gives the slope of the graph relating the sum of squared residuals
P
to the value of b, for any specified value of b. Similarly, the partial derivative of ni=1 e2i with
respect to a gives the slope of the graph relating the sum of squared residuals to the value
P
of a, for any specified value of a. Since we already know that ni=1 e2i is determined by the
combination of the two coefficients (a and b), you can probably guess that these two partial
derivatives are intertwined with each other. In other words, to calculate the specific numeric
value of one partial derivative, you must incorporate the information from the other partial
derivative, and vice versa.
Enough talk, already! Let’s look at the actual formulas for these two partial derivatives
(note that all summations are taken across the n observations in the dataset; therefore, we
34
omit the limits of summation in the interest of “cleaning up” the formulas):
∂
P
e2i
= −2
X
Xi Yi + 2a
e2i
= −2
X
Yi + 2na + 2b
∂b
∂
P
∂a
X
Xi + 2b
X
X
Xi2
Xi
(8)
(9)
Those strange-looking fractions on the left-hand sides of equations (7) and (8) are just
standard mathematical notation for partial derivatives. In equation (7), the left-hand side is
P 2
read as “the partial derivative of
ei with respect to b. The left-hand side of equation (8)
P 2
is similarly verbalized as “the partial derivative of
ei with respect to a. Now, we could
plug in any pair of a and b values on the right-hand sides of these two equations to find the
slope of the plotted curve for that pair of values. But, instead of doing this, we want to plug
a zero in on each of the left-hand sides; the reasoning is that we already know this is the
slope we want. We “merely” want to find the values of a and b that are associated with this
slope. So, we show the derivatives as follows:
0 = −2
X
Xi Yi + 2a
X
0 = −2
X
Yi + 2na + 2b
Xi + 2b
X
X
Xi2
Xi
(10)
(11)
At this stage, we can carry out some very simple algebraic manipulations. Basically, we will
isolate the terms involving a and b on the right-hand side of each equation and also divide
each equation by 2. Those operations produce the following results:
X
Xi Yi = a
X
X
Xi + b
Yi = na + b
35
X
X
Xi
Xi2
(12)
(13)
These are two simultaneous equations, in two unknowns (a and b). This rather intimidatingsounding phrase simply means that the two equations are “connected” to each other (we
already know that because the same coefficients appear in both of them) and that all of the
quantities except for a and b can be calculated directly from our data. The fact that the
number of unknown quantities is exactly equal to the number of separate equations (two, in
this case) means that we can solve the two equations together in order to produce unique values for those unknown quantities. There are a number of specific ways we could do this, but
in every case, the operations would produce the following formulas for the two coefficients:
b =
n
P
P P
Xi Yi − Xi Yi
P
P
n Xi2 − ( Xi )2
P
a =
P
Yi
Xi
−b
n
n
(14)
(15)
These are the formulas for a and b coincide with the point on the plotted curve where the
slope is zero. And, as we know, these are also the values that minimize the sum of squared
residuals. Notice that the right-hand sides of the preceding two equations only include
quantities that we already know (n) or can calculate quite readily from the data (the various
sums involving the Xi ’s and the Yi ’s). This is important because it means that we can
actually calculate the values for a and b using the information that is available to us.
In fact, we can even go a bit farther. Equations (13) and (14) can be manipulated
algebraically in ways that allow us to express them in the following manner:
P
(Xi − X̄)(Yi − Ȳ )
cov(X, Y )
SXY
P
b =
=
=
2
var(X)
SX
(Xi − X̄)2
(16)
a = Ȳ − bX̄
(17)
As you can see, these are exactly the same formulas for a and b that we derived earlier,
36
when we were trying to run a line through the middle of the point cloud in the scatterplot!
So, back in that exercise, we not only produced some line; we actually found the best line
in the sense that it runs through the middle of the point cloud, coming as close as possible
to as many of the points as possible. Here, “close” is defined in terms of the least-squares
criterion. In other words, these are the coefficients which generate the line that produces
the smallest possible sum of squared residuals.
Equations (15) and (16) (or their equivalents, equations (13) and (14)) are so important
that they have a special name: They are usually called the ordinary least squares estimators
for the slope and intercept, respectively. Obviously, “least squares” refers to the fact that
these formulas minimize the sum of squared residuals from the fitted line to the data points.
And, “ordinary” simply means that this is our typical, “default” method for fitting the line.
In later chapters, we will see that we sometimes need to modify these formulas to take into
account the characteristics of particular datasets. Note, too, that “ordinary least squares”
is a bit of a mouthful, so it is often abbreviated to the acronym, “OLS.”
Because the line produced by the OLS formulas (which we will begin calling “the OLS
line” for verbal convenience) is identical to the line we fitted earlier, the OLS line shares all
of its characteristics. That is, the sum (and therefore, the mean) of the residuals from the
OLS line is zero, the OLS line passes through the centroid of the bivariate data, and the
residuals from the OLS line are uncorrelated with the values of the independent variable.
Taken together, these characteristics imply that the OLS line should run through the middle
of the data point cloud in the scatterplot, if the cloud really does have a linear shape (as
it did, for example, with the Medicaid data shown in Figure 3.14). And, since it comes as
close as possible to as many of the points as possible (in the least-squares sense), we can be
confident that this is the single best line that we could come up with to summarize the linear
pattern in the data. This is an incredible accomplishment!
GOODNESS OF FIT
We have found a line that we claim gives the best fit to our bivariate data (as long as we
37
define “best” in terms of the least-squares criterion). But, how good is it? In other words,
when we make the informal claim that the line passes as close as possible to as many of the
points as possible, just how close does it come, in general? The answer to this question is
important from an analytic perspective because it addresses the degree to which the OLS
line really does approximate the entire cloud of points within the bivariate scatterplot. And,
substantively, this question is important because we are trying to use the line to provide a
summary depiction of the way that our Y variable is related to the X variable. In either
case, the closer the points actually fall to the line, the better the summary that the line
provides for us— in either a geometric or a substantive sense. It would be nice if we had
some statistic that, in a single numeric value, summarizes the overall degree to which the
OLS line approximates the actual set of n observations.
You might be thinking that we already have a summary statistic which describes the
degree to which bivariate data conform to a linear pattern: The correlation coefficient.11 But,
there are two reasons why we do not want to simply fall back on this statistic. First, we don’t
know whether the correlation coefficient and the OLS estimators are focusing on the same
line within the data. If the correlation measures the degree to which the data approximate
some other line, then it would hardly suffice as a goodness-of-fit measure for the OLS line!
Second, remember that we do not have a very clear interpretation for intermediate values of
the correlation coefficient— those that fall between zero and either positive or negative one;
instead, we only offered some broad and unsatisfying rules of thumb (with disclaimers!). It
would certainly be better if we could produce a statistic with a more precise meaning for
any specific value it might take on in a particular dataset.
Partitioning the Variance in Y
In order to address this problem, we will begin with the equation that breaks down any
observation’s value on Y into its separate components:
Yi = a + bXi + ei
38
(18)
We also know that the fitted values for the OLS line are obtained from the first two terms
on the right-hand side of equation (17), or Ŷi = a + bXi . So, we can substitute this in, as
follows:
Yi = Ŷi + ei
(19)
Equation (18) shows that the variable, Y , is actually a linear combination of two other
variables: The fitted values (that is, Ŷ ) and the residuals (that is, e). (Note that the
coefficients on the two variables in this linear combination are both implicitly equal to one).
We could say that the dependent variable, Y , is divided into two parts. First, Ŷ is the part
of Y that is related to X, through the OLS line. Second, e is the left-over part of Y which
(by definition) is not at all linearly related to X.
Now, despite our emphasis throughout this chapter on fitting curves and lines to bivariate
data, the real, ultimate objective is to account for the variance in Y . Since we recognize
that Y is a linear combination, its variance can be broken down as follows:
var(Y ) = var(Ŷ ) + var(ei ) + 2cov(Ŷ , e)
(20)
But, Ŷ is a linear transformation of X and we know that X is uncorrelated with e. Therefore,
the last term on the right-hand side is zero. (Why?). If we treat the variances strictly as
descriptive quantities for our sample of n observations (i.e., there is absolutely no question
about inferences to some broader population from which these observations were drawn),
then we can re-express equation (19) as follows:
P
P
P
¯
(Ŷi − Ŷ )2
(ei − ē)2
(Yi − Ȳ )2
=
+
n−1
n−1
n−1
(21)
We can multiply both sides of equation (20) by n − 1 to get rid of the fractions:
X
(Yi − Ȳ )2 =
X
X
¯
(Ŷi − Ŷ )2 +
(ei − ē)2
39
(22)
Remember, too, that the mean of the fitted values is equal to the mean of Y , and that the
mean of the residuals is always zero. Therefore, we can simplify even further:
X
(Yi − Ȳ )2 =
X
(Ŷi − Ȳ )2 +
X
e2i
(23)
Equation (22) shows a very important result. The left-hand side of the equation contains
the numerator of the variance in Y , or the sum of squares in Y . We can abbreviate this as
SSY . The first term on the right-hand side in equation (22) is the sum of squares (i.e., the
sum of squared deviations around the mean) in the Ŷi ’s. Since these fitted values fall along
the OLS line, we might call this term the “regression sum of squares.” We’ll abbreviate this
as SSReg . Finally, the second term on the right-hand side is the sum of squared residuals,
which we can show as SSRes . Just to remind you, the OLS coefficients create the line that
makes this residual sum of squares as small as possible.
Now, SSY represents the “total spread” in the Y values around the mean of Y . SSReg
might be viewed as the part of this total spread that is associated with the X values. And,
SSRes could be conceptualized as the part of Y ’s dispersion that is not at all linearly related
to the X’s. Our OLS line breaks the total sum of squares in Y neatly into these two additive
components, because X (and, hence, Ŷ ) is uncorrelated with e. Since the sums of squares
for the fitted values and for the residuals must add up to the total sum of squares in Y ,
we know that as one gets larger, the other must get smaller. And, any sum of squares
(regardless whether it is SSY , SSReg , or SSRes ) must be non-negative (why?). Therefore,
the maximum possible value of either term on the right-hand side of equation (22) is SSY ,
and the minimum value must be zero.
A Summary Statistic for Goodness of Fit
Next, we can divide both sides of equation (22) by SSY , to express the two additive
40
components as proportions of the total sum of squares in Y :
P
P
P 2
(Yi − Ȳ )2
(Ŷi − Ȳ )2
ei
P
P
P
=
1.00
=
+
(Yi − Ȳ )2
(Yi − Ȳ )2
(Yi − Ȳ )2
(24)
We are going to focus on the first term on the right-hand side. In fact, we are going to give
that first fraction a special name, as follows:
P
(Ŷi − Ȳ )2
R =P
(Yi − Ȳ )2
2
(25)
R2 expresses the regression sum of squares as a proportion of the total sum of squares in Y .
And, as such, we can use it as a goodness-of-fit statistic for the OLS line. The reasoning is
straightforward. The Ŷi ’s represent the part of each Yi that falls exactly along the fitted OLS
line. Therefore, R2 shows the portion of Y ’s total sum of squares that is due to disperion in
these Ŷi ’s. The maximum possible value of R2 is one; this will only occur when the fitted
values are exactly equal to the actual values of Y . That, in turn, occurs when the points
in the scatterplot fall perfectly into a linear pattern. Notice that, in this case, the residual
for each observation would be zero, and the residual sum of squares would also be zero. At
the other extreme, R2 has a minimum value of zero. This occurs when SSReg = 0. But,
that only happens when Ŷi = Ȳ for every observation and that, in turn, can only happen
when the OLS line has a slope of zero. The latter is equivalent to saying that Y exhibits no
linear responsiveness to X whatsoever. Values of R2 that fall in between these two extremes
can be interpreted very precisely as the proportion of the total sum of squares for Y that is
shared with X through the OLS line.
As we know, the total sum of squares for Y is the numerator of Y ’s variance. Therefore,
speaking colloquially, R2 is often interpreted as the proportion of variance in the dependent
variable that is “explained by” the independent variable. Even more succinctly, some people
say that R2 is a measure of “variance explained,” period. This terminology is very popular;
in fact, it is used almost universally to describe the meaning of the R2 statistic. We will go
41
with common practice and use this phrase ourselves. But, it is very important to remember
that the word “explained” should definitely not be taken very literally when doing so.
You might wonder why “R” is used in the name of this summary goodness-of-fit statistic.
As it turns out, R2 is equal to the square of the correlation coefficient between X and Y ,
2
or rXY
. The derivation of this interesting result is shown in Box 3.ZZ. The implications are
important: First, it shows that the correlation coefficient does, in fact, assess the degree to
which the data conform to exactly the same line as that obtained from the OLS formulas.
Second, R2 provides a simple transformation of the correlation coefficient (i.e., squaring it)
which produces a much better interpretation of the strength of the relationship between X
and Y . Interpreting R2 in terms of the proportion of variance in Y that is “explained” by
X is more precise and specific than the vague guidelines that we provided earlier for the
correlation coefficient.
Goodness of Fit and the Medicaid Data
In our earlier calculations, the line we fitted to the Medicaid data had a slope of 48.252
and an intercept of 301.136. We now know that this is the best-fitting line, in the sense
that it is the pair of coefficients which produces the line that minimizes the sum of squared
residuals. We could also say that the OLS estimates of the linear structure in the Medicaid
data produce the following equation:
Yi = 301.136 + 48.252Xi + ei
(26)
In equation (25), Yi corresponds to Medicaid spending per capita in state i and Xi is the
number of Medicaid recipients per thousand population in state i. We have already argued
that this line effectively summarizes the linear pattern of data points that we observed in the
scatterplot for these variables. In order to measure the degree to which the OLS line captures
the variance in the dependent variable values, we can calculate the R2 for this equation.
The relevant calculations are shown in Table 3.5. Once again, the leftmost two columns
42
in the table simply give the data values for the 50 observations. The third column gives the
Ŷi ’s for each observation, calculated from the OLS equation. The fourth and fifth columns
give the deviation scores for Y and Ŷ , respectively. And, finally, the last two columns provide
the squared deviation scores for these two variables. As in previous tables, the sums for the
variables are given at the bottom of each column. From the latter, we can see that the total
sum of squares for Y , SSY = 5, 013, 494. The regression sum of squares, SSReg = 2, 074, 726.
So, R2 = 2, 074, 726/5, 013, 494 = 0.414. For verbal convenience, we can multiply the value
of R2 by 100 in order to convert the proportion into a percentage. And then, we might say
that 41.4% of the variance in state Medicaid spending per capita is explained by the linear
relationship with the size of the state Medicaid population.
Addressing Some Common Questions About R2
We know that R2 can vary between zero and one, because it is a proportion. And, in most
cases, we are looking for relationships between our independent and dependent variables, so
we would like the calculated value of this statistic to be as close to one as possible. But, it
is probably natural to wonder how large this proportion needs to be. In other words, when
is R2 large enough (or close enough to its maximum value of 1.00) to be useful or to indicate
that a linear relationship really does exist between the X and Y variables?
Unfortunately, there is just no straightforward and general answer to this question. And,
for a variety of reasons, we often have to contend with fairly weak empirical relationships
between our independent and dependent variables. Therefore, the R2 values associated with
our regression equations sometimes seem to be a bit low. For example, in the Medicaid data,
where we saw a fairly obvious linear pattern in the data, the OLS line still only accounts
for less than half (i.e., about 41%) of the variance in the dependent variable. As we proceed
through the rest of the material in this book, we will discuss several things that tend to
increase the R2 value in a regression equation, such as incorporating additional variables,
allowing for nonlinear relationships, and reducing measurement error in our variables. But,
43
it is important to understand that these involve substantive and theoretical considerations
as much as, or even more so than, statistical issues.
Another question often asked is: Which is more important, the OLS coefficient estimates
or the R2 ? A related version of this question asks about the relative importance of the OLS
coefficients versus the correlation coefficient. In fact, either version is just the wrong question
to ask. The OLS estimates on the one hand and the R2 or the correlation coefficient on the
other hand measure different things. The OLS equation tells which line produces the best
least-squares fit to the bivariate data. The R2 and the correlation coefficient provide two
measures of how well the data conform to that best-fitting line. These are two quite different
properties of the relationship between the independent and dependent variables.
In order to show the difference between the regression coefficients and the goodness of fit
(or correlation), Figure 3.18 shows four scatterplots with OLS lines superimposed over the
data. First, look at panels A and B. Each of these two scatterplots contains 100 observations.
Notice that the regression line is identical in each of these two displays. The OLS line for
both is: Yi = 10.05 + 0.62Xi + ei . However, the goodness of fit is very different in each case.
The R2 for the linear relationship depicted in Figure 3.18A is quite large, at 0.94, while the
R2 for Figure 3.18B is much smaller, at 0.37. Now, look at panels C and D. In these two
displays, the R2 is identical, at 0.85. However, the OLS lines are quite different: In Figure
3C, Yi = 4.19 + 0.51Xi + ei . In Figure 3D, Yi = −1.61 + 1.48Xi + ei .
Hopefully, our brief consideration of these two questions makes you aware of two important things: First, the “best” in the “best-fitting line” is always relative. There is no set
criterion for saying how large is good enough in any particular research setting. Second, the
OLS line, itself, is different from the data’s goodness of fit to that line. Both are important and substantively interesting characteristics of the relationship between the variables
and they should always be included whenever the results of an analysis are conveyed to an
audience (e.g., in a report, thesis, conference paper, etc.).
LOOKING AT THE RESIDUALS
44
We fitted a line to the Medicaid data because visual inspection of the scatterplot suggested
that there is a linear pattern in the combinations of X and Y values that observations possess.
In other words, we used a line, rather than some other type of curve, because it conforms to
what we believe to be the shape of the systematic structure which exists in the data. This is
a critically important idea, which we want to emphasize: The curve that we fit to the data
always should be consistent with the structure that exists within those data.
Building upon the preceding idea, if we truly believe the structure in our data is linear,
then once we remove that linear relationship from the observations, there should be no
remaining systematic structure within the data whatsoever; the “left-over” variability in Y
after we remove the part of each Yi that is related to X should be nothing more than random,
non-systematic, “noise.” If there is any nonrandom, discernible structure left in the data
after we pull out the linear relationship, then it would suggest that the fitted line does not
adequately capture all of the interesting elements in our bivariate data.
In order to investigate this possibility, it is very useful to take a careful look at the
residuals from our OLS line. As we know, the residual for each observation, i, is equivalent
to the signed difference between the actual value of Yi , and the fitted value, Ŷi . And, the
fitted values represent the part of Y that is linearly related to X; this leaves the residuals as
the part of Y that is not linearly related to X. We want to make sure that there is not any
other kind of nonrandom structure remaining within these residuals. And, since we don’t
really know what kind of structure might exist, it is probably useful to look at the residuals,
in a very literal sense, using a graphical display.
The Residual Plot for the OLS Line Fitted to the Medicaid Data
Figure 3.19 highlights the residuals from the OLS line in the Medicaid data by using
vertical line segments that connect the fitted values with their corresponding Yi ’s. We do
not necessarily need these line segments and we generally will not use them in future graphs
(except for pedagogical purposes); we merely show them here to focus our attention on the
relevant parts of the information in the scatterplot. The problem, however, is that the basic
45
scatterplot is not very useful for assessing patterns in the residuals. For one thing, our visual
impression is dominated by the linear structure between X and Y , captured by the OLS
line. Another problem is that it is difficult to make comparisons of the residuals across
observations, because they are shown as deviations from a “tilted” line. It would be better if
we could remove the linear trend from the graphical display and “un-tilt” the baseline from
which the residuals depart.
This can be accomplished very easily, by creating a scatterplot with the residuals on the
vertical axis, and the fitted values on the horizontal axis. Such a display is usually called
the residual versus fitted plot or just the residual plot for the OLS line. Figure 3.20 shows
the residual plot for the OLS line fitted to the Medicaid data.
We want to point out several features that should hold for all residual plots. First, the
number of points in the residual plot is equal to the number in the original scatterplot of the
Y variable against the X variable. In fact, the points represent exactly the same observations
across the two graphical displays. If you look carefully at Figure 3.20, you might be able to see
that it really does represent an “un-tilted” version of Figure 3.19. So we are, indeed, looking
at a version of the scatterplot with the predominant linear pattern linking the number of
Medicaid recipients to the amount of Medicaid spending removed. Second, the residuals from
an OLS fit always have a mean of zero; therefore, that value invariably appears somewhere
near the middle of the vertical axis in a residual plot.12 Third, and closely related, if an
observation had a residual of exactly zero, then its plotted point would fall exactly on the
fitted OLS line. The horizontal axis in the residual plot corresponds to the fitted values.
Therefore, the zero point on the vertical axis can be viewed as a baseline for evaluating the
residuals. In order to emphasize this, we draw a horizontal, dashed, line emanating from this
position along the vertical axis. Fourth, the vertical line segments we used back in Figure
3.19 are now largely unnecessary because the linear pattern has been removed, and the
residuals now represent vertical discrepancies from the horizontal line. Accordingly, visual
assessment is facilitated and more accurate judgments are possible. Fifth, there is absolutely
46
no linear structure between the two variables represented by the axes of the residual plot.
We can say this with perfect confidence, because we know from the properties of the OLS
fitting process that the residuals (i.e., the vertical-axis variable) are uncorrelated with the
fitted values (i.e., the horizontal axis variable).
Turning our attention to the contents of the residual plot in Figure 3.20, we do not see any
particular patterns in the graphical display. It appears to us that the plot shows a shapeless
cloud of points. Therefore, we would conclude that our OLS line really does capture the
systematic structure that connects the size of a state’s Medicaid population to its level of
Medicaid spending.
Do you agree with our assessment of this residual plot? For now, our assessment is
entirely subjective, so there is certainly room for disagreement and differing opinions. Later
in this book, we will discuss a variety of strategies for making more systematic evaluations
of the information in a residual plot.
But, for now, we think it is fairly safe to conclude that there are no easily-discernible
patterns in these residuals. It may seem a little strange to say this, but the lack of any
clear structure is exactly what we want to find here. And, it confirms that our linear model,
constructed by fitting the OLS line, truly does summarize the interesting elements of these
bivariate data.
The Residual Plot for an OLS Line Fitted to the Social Service Office Data
Before leaving our discussion of residual plots, we want to show you an example in
which a linear model does not adequately capture the structure within a bivariate dataset.
Remember the data on state social service offices that we examined at the beginning of this
chapter? The scatterplot in Figure 3.1 showed a distinctly curved pattern in the points,
and we concluded that the size of an office’s clientele tends to increase with the amount
of time the office has been open, but the average size of this increase decreases over time.
Any attempt to use a straight line (obtained via OLS or any other method) in order to
47
summarize these data would produce a distorted representation of the existing structure and
relationship between the length of time an office has been open and the size of its clientele.
Nevertheless, we could go ahead and apply the formulas for the OLS slope and intercept
to the bivariate data on social service offices. At this point, hopefully, you are asking
yourself, “Why would we do this, since we have already inspected these data and know that
the two variables are related in a nonlinear way?” True enough, and we agree with you
wholeheartedly. But, we are trying to show you what will happen when we try to fit a model
that is inconsistent with the predominant structure in the data. And, we are sorry to say
that many data analysts proceed in exactly this manner: They estimate the OLS coefficients
without first examining the data to see whether it is really appropriate to do so.
In any event, the OLS formulas applied to the social service office data produce the
following equation:
Yi = 62.649 + 22.312Xi + ei
(27)
Here, Yi is the mean number of clients per week for the ith social service office, and Xi is the
number of months social service office i has been open. If this equation is the only evidence
we examine, apart from the original data, we might well conclude that the two variables are
linearly related. After all, the slope has a sizable positive value, at 22.312; thus, the mean
number of weekly clients does seem to increase fairly substantially with each additional week
that an office is open. Furthermore, the OLS equation seems to fit the data very well: The
R2 value associated with this OLS line is 0.735, telling us that 73.5% of the variance in
the dependent variable is shared with, or explained by, the variability in the independent
variable through the linear relationship.
None of the preceding statements are completely incorrect. However, they do provide
only an incomplete description of the structure that exists in the data. Figure 3.21 shows
the residual plot for the OLS fit to the social service office data. Just as we did earlier,
jittering is applied to the horizontal-axis variable in order to break up the plotting locations
a bit. As is always the case, the residuals are uncorrelated with the predicted values, so
48
there is no linear structure to be found in this graphical display. But, if you look closely, you
can see that there is a sort of “hump” in the point cloud. That is, the points in the central
area of the plotting region are far more likely to fall above the horizontal baseline, while
the points near the left and right sides of the plotting region tend to fall below the baseline.
Figure 3.22 fits a conditional mean smoother to these data and this further emphasizes the
existence of a clear systematic, albeit nonlinear, pattern in these data.
The OLS line misses this nonlinearity entirely. But, it does pull out whatever linear
trend exists in the data. The “hump” in the residual plot represents the structure that is left
over after doing so. Therefore, we would say that the OLS equation provides an incomplete
description of the relationship between the two variables. Relying solely on the linear model
of the data, we would miss some potentially interesting nonlinear structure.
In most contexts, we believe that doing this (i.e., ignoring nonlinearity when it is clearly
present in data) is a bad thing. But, there is definitely a trade-off involved in taking the
nonlinearity into account. We have already stated that nonlinear relationships are generally
more complex than linear relationships because the connection between the independent
and dependent variables is not constant across the range of X. As we will see in Chapter
9, the specification of the equations that we use to provide a quantitative description of
these relationships must be adjusted accordingly. Now, parsimony is an explicit objective of
scientific explanations. Therefore, we must decide whether adjusting our model to take the
nonlinearity into account is worth the costs imposed by this additional complexity. Generally,
we believe that it is. However, others could well disagree. And, we want to emphasize that
this is always a modeling decision that is entirely up to the data analyst.
CONCLUSIONS
We have covered a lot of ground in this chapter! We started with the general objective
of regression analysis to provide a description of how the conditional Y distributions vary
across the values of the X variable. And, we focused particularly on the central tendencies of
these conditional distributions, in the form of the conditional means. Scatterplot smoothing
49
is a general strategy for superimposing a curve over the points in a scatterplot such that the
curve follows the means of the conditional Y distributions. The result is a graphical form of
regression analysis. We called it “nonparametric regression” because the shape of the curve
is not specified in advance of fitting it to the scatterplot; instead, the form of the fitted curve
is determined entirely by the bivariate data.
We noted earlier and we want to reiterate that scatterplot smoothing and nonparametric
regression are potentially vast topics that deserve far more attention than we have given
them so far. We will definitely return to these ideas later in this book, where we will use
scatterplot smoothing as an important tool for insuring that our regression models produce
accurate descriptions of the structure within our data. But, for now, we used the smooth
curves obtained through nonparametric regression more as an exploratory mechanism for
determining what the nature of the structure is within the bivariate data in the first place.
We moved to parametric regression when our scatterplot smoother revealed that the predominant structure within the bivariate data really is linear in form. And, we characterized
the problem as one in which we try to find the single line that best approximates that linear
structure. Notice that this is a manifestation of the overall idea of a descriptive statistic.
In other words, any given descriptive statistic seeks to summarize a particular characteristic
of data. For example, the mean summarizes the central tendency of a univariate distribution, while the standard deviation summarizes the spread or dispersion of a univariate
distribution. In the same spirit, the slope and intercept of a regression line are intended to
summarize the predominant linear pattern in a bivariate data distribution. So, we are really
just continuing on with a process that you began back when you started studying statistics.
Once we decide that the predominant structure within a set of bivariate data really does
conform to a linear pattern, the task becomes one of positioning a line within the data cloud
to provide a representative summary of that pattern. We started with an intuitive approach,
in which we merely try to place the line so that it runs through the middle of the point
cloud in the scatterplot. Our intuition was pretty useful in this case, because we developed
50
formulas for the slope and the intercept which are easy to calculate and can be expressed
in terms of a few summary statistics that we encountered earlier (i.e., the covariance, the
variance, and the mean).
But, we couldn’t stop at that point, because we didn’t want to just have some line— we
wanted the best line. We used the least-squares criterion to define precisely what we mean
by “best.” And, after doing so, we discovered that our intuitively-derived line through the
middle of the point cloud is also the one that produces the smallest possible sum of squared
residuals, hence making it the best possible line, according to our specified criterion. We
called the formulas for finding this line the ordinary least squares estimators, and we began
to call the line, itself, the OLS line.
Once we find the best line for a given set of bivariate data, it is natural to ask how good
the line is as a description of the pattern within the data. In order to answer that kind of
question, we reached back to our original motivation for performing the data analysis in the
first place: We want to account for the existence of variance in the dependent variable, Y .
We found that the OLS line actually breaks the values of the dependent variable down into
two uncorrelated sets of values: The fitted values, which we presented as the “new” variable,
Ŷ , and the residuals, which we designated as the variable, e. The fitted values represent the
part of Y that is linearly related to X, through the OLS line. So, we take the sum of squares
in these fitted values (which we called the regression sum of squares and designated SSReg )
as the numerator of a fraction in which the sum of squares for Y , itself (which we called the
total sum of squares in Y , or SSY ), is the denominator. This fraction gives the proportion of
the variance in the dependent variable that is linearly related to variance in the independent
variable through the fitted values (which are, themselves, linear transformations of the Xi ’s).
Also, the fraction is called R2 to reflect the fact that its value is equal to the square of the
correlation coefficient between X and Y .
Notice that the work we carried out in this chapter is nicely consistent with some of
the ideas that we first raised back in Chapter 1. The OLS line is actually a model of
51
bivariate data in the sense that it provides a parsimonious representation of an interesting
aspect of those data. The line is determined by only two parameters— the slope and the
intercept. But, if the structure in the data truly is linear, then this OLS line “stands in” for
the information contained in the original set of 2n data values (that is, n values of the X
variable and n values of the Y variable). Of course, we recognize that this fitted line does not
tell us everything about what is going on in our data; models are inherently simplifications
which omit some of the detail. And, the R2 provides goodness of fit information to tell us
how much of the data (or, more precisely, what proportion of Y ’s variance) is captured in
the OLS regression line. It is gratifying to see that our hard work in this chapter really has
taken us along the road we laid out initially in Chapter 1, using the tools that we developed
in Chapter 2! Apparently, we are traveling in the direction that we intended.
But, this chapter also ended on something of a cautionary note: We argued that fitting
the OLS line to the data is not sufficient, in itself, if we really do want that line to be a useful
model of those data. Instead, we need to take a further diagnostic step and try to guarantee
that the line which we have imposed on the data really does reflect the structure that exists
in those data. This leads us to the residual plot for the OLS line, which we introduced as a
tool to search for any additional nonrandom patterns that may linger in the data after we
have pulled out any linear structure between X and Y .
The residual plot is a specific tool that we will employ frequently in subsequent chapters
of this book. But, for now, we want to emphasize the underlying line of thinking that led us
to it in the first place— we believe this is really, really important! Models are always artificial,
simplistic, constructions that the analyst imposes on his or her data. It is incumbent upon
that researcher (and no one else!) to make sure that the model really does provide an accurate
and reasonably complete representation of the interesting elements within the dataset under
investigation. Therefore, fitting a model, such as an OLS line, is just the beginning of a
more continuous process. It is always important to inspect the correspondence between the
fitted model and the data (using tools like the residual plot) in order to diagnose and address
52
any problems that may exist. Indeed, we believe that good researchers are characterized by
their aggressiveness in carrying out this introspective process of always questioning their
regression models and applying suitable remedies when problems are discovered. This is a
theme to which we will return many times throughout the rest of this book.
53
Box 3-1: Derivation of the formula for the b coefficient in a line that runs through the middle
of the point cloud in a scatterplot.
Yi = a + bXi + ei
n
X
Yi − Ȳ
= (a + bXi + ei ) − (a + bX̄ + ē)
(2)
Yi − Ȳ
= a + bXi + ei − a − bX̄ − ē
(3)
Yi − Ȳ
= (a − a) + (bXi − bX̄) + (ei − ē)
(4)
Yi − Ȳ
= b(Xi − X̄) + (ei − ē)
(5)
(Xi − X̄)(Yi − Ȳ ) = (Xi − X̄) b(Xi − X̄) + (ei − ē)
(6)
(Xi − X̄)(Yi − Ȳ ) = b(Xi − X̄)2 + (Xi − X̄)(ei − ē)
(7)
n
X
(Xi − X̄)(Yi − Ȳ ) =
b(Xi − X̄)2 + (Xi − X̄)(ei − ē)
(8)
i=1
i=1
n
X
n
X
i=1
n
X
(Xi − X̄)(Yi − Ȳ ) =
2
b(Xi − X̄) +
i=1
(Xi − X̄)(Yi − Ȳ ) = b
n
X
i=1
i=1
n
X
n
X
i=1
(1)
(Xi − X̄)(Yi − Ȳ ) = b
n
X
(Xi − X̄)(ei − ē)
(9)
(Xi − X̄)(ei − ē)
(10)
i=1
(Xi − X̄)2 +
n
X
i=1
(Xi − X̄)2
(11)
i=1
Pn
2
(X
−
X̄)(Y
−
Ȳ
)
i
i
i=1 (Xi − X̄)
i=1
Pn
= b Pn
2
2
i=1 (Xi − X̄)
i=1 (Xi − X̄)
Pn
i=1 (Xi − X̄)(Yi − Ȳ )
Pn
b =
2
i=1 (Xi − X̄)
Pn
(12)
(13)
Box 3-2: Derivation of the formula for the a coefficient in a line that runs through the
middle of the point cloud in a scatterplot.
Yi = a + bXi + ei
(1)
Ȳ
= a + bX̄ + ē
(2)
Ȳ
= a + bX̄
(3)
Ȳ − bX̄ = a
a = Ȳ − bX̄
(4)
(5)
NOTES
1. Mention other ways of defining the center of each slice.
2. Boundaries placed halfway between max X in lower slice and min X in higher slice.
3.
Explain again that X could be a dependent variable in some other regression. Entirely
context-specific!
4. Mention PCA, which does use this criterion to fit a line.
5. In fact, the first criterion for our operational definition of “middle” is unnecessary, given
this second condition.
6. Mention that these are “well-behaved” point clouds. It is possible to construct situations
where most observations fall on one side of line, but one additional observation (or a small
set of observations) have extremely large values on the other side of the line, and the line
is thereby “forced” away from the main point cloud. We will consider precisely this kind of
situation in Chapter 11 on unusual data.
7. Fifty observations is a pretty large dataset, and we certainly do not expect you to perform
calculations in a dataset of such size by hand. But, in principle, it could be done and we
want to show you the exact steps that you might follow in order to do so. But, like any sane
person, we actually rely upon the labor-saving power of hand calculators and computers to
do the calculating work for us!
8. We cannot help but point out that this calculated value of the slope is remarkably close to
the estimate we produced by “eyeballing” the smooth curve that we obtained by calculating
the conditional mean smoother for the sliced version of the scatterplot.
9. Explain how intercept calculated.
10. You might wonder how we know the shape of the curve. Answer is that sum of squared
residuals is a particular kind of formula called a quadratic, and graphs of quadratics always
have this shape.
11. In fact, the covariance also measures this. But, we stick with the correlation because of its
interpretational advantages.
12. In fact, we generally set the vertical axis so that range of values shown in the plot is symmetric around the zero point.
Mean number of clients per week
Figure 3.1: Scatterplot of data on state social service offices, showing the number of clients
per week (during the preceding month) versus the number of months the office
has been open.
150
100
1
2
3
4
5
Number of months office has been open
6
Figure 3.2: Fitting a smooth curve to the scatterplot of the social service office data, using
conditional means.
Mean number of clients per week
A. Scatterplot with conditional Y means superimposed over data
150
100
1
2
3
4
5
6
Number of months office has been open
Mean number of clients per week
B. Conditional means are connected with line segments
150
100
1
2
3
4
5
6
Number of months office has been open
Mean number of clients per week
Figure 3.3: Scatterplot of data on state social service offices, showing final conditional mean
smoother.
150
100
1
2
3
4
Number of months office has been open
5
6
State Medicaid spending, 2005 (dollars per capita)
Figure 3.4: Scatterplot showing 2005 Medicaid expenditures in the American states (dollars
per capita) versus the number of Medicaid recipients in each state (as a
percentage of total state population).
2000
1500
1000
500
10
15
20
Medicaid recipients, as percentage of state population
State Medicaid spending, 2005 (dollars per capita)
Figure 3.5: Scatterplot of 2005 state Medicaid data, with a basic conditional mean smoother
superimposed over the data points.
2000
1500
1000
500
10
15
20
Medicaid recipients, as percentage of state population
Figure 3.6: Slicing the scatterplot of the state Medicaid data into six slices, and fitting the
conditional mean smoother across the slices.
2000
B. Slice centers shown within slices
Medicaid spending
Medicaid spending)
A. Slice boundaries added to scatterplot
1500
1000
500
2000
1500
1000
500
10
15
20
10
Medicaid recipients
2000
1500
1000
500
20
Medicaid recipients
D. Line segments connecting the conditional means
Medicaid spending
Medicaid spending
C. Plot conditional Y means over slice centers
15
2000
1500
1000
500
10
15
20
Medicaid recipients
10
15
20
Medicaid recipients
State Medicaid spending, 2005 (dollars per capita)
Figure 3.7: Scatterplot of the 2005 state Medicaid data, showing the final conditional mean
smoother fitted across ten slices.
2000
1500
1000
500
10
15
20
Medicaid recipients, as percentage of state population
State Medicaid spending, 2005 (dollars per capita)
Figure 3.8: Scatterplot of the 2005 state Medicaid data, showing the final conditional mean
smoother fitted across five slices.
2000
1500
1000
500
10
15
20
Medicaid recipients, as percentage of state population
Figure 3.11:
Scatterplot showing centroid of data points.
80
Y variable
60
40
20
0
20
40
X variable
60
80
Scatterplots with lines fitted such that summed residuals do not equal zero.
A.
∑ei = 384.414
Y variable
80
60
40
20
0
20
40
60
80
X variable
B.
∑ ei = − 365.586
80
Y variable
Figure 3.12:
60
40
20
0
20
40
60
X variable
80
Scatterplots with lines fitted such that residuals are correlated with the X
variable.
A. rX, e = 0.851
Y variable
80
60
40
20
0
20
40
60
80
X variable
B. rX, e = − 0.908
80
Y variable
Figure 3.13:
60
40
20
0
20
40
60
X variable
80
State Medicaid spending, 2005 (dollars per capita)
Figure 3.14:
Scatterplot of 2005 state Medicaid spending versus size of Medicaid recipient
population, with line fitted through the middle of the point cloud.
2000
1500
1000
500
10
15
20
Medicaid recipients, as percentage of state population
Figure 3.15: Scatterplots of the 2005 state Medicaid data, showing different lines running
through the “middle” of the point cloud.
Medicaid spending
^
A. Line Yi = 275.8 + 50Xi fitted to data
2000
1500
1000
500
10
15
20
Medicaid recipients
Medicaid spending
^
B. Line Yi = 100 + 68Xi fitted to data
2000
1500
1000
500
10
15
20
Medicaid recipients
State Medicaid spending, 2005 (dollars per capita)
Figure 3.16: Scatterplot of 2005 state Medicaid data, showing a line that does not fit the data
very well, but for which the sum of the residuals equals zero.
2000
1500
1000
500
10
15
20
Medicaid recipients, as percentage of state population
State Medicaid spending, 2005 (dollars per capita)
Figure 3.19: Scatterplot of 2005 state Medicaid data with OLS line superimposed over data
points, and residuals drawn in as vertical line segments.
2000
1500
1000
500
10
15
20
Medicaid recipients, as percentage of state population
Residuals from OLS line fitted to Medicaid data
Figure 3.20:
Residual versus fitted plot for OLS regression line fitted to the 2005 state
Medicaid data.
500
0
-500
800
1000
1200
Fitted values from OLS line fitted to Medicaid data
1400
Residuals from OLS line fitted to social service office data
Figure 3.21: Residual versus fitted plot for OLS line fitted to data on state social service
offices.
20
10
0
-10
-20
80
100
120
140
160
180
200
Predicted values (jittered) from OLS line fitted to social service office data
Residuals from OLS line fitted to social service office data
Figure 3.22: Residual versus fitted plot for OLS line fitted to data on state social service
offices, with conditional mean smoother superimposed over data points.
20
10
0
-10
-20
80
100
120
140
160
180
200
Predicted values (jittered) from OLS line fitted to social service office data