Schneider-Jacoby manuscript, Chapter 2 Draft

This chapter will introduce some important concepts and tools which will prove useful for
our development of the regression model. The concepts include critical ideas like “linear,”
“covariance,” and “correlation.” The tools include scatterplots, linear transformations, linear
combinations, and the correlation coefficient. It is possible that some readers might consider
this material to be a digression from the major objectives of this text. However, we would
reply that the topics to be considered here will actually facilitate our broader discussion and
the presentation of the statistical material which follows in subsequent chapters.
If you have already read the first chapter (and, of course, we assume that you have
done so, if you are reading this!), then you have probably gained some sense of our own
approach: We want to be systematic and cumulative in the ways that we introduce new
material— explaining why we want to pursue a topic, and building upon results that we
have already established. We definitely do not want to take a superficial approach in which
concepts, results, and rules for usage simply “fall from the sky.” We want to convey to you
why the regression model looks and works as it does. To beginners, this may seem to be
a somewhat difficult way to introduce the subject matter. But, we are convinced that it
is useful to proceed this way because our approach provides much better understanding in
the end. So, let us take a look at some basic concepts and tools. As you will see, they will
appear repeatedly throughout the rest of the book. It is very helpful to develop a firm grasp
of these basic elements before we begin applying them to develop the regression model and
test its various elements.
THE SCATTERPLOT
The scatterplot is the basic graphical display for bivariate data. We already had a brief
discussion of scatterplots and their construction back in Chapter 1, and we saw a number
of examples there, too. So, why do we need a special section discussing them further in this
chapter? Primarily to lay out some principles and guidelines for creating good scatterplots!
Unfortunately, the default selections for creating scatterplots in some software packages
1
result in displays which do not display the information they contain in an optimal manner.
Of course, we want to avoid such problems whenever possible.
Back in Chapter 1, we explained that a scatterplot is set up by drawing two perpendicular
coordinate axes. Each axis represents one of the two variables; traditionally, X is assigned
to the horizontal axis, and Y to the vertical axis. If we are using the scatterplot to assess the
relationship between the two variables, then the horizontal axis is assigned to the independent
variable and the vertical axis is assigned to the dependent variable.
Each axis “represents” its variable in the sense that it is a number line which is calibrated
in that variable’s data units (note that the units need not be the same for the horizontal
and vertical axes). Each axis covers that variable’s range of possible values (or, at least,
the range of values which actually occurs in the empirical data at hand). Tick marks are
usually included on the axes to show the relative positions of the sequence of values within
each variable’s range.
Observations are plotted as points within the coordinate space defined by the two axes.
The process is simple: For any given observation, say the ith , find the location corresponding
to its value on X along the horizontal axis (this value is usually designated xi ) and the
location for its value on Y (that is, yi ) along the vertical axis. From each of these locations,
project imaginary lines into the space between the axes; the imaginary line from each axis
should be parallel to the other axis, so these imaginary lines will be perpendicular to each
other, just like the axes themselves. Place a point (or some other symbol) at the position
where these imaginary lines intersect. That point represents observation i. Repeat this
process for all n observations in the full dataset.
A Substantive Example: Medicaid Spending in the States
To illustrate a scatterplot, let us return to a substantive example first introduced in
Chapter 1. Table 2.1 shows bivariate data on 50 observations— the American states. Since
the information is arrayed in a tabular format, this kind of display is often called a “data
2
matrix.” In a data matrix, the rows correspond to observations and the columns correspond
to variables. There are two variables (note that the state names shown in the left-hand
margin of the table are row identifiers, although they also could be considered a categorical
variable): The leftmost column shows each state’s 2005 expenditure for Medicaid, in dollars
per capita. The righthand column shows the number of people within each state who are
covered by Medicaid, expressed as a percentage of the state’s total population. Note that
the ordering of the observations in the data matrix is arbitrary. In Table 2.1, the states are
arranged in alphabetical order. However, this is only for our convenience; it has nothing
whatsoever to do with any subsequent analyses we may perform on these data.
Figure 2.1 shows a scatterplot of the bivariate data. We have assigned the size of the atrisk population to the horizontal axis, and health care expenditures to the vertical axis. This
placement reflects our substantive assumption that the former is the independent variable
and the latter is the dependent variable. The fifty states are plotted as small open circles
within the region defined by the perpendicular axes of the graph.
It is worth emphasizing that Figure 2.1 contains exactly the same information as Table
2.1. It is merely presented in a different format. Neither of these formats is any better or
worse than the other; instead, they tend to be useful for different purposes. The data matrix
enables direct observation (and manipulation) of the precise numeric quantities associated
with each observation. However, it is not particularly useful for discerning structure within
the overall dataset. The scatterplot enables a visual assessment of structure in the data.
But, it is difficult to recover the precise variable values from the graph. As we will see, both
of these kinds of tasks come into play during a data analysis project. It is important for the
researcher to have some sense about the kind of structure that might exist in the data, so
he or she knows what kind of model to construct. Once this is known, the data values can
be entered into the appropriate equations to estimate values for the model parameters. But,
again, we need to know the form of the model before we can estimate it. And, that is more
easily discerned from a graphical display than the raw data matrix, itself. For that reason,
3
we need to pay attention to the details of constructing “good” scatterplots.
Guidelines for Good Scatterplots
Scatterplots can be used for many different purposes. We will employ them primarily to
discern the existence of systematic structure within our data. Therefore, the details of the
scatterplot should be determined with that objective in mind. In this section, we will lay out
a set of principles for constructing good scatterplots. But, to reiterate the preceding idea,
the definition of a “good” scatterplot is not based upon aesthetic considerations. Rather, a
good scatterplot is one in which the analyst can see the bivariate information in a manner
that facilitates an assessment of patterned regularities across the observations in the dataset.
One broad generalization is that good graphical displays should “maximize the data-toink ratio” (Tufte 1983). This implies that all elements of the scatterplot should be aimed
at providing an accurate representation of the data, rather than devoted to decorative and
unnecessary trappings. Beyond this, there is no intrinsic ordering to our set of principles.
Still, in order to lay out our ideas in a logical manner, we will work “from the outside to the
inside” of the scatterplot.
Use an aspect ratio close to one. The aspect ratio of a graph is the ratio of its physical
height to its physical width. An aspect ratio of one means that the vertical and
horizontal axes will be equally sized. This is usually helpful because it makes it easier
to see how the Y values tend to shift across the range of X values than would be the
case with very wide graphs (i.e., aspect ratios much lower than one) or very tall graphs
(i.e., aspect ratios much larger than one).
Show the full scale rectangle. By “scale rectangle,” we mean explicit axes which delimit
all four sides of the plotting region for the scatterplot. Including all four axes facilitates
more accurate visual judgments about differences in the relative positions of the plotted
points. Some graphing software packages only draw the left and bottom axes, leaving
the right and top sides of the plot “open.”
4
Provide clear axis labels. The rationale for this point is fairly obvious. You want anyone
who looks at your scatterplot to understand what is being shown in the graphical
display. Therefore, you should put clear descriptions of the variables on left and bottom
axes, rather than abbreviations or short acronyms, such as the kinds of variable names
required in many software packages.
Use relatively few tick marks on axes. We are not using the scatterplot to determine
the data values associated with the respective observations. Therefore, it is not necessary to provide too much detail about the number lines which represent the variables’
value ranges in the scatterplot. Three or four tick marks (with labeled values on the
left and bottom axes) should be sufficient to enable readers to estimate the general
data values that coincide with various locations within the plotting region.
Do not use a reference grid. There is generally no reason to include a system of horizontal and vertical grid lines in the plotting region of a scatterplot. For one thing, the
axis tick marks should be sufficient to help readers evaluate variability in the plotted
point locations. For another thing, a grid is a visually distracting element that might
make it more difficult to discern the data points accurately.
Data points should not be too close to the axes. The ranges of values on the horizontal and vertical axes should extend beyond the ranges of the respective variables in
the scatterplot. This guarantees that data points do not collide with the axes (which
would make them more difficult to see).
Select plotting symbols carefully. The plotting symbols used in a scatterplot need to
be large enough to enable visual detection of the observations (e.g., small dots or
periods are often difficult to see). At the same time, the symbols should not be unduly
affected by overplotting; that is, it should be possible to discern differences across
symbols which represent distinct observations with very similar data values (e.g., open
circles work better than squares, diamonds, or solid circles). Generally speaking the
5
size of the plotting symbols is inversely related to the number of observations included
in the plot (scatterplots with large numbers of observations should use smaller plotting
symbols than scatterplots with small numbers of observations, and vice versa).
Avoid point labels. We are using the scatterplot to discover structure in data, defined
here as variability in the conditional distributions of Y values across X values. Thus,
we are interested in relative concentrations of data points at various positions within
the plotting region, rather than the identities of the observations that fall at particular
locations (which can be obtained more accurately from the data matrix, itself). In
addition to being extraneous to our objectives, observation labels are generally detrimental in a scatterplot, because they make it more difficult to observe the data points,
themselves.
The preceding list of principles should be regarded as guidelines, rather than hard and fast
rules. There are certainly situations where a researcher might want to ignore one or more of
the principles. But, for most situations, these guidelines should produce scatterplots which
optimize our ability to visually discern how observations’ Y values tend to shift across the
values of the X variable— in other words, we can see whether the dependent variable is
related to the independent variable, according to the definition of empirical “relatedness”
laid out in Chapter 1.
Jittered Scatterplots for Discrete Data
Before leaving the topic of scatterplots, we need to consider a situation that arises quite
frequently in the social sciences: Scatterplots with discrete data. What do we mean by
this? You have probably already heard of the distinction between discrete and continuous
variables. Our definitions of these terms probably differ slightly from those you have already
encountered. For our purposes, a discrete variable is one in which the number of distinct
values is small, compared to the number of observations and (logically enough) a continuous
variable is one in which the number of distinct values is large. Note also that a variable’s
6
status as discrete or continuous is specific to the particular dataset under observation; that
is, these terms apply to the particular configuration of variable values that actually show up
in a given sample of observations.
We can easily specify the limiting cases for these two types of variables. A discrete
variable can not have any fewer than two values. Otherwise, of course, it is a constant rather
than a variable. At the other extreme, a continuous variable can have a maximum of n
distinct values, since it is impossible to have more data than there are observations in the
dataset. But, the “dividing line” between discrete and continuous status is not clear! In
fact, there is no particular need to specify a particular cut-off point for the number of values
at which a variable changes from discrete to continuous. Instead, we will simply make the
determination on a case-by-case basis.1
Getting back to the more general topic, what will happen if we try to construct a scatterplot using discrete, rather than continuous variables? In that case, there will likely be a
great deal of overplotting, in which more than one observation is located at exactly the same
position within the plotting region. This is a problem because the overplotting makes it
very difficult to discern accurately the varying densities of data points at different locations
within the scatterplot.
As a specific example, the 2004 CPS American National Election Study (NES) contains
data from a public opinion survey of the American electorate, conducted in fall 2004. Among
many other questions, the respondents to this survey were asked their personal party identifications, and their ideological self-placements. Both of these variables are measured on
bipolar scales (ranging from “strong Democrat” to “strong Republican” on the party identification scale and ranging from “extremely liberal” to “extremely conservative” on the
ideology scale). Political scientists often attach numeric values to these categories, such as
successive integers from one to seven. For theoretical reasons, it is interesting to consider
the relationship between these two variables. And, we might try to do this by examining a
scatterplot.
7
Figure 3.2 shows the scatterplot of ideology versus party identification for the 2004 NES
data. The graphical display actually includes 915 observations. But, there are only 45
distinct plotting positions, due to the discrete nature of the two variables.2 The scatterplot
shows an undifferentiated, nearly square, array of points, making it impossible to tell how
many observations actually fall at each location.
What can we do in such a situation? The basic problem is that data points are superimposed directly over each other. In order to change this, we might consider moving all of
the points very slightly, just to break up the specific plotting locations. This can be accomplished very easily by adding a small amount of random variation to each data value. This
process is called “jittering.” Figure 3.3 shows a jittered version of the scatterplot for party
identification versus ideology. Now, variations in the ink density make it easier to see that
increasing values on one of these variables do tend to correspond to increasing values on the
other; in other words, partisanship and ideology are related to each other, in a manner that
makes sense in substantive terms. Jittering enables us to see this in the scatterplot; the
nonrandom pattern in the placement of the data points was invisible in the “raw” version of
the graphical display.
You might be concerned that we are doing something illegitimate (i.e., cheating) by
jittering the points in a scatterplot. After all, by moving them around in this manner, we
are implicitly changing the data values associated with the plotted points. In fact, this is just
not problematic: We are only changing the point locations, not the data values themselves.
In doing so, we can observe something that we could not see before jittering: The variation
in the density of the data within different sectors of the plotting region. And, of course, the
latter is why we are looking at the scatterplot in the first place!
Furthermore, jittering does not move the points very much. The shift in any point’s
location along any of the coordinate axes should always be kept smaller than the rounding
limits on the respective variables. Sticking to this rule will minimize the possibility that
anyone looking at the scatterplot will mistake the jittering for real, substantive, variation in
8
the data values.
Finally, we need to emphasize that jittering only occurs in the contents of the scatterplot.
The random movements in the point locations do not get retained in the variable values which
compose the actual dataset, itself. Any and all calculations involving the data (e.g., summary
statistics) are carried out using the original, non-jittered values. In summary, jittering is a
tool that can be employed to optimize visual perception of relationships between discrete
variables in a scatterplot.
THAT UBIQUITOUS CONCEPT, “LINEAR”
One of the terms that seems to show up frequently in regression analysis and in statistics,
more generally, is the word, “linear.” So, for example, you have probably heard of “linear
relationships between variables” and “linear regression analysis.” As we proceed through
the material, we’ll see a few more appearances, such as “linear transformations,” “linear
combinations,” and “linear estimators.” Therefore, it seems reasonable to spend some time
explaining exactly what this term means.
It is, perhaps, tempting to say that “linear” refers to something that can be shown as
a line. Well, this is probably true. But, it is not very informative! As a more constructive
approach, we will begin by saying that the term, “linear,” is an adjective which describes a
particular kind of function. Of course, this raises an additional question: What is a function?
To answer that question, let us identify two sets, designated X and Y . A function is
simply a rule for associating each member of X with a single member of Y . It is no accident
that we used X and Y as the names of the two sets: For our purposes, these sets are two
variables, and their members are the possible values of those variables. Here is a generic
representation of a function relating the variable Y to the variable X:
Yi = f (Xi )
(1)
Here, Yi refers to a specific value of the variable, Y , and Xi refers to a specific value of the
9
variable, X. The “f ” in expression 1 just designates the rule that links these two values
together. So, the preceding expression could be stated verbally as “start with the ith value
of X, apply the rule, f , to it, and the result will be the ith value of Y .
As a very simple example of a function, let’s say that f is the rule, “add six to”. If
Xi = 3, we would apply “add six to 3” in order to produce Yi = 9. Similarly, we could apply
the same rule to other legitimate values of X in order to obtain other Y values. The result
would be a set of Y values which are “linked” to the X values through the function, or rule,
which was used to create the specific Yi ’s from the respective Xi ’s.
Linear and Nonlinear Functions
A linear function is a particular type of rule, in which the Xi ’s are each multiplied by a
constant, or a single numerical value which does not change across the different values of X.
Thus, one type of linear function would be:
Yi = b Xi
(2)
Notice that the b in equation 2 does not have a subscript. This indicates that the numeric
value that is substituted for b does not change across the different values of X. This constant
is often called the “coefficient” in the equation relating X to Y . Let us assume that b = 2.
If that is the case, Xi = 1 will be associated with Yi = 2 = 2 · 1. Similarly, if Xi = 3 then
Yi = 6 = 2 · 3. Regardless which X value is input to the function, it is always multiplied by
two (i.e., the value of b) to obtain Yi .
Note that the function relating values of X to values of Y can contain other terms besides
the variable, X, itself. For example, consider the following:
Yi = a + b X i
(3)
Stated verbally, equation 3 says that, in order to obtain the Y value for observation i, the
10
constant, a, should be added to the product obtained by multiplying the coefficient, b, by the
X value for that observation (that is, Xi ). And, since the value of a does not change across
the observations (we know this because it does not have an “i” subscript), it is considered
another coefficient in the function relating X to Y .
Equation 3 still shows Y as a linear function of X, since its responsiveness to changes
in X values remains constant. That is, equal differences in pairs of X values will always
correspond to a fixed amount of difference in the associated pairs of Y values. Adding the
new coefficient, a, does not change that in any way. In fact, the kind of function shown
in equation 3 has a special name: It is called a “linear transformation.” We will look more
closely at linear transformations in the next section of this chapter.
What would a nonlinear function look like? It is difficult to provide a satisfying answer
to that question, because any other rule for obtaining Yi apart from multiplying Xi by a
constant value would be a nonlinear function. So, one example of nonlinear functions would
be the following:
Y i = bi X i
(4)
Equation 4 is a nonlinear function because the coefficient, b, has the subscript, i, indicating
that the number we use to multiply Xi can, itself, change across the observations. Another
nonlinear function is:
Yi = Xib
(5)
Here, the X variable is not multiplied by the coefficient, b. Instead, the value of Xi is raised
to the power corresponding to the number, b.
What makes a linear function distinctive, compared to nonlinear functions? In a linear
function, the “responsiveness” of Y to X will be constant across the entire range of X values.
That is not the case with a nonlinear function. To see what we mean by this, let us return
to our earlier example of a linear function, Yi = bXi , with b = 2. We will assume that the
possible values of X range across the interval from one to ten. Now, consider two X values
11
at the lower end of the X distribution, X1 = 1 and X2 = 2. We apply the linear function to
each of these values and obtain 2 · 1 = 2 and 2 · 2 = 4, or Y1 = 2 and Y2 = 4. Notice that, as
the X values increased by one unit, applying the function produced Y values which differed
by two units. Next, let us move over to the other side of the X distribution and select two
large values, X3 = 9 and X4 = 10. Applying our linear function, Y3 = 2 · 9 = 18, and
Y4 = 2 · 10 = 20. The third and fourth X values are much larger than the first and second
values (i.e., 9 and 10, rather than 1 and 2). But, they still differ by one unit. And, note
that when we increase from 9 to 10, applying the function still produces a pair of Y values
which differ by two units (i.e., Y3 = 18 and Y4 = 20), just as before. And, you can verify
that the function will work the same way for any pair of X values that differ by a single
unit; regardless where that pair of values occurs within the range of X values, application
of the linear function (with b = 2) will always produce a difference of two units in the
resultant Y values. It is precisely this constancy in responsiveness of Y to X which defines
the linear nature of the function: If Y is a linear function of X, then for any coefficient b,
two observations which differ by one unit on X will always differ by exactly b units on Y ,
regardless of the specific values of X.
The preceding rule also differentiates linear functions from all nonlinear functions. For
example, consider the function in equation 4, with b = 2. In other words, Yi is obtained by
squaring Xi . Here, if X1 = 1 then Y1 = 12 = 1. And, if X2 = 2 then Y2 = 22 = 4. With
this function, increasing Xi from one to two corresponded to an increase of three units on
Y , from Y1 = 1 to Y2 = 4. Moving to the other end of X’s range, consider X3 = 9. Now,
Y3 = 92 = 81. If X4 = 10, then Y4 = 102 = 100. Here, we have still increased Xi by one
unit. But now, that movement occurs at a different location within X’s range, from 9 to 10.
And, the increase in Y is now much larger than before, at 19 units (i.e., from Y3 = 81 to
Y4 = 100. Here, the responsiveness of Y to X is not constant; instead, it depends upon the
particular X values over which the one-unit increase occurs.
Obtaining the Coefficients in a Linear Function
12
So far, we have seen how a particular configuration of coefficients defines a linear function.
To reiterate, if the values of a variable, X, are related to the values of another variable, Y ,
by a rule in which Yi = a + bXi , for all observations on Xi , i = 1, 2, 3, . . . , n − 2, n − 1, n,
then that rule is a linear function. But, we have left a fairly important question unanswered:
Where do a and b come from? We know that, in any particular example of a linear function,
the letters representing these coefficients will be replaced by specific numbers— one value
for a and another value for b. So, how are the numeric values determined?
The answer to the preceding question depends upon the circumstances. Sometimes, the
values of the coefficients are simply given, as in the example functions used in the previous
section. In other cases, the coefficients might be determined by pre-existing rules. So, for
example, the cost of an item in Australian dollars is a linear function of the cost of that item
in American dollars, with the b coefficient determined by the exchange rate between the two
currencies. At the time of this writing, the function would be expressed as:
AUDi = 1.076 USDi
Where AUDi is the amount of item i in Australian dollars, USDi is the amount of item i in
American dollars, and b = 1.076.
In other situations, the values of the coefficients need to be calculated from previouslyexisting data. (Indeed, much of the rest of this book will be occupied with calculating
coefficients in linear functions from empirical data!). Let us assume that we have n observations on two variables, X and Y , and that we also know that Y is a linear function of
X. This information implies that, for any observation, i, Yi = a + bXi . It is quite easy to
calculate the values of b and a, as follows:
First, select any two observations from the data. Here, we will call these observations 1
and 2, although we want to emphasize that they really can be any two observations selected
arbitrarily from the n observations. Of course, we have the X and Y values for each of these
observations, which can be shown as (X1 , Y1 ) for observation 1 and (X2 , Y2 ) for observation
13
2. Since we know that Y is a linear function of X, we can say that Y1 = a + bX1 and
Y2 = a + bX2 . Now, the process of calculating the values of b and a involves manipulating
the preceding information to generate a formula which isolates b on one side of the equation
and leaves previously-known information on the other side of the equation. We will begin
by simply subtracting Y1 from Y2 and working out the consequences:
Y2 − Y1 = (a + bX2 ) − (a + bX1 )
Y2 − Y1 = a + bX2 − a − bX1
Y2 − Y1 = a − a + bX2 − bX1
Y2 − Y1 = (a − a) + (bX2 − bX1 )
Y2 − Y1 = b(X2 − X1 )
Let us go through each of the preceding steps and explain what happened. In the first
line, the left side simply shows the difference between Y2 and Y1 , and the information on
the right-hand side follows from the information that Y is a linear function of X. In the
second line, nothing happens on the left-hand side. But, on the right-hand side, we remove
the parentheses; this means that we have to subtract both of the separate terms that were
originally in the rightmost set of parentheses (i.e., a and bX1 ) from the terms that were
originally in the other set of parentheses (i.e., a and bX2 ). In the third line, we rearrange
terms a bit, in order to put the two a’s adjacent to each other, and the two terms involving b
adjacent to each other. In the fourth line, we insert two sets of parentheses on the right-hand
side. Finally, two things occur on the right side of the fifth line (and, note that the left side
of the equation has remained unchanged throughout all of this): (1) The term within the
first set of parentheses is removed, since a − a = 0; (2) the b coefficient is factored out of the
difference contained within the second set of parentheses in the fourth line. At this point,
14
we simply divide both sides by (X2 − X1 ) and simplify, as follows:
Y2 − Y1
(X2 − X1 )
= b
X2 − X1
(X2 − X1 )
Y2 − Y1
= b(1)
X2 − X1
Y2 − Y1
= b
X2 − X1
The preceding three lines should be easy to follow. We start in the first line by performing
the division on both sides of the equation (note that the parentheses aren’t needed on the left
side, so we omit them). In the second line, the numerator and denominator on the right side
cancel; that is, they are equal to each other, so the fraction itself is equal to one. And, in the
last line, we simply eliminate the one from the right side, because any quantity multiplied
by one is just equal to that quantity (b, in this case). Finally, we will follow longstanding
tradition and reverse the two sides of the last line, so that the quantity we are seeking falls
on the left-hand side of the equation, and the formula to obtain that quantity falls on the
right-hand side:
b=
Y2 − Y1
X2 − X1
(6)
It is instructive to look closely at equation 6, because it shows us how the b coefficient can be
interpreted as the “responsiveness” of Y to X. When reduced to lowest terms, the fraction
on the right-hand side gives the difference in the values of Y that will correspond to a
difference of one unit in X. The larger the size of this difference, the more pronounced is Y ’s
“response” to X, and vice versa. Again, since Y is a linear function of X, this responsiveness
is constant across the range of X values. In other words, we can use any pair of observations,
and the ratio of the difference in their Y values to the difference in their X values is always
equal to b.
Once we know the value of b, it is a simple matter to obtain the value for a:
15
Yi = a + bXi
Yi − bXi = a
The preceding two lines show that we can determine a by simply picking an arbitrary observation, i, and subtracting b times the X value for that observation from the Y value for
that observation. Again, following tradition, we will place the symbol for the coefficient on
the left-hand side, and the formula used to calculate it on the right-hand side:
a = Yi − bXi
(7)
Looking at equation 7, we can see that the a coefficient is really just an additive “adjustment”
that we apply in order to obtain the specific value of Yi . But, note that this adjustment takes
place after we take the responsiveness of Y to X into account, through the term bXi . For
this reason, it is somewhat tempting to say that a is less interesting than b, since it does not
seem to be part of the direct connection between the two variables. While we are somewhat
sympathetic to that interpretation, we do not agree with it. Instead, we prefer to say that
the a coefficient plays a different role than the b coefficient. Consider an observation where
the X value is zero, or Xi = 0. In that case, the second term on the right-hand side of
equation 7 drops out (since bXi = 0), leaving a = Yi . Thus, we might interpret a as a sort of
“baseline” value for Y . If situations in which Xi = 0 represent the absence of the variable,
X, then this baseline might be interesting in itself.
Graphing Linear Functions
One of the themes that permeates this book is the idea that many mathematical and
statistical ideas can be conveyed in pictorial, as well as numeric, form. That is certainly the
case with functions. We can graph a function by taking a subset of n values from the range
of possible X values, using the function to obtain the Yi values associated with each of the
Xi ’s, and then using the ordered pairs, (Xi , Yi ), as the coordinates of points in a scatterplot.
By tradition, the range of X values to be plotted is shown on the horizontal axis, while the
16
range of Y values produced by the function (as it is applied to the n values shown in the
graph) is shown on the vertical axis.
As you can probably guess, the graph of a linear function will array the plotted points
in a straight line within the plotting region. As a first example, Figure 2.4A graphs the
function Yi = 2 · Xi for integer X values from -3 to 10. The linear pattern among the points
is obvious from even a casual visual inspection.
If the values of X vary continuously across the variable’s range, then the graph of the
function is often shown as a curve, rather than a set of discrete points.3 Figure 2.4B graphs
the same function as before, Yi = 2· Xi , under the assumption that X is a continuous variable
(which still ranges from -3 to 10). Of course, the “curve” in the figure is actually a straight
line segment. And, the line in Figure 2.4B covers the same locations as the individual points
that were plotted back in Figure 2.4A. Here, however, we can determine the Yi that would
be associated with any value of Xi within the interval from -3 to 10. We merely locate
Xi on the horizontal axis, and scan vertically, until we intersect the curve. At the point of
intersection, we scan horizontally over to the vertical axis, and read off the value of Yi that
falls at the intersection point. Thus, one could argue that the graph of the function using
the curve (Figure 2.4B) is a more succinct depiction of the function than the scatterplot (in
Figure 2.4A), because it it not tied to specific ordered pairs of Xi and Yi values.
The graphical representation of a linear function allows us to see several features that
would be a bit more difficult to convey by the equations alone. The coefficient, b is often
called the “slope” of the function, because it effectively measures the orientation of the linear
array within the plotting region. If b is a positive value, then the linear array of points (for
discrete Xi ’s), r the line (for a continuous X variable) would run from lower left to upper
right within the plotting region. This is, of course, the case in the two panels of Figure 2.4.
For this reason, a function with a graph that runs from lower left to upper right is said to
have “positive slope.”
In contrast, a linear function with a negative value for b will be shown graphically as a
17
line which runs from upper left to lower right. As an example, Figure 2.5 shows the graph of
the function Yi = −2 · Xi , with X ranging continuously from -3 to 10. As you have probably
guessed, functions with graphs that run in this direction are said to have “negative slope.”
Irrespective of sign, the size of the slope coefficient measures the “steepness” of the line
corresponding to the function. Larger absolute values of b correspond to steeper lines, while
smaller absolute values of b correspond to shallower lines. For example, Figure 2.6 shows
the graphs of two different linear functions, plotted simultaneously in the same geometric
space. The solid line segment depicts the function, Yi = 2· Xi , while the dashed line segment
depicts Yi = 4 · Xi . Both functions are shown for the continuous set of X values from -3
through 10. Notice that the second function has a larger b coefficient than the first function
(i.e., b = 4 in the second function, while b = 2 in the first function); hence, the dashed
line (representing the second function) is steeper than the solid line (representing the first
function).
The same rule holds even if the value of b is negative. Figure 2.7 shows the graphs of two
more functions, Yi = −1 · Xi and Yi = −3 · Xi . As you can see, the graph of the second
function is, once again, steeper than that of the first function. But, the line segments run
in a different general direction than those in Figure 2.3, reflecting the negative signs of the
b coefficients in the latter two functions. To summarize, the general direction of the graph
corresponding to a linear function is determined by the sign of the slope coefficient, b, and
the steepness of the graph is determined by the size of the absolute value of b.
What about the a coefficient in a linear function of the form, Yi = a + bXi ? The value
of the a coefficient also has a direct graphical representation: It determines the vertical
position of the graph for a linear function. Specifically, a gives the value of Y that will occur
when Xi = 0. And, assuming that the vertical axis is located at the zero point along the
horizontal axis, then a is the value of Y corresponding to the location where the graph of
the linear function crosses the vertical axis. For this reason, the a coefficient is often called
“the intercept” of the linear function. Functions with positive a coefficients intersect the
18
Y axis above the zero point along that axis, while negative a coefficients cause a function
to intersect the vertical axis below the zero point. A function with a = 0 (which might be
shown as a function without an a coefficient at all, such as the functions graphed in Figures
2.2, 2.3, and 2.4) will cross the vertical and horizontal axes at the point (0. 0). This location
is often called “the origin” of the space.
Figure 2.8 shows three linear functions with identical slopes but different intercepts. The
function shown by the solid line, Yi = 3 + 1.5Xi , has an intercept of 3, and it crosses the
Y axis in the upper half of the plotting region. The function shown by the dashed line,
Yi = 1.5Xi has an intercept of zero, so it intersects the origin of the space, in the middle
of the plotting region. The function shown by the dotted line, Yi = −3 + 1.5Xi , has an
intercept of -3, so it crosses the vertical axis in the lower half of the space. Thus, the specific
value of the intercept for a linear function affects the vertical position of the graph for that
function, but not the orientation of the graph, relative to the axes.
Up to this point, we have only looked at graphs of linear functions. But, we can also graph
nonlinear functions as well. The general process of creating the graph is the same as with a
linear function. That is, plot points representing ordered pairs (Xi , Yi ) for some subset of n
observations, and connect adjacent points with line segments if X is a continuous variable,
and Y values exist for all points within the range of X that is to be plotted.4 The panels
of Figure 2.9 show graphs for several different nonlinear functions of a continuous variable,
X, which ranges from zero to ten. As you might expect from the name, each function shows
a curved pattern, rather than a straight line. And, the shape of the curve differs from one
function to the next.
We previously said that nonlinear functions are different from linear functions in that
Y ’s responsiveness to X varies across the range of X values. We can see this directly in the
graphs of the nonlinear functions. The relationship of Y to X is represented by the steepness
of the curve in the graph of the function. With the linear functions shown in Figures 2.4 to
2.8, this steepness is just the slope of the line segment corresponding to the function and,
19
again, it is constant across all X values. With a nonlinear function, the steepness changes
at different locations along the horizontal axis. For example, consider the graph in Figure
2.9A. Here, if we start near the left side of the horizontal axis, say in the region where
the Xi ’s vary from zero to about three, the curve is very shallow. This shows that the Y
values change very little as the X values increase. Now, look over the the right side of the
horizontal axis, say in the region where the Xi ’s vary from about eight to ten. Here, the
curve is very steep, indicating that a given amount of increase in X corresponds to a quite
large increase in the associated value of Y . With nonlinear functions, we must be attentive
to “local slopes,” defined as the orientation of the function’s curve within particular regions
or subsets of X values. These local slopes stand in contrast to the “global slope” of a linear
function, signifying that the orientation of the linear “curve” for the function does not change
across X’s range.
How can we measure these local slopes in nonlinear functions? Is there any analog to
the slope coefficient, which measures the amount of change in Y that corresponds to a unit
increase in X? In fact there is: A concept from differential calculus, the derivative of Y with
respect to X, measures the local slope of the function’s curve for any specified Xi within the
range of the function. But, we have decided not to use any calculus in this book. Therefore,
we will turn to a different strategy which is a little less precise but somewhat easier (at least
conceptually), called “first differences.” The latter simply give the difference in the Y values
that would correspond to a one-unit increase in X values, at different positions along the
range of X.
The procedure for calculating first differences is very simple, albeit somewhat cumbersome. First, select some “interesting” pairs of X values; for each such pair, the values in
that pair should differ by one unit. So, in Figure 2.9A, we might select a pair of X values
near the left side of the graph (say, X1 = 0 and X2 = 1) and another pair of X values
near the right side of the graph (say, X3 = 9 and X4 = 10). For each pair, calculate the
corresponding Y values. For X1 = 0, Y1 = a + (0)2 = y1. For X2 = 0, Y2 = a + (0)2 = y2.
20
The first difference is, then, Y2 − Y1 = y2 − y1 = cc. Now, repeat the process for the other
pair of X values. For X3 = 9, Y3 = a + (9)2 = y3. For X2 = 10, Y4 = a + (10)2 = y4.
And, the first difference is Y4 − Y3 = y4 − y2 = dd. These first differences summarize the
responsiveness of the Y values to changes in the X values within different regions of X’s
range. The calculated numerical values show that small values of X have little impact on
Y , since the first difference from X1 = 0 to X2 = 1 is only cc. But, with large values of X,
there is a stronger effect on Y , since the first difference from X3 = 9 to X4 = 10 is much
larger, at dd. While a first difference is not identical to a local slope, it does summarize the
set of local slopes which occur in the interval over which that first difference is calculated.
For that reason, it is probably a good idea to only calculate first differences across intervals
of X values for which the local slope (or, alternatively, the direction of the graphed curve)
does not change very much.
Our discussion of graphing functions suggests that linear functions often have a practical
advantage over nonlinear functions: For many purposes, linear functions are easier to deal
with, in the sense that it only takes one coefficient (the slope, b) to describe the way that Y
values respond to differences in the X values. With a nonlinear function, several separate
measures (i.e., values of the partial derivative at different Xi or first differences calculated
for different regions within X’s range) are usually needed in order to represent the varying
responsiveness of Y to X across X’s range. We will eventually see that this same difference
persists when we employ regression analysis to construct models of linear and nonlinear
relationships between empirical variables.
LINEAR TRANSFORMATIONS
We will now move away from the relatively abstract discussion of functions and focus
more directly on empirical variables. We will still consider functions as rules which describe
relationships between variables. But, now, the emphasis will shift more toward the implications of the functions for measurement and the variables’ distributions rather than the
characteristics of the functions, themselves.
21
Sometimes, it will be useful to create a new variable, Y , by applying a linear function to
a previously-existing variable, X. That is, Yi = a + bXi for every observation, i = 1, 2, . . . , n.
In this case, Y is called a “linear transformation” of the variable X.
The Effect of a Linear Transformation on a Variable
A linear transformation contains essentially the same information as the original variable.
This is true because the relative sizes of the differences between the observations are preserved
across the transformation. However, the units of measurement change by a factor of size
b. And, after moving to the “new” measurement units, the origin (or zero point) in the
transformed variable is shifted by a units.
An easily-recognizable example of a linear transformation involves temperatures, measured in Celsius units (which we will call C) and in Fahrenheit units (which we will call
F ). Each of these temperature measurements is a linear transformation of the other. For
example, to transform a temperature in Celsius units into one that is expressed in Fahrenheit
units, we apply the following transformation:
Fi = 32 + 1.8Ci
(8)
In the preceding linear transformation, it is easy to see that b = 1.8 and a = 32. The
measurement units for F and C both happen to be called “degrees.” However, a Celsius
degree is “wider” then a Fahrenheit degree because one of the former is equivalent to 1.8 of
the latter. At the same time, the zero point on the Celsius scale falls 32 Fahrenheit units
away from the zero point on the Fahrenheit scale.
The nature of the linear transformation from Celsius to Fahrenheit is illustrated with the
parallel number lines shown in Figure 2.10; the right-hand line segment represents a range of
Celsius temperatures, and the left-hand line segment shows its Fahrenheit equivalent. The
figure also locates three observations on both of the scales. Notice that C1 = 0 and C2 = 10;
in other words, the first two observations differ from each other by ten Celsius units. But,
22
looking at the other scale, F1 = 32 and F2 = 50, for a difference of eighteen Fahrenheit units.
Of course, the sizes of the differences on these two scales reflect the value of the b coefficient
(1.8) in the linear transformation from Celsius to Fahrenheit degrees. Notice, too, that the
first observation falls at the zero point on the Celsius scale. However, F1 falls 32 Fahrenheit
units above zero on the latter scale. This corresponds to the size of the a coefficient in the
transformation.
As we said earlier, a linear transformation retains the relative differences between the
observations. We can see this in Figure 2.10 by comparing the difference between observations 1 and 2 to the difference between observations 2 and 3, using both scales. As we
have already said, observations 1 and 2 differ by 10 Celsius units. And, we can also see that
observations 2 and 3 differ by 20 Celsius units (since C3 − C4 = 30 − 10 = 20). So, the
difference between the second and third observations is two times the size of the difference
between the first and second observations, since (F3 − F2 ) = 2(F2 − F1 ). If we look at the
same pairs of observations in Fahrenheit units, we see that F2 − F1 = 50 − 32 = 18 and
F3 −F2 = 86−50 = 36. Once again, the difference between the second and third observations
is twice the difference between the first and second observations, or (F3 − F2 ) = 2(F2 − F1 ).
Linear Transformation and Summary Statistics
Linear transformations occur frequently in the everyday world of data analysis and statistics. In fact, they may be particularly common in the social sciences, where many of the
central variables are measured in arbitrary units. So, it is particularly important that we
understand how a linear transformation would affect the summary statistics calculated to
describe important characteristics of a variable’s distribution. Assume that we have n observations on a variable, Y , which is a linear transformation of a variable, X. How would
we calculate the sample mean for Y ?
One approach would be to simply perform the transformation, Yi = a + bXi for each
of the n observations, i = 1, 2, . . . , n and then calculate the sample mean, using the usual
P
formula, Ȳ = ni=1 Yi /n. That would certainly give the correct result. But, it involves a lot
23
of work, since the transformation would have to be carried out n times before calculating Ȳ .
Since we “started out” with the variable, X, perhaps we already know the value of its mean,
X̄. And, if so, we might be able to use that information to obtain Ȳ in a more efficient
manner.
In order to do this, you need to be familiar with some basic properties of summations.
These are listed in Appendix 1. But, the process is straightforward. We will just lay out
the formula for the linear transformation, manipulate it to create the formula for Ȳ on the
left-hand side, and see what happens on the right-hand side.
The derivation of the mean of a linear transformation shown in Box 2.1. The first line
simply lays out the linear transformation. In the second line, we take the sum of the left- and
right-hand sides. In the third line, we make use of the first property of summations given in
Appendix 1, and distribute the summation sign across the two terms on the right-hand side.
Line 4 uses the other two properties of summations. The coefficient, a, is a constant, so when
that constant is summed across n observations, it simply becomes na. The other coefficient,
b, is also a constant so it can be factored out of the second summation on the right-hand side.
In lines 5 and 6, we divide both sides of the equation by n, and break the right-hand side into
two fractions, respectively. Line 7 cancels the n’s from the first fraction on the right-hand
side, and factors the b coefficient out of the numerator of the second fraction. Now, we can
recognize very easily that the fraction on the left-hand side simply gives the sample mean of
Y , while the only remaining fraction on the right-hand side similarly gives the sample mean
of X. So, that leaves us with line 8, which shows that the mean of a linear transformation is
that linear transformation applied to the mean of the original variable. Just as we suspected,
the coefficients of the linear transformation provide a shortcut to calculate Ȳ if we already
know the value of X̄.
Can we use a similar shortcut to obtain the variance and standard deviation of a linear
transformation? To answer this question, we will follow the same procedure we just used
for the sample mean. In Box 2.2, we begin with the formula for the linear transformation,
24
manipulate the left-hand side to produce the formula for the sample variance, and observe
the results on the right-hand side.
In the second line of Box 2.2, we subtract the mean of Y from the linear combination;
The left-hand side subtracts Ȳ and the right-hand side uses the result from the previous
derivation. Lines 3 and 4 rearrange terms and simplify on the right-hand side. Note that
the a coefficient is removed from further consideration, since a − a = 0. In line 5, we square
both sides, and in line 6, the exponent is distributed across the two terms in the product on
the right-hand side. In line 7, both sides are summed across the observations. In line 8, we
use the third summation rule to factor the constant, b2 , out of the sum. Then, line 9 divides
both sides by n − 1 and line 10 factors the b2 term out of the fraction on the right-hand side.
At this point, you probably recognize that the left-hand side simply gives the formula for
the sample variance of Y , while the right-hand side shows the product of b2 and the sample
variance of X. Line 11 shows this explicitly. Then, lines 12 and 13 take the square root of
each side to show that the sample standard deviation of Y is equal to b times the sample
standard deviation of X.
The derivation in Box 2.2 was a little more involved than that for the sample mean, back
in Box 2.1, but not by much. And, the results are equally straightforward. There are a
couple of features of the variance and standard deviation that should be emphasized. First,
note that the a coefficient does not appear in the formula relating the variance of X to the
variance of Y (of course, the same is true for the standard deviation). This is reasonable,
because a is a constant, added to every observation. As such, it shifts the location of each
observation along the number line, but it does not contribute at all to the dispersion of the
observations. So, it does not really have anything to do with the variance.
Second, the variance of X is multiplied by b2 , rather than by b. This makes sense, too: The
variance is a measure of the average squared deviation from the mean. So, the measurement
unit for the variance is the square of the measurement unit for the variable, itself. And,
remember that the b coefficient in a linear transformation changes the measurement units
25
from those of X to those of Y . So, it is reasonable that b must be squared, itself, when
determining the variance of the linear transformation, Y , from the variance of the original
variable, X.
Third, the sample standard deviation of Y is the standard deviation of X, multiplied by
b. Of course, the standard deviation is measured in the same units as the variable whose
dispersion it is representing (in fact, that is one of the central motivations for using the
standard deviation rather than the variance in many applications). Again, the b coefficient
adjusts the measurement units from those of X to those of Y . So, it performs a similar function in obtaining the standard deviation of the linear transformation, Y , from the standard
deviation of X.
To summarize, the salient features of a linear transformation’s distribution (specifically,
its central tendency and dispersion) are related to the same features of the original variable’s
distribution through some very simple rules. We have shown this to be the case for data in a
sample. However, it is also true for the population distributions from which the sample data
were drawn. These characteristics of linear transformations are useful, because they will be
applicable in several of the derivations that we will complete in the rest of the book. They
are also important because they will help us understand how changes in measurement units
affect the various elements of regression models.
LINEAR COMBINATIONS
Linear transformations create a new variable, Y , by applying a linear function to one
other variable, X. There are also many situations where we want to create a new variable
(which we will, again, designate as Y ) by combining information from several other variables,
say X and Z (although we could certainly have more than two such variables, we will use
only these two for now, as the simplest case). If Y is created by forming a weighted sum of
X and Z, then Y is called a linear combination of those two variables. This could be shown
as follows:
Yi = bXi + cZi
26
(9)
In the preceding formula, b and c are constants; their values do not change across the
observations, from i = 1, 2, . . . , n. Just as before, we call them the coefficients of the linear
combination. It is obvious that Y is a combination of the other two variables. And, it is a
linear combination because (once again) the impact of X on Y is constant across X’s range,
and the same is true of Z.
The coefficients in the linear combination are “weights” in the sense that they determine
the relative contributions of X and Z to the final variable, Y . In other words, if b is larger
than c, then the Xi ’s will have a greater impact on the Yi ’s than do the Zi ’s, and vice versa.
Of course, B can be equal to c in which case, both variables have an equal influence in
creating Y . It is also possible that b and c are both equal to one, in which case the linear
combination, Y , is just the simple sum of X and Z.
Linear combinations occur very frequently in the social sciences. To mention only a few
examples:
• A professor calculates each student’s overall course grade (Yi ) by summing the grades
received on the midterm examination (Xi ) and the final examination (Zi ).
• Another professor calculates the overall course grade by weighting the midterm examination to count for 40% (that is, b = 0.40) and the final examination to count for 60%
(c = 0.60).
• A sociologist develops a measure of socioeconomic status for respondents to a public opinion survey (Yi ) by adding together each person’s reported family income, in
thousands of dollars (Xi ), and years of schooling (Zi ). Furthermore, the sociologist
multiplies the first constituent variable by 1.5, to indicate that status is more a function of income than of education.
Of course, many other examples of linear combinations are possible. Again, they occur
whenever a new variable is created by taking a weighted sum of other variables.
27
The linear combinations we have considered so far are relatively simple, because they
have only included two terms on the right-hand side. But, the weighted sum which defines
the linear combination can actually contain any number of terms. And, at this point, we
encounter a bit of a problem with notation. Traditionally, statistical nomenclature has
employed letters from the end of the alphabet to represent variables (e.g., X, Y , and Z) and
letters from the beginning of the alphabet to represent coefficients (e.g., a, b, and c). This
system has worked so far, because our linear combinations have involved very few terms.
With more terms entering into a weighted sum, we will run out of distinct letters pretty
quickly. In order to avoid this problem, we will move to a more generic system that can be
employed no matter how many terms are contained in the linear combination. Specifically, we
will build upon our previous system, and designate that the variable on the left-hand side—
that is, the linear combination, itself— is shown as Y . The variables on the right-hand side—
that is, those comprising the terms that are summed to create the linear combination— are
shown as X’s with subscripts. So, if there are k variables going into the linear combination,
then they will be represented as X1 , X2 , . . . , Xj , . . . , Xk . From now on, we will reserve
the subscript, j, as a variable index, with k representing the total number of variables going
into the linear combination (that is, appearing on the right-hand side). At the same time,
we will use b with a subscript to represent the coefficient for each of the Xj ’s in a linear
combination. In other words, bj is the coefficient associated with Xj . Thus, we could show
a linear combination, Y , composed of k terms, as follows:
Y = b1 X 1 + b2 X 2 + . . . + bj X j + . . . + bk X k
(10)
Or, in a more compact form:
Y =
k
X
Xj
(11)
j=1
You need to look carefully at the limits of summation in the preceding expression, to see
that the sum is taken across the k variables, rather than across observations (as we have
28
done in previous applications of the summation operator). In fact, the observations don’t
show up at all in expressions 10 and 11. If we want to represent the ith observation’s values
in the linear combination, Y , then we will have to use a slightly more complicated form of
subscripting for the right-hand side variables, as follows:
Yi = b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik
(12)
Now, the X’s have double subscripts. Following the notation we have already employed
in earlier equations and derivations, one of the subscripts is designated i and it indexes
observations. The other subscript is designated j and it indexes variables. Arbitrarily, we
will place the observation index first and the variable index second in the doubly-subscripted
versions of the variable. So, Xij refers to the ith observation’s value on variable Xj . Of
course, the coefficients only have single subscripts, because they are each constant across the
observations.
Notice that the linear combinations we have shown so far do not include an a coefficient
which might serve a similar purpose as the intercept in a linear transformation. We could
easily add one, as follows:
Yi = a + b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik
(13)
However, we prefer to use a slightly different notation. Rather than a, we will show the
intercept in a linear transformation as b0 . So, the linear transformation is shown as:
Yi = b0 + b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik
(14)
And, it is sometimes useful to think of b0 as the coefficient on a pseudo-variable which we will
call X0 . The latter is a pseudo-variable because its values equal one for every observation—
hence, this “variable” doesn’t actually vary! This may seem a bit nonsensical. But, later on,
29
we will see that thinking about the intercept this way helps us in several derivations. So, for
now, we note that a linear transformation could be represented as:
Yi = b0 Xi0 + b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik
Yi =
k
X
bj Xij
j=0
Once again, look carefully at the limits of summation in the preceding expression. You will
see that the variable index begins at zero, rather than one, indicating that an intercept is
included in the linear combination. Of course, if a particular linear combination does not
include an intercept, we would just begin with j = 1, as usual.
Graphing the “Linear” Part of a Linear Combination
Earlier in this chapter, we saw that the graph of a linear function is (reasonably enough) a
line. We didn’t look at the graph of a linear transformation, but that is just a straightforward
application of our general rule. So, if Y is a linear transformation of X, then a graph of Y
versus X would produce a linear pattern of points if X is discrete, or a line if the values of
X vary continuously across the variable’s range. In either case, the linear pattern (or, the
line itself) would have an intercept of a and a slope of b.
Can we generalize this kind of graphical display to a linear combination? We will address
this question using an example. Assume we have two variables, X1 and X2 , each of which
takes on integer values from one to ten. Our data consist of 100 observations, comprising all
possible combinations of distinct X1 and X2 values. Let us create a linear combination, Y ,
from these variables, defined as follows:
Yi = 3Xi1 + −2Xi2
(15)
Figure 2.11 graphs Y against each of the two constituent variables in the linear combination.
30
But, neither scatterplot shows a line. What is going on here? Shouldn’t a linear combination
be manifested in linear graphical displays?
The answer is that the values of Y are affected simultaneously by both X1 and X2 . Therefore, if we graph Y against only one of the constituent variables in the linear combination
(say, X1 ), the locations of the plotted points are also affected by the values of X2 , even
though that variable is not explicitly included as one of the axes of the scatterplot. But, it
is nevertheless the effects of X2 which prevent us from seeing directly the linear relationship
between Y and X1 in the graph. Even though a simple scatterplot does not work for this
purpose, there are several strategies we might use to “remove” the effects of X2 from the
display. After we do that, we should be able to see X1 ’s linear impact on Y .
The first strategy is to condition on the values of X2 before graphing the relationship
between X1 and Y . In other words, we divide the 100 observations into subgroups, based
upon the values of X2 . Given the way we have defined our dataset, there will be ten such
subgroups: The first subgroup contains the ten observations for which X2 = 1, the second
subgroup contains the ten observations for which X2 = 2, and so on. Stated a bit differently,
Xi2 is held constant within each of the subgroups. Then, construct a separate scatterplot
of Y against X1 for each of the subgroups. Each scatterplot will show a linear array of
points, with slope b1 and intercept b2 Xi2 . The slope remains the same across the 10 graphs
(since, by definition, the linear responsiveness of Y to X1 is constant) but the intercept shifts
according to the value of X2 that is held constant within each graph. Figure 2.12 shows two
such scatterplots, with X2 held constant at values of four and seven, respectively. In both
panels of the figure, the slope of the linear array is 3, which is the value of b1 . In the first
panel, the intercept is equal to -8, which is b2 Xi2 for this subgroup (or −2 × 4). In the second
panel, the intercept is -14, or −2 × 7. Of course, we could also observe the linear relationship
between X2 and Y by conditioning on values of X1 .
The second strategy for isolating the linear effect of X1 is to remove directly the effects
31
of X2 on Y . We can accomplish this through some simple algebra:
Yi = 3Xi1 + −2Xi2
Yi − −2Xi2 = 3Xi1
Yi∗ = 3Xi1
The Y ∗ variable which appears on the left-hand side of the preceding expression is just a
version of the original Y variable that has been “purged” of the effects of X2 , by subtracting
it out. That is, Yi∗ = Yi − −2Xi2 . The part of Y that remains after this “purging” operation
is linearly related to X1 . We can see this clearly in Figure 2.13, which graphs the “purged”
version of Y against X1 . The first panel of the figure shows the simple scatterplot. This
shows the expected linear array of points; however, there only appear to be ten points despite
the fact that the dataset contains 100 observations. The problem is that each plotted point
actually represents ten distinct observations; they are simply located directly on top of each
other. In order to correct for this overplotting problem, the second panel of Figure 2.13
shows a jittered version of the same plot. Here, the plotting locations have been broken up,
so we can see more easily the concentration of observations at each of the ten distinct pairs
of variable values that appear in the graph. The slope of the linear array in Figure 2.13 is
three, which is also the value of b1 in the linear combination. The intercept is equal to zero,
which can be verified by plugging that value into Xi1 on the right-hand side of the equation
that created the purged version of Y , above. Again, we could observe the linear relationship
between X2 and Y by using the same strategy to remove the effects of X1 from Y .
Thus, there are two different strategies for isolating the linear effect of each constituent
variable within a linear combination. The first strategy involves physical control, by holding the values of the other variable constant. The second strategy involves mathematically
removing the other variable’s effects from the linear combination, before observing the relationship. Both of these strategies are employed in “real-life” data analysis. And, the second
32
strategy is conceptually similar to what we will eventually do in a multiple regression equation, which attempts to isolate each independent variable’s effect on a dependent variable by
mathematically removing the effects of all other independent variables from the measured
representation of that effect. For now, however, we want you to realize that a linear combination really does comprise a sum of linear effects, even though the bivariate graphs of the
linear combination against the constituent variables do not show simple linear patterns.
The Mean and Variance of a Linear Combination
As we will soon see, there are many situations where we will want to work with summary
statistics for the distribution of a variable that is a linear combination of two or more other
variables. And, as was the case with linear transformations, it will prove convenient to express those summary statistics in terms of the variables that comprise the linear combination.
Here, doing so will not only save us some computational work. Establishing the relationships
between the summary statistics for a linear combination and the summary statistics for its
constituent variables will eventually be very useful for deriving the statistical properties of
the regression model.
For the derivations in this section, we will use the simplest form of a linear combination—
that is, a weighted sum of two variables. At the same time, we will return to the earlier
notation, in which the linear combination is shown as:
Yi = bXi + cZi
Thus, we will be using distinct letters to represent each coefficient and variable on the
right-hand side, rather than subscripted b’s and X’s. We do this strictly to simplify the
current presentation and avoid the need to keep track of multiple subscripts throughout the
derivations. But, it remains important for you to understand how the derivations would
work the same way with more than two variables (that is, k > 2) and the more general
notation (i.e., bj ’s and Xij ’s) on the right-hand side.
33
Let us begin with the mean of a linear combination. The derivation is shown in Box 2.3
and it follows the same general strategy that we used earlier, for linear transformations. By
now, you should be comfortable enough with this process that we do not need to explain
each step in detail. The important point to take away from Box 2.3 is the final result: Stated
verbally, the mean of a linear combination is equal to the linear combination (or weighted
sum) of the means of the constituent variables. That was simple enough, right?
Now, let us turn to the variance of a linear combination, which is worked out in Box 2.4.
While this derivation is certainly not difficult, it is a bit longer than the previous one and it
requires us to manipulate a few more terms. So, we’ll follow the steps carefully. As always,
we begin with the information that is given— the definition of the linear combination, in
this case— and manipulate the terms (always making sure to perform the same operations
on both sides of equation) until we produce the desired result— in this case, the formula for
the sample variance— on the left-hand side. After doing so, we look to see whether we can
make sense of, or gain new insights from, the information left on the right-hand side.
We know that the sample variance is defined as the sum of squares divided by the degrees
of freedom. In order to get to that result, we begin in Box 2.4 by subtracting the mean of the
linear combination from both sides of the linear combination’s definition (which is shown,
by itself, in the first line). On the left-hand side of line two, we simply subtract Ȳ ; on the
right-hand side, we insert the result we just obtained for the mean of the linear combination.
In the third line, we remove the parentheses on the right-hand side. Then, in line four, we
rearrange terms and enclose the differences that involve the same coefficients and variables
within sets of parentheses. The fifth line factors out the coefficients from the two differences
on the right-hand side. Note that the left-hand side of line five shows the deviation score for
Yi , or the signed difference of observation i from the mean of Y . Of course, the right-hand
side is a weighted sum of the difference scores for X and Z.
The sixth line in Box 2.4 squares the equation, with the product of the two sums written
out explicitly on the right-hand side. In the seventh line, we carry out the multiplication on
34
the right-hand side, making sure to multiply each term in the first set of brackets by each
term within the second set of brackets. The result is a rather messy-looking sum of four
products. So, the eighth line simplifies things a bit, showing products of identical terms as
squares, and summing together the second and fourth terms from the previous line (which
were identical to each other, albeit arranged a bit differently).
In line nine, we take the sum, across n observations, on each side. Line 10 distributes
the summation sign across the three terms on the right-hand side, and line 11 factors the
constants out of each sum. To recast this latter result in a bit of statistical language, the lefthand side of line 11 shows the sum of squares for Y . The right-hand side shows a weighted
sum, where the weights are the square of the first coefficient from the linear combination
(i.e., b2 ), the square of the second coefficient from the linear combination (i.e., c2 ), and two
times the product of the two coefficients (i.e., 2bc). The first term on the right-hand side
also includes the sum of squares for the variable, X and the second term includes the sum
of squares for Y . The sum in the third term on the right-hand side involves the deviation
scores for both X and Z. Therefore, we will call it the “sum of crossproducts” for that pair
of variables. We will eventually see that sums of squares and sums of crossproducts play a
very prominent role in several aspects of the regression model. But, for now, they are just a
step along the way to our more immediate objective, the sample variance of Y .
In the twelfth line of Box 2.4, we divide both sides by n − 1, the degrees of freedom.
Line 13 then separates the right-hand side into three fractions, and factors out the products
involving coefficients from each of those fractions. Take a close look at the information in
this line. On the left-hand side, the fraction corresponds to the formula for the sample
variance of the variable, Y ; and, of course, that is exactly what we were trying to obtain
in this derivation. On the right-hand side of line 13, the first term is b2 times a formula
that corresponds to the sample variance of X. And, the second term is c2 times the sample
variance of Z. So, these first two terms on the right-hand side of line 12 can be interpreted
in a straightforward manner, as some familiar statistical quantities.
35
The third term on the right-hand side of line 13 looks a little more complicated in that
it involves both of the constituent variables in the linear combination. As we have already
mentioned, the first part of this term is two times the product of the coefficients (i.e., 2bc).
The second part of the term is the sum of the crossproducts divided by the degrees of
freedom. This latter fraction looks very similar to a variance; the only difference is the fact
that the numerator involves the sum of the product of two different deviation scores, rather
than the sum of the product of a single variable’s squared deviation scores. Because of its
similarity to the variance, and the fact that it involves two variables, this fraction is called
the covariance of X and Z. This is a very important bivariate summary statistic in itself,
and we will return to it for a closer look in the next section of this chapter.
For now, the fourteenth line of Box 2.4 replaces the fractions with symbols representing
the sample variances and the covariance. Note that the latter is shown as SXZ . While
this is relatively standard statistical notation for the covariance, we believe it is potentially
confusing; it is easy to mistake SXZ for a sample standard deviation, which is also usually
shown as S, but with a single subscript. In order to avoid any such problems, we prefer to
use the covariance operator, as shown in line 15: When a formula or equation includes “cov”
followed by a set of parentheses with two variable names enclosed within them, that term
within the equation is referring to the covariance between those two variables.
At last, we are ready to provide a verbal statement of the variance of a linear combination
composed of two variables: The variance of a linear combination is equal to the squares of
the coefficients times the variances of the constituent variables, plus two times the sum of the
coefficients times the covariance between the variables. That definition is quite a mouthful!
But, it shows that the dispersion in the linear combination’s distribution is related to the
amount of dispersion in the constituent variables’ distributions, plus an “extra” contribution
from the covariance which actually involves both of the original variables in a somewhat
different way.
Up to this point, we have not paid any attention to the standard deviation of a linear
36
combination. The latter is simply defined as the square root of the linear combination’s
variance. Both of these statistics measure the dispersion of the distribution for the linear
combination. But, there is an important “practical” difference between them: The variance of the linear combination can be broken down into a weighted sum of variances and
covariance; therefore, we can get a sense of each constituent variable’s contribution to the
overall variance. The standard deviation of the linear combination cannot be broken down
in a similar manner. For that reason, the latter will play a far less prominent role in the
development and evaluation of the regression model, even though the standard deviation is
(arguably) somewhat easier to interpret than the variance.
We can generalize the results we have obtained so far very easily. Assume that we have
a linear combination, Y , defined as a weighted sum of an arbitrary number of variables:
Yi = b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik
k
X
Yi =
bj Xij
j=1
The mean of the linear combination would be defined as follows:
Ȳ =
k
X
bj X̄
j=1
In other words, the mean of the linear combination will always be a weighted sum of the
original variables’ means, regardless how many variables there are in the linear combination.
Generalizing the variance of a linear combination is a bit trickier. The right-hand side will
always include a sum of k terms involving the variances of the constituent variables. But,
the variance also includes additional terms for the covariances between every possible pair of
constituent variables. If there are k variables, there will be k(k − 1)/2 distinct covariances.
This can add up to a lot of terms very quickly! Regardless, the general form for the variance
37
of the linear combination is:
SY2
SY2
2
2
+ cov(X1 , X2 ) + . . . + cov(Xk−1 , Xk )
+ . . . + SX
= b1 SX
1
k
=
k
X
j=1
2
b2j SX
j
+
k−1 X
k
X
cov(Xj , Xl )
j=1 l=j+1
To show you some specific examples, let’s take a linear combination of three variables:
Yi = b1 Xi1 + b2 Xi2 + b3 Xi3
The mean of this linear combination would be as follows:
Ȳ = b1 X̄ + b2 X̄ + b3 X̄
And, the variance of this linear combination would be:
2
2
2
SY2 = b21 SX
+ b22 SX
+ b23 SX
+ cov(X1 , X2 ) + cov(X1 , X3 ) + cov(X2 , X2 )
1
2
3
At this point, you might be asking yourself why you would want to know (or care) about
the summary statistics for linear combinations. We think that is a very reasonable question!
And, we want to reassure you that we are not going through these derivations and examples
simply to make you do some math. Instead, we will see very shortly that linear combinations
are one of the most important tools for developing the various features of the regression
model. In fact, we’ll be encountering them repeatedly throughout the rest of the book
COVARIANCE, LINEAR STRUCTURE, AND CORRELATION
In this section, we will take a closer look at that term which showed up in the derivation
of the variance of a linear transformation, the covariance between two variables. To reiterate
what we already know, the covariance between two variables (say, X and Y ) is defined as
38
the sum of crossproducts for those variables, divided by n − 1. Or:
Pn
cov(X, Y ) =
i=1 (Xi
− X̄)(Yi − Ȳ )
n−1
Now, we have already mentioned that the covariance is a summary statistic. But, unlike the
mean or the variance, which represent features of a univariate distribution, the covariance
measures an important characteristic of a bivariate distribution. The bivariate distribution,
itself, can be interpreted as an enumeration of all possible combinations of values on the
two variables, along with the number of observations that fall at each of those possible
combinations. A scatterplot can be interpreted as a graphical display which illustrates the
bivariate distribution for two particular variables.
The Covariance as a Summary Statistic
The covariance summarizes, in a single numeric value, the degree to which the two variables are linearly related to each other. Stated a bit differently, the covariance measures the
degree of linear structure in the bivariate distribution for the two variables. Given a dataset
in which two variables have fixed variances (or standard deviations), the absolute value of
the covariance will increase as the points plotted in a scatterplot of the two variables come
to approximate a perfectly linear pattern. Furthermore, if the line that is approximated by
the plotted points has a positive slope (i.e., it runs from lower left to upper right in the
scatterplot), the covariance will be a positive value. If the linear pattern is negative (i.e.,
runs from upper left to lower right), the covariance will also have a negative value.
From the opposite perspective, the absolute value of the covariance will be close to zero
when there is very little linear structure in the bivariate data. In such cases, it would be
very difficult to discern any kind of linear trend or pattern within the points plotted in a
scatterplot. The covariance has a value of exactly zero when there is no linear structure in
the data at all. This can occur for several different reasons. But, one important situation in
which the covariance is zero is when X and Y are statistically independent of each other. In
39
that case, knowing an observation’s value on one variable provides no information whatsoever
about that same observation’s value on the other variable. Speaking informally, we might
say that the two variables are completely unrelated to each other.
Table 2.XX and Figure 2.14 provide some examples that should illustrate these properties
of the covariance. The table shows a moderately-sized dataset with 50 observations and six
variables. The first variable, labeled X, has a mean of 4.95 and a standard deviation of
2.95. The second through sixth variables are labeled Y1 through Y5 ; each of these variables
has a mean of 10.00 and a standard deviation of 5.00. The five panels in Figure 2.14 show
the scatterplots of the X variable against each of the Y variables from the dataset. Even a
cursory glance at the five scatterplots reveals the differences in linear structure across them.
In panel A, the data points form a “shapeless” cloud within the plotting region. In panel B,
the two variables show a very faint structure, in which observations with smaller X values
also tend to have relatively small Y values, and vice versa. The data in panel C conform
to a moderately clear linear pattern, oriented in the positive direction. Similarly, the data
in panel D also show a moderately strong linear pattern, but it is oriented in the negative
direction. Finally, panel E shows a perfectly linear array of data points.
The various degrees of linear structure (or, one could say “the strength of the linear relationship”) are revealed graphically in the respective scatterplots. But, they are also reflected
in the values of the covariance calculated for each pair of variables. These covariances range
from 0.82 for X and Y1 up to 14.74 for X and Y5 . You should be sure that you can see how
larger absolute values correspond to clearer linear structures. And, note how panels C and
D show the same degree of linearity, but in opposite directions.
Why does the covariance work this way? You can gain an intuitive sense of the answer
by looking at the numerator in the formula for the covariance, which sums the crossproducts
for each observation. Each of the crossproducts multiplies a deviation score for X by a
deviation score for Y . Each of these deviation scores effectively measures the observation’s
distance above or below the mean on the appropriate variable. The (absolute value of the)
40
sum of these products will be maximized when an observation’s distance above (or below)
the mean on one variable is proportional to its distance above (or below) the mean on the
other variable. Stated differently, the deviation scores on one variable comprise a linear
function of the deviation scores on the other variable.
Thus, the covariance measures the degree to which two variables approximate a linear
relationship. And, we already know that linearity is a particularly important type of relationship between two variables, because it is the simplest form of bivariate structure. The
covariance is a summary statistic which measures precisely this kind of structure, so it is a
very important research tool on its own.
Some Problems with the Covariance
Nevertheless, there are some practical issues that arise when we try to use the covariance
to measure linearity in bivariate relationships. Most of these stem from the fact that it
is very difficult to interpret specific values of the covariance. For one thing, there is no
clear standard of comparison. Suppose you had not seen the scatterplots in Figure 2.14. If
someone told you that the covariance between X and Y3 is 12.69, you have no way to evaluate
what this means, beyond the fact that any linear structure across these two variables would
have a positive slope. Similarly, there is nothing in the specific numeric value, 14.74, to
indicate that X and Y5 are perfectly linearly related to each other.
Another problem is that the value of the covariance is affected by the specific measurement units employed for each of the two variables. Let’s transform variable Y3 by dividing
each observation’s value in half; we’ll call this new variable Y ∗ . Similarly, we can create a
new variable called Y ∗∗ by doubling each observation’s value on Y5 . Figure 2.15 shows the
scatterplots between X and each of these new variables. You can see that panel A in this
new figure is identical to panel C from Figure 2.14. Similarly, panel B is identical to panel E
from Figure 2.14. The only things that differ across the two figures are the specific numeric
scales shown on the vertical axes. Certainly, the amount of linear structure is identical across
the comparable scatterplots from the two figures. Nevertheless, the covariances differ: The
41
covariance between X and Y ∗ is 6.345, whereas the covariance between X and Y3 was 12.69.
And, the covariance between X and Y ∗∗ is 29.48, while the covariance between X and Y5
was 14.74.
The problem is that the value of the covariance is affected by the measurement scales used
for the two variables, as well as the degree of linear structure between them. In scientific
research situations, we are often much more interested in the latter than in the former.
In fact, measurement schemes will typically be regarded as given information prior to any
attempts to examine relationships between variables. Therefore, it is probably useful to
“correct” the covariance, in order to remove its dependence on measurement scales. How
might we accomplish this?
Transforming the Covariance into the Correlation Coefficient
One way to proceed would be to transform the two variables whose covariance is being
measured so that they always have the same measurement units. And, you probably know
a strategy for doing this already, from your background work in introductory statistics: We
can just standardize, or perform a Z-transformation on, each of the variables. In other
words, start with variables X and Y , which we will assume have their own particular units
of measurement. Create new variables, ZX and ZY , by transforming X and Y as follows.
For each observation, i, where i ranges from 1 to n:
ZXi =
(Yi − Ȳ )
(Xi − X̄)
and ZY i =
SX
SY
Then, apply the formula we have already seen to obtain the covariance between ZX and ZY :
Pn
cov(ZX , ZY ) =
i=1 (ZXi
− Z¯X )(ZY i − Z¯Y )
=
n−1
Pn
− 0)(ZY i − 0)
=
n−1
i=1 (ZXi
Pn
ZXi ZY i
n−1
i=1
We can also achieve exactly the same result in a slightly different way (that is, perhaps,
a bit easier to calculate). First, calculate the covariance in the usual manner (i.e., as the
42
sum of crossproducts, divided by n − 1). Second, divide the covariance by the product of
the standard deviations of the two variables. This can be shown as follows:
cov(ZX , ZY ) =
cov(X, Y )
SX SY
Regardless how you calculate it, this covariance of standardized variables is actually a
transformed— or standardized— version of the covariance, itself. And, the transformation effectively removes the effect of the original variables’ measurement units on this new
version of the covariance. As you know, Z-scores are always measured in the same units
(i.e., standard deviations) regardless of the measurement units used for the variables prior
to standardization. So, the question of the measurement units’ impact on the value of the
covariance is just irrelevant when standardized variables are used. And, therefore, if two
such standardized covariances differ, we can be sure that the difference is due to variation
in the degree of linear structure that exists across each pair of variables, and not due to
differences in the measurement units used in the variables.
In fact, the standardized covariance is so important that it has a special name of its own.
It is called the correlation coefficient. The correlation coefficient is a bivariate summary
statistic, and it is generally represented by lower-case r. Usually (at least, in this book), the
r is given double subscripts which indicate the variables being correlated. Thus, rXY refers
to the correlation between variables X and Y . Again, the correlation can be obtained in
several ways including the covariance of standardized versions of the two variables, or the
covariance between the original (i.e., non-standardized) variables divided by the product of
their standard deviations. And, there are various situations where both of the preceding
manifestations of the correlation coefficient are useful (in the sense that they will help us
see something interesting in a formula or a derivation). But, many statistics texts use the
43
following formula to represent the correlation between X and Y :
Pn
− X̄)(Yi − Ȳ )
Pn
2
2
i=1 (Xi − X̄)
i=1 (Yi − Ȳ )
rXY = pPn
i=1 (Xi
Now, the preceding formula may look a bit intimidating. But, it can be obtained pretty
easily by performing some straightforward operations on either of our earlier formulas for
the standardized covariance. And, the important thing to remember is that the correlation
coefficient is simply a standardized version of the covariance.
Interpreting the Value of the Correlation Coefficient
Again we use the correlation coefficient rather than the “raw” covariance because it has
some interpretational advantages. We have already noted that it is not affected by differences
in measurement units. Even more important, the correlation coefficient has a maximum
absolute value of 1.00. In other words, rXY will always fall within the interval from -1.00
to 1.00, inclusive. And, the correlation coefficient will only take on its most extreme values,
+1.00 or -1.00, when Y is a perfect linear function of X, and vice versa. If |rXY | < 1.00, then
Y is not a perfect linear function of X (and vice versa). But, the larger the value of |rXY |,
or the closer it comes to 1.00, the greater the extent to which the bivariate data conform
to a linear pattern. Furthermore, the correlation coefficient can be either a positive number
or a negative number; the sign of rX Y indicates whether the predominant linear pattern in
the data (whatever its clarity might be) has positive or negative slope. Speaking somewhat
informally, many people would say that the correlation coefficient measures the strength of
the linear relationship between two variables. If rXY is close to zero, there is no clear linear
pattern in the bivariate data. The closer rXY comes to +1.00, the greater the extent to
which the bivariate data conform to a linear relationship with a positive slope. And, the
closer rXY comes to -1.00, the greater the extent to which the bivariate data conform to a
linear relationship with a negative slope.
To appreciate the utility of the correlation coefficient, we can go back to the data that
44
were provided in Table 2.14 and calculate the correlation coefficients for each pair of variables.
Now, we already know the standard deviations of all the variables (As shown in Table 2.XX,
SX = 2.95 and SYJ = 5 for j = 1, 2, . . . , 5). And, we have already calculated the covariances
for X and each of the five Yj ’s (they are shown at the bottom of columns for the respective
Yj ’s in Table 2.XX). So, the easiest way to obtain the correlation coefficient for each pair of
variables is to divide the covariance by the product of the two standard deviations:
rX,Y1 =
0.817
= 0.055
(2.950)(5.000)
rX,Y2 =
8.910
= 0.604
(2.950)(5.000)
rX,Y3 =
12.690
= 0.860
(2.950)(5.000)
rX,Y4 =
−12.690
= −0.860
(2.950)(5.000)
rX,Y5 =
14.739
= 1.000
(2.950)(5.000)
With these values for the rX,Yj ’s let’s go back and re-examine the scatterplots in Figure 2.14.
Panel A in the figure shows the rectangular array of points— in other words, there is no
apparent linear structure. And, this is reflected in the value of rX,Y1 , which is very close to
zero (0.055). In panel B, there is a rather vague linear pattern and this corresponds to a larger
correlation between X and Y2 , at 0.604. Panel C shows a very clear linear pattern with a
positive slope, so the correlation coefficient for X and Y4 is a large (i.e., close to one) positive
value, 0.860. In panel D, Y4 is strongly, but negatively, related to X with a correlation of
-0.860. Finally, panel E shows that X and Y5 exhibit a “perfect” linear pattern, with a
positive slope; hence, rX,Y5 shows the largest possible value for the correlation coefficient, at
45
1.000.
Thus, the correlation coefficient measures the degree of linear relationship in bivariate
data. As such, it “summarizes” some of the visual information that we can obtain from a
scatterplot between two variables. Again, the value of the correlation coefficient is unaffected
by the measurement units of the two variables. We also know that the absolute value of the
correlation coefficient will always range from zero to one. Therefore, we know what its value
will be when there is no linear relationship between X and Y (zero) and when there is a
perfect, or deterministic, linear relationship between the two variables (either 1.00 or -1.00,
depending upon the slope of the linear function connecting X and Y ). But, neither of these
extremes show up very frequently (if at all), with “real” data. So, how do we interpret
intermediate values of rXY ? In fact, this is a bit of a problem with the correlation coefficient
because, beyond the general statement that larger absolute values indicate stronger linear
relationships, there is no clear interpretation of specific values. We will offer the following
rules of thumb (but, please read the caveat afterward!):
If 0 < rXY ≤ 0.30 then X and Y are weakly related.
If 0.30 < rXY ≤ 0.70 then X and Y are moderately related.
If 0.70 < rXY < 1.00 then X and Y are strongly related.
Please note that the preceding rules of thumb are totally subjective; another analyst could
easily disagree with our definitions of weak, moderate, and strong relationships (in fact, the
two of us don’t entirely agree on the various cutoffs). Still, these ranges should give you
some general ideas about how you can interpret the value of the correlation coefficient in
any specific application. And, the descriptive terms that we have employed here do show up
in the research literature quite frequently. Just don’t take these guidelines as hard and fast
rules! In fact, we will eventually provide a more precise interpretation of the strength of a
linear relationship based upon the correlation coefficient (or, more specifically, the square of
the correlation coefficient) in Chapter 3.
46
Some Perils of the Correlation Coefficient
We often hear people say “the correlation coefficient measures the relationship between
two variables.” Strictly speaking, this statement is not correct. Can you tell what is wrong
with it? In order to see the problem, consider the data in Table 2.ZZZ. Here, we have twentyfive observations on three variables, X, Y1 , and Y2 . The covariance between X and Y1 is
exactly zero; this means that the correlation between these two variables is also zero (why?).
If we accept the preceding definition of the correlation coefficient, we would conclude that X
and Y1 are not related to each other at all. The covariance between X and Y2 is 117.57, and
the correlation between these two variables is 0.67. Applying our previous rules of thumb,
and using the preceding interpretation, we would conclude that X and Y2 exhibit a moderate
relationship.
Both of the preceding interpretations are incorrect. In order to see why that is the case, we
only need to display the data graphically. The two panels of Figure 2.16 show the scatterplots
between X and Y1 , and between X and Y2 , respectively. As you can see immediately, both
pairs of variables show clear patterns. In fact, the relationships are deterministic. This
means that, for any specified value of X (say, Xi ), we can predict the corresponding value
of Yi without any error whatsoever. Stated differently, X is perfectly related to each of the
two Y variables.
If the relationships between X and Y1 and between X and Y2 are both perfectly deterministic, then why are the respective values of the correlation coefficient less than one? The
answer is simple: The correlation coefficient only measures linear structure in bivariate data.
These pairs of variables are related to each other in nonlinear ways. The function relating
the values of X to the values of Y1 is as follows:
Yi1 = 3 + 0.5(13 − Xi )2
The function relating the values of X to the values of Y2 is:
Yi2 = 10(1 − 1/Xi )
47
Neither of the preceding two equations depicts a linear function. That is, we cannot manipulate the terms on the right-hand side of either one to produce an equation of the form,
Yij = a + bXi
Thus, the type of structure that exists in each of these two cases is simply not “captured”
by the correlation coefficient. Hence, its value is less than a “perfect” 1.00 in each case.
We want to emphasize that this is not a problem with the correlation coefficient, itself.
Rather, it is a problem in the interpretation. In both of these cases, the correlation coefficient is being used as a tool in a situation for which it is not intended. So, it is important
to remember that the correlation coefficient only summarizes the degree of linear structure
within bivariate data. If any other type of structure happens to exist between an X variable and a Y variable, then rXY will miss it. So, “loose” interpretations of the correlation
coefficient can be very misleading.
Are nonlinear structures in data any less important than linear structures? Absolutely
not! In your own research, you may very well encounter variables that are related to each
other in nonlinear ways. As we will see in later chapters, it is usually very easy to adapt the
basic ideas that we have introduced here to fit these slightly more complicated situations.
CONCLUSIONS
In this chapter, we have introduced some ideas that will appear repeatedly throughout
the rest of this text. Some of these are tools that will facilitate various tasks for us. Others
are concepts that will play key roles in developing and interpreting regression models. Regardless, this is information with which you should feel comfortable before proceeding any
further. You will definitely encounter it all again!
Our discussion has actually come full circle. We began by discussing scatterplots as
convenient graphical tools for looking at bivariate data. Our main objective is to discern
relationships between variables. And, back in Chapter 1, we defined “relationships” in terms
of Y ’s functional dependence on X. So, we spent some time in this chapter thinking about
48
functions more generally. Then, we focused on linearity as a particular type of function.
Given the attention that we paid to it in this chapter, we should probably issue a bit of a
disclaimer regarding linearity: We want to emphasize that linear relationships are definitely
not better than nonlinear relationships, and they are probably not even more common in
the empirical world. For most purposes, the only reason we attribute any special status
to linearity is that linear relationships are simpler than nonlinear relationships. We will
capitalize on this simplicity in much of the work to follow.
Moving on, we discussed the process of creating a new variable that is linearly related
to one or more previously-existing variables. This led us to the definitions and properties of
linear transformations and linear combinations. Those, in turn, led us to yet another set of
useful tools, the covariance and the correlation coefficient.
With the latter two descriptive statistics, we complete the circle in this chapter, in the
sense that both the covariance and the correlation provide a numerical summary of a particular type of pattern that could appear in a scatterplot. We also emphasized that these
bivariate statistics only measure one kind of structure, even though other kinds of structure
can certainly exist in the data that we analyze. For that reason, it is critically important
that we use all of the tools available to us in complementary ways. The correlation coefficient
(or the covariance) captures the degree of linear structure across two variables in a single
numeric value. But, visual inspection of the scatterplot is necessary in order to confirm
that linearity is an appropriate description of the bivariate structure in the first place. We
believe that good data analysis always proceeds by moving back and forth between graphical
and numeric representations of the data under investigation. You will see many examples
of this throughout the rest of the book. And, we hope to convince you that there are very
important advantages to relying upon such complementary “dual views” of the data in your
own analyses.
49
Box 2-1: Deriving the mean of a linear transformation. All sums are taken over n observations, so the limits of summation are omitted for clarity.
Yi = a + bXi
X
Yi =
=
(1)
X
(a + bXi )
(2)
X
(3)
a+
= na + b
X
X
bXi
Xi
P
Yi
na + b Xi
=
n
n
P
na b Xi
=
+
n
n
P
P
Yi
Xi
= a+b
n
n
(4)
P
Ȳ
= a + bX̄
1
(5)
(6)
(7)
(8)
Box 2-2: Deriving the variance and standard deviation of a linear transformation. All sums
are taken over n observations, so the limits of summation are omitted for clarity.
Yi = a + bXi
Yi − Ȳ
X
(1)
= (a + bXi ) − (a + bX̄)
(2)
= (a − a) + (bXi − bX̄)
(3)
= 0 + b(Xi − X̄)
(4)
(Yi − Ȳ )2 = [b(Xi − X̄)]2
(5)
= b2 (Xi − X̄)2
(6)
(Yi − Ȳ )2 =
X
= b2
b2 (Xi − X̄)2
(7)
X
(Xi − X̄)2
(8)
P
P
b2 (Xi − X̄)2
(Yi − Ȳ )2
=
n−1
n−1
P
P
(Yi − Ȳ )2
(Xi − X̄)2
2
= b
n−1
n−1
SY2
q
SY2
SY
2
= b2 SX
=
q
2
b2 SX
= bSX
1
(9)
(10)
(11)
(12)
(13)
Box 2-3: Deriving the mean of a linear combination. All sums are taken over n observations,
so the limits of summation are omitted for clarity.
Yi = a + bXi + cZi
(1)
X
Yi =
X
(a + bXi + cZi )
(2)
X
Yi =
X
(a + bXi + cZi )
(3)
X
Yi =
X
a+
X
Yi = na + b
1X
Yi
n
1X
Yi
n
P
Yi
n
P
Yi
n
Ȳ
X
X
bXi +
Xi + c
X
X
cZi
(4)
Zi
(5)
X
X
1
(na + b
Xi + c
Zi )
n
1
1 X
1 X
=
na + b
Xi + c
Zi )
n
n
n
P
P
na b Xi c Zi
=
+
+
n
n
n
P
P
Xi
Zi
= a+b
+c
n
n
=
= a + bX̄ + cZ̄
1
(6)
(7)
(8)
(9)
(10)
Box 2-4: Deriving the variance of a linear combination. All sums are taken over n observations, so the limits of summation are omitted for clarity.
Yi = bXi + cZi
(1)
Yi − Ȳ
= (bXi + cZi ) − (bX̄ + cZ̄)
(2)
Yi − Ȳ
= bXi + cZi − bX̄ − cZ̄
(3)
Yi − Ȳ
= (bXi − bX̄) + (cZi − cZ̄)
(4)
Yi − Ȳ
= b(Xi − X̄) + c(Zi − Z̄)
(5)
(Yi − Ȳ )2 = [b(Xi − X̄) + c(Zi − Z̄)][b(Xi − X̄) + c(Zi − Z̄)]
(6)
(Yi − Ȳ )2 = b(Xi − X̄)b(Xi − X̄) + b(Xi − X̄)c(Zi − Z̄) +
c(Zi − Z̄)b(Xi − X̄) + c(Zi − Z̄)c(Zi − Z̄)
(Yi − Ȳ )2 = b2 (Xi − X̄)2 + c2 (Zi − Z̄)2 + 2bc(Xi − X̄)(Zi − Z̄)
X
(Yi − Ȳ )2 =
X
[b2 (Xi − X̄)2 + c2 (Zi − Z̄)2 + 2bc(Xi − X̄)(Zi − Z̄)]
X
(Yi − Ȳ )2 =
X
X
(Yi − Ȳ )2 = b2
P
(Yi − Ȳ )2
n−1
P
(Yi − Ȳ )2
n−1
b2 (Xi − X̄)2 +
X
c2 (Zi − Z̄)2 +
X
(8)
(9)
2bc(Xi − X̄)(Zi − Z̄)
(10)
X
X
X
(Xi − X̄)2 + c2
(Zi − Z̄)2 + 2bc
(Xi − X̄)(Zi − Z̄)
(11)
P
P
(Zi − Z̄)2 + 2bc (Xi − X̄)(Zi − Z̄)
=
n−1
P
P
P
(Xi − X̄)2
(Zi − Z̄)2
(Xi − X̄)(Zi − Z̄)
2
2
= b
+c
+ 2bc
n−1
n−1
n−1
b2
(7)
P
(Xi − X̄)2 + c2
(12)
(13)
2
Sy2 = b2 SX
+ c2 SZ2 + 2bc SXZ
(14)
2
Sy2 = b2 SX
+ c2 SZ2 + 2bc cov(X, Z)
(15)
Figure 2.1: Scatterplot showing 2005 state Medicaid spending (in dollars per capita) plotted
against the percentage of each state’s population receiving Medicaid benefits.
Medicaid spending, in dollars per capita
2000
1500
1000
500
10
15
20
Number of Medicaid recipients, as percentage of state population
Figure 2.2: Basic scatterplot showing party identification versus ideological self-placement in
the 2004 CPS American National Election Study.
7
Ideological self-placement
6
5
4
3
2
1
1
2
3
4
Party identification
5
6
7
Figure 2.3: Scatterplot showing party identification versus ideological self-placement in the
2004 CPS American National Election Study. Plotted points are jittered to break
up the plotting locations of discrete data values.
Ideological self-placement
6
4
2
2
4
Party identification
6
Figure 2.4: Graph of function, Yi = 2 Xi.
A. Graph of function when X is discrete
20
Y variable
15
10
5
0
-5
0
5
10
X variable
B. Graph of function when X is continuous
20
Y variable
15
10
5
0
-5
0
5
X variable
10
Figure 2.5: Graph of linear function with negative slope, Yi = -2 Xi.
5
Y variable
0
-5
-10
-15
-20
0
5
X variable
10
Figure 2.6: Graph of two linear functions with different slopes. The solid line represents the
function, Yi = 2 Xi, and the dashed line represents the function Yi = 2 Xi.
40
Y variable
30
20
10
0
-10
0
5
X variable
10
Figure 2.7: Graph of two linear functions with different negative slopes. The solid line
represents the function, Yi = -1 Xi, and the dashed line represents the function Yi =
-3 Xi.
10
Y variable
0
-10
-20
-30
0
5
X variable
10
Figure 2.8: Graph of three linear functions with identical slopes, but different intercepts. The
solid line represents the function, Yi = 3 + 1.5 Xi. The dashed line represents the
function Yi = 1.5 Xi. And, the dotted line represents Yi = -3 + 1.5 Xi.
15
Y variable
10
5
0
-5
0
5
X variable
10
Scatterplots showing a linear combination (Y) plotted against each of the two
variables used to create it (X1 and X2).
A. Y plotted against X1
30
Y variable
20
10
0
-10
2
4
6
8
10
X1 variable
A. Y plotted against X2
30
20
Y variable
Figure 2.11:
10
0
-10
2
4
6
8
X2 variable
10
Plotting Y (the linear combination) against X1, while holding X2 constant at a
fixed value.
A. Y versus X1 for observations with X2 = 4
30
Y variable
20
10
0
-10
-20
2
4
6
8
10
X1 variable
A. Y versus X1 for observations with X2 = 7
30
20
Y variable
Figure 2.12:
10
0
-10
-20
2
4
6
8
X1 variable
10
Plotting a “purged” version of Y versus X1. The vertical axis variable is defined
as: Y* = Y - -2 X2.
A. Plotting purged version of Y against X1
Y* variable
30
25
20
15
10
5
2
4
6
8
10
X1 variable
A. Plotting purged version of Y against X1 with points jittered
30
Y* variable
Figure 2.13:
25
20
15
10
5
2
4
6
8
X1 variable
10
Figure 2.14:
Scatterplots showing different degrees of linear structure in bivariate data.
A. Scatterplot shows no linear structure
B. Scatterplot shows weak linear structure
20
20
Y2
15
10
10
5
5
0
0
0
2
4
6
8
10
0
X
2
4
6
X
C. Scatterplot shows clear linear structure
20
15
Y3
Y1
15
10
5
0
0
2
4
6
X
8
10
8
10
Scatterplots showing different degrees of linear structure in bivariate data
(Continued).
D. Scatterplot shows clear, but negative, linear structure
20
Y4
15
10
5
0
0
2
4
6
8
10
X
E. Scatterplot shows perfect linear structure
20
15
Y5
Figure 2.14:
10
5
0
0
2
4
6
X
8
10
Scatterplots showing X versus Y* = Y3 / 2 and X versus Y** = 2 Y5, to
demonstrate that changing the measurement units does not alter the degree of
linear structure in the data.
A. X versus Y*
8
Y*
6
4
2
0
0
2
4
6
8
10
X
B. X versus Y**
40
30
Y**
Figure 2.15:
20
10
0
0
2
4
6
X
8
10
Figure 2.16: Scatterplots of nonlinear deterministic functions.
Y1 variable
A. X versus Y1
60
40
20
0
0
5
10
15
20
25
20
25
X variable
B. X versus Y2
10
Y2 variable
8
6
4
2
0
0
5
10
15
X variable