This chapter will introduce some important concepts and tools which will prove useful for our development of the regression model. The concepts include critical ideas like “linear,” “covariance,” and “correlation.” The tools include scatterplots, linear transformations, linear combinations, and the correlation coefficient. It is possible that some readers might consider this material to be a digression from the major objectives of this text. However, we would reply that the topics to be considered here will actually facilitate our broader discussion and the presentation of the statistical material which follows in subsequent chapters. If you have already read the first chapter (and, of course, we assume that you have done so, if you are reading this!), then you have probably gained some sense of our own approach: We want to be systematic and cumulative in the ways that we introduce new material— explaining why we want to pursue a topic, and building upon results that we have already established. We definitely do not want to take a superficial approach in which concepts, results, and rules for usage simply “fall from the sky.” We want to convey to you why the regression model looks and works as it does. To beginners, this may seem to be a somewhat difficult way to introduce the subject matter. But, we are convinced that it is useful to proceed this way because our approach provides much better understanding in the end. So, let us take a look at some basic concepts and tools. As you will see, they will appear repeatedly throughout the rest of the book. It is very helpful to develop a firm grasp of these basic elements before we begin applying them to develop the regression model and test its various elements. THE SCATTERPLOT The scatterplot is the basic graphical display for bivariate data. We already had a brief discussion of scatterplots and their construction back in Chapter 1, and we saw a number of examples there, too. So, why do we need a special section discussing them further in this chapter? Primarily to lay out some principles and guidelines for creating good scatterplots! Unfortunately, the default selections for creating scatterplots in some software packages 1 result in displays which do not display the information they contain in an optimal manner. Of course, we want to avoid such problems whenever possible. Back in Chapter 1, we explained that a scatterplot is set up by drawing two perpendicular coordinate axes. Each axis represents one of the two variables; traditionally, X is assigned to the horizontal axis, and Y to the vertical axis. If we are using the scatterplot to assess the relationship between the two variables, then the horizontal axis is assigned to the independent variable and the vertical axis is assigned to the dependent variable. Each axis “represents” its variable in the sense that it is a number line which is calibrated in that variable’s data units (note that the units need not be the same for the horizontal and vertical axes). Each axis covers that variable’s range of possible values (or, at least, the range of values which actually occurs in the empirical data at hand). Tick marks are usually included on the axes to show the relative positions of the sequence of values within each variable’s range. Observations are plotted as points within the coordinate space defined by the two axes. The process is simple: For any given observation, say the ith , find the location corresponding to its value on X along the horizontal axis (this value is usually designated xi ) and the location for its value on Y (that is, yi ) along the vertical axis. From each of these locations, project imaginary lines into the space between the axes; the imaginary line from each axis should be parallel to the other axis, so these imaginary lines will be perpendicular to each other, just like the axes themselves. Place a point (or some other symbol) at the position where these imaginary lines intersect. That point represents observation i. Repeat this process for all n observations in the full dataset. A Substantive Example: Medicaid Spending in the States To illustrate a scatterplot, let us return to a substantive example first introduced in Chapter 1. Table 2.1 shows bivariate data on 50 observations— the American states. Since the information is arrayed in a tabular format, this kind of display is often called a “data 2 matrix.” In a data matrix, the rows correspond to observations and the columns correspond to variables. There are two variables (note that the state names shown in the left-hand margin of the table are row identifiers, although they also could be considered a categorical variable): The leftmost column shows each state’s 2005 expenditure for Medicaid, in dollars per capita. The righthand column shows the number of people within each state who are covered by Medicaid, expressed as a percentage of the state’s total population. Note that the ordering of the observations in the data matrix is arbitrary. In Table 2.1, the states are arranged in alphabetical order. However, this is only for our convenience; it has nothing whatsoever to do with any subsequent analyses we may perform on these data. Figure 2.1 shows a scatterplot of the bivariate data. We have assigned the size of the atrisk population to the horizontal axis, and health care expenditures to the vertical axis. This placement reflects our substantive assumption that the former is the independent variable and the latter is the dependent variable. The fifty states are plotted as small open circles within the region defined by the perpendicular axes of the graph. It is worth emphasizing that Figure 2.1 contains exactly the same information as Table 2.1. It is merely presented in a different format. Neither of these formats is any better or worse than the other; instead, they tend to be useful for different purposes. The data matrix enables direct observation (and manipulation) of the precise numeric quantities associated with each observation. However, it is not particularly useful for discerning structure within the overall dataset. The scatterplot enables a visual assessment of structure in the data. But, it is difficult to recover the precise variable values from the graph. As we will see, both of these kinds of tasks come into play during a data analysis project. It is important for the researcher to have some sense about the kind of structure that might exist in the data, so he or she knows what kind of model to construct. Once this is known, the data values can be entered into the appropriate equations to estimate values for the model parameters. But, again, we need to know the form of the model before we can estimate it. And, that is more easily discerned from a graphical display than the raw data matrix, itself. For that reason, 3 we need to pay attention to the details of constructing “good” scatterplots. Guidelines for Good Scatterplots Scatterplots can be used for many different purposes. We will employ them primarily to discern the existence of systematic structure within our data. Therefore, the details of the scatterplot should be determined with that objective in mind. In this section, we will lay out a set of principles for constructing good scatterplots. But, to reiterate the preceding idea, the definition of a “good” scatterplot is not based upon aesthetic considerations. Rather, a good scatterplot is one in which the analyst can see the bivariate information in a manner that facilitates an assessment of patterned regularities across the observations in the dataset. One broad generalization is that good graphical displays should “maximize the data-toink ratio” (Tufte 1983). This implies that all elements of the scatterplot should be aimed at providing an accurate representation of the data, rather than devoted to decorative and unnecessary trappings. Beyond this, there is no intrinsic ordering to our set of principles. Still, in order to lay out our ideas in a logical manner, we will work “from the outside to the inside” of the scatterplot. Use an aspect ratio close to one. The aspect ratio of a graph is the ratio of its physical height to its physical width. An aspect ratio of one means that the vertical and horizontal axes will be equally sized. This is usually helpful because it makes it easier to see how the Y values tend to shift across the range of X values than would be the case with very wide graphs (i.e., aspect ratios much lower than one) or very tall graphs (i.e., aspect ratios much larger than one). Show the full scale rectangle. By “scale rectangle,” we mean explicit axes which delimit all four sides of the plotting region for the scatterplot. Including all four axes facilitates more accurate visual judgments about differences in the relative positions of the plotted points. Some graphing software packages only draw the left and bottom axes, leaving the right and top sides of the plot “open.” 4 Provide clear axis labels. The rationale for this point is fairly obvious. You want anyone who looks at your scatterplot to understand what is being shown in the graphical display. Therefore, you should put clear descriptions of the variables on left and bottom axes, rather than abbreviations or short acronyms, such as the kinds of variable names required in many software packages. Use relatively few tick marks on axes. We are not using the scatterplot to determine the data values associated with the respective observations. Therefore, it is not necessary to provide too much detail about the number lines which represent the variables’ value ranges in the scatterplot. Three or four tick marks (with labeled values on the left and bottom axes) should be sufficient to enable readers to estimate the general data values that coincide with various locations within the plotting region. Do not use a reference grid. There is generally no reason to include a system of horizontal and vertical grid lines in the plotting region of a scatterplot. For one thing, the axis tick marks should be sufficient to help readers evaluate variability in the plotted point locations. For another thing, a grid is a visually distracting element that might make it more difficult to discern the data points accurately. Data points should not be too close to the axes. The ranges of values on the horizontal and vertical axes should extend beyond the ranges of the respective variables in the scatterplot. This guarantees that data points do not collide with the axes (which would make them more difficult to see). Select plotting symbols carefully. The plotting symbols used in a scatterplot need to be large enough to enable visual detection of the observations (e.g., small dots or periods are often difficult to see). At the same time, the symbols should not be unduly affected by overplotting; that is, it should be possible to discern differences across symbols which represent distinct observations with very similar data values (e.g., open circles work better than squares, diamonds, or solid circles). Generally speaking the 5 size of the plotting symbols is inversely related to the number of observations included in the plot (scatterplots with large numbers of observations should use smaller plotting symbols than scatterplots with small numbers of observations, and vice versa). Avoid point labels. We are using the scatterplot to discover structure in data, defined here as variability in the conditional distributions of Y values across X values. Thus, we are interested in relative concentrations of data points at various positions within the plotting region, rather than the identities of the observations that fall at particular locations (which can be obtained more accurately from the data matrix, itself). In addition to being extraneous to our objectives, observation labels are generally detrimental in a scatterplot, because they make it more difficult to observe the data points, themselves. The preceding list of principles should be regarded as guidelines, rather than hard and fast rules. There are certainly situations where a researcher might want to ignore one or more of the principles. But, for most situations, these guidelines should produce scatterplots which optimize our ability to visually discern how observations’ Y values tend to shift across the values of the X variable— in other words, we can see whether the dependent variable is related to the independent variable, according to the definition of empirical “relatedness” laid out in Chapter 1. Jittered Scatterplots for Discrete Data Before leaving the topic of scatterplots, we need to consider a situation that arises quite frequently in the social sciences: Scatterplots with discrete data. What do we mean by this? You have probably already heard of the distinction between discrete and continuous variables. Our definitions of these terms probably differ slightly from those you have already encountered. For our purposes, a discrete variable is one in which the number of distinct values is small, compared to the number of observations and (logically enough) a continuous variable is one in which the number of distinct values is large. Note also that a variable’s 6 status as discrete or continuous is specific to the particular dataset under observation; that is, these terms apply to the particular configuration of variable values that actually show up in a given sample of observations. We can easily specify the limiting cases for these two types of variables. A discrete variable can not have any fewer than two values. Otherwise, of course, it is a constant rather than a variable. At the other extreme, a continuous variable can have a maximum of n distinct values, since it is impossible to have more data than there are observations in the dataset. But, the “dividing line” between discrete and continuous status is not clear! In fact, there is no particular need to specify a particular cut-off point for the number of values at which a variable changes from discrete to continuous. Instead, we will simply make the determination on a case-by-case basis.1 Getting back to the more general topic, what will happen if we try to construct a scatterplot using discrete, rather than continuous variables? In that case, there will likely be a great deal of overplotting, in which more than one observation is located at exactly the same position within the plotting region. This is a problem because the overplotting makes it very difficult to discern accurately the varying densities of data points at different locations within the scatterplot. As a specific example, the 2004 CPS American National Election Study (NES) contains data from a public opinion survey of the American electorate, conducted in fall 2004. Among many other questions, the respondents to this survey were asked their personal party identifications, and their ideological self-placements. Both of these variables are measured on bipolar scales (ranging from “strong Democrat” to “strong Republican” on the party identification scale and ranging from “extremely liberal” to “extremely conservative” on the ideology scale). Political scientists often attach numeric values to these categories, such as successive integers from one to seven. For theoretical reasons, it is interesting to consider the relationship between these two variables. And, we might try to do this by examining a scatterplot. 7 Figure 3.2 shows the scatterplot of ideology versus party identification for the 2004 NES data. The graphical display actually includes 915 observations. But, there are only 45 distinct plotting positions, due to the discrete nature of the two variables.2 The scatterplot shows an undifferentiated, nearly square, array of points, making it impossible to tell how many observations actually fall at each location. What can we do in such a situation? The basic problem is that data points are superimposed directly over each other. In order to change this, we might consider moving all of the points very slightly, just to break up the specific plotting locations. This can be accomplished very easily by adding a small amount of random variation to each data value. This process is called “jittering.” Figure 3.3 shows a jittered version of the scatterplot for party identification versus ideology. Now, variations in the ink density make it easier to see that increasing values on one of these variables do tend to correspond to increasing values on the other; in other words, partisanship and ideology are related to each other, in a manner that makes sense in substantive terms. Jittering enables us to see this in the scatterplot; the nonrandom pattern in the placement of the data points was invisible in the “raw” version of the graphical display. You might be concerned that we are doing something illegitimate (i.e., cheating) by jittering the points in a scatterplot. After all, by moving them around in this manner, we are implicitly changing the data values associated with the plotted points. In fact, this is just not problematic: We are only changing the point locations, not the data values themselves. In doing so, we can observe something that we could not see before jittering: The variation in the density of the data within different sectors of the plotting region. And, of course, the latter is why we are looking at the scatterplot in the first place! Furthermore, jittering does not move the points very much. The shift in any point’s location along any of the coordinate axes should always be kept smaller than the rounding limits on the respective variables. Sticking to this rule will minimize the possibility that anyone looking at the scatterplot will mistake the jittering for real, substantive, variation in 8 the data values. Finally, we need to emphasize that jittering only occurs in the contents of the scatterplot. The random movements in the point locations do not get retained in the variable values which compose the actual dataset, itself. Any and all calculations involving the data (e.g., summary statistics) are carried out using the original, non-jittered values. In summary, jittering is a tool that can be employed to optimize visual perception of relationships between discrete variables in a scatterplot. THAT UBIQUITOUS CONCEPT, “LINEAR” One of the terms that seems to show up frequently in regression analysis and in statistics, more generally, is the word, “linear.” So, for example, you have probably heard of “linear relationships between variables” and “linear regression analysis.” As we proceed through the material, we’ll see a few more appearances, such as “linear transformations,” “linear combinations,” and “linear estimators.” Therefore, it seems reasonable to spend some time explaining exactly what this term means. It is, perhaps, tempting to say that “linear” refers to something that can be shown as a line. Well, this is probably true. But, it is not very informative! As a more constructive approach, we will begin by saying that the term, “linear,” is an adjective which describes a particular kind of function. Of course, this raises an additional question: What is a function? To answer that question, let us identify two sets, designated X and Y . A function is simply a rule for associating each member of X with a single member of Y . It is no accident that we used X and Y as the names of the two sets: For our purposes, these sets are two variables, and their members are the possible values of those variables. Here is a generic representation of a function relating the variable Y to the variable X: Yi = f (Xi ) (1) Here, Yi refers to a specific value of the variable, Y , and Xi refers to a specific value of the 9 variable, X. The “f ” in expression 1 just designates the rule that links these two values together. So, the preceding expression could be stated verbally as “start with the ith value of X, apply the rule, f , to it, and the result will be the ith value of Y . As a very simple example of a function, let’s say that f is the rule, “add six to”. If Xi = 3, we would apply “add six to 3” in order to produce Yi = 9. Similarly, we could apply the same rule to other legitimate values of X in order to obtain other Y values. The result would be a set of Y values which are “linked” to the X values through the function, or rule, which was used to create the specific Yi ’s from the respective Xi ’s. Linear and Nonlinear Functions A linear function is a particular type of rule, in which the Xi ’s are each multiplied by a constant, or a single numerical value which does not change across the different values of X. Thus, one type of linear function would be: Yi = b Xi (2) Notice that the b in equation 2 does not have a subscript. This indicates that the numeric value that is substituted for b does not change across the different values of X. This constant is often called the “coefficient” in the equation relating X to Y . Let us assume that b = 2. If that is the case, Xi = 1 will be associated with Yi = 2 = 2 · 1. Similarly, if Xi = 3 then Yi = 6 = 2 · 3. Regardless which X value is input to the function, it is always multiplied by two (i.e., the value of b) to obtain Yi . Note that the function relating values of X to values of Y can contain other terms besides the variable, X, itself. For example, consider the following: Yi = a + b X i (3) Stated verbally, equation 3 says that, in order to obtain the Y value for observation i, the 10 constant, a, should be added to the product obtained by multiplying the coefficient, b, by the X value for that observation (that is, Xi ). And, since the value of a does not change across the observations (we know this because it does not have an “i” subscript), it is considered another coefficient in the function relating X to Y . Equation 3 still shows Y as a linear function of X, since its responsiveness to changes in X values remains constant. That is, equal differences in pairs of X values will always correspond to a fixed amount of difference in the associated pairs of Y values. Adding the new coefficient, a, does not change that in any way. In fact, the kind of function shown in equation 3 has a special name: It is called a “linear transformation.” We will look more closely at linear transformations in the next section of this chapter. What would a nonlinear function look like? It is difficult to provide a satisfying answer to that question, because any other rule for obtaining Yi apart from multiplying Xi by a constant value would be a nonlinear function. So, one example of nonlinear functions would be the following: Y i = bi X i (4) Equation 4 is a nonlinear function because the coefficient, b, has the subscript, i, indicating that the number we use to multiply Xi can, itself, change across the observations. Another nonlinear function is: Yi = Xib (5) Here, the X variable is not multiplied by the coefficient, b. Instead, the value of Xi is raised to the power corresponding to the number, b. What makes a linear function distinctive, compared to nonlinear functions? In a linear function, the “responsiveness” of Y to X will be constant across the entire range of X values. That is not the case with a nonlinear function. To see what we mean by this, let us return to our earlier example of a linear function, Yi = bXi , with b = 2. We will assume that the possible values of X range across the interval from one to ten. Now, consider two X values 11 at the lower end of the X distribution, X1 = 1 and X2 = 2. We apply the linear function to each of these values and obtain 2 · 1 = 2 and 2 · 2 = 4, or Y1 = 2 and Y2 = 4. Notice that, as the X values increased by one unit, applying the function produced Y values which differed by two units. Next, let us move over to the other side of the X distribution and select two large values, X3 = 9 and X4 = 10. Applying our linear function, Y3 = 2 · 9 = 18, and Y4 = 2 · 10 = 20. The third and fourth X values are much larger than the first and second values (i.e., 9 and 10, rather than 1 and 2). But, they still differ by one unit. And, note that when we increase from 9 to 10, applying the function still produces a pair of Y values which differ by two units (i.e., Y3 = 18 and Y4 = 20), just as before. And, you can verify that the function will work the same way for any pair of X values that differ by a single unit; regardless where that pair of values occurs within the range of X values, application of the linear function (with b = 2) will always produce a difference of two units in the resultant Y values. It is precisely this constancy in responsiveness of Y to X which defines the linear nature of the function: If Y is a linear function of X, then for any coefficient b, two observations which differ by one unit on X will always differ by exactly b units on Y , regardless of the specific values of X. The preceding rule also differentiates linear functions from all nonlinear functions. For example, consider the function in equation 4, with b = 2. In other words, Yi is obtained by squaring Xi . Here, if X1 = 1 then Y1 = 12 = 1. And, if X2 = 2 then Y2 = 22 = 4. With this function, increasing Xi from one to two corresponded to an increase of three units on Y , from Y1 = 1 to Y2 = 4. Moving to the other end of X’s range, consider X3 = 9. Now, Y3 = 92 = 81. If X4 = 10, then Y4 = 102 = 100. Here, we have still increased Xi by one unit. But now, that movement occurs at a different location within X’s range, from 9 to 10. And, the increase in Y is now much larger than before, at 19 units (i.e., from Y3 = 81 to Y4 = 100. Here, the responsiveness of Y to X is not constant; instead, it depends upon the particular X values over which the one-unit increase occurs. Obtaining the Coefficients in a Linear Function 12 So far, we have seen how a particular configuration of coefficients defines a linear function. To reiterate, if the values of a variable, X, are related to the values of another variable, Y , by a rule in which Yi = a + bXi , for all observations on Xi , i = 1, 2, 3, . . . , n − 2, n − 1, n, then that rule is a linear function. But, we have left a fairly important question unanswered: Where do a and b come from? We know that, in any particular example of a linear function, the letters representing these coefficients will be replaced by specific numbers— one value for a and another value for b. So, how are the numeric values determined? The answer to the preceding question depends upon the circumstances. Sometimes, the values of the coefficients are simply given, as in the example functions used in the previous section. In other cases, the coefficients might be determined by pre-existing rules. So, for example, the cost of an item in Australian dollars is a linear function of the cost of that item in American dollars, with the b coefficient determined by the exchange rate between the two currencies. At the time of this writing, the function would be expressed as: AUDi = 1.076 USDi Where AUDi is the amount of item i in Australian dollars, USDi is the amount of item i in American dollars, and b = 1.076. In other situations, the values of the coefficients need to be calculated from previouslyexisting data. (Indeed, much of the rest of this book will be occupied with calculating coefficients in linear functions from empirical data!). Let us assume that we have n observations on two variables, X and Y , and that we also know that Y is a linear function of X. This information implies that, for any observation, i, Yi = a + bXi . It is quite easy to calculate the values of b and a, as follows: First, select any two observations from the data. Here, we will call these observations 1 and 2, although we want to emphasize that they really can be any two observations selected arbitrarily from the n observations. Of course, we have the X and Y values for each of these observations, which can be shown as (X1 , Y1 ) for observation 1 and (X2 , Y2 ) for observation 13 2. Since we know that Y is a linear function of X, we can say that Y1 = a + bX1 and Y2 = a + bX2 . Now, the process of calculating the values of b and a involves manipulating the preceding information to generate a formula which isolates b on one side of the equation and leaves previously-known information on the other side of the equation. We will begin by simply subtracting Y1 from Y2 and working out the consequences: Y2 − Y1 = (a + bX2 ) − (a + bX1 ) Y2 − Y1 = a + bX2 − a − bX1 Y2 − Y1 = a − a + bX2 − bX1 Y2 − Y1 = (a − a) + (bX2 − bX1 ) Y2 − Y1 = b(X2 − X1 ) Let us go through each of the preceding steps and explain what happened. In the first line, the left side simply shows the difference between Y2 and Y1 , and the information on the right-hand side follows from the information that Y is a linear function of X. In the second line, nothing happens on the left-hand side. But, on the right-hand side, we remove the parentheses; this means that we have to subtract both of the separate terms that were originally in the rightmost set of parentheses (i.e., a and bX1 ) from the terms that were originally in the other set of parentheses (i.e., a and bX2 ). In the third line, we rearrange terms a bit, in order to put the two a’s adjacent to each other, and the two terms involving b adjacent to each other. In the fourth line, we insert two sets of parentheses on the right-hand side. Finally, two things occur on the right side of the fifth line (and, note that the left side of the equation has remained unchanged throughout all of this): (1) The term within the first set of parentheses is removed, since a − a = 0; (2) the b coefficient is factored out of the difference contained within the second set of parentheses in the fourth line. At this point, 14 we simply divide both sides by (X2 − X1 ) and simplify, as follows: Y2 − Y1 (X2 − X1 ) = b X2 − X1 (X2 − X1 ) Y2 − Y1 = b(1) X2 − X1 Y2 − Y1 = b X2 − X1 The preceding three lines should be easy to follow. We start in the first line by performing the division on both sides of the equation (note that the parentheses aren’t needed on the left side, so we omit them). In the second line, the numerator and denominator on the right side cancel; that is, they are equal to each other, so the fraction itself is equal to one. And, in the last line, we simply eliminate the one from the right side, because any quantity multiplied by one is just equal to that quantity (b, in this case). Finally, we will follow longstanding tradition and reverse the two sides of the last line, so that the quantity we are seeking falls on the left-hand side of the equation, and the formula to obtain that quantity falls on the right-hand side: b= Y2 − Y1 X2 − X1 (6) It is instructive to look closely at equation 6, because it shows us how the b coefficient can be interpreted as the “responsiveness” of Y to X. When reduced to lowest terms, the fraction on the right-hand side gives the difference in the values of Y that will correspond to a difference of one unit in X. The larger the size of this difference, the more pronounced is Y ’s “response” to X, and vice versa. Again, since Y is a linear function of X, this responsiveness is constant across the range of X values. In other words, we can use any pair of observations, and the ratio of the difference in their Y values to the difference in their X values is always equal to b. Once we know the value of b, it is a simple matter to obtain the value for a: 15 Yi = a + bXi Yi − bXi = a The preceding two lines show that we can determine a by simply picking an arbitrary observation, i, and subtracting b times the X value for that observation from the Y value for that observation. Again, following tradition, we will place the symbol for the coefficient on the left-hand side, and the formula used to calculate it on the right-hand side: a = Yi − bXi (7) Looking at equation 7, we can see that the a coefficient is really just an additive “adjustment” that we apply in order to obtain the specific value of Yi . But, note that this adjustment takes place after we take the responsiveness of Y to X into account, through the term bXi . For this reason, it is somewhat tempting to say that a is less interesting than b, since it does not seem to be part of the direct connection between the two variables. While we are somewhat sympathetic to that interpretation, we do not agree with it. Instead, we prefer to say that the a coefficient plays a different role than the b coefficient. Consider an observation where the X value is zero, or Xi = 0. In that case, the second term on the right-hand side of equation 7 drops out (since bXi = 0), leaving a = Yi . Thus, we might interpret a as a sort of “baseline” value for Y . If situations in which Xi = 0 represent the absence of the variable, X, then this baseline might be interesting in itself. Graphing Linear Functions One of the themes that permeates this book is the idea that many mathematical and statistical ideas can be conveyed in pictorial, as well as numeric, form. That is certainly the case with functions. We can graph a function by taking a subset of n values from the range of possible X values, using the function to obtain the Yi values associated with each of the Xi ’s, and then using the ordered pairs, (Xi , Yi ), as the coordinates of points in a scatterplot. By tradition, the range of X values to be plotted is shown on the horizontal axis, while the 16 range of Y values produced by the function (as it is applied to the n values shown in the graph) is shown on the vertical axis. As you can probably guess, the graph of a linear function will array the plotted points in a straight line within the plotting region. As a first example, Figure 2.4A graphs the function Yi = 2 · Xi for integer X values from -3 to 10. The linear pattern among the points is obvious from even a casual visual inspection. If the values of X vary continuously across the variable’s range, then the graph of the function is often shown as a curve, rather than a set of discrete points.3 Figure 2.4B graphs the same function as before, Yi = 2· Xi , under the assumption that X is a continuous variable (which still ranges from -3 to 10). Of course, the “curve” in the figure is actually a straight line segment. And, the line in Figure 2.4B covers the same locations as the individual points that were plotted back in Figure 2.4A. Here, however, we can determine the Yi that would be associated with any value of Xi within the interval from -3 to 10. We merely locate Xi on the horizontal axis, and scan vertically, until we intersect the curve. At the point of intersection, we scan horizontally over to the vertical axis, and read off the value of Yi that falls at the intersection point. Thus, one could argue that the graph of the function using the curve (Figure 2.4B) is a more succinct depiction of the function than the scatterplot (in Figure 2.4A), because it it not tied to specific ordered pairs of Xi and Yi values. The graphical representation of a linear function allows us to see several features that would be a bit more difficult to convey by the equations alone. The coefficient, b is often called the “slope” of the function, because it effectively measures the orientation of the linear array within the plotting region. If b is a positive value, then the linear array of points (for discrete Xi ’s), r the line (for a continuous X variable) would run from lower left to upper right within the plotting region. This is, of course, the case in the two panels of Figure 2.4. For this reason, a function with a graph that runs from lower left to upper right is said to have “positive slope.” In contrast, a linear function with a negative value for b will be shown graphically as a 17 line which runs from upper left to lower right. As an example, Figure 2.5 shows the graph of the function Yi = −2 · Xi , with X ranging continuously from -3 to 10. As you have probably guessed, functions with graphs that run in this direction are said to have “negative slope.” Irrespective of sign, the size of the slope coefficient measures the “steepness” of the line corresponding to the function. Larger absolute values of b correspond to steeper lines, while smaller absolute values of b correspond to shallower lines. For example, Figure 2.6 shows the graphs of two different linear functions, plotted simultaneously in the same geometric space. The solid line segment depicts the function, Yi = 2· Xi , while the dashed line segment depicts Yi = 4 · Xi . Both functions are shown for the continuous set of X values from -3 through 10. Notice that the second function has a larger b coefficient than the first function (i.e., b = 4 in the second function, while b = 2 in the first function); hence, the dashed line (representing the second function) is steeper than the solid line (representing the first function). The same rule holds even if the value of b is negative. Figure 2.7 shows the graphs of two more functions, Yi = −1 · Xi and Yi = −3 · Xi . As you can see, the graph of the second function is, once again, steeper than that of the first function. But, the line segments run in a different general direction than those in Figure 2.3, reflecting the negative signs of the b coefficients in the latter two functions. To summarize, the general direction of the graph corresponding to a linear function is determined by the sign of the slope coefficient, b, and the steepness of the graph is determined by the size of the absolute value of b. What about the a coefficient in a linear function of the form, Yi = a + bXi ? The value of the a coefficient also has a direct graphical representation: It determines the vertical position of the graph for a linear function. Specifically, a gives the value of Y that will occur when Xi = 0. And, assuming that the vertical axis is located at the zero point along the horizontal axis, then a is the value of Y corresponding to the location where the graph of the linear function crosses the vertical axis. For this reason, the a coefficient is often called “the intercept” of the linear function. Functions with positive a coefficients intersect the 18 Y axis above the zero point along that axis, while negative a coefficients cause a function to intersect the vertical axis below the zero point. A function with a = 0 (which might be shown as a function without an a coefficient at all, such as the functions graphed in Figures 2.2, 2.3, and 2.4) will cross the vertical and horizontal axes at the point (0. 0). This location is often called “the origin” of the space. Figure 2.8 shows three linear functions with identical slopes but different intercepts. The function shown by the solid line, Yi = 3 + 1.5Xi , has an intercept of 3, and it crosses the Y axis in the upper half of the plotting region. The function shown by the dashed line, Yi = 1.5Xi has an intercept of zero, so it intersects the origin of the space, in the middle of the plotting region. The function shown by the dotted line, Yi = −3 + 1.5Xi , has an intercept of -3, so it crosses the vertical axis in the lower half of the space. Thus, the specific value of the intercept for a linear function affects the vertical position of the graph for that function, but not the orientation of the graph, relative to the axes. Up to this point, we have only looked at graphs of linear functions. But, we can also graph nonlinear functions as well. The general process of creating the graph is the same as with a linear function. That is, plot points representing ordered pairs (Xi , Yi ) for some subset of n observations, and connect adjacent points with line segments if X is a continuous variable, and Y values exist for all points within the range of X that is to be plotted.4 The panels of Figure 2.9 show graphs for several different nonlinear functions of a continuous variable, X, which ranges from zero to ten. As you might expect from the name, each function shows a curved pattern, rather than a straight line. And, the shape of the curve differs from one function to the next. We previously said that nonlinear functions are different from linear functions in that Y ’s responsiveness to X varies across the range of X values. We can see this directly in the graphs of the nonlinear functions. The relationship of Y to X is represented by the steepness of the curve in the graph of the function. With the linear functions shown in Figures 2.4 to 2.8, this steepness is just the slope of the line segment corresponding to the function and, 19 again, it is constant across all X values. With a nonlinear function, the steepness changes at different locations along the horizontal axis. For example, consider the graph in Figure 2.9A. Here, if we start near the left side of the horizontal axis, say in the region where the Xi ’s vary from zero to about three, the curve is very shallow. This shows that the Y values change very little as the X values increase. Now, look over the the right side of the horizontal axis, say in the region where the Xi ’s vary from about eight to ten. Here, the curve is very steep, indicating that a given amount of increase in X corresponds to a quite large increase in the associated value of Y . With nonlinear functions, we must be attentive to “local slopes,” defined as the orientation of the function’s curve within particular regions or subsets of X values. These local slopes stand in contrast to the “global slope” of a linear function, signifying that the orientation of the linear “curve” for the function does not change across X’s range. How can we measure these local slopes in nonlinear functions? Is there any analog to the slope coefficient, which measures the amount of change in Y that corresponds to a unit increase in X? In fact there is: A concept from differential calculus, the derivative of Y with respect to X, measures the local slope of the function’s curve for any specified Xi within the range of the function. But, we have decided not to use any calculus in this book. Therefore, we will turn to a different strategy which is a little less precise but somewhat easier (at least conceptually), called “first differences.” The latter simply give the difference in the Y values that would correspond to a one-unit increase in X values, at different positions along the range of X. The procedure for calculating first differences is very simple, albeit somewhat cumbersome. First, select some “interesting” pairs of X values; for each such pair, the values in that pair should differ by one unit. So, in Figure 2.9A, we might select a pair of X values near the left side of the graph (say, X1 = 0 and X2 = 1) and another pair of X values near the right side of the graph (say, X3 = 9 and X4 = 10). For each pair, calculate the corresponding Y values. For X1 = 0, Y1 = a + (0)2 = y1. For X2 = 0, Y2 = a + (0)2 = y2. 20 The first difference is, then, Y2 − Y1 = y2 − y1 = cc. Now, repeat the process for the other pair of X values. For X3 = 9, Y3 = a + (9)2 = y3. For X2 = 10, Y4 = a + (10)2 = y4. And, the first difference is Y4 − Y3 = y4 − y2 = dd. These first differences summarize the responsiveness of the Y values to changes in the X values within different regions of X’s range. The calculated numerical values show that small values of X have little impact on Y , since the first difference from X1 = 0 to X2 = 1 is only cc. But, with large values of X, there is a stronger effect on Y , since the first difference from X3 = 9 to X4 = 10 is much larger, at dd. While a first difference is not identical to a local slope, it does summarize the set of local slopes which occur in the interval over which that first difference is calculated. For that reason, it is probably a good idea to only calculate first differences across intervals of X values for which the local slope (or, alternatively, the direction of the graphed curve) does not change very much. Our discussion of graphing functions suggests that linear functions often have a practical advantage over nonlinear functions: For many purposes, linear functions are easier to deal with, in the sense that it only takes one coefficient (the slope, b) to describe the way that Y values respond to differences in the X values. With a nonlinear function, several separate measures (i.e., values of the partial derivative at different Xi or first differences calculated for different regions within X’s range) are usually needed in order to represent the varying responsiveness of Y to X across X’s range. We will eventually see that this same difference persists when we employ regression analysis to construct models of linear and nonlinear relationships between empirical variables. LINEAR TRANSFORMATIONS We will now move away from the relatively abstract discussion of functions and focus more directly on empirical variables. We will still consider functions as rules which describe relationships between variables. But, now, the emphasis will shift more toward the implications of the functions for measurement and the variables’ distributions rather than the characteristics of the functions, themselves. 21 Sometimes, it will be useful to create a new variable, Y , by applying a linear function to a previously-existing variable, X. That is, Yi = a + bXi for every observation, i = 1, 2, . . . , n. In this case, Y is called a “linear transformation” of the variable X. The Effect of a Linear Transformation on a Variable A linear transformation contains essentially the same information as the original variable. This is true because the relative sizes of the differences between the observations are preserved across the transformation. However, the units of measurement change by a factor of size b. And, after moving to the “new” measurement units, the origin (or zero point) in the transformed variable is shifted by a units. An easily-recognizable example of a linear transformation involves temperatures, measured in Celsius units (which we will call C) and in Fahrenheit units (which we will call F ). Each of these temperature measurements is a linear transformation of the other. For example, to transform a temperature in Celsius units into one that is expressed in Fahrenheit units, we apply the following transformation: Fi = 32 + 1.8Ci (8) In the preceding linear transformation, it is easy to see that b = 1.8 and a = 32. The measurement units for F and C both happen to be called “degrees.” However, a Celsius degree is “wider” then a Fahrenheit degree because one of the former is equivalent to 1.8 of the latter. At the same time, the zero point on the Celsius scale falls 32 Fahrenheit units away from the zero point on the Fahrenheit scale. The nature of the linear transformation from Celsius to Fahrenheit is illustrated with the parallel number lines shown in Figure 2.10; the right-hand line segment represents a range of Celsius temperatures, and the left-hand line segment shows its Fahrenheit equivalent. The figure also locates three observations on both of the scales. Notice that C1 = 0 and C2 = 10; in other words, the first two observations differ from each other by ten Celsius units. But, 22 looking at the other scale, F1 = 32 and F2 = 50, for a difference of eighteen Fahrenheit units. Of course, the sizes of the differences on these two scales reflect the value of the b coefficient (1.8) in the linear transformation from Celsius to Fahrenheit degrees. Notice, too, that the first observation falls at the zero point on the Celsius scale. However, F1 falls 32 Fahrenheit units above zero on the latter scale. This corresponds to the size of the a coefficient in the transformation. As we said earlier, a linear transformation retains the relative differences between the observations. We can see this in Figure 2.10 by comparing the difference between observations 1 and 2 to the difference between observations 2 and 3, using both scales. As we have already said, observations 1 and 2 differ by 10 Celsius units. And, we can also see that observations 2 and 3 differ by 20 Celsius units (since C3 − C4 = 30 − 10 = 20). So, the difference between the second and third observations is two times the size of the difference between the first and second observations, since (F3 − F2 ) = 2(F2 − F1 ). If we look at the same pairs of observations in Fahrenheit units, we see that F2 − F1 = 50 − 32 = 18 and F3 −F2 = 86−50 = 36. Once again, the difference between the second and third observations is twice the difference between the first and second observations, or (F3 − F2 ) = 2(F2 − F1 ). Linear Transformation and Summary Statistics Linear transformations occur frequently in the everyday world of data analysis and statistics. In fact, they may be particularly common in the social sciences, where many of the central variables are measured in arbitrary units. So, it is particularly important that we understand how a linear transformation would affect the summary statistics calculated to describe important characteristics of a variable’s distribution. Assume that we have n observations on a variable, Y , which is a linear transformation of a variable, X. How would we calculate the sample mean for Y ? One approach would be to simply perform the transformation, Yi = a + bXi for each of the n observations, i = 1, 2, . . . , n and then calculate the sample mean, using the usual P formula, Ȳ = ni=1 Yi /n. That would certainly give the correct result. But, it involves a lot 23 of work, since the transformation would have to be carried out n times before calculating Ȳ . Since we “started out” with the variable, X, perhaps we already know the value of its mean, X̄. And, if so, we might be able to use that information to obtain Ȳ in a more efficient manner. In order to do this, you need to be familiar with some basic properties of summations. These are listed in Appendix 1. But, the process is straightforward. We will just lay out the formula for the linear transformation, manipulate it to create the formula for Ȳ on the left-hand side, and see what happens on the right-hand side. The derivation of the mean of a linear transformation shown in Box 2.1. The first line simply lays out the linear transformation. In the second line, we take the sum of the left- and right-hand sides. In the third line, we make use of the first property of summations given in Appendix 1, and distribute the summation sign across the two terms on the right-hand side. Line 4 uses the other two properties of summations. The coefficient, a, is a constant, so when that constant is summed across n observations, it simply becomes na. The other coefficient, b, is also a constant so it can be factored out of the second summation on the right-hand side. In lines 5 and 6, we divide both sides of the equation by n, and break the right-hand side into two fractions, respectively. Line 7 cancels the n’s from the first fraction on the right-hand side, and factors the b coefficient out of the numerator of the second fraction. Now, we can recognize very easily that the fraction on the left-hand side simply gives the sample mean of Y , while the only remaining fraction on the right-hand side similarly gives the sample mean of X. So, that leaves us with line 8, which shows that the mean of a linear transformation is that linear transformation applied to the mean of the original variable. Just as we suspected, the coefficients of the linear transformation provide a shortcut to calculate Ȳ if we already know the value of X̄. Can we use a similar shortcut to obtain the variance and standard deviation of a linear transformation? To answer this question, we will follow the same procedure we just used for the sample mean. In Box 2.2, we begin with the formula for the linear transformation, 24 manipulate the left-hand side to produce the formula for the sample variance, and observe the results on the right-hand side. In the second line of Box 2.2, we subtract the mean of Y from the linear combination; The left-hand side subtracts Ȳ and the right-hand side uses the result from the previous derivation. Lines 3 and 4 rearrange terms and simplify on the right-hand side. Note that the a coefficient is removed from further consideration, since a − a = 0. In line 5, we square both sides, and in line 6, the exponent is distributed across the two terms in the product on the right-hand side. In line 7, both sides are summed across the observations. In line 8, we use the third summation rule to factor the constant, b2 , out of the sum. Then, line 9 divides both sides by n − 1 and line 10 factors the b2 term out of the fraction on the right-hand side. At this point, you probably recognize that the left-hand side simply gives the formula for the sample variance of Y , while the right-hand side shows the product of b2 and the sample variance of X. Line 11 shows this explicitly. Then, lines 12 and 13 take the square root of each side to show that the sample standard deviation of Y is equal to b times the sample standard deviation of X. The derivation in Box 2.2 was a little more involved than that for the sample mean, back in Box 2.1, but not by much. And, the results are equally straightforward. There are a couple of features of the variance and standard deviation that should be emphasized. First, note that the a coefficient does not appear in the formula relating the variance of X to the variance of Y (of course, the same is true for the standard deviation). This is reasonable, because a is a constant, added to every observation. As such, it shifts the location of each observation along the number line, but it does not contribute at all to the dispersion of the observations. So, it does not really have anything to do with the variance. Second, the variance of X is multiplied by b2 , rather than by b. This makes sense, too: The variance is a measure of the average squared deviation from the mean. So, the measurement unit for the variance is the square of the measurement unit for the variable, itself. And, remember that the b coefficient in a linear transformation changes the measurement units 25 from those of X to those of Y . So, it is reasonable that b must be squared, itself, when determining the variance of the linear transformation, Y , from the variance of the original variable, X. Third, the sample standard deviation of Y is the standard deviation of X, multiplied by b. Of course, the standard deviation is measured in the same units as the variable whose dispersion it is representing (in fact, that is one of the central motivations for using the standard deviation rather than the variance in many applications). Again, the b coefficient adjusts the measurement units from those of X to those of Y . So, it performs a similar function in obtaining the standard deviation of the linear transformation, Y , from the standard deviation of X. To summarize, the salient features of a linear transformation’s distribution (specifically, its central tendency and dispersion) are related to the same features of the original variable’s distribution through some very simple rules. We have shown this to be the case for data in a sample. However, it is also true for the population distributions from which the sample data were drawn. These characteristics of linear transformations are useful, because they will be applicable in several of the derivations that we will complete in the rest of the book. They are also important because they will help us understand how changes in measurement units affect the various elements of regression models. LINEAR COMBINATIONS Linear transformations create a new variable, Y , by applying a linear function to one other variable, X. There are also many situations where we want to create a new variable (which we will, again, designate as Y ) by combining information from several other variables, say X and Z (although we could certainly have more than two such variables, we will use only these two for now, as the simplest case). If Y is created by forming a weighted sum of X and Z, then Y is called a linear combination of those two variables. This could be shown as follows: Yi = bXi + cZi 26 (9) In the preceding formula, b and c are constants; their values do not change across the observations, from i = 1, 2, . . . , n. Just as before, we call them the coefficients of the linear combination. It is obvious that Y is a combination of the other two variables. And, it is a linear combination because (once again) the impact of X on Y is constant across X’s range, and the same is true of Z. The coefficients in the linear combination are “weights” in the sense that they determine the relative contributions of X and Z to the final variable, Y . In other words, if b is larger than c, then the Xi ’s will have a greater impact on the Yi ’s than do the Zi ’s, and vice versa. Of course, B can be equal to c in which case, both variables have an equal influence in creating Y . It is also possible that b and c are both equal to one, in which case the linear combination, Y , is just the simple sum of X and Z. Linear combinations occur very frequently in the social sciences. To mention only a few examples: • A professor calculates each student’s overall course grade (Yi ) by summing the grades received on the midterm examination (Xi ) and the final examination (Zi ). • Another professor calculates the overall course grade by weighting the midterm examination to count for 40% (that is, b = 0.40) and the final examination to count for 60% (c = 0.60). • A sociologist develops a measure of socioeconomic status for respondents to a public opinion survey (Yi ) by adding together each person’s reported family income, in thousands of dollars (Xi ), and years of schooling (Zi ). Furthermore, the sociologist multiplies the first constituent variable by 1.5, to indicate that status is more a function of income than of education. Of course, many other examples of linear combinations are possible. Again, they occur whenever a new variable is created by taking a weighted sum of other variables. 27 The linear combinations we have considered so far are relatively simple, because they have only included two terms on the right-hand side. But, the weighted sum which defines the linear combination can actually contain any number of terms. And, at this point, we encounter a bit of a problem with notation. Traditionally, statistical nomenclature has employed letters from the end of the alphabet to represent variables (e.g., X, Y , and Z) and letters from the beginning of the alphabet to represent coefficients (e.g., a, b, and c). This system has worked so far, because our linear combinations have involved very few terms. With more terms entering into a weighted sum, we will run out of distinct letters pretty quickly. In order to avoid this problem, we will move to a more generic system that can be employed no matter how many terms are contained in the linear combination. Specifically, we will build upon our previous system, and designate that the variable on the left-hand side— that is, the linear combination, itself— is shown as Y . The variables on the right-hand side— that is, those comprising the terms that are summed to create the linear combination— are shown as X’s with subscripts. So, if there are k variables going into the linear combination, then they will be represented as X1 , X2 , . . . , Xj , . . . , Xk . From now on, we will reserve the subscript, j, as a variable index, with k representing the total number of variables going into the linear combination (that is, appearing on the right-hand side). At the same time, we will use b with a subscript to represent the coefficient for each of the Xj ’s in a linear combination. In other words, bj is the coefficient associated with Xj . Thus, we could show a linear combination, Y , composed of k terms, as follows: Y = b1 X 1 + b2 X 2 + . . . + bj X j + . . . + bk X k (10) Or, in a more compact form: Y = k X Xj (11) j=1 You need to look carefully at the limits of summation in the preceding expression, to see that the sum is taken across the k variables, rather than across observations (as we have 28 done in previous applications of the summation operator). In fact, the observations don’t show up at all in expressions 10 and 11. If we want to represent the ith observation’s values in the linear combination, Y , then we will have to use a slightly more complicated form of subscripting for the right-hand side variables, as follows: Yi = b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik (12) Now, the X’s have double subscripts. Following the notation we have already employed in earlier equations and derivations, one of the subscripts is designated i and it indexes observations. The other subscript is designated j and it indexes variables. Arbitrarily, we will place the observation index first and the variable index second in the doubly-subscripted versions of the variable. So, Xij refers to the ith observation’s value on variable Xj . Of course, the coefficients only have single subscripts, because they are each constant across the observations. Notice that the linear combinations we have shown so far do not include an a coefficient which might serve a similar purpose as the intercept in a linear transformation. We could easily add one, as follows: Yi = a + b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik (13) However, we prefer to use a slightly different notation. Rather than a, we will show the intercept in a linear transformation as b0 . So, the linear transformation is shown as: Yi = b0 + b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik (14) And, it is sometimes useful to think of b0 as the coefficient on a pseudo-variable which we will call X0 . The latter is a pseudo-variable because its values equal one for every observation— hence, this “variable” doesn’t actually vary! This may seem a bit nonsensical. But, later on, 29 we will see that thinking about the intercept this way helps us in several derivations. So, for now, we note that a linear transformation could be represented as: Yi = b0 Xi0 + b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik Yi = k X bj Xij j=0 Once again, look carefully at the limits of summation in the preceding expression. You will see that the variable index begins at zero, rather than one, indicating that an intercept is included in the linear combination. Of course, if a particular linear combination does not include an intercept, we would just begin with j = 1, as usual. Graphing the “Linear” Part of a Linear Combination Earlier in this chapter, we saw that the graph of a linear function is (reasonably enough) a line. We didn’t look at the graph of a linear transformation, but that is just a straightforward application of our general rule. So, if Y is a linear transformation of X, then a graph of Y versus X would produce a linear pattern of points if X is discrete, or a line if the values of X vary continuously across the variable’s range. In either case, the linear pattern (or, the line itself) would have an intercept of a and a slope of b. Can we generalize this kind of graphical display to a linear combination? We will address this question using an example. Assume we have two variables, X1 and X2 , each of which takes on integer values from one to ten. Our data consist of 100 observations, comprising all possible combinations of distinct X1 and X2 values. Let us create a linear combination, Y , from these variables, defined as follows: Yi = 3Xi1 + −2Xi2 (15) Figure 2.11 graphs Y against each of the two constituent variables in the linear combination. 30 But, neither scatterplot shows a line. What is going on here? Shouldn’t a linear combination be manifested in linear graphical displays? The answer is that the values of Y are affected simultaneously by both X1 and X2 . Therefore, if we graph Y against only one of the constituent variables in the linear combination (say, X1 ), the locations of the plotted points are also affected by the values of X2 , even though that variable is not explicitly included as one of the axes of the scatterplot. But, it is nevertheless the effects of X2 which prevent us from seeing directly the linear relationship between Y and X1 in the graph. Even though a simple scatterplot does not work for this purpose, there are several strategies we might use to “remove” the effects of X2 from the display. After we do that, we should be able to see X1 ’s linear impact on Y . The first strategy is to condition on the values of X2 before graphing the relationship between X1 and Y . In other words, we divide the 100 observations into subgroups, based upon the values of X2 . Given the way we have defined our dataset, there will be ten such subgroups: The first subgroup contains the ten observations for which X2 = 1, the second subgroup contains the ten observations for which X2 = 2, and so on. Stated a bit differently, Xi2 is held constant within each of the subgroups. Then, construct a separate scatterplot of Y against X1 for each of the subgroups. Each scatterplot will show a linear array of points, with slope b1 and intercept b2 Xi2 . The slope remains the same across the 10 graphs (since, by definition, the linear responsiveness of Y to X1 is constant) but the intercept shifts according to the value of X2 that is held constant within each graph. Figure 2.12 shows two such scatterplots, with X2 held constant at values of four and seven, respectively. In both panels of the figure, the slope of the linear array is 3, which is the value of b1 . In the first panel, the intercept is equal to -8, which is b2 Xi2 for this subgroup (or −2 × 4). In the second panel, the intercept is -14, or −2 × 7. Of course, we could also observe the linear relationship between X2 and Y by conditioning on values of X1 . The second strategy for isolating the linear effect of X1 is to remove directly the effects 31 of X2 on Y . We can accomplish this through some simple algebra: Yi = 3Xi1 + −2Xi2 Yi − −2Xi2 = 3Xi1 Yi∗ = 3Xi1 The Y ∗ variable which appears on the left-hand side of the preceding expression is just a version of the original Y variable that has been “purged” of the effects of X2 , by subtracting it out. That is, Yi∗ = Yi − −2Xi2 . The part of Y that remains after this “purging” operation is linearly related to X1 . We can see this clearly in Figure 2.13, which graphs the “purged” version of Y against X1 . The first panel of the figure shows the simple scatterplot. This shows the expected linear array of points; however, there only appear to be ten points despite the fact that the dataset contains 100 observations. The problem is that each plotted point actually represents ten distinct observations; they are simply located directly on top of each other. In order to correct for this overplotting problem, the second panel of Figure 2.13 shows a jittered version of the same plot. Here, the plotting locations have been broken up, so we can see more easily the concentration of observations at each of the ten distinct pairs of variable values that appear in the graph. The slope of the linear array in Figure 2.13 is three, which is also the value of b1 in the linear combination. The intercept is equal to zero, which can be verified by plugging that value into Xi1 on the right-hand side of the equation that created the purged version of Y , above. Again, we could observe the linear relationship between X2 and Y by using the same strategy to remove the effects of X1 from Y . Thus, there are two different strategies for isolating the linear effect of each constituent variable within a linear combination. The first strategy involves physical control, by holding the values of the other variable constant. The second strategy involves mathematically removing the other variable’s effects from the linear combination, before observing the relationship. Both of these strategies are employed in “real-life” data analysis. And, the second 32 strategy is conceptually similar to what we will eventually do in a multiple regression equation, which attempts to isolate each independent variable’s effect on a dependent variable by mathematically removing the effects of all other independent variables from the measured representation of that effect. For now, however, we want you to realize that a linear combination really does comprise a sum of linear effects, even though the bivariate graphs of the linear combination against the constituent variables do not show simple linear patterns. The Mean and Variance of a Linear Combination As we will soon see, there are many situations where we will want to work with summary statistics for the distribution of a variable that is a linear combination of two or more other variables. And, as was the case with linear transformations, it will prove convenient to express those summary statistics in terms of the variables that comprise the linear combination. Here, doing so will not only save us some computational work. Establishing the relationships between the summary statistics for a linear combination and the summary statistics for its constituent variables will eventually be very useful for deriving the statistical properties of the regression model. For the derivations in this section, we will use the simplest form of a linear combination— that is, a weighted sum of two variables. At the same time, we will return to the earlier notation, in which the linear combination is shown as: Yi = bXi + cZi Thus, we will be using distinct letters to represent each coefficient and variable on the right-hand side, rather than subscripted b’s and X’s. We do this strictly to simplify the current presentation and avoid the need to keep track of multiple subscripts throughout the derivations. But, it remains important for you to understand how the derivations would work the same way with more than two variables (that is, k > 2) and the more general notation (i.e., bj ’s and Xij ’s) on the right-hand side. 33 Let us begin with the mean of a linear combination. The derivation is shown in Box 2.3 and it follows the same general strategy that we used earlier, for linear transformations. By now, you should be comfortable enough with this process that we do not need to explain each step in detail. The important point to take away from Box 2.3 is the final result: Stated verbally, the mean of a linear combination is equal to the linear combination (or weighted sum) of the means of the constituent variables. That was simple enough, right? Now, let us turn to the variance of a linear combination, which is worked out in Box 2.4. While this derivation is certainly not difficult, it is a bit longer than the previous one and it requires us to manipulate a few more terms. So, we’ll follow the steps carefully. As always, we begin with the information that is given— the definition of the linear combination, in this case— and manipulate the terms (always making sure to perform the same operations on both sides of equation) until we produce the desired result— in this case, the formula for the sample variance— on the left-hand side. After doing so, we look to see whether we can make sense of, or gain new insights from, the information left on the right-hand side. We know that the sample variance is defined as the sum of squares divided by the degrees of freedom. In order to get to that result, we begin in Box 2.4 by subtracting the mean of the linear combination from both sides of the linear combination’s definition (which is shown, by itself, in the first line). On the left-hand side of line two, we simply subtract Ȳ ; on the right-hand side, we insert the result we just obtained for the mean of the linear combination. In the third line, we remove the parentheses on the right-hand side. Then, in line four, we rearrange terms and enclose the differences that involve the same coefficients and variables within sets of parentheses. The fifth line factors out the coefficients from the two differences on the right-hand side. Note that the left-hand side of line five shows the deviation score for Yi , or the signed difference of observation i from the mean of Y . Of course, the right-hand side is a weighted sum of the difference scores for X and Z. The sixth line in Box 2.4 squares the equation, with the product of the two sums written out explicitly on the right-hand side. In the seventh line, we carry out the multiplication on 34 the right-hand side, making sure to multiply each term in the first set of brackets by each term within the second set of brackets. The result is a rather messy-looking sum of four products. So, the eighth line simplifies things a bit, showing products of identical terms as squares, and summing together the second and fourth terms from the previous line (which were identical to each other, albeit arranged a bit differently). In line nine, we take the sum, across n observations, on each side. Line 10 distributes the summation sign across the three terms on the right-hand side, and line 11 factors the constants out of each sum. To recast this latter result in a bit of statistical language, the lefthand side of line 11 shows the sum of squares for Y . The right-hand side shows a weighted sum, where the weights are the square of the first coefficient from the linear combination (i.e., b2 ), the square of the second coefficient from the linear combination (i.e., c2 ), and two times the product of the two coefficients (i.e., 2bc). The first term on the right-hand side also includes the sum of squares for the variable, X and the second term includes the sum of squares for Y . The sum in the third term on the right-hand side involves the deviation scores for both X and Z. Therefore, we will call it the “sum of crossproducts” for that pair of variables. We will eventually see that sums of squares and sums of crossproducts play a very prominent role in several aspects of the regression model. But, for now, they are just a step along the way to our more immediate objective, the sample variance of Y . In the twelfth line of Box 2.4, we divide both sides by n − 1, the degrees of freedom. Line 13 then separates the right-hand side into three fractions, and factors out the products involving coefficients from each of those fractions. Take a close look at the information in this line. On the left-hand side, the fraction corresponds to the formula for the sample variance of the variable, Y ; and, of course, that is exactly what we were trying to obtain in this derivation. On the right-hand side of line 13, the first term is b2 times a formula that corresponds to the sample variance of X. And, the second term is c2 times the sample variance of Z. So, these first two terms on the right-hand side of line 12 can be interpreted in a straightforward manner, as some familiar statistical quantities. 35 The third term on the right-hand side of line 13 looks a little more complicated in that it involves both of the constituent variables in the linear combination. As we have already mentioned, the first part of this term is two times the product of the coefficients (i.e., 2bc). The second part of the term is the sum of the crossproducts divided by the degrees of freedom. This latter fraction looks very similar to a variance; the only difference is the fact that the numerator involves the sum of the product of two different deviation scores, rather than the sum of the product of a single variable’s squared deviation scores. Because of its similarity to the variance, and the fact that it involves two variables, this fraction is called the covariance of X and Z. This is a very important bivariate summary statistic in itself, and we will return to it for a closer look in the next section of this chapter. For now, the fourteenth line of Box 2.4 replaces the fractions with symbols representing the sample variances and the covariance. Note that the latter is shown as SXZ . While this is relatively standard statistical notation for the covariance, we believe it is potentially confusing; it is easy to mistake SXZ for a sample standard deviation, which is also usually shown as S, but with a single subscript. In order to avoid any such problems, we prefer to use the covariance operator, as shown in line 15: When a formula or equation includes “cov” followed by a set of parentheses with two variable names enclosed within them, that term within the equation is referring to the covariance between those two variables. At last, we are ready to provide a verbal statement of the variance of a linear combination composed of two variables: The variance of a linear combination is equal to the squares of the coefficients times the variances of the constituent variables, plus two times the sum of the coefficients times the covariance between the variables. That definition is quite a mouthful! But, it shows that the dispersion in the linear combination’s distribution is related to the amount of dispersion in the constituent variables’ distributions, plus an “extra” contribution from the covariance which actually involves both of the original variables in a somewhat different way. Up to this point, we have not paid any attention to the standard deviation of a linear 36 combination. The latter is simply defined as the square root of the linear combination’s variance. Both of these statistics measure the dispersion of the distribution for the linear combination. But, there is an important “practical” difference between them: The variance of the linear combination can be broken down into a weighted sum of variances and covariance; therefore, we can get a sense of each constituent variable’s contribution to the overall variance. The standard deviation of the linear combination cannot be broken down in a similar manner. For that reason, the latter will play a far less prominent role in the development and evaluation of the regression model, even though the standard deviation is (arguably) somewhat easier to interpret than the variance. We can generalize the results we have obtained so far very easily. Assume that we have a linear combination, Y , defined as a weighted sum of an arbitrary number of variables: Yi = b1 Xi1 + b2 Xi2 + . . . + bj Xij + . . . + bk Xik k X Yi = bj Xij j=1 The mean of the linear combination would be defined as follows: Ȳ = k X bj X̄ j=1 In other words, the mean of the linear combination will always be a weighted sum of the original variables’ means, regardless how many variables there are in the linear combination. Generalizing the variance of a linear combination is a bit trickier. The right-hand side will always include a sum of k terms involving the variances of the constituent variables. But, the variance also includes additional terms for the covariances between every possible pair of constituent variables. If there are k variables, there will be k(k − 1)/2 distinct covariances. This can add up to a lot of terms very quickly! Regardless, the general form for the variance 37 of the linear combination is: SY2 SY2 2 2 + cov(X1 , X2 ) + . . . + cov(Xk−1 , Xk ) + . . . + SX = b1 SX 1 k = k X j=1 2 b2j SX j + k−1 X k X cov(Xj , Xl ) j=1 l=j+1 To show you some specific examples, let’s take a linear combination of three variables: Yi = b1 Xi1 + b2 Xi2 + b3 Xi3 The mean of this linear combination would be as follows: Ȳ = b1 X̄ + b2 X̄ + b3 X̄ And, the variance of this linear combination would be: 2 2 2 SY2 = b21 SX + b22 SX + b23 SX + cov(X1 , X2 ) + cov(X1 , X3 ) + cov(X2 , X2 ) 1 2 3 At this point, you might be asking yourself why you would want to know (or care) about the summary statistics for linear combinations. We think that is a very reasonable question! And, we want to reassure you that we are not going through these derivations and examples simply to make you do some math. Instead, we will see very shortly that linear combinations are one of the most important tools for developing the various features of the regression model. In fact, we’ll be encountering them repeatedly throughout the rest of the book COVARIANCE, LINEAR STRUCTURE, AND CORRELATION In this section, we will take a closer look at that term which showed up in the derivation of the variance of a linear transformation, the covariance between two variables. To reiterate what we already know, the covariance between two variables (say, X and Y ) is defined as 38 the sum of crossproducts for those variables, divided by n − 1. Or: Pn cov(X, Y ) = i=1 (Xi − X̄)(Yi − Ȳ ) n−1 Now, we have already mentioned that the covariance is a summary statistic. But, unlike the mean or the variance, which represent features of a univariate distribution, the covariance measures an important characteristic of a bivariate distribution. The bivariate distribution, itself, can be interpreted as an enumeration of all possible combinations of values on the two variables, along with the number of observations that fall at each of those possible combinations. A scatterplot can be interpreted as a graphical display which illustrates the bivariate distribution for two particular variables. The Covariance as a Summary Statistic The covariance summarizes, in a single numeric value, the degree to which the two variables are linearly related to each other. Stated a bit differently, the covariance measures the degree of linear structure in the bivariate distribution for the two variables. Given a dataset in which two variables have fixed variances (or standard deviations), the absolute value of the covariance will increase as the points plotted in a scatterplot of the two variables come to approximate a perfectly linear pattern. Furthermore, if the line that is approximated by the plotted points has a positive slope (i.e., it runs from lower left to upper right in the scatterplot), the covariance will be a positive value. If the linear pattern is negative (i.e., runs from upper left to lower right), the covariance will also have a negative value. From the opposite perspective, the absolute value of the covariance will be close to zero when there is very little linear structure in the bivariate data. In such cases, it would be very difficult to discern any kind of linear trend or pattern within the points plotted in a scatterplot. The covariance has a value of exactly zero when there is no linear structure in the data at all. This can occur for several different reasons. But, one important situation in which the covariance is zero is when X and Y are statistically independent of each other. In 39 that case, knowing an observation’s value on one variable provides no information whatsoever about that same observation’s value on the other variable. Speaking informally, we might say that the two variables are completely unrelated to each other. Table 2.XX and Figure 2.14 provide some examples that should illustrate these properties of the covariance. The table shows a moderately-sized dataset with 50 observations and six variables. The first variable, labeled X, has a mean of 4.95 and a standard deviation of 2.95. The second through sixth variables are labeled Y1 through Y5 ; each of these variables has a mean of 10.00 and a standard deviation of 5.00. The five panels in Figure 2.14 show the scatterplots of the X variable against each of the Y variables from the dataset. Even a cursory glance at the five scatterplots reveals the differences in linear structure across them. In panel A, the data points form a “shapeless” cloud within the plotting region. In panel B, the two variables show a very faint structure, in which observations with smaller X values also tend to have relatively small Y values, and vice versa. The data in panel C conform to a moderately clear linear pattern, oriented in the positive direction. Similarly, the data in panel D also show a moderately strong linear pattern, but it is oriented in the negative direction. Finally, panel E shows a perfectly linear array of data points. The various degrees of linear structure (or, one could say “the strength of the linear relationship”) are revealed graphically in the respective scatterplots. But, they are also reflected in the values of the covariance calculated for each pair of variables. These covariances range from 0.82 for X and Y1 up to 14.74 for X and Y5 . You should be sure that you can see how larger absolute values correspond to clearer linear structures. And, note how panels C and D show the same degree of linearity, but in opposite directions. Why does the covariance work this way? You can gain an intuitive sense of the answer by looking at the numerator in the formula for the covariance, which sums the crossproducts for each observation. Each of the crossproducts multiplies a deviation score for X by a deviation score for Y . Each of these deviation scores effectively measures the observation’s distance above or below the mean on the appropriate variable. The (absolute value of the) 40 sum of these products will be maximized when an observation’s distance above (or below) the mean on one variable is proportional to its distance above (or below) the mean on the other variable. Stated differently, the deviation scores on one variable comprise a linear function of the deviation scores on the other variable. Thus, the covariance measures the degree to which two variables approximate a linear relationship. And, we already know that linearity is a particularly important type of relationship between two variables, because it is the simplest form of bivariate structure. The covariance is a summary statistic which measures precisely this kind of structure, so it is a very important research tool on its own. Some Problems with the Covariance Nevertheless, there are some practical issues that arise when we try to use the covariance to measure linearity in bivariate relationships. Most of these stem from the fact that it is very difficult to interpret specific values of the covariance. For one thing, there is no clear standard of comparison. Suppose you had not seen the scatterplots in Figure 2.14. If someone told you that the covariance between X and Y3 is 12.69, you have no way to evaluate what this means, beyond the fact that any linear structure across these two variables would have a positive slope. Similarly, there is nothing in the specific numeric value, 14.74, to indicate that X and Y5 are perfectly linearly related to each other. Another problem is that the value of the covariance is affected by the specific measurement units employed for each of the two variables. Let’s transform variable Y3 by dividing each observation’s value in half; we’ll call this new variable Y ∗ . Similarly, we can create a new variable called Y ∗∗ by doubling each observation’s value on Y5 . Figure 2.15 shows the scatterplots between X and each of these new variables. You can see that panel A in this new figure is identical to panel C from Figure 2.14. Similarly, panel B is identical to panel E from Figure 2.14. The only things that differ across the two figures are the specific numeric scales shown on the vertical axes. Certainly, the amount of linear structure is identical across the comparable scatterplots from the two figures. Nevertheless, the covariances differ: The 41 covariance between X and Y ∗ is 6.345, whereas the covariance between X and Y3 was 12.69. And, the covariance between X and Y ∗∗ is 29.48, while the covariance between X and Y5 was 14.74. The problem is that the value of the covariance is affected by the measurement scales used for the two variables, as well as the degree of linear structure between them. In scientific research situations, we are often much more interested in the latter than in the former. In fact, measurement schemes will typically be regarded as given information prior to any attempts to examine relationships between variables. Therefore, it is probably useful to “correct” the covariance, in order to remove its dependence on measurement scales. How might we accomplish this? Transforming the Covariance into the Correlation Coefficient One way to proceed would be to transform the two variables whose covariance is being measured so that they always have the same measurement units. And, you probably know a strategy for doing this already, from your background work in introductory statistics: We can just standardize, or perform a Z-transformation on, each of the variables. In other words, start with variables X and Y , which we will assume have their own particular units of measurement. Create new variables, ZX and ZY , by transforming X and Y as follows. For each observation, i, where i ranges from 1 to n: ZXi = (Yi − Ȳ ) (Xi − X̄) and ZY i = SX SY Then, apply the formula we have already seen to obtain the covariance between ZX and ZY : Pn cov(ZX , ZY ) = i=1 (ZXi − Z¯X )(ZY i − Z¯Y ) = n−1 Pn − 0)(ZY i − 0) = n−1 i=1 (ZXi Pn ZXi ZY i n−1 i=1 We can also achieve exactly the same result in a slightly different way (that is, perhaps, a bit easier to calculate). First, calculate the covariance in the usual manner (i.e., as the 42 sum of crossproducts, divided by n − 1). Second, divide the covariance by the product of the standard deviations of the two variables. This can be shown as follows: cov(ZX , ZY ) = cov(X, Y ) SX SY Regardless how you calculate it, this covariance of standardized variables is actually a transformed— or standardized— version of the covariance, itself. And, the transformation effectively removes the effect of the original variables’ measurement units on this new version of the covariance. As you know, Z-scores are always measured in the same units (i.e., standard deviations) regardless of the measurement units used for the variables prior to standardization. So, the question of the measurement units’ impact on the value of the covariance is just irrelevant when standardized variables are used. And, therefore, if two such standardized covariances differ, we can be sure that the difference is due to variation in the degree of linear structure that exists across each pair of variables, and not due to differences in the measurement units used in the variables. In fact, the standardized covariance is so important that it has a special name of its own. It is called the correlation coefficient. The correlation coefficient is a bivariate summary statistic, and it is generally represented by lower-case r. Usually (at least, in this book), the r is given double subscripts which indicate the variables being correlated. Thus, rXY refers to the correlation between variables X and Y . Again, the correlation can be obtained in several ways including the covariance of standardized versions of the two variables, or the covariance between the original (i.e., non-standardized) variables divided by the product of their standard deviations. And, there are various situations where both of the preceding manifestations of the correlation coefficient are useful (in the sense that they will help us see something interesting in a formula or a derivation). But, many statistics texts use the 43 following formula to represent the correlation between X and Y : Pn − X̄)(Yi − Ȳ ) Pn 2 2 i=1 (Xi − X̄) i=1 (Yi − Ȳ ) rXY = pPn i=1 (Xi Now, the preceding formula may look a bit intimidating. But, it can be obtained pretty easily by performing some straightforward operations on either of our earlier formulas for the standardized covariance. And, the important thing to remember is that the correlation coefficient is simply a standardized version of the covariance. Interpreting the Value of the Correlation Coefficient Again we use the correlation coefficient rather than the “raw” covariance because it has some interpretational advantages. We have already noted that it is not affected by differences in measurement units. Even more important, the correlation coefficient has a maximum absolute value of 1.00. In other words, rXY will always fall within the interval from -1.00 to 1.00, inclusive. And, the correlation coefficient will only take on its most extreme values, +1.00 or -1.00, when Y is a perfect linear function of X, and vice versa. If |rXY | < 1.00, then Y is not a perfect linear function of X (and vice versa). But, the larger the value of |rXY |, or the closer it comes to 1.00, the greater the extent to which the bivariate data conform to a linear pattern. Furthermore, the correlation coefficient can be either a positive number or a negative number; the sign of rX Y indicates whether the predominant linear pattern in the data (whatever its clarity might be) has positive or negative slope. Speaking somewhat informally, many people would say that the correlation coefficient measures the strength of the linear relationship between two variables. If rXY is close to zero, there is no clear linear pattern in the bivariate data. The closer rXY comes to +1.00, the greater the extent to which the bivariate data conform to a linear relationship with a positive slope. And, the closer rXY comes to -1.00, the greater the extent to which the bivariate data conform to a linear relationship with a negative slope. To appreciate the utility of the correlation coefficient, we can go back to the data that 44 were provided in Table 2.14 and calculate the correlation coefficients for each pair of variables. Now, we already know the standard deviations of all the variables (As shown in Table 2.XX, SX = 2.95 and SYJ = 5 for j = 1, 2, . . . , 5). And, we have already calculated the covariances for X and each of the five Yj ’s (they are shown at the bottom of columns for the respective Yj ’s in Table 2.XX). So, the easiest way to obtain the correlation coefficient for each pair of variables is to divide the covariance by the product of the two standard deviations: rX,Y1 = 0.817 = 0.055 (2.950)(5.000) rX,Y2 = 8.910 = 0.604 (2.950)(5.000) rX,Y3 = 12.690 = 0.860 (2.950)(5.000) rX,Y4 = −12.690 = −0.860 (2.950)(5.000) rX,Y5 = 14.739 = 1.000 (2.950)(5.000) With these values for the rX,Yj ’s let’s go back and re-examine the scatterplots in Figure 2.14. Panel A in the figure shows the rectangular array of points— in other words, there is no apparent linear structure. And, this is reflected in the value of rX,Y1 , which is very close to zero (0.055). In panel B, there is a rather vague linear pattern and this corresponds to a larger correlation between X and Y2 , at 0.604. Panel C shows a very clear linear pattern with a positive slope, so the correlation coefficient for X and Y4 is a large (i.e., close to one) positive value, 0.860. In panel D, Y4 is strongly, but negatively, related to X with a correlation of -0.860. Finally, panel E shows that X and Y5 exhibit a “perfect” linear pattern, with a positive slope; hence, rX,Y5 shows the largest possible value for the correlation coefficient, at 45 1.000. Thus, the correlation coefficient measures the degree of linear relationship in bivariate data. As such, it “summarizes” some of the visual information that we can obtain from a scatterplot between two variables. Again, the value of the correlation coefficient is unaffected by the measurement units of the two variables. We also know that the absolute value of the correlation coefficient will always range from zero to one. Therefore, we know what its value will be when there is no linear relationship between X and Y (zero) and when there is a perfect, or deterministic, linear relationship between the two variables (either 1.00 or -1.00, depending upon the slope of the linear function connecting X and Y ). But, neither of these extremes show up very frequently (if at all), with “real” data. So, how do we interpret intermediate values of rXY ? In fact, this is a bit of a problem with the correlation coefficient because, beyond the general statement that larger absolute values indicate stronger linear relationships, there is no clear interpretation of specific values. We will offer the following rules of thumb (but, please read the caveat afterward!): If 0 < rXY ≤ 0.30 then X and Y are weakly related. If 0.30 < rXY ≤ 0.70 then X and Y are moderately related. If 0.70 < rXY < 1.00 then X and Y are strongly related. Please note that the preceding rules of thumb are totally subjective; another analyst could easily disagree with our definitions of weak, moderate, and strong relationships (in fact, the two of us don’t entirely agree on the various cutoffs). Still, these ranges should give you some general ideas about how you can interpret the value of the correlation coefficient in any specific application. And, the descriptive terms that we have employed here do show up in the research literature quite frequently. Just don’t take these guidelines as hard and fast rules! In fact, we will eventually provide a more precise interpretation of the strength of a linear relationship based upon the correlation coefficient (or, more specifically, the square of the correlation coefficient) in Chapter 3. 46 Some Perils of the Correlation Coefficient We often hear people say “the correlation coefficient measures the relationship between two variables.” Strictly speaking, this statement is not correct. Can you tell what is wrong with it? In order to see the problem, consider the data in Table 2.ZZZ. Here, we have twentyfive observations on three variables, X, Y1 , and Y2 . The covariance between X and Y1 is exactly zero; this means that the correlation between these two variables is also zero (why?). If we accept the preceding definition of the correlation coefficient, we would conclude that X and Y1 are not related to each other at all. The covariance between X and Y2 is 117.57, and the correlation between these two variables is 0.67. Applying our previous rules of thumb, and using the preceding interpretation, we would conclude that X and Y2 exhibit a moderate relationship. Both of the preceding interpretations are incorrect. In order to see why that is the case, we only need to display the data graphically. The two panels of Figure 2.16 show the scatterplots between X and Y1 , and between X and Y2 , respectively. As you can see immediately, both pairs of variables show clear patterns. In fact, the relationships are deterministic. This means that, for any specified value of X (say, Xi ), we can predict the corresponding value of Yi without any error whatsoever. Stated differently, X is perfectly related to each of the two Y variables. If the relationships between X and Y1 and between X and Y2 are both perfectly deterministic, then why are the respective values of the correlation coefficient less than one? The answer is simple: The correlation coefficient only measures linear structure in bivariate data. These pairs of variables are related to each other in nonlinear ways. The function relating the values of X to the values of Y1 is as follows: Yi1 = 3 + 0.5(13 − Xi )2 The function relating the values of X to the values of Y2 is: Yi2 = 10(1 − 1/Xi ) 47 Neither of the preceding two equations depicts a linear function. That is, we cannot manipulate the terms on the right-hand side of either one to produce an equation of the form, Yij = a + bXi Thus, the type of structure that exists in each of these two cases is simply not “captured” by the correlation coefficient. Hence, its value is less than a “perfect” 1.00 in each case. We want to emphasize that this is not a problem with the correlation coefficient, itself. Rather, it is a problem in the interpretation. In both of these cases, the correlation coefficient is being used as a tool in a situation for which it is not intended. So, it is important to remember that the correlation coefficient only summarizes the degree of linear structure within bivariate data. If any other type of structure happens to exist between an X variable and a Y variable, then rXY will miss it. So, “loose” interpretations of the correlation coefficient can be very misleading. Are nonlinear structures in data any less important than linear structures? Absolutely not! In your own research, you may very well encounter variables that are related to each other in nonlinear ways. As we will see in later chapters, it is usually very easy to adapt the basic ideas that we have introduced here to fit these slightly more complicated situations. CONCLUSIONS In this chapter, we have introduced some ideas that will appear repeatedly throughout the rest of this text. Some of these are tools that will facilitate various tasks for us. Others are concepts that will play key roles in developing and interpreting regression models. Regardless, this is information with which you should feel comfortable before proceeding any further. You will definitely encounter it all again! Our discussion has actually come full circle. We began by discussing scatterplots as convenient graphical tools for looking at bivariate data. Our main objective is to discern relationships between variables. And, back in Chapter 1, we defined “relationships” in terms of Y ’s functional dependence on X. So, we spent some time in this chapter thinking about 48 functions more generally. Then, we focused on linearity as a particular type of function. Given the attention that we paid to it in this chapter, we should probably issue a bit of a disclaimer regarding linearity: We want to emphasize that linear relationships are definitely not better than nonlinear relationships, and they are probably not even more common in the empirical world. For most purposes, the only reason we attribute any special status to linearity is that linear relationships are simpler than nonlinear relationships. We will capitalize on this simplicity in much of the work to follow. Moving on, we discussed the process of creating a new variable that is linearly related to one or more previously-existing variables. This led us to the definitions and properties of linear transformations and linear combinations. Those, in turn, led us to yet another set of useful tools, the covariance and the correlation coefficient. With the latter two descriptive statistics, we complete the circle in this chapter, in the sense that both the covariance and the correlation provide a numerical summary of a particular type of pattern that could appear in a scatterplot. We also emphasized that these bivariate statistics only measure one kind of structure, even though other kinds of structure can certainly exist in the data that we analyze. For that reason, it is critically important that we use all of the tools available to us in complementary ways. The correlation coefficient (or the covariance) captures the degree of linear structure across two variables in a single numeric value. But, visual inspection of the scatterplot is necessary in order to confirm that linearity is an appropriate description of the bivariate structure in the first place. We believe that good data analysis always proceeds by moving back and forth between graphical and numeric representations of the data under investigation. You will see many examples of this throughout the rest of the book. And, we hope to convince you that there are very important advantages to relying upon such complementary “dual views” of the data in your own analyses. 49 Box 2-1: Deriving the mean of a linear transformation. All sums are taken over n observations, so the limits of summation are omitted for clarity. Yi = a + bXi X Yi = = (1) X (a + bXi ) (2) X (3) a+ = na + b X X bXi Xi P Yi na + b Xi = n n P na b Xi = + n n P P Yi Xi = a+b n n (4) P Ȳ = a + bX̄ 1 (5) (6) (7) (8) Box 2-2: Deriving the variance and standard deviation of a linear transformation. All sums are taken over n observations, so the limits of summation are omitted for clarity. Yi = a + bXi Yi − Ȳ X (1) = (a + bXi ) − (a + bX̄) (2) = (a − a) + (bXi − bX̄) (3) = 0 + b(Xi − X̄) (4) (Yi − Ȳ )2 = [b(Xi − X̄)]2 (5) = b2 (Xi − X̄)2 (6) (Yi − Ȳ )2 = X = b2 b2 (Xi − X̄)2 (7) X (Xi − X̄)2 (8) P P b2 (Xi − X̄)2 (Yi − Ȳ )2 = n−1 n−1 P P (Yi − Ȳ )2 (Xi − X̄)2 2 = b n−1 n−1 SY2 q SY2 SY 2 = b2 SX = q 2 b2 SX = bSX 1 (9) (10) (11) (12) (13) Box 2-3: Deriving the mean of a linear combination. All sums are taken over n observations, so the limits of summation are omitted for clarity. Yi = a + bXi + cZi (1) X Yi = X (a + bXi + cZi ) (2) X Yi = X (a + bXi + cZi ) (3) X Yi = X a+ X Yi = na + b 1X Yi n 1X Yi n P Yi n P Yi n Ȳ X X bXi + Xi + c X X cZi (4) Zi (5) X X 1 (na + b Xi + c Zi ) n 1 1 X 1 X = na + b Xi + c Zi ) n n n P P na b Xi c Zi = + + n n n P P Xi Zi = a+b +c n n = = a + bX̄ + cZ̄ 1 (6) (7) (8) (9) (10) Box 2-4: Deriving the variance of a linear combination. All sums are taken over n observations, so the limits of summation are omitted for clarity. Yi = bXi + cZi (1) Yi − Ȳ = (bXi + cZi ) − (bX̄ + cZ̄) (2) Yi − Ȳ = bXi + cZi − bX̄ − cZ̄ (3) Yi − Ȳ = (bXi − bX̄) + (cZi − cZ̄) (4) Yi − Ȳ = b(Xi − X̄) + c(Zi − Z̄) (5) (Yi − Ȳ )2 = [b(Xi − X̄) + c(Zi − Z̄)][b(Xi − X̄) + c(Zi − Z̄)] (6) (Yi − Ȳ )2 = b(Xi − X̄)b(Xi − X̄) + b(Xi − X̄)c(Zi − Z̄) + c(Zi − Z̄)b(Xi − X̄) + c(Zi − Z̄)c(Zi − Z̄) (Yi − Ȳ )2 = b2 (Xi − X̄)2 + c2 (Zi − Z̄)2 + 2bc(Xi − X̄)(Zi − Z̄) X (Yi − Ȳ )2 = X [b2 (Xi − X̄)2 + c2 (Zi − Z̄)2 + 2bc(Xi − X̄)(Zi − Z̄)] X (Yi − Ȳ )2 = X X (Yi − Ȳ )2 = b2 P (Yi − Ȳ )2 n−1 P (Yi − Ȳ )2 n−1 b2 (Xi − X̄)2 + X c2 (Zi − Z̄)2 + X (8) (9) 2bc(Xi − X̄)(Zi − Z̄) (10) X X X (Xi − X̄)2 + c2 (Zi − Z̄)2 + 2bc (Xi − X̄)(Zi − Z̄) (11) P P (Zi − Z̄)2 + 2bc (Xi − X̄)(Zi − Z̄) = n−1 P P P (Xi − X̄)2 (Zi − Z̄)2 (Xi − X̄)(Zi − Z̄) 2 2 = b +c + 2bc n−1 n−1 n−1 b2 (7) P (Xi − X̄)2 + c2 (12) (13) 2 Sy2 = b2 SX + c2 SZ2 + 2bc SXZ (14) 2 Sy2 = b2 SX + c2 SZ2 + 2bc cov(X, Z) (15) Figure 2.1: Scatterplot showing 2005 state Medicaid spending (in dollars per capita) plotted against the percentage of each state’s population receiving Medicaid benefits. Medicaid spending, in dollars per capita 2000 1500 1000 500 10 15 20 Number of Medicaid recipients, as percentage of state population Figure 2.2: Basic scatterplot showing party identification versus ideological self-placement in the 2004 CPS American National Election Study. 7 Ideological self-placement 6 5 4 3 2 1 1 2 3 4 Party identification 5 6 7 Figure 2.3: Scatterplot showing party identification versus ideological self-placement in the 2004 CPS American National Election Study. Plotted points are jittered to break up the plotting locations of discrete data values. Ideological self-placement 6 4 2 2 4 Party identification 6 Figure 2.4: Graph of function, Yi = 2 Xi. A. Graph of function when X is discrete 20 Y variable 15 10 5 0 -5 0 5 10 X variable B. Graph of function when X is continuous 20 Y variable 15 10 5 0 -5 0 5 X variable 10 Figure 2.5: Graph of linear function with negative slope, Yi = -2 Xi. 5 Y variable 0 -5 -10 -15 -20 0 5 X variable 10 Figure 2.6: Graph of two linear functions with different slopes. The solid line represents the function, Yi = 2 Xi, and the dashed line represents the function Yi = 2 Xi. 40 Y variable 30 20 10 0 -10 0 5 X variable 10 Figure 2.7: Graph of two linear functions with different negative slopes. The solid line represents the function, Yi = -1 Xi, and the dashed line represents the function Yi = -3 Xi. 10 Y variable 0 -10 -20 -30 0 5 X variable 10 Figure 2.8: Graph of three linear functions with identical slopes, but different intercepts. The solid line represents the function, Yi = 3 + 1.5 Xi. The dashed line represents the function Yi = 1.5 Xi. And, the dotted line represents Yi = -3 + 1.5 Xi. 15 Y variable 10 5 0 -5 0 5 X variable 10 Scatterplots showing a linear combination (Y) plotted against each of the two variables used to create it (X1 and X2). A. Y plotted against X1 30 Y variable 20 10 0 -10 2 4 6 8 10 X1 variable A. Y plotted against X2 30 20 Y variable Figure 2.11: 10 0 -10 2 4 6 8 X2 variable 10 Plotting Y (the linear combination) against X1, while holding X2 constant at a fixed value. A. Y versus X1 for observations with X2 = 4 30 Y variable 20 10 0 -10 -20 2 4 6 8 10 X1 variable A. Y versus X1 for observations with X2 = 7 30 20 Y variable Figure 2.12: 10 0 -10 -20 2 4 6 8 X1 variable 10 Plotting a “purged” version of Y versus X1. The vertical axis variable is defined as: Y* = Y - -2 X2. A. Plotting purged version of Y against X1 Y* variable 30 25 20 15 10 5 2 4 6 8 10 X1 variable A. Plotting purged version of Y against X1 with points jittered 30 Y* variable Figure 2.13: 25 20 15 10 5 2 4 6 8 X1 variable 10 Figure 2.14: Scatterplots showing different degrees of linear structure in bivariate data. A. Scatterplot shows no linear structure B. Scatterplot shows weak linear structure 20 20 Y2 15 10 10 5 5 0 0 0 2 4 6 8 10 0 X 2 4 6 X C. Scatterplot shows clear linear structure 20 15 Y3 Y1 15 10 5 0 0 2 4 6 X 8 10 8 10 Scatterplots showing different degrees of linear structure in bivariate data (Continued). D. Scatterplot shows clear, but negative, linear structure 20 Y4 15 10 5 0 0 2 4 6 8 10 X E. Scatterplot shows perfect linear structure 20 15 Y5 Figure 2.14: 10 5 0 0 2 4 6 X 8 10 Scatterplots showing X versus Y* = Y3 / 2 and X versus Y** = 2 Y5, to demonstrate that changing the measurement units does not alter the degree of linear structure in the data. A. X versus Y* 8 Y* 6 4 2 0 0 2 4 6 8 10 X B. X versus Y** 40 30 Y** Figure 2.15: 20 10 0 0 2 4 6 X 8 10 Figure 2.16: Scatterplots of nonlinear deterministic functions. Y1 variable A. X versus Y1 60 40 20 0 0 5 10 15 20 25 20 25 X variable B. X versus Y2 10 Y2 variable 8 6 4 2 0 0 5 10 15 X variable
© Copyright 2026 Paperzz