Chapter 11 Linear Regression Linear regression is one of the most frequently used statistical tools by business. It is used to determine whether a relationship exists between two variables. Suppose we were able to determine what relationship, if any, existed between sales and the amount spent on advertising. We could then use this to calculate how an increase in advertising might affect sales. If we were really lucky we might be able to predict the amount of sales a particular amount of advertising might generate. There are a large number of cases where business might be interested in how a change in one economic variable might affect a company's profits, sales, or revenues. We will examine some of these ideas in this chapter. 11.1 The Pearson Correlation Coefficient In this section we will examine the correlation coefficient, a fairly simple way of detecting whether a relationship exists between two variables. We will designate one of these variables as x and the other as y. The Pearson correlation coefficient is defined to be SS XY SS X SSY r where n SS XY xi x yi y i 1 n SS X xi x 2 i 1 n SSY yi y 2 i 1 The SS terms are called the sum of the squares. Now let's suppose we have collected data on x and y shown in Table 11.1 Y 1 2 3 4 5 X 1 2 3 4 5 Table 11.1 Data for Example 11.1 1 This data suggests that every time x increases by one unit y also increases by one unit. Not only that but it seems that x and y are always equal. This could be taken as an example of an ideal relationship between x and y. It is hard to picture a more perfect relationship between two variable than one in which they are always equal. Example 11.1 Let's calculate the correlation coefficient for this set of data in Table 11.1. x 1 5 1 15 xi 1 2 3 4 5 3 5 i 1 5 5 y 1 5 1 15 yi 1 2 3 4 5 3 5 i 1 5 5 SS X xi x 1 3 2 3 3 3 4 3 5 3 2 2 2 2 2 2 SS X 10 SSY yi y 1 3 2 3 3 3 4 3 5 3 2 2 2 2 2 2 SSY 10 SS XY xi x yi y 1 31 3 2 3 2 3 3 3 3 3 4 3 4 3 5 3 5 3 SS XY 10 So r SS XY 10 1 SS X SSY 10 10 Data that has a correlation coefficient equal to one is said to be perfectly positively correlated, meaning that given a value for x (or y) you could perfectly predict the value of y (or x). 2 Now consider the correlation coefficient for the set of data shown in Table 11.2. Y -1 -2 -3 -4 -5 X 1 2 3 4 5 Table 11.22 Data for Example 11.2 Example 11.2 Compute the correlation coefficient for the data in Table 11.2 Compute the sums of squares and get SS XY 10 SSY 10 SS X 10 so that r SS XY 10 1 SS X SSY 10 10 Data that fits exactly on a straight line with a positive slope will have a correlation coefficient r 1 while data that fits exactly on a straight line with negative slope will have a correlation coefficient r 1 . Finding data that fits exactly on a straight line is very, very rare. It's never happened to me. If it ever did I would suspect something was wrong. Data may tend to fitting on a straight line, however, and that is what the correlation coefficient is used for. Consider the data in Table 11.3. Y 1 2 4 4 5 X 1 2 3 4 5 Table 11.3 Data for Example 11.3 Example 11.3 Compute the correlation coefficient for the data in Table 11.3 While we could compute the correlation coefficient by hand we will use a computer package to do it for us. The package we will use is SPSS (Statistical Programs for the Social Sciences) statistical package. The output from SPSS for this data is shown in Fig. 11.1. 3 Figure 11.1 The SPSS output for Example 11.3 The table gives the Pearson correlation coefficient, r , for Y correlated with Y, for Y correlated with X, for X correlated with Y, and for X correlated with X. Y and Y are perfectly correlated with each other as well as X and X, so r 1 for both these correlations (the correlation of Y with itself and X with itself is pretty useless, but is done for compatibility with older computer packages). The interesting item is how X is correlated with Y and for this r 0.962. The table also gives the number of observations, N 5. The other entry in the table is Sig.(2-tailed) = 0.009. Sig.(2-tailed) is actually the p—value for a hypothesis test H0 : 0 (X and Y are not correlated) HA : 0 (X and Y are correlated) where is the population correlation coefficient. The Pearson correlation coefficient r is the sample correlation coefficient. There is a test statistic that can be used to test the hypothesis and that is r . t 1 r2 n2 Note if the null hypothesis is true 0 and r 0 so r 0. So if the null is true we 0 would expect to get values of r close to zero. In this case t 0 and values of t 1 02 n2 near zero would be taken as evidence that the null is true. If the null is false, t would take on large values. Actually for this test we could use the t distribution just like we did for or t t / 2 using the t tests on the population mean, and reject the null if t t /2 distribution with n 2 degrees of freedom. So for this example the test would be 4 H0 : 0 (X and Y are not correlated) HA : 0 (X and Y are correlated) =0.05 Reject H 0 : if t t0.025 3.182 or t t0.025 3.182 t r 1 r2 n2 0.962 1 0.9622 52 0.962 0.962 6.08 1 0.925 0.025 3 and we would reject the null hypothesis and claim that x and y are correlated. It is much easier to use the p—value. We note that p 0.009 0.05 which also gives sufficient evidence to reject the null. We will use p—values from now on when we can. Also note the statement ** correlation is significant at the 0.01 level. This is SPSS’s notification that you would reject the null hypothesis at this level of significance. A plot of the values of a set of data is called a scattergram. The scattergram of the above set of data in Table 11.3 is shown in Figure 11.2. Note that the data seems to be sloping upward as is reflected in the correlation coefficient r 0.962 . Figure 11.2 The scattergram of the data in Table 11.3 5 Y 2 1 3 1 2 X 1 2 3 4 5 Table 11.4 Data for Example 11.4 Example 11.4 Is the data in Table 11.4 correlated? Test using a 1% level of significance. First let’s look at the scattergram as shown in Figure 11.3 Figure 11.3 The scattergram of the data in Table 11.4 It is very difficult to say whether there is any relationship between x and y by an inspection of the scattergram. If we look at the correlation analysis as shown in Figure 11.4, however, we find that 6 Figure 11.4 The correlation analysis of the data in Table 11.4 r 0.00 and that p=1.000> =0.01. Using this evidence we could not reject the null hypothesis for any level of significance. The results shown in Figure 11.5 give some rules of thumb often used in using correlation coefficients. The coefficient only can take on values between plus and minus one, and that values near zero suggest that no correlation exists. This table would not replace a hypothesis test, however. Pearson correlation coefficient r 1 r 1 r 1 Conclusion Perfect positive correlation Perfect negative correlation Evidence for the existence of positive correlation Evidence for the existence of negative r 1 correlation Evidence for the absence of correlation r 0 Table 11.5 Rough interpretation of the values of the Pearson’s correlation coefficient. Example 11.5 Conduct a correlation analysis of the data in Table 11.6 Here we have 3 variable. You can conduct a correlation on two or more variables. The SPSS output is shown in Figure 11.5. Test using a 2% level of significance. X 1 2 3 4 5 Y 1 2 1 2 2 Z 1 2 2 3 3 Table 11.6 Data for Example 11.5 7 Figure 11.5 The correlation analysis of the data in Table 11.6 The p—values suggest: X and Y are not correlated p 0.308 0.02 ; X and Z are correlated p 0.015 0.02 , and Y and Z are not correlated p 0.133 0.02 . If we had chosen a 1% level of significance we would have concluded that none of the variables are correlated. Example 11.6 A statistician with nothing better to do has collected 20 years data on rum prices and preachers' salaries. He puts the data into SPSS and obtains the scattergram shown in Figure 11.6 Figure 11.6 A scattergram of rum prices and preachers’ salaries 8 The SPSS correlation analysis is shown in Figure 11.7 Figure 11.7 The correlation analysis of rum prices and preachers’ salaries. If we test H 0 : 0 (There is no correlation between rum prices and preachers' salaries) H A : 0 (There is a correlation between rum prices and preachers' salaries) and test this with a 1% level of significance, we would reject the null hypotheses ( p 0.007 0.01 ) and conclude that preachers’ salaries and rum prices are correlated. When we say that two variables are linearly correlated we mean that they tend to change in a systematic manner. In the above example, an increase in preachers' salaries seems to be related with increases in rum prices. This does not mean that increases in preachers' salaries cause increases in rum prices (unless the preachers are out guzzling all the rum). One suspects that both are increasing due to inflation. All correlation analysis does is tell you if two variables seem to be related in a linear fashion. The interpretation of why this might be so is beyond the realm of statistics. You will likely use correlation analysis heavily in finance courses, where correlation between rates of return of various assets are of interest. If two stocks have a high correlation coefficient you can say there seems to be a relationship between their rates of return, but this does not tell you why. Quite possibly you might not care why. A good investment strategy is to pick a portfolio of assets that have returns that are uncorrelated – in that case if the return of one asset declines the rest won’t tend to follow suit. In the next section we will look at a stronger assumption, one in which we believe that a change in one variable actually causes changes in the second variable. 11.2 Causal Relationships. In the previous section we examined the case where changes in one variable seemed to coincide with changes in a second variable. If the correlation coefficient between the two variables was large enough we could take that as evidence that the two variables were 9 related somehow. In this section we want to examine a stronger and more useful relationship between two variables, one in which a change in one variable causes a change in the other. Such a relationship is called a causal relationship. Suppose that we are interested in the amount of corn produced, y, on a plot of ground as a function of the amount of water used, x, on the plot. It seems plausible that increases in the amount of water used should produce increases in the amount of corn produced. Note that this relationship works in only one direction. More water produces more corn; more corn does not produce more water. More corn cannot cause more water to occur. For this reason the y variable is called the dependent variable and x is called the independent variable. Restating this, the amount of y produced depends on the amount of x used but the amount of x used is independent of y. We, or nature, control the amount of water used, the corn is not in control. If y is sales and x is the value of the advertising budget, then management may increase x to increase y . Management determines the amount of x (advertising), which they hope will produce the desired level of y (sales). In this chapter we will examine the simplest causal relationship, a linear or straight-line relationship. Recall from your math classes the formula for a straight line is something like y mx b where b is called the intercept term and m is called the slope term. The intercept term gives the amount of y produced when x 0 . The slope term indicates how much y will change for a given change in x. It is customary in statistics to use the symbol 0 to designate the intercept term and 1 to designate the slope term. So the representation we will use for a straight line is y 0 1 x There is a good reason for introducing this sort of notation. In a more advanced study of, say, the factors that might affect consumer demand (y) we should include things like gross national product ( x2 ), the price of the good ( x3 ), and the price of substitute goods ( x4 ) in the equation as well as advertising ( x1 ) . In that case we would have for the relationship y 0 1 x1 2 x2 3 x3 4 x4 . Note that the subscript on is the same as the subscript on x. Generally, this makes it easy to match the slope terms with their respective variables. In this chapter we will only consider a single independent variable, x1 , and because we only have one of them we will drop the subscript. From now on we will call the equation y 0 1 x 10 a model. One of the things we will have to do is determine the values of the terms; in particular we will be most interested in whether the 1 term is equal to zero or not. If 1 0 then no relationship exists between x and y. No relationship would exist because y would always have the same value regardless of what value x has. If y is gross domestic product and x is the phase of the moon, we would suspect that no relationship would exist between the two, and that the term 1 would be equal to zero. So the first task we have is that of finding the values of the terms. As an example of the problem we face let's consider a consumption function. If you have had macroeconomics you may recall that Keynes suggested that current consumption was determined by current income. So we could model a Keynesian consumption functions as y 0 1 x where y is current consumption and x is current income. You might recognize that 1 is the marginal propensity to consume and 0 is autonomous consumption. From your statistics course you hopefully know that the terms are parameters (the Greek letters). In your economics course you may have used values like 1 0.9 for the marginal propensity to consume. Where did this value come from? Out of the mind of the author of the text, that's where. The author was using an easy value to illustrate basic operations of the consumption function. The true value of the marginal propensity to consume is not determined by textbook authors nor is it determined by the government. It is determined by us, the consumers. To find values for the terms we will have to collect data on income and consumption and use this data to find estimates of the terms. Suppose we were able to find two groups of households where every household in the first group earns exactly $20,000 and every household in the second group earns exactly 40,000 (in practice it is not necessary to collect data in this fashion, that is, data in identical groups. We do that here to illustrate a point). Not all households with incomes of $20,000 will spend exactly the same amount. One family may spend $18,934.25, a second might consume $19,224.21, a third some other amount, and so on. Consumption for families in this income group will vary from household to household. The same will hold true for families earning $40,000. In other words consumption is a random variable. Recall that random variables have probability distributions, which have means and standard deviations. The situation is shown in the graph in Figure 11.8 where the population of consumption values for the $20,000 group is labeled A and that of the $40,000 labeled as B. 11 Figure 11.8 A scattergram of a consumption function Assume for the moment that we know the mean and variance of consumption for both these income groups (these are population parameters and in practice you won't know them). Suppose a household earning $20,000 was selected at random and you were asked to predict consumption for that family. What answer would you give? It would be the population mean of course. Likewise for a $40,000 household. The best single value guess would be the population mean. If you knew the variance of the populations you could also assign a confidence interval to the guess. If the distribution were normally distributed could say you were 95% confident that a household picked at random would consume in the range 1.96 . There's nothing new here. Suppose you were asked to make a prediction for consumption for a family earning $30,000 a year. How might you do it? Well you can't unless you throw in an assumption. Let's assume that the straight line passing through the means of the distributions for the $20,000 and $40,000 groups also passes through the mean of the distribution for the 30,000 group. Such a line is shown in Figure 11.9. If the assumption is true then the line passes through the mean of the consumption distribution for the $30,000 group and your best prediction for consumption would be the value of that mean. 12 Figure 11.9 A line passing through the means of distributions of consumption values for given values of income. Now would it be possible to find a 95% confidence interval for this prediction? The answer is no, unless we throw in yet another assumption. We will assume that every population distribution has exactly the same variance, 2 . If this assumption is true then we have two additional lines, one 1.96 below the line passing through the means and one 1.96 above the line passing through the means. These lines are shown in Figure 11.10. Note that if these assumptions are true then we can make predictions and calculate confidence intervals for any income group. These assumptions that a straight line passes through the means and that the variances are all the same may seem pretty bold, but they are those used in linear regression. The reason for the term linear should now be clear; we are dealing with straight lines. Figure 11.10 95% confidence intervals about the population means 13 What we can't assume is that we know what the means and the variance are. If we can't do that for a single distribution how can we possibly do that for several (an infinity really) distributions? We will have to estimate these population parameters. We will do this much as we have in the past. We will take sample data and try to infer the population values based on the sample information. Next we want to see what kind of problems we run into in this estimation process. Recall that we want to find a line y 0 1 x that passes through the means of the populations. Using sample data we can find a line that approximates (estimates) this line. We will represent that line by ŷ ˆ0 ˆ1 x called the regression line. Using previous terminology we will call the terms parameters (if we had population data we could calculate exactly). The ̂ terms will be the best point estimates of the terms. Next let's consider some of the properties of the ̂ terms. Let's suppose we find two families, one with income x = $20,000 and the other with income x = $40,000. Suppose the first family consumes $18,000 of its income and the second family consumes $36,000. Only two points are needed to fit a straight line, so we could infer values ˆ0 0 and ˆ1 0.9 using these two points. Suppose we repeat the experiment and collect data from two other families, which have incomes of $20,000 and $40,000, respectively. Does it seem likely that these two families would consume exactly the same amounts as the first two? It would be an amazing coincidence if they did. These two families are likely to consume different amounts of their incomes, say $19,000 and $35,000. These values give ˆ0 $3,000 and ˆ1 0.8 . The situation we have just described is shown in Figure 11.11 where line 1 and line 2 are the sort of lines we might get. 14 Figure 11.11 Regression lines for different sets of data. The ̂ ’s change from experiment to experiment. In other words the estimates are random variables. Like all random variables they have probability distributions with means and standard deviations. What we will do with these random variables is the same thing we have done with other random variables - construct confidence intervals and perform hypothesis tests. You might feel that we could get better estimates for the ’s if we took large samples from the $20,000 and $40,000 income groups. You would be correct in believing this. You might also wonder how in the world we could find households that would fit so neatly into these categories. We might find one household that earned $33,124 and another household that earned $50,987 and yet a third that earned yet at different amount, and so on. That is we want to be able to analyze data like that shown in Figure 11.12. The linear regression procedure can cope with this as long as the assumptions of linear regression are met. 15 Figure 11.12 A scattergram of consumption—income data 11.3 The Assumptions of Linear Regression 1. A straight line y 0 1 x passes through the means of a number of normal populations. 2. Each population has the same variance 2 . 3. Any one value of y is statistically independent of all other values of y. The reasons for the first two of these assumptions have been discussed previously. The reason for the third assumption is a bit subtle and will not be discussed here; as has been suggested in previous chapters, however, the assumption of independence is made to make things (relatively) simple. All of the results we will find in this chapter depend on the assumptions being true. If any of the assumptions are false, the results will be suspect. Unfortunately, these assumptions are frequently violated in practice. The Least Squares Criterion We want to be able to find the line that passes as close to the means of the populations as we can, given the data that we have. Put another way we want to find the line that best “fits” the data. Think of it like this. The line is to be used for prediction. If you have an observation (a value of y at a certain x) then that observation represents something 16 known. A good predictive line should reproduce the known information as closely as possible. We are going to submit this data to a mathematical procedure, likely implemented on a computer, and ask it to find the line that best reproduces the known results. The problem is that mathematical procedures, and computers for that matter, are pretty dumb. They don't know what “best” is, so we will have to tell them. Let's try this for a criterion for “best”. There is an infinity of lines that could be used to fit the data. Some may fit it fairly well, some may give a perfectly rotten fit, and one may give a better fit that any of its competitors. The line that works better than any of its competitors is the one that we will say gives the “best” fit. Consider the following rule: pick the line where the distances from the observations to the line are as small as possible. Let i yio yˆi where i is the lower case Greek letter “epsilon”, yio is the observed value of y at xi , and yˆ i is the value of y taken from the line used to fit the data. So i is how much the observed value of y deviates from the regression line. Sampling error is also a term used for i . An example of sampling error is shown in Figure 11.13. Figure 11.13 Sampling error i According to our discussion of “best” it might seem that we should pick the line where 17 n i i 1 is a minimum. There is a problem with this scheme. Recall the discussion of the variance in Ch. 3. In that discussion we had to square the differences between the mean and the observed values. If we did not the sum of these differences would always be zero. Something similar occurs here. In fact there is an infinity of lines that can be found where i 0 . The fix we will use here is to pick the line where n i 1 2 i is a minimum. What we do then is to pick the line that makes the sum of the squared error terms a minimum. For this reason linear regression is often called ordinary least squares. Restating this we can say linear regression is a technique for finding ˆ0 and ˆ1 so that n i 1 2 y yˆi n 2 i i 1 o i n i 1 y ˆ0 ˆ1 xi 0 i 2 is a minimum. The values of ˆ0 and ˆ1 are fairly easy to calculate for small sets of data: SS XY SS X ˆ y ˆ x ˆ1 0 1 To test these formulae let's create some hypothetical data. Consider a straight line (with no sampling error) yx and calculate some values of y for given values of x. Y 1 2 3 4 5 X 1 2 3 4 5 Note that this is the first set of data that we examined in the section on the correlation coefficient. There we found 18 SS XY 10 and SS X 10 x 3 and y=3, so that SS XY 10 1 SS X 10 ˆ y ˆ x 3 (1)(3) 0 ˆ1 0 1 Note that this gives the regression equation yˆ ˆ0 ˆ1 x 0 (1)( x) x Because the regression procedure reproduces the equation used to create the data, this should give you some faith that the regression procedure does work. SPSS also has a regression procedure. The regression procedure produces the following output in Figure 11.4. Figure 11.4 A SPSS printout of a regression analysis. The column called Model lists two items – (Constant) and X. The term (Constant) refers to ̂ 0 and the term X refers to ̂ 1 . There is also a column called Standardized Coefficients – forget about it. We are interested in the column called Unstandardized Coefficients. The column labeled B contains the values of ˆ0 and ˆ1 . The value of ̂ 0 is where the column labeled (Constant) and the row labeled B intersect. Likewise, the intersection of row (Constant) with column B gives ̂ 1 . So using this table we find ˆ 0 and ˆ 1 . 0 1 Recall that ˆ0 and ˆ1 are random variables and random variables have standard deviations. Call these sˆ0 and sˆ1 . These values are located in the column labeled Std Error, immediately to the right of the values for ˆ0 and ˆ1 . In this case the values of the standard deviation (or standard error) are both zero because we have no sampling error. The columns labeled t and Sig are used for hypothesis testing. You might not be to terribly surprised to learn that Sig refers to a p—value. 19 11.4 Statistically Significant Relationships One of our primary concerns when using regression analysis is to determine whether the slope term 1 0 is equal to zero or not. Suppose that 1 0 , then y 0 1 x 0 (0) x 0 and changes in x do not cause changes in y. We have 1 0 (this means that changes in x do not cause changes in y) 1 0 (this means that changes in x cause changes in y) To try to determine whether 1 has a value of zero or not we will perform the following hypothesis test: H 0 :1 0 (Y does not depend on X) H A :1 0 (Y does depend on X) ˆ1 Test statistic: t with (n-2) df. sˆ 1 If we reject the null hypothesis we will say that a statistically significant relationship exists between x and y. The number of degrees of freedom for the regression procedure is given by the formula df n the number of coefficients in the model For the problem we have been looking at so far there are two coefficients, 0 and 1 so we have df n 2 for the degrees of freedom. Let's examine another set of data that we've seen before: Y 1 2 4 4 5 X 1 2 3 4 5 Table 11.7 The data for Example 11.6 The SPSS printout is 20 Figure 11.5.The regression analysis for Example 11.6. where we have ˆ0 0.200 ˆ1 1.000 sˆ 0.542 0 sˆ 0.163 1 Let's use this information to determine whether we believe a statistically relationship exists between x and y. Example 11.6 Does a statistically significant relationship exist between x and y for the data in Table 11.7? Test using a 5% level of significance. The SPSS output is shown in Figure 11.5. H 0 : 1 0 (Y does not depend on X) H A :1 0 (Y does depend on X) =0.05 df 5 2 3 Reject H 0 : if t 3.182 or t 3.182 ˆ 1.0 Test statistic: t 1 6.135 sˆ 0.163 1 The test statistic is in the rejection region, so we will reject the null hypothesis and say that a statistically significant relationship exists between x and y. The p—value is p 0.009 0.05 which would provide sufficient evidence to reject the null hypothesis of no relationship. Another way of determining if a statistically significant relationship exists is to calculate a confidence interval for 1 21 A 1 100% confidence interval for 1 is ˆ1 t / 2 sˆ 1 Now suppose the resulting CI includes the value zero. Then we would conclude that a statistically significant relationship did not exist. For example, suppose we calculated a 95% CI for 1 and found the interval to be [-1,3]. This would mean we would be 95% confident that the slope had a value between -1 (a negative slope) and +3 (a positive slope). So we could be pretty certain that the slope is either positive or negative. Well, I am 100% confident of that without ever looking at any data. A confidence interval that includes zero for a slope term tells you that you have no idea whether the line slopes up or down. Suppose on the other hand the CI had been [1,3]. In this case you could be pretty confident that the line slopes up. We will say a statistically significant relation exists if the CI for 1 does not include zero. Example 11.7 Use a CI to determine if a statistically significant relationship exists between x and y using the data in Table 11.7. Use the same level of significance as in Example 11.6. Compare the result with that of Example 11.6. In Example 11.6 we used a 5% level of significance. To be consistent with that test we would compute a 95% confidence interval and check to see if it contained zero. From the SPSS printout we have ˆ1 1.0 sˆ 0.163 1 For a 95% CI with df 5 2 3 , t 3.182 (note we use the same value of t in both cases) ˆ1 t / 2 sˆ 1.0 3.182 0.163 1.0 0.519 0.481,1.519 1 Because the resulting interval does not contain the value zero we will conclude that a statistically significant relationship exists between x and y. Recall that we also claimed a statistically significant relationship existed for the same data in Example 11.6. There we used a hypothesis test and found that the test statistic was in the critical region and also, what is the same thing, that p . So we have three ways of deciding if a statistically significant relationship exists. Don’t worry there will be a fourth way later on. 22 11.5 Prediction using Regression After we have determined that a statistically significant relationship exits, we can use the regression equation for prediction. The regression output above gives the following regression equation (plug in the appropriate values for ˆ0 and ˆ1 from Figure 11.5): yˆ ˆ0 ˆ1 x 0.200 (1.0) x Now if you want to make a prediction for y for a given value of x, say x=10 use that value of x in the regression equation yˆ 0.200 (1.0) x 0.200 (1.0)(10) 0.200 10.0 10.2 11.6 The Coefficient of Determination Some people think the coefficient of determination is a useful indicator of whether or not a regression equation is a good one. In particular they like to use it to compare different regression equations. I believe it to be a somewhat useful qualitative measure. Figure 11.6. The decomposition of the variance into sums of squares. Note that we can measure how yio varies about the mean, y , of the y observations in a fashion somewhat similar to a variance or sum of squares. Now define 23 2 SST y y Total Sum of Squares n o i i 1 This variance in about the mean can be broken down into two parts y o i y yio yˆi yˆi y Let's give some names to other terms n SSR yˆi y The Regression Sum of Squares 2 i 1 and 2 SSE y y Error Sum of Squares n i 1 o i SSR measures the variance of the regression line about the mean and SSE measure the variance of the y observations about the regression line It can be shown (with a whole lot of effort) that SST SSR SSE Let's rewrite this as SST SST SSE and then divide both sides of the equation by SST, so that SSR SST SSE SSE 1 SST SST SST SST Now define R2 SSR SST so that 24 SSR SST SSE R2 1 SST 2 The term R is called the coefficient of determination. R2 Figure 11.7. A prefect fit to the data Now let's consider a situation where the regression line passes through the data exactly. That is the regression line gives a perfect fit to the data as shown in Figure 11.7. In that case yio yˆi for every observation, so 2 2 SSE y yˆi 0 0 n i 1 o i n i 1 and then R2 1 SSE 1 0 1. SST Thus a regression line that perfectly fits the data gives a value R 2 1 . This is also equivalent to the situation where 1 0 . 25 Now let's look at a situation were 1 0 and there is no straight line to fit to the data. In this case the regression line and the mean of the y observations are the same line. If you make a prediction for y you can ignore the value of x and just use the mean. Figure 11.8 A poor fit to data. Figure 11.8 shows a scattergram where there is a poor fit to the data. In that case yˆi y so n 2 n 2 SSR yˆi y 0 0 i 1 i 1 and a value of R 2 0 is consistent with the situation of no statistically significant relationship between x and y. Figure 11.9. R2 analysis from SPSS for the data in Table 11.3. 26 SPSS computes the coefficient of determination in its regression procedure. In SPSS R 2 R Square in Fig. 11.9. The printout in Fig. 11.9 also contains a column labeled Adjusted R Square. The Adjusted R Square will apply to the material in Chapter 12. The column labeled Std. Error of the Estimate is and estimate of , the standard deviation of the populations. It is possible to test to see whether a statistically significant relationship exists between x and y using R2 by computing r R2 and using the test for the correlation coefficient in Sec 11.1 The Pearson Correlation Coefficient. H0 : 0 (no statistically significant relationship exists between x and y ) HA : 0 (a statisticaly significant relationship exists between x and y ) =0.05 Reject H 0 : if t t / 2 or t t / 2 Test statistic: t r 1 r2 n2 We will not use this test here. A more convenient test is to use an ANOVA table computed by SPSS. Once again consider the data given in Table 11.3 and use that data in a regression equation. The results are shown in Fig. 11.10. The ANOVA table is used to test the hypothesis H 0 : 1 0 (No statistically significant relationship exists) H A : 1 0 (A statistically significant relationship exists) using the p—value from the ANOVA table. It is, like before, found in the column labeled Sig. For the regression in Fig. 11.10, p 0.009 and we would reject the null hypothesis of no statistically significant relationship for any 0.009 . Note that this is exactly the same value of p found in the table containing information about the ' s . That is because the hypothesis is exactly the same. Thing will change in the next chapter where we have more than one independent variable. 27 Figure 11.10. The regression output for the data in Table 11.3 Example 11. 8 Use the following SPSS printout to determine if a statistically significant relationship exists. Use (a) a test on 1 using the t statistic, (b) a confidence interval for 1 , (c) the p—value from the coefficient table, (d) the p—value from the ANOVA table. If a statistically significant relationship exists, use the relationship to forecast a value for y if x 50. Use the regression output in Fig. 11.11 and a 5% level of significance. 28 Figure 11.11 The regression output for Example 11.8. From the printout we find ˆ1 2.087 sˆ 0.061 p 0.000 1 (a) use a hypothesis test on 1 using the t—statistic to determine if a statistically significant relationship exists. 29 H 0 : 1 0 (Y does not depend on X) H A :1 0 (Y does depend on X) =0.05 df 30 2 28 Reject H 0 : if t 2.048 or t 2.048 ˆ 2.087 Test statistic: t 1 34.271 sˆ 0.061 1 so we could reject the hypothesis that y does not depend on x using the fact that t 34.271 2.048. (b) Test using a 95% CI for 1. ˆ1 t / 2 sˆ 2.087 2.048 0.061 2.087 0.129 1.958, 2.216 1 Note that the CI does not include the value zero. So we would reject the null hypothesis on that basis. The small size of the CI also gives assurance that we have fairly good knowledge of the value of 1 . Note: The data for this example was created using the formula y 0 1 x 20 2 x where the sampling error was simulated using a random number generator provided by SPSS. The random number generator was used to create normal distributed random values with 0 and 3 . The true value for the slope term is 2 and the point estimate is ˆ 2.087 . 1 1 The true value for the standard deviation is 3 and the regression procedure gives s = 2.8875 as an estimate. It would seem that the regression procedure is producing pretty good estimates. ( c) Use the p—value from the coefficient table to test the hypothesis p 0.000 0.005 so we can reject the null hypothesis on that basis. (d) Use the p—value from the ANOVA table to test the hypothesis p 0.000 0.005 30 and we would reject the hypothesis on this basis. Note that the p—value is the same for both ( c) and (d). While we are not going to use it in a hypothesis test, the value of R Square is R 2 0.977 which is very near to one. This would make us suspect the existence of a statistically significant relationship Now that we have determined that a statistically significant relationship exits we will use it for prediction for a value of x 50. The regression equation is yˆ ˆ0 ˆ1 x 18.061 2.087(50) 122.411 and the best prediction we can make for y is 122.411. Example 11.9 Use the following SPSS output to determine if a statistically significant relationship exists. If so use the relationship to forecast a value for y if x 50. Use a 5% level of significance. The SPSS printout is in Fig. 11.12. (a) A test using the t—statistic H 0 : 1 0 (Y does not depend on X) H A :1 0 (Y does depend on X) =0.05 df 30 2 28 Reject H 0 : if t 2.048 or t 2.048 ˆ 0.08735 Test statistic: t 1 1.434 sˆ 0.061 1 The t—statistic is not in the rejection region so we do not have sufficient evidence to suggest that a statistically significant relationship exists. (b) A 95% CI for 1 is ˆ1 t / 2 sˆ 0.0873 2.048(0.061) 0.0873 0.125 0.038,0.212 1 The CI includes the value zero, so this does not give sufficient evidence to claim that a statistically significant relationship exits. 31 Figure 11.12 The regression output for Example 11.9 ( c) Use the p—value from the coefficient table. p 0.163 0.05 and this does not provide evidence to reject the null. (d) Use the p—value from the ANOVA table. p 0.163 0.05 and this does not provide evidence to reject the null. We could also note here that R 2 0.068 is very small and does not lead one to suspect a statistically significant relationship. 32 We will not use the regression equation for prediction because a statistically significant relationship does not exist. Causal and Statically Significant Relationships We started our study of regression analysis by considering causal relationships, ones where a change in x caused a change in y. We ended by discussing statistically significant relationships, ones where we rejected the hypothesis that the slope term had a zero value. Now a causal relationship should certainly produce data that give a statistically significant relationship. Changes in x that produce changes in y should reveal a slope term not equal to zero. But is the reverse true? If a set of data produces a statistically significant relationship does that mean that it is causal? Recall the rum price and preachers’ salary example in the section on correlation analysis. I suspect that if you took data on preachers’ salaries and rum prices you would find that a statistically significant relationship existed. But would this mean the relationship is causal? Would an increase in preachers’ salaries cause rum prices to increase? Yes, if the preachers were buying most of the rum. What is most likely going on is that inflation is making the price of everything, including preachers’ labor, increase. If a statistically significant relationship does not exist between two variables, then it is very doubtful if a causal relationship exists either. If a statistically relationship exists between two variables then a causal relationship may exist also. I know of no way to determine with certainty if a causal relationship exists. Once upon a time there existed the sunspot theory of economics. The general idea was that when sunspots were plentiful that meant lots of rain. Lots of rain meant good crops. Good crops meant good income for farmers and economic activity would be good. A statistically significant relationship existed between sunspots and economic activity. Sunspots then could be home used to forecast economic activity. The problem is that no causal relationship existed between the two. The statistically significant relationship occurred only because of coincidence at the time the data was taken. The relationship broke down and the theory discredited. Suppose you wanted to use the amount spent on advertising as a means of predicting sales. At the period the data was acquired a statistically significant relationship might well exist between advertising and sales. This relationship might break down at a later time due to increases in competition, reduction in prices of substitutes, and a host of other reasons. Your prediction of sales based on a certain advertising budget might well be perfectly rotten. When using regression analysis (or any other statistical technique) the best advice is to be careful. Statistics is only a guide to decision making and should not be the only factor considered. 33 Problems 11.1 Calculate the correlation coefficient for the following set of data. Use a hypothesis test to determine if there is a relationship between x and y (test the hypothesis 0 ). Y 1 -1 1 -1 X 1 2 3 4 11.2 The SPSS output for problem 11.2 is shown below. Use the p—value to test whether a relationship exists between x and y 11.3 Calculate the correlation coefficient for the following set of data. Use a hypothesis test to determine if there is a relationship between x and y (test the hypothesis 0 ). Y 1 2 2.5 4.1 X 1 2 3 4 11.4 The SPSS output for the data in problem 11.3 is shown below. Use the p—value to test whether a relationship exists between x and y. 34 11.5 The SPSS printout below is a regression analysis of the effect of water on the production of corn. Water is applied to certain acre patches where corn has been planted. Corn is measured in terms of bushels per acre. Water is measured in terms of acre feet of water. Determine if a statistically significant relationship exists between corn and water. Can you use the relationship to predict corn productin? If so make a prediction of the corn yield produced by 10 acre feet of water. Use a 5% level of significance Test.using all the methods we have discussed. 35 11.6 The scattergram of (aggregate) disposible personal income(dpi) and household consumption for the US for the time period 1960—2000 is shown below It certainly looks as if a linear relationship exists between income and consumption. The SPSS regression analysis is given next. Use it to determine if a statistically significant relationship exists between income and consumption. What is the best guess for the value of the marginal propensity to consume? 36 37 11.7 Someone has proposed that the amount of money spent on advertising greatly affects the sales of a certain product. Data is collected on sales (in units of millions of dollars) and advertising (in terms of millions of dollars). The SPSS regression results are shown below. Do these data support the contention that the amount of advertising dollars serves as a good predictor of sales revenue? Test using a 10% level of significance. Would this information lead you to believe that a causal relationship exists between advertising expenditures and sales? 38
© Copyright 2026 Paperzz