Chapter 11

Chapter 11
Linear Regression
Linear regression is one of the most frequently used statistical tools by business. It is
used to determine whether a relationship exists between two variables. Suppose we were
able to determine what relationship, if any, existed between sales and the amount spent
on advertising. We could then use this to calculate how an increase in advertising might
affect sales. If we were really lucky we might be able to predict the amount of sales a
particular amount of advertising might generate. There are a large number of cases
where business might be interested in how a change in one economic variable might
affect a company's profits, sales, or revenues. We will examine some of these ideas in
this chapter.
11.1 The Pearson Correlation Coefficient
In this section we will examine the correlation coefficient, a fairly simple way of
detecting whether a relationship exists between two variables. We will designate one of
these variables as x and the other as y. The Pearson correlation coefficient is defined to
be
SS XY
SS X SSY
r
where
n
SS XY    xi  x  yi  y 
i 1
n
SS X    xi  x 
2
i 1
n
SSY    yi  y 
2
i 1
The SS terms are called the sum of the squares. Now let's suppose we have collected
data on x and y shown in Table 11.1
Y 1 2 3 4 5
X 1 2 3 4 5
Table 11.1 Data for Example 11.1
1
This data suggests that every time x increases by one unit y also increases by one unit.
Not only that but it seems that x and y are always equal. This could be taken as an
example of an ideal relationship between x and y. It is hard to picture a more perfect
relationship between two variable than one in which they are always equal.
Example 11.1 Let's calculate the correlation coefficient for this set of data in Table 11.1.
x
1 5
1
15
xi  1  2  3  4  5  
3

5 i 1
5
5
y
1 5
1
15
yi  1  2  3  4  5  
3

5 i 1
5
5
SS X    xi  x   1  3   2  3   3  3   4  3   5  3
2
2
2
2
2
2
SS X  10
SSY    yi  y   1  3   2  3   3  3   4  3   5  3
2
2
2
2
2
2
SSY  10
SS XY    xi  x  yi  y   1  31  3   2  3 2  3
  3  3 3  3   4  3 4  3   5  3 5  3 
SS XY  10
So
r
SS XY
10

1
SS X SSY
10 10
Data that has a correlation coefficient equal to one is said to be perfectly positively
correlated, meaning that given a value for x (or y) you could perfectly predict the value of
y (or x).
2
Now consider the correlation coefficient for the set of data shown in Table 11.2.
Y -1 -2 -3 -4 -5
X 1 2 3 4 5
Table 11.22 Data for Example 11.2
Example 11.2 Compute the correlation coefficient for the data in Table 11.2
Compute the sums of squares and get
SS XY  10
SSY  10
SS X  10
so that
r
SS XY
10

 1
SS X SSY
10 10
Data that fits exactly on a straight line with a positive slope will have a correlation
coefficient r  1 while data that fits exactly on a straight line with negative slope will
have a correlation coefficient r  1 . Finding data that fits exactly on a straight line is
very, very rare. It's never happened to me. If it ever did I would suspect something was
wrong. Data may tend to fitting on a straight line, however, and that is what the
correlation coefficient is used for. Consider the data in Table 11.3.
Y 1 2 4 4 5
X 1 2 3 4 5
Table 11.3 Data for Example 11.3
Example 11.3 Compute the correlation coefficient for the data in Table 11.3
While we could compute the correlation coefficient by hand we will use a computer
package to do it for us. The package we will use is SPSS (Statistical Programs for the
Social Sciences) statistical package. The output from SPSS for this data is shown in Fig.
11.1.
3
Figure 11.1 The SPSS output for Example 11.3
The table gives the Pearson correlation coefficient, r , for Y correlated with Y, for Y
correlated with X, for X correlated with Y, and for X correlated with X. Y and Y are
perfectly correlated with each other as well as X and X, so r  1 for both these
correlations (the correlation of Y with itself and X with itself is pretty useless, but is done
for compatibility with older computer packages). The interesting item is how X is
correlated with Y and for this r  0.962. The table also gives the number of
observations, N  5. The other entry in the table is Sig.(2-tailed) = 0.009. Sig.(2-tailed)
is actually the p—value for a hypothesis test
H0 :   0
(X and Y are not correlated)
HA :   0
(X and Y are correlated)
where  is the population correlation coefficient. The Pearson correlation coefficient r
is the sample correlation coefficient. There is a test statistic that can be used to test the
hypothesis and that is
r
.
t
1 r2
n2
Note if the null hypothesis is true   0 and r    0 so r  0. So if the null is true we
0
would expect to get values of r close to zero. In this case t 
 0 and values of t
1  02
n2
near zero would be taken as evidence that the null is true. If the null is false, t would take
on large values. Actually for this test we could use the t distribution just like we did for
or t  t / 2 using the t
tests on the population mean, and reject the null if t  t
 /2
distribution with n  2 degrees of freedom. So for this example the test would be
4
H0 :   0
(X and Y are not correlated)
HA :   0
(X and Y are correlated)
 =0.05
Reject H 0 : if t  t0.025  3.182 or t  t0.025  3.182
t
r
1 r2
n2

0.962
1  0.9622
52

0.962
0.962

 6.08
1  0.925
0.025
3
and we would reject the null hypothesis and claim that x and y are correlated. It is much
easier to use the p—value. We note that p  0.009    0.05 which also gives sufficient
evidence to reject the null. We will use p—values from now on when we can. Also note
the statement ** correlation is significant at the 0.01 level. This is SPSS’s notification
that you would reject the null hypothesis at this level of significance.
A plot of the values of a set of data is called a scattergram. The scattergram of the above
set of data in Table 11.3 is shown in Figure 11.2. Note that the data seems to be sloping
upward as is reflected in the correlation coefficient r  0.962 .
Figure 11.2 The scattergram of the data in Table 11.3
5
Y 2 1 3 1 2
X 1 2 3 4 5
Table 11.4 Data for Example 11.4
Example 11.4 Is the data in Table 11.4 correlated? Test using a 1% level of significance.
First let’s look at the scattergram as shown in Figure 11.3
Figure 11.3 The scattergram of the data in Table 11.4
It is very difficult to say whether there is any relationship between x and y by an
inspection of the scattergram. If we look at the correlation analysis as shown in Figure
11.4, however, we find that
6
Figure 11.4 The correlation analysis of the data in Table 11.4
r  0.00 and that p=1.000> =0.01. Using this evidence we could not reject the null
hypothesis for any level of significance. The results shown in Figure 11.5 give some
rules of thumb often used in using correlation coefficients. The coefficient only can
take on values between plus and minus one, and that values near zero suggest that no
correlation exists. This table would not replace a hypothesis test, however.
Pearson correlation coefficient
r 1
r  1
r 1
Conclusion
Perfect positive correlation
Perfect negative correlation
Evidence for the existence of positive
correlation
Evidence for the existence of negative
r  1
correlation
Evidence for the absence of correlation
r 0
Table 11.5 Rough interpretation of the values of the Pearson’s correlation coefficient.
Example 11.5 Conduct a correlation analysis of the data in Table 11.6 Here we have 3
variable. You can conduct a correlation on two or more variables. The SPSS output is
shown in Figure 11.5. Test using a 2% level of significance.
X 1 2 3 4 5
Y 1 2 1 2 2
Z 1 2 2 3 3
Table 11.6 Data for Example 11.5
7
Figure 11.5 The correlation analysis of the data in Table 11.6
The p—values suggest: X and Y are not correlated  p  0.308    0.02 ; X and Z are
correlated  p  0.015    0.02 , and Y and Z are not correlated  p  0.133    0.02 .
If we had chosen a 1% level of significance we would have concluded that none of the
variables are correlated.
Example 11.6 A statistician with nothing better to do has collected 20 years data on rum
prices and preachers' salaries. He puts the data into SPSS and obtains the scattergram
shown in Figure 11.6
Figure 11.6 A scattergram of rum prices and preachers’ salaries
8
The SPSS correlation analysis is shown in Figure 11.7
Figure 11.7 The correlation analysis of rum prices and preachers’ salaries.
If we test
H 0 :   0 (There is no correlation between rum prices and preachers' salaries)
H A :  0 (There is a correlation between rum prices and preachers' salaries)
and test this with a 1% level of significance, we would reject the null hypotheses
( p  0.007    0.01 ) and conclude that preachers’ salaries and rum prices are
correlated.
When we say that two variables are linearly correlated we mean that they tend to change
in a systematic manner. In the above example, an increase in preachers' salaries seems to
be related with increases in rum prices. This does not mean that increases in preachers'
salaries cause increases in rum prices (unless the preachers are out guzzling all the rum).
One suspects that both are increasing due to inflation. All correlation analysis does is tell
you if two variables seem to be related in a linear fashion. The interpretation of why this
might be so is beyond the realm of statistics.
You will likely use correlation analysis heavily in finance courses, where correlation
between rates of return of various assets are of interest. If two stocks have a high
correlation coefficient you can say there seems to be a relationship between their rates of
return, but this does not tell you why. Quite possibly you might not care why. A good
investment strategy is to pick a portfolio of assets that have returns that are uncorrelated –
in that case if the return of one asset declines the rest won’t tend to follow suit. In the
next section we will look at a stronger assumption, one in which we believe that a change
in one variable actually causes changes in the second variable.
11.2 Causal Relationships.
In the previous section we examined the case where changes in one variable seemed to
coincide with changes in a second variable. If the correlation coefficient between the two
variables was large enough we could take that as evidence that the two variables were
9
related somehow. In this section we want to examine a stronger and more useful
relationship between two variables, one in which a change in one variable causes a
change in the other. Such a relationship is called a causal relationship. Suppose that we
are interested in the amount of corn produced, y, on a plot of ground as a function of the
amount of water used, x, on the plot. It seems plausible that increases in the amount of
water used should produce increases in the amount of corn produced. Note that this
relationship works in only one direction. More water produces more corn; more corn
does not produce more water. More corn cannot cause more water to occur. For this
reason the y variable is called the dependent variable and x is called the independent
variable. Restating this, the amount of y produced depends on the amount of x used but
the amount of x used is independent of y. We, or nature, control the amount of water
used, the corn is not in control. If y is sales and x is the value of the advertising budget,
then management may increase x to increase y . Management determines the amount of x
(advertising), which they hope will produce the desired level of y (sales).
In this chapter we will examine the simplest causal relationship, a linear or straight-line
relationship. Recall from your math classes the formula for a straight line is something
like
y  mx  b
where b is called the intercept term and m is called the slope term. The intercept term
gives the amount of y produced when x  0 . The slope term indicates how much y will
change for a given change in x.
It is customary in statistics to use the symbol  0 to designate the intercept term and 1 to
designate the slope term. So the representation we will use for a straight line is
y   0  1 x
There is a good reason for introducing this sort of notation. In a more advanced study of,
say, the factors that might affect consumer demand (y) we should include things like
gross national product ( x2 ), the price of the good ( x3 ), and the price of substitute goods
( x4 ) in the equation as well as advertising ( x1 ) . In that case we would have for the
relationship
y   0  1 x1   2 x2  3 x3   4 x4 .
Note that the subscript on  is the same as the subscript on x. Generally, this makes it
easy to match the slope terms with their respective variables. In this chapter we will only
consider a single independent variable, x1 , and because we only have one of them we
will drop the subscript.
From now on we will call the equation
y   0  1 x
10
a model. One of the things we will have to do is determine the values of the  terms; in
particular we will be most interested in whether the 1 term is equal to zero or not. If
1  0 then no relationship exists between x and y. No relationship would exist because
y would always have the same value regardless of what value x has. If y is gross
domestic product and x is the phase of the moon, we would suspect that no relationship
would exist between the two, and that the term 1 would be equal to zero.
So the first task we have is that of finding the values of the  terms. As an example of
the problem we face let's consider a consumption function. If you have had
macroeconomics you may recall that Keynes suggested that current consumption was
determined by current income. So we could model a Keynesian consumption functions
as
y   0  1 x
where y is current consumption and x is current income. You might recognize that 1 is
the marginal propensity to consume and  0 is autonomous consumption. From your
statistics course you hopefully know that the  terms are parameters (the Greek letters).
In your economics course you may have used values like 1  0.9 for the marginal
propensity to consume. Where did this value come from? Out of the mind of the author
of the text, that's where. The author was using an easy value to illustrate basic operations
of the consumption function. The true value of the marginal propensity to consume is not
determined by textbook authors nor is it determined by the government. It is determined
by us, the consumers. To find values for the  terms we will have to collect data on
income and consumption and use this data to find estimates of the  terms.
Suppose we were able to find two groups of households where every household in the
first group earns exactly $20,000 and every household in the second group earns exactly
40,000 (in practice it is not necessary to collect data in this fashion, that is, data in
identical groups. We do that here to illustrate a point). Not all households with incomes
of $20,000 will spend exactly the same amount. One family may spend $18,934.25, a
second might consume $19,224.21, a third some other amount, and so on. Consumption
for families in this income group will vary from household to household. The same will
hold true for families earning $40,000. In other words consumption is a random variable.
Recall that random variables have probability distributions, which have means and
standard deviations. The situation is shown in the graph in Figure 11.8 where the
population of consumption values for the $20,000 group is labeled A and that of the
$40,000 labeled as B.
11
Figure 11.8 A scattergram of a consumption function
Assume for the moment that we know the mean and variance of consumption for both
these income groups (these are population parameters and in practice you won't know
them). Suppose a household earning $20,000 was selected at random and you were asked
to predict consumption for that family. What answer would you give? It would be the
population mean of course. Likewise for a $40,000 household. The best single value
guess would be the population mean. If you knew the variance of the populations you
could also assign a confidence interval to the guess. If the distribution were normally
distributed could say you were 95% confident that a household picked at random would
consume in the range   1.96 . There's nothing new here.
Suppose you were asked to make a prediction for consumption for a family earning
$30,000 a year. How might you do it? Well you can't unless you throw in an assumption.
Let's assume that the straight line passing through the means of the distributions for the
$20,000 and $40,000 groups also passes through the mean of the distribution for the
30,000 group. Such a line is shown in Figure 11.9. If the assumption is true then
the line passes through the mean of the consumption distribution for the $30,000 group
and your best prediction for consumption would be the value of that mean.
12
Figure 11.9 A line passing through the means of distributions of consumption values for
given values of income.
Now would it be possible to find a 95% confidence interval for this prediction? The
answer is no, unless we throw in yet another assumption. We will assume that every
population distribution has exactly the same variance,  2 . If this assumption is true then
we have two additional lines, one 1.96 below the line passing through the means and
one 1.96 above the line passing through the means. These lines are shown in Figure
11.10. Note that if these assumptions are true then we can make predictions and calculate
confidence intervals for any income group. These assumptions that a straight line passes
through the means and that the variances are all the same may seem pretty bold, but they
are those used in linear regression. The reason for the term linear should now be clear;
we are dealing with straight lines.
Figure 11.10 95% confidence intervals about the population means
13
What we can't assume is that we know what the means and the variance are. If we can't
do that for a single distribution how can we possibly do that for several (an infinity
really) distributions? We will have to estimate these population parameters. We will do
this much as we have in the past. We will take sample data and try to infer the population
values based on the sample information. Next we want to see what kind of problems we
run into in this estimation process.
Recall that we want to find a line
y   0  1 x
that passes through the means of the populations. Using sample data we can find a line
that approximates (estimates) this line. We will represent that line by
ŷ  ˆ0  ˆ1 x
called the regression line. Using previous terminology we will call the  terms
parameters (if we had population data we could calculate  exactly). The ̂ terms will
be the best point estimates of the  terms. Next let's consider some of the properties of
the ̂ terms.
Let's suppose we find two families, one with income x = $20,000 and the other with
income x = $40,000. Suppose the first family consumes $18,000 of its income and the
second family consumes $36,000. Only two points are needed to fit a straight line, so we
could infer values ˆ0  0 and ˆ1  0.9 using these two points. Suppose we repeat the
experiment and collect data from two other families, which have incomes of $20,000 and
$40,000, respectively. Does it seem likely that these two families would consume exactly
the same amounts as the first two? It would be an amazing coincidence if they did.
These two families are likely to consume different amounts of their incomes, say $19,000
and $35,000. These values give ˆ0  $3,000 and ˆ1  0.8 . The situation we have just
described is shown in Figure 11.11 where line 1 and line 2 are the sort of lines we might
get.
14
Figure 11.11 Regression lines for different sets of data.
The ̂ ’s change from experiment to experiment. In other words the estimates are random
variables. Like all random variables they have probability distributions with means and
standard deviations. What we will do with these random variables is the same thing we
have done with other random variables - construct confidence intervals and perform
hypothesis tests.
You might feel that we could get better estimates for the  ’s if we took large samples
from the $20,000 and $40,000 income groups. You would be correct in believing this.
You might also wonder how in the world we could find households that would fit so
neatly into these categories. We might find one household that earned $33,124 and
another household that earned $50,987 and yet a third that earned yet at different amount,
and so on. That is we want to be able to analyze data like that shown in Figure 11.12.
The linear regression procedure can cope with this as long as the assumptions of linear
regression are met.
15
Figure 11.12 A scattergram of consumption—income data
11.3 The Assumptions of Linear Regression
1. A straight line y   0  1 x passes through the means of a number of
normal populations.
2. Each population has the same variance  2 .
3. Any one value of y is statistically independent of all other values of y.
The reasons for the first two of these assumptions have been discussed previously. The
reason for the third assumption is a bit subtle and will not be discussed here; as has been
suggested in previous chapters, however, the assumption of independence is made to
make things (relatively) simple. All of the results we will find in this chapter depend on
the assumptions being true. If any of the assumptions are false, the results will be
suspect. Unfortunately, these assumptions are frequently violated in practice.
The Least Squares Criterion
We want to be able to find the line that passes as close to the means of the populations as
we can, given the data that we have. Put another way we want to find the line that best
“fits” the data. Think of it like this. The line is to be used for prediction. If you have an
observation (a value of y at a certain x) then that observation represents something
16
known. A good predictive line should reproduce the known information as closely as
possible.
We are going to submit this data to a mathematical procedure, likely implemented on a
computer, and ask it to find the line that best reproduces the known results. The problem
is that mathematical procedures, and computers for that matter, are pretty dumb. They
don't know what “best” is, so we will have to tell them.
Let's try this for a criterion for “best”. There is an infinity of lines that could be used to fit
the data. Some may fit it fairly well, some may give a perfectly rotten fit, and one may
give a better fit that any of its competitors. The line that works better than any of its
competitors is the one that we will say gives the “best” fit. Consider the following rule:
pick the line where the distances from the observations to the line are as small as
possible.
Let
 i  yio  yˆi
where  i is the lower case Greek letter “epsilon”, yio is the observed value of y at xi , and
yˆ i is the value of y taken from the line used to fit the data. So  i is how much the
observed value of y deviates from the regression line. Sampling error is also a term used
for  i . An example of sampling error is shown in Figure 11.13.
Figure 11.13 Sampling error  i
According to our discussion of “best” it might seem that we should pick the line where
17
n

i
i 1
is a minimum. There is a problem with this scheme. Recall the discussion of the variance
in Ch. 3. In that discussion we had to square the differences between the mean and the
observed values. If we did not the sum of these differences would always be zero.
Something similar occurs here. In fact there is an infinity of lines that can be found
where   i  0 . The fix we will use here is to pick the line where
n

i 1
2
i
is a minimum. What we do then is to pick the line that makes the sum of the squared error
terms a minimum. For this reason linear regression is often called ordinary least squares.
Restating this we can say linear regression is a technique for finding ˆ0 and ˆ1 so that
n

i 1
2
   y  yˆi   
n
2
i
i 1
o
i
n
i 1

y  ˆ0  ˆ1 xi
0
i

2
is a minimum.
The values of ˆ0 and ˆ1 are fairly easy to calculate for small sets of data:
SS XY
SS X
ˆ  y  ˆ x
ˆ1 
0
1
To test these formulae let's create some hypothetical data.
Consider a straight line (with no sampling error)
yx
and calculate some values of y for given values of x.
Y 1 2 3 4 5
X 1 2 3 4 5
Note that this is the first set of data that we examined in the section on the correlation
coefficient. There we found
18
SS XY  10 and SS X  10
x  3 and y=3, so that
SS XY 10

1
SS X 10
ˆ  y  ˆ x  3  (1)(3)  0
ˆ1 
0
1
Note that this gives the regression equation
yˆ  ˆ0  ˆ1 x  0  (1)( x)  x
Because the regression procedure reproduces the equation used to create the data, this
should give you some faith that the regression procedure does work.
SPSS also has a regression procedure. The regression procedure produces the following
output in Figure 11.4.
Figure 11.4 A SPSS printout of a regression analysis.
The column called Model lists two items – (Constant) and X. The term (Constant) refers
to ̂ 0 and the term X refers to ̂ 1 . There is also a column called Standardized
Coefficients – forget about it. We are interested in the column called Unstandardized
Coefficients. The column labeled B contains the values of ˆ0 and ˆ1 . The value of ̂ 0 is
where the column labeled (Constant) and the row labeled B intersect. Likewise, the
intersection of row (Constant) with column B gives ̂ 1 . So using this table we find
ˆ  0 and ˆ  1 .
0
1
Recall that ˆ0 and ˆ1 are random variables and random variables have standard
deviations. Call these sˆ0 and sˆ1 . These values are located in the column labeled Std
Error, immediately to the right of the values for ˆ0 and ˆ1 . In this case the values of the
standard deviation (or standard error) are both zero because we have no sampling error.
The columns labeled t and Sig are used for hypothesis testing. You might not be to
terribly surprised to learn that Sig refers to a p—value.
19
11.4 Statistically Significant Relationships
One of our primary concerns when using regression analysis is to determine whether the
slope term 1  0 is equal to zero or not. Suppose that 1  0 , then
y   0  1 x   0  (0) x   0
and changes in x do not cause changes in y. We have
1  0 (this means that changes in x do not cause changes in y)
1  0 (this means that changes in x cause changes in y)
To try to determine whether 1 has a value of zero or not we will perform the following
hypothesis test:
H 0 :1  0 (Y does not depend on X)
H A :1  0 (Y does depend on X)
ˆ1
Test statistic: t 
with (n-2) df.
sˆ
1
If we reject the null hypothesis we will say that a statistically significant relationship
exists between x and y. The number of degrees of freedom for the regression procedure is
given by the formula
df  n  the number of  coefficients in the model
For the problem we have been looking at so far there are two  coefficients,  0 and 1
so we have df  n  2 for the degrees of freedom.
Let's examine another set of data that we've seen before:
Y 1 2 4 4 5
X 1 2 3 4 5
Table 11.7 The data for Example 11.6
The SPSS printout is
20
Figure 11.5.The regression analysis for Example 11.6.
where we have
ˆ0  0.200
ˆ1  1.000
sˆ  0.542
0
sˆ  0.163
1
Let's use this information to determine whether we believe a statistically relationship
exists between x and y.
Example 11.6 Does a statistically significant relationship exist between x and y for the
data in Table 11.7? Test using a 5% level of significance. The SPSS output is shown in
Figure 11.5.
H 0 : 1  0 (Y does not depend on X)
H A :1  0 (Y does depend on X)
 =0.05
df  5  2  3
Reject H 0 : if t  3.182 or t  3.182
ˆ
1.0
Test statistic: t  1 
 6.135
sˆ 0.163
1
The test statistic is in the rejection region, so we will reject the null hypothesis and say
that a statistically significant relationship exists between x and y. The p—value
is p  0.009    0.05 which would provide sufficient evidence to reject the null
hypothesis of no relationship.
Another way of determining if a statistically significant relationship exists is to calculate
a confidence interval for 1
21
A 1   100% confidence interval for 1 is
ˆ1  t / 2 sˆ
1
Now suppose the resulting CI includes the value zero. Then we would conclude that a
statistically significant relationship did not exist. For example, suppose we calculated a
95% CI for 1 and found the interval to be [-1,3]. This would mean we would be 95%
confident that the slope had a value between -1 (a negative slope) and +3 (a positive
slope). So we could be pretty certain that the slope is either positive or negative. Well, I
am 100% confident of that without ever looking at any data. A confidence interval that
includes zero for a slope term tells you that you have no idea whether the line slopes up
or down.
Suppose on the other hand the CI had been [1,3]. In this case you could be pretty
confident that the line slopes up. We will say a statistically significant relation exists if
the CI for 1 does not include zero.
Example 11.7 Use a CI to determine if a statistically significant relationship exists
between x and y using the data in Table 11.7. Use the same level of significance as in
Example 11.6. Compare the result with that of Example 11.6.
In Example 11.6 we used a 5% level of significance. To be consistent with that test we
would compute a 95% confidence interval and check to see if it contained zero. From the
SPSS printout we have
ˆ1  1.0
sˆ  0.163
1
For a 95% CI with df  5  2  3 , t  3.182 (note we use the same value of t in both
cases)
ˆ1  t / 2 sˆ  1.0  3.182  0.163  1.0  0.519  0.481,1.519
1
Because the resulting interval does not contain the value zero we will conclude that a
statistically significant relationship exists between x and y. Recall that we also claimed a
statistically significant relationship existed for the same data in Example 11.6. There we
used a hypothesis test and found that the test statistic was in the critical region and also,
what is the same thing, that p   . So we have three ways of deciding if a statistically
significant relationship exists. Don’t worry there will be a fourth way later on.
22
11.5 Prediction using Regression
After we have determined that a statistically significant relationship exits, we can use the
regression equation for prediction. The regression output above gives the following
regression equation (plug in the appropriate values for ˆ0 and ˆ1 from Figure 11.5):
yˆ  ˆ0  ˆ1 x  0.200  (1.0) x
Now if you want to make a prediction for y for a given value of x, say x=10 use that value
of x in the regression equation
yˆ  0.200  (1.0) x  0.200  (1.0)(10)  0.200  10.0  10.2
11.6 The Coefficient of Determination
Some people think the coefficient of determination is a useful indicator of whether or not
a regression equation is a good one. In particular they like to use it to compare different
regression equations. I believe it to be a somewhat useful qualitative measure.
Figure 11.6. The decomposition of the variance into sums of squares.
Note that we can measure how yio varies about the mean, y , of the y observations in a
fashion somewhat similar to a variance or sum of squares. Now define
23
2
SST    y  y   Total Sum of Squares
n
o
i
i 1
This variance in about the mean can be broken down into two parts
y
o
i
 y    yio  yˆi    yˆi  y 
Let's give some names to other terms
n
SSR    yˆi  y   The Regression Sum of Squares
2
i 1
and
2
SSE    y  y   Error Sum of Squares
n
i 1
o
i
SSR measures the variance of the regression line about the mean and SSE measure the
variance of the y observations about the regression line
It can be shown (with a whole lot of effort) that
SST  SSR  SSE
Let's rewrite this as
SST  SST  SSE
and then divide both sides of the equation by SST, so that
SSR SST SSE
SSE


 1
SST SST SST
SST
Now define
R2 
SSR
SST
so that
24
SSR
SST
SSE
R2  1 
SST
2
The term R is called the coefficient of determination.
R2 
Figure 11.7. A prefect fit to the data
Now let's consider a situation where the regression line passes through the data exactly.
That is the regression line gives a perfect fit to the data as shown in Figure 11.7. In that
case
yio  yˆi
for every observation, so
2
2
SSE    y  yˆi     0   0
n
i 1
o
i
n
i 1
and then
R2  1 
SSE
 1  0  1.
SST
Thus a regression line that perfectly fits the data gives a value R 2  1 . This is also
equivalent to the situation where 1  0 .
25
Now let's look at a situation were 1  0 and there is no straight line to fit to the data. In
this case the regression line and the mean of the y observations are the same line. If you
make a prediction for y you can ignore the value of x and just use the mean.
Figure 11.8 A poor fit to data.
Figure 11.8 shows a scattergram where there is a poor fit to the data. In that case
yˆi  y
so
n
2
n
2
SSR    yˆi  y     0   0
i 1
i 1
and a value of R 2  0 is consistent with the situation of no statistically significant
relationship between x and y.
Figure 11.9. R2 analysis from SPSS for the data in Table 11.3.
26
SPSS computes the coefficient of determination in its regression procedure. In SPSS
R 2  R Square in Fig. 11.9. The printout in Fig. 11.9 also contains a column labeled
Adjusted R Square. The Adjusted R Square will apply to the material in Chapter 12. The
column labeled Std. Error of the Estimate is and estimate of  , the standard deviation of
the populations. It is possible to test to see whether a statistically significant relationship
exists between x and y using R2 by computing
r  R2
and using the test for the correlation coefficient in Sec 11.1 The Pearson Correlation
Coefficient.
H0 :   0
(no statistically significant relationship exists between x and y )
HA :   0
(a statisticaly significant relationship exists between x and y )
 =0.05
Reject H 0 : if t  t / 2 or t  t / 2
Test statistic: t 
r
1 r2
n2
We will not use this test here. A more convenient test is to use an ANOVA table
computed by SPSS. Once again consider the data given in Table 11.3 and use that data in
a regression equation. The results are shown in Fig. 11.10. The ANOVA table is used to
test the hypothesis
H 0 : 1  0 (No statistically significant relationship exists)
H A : 1  0 (A statistically significant relationship exists)
using the p—value from the ANOVA table. It is, like before, found in the column labeled
Sig. For the regression in Fig. 11.10, p  0.009 and we would reject the null hypothesis
of no statistically significant relationship for any   0.009 . Note that this is exactly the
same value of p found in the table containing information about the  ' s . That is because
the hypothesis is exactly the same. Thing will change in the next chapter where we have
more than one independent variable.
27
Figure 11.10. The regression output for the data in Table 11.3
Example 11. 8 Use the following SPSS printout to determine if a statistically significant
relationship exists. Use (a) a test on 1 using the t statistic, (b) a confidence interval for
1 , (c) the p—value from the coefficient table, (d) the p—value from the ANOVA table.
If a statistically significant relationship exists, use the relationship to forecast a value for
y if x  50. Use the regression output in Fig. 11.11 and a 5% level of significance.
28
Figure 11.11 The regression output for Example 11.8.
From the printout we find
ˆ1  2.087
sˆ  0.061
p  0.000
1
(a) use a hypothesis test on 1 using the t—statistic to determine if a statistically
significant relationship exists.
29
H 0 : 1  0 (Y does not depend on X)
H A :1  0 (Y does depend on X)
 =0.05
df  30  2  28
Reject H 0 : if t  2.048 or t  2.048
ˆ 2.087
Test statistic: t  1 
 34.271
sˆ
0.061
1
so we could reject the hypothesis that y does not depend on x using the fact that
t  34.271  2.048.
(b) Test using a 95% CI for 1.
ˆ1  t / 2 sˆ  2.087  2.048  0.061  2.087  0.129  1.958, 2.216
1
Note that the CI does not include the value zero. So we would reject the null hypothesis
on that basis. The small size of the CI also gives assurance that we have fairly good
knowledge of the value of 1 .
Note: The data for this example was created using the formula
y   0  1 x    20  2 x  
where the sampling error  was simulated using a random number
generator provided by SPSS. The random number generator was used to
create normal distributed random values with   0 and   3 . The true
value for the slope term is   2 and the point estimate is ˆ  2.087 .
1
1
The true value for the standard deviation is   3 and the regression
procedure gives s = 2.8875 as an estimate. It would seem that the
regression procedure is producing pretty good estimates.
( c) Use the p—value from the coefficient table to test the hypothesis
p  0.000    0.005
so we can reject the null hypothesis on that basis.
(d) Use the p—value from the ANOVA table to test the hypothesis
p  0.000    0.005
30
and we would reject the hypothesis on this basis. Note that the p—value is the same for
both ( c) and (d). While we are not going to use it in a hypothesis test, the value of R
Square is R 2  0.977 which is very near to one. This would make us suspect the
existence of a statistically significant relationship
Now that we have determined that a statistically significant relationship exits we will use
it for prediction for a value of x  50. The regression equation is
yˆ  ˆ0  ˆ1 x  18.061  2.087(50)  122.411
and the best prediction we can make for y is 122.411.
Example 11.9 Use the following SPSS output to determine if a statistically significant
relationship exists. If so use the relationship to forecast a value for y if x  50. Use a
5% level of significance. The SPSS printout is in Fig. 11.12.
(a) A test using the t—statistic
H 0 : 1  0 (Y does not depend on X)
H A :1  0 (Y does depend on X)
 =0.05
df  30  2  28
Reject H 0 : if t  2.048 or t  2.048
ˆ 0.08735
Test statistic: t  1 
 1.434
sˆ
0.061
1
The t—statistic is not in the rejection region so we do not have sufficient evidence to
suggest that a statistically significant relationship exists.
(b) A 95% CI for 1 is
ˆ1  t / 2 sˆ  0.0873  2.048(0.061)  0.0873  0.125   0.038,0.212
1
The CI includes the value zero, so this does not give sufficient evidence to claim that a
statistically significant relationship exits.
31
Figure 11.12 The regression output for Example 11.9
( c) Use the p—value from the coefficient table.
p  0.163    0.05
and this does not provide evidence to reject the null.
(d) Use the p—value from the ANOVA table.
p  0.163    0.05
and this does not provide evidence to reject the null. We could also note here that
R 2  0.068 is very small and does not lead one to suspect a statistically significant
relationship.
32
We will not use the regression equation for prediction because a statistically significant
relationship does not exist.
Causal and Statically Significant Relationships
We started our study of regression analysis by considering causal relationships, ones
where a change in x caused a change in y. We ended by discussing statistically
significant relationships, ones where we rejected the hypothesis that the slope term had a
zero value. Now a causal relationship should certainly produce data that give a
statistically significant relationship. Changes in x that produce changes in y should reveal
a slope term not equal to zero. But is the reverse true? If a set of data produces a
statistically significant relationship does that mean that it is causal?
Recall the rum price and preachers’ salary example in the section on correlation analysis.
I suspect that if you took data on preachers’ salaries and rum prices you would find that a
statistically significant relationship existed. But would this mean the relationship is
causal? Would an increase in preachers’ salaries cause rum prices to increase? Yes, if the
preachers were buying most of the rum. What is most likely going on is that inflation is
making the price of everything, including preachers’ labor, increase.
If a statistically significant relationship does not exist between two variables, then it is
very doubtful if a causal relationship exists either. If a statistically relationship exists
between two variables then a causal relationship may exist also. I know of no way to
determine with certainty if a causal relationship exists.
Once upon a time there existed the sunspot theory of economics. The general idea was
that when sunspots were plentiful that meant lots of rain. Lots of rain meant good crops.
Good crops meant good income for farmers and economic activity would be good. A
statistically significant relationship existed between sunspots and economic activity.
Sunspots then could be home used to forecast economic activity.
The problem is that no causal relationship existed between the two. The statistically
significant relationship occurred only because of coincidence at the time the data was
taken. The relationship broke down and the theory discredited.
Suppose you wanted to use the amount spent on advertising as a means of predicting
sales. At the period the data was acquired a statistically significant relationship might
well exist between advertising and sales. This relationship might break down at a later
time due to increases in competition, reduction in prices of substitutes, and a host of other
reasons. Your prediction of sales based on a certain advertising budget might well be
perfectly rotten.
When using regression analysis (or any other statistical technique) the best advice is to be
careful. Statistics is only a guide to decision making and should not be the only factor
considered.
33
Problems
11.1 Calculate the correlation coefficient for the following set of data. Use a hypothesis
test to determine if there is a relationship between x and y (test the hypothesis
  0 ).
Y 1 -1 1 -1
X 1 2 3 4
11.2 The SPSS output for problem 11.2 is shown below. Use the p—value to test
whether a relationship exists between x and y
11.3 Calculate the correlation coefficient for the following set of data. Use a hypothesis
test to determine if there is a relationship between x and y (test the hypothesis
  0 ).
Y 1 2 2.5 4.1
X 1 2 3
4
11.4 The SPSS output for the data in problem 11.3 is shown below. Use the p—value to
test whether a relationship exists between x and y.
34
11.5 The SPSS printout below is a regression analysis of the effect of water on the
production of corn. Water is applied to certain acre patches where corn has been planted.
Corn is measured in terms of bushels per acre. Water is measured in terms of acre feet of
water. Determine if a statistically significant relationship exists between corn and water.
Can you use the relationship to predict corn productin? If so make a prediction of the
corn yield produced by 10 acre feet of water. Use a 5% level of significance Test.using
all the methods we have discussed.
35
11.6 The scattergram of (aggregate) disposible personal income(dpi) and household
consumption for the US for the time period 1960—2000 is shown below
It certainly looks as if a linear relationship exists between income and consumption. The
SPSS regression analysis is given next. Use it to determine if a statistically significant
relationship exists between income and consumption. What is the best guess for the
value of the marginal propensity to consume?
36
37
11.7
Someone has proposed that the amount of money spent on advertising greatly
affects the sales of a certain product. Data is collected on sales (in units of
millions of dollars) and advertising (in terms of millions of dollars). The SPSS
regression results are shown below. Do these data support the contention that the
amount of advertising dollars serves as a good predictor of sales revenue? Test
using a 10% level of significance. Would this information lead you to believe
that a causal relationship exists between advertising expenditures and sales?
38