4. DIRECTION OF CAUSATION

CHAPTER 4:
36
4. DIRECTION OF CAUSATION
In chapter 3, we learned how to regress the output of wheat against the amount of
fertilizer per acre. Does it make any difference whether we regress wheat against fertilizer or
fertilizer against wheat? And if so, how do we decide which variable is on the left-hand side of
the equation and which is on the right-hand side of the equation? And if we can’t determine
which variable belongs on which side, what do we do then?
A. INDEPENDENT vs. DEPENDENT VARIABLES
In ordinary mathematics the following equations are identical:
1. Y = A + BX
2.
Y −A
=X
B
3. X =
Y −A Y A
A Y
= − =− +
B
B B
B B
A
1
4. X = A´ + B´Y where A' = − ; B' =
B
B
However this is not generally true in regression as the least squares technique will give
different answers depending on whether we have Yˆ = A + BX or Xˆ = A´ + B´Y. When we
regress Y against X, we minimize ∑ (Yi − Ŷi )2 ; that is, we minimize the sum of the squared
vertical distances between the line and the observation. However, when we regress X against Y,
we minimize ∑ (X i − X̂ i )2 , which is equivalent to minimizing the sum of the squared horizontal
distances between the line and the observations.
All of this can be understood in terms of the least-squares formulae. To hammer the
ideas home, let us consider the following example:
CHAPTER 4:
Xi
(Xi – X ) (Yi – Y )
Yi
1
2
3
∑Xi = 6
B=
37
(1 – 2) (6 − 4) = −2
(2 – 2) (2 − 4) = 0
(3 – 2) (4 − 4) = 0
∑(Xi – X )(Yi – Y ) = −2
6
2
4
12 = ∑Yi
(Xi – X )2
(1 - 2)2 = 1
(1 - 2)2 = 0
(6 - 2)2 = 1
∑(Xi – X )2 = 2
(Yi – Y )2
(6 − 4)2 = 4
(2 − 4)2 = 4
(4 − 4)2 = 0
∑(Yi – Y )2 = 8
∑ (X – X ) (Y – Y ) = −2 = −1
2
∑(X – X )
i
i
2
i
A = Y – BX = 4 − (−1)(2) = 6
Yˆ = 6 – X This regression can be seen in Figure 4:1A.
If we regress X against Y we substitute X for Y and Y for X in the least squares formula as
X and Y "change places."
Xˆ = A´ + B´Y
B´ =
∑ (X – X ) (Y – Y ) = −2 = − 1
8
4
∑ (Y – Y )
i
i
2
i
⎛ 1⎞
A´ = X – B´ Y = 2 − ⎜ − ⎟ 4 = 3
⎝ 4⎠
1
X̂ = 3 − Y or Y = 12 − 4 Xˆ . This can be seen in Figure 4.1B.
4
CHAPTER 4:
38
Figure 4:1
A: Yˆ = 6 − X
10 11 12
9
8
7
Y
6
5
4
3
2
1
0
0
1
2
3
4
5
Y
6
7
8
9
10 11 12
B: Y = 12 − 4 X̂
0
1
2
3
X
4
5
6
0
1
2
3
X
4
5
6
Hence, when we regress Y against X we get a different line than when we regress X against
Y. When Y is regressed against X, the least squares line minimizes the sum of squared errors
between Yi and Yˆ . In contrast when X is regressed against Y, the least squares line minimizes
the sum of squared errors between X and Xˆ
Except when all the observations fall on the same line, it does make a difference whether we
regress Y against X or X against Y. Therefore how does one choose the correct regression? We
regress Y on X when X is the independent (exogenous) variable and Y the dependent
(endogenous) variable (and X against Y when the reverse is the case); typically in econometrics
X, stands for the independent variable, and Y the dependent one, unless otherwise specified. The
independent variable is plotted on the horizontal axis and the dependent variable on the vertical.
It is not always easy to know which is the independent and which is the dependent variable.
The determination often involves considerable knowledge on the part of the researcher and an
understanding of certain philosophical issues. A good rule of thumb is the following: If the
researcher has a prior belief that one variable depends on the other but not vice versa then the
former is the dependent variable and the latter is the independent variable. The determination of
which variable is the independent variable and which variable is the dependent variable is based
CHAPTER 4:
39
on the theoretical relationship to be analyzed. If the theory suggests a causal relationship, then
the regression model treats the dependent variable as being caused by the independent
variable(s). For example, our rudimentary knowledge of meteorology would suggest that the
quantity of wheat grown is the dependent variable, and the amount of sunshine is the
independent variable as sunshine causes wheat to grow and not wheat growth causes the sun to
shine. In the example presented above, the researcher determined the amount of fertilizer to
apply to each plot and then observed the bushels of wheat per acre. So the amount of fertilizer
applied determined the bushels per acre. Fertilizer per acre is the independent variable and
bushels per acre is the dependent variable. The bushels of wheat per acre did not determine the
amount of fertilizer that the researcher applied. Here is one more example to drive the point
home, income is the independent variable and price of car purchased is the dependent variable.
Making more money provides the wherewithal to buy an expensive car; buying an expensive car
will not make you rich (if anything, it will make you poorer).
A more nuanced understanding of regression is gained by considering situations when there
are several variables. Going back to the wheat example, the number of bushels of wheat
produced per acre may depend on both the amount of rainfall that falls on a plot of land and the
amount of fertilizer that is applied. Assuming that the fertilizer choice is not affected by the
amount of rainfall, both rainfall and fertilizer are the independent (right-side) variables because
each is independent of the other, while bushels per acre is the dependent (left-side) variable
because it is not independent of the other two variables. Now it is true that the amount of
fertilizer is likely to depend on other variables (for example, whether the fertilizer spreader is
accurate), but as long as these other variables affect wheat production only through the amount
of fertilizer, fertilizer can be treated as the independent variable in finding the relationship
between fertilizer and wheat production.
In other cases, both variables are a priori felt to be dependent on each other or dependent on
a third variable. As an example, except for Rastifarians, it appears that there is a negative or
inverse relationship between going to church and smoking marijuana. It is not clear whether
going to church discourages one from smoking marijuana, smoking marijuana discourages one
from going to church, or whether both depend on a third variable such as parental attitudes. In
CHAPTER 4:
40
this case we would use correlation (to be explained below) instead of regression to find the
relationship between church going and smoking marijuana (or, when appropriate, use
simultaneous equation techniques to disentangle the interrelationships—see Chapter _). As
another example, the number of cell phones and the number of plasma screen televisions have
both increased over the last 30 months. But neither has caused the other. Instead both are due to
other factors (technological change, increase in income, etc.).
It should be noted that I have stressed the notion of a priori belief concerning dependent and
independent variables. The data cannot tell us which is the dependent variable or the direction of
causation. This is clear from the example of first regressing Y against X and then regressing X
against Y. In either case we get a regression line. Therefore (in the absence of some subsidiary
assumptions such as causation does not work backward in time), how could the present data tell
us the direction of causation?
B. THE ARROW OF CAUSATION
Once we view the right hand variable as having an effect on the left hand variable, we need
to be very careful that this is indeed the case. The mistake is much more serious than just getting
an incorrect coefficient. It means that the person is making a mistake about the arrow of
causation. If one regresses the amount of sunshine on the amount of wheat produced, it is
incorrect to assert that growing more wheat will increase the amount of sunshine.
Even if they have not taken a course in statistics, most people have a basic understanding
regarding cause and effect. Nevertheless, researchers sometimes are confused on the direction of
causation. For example, consider a recent study by Stanley Kurz, a social anthropologist
(http://judiciary.house.gov/legacy/kurtz042204.htm) that was reported in a number of
newspapers. Kurz argued that allowing homosexuals to marry increased the number of out of
wedlock births by heterosexuals. The author had looked at data from the Netherlands. He
observed that out of wedlock births increased dramatically starting in 1996 when a law allowing
marriage between homosexuals was introduced and debated and continued to increase after 2000
when the law was fully implemented. He therefore concluded that the law caused heterosexuals
CHAPTER 4:
41
to have children outside of marriage. There are two problems with his analysis. First, he is basing
his conclusion on a limited sample (one country and one change in the law over a small number
of years). More relevant to the discussion in this section, the obvious direction of causation is
that changes in the law and increases in out of wedlock birth are both due to changes in attitudes
about marriage, not that allowing homosexual marriages make heterosexuals want to have
children outside of marriage. Even if you don’t undertake any econometric research, you should
always be prepared to question newspaper reports of research findings.
Let us consider one final example. To sharpen your wits, try to provide a counterexplanation before you read my counter. Does watching “The Daily Show” with John Stewart
make the viewer more cynical? Suppose that the following was provided as evidence: People
who watch this cynical show are more cynical (in responding to a questionnaire) than those who
don’t watch the show.
The problem with this simple test is that cynical people are more likely to watch the show in
the first place. That is, the arrow of causation may be in the opposite direction. Actually, a
researcher did ask this question (Diana Mutz, 2004, “Comedy or News? Viewer Processing of
Political News”, Paper presented at the 3rd Annual Pre-APSA conference on Political
Communication, September 1, 2004. Chicago), but she got around the problem of two-way
causation by first dividing students into two groups: those who had watched the “Daily Show” in
the past and those who had not. Then the psychologist further divided each of these groups into
two – making one group watch “The Daily Show” and making the other group watch a regular
news show. In this way, watching the TV show during the experiment was the independent
variable as the psychologist chose which subject would watch which TV show, while cynicism
was the dependent variable. She then asked all of the students to fill out a questionnaire. Those
asked to watch the Daily show were more cynical. In the “Daily Show” example, the
experimenter caused the students to watch the show or a regular news format; the student’s
cynicism did not make the researcher choose which show the student would watch
In a nutshell, make sure that you know the direction of causation. That is, make sure that the
variable on left hand side of the equation is the dependent variable and the variable on the right
hand side is the independent variable and not the reverse. Switching not only gives different
CHAPTER 4:
42
coefficients, but more important, may lead to incorrect conclusions about which variable is the
cause and which variable is the effect.
B. SAMPLE CORRELATION
If we do not know which is the independent and which is the dependent variable, or if we
believe that they are jointly dependent on each other or on a third variable, then we use the
sample correlation coefficient denoted by the symbol R.
N
∑ (Yi – Y ) (Xi – X )
4:1 DEFINITION:
R=
i =1
N
N
i=1
i =1
∑ (X i – X)2 ∑ (Yi – Y)2
is the sample correlation coefficient
R is the geometric average of the slopes found in regressing Y on X ( Yˆ = A + BX) and X
on Y ( Xˆ = A´ + B´Y)
N
4:1
BB' =
∑ (Yi − Y )(X i − X )
i=1
N
∑ (X i −X )
2
N
⋅
∑ (Yi − Y )(X i − X )
i=1
i=1
N
∑ (Yi −Y )
2
i=1
N
∑ (Yi − Y ) (X i − X )
=
i=1
N
2
N
∑ (X i – X ) ∑ (Yi – Y )
i=1
= R
2
i=1
Thus the formula for R reflects the fact that we do not know which variable is the independent
one and therefore we treat the variables symmetrically and choose the (geometric) average slope.
CHAPTER 4:
43
Sample correlation is a measure of linear relationship between the two variables in the
sample. If there is no linear relationship, then the sample correlation coefficient (R) is equal to 0.
If the linear relationship in the sample is perfect, then R is equal to 1 or –1. The intuitive
understanding can best be understood by again looking at the formula and then looking at some
examples.
N
∑ ( Yi – Y) ( Xi – X)
4:2
i=1
R=
N
∑ (X i – X )
i=1
=
N
2
N
∑ (Y i – Y )
i=1
N
2
SAMPLE COV (X,Y)
SAMPLE
SAMPLE
VARIANCE X
VARIANCE Y
N
The sample correlation, R, is the sample covariance or joint linear variability standardized (i.e.,
divided) by the square root of the individual variances. Ignoring the minus sign, R tells us what
percent of the total variability of X and Y is a linear covariability between X and Y.
If the sample covariability between X and Y is zero (i.e., sample Cov(X, Y) = 0), then
knowing that X is greater than X tells us nothing as to whether Y is greater or lesser than Y .
Thus the sample correlation is zero.
What if there were a perfect linear relationship between X and Y (i.e., all of the
observations fell on a line)? Then each Yi = Yˆ i = A + BXi and R would then equal the
following:
R =
∑ (A + BX i − A − BX )( X i – X )
[∑ (X −X ) ∑ (A + BX
2
I
i
− A − BX )
1
2 2
]
=
B∑ ( X i – X )( X i – X )
1
⎡ ( – )2 2 ( – )2 ⎤ 2
⎢⎣∑ X i X B ∑ X i X ⎥⎦
Of course if Yi = Xi a glance at the formula for R will show that R = 1.
=1
CHAPTER 4:
44
EXAMPLE 1:
Xi
1
2
3
∑Xi=6
X=2
Yi
–3
–5
–7
–15 = ∑Yi
–5 = Y
(Xi – X )
(Yi – Y )
(Xi– X )2
(Yi– Y )2
(1 – 2) = –1
(–3 –(–5)) = 2
1
4
–2
(2 – 2) = 0
(–5 –(–5)) = 0
0
0
0
(3 – 2) = 1
–7 –(–5)) = –2
1
4
–2
∑(Xi– X )2 = 2 ∑(Yi– Y )2= 8
( Xi– X ) (Yi – Y )
∑(Xi– X )(Yi – Y ) = –4
N
∑ ( Xi – X) ( Yi – Y)
R=
i=1
N
=
2
∑ (X i – X) ∑ ( Yi – Y)
2
–4
–4
–4
=
= 4 = –1
16
2 ⋅8
i=1
This can be seen in Figure 4:2A, where all the observations lie on a straight line with negative
slope and therefore R = –1.
CHAPTER 4:
45
FIGURE 4:2
A
B
X
X
2
3
0
2
3
-1
-6
-7
-5
-4
Y
-3
-2
-1
-2
-3
-6
-7
-5
-4
Y
1
0
1
0
0
In both A and B the correlation is perfect. R = –1 as the observations all lie on a line of
negative slope.
If Xi = –Yi, (equivalently, Yi = –Xi), then R would again equal –1 (see figure 4:2B).
Therefore R cannot be used to predict Y given X or vice versa. It just tells whether the sample
relationship is positive or negative and how close the relationship approximates a linear one, but
not the slope. R does not tell us as much as the regression equation Y = A + BX. This should be
expected. In regression, we bring in more information (which variable is the independent
variable), and therefore we can get more out of it. This is an example of an important rule in
econometrics: the more prior information that we bring to bear, the more precise our results are
likely to be.
CHAPTER 4:
46
EXAMPLE 2:
Yi
(Xi – X )
6
(–1 – 0) = -1
0
(0 – 0) = 0
6
(1 – 0) = 1
12 = ∑Yi
Xi
-1
0
1
∑Xi=0
(Yi – Y )
(Xi– X )2
(6 – 4) = 2
1
(0 –4) = -4
0
(6 – 4) = 2
1
∑(Xi– X )2 = 2
(Yi– Y )2
4
16
4
24 =∑(Yi– Y )2
(Xi– X ) (Yi– Y )
–2
0
2
∑(Xi– X )(Yi – Y ) = 0
R=
∑ (Xi – X) (Yi – Y)
N
∑ (X i – X)
N
2
= 0/48 = 0
∑ (Y i – Y)
i=1
2
i=1
Notice that the sample correlation, R, and the sample covariance = 0 even though Y =
6X2. While there is a perfect nonlinear relationship between X and Y, there is no linear
relationship between X and Y. This example emphasizes the fact that regression and correlation
are concerned with linear relationships. The fact that R = 0 does not mean that there is no
relationship between X and Y, only that there is no linear relationship.
As already stated, the sample correlation between X and Y
N
∑ (Yi − Y ) (X i − X )
i=1
=R=
=
N
N
∑ (X i −X )
i=1
N
2
N
∑ (Yi − Y )
i=1
N
2
SAMPLE COV (X,Y)
.
SAMPLE
SAMPLE
VARIANCE X
VARIANCE Y
CHAPTER 4:
47
It is therefore not surprising that R has many of the characteristics of the sample
covariance. In particular, if the sample covariance between X and Y is positive (negative), so is
the sample correlation between X and Y. If for example, when X is above the average X, Y tends
to be below the average Y, then both the sample covariance and the sample correlation are
negative. Insight into the characteristics of R can be obtained by reviewing Chapter 2 section D
on covariance.
An important difference between correlation and covariance is that correlation is
independent of scale. Suppose that we are finding the correlation between the husband's height
and the wife's height. If men of above average height marry women of above average height and
men of below average height marry women of below average height, then the sample correlation
is positive. If we first measure height in feet and then later in inches, the measure of correlation
does not change. Looking at the numerator in our definition of R, each deviation from Y is
multiplied by 12, and each deviation from X is also multiplied by 12. So the numerator is 144
times as large when the measurement changes from feet to inches. Looking at the denominator,
each deviation from Y is again multiplied by 12. These deviations are then squared so the
squared deviations of Y are 144 times as large when the measurements are in inches instead of
feet. Similarly, the squared deviations from X are 144 times as large. So in the denominator we
have the square root of (144 × 144) = 144. Since both the numerator and denominator are
multiplied by 144, there is no change in the correlation value when the units of measurement
change. The same is not true for sample covariance since it is multiplied by 144.
C. CONCLUDING REMARKS
It is important to know the arrow of causation – that is, it is important to know which
variable is the independent variable and which is the dependent variable or whether they both
depend on each other or a third variable. If one is mistaken in the direction of causation, not only
will one get a different least squares line, but also one will make mistaken inferences. If gay men
wear their hair shorter, one should not infer that having your hair cut short will make you gay.
This is obvious to everyone, but other cases are not so obvious, particularly when the arrow of
CHAPTER 4:
48
causation may be in both directions. In such cases, people who believe that the causation is in
one direction may marshal evidence that is perfectly consistent with the direction of causation
being in the opposite direction. Let us illustrate, by considering the following example. If we
observe that children who watch more television tend to be more violent, does that mean
watching television makes children more violent or does it mean that those who are more violent
have a preference for watching television when they are not engaging in violence. Both are
plausible hypotheses. So one cannot come to a definitive conclusion regarding direction of
causation by finding a positive correlation between the two.
Sometimes, there are natural experiments, where nature inadvertently creates an
experiment so that the direction of causation is only in one direction. To illustrate, we will look
at work by Matthew Gentzkow and Jesse Shapiro (Preschool Television Viewing and Adolescent
Test Scores: Historical Evidence from the Coleman Study" Quarterly Journal of Economics,
2008) as reported by Austen Goolsbee
(http://www.slate.com/articles/business/the_dismal_science/2006/02/the_benefits_of_bozo.html)
According to most experts, TV for kids is basically a no-no. The American Academy of
Pediatrics recommends no TV at all for children under the age of 2, and for older children, one to
two hours a day of educational programming at most. Various studies have linked greater
amounts of television viewing to all sorts of problems, among them attention deficit disorder,
violent behavior, obesity, and poor performance in school and on standardized tests. Given that
kids watch an average of around four hours of TV a day, the risks would seem to be awfully
high. Most studies of the impact of television, however, are seriously flawed. They compare kids
who watch TV and kids who don't, when kids in those two groups live in very different
environments. Kids who watch no TV, or only a small amount of educational programming, as a
group are from much wealthier families than those who watch hours and hours. Because of their
income advantage, the less-TV kids have all sorts of things going for them that have nothing to
do with the impact of television. The problem with comparing them to kids who watch a lot of
TV is like the problem with a study that compared, say, kids who ride to school in a Mercedes
with kids who ride the bus. The data would no doubt show that Mercedes kids are more likely to
score high on their SATs, go to college, and go on to high-paying jobs. None of that has anything
to do with the car, but the comparison would make it look as if it did. The only way to really
know the long-term effect of TV on kids would be to run an experiment over time. But no one is
going to barrage kids with TV for five years and then see if their test scores go down (though I
know plenty of kids who would volunteer).
CHAPTER 4:
49
Gentzkow and Shapiro came up with a different way to test the long-run impact of
television on kids—by going back in time. When Americans first started getting television in the
1940s, the availability of the medium spread across the country unevenly. Some cities, like New
York, had television by 1940. Others, like Denver and Honolulu, didn't get their first broadcasts
until the early 1950s. Whenever television appeared, kids became immediate junkies: The key
point for Gentzkow and Shapiro's study is that depending on where you lived and when you were
born, the total amount of TV you watched in your childhood could differ vastly. A kid born in
1947 who grew up in Denver, where the first TV station didn't get under way until 1952, would
probably not have watched much TV at all until the age of 5. But a kid born the same year in
Seattle, where TV began broadcasting in 1948, could watch from the age of 1. If TV-watching
during the early years damages kids' brains, then the test scores of Denver high-school seniors in
1965 (the kids born in 1947) should be better than those of 1965 high-school seniors in Seattle.
What if you're concerned about differences between the populations of the two cities that could
affect the results? Then you compare test scores within the same city for kids born at different
times. Denver kids who were in sixth grade in 1965 would have spent their whole lives with
television; their 12th-grade counterparts wouldn't have. If TV matters, the test scores of these two
groups should differ, too. Think analogously about lead poisoning. Lead has been scientifically
proven to damage kids' brains. If, hypothetically, Seattle added lead to its water in 1948 and
Denver did so in 1952, you would see a difference in the test-score data when the kids got to
high school—the Seattle kids would score lower than the Denver kids, and the younger Denver
kids would score lower than the older Denver ones, because they would have started ingesting
lead at a younger age.
Gentzkow and Shapiro got 1965 test-score data for almost 300,000 kids. They looked for
evidence that greater exposure to television lowered test scores. They found none. After
controlling for socioeconomic status, there were no significant test-score differences between
kids who lived in cities that got TV earlier as opposed to later, or between kids of pre- and postTV-age cohorts. Nor did the kids differ significantly in the amount of homework they did,
dropout rates, or the wages they eventually made. If anything, the data revealed a small positive
uptick in test scores for kids who got to watch more television when they were young. For kids
living in households in which English was a second language, or with a mother who had less
than a high-school education, the study found that TV had a more sizable positive impact on test
scores in reading and general knowledge.
CHAPTER 4:
50
The most innovative part of the above study is that it is able to determine the arrow of
causation. Clearly, the availability of television broadcasting is the independent variable and
watching TV is the dependent variable. This gets around the problem in other studies where the
arrow of causation is likely to go in both directions: watching more TV may result in lower
academic abilities; lower academic abilities may encourage a person to watch TV rather than
read a book or study. The study is also important because it controls for other variables (such as
level of parent’s education), a point we will come to in a later chapter. It is also important
because other studies that are based on experiments can only determine short-term effects. If kids
watch a violent TV program, they may act more aggressively for the next hour or two than if
they had watch a ballet. But we do not know how this impacts their lives over the long-run.
PROBLEMS
1.
ˆ = A + BX not in general equal to 1/B´, in the
Why is the B in the linear regression Y
ˆ = A´ + B´Y. (4) Under what circumstances are they identical? (2)
linear regression X
2.
Why can't we use empirical data to tell us which variable is the independent variable and
which is the dependent variable? (2)
3.
What is the difference between correlation and linear regression? (4) When is each
appropriate? (4)
4.
Prove that a perfect linear relationship results in an R2 of 1. (6)
5.
In the equation Y = A + BX, which is the independent variable and which is the
dependent variable? (2) What are the parameters and what are the variables? (4)
6.
Explain in your own words, the problem of disentangling the effect of TV watching on
school performance. (6) Explain in your own words how the Gentzkow and Shapiro study
resolved the problem of two-way causation. (4).
CHAPTER 4:
7.
51
When do we use sample correlation? (4)
What is the formula for R? (2)
How does it differ from the formula for B? (4)
Provide an intuitive explanation for the differences between the two formulae. (2)
What is the relationship between R and the B in Y = A + BX and the B' in X = A' + B'Y?
(4)
What is the intuitive explanation for this? (4)
What does R measure? (2)
If R is 1 what does it tell you about the slope of the line? (2)
Why does regression tell you more? (2)