Correlation, Causation, and Control Variables Sociology 128 Fall

Correlation, Causation, and Control Variables
Sociology 128
Fall 2013
There are three types of questions you might be trying to answer that would make it helpful to use
control variables. (To ‘control’ only means that you are comparing results for all the units with
the same value on that variable: you are comparing like with like.)
Case 1: Spurious relationships. I have found a correlation between A and B. Is it spurious?
That is, is something else causing A AND B?
For instance: taller kids know more math, on average. Does getting taller increase math
knowledge? (Maybe giving kids growth hormone would make them smarter!) No. Instead,
increasing age causes both height and math knowledge to increase. The relationship between
height and math scores is entirely spurious.
height
age
math knowledge
Relationship of height to test scores
Height
Test score
Low
High
3’6”
70%
30%
4’0”
40%
60%
Relationship of height to test scores, controlling for age
Height
Age 6
Low
High
Low
3’6”
80%
20%
35%
4’0”
80%
20%
35%
Age 8
High
65%
65%
Case 2: Antecedent variables. I have found a correlation between A and B. What is the causal
effect of A on B? (Note: this is logically the same as Case 1, but expressed differently.)
The Stouffer reading is full of examples of this type.
To answer this question, I need to know something more about the individuals (or firms, or states,
or whatever) at the beginning. For instance, imagine we want to know about the risk of harm
posed by various cancer treatments. If I tell you that the people who took Drug Q died more often
than people who took Drug X, you would not immediately conclude that Drug Q is more
dangerous. You might first like to know if the people who took Drug Q had less treatable
cancers, while the people who took Drug X had early-stage cancers.
For instance: students who take AP classes get better SAT test scores than students who don’t
take AP classes. Do AP classes teach useful skills, and therefore have a causal effect on SAT
scores? Or is the difference in SAT scores due to some boost in cognitive abilities that kids with
highly educated parents get at home from a young age? Watts would notice that we can make a
compelling story for either version.
We have two competing views of the world.
Parents’
education
Taking AP
classes
Taking AP
classes
OR
SAT scores
SAT scores
We will probably find that kids of highly educated parents are both more likely to take AP classes
and to have high SAT scores, so this is a good candidate for an antecedent control variable. But
when you chart out the relationship of AP classes to test scores, controlling for parents’
education, I bet you’d find that there is still a relationship between AP classes and SAT scores, as
well as a relationship between parents’ education and SAT scores. If this is the real world, then:
Parents’
education
Taking AP
classes
SAT scores
For an exercise, try making a table like the one in Case 1 that matches this story.
Note again: Case 2 is logically similar to Case 1, just expressed differently. Both deal with
antecedent variables. Age is logically prior to both height and test scores. Parents’ education is
logically prior to both kids’ AP classes and SAT scores. If a relationship is spurious, then adding
an antecedent variable makes the apparent relationship between A and B disappear. More often
in the social world, the relationship is attenuated; often, the size of the relationship is smaller
when you consider the control variable, but the effect may not completely go away.
Antecedent variables are also called “pre-treatment” controls.
Case 3: Intervening variables. I have found a correlation between A and B. I believe that A
causes B, but perhaps not directly. What is the mechanism by which A causes B?
For instance: I observe that children whose mothers have attended college are more fluent readers
in elementary school, on average, than children whose mothers have a high school degree or less.
But there is nothing about my mother having a piece of paper that certifies her college graduation
that should make me a fluent reader: there must be other factors that intervene.
Try listing possible intervening variables that might mediate this relationship. I bet you can come
up with at least five. Here’s one possibility:
Mother’s
education
Mother’s
vocabulary
size
Child’s reading
fluency
There are large differences by education in the number and variety of words that children hear at
home. If mother’s vocabulary is an intervening variable, then when I control for vocabulary size,
the apparent effect of mother’s education should go away. In other words, children with mothers
who use small vocabularies at home should have similar reading fluency, and children whose
mothers use large vocabularies should have similar reading fluency. The difference is that a large
percentage of mothers with large vocabularies have attended college. In this world, mother’s
education does cause child’s reading fluency, but the link is explained by differences in mother’s
vocabulary size.
Try making the tables that match this story.
In reality, vocabulary size is probably one of several intervening variables between mother’s
education, so controlling for mother’s vocabulary size might partially explain the link between
mother’s education and child’s reading fluency, but wouldn’t completely explain it:
Mother’s
education
Other
intervening
variables
Mother’s
vocabulary
size
Child’s reading
fluency
Intervening variables are also called mediator variables. Researchers who are looking for
mechanisms search for intervening variables.
When do you stop? In most cases, you’d be able to carry on elaborating the model indefinitely.
In Case 3, for instance, does mother’s college attendance actually explain her vocabulary size, or
did she have a large vocabulary as a result of her childhood environment? What other intervening
variables could we consider? How exactly does a mother’s vocabulary get translated into her
child’s reading ability – are there other intervening variables in this stage of the model? The
decision of when to stop is a judgment call on the part of the researcher, depending on what
question they are trying to answer.
An Eastern guru affirms that the earth is supported on the back of a tiger. When asked what
supports the tiger, he says it stands upon an elephant; and when asked what supports the
elephant he says it is a giant turtle. When asked, finally, what supports the giant turtle, he is
briefly taken aback, but quickly replies, "Ah, after that it is turtles all the way down." (see
Wikipedia, “Turtles all the way down.”)