Problem

The Devil is in the Design:
Experimental logic and causal inference in
social science research
Mads Meier Jæger, Aarhus University
What you’ve learned so far
 Register data are a great (in fact, unique) resource
for social science research
 Longitudinal and life-course data provide excellent
opportunities to study important research questions
 (Sometimes you’ve got so much data that you don’t
know what to do with it!)
Intro #1
 You’ve got the idea, you’ve got the data, you took the
stats course … now you want to get started crunching
numbers and estimating the effect of x on y … let’s go!
 Hold on! What is your identification strategy?
 What is my what …?
Intro #2
 You claim that x affects y. Why should we believe you?
Which research strategy/design do you use to
substantiate your causal interpretation that x -> y?
“Identification strategy describes the manner in which a
researcher uses observational data to approximate a real
experiment” (Angrist/Pischke 2009)
 Problem: Most social scientists rarely (1) think of their
research designs as approximations to experiments and
(2) explicate the conditions under which they make causal
interpretations …
Intro #3
 The lamest answer: “My theory predicts that x affects y.
Consequently, my estimates are causal”
 The second-lamest answer: “I interpret my regression
coefficients as ‘elaborated’ correlations, not causal
effects”
 Both arguments are dubious:
1.
2.
Theory alone cannot be used to infer anything about reality. To do this, you
need data/design which allows you to make credible causal inference.
Regression doesn’t care how you interpret coefficients. Regression models
assume that x affects y. If this is not the case, your model is mis-specified and
you get biased results.
“If the estimates you get are not the estimates you want, the fault lies in the
econometrician and not the econometrics!” (Angrist/Pischke)
Intro #4
 You puritan Nazi! Does all this matter? Well, yes. Let’s
look at a classic: Max Weber’s theory of The Protestant
Ethic and the Spirit of Capitalism …
Weber:
Reformation
Protestant ethic
Capitalism
Becker/Woessmann, QJE (using 19th century data on Prussia):
Luther: Thou shall read
Bible in thy own language
Reformation
When people can read,
they can do business
Increase in literacy
Capitalism
This talk

Fundamental problems: Why quant data & which identification
strategies can we (probably) trust?

1.
The hierarchy of identification strategies:
Experimental data: (a) Randomized Controlled Trials; (b)
Quasi/”Natural” experiments.
Observational data: (c) Longitudinal/cohort data, (d) crosssectional data.
2.

Examples: Slightly exotic & education example: The causal
effect of class size on children’s outcomes

Take home message: We need to sharpen the way we practice
identification!
Why quant data?
The different philosophies of working with numbers &
statistics …
The Deniers
The Pragmatics
”There are three kinds of lies: lies,
damned lies, and statistics” (M.
Twain)
”All models are wrong, some
models are useful” (G.E.P. Box)
”It is the mark of a truly
intelligent person to be moved
by statistics” (G.B. Shaw)
“Definition of Statistics: The science
of producing unreliable facts from
reliable figures” (E. Esar)
”Models are to be used, not
believed” (H. Theil)
”Statistics is the grammar of
science” (K. Pearson)
The Overly Ambitious
“Statistics are like lampposts: they
are good to lean on, but they don't
shed much light”(Storm P)
”Do not put faith in statistics until
you have carefully considered
what they do not say” (W. Watt)
“The advancement and
perfection of mathematics are
intimately connected with the
prosperity of the State” (N.
Bonaparte)
“There are two kinds of statistics,
the kind you look up and the kind
you make up” (R. Stout)
”It doesn’t do to leave Dragon out
of your calculations, if you live
near him” (J.R.R. Tolkien)
“Oh, people can come up with
statistics to prove anything, Kent.
14% of people know that” (H.
Simpson)
The Deniers
The Pragmatics
”There are three kinds of lies: lies,
damned lies, and statistics” (M.
Twain)
”All models are wrong, some
models are useful” (G.E.P. Box)
”It is the mark of a truly
intelligent person to be moved
by statistics” (G.B. Shaw)
“Definition of Statistics: The science
of producing unreliable facts from
reliable figures” (E. Esar)
”Models are to be used, not
believed” (H. Theil)
”Statistics is the grammar of
science” (K. Pearson)
The Overly Ambitious
“Statistics are like lampposts: they
are good to lean on, but they don't
shed much light”(Storm P)
”Do not put faith in statistics until
you have carefully considered
what they do not say” (W. Watt)
“The advancement and
perfection of mathematics are
intimately connected with the
prosperity of the State” (N.
Bonaparte)
“There are two kinds of statistics,
the kind you look up and the kind
you make up” (R. Stout)
”It doesn’t do to leave Dragon out
of your calculations, if you live
near him” (J.R.R. Tolkien)
“Oh, people can come up with
statistics to prove anything, Kent.
14% of people know that” (H.
Simpson)
Advantages of quant data
1.
*We can’t do real science
from theory and common
sense]
2.
We can generalize results
3.
Only quant data allow us
to address causal research
questions
”Blind commitment to a theory is
not an intellectual virtue: It is an
intellectual crime” (I. Lakatos)
”Theory is often just practice with
the hard bits left out” (J.M.
Robson)
”The plural of anecdote is not
data” (R. Brinner)
”Facts are stupid things” (R.
Reagan)
”When the going gets tough, the
anyone
says
toughWhen
get empirical”
(J. Carroll)
“theoretically”, they really
” … while the individual man is an
mean
“not
really”
insoluble
puzzle,
in the
aggregate
”There
is occasions
and causes
he becomes
a mathematical
why
and wherefore
in all
things.”
certainty.
You can, for
example,
(William
Shakespeare,
Henry
V)
never foretell
what any
one man
will do, but you can say with
precision what an average number
“Facts
could
will beare
upmeaningless.
to. IndividualsYou
vary,
but
use
facts
to
prove
anything
that’s
percentages remain constant. So
even
true” (H.
Simpson)
says remotely
the statistician.“
(Sherlock
Holmes, The Sign of the Four)
“Nothing shocks me. I’m a
scientist” (Indiana Jones, The
Temple of Doom)
Causal explanations
Let’s change focus from theoretical niceness of
quant data and to cruel reality …
 Quant data is a necessary but not a sufficient
condition for identifying causal effects
 Fancy statistical methods aren’t magic bullets:
We always need good designs
 The experiment is the natural design for
studying causal effects
Education

Are private schools better than public
schools?

Does class size have a negative effect on
student performance?

Do schools’ economic resources affect
student performance?
Education

Are private schools better than public
schools?

Does class size have a negative effect on
student performance?

Do schools’ economic resources affect
student performance?
Private vs. Public Schools
Are private schools better than public schools?
Piece of cake! We collect data from both types of
schools and compare …
 GPA levels in private and
public schools
 Mean difference in GPA summarizes how much
better private schools are
For example:
Mean GPA in private schools: 8,2
Mean GPA in public schools: 7,8
Conclusion: Going to a private school increases GPA by 0,4 (8,2-7,8)
Not so fast! What if kids who
attend private schools differ
systematically from those who attend
public schools?
 We know that kids in private
schools tend to have better educated
and wealthier parents
 Better educated parents tend to live
in wealthier areas with high-quality
schools and small class sizes
 Better educated parents tend to be
particularly interested in their kids’
schooling
 Consequently: When doing naïve comparisons of
GPA across school types to evaluate causal effects, we
also pick up the effect of other factors (parents’
education, income, etc.)
 Fact of life: Most (if not all) of the time we don’t
observe all the relevant variables which lead to the
two groups having different GPAs → we need to deal
w. unobserved factors
 This problem is often referred to as the evaluation
problem
The Evaluation Problem
The evaluation problem arises from the fact that we
(almost) never observe the same individual both in
the treatment group (private school) and in the
control group (public school).
If we observed the same individual in the
treatment and control group, we could easily
calculate the difference in GPA and aggregate
across the population
Hello little friend. You look
sweet. Would you like a
present?
Hello little friend. You look
sweet. Would you like a
present?
← Creep
Nice →
It’s pretty important to have
the right image
The Evaluation Problem
There is only one way of dealing effectively with
the evaluation problem:
Randomized Controlled Trials
 How? Participants allocated to treatment and control
group on the basis of lottery
 Why? Participants have no influence on which
group they are allocated to. Treatment and control
group does not differ – on average – in terms of
unobserved factors which affect outcome of interest
The Evaluation Problem
 RCTs are the scientific “gold standard” for identifying
causal effects
 Unsurprisingly, it is often not possible to carry out
RCTs in social science research
”Congratulations Mrs. Jones. Your
son has been randomly selected to
receive only 7 years of schooling
”No dental care for you
this year Bob, aren’t you
happy?”
”Bridget, you’re in the
control group. No
unemployment benefits for
this lady please!”
The Evaluation Problem
 RCTs are the scientific “gold standard” for identifying
causal effects
 Unsurprisingly, it is often not possible to carry out
RCTs in social science research
When we can’t do proper experiments we have two
options left:
1. Observational [i.e., non-experimental] data & fancy statistics →
Idea: “Repair” the fact that we don’t have experimental data …
Often ends up looking like this …
The Evaluation Problem
The Evaluation Problem
When we can’t do proper experiments we have
two options left:
2. Exploit
experiments”
which
serve the same
function
as
The
idea is”natural
the same
as in manmade
experiments.
We look
for “natural”
real
experiments. Only,variation
they weren’t
byofusinterest (private vs.
or
quasi-experimental
in thedesigned
treatment
public school); i.e., variation which is unrelated to the participants in the
experiment. We let nature do the work and observe the outcome …
Slightly exotic examples of natural
experiments used in social science
research:
Research Question:
Do social norms (egoism vs.
altruism) affect people’s behavior?
Evaluation Problem:
People don’t take lab experiments
seriously …!
Stakes are too low; no incentive to act as if in
the “real” world …
We need an experiment in which there’s really
something to win … or lose
How about …
The sinking of the Titanic
Titanic:
 Titanic hit an iceberg just after midnight on 14
April 1912 and sank 2 hours and 40 minutes later
 2223 passengers on board – life boats for 1178
people only
Sinking can be seen as a natural experiment in which
people’s lives are at stake. How do people react: Save
yourself (egoism) or adhere to prevailing social
norms (women and children first!)?
Results: Probability of survival not random. Who
were more likely to survive:
 Women & children (not men)
 Titanic staff (not passengers)
 Passengers traveling on first class
 Other nationalities than Brits (and especially
Americans)
Strong evidence of altruistic behavior (especially
among Brits!)
Exotic Example #2:
What are the long-term consequence
of early childhood health?
Evaluation Problem:
 We wish to know if children’s early health has a causal effect
on how well they do later in adulthood (education, income, etc)
 The problem: Healthy mothers have health children. Healthy
mothers also have other resources (education, income, etc.)
which has a positive effect on child outcomes
 Our measure of childhood health captures the causal effect of
childhood health but also, indirectly, the effect of other
beneficial (but unobserved) resources in the child’s environment.
We wish to isolate the effect of child health …
We need a randomized shot of (ill-)health …
How about …
The Influenza Epidemic of
1918?
Influenza Epidemic of 1918
 Epidemic hit the US without any warning in
October
1918epidemic
and was gone
by early
 Influenza
hit most
of the1919
world from

Epidemic
pregnant
women
random:
March
1918 affected
to June 1920.
About
1/3 ofatworld’s
Some
women
got sick, others didn’t
population
infected

children
of mothers
got3-6%
infected
during
 Did
10-20%
of those
infectedwho
died;
of world’s
pregnancy
worse than children of mothers who
populationdo
died
didn’t
get infected?
 Particularly
nasty variant of H1N1 influenza virus
Yes! Children whose mothers got sick during
(Trivia: Max Weber died from influenza in June 1920; Walt Disney also
pregnancy
…
got
sick but survived!)
 Had 15% lower probability of completing high
school
 Had 5-9% lower earnings throughout life
 Had 20% higher probability of developing a
disability
Conclusion: Poor health in early life has
significant, negative long-term effects!
Douglas Almond & Bhaskar Mazunder (2005): ”The 1918
Influenza Pandemic and Subsequent Health Outcomes: An
Analysis of SIPP Data.” American Economic Review 95: 258262.
Douglas Almond (2006): ”Is the 1918 Pandemic Over? LongTerm Effects of In Utero Influenza Exposure in the Post1940 U.S. Population.” Journal of Political Economy 114:
672-712.
Influenza Epidemic of 1918
 Researchers have also used the famine that arose
during the winter of 1944 in the western part of the
Netherlands as a natural experiment for child health
 Same result: (Grand)children of (grand)mothers
who were affected by the famine had poorer health
throughout life
(Trivia: Audrey Hepburn grew up in the Netherlands during the war
and was affected by the famine; she suffered from poor health
throughout life. She ate chocolate every day)
Slightly depressing natural experiments. Let’s look at
some more fun ones …
Smog day before
(More) funny experiments
 The power outage in the US North-East on 14
August 2003 (biggest outage ever!). Result: Big surge
in number of babies born nine months after!
Smog day after
 Happiness level in Denmark increased
permanently after we beat the Germans in the 1992
European Championship final
 The probability of an acute heart attack increases
significantly when (German) men watch an exciting
soccer match (even on TV)
 The probability of dying from a heart attack
increases significantly among (British) men when
their favorite team loses a match and when UK
national team plays in European Championships
 The probability of (women) experiencing domestic
violence increases significantly when (men’s) favorite
team doesn’t do very well …
Let’s get back to
education …
Example: Class size
Why class size?
1. Important policy issue & many common sense ideas …
2. Tons of research exists which uses different analytical
approaches. We can compare identification strategies
3. Tons of (old) cross-sectional studies: No clear results!
 Research question: Do children in large classes perform worse
than children in small classes? Intuition says yes!
However … consider that
Class size
 Wealthy/better educated parents tend to move to
residential areas with high-quality schools and smaller class
sizes*
 Parents of children in small classes possess other
resources (income, social capital, etc.) which make their
children perform well
 Evaluation problem: Class size correlated with unobserved
characteristics of children which also affect academic
achievement we don’t measure causal effect of class size
* A UK study recently found that having a high-quality school in your catchment area increases property
value by 50%
Class size
The Gold Standard: Project STAR (RCT):
 11,600 pupils, 1,300 teachers, and 76 schools in Tennessee, US
 Pupils randomly allocated to (1) small classes (13-17 pupils), (2)
regular classes (22-26 pupils), and regular classes with teacher
aide
 Teachers randomly allocated to classes
Main results: 1) pupils in small classes performed better in
achievement tests than pupils in large classes; 2) pupils in small
classes more likely to go to college; 3) particularly strong effects
of small class size for pupils from low-SES families
Class size
 Project STAR is one of few examples of RCTs
 Did project STAR settle the matter? Well, like any RCT there
were problems:
 Pupils, teachers and parents knew they participated in an experiment
(Rosenthal effect)
 The “better” treatment (small class size) was known to everyone.
What happened to the people who got annoyed by ending up in a big
class? Did their behavior change?
 External validity: Would the experiment produce the same result if
carried out elsewhere?
 Project STAR convincingly showed a causal effect of class size on
academic achievement. However, the project cost more than $10 million
and effect sizes were small. Is it cost efficient to reduce class size
compared to other interventions which improve academic achievement?
Class size
 An alternative identification strategy is to look for a natural
experiment which affect class size but which is unrelated to
individual students
Any ideas …?
Class size
 A famous example is Angrist & Lavy (1999, QJE) who used
Maimonides’ rule as natural experiment for class size in Israel
 Maimonedes was a 12th century rabbi who interpreted teaching on
class size in the Babylonian Talmud like this
”Twenty-five children may be put in
charge of one teacher. If the number in
the class exceeds twenty-five but is no
more than forty, he should have an
assistant to help with the instruction”
 Maimonides’ rules has been applied in Israel since 1969. It has the
following consequence:
Class size
 A famous example is Angrist & Lavy (1999, QJE) who used
Maimonides’ rule as natural experiment for class size in Israel
Class size: 20
Class size: 40
Class size
 A famous example is Angrist & Lavy (1999, QJE) who used
Maimonides’ rule as natural experiment for class size in Israel
Class size: 40
Class size: 20,5
41
Class size
 Compare Maimonides’ rule with
actual class sizes in Israel
 Maimonides’ rule provides
exogenous/”experimental’
variation in class size (since rule
has nothing to do with students or
their parents)
 Angrist & Lavy find a clear
negative effect of class size on
reading ability
Class size: Natural experiment #2
 Similar idea (Hoxby 2000, QJE):
1. Random variation from year to year with regard to the size of each
birth cohort starting in school → random variation in the number of
children who starts in school
2. Administrative rules stipulate maximum class sizes (like
Maimonides’ rule) → if number of children entering a school
crosses threshold a new class is automatically formed
Result: No effect of class size on GPA in Conn., USA
Class size: Natural experiment #2
Hoxby’s design used by others:
Leuven et al. (2009, SJE): Norway, no effect of class size on GPA
Browning & Heinesen (2007, SJE): Denmark, negative effect of
class size on educational attainment
Bingley, Jensen & Walker (2006): Denmark, same result as
Browning & Heinesen, more fancy method
Class size: Natural experiment #3
Case & Deaton (1999, QJE): South Africa during apartheid: Blacks
living in Townships had [in practice] no control over (1) where
they lived and (2) which school their children attended. This
makes class size effectively random. Result: Large negative effect
of class size on academic achievement
Heinesen (2010, EJ): Denmark: When pupils choose between
French and German before 7th grade they don’t know their
future class size. Result: Pupils in small French class have higher
GPA at the end of 9th grade compared to pupils in large class
sizes
Consequently:
 One study alone probably don’t provide the truth …
 In the case of class size results differ for studies using different
natural experiments: Was it das für ein Ding?
 This is expected. We need to apply a “situational”
interpretation of the causal effect which the experiment
identifies. This phenomenon is known as …
 LATE (Local Average Treatment Effect). Who does the
experiment move? We give people a pill. Some people (1) always
eat the pill (“always-takers”), (2) always flush the pill down the
toilet (“never-takers”), and some people (3) eat the pill if they
like the taste (“compliers”). Who are the compliers?
Summary:
So what have we learned?
Identification: Hard facts
 (Register) data is a great thing – you’ll need it!
 Descriptive analysis is fine … but …
“Although purely descriptive research has an important role to play,
(…) most interesting research in social science is about questions of
cause and effect” (Angrist/Pischke)
 You need a credible identification strategy and …
“The most credible and influential research programs use random
assignment” (Angrist/Pischke)
 Fancy statistics try to emulate an experimental design. Usually
bad substitute for the real thing. 1 trillion observations can’t fix a
poor identification strategy …
Identification: Hard facts
 “I can’t really think of an experiment that might identify the
causal effect I’m after”. Well, tough luck:
“If you can’t devise an experiment that answers your question in a
world where everything goes, then the odds of generating useful
results with a modest budget and nonexperimental survey data seem
pretty slim” (Angrist/Pischke)
 How to find good instruments/natural experiments?
“So, where can you find an instrumental variable? Good instruments
come from a combination of institutional knowledge and ideas about
the processes determining the economic variable of interest”
(Angrist/Pischke)
Identification: Good news
Taking identification seriously forces you to think hard about the
mechanisms which generate the effect you’re after -> fosters
creativity! For example, I’m currently looking at …
Blood types of spouses as instruments for fertility
Pollen in the air as instrument for exam performance among (allergic) exam takers
Weather conditions (rain/sun) as instrument for happiness …
RCT on training of school teachers in classroom management
 Once you start looking for natural experiments they’re all
around you. Warning: There’s no going back …
 Economists are currently in the lead with respect to taking
identification seriously. Political science comes in second and
sociology (and like) still have a long way to go …
Case in point
The Review of Economics and Statistics, vol. 93(3), August 2011:
20 empirical articles: RCT: 0; natural experiment: 9; panel data: 6; crosssectional/descriptive: 6.
American Political Science Review, vol. 105(2), May 2011
5 empirical (quant) articles: Natural experiment: 2, panel data: 1; crosssectional/descriptive: 2 (both mainly theoretical).
American Sociological Review, vol. 76(4), August 2011:
4 empirical (quant) articles RCT: 1 (lab experiment!); Crosssectional/descriptive: 3.
European Sociological Review, 27(4), August 2011:
6 (quant) articles: Panel data/econometrics: 2; Cross-sectional/descriptive:
4.
Case in point
The Review of Economics and Statistics, vol. 93(3), August 2011:
20 empirical articles: RCT: 0; natural experiment: 9; panel data: 6; crosssectional/descriptive: 6.
Creativity:

Gun shows in Texas and California as instrument for homicide/suicide
rates

(Re)location of German airports after WW2 as instrument for industry
location

The Holocaust as an instrument for altruistic behavior (rescuing Jews)

Shifts in labor demand in California as instrument for social policy
preferences

Winning the Florida Lottery as instrument for (non)likelihood of
bankruptcy

Drought/floods in China 2,000 years ago as instrument for nomadic
incursions

And so on!!!
Thank you for your attention!