The Devil is in the Design: Experimental logic and causal inference in social science research Mads Meier Jæger, Aarhus University What you’ve learned so far Register data are a great (in fact, unique) resource for social science research Longitudinal and life-course data provide excellent opportunities to study important research questions (Sometimes you’ve got so much data that you don’t know what to do with it!) Intro #1 You’ve got the idea, you’ve got the data, you took the stats course … now you want to get started crunching numbers and estimating the effect of x on y … let’s go! Hold on! What is your identification strategy? What is my what …? Intro #2 You claim that x affects y. Why should we believe you? Which research strategy/design do you use to substantiate your causal interpretation that x -> y? “Identification strategy describes the manner in which a researcher uses observational data to approximate a real experiment” (Angrist/Pischke 2009) Problem: Most social scientists rarely (1) think of their research designs as approximations to experiments and (2) explicate the conditions under which they make causal interpretations … Intro #3 The lamest answer: “My theory predicts that x affects y. Consequently, my estimates are causal” The second-lamest answer: “I interpret my regression coefficients as ‘elaborated’ correlations, not causal effects” Both arguments are dubious: 1. 2. Theory alone cannot be used to infer anything about reality. To do this, you need data/design which allows you to make credible causal inference. Regression doesn’t care how you interpret coefficients. Regression models assume that x affects y. If this is not the case, your model is mis-specified and you get biased results. “If the estimates you get are not the estimates you want, the fault lies in the econometrician and not the econometrics!” (Angrist/Pischke) Intro #4 You puritan Nazi! Does all this matter? Well, yes. Let’s look at a classic: Max Weber’s theory of The Protestant Ethic and the Spirit of Capitalism … Weber: Reformation Protestant ethic Capitalism Becker/Woessmann, QJE (using 19th century data on Prussia): Luther: Thou shall read Bible in thy own language Reformation When people can read, they can do business Increase in literacy Capitalism This talk Fundamental problems: Why quant data & which identification strategies can we (probably) trust? 1. The hierarchy of identification strategies: Experimental data: (a) Randomized Controlled Trials; (b) Quasi/”Natural” experiments. Observational data: (c) Longitudinal/cohort data, (d) crosssectional data. 2. Examples: Slightly exotic & education example: The causal effect of class size on children’s outcomes Take home message: We need to sharpen the way we practice identification! Why quant data? The different philosophies of working with numbers & statistics … The Deniers The Pragmatics ”There are three kinds of lies: lies, damned lies, and statistics” (M. Twain) ”All models are wrong, some models are useful” (G.E.P. Box) ”It is the mark of a truly intelligent person to be moved by statistics” (G.B. Shaw) “Definition of Statistics: The science of producing unreliable facts from reliable figures” (E. Esar) ”Models are to be used, not believed” (H. Theil) ”Statistics is the grammar of science” (K. Pearson) The Overly Ambitious “Statistics are like lampposts: they are good to lean on, but they don't shed much light”(Storm P) ”Do not put faith in statistics until you have carefully considered what they do not say” (W. Watt) “The advancement and perfection of mathematics are intimately connected with the prosperity of the State” (N. Bonaparte) “There are two kinds of statistics, the kind you look up and the kind you make up” (R. Stout) ”It doesn’t do to leave Dragon out of your calculations, if you live near him” (J.R.R. Tolkien) “Oh, people can come up with statistics to prove anything, Kent. 14% of people know that” (H. Simpson) The Deniers The Pragmatics ”There are three kinds of lies: lies, damned lies, and statistics” (M. Twain) ”All models are wrong, some models are useful” (G.E.P. Box) ”It is the mark of a truly intelligent person to be moved by statistics” (G.B. Shaw) “Definition of Statistics: The science of producing unreliable facts from reliable figures” (E. Esar) ”Models are to be used, not believed” (H. Theil) ”Statistics is the grammar of science” (K. Pearson) The Overly Ambitious “Statistics are like lampposts: they are good to lean on, but they don't shed much light”(Storm P) ”Do not put faith in statistics until you have carefully considered what they do not say” (W. Watt) “The advancement and perfection of mathematics are intimately connected with the prosperity of the State” (N. Bonaparte) “There are two kinds of statistics, the kind you look up and the kind you make up” (R. Stout) ”It doesn’t do to leave Dragon out of your calculations, if you live near him” (J.R.R. Tolkien) “Oh, people can come up with statistics to prove anything, Kent. 14% of people know that” (H. Simpson) Advantages of quant data 1. *We can’t do real science from theory and common sense] 2. We can generalize results 3. Only quant data allow us to address causal research questions ”Blind commitment to a theory is not an intellectual virtue: It is an intellectual crime” (I. Lakatos) ”Theory is often just practice with the hard bits left out” (J.M. Robson) ”The plural of anecdote is not data” (R. Brinner) ”Facts are stupid things” (R. Reagan) ”When the going gets tough, the anyone says toughWhen get empirical” (J. Carroll) “theoretically”, they really ” … while the individual man is an mean “not really” insoluble puzzle, in the aggregate ”There is occasions and causes he becomes a mathematical why and wherefore in all things.” certainty. You can, for example, (William Shakespeare, Henry V) never foretell what any one man will do, but you can say with precision what an average number “Facts could will beare upmeaningless. to. IndividualsYou vary, but use facts to prove anything that’s percentages remain constant. So even true” (H. Simpson) says remotely the statistician.“ (Sherlock Holmes, The Sign of the Four) “Nothing shocks me. I’m a scientist” (Indiana Jones, The Temple of Doom) Causal explanations Let’s change focus from theoretical niceness of quant data and to cruel reality … Quant data is a necessary but not a sufficient condition for identifying causal effects Fancy statistical methods aren’t magic bullets: We always need good designs The experiment is the natural design for studying causal effects Education Are private schools better than public schools? Does class size have a negative effect on student performance? Do schools’ economic resources affect student performance? Education Are private schools better than public schools? Does class size have a negative effect on student performance? Do schools’ economic resources affect student performance? Private vs. Public Schools Are private schools better than public schools? Piece of cake! We collect data from both types of schools and compare … GPA levels in private and public schools Mean difference in GPA summarizes how much better private schools are For example: Mean GPA in private schools: 8,2 Mean GPA in public schools: 7,8 Conclusion: Going to a private school increases GPA by 0,4 (8,2-7,8) Not so fast! What if kids who attend private schools differ systematically from those who attend public schools? We know that kids in private schools tend to have better educated and wealthier parents Better educated parents tend to live in wealthier areas with high-quality schools and small class sizes Better educated parents tend to be particularly interested in their kids’ schooling Consequently: When doing naïve comparisons of GPA across school types to evaluate causal effects, we also pick up the effect of other factors (parents’ education, income, etc.) Fact of life: Most (if not all) of the time we don’t observe all the relevant variables which lead to the two groups having different GPAs → we need to deal w. unobserved factors This problem is often referred to as the evaluation problem The Evaluation Problem The evaluation problem arises from the fact that we (almost) never observe the same individual both in the treatment group (private school) and in the control group (public school). If we observed the same individual in the treatment and control group, we could easily calculate the difference in GPA and aggregate across the population Hello little friend. You look sweet. Would you like a present? Hello little friend. You look sweet. Would you like a present? ← Creep Nice → It’s pretty important to have the right image The Evaluation Problem There is only one way of dealing effectively with the evaluation problem: Randomized Controlled Trials How? Participants allocated to treatment and control group on the basis of lottery Why? Participants have no influence on which group they are allocated to. Treatment and control group does not differ – on average – in terms of unobserved factors which affect outcome of interest The Evaluation Problem RCTs are the scientific “gold standard” for identifying causal effects Unsurprisingly, it is often not possible to carry out RCTs in social science research ”Congratulations Mrs. Jones. Your son has been randomly selected to receive only 7 years of schooling ”No dental care for you this year Bob, aren’t you happy?” ”Bridget, you’re in the control group. No unemployment benefits for this lady please!” The Evaluation Problem RCTs are the scientific “gold standard” for identifying causal effects Unsurprisingly, it is often not possible to carry out RCTs in social science research When we can’t do proper experiments we have two options left: 1. Observational [i.e., non-experimental] data & fancy statistics → Idea: “Repair” the fact that we don’t have experimental data … Often ends up looking like this … The Evaluation Problem The Evaluation Problem When we can’t do proper experiments we have two options left: 2. Exploit experiments” which serve the same function as The idea is”natural the same as in manmade experiments. We look for “natural” real experiments. Only,variation they weren’t byofusinterest (private vs. or quasi-experimental in thedesigned treatment public school); i.e., variation which is unrelated to the participants in the experiment. We let nature do the work and observe the outcome … Slightly exotic examples of natural experiments used in social science research: Research Question: Do social norms (egoism vs. altruism) affect people’s behavior? Evaluation Problem: People don’t take lab experiments seriously …! Stakes are too low; no incentive to act as if in the “real” world … We need an experiment in which there’s really something to win … or lose How about … The sinking of the Titanic Titanic: Titanic hit an iceberg just after midnight on 14 April 1912 and sank 2 hours and 40 minutes later 2223 passengers on board – life boats for 1178 people only Sinking can be seen as a natural experiment in which people’s lives are at stake. How do people react: Save yourself (egoism) or adhere to prevailing social norms (women and children first!)? Results: Probability of survival not random. Who were more likely to survive: Women & children (not men) Titanic staff (not passengers) Passengers traveling on first class Other nationalities than Brits (and especially Americans) Strong evidence of altruistic behavior (especially among Brits!) Exotic Example #2: What are the long-term consequence of early childhood health? Evaluation Problem: We wish to know if children’s early health has a causal effect on how well they do later in adulthood (education, income, etc) The problem: Healthy mothers have health children. Healthy mothers also have other resources (education, income, etc.) which has a positive effect on child outcomes Our measure of childhood health captures the causal effect of childhood health but also, indirectly, the effect of other beneficial (but unobserved) resources in the child’s environment. We wish to isolate the effect of child health … We need a randomized shot of (ill-)health … How about … The Influenza Epidemic of 1918? Influenza Epidemic of 1918 Epidemic hit the US without any warning in October 1918epidemic and was gone by early Influenza hit most of the1919 world from Epidemic pregnant women random: March 1918 affected to June 1920. About 1/3 ofatworld’s Some women got sick, others didn’t population infected children of mothers got3-6% infected during Did 10-20% of those infectedwho died; of world’s pregnancy worse than children of mothers who populationdo died didn’t get infected? Particularly nasty variant of H1N1 influenza virus Yes! Children whose mothers got sick during (Trivia: Max Weber died from influenza in June 1920; Walt Disney also pregnancy … got sick but survived!) Had 15% lower probability of completing high school Had 5-9% lower earnings throughout life Had 20% higher probability of developing a disability Conclusion: Poor health in early life has significant, negative long-term effects! Douglas Almond & Bhaskar Mazunder (2005): ”The 1918 Influenza Pandemic and Subsequent Health Outcomes: An Analysis of SIPP Data.” American Economic Review 95: 258262. Douglas Almond (2006): ”Is the 1918 Pandemic Over? LongTerm Effects of In Utero Influenza Exposure in the Post1940 U.S. Population.” Journal of Political Economy 114: 672-712. Influenza Epidemic of 1918 Researchers have also used the famine that arose during the winter of 1944 in the western part of the Netherlands as a natural experiment for child health Same result: (Grand)children of (grand)mothers who were affected by the famine had poorer health throughout life (Trivia: Audrey Hepburn grew up in the Netherlands during the war and was affected by the famine; she suffered from poor health throughout life. She ate chocolate every day) Slightly depressing natural experiments. Let’s look at some more fun ones … Smog day before (More) funny experiments The power outage in the US North-East on 14 August 2003 (biggest outage ever!). Result: Big surge in number of babies born nine months after! Smog day after Happiness level in Denmark increased permanently after we beat the Germans in the 1992 European Championship final The probability of an acute heart attack increases significantly when (German) men watch an exciting soccer match (even on TV) The probability of dying from a heart attack increases significantly among (British) men when their favorite team loses a match and when UK national team plays in European Championships The probability of (women) experiencing domestic violence increases significantly when (men’s) favorite team doesn’t do very well … Let’s get back to education … Example: Class size Why class size? 1. Important policy issue & many common sense ideas … 2. Tons of research exists which uses different analytical approaches. We can compare identification strategies 3. Tons of (old) cross-sectional studies: No clear results! Research question: Do children in large classes perform worse than children in small classes? Intuition says yes! However … consider that Class size Wealthy/better educated parents tend to move to residential areas with high-quality schools and smaller class sizes* Parents of children in small classes possess other resources (income, social capital, etc.) which make their children perform well Evaluation problem: Class size correlated with unobserved characteristics of children which also affect academic achievement we don’t measure causal effect of class size * A UK study recently found that having a high-quality school in your catchment area increases property value by 50% Class size The Gold Standard: Project STAR (RCT): 11,600 pupils, 1,300 teachers, and 76 schools in Tennessee, US Pupils randomly allocated to (1) small classes (13-17 pupils), (2) regular classes (22-26 pupils), and regular classes with teacher aide Teachers randomly allocated to classes Main results: 1) pupils in small classes performed better in achievement tests than pupils in large classes; 2) pupils in small classes more likely to go to college; 3) particularly strong effects of small class size for pupils from low-SES families Class size Project STAR is one of few examples of RCTs Did project STAR settle the matter? Well, like any RCT there were problems: Pupils, teachers and parents knew they participated in an experiment (Rosenthal effect) The “better” treatment (small class size) was known to everyone. What happened to the people who got annoyed by ending up in a big class? Did their behavior change? External validity: Would the experiment produce the same result if carried out elsewhere? Project STAR convincingly showed a causal effect of class size on academic achievement. However, the project cost more than $10 million and effect sizes were small. Is it cost efficient to reduce class size compared to other interventions which improve academic achievement? Class size An alternative identification strategy is to look for a natural experiment which affect class size but which is unrelated to individual students Any ideas …? Class size A famous example is Angrist & Lavy (1999, QJE) who used Maimonides’ rule as natural experiment for class size in Israel Maimonedes was a 12th century rabbi who interpreted teaching on class size in the Babylonian Talmud like this ”Twenty-five children may be put in charge of one teacher. If the number in the class exceeds twenty-five but is no more than forty, he should have an assistant to help with the instruction” Maimonides’ rules has been applied in Israel since 1969. It has the following consequence: Class size A famous example is Angrist & Lavy (1999, QJE) who used Maimonides’ rule as natural experiment for class size in Israel Class size: 20 Class size: 40 Class size A famous example is Angrist & Lavy (1999, QJE) who used Maimonides’ rule as natural experiment for class size in Israel Class size: 40 Class size: 20,5 41 Class size Compare Maimonides’ rule with actual class sizes in Israel Maimonides’ rule provides exogenous/”experimental’ variation in class size (since rule has nothing to do with students or their parents) Angrist & Lavy find a clear negative effect of class size on reading ability Class size: Natural experiment #2 Similar idea (Hoxby 2000, QJE): 1. Random variation from year to year with regard to the size of each birth cohort starting in school → random variation in the number of children who starts in school 2. Administrative rules stipulate maximum class sizes (like Maimonides’ rule) → if number of children entering a school crosses threshold a new class is automatically formed Result: No effect of class size on GPA in Conn., USA Class size: Natural experiment #2 Hoxby’s design used by others: Leuven et al. (2009, SJE): Norway, no effect of class size on GPA Browning & Heinesen (2007, SJE): Denmark, negative effect of class size on educational attainment Bingley, Jensen & Walker (2006): Denmark, same result as Browning & Heinesen, more fancy method Class size: Natural experiment #3 Case & Deaton (1999, QJE): South Africa during apartheid: Blacks living in Townships had [in practice] no control over (1) where they lived and (2) which school their children attended. This makes class size effectively random. Result: Large negative effect of class size on academic achievement Heinesen (2010, EJ): Denmark: When pupils choose between French and German before 7th grade they don’t know their future class size. Result: Pupils in small French class have higher GPA at the end of 9th grade compared to pupils in large class sizes Consequently: One study alone probably don’t provide the truth … In the case of class size results differ for studies using different natural experiments: Was it das für ein Ding? This is expected. We need to apply a “situational” interpretation of the causal effect which the experiment identifies. This phenomenon is known as … LATE (Local Average Treatment Effect). Who does the experiment move? We give people a pill. Some people (1) always eat the pill (“always-takers”), (2) always flush the pill down the toilet (“never-takers”), and some people (3) eat the pill if they like the taste (“compliers”). Who are the compliers? Summary: So what have we learned? Identification: Hard facts (Register) data is a great thing – you’ll need it! Descriptive analysis is fine … but … “Although purely descriptive research has an important role to play, (…) most interesting research in social science is about questions of cause and effect” (Angrist/Pischke) You need a credible identification strategy and … “The most credible and influential research programs use random assignment” (Angrist/Pischke) Fancy statistics try to emulate an experimental design. Usually bad substitute for the real thing. 1 trillion observations can’t fix a poor identification strategy … Identification: Hard facts “I can’t really think of an experiment that might identify the causal effect I’m after”. Well, tough luck: “If you can’t devise an experiment that answers your question in a world where everything goes, then the odds of generating useful results with a modest budget and nonexperimental survey data seem pretty slim” (Angrist/Pischke) How to find good instruments/natural experiments? “So, where can you find an instrumental variable? Good instruments come from a combination of institutional knowledge and ideas about the processes determining the economic variable of interest” (Angrist/Pischke) Identification: Good news Taking identification seriously forces you to think hard about the mechanisms which generate the effect you’re after -> fosters creativity! For example, I’m currently looking at … Blood types of spouses as instruments for fertility Pollen in the air as instrument for exam performance among (allergic) exam takers Weather conditions (rain/sun) as instrument for happiness … RCT on training of school teachers in classroom management Once you start looking for natural experiments they’re all around you. Warning: There’s no going back … Economists are currently in the lead with respect to taking identification seriously. Political science comes in second and sociology (and like) still have a long way to go … Case in point The Review of Economics and Statistics, vol. 93(3), August 2011: 20 empirical articles: RCT: 0; natural experiment: 9; panel data: 6; crosssectional/descriptive: 6. American Political Science Review, vol. 105(2), May 2011 5 empirical (quant) articles: Natural experiment: 2, panel data: 1; crosssectional/descriptive: 2 (both mainly theoretical). American Sociological Review, vol. 76(4), August 2011: 4 empirical (quant) articles RCT: 1 (lab experiment!); Crosssectional/descriptive: 3. European Sociological Review, 27(4), August 2011: 6 (quant) articles: Panel data/econometrics: 2; Cross-sectional/descriptive: 4. Case in point The Review of Economics and Statistics, vol. 93(3), August 2011: 20 empirical articles: RCT: 0; natural experiment: 9; panel data: 6; crosssectional/descriptive: 6. Creativity: Gun shows in Texas and California as instrument for homicide/suicide rates (Re)location of German airports after WW2 as instrument for industry location The Holocaust as an instrument for altruistic behavior (rescuing Jews) Shifts in labor demand in California as instrument for social policy preferences Winning the Florida Lottery as instrument for (non)likelihood of bankruptcy Drought/floods in China 2,000 years ago as instrument for nomadic incursions And so on!!! Thank you for your attention!
© Copyright 2026 Paperzz