Understanding and misunderstanding randomized controlled trials

Understandingandmisunderstandingrandomizedcontrolledtrials
AngusDeatonandNancyCartwright
PrincetonUniversity
DurhamUniversityandUCSanDiego
Thisversion,August2016
Weacknowledgehelpfuldiscussionswithmanypeopleoverthemanyyearsthispaperhasbeen
inpreparation.Wewouldparticularlyliketonotecommentsfromseminarparticipantsat
Princeton,ColumbiaandChicago,theCHESSresearchgroupatDurham,aswellasdiscussions
withOrleyAshenfelter,AnneCase,NickCowen,HankFarber,BoHonoré,andJulianReiss.Ulrich
MuellerhadamajorinfluenceonshapingSection1ofthepaper.WehavebenefitedfromgenerouscommentsonanearlierversionbyTimBesley,ChrisBlattman,SylvainChassang,Steven
Durlauf,JeanDrèze,WilliamEasterly,JonathanFuller,LarsHansen,JimHeckman,JeffHammer,
MacartanHumphreys,HelenMilner,SureshNaidu,LantPritchett,DaniRodrik,BurtSinger,
RichardZeckhauser,andSteveZiliak.Cartwright’sresearchforthispaperhasreceivedfunding
fromtheEuropeanResearchCouncil(ERC)undertheEuropeanUnion’sHorizon2020research
andinnovationprogram(grantagreementNo667526K4U).Deatonacknowledgesfinancial
supportthroughtheNationalBureauofEconomicResearch,Grants5R01AG040629-02andP01
AG05842-14andthroughPrincetonUniversity’sRoybalCenter,GrantP30AG024928.
ABSTRACT
RCTsarevaluabletoolswhoseuseisspreadingineconomicsandinothersocialsciences.
Theyareseenasdesirableaidsinscientificdiscoveryandforgeneratingevidenceforpolicy.YetsomeoftheenthusiasmforRCTsappearstobebasedonmisunderstandings:that
randomizationprovidesafairtestbyequalizingeverythingbutthetreatmentandsoallows
apreciseestimateofthetreatmentalone;thatrandomizationisrequiredtosolveselection
problems;thatlackofblindingdoeslittletocompromiseinference;andthatstatisticalinferenceinRCTsisstraightforward,becauseitrequiresonlythecomparisonoftwomeans.
Noneofthesestatementsistrue.RCTsdoindeedrequireminimalassumptionsandcanoperatewithlittlepriorknowledge,anadvantagewhenpersuadingdistrustfulaudiences,but
acrucialdisadvantageforcumulativescientificprogress,whererandomizationaddsnoise
andunderminesprecision.ThelackofconnectionbetweenRCTsandotherscientific
knowledgemakesithardtousethemoutsideoftheexactcontextinwhichtheyareconducted.Yet,oncetheyareseenaspartofacumulativeprogram,theycanplayarolein
buildinggeneralknowledgeandusefulpredictions,providedtheyarecombinedwithother
methods,includingconceptualandtheoreticaldevelopment,todiscovernot“whatworks,”
butwhythingswork.Unlesswearepreparedtomakeassumptions,andtostandonwhat
weknow,makingstatementsthatwillbeincredibletosome,allthecredibilityofRCTsisfor
naught.
1
Introduction
Randomizedtrialsarecurrentlymuchusedineconomicsandarewidelyconsideredtobeadesirablemethodofempiricalanalysisanddiscovery.Thereisalonghistoryofsuchtrialsinthe
subject.Therewerefourlargefederallysponsorednegativeincometaxtrialsinthe1960sand
1970s.Inthemid-1970s,therewasafamous,andstillfrequentlycited,trialonhealthinsurance,
theRandhealthexperiment.Therewasthenaperiodduringwhichrandomizedcontrolledtrials
(RCTs)receivedlessattentionbyacademiceconomics;evenso,randomizedtrialsonwelfare,
socialpolicy,labormarkets,andeducationhavecontinuedsincethemid-1970s,somewithsubstantialinvolvementanddiscussionbyacademiceconomists,seeGreenbergandShroder
(2004).
Recentrandomizedtrialsineconomicdevelopmenthaveattractedattention,andthe
ideathatsuchtrialscandiscover“whatworks”hasbeenwidelyadoptedineconomics,aswell
asinpoliticalscience,education,andsocialpolicy.Amongbothresearchersandthegeneral
public,RCTsareperceivedtoyieldcausalinferencesandparameterestimatesthataremore
crediblethanotherempiricalmethodsthatdonotinvolvethecomparisonofrandomlyselected
treatmentandcontrolgroups.RCTsareseenaslargelyexemptfrommanyoftheeconometric
problemsthatcharacterizeobservationalstudies.WhenRCTsarenotfeasible,researchersoften
mimicrandomizeddesignsbyusingobservationaldatatoconstructtwogroupsthat,asfaras
possible,areidenticalanddifferonlyintheirexposuretotreatment.
Thepreferenceforrandomizedtrialshasspreadbeyondtrialiststothegeneralpublic
andthemedia,whichtypicallyreportsfavorablyonthem.Theyareseenasaccurate,objective,
andlargelyindependentof“expert”knowledgethatisoftenregardedasmanipulable,politically
biased,orotherwisesuspect.Therearenow“WhatWorks”centersusingandrecommending
RCTsinahugerangeofareasofsocialconcernacrossEuropeandtheAnglophoneworld,such
astheUSDepartmentofEducation’sWhatWorksClearingHouse,TheCampbellCollaboration
(paralleltotheCochraneCollaborationinhealth),theScottishIntercollegiateGuidelinesNetwork(SIGN),theUSDepartmentofHealthandHumanServicesChildWelfareInformation
Gateway,theUSSocialandBehavioralSciencesTeam,andothers.TheBritishgovernmenthas
establishedeightnew(well-financed)WhatWorksCenterssimilartotheNationalInstitutefor
HealthandCareExcellence(NICE),withmoreplanned.TheyextendNICE’sevaluationofhealth
treatmentintoaging,earlyintervention,education,crime,localeconomicgrowth,Scottishservicedelivery,poverty,andwellbeing.Thesecentersseerandomizedcontrolledtrialsastheir
2
preferredtool.Thereisawidespreaddesireforcarefulevaluation—tosupportwhatissometimescalledthe“auditsociety”—andeveryoneassentstotheideathatpolicyshouldbebased
onevidenceofeffectiveness,forwhichrandomizedtrialsappeartobeideallysuited.Trialsare
easily,ifnotveryprecisely,explainedalongthelinesthatrandomselectiongeneratestwootherwiseidenticalgroups,onetreatedandonenot;resultsareeasytocompute—allweneedis
thecomparisonoftwoaverages;andunlikeothermethods,itseemstorequirenospecialized
understandingofthesubjectmatter.Itseemsatrulygeneraltoolthat(nominally)worksinthe
samewayinagriculture,medicine,sociology,economics,politics,andeducation.Itissupposed
torequirenopriorknowledge,whethersuspectornot,whichisseenasagreatadvantage.
Inthispaper,wepresenttwosetsofarguments,oneonconductingRCTSandonhowto
interprettheresults,andoneonhowtousetheresultsoncewehavethem.Althoughwedonot
carefortheterms—forreasonsthatwillbecomeapparent—thetwosectionscorrespondroughlytointernalandexternalvalidity.
Randomizedcontrolledtrialsareoftenuseful,andhavebeenimportantsourcesofempiricalevidenceforcausalclaimsandevaluationofeffectivenessinmanyfields.Yetmanyofthe
popularinterpretations—notonlyamongthegeneralpublic,butalsoamongtrialists—areincompleteandsometimesmisleading,andthesemisunderstandingscanleadtounwarranted
trustintheimpregnabilityofresultsfromRCTs,toalackofunderstandingoftheirlimitations,
andtomistakenclaimsabouthowwidelytheirresultscanbeused.Allthese,inturn,canleadto
flawedpolicyrecommendations.
Amongthemisunderstandingsarethefollowing:(a)randomizationensuresafairtrial
byensuringthat,atleastwithhighprobability,treatmentandcontrolgroupsdifferonlyinthe
treatment;(b)RCTsprovidenotonlyunbiasedestimatesofaveragetreatmenteffects,butalso
preciseestimates;(c)randomizationisnecessarytosolvetheselectionproblem;(d)lackof
blinding,whichiscommoninsocialscienceexperiments,doesnotseriouslycompromiseinference;(e)statisticalinferenceinRCTs,whichrequiresonlythesimplecomparisonofmeans,is
straightforward,sothatstandardsignificancetestsarereliable.
WhilemanyoftheproblemsofRCTsaresharedwithobservationalstudies,someare
unique,forexamplethefactthatrandomizingitselfcanchangeoutcomesindependentlyof
treatment.Moregenerally,itisalmostneverthecasethatanRCTcanbejudgedsuperiortoa
well-conductedobservationalstudysimplybyvirtueofbeinganRCT.Theideathatallmethods
3
havetheirflaws,butRCTsalwayshavefewest,isoneofthedeepestandmortperniciousmisunderstandings.
Inthesecondpartofthepaper,wediscusstheusesandlimitationsofresultsfromRCTs
formakingpolicy.Thenon-parametricandtheory-freenatureofRCTs,whichisarguablyanadvantageinestimation,isaseriousdisadvantagewhenwetrytousetheresultsoutsideofthe
contextinwhichtheywereobtained.Muchoftheliterature,ineconomicdevelopmentand
elsewhere,perhapsinspiredbyCampbellandStanley’s(1963)famous“primacyofinternalvalidity,”assumesthatinternalvalidityisenoughtoguaranteetheusefulnessoftheestimatesindifferentcontexts.WithoutunderstandingRCTswithinthecontextoftheknowledgethatwealreadypossessabouttheworld,muchofitobtainedbyothermethods,wedonotknowhowto
usetrialresults.ButoncethecommitmenthasbeenmadetoseeingRCTswithinthisbroader
structureofknowledgeandinference,andwhentheyaredesignedtofitwithinit,theycanplay
ausefulroleinbuildinggeneralknowledgeandpolicypredictions;forexample,anRCTcanbea
goodwayofestimatingakeypolicymagnitude.ThebroadercontextwithinwhichRCTsneedto
besetincludesnotonlymodelsofeconomicstructure,butalsothepreviousexperiencethat
policymakershaveaccumulatedaboutlocalsettingsandimplementation.Mostimportantlyfor
economicdevelopment,theuseofRCTresultsshouldbesensitivetowhatpeoplewant,both
individuallyandcollectively.RCTsshouldnotbecomeyetanothertechnicalfixthatisimposed
onpeoplebybureaucratsorforeigners;RCTresultsneedtobeincorporatedintoademocratic
processofpublicreasoning,Sen(2011).Greenberg,Shroder,andOnstott(1999)documentthat,
evenbeforetherecentwaveofRCTsindevelopment,mostRCTsineconomicshavebeencarriedoutbyrichpeopleonpoorpeople,andthefactshouldmakeusespeciallysensitivetoavoid
chargesofpaternalism.
Section1:InterpretingtheresultsofRCTs
1.1Prolog
RCTswerefirstpopularizedbyFisher’sagriculturaltrialsinthe1930sandaretodayoftendescribedbytheRubincounterfactualcausalmodel,whichitselftracesbacktoNeymanin1923,
seeFreedman(2006)foradescriptionofthehistory:Eachuniti(aperson,apupil,aschool,an
agriculturalplot)isassumedtohavetwopossibleoutcomes, Yio and Yi1 ,theformeroccurring
ifthereisnotreatmentatthetimeinquestion,thelatteriftheunitistreated.Thedifference
betweenthetwooutcomes Yi1 − Yi 0 istheindividualtreatmenteffect,whichweshalldenote
βi . Treatmenteffectsaretypicallydifferentfordifferentunits.Nounitcanbebothtreatedand
4
untreatedatthesametime,soonlyoneorotheroftheoutcomesoccurs;theotheriscounterfactualsothatindividualtreatmenteffectsareinprincipleunobservable.
Wenoteparentheticallythatwhileweusethecounterfactualframeworkhere,wedo
notendorseit,norargueagainstotherapproachesthatdonotuseit,suchastheCowlescommissioneconometricframeworkwherethecausalrelationsarecodedasstructuralequations,
seealsoPearl(2009.)ImbensandWooldridge(2009,Introduction)provideaneloquentdefense
oftheRubinformulation,emphasizingthecredibilitythatcomesfromatheory-freespecificationwithunlimitedheterogeneityintreatmenteffects.HeckmanandVytlacil(2007,Introduction)makeanequallyeloquentcaseagainst,notingthatthetreatmentsinRCTsareoftenunclearlyspecifiedandthatthetreatmenteffectsarehardtolinktoinvariantparametersthat
wouldbeusefulelsewhere.
ThebasictheoremgoverningRCTsisaremarkableone.Itstatesthattheaveragetreatmenteffectistheaverageoutcomeinthetreatmentgroupminustheaverageoutcomeinthe
controlgroup.Whilewecannotobservetheindividualtreatmenteffects,wecanobservetheir
mean.Theestimateoftheaveragetreatmenteffect(ATE)issimplythedifferencebetweenthe
meansinthetwogroups,andithasastandarderrorthatcanbeestimatedandusedtomake
significancestatementsaccordingtothestatisticaltheorythatappliestothedifferenceoftwo
means,onwhichmorebelowinSection1.3.Thedifferenceinmeansisanunbiasedestimatorof
themeantreatmenteffect.
Thetheoremisremarkablebecauseitrequiressofewassumptions;nomodelisrequired,noassumptionsaboutcovariatesareneeded,thetreatmenteffectscanbeheterogeneous,andnothingisassumedabouttheshapesofstatisticaldistributionsotherthanthestatisticalquestionoftheexistenceofthemeanofthecounterfactualoutcomevalues.Intermsofone
ofourrunningthemes,itrequiresnoexpertknowledge,ornoacceptanceofpriors,expertor
otherwise.Thetheoremalsohasitslimitations;theproofusesthefactthatthedifferencein
twomeansisthemeanoftheindividualdifferences,i.e.thetreatmenteffects.Thisisnottrue
forthemedian(thedifferenceintwomediansisnotthemedianofthedifferenceswhichisthe
mediantreatmenteffect).Italsodoesnotallowustoestimateanypercentileofthedistribution
oftreatmenteffects,oritsvariance.(Quantileestimatesoftreatmenteffectsarenotthequantilesofthedistributionoftreatmenteffects,butthedifferencesinthequantilesofthetwomarginaldistributionsoftreatmentsandcontrols;thetwomeasurescoincideiftheexperimenthas
noeffectonranks,anassumptionthatwouldbeconvenientbutishardtojustify,atleastin
5
general.)AllofthesestatisticscanbeofinterestforpolicybutRCTsarenotinformativeabout
them,oratleastnotwithoutfurtherassumptions,forexampleonthedistributionoftreatment
effects,seeHeckman,Smith,andClements(1997),andmuchoftheattractionofRCTsisthe
absenceofsuchassumptions.
Thebasictheoremtellsusthatthedifferenceinmeansisanunbiasedestimatorofthe
averagetreatmenteffectbutsaysnothingaboutthevarianceofthisestimator.Ingeneral,abiasedestimatorthatistypicallyclosertothetruthwilloftenbebetterthananunbiasedestimatorthatistypicallywideofthetruth.Thereisnothingtosaythatanon-RCTestimator,inspite
ofbias,mightnothavealowermeansquarederror(MSE),onemeasureofthedistanceofthe
estimatefromthetruth,oralowervalueofa“lossfunction”thatdefinesthelosstotheexperimenterofmissingthetarget.
ItisusefultothinkofthemeanaveragetreatmenteffectfromanRCTintermsofsamplingfromafinitepopulation,aswhentheBureauoftheCensusestimatesaverageincomeof
theUSpopulationin2013.FortheRCT,thepopulationisthepopulationofunitswhoseaverage
treatmenteffectisofinterest;notetheimportanceofdefiningthepopulationofinterestbecause,giventheheterogeneityoftreatmenteffects,theaveragetreatmenteffectwillvary
acrossdifferentpopulations,justasaverageincomesdifferacrossdifferentsubpopulationsof
theUS.Finitepopulationsamplingtheorytellsushowtogetaccurateestimatesofmeansfrom
samples;intheRCTcase,thesampleisthestudysample,bothtreatmentsandcontrols.Inprinciple,thestudysamplecouldbearandomsampleoftheparentpopulationofinterest,inwhich
caseitisrepresentativeofit,butthatisseldomthecase.Becausetheestimateispopulation
specific,itisnot(orneednotbe)thoughtofastheparameterofasuper-population,orotherwisegeneralizableinanyway.AverageincomeintheUSin2013maybeofinterestinitsown
right;butitwillnotbethesameasaverageincomein2014,norwillitbethesameasaverage
incomeofwhites,orofthepopulationsofWyomingorNewYork.Exactlythesameistrueof
theestimateofanaveragetreatmenteffect;itappliestothestudysampleinwhichthetrialwas
done,atthetimewhenitwasdone,anditsuseoutsideofthoseconfines,thoughoftenpossible,requiresargumentandjustification.Withoutsuchanargument,wecannotclaimthatan
ATEis“the”meantreatmenteffectanymorethanthataverageincomeintheUSin2013is
“the”averageincomeoftheUSinanyotheryear.Ofcourse,knowingaverageincomein2013
canbeusefulformakingothercalculations,suchasanestimateofincomein2014,orofasub-
6
populationthatweknowisricherorpoorer;thefactthatanestimatedoesnotuniversallygeneralizedoesnotmakeituseless.WeshallreturntotheseissuesinSection2.
1.2.Precision,balance,andrandomization
1.2.1Precisionandbias
Weshouldlikeourestimateoftheaveragetreatmenteffecttobeasclosetothetruthaspossible.Onewaytoassessclosenessisthemeansquareerror(MSE),definedas
⌢
MSE = E(θ − θ )2 (1)
⌢
where θ isthetrueaveragetreatmenteffect,and θ isitsestimatefromaparticulartrial.The
expectationistakenoverrepeatedrandomizationsoftreatmentsandcontrolsusingthesame
studypopulation.Itisalsostandardtorewrite(1)as
⌢
⌢ 2
⌢
MSE = E (θ − E(θ ) + E(θ ) − θ
(
) (
)
2
⌢
⌢
= var(θ ) + bias(θ ,θ )2 (2)
sothatmeansquareerroristhesumofthevarianceoftheestimator—whichwetypicallyknow
somethingaboutfromtheestimatedstandarderror—andthesquareofthebias—whichinthe
caseofa(nideal)randomizedcontrolledtrialiszero.Theelementary,butcrucialpointisthat,
whileitiscertainlygoodthatthebiasiszero,thatfactdoesnothingtomakethedistancefrom
thetruthassmallasitmightbe,whichiswhatwereallycareabout.Anunbiasedestimatorthat
isnearlyalwayswideofthetargetisnotasusefulasonethatisalwaysneartoit,evenif,on
average,itisoffcenter.Moregenerally,itwilloftenbedesirabletotradeinsomeunbiasedness
forgreaterprecision.Experimentsareoftenexpensive,sowecannotalwaysrelyonlargesamplestobringtheestimateclosetothetruthandresolvetheseissuesforus.MuchofthisSection
isconcernedwithhowtodesignexperimentstomaximizeprecision.
Unbiasednessalonecannotthereforejustifytheoften-expressedpreferenceforRCTs
overotherestimators.TheminimalistassumptionsrequiredforanRCTtobeunbiasedarean
attractionalthough,asweshallseeinthisSection,thisadvantageusuallycomesatthecostof
loweredprecisionandofdifficultiesinknowinghowtousetheresult,asweshallseeinSection
2.YetthereisanoftenexpressedbeliefthatRCTsaresomehowguaranteedtobeprecise,simplybecausetheyareRCTs.Occasionallybiasandprecisionareexplicitlyconfused;theJPALwebsite,initsexplanationofwhyitisgoodtorandomize,saysthatRCTs“aregenerallyconsidered
themostrigorousand,allelseequal,producethemostaccurate(i.e.unbiased)results.”Shadish,Cook,andCampbell(2002,p.276),inwhatis(rightly)consideredoneofthebiblesofcausal
inferenceinsocialscience,statewithoutqualificationthat“randomizedexperimentsprovidea
7
preciseansweraboutwhetheratreatmentworked”(p.276)and“Therandomizedexperimentis
oftenthepreferredmethodforobtainingapreciseandstatisticallyunbiasedestimateofthe
effectsofanintervention,”(p.277)ouritalics.
ContrastthiswithCronbachetal(1980)whoquotesKendall’s(1957)pasticheofLongfellow,“Hiawathadesignsanexperiment,”whereHiawatha’sinsistenceonunbiasednessleads
tohisneverhittingthetargetandtohiseventualbanishment.
1.2.2Balanceandprecisioninalinearall-causemodel
AusefulwaytothinkaboutprecisionandwhatanRCTdoesanddoesnotdoistouseaschematiclinearcausalmodeloftheform:
Yi = βiTi + ∑ j=1γ j xij J
(3)
where,asbefore, Yi istheoutcomeforuniti, Ti isadichotomous(1,0)treatmentdummyindicatingwhetherornotiistreated,and β i istheindividualtreatmenteffectofthetreatment
oni.Thex’saretheobservedorunobservedothercausesoftheoutcome,andwesupposethat
(3)capturesallthecausesof Yi . Jmaybeverylarge.Becausetheheterogeneityoftheindividualtreatmenteffects β i isunrestricted,weallowthepossibilitythatthetreatmentinteractswith
thex’sorothervariables,sothattheeffectsofTcandependonanyothervariables,andwe
shallhaveoccasiontomakethisexplicitbelow.Anobviousandimportantexampleiswhenthe
treatmentifeffectiveonlyinthepresenceofaparticularvalueofoneofthex’s.
Wedonotneedisubscriptsonthe γ 's thatcontroltheeffectsoftheothercauses;if
theireffectsdifferacrossindividuals,weincludetheinteractionsofindividualcharacteristics
withtheoriginalx’sasnewx’s.Giventhatthex’scanbeunobservable,thisisnotrestrictive.
Becausethe β 's candependonthex’s,theeffectsofthex’sontheoutcomecandependon
Ti , or,equivalently,theeffectsoftreatmentcandependoncovariates.
Inanexperiment,withorwithoutrandomization,wecanrepresentthetreatmentgroup
ashaving Ti = 1, andthecontrolgroupashaving Ti = 0. Sowhenwesubtracttheaverageoutcomesamongthecontrolsfromtheaverageoutcomesamongthetreatments,wewillget
J
Y − Y = β + ∑ γ j (x ij − x ij ) = β + (S − S ) 1
0
1
1
0
1
1
0
(4)
j=1
Thefirsttermonthefarrighthandside,whichistheaveragetreatmenteffect,iswhatwewant,
butthesecondtermorerrorterm,whichisthesumofthenetaveragebalancesofothercauses
8
acrossthetwogroups,willgenerallybenon-zero—becauseofselectionormanyotherreasons—andneedstobedealtwithsomehow.Wegetwhatwewantwhenthemeansofallthe
othercausesareidenticalinthetwogroups,ormorepreciselywhenthesumoftheirnetdiffer1
0
ences S − S iszero;thisisthecaseofperfectbalance.Withperfectbalance,thedifference
betweenthetwomeansisexactlyequaltotheaverageofthetreatmenteffectamongthe
treated,sothatwehavetheultimateprecisionandweknowtheanswerexactly,atleastinthis
linearcase.
1.2.3Balancingacts:realandmagical
Howdowegetbalance,orsomethingclosetoit?What,exactly,istheroleofrandomization?In
alaboratoryexperiment,wherethereisgoodbackgroundknowledgeoftheothercauses,the
experimenterhasagoodchanceofcontrollingalloftheothercauses,aimingtoensurethatthe
lasttermin(4)isclosetozero.Failingsuchknowledgeandcontrol,analternativeismatching,
frequentlyusedinstatistical,medical,andeconometricwork.Foreachtreatment,amatchis
foundthatisascloseaspossibleonallsuspectedcauses,sothat,onceagain,thelasttermin(4)
canbekeptsmall.Again,whenwehaveagoodideaofthecauses,matchingmayalsodelivera
preciseestimate.Ofcourse,whenthereareimportantunknownorunobservablecauses,neitherlaboratorycontrolnormatchingoffersprotection.
Whatdoesrandomizationdo?Becausethetreatmentsandcontrolscomefromthe
sameunderlyingdistribution,randomizationguarantees,byconstruction,thatthelasttermon
therightin(4)iszeroinexpectationatbaseline(muchcanhappentodisturbthisbeyondbaseline).Thisistruewhetherornotthecausesareobserved.IftheRCTisrepeatedmanytimeson
thesametrialpopulation,thenthelasttermwillbezerowhenaveragedoveraninfinitenumber
of(entirelyhypothetical)trials.Ofcourse,thisdoesnothingtomakeitzeroinanyonetrial
wherethedifferenceinmeanswillbeequaltotheaveragetreatmenteffectamongthosetreatedplusatermthatreflectstheimbalanceintheneteffectsoftheothercauses.Wedonot
knowthesizeofthiserrorterm,andthereisnothingintherandomizationthatlimitsitssize;by
chance,therecanbeone(ormore)importantexcludedcause(s)thatisveryunequallydistributedbetweentreatmentandcontrols.Thisimbalancewillvaryoverreplicationsofthetrial,and
itsaveragesizewillideallybecapturedbythestandarderroroftheestimatedATE,whichgives
ussomeideaofhowlikelywearetobeawayfromthetruth.Gettingthestandarderrorand
associatedsignificancestatementsrightarethereforeofgreatimportance.
9
Exactlywhatrandomizationdoesisfrequentlylostinthepracticalliterature,andthere
isoftenaconfusionbetweenperfectcontrol,ontheonehand—asinalaboratoryexperimentor
perfectmatchingwithnounobservablecauses—andcontrolinexpectation—whichiswhatRCTs
do.WesuspectthatatleastsomeofthepopularandprofessionalenthusiasmforRCTs,aswell
asthebeliefthattheyareprecisebyconstruction,comesfrommisunderstandingsaboutbalance.Thesemisunderstandingsarenotsomuchamongthetrialistswho,whenpressed,willgive
acorrectaccount,butcomefromimprecisestatementsbytrialiststhataretakenasgospelby
thelayaudiencethatthetrialistsarekeentoreach.
SuchamisunderstandingiswellcapturedbythefollowingquotefromtheWorldBank’s
onlinemanualonimpactevaluation:
“Wecanbeveryconfidentthatourestimatedaverageimpact,givenasthedifference
betweentheoutcomeundertreatment(themeanoutcomeoftherandomlyassigned
treatmentgroup)andourestimateofthecounterfactual(themeanoutcomeofthe
randomlyassignedcomparisongroup)constitutethetrueimpactoftheprogram,since
byconstructionwehaveeliminatedallobservedandunobservedfactorsthatmightotherwiseplausiblyexplainthedifferenceinoutcomes.”Gertleretal(2011)(ouritalics.)
Thisstatementconfusesactualbalanceinanysingletrialwithbalanceinexpectationovermany
entirelyhypotheticaltrials.Ifthestatementaboveweretrue,andifallfactorswereindeedcontrolled(andnoimbalanceswereintroducedpostrandomization),thedifferencewouldbean
exactmeasureoftheaveragetreatmenteffect,atleastintheabsenceofmeasurementerror.
Weshouldnotonlybeconfidentofourestimate;wewouldknowthetruth,asthequotesays.
AsimilarquotecomesfromJohnList,oneofthemostimaginativeandsuccessfulschol-
arswhouseRCTs:
“complicationsthataredifficulttounderstandandcontrolrepresentkeyreasonsto
conductexperiments,notapointofskepticism.Thisisbecauserandomizationactsasan
instrumentalvariable,balancingunobservablesacrosscontrolandtreatmentgroups.”
Al-UbaydliandList(2013)(italicsintheoriginal.)
AndfromDeanKarlan,founderandPresidentofYale’sInnovationsforPovertyAction,which
runsdevelopmentRCTsaroundtheworld:
“Asinmedicaltrials,weisolatetheimpactofaninterventionbyrandomlyassigningsubjectstotreatmentsandcontrolgroups.Thismakesitsothatallthoseotherfactors
whichcouldinfluencetheoutcomearepresentintreatmentandcontrol,andthusany
10
differenceinoutcomecanbeconfidentlyattributedtotheintervention.”Karlan,GoldbergandCopestake(2009)
Andfromthemedicalliterature,fromadistinguishedpsychiatristwhoisdeeplyskepticalof
RCTs,
“Thebeautyofarandomizedtrialisthattheresearcherdoesnotneedtounderstandall
thefactorsthatinfluenceoutcomes.Saythatanundiscoveredgeneticvariationmakes
certainpeopleunresponsivetomedication.Therandomizingprocesswillensure—or
makeithighlyprobable—thatthearmsofthetrialcontainequalnumbersofsubjects
withthatvariation.Theresultwillbeafairtest.”(Kramer,2016,p.18)
ClaimsareevenmadethatRCTsrevealknowledgewithoutpossibilityoferror.JudyGueron,the
long-timepresidentofMDRC,whichhasbeenrunningRCTsonUSgovernmentpolicyfor45
years,askswhyfederalandstateofficialswerepreparedtosupportrandomizationinspiteof
frequentdifficultiesandinspiteoftheavailabilityofothermethods,andconcludesthatitwas
because“theywantedtolearnthetruth,”GueronandRolston(2013,429).Therearemany
statementsoftheform“Weknowthat[projectX]workedbecauseitwasevaluatedwitharandomizedtrial,”Dynarski(2015).
Manywritersaremorecautious,andmodifystatementsabouttreatmentandcontrol
groupsbeingidenticalwithtermssuchas“statisticallyidentical,”“reasonablysimilar”ordonot
differ“systematically.”Andwehavenodoubtthatalloftheauthorsquotedaboveunderstand
theneedforthesequalifications.Buttotheuninformedreader,thequalifiedstatementsare
unlikelytobedifferentiatedfromtheunqualifiedstatementsquotedabove.Norisitalways
clearwhatsomeofthesetermsmean.Forexample,iftwopeopleareselectedatrandomfroma
population,anditsohappensthatoneisfemaleandonemale,inwhatsensetheyarestatisticallyidentical?Whileitistruethattheywererandomlyselectedfromthesameparentdistribution,whichprovidesthebasisforinference,thecalculationofstandarderrors,andsignificance
statements,itdoesnothingtohelpwithbalanceorprecisioninanygiventrial.
1.2.4Samplesizeandstatisticalinferenceinunbalancedtrials
Isasingletrialmorelikelytobebalanced,andthusmoreprecise,whenthesamplesizeislarge?
Indeed,asthesamplesizetendstoinfinity,themeansofthex’sinthetreatmentandcontrol
groupswillbecomearbitrarilyclose.YetthisisoflittlehelpinfinitesamplesasFisher(1926)
noted:“Mostexperimentersoncarryingoutarandomassignmentwillbeshockedtofindhow
farfromequallytheplotsdistributethemselves,”quotedinMorganandRubin(2012).Evenwith
11
verylargesamplesizes,iftherearealargenumberofcauses,balanceoneachcausemaybe
infeasible.Vandenbroucke(2004)notesthattherearethreemillionbasepairsinthehuman
genome,manyorallofwhichcouldberelevantprognosticfactorsforthebiologicaloutcome
thatweareseekingtoinfluence.
However,as(4)makesclear,wedonotneedbalanceonallcauses,onlyontheirneteffect,theterm S 1 − S 0 whichdoesnotrequirebalanceoneachcauseindividually.Yetthereis
noguaranteethateventheneteffectwillbesmall.Forexample,theremayonlybeoneomitted
unobservedcausewhoseeffectislarge,onesinglebasepairsay,sothatifthatonecauseisunbalancedacrosstreatmentsandcontrols,thatthereisindividualorevennetbalanceonother
lessimportantcausesisnotgoingtohelp.
Statementsaboutlargesamplesguaranteeingbalancearenotusefulwithoutguidelines
abouthowlargeislargeenough,andsuchstatementscannotbemadewithoutknowledgeof
othercausesandhowtheyaffectoutcomes.
Asimplecaseillustrates.Supposethatthereisonehiddencausein(3),abinaryvariable
xthatisunitywithprobabilitypand0otherwise.Withncontrolsandntreatments,thedifferenceinfractionswithx=1inthetwogroupshasmean0andvariance 1/ np(1− p). Withn=100
andp=0.5,thestandarderroraround0is0.2sothat,ifthisunobservedconfounderhasalarge
effectontheoutcome,theimbalancecouldeasilymasktheeffectoftreatment,orbemistaken
asevidencefortheeffectivenessofatrulyineffectivetreatment.
Lackofbalanceintheaboveexampleorintheneteffectofeitherobservablesornonobservablesin(4)doesnotcompromisetheinferenceinanRCTinthesenseofobtaininga
standarderrorfortheunbiasedATE,seeSenn(2013)foraparticularlyclearstatement.The
randomizationdoesnotguaranteebalancebutitprovidesthebasisformakingprobability
statementsaboutthevariouspossibleoutcomes,whichisalsoclearintheexampleinthepreviousparagraph.ThiswasalsoFisher’sargumentforrandomization.Sennwrites“theprobability
calculationappliedtoaclinicaltrialautomaticallymakesanallowanceforthefactthatthe
groupswillalmostcertainlybeunbalanced.”(italicsintheoriginal.)Ifthedesignissuchthat,
evenwithperfectrandomization,successivereplicationstendtogeneratelargeimbalances,the
resultingimprecisionoftheATEwillshowupinitsstandarderror.Ofcourse,theusefulnessof
thisrequiresthatthecalculatedstandarderrorspermitcorrectsignificancestatements,which,
asweshallseeinthenextsubsection,isoftenfarfromstraightforward.Intheexampleabove,
anextreme,butentirelypossible,caseoccurswhen,bychance,theunobservedconfounderis
12
perfectlycorrelatedwiththetreatment;unlessthereareactualreplications,thefalsecertainty
thatsuchanexperimentprovideswillbereinforcedbyfalsesignificancetests.
1.2.4Testingforbalance
Inpractice,trialistsineconomics(andinsomeotherdisciplines)usuallycarryoutastatistical
testforbalanceafterrandomizationbutbeforeanalysis,presumablywiththeaimoftaking
someappropriateactionifbalancefails.Thefirsttableofthepapertypicallypresentsthesamplemeansofobservablecovariates—theobservablex’sin(3),whichareeithercausesintheir
ownrightorinteractwiththe β 's —forthecontrolandtreatmentgroups,togetherwiththeir
differences,andtestsforwhetherornottheyaresignificantlydifferentfromzero,eithervariablebyvariable,orjointly.Thesetestsareappropriateifweareconcernedthattherandom
numbergeneratormighthavefailed(becausewearedrawingplayingcards,rollingdice,or
spinningbottletops,thoughpresumablynotiftherandomizationisdonebyarandomnumber
generator,alwayssupposingthatthereissuchathingasrandomness,SingerandPincus(1998)),
orifweareworriedthattherandomizationisunderminedbynon-blindedsubjectsortrialists
systematicallyunderminingtheallocation.Otherwise,asthenextparagraphshows,thetest
makesnosenseandisnotinformative,whichdoesnotseemtostopitbeingroutinelyused.
Ifwewrite µ and µ forthe(vectorsof)populationmeans(i.e.themeansoverall
0
1
possiblerandomizations)oftheobservedx’sinthecontrolandtreatmentgroupsatthepointof
assignment,thenullhypothesisis(presumably,asjudgedbythetypicalbalancetest)thatthe
twovectorsareidentical,withthealternativebeingthattheyarenot.Butiftherandomization
hasbeencorrectlydone,thenullhypothesisistruebyconstruction,seee.g.Altman(1985)and
Senn(1994),whichmayhelpexplainwhyitsorarelyfailsinpractice.Indeed,althoughwecannot“test”it,weknowthatthenullhypothesisisalsotruefortheunobservablecomponentsof
x.NotethecontrastwiththestatementsquotedaboveclaimingthatRCTsguaranteebalanceon
causesacrosstreatmentandcontrolgroups.Thosestatementsrefertobalanceofcausesatthe
pointofassignmentinanysingletrial,whichisnotguaranteedbyrandomization,whereasthe
balancetestsareaboutthebalanceofcausesatthepointofassignmentinexpectationover
manytrials,whichisguaranteedbyrandomization.Theconfusionisperhapsunderstandable,
butitisconfusionnevertheless.Ofcourse,itmakessensetolookforbalancebetweenobserved
covariatesusingsomemoreappropriatedistancemeasureforexamplethenormalizeddifferenceinmeans,ImbensandWooldridge(2009,equation3).
13
1.2.5Methodsforbalancing
Oneproceduretoimprovebalanceistoadaptthedesignbeforerandomization,forexampleby
stratification.Fisher,whoasthequoteaboveillustrates,waswellawareofthelossofprecision
fromrandomizationarguedfor“blocking”(stratification)inagriculturaltrialsorforusingLatin
Squares,bothofwhichrestricttheamountofimbalance.Stratification,tobeuseful,requires
somepriorunderstandingofthefactorsthatarelikelytobeimportant,andsoittakesusaway
fromthe“noknowledgerequired,”or“nopriorsaccepted”appealofRCTs.ButasScriven(1974,
103)notes:“causehunting,likelionhunting,isonlylikelytobesuccessfulifwehaveaconsiderableamountofrelevantbackgroundknowledge,”orevenmorestrongly,“nocausesin,no
causesout,”Cartwright(1994,Chapter2).StratificationinRCTs,asinotherformsofsampling,is
astandardmethodforusingbackgroundknowledgetoincreasetheprecisionofanestimator.It
hasthefurtheradvantagethatitallowsfortheexplorationofdifferentaveragetreatmenteffectsindifferentstratawhichcanbeusefulinadaptingortransportingtheresultstootherlocations,seeSection2.
Stratificationisnotpossiblewhentherearetoomanycovariates,orifeachhasmany
values,sothattherearemorecellsthancanbefilledgiventhesamplesize.Analternativeisto
re-randomize,repeatingtherandomizationuntilthedistancebetweentheobservedcovariates
islessthansomepredeterminedcriteria.MorganandRubin(2012)suggesttheMahalanobisD–
statistic,anduseFisher’srandomizationinference(tobediscussedfurtherbelow)tocalculate
standarderrorsthattakethere-randomizationintoaccount.Analternative,widelyadaptedin
practice,istoadjustforcovariatesbyrunningaregression(orcovariance)analysis,withthe
outcomeonthelefthandsideandthetreatmentdummyandthecovariatesasexplanatoryvariables,includingpossibleinteractionsbetweencovariatesandtreatmentdummies.
Freedman(2008)hasanalyzedthismethodandargues“ifadjustmentmadeasubstantialdifference,wewouldsuggestmuchcautionwheninterpretingtheresults.”Butasubstantial
differenceisexactlywhatwewouldliketosee,atleastsomeofthetime,iftheadjustment
movestheestimateclosertothetruth.FreedmanshowsthattheadjustedestimateoftheATE
isbiasedinfinitesamples,withthebiasdependingonthecorrelationbetweenthesquared
treatmenteffectandthecovariates.Thereisalsonogeneralguaranteethattheregressionadjustmentwillgenerateamorepreciseestimate,althoughitwilldosoifthereareequalnumbers
oftreatmentsandcontrolsorifthetreatmenteffectsareconstantoverunits(inwhichcase
therewillalsobenobias).Evenwithbias,theregressionadjustmentisattractiveifitdoesin-
14
deedtradeoffbiasforprecision,thoughpresumablynottoRCTpuristsforwhomunbiasedness
isthesinequanon.Noteagainthattheincreasedprecision,whenitexists,comesfromusing
priorknowledgeaboutthevariablesthatarelikelytobeimportantfortheoutcome.Thatthe
backgroundknowledgeortheoryiswidelysharedandunderstoodwillalsoprovidesomeprotectionagainstdataminingbysearchingthroughcovariatesinthesearchfor(perhapsfalsely)
estimatedprecision.
1.2.6Shouldwerandomize?
ThetensionbetweenrandomizationandprecisiongoesbacktotheearlydebatebetweenFisher
andStudent(Gosset)whoneveracceptedFisher’sargumentsforrandomization,seealsoZiliak
(2014).InhisdebatewithFisheraboutagriculturaltrials,Studentarguedthatrandomization
ignoredrelevantpriorinformation,forexampleabouthowlikelyconfounderswouldbedistributedacrossthetestplots,sothatrandomizationwastedresourcesandledtounnecessarily
poorestimates.Thisgeneralquestionofwhetherrandomizationisdesirablehasbeenreopened
inrecentpapersbyKasy(2016),Banerjee,Chassang,andSnowberg(2016)andBanerjee,
Chassang,Montero,andSnowberg(2016).
ReferbacktotheMSEintroducedabove,andconsiderdesigninganexperimentthatwill
makethisassmallaspossible.Unfortunately,thisisnotgenerallypossible;forexample,the“estimator”of3,say,fortheATEhasthelowestpossiblemean-squarederrorifthetrueATEisactually3.Instead,weneedtoaveragetheMSEoveradistributionofpossibleATEs.Thisleadsto
adecisiontheoryapproachtoestimationwherebyaBayesianeconometricianwillestimatethe
ATEbychoosingtheallocationoftreatmentandcontrolssoastominimizetheexpectedvalue
ofalossfunction—theMSEbeingoneexample.Suchanapproachrequiresustospecifyaprior
ontheATE,ormoregenerally,ontheexpectationofoutcomesconditionalonthecovariates.
Thesepriorsareformalversionsoftheissuethathasalreadycomeuprepeatedly,thattoget
goodestimators,weneedtoknowsomethingabouthowthecovariatesaffecttheoutcome.
Kasy(2016)solvesthisproblemforthecaseofexpectedMSEandshowsthatrandomizationis
undesirable;itsimplyaddsnoiseandmakestheMSElarger.Heusesanon-parametricpriorthat
hasprovedusefulinanumberofotherapplications—wecouldpresumablydoevenbetterifwe
werepreparedtocommitfurther,andheprovidescodetoimplementhismethod,whichshows
a20percentreductioninMSEcomparedwithrandomization(14percentforstratifiedrandomization)forthewell-knownTennesseeSTARclass-sizeexperiment.
15
Banerjeeetalproposeamoregenerallossfunctionandprovethecomparabletheorem,
thatrandomizationleadstolargerlossesthantheoptimalnon-randompurposiveassignment.
Theseauthorsrecommendrandomizationonothergrounds,whichwewilldiscussbelow,but
agreethat,forstandardstatisticalefficiencyormaximizationofexpectedutilityrandomization
shouldnotbeusedinexperimentaldesign.Studentwasright.
Severalpointsshouldbenoted.First,theanti-randomizationtheoremisnotajustificationofanynon-experimentaldesign,forexampleonethatcomparesoutcomesofthosewhodo
ordonotself-selectintotreatment.Selectioneffectsarerealenough,andifselectionisbased
onunobservablecauses,comparisonoftreatedandcontrolswillbebiased.Oneacceptablenonrandomschemeistousetheobservablecovariatestodividethestudysampleintocellswithin
whichallobservationshavethesamevalueandthendivideeachcellintotreatmentsandcontrols.Withineachcell,orforthoseunitsonwhichwehavenoinformation,wecanchooseany
waywelike,includingrandomly,thoughrandomizationhasnoadvantageordisadvantage.Such
allocationsruleoutself-selection(ordoctororprogramadministratorselection)wheretheindividual(doctor,oradministrator)hasinformationnotvisibletothepersonassigningtreatments
andcontrols.Thekeyisthatthepersonwhomakestheassignment(theanalyst)usesallofthe
informationthatheorshepossesses,andthatoncethishasbeentakenintoaccount,allunits
areinterchangeableconditionalonthatinformation,sothatassignmentbeyondthatdoesnot
matter.Ofcourse,theprogramadministratorsmustenforcetheanalyst’sassignment,sothat
privateinformationthattheyortheunitspossessisnotallowedtoaffecttheassignment,conditionalontheinformationusedbytheanalyst.Giventhis,selectiononunobservablesisruled
out,anddoesnotaffecttheresults.Randomizationisnotrequiredtoeliminateselectionbias.
Whetheritisreallypossiblefortheanalysttoassignarbitrarilyisanopenquestion,asis
whether“randomization”fromarandom-numbergeneratorwilldoso.Evenmachine-generated
sequenceshavecauses,andeveniftheanalysthasonlyasetofuninformativelabelsforthe
units,thosetoomustcomefromsomewhere,sothatitispossiblethatthosecausesarelinked
totheunobservedcausesintheexperiment.Wedonotattempttodealherewiththesedeep
issuesonthemeaningofrandomization,butseeSingerandPincus(1998).
AccordingtoChalmers(2001)andBothwellandPodolsky(2016),thedevelopmentof
randomizationinmedicineoriginatedwithBradford-HillwhousedrandomizationinthefirstRCT
inmedicine—thestreptomycintrial—becauseitpreventeddoctorsselectingpatientsonthe
basisofperceivedneed(oragainstperceivedneed,leaningoverbackwardasitwere),anargu-
16
mentmorerecentlyechoedbyWorrall(2007).Randomizationservesthispurpose,butsodo
othernon-discretionaryschemes;whatisrequiredisthatthehiddeninformationnotaffectthe
allocation.Whileitistruethatdoctorscannotbeallowedtomaketheassignment,itisnottrue
thatrandomizationistheonlyschemethatcanbeenforced.
Second,theidealrulesbywhichunitsareallocatedtotreatmentorcontroldependon
thecovariates,andontheinvestigators’priorsabouthowthecovariatesaffecttheoutcomes.
Thisopensupallsortsofmethodsofinferencethatareexcludedbypurerandomization.For
example,thehypothetico-deductivemethodworksbyusingtheorytomakeapredictionthat
canbetakentothedata;herethepredictionswouldbeoftheformthataunitwithcharacteristicsxwillrespondinaparticularwaytotreatment,falsificationofwhichcanbetestedbyan
appropriateallocationofunitstotreatment.Banerjee,ChassangandSnowberg(2016)provide
suchexamples.
Third,randomization,byrunningroughshodoverpriorinformationfromtheoryand
fromthecovariates,iswastefulandevenunethicalwhenitunnecessarilyexposespeople,or
unnecessarilymanypeople,topossibleharminariskyexperiment,seeWorrall(2002)foran
egregiouscaseofhowanunthinkingdemandforrandomizationandtherefusaltoacceptprior
informationputchildren’slivesdirectlyatrisk.
Fourth,thenon-randommethodsusepriorinformation,whichiswhytheydobetter
thanrandomization.Thisisbothanadvantageandadisadvantage,dependingonone’sperspective.Ifpriorinformationisnotwidelyaccepted,orisseenasnon-crediblebythoseweareseekingtopersuade,wewillgeneratemorecredibleestimatesifwedonotusethosepriors.Indeed,
thisiswhyBanerjee,ChassangandSnowberg(2016)recommendrandomizeddesigns,including
inmedicineandindevelopmenteconomics.Theydevelopatheoryofaninvestigatorwhoisfacinganadversarialaudiencethatwillchallengeanypriorinformationandcanevenpotentially
vetoresultsthatarebasedonit(thinkadministrativeagenciesorjournalreferees).Theexperimentertradesoffhisorherowndesireforprecision(andpreventingpossibleharmtosubjects),
whichusespriorinformation,againstthewishesoftheaudience,whowantnothingofthepriors.Eventhen,theapprovalofthisaudienceisonlyexante;oncethefullyrandomizedexperimenthasbeendone,nothingstopscriticsarguingthat,infact,therandomizationdidnotoffera
fairtest.AmongdoctorswhouseRCTs,andespeciallymeta-analysis,suchargumentsare(appropriately)common;seeagainKramer(2016).
17
AswenotedintheIntroduction,muchofthepublichascometoquestionexpertprior
knowledge,andBanerjee,Chassang,MonteroandSnowberg(2016)haveprovidedanelegant
(positive)accountofwhyRCTswillflourishinsuchanenvironment.Incaseswherethereisgood
reasontodoubtthegoodfaithofexperimenters,asinsomepharmaceuticaltrials,randomizationwillindeedbetheappropriateresponse.Butwebelievesuchargumentsaredeeplydestructiveforscientificendeavorandshouldberesistedasageneralprescriptionforscientific
research.Economistsandothersocialscientistsknowagreatdeal,andtherearemanyareasof
theoryandpriorknowledgethatarejointlyendorsedbylargenumbersofknowledgeableresearchers.Suchinformationneedstobebuiltonandincorporatedintonewknowledge,notdiscardedinthefaceofaggressiveknow-nothingignorance.Thesystematicrefusaltouseprior
knowledgeandtheassociatedpreferenceforRCTsarerecipesforpreventingcumulativescientificprogress.Intheend,itisalsoself-defeating;toquoteRodrik(2016)“thepromiseofRCTsas
theory-freelearningmachinesisafalseone.”
1.3StatisticalinferenceinRCTs
IfwearetointerprettheresultsofanRCTasdemonstratingthecausaleffectofthetreatment
inthetrialpopulation,wemustbeabletotellwhetherthedifferencebetweenthecontroland
treatmentmeanscouldhavecomeaboutbychance.Anyconclusionaboutcausalityishostage
toourabilitytocalculatestandarderrorsandaccuratep–values.ButthisisnotgenerallypossiblewithoutassumptionsthatgobeyondthoseneededtosupportthebasictheoremofRCTs.In
particular,ithaslongbeenknownthatthemean—andafortiorithedifferencebetweentwo
means—isastatisticthatissensitivetooutliers.IndeedBahadurandSavage(1956)demonstratethat,withoutrestrictionsontheparentdistributions,standardt–testsareinherentlyunreliable.
Thekeyproblemhereisskewness;standardt–testsbreakdownindistributionswith
largeskewness,seeLehmannandRomano(2005,p.466–8).Inconsequence,RCTswillnotwork
wellwhenthedistributionoftheindividualtreatmenteffectsisstronglyasymmetric,atleastif
thestandardtwo-samplet–statistics(orequivalentlyWhite’s(1980)heteroskedasticrobustregressiont–values)areused.Whilewemaybewillingtoassumethattreatmenteffectsaresymmetricinsomecases,theneedforsuchanassumption—whichrequirespriorknowledgeabout
thespecificprocessbeingstudied—underminestheargumentthatRCTsarelargelyassumption
freeanddonotdependonsuchknowledge.Thereisadeepironyhere.Inthesearchforrobustnessandthedesiretodoawaywithunnecessaryassumptions,theRCTcandeliverthemeanof
18
theATE,yetthemean—asopposedtothemedian,whichcannotbeestimatedbyanRCT—does
notpermitrobustprobabilitystatementsabouttheestimatesoftheATE
Howdifficultisittomaintainsymmetry?Andhowbadlyisinferenceaffectedwhenthe
distributionoftreatmenteffectsisnotsymmetric?Ineconomics,manytrialshaveoutcomes
valuedinmoney.Doesananti-povertyinnovation—forexamplemicrofinance—increasethe
incomesoftheparticipants?Incomeitselfisnotsymmetricallydistributed,andthismightbe
trueofthetreatmenteffectstoo,ifthereareafewpeoplewhoaretalentedbutcreditconstrainedentrepreneursandwhohavetreatmenteffectsthatarelargeandpositive,while
thevastmajorityofborrowersfritterawaytheirloans,oratbestmakepositivebutmodest
profits.Anotherimportantexampleisexpendituresonhealthcare.Mostpeoplehavezeroexpenditureinanygivenperiod,butamongthosewhodoincurexpenditures,afewindividuals
spendhugeamountsthataccountforalargeshareofthetotal.Indeed,inthefamousRand
healthexperiment,Manning,Newhouseetal.(1987,1988),thereisasingleverylargeoutlier.
Theauthorsrealizethatthecomparisonofmeansacrosstreatmentarmsisfragile,and,althoughtheydonotseetheirproblemexactlyasdescribedhere,theyobtaintheirpreferredestimatesusingastructuralapproachthatisdesignedtoexplicitlymodeltheskewnessofexpenditures.
Insomecases,itwillbeappropriatetodealwithoutliersbytrimming,eliminatingob-
servationsthathavelargeeffectsontheestimates.Butiftheexperimentisaprojectevaluation
designedtoestimatethenetbenefitsofapolicy,theeliminationofgenuineoutliers,asinthe
RandHealthExperiment,willvitiatetheanalysis.Itispreciselytheoutliersthatmakeorbreak
theprogram.
1.3.1Spuriousstatisticalsignificance:anillustrativeexample
Weconsideranexamplethatillustrateswhatcanhappeninarealisticbutsimplifiedcase.There
isaparentpopulation,orpopulationofinterest,definedasthecollectionofunitsforwhichwe
wouldliketoestimateanaveragetreatmenteffect.ItmightbeallvillagesinIndia,orallrecipientsoffoodsubsidies,orallusersofhealthcareintheUS.Fromthispopulationwehaveasamplethatisavailableforrandomization,thetrialorexperimentalsample;inarandomizedcontrolledtrial,thiswillsubsequentlyberandomlydividedintotreatmentsandcontrols.Ideally,
thetrialsamplewouldberandomlyselectedfromtheparentsample,sothatthesampleaveragetreatmenteffectwouldbeanunbiasedestimatorofthepopulationaveragetreatmenteffect;indeedinsomecasesthecompletepopulationofinterestisavailableforthetrial.Clearly,
19
intheseidealcases,itisstraightforwardtousestandardsamplingtheorytogeneralizethetrial
resultsfromthesampletothepopulation.However,foranumberofpracticalandconceptual
reasons,thetrialsampleisrarelyeitherthewholepopulationorarandomlyselectedsubset,
seeShadishetal(2002,pp.341–8)foragooddiscussionofbothpracticalandtheoreticalobstacles.
Inourillustrativeexample,thereisparentpopulationeachmemberofwhichhashisor
herowntreatmenteffect;thesearecontinuouslydistributedwithashiftedlognormaldistributionwithzeromeansothatthepopulationaveragetreatmenteffectiszero.Theindividual
treatmenteffects β aredistributedsothat β + e
0.5
∼ Λ(0,1) ,forstandardizedlognormaldis-
tribution Λ. Wehavesomethinglikeamicrofinancetrialinmind,wherethereisalongpositive
tailofrareindividualswhocandoamazingthingswithcredit,whilemostpeoplecannotuseit
effectively.Atrial(experimental)sampleof 2n individualsisrandomlydrawnfromtheparent
andisrandomlysplitbetweenntreatmentsandncontrols.Intheabsenceoftreatment,everyoneinthesamplerecordszero,sothesampleaveragetreatmenteffectinanyonetrialissimply
themeanoutcomeamongthentreatments.Forvaluesofnequalto25,50,100,200,and500
wedraw100trial/experimentalsampleseachofsize2n;withfivevaluesofn,thisgivesus500
trial/experimentalsamplesinall.Foreachofthese500samples,werandomizeintoncontrols
andntreatments,estimatetheATEanditsestimatedt–value(usingthestandardtwo-samplet–
value,orequivalently,byrunningaregressionwithrobustt–values),andthenrepeat1,000
times,sowehave1,000ATEestimatesandt–valuesforeachofthe500trialsamples;theseallowustoassessthedistributionofATEestimatesandtheirnominalt–valuesforeachtrial.
Table1:RCTswithskewedtreatmenteffects
Samplesize
MeanofATE
Meanofnominalt–
Fractionnullreject-
estimates
values
ed(percent)
25
0.0268
–0.4274
13.54
50
0.0266
–0.2952
11.20
100
–0.0018
–0.2600
8.71
200
0.0184
–0.1748
7.09
500
–0.0024
–0.1362
6.06
20
Note:1,000randomizationsoneachof100drawsofthetrialsamplerandomlydrawnfroma
lognormaldistributionoftreatmenteffectsshiftedtohaveazeromean.
TheresultsareshowninTable1.Eachrowcorrespondstoasamplesize.Ineachrow,
weshowtheresultsof100,000individualtrials,composedof1,000replicationsoneachofthe
100trial(experimental)samples.Thecolumnsareaveragedoverall100,000trials.
Thelastcolumnshowsthefractionsoftimesthetruenullisrejectedandisthekeyresult.Whenthereareonly50treatmentsand50controls(row2),the(true)nullisrejected11.2
percentofthetime,insteadofthe5percentthatwewouldlikeandexpectifwewereunaware
oftheproblem.Whenthereare500unitsineacharm,therejectionrateis6.06percent,much
closertothenominal5percent.
Whydoesthestandardapplicationofthet–distributiongivesuchstrangeresultswhen
allwearedoingisestimatingamean?Theproblemcasesarewhenthetrialsamplehappensto
containoneormoreoutliers,somethingthatisalwaysariskgiventhelongpositivetailofthe
parentdistribution.Whenthishappens,everythingdependsonwhethertheoutlierisamong
thetreatmentsorthecontrols;ineffecttheoutliersbecomethesample,reducingtheeffective
0
.5
Density
1
1.5
numberofdegreesoffreedom.
-.5
0
.5
1
1.5
1,000 estimates of average treatment effect
2
Figure1:EstimatesofanATEwithanoutlierinthetrialsample
Figure1illustratestheestimatedaveragetreatmenteffectsfromanextremecasefrom
thesimulationswith100observationsintotal,thesecondrowofTable1;thehistogramshows
the1,000estimatesoftheATE.Thetrialsamplehasasinglelargeoutlyingtreatmenteffectof
21
48.3;themean(s.d.)oftheother99observationsis–0.51(2.1);whentheoutlierisinthe
treatmentgroup,wegettheright-handsideofthefigure,whenitisnot,wegettheleft-hand
side.Ontheright-handside,whentheoutlierisamongthetreatmentgroup,thedispersion
acrossoutcomesislarge,asistheestimatedstandarderror,andsothoseoutcomesrarelyreject
thenullusingthestandardtableoft–values.Theover-rejectionscomefromtheleft-handside
ofthefigurewhentheoutlierisinthecontrolgroup,theoutcomesarenotsodispersed,and
thet–valuescanbelarge,negative,andsignificant.Whilethesecasesofbimodaldistributions
maynotbecommon,anddependonlargeoutliers,theyillustratetheprocessthatgenerates
theover-rejectionsandspurioussignificance.
Wecouldescapetheseproblemsifwecouldcalculatethemediantreatmenteffect,but
RCTscannot(withoutfurtherassumption)identifythemedian,onlythemean,anditisthe
meanthatisatriskbecauseoftheBahadur-Savagetheorem.Notetoothatthereisonlymoderatecomforttobetakeninlargesamplesizes.Whilethelastrowiscertainlybetterthantheothers,therearestillmanytrialsamplesthataregoingtogivesampleaverageeffectsthataresignificant,evenwhenthenumberwewantiszero.TheproofoftheBahadur-Savagetheorem
worksbynotingthatforanysamplesize,itisalwayspossibletofindanoutlierthatwillgivea
misleadingt–value.NoristhereanescapeherebyusingtheFisherexactmethodforinference;
theFishermethodteststhenullhypothesisthatallofthetreatmenteffectsarezerowhereas
whatweareinterestedinhere,atleastifwewanttodoprojectevaluationorcost-benefitanalysis,isthattheaveragetreatmenteffectiszero.
Theproblemsillustratedabove,thatstemfromtheBahadur-Savagetheorem,arecertainlynotconfinedtoRCTs,andoccurmoregenerallyineconometricandstatisticalwork.However,theanalysishereillustratesthatthesimplicityofidealRCTs,subtractingonemeanfrom
another,bringsnoexemptionfromtroublesomeproblemsofinference.Escapefromtheseissues,asintheRandHealthExperiment,requiresexplicitmodeling,ormightbebesthandledby
estimatingquantilesofthetreatmentdistribution,whichagainrequiresadditionalassumptions.
OurreadingoftheliteratureonRCTsindevelopmentsuggeststhattheyarenotexempt
fromtheseconcerns.Manydevelopmenttrialsarerunon(sometimesvery)smallsamples,they
havetreatmenteffectswhereasymmetryishardtoruleout—especiallywhentheoutcomesare
inmoney—andtheyoftengiveresultsthatarepuzzling,oratleastnoteasilyinterpretedin
termsofeconomictheory.NeitherBanerjeeandDuflo(2012)norKarlanandAppel(2011),who
citemanyRCTs,raiseconcernsaboutmisleadinginference,treatingallresultsassolid.Nodoubt
22
therearebehaviorsintheworldthatareinconsistentwithstandardeconomics,andsomecan
beexplainedbystandardbiasesinbehavioraleconomics,butitwouldalsobegoodtobesuspiciousofthesignificancetestsbeforeacceptingthatanunexpectedfindingiswellsupportedand
theoryshouldberevised.Replicationofresultsindifferentsettingsmaybehelpful—iftheyare
therightkindofplaces(seeourdiscussioninSection2)—butithardlysolvestheproblemgiven
thattheasymmetrymaybeinthesamedirectionindifferentsettings(andseemslikelytobeso
injustthosesettingsthataresufficientlyliketheoriginaltrialsettingtobeofuseforinference
aboutthetrialpopulation),andthatthe“significant”t–valueswillshowdeparturesfromthe
nullinthesamedirection,thusreplicatingspuriousfindings.
1.2.11:Significancetests:Fisher-Behrens,robustinference,andmultiplehypotheses
Skewnessoftreatmenteffectsisnottheonlythreattoaccuratesignificancetests.Thetwo–
samplet–statisticiscomputedbydividingtheATEbytheestimatedstandarderrorwhose
squareisgivenby
⌢
σ2 =
⌢
(n1 − 1)−1 ∑ (Yi − µ1 )2
i∈1
n1
+
⌢
(n0 − 1)−1 ∑ (Yi − µ0 )2
i∈0
n0
(5)
where0referstocontrolsand1totreatments,sothatthereare n1 treatmentsand n0 controls,and µ̂1 and µ̂0 arethetwomeans.Ashasbeenlongknown,thist–statisticisnotdistributedasStudent’stifthetwovariances(treatmentandcontrol)arenotidentical;thisisknown
astheBehrens–Fisherproblem.Inextremecases,whenoneofthevariancesiszero,thet–
statistichaseffectivedegreesoffreedomhalfofthatofthenominaldegreesoffreedom,sothat
thetest-statistichasthickertailsthanallowedfor,andtherewillbetoomanyrejectionswhen
thenullistrue.
Inaremarkablerecentpaper,Young(2016)arguesthatthisproblemgetsmuchworse
whenthetrialresultsareanalyzedbyregressingoutcomesnotonlyonthetreatmentdummy,
butalsoonadditionalcontrols,someofwhichmightinteractwiththetreatmentdummy.Again
theproblemconcernsoutliersincombinationwiththeuseofclusteredorrobuststandarderrors.Whenthedesignmatrixissuchthatthemaximalinfluenceislarge,sothatforsomeobservationsoutcomeshavelargeinfluenceontheirownpredictedvalues,thereisareductioninthe
effectivedegreesoffreedomforthet–value(s)oftheaveragetreatmenteffect(s)leadingto
spuriousfindingsofsignificance.
23
Younglooksat2003regressionsreportedin53RCTpapersintheAmericanEconomic
AssociationjournalsandrecalculatesthesignificanceoftheestimatesusingFisher’srandomizationinferenceappliedtotheauthors’originaldata;seeagainImbensandWooldridge(2009)for
agoodmodernaccountofFisher’smethod.In30to40percentoftheestimatedtreatmenteffectsinindividualequationswithcoefficientsthatarereportedassignificant,hecannotreject
thenullofnoeffect;thefractionofspuriouslysignificantresultsincreasesfurtherwhenhesimultaneouslytestsforallresultsineachpaper.Thesespuriousfindingscomeinpartfromthe
well-knownproblemofmultiple-hypothesistesting,bothwithinregressionswithseveraltreatmentsandacrossregressions.Withinregressions,treatmentsarelargelyorthogonal,butauthorstendtoemphasizesignificantt–valuesevenwhenthecorrespondingF-testsareinsignificant.Acrossequations,resultsareoftenstronglycorrelated,sothat,atworst,differentregressionsarereportingvariantsofthesameresult,thusspuriouslyaddingtothe“killcount”ofsignificanteffects.Atthesametime,thepervasivenessofobservationswithhighinfluencegeneratesspurioussignificanceonitsown.
Oursenseisthattheseissuesarebeingtakenmoreseriouslyinrecentwork,especially
asconcernsmultiplehypothesistesting.YounghimselfisastrongproponentofRCTsingeneral
andbelievesthatrandomizationinferencewillyieldcorrectinferences.Yetrandomizationinferencecanonlytestthenullthatalltreatmenteffectsarezero,thattheexperimentdoesnothing
toanyone,whereasmanyinvestigatorsareinterestedintheweakerhypothesisthattheaveragetreatmenteffectiszero.Thissimplymakesmattersworsesincethestrongerhypothesis
impliestheweakerhypothesisandtherearepresumablyundiscoveredcaseswheretheATEis
spuriouslysignificant,evenwhentheFishertestrejectsthatalltreatmenteffectsarezero.Note
thattestingdoesnotalwaysmatchlogic;itispossibletorejectthenullthattheATEiszeroeven
whenwecansimultaneouslyacceptthe(joint)hypothesisthatalltreatmenteffectsarezero;
thisisfamiliarfromOLSregression,whereanF–testcanshowjointinsignificance,evenwhena
t–testofsomelinearcombinationissignificant.
Itisclearthat,asofnow,allreportedsignificancelevelsfromRCTresultsineconomics
shouldbetreatedwithconsiderablecaution.Greatercareaboutskewnessandoutlierswould
help,aswouldgreateruseoftheFishermethodandofproceduresthatdealcorrectlywithmultiplehypothesistesting.Yetifthenullhypothesisisthattheaveragetreatmenteffectiszero,as
inmostprojectevaluation,theFishertestisnotavailable,sothatwecurrentlydonothavea
reliablesetofprocedures.Robustorclusteredstandarderrorsarenecessarytoallowforthe
24
possibilitythattreatmentchangesvariances,andtheinclusionofcovariatesisnecessarytocontrolforimbalanceinfinitesamples.
1.3Blinding
Blindingisrarelypossibleineconomicsorsocialsciencetrials,andthisisoneofthemajordifferencesfrommost(althoughnotall)RCTsinmedicine,whereblindingisstandard,bothfor
thosereceivingthetreatmentandthoseadministeringit.Indeed,theabilitytoblindhasbeen
oneofthekeyargumentsinfavorofrandomization,fromBradford-Hillinthe1950s,see
Chalmers(2003),towelfaretrialstoday,GueronandRolston(2013).Considerfirsttheblinding
ofsubjects.SubjectsinsocialRCTsusuallyknowwhethertheyarereceivingthetreatmentornot
andsocanreacttotheirassignmentinwaysthatcanaffecttheoutcomeotherthanthroughthe
operationofthetreatment;ineconometriclanguage,thisisakintoaviolationofexclusionrestrictions,orafailureofexogeneity.Intermsof(1),thereisapathwayfromthetreatmentassignmenttoanotherunobservedcause,whichwillresultinabiasedATE.Thisisnottoarguein
favorofinstrumentalvariablesoverRCTs,orviceversa,butsimplytonotethat,withoutblinding,RCTsdonotautomaticallysolvetheselectionproblemanymorethanIVestimationautomaticallysolvestheselectionproblem.Inbothcases,theexogeneity(exclusionrestriction)argumentneedstobeexplicitlymadeandjustified.YettheliteratureineconomicsgivesgreatattentiontothevalidityofexclusionrestrictionsinIVestimation,whiletendingtoshrugoffthe
essentiallyidenticalproblemswithlackofblindinginRCTs.
Notealsothatknowledgeoftheirassignmentmaycausepeopletowanttocrossover
fromtreatmenttocontrol,orviceversa,todropoutoftheprogram,ortochangetheirbehavior
inthetrialdependingontheirassignment.Inextremecases,onlythosemembersofthetrial
samplewhoexpecttobenefitfromthetreatmentwillaccepttreatment.Consider,forexample,
atrialinwhichchildrenarerandomlyallocatedtotwoschoolsthatteachindifferentlanguages,
RussianorEnglish,ashappenedduringthebreakupoftheformerYugoslavia.Thechildren(and
theirparents)knowtheirallocation,andthemoreeducated,wealthier,andless-ideologically
committedparentswhosechildrenareassignedtotheRussian-mediumschoolscan(anddid)
removetheirchildrentoprivateEnglish-mediumschools.Inacomparisonofthosewhoacceptedtheirassignments,theeffectsofthelanguageofinstructionwillbedistortedinfavorofthe
Englishschoolsbydifferencesinfamilycharacteristics.Thisisacasewhere,eveniftherandom
numbergeneratorisfullyfunctional,alaterbalancetestwillshowsystematicdifferencesinob-
25
servablebackgroundcharacteristicsbetweenthetreatmentandcontrolgroups;evenifthebalancetestispassed,theremaystillbeselectiononunobservablesforwhichwecannottest.
Moregenerally,whenpeopleknowtheirallocation,whentheyhaveastakeintheoutcome,andwhenthetreatmenteffectisdifferentfordifferentpeople,thereareincentivesand
opportunitiesforselectioninresponsetotherandomization,andthatselectioncancontaminatetheestimatedaveragetreatmenteffect,seeHeckman(1997)whomakesthesamepointin
thecontextofinstrumentalvariables.Thosewhowererandomizedbyalotteryintogoingto
Vietnamwillhavedifferenttreatmenteffectsdependingontheirlabormarketprospects,and
thosewithbetterprospectsaremorelikelytoresistthedraft.Asweshallseeinthenextsubsection,variousstatisticalcorrectionsareavailableforafewoftheselectionproblemsnonblindingpresents,butallrelyonthekindofassumptionsthat,whilecommoninobservational
studies,RCTsaredesignedtoavoid.Ourownviewisthatassumptionsandtheuseofprior
knowledgearewhatweneedtomakeprogressinanykindofanalysis,includingRCTswhose
promiseofassumption-freelearningisalwayslikelytobeillusory.
Theremaybeatendencyineconomicstofocusontheselectionbiaseffectsofnonblindingbecausesomesolutionsareavailable,butselectionbiasisnottheonlyserioussource
ofbiasinsocialandmedicaltrials.Concernsabouttheplacebo,Pygmalion,Hawthorne,John
Henry,and'teacher/therapist'effectsarewidespreadacrossstudiesofmedicalandsocialinterventions.Thisliteraturearguesthatdoubleblindingshouldbereplacedbyquadrupleblinding;
blindingshouldextendbeyondparticipantsandinvestigatorsandincludethosewhomeasure
outcomesandthosewhoanalyzethedata,allofwhommaybeaffectedbybothconsciousand
unconsciousbias.Theneedforblindinginthosewhoassessoutcomesisparticularlyimportant
inanycaseswhereoutcomesarenotdeterminedbystrictlyprescribedprocedureswhoseapplicationistransparentandcheckablebutrequireselementsofjudgment;agoodexampleistherapistswhoareaskedtoassesstheextentofdepressioninclinicaltrialsofanti-depressants,see
Kramer(2016).
Thelessonhereisthatblindingmattersandisveryoftenmissing.Thereisnoreasonto
supposethatapoorlyblindedtrialwithrandomassignmenttrumpsbetterblindedstudieswith
alternativeallocationmechanisms,ormatchedstudies.
1.13WhatdoRCTsdoinpractice?
TheexecutionofanRCTwilloftendeviatefromitsdesign.Peoplemaynotaccepttheirassignment,controlsmaymanagetogettreatment,andviceversa,andpeoplemayaccepttheiras-
26
signment,butdropoutbeforethecompletionofthestudy.Insomedesigns,thetrialworksby
givingpeopleincentivestoparticipate,forexamplebymailingthemavoucherthatgivesthem
subsidizedaccesstoaschoolortoasavingsproduct.Iftheaimistoevaluatethevoucher
schemeitself,nonewissuearises.However,iftheaimistofindoutwhattheeducationorsavingsprogramdoes,andthevoucherissimplyadevicetoinducevariation,muchdependson
whetherornotpeopledecidetousethevoucherwhich,likeattritionandcrossover,issubject
topurposivedecisionsbythesubjectsinducingdifferencesbetweentreatmentsandcontrols.
Everythingdependsonthepurposeofthetrial.Intheexampleabove,wemaywantto
evaluatethevoucherprogram,orwemaywanttofindoutwhatthesavingproductdoesfor
people.Wearesometimesinterestedinestablishingcausality,andsometimesinestimatingan
averagetreatmenteffect;intheeconomicsliterature,somewritersdefineinternalvalidityas
gettingtheATEright,whileothers,followingtheoriginaldefinitionoftheterm,defineinternal
validityasgettingcausalityright.Sometimesthetriallimitsitselftoestablishingcausality(orto
estimatinganATE)inonlythetrialsample,butsometrialsaremoreambitious,andtrytoestablishcausality(orestimateanATE)forabroaderpopulationofinterest.When,asiscommonin
economicstrials,nolimitsareplacedontheheterogeneityoftreatmentresponses,different
trialsamplesanddifferentpopulationswillgenerallyhavedifferentATEsandmayhavedifferent
casualoutcomes,e.g.ifthetreatmenthasaneffectinonepopulationbutnoneortheopposite
effectinanother.Ourviewisthatthetargetofthetrial,includingthepopulationofinterest,
needstobedefinedinadvance.Otherwise,almostanyestimatednumbercanbeinterpretedas
avalidATEforsomepopulation,weallowdeviationsfromthedesigntodefineourtarget,and
wehavenowayofknowingwhetherapparentlycontradictoryresultsarereallycontradictoryor
arecorrectforthepopulationonwhichtheywerederived.Differencesinresults,betweendifferentRCTsandbetweenRCTsandobservationalstudies,mayowelesstotheselectioneffects
thatRCTsaredesignedtoremove,thantothefactthatwearecomparingnon-comparablepeople,Heckman,Lalonde,andSmith(1999,p.2082).Withoutaclearideaofhowtocharacterize
thepopulationofindividualsinthetrial,whetherwearelookingforanATEortoidentifycausality,andforwhichgroupsenrolledinthetrialtheresultsaresupposedtohold,wehavenobasis
forthinkingabouthowtousethetrialresultsinothercontexts.
Toillustratesomeoftheissues,considerasimpleRCTinwhichatreatmentTisadminis-
teredtoatrialsamplethatissplitbetweenatreatmentgroupofsizenandacontrolgroupof
sizen,butthatonlyafractionpofthetreatmentgroupacceptstheirassignment,withfraction
27
(1− p) receivingnotreatment.SupposethattheparameterofinterestistheATEintheoriginal
population,fromwhichthetrialsamplewasdrawnrandomly.Denoteby β thehypothetical
idealATEestimatethatwouldhavebeencalculatedifeveryonehadacceptedassignment;aswe
haveseen,thisisanunbiasedestimatoroftheparameterofinterestforboththetrialsample
andtheparentpopulation. β cannotbecalculated,buttherearevariousoptions.
Optiononeistoignoretheoriginalassignmentandcalculatethedifferenceinmeans
betweenthosewhoreceivedthetreatmentandthosewhodidnot,includingamongthelatter
thosewhowereintendedtoreceiveitbutdidnot.Denotethis(“astreated”)estimate β1 . Alternatively,optiontwo,istocomparetheaverageoutcomeamongthosewhowereintendedto
betreatedandthosewhowereintendedtobecontrols.Denotethisestimate,the“intentto
treat”(ITT)estimator, β 2 . Itiseasytoshowthatonesetofconditionsfor β1 = β isthatthose
whoweretreatedhavethesameATEasthosewhowereintendedtobetreated,andthatthose
whobroketheirassignmenthavethesameuntreatedmeanasthosewhowereassignedtobe
controls,conditionsthatmayholdinsomeapplications,forexamplewherethetreatmenteffectsareidentical.
TheITTestimator, β 2 ,willtypicallybeclosertozerothanis β ,anditwillcertainlybe
soiftheaveragetreatmenteffectamongthosewhobreaktheirassignmentisthesameasthe
overallATE,inwhichcase β 2 = pβ . Forthesereasons,theITTisoftendescribedasyieldinga
conservativeestimateandisroutinelyadvocatedinmedicaltrialseventhoughitisanattenuatedestimatoroftheATE.Athirdestimator, β 3 ,thelocalaveragetreatmentestimator(LATE)is
computedbyrunningaregressionofoutcomesonan(actual)treatmentdummyusingthe
treatmentassignmentasaninstrumentalvariable.Inthiscase,theLATEissimplytheITT,scaled
upbythereciprocalofp,sothat β 3 = β 2 / p. Fromtheabove,theLATEis β iftheaverage
treatmenteffectofthosewhobreaktheirassignmentisthesameastheaveragetreatmenteffectingeneral,sothattheITTestimatorisbiaseddownbycountingthosewhoshouldhave
beentreatedasiftheywerecontrols.Moregenerally,andwithadditionalassumptions,Imbens
andAngrist(1994)showthattheLATEistheaveragetreatmenteffectamongthosewhowere
inducedtoacceptthetreatmentbytheirassignmenttotreatmentstatus,whichcanbeavery
differentobjectfromtheoriginaltargetofinvestigation.Thesevariousestimators,theATE,the
ITT,andtheLATE,areallaveragesoverdifferentgroups;moreformally,HeckmanandVytlacil
(2005)defineamarginaltreatmenteffect(MTE)astheATEforthoseonthemarginoftreat-
28
ment—whatevertheassignmentmechanism—andshowthattheotherestimatorscanbe
thoughtofasaveragesoftheMTEsoverdifferentpopulations.
Ingeneral,andunlesswearepreparedtosaymoreabouttheheterogeneityinthe
treatmenteffects,thethreeestimatorswillgivedifferentresultsbecausetheyareaveragesover
differentpopulations.Economiststendtobelievethatpeopleactintheirowninterest,atleast
inpart,soitisnotattractivetobelievethatthosewhobreaktheirassignmentshavethesame
distributionoftreatmenteffectsasdothosewhoacceptthem.InHeckman’s(1992)analogy,
peoplearenotlikeagriculturalplots,whichareinnopositiontoevadethetreatmentwhenthey
seeitcoming.Suchpurposivebehaviorwillgenerallyalsoaffectthecompositionofthetrial
samplecomparedwiththeparentpopulation,withthosewhoagreetoparticipatedifferent
fromthosewhodonot.Forexample,peoplemaydislikerandomizationbecauseoftherisksit
entails,orpeoplemayseektoentertrialsinthehopethattheywillreceiveabeneficialtreatmentthatisotherwiseunavailable.AfamousexampleineconomicsistheAshenfelter(1978)
pre-program“dip,”wherethosewhoentertrialsoftrainingprogramstendtobethosewhose
earningshavefallenimmediatelypriortoenrolment,seealsoHeckmanandSmith(1999).Peoplewhoparticipateindrugtrialsaremorelikelytobesickthanthosewhodonot,orarelikely
tobethosewhohavefailedonstandardmedication.AnotherexampleisChyn’s(2016)evidence
thatthosewhoappliedforvouchersintheMovingtoOpportunityexperimentandwerethus
eligibleforrandomization—andonlyaquarterofthosewhowereeligibleactuallydidso—were
thosewhowerealreadymakingunusualeffortsontheirchildren’sbehalf.Theseparentshad
effectivelysubstitutedforpartofthebetterenvironment,sothattheATEfromthetrialunderstatesthebenefitstotheaveragechildofmoving.Similarphenomenaoccurinmedicine.Inthe
1954trialsoftheSalkpoliovaccineintheUS,theratesofinfection,whilelowestamongthe
treatedchildren,werehigherinthecontrolchildrenthaninthegeneralpopulationatrisk,so
thattheparentsofthosewhoselectedintothetrialpresumablyhadsomeideathattheymight
havebeenexposed,HausmanandWise(1985,p.193–4).Inthiscase,theaveragetreatment
effectinthetrialsampleexaggeratestheATEinthegeneralpopulation,whichiswhatwewant
toknowforpublicpolicy.
Giventhenon-parametricspiritofRCTs,andtheunwillingnessofmanytrialiststomake
assumptionsortoincorporatepriorinformation,theonlywayforwardistobeveryclearabout
thepurposeofthetrialand,inparticular,whichaveragewearetryingtoestimate.Forthose
whofocusoninternalvalidityintermsofestablishingcausalitybyfindinganATEsignificantly
29
differentfromzero,thedefinitionofthepopulationseemstobeasecondaryconcern.Theidea
seemstobethatifcausalityisestablishedinsomepopulation,thatfindingisimportantinitself,
withthetaskofexploringitsapplicabilitytootherpopulationsleftasasecondarymatter.For
themanyeconomicorcost–benefitanalyseswheretheATEistheparameterofinterest,the
populationofinterestisdefinitional,andtheinferenceneedstofocusonapathfromtheresults
ofthetrialtotheparameterofinterest.Thisisoftendifficultorevenimpossiblewithoutadditionalassumptionsand/ormodelingofbehavior,includingthedecisiontoparticipateinthetrial,andamongparticipants,thedecisionnottodropout.Manski(1990,1995,2003)hasshown
that,withoutadditionalevidence,thepopulationATEisnot(point)identifiedfromthetrialresults,andhasdevelopednon-parametricbounds(anintervalestimate)fortheATE.Aswiththe
ITT,theseboundsaresometimestightenoughtobeinformative,thoughtheintervaldefinedby
theboundswilloftencontainzero,seeManski(2013)foradiscussionaimedatabroadaudience.Facedwiththis,manyscholarsarepreparedtomakeassumptionsortobuildmodelsthat
givemorepreciseresults.
RCTsmaytellusaboutcausality,evenwhentheydonotdeliveragoodestimateofthe
ATE.Forexample,iftheITTestimateissignificantlydifferentfromzero,thetreatmenthasa
causaleffectforatleastsomeindividualsinthepopulation.ThesameistrueiftheLATEissignificantlydifferentfromzero;againthetreatmentiscausalforsomesub-population,evenifwe
mayhavedifficultycharacterizingitoracceptingitasthepopulationofinterest.Fromthis,we
alsolearnthat,providedwehadapopulationwiththerightdistributionof β i ' s andgoverned
bythesamepotentialoutcomeequation,thetreatmentwouldproducetheeffectinatleast
someindividualsthere.
Section2:Usingtheresultsofrandomizedcontrolledtrials
2.1Introduction
Supposewehavetheresultsofawell-conductedRCT.Wehaveestimatedanaveragetreatment
effect,andourstandarderrorgivesusreasontobelievethattheeffectdidnotcomeaboutby
chance.Wethushavegoodwarrantthatthetreatmentcausestheeffectinoursamplepopulation,uptothelimitsofstatisticalinference.Whataresuchfindingsgoodfor?Howshouldwe
usethem?
Theliteratureineconomics,asindeedinmedicineandinsocialpolicy,haspaidmoreattentiontoobtainingresultsthantowhetherandhowtheyshouldbeadaptedforuse,oftenas-
30
sumingthatfindingscanbeused“asis.”Mucheffortisdevotedtodemonstratingcausalityand
estimatingeffectsizesinstudypopulations,bothinempiricalwork—moreandbetterRCTs,or
substitutesforRCTs,suchasinstrumentalvariablesorregressiondiscontinuitymodels—aswell
asintheoreticalstatisticalwork—forexampleontheconditionsunderwhichwecanestimate
anaveragetreatmenteffect,oralocalaveragetreatmenteffect,andwhattheseestimates
mean.Thereislesstheoreticalorempiricalworktoguideushowandforwhatpurposestouse
thefindingsofRCTs,suchastheconditionsunderwhichthesameresultsholdoutsideofthe
originalsettings,howtheymightbeadaptedforuseelsewhere,orhowtheymightbeusedfor
formulating,testing,understanding,orprobinghypothesesbeyondtheimmediaterelationbetweenthetreatmentandtheoutcomeinvestigatedinthestudy.
Yetitcannotbethatknowinghowtouseresultsislessimportantthanknowinghowto
demonstratethem.Anychainofevidenceisonlyasstrongasitweakestlink,sothatarigorously
establishedeffectwhoseapplicabilityisjustifiedbyaloosedeclarationofsimilewarrantslittle
morethananestimatethatwaspluckedoutofthinair.Iftrialsaretobeuseful,weneedpaths
totheirusethatareascarefullyconstructedasarethetrialsthemselves.
Itissometimesassumedthataparameter,oncewellestablished,isinvariantacrossset-
tings.Theparametermaybedifficulttoestimate,becauseofselectionorotherissues,andit
maybethatonlyawell-conductedRCTcanprovideacredibleestimateofit.Ifso,internalvalidityisallthatisrequired,anddebateaboutusingtheresultsbecomesadebateabouttheconduct
ofthestudy.Theargumentforthe“primacyofinternalvalidity,”Shadish,Cook,andCampbell
(2002),isreasonableasawarningthatbadRCTsareunlikelytogeneralize,butitissometimes
incorrectlytakentoimplythatresultsofaninternallyvalidtrialwillautomaticallyoroftenapply
‘asis’elsewhere,orthatthisisthedefaultassumptionfailingargumentstothecontrary.Aninvarianceargumentisoftenmadeinmedicine,whereitissometimesplausiblethataparticular
procedureordrugworksthesamewayeverywhere,thoughseeHorton(2000)forastrongdissentandRothwell(2005)forexamplesonbothsidesofthequestion.Weshouldalsonotethe
recentmovementtoensurethattestingofdrugsincludeswomenandminoritiesbecausemembersofthosegroupssupposethattheresultsoftrialsonmostlyhealthyyoungwhitemalesdo
notapplytothem.
2.2Usingresults,transportability,andexternalvalidity
Supposeatrialhasestablishedaresultinaspecificsetting,andweareinterestedinusingthe
resultoutsidetheoriginalcontext.If“thesame”resultholdselsewhere,wesaywehaveexter-
31
nalvalidity,otherwisenot.Externalvaliditymayreferjusttothetransportabilityofthecausal
connection,orgofurtherandrequirereplicationofthemagnitudeoftheaveragetreatment
effect.Eitherway,theresultholds—everywhere,orwidely,orinsomespecificelsewhere—orit
doesnot.
Thisbinaryconceptofexternalvalidityisoftenunhelpful;itbothoverstatesandunderstatesthevalueoftheresultsfromanRCT.Itdirectsustowardsimpleextrapolation—whether
thesameresultwillholdelsewhere—orsimplegeneralization—whetheritholdsuniversallyor
atleastwidely—andawayfrompossiblymorecomplexbutmoreusefulapplicationsoftheevidence.Justasinternalvaliditysaysnothingaboutwhetherornotatrialresultwillholdelsewhere,thefailureofexternalvalidityinterpretedassimplegeneralizationorextrapolationsays
littleaboutthevalueofthetrial.
First,thereareseveralusesofRCTsthatdonotrequiretransportabilitybeyondtheoriginalcontext;wediscusstheseinthenextsubsection.Second,thereareoftengoodreasonsto
expectthattheresultsfromawell-conducted,informative,andpotentiallyusefulRCTwillnot
applyelsewhereinanysimpleway.Evensuccessfulreplicationbyitselftellsuslittleeitherforor
againstsimplegeneralizationorextrapolation.Withoutfurtherunderstandingandanalysis,
evenmultiplereplicationscannotprovidemuchsupportfor,letaloneguarantee,theconclusion
thatthenextwillworkinthesameway.Nordofailuresofreplicationmaketheoriginalresult
useless.Wecanoftenlearnmuchfromcomingtounderstandwhyreplicationfailedanduse
thatknowledgetomakeappropriateuseoftheoriginalfindings,notbyexpectingreplication,
butbylookingforhowthefactorsthatcausedtheoriginalresultmightbeexpectedtooperate
differentlyindifferentsettings.Third,andparticularlyimportantforscientificprogress,theRCT
resultcanbeincorporatedintoanetworkofevidenceandhypothesesthattestorexplore
claimsthatlookverydifferentfromtheresultsreportedfromtheRCT.Weshallgiveexamples
belowofextremelyusefulRCTsthatarenotexternallyvalidinthe(usual)sensethattheirresultsdonotholdelsewhere,whetherinaspecifictargetsettingorinthemoresweepingsense
ofholdingeverywhere.
BertrandRussell’schickenprovidesanexcellentexampleofthelimitationstostraightforwardextrapolationfromrepeatedsuccessfulreplication.Thebirdinfers,basedonmultiply
repeatedevidence,thatwhenthefarmercomesinthemorning,hefeedsher.Theinference
servesherwelluntilChristmasmorning,whenhewringsherneckandservesherforChristmas
dinner.Ofcourse,ourchickendidnotbaseherinferenceonanRCT.Buthadweconstructed
32
oneforher,wewouldhaveobtainedexactlythesameresultthatshedid.Herproblemwasnot
hermethodology,butratherthatshewasstudyingsurfacerelations,andthatshedidnotunderstandthesocialandeconomicstructurethatgaverisetothecausalrelationsthatsheobserved.Soshedidnotknowhowwidelyorhowlongtheywouldobtain.Russellnotes,“more
refinedviewsastotheuniformityofnaturewouldhavebeenusefultothechicken”(1912,p.
44).Weoftenactasifthemethodsofinvestigationthatservedthechickensobadlywilldoperfectlywellforus.
Establishingcausalitydoesnothinginandofitselftoguaranteegeneralizability.Nor
doestheabilityofanidealRCTtoeliminatebiasfromselectionorfromomittedvariablesmean
thattheresultingATEwillapplyanywhereelse.Theissueisworthmentioningonlybecauseof
theenormousweightthatiscurrentlyattachedineconomicstothediscoveryandlabelingof
causalrelations,aweightthatishardtojustifyforeffectsthatmayhaveonlylocalapplicability,
whatmight(perhapsprovocatively)belabeled‘anecdotalcausality’.Theoperationofacause
generallyrequiresthepresenceofsupportorhelpingfactors,withoutwhichacausethatproducesthetargetedeffectinoneplace,eventhoughitmaybepresentandhavethecapacityto
operateelsewhere,willremainlatentandinoperative.WhatMackie(1974)calledINUScausality
(InsufficientbutNon-redundantpartsofaconditionthatisitselfUnnecessarybutSufficientfora
contributiontotheoutcome)isoftenthekindofcausalitywesee;astandardexampleisa
houseburningdownbecausethetelevisionwaslefton,althoughtelevisionsdonotoperatein
thiswaywithouthelpingfactors,suchaswiringfaults,thepresenceoftinder,andsoon.Thisis
standardfareinepidemiology,whichusestheterm“causalpie”torefertothecasewhereaset
ofcausesarejointlybutnotseparatelysufficientforaneffect.Ifwerewrite(3)intheform
J
J
⎛ K
⎞
Yi = β iTi + ∑ γ j xij = ⎜ ∑ θ k w ik ⎟ Ti + ∑ γ j xij ⎝ k=1
⎠
j=1
j=1
(6)
where θ k controlshow wik affectsindividualI’streatmenteffect β i . The“helping”or“support”
factorsforthetreatmentarerepresentedbytheinteractivevariables wik , amongwhichmaybe
includedsomex’s.SincetheATEistheaverageofthe β i 's ,twopopulationswillhavethesame
ATEonlyif,exceptbyaccident,theyhavethesameaverageforthesupportfactorsnecessary
forthetreatmenttowork.Thesearehoweverjustthekindoffactorsthatarelikelytobedifferentlydistributedindifferentpopulations,andindeedwedogenerallyfinddifferentATEsindif-
33
ferentdevelopment(andothersocialpolicy)RCTsindifferentplaceseveninthecaseswhere
(unusually)theyallpointinthesamedirection.
Causalprocessesoftenrequirehighlyspecializedeconomic,cultural,orsocialstructures
toenablethemtowork.ConsidertheRubeGoldbergmachinethatisriggedupsothatflyinga
kitesharpensapencil,CartwrightandHardie(2012,77),oranotherwherealongchainofropes
andpulleyscausestheinsertionoffoodintothemouthtoactivateaface-wipingnapkin.These
arecausalmachines,buttheyarespeciallyconstructedtogiveakindofcausalitythatoperates
extremelylocallyandhasnogeneralapplicability.Theunderlyingstructureaffordsaveryspecificformof(6)thatwillnotdescribecausalprocesseselsewhere.NeitherthesameATEnorthe
samequalitativecausalrelationscanbeexpectedtoholdwherethespecificformfor(6)isdifferent.
Indeed,wecontinuallyattempttodesignsystemsthatwillgeneratecausalrelations
thatwelikeandthatwillruleoutcausalrelationsthatwedonotlike.Healthcaresystemsare
designedtopreventnursesanddoctorsmakingerrors;carsaredesignedsothatdriverscannot
starttheminreverse;workschedulesforpilotsaredesignedsotheydonotflytoomanyconsecutivehourswithoutrestbecausealertnessandperformancearecompromised.
AsintheRubeGoldbergmachinesandinthedesignofcarsandworkschedules,the
economicstructureandequilibriummaydifferinwaysthatsupportdifferentkindsofcausal
relationsandthusrenderatrialinonesettinguselessinanother.Forexample,atrialthatrelies
onprovidingincentivesforpersonalpromotionisofnouseinastateinwhichapoliticalsystem
lockspeopleintotheirsocialandeconomicpositions.Conditionalcashtransferscannotimprove
childhealthintheabsenceoffunctioningclinics.Policiestargetedatmenmaynotworkfor
women.Weusealevertotoastourbread,butleversonlyoperatetotoastbreadinatoaster;
wecannotbrowntoastbypressinganaccelerator,eveniftheprincipleoftheleveristhesame
inbothatoasterandacar.Ifwemisunderstandthesetting,ifwedonotunderstandwhythe
treatmentinourRCTworks,werunthesamerisksasRussell’schicken.
2.3WhenRCTsspeakforthemselves:notransportabilityrequired
Forsomethingswewanttolearn,anRCTisenoughbyitself.AnRCTmaydisproveageneral
theoreticalpropositiontowhichitprovidesacounterexample.Thetestmightbeofthegeneral
propositionitself(asimplerefutationtest),orofsomeconsequenceofitthatissusceptibleto
testingusinganRCT(acomplexrefutationtest).Ofcourse,counterexamplesareoftenchallenged—forexample,itisnotthegeneralpropositionthatcausedtherejection,butaspecial
34
featureofthetrial—buthereweareonfamiliarinferentialturf.AnRCTmayalsoconfirmapredictionofatheory,andalthoughthisdoesnotconfirmthetheory,itisevidenceinitsfavor,especiallyifthepredictionseemsinherentlyunlikelyinadvance.Onceagain,thisisfamiliarterritory,andthereisnothinguniqueaboutanRCT;itissimplyoneamongmanypossibletesting
procedures.Evenwhenthereisnotheory,orveryweaktheory,anRCT,bydemonstratingcausalityinsomepopulationcanbethoughtofasproofofconcept,thatthetreatmentiscapableof
workingsomewhere.Thisisoneoftheargumentsfortheimportanceofinternalvalidity.
AnothercasewherenotransportationiscalledforiswhenanRCTisusedforevaluation,
forexampletosatisfydonorsthattheprojecttheyfundedactuallyachieveditsaimsinthepopulationinwhichitwasconducted.Evenso,forsuchevaluations,saybytheWorldBank,tobe
globalpublicgoodsrequiresthedevelopmentofargumentsandguidelinesthatjustifyusingthe
resultsinsomewayelsewhere;theglobalpublicgoodisnotanautomaticby-productofthe
Bankfulfillingitsfiduciaryresponsibility.Whenthecomponentsoftreatmentschangeacross
studies,evaluationsneednotleadtocumulativeknowledge.OrasHeckmanetal(1999,p.1934)
note,“thedataproducedfromthem[socialexperiments]arefarfromidealforestimatingthe
structuralparametersofbehavioralmodels.Thismakesitdifficulttogeneralizefindingsacross
experimentsortouseexperimentstoidentifythepolicy-invariantstructuralparametersthat
arerequiredforeconometricpolicyevaluation.”Ofcourse,whenweaskexactlywhatthoseinvariantstructuralparametersare,whethertheyexist,andhowtheyshouldbemodeled,we
openupmajorfaultlinesinmodernappliedeconomics.Forexample,wedonotintendtoendorseintertemporaldynamicmodelsofbehaviorastheonlywayofrecoveringtheparameters
thatweneed.Wealsorecognizethattheusefulnessofsimplepricetheoryisnotasuniversally
acceptedasitoncewas.Butthepointremainsthatweneedsomething,someregularity,and
thatthesomethingneededcanrarelyberecoveredbysimplygeneralizingacrosstrials.
Athirdnon-problematicandimportantuseofanRCTiswhentheparameterofinterest
istheaveragetreatmenteffectinawell-definedpopulationfromwhichthesampletrialpopulation—fromwhichtreatmentsandcontrolsarerandomlyassigned—isitselfarandomsample.In
thiscasethesampleaveragetreatmenteffect(SATE)isanunbiasedestimatorofthepopulation
averagetreatmenteffect(PATE)that,byassumption,isourtarget,seeImbens(2004)forthese
terms.Werefertothisasthe“publichealth”case;likemanypublichealthinterventions,the
targetistheaverage,“populationhealth,”notthehealthofindividuals.Onemajor(andwidely
recognized)dangerofthepublic-health-styleusesofRCTsisthatthescalingupfrom(evena
35
random)sampletothepopulationwillnotgothroughinanysimplewayiftheoutcomesofindividualsorgroupsofindividualschangethebehaviorofothers—whichwillbecommonineconomicexamplesbutperhapslesscommoninhealth.Thereisalsoanissueoftimingiftheresults
aretobeimplementedsometimeafterthetrial.
Ineconomics,a‘public-health-style’exampleistheimpositionofacommoditytax,
wherethetotaltaxrevenueisofinterestandwedonotcarewhopaysthetax.Indeed,theory
canoftenidentifyaspecific,well-definedmagnitudewhosemeasurementiskeyforthepolicy;
seeDeatonandNg(1998)foranexampleofwhatChetty(2009)callsa“sufficient”statistic.In
thiscase,thebehaviorofarandomsampleofindividualsmightwellprovideagoodguidetothe
taxrevenuethatcanbeexpected.Anothercasecomesfromworkonpovertyprogramswhere
theinterestofthesponsorsisintheconsequencesforthebudgetofthestateresponsiblefor
theprogram;wediscussthesecasesattheendofthisSection.Evenhere,itiseasytoimagine
behavioraleffectscomingintoplaythatdriveawedgebetweenthetrialanditsfullscaleimplementation,forexampleifcomplianceishigherwhentheschemeiswidelypublicized,orif
governmentagenciesimplementtheschemedifferentlyfromtrialists.
2.4Transportingresultslaterallyandglobally
TheprogramofRCTsindevelopmenteconomics,asinotherareasofsocialscience,hasthe
broadergoaloffindingout“whatworks.”Atitsmostambitious,thisaimsforuniversalreach,
andthedevelopmentliteraturefrequentlyarguesthat“credibleimpactevaluationsareglobal
publicgoodsinthesensethattheycanofferreliableguidancetointernationalorganizations,
governments,donors,andnongovernmentalorganizations(NGOs)beyondnationalborders,”
KremerandDuflo(2008,p.93).SometimestheresultsofasingleRCTareadvocatedashaving
wideapplicability,withespeciallystrongendorsementwhenthereisatleastonereplication.
Forexample,KremerandHolla(2009)useaKenyantrialasthebasisforablanketstatement
withoutcontextrestriction,“Provisionoffreeschooluniforms,forexample,leadsto10%-15%
reductionsinteenpregnancyanddropoutrates.”KremerandDuflo(2008),writingaboutanothertrial,aremorecautious,citingtwoevaluations,andrestrictingthemselvestoIndia:“One
canberelativelyconfidentaboutrecommendingthescaling-upofthisprogram,atleastinIndia,
onthebasisoftheseestimates,sincetheprogramwascontinuedforaperiodoftime,wasevaluatedintwodifferentcontexts,andhasshownitsabilitytoberolledoutonalargescale.”
Ofcourse,theproblemofgeneralizationextendsbeyondRCTs,toboth“fullycontrolled”laboratoryexperimentsandtomostnon-experimentalfindings.Forexample,eversince
36
AlfredMarshallthoughtofitwhilesunbathing,economistshaveusedtheconceptofanelasticity—asintheincomeelasticityofthedemandforfood,orthepriceelasticityofthesupplyof
cotton—andhavetransportedelasticities—whichareconvenientlydimensionless—fromone
contexttoanother,asnumericalestimates,orinranges,suchashigh,medium,orlow.Articles
thatcollectsuchestimatesarewidelycitedeventhough,ashaslongbeenknown,theinvarianceofelasticitiesisnotguaranteedinpracticeandissometimesinconsistentwithchoicetheory.OurargumenthereisthatevidencefromRCTs,likeevidenceonelasticities,isnotautomaticallysimplygeneralizable,andthatitsinternalvalidity,whenitexists,doesnotprovideitwith
anyuniqueinvarianceacrosscontext.WeshallalsoarguethatspecificfeaturesofRCTs,suchas
theirfreedomfromparametricassumptions,althoughadvantageousinestimation,canbeaserioushandicapinuse.
MostadvocatesofRCTsunderstandthat“whatworks”needstobequalifiedto“what
worksunderwhichcircumstances,”andtrytosaysomethingaboutwhatthosecircumstances
mightbe,forexample,byreplicatingRCTsindifferentplaces,andthinkingintelligentlyabout
thedifferencesinoutcomeswhentheyfindthem.Sometimesthisisdoneinasystematicway,
forexamplebyhavingmultipletreatmentswithinthesametrialsothatitispossibletoestimate
a“responsesurface,”thatlinksoutcomestovariouscombinationsoftreatments,seeGreenberg
andSchroder(2004)orShadishetal(2002).Forexample,theRANDhealthexperimenthadmultipletreatments,allowinginvestigation,notonlyofwhetherhealthinsuranceincreasedexpenditures,buthowmuchitdidsounderdifferentcircumstances.Someofthenegativeincometax
experiments(NITs)inthe1960sand1970sweredesignedtoestimateresponsesurfaces,with
thenumberoftreatmentsandcontrolsineacharmoptimizedtomaximizeprecisionofestimatedresponsefunctionssubjecttoanoverallcostlimit,Conlisk(1973).Experimentsontime-ofdaypricingforelectricityhadasimilarstructure,seeAigner(1985).
TheMDRCexperimentshavealsobeenanalyzedacrosscitiesinanefforttolinkcityfeaturestotheresultsoftheRCTswithinthem,Bloom,Hill,andRiccio(2005).UnliketheRANDand
NITexamples,theseareexpostanalysesofcompletedtrials;thesameistrueofVivalt(2015)
whoassemblesevidenceonalargenumberoftrials,andfinds,forthecollectionoftrialsshe
studied,thatdevelopment-relatedRCTsrunbygovernmentagenciestypicallyfindsmaller
(standardized)effectsizesthanRCTsrunbyacademicsorbyNGOs.Boldetal(2013),whoran
parallelRCTsonaninterventionimplementedeitherbyanNGOorbythegovernmentofKenya,
foundsimilarresultsthere.Notethattheseanalyseshaveadifferentpurposefromthosemeta-
37
analysesthatassumethatdifferenttrialsestimatethesameparameteruptonoiseandaverage
inordertoincreaseprecision.
Althoughthereareissueswithallofthesemethodsofinvestigatingdifferencesacross
trials,withoutsomedisciplineitistooeasytocomeupwith“just-so”orfairystoriesthataccountforalmostanydifferences.Weriskaprocedurethat,ifaresultisreplicatedinfullorin
partinatleasttwoplaces,putsthattreatmentintothe“itworks”boxand,iftheresultdoesnot
replicate,causallyinterpretsthedifferenceinawaythatallowsatleastsomeofthefindingsto
survive.
Howcanwethinkaboutthismoreseriously?Howcanwedobetterthansimplegeneralizationandsimpleextrapolation?Manywritershaveemphasizedtheroleoftheoryintransportingandusingtheresultsoftrials,andweshalldiscussthisfurtherinthenextsubsection.
Butstatisticalapproachesarealsowidelyused;thesearedesignedtodealwiththepossibility
thattreatmenteffectsvarysystematicallywithothervariables.Referringbackto(6),suppose
thatthe β i ' s ,theindividualtreatmenteffects,arefunctionsofasetofKobservableorunobservablesupportvariables, wik ,andthatthenon-vacuousw’smayevenrepresentdifferentfeaturesindifferentplaces.Itisthenclearthat,providedthedistributionofthewvaluesisthe
sameinthenewcircumstancesastheold,thentheATEintheoriginaltrialwillholdinthenew
circumstances.Ingeneral,ofcourse,thisconditionwillnothold,nordowehaveanyobvious
wayofcheckingitunlessweknowwhatthesupportfactorsareinbothplaces.
Oneproceduretodealwithinteractionsispost-experimentalstratification,whichparallelspost-surveystratificationinsamplesurveys.Thetrialisbrokenupintosubgroupsthathave
thesamecombinationofknown,observablew’s,theATEswithineachofthesubgroupscalculated,andthenreassembledaccordingtotheconfigurationofw’sinthenewcontext.Forexample,ifthetreatmenteffectsvarywithage,theage-specificATEscanbeestimated,andthe
agedistributioninthenewcontextusedtoreweighttheage-specificATEstogiveanew,overall,
ATE.ThiscanbeusedtoestimatetheATEinanewcontext,ortocorrectestimatestotheparentpopulationwhenthetrialsampleisnotarandomsampleoftheparent.Ofcourse,this
methodwillonlyworkinspecialcases;forexample,ifweonlyknowsomeofthew’s,thereisno
reasontosupposethatreweightingforthosealonewillgiveausefulcorrection.
Othermethodsalsoworkwhentherearetoomanyw’sforstratification,forexampleby
estimatingtheprobabilityofeachobservationinthepopulationbeingincludedinthetrialsampleasafunctionofthew’s,thenweightingeachobservationbytheinverseofthesepropensity
38
scores.AgoodreferenceforthesemethodsisStuartetal(2011),orineconomics,Angrist
(2004)andHotz,Imbens,andMortimer(2005).
Thereareyetfurtherreasonswhythesemethodsdonotalwayswork.Aswithanyform
ofreweighting,thevariablesusedtoconstructtheweightsmustbepresentinboththeoriginal
andnewcontext.Iftreatmenteffectsvarybysex,wecannotpredicttheoutcomesformenusingatrialsamplethatisentirelyfemale.Ifwearetocarryaresultforwardintime,wemaynot
beabletoextrapolatefromaperiodoflowinflationtoaperiodofhighinflation;asHotzetal
(2005)note,itwilltypicallybenecessarytoruleoutsuch“macro”effects,whetherovertime,or
overlocations.Italsodependsonassumingthatthesamegoverningequation(6)coversthe
trialandthetargetpopulation.Iftheydiffernotonlybywhatcausalfactorsarepresentinwhat
proportionsbutalsoinhow(ifatall)thecausescontributetotheeffects,re-weightingtheeffect
sizesthatoccurintrialsub-populationswillnotproducegoodpredictionsabouttargetpopulationoutcomes.
Itshouldbeclearfromthisthatreweightingworksonlywhentheobservablefactors
usedforreweightingincludeallandonlygenuineinteractivecauses;weneeddataonallthe
relevantinteractivefactors.ButasMuller(2015)notes,thistakesusbacktothesituationthat
RCTsaredesignedtoavoid,whereweneedtostartfromacompleteandcorrectspecificationof
thecausalstructure.RCTscanavoidthisinestimation—whichisoneoftheirstrengths,supportingtheircredibility—butthebenefitvanishesassoonaswetrytocarrytheirresultstoanew
context.
PearlandBareinboim(2014)usePearl’sdo–calculustoprovideafullerformalanalysis
fortransportabilityofcausalempiricalfindingsacrosspopulations.Theydefinetransportability
as“alicensetotransfercausaleffectslearnedinRCTstoanewpopulation,inwhichonlyobservationalstudiescanbeconducted,”PearlandBareinboim(2015,p.1).Theyconsiderbothqualitativecausalrelations,whichtheyrepresentindirectedacyclicgraphs,andprobabilisticfacts,
suchastheconditionalprobabilityoftheoutcomeonatreatmentconditionalonsomethird
factor.Theythenprovidetheoremsaboutwhattherelationshipbetweenthecausalandprobabilisticfactsintwopopulationsmustbeifitistobepossibletoinferaparticularcausalfact,
suchastheATE,aboutpopulation2fromcausalandprobabilisticinformationaboutpopulation
1coupledwithpurelyprobabilisticinformationaboutpopulation2.Notsurprisingly,formany
thingsweshouldliketoknowaboutpopulation2,knowledgeofeventhefullstructureonpopulation1willnotsuffice.Inferencestofactsaboutanewpopulationrequirenotonlythatthe
39
factswesupposeaboutpopulation1—likeanATE—arewellgrounded,thattheRCTwaswell
conducted,thatthestatisticalinferenceissound—butthatwehaveequallygoodgroundingfor
otherassumptionsweneedabouttherelationbetweenthetwopopulations.Forexample,using
theresultdescribedabovefordirectlytransportingtheATEfromatrialpopulationtosomeother—simpleextrapolation—weneedgoodgroundstosupposeboththattheaverageofthenet
effectoftheinteractivefactorsisthesameinbothpopulationsandalsothatthesamegoverningequationdescribesbothpopulations.
Thisdiscussionleadstoanumberofpoints.First,wecannotgettogeneralclaimsby
simplegeneralization;thereisnowarrantfortheconvenientassumptionthattheATEestimated
inaspecificRCTisaninvariantparameter.Weneedtothinkthroughthecausalchainthathas
generatedtheRCTresult,andtheunderlyingstructuresthatsupportthiscausalchain,whether
thatcausalchainmightoperateinanewsettingandhowitwoulddosowithdifferentjointdistributionsofthecausalvariables;weneedtoknowwhyandwhetherthatwhywillapplyelsewhere.Whileitistruethatthereexistgeneralcausalclaims—theforceofgravity,orthatpeople
respondtoincentives—theyuserelativelyabstractconceptsandoperateatamuchhigherlevel
thantheclaimsthatcanbereasonablyinferredfromatypicalRCT,andcannot,bythemselves,
guaranteetheoutcomesthatweareconsideringhere.Thattransportationisfarfromautomatic
alsotellsuswhy(evenideal)RCTsofsimilarinterventionscanbeexpectedtogivedifferentanswersindifferentsettings.Suchdifferencesdonotnecessarilyreflectmethodologicalfailings
andwillholdacrossperfectlyexecutedRCTsjustastheydoacrossobservationalstudies.
Second,thoughtfulpre-experimentalstratificationinRCTsislikelytobevaluable,or
failingthat,subgroupanalysis,becauseitcanprovideinformationthatmaybeusefulforgeneralizationortransportation.Forexample,KremerandHolla(2009)notethat,intheirtrials,
schoolattendanceissurprisinglysensitivetosmallsubsidies,whichtheysuggestisbecause
therearealargenumberofstudentsandparentswhoareonthe(financial)marginbetween
attendingandnotattendingschool;ifthisisindeedthemechanismfortheirresults,agoodvariableforstratificationwouldbethefractionofpeopleneartherelevantcutoff.Wealsoneedto
knowthatthesamemechanismworksinanynewsettingwhereweconsiderusingsmallsubsidiestoincreaseschoolattendance.
Third,weneedtobeexplicitaboutcausalstructure,evenifthatmeansmoremodel
buildingandmore—ordifferent—assumptionsthanadvocatesofRCTsareoftencomfortable
with.Tobeclear,modelingcausalstructuredoesnotnecessarilycommitustotheelaborateand
40
oftenincredibleassumptionsthatcharacterizesomestructuralmodelingineconomics,but
thereisnoescapefromthinkingaboutthewaythingswork,thewhyaswellasthewhat.
Fourth,wewilltypicallyneedtoknowmorethantheresultsoftheRCTitself,forexampleaboutdifferencesinsocial,economic,andculturalstructuresandaboutthejointdistributionsofcausalvariables,knowledgethatwilloftenonlybeavailablethrougharangeofempiricalstrategiesincludingobservationalstudies.Wewillalsoneedtobeabletocharacterizethe
populationtowhichtheoriginalRCTanditsATEappliedbecausehowthepopulationisdescribediscommonlytakentobesomeindicationofwhichotherpopulationstheresultsarelikelytobeexportabletoandwhichnot.Manymedicalandpsychologicaljournalsareexplicitabout
this.Forinstance,therulesforsubmissionrecommendedbytheInternationalCommitteeof
MedicalJournalEditors,ICMJE(2015,p14)insistthatarticleabstracts“Clearlydescribetheselectionofobservationalorexperimentalparticipants(healthyindividualsorpatients,including
controls),includingeligibilityandexclusioncriteriaandadescriptionofthesourcepopulation.”
Theproblemsofcharacterizingthepopulationheregoesbeyondthosewefacedinconsidering
aLATE.AnRCTisconductedonapopulationofspecificindividuals.Theresultsobtained,
whetherwethinkintermsofanATEorintermsofestablishingcausality,arefeaturesofthat
population,ofthoseveryindividualsatthatverytime,notanyotherpopulationwithanydifferentindividualsthatmight,forexample,satisfyoneoftheinfinitesetofdescriptionsthatthe
trialpopulationsatisfies.Howisthedescriptionofthepopulationthatisusedinreportingthe
resultstobechosen?Forchoosewemust—thealternativetodescribingisnaming,identifying
eachindividualinthestudybyname,whichiscumbersomeandunhelpfulandoftenunethical.
Thissameissueisconfrontedalreadyinstudydesign.Apartfromspecialcases,likepost
hocevaluationforpayment-for-results,wearenotespeciallyconcernedtolearnaboutthevery
populationenrolledinthetrial.Mostexperimentsare,andshouldbe,conductedwithaneyeto
whattheresultscanhelpuslearnaboutotherpopulations.Thiscannotbedonewithoutsignificantsubstantialassumptionsaboutwhatmightbeandwhatmightnotberelevanttotheproductionoftheoutcomestudied.(Forexample,theICMJEguidelinesgoontosay:“Becausethe
relevanceofsuchvariablesasage,sex,orethnicityisnotalwaysknownatthetimeofstudydesign,researchersshouldaimforinclusionofrepresentativepopulationsintoallstudytypesand
ataminimumprovidedescriptivedatafortheseandotherrelevantdemographicvariables,”
p14.)Sobothintelligentstudydesignandresponsiblereportingofstudyresultsinvolvesubstantialbackgroundassumptions.Ofcoursethisistrueforallstudies,notjustRCTs.ButRCTsrequire
41
specialconditionsiftheyaretobeconductedatallandespeciallyiftheyaretobeconducted
successfully—localagreements,compliantsubjects,affordableadministrators,peoplecompetenttomeasureandrecordoutcomesreliably,asettingwhererandomallocationismorallyand
politicallyacceptable,etc.,whereasobservationaldataareoftenmorereadilyandwidelyavailable.InthecaseofRCTs,thereisdangerthatthesekindsofconsiderationshavetoomucheffect.Thisisespeciallyworrisomewherethefeaturesthestudypopulationshouldhavearenot
justified,madeexplicit,orsubjectedtoseriouscriticalreview.Thiscarefuldescriptionofthe
studypopulationisuncommonineconomics,whetherinRCTsormanyobservationalstudies.
TheneedforobservationalknowledgeisoneofmanyreasonswhyitiscounterproductivetoinsistthatRCTsaretheuniquegoldstandard,orthatsomecategoriesofevidence
shouldbeprioritizedoverothers;thesestrategiesleaveushelplessinusingRCTsbeyondtheir
originalcontext.TheresultsofRCTsmustbeintegratedwithotherknowledge,includingthe
practicalwisdomofpolicymakers,iftheyaretobeuseableoutsidethecontextinwhichthey
wereconstructed.Contrarytomuchpracticeinmedicineaswellasineconomics,conflictsbetweenRCTsandobservationalresultsneedtobeexplained,forexamplebyreferencetothedifferentpopulationsineach,aprocessthatwillsometimesyieldimportantevidence,includingon
therangeofapplicabilityoftheRCTitself.WhilethevalidityoftheRCTwillsometimesprovide
anunderstandingofwhytheobservationalstudyfoundadifferentanswer,thereisnobasis(or
excuse)forthecommonpracticeofdismissingtheobservationalstudysimplybecauseitwas
notanRCTandthereforemustbeinvalid.Itisabasictenetofscientificadvancethatnewfindingsmustbeabletoexplainpreviousresults,evenresultsthatarenowthoughttobeinvalid;
methodologicalprejudiceisnotanexplanation.
Theseconsiderationscanbeseeninpracticeintherangeofrandomizedcontrolledtrials
ineconomics,whichweshallexploreinthefinalsubsectionbelow.
2.5Usingtheoryforgeneralization
Economistshavebeencombiningtheoryandrandomizedcontrolledtrialssincetheearlyexperiments.OrcuttandOrcutt(1968)laidouttheinspirationfortheincometaxtrialsusingasimple,
statictheoryoflaborsupply.Accordingtothis,peoplechoosehowtodividetheirtimebetween
workandleisureinanenvironmentinwhichtheyreceiveaminimumGiftheydonotwork,and
wheretheyreceiveanadditionalamount (1− t)w foreachhourtheywork,wherewisthe
wagerate,andtisataxrate.ThetrialsassigneddifferentcombinationsofGandttodifferent
trialgroups,sothattheresultstracedoutthelaborsupplyfunction,allowingestimationofthe
42
parametersofpreferences,whichcouldthenbeusedinawiderangeofpolicycalculations,for
exampletoraiserevenueatminimumutilitylosstoworkers.
Followingtheseearlytrials,therehasbeenalongandcontinuingtraditionofusingtrial
results,togetherwiththebaselinedatacollectedforthetrial,tofitstructuralmodelsthatareto
beusedmoregenerally.EarlyexamplesincludeMoffitt(1979)onlaborsupplyandWise(1985)
onhousing;morerecentexamplesareHeckman,PintoandSavelyev(2013)forthePerrypreschoolprogram.DevelopmenteconomicsexamplesincludeAttanasio,MeghirandSantiago
(2012),Attanasioetal(2015),ToddandWolpin(2006)andDuflo,HannaandRyan(2012).Thesestructuralmodelssometimesrequireformidableauxiliaryassumptionsonfunctionalformsor
thedistributionsofunobservables,whichmakesmanyeconomistsreluctanttoembracethem,
buttheyhavecompensatingadvantages,includingtheabilitytointegratetheoryandevidence,
tomakeout-of-samplepredictions,andtoanalyzewelfare—whichalwaysrequiressomeunderstandingofwhythingshappen—andtheuseofRCTevidenceallowstherelaxationofatleast
someoftheassumptionsthatareneededforidentification.Inthisway,thestructuralmodels
borrowcredibilityfromtheRCTsandinreturnhelpsettheRCTresultswithinacoherent
framework.Withoutsomesuchinterpretation,thewelfareimplicationsofRCTresultscanbe
problematic;knowinghowpeopleingeneral(letalonejustpeopleinthetrialpopulation,which
iswhat,aswekeeprepeating,thetrialresultstellusabout)respondtosomepolicyisrarely
enoughtotellwhetherornottheyaremadebetteroff.Whatworksisnotequivalenttowhat
shouldbe.
Inmanypapers,Heckmanhasdevelopedwaystomodelhowthebeliefsandinterestsof
participantsaffecttheirparticipationin,behaviorduring,andtheiroutcomesintrials,forexampleusingaRoymodelofchoice;seee.g.HeckmanandSmith(1995),andmorerecently
Chassang,PadróIMiguel,andSnowberg(2012)andChassangetal(2015).Themodelingofbeliefsandbehaviorallowspredictionsabouttheresultsoftrialsthatdifferfromthebasetrial,or
wheretheriskandrewardstructuresaredifferent.Beyondthat,andinlinewitharunning
themeofthisSection,thinkingabouthowtohandlenewsituationscanbeincorporatedintothe
designoftheoriginaltrialsoastoprovidetheinformationneededfortransportation.
LighttouchtheorycandomuchtoextendandtouseRCTresults.InboththeRAND
HealthExperimentandnegativeincometaxexperiments,animmediateissueconcernedthe
differencebetweenshortandlong-runresponses;indeed,differencesbetweenimmediateand
ultimateeffectsoccurinawiderangeofRCTs.BothhealthandtaxRCTsaimedtodiscoverwhat
43
wouldhappenifconsumers/workerswerepermanentlyfacedwithhigherorlowerprices/wages,butthetrialscouldonlyrunforalimitedperiod.Atemporarilyhightaxrateonearningswaseffectivelya“firesale”onleisure,sothattheexperimentprovidedanopportunityto
takeavacationandmakeuptheearningslater,anincentivethatwouldbeabsentinapermanentscheme.Howdowegetfromtheshort-runresponsesthatcomefromthetrialtothelongrunresponsesthatwewanttoknow?Metcalf(1973)andAshenfelter(1978)providedanswers
fortheincometaxexperiments,asdidArrow(1975)fortheRandHealthExperiment.
Arrow’sanalysisillustrateshowtousebothstructureandobservationaldatato
transportandadaptresultsfromonesettingtoanother.Hemodelsthehealthexperimentasa
two-periodmodel,inwhichthepriceofmedicalcareisloweredinthefirstperiodonly,and
showshowtoderivewhatwewant,whichistheresponseinthefirstperiodifpriceswereloweredbythesameproportioninbothperiods.ThemagnitudethatwewantisS,thecompensatedpricederivativeofmedicalcareinperiod1inthefaceofidenticalincreasesin p1 and p2 inbothperiods1and2,andthisisequalto s11 + s12 ,thesumofthederivativesofperiod1’s
demandwithrespecttothetwoprices.Thetrialgivesonly s11 .Butifwehavepost-trialdataon
medicalservicesforbothtreatmentsandcontrols,wecaninfer s21 ,theeffectoftheexperimentalpricemanipulationonpost-experimentalcare.Choicetheory,intheformofSlutsky
symmetry,allowsArrowtousethistoinfer s12 andthusS.HecontraststhiswithMetcalf’salternativesolution,whichmakesdifferentassumptions—thattwoperiodpreferencesareintertemporallyadditive,inwhichcasethelong-runelasticitycanbeobtainedfromknowledgeofthe
incomeelasticityofpost-experimentalmedicalcare,whichwouldhavetocomefromanobservationalanalysis.Thesetwoalternativeapproachesshowhowwecanchoose,basedonourwillingnesstomakeassumptionsandonthedatawehave,asuitablecombinationof(elementary
andtransparent)theoreticalassumptionsandobservationaldatainorderadaptandusethetrial
results.Suchanalysiscanalsohelpdesigntheoriginaltrialbyclarifyingwhatweneedtoknowin
ordertobeabletousetheresultsofatemporarytreatmenttoestimatethepermanenteffects
thatweneed.Ashenfelterprovidesathirdsolution,notingthatthetwoperiodmodelisformally
identicaltoatwopersonmodel,sothatwecanuseinformationontwo-personlaborsupplyto
tellusaboutthedynamics.
Theorycanoftenallowustoreclassifyneworunknownsituationsasanalogoustositua-
tionswherewealreadyhavebackgroundknowledge.Onefrequentlyusefulwayofdoingthisis
44
whenthenewpolicycanberecastasequivalenttoachangeinthebudgetconstraintthatrespondentsface.Theconsequencesofanewpolicymaybeeasiertopredictifwecanreduceit
toequivalentchangesinincomeandprices,whoseeffectsareoftenwellunderstoodandwell
studied.ToddandWolpin(2008)makethispointandprovideexamples.Inthelaborsupply
case,anincreaseinthetaxratethasthesameeffectasadecreaseinthewageratew,sothat
wecanrelyonpreviousliteraturetopredictwhatwillhappenwhentaxratesarechanged.In
thecaseofMexico’sPROGRESAconditionalcashtransferprogram,ToddandWolpinnotethat
thesubsidiespaidtoparentsiftheirchildrengotoschoolcanbethoughtofasacombinationof
reductioninchildren’swageratesandanincreaseinparents’income,whichallowsthemto
predicttheresultsoftheconditionalcashexperimentwithlimitedadditionalassumptions.If
thisworks,asitpartiallydoesintheiranalysis,thetrialhelpsconsolidatepreviousknowledge
andcontributestoanevolvingbodyoftheoryandempirical,includingtrial,evidence.
Theprogramofthinkingaboutpolicychangesasequivalenttopriceandincomechangeshasalonghistoryineconomics;muchofrationalchoicetheorycanbesointerpreted,see
DeatonandMuellbauer(1980)formanyexamples.Whenthisconversioniscredible,andwhen
atrialonsomeapparentlyunrelatedtopiccanbemodeledasequivalenttoachangeinprices
andincomes,andwhenwecanassumethatpeopleindifferentsettingsrespondrelevantlysimilarlytochangesinpricesandincomes,wehaveareadymadeframeworkforincorporatingthe
trialresultsintopreviousknowledge,aswellasforextendingthetrialresultsandusingthem
elsewhere.Ofcourse,alldependsonthevalidityandcredibilityofthetheory;peoplemaynotin
factthinkofataxincreaseasadecreaseinthepriceofleisure,andbehavioraleconomicsisfull
ofexampleswhereapparentlyequivalentstimuligeneratenon-equivalentoutcomes.Theembraceofbehavioraleconomicsbymanyofthecurrentgenerationoftrialistsmayaccountfor
theirlimitedwillingnesstouseconventionalchoicetheoryinthisway;unfortunately,behavioral
economicsdoesnotyetofferareplacementforthegeneralframeworkofchoicetheorythatis
sousefulinthisregard.
Theorycanalsohelpwiththeproblemweraisedofdelineatingthepopulationtowhich
thetrialresultsimmediatelyapplyandforthinkingaboutmovingfromthispopulationtothe
populationofinterest.Ashenfelter’s(1978)analysisisagainagoodillustrationandpredates
muchsimilarworkinlaterliterature.Theincometaxexperimentsofferedparticipationinthe
trialtoarandomsampleofthepopulationofinterest.Becausetherewasnoblindingandno
compulsion,peoplewhowererandomizedintothetreatmentgroupwerefreetochoosetore-
45
fusetreatment.Asinmanysubsequentanalyses,Ashenfeltersupposesthatpeoplechooseto
participateifitisintheirinteresttodoso,dependingonwhathasbecomeknownintheRCT
andInstrumentalVariablesliteratureastheirownidiosyncratic“gain.”Thesimplelaborsupply
modelgivesanapproximatecondition:ifthetreatmentincreasesthetaxratefrom t 0 to t1 with
anoffsettingincreaseinG,thenanindividualassignedtotheexperimentalgroupwilldeclineto
participateif
1
(t1 − t 0 )w0 h0 + s00 (t1 − t 0 ) > G1 − G0 2
(7)
wheresubscript1referstothetreatmentsituation,0tothecontrol, h0 ishoursworked,and
s00 isthe(negative)utility-constantresponseofhoursworkedtothetaxrate.Ifthereisnosubstitution,thesecondtermontheleft-handsideiszero,andpeoplewillaccepttreatmentifthe
increaseinGmorethanmakesupfortheincreasesintaxespayable,the“breakeven”condition.
Inconsequence,thosewithhigherearningsarelesslikelytoaccepttreatment.Somebetter-off
peoplewithhighsubstitutioneffectswillalsoaccepttreatmentiftheopportunitytobuymore
cheapleisureissufficiententicement.
Theselectiveacceptanceoftreatmentlimitstheanalyst’sabilitytolearnaboutthebet-
ter-offorlow-substitutionpeoplewhodeclinetreatmentbutwhowouldhavetoacceptitifthe
policywereactuallyimplemented.BoththeITTestimatorandthe“astreated”estimatorthat
comparesthetreatedandtheuntreatedareaffected,notjustbythelaborsupplyeffectsthat
thetrialisdesignedtoinduce,butbythekindofselectioneffectsthatrandomizationisdesignedtoeliminate.Ofcourse,theanalysisthatleadsto(3)canperhapshelpussaysomething
aboutthisandhelpusadjustthetrialestimatesbacktowhatwewouldliketoknow.Yetthisis
noeasymatterbecauseselectiondepends,notonlyonobservables,suchaspre-experimental
earningsandhoursworked,buton(muchhardertoobserve)laborsupplyresponsesthatlikely
varyacrossindividuals.ParaphrasingAshenfelter,wecannotestimatetheeffectsofapermanentcompulsorynegativeincometaxprogramfromatransitoryvoluntarytrialwithoutstrong
assumptionsoradditionalevidence.
Muchofthemodernliterature,forexampleontrainingprograms,wrestleswiththeissueofexactlywhoisrepresentedbytheRCTresults,seeagainHeckman,LalondeandSmith
(1999).Whenpeopleareallowedtorejecttheirrandomlyassignedtreatmentaccordingtotheir
own(realorperceived)individualadvantage,wehavecomealongwayawayfromtherandom
allocationinthestandardconceptionofarandomizedcontrolledtrial.Moreover,theabsenceof
46
blindingiscommoninsocialandeconomicRCTs,andwhiletherearetrials,suchaswelfaretrials,thateffectivelycompelpeopletoaccepttheirassignments,andsomewherethetreatment
isgenerousenoughtodoso,therearetrialswheresubjectshavemuchfreedomand,inthose
cases,itislessthanobvioustouswhatrole,ifany,randomizationplaysinwarrantingtheresults.
2.6Scalingup:usingtheaverageforpopulations
AtypicalRCT,especiallyinthedevelopmentcontext,issmall-scaleandlocal,forexampleina
fewschools,clinics,orfarmsinaparticulargeographic,cultural,socio-economicsetting.Ifsuccessfulaccordingtoacost-effectivenesscriterion,forexample,itisacandidateforscaling-up,
applyingthesameinterventionforamuchlargerarea,oftenawholecountry,orsometimes
evenbeyond,aswhensometreatmentisconsideredforallrelevantWorldBankprojects.The
factthattheinterventionmightworkdifferentlyatscalehaslongbeennotedintheeconomics
literature,e.g.GarfinkelandManski(1992),Heckman(1992),andMoffitt(1992),andisrecognizedintherecentreviewbyBanerjeeandDuflo(2009).Wewantheretoemphasizethepervasivenessofsucheffects—thatfailureofthetrialresultstoreplicateatalargerscaleislikelyto
betheruleratherthantheexception—aswellastonoteonceagainthat,asinfailuresoftransportability,thisshouldnotbetakenasanargumentagainstusingRCTs,butonlyagainsttheidea
thateffectsatscalearelikelytobethesameasinthetrial.UsingRCTresultsisnotthesameas
assumingthesameresultsholdsinallcircumstances.
Anexampleofwhatareoftencalledgeneralequilibriumeffectscomesfromagriculture.
SupposeanRCTdemonstratesthatinthestudypopulationanewwayofusingfertilizerorinsecticidehadasubstantialpositiveeffecton,say,cocoayields,sothatfarmerswhousedthenew
methodssawincreasesinproductionandinincomescomparedtothoseinthecontrolgroup.If
theprocedureisscaleduptothewholecountry,ortoallcocoafarmersworldwide,theprice
willdrop,andifthedemandforcocoaispriceinelastic—asisusuallythoughttobethecase,at
leastintheshortrun—cocoafarmers’incomeswillfall.Indeed,theconventionalwisdomfor
manycropsisthatfarmersdobestwhentheharvestissmall,notlarge.Ofcourse,theseconsiderationsmightnotbedecisiveindecidingwhetherornottopromotetheinnovation,andthere
maystillbelongtermgainsif,forexample,somefarmersfindsomethingbettertodothan
growingcocoa.Butthebasicpointisthatthescaled-upeffectinthiscaseisoppositeinsignto
thetrialeffect.Theproblemhereisnotwiththetrialresults,whichcanbeusefullyincorporated
intoamorecomprehensivemarketmodelthatincorporatestheresponsesestimatedbythe
47
trial.Theproblemisonlyifweassumethattheaggregatelooksliketheindividual.Thatother
ingredientsoftheaggregatemodelmustcomefromobservationalstudiesshouldnotbeacriticism,evenforthosewhofavorRCTs;itissimplythepriceofdoingseriousanalysis.
Therearemanypossibleinterventionsthataltersupplyordemandwhoseeffect,inag-
gregate,willchangeapriceorawagethatisheldconstantintheoriginalRCT.Educationwill
changethesuppliesofskilledversusunskilledlabor,withimplicationsforrelativewagerates.
Conditionalcashtransfersincreasethedemandfor(andperhapssupplyof)schoolsandclinics,
whichwillchangepricesorwaitinglines,orboth.Thereareinteractionsbetweenpeoplethat
willoperateonlyatscale.Givingonechildavouchertogotoprivateschoolmightimproveher
future,butdoingsoforeveryonecandecreasethequalityofeducationforthosechildrenwho
areleftinthepublicschools,seethecontrastingstudiesofAngristetal(1999)andHsiehand
Urquiola(2002).Educationalortrainingprogramsmaybenefitthosewhoaretreated,butharm
thoseleftbehind;ifthecontrolgroupisselectedfromthelatter,theRCTmaygenerateapositiveresultinspiteofhurtingsomeandhelpingnone;Créponetal(2014)recognizetheissueand
showhowtoadaptanRCTtodealwithit.
Scalingupcanalsodisturbthepoliticalequilibrium.Anexploitativegovernmentmaynot
allowthemasstransferofmoneyfromabroadtoapowerlesssegmentofthepopulation,
thoughitmaypermitasmall-scaleRCTofcashtransfers.Provisionofhealthcarebyforeign
NGOsmaybesuccessfulintrials,buthaveunintendednegativeconsequencestoscalebecause
ofgeneralequilibriumeffectsonthesupplyofhealthcarepersonnel,orbecauseitdisturbsthe
natureofthecontractbetweenthepeopleandagovernmentthatisusingtaxrevenuetoprovideservices.InIndia,thegovernmentspendslargesumsonfoodsubsidiesthroughasystem
(thePDS)thatisbothcorruptandinefficient,withmuchofthegrainthatisprocuredfailingto
finditswaytotheintendedbeneficiaries.LocalizedRCTsonwhetherornotfamiliesarebetter
offwithcashtransfersarenotinformativeabouthowpoliticianswouldchangetheamountof
thetransferiffacedwithunanticipatedinflation,andatleastasimportant,whetherthegovernmentcouldcutprocurementfromrelativelywealthyandpoliticallypowerfulfarmers.Withoutapoliticalandgeneralequilibriumanalysis,itisimpossibletothinkabouttheeffectsofreplacingfoodsubsidieswithcashtransfers,seee.g.Basu(2010).
Eveninmedicine,wherebiologicalinteractionsbetweenpeoplearelesscommonthan
aresocialinteractionsinsocialscience,interactionscanbeimportant;infectiousdiseasesarean
example,andimmunizationprogramsaffectthedynamicsofdiseasetransmissionthroughherd
48
immunity,sothattheeffectsonanindividualdependonhowmanyothersarevaccinated,Fine
andClarkson(1986),Manski(2013,p52).Theusual,ifseldomcorrect,conceptionofanRCTin
medicineisofabiologicalprocess—forexample,theadministrationofaspirinafteraheartattack—wheretheeffectisthoughttobesimilaracrossindividuals,andwheretherearenointeractions.Yetevenhere,thesocialandeconomicsettingaffectshowdrugsareactuallyusedand
thesameissuescanarise;thedistinctionbetweenefficacyandeffectivenessinclinicaltrialsisin
partrecognitionofthefact.
2.7Drillingdown:usingtheaverageforindividuals
Justasthereareissueswithscaling-up,itisnotobvioushowtousetheresultsfromRCTsatthe
levelofindividualunits,evenindividualunitsthatwereactually(orpotentially)includedinthe
trial.Awell-conductedRCTdeliversanaveragetreatmenteffectforawell-definedpopulation
but,ingeneral,thataveragedoesnotapplytoeveryone.Itisnottrue,forexample,asarguedin
JAMA’s“Users’guidetothemedicalliterature”that“ifthepatientwouldhavebeenenrolledin
thestudyhadshebeenthere—thatisshemeetsalloftheinclusioncriteriaanddoesn’tviolate
anyoftheexclusioncriteria—thereislittlequestionthattheresultsareapplicable,”Guyattetal
(1994).Evenmoremisleadingaretheoften-heardstatementsthatanRCTwithanaverage
treatmenteffectinsignificantlydifferentfromzerohasshownthatthetreatmentworksforno
one,thoughsuchaconclusionwouldbebettersupportedbyaFisherrandomizationtest.
Theseissuesarefamiliartophysicianspracticingevidence-basedmedicinewhoseguidelinesrequire“integratingindividualclinicalexpertisewiththebestavailableexternalclinicalevidencefromsystematicresearch,”Sackettetal(1996).Exactlywhatthismeansisunclear;physiciansknowmuchmoreabouttheirpatientsthanisallowedforintheATEfromtheRCT
(though,onceagain,stratificationinthetrialislikelytobehelpful)andtheyoftenhaveintuitive
expertisefromlongpracticethattheyrelyontohelpthemidentifyfeaturesinaparticularpatientthatarelikelytoaffecttheeffectivenessofagiventreatmentforthatpatient.Butthereis
anoddbalancebeingstruckhere.Thesejudgmentsaredeemedadmissibleindealingwiththe
individualpatient,atleastfordiscussionwiththepatientaspossibleconsiderations,butthey
don’tadduptoevidencetobemadepubliclyavailable,withtheusualcautionsaboutcredibility,
bythestandardsadoptedbymostEBMsites.Itisalsotruethatphysicianscanhaveprejudices
and“knowledge”thatmightbeanythingbut.Clearly,therearesituationswhereforcingpractitionerstofollowtheaveragewilldobetter,evenforindividualpatients,andotherswherethe
oppositeistrue,seeKahnemanandKlein(2009).
49
Whetherornotaveragesareusefultoindividualsraisesthesameissueinsocialscience
research.Imaginetwoschools,StJoseph’sandSt.Mary’s,bothofwhichwereincludedinan
RCTofaclassroominnovation,oratleastwereeligibletobeso.Theinnovationissuccessfulon
average,butshouldtheschoolsadoptit?ShouldStMary’sbeinfluencedbyapreviousattempt
inStJoseph’sthatwasjudgedafailure?Manywoulddismissthisexperienceasanecdotaland
askhowStJoseph’scouldhaveknownthatitwasafailurewithoutbenefitof“rigorous”evidence.YetifStMary’sislikeStJoseph’s,withasimilarmixofpupils,asimilarcurriculum,and
similaracademicstanding,mightnotStJoseph’sexperiencebemorerelevanttowhatmight
happenatStMary’sthanisthepositiveaveragefromtheRCT?Andmightitnotbeagoodidea
fortheteachersandgovernorsofStMary’stogotoStJoseph’sandfindoutwhathappenedand
why?Theymaybeabletoobservethemechanismofthefailure,ifsuchitwas,andfigureout
whetherthesameproblemswouldapplyforthem,orwhethertheymightbeabletoadaptthe
innovationtomakeitworkforthem,perhapsevenmoresuccessfullythanthepositiveaverage
inthetrial.
Onceagain,thesequestionsareunlikelytobesimplyansweredinpractice;but,aswith
transportability,thereisnoseriousalternativetotrying.Assumingthattheaverageworksfor
youwilloftenbewrong,anditwillatleastsometimesbepossibletodobetter.Asinthemedicalcase,theadvicetoindividualschoolsoftenlacksspecificity.Forexample,theUSInstituteof
EducationScienceshasprovideda“user-friendly”guidetopracticessupportedbyrigorousevidence,USDepartmentofEducation(2003).Theadvice,whichisverysimilartorecommendationsindevelopmenteconomics,isthattheinterventionbedemonstratedeffectivethrough
well-designedRCTsinmorethanonesiteofimplementation,andthat“thetrialsshoulddemonstratetheintervention’seffectivenessinschoolsettingssimilartoyours”(2003,p.17).Nooperationaldefinitionof“similar”isprovided.
Wenotefinallythatthesecaveats,whichapplytoindividuals(orschools)evenifthey
wereinthetrial,provideanotherreasonwhytheconceptof“external”validityisunhelpful.The
realissueishowtousethefindingsofatrialinnewsettings,includingsettingsincludedinthe
trial;externalvalidityinthesenseofinvarianceoftheATEemphasizessimplereplication,which
guaranteesnothing,whileignoringthepossibilitythatlackofreplicationcanbeakeytounderstanding.
50
2.8Examplesandillustrationsfromeconomics
OurargumentsinthisSectionshouldnotbecontroversial,yetwebelievethattheyrepresentan
approachthatisdifferentfrommostcurrentpractice.Todocumentthisandtofilloutthearguments,weprovidesomeexamples.Whiletheseareoccasionallycritical,ourpurposeisconstructive;indeed,webelievethatmisunderstandingsabouthowtouseRCTshaveartificially
limitedtheirusefulness,aswellasalienatedsomewhowouldotherwiseusethem.
Conditionalcashtransfers(CCTs)areinterventionsthathavebeentestedusingRCTs
(andotherRCT-likemethods)andareoftencitedasaleadingexampleofhowanevaluation
withstronginternalvalidityleadstoarapidspreadofthepolicy,e.g.AngristandPischke(2010)
amongmanyothers.IThinkthroughthecausalchainthatisrequiredforCCTstobesuccessful:
peoplemustlikemoney,theymustlike(ordonotobjecttoomuch)totheirchildrenbeingeducatedandvaccinated,theremustexistschoolsandclinicsthatarecloseenoughandwell
enoughstaffedtodotheirjob,andthegovernmentoragencythatisrunningtheschememust
careaboutthewellbeingoffamiliesandtheirchildren.Thatsuchconditionsholdinawide
rangeof(althoughcertainlynotall)countriesmakesitunsurprisingthatCCTs“work”inmany
replications,thoughtheycertainlywillnotworkinplaceswheretheschoolsandclinicsdonot
exist,Levy(2001),norinplaceswherepeoplestronglyopposeeducationorvaccination.
Similarly,giventhatthehelpingfactorswilloperatewithdifferentstrengthsandeffectivenessindifferentplaces,itisalsonotsurprisingthatthesizeoftheATEdiffersfromplaceto
place;forexample,Vivalt’sAidGradewebsitelists29estimatesfromarangeofcountriesofthe
standardized(dividedbylocalstandarddeviationoftheoutcome)effectsofconditionalcash
transfersonschoolattendance;allbutfourshowtheexpectedpositiveeffect,andtherange
runsfrom–8to+38percentagepoints.Eveninthisleadingcase,wherewemightreasonably
concludethatCCTs“work”ingettingchildrenintoschool,itwouldbehardtocalculatecredible
cost-effectivenessnumbers,ortocometoageneralconclusionaboutwhetherCCTsaremoreor
lesscosteffectivethanotherpossiblepolicies.Bothcostsandeffectsizescanbeexpectedto
differinnewsettings,justastheyhaveinobservedones,makingthesepredictionsdifficult.
Therangeofestimatesillustratesthatthesimpleviewofexternalvalidity—thattheATE
shouldtransportfromoneplacetoanother—isnotwelldefined.AidGradeusesstandardized
measuresofeffectsizedividedbystandarddeviationofoutcomeatbaseline,asdoesthemajor
multi-countrystudybyBanerjeeetal(2015),Butwemightprefermeasuresthathaveaneconomicinterpretation,suchasadditionalmonthsofschoolingper$100spent(forexampleifa
51
donoristryingtodecidewheretospend,seebelow).Nutritionmightbemeasuredbyheight,or
bythelogofheight.EveniftheATEbyonemeasurecarriesacross,itwillonlydosousinganothermeasureiftherelationshipbetweenthetwomeasuresisthesameinbothsituations.This
isexactlythesortofthingthataformalanalysisoftransportabilityforcesustothinkabout.
(NotealsothatATEintheoriginalRCTcandifferdependingonwhethertheoutcomeismeasuredinlevelsorinlogs;thetwoATEscouldevenhavedifferentsigns.)
Dewormingissurelymorecomplicatedthanconditionalcashtransfersthoughnotbecauseanyonedisputesthedesirabilityofremovingparasiticalwormsorthebiologicalefficacyof
themedicines,atleastiftheyarerepeatedlyandeffectivelyadministered;thatisthepartofthe
causalprocessthatistransportablefromoneplacetoanother.Yetnutritionalorschoolattendanceoutcomesdependonreinfectionfromonepersontoanother—whichdependsonlocal
customsaboutdefecation(whichvaryfromplacetoplaceandaresubjecttoreligiousandculturalfactors),particularlyontheextentofopendefecationandthedensityofpopulation,on
whetherornotchildrenwearshoes,andontheavailabilityanduseofpublicandprivatesanitation;thislastwascrucialintheeliminationofhookworminthesouthernstatesoftheU.S.accordingtoStiles(1939).Temperaturemayalsobeimportant;indeed,such“macro”variablesare
likelytobeimportantinawiderangeofmedical,employment,andproductiontrials,
RosenzweigandUdry(2016).Therearetwoprominentpositivestudiesintheeconomicsliterature,oneinKenya,KremerandMiguel(2000)andoneinIndia,Bobonis,MiguelandPuriSharma(2006);theseareoftencitedasexamplesofthepowerofRCTstocomeupwiththe
“right”answer,forexamplebyKarlanandAppel(2008).YettheCochraneCollaborationreview
ofdewormingandschooling,Taylor-Robinsonetal(2015),whichreviewsonetrial(fromIndia)
coveringmorethanamillionparticipants,and44otherscovering67,672participants,including
KremerandMiguel(2004),concludethatthereis“substantialevidence”thatdewormingshows
nobenefitinnutritionalstatus,hemoglobin,cognition,schoolperformanceordeath.Thevalidityofthismeta-analysisisdisputedbyCrokeetal(2016).Areplication,Aikenetal(2015)andreanalysis(usingdifferentmethods)ofMiguelandKremer’soriginaldatabyDaveyetal(2015)
concludedthatthestudy“providedsomeevidence,butwithhighriskofbias,”provokinga
lengthyexchange,Hicksetal(2015)andHargreavesetal(2015).Mostofthedifferencesinresultscomefromdifferentmethodologicalchoices,themselveslargelybasedondisciplinarytraditions,ratherfromtheeffectsofmistakesorerrors.Inanimpressiveandclearreanalysis,
Humphreys(2015)arguesthatonepuzzlingfeatureofMiguelandKremer’sresultsistheab-
52
senceofanycleareffectofdewormingonhealth,aswasthecaseinthelargeIndianRCT.Yet
theeffectsofdewormingoneducation,whicharethemaintargetofthepaper,presumably
workthroughhealth,sothattheabsenceofhealtheffects—afailureofexpectedmediators—is
apuzzle,seealsoMiguel,KremerandHicks(2015),andAhujaetal(2015).Recalltooourearlier
discussionofthedifficultyofinterpretingthestandarderrorsoftheoriginalstudyintheabsenceofrandomization.
Itisnotourpurposeheretotrytoadjudicatethesecompetingclaimsbutrathertorelatethisworktoourgeneralargument.First,itisnotclearthatthereisarightanswertobediscovered;giventhecausalchainsinvolved,dewormingmightbehelpfulinoneplacebutunhelpfulinanother.Yetthefocusofthedebateisalmostentirelyoninternalvalidity,onwhetherthe
originalstudieswerecorrectlydone.TheCochranereview,inlinewiththis,andinlinewith
muchmeta-analysisoftrials,seemstosupposethatthereisasingleeffecttobeuncoveredthat,
onceestablished,willbeinvarianttolocalandenvironmentaldifferences.Externalvalidity,it
seems,isimpliedbyinternalvalidity.Indeed,Chalmers,oneofthefoundersoftheCochrane
Collaboration,hasexplicitlyargued(inresponsetooneofus)that,intheabsenceofstrongreasonstothecontrary,resultsshouldbetakenasapplicableeverywhere,PettigrewandChalmers
(2011).
Second,thedebatemakesitclearthatthepracticeofRCTsineconomicdevelopment
hasdonelittletofulfilltheoriginalpromisethattheirsimplicity—howhardisittosubtractone
meanfromanother?—woulddisposeofthemethodologicalandeconometricdisputesthat
characterizesomanyobservationalstudiesandwerethoughttobeoneoftheirmainflaws.
WhileRCTstendtotakesomecontentiousissuesofidentificationoffthetable,theyleavemuch
tobedisputed,includingthehandlingoffactorsthatinteractwithtreatmenteffects,theappropriatelevelofrandomization,thecalculationofstandarderrors,thechoiceofoutcomemeasure,theinclusioncriteriaforthesample,placeboandHawthorneeffects,andmuchmore.The
claimthatRCTscutthroughtheusualeconometricdisputestodelivertopolicymakersasimple,
convincing,andeasilyunderstoodanswerissimplyfalse.Thedewormingdebatesareperhaps
theleadingillustration.
Muchofthedevelopmentliterature,likethemedicalliterature,workswiththeviewof
externalvaliditythat,unlessthereisevidencetothecontrary,thedirectionandsizeoftreatmenteffectscanbetransportedfromoneplacetoanother.TheJ-PALwebsitereportsitsfindingsunderageneralheadofpolicyrelevance,subdividedbyaselectionoftopics.Undereach
53
topic,thereisalistofrelevantRCTsfromarangeofdifferentsettingsaroundtheworld.These
areconvenientlyconvertedintoacommoncost-effectivenessmeasuresothat,forexample,
under‘education’,subhead‘studentparticipation’,therearefourstudiesfromAfrica:oninformingparentsaboutthereturnstoeducationinMadagascar,ondeworming,onschooluniforms,andonmeritscholarships,allfromKenya.Theunitsofmeasurementareadditionalyears
ofstudenteducationper$100,andamongthesefourstudies,theaverageeffectsizesofspending$100are20.7years,13.9years,0.71yearsand0.27yearsrespectively.(Notethatthisisa
different—andsuperior—standardizationfromtheeffectsizestandardizationdiscussedabove.)
Whatcanweconcludefromsuchcomparisons?Foraphilanthropicdonorinterestedin
education,andifmarginalandaverageeffectsarethesame,theymightindicatethatthebest
placetodevoteamarginaldollarisinMadagascar,whereitwouldbeusedtoinformparents
aboutthevalueofeducation.Thisiscertainlyuseful,butitisnotasusefulasstatementsthat
informationordewormingprogramsareeverywheremorecost-effectivethanprogramsinvolvingschooluniformsorscholarships,orifnoteverywhere,atleastoversomedomain,anditis
thesesecondkindsofcomparisonthatwouldgenuinelyfulfillthepromiseof“findingoutwhat
works.”Butsuchcomparisonsonlymakesenseifwecantransporttheresultsfromoneplaceto
another,iftheKenyanresultsalsoholdinMadagascar,Mali,orNamibia,orsomeotherlistof
Africanornon-Africanplaces.J-PAL’smanualforcost-effectiveness,Dhaliwaletal(2012)explainsin(entirelyappropriate)detailhowtohandlevariationincostsacrosssites,notingvariablefactorssuchaspopulationdensity,prices,exchangerates,discountrates,inflation,andbulk
discounts.Butitgivesshortshrifttocross-sitevariationinthesizeofaveragetreatmenteffects
whichplayanequalpartinthecalculationsofcosteffectiveness.Themanualbrieflynotesthat
diminishingreturns(orthe“last-mile”problem)mightbeimportantintheory,butarguesthat
thebaselinelevelsofoutcomesarelikelytobesimilarinthepilotandreplicationareas,sothat
theaveragetreatmenteffectcanbesafelytransportedasis.Allofthislacksajustificationfor
transportability,someunderstandingofwhenresultstransport,whentheydonot,orbetter
still,howtheyshouldbemodifiedtomakethemtransportable.
OneofthelargestandmosttechnicallyimpressiveofthedevelopmentRCTsisby
Banerjeeetal(2015),whichtestsa“graduation”programdesignedtopermanentlyliftextremelypoorpeoplefrompovertybyprovidingthemwithagiftofaproductiveasset(fromguineapigs,(regular-)pigs,sheep,goats,orchickensdependingonlocale),trainingandsupport,life
skillscoaching,aswellassupportforconsumption,saving,andhealthservices;theideaisthat
54
thispackageofaidcanhelppeoplebreakoutofpovertytrapsinawaythatwouldnotbepossiblewithoneinterventionatatime.ComparableversionsoftheprogramweretestedinEthiopia,Ghana,Honduras,India,Pakistan,andPeruand,exceptingHonduras(wherethechickens
died)findlargelypositiveandpersistenteffects—withsimilar(standardized)effectsizes—fora
rangeofoutcomes(economic,mentalandphysicalhealth,andfemaleempowerment).Onesite
apart,essentiallyeveryoneacceptedtheirassignment,sothatmanyofthefamiliarcaveatsdo
notapply.ReplicationofpositiveATEsoversuchawiderangeofplacescertainlyprovidesproof
ofconceptforsuchascheme.YetBauchet,Morduch,andRavi(2015)failtoreplicatetheresult
inSouthIndia,wherethecontrolgroupgotaccesstomuchthesamebenefits,whatHeckman,
Hohman,andSmith(2000)call‘substitutionbias’.Evenso,theresultsareimportantbecause,
althoughthereisalongstandinginterestinpovertytraps,manyeconomistshavelongbeen
skepticaloftheirexistenceorthattheycouldbesprungbysuchaid-basedpolicies.Inthissense,
thestudyisanimportantcontributiontothetheoryofeconomicdevelopment;ittestsatheoreticalpropositionandwill(orshould)changemindsaboutit.
Anumberofdifficultiesremain.Astheauthorsnote,suchtrialscannottelluswhich
componentofthetreatmentaccountedfortheresults,orwhichmightbedispensable—amuch
moreexpensivemultifactorialtrialwouldberequired—thoughitseemslikelyinpracticethat
thecostliestcomponent—therepeatedvisitsfortrainingandsupport—islikelytobethefirstto
becutbycash-strappedpoliticiansoradministrators.Andasnoted,itisunclearwhatshould
countas(simple)replicationininternationalcomparisons;itishardtothinkoftheusesof
standardizedeffectsizes,excepttodocumentthateffectsexisteverywhereandthattheyare
similarlylargerelativetolocalvariationinsuchthings.
Theeffectsize—theaveragetreatmenteffectexpressedinnumbersofstandarddeviationsoftheoriginaloutcome—thoughconvenientlydimensionless,haslittletorecommendit.
AswithmuchofRCTpractice,itstripsoutanyeconomiccontent—noratesofreturn,orbenefits
minuscosts—anditremovesanydisciplineonwhatisbeingcompared.Applesandorangesbecomeimmediatelycomparable,asdotreatmentswhoseinclusioninameta-analysisislimited
onlybytheimaginationoftheanalystsinclaimingsimilarity.Inpsychology,wheretheconcept
originated,thereareendlessdisputesaboutwhatshouldandshouldnotbepooledinametaanalysis.Beyondthat,asarguedbySimpson(2016),restrictionsonthetrialsample—oftengood
practicetoreducebackgroundnoiseandtohelpdetectaneffect—willreducethebaseline
standarddeviationandinflatetheeffectsize.Moregenerally,effectsizesareopentomanipula-
55
tionbyexclusionrules.Itmakesnosensetoclaimreplicabilityonthebasisofeffectsizes,let
alonetousethemtorankprojects.
Thegraduationstudycanbetakenastheclosesttofulfillingthe“findingoutwhat
works”aimoftheRCTmovementindevelopment.Yetitissilentonperhapsthecrucialaspect
forpolicy,whichisthatthetrialwasrunentirelyinpartnershipwithNGOs,whereaswhatwe
wouldliketoknowiswhetheritcouldbereplicatedbygovernments,includingthosegovernmentsthatareincapableofgettingdoctors,nurses,andteacherstoshowuptoclinics,or
schools,Chaudhuryetal(2005),Banerjee,DeatonandDuflo(2004),orofregulatingthequality
ofmedicalcareineitherthepublicorprivatesectors,Filmer,HammerandPritchett(2000)or
DasandHammer(2005).Infact,wealreadyknowagreatdealabout“whatworks.”Vaccinationswork,maternalandchildhealthcareserviceswork,andclassroomteachingworks.Yet
knowingthisdoesnotgetthosethingsdone.Addinganotherprogramthatworksunderideal
conditionsisusefulonlywheresuchconditionsexist,andthatwouldlikelybeunnecessarywhen
theyexist.Findingoutwhatworksisnotthemagickeytoeconomicdevelopment.Technical
knowledge,thoughalwaysworthhaving,requiressuitableinstitutionsifitistodoanygood.
Asimilarpointisdocumentedinthecontrastbetweenasuccessfultrialthatusedcamerasandthreatsofwagereductionstoincentivizeattendanceofteachersinschoolsrunbyan
NGOinRajasthaninIndia,Duflo,Hanna,andRyan(2012),andthesubsequentfailureofafollow-upprograminthesamestatetotacklemassabsenteeismofhealthworkers,Banerjee,
Duflo,andGlennerster(2008).Intheschools,thecamerasandtimekeepingworkedasintended,
andteacherattendanceincreased.Intheclinics,therewasashort-runeffectonnurseattendance,butitwasquicklyeliminated.(Theabilityofagentseventuallytounderminepoliciesthat
areinitiallyeffectiveiscommonenoughandnoteasilyhandledwithinanRCT.)Inbothtrials,
therewereincentivestoimproveattendance,andtherewereincentivestofindawaytosabotagethemonitoringandrestoreworkerstotheiraccustomedpositions;theforceoftheseincentivesisa“high-level”cause,likegravity,ortheprincipleofthelever,thatworksinmuchthe
samewayeverywhere.Fortheclinics,somesabotagewasdirect—thesmashingofcameras—
andsomewassubtler,whengovernmentsupervisorsprovidedofficial,thoughessentiallyspeciousreasons,formissingwork.Wecanonlyconjecturewhythecausalitywasswitchedinthe
movefromNGOtogovernment;wesuspectthatworkingforahighly-respectedlocalNGOisa
differentcontractfromworkingforthegovernment,wherenotshowingupforworkiswidely(if
informally)understoodtobepartofthedeal.Theincentiveleverworkswhenitiswiredup
56
right,aswiththeNGOs,butnotwhenthewiringcutsitout,aswiththegovernment.Knowing
“whatworks”inthesenseofthetreatmenteffectonthetrialpopulationisoflimitedvalue
withoutunderstandingthepoliticalandinstitutionalenvironmentinwhichitisset.Thisunderlinestheneedtounderstandtheunderlyingsocial,economic,andculturalstructures—including
theincentivesandagencyproblemsthatinhibitservicedelivery—thatarerequiredtosupport
thecausalpathwaysthatweshouldliketoseeatwork.
Trialsineconomicdevelopmentaresusceptibletothecritiquethattheytakeplaceinartificialenvironments.Drèze(2016)notes,basedonextensiveexperienceinIndia,“whenaforeignagencycomesinwithitsheavybootsandsuitcasesofdollarstoadministera`treatment,’
whetherthroughalocalNGOorgovernmentorwhatever,thereisalotgoingonotherthanthe
treatment.”Thereisalsothesuspicionthatatreatmentthatworksdoessobecauseofthepresenceofthe“treators,”oftenfromabroad,ratherthanbecauseofthepeoplewhowillbecalled
toworkitinreality.
ThereisalsomuchtobelearnedfrommanyyearsofeconomictrialsintheUnited
States,particularlyfromtheworkoftheManpowerDemonstrationResearchCorporation(now
knownbyitsinitialsMDRC),fromtheearlyincometaxtrials,aswellasfromtheRandHealth
Experiment.Followingtheincometaxtrials,MDRChasrunmanyrandomizedtrialssincethe
1970s,mostlyfortheFederalgovernmentbutalsoforindividualstatesandforCanada,seethe
thoroughandinformativeaccountbyGueronandRolston(2011)forthefactualinformation
underlyingthefollowingdiscussion.MRDC’sprogram,likethatofJPALindevelopment,isintendedtofindout“whatworks”inthestateandfederalwelfareprograms.Theseprogramsare
conditionalcashtransfersinwhichpoorrecipientsaregivencashprovidedtheysatisfycertain
conditionswhichareoftenthesubjectofthetrial.Shouldtherebeworkrequirements?Should
thereberemedialeducationalbeforeworkrequirements?Whatarethebenefitsandcostsof
variousalternatives,bothtotherecipientsandtothelocalandfederaltaxpayers?Allofthese
programsaredeeplypoliticized,withsharplydifferentviewsoverbothfactsanddesirability.
Manyengagedinthesedisputesfeelcertainofwhatshouldbedoneandwhatitsconsequences
willbesothat,bytheirlights,controlgroupsareunethicalbecausetheydeprivesomepeopleof
whattheadvocates“know”willbecertainbenefits.Giventhis,itisperhapssurprisingthatRCTs
havebecometheacceptednormforthiskindofpolicyevaluationintheUS.
Thereasonsowemuchtopoliticalinstitutions,aswellastothecommonfaiththatRCTs
canrevealthetruth.AttheFederallevel,prospectivepoliciesarevettedbythenon-partisan
57
CongressionalBudgetOffice,whichmakesitsownestimatesofthebudgetaryimplicationsof
theprogram.IdeologueswhoseprogramsscorepoorlybytheCBOhaveanincentivetosupport
anRCT,nottoconvincethemselves,buttoconvincetheiropponents;onceagain,RCTsareespeciallyvaluablewhenyouropponentsdonotshareyourprior.Andcontrolgroupsareeasierto
putinplacewhenthereareinsufficientfundstocoverthewholepopulation.Therewasalsoa
widespreadandlargelyuncriticalbeliefthatRCTsalwaysgivetherightanswer,atleastforthe
budgetaryimplications,which,ratherthanthewellbeingoftherecipients,wereoftentheprimary(andindeedsometimestheonly)concern;notethatallofthesetrialsareonpoorpeople
byrichpeoplewhoaretypicallymoreconcernedwithcostthanwiththewellbeingofthepoor,
Greenberg,SchroderandOnstott(1999).MDRCstrialscouldthereforebeeffectivedisputereconciliationmechanismsbothforthosewhosawtheneedforevidenceandforthosewhodid
not(exceptinstrumentally).Notethattheoutcomeherefitswithour“publichealth”case;what
thepoliticiansneedtoknowisnottheoutcomesforindividuals,orevenhowtheoutcomesin
onestatemighttransporttoanother,buttheaveragebudgetarycostinaspecificplaceforeach
poorpersontreated,somethingthatagoodRCTconductedonarepresentativesampleofthe
targetpopulationisequippedtodeliver,atleastintheabsenceofgeneralequilibriumeffects,
timingeffects,etc.
TheseRCTsbyMDRCandothercontractorsdeservemuchcredit.Theyhavedemon-
stratedboththefeasibilityoflarge-scalesocialtrialsincludingthepossibilityofrandomizationin
thesesettings(wheremanyparticipantswerehostiletotheidea),aswellastheirusefulnessto
policymakers.Theyalsoseemtohavechangedbeliefs,forexampleinfavorofthedesirabilityof
workrequirementsasaconditionofwelfare,evenamongmanyofthosewhowereoriginally
opposed.Therearealsolimitations;thetrialsappeartohavehadatbestalimitedinfluenceon
scientificthinkingaboutbehaviorinlabormarkets.Theresultsofsimilarprogramshaveoften
beendifferentacrossdifferentsites,andtherehastodatebeennofirmunderstandingofwhy;
indeed,thetrialsarenotdesignedtorevealthis,Moffitt(2004).Finally,andperhapscruciallyfor
thepotentialcontributiontoeconomicscience,therehasbeenlittlesuccessinunderstanding
eithertheunderlyingstructuresorchainsofcausation,inspiteofadeterminedeffortfromthe
verybeginningtopeerintotheblackboxes.Withoutsuchmechanisms,transportabilityisalwaysindoubt,itisimpossibleforpolicymakersoracademicstopurposivelyimprovethepolicies,andthecontributionstocumulativescienceareseverelylimited.
58
TheRANDhealthexperiment,Manningetal(1975a,b),providesadifferentbutequally
instructivestoryifonlybecauseitsresultshavepermeatedtheacademicandpolicydiscussions
abouthealthcareeversince.Itwasoriginallydesignedtotestthequestionofwhethermore
generousinsurancewouldcausepeopletousemoremedicalcareand,ifso,byhowmuch.The
incentiveeffectsarehardlyindoubttoday;theimmortalityofthestudycomesratherfromthe
factthatitsmulti-arm(responsesurface)designallowedthecalculationofanelasticityforthe
studypopulation,thatmedicalexpendituresdecreasedby–0.1to–0.2percentforeverypercentageincreaseinthecopayment.AccordingtoAron-Dine,Einav,andFinkelstein(2013),itis
thisdimensionlessandthusapparentlytransportablenumberthathasbeenusedeversinceto
discussthedesignofhealthcarepolicy;theelasticityhascometobetreatedasauniversalconstant.Ironically,theyarguethattheestimatecannotbereplicatedinrecentstudies,anditis
evenunclearthatitisfirmlybasedontheoriginalevidence.Thisaccountpoints,onceagain,to
thecentralimportanceoftransportabilityfortheusefulnessandlong-termusefulnessofatrial.
Here,thesimpledirecttransportabilityoftheresultseemstohavebeenlargelyillusorythough,
aswehaveargued,thisdoesnotmeanthatmorecomplexconstructionsbasedontheresultsof
thetrialwouldnothavedonebetter.
Conclusions
RCTsaretheultimateincredibleestimationofaveragetreatmenteffectsinthepopulationbeingstudiedbecausetheymakesofewassumptionsaboutheterogeneity,causalstructure,
choiceofvariables,andfunctionalform.Theyaretrulynonparametric.Andindeed,thisissometimesjustwhatwewant,particularlywherewehavelittlecrediblepriorinformation.RCTsare
oftenconvenientwaystointroduceexperimenter-controlledvariance—ifyouwanttoseewhat
happens,thenkickitandsee,twistthelion’stail—butnotethatmanyexperiments,including
manyofthemostimportant(andNobelPrizewinning)experimentsineconomics,donotand
didnotuserandomization,Harrison(2013),Svorencik(2015).Butthecredibilityoftheresults,
eveninternally,canbeunderminedbyexcessiveheterogeneityinresponses,andespecially
whenthedistributionofeffectsisasymmetric,whereinferenceonmeanscanbehazardous.
Ironically,thepriceofthecredibilityinRCTsisthatallwegetaremeans.Yet,inthepresenceof
outliers,meansthemselvesdonotprovidethebasisforreliableinference.Andrandomizationin
andofitselfdoesnothingunlessthedetailsareright;purposiveselectionintotheexperimental
population,likepurposiveselectionintoandoutofassignment,underminesinferenceinjust
59
thesamewayasdoesselectioninobservationalstudies.Lackofblinding,whetherofparticipants,trialists,datacollectors,oranalysts,underminesinferencebypermittingfactorsother
thanthetreatmenttoaffecttheoutcome,akintoafailureofexclusionrestrictionsininstrumentalvariableanalysis.
ThelackofstructurecanbecomeseriouslydisablingwhenwetrytouseRCTresults,
outsideofafewcontexts,suchasprogramevaluation,hypothesistesting,orestablishingproof
ofconcept.Beyondthat,weareintrouble.Wecannotusetheresultstohelpmakepredictions
elsewherewithoutmorestructure,withoutmorepriorinformation,andwithouthavingsome
ideaofwhatmakestreatmenteffectsvaryfromplacetoplace,ortimetotime.ThereisnooptionbuttocommittosomecausalstructureifwearetoknowhowtouseRCTevidenceelsewhere,ortousetheestimatesoutoftheoriginalcontext.Simplegeneralizationandsimpleextrapolationjustdonotcutthemustard.Thisistrueofanystudy,experimentalorobservational.
Butobservationalstudiesarefamiliarwith,androutinelyworkwith,thesortofassumptions
thatRCTsclaimtoavoid,sothatiftheaimistouseempiricalevidence,anycredibilityadvantage
thatRCTshaveinestimationisnolongeroperative.
Yetoncethatcommitmenthasbeenmade,RCTevidencecanbeextremelyuseful,pinningdownpartofastructure,helpingtobuildstrongerunderstandingandknowledge,andhelpingtoassesswelfareconsequences.Asourexamplesshow,thiscanoftenbedonewithout
committingtothefullcomplexityofwhatareoftenthoughtofasstructuralmodels.Yetwithout
thestructurethatallowsustoplaceRCTresultsincontext,ortounderstandthemechanisms
behindthoseresults,notonlycanwenottransportwhether“itworks”elsewhere,butwecannotdothestandardstuffofeconomics,whichistosaywhetherornottheinterventionisactuallywelfareimproving,seeHarrison(2014)foravividaccountthatsharplyidentifiesthisandotherissues.Withoutknowingwhythingshappenandwhypeopledothings,weruntheriskof
worthlesscasual(“fairystory”)causaltheorizingandhaveessentiallygivenupononeofthe
centraltasksofeconomics.
Wemustbackawayfromtherefusaltotheorize,fromtheexultationinourabilityto
handleunlimitedheterogeneity,andactuallySAYsomething.Perhapsparadoxically,unlesswe
arepreparedtomakeassumptions,andtosaywhatweknow,makingstatementsthatwillbe
incredibletosome,allthecredibilityoftheRCTisfornaught.
Inthespecificcontextofdevelopmentthathasconcernedushere,RCTshaveproven
theirworthinprovidingproofsofconceptandattestingpredictionsthatsomepoliciesmust
60
alwaysworkorcanneverwork.But,aselsewhereineconomics,wecannotfindoutwhysomethingworksbysimplydemonstratingthatitdoeswork,nomatterhowoften,whichleavesus
uninformedastowhetherthepolicyshouldbeimplemented.Beyondthat,smallscale,demonstrationRCTsarenotcapableoftellinguswhatwouldhappenifthesepolicieswereimplementedtoscale,ofcapturingunintendedconsequencesthattypicallycannotbeincludedintheprotocols,orofmodelingwhatwillhappenifschemesareimplementedbygovernments,whose
motivesandoperatingprinciplesaredifferentfromtheNGOswhotypicallyruntrials.Whileitis
truethatabstractknowledgeisalwayslikelytobebeneficialtoeconomicdevelopment,successfuldevelopmentdependsoninstitutionsandonpolitics,mattersonwhichRCTshavelittleto
say.Intheend,RCTsareoneofthemanyexternaltechnicalfixesthathavemeanderedoffand
onthedevelopmentstagesincetheSecondWorldWar,includingbuildinginfrastructure,getting
pricesright,andservicedelivery,noneofwhichhavefaceduptotheessentialdomesticpolitical
foundationsfordevelopment.
Citations
Ahuja,Amrita,SarahBaird,JoanHamoryHicks,MichaelKremer,EdwardMiguel,andShawn
Powers,2015,“Whenshouldgovernmentssubsidizehealth?Thecaseofmassdeworming,”
WorldBankEconomicReview,29,S9–S24.
Aigner,DennisJ.,1985,“Theresidentialelectricitytime-of-usepricingexperiments.Whathave
welearned?”inDavidA.WiseandJerryA.Hausman,Socialexperimentation,Chicago,Il.ChicagoUniversityPressforNationalBureauofEconomicResearch,11–54.
Aiken,AlexanderM.,CalumDavey,JamesR.HargreavesandRichardJ.Hayes,“Re-analysisof
healthandeducationalimpactsofaschool-baseddewormingprogrammeinwesternKenya:
apurereplication,”InternationalJournalofEpidemiology,0(0),1–9.
Al-Ubaydil,Omar,andJohnA.List,2013,“Onthegeneralizabilityofexperimentalresultsineconomics,”inG.FrechetteandA.Schotter,Methodsofmodernexperimentaleconomics,OxfordUniversityPress.
Altman,DouglasG.,1985,“Comparabilityofrandomizedgroups,”JournaloftheRoyalStatistical
Society,SeriesD(TheStatistician),34(1),Statisticsinhealth,125–36.
Angrist,JoshuaD.,2004,“Treatmenteffectheterogeneityintheoryandpractice,”Economic
Journal,114,C52–C83.
Angrist,JoshuaD.,EricBettinger,ErikBloom,ElizabethKingandMichaelKremer,2002,“VouchersforprivateschoolinginColombia:evidencefromarandomizednaturalexperiment,”
AmericanEconomicReview,92(5),1535–58.
Angrist,JoshuaD.,andJörn-SteffenPischke,2010,“Thecredibilityrevolutioninempiricaleconomics:howbetterresearchdesignistakingtheconoutofeconometrics,”JournalofEconomicPerspectives,24(2),3–30.
Aron-Dine,Aviva,LiranEinav,andAmyFinkelstein,2013,“TheRANDhealthinsuranceexperiment,threedecadeslater,”JournalofEconomicPerspectives,27(1),197–222.
61
Arrow,KennethJ.,1975,“Twonotesoninferringlongrunbehaviorfromsocialexperiments,”
DocumentNo.P-5546,SantaMonica,CA.RandCorporation.
Ashenfelter,Orley,1978,“Estimatingtheeffectoftrainingprogramsonearnings,”Reviewof
EconomicsandStatistics,60(1),47–57.
Ashenfelter,Orley,1978,“Thelaborsupplyresponseofwageearners,”inJohnL.Palmerand
JosephA.Pechman,eds.,Welfareinruralareas:theNorthCarolina–IowaIncomeMaintenanceExperiment,Washington,DC.TheBrookingsInstitution.109–38.
Attanasio,Orazio,CostasMeghir,andAnaSantiago,2012,“EducationchoicesinMexico:usinga
structuralmodelandarandomizedexperimenttoevaluatePROGRESA,”ReviewofEconomic
Studies,79(1),37–66.
Attanasio,Orazio,SarahCattan,EmlaFitzsimons,CostasMeghir,andMartaRubioCodina,2015,
“Estimatingtheproductionfunctionforhumancapital:resultsfromarandomizedcontrolled
trialinColumbia,”London.InstituteforFiscalStudies,WorkingPapernoW15/06.
Bahadur,R.R.,andLeonardJ.Savage,1956,“Thenon-existenceofcertainstatisticalprocedures
innonparametricproblems,”AnnalsofMathematicalStatistics,25:1115–22.
Banerjee,Abhijit,SylvainChassang,SergioMontero,andErikSnowberg,2016,“Atheoryofexperimenters,”processed,July2016.
Banerjee,Abhijit,SylvainChassang,andErikSnowberg,2016,“Decisiontheoreticapproachesto
experimentdesignandexternalvalidity,”Cambridge,MA.NBERWorkingPaperno22167,
April.
Banerjee,Abhijit,AngusDeaton,andEstherDuflo,2004,“HealthcaredeliveryinruralRajasthan,”EconomicandPoliticalWeekly,39(9),944–9.
Banerjee,Abhijit,andEstherDuflo,2012,Pooreconomics:aradicalrethinkingofthewayto
fightglobalpoverty,PublicAffairs.
Banerjee,Abhijit,EstherDuflo,NathanaelGoldberg,DeanKarlan,RobertOsei,WilliamParienté,
JeremyShapiro,BramThuysbaert,andChristopherUdry,2015,“Amultifacetedprogram
causeslastingprogressfortheverypoor:evidencefromsixcountries,”Science,348(6236),
1260799.
Banerjee,Abhijit,EstherDuflo,andRachelGlennerster,2008,“Puttingaband-aidonacorpse:
incentivesfornursesintheIndianpublichealthcaresystem,”JournaloftheEuropeanEconomicAssociation,6(2–3),487–500.
Banerjee,AbhijitV.,andRuiminHe,2003,“TheWorldBankofthefuture,”AmericanEconomic
Review,93(2),39–44.
Bauchet,Jonathan,JonathanMorduchandShamikaRavi,2015,“Failurevsdisplacement:why
aninnovativeanti-povertyprogramshowednonetimpactinSouthIndia,”JournalofDevelopmentEconomics,116,1–16.
Basu,Kaushik,2010,“TheeconomicsoffoodgrainmanagementinIndia,”MinistryofFinance,
Delhi.http://finmin.nic.in/workingpaper/Foodgrain.pdf
Bloom,HowardS.,CarolynJ.Hill,andJamesA.Riccio,2005,“Modelingcross-siteexperimental
differencestofindoutwhyprogrameffectivenessvaries,”inHowardS.Bloom,ed.,Learning
morefromsocialexperiments:evolvinganalyticalapproaches,NewYork,NY.RussellSage.
Bobonis,Gustavo,EdwardMiguel,andCharuPuri-Sharma,2006,“Anemiaandschoolparticipation,”JournalofHumanResources,41(4),692–721.
Bold,Tessa,MwangiKimenyi,,GermanoMwabu,AliceNg’ang’aandJustinSandefur,2013,
“Scalingupwhatworks:experimentalevidenceonexternalvalidityinKenyaneducation,”
Washington,DC.CenterforGlobalDevelopment,WorkingPaper321.
Bothwell,LauraE.,andScottH.Podolsky,2016,“Theemergenceoftherandomized,controlled
trial,”NewEnglandJournalofMedicine,375(6),501–4.doi:10.1056/NEJMp1604635
62
Campbell,D.T.,andJ.C.Stanley,1963,Experimentalandquasi-experimentaldesignsforresearch.Chicago.RandMcNally.
Cartwright,Nancy,1994,Nature’scapacitiesandtheirmeasurement.Oxford.ClarendonPress.
Cartwright,Nancy,andJeremyHardie,2012,Evidencebasedpolicy:apracticalguidetodoingit
better,Oxford.OxfordUniversityPress.
Chalmers,Iain,2001,“Comparinglikewithlike:somehistoricalmilestonesintheevolutionof
methodstocreateunbiasedcomparisongroupsintherapeuticexperiments,”International
JournalofEpidemiology,30,1156–64.
Chalmers,Iain,2003,“FisherandBradfordHill:theoryandpragmatism?”InternationalJournal
ofEpidemiology,32,922–24.
Chassang,Sylvain,GerardPadróIMiguel,andErikSnowberg,2012,“Selectivetrials:aprincipal–
agentapproachtorandomizedcontrolledexperiments,”AmericanEconomicReview,102(4),
1279–1309.
Chassang,Sylvain,ErikSnowberg,BenSeymour,andCayleyBowles,2015,“Accountingforbehaviorintreatmenteffects:newapplicationsforblindtrials,”PLoSOne,10(6),e0127227.
doi:10:1371/journal.pone.0127227.
Chaudhury,Nazmul,JeffreyHammer,MichaelKremer,KarthikMuralidharanandF.HalseyRogers,2005,“Missinginaction:teacherandhealthworkerabsenceindevelopingcountries,”
JournalofEconomicPerspectives,19(4),91–116.Chyn,Eric,2016,“Movedtoopportunity:
thelong-runeffectofpublichousingdemolitiononlabormarketoutcomesofchildren,”UniversityofMichigan.http://wwwpersonal.umich.edu/~ericchyn/Chyn_Moved_to_Opportunity.pdf
Conlisk,John,1973,“Choiceofresponsefunctionalformindesigningsubsidyexperiments,”
Econometrica,41(4),643–56.
Crépon,Bruno,EstherDuflo,MarcGurgand,RolandRathelot,andPhilippeZamora,2014,“Do
labormarketpolicieshavedisplacementeffects?evidencefromaclusteredrandomizedexperiment,”QuarterlyJournalofEconomics,128(2),531–80.
Croke,Kevin,JoanHamoryHicks,EricHsu,MichaelKremer,andEdwardMiguel,2016,“Does
massdewormingaffectchildren’snutrition?Metaanalysis,costeffectiveness,andstatistical
power,”Cambridge,MA.NBERWorkingPaperNo.22382(July.)
Cronbach,LeeJ.,S.R.Ambron,S.M.Dornbusch,R.D.Hess,R.C.Hornick,D.C.Phillips,D.F.
Walker,andS.S.Weiner,1980,Towardsreformofprogramevaluation,SanFrancisco,
Jossey-Bass.
Das,JishnuandJeffreyHammer,2005,”’Whichdoctor?Combiningvignettesanditemresponse
tomeasureclinicalcompetence,”JournalofDevelopmentEconomics,78,348–83.
Davey,Calum,AlexanderM.Aitken,RichardJ.Hayes,andJamesR.Hargreaves,2015,“Reanalysisofhealthandeducationalimpactsofaschool-baseddewormingprogrammein
westernKenya:astatisticalreplicationofaclusterquasi-randomizedsteppedwedgetrial,”
InternationalJournalofEpidemiology,0(0),1–12.
Deaton,Angus,andJohnMuellbauer,1980,Economicsandconsumerbehavior,NewYork.CambridgeUniversityPress.
Dhaliwal,Iqbal,EstherDuflo,RachelGlennerster,andCaitlinTulloch,2012,“Comparativecosteffectivenessanalysistoinformpolicyindevelopingcountries:ageneralframeworkwithapplicationsforeducation,”J–PAL,MIT,December3rd.
http://www.povertyactionlab.org/publication/cost-effectiveness
Drèze,Jean,2016,Personalemailcommunication.
Duflo,Esther,RemaHanna,andStephenP.Ryan,2012,“Incentiveswork:gettingteachersto
cometoschool,”AmericanEconomicReview,102(4),1241–78.
63
Duflo,Esther,andMichaelKremer,2008,“Useofrandomizationintheevaluationofdevelopmenteffectiveness,”inWilliamEasterly,ed.,Reinventingforeignaid.Washington,DC.Brookings,93–120.
Dynarski,Susan,2015,”Helpingthepoorineducation:thepowerofasimplenudge,”NewYork
Times,Jan17,2015.
Fine,PaulE.M.,andJacquelineA.Clarkson,1986,“Individualversuspublicprioritiesinthedeterminationofoptimalvaccinationpolicies,”AmericanJournalofEpidemiology,124(6),
1012–20.
Fisher,RonaldA.,1926,“Thearrangementoffieldexperiments,”JournaloftheMinistryofAgricultureofGreatBritain,33,503–13.
Filmer,Deon,JeffreyHammer,andLantPritchett,2000,“Weaklinksinthechain:adiagnosisof
healthpolicyinpoorcountries,”WorldBankResearchObserver,15(2),199–204.
Freedman,DavidA.,2006,“Statisticalmodelsforcausation:whatinferentialleveragedothey
provide?”EvaluationReview,30:691−713.
Freedman,DavidA.,2008,“Onregressionadjustmentstoexperimentaldata,”AdvancesinAppliedMathematics,40,180–93.
Garfinkel,Irwin,andCharlesF.Manski,1992,“Introduction,”inIrwinGarfinkelandCharlesF.
Manski,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversity
Press.1–22.
Gertler,PaulJ.,SebastianMartinez,PatrickPremand,LauraB.Rawlings,andChristelM.J.Vermeersch,Impactevaluationinpractice,Washington,DC.TheWorldBank.
Glewwe,Paul,MichaelKremer,SylvieMoulin,andEricZitzewitz,2004,“Retrospectivevs.prospectiveanalysesofschoolinputs:thecaseofflip-chartsinKenya,”JournalofDevelopment
Economics,74,251–68.
Greenberg,DavidandMarkShroder,2004,Thedigestofsocialexperiments(3rded.),Washington,DC.UrbanInstitutePress.
Greenberg,David,MarkShroder,andMatthewOnstott,1999,“Thesocialexperimentmarket,”
JournalofEconomicPerspectives,13(3),157–72.
Gueron,JudithM.,andHowardRolston,2013,Fightingforreliableevidence,NewYork,Russell
Sage.
Guyatt,Gordon,DavidL.SackettandDeborahJ.CookfortheEvidence-BasedMedicineWorking
Group,1994,“Users’guidestothemedicalliteratureII:howtouseanarticleabouttherapy
orprevention.B.Whatweretheresultsandwilltheyhelpmeincaringformypatients?”
JournaloftheAmericanMedicalAssociation,271(1),59–63.
Hargreaves,JamesR.,AlexanderM.Aiken,CalumDavey,andRichardJ.Hayes,2015,“Authors’
responseto:dewormingexternalitiesandschoolimpactsinKenya,”InternationalJournalof
Epidemiology,0(0),1–3.
Harrison,GlennW.,2013,“Fieldexperimentsandmethodologicalintolerance,”JournalofEconomicMethodology,20(2),103–17.
Harrison,GlennW.,2014,“Impactevaluationandwelfareevaluation,”EuropeanJournalofDevelopmentResearch,26,39–45.
Hausman,JerryA.,andDavidA.Wise,1985,“Technicalproblemsinsocialexperimentation:cost
versuseaseofanalysis,”inJerryA.HausmanandDavidA.Wise,eds.,SocialExperimentation,
Chicago,IL.ChicagoUniversityPress.187–220.
Heckman,JamesJ.,1992,“Randomizationandsocialpolicyevaluation,”inCharlesF.Manskiand
IrwinGarfinkel,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.Harvard
UniversityPress.547–70.
64
Heckman,JamesJ.,1997,“Instrumentalvariables:astudyofimplicitbehavioralassumptions
usedinmakingprogramevaluations,”JournalofHumanResources,32(3),441–62.
Heckman,JamesJ.,NeilHohman,andJeffreySmith,withtheassistanceofMichaelKhoo,2000,
“Substitutionanddropoutbiasinsocialexperiments:astudyofaninfluentialsocialexperiment,”QuarterlyJournalofEconomics,115(2),651–94.
Heckman,JamesJ.,RobertJ.Lalonde,andJeffreyA.Smith,1999,“Theeconomicsandeconometricsofactivelabormarkets,”Chapter31inAshenfelter,OrleyandDavidCard,eds.
Handbookoflaboreconomics,Amsterdam.North-Holland,3(A),1866–2097.
Heckman,JamesJ,,RodrigoPinto,andPeterSavelyev,2013,“Understandingthemechanisms
throughwhichaninfluentialearlychildhoodprogramboostedadultoutcomes,”American
EconomicReview,103(6),2052–86.
Heckman,JamesJ.,JeffreySmith,andNancyClements,1997,“Makingthemostoutofprogrammeevaluationsandsocialexperiments:accountingforheterogeneityinprogramme
impacts,”ReviewofEconomicStudies,64(4),487–535.
Heckman,JamesJ,andEdwardVytlacil,2005,“Structuralequations,treatmenteffects,and
econometricpolicyevaluation,”Econometrica,73(3),669–738.
Heckman,JamesJ.andEdwardJ.Vytlacil,2007,“Econometricevaluationofsocialprograms,
Part1:causalmodels,structuralmodels,andeconometricpolicyevaluation,”Chapter70in
JamesJ.HeckmanandEdwardE.Leamer,eds.,HandbookofEconometrics,6B,4779–874.
Hicks,JoanHamory,MichaelKremer,andEdwardMiguel,2015,“Commentary:dewormingexternalitiesandschoolingimpactsinKenya:acommentonAikenetal(2015)andDaveyetal.
(2015),”InternationalJournalofEpidemiology,0(0),1–4.
Horton,Richard,2000,“Commonsenseandfigures:therhetoricofvalidityinmedicine:BradfordHillmemoriallecture1999,”Statisticsinmedicine,19,3149–64.
Hotz,V.Joseph,GuidoW.ImbensandJulieH.Mortimer,2005,“Predictingtheefficacyoffuture
trainingprogramsusingpastexperienceatotherlocations,”JournalofEconometrics,125,
241–70.
Hsieh,Chang-taiandMiguelUrquiola,2006,“Theeffectsofgeneralizedschoolchoiceon
achievementandstratification:evidencefromChile’svoucherprogram,”JournalofPublic
Economics,90,1477–1503.
Humphreys,Macartan,2015,“Whathasbeenlearnedfromthedewormingreplications:anonpartisanview,”ColumbiaUniversity,Aug.
http://www.columbia.edu/~mh2245/w/worms.html
Imbens,GuidoW.,2004,“Nonparametricestimationofaveragetreatmenteffectsunderexogeneity:areview,”ReviewofEconomicsandStatistics,86(1),4–29.
Imbens,GuidoW.,2010,“BetterLATEthannothing:somecommentsonDeaton(2009)and
HeckmanandUrzua,”JournalofEconomicLiterature,48(2),399–423.
Imbens,GuidoW.andJoshuaD.Angrist,1994,“Identificationandestimationoflocalaverage
treatmenteffects,”Econometrica,62(2),467–75.
Imbens,GuidoW.,andJeffreyM.Wooldridge,2009,“Recentdevelopmentsintheeconometrics
ofprogramevaluation,”JournalofEconomicLiterature,47(1),5–86.
InternationalCommitteeofMedicalJournalEditors,2015,Recommendationsfortheconduct,
reporting,editing,andpublicationofscholarlyworkinmedicaljournals,
http://www.icmje.org/icmje-recommendations.pdf(accessed,August20,2016.)
Kahneman,DanielandGaryKlein,2009,“Conditionsforintuitiveexpertise:afailuretodisagree,”AmericanPsychologist,64(6),515–26.
Karlan,DeanandJacobAppel,2011,Morethangoodintentions:howaneweconomicsishelpingtosolveglobalpoverty,Dutton.
65
Karlan,Dean,NathanealGoldbergandJamesCopestake,2009,“Randomizedcontrolledtrials
arethebestwaytomeasureimpactofmicrofinanceprogramsandimprovemicrofinance
productdesigns,”EnterpriseDevelopmentandMicrofinance,20(3),167–76.
Kasy,Maximilian,2016,“Whyexperimentersmightnotwanttorandomize,andwhattheycould
doinstead,”PoliticalAnalysis,1–15doi:10.1093/pan/mpw012
Kendall,MauriceG.,1959,“Hiawathadesignsanexperiment,”AmericanStatistician,13(5),23–
4.
Kramer,Peter,2016,Ordinarilywell:thecaseforantidepressants,Farrar,Straus,andGiroux.
Kremer,Michael,andAlakaHolla,2009,“Improvingeducationinthedevelopingworld:what
havewelearnedfromrandomizedevaluations?”AnnualReviewofEconomics,1,513–42. Lehman,Erich.L.,andJosephP.Romano,2005,Testingstatisticalhypotheses(thirdedition),
NewYork.Springer.
Levy,Santiago,2006,Progressagainstpoverty:sustainingMexico’sProgresa-Oportunidades
program,Washington,DC.Brookings.
Mackie,JohnL.,1974,Thecementoftheuniverse:astudyofcausation,Oxford.OxfordUniversityPress.
Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeelerandArleenLeibowitz,
1988a,“Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedexperiment,”AmericanEconomicReview,77(3),251–77.
Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,BernadetteBenjamin,
ArleenLeibowitz,M.SusanMarquis,andJackZwanziger,1988b,Healthinsuranceandthe
demandformedicalcare:evidencefromarandomizedexperiment,SantaMonica,CA.RAND.
Manski,CharlesF.,1990,“Nonparametricboundsontreatmenteffects”AmericanEconomic
Review,80(2),319–23.
Manski,CharlesF.,1995,Identificationproblemsinthesocialsciences,Cambridge,MA.Harvard
UniversityPress.
Manski,CharlesF.,2003,Partialidentificationofprobabilitydistributions,NewYork.Springer.
Manski,CharlesF.,2013,Publicpolicyinanuncertainworld:analysisanddecisions,Cambridge,
MA.HarvardUniversityPress.
Metcalfe,CharlesE.,1973,“Makinginferencesfromcontrolledincomemaintenanceexperiments,”AmericanEconomicReview,63(3),478–83.
Miguel,Edward,andMichaelKremer,2004,“Worms:identifyingimpactsoneducationand
healthinthepresenceoftreatmentexternalities,”Econometrica,72(1),159–217.
Miguel,Edward,MichaelKremer,andJoanHamoryHicks,2015,“CommentonMacartanHumphreys’andotherrecentdiscussionsoftheMiguelandKremer(2004)study,”Berkeley,Dec.
http://emiguel.econ.berkeley.edu/assets/miguel_research/63/Worms-Comment_2015-1221.pdf
Moffitt,Robert,1979,“ThelaborsupplyresponseintheGaryexperiment,”JournalofHuman
Resources,14(4),477–87.
Moffitt,Robert,1992,“Evaluationmethodsforprogramentryeffects,”Chapter6inCharles
ManskiandIrwinGarfinkel,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress,231–52.
Moffitt,Robert,2004,“Theroleofrandomizedfieldtrialsinsocialscienceresearch:aperspectivefromevaluationsofreformsofsocialwelfareprograms,”AmericanBehavioralScientist,
47(5),506–40
Morgan,KariLock,andDonaldB.Rubin,2012,“Rerandomizationtoimprovecovariatebalance
inexperiments,”AnnalsofStatistics,40(2),1263–82.
66
Muller,SeánM.,2015,“Causalinteractionandexternalvalidity:obstaclestothepolicyrelevanceofrandomizedevaluations,”WorldBankEconomicReview,29,S217–S225.
Orcutt,GuyH.,andAliceG.Orcutt,1968,“Incentiveanddisincentiveexperimentationforincomemaintenancepolicypurposes,”AmericanEconomicReview,58(4),754–72.
Pearl,Judea,2009,Causality:models,reasoning,andinference,2ndedition,Cambridge.CambridgeUniversityPress.
Pettigrew,Mark,andIainChalmers,2011,“Useofresearchevidenceinpractice,”Lancet,
378(9804),1696.
Rodrik,Dani,2006,personalemailcommunication.
Rosenzweig,MarkandChristopherUdry,2016,“Externalvalidityinastochasticworld,”Cambridge,MA.NBERWorkingPaper22449(July).
Rothwell,PeterM.,2005,“Externalvalidityofrandomizedcontrolledtrials:‘towhomdothe
resultsofthetrialapply’”,Lancet,365,82–93.
Russell,Bertrand,2008[1912],Theproblemsofphilosophy,Rockville,MD.ArcManor.
Sackett,DavidL.,WilliamM.C.Rosenberg,J.A.MuirGray,R.BrianHaynesandW.ScottRichardson,1996,“Evidencebasedmedicine:whatitisandwhatitisn’t,”BritishMedicalJournal,
312(January13),71–2.
Scriven,Michael,1974,“Evaluationperspectivesandprocedures,”inW.JamesPopham,ed.,
Evaluationineducation—currentapplications,Berkeley,CA.McCutchanPublishingCorporation.
Sen,AmartyaK.,2011,Theideaofjustice,Cambridge,MA.HarvardUniversityPress.
Senn,Stephen,1994,“Testingforbaselinebalanceinclinicaltrials,”StatisticsinMedicine,13,
1715–26.
Senn,Stephen,2013,“Sevenmythsofrandomizationinclinicaltrials,”StatisticsinMedicine32,
1439–50.
Shadish,WilliamR.,ThomasD.Cook,andDonaldT.Campbell,2002,Experimentalandquasiexperimentaldesignsforgeneralizedcausalinference,Boston,MA.HoughtonMifflin.
Simpson,Adrian,2016,“Comparingandcombiningstandardizedeffectsizes:themisdirectionof
publicpolicy,”WorkingPaper,UniversityofDurham(July).
Singer,BurtonH.,andStevePincus,1998,“Irregulararraysandrandomization,”Proceedingsof
theNationalAcademyofSciencesoftheUSA,”95,1363–8.
Stiles,CharlesWardell,1939,“Earlyhistory,inpartesoteric,ofthehookworm(uncinariasis)
campaigninoursouthernUnitedStates,”JournalofParasitology,25(4),283–308.
Stuart,ElizabethA.,StephenR.Cole,andCatharineP.BradshawandPhilipJ.Leaf,2011,“The
useofpropensityscorestoassessthegeneralizabilityofresultsfromrandomizedtrials,”
JournaloftheRoyalStatisticalSocietyA,174(2)369–86.
Svorencik,Andrej,2015,Theexperimentalturnineconomics:ahistoryofexperimentaleconomics,UtrechtSchoolofEconomics,DissertationSeries#29,
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2560026
Taylor-Robinson,DavidC.,NicolaMaayan,KarlaSoares-Weiser,SarahDonegan,andPaulGarner,2015,“Dewormingdrugsforsoil-transmittedintestinalwormsinchildren:effectsonnutritionalindicators,haemoglobin,andschoolperformance(review),”TheCochraneCollaboration.Wiley.
http://onlinelibrary.wiley.com/doi/10.1002/14651858.CD000371.pub6/abstract
Todd,PetraE.,andKennethJ.Wolpin,2006,“Assessingtheimpactofaschoolsubsidyprogram
inMexico:usingasocialexperimenttovalidateadynamicbehavioralmodelofchildschoolingandfertility,”AmericanEconomicReview,96(5),1384–1417.
67
Todd,PetraE.,andKennethJ.Wolpin,2008,“Exanteevaluationofsocialprograms,”Annales
d’EconomieetdelaStatistique,91/92,263–91.
U.S.DepartmentofEducation,InstituteofEducationSciences,NationalCenterforEducation
EvaluationandRegionalAssistance,2003,Identifyingandimplementingeducationalpracticessupportedbyrigorousevidence:auserfriendlyguide,Washington,DC.InstituteofEducationSciences.
Vandenbroucke,JanP.,2004,“Whenareobservationalstudiesascredibleasrandomizedcontrolledtrials?”TheLancet,363:1728–31.
Vivalt,Eva,2015,“Howmuchcanwegeneralizefromimpactevaluations?”NYU,unpublished.
http://evavivalt.com/wp-content/uploads/2014/10/Vivalt-JMP-10.27.14.pdf
White,Halbert,1980,“Aheteroskedasticity-consistentcovariancematrixestimatorandadirect
testforheteroskedasticity,”Econometrica,50(1),1–25.
Wise,DavidA.,1985,“Abehavioralmodelversusexperimentation:theeffectsofhousingsubsidiesonrent,”inP.BruckerandR.Pauly,eds..MethodsofOperationsResearch,50,Verlag
AnonHain.441–89.
Worrall,John,2002,“WhatEvidenceinEvidence-BasedMedicine?”PhilosophyofScience69,
S316-S330.
Worrall,John,2007,“Evidenceinmedicineandevidence-basedmedicine,”PhilosophyCompass,
2/6,981–1022.
Young,Alwyn,2016,“ChannelingFisher:randomizationtestsandthestatisticalinsignificanceof
seeminglysignificantexperimentalresults,”LondonSchoolofEconomics,WorkingPaper,
Feb.
Ziliak,StephenT.,2014,“Balancedversusrandomizedfieldexperimentsineconomics:whyW.S.
Gossetaka‘Student’matters,”ReviewofBehavioralEconomics,1,167–208.
68