Understandingandmisunderstandingrandomizedcontrolledtrials AngusDeatonandNancyCartwright PrincetonUniversity DurhamUniversityandUCSanDiego Thisversion,August2016 Weacknowledgehelpfuldiscussionswithmanypeopleoverthemanyyearsthispaperhasbeen inpreparation.Wewouldparticularlyliketonotecommentsfromseminarparticipantsat Princeton,ColumbiaandChicago,theCHESSresearchgroupatDurham,aswellasdiscussions withOrleyAshenfelter,AnneCase,NickCowen,HankFarber,BoHonoré,andJulianReiss.Ulrich MuellerhadamajorinfluenceonshapingSection1ofthepaper.WehavebenefitedfromgenerouscommentsonanearlierversionbyTimBesley,ChrisBlattman,SylvainChassang,Steven Durlauf,JeanDrèze,WilliamEasterly,JonathanFuller,LarsHansen,JimHeckman,JeffHammer, MacartanHumphreys,HelenMilner,SureshNaidu,LantPritchett,DaniRodrik,BurtSinger, RichardZeckhauser,andSteveZiliak.Cartwright’sresearchforthispaperhasreceivedfunding fromtheEuropeanResearchCouncil(ERC)undertheEuropeanUnion’sHorizon2020research andinnovationprogram(grantagreementNo667526K4U).Deatonacknowledgesfinancial supportthroughtheNationalBureauofEconomicResearch,Grants5R01AG040629-02andP01 AG05842-14andthroughPrincetonUniversity’sRoybalCenter,GrantP30AG024928. ABSTRACT RCTsarevaluabletoolswhoseuseisspreadingineconomicsandinothersocialsciences. Theyareseenasdesirableaidsinscientificdiscoveryandforgeneratingevidenceforpolicy.YetsomeoftheenthusiasmforRCTsappearstobebasedonmisunderstandings:that randomizationprovidesafairtestbyequalizingeverythingbutthetreatmentandsoallows apreciseestimateofthetreatmentalone;thatrandomizationisrequiredtosolveselection problems;thatlackofblindingdoeslittletocompromiseinference;andthatstatisticalinferenceinRCTsisstraightforward,becauseitrequiresonlythecomparisonoftwomeans. Noneofthesestatementsistrue.RCTsdoindeedrequireminimalassumptionsandcanoperatewithlittlepriorknowledge,anadvantagewhenpersuadingdistrustfulaudiences,but acrucialdisadvantageforcumulativescientificprogress,whererandomizationaddsnoise andunderminesprecision.ThelackofconnectionbetweenRCTsandotherscientific knowledgemakesithardtousethemoutsideoftheexactcontextinwhichtheyareconducted.Yet,oncetheyareseenaspartofacumulativeprogram,theycanplayarolein buildinggeneralknowledgeandusefulpredictions,providedtheyarecombinedwithother methods,includingconceptualandtheoreticaldevelopment,todiscovernot“whatworks,” butwhythingswork.Unlesswearepreparedtomakeassumptions,andtostandonwhat weknow,makingstatementsthatwillbeincredibletosome,allthecredibilityofRCTsisfor naught. 1 Introduction Randomizedtrialsarecurrentlymuchusedineconomicsandarewidelyconsideredtobeadesirablemethodofempiricalanalysisanddiscovery.Thereisalonghistoryofsuchtrialsinthe subject.Therewerefourlargefederallysponsorednegativeincometaxtrialsinthe1960sand 1970s.Inthemid-1970s,therewasafamous,andstillfrequentlycited,trialonhealthinsurance, theRandhealthexperiment.Therewasthenaperiodduringwhichrandomizedcontrolledtrials (RCTs)receivedlessattentionbyacademiceconomics;evenso,randomizedtrialsonwelfare, socialpolicy,labormarkets,andeducationhavecontinuedsincethemid-1970s,somewithsubstantialinvolvementanddiscussionbyacademiceconomists,seeGreenbergandShroder (2004). Recentrandomizedtrialsineconomicdevelopmenthaveattractedattention,andthe ideathatsuchtrialscandiscover“whatworks”hasbeenwidelyadoptedineconomics,aswell asinpoliticalscience,education,andsocialpolicy.Amongbothresearchersandthegeneral public,RCTsareperceivedtoyieldcausalinferencesandparameterestimatesthataremore crediblethanotherempiricalmethodsthatdonotinvolvethecomparisonofrandomlyselected treatmentandcontrolgroups.RCTsareseenaslargelyexemptfrommanyoftheeconometric problemsthatcharacterizeobservationalstudies.WhenRCTsarenotfeasible,researchersoften mimicrandomizeddesignsbyusingobservationaldatatoconstructtwogroupsthat,asfaras possible,areidenticalanddifferonlyintheirexposuretotreatment. Thepreferenceforrandomizedtrialshasspreadbeyondtrialiststothegeneralpublic andthemedia,whichtypicallyreportsfavorablyonthem.Theyareseenasaccurate,objective, andlargelyindependentof“expert”knowledgethatisoftenregardedasmanipulable,politically biased,orotherwisesuspect.Therearenow“WhatWorks”centersusingandrecommending RCTsinahugerangeofareasofsocialconcernacrossEuropeandtheAnglophoneworld,such astheUSDepartmentofEducation’sWhatWorksClearingHouse,TheCampbellCollaboration (paralleltotheCochraneCollaborationinhealth),theScottishIntercollegiateGuidelinesNetwork(SIGN),theUSDepartmentofHealthandHumanServicesChildWelfareInformation Gateway,theUSSocialandBehavioralSciencesTeam,andothers.TheBritishgovernmenthas establishedeightnew(well-financed)WhatWorksCenterssimilartotheNationalInstitutefor HealthandCareExcellence(NICE),withmoreplanned.TheyextendNICE’sevaluationofhealth treatmentintoaging,earlyintervention,education,crime,localeconomicgrowth,Scottishservicedelivery,poverty,andwellbeing.Thesecentersseerandomizedcontrolledtrialsastheir 2 preferredtool.Thereisawidespreaddesireforcarefulevaluation—tosupportwhatissometimescalledthe“auditsociety”—andeveryoneassentstotheideathatpolicyshouldbebased onevidenceofeffectiveness,forwhichrandomizedtrialsappeartobeideallysuited.Trialsare easily,ifnotveryprecisely,explainedalongthelinesthatrandomselectiongeneratestwootherwiseidenticalgroups,onetreatedandonenot;resultsareeasytocompute—allweneedis thecomparisonoftwoaverages;andunlikeothermethods,itseemstorequirenospecialized understandingofthesubjectmatter.Itseemsatrulygeneraltoolthat(nominally)worksinthe samewayinagriculture,medicine,sociology,economics,politics,andeducation.Itissupposed torequirenopriorknowledge,whethersuspectornot,whichisseenasagreatadvantage. Inthispaper,wepresenttwosetsofarguments,oneonconductingRCTSandonhowto interprettheresults,andoneonhowtousetheresultsoncewehavethem.Althoughwedonot carefortheterms—forreasonsthatwillbecomeapparent—thetwosectionscorrespondroughlytointernalandexternalvalidity. Randomizedcontrolledtrialsareoftenuseful,andhavebeenimportantsourcesofempiricalevidenceforcausalclaimsandevaluationofeffectivenessinmanyfields.Yetmanyofthe popularinterpretations—notonlyamongthegeneralpublic,butalsoamongtrialists—areincompleteandsometimesmisleading,andthesemisunderstandingscanleadtounwarranted trustintheimpregnabilityofresultsfromRCTs,toalackofunderstandingoftheirlimitations, andtomistakenclaimsabouthowwidelytheirresultscanbeused.Allthese,inturn,canleadto flawedpolicyrecommendations. Amongthemisunderstandingsarethefollowing:(a)randomizationensuresafairtrial byensuringthat,atleastwithhighprobability,treatmentandcontrolgroupsdifferonlyinthe treatment;(b)RCTsprovidenotonlyunbiasedestimatesofaveragetreatmenteffects,butalso preciseestimates;(c)randomizationisnecessarytosolvetheselectionproblem;(d)lackof blinding,whichiscommoninsocialscienceexperiments,doesnotseriouslycompromiseinference;(e)statisticalinferenceinRCTs,whichrequiresonlythesimplecomparisonofmeans,is straightforward,sothatstandardsignificancetestsarereliable. WhilemanyoftheproblemsofRCTsaresharedwithobservationalstudies,someare unique,forexamplethefactthatrandomizingitselfcanchangeoutcomesindependentlyof treatment.Moregenerally,itisalmostneverthecasethatanRCTcanbejudgedsuperiortoa well-conductedobservationalstudysimplybyvirtueofbeinganRCT.Theideathatallmethods 3 havetheirflaws,butRCTsalwayshavefewest,isoneofthedeepestandmortperniciousmisunderstandings. Inthesecondpartofthepaper,wediscusstheusesandlimitationsofresultsfromRCTs formakingpolicy.Thenon-parametricandtheory-freenatureofRCTs,whichisarguablyanadvantageinestimation,isaseriousdisadvantagewhenwetrytousetheresultsoutsideofthe contextinwhichtheywereobtained.Muchoftheliterature,ineconomicdevelopmentand elsewhere,perhapsinspiredbyCampbellandStanley’s(1963)famous“primacyofinternalvalidity,”assumesthatinternalvalidityisenoughtoguaranteetheusefulnessoftheestimatesindifferentcontexts.WithoutunderstandingRCTswithinthecontextoftheknowledgethatwealreadypossessabouttheworld,muchofitobtainedbyothermethods,wedonotknowhowto usetrialresults.ButoncethecommitmenthasbeenmadetoseeingRCTswithinthisbroader structureofknowledgeandinference,andwhentheyaredesignedtofitwithinit,theycanplay ausefulroleinbuildinggeneralknowledgeandpolicypredictions;forexample,anRCTcanbea goodwayofestimatingakeypolicymagnitude.ThebroadercontextwithinwhichRCTsneedto besetincludesnotonlymodelsofeconomicstructure,butalsothepreviousexperiencethat policymakershaveaccumulatedaboutlocalsettingsandimplementation.Mostimportantlyfor economicdevelopment,theuseofRCTresultsshouldbesensitivetowhatpeoplewant,both individuallyandcollectively.RCTsshouldnotbecomeyetanothertechnicalfixthatisimposed onpeoplebybureaucratsorforeigners;RCTresultsneedtobeincorporatedintoademocratic processofpublicreasoning,Sen(2011).Greenberg,Shroder,andOnstott(1999)documentthat, evenbeforetherecentwaveofRCTsindevelopment,mostRCTsineconomicshavebeencarriedoutbyrichpeopleonpoorpeople,andthefactshouldmakeusespeciallysensitivetoavoid chargesofpaternalism. Section1:InterpretingtheresultsofRCTs 1.1Prolog RCTswerefirstpopularizedbyFisher’sagriculturaltrialsinthe1930sandaretodayoftendescribedbytheRubincounterfactualcausalmodel,whichitselftracesbacktoNeymanin1923, seeFreedman(2006)foradescriptionofthehistory:Eachuniti(aperson,apupil,aschool,an agriculturalplot)isassumedtohavetwopossibleoutcomes, Yio and Yi1 ,theformeroccurring ifthereisnotreatmentatthetimeinquestion,thelatteriftheunitistreated.Thedifference betweenthetwooutcomes Yi1 − Yi 0 istheindividualtreatmenteffect,whichweshalldenote βi . Treatmenteffectsaretypicallydifferentfordifferentunits.Nounitcanbebothtreatedand 4 untreatedatthesametime,soonlyoneorotheroftheoutcomesoccurs;theotheriscounterfactualsothatindividualtreatmenteffectsareinprincipleunobservable. Wenoteparentheticallythatwhileweusethecounterfactualframeworkhere,wedo notendorseit,norargueagainstotherapproachesthatdonotuseit,suchastheCowlescommissioneconometricframeworkwherethecausalrelationsarecodedasstructuralequations, seealsoPearl(2009.)ImbensandWooldridge(2009,Introduction)provideaneloquentdefense oftheRubinformulation,emphasizingthecredibilitythatcomesfromatheory-freespecificationwithunlimitedheterogeneityintreatmenteffects.HeckmanandVytlacil(2007,Introduction)makeanequallyeloquentcaseagainst,notingthatthetreatmentsinRCTsareoftenunclearlyspecifiedandthatthetreatmenteffectsarehardtolinktoinvariantparametersthat wouldbeusefulelsewhere. ThebasictheoremgoverningRCTsisaremarkableone.Itstatesthattheaveragetreatmenteffectistheaverageoutcomeinthetreatmentgroupminustheaverageoutcomeinthe controlgroup.Whilewecannotobservetheindividualtreatmenteffects,wecanobservetheir mean.Theestimateoftheaveragetreatmenteffect(ATE)issimplythedifferencebetweenthe meansinthetwogroups,andithasastandarderrorthatcanbeestimatedandusedtomake significancestatementsaccordingtothestatisticaltheorythatappliestothedifferenceoftwo means,onwhichmorebelowinSection1.3.Thedifferenceinmeansisanunbiasedestimatorof themeantreatmenteffect. Thetheoremisremarkablebecauseitrequiressofewassumptions;nomodelisrequired,noassumptionsaboutcovariatesareneeded,thetreatmenteffectscanbeheterogeneous,andnothingisassumedabouttheshapesofstatisticaldistributionsotherthanthestatisticalquestionoftheexistenceofthemeanofthecounterfactualoutcomevalues.Intermsofone ofourrunningthemes,itrequiresnoexpertknowledge,ornoacceptanceofpriors,expertor otherwise.Thetheoremalsohasitslimitations;theproofusesthefactthatthedifferencein twomeansisthemeanoftheindividualdifferences,i.e.thetreatmenteffects.Thisisnottrue forthemedian(thedifferenceintwomediansisnotthemedianofthedifferenceswhichisthe mediantreatmenteffect).Italsodoesnotallowustoestimateanypercentileofthedistribution oftreatmenteffects,oritsvariance.(Quantileestimatesoftreatmenteffectsarenotthequantilesofthedistributionoftreatmenteffects,butthedifferencesinthequantilesofthetwomarginaldistributionsoftreatmentsandcontrols;thetwomeasurescoincideiftheexperimenthas noeffectonranks,anassumptionthatwouldbeconvenientbutishardtojustify,atleastin 5 general.)AllofthesestatisticscanbeofinterestforpolicybutRCTsarenotinformativeabout them,oratleastnotwithoutfurtherassumptions,forexampleonthedistributionoftreatment effects,seeHeckman,Smith,andClements(1997),andmuchoftheattractionofRCTsisthe absenceofsuchassumptions. Thebasictheoremtellsusthatthedifferenceinmeansisanunbiasedestimatorofthe averagetreatmenteffectbutsaysnothingaboutthevarianceofthisestimator.Ingeneral,abiasedestimatorthatistypicallyclosertothetruthwilloftenbebetterthananunbiasedestimatorthatistypicallywideofthetruth.Thereisnothingtosaythatanon-RCTestimator,inspite ofbias,mightnothavealowermeansquarederror(MSE),onemeasureofthedistanceofthe estimatefromthetruth,oralowervalueofa“lossfunction”thatdefinesthelosstotheexperimenterofmissingthetarget. ItisusefultothinkofthemeanaveragetreatmenteffectfromanRCTintermsofsamplingfromafinitepopulation,aswhentheBureauoftheCensusestimatesaverageincomeof theUSpopulationin2013.FortheRCT,thepopulationisthepopulationofunitswhoseaverage treatmenteffectisofinterest;notetheimportanceofdefiningthepopulationofinterestbecause,giventheheterogeneityoftreatmenteffects,theaveragetreatmenteffectwillvary acrossdifferentpopulations,justasaverageincomesdifferacrossdifferentsubpopulationsof theUS.Finitepopulationsamplingtheorytellsushowtogetaccurateestimatesofmeansfrom samples;intheRCTcase,thesampleisthestudysample,bothtreatmentsandcontrols.Inprinciple,thestudysamplecouldbearandomsampleoftheparentpopulationofinterest,inwhich caseitisrepresentativeofit,butthatisseldomthecase.Becausetheestimateispopulation specific,itisnot(orneednotbe)thoughtofastheparameterofasuper-population,orotherwisegeneralizableinanyway.AverageincomeintheUSin2013maybeofinterestinitsown right;butitwillnotbethesameasaverageincomein2014,norwillitbethesameasaverage incomeofwhites,orofthepopulationsofWyomingorNewYork.Exactlythesameistrueof theestimateofanaveragetreatmenteffect;itappliestothestudysampleinwhichthetrialwas done,atthetimewhenitwasdone,anditsuseoutsideofthoseconfines,thoughoftenpossible,requiresargumentandjustification.Withoutsuchanargument,wecannotclaimthatan ATEis“the”meantreatmenteffectanymorethanthataverageincomeintheUSin2013is “the”averageincomeoftheUSinanyotheryear.Ofcourse,knowingaverageincomein2013 canbeusefulformakingothercalculations,suchasanestimateofincomein2014,orofasub- 6 populationthatweknowisricherorpoorer;thefactthatanestimatedoesnotuniversallygeneralizedoesnotmakeituseless.WeshallreturntotheseissuesinSection2. 1.2.Precision,balance,andrandomization 1.2.1Precisionandbias Weshouldlikeourestimateoftheaveragetreatmenteffecttobeasclosetothetruthaspossible.Onewaytoassessclosenessisthemeansquareerror(MSE),definedas ⌢ MSE = E(θ − θ )2 (1) ⌢ where θ isthetrueaveragetreatmenteffect,and θ isitsestimatefromaparticulartrial.The expectationistakenoverrepeatedrandomizationsoftreatmentsandcontrolsusingthesame studypopulation.Itisalsostandardtorewrite(1)as ⌢ ⌢ 2 ⌢ MSE = E (θ − E(θ ) + E(θ ) − θ ( ) ( ) 2 ⌢ ⌢ = var(θ ) + bias(θ ,θ )2 (2) sothatmeansquareerroristhesumofthevarianceoftheestimator—whichwetypicallyknow somethingaboutfromtheestimatedstandarderror—andthesquareofthebias—whichinthe caseofa(nideal)randomizedcontrolledtrialiszero.Theelementary,butcrucialpointisthat, whileitiscertainlygoodthatthebiasiszero,thatfactdoesnothingtomakethedistancefrom thetruthassmallasitmightbe,whichiswhatwereallycareabout.Anunbiasedestimatorthat isnearlyalwayswideofthetargetisnotasusefulasonethatisalwaysneartoit,evenif,on average,itisoffcenter.Moregenerally,itwilloftenbedesirabletotradeinsomeunbiasedness forgreaterprecision.Experimentsareoftenexpensive,sowecannotalwaysrelyonlargesamplestobringtheestimateclosetothetruthandresolvetheseissuesforus.MuchofthisSection isconcernedwithhowtodesignexperimentstomaximizeprecision. Unbiasednessalonecannotthereforejustifytheoften-expressedpreferenceforRCTs overotherestimators.TheminimalistassumptionsrequiredforanRCTtobeunbiasedarean attractionalthough,asweshallseeinthisSection,thisadvantageusuallycomesatthecostof loweredprecisionandofdifficultiesinknowinghowtousetheresult,asweshallseeinSection 2.YetthereisanoftenexpressedbeliefthatRCTsaresomehowguaranteedtobeprecise,simplybecausetheyareRCTs.Occasionallybiasandprecisionareexplicitlyconfused;theJPALwebsite,initsexplanationofwhyitisgoodtorandomize,saysthatRCTs“aregenerallyconsidered themostrigorousand,allelseequal,producethemostaccurate(i.e.unbiased)results.”Shadish,Cook,andCampbell(2002,p.276),inwhatis(rightly)consideredoneofthebiblesofcausal inferenceinsocialscience,statewithoutqualificationthat“randomizedexperimentsprovidea 7 preciseansweraboutwhetheratreatmentworked”(p.276)and“Therandomizedexperimentis oftenthepreferredmethodforobtainingapreciseandstatisticallyunbiasedestimateofthe effectsofanintervention,”(p.277)ouritalics. ContrastthiswithCronbachetal(1980)whoquotesKendall’s(1957)pasticheofLongfellow,“Hiawathadesignsanexperiment,”whereHiawatha’sinsistenceonunbiasednessleads tohisneverhittingthetargetandtohiseventualbanishment. 1.2.2Balanceandprecisioninalinearall-causemodel AusefulwaytothinkaboutprecisionandwhatanRCTdoesanddoesnotdoistouseaschematiclinearcausalmodeloftheform: Yi = βiTi + ∑ j=1γ j xij J (3) where,asbefore, Yi istheoutcomeforuniti, Ti isadichotomous(1,0)treatmentdummyindicatingwhetherornotiistreated,and β i istheindividualtreatmenteffectofthetreatment oni.Thex’saretheobservedorunobservedothercausesoftheoutcome,andwesupposethat (3)capturesallthecausesof Yi . Jmaybeverylarge.Becausetheheterogeneityoftheindividualtreatmenteffects β i isunrestricted,weallowthepossibilitythatthetreatmentinteractswith thex’sorothervariables,sothattheeffectsofTcandependonanyothervariables,andwe shallhaveoccasiontomakethisexplicitbelow.Anobviousandimportantexampleiswhenthe treatmentifeffectiveonlyinthepresenceofaparticularvalueofoneofthex’s. Wedonotneedisubscriptsonthe γ 's thatcontroltheeffectsoftheothercauses;if theireffectsdifferacrossindividuals,weincludetheinteractionsofindividualcharacteristics withtheoriginalx’sasnewx’s.Giventhatthex’scanbeunobservable,thisisnotrestrictive. Becausethe β 's candependonthex’s,theeffectsofthex’sontheoutcomecandependon Ti , or,equivalently,theeffectsoftreatmentcandependoncovariates. Inanexperiment,withorwithoutrandomization,wecanrepresentthetreatmentgroup ashaving Ti = 1, andthecontrolgroupashaving Ti = 0. Sowhenwesubtracttheaverageoutcomesamongthecontrolsfromtheaverageoutcomesamongthetreatments,wewillget J Y − Y = β + ∑ γ j (x ij − x ij ) = β + (S − S ) 1 0 1 1 0 1 1 0 (4) j=1 Thefirsttermonthefarrighthandside,whichistheaveragetreatmenteffect,iswhatwewant, butthesecondtermorerrorterm,whichisthesumofthenetaveragebalancesofothercauses 8 acrossthetwogroups,willgenerallybenon-zero—becauseofselectionormanyotherreasons—andneedstobedealtwithsomehow.Wegetwhatwewantwhenthemeansofallthe othercausesareidenticalinthetwogroups,ormorepreciselywhenthesumoftheirnetdiffer1 0 ences S − S iszero;thisisthecaseofperfectbalance.Withperfectbalance,thedifference betweenthetwomeansisexactlyequaltotheaverageofthetreatmenteffectamongthe treated,sothatwehavetheultimateprecisionandweknowtheanswerexactly,atleastinthis linearcase. 1.2.3Balancingacts:realandmagical Howdowegetbalance,orsomethingclosetoit?What,exactly,istheroleofrandomization?In alaboratoryexperiment,wherethereisgoodbackgroundknowledgeoftheothercauses,the experimenterhasagoodchanceofcontrollingalloftheothercauses,aimingtoensurethatthe lasttermin(4)isclosetozero.Failingsuchknowledgeandcontrol,analternativeismatching, frequentlyusedinstatistical,medical,andeconometricwork.Foreachtreatment,amatchis foundthatisascloseaspossibleonallsuspectedcauses,sothat,onceagain,thelasttermin(4) canbekeptsmall.Again,whenwehaveagoodideaofthecauses,matchingmayalsodelivera preciseestimate.Ofcourse,whenthereareimportantunknownorunobservablecauses,neitherlaboratorycontrolnormatchingoffersprotection. Whatdoesrandomizationdo?Becausethetreatmentsandcontrolscomefromthe sameunderlyingdistribution,randomizationguarantees,byconstruction,thatthelasttermon therightin(4)iszeroinexpectationatbaseline(muchcanhappentodisturbthisbeyondbaseline).Thisistruewhetherornotthecausesareobserved.IftheRCTisrepeatedmanytimeson thesametrialpopulation,thenthelasttermwillbezerowhenaveragedoveraninfinitenumber of(entirelyhypothetical)trials.Ofcourse,thisdoesnothingtomakeitzeroinanyonetrial wherethedifferenceinmeanswillbeequaltotheaveragetreatmenteffectamongthosetreatedplusatermthatreflectstheimbalanceintheneteffectsoftheothercauses.Wedonot knowthesizeofthiserrorterm,andthereisnothingintherandomizationthatlimitsitssize;by chance,therecanbeone(ormore)importantexcludedcause(s)thatisveryunequallydistributedbetweentreatmentandcontrols.Thisimbalancewillvaryoverreplicationsofthetrial,and itsaveragesizewillideallybecapturedbythestandarderroroftheestimatedATE,whichgives ussomeideaofhowlikelywearetobeawayfromthetruth.Gettingthestandarderrorand associatedsignificancestatementsrightarethereforeofgreatimportance. 9 Exactlywhatrandomizationdoesisfrequentlylostinthepracticalliterature,andthere isoftenaconfusionbetweenperfectcontrol,ontheonehand—asinalaboratoryexperimentor perfectmatchingwithnounobservablecauses—andcontrolinexpectation—whichiswhatRCTs do.WesuspectthatatleastsomeofthepopularandprofessionalenthusiasmforRCTs,aswell asthebeliefthattheyareprecisebyconstruction,comesfrommisunderstandingsaboutbalance.Thesemisunderstandingsarenotsomuchamongthetrialistswho,whenpressed,willgive acorrectaccount,butcomefromimprecisestatementsbytrialiststhataretakenasgospelby thelayaudiencethatthetrialistsarekeentoreach. SuchamisunderstandingiswellcapturedbythefollowingquotefromtheWorldBank’s onlinemanualonimpactevaluation: “Wecanbeveryconfidentthatourestimatedaverageimpact,givenasthedifference betweentheoutcomeundertreatment(themeanoutcomeoftherandomlyassigned treatmentgroup)andourestimateofthecounterfactual(themeanoutcomeofthe randomlyassignedcomparisongroup)constitutethetrueimpactoftheprogram,since byconstructionwehaveeliminatedallobservedandunobservedfactorsthatmightotherwiseplausiblyexplainthedifferenceinoutcomes.”Gertleretal(2011)(ouritalics.) Thisstatementconfusesactualbalanceinanysingletrialwithbalanceinexpectationovermany entirelyhypotheticaltrials.Ifthestatementaboveweretrue,andifallfactorswereindeedcontrolled(andnoimbalanceswereintroducedpostrandomization),thedifferencewouldbean exactmeasureoftheaveragetreatmenteffect,atleastintheabsenceofmeasurementerror. Weshouldnotonlybeconfidentofourestimate;wewouldknowthetruth,asthequotesays. AsimilarquotecomesfromJohnList,oneofthemostimaginativeandsuccessfulschol- arswhouseRCTs: “complicationsthataredifficulttounderstandandcontrolrepresentkeyreasonsto conductexperiments,notapointofskepticism.Thisisbecauserandomizationactsasan instrumentalvariable,balancingunobservablesacrosscontrolandtreatmentgroups.” Al-UbaydliandList(2013)(italicsintheoriginal.) AndfromDeanKarlan,founderandPresidentofYale’sInnovationsforPovertyAction,which runsdevelopmentRCTsaroundtheworld: “Asinmedicaltrials,weisolatetheimpactofaninterventionbyrandomlyassigningsubjectstotreatmentsandcontrolgroups.Thismakesitsothatallthoseotherfactors whichcouldinfluencetheoutcomearepresentintreatmentandcontrol,andthusany 10 differenceinoutcomecanbeconfidentlyattributedtotheintervention.”Karlan,GoldbergandCopestake(2009) Andfromthemedicalliterature,fromadistinguishedpsychiatristwhoisdeeplyskepticalof RCTs, “Thebeautyofarandomizedtrialisthattheresearcherdoesnotneedtounderstandall thefactorsthatinfluenceoutcomes.Saythatanundiscoveredgeneticvariationmakes certainpeopleunresponsivetomedication.Therandomizingprocesswillensure—or makeithighlyprobable—thatthearmsofthetrialcontainequalnumbersofsubjects withthatvariation.Theresultwillbeafairtest.”(Kramer,2016,p.18) ClaimsareevenmadethatRCTsrevealknowledgewithoutpossibilityoferror.JudyGueron,the long-timepresidentofMDRC,whichhasbeenrunningRCTsonUSgovernmentpolicyfor45 years,askswhyfederalandstateofficialswerepreparedtosupportrandomizationinspiteof frequentdifficultiesandinspiteoftheavailabilityofothermethods,andconcludesthatitwas because“theywantedtolearnthetruth,”GueronandRolston(2013,429).Therearemany statementsoftheform“Weknowthat[projectX]workedbecauseitwasevaluatedwitharandomizedtrial,”Dynarski(2015). Manywritersaremorecautious,andmodifystatementsabouttreatmentandcontrol groupsbeingidenticalwithtermssuchas“statisticallyidentical,”“reasonablysimilar”ordonot differ“systematically.”Andwehavenodoubtthatalloftheauthorsquotedaboveunderstand theneedforthesequalifications.Buttotheuninformedreader,thequalifiedstatementsare unlikelytobedifferentiatedfromtheunqualifiedstatementsquotedabove.Norisitalways clearwhatsomeofthesetermsmean.Forexample,iftwopeopleareselectedatrandomfroma population,anditsohappensthatoneisfemaleandonemale,inwhatsensetheyarestatisticallyidentical?Whileitistruethattheywererandomlyselectedfromthesameparentdistribution,whichprovidesthebasisforinference,thecalculationofstandarderrors,andsignificance statements,itdoesnothingtohelpwithbalanceorprecisioninanygiventrial. 1.2.4Samplesizeandstatisticalinferenceinunbalancedtrials Isasingletrialmorelikelytobebalanced,andthusmoreprecise,whenthesamplesizeislarge? Indeed,asthesamplesizetendstoinfinity,themeansofthex’sinthetreatmentandcontrol groupswillbecomearbitrarilyclose.YetthisisoflittlehelpinfinitesamplesasFisher(1926) noted:“Mostexperimentersoncarryingoutarandomassignmentwillbeshockedtofindhow farfromequallytheplotsdistributethemselves,”quotedinMorganandRubin(2012).Evenwith 11 verylargesamplesizes,iftherearealargenumberofcauses,balanceoneachcausemaybe infeasible.Vandenbroucke(2004)notesthattherearethreemillionbasepairsinthehuman genome,manyorallofwhichcouldberelevantprognosticfactorsforthebiologicaloutcome thatweareseekingtoinfluence. However,as(4)makesclear,wedonotneedbalanceonallcauses,onlyontheirneteffect,theterm S 1 − S 0 whichdoesnotrequirebalanceoneachcauseindividually.Yetthereis noguaranteethateventheneteffectwillbesmall.Forexample,theremayonlybeoneomitted unobservedcausewhoseeffectislarge,onesinglebasepairsay,sothatifthatonecauseisunbalancedacrosstreatmentsandcontrols,thatthereisindividualorevennetbalanceonother lessimportantcausesisnotgoingtohelp. Statementsaboutlargesamplesguaranteeingbalancearenotusefulwithoutguidelines abouthowlargeislargeenough,andsuchstatementscannotbemadewithoutknowledgeof othercausesandhowtheyaffectoutcomes. Asimplecaseillustrates.Supposethatthereisonehiddencausein(3),abinaryvariable xthatisunitywithprobabilitypand0otherwise.Withncontrolsandntreatments,thedifferenceinfractionswithx=1inthetwogroupshasmean0andvariance 1/ np(1− p). Withn=100 andp=0.5,thestandarderroraround0is0.2sothat,ifthisunobservedconfounderhasalarge effectontheoutcome,theimbalancecouldeasilymasktheeffectoftreatment,orbemistaken asevidencefortheeffectivenessofatrulyineffectivetreatment. Lackofbalanceintheaboveexampleorintheneteffectofeitherobservablesornonobservablesin(4)doesnotcompromisetheinferenceinanRCTinthesenseofobtaininga standarderrorfortheunbiasedATE,seeSenn(2013)foraparticularlyclearstatement.The randomizationdoesnotguaranteebalancebutitprovidesthebasisformakingprobability statementsaboutthevariouspossibleoutcomes,whichisalsoclearintheexampleinthepreviousparagraph.ThiswasalsoFisher’sargumentforrandomization.Sennwrites“theprobability calculationappliedtoaclinicaltrialautomaticallymakesanallowanceforthefactthatthe groupswillalmostcertainlybeunbalanced.”(italicsintheoriginal.)Ifthedesignissuchthat, evenwithperfectrandomization,successivereplicationstendtogeneratelargeimbalances,the resultingimprecisionoftheATEwillshowupinitsstandarderror.Ofcourse,theusefulnessof thisrequiresthatthecalculatedstandarderrorspermitcorrectsignificancestatements,which, asweshallseeinthenextsubsection,isoftenfarfromstraightforward.Intheexampleabove, anextreme,butentirelypossible,caseoccurswhen,bychance,theunobservedconfounderis 12 perfectlycorrelatedwiththetreatment;unlessthereareactualreplications,thefalsecertainty thatsuchanexperimentprovideswillbereinforcedbyfalsesignificancetests. 1.2.4Testingforbalance Inpractice,trialistsineconomics(andinsomeotherdisciplines)usuallycarryoutastatistical testforbalanceafterrandomizationbutbeforeanalysis,presumablywiththeaimoftaking someappropriateactionifbalancefails.Thefirsttableofthepapertypicallypresentsthesamplemeansofobservablecovariates—theobservablex’sin(3),whichareeithercausesintheir ownrightorinteractwiththe β 's —forthecontrolandtreatmentgroups,togetherwiththeir differences,andtestsforwhetherornottheyaresignificantlydifferentfromzero,eithervariablebyvariable,orjointly.Thesetestsareappropriateifweareconcernedthattherandom numbergeneratormighthavefailed(becausewearedrawingplayingcards,rollingdice,or spinningbottletops,thoughpresumablynotiftherandomizationisdonebyarandomnumber generator,alwayssupposingthatthereissuchathingasrandomness,SingerandPincus(1998)), orifweareworriedthattherandomizationisunderminedbynon-blindedsubjectsortrialists systematicallyunderminingtheallocation.Otherwise,asthenextparagraphshows,thetest makesnosenseandisnotinformative,whichdoesnotseemtostopitbeingroutinelyused. Ifwewrite µ and µ forthe(vectorsof)populationmeans(i.e.themeansoverall 0 1 possiblerandomizations)oftheobservedx’sinthecontrolandtreatmentgroupsatthepointof assignment,thenullhypothesisis(presumably,asjudgedbythetypicalbalancetest)thatthe twovectorsareidentical,withthealternativebeingthattheyarenot.Butiftherandomization hasbeencorrectlydone,thenullhypothesisistruebyconstruction,seee.g.Altman(1985)and Senn(1994),whichmayhelpexplainwhyitsorarelyfailsinpractice.Indeed,althoughwecannot“test”it,weknowthatthenullhypothesisisalsotruefortheunobservablecomponentsof x.NotethecontrastwiththestatementsquotedaboveclaimingthatRCTsguaranteebalanceon causesacrosstreatmentandcontrolgroups.Thosestatementsrefertobalanceofcausesatthe pointofassignmentinanysingletrial,whichisnotguaranteedbyrandomization,whereasthe balancetestsareaboutthebalanceofcausesatthepointofassignmentinexpectationover manytrials,whichisguaranteedbyrandomization.Theconfusionisperhapsunderstandable, butitisconfusionnevertheless.Ofcourse,itmakessensetolookforbalancebetweenobserved covariatesusingsomemoreappropriatedistancemeasureforexamplethenormalizeddifferenceinmeans,ImbensandWooldridge(2009,equation3). 13 1.2.5Methodsforbalancing Oneproceduretoimprovebalanceistoadaptthedesignbeforerandomization,forexampleby stratification.Fisher,whoasthequoteaboveillustrates,waswellawareofthelossofprecision fromrandomizationarguedfor“blocking”(stratification)inagriculturaltrialsorforusingLatin Squares,bothofwhichrestricttheamountofimbalance.Stratification,tobeuseful,requires somepriorunderstandingofthefactorsthatarelikelytobeimportant,andsoittakesusaway fromthe“noknowledgerequired,”or“nopriorsaccepted”appealofRCTs.ButasScriven(1974, 103)notes:“causehunting,likelionhunting,isonlylikelytobesuccessfulifwehaveaconsiderableamountofrelevantbackgroundknowledge,”orevenmorestrongly,“nocausesin,no causesout,”Cartwright(1994,Chapter2).StratificationinRCTs,asinotherformsofsampling,is astandardmethodforusingbackgroundknowledgetoincreasetheprecisionofanestimator.It hasthefurtheradvantagethatitallowsfortheexplorationofdifferentaveragetreatmenteffectsindifferentstratawhichcanbeusefulinadaptingortransportingtheresultstootherlocations,seeSection2. Stratificationisnotpossiblewhentherearetoomanycovariates,orifeachhasmany values,sothattherearemorecellsthancanbefilledgiventhesamplesize.Analternativeisto re-randomize,repeatingtherandomizationuntilthedistancebetweentheobservedcovariates islessthansomepredeterminedcriteria.MorganandRubin(2012)suggesttheMahalanobisD– statistic,anduseFisher’srandomizationinference(tobediscussedfurtherbelow)tocalculate standarderrorsthattakethere-randomizationintoaccount.Analternative,widelyadaptedin practice,istoadjustforcovariatesbyrunningaregression(orcovariance)analysis,withthe outcomeonthelefthandsideandthetreatmentdummyandthecovariatesasexplanatoryvariables,includingpossibleinteractionsbetweencovariatesandtreatmentdummies. Freedman(2008)hasanalyzedthismethodandargues“ifadjustmentmadeasubstantialdifference,wewouldsuggestmuchcautionwheninterpretingtheresults.”Butasubstantial differenceisexactlywhatwewouldliketosee,atleastsomeofthetime,iftheadjustment movestheestimateclosertothetruth.FreedmanshowsthattheadjustedestimateoftheATE isbiasedinfinitesamples,withthebiasdependingonthecorrelationbetweenthesquared treatmenteffectandthecovariates.Thereisalsonogeneralguaranteethattheregressionadjustmentwillgenerateamorepreciseestimate,althoughitwilldosoifthereareequalnumbers oftreatmentsandcontrolsorifthetreatmenteffectsareconstantoverunits(inwhichcase therewillalsobenobias).Evenwithbias,theregressionadjustmentisattractiveifitdoesin- 14 deedtradeoffbiasforprecision,thoughpresumablynottoRCTpuristsforwhomunbiasedness isthesinequanon.Noteagainthattheincreasedprecision,whenitexists,comesfromusing priorknowledgeaboutthevariablesthatarelikelytobeimportantfortheoutcome.Thatthe backgroundknowledgeortheoryiswidelysharedandunderstoodwillalsoprovidesomeprotectionagainstdataminingbysearchingthroughcovariatesinthesearchfor(perhapsfalsely) estimatedprecision. 1.2.6Shouldwerandomize? ThetensionbetweenrandomizationandprecisiongoesbacktotheearlydebatebetweenFisher andStudent(Gosset)whoneveracceptedFisher’sargumentsforrandomization,seealsoZiliak (2014).InhisdebatewithFisheraboutagriculturaltrials,Studentarguedthatrandomization ignoredrelevantpriorinformation,forexampleabouthowlikelyconfounderswouldbedistributedacrossthetestplots,sothatrandomizationwastedresourcesandledtounnecessarily poorestimates.Thisgeneralquestionofwhetherrandomizationisdesirablehasbeenreopened inrecentpapersbyKasy(2016),Banerjee,Chassang,andSnowberg(2016)andBanerjee, Chassang,Montero,andSnowberg(2016). ReferbacktotheMSEintroducedabove,andconsiderdesigninganexperimentthatwill makethisassmallaspossible.Unfortunately,thisisnotgenerallypossible;forexample,the“estimator”of3,say,fortheATEhasthelowestpossiblemean-squarederrorifthetrueATEisactually3.Instead,weneedtoaveragetheMSEoveradistributionofpossibleATEs.Thisleadsto adecisiontheoryapproachtoestimationwherebyaBayesianeconometricianwillestimatethe ATEbychoosingtheallocationoftreatmentandcontrolssoastominimizetheexpectedvalue ofalossfunction—theMSEbeingoneexample.Suchanapproachrequiresustospecifyaprior ontheATE,ormoregenerally,ontheexpectationofoutcomesconditionalonthecovariates. Thesepriorsareformalversionsoftheissuethathasalreadycomeuprepeatedly,thattoget goodestimators,weneedtoknowsomethingabouthowthecovariatesaffecttheoutcome. Kasy(2016)solvesthisproblemforthecaseofexpectedMSEandshowsthatrandomizationis undesirable;itsimplyaddsnoiseandmakestheMSElarger.Heusesanon-parametricpriorthat hasprovedusefulinanumberofotherapplications—wecouldpresumablydoevenbetterifwe werepreparedtocommitfurther,andheprovidescodetoimplementhismethod,whichshows a20percentreductioninMSEcomparedwithrandomization(14percentforstratifiedrandomization)forthewell-knownTennesseeSTARclass-sizeexperiment. 15 Banerjeeetalproposeamoregenerallossfunctionandprovethecomparabletheorem, thatrandomizationleadstolargerlossesthantheoptimalnon-randompurposiveassignment. Theseauthorsrecommendrandomizationonothergrounds,whichwewilldiscussbelow,but agreethat,forstandardstatisticalefficiencyormaximizationofexpectedutilityrandomization shouldnotbeusedinexperimentaldesign.Studentwasright. Severalpointsshouldbenoted.First,theanti-randomizationtheoremisnotajustificationofanynon-experimentaldesign,forexampleonethatcomparesoutcomesofthosewhodo ordonotself-selectintotreatment.Selectioneffectsarerealenough,andifselectionisbased onunobservablecauses,comparisonoftreatedandcontrolswillbebiased.Oneacceptablenonrandomschemeistousetheobservablecovariatestodividethestudysampleintocellswithin whichallobservationshavethesamevalueandthendivideeachcellintotreatmentsandcontrols.Withineachcell,orforthoseunitsonwhichwehavenoinformation,wecanchooseany waywelike,includingrandomly,thoughrandomizationhasnoadvantageordisadvantage.Such allocationsruleoutself-selection(ordoctororprogramadministratorselection)wheretheindividual(doctor,oradministrator)hasinformationnotvisibletothepersonassigningtreatments andcontrols.Thekeyisthatthepersonwhomakestheassignment(theanalyst)usesallofthe informationthatheorshepossesses,andthatoncethishasbeentakenintoaccount,allunits areinterchangeableconditionalonthatinformation,sothatassignmentbeyondthatdoesnot matter.Ofcourse,theprogramadministratorsmustenforcetheanalyst’sassignment,sothat privateinformationthattheyortheunitspossessisnotallowedtoaffecttheassignment,conditionalontheinformationusedbytheanalyst.Giventhis,selectiononunobservablesisruled out,anddoesnotaffecttheresults.Randomizationisnotrequiredtoeliminateselectionbias. Whetheritisreallypossiblefortheanalysttoassignarbitrarilyisanopenquestion,asis whether“randomization”fromarandom-numbergeneratorwilldoso.Evenmachine-generated sequenceshavecauses,andeveniftheanalysthasonlyasetofuninformativelabelsforthe units,thosetoomustcomefromsomewhere,sothatitispossiblethatthosecausesarelinked totheunobservedcausesintheexperiment.Wedonotattempttodealherewiththesedeep issuesonthemeaningofrandomization,butseeSingerandPincus(1998). AccordingtoChalmers(2001)andBothwellandPodolsky(2016),thedevelopmentof randomizationinmedicineoriginatedwithBradford-HillwhousedrandomizationinthefirstRCT inmedicine—thestreptomycintrial—becauseitpreventeddoctorsselectingpatientsonthe basisofperceivedneed(oragainstperceivedneed,leaningoverbackwardasitwere),anargu- 16 mentmorerecentlyechoedbyWorrall(2007).Randomizationservesthispurpose,butsodo othernon-discretionaryschemes;whatisrequiredisthatthehiddeninformationnotaffectthe allocation.Whileitistruethatdoctorscannotbeallowedtomaketheassignment,itisnottrue thatrandomizationistheonlyschemethatcanbeenforced. Second,theidealrulesbywhichunitsareallocatedtotreatmentorcontroldependon thecovariates,andontheinvestigators’priorsabouthowthecovariatesaffecttheoutcomes. Thisopensupallsortsofmethodsofinferencethatareexcludedbypurerandomization.For example,thehypothetico-deductivemethodworksbyusingtheorytomakeapredictionthat canbetakentothedata;herethepredictionswouldbeoftheformthataunitwithcharacteristicsxwillrespondinaparticularwaytotreatment,falsificationofwhichcanbetestedbyan appropriateallocationofunitstotreatment.Banerjee,ChassangandSnowberg(2016)provide suchexamples. Third,randomization,byrunningroughshodoverpriorinformationfromtheoryand fromthecovariates,iswastefulandevenunethicalwhenitunnecessarilyexposespeople,or unnecessarilymanypeople,topossibleharminariskyexperiment,seeWorrall(2002)foran egregiouscaseofhowanunthinkingdemandforrandomizationandtherefusaltoacceptprior informationputchildren’slivesdirectlyatrisk. Fourth,thenon-randommethodsusepriorinformation,whichiswhytheydobetter thanrandomization.Thisisbothanadvantageandadisadvantage,dependingonone’sperspective.Ifpriorinformationisnotwidelyaccepted,orisseenasnon-crediblebythoseweareseekingtopersuade,wewillgeneratemorecredibleestimatesifwedonotusethosepriors.Indeed, thisiswhyBanerjee,ChassangandSnowberg(2016)recommendrandomizeddesigns,including inmedicineandindevelopmenteconomics.Theydevelopatheoryofaninvestigatorwhoisfacinganadversarialaudiencethatwillchallengeanypriorinformationandcanevenpotentially vetoresultsthatarebasedonit(thinkadministrativeagenciesorjournalreferees).Theexperimentertradesoffhisorherowndesireforprecision(andpreventingpossibleharmtosubjects), whichusespriorinformation,againstthewishesoftheaudience,whowantnothingofthepriors.Eventhen,theapprovalofthisaudienceisonlyexante;oncethefullyrandomizedexperimenthasbeendone,nothingstopscriticsarguingthat,infact,therandomizationdidnotoffera fairtest.AmongdoctorswhouseRCTs,andespeciallymeta-analysis,suchargumentsare(appropriately)common;seeagainKramer(2016). 17 AswenotedintheIntroduction,muchofthepublichascometoquestionexpertprior knowledge,andBanerjee,Chassang,MonteroandSnowberg(2016)haveprovidedanelegant (positive)accountofwhyRCTswillflourishinsuchanenvironment.Incaseswherethereisgood reasontodoubtthegoodfaithofexperimenters,asinsomepharmaceuticaltrials,randomizationwillindeedbetheappropriateresponse.Butwebelievesuchargumentsaredeeplydestructiveforscientificendeavorandshouldberesistedasageneralprescriptionforscientific research.Economistsandothersocialscientistsknowagreatdeal,andtherearemanyareasof theoryandpriorknowledgethatarejointlyendorsedbylargenumbersofknowledgeableresearchers.Suchinformationneedstobebuiltonandincorporatedintonewknowledge,notdiscardedinthefaceofaggressiveknow-nothingignorance.Thesystematicrefusaltouseprior knowledgeandtheassociatedpreferenceforRCTsarerecipesforpreventingcumulativescientificprogress.Intheend,itisalsoself-defeating;toquoteRodrik(2016)“thepromiseofRCTsas theory-freelearningmachinesisafalseone.” 1.3StatisticalinferenceinRCTs IfwearetointerprettheresultsofanRCTasdemonstratingthecausaleffectofthetreatment inthetrialpopulation,wemustbeabletotellwhetherthedifferencebetweenthecontroland treatmentmeanscouldhavecomeaboutbychance.Anyconclusionaboutcausalityishostage toourabilitytocalculatestandarderrorsandaccuratep–values.ButthisisnotgenerallypossiblewithoutassumptionsthatgobeyondthoseneededtosupportthebasictheoremofRCTs.In particular,ithaslongbeenknownthatthemean—andafortiorithedifferencebetweentwo means—isastatisticthatissensitivetooutliers.IndeedBahadurandSavage(1956)demonstratethat,withoutrestrictionsontheparentdistributions,standardt–testsareinherentlyunreliable. Thekeyproblemhereisskewness;standardt–testsbreakdownindistributionswith largeskewness,seeLehmannandRomano(2005,p.466–8).Inconsequence,RCTswillnotwork wellwhenthedistributionoftheindividualtreatmenteffectsisstronglyasymmetric,atleastif thestandardtwo-samplet–statistics(orequivalentlyWhite’s(1980)heteroskedasticrobustregressiont–values)areused.Whilewemaybewillingtoassumethattreatmenteffectsaresymmetricinsomecases,theneedforsuchanassumption—whichrequirespriorknowledgeabout thespecificprocessbeingstudied—underminestheargumentthatRCTsarelargelyassumption freeanddonotdependonsuchknowledge.Thereisadeepironyhere.Inthesearchforrobustnessandthedesiretodoawaywithunnecessaryassumptions,theRCTcandeliverthemeanof 18 theATE,yetthemean—asopposedtothemedian,whichcannotbeestimatedbyanRCT—does notpermitrobustprobabilitystatementsabouttheestimatesoftheATE Howdifficultisittomaintainsymmetry?Andhowbadlyisinferenceaffectedwhenthe distributionoftreatmenteffectsisnotsymmetric?Ineconomics,manytrialshaveoutcomes valuedinmoney.Doesananti-povertyinnovation—forexamplemicrofinance—increasethe incomesoftheparticipants?Incomeitselfisnotsymmetricallydistributed,andthismightbe trueofthetreatmenteffectstoo,ifthereareafewpeoplewhoaretalentedbutcreditconstrainedentrepreneursandwhohavetreatmenteffectsthatarelargeandpositive,while thevastmajorityofborrowersfritterawaytheirloans,oratbestmakepositivebutmodest profits.Anotherimportantexampleisexpendituresonhealthcare.Mostpeoplehavezeroexpenditureinanygivenperiod,butamongthosewhodoincurexpenditures,afewindividuals spendhugeamountsthataccountforalargeshareofthetotal.Indeed,inthefamousRand healthexperiment,Manning,Newhouseetal.(1987,1988),thereisasingleverylargeoutlier. Theauthorsrealizethatthecomparisonofmeansacrosstreatmentarmsisfragile,and,althoughtheydonotseetheirproblemexactlyasdescribedhere,theyobtaintheirpreferredestimatesusingastructuralapproachthatisdesignedtoexplicitlymodeltheskewnessofexpenditures. Insomecases,itwillbeappropriatetodealwithoutliersbytrimming,eliminatingob- servationsthathavelargeeffectsontheestimates.Butiftheexperimentisaprojectevaluation designedtoestimatethenetbenefitsofapolicy,theeliminationofgenuineoutliers,asinthe RandHealthExperiment,willvitiatetheanalysis.Itispreciselytheoutliersthatmakeorbreak theprogram. 1.3.1Spuriousstatisticalsignificance:anillustrativeexample Weconsideranexamplethatillustrateswhatcanhappeninarealisticbutsimplifiedcase.There isaparentpopulation,orpopulationofinterest,definedasthecollectionofunitsforwhichwe wouldliketoestimateanaveragetreatmenteffect.ItmightbeallvillagesinIndia,orallrecipientsoffoodsubsidies,orallusersofhealthcareintheUS.Fromthispopulationwehaveasamplethatisavailableforrandomization,thetrialorexperimentalsample;inarandomizedcontrolledtrial,thiswillsubsequentlyberandomlydividedintotreatmentsandcontrols.Ideally, thetrialsamplewouldberandomlyselectedfromtheparentsample,sothatthesampleaveragetreatmenteffectwouldbeanunbiasedestimatorofthepopulationaveragetreatmenteffect;indeedinsomecasesthecompletepopulationofinterestisavailableforthetrial.Clearly, 19 intheseidealcases,itisstraightforwardtousestandardsamplingtheorytogeneralizethetrial resultsfromthesampletothepopulation.However,foranumberofpracticalandconceptual reasons,thetrialsampleisrarelyeitherthewholepopulationorarandomlyselectedsubset, seeShadishetal(2002,pp.341–8)foragooddiscussionofbothpracticalandtheoreticalobstacles. Inourillustrativeexample,thereisparentpopulationeachmemberofwhichhashisor herowntreatmenteffect;thesearecontinuouslydistributedwithashiftedlognormaldistributionwithzeromeansothatthepopulationaveragetreatmenteffectiszero.Theindividual treatmenteffects β aredistributedsothat β + e 0.5 ∼ Λ(0,1) ,forstandardizedlognormaldis- tribution Λ. Wehavesomethinglikeamicrofinancetrialinmind,wherethereisalongpositive tailofrareindividualswhocandoamazingthingswithcredit,whilemostpeoplecannotuseit effectively.Atrial(experimental)sampleof 2n individualsisrandomlydrawnfromtheparent andisrandomlysplitbetweenntreatmentsandncontrols.Intheabsenceoftreatment,everyoneinthesamplerecordszero,sothesampleaveragetreatmenteffectinanyonetrialissimply themeanoutcomeamongthentreatments.Forvaluesofnequalto25,50,100,200,and500 wedraw100trial/experimentalsampleseachofsize2n;withfivevaluesofn,thisgivesus500 trial/experimentalsamplesinall.Foreachofthese500samples,werandomizeintoncontrols andntreatments,estimatetheATEanditsestimatedt–value(usingthestandardtwo-samplet– value,orequivalently,byrunningaregressionwithrobustt–values),andthenrepeat1,000 times,sowehave1,000ATEestimatesandt–valuesforeachofthe500trialsamples;theseallowustoassessthedistributionofATEestimatesandtheirnominalt–valuesforeachtrial. Table1:RCTswithskewedtreatmenteffects Samplesize MeanofATE Meanofnominalt– Fractionnullreject- estimates values ed(percent) 25 0.0268 –0.4274 13.54 50 0.0266 –0.2952 11.20 100 –0.0018 –0.2600 8.71 200 0.0184 –0.1748 7.09 500 –0.0024 –0.1362 6.06 20 Note:1,000randomizationsoneachof100drawsofthetrialsamplerandomlydrawnfroma lognormaldistributionoftreatmenteffectsshiftedtohaveazeromean. TheresultsareshowninTable1.Eachrowcorrespondstoasamplesize.Ineachrow, weshowtheresultsof100,000individualtrials,composedof1,000replicationsoneachofthe 100trial(experimental)samples.Thecolumnsareaveragedoverall100,000trials. Thelastcolumnshowsthefractionsoftimesthetruenullisrejectedandisthekeyresult.Whenthereareonly50treatmentsand50controls(row2),the(true)nullisrejected11.2 percentofthetime,insteadofthe5percentthatwewouldlikeandexpectifwewereunaware oftheproblem.Whenthereare500unitsineacharm,therejectionrateis6.06percent,much closertothenominal5percent. Whydoesthestandardapplicationofthet–distributiongivesuchstrangeresultswhen allwearedoingisestimatingamean?Theproblemcasesarewhenthetrialsamplehappensto containoneormoreoutliers,somethingthatisalwaysariskgiventhelongpositivetailofthe parentdistribution.Whenthishappens,everythingdependsonwhethertheoutlierisamong thetreatmentsorthecontrols;ineffecttheoutliersbecomethesample,reducingtheeffective 0 .5 Density 1 1.5 numberofdegreesoffreedom. -.5 0 .5 1 1.5 1,000 estimates of average treatment effect 2 Figure1:EstimatesofanATEwithanoutlierinthetrialsample Figure1illustratestheestimatedaveragetreatmenteffectsfromanextremecasefrom thesimulationswith100observationsintotal,thesecondrowofTable1;thehistogramshows the1,000estimatesoftheATE.Thetrialsamplehasasinglelargeoutlyingtreatmenteffectof 21 48.3;themean(s.d.)oftheother99observationsis–0.51(2.1);whentheoutlierisinthe treatmentgroup,wegettheright-handsideofthefigure,whenitisnot,wegettheleft-hand side.Ontheright-handside,whentheoutlierisamongthetreatmentgroup,thedispersion acrossoutcomesislarge,asistheestimatedstandarderror,andsothoseoutcomesrarelyreject thenullusingthestandardtableoft–values.Theover-rejectionscomefromtheleft-handside ofthefigurewhentheoutlierisinthecontrolgroup,theoutcomesarenotsodispersed,and thet–valuescanbelarge,negative,andsignificant.Whilethesecasesofbimodaldistributions maynotbecommon,anddependonlargeoutliers,theyillustratetheprocessthatgenerates theover-rejectionsandspurioussignificance. Wecouldescapetheseproblemsifwecouldcalculatethemediantreatmenteffect,but RCTscannot(withoutfurtherassumption)identifythemedian,onlythemean,anditisthe meanthatisatriskbecauseoftheBahadur-Savagetheorem.Notetoothatthereisonlymoderatecomforttobetakeninlargesamplesizes.Whilethelastrowiscertainlybetterthantheothers,therearestillmanytrialsamplesthataregoingtogivesampleaverageeffectsthataresignificant,evenwhenthenumberwewantiszero.TheproofoftheBahadur-Savagetheorem worksbynotingthatforanysamplesize,itisalwayspossibletofindanoutlierthatwillgivea misleadingt–value.NoristhereanescapeherebyusingtheFisherexactmethodforinference; theFishermethodteststhenullhypothesisthatallofthetreatmenteffectsarezerowhereas whatweareinterestedinhere,atleastifwewanttodoprojectevaluationorcost-benefitanalysis,isthattheaveragetreatmenteffectiszero. Theproblemsillustratedabove,thatstemfromtheBahadur-Savagetheorem,arecertainlynotconfinedtoRCTs,andoccurmoregenerallyineconometricandstatisticalwork.However,theanalysishereillustratesthatthesimplicityofidealRCTs,subtractingonemeanfrom another,bringsnoexemptionfromtroublesomeproblemsofinference.Escapefromtheseissues,asintheRandHealthExperiment,requiresexplicitmodeling,ormightbebesthandledby estimatingquantilesofthetreatmentdistribution,whichagainrequiresadditionalassumptions. OurreadingoftheliteratureonRCTsindevelopmentsuggeststhattheyarenotexempt fromtheseconcerns.Manydevelopmenttrialsarerunon(sometimesvery)smallsamples,they havetreatmenteffectswhereasymmetryishardtoruleout—especiallywhentheoutcomesare inmoney—andtheyoftengiveresultsthatarepuzzling,oratleastnoteasilyinterpretedin termsofeconomictheory.NeitherBanerjeeandDuflo(2012)norKarlanandAppel(2011),who citemanyRCTs,raiseconcernsaboutmisleadinginference,treatingallresultsassolid.Nodoubt 22 therearebehaviorsintheworldthatareinconsistentwithstandardeconomics,andsomecan beexplainedbystandardbiasesinbehavioraleconomics,butitwouldalsobegoodtobesuspiciousofthesignificancetestsbeforeacceptingthatanunexpectedfindingiswellsupportedand theoryshouldberevised.Replicationofresultsindifferentsettingsmaybehelpful—iftheyare therightkindofplaces(seeourdiscussioninSection2)—butithardlysolvestheproblemgiven thattheasymmetrymaybeinthesamedirectionindifferentsettings(andseemslikelytobeso injustthosesettingsthataresufficientlyliketheoriginaltrialsettingtobeofuseforinference aboutthetrialpopulation),andthatthe“significant”t–valueswillshowdeparturesfromthe nullinthesamedirection,thusreplicatingspuriousfindings. 1.2.11:Significancetests:Fisher-Behrens,robustinference,andmultiplehypotheses Skewnessoftreatmenteffectsisnottheonlythreattoaccuratesignificancetests.Thetwo– samplet–statisticiscomputedbydividingtheATEbytheestimatedstandarderrorwhose squareisgivenby ⌢ σ2 = ⌢ (n1 − 1)−1 ∑ (Yi − µ1 )2 i∈1 n1 + ⌢ (n0 − 1)−1 ∑ (Yi − µ0 )2 i∈0 n0 (5) where0referstocontrolsand1totreatments,sothatthereare n1 treatmentsand n0 controls,and µ̂1 and µ̂0 arethetwomeans.Ashasbeenlongknown,thist–statisticisnotdistributedasStudent’stifthetwovariances(treatmentandcontrol)arenotidentical;thisisknown astheBehrens–Fisherproblem.Inextremecases,whenoneofthevariancesiszero,thet– statistichaseffectivedegreesoffreedomhalfofthatofthenominaldegreesoffreedom,sothat thetest-statistichasthickertailsthanallowedfor,andtherewillbetoomanyrejectionswhen thenullistrue. Inaremarkablerecentpaper,Young(2016)arguesthatthisproblemgetsmuchworse whenthetrialresultsareanalyzedbyregressingoutcomesnotonlyonthetreatmentdummy, butalsoonadditionalcontrols,someofwhichmightinteractwiththetreatmentdummy.Again theproblemconcernsoutliersincombinationwiththeuseofclusteredorrobuststandarderrors.Whenthedesignmatrixissuchthatthemaximalinfluenceislarge,sothatforsomeobservationsoutcomeshavelargeinfluenceontheirownpredictedvalues,thereisareductioninthe effectivedegreesoffreedomforthet–value(s)oftheaveragetreatmenteffect(s)leadingto spuriousfindingsofsignificance. 23 Younglooksat2003regressionsreportedin53RCTpapersintheAmericanEconomic AssociationjournalsandrecalculatesthesignificanceoftheestimatesusingFisher’srandomizationinferenceappliedtotheauthors’originaldata;seeagainImbensandWooldridge(2009)for agoodmodernaccountofFisher’smethod.In30to40percentoftheestimatedtreatmenteffectsinindividualequationswithcoefficientsthatarereportedassignificant,hecannotreject thenullofnoeffect;thefractionofspuriouslysignificantresultsincreasesfurtherwhenhesimultaneouslytestsforallresultsineachpaper.Thesespuriousfindingscomeinpartfromthe well-knownproblemofmultiple-hypothesistesting,bothwithinregressionswithseveraltreatmentsandacrossregressions.Withinregressions,treatmentsarelargelyorthogonal,butauthorstendtoemphasizesignificantt–valuesevenwhenthecorrespondingF-testsareinsignificant.Acrossequations,resultsareoftenstronglycorrelated,sothat,atworst,differentregressionsarereportingvariantsofthesameresult,thusspuriouslyaddingtothe“killcount”ofsignificanteffects.Atthesametime,thepervasivenessofobservationswithhighinfluencegeneratesspurioussignificanceonitsown. Oursenseisthattheseissuesarebeingtakenmoreseriouslyinrecentwork,especially asconcernsmultiplehypothesistesting.YounghimselfisastrongproponentofRCTsingeneral andbelievesthatrandomizationinferencewillyieldcorrectinferences.Yetrandomizationinferencecanonlytestthenullthatalltreatmenteffectsarezero,thattheexperimentdoesnothing toanyone,whereasmanyinvestigatorsareinterestedintheweakerhypothesisthattheaveragetreatmenteffectiszero.Thissimplymakesmattersworsesincethestrongerhypothesis impliestheweakerhypothesisandtherearepresumablyundiscoveredcaseswheretheATEis spuriouslysignificant,evenwhentheFishertestrejectsthatalltreatmenteffectsarezero.Note thattestingdoesnotalwaysmatchlogic;itispossibletorejectthenullthattheATEiszeroeven whenwecansimultaneouslyacceptthe(joint)hypothesisthatalltreatmenteffectsarezero; thisisfamiliarfromOLSregression,whereanF–testcanshowjointinsignificance,evenwhena t–testofsomelinearcombinationissignificant. Itisclearthat,asofnow,allreportedsignificancelevelsfromRCTresultsineconomics shouldbetreatedwithconsiderablecaution.Greatercareaboutskewnessandoutlierswould help,aswouldgreateruseoftheFishermethodandofproceduresthatdealcorrectlywithmultiplehypothesistesting.Yetifthenullhypothesisisthattheaveragetreatmenteffectiszero,as inmostprojectevaluation,theFishertestisnotavailable,sothatwecurrentlydonothavea reliablesetofprocedures.Robustorclusteredstandarderrorsarenecessarytoallowforthe 24 possibilitythattreatmentchangesvariances,andtheinclusionofcovariatesisnecessarytocontrolforimbalanceinfinitesamples. 1.3Blinding Blindingisrarelypossibleineconomicsorsocialsciencetrials,andthisisoneofthemajordifferencesfrommost(althoughnotall)RCTsinmedicine,whereblindingisstandard,bothfor thosereceivingthetreatmentandthoseadministeringit.Indeed,theabilitytoblindhasbeen oneofthekeyargumentsinfavorofrandomization,fromBradford-Hillinthe1950s,see Chalmers(2003),towelfaretrialstoday,GueronandRolston(2013).Considerfirsttheblinding ofsubjects.SubjectsinsocialRCTsusuallyknowwhethertheyarereceivingthetreatmentornot andsocanreacttotheirassignmentinwaysthatcanaffecttheoutcomeotherthanthroughthe operationofthetreatment;ineconometriclanguage,thisisakintoaviolationofexclusionrestrictions,orafailureofexogeneity.Intermsof(1),thereisapathwayfromthetreatmentassignmenttoanotherunobservedcause,whichwillresultinabiasedATE.Thisisnottoarguein favorofinstrumentalvariablesoverRCTs,orviceversa,butsimplytonotethat,withoutblinding,RCTsdonotautomaticallysolvetheselectionproblemanymorethanIVestimationautomaticallysolvestheselectionproblem.Inbothcases,theexogeneity(exclusionrestriction)argumentneedstobeexplicitlymadeandjustified.YettheliteratureineconomicsgivesgreatattentiontothevalidityofexclusionrestrictionsinIVestimation,whiletendingtoshrugoffthe essentiallyidenticalproblemswithlackofblindinginRCTs. Notealsothatknowledgeoftheirassignmentmaycausepeopletowanttocrossover fromtreatmenttocontrol,orviceversa,todropoutoftheprogram,ortochangetheirbehavior inthetrialdependingontheirassignment.Inextremecases,onlythosemembersofthetrial samplewhoexpecttobenefitfromthetreatmentwillaccepttreatment.Consider,forexample, atrialinwhichchildrenarerandomlyallocatedtotwoschoolsthatteachindifferentlanguages, RussianorEnglish,ashappenedduringthebreakupoftheformerYugoslavia.Thechildren(and theirparents)knowtheirallocation,andthemoreeducated,wealthier,andless-ideologically committedparentswhosechildrenareassignedtotheRussian-mediumschoolscan(anddid) removetheirchildrentoprivateEnglish-mediumschools.Inacomparisonofthosewhoacceptedtheirassignments,theeffectsofthelanguageofinstructionwillbedistortedinfavorofthe Englishschoolsbydifferencesinfamilycharacteristics.Thisisacasewhere,eveniftherandom numbergeneratorisfullyfunctional,alaterbalancetestwillshowsystematicdifferencesinob- 25 servablebackgroundcharacteristicsbetweenthetreatmentandcontrolgroups;evenifthebalancetestispassed,theremaystillbeselectiononunobservablesforwhichwecannottest. Moregenerally,whenpeopleknowtheirallocation,whentheyhaveastakeintheoutcome,andwhenthetreatmenteffectisdifferentfordifferentpeople,thereareincentivesand opportunitiesforselectioninresponsetotherandomization,andthatselectioncancontaminatetheestimatedaveragetreatmenteffect,seeHeckman(1997)whomakesthesamepointin thecontextofinstrumentalvariables.Thosewhowererandomizedbyalotteryintogoingto Vietnamwillhavedifferenttreatmenteffectsdependingontheirlabormarketprospects,and thosewithbetterprospectsaremorelikelytoresistthedraft.Asweshallseeinthenextsubsection,variousstatisticalcorrectionsareavailableforafewoftheselectionproblemsnonblindingpresents,butallrelyonthekindofassumptionsthat,whilecommoninobservational studies,RCTsaredesignedtoavoid.Ourownviewisthatassumptionsandtheuseofprior knowledgearewhatweneedtomakeprogressinanykindofanalysis,includingRCTswhose promiseofassumption-freelearningisalwayslikelytobeillusory. Theremaybeatendencyineconomicstofocusontheselectionbiaseffectsofnonblindingbecausesomesolutionsareavailable,butselectionbiasisnottheonlyserioussource ofbiasinsocialandmedicaltrials.Concernsabouttheplacebo,Pygmalion,Hawthorne,John Henry,and'teacher/therapist'effectsarewidespreadacrossstudiesofmedicalandsocialinterventions.Thisliteraturearguesthatdoubleblindingshouldbereplacedbyquadrupleblinding; blindingshouldextendbeyondparticipantsandinvestigatorsandincludethosewhomeasure outcomesandthosewhoanalyzethedata,allofwhommaybeaffectedbybothconsciousand unconsciousbias.Theneedforblindinginthosewhoassessoutcomesisparticularlyimportant inanycaseswhereoutcomesarenotdeterminedbystrictlyprescribedprocedureswhoseapplicationistransparentandcheckablebutrequireselementsofjudgment;agoodexampleistherapistswhoareaskedtoassesstheextentofdepressioninclinicaltrialsofanti-depressants,see Kramer(2016). Thelessonhereisthatblindingmattersandisveryoftenmissing.Thereisnoreasonto supposethatapoorlyblindedtrialwithrandomassignmenttrumpsbetterblindedstudieswith alternativeallocationmechanisms,ormatchedstudies. 1.13WhatdoRCTsdoinpractice? TheexecutionofanRCTwilloftendeviatefromitsdesign.Peoplemaynotaccepttheirassignment,controlsmaymanagetogettreatment,andviceversa,andpeoplemayaccepttheiras- 26 signment,butdropoutbeforethecompletionofthestudy.Insomedesigns,thetrialworksby givingpeopleincentivestoparticipate,forexamplebymailingthemavoucherthatgivesthem subsidizedaccesstoaschoolortoasavingsproduct.Iftheaimistoevaluatethevoucher schemeitself,nonewissuearises.However,iftheaimistofindoutwhattheeducationorsavingsprogramdoes,andthevoucherissimplyadevicetoinducevariation,muchdependson whetherornotpeopledecidetousethevoucherwhich,likeattritionandcrossover,issubject topurposivedecisionsbythesubjectsinducingdifferencesbetweentreatmentsandcontrols. Everythingdependsonthepurposeofthetrial.Intheexampleabove,wemaywantto evaluatethevoucherprogram,orwemaywanttofindoutwhatthesavingproductdoesfor people.Wearesometimesinterestedinestablishingcausality,andsometimesinestimatingan averagetreatmenteffect;intheeconomicsliterature,somewritersdefineinternalvalidityas gettingtheATEright,whileothers,followingtheoriginaldefinitionoftheterm,defineinternal validityasgettingcausalityright.Sometimesthetriallimitsitselftoestablishingcausality(orto estimatinganATE)inonlythetrialsample,butsometrialsaremoreambitious,andtrytoestablishcausality(orestimateanATE)forabroaderpopulationofinterest.When,asiscommonin economicstrials,nolimitsareplacedontheheterogeneityoftreatmentresponses,different trialsamplesanddifferentpopulationswillgenerallyhavedifferentATEsandmayhavedifferent casualoutcomes,e.g.ifthetreatmenthasaneffectinonepopulationbutnoneortheopposite effectinanother.Ourviewisthatthetargetofthetrial,includingthepopulationofinterest, needstobedefinedinadvance.Otherwise,almostanyestimatednumbercanbeinterpretedas avalidATEforsomepopulation,weallowdeviationsfromthedesigntodefineourtarget,and wehavenowayofknowingwhetherapparentlycontradictoryresultsarereallycontradictoryor arecorrectforthepopulationonwhichtheywerederived.Differencesinresults,betweendifferentRCTsandbetweenRCTsandobservationalstudies,mayowelesstotheselectioneffects thatRCTsaredesignedtoremove,thantothefactthatwearecomparingnon-comparablepeople,Heckman,Lalonde,andSmith(1999,p.2082).Withoutaclearideaofhowtocharacterize thepopulationofindividualsinthetrial,whetherwearelookingforanATEortoidentifycausality,andforwhichgroupsenrolledinthetrialtheresultsaresupposedtohold,wehavenobasis forthinkingabouthowtousethetrialresultsinothercontexts. Toillustratesomeoftheissues,considerasimpleRCTinwhichatreatmentTisadminis- teredtoatrialsamplethatissplitbetweenatreatmentgroupofsizenandacontrolgroupof sizen,butthatonlyafractionpofthetreatmentgroupacceptstheirassignment,withfraction 27 (1− p) receivingnotreatment.SupposethattheparameterofinterestistheATEintheoriginal population,fromwhichthetrialsamplewasdrawnrandomly.Denoteby β thehypothetical idealATEestimatethatwouldhavebeencalculatedifeveryonehadacceptedassignment;aswe haveseen,thisisanunbiasedestimatoroftheparameterofinterestforboththetrialsample andtheparentpopulation. β cannotbecalculated,buttherearevariousoptions. Optiononeistoignoretheoriginalassignmentandcalculatethedifferenceinmeans betweenthosewhoreceivedthetreatmentandthosewhodidnot,includingamongthelatter thosewhowereintendedtoreceiveitbutdidnot.Denotethis(“astreated”)estimate β1 . Alternatively,optiontwo,istocomparetheaverageoutcomeamongthosewhowereintendedto betreatedandthosewhowereintendedtobecontrols.Denotethisestimate,the“intentto treat”(ITT)estimator, β 2 . Itiseasytoshowthatonesetofconditionsfor β1 = β isthatthose whoweretreatedhavethesameATEasthosewhowereintendedtobetreated,andthatthose whobroketheirassignmenthavethesameuntreatedmeanasthosewhowereassignedtobe controls,conditionsthatmayholdinsomeapplications,forexamplewherethetreatmenteffectsareidentical. TheITTestimator, β 2 ,willtypicallybeclosertozerothanis β ,anditwillcertainlybe soiftheaveragetreatmenteffectamongthosewhobreaktheirassignmentisthesameasthe overallATE,inwhichcase β 2 = pβ . Forthesereasons,theITTisoftendescribedasyieldinga conservativeestimateandisroutinelyadvocatedinmedicaltrialseventhoughitisanattenuatedestimatoroftheATE.Athirdestimator, β 3 ,thelocalaveragetreatmentestimator(LATE)is computedbyrunningaregressionofoutcomesonan(actual)treatmentdummyusingthe treatmentassignmentasaninstrumentalvariable.Inthiscase,theLATEissimplytheITT,scaled upbythereciprocalofp,sothat β 3 = β 2 / p. Fromtheabove,theLATEis β iftheaverage treatmenteffectofthosewhobreaktheirassignmentisthesameastheaveragetreatmenteffectingeneral,sothattheITTestimatorisbiaseddownbycountingthosewhoshouldhave beentreatedasiftheywerecontrols.Moregenerally,andwithadditionalassumptions,Imbens andAngrist(1994)showthattheLATEistheaveragetreatmenteffectamongthosewhowere inducedtoacceptthetreatmentbytheirassignmenttotreatmentstatus,whichcanbeavery differentobjectfromtheoriginaltargetofinvestigation.Thesevariousestimators,theATE,the ITT,andtheLATE,areallaveragesoverdifferentgroups;moreformally,HeckmanandVytlacil (2005)defineamarginaltreatmenteffect(MTE)astheATEforthoseonthemarginoftreat- 28 ment—whatevertheassignmentmechanism—andshowthattheotherestimatorscanbe thoughtofasaveragesoftheMTEsoverdifferentpopulations. Ingeneral,andunlesswearepreparedtosaymoreabouttheheterogeneityinthe treatmenteffects,thethreeestimatorswillgivedifferentresultsbecausetheyareaveragesover differentpopulations.Economiststendtobelievethatpeopleactintheirowninterest,atleast inpart,soitisnotattractivetobelievethatthosewhobreaktheirassignmentshavethesame distributionoftreatmenteffectsasdothosewhoacceptthem.InHeckman’s(1992)analogy, peoplearenotlikeagriculturalplots,whichareinnopositiontoevadethetreatmentwhenthey seeitcoming.Suchpurposivebehaviorwillgenerallyalsoaffectthecompositionofthetrial samplecomparedwiththeparentpopulation,withthosewhoagreetoparticipatedifferent fromthosewhodonot.Forexample,peoplemaydislikerandomizationbecauseoftherisksit entails,orpeoplemayseektoentertrialsinthehopethattheywillreceiveabeneficialtreatmentthatisotherwiseunavailable.AfamousexampleineconomicsistheAshenfelter(1978) pre-program“dip,”wherethosewhoentertrialsoftrainingprogramstendtobethosewhose earningshavefallenimmediatelypriortoenrolment,seealsoHeckmanandSmith(1999).Peoplewhoparticipateindrugtrialsaremorelikelytobesickthanthosewhodonot,orarelikely tobethosewhohavefailedonstandardmedication.AnotherexampleisChyn’s(2016)evidence thatthosewhoappliedforvouchersintheMovingtoOpportunityexperimentandwerethus eligibleforrandomization—andonlyaquarterofthosewhowereeligibleactuallydidso—were thosewhowerealreadymakingunusualeffortsontheirchildren’sbehalf.Theseparentshad effectivelysubstitutedforpartofthebetterenvironment,sothattheATEfromthetrialunderstatesthebenefitstotheaveragechildofmoving.Similarphenomenaoccurinmedicine.Inthe 1954trialsoftheSalkpoliovaccineintheUS,theratesofinfection,whilelowestamongthe treatedchildren,werehigherinthecontrolchildrenthaninthegeneralpopulationatrisk,so thattheparentsofthosewhoselectedintothetrialpresumablyhadsomeideathattheymight havebeenexposed,HausmanandWise(1985,p.193–4).Inthiscase,theaveragetreatment effectinthetrialsampleexaggeratestheATEinthegeneralpopulation,whichiswhatwewant toknowforpublicpolicy. Giventhenon-parametricspiritofRCTs,andtheunwillingnessofmanytrialiststomake assumptionsortoincorporatepriorinformation,theonlywayforwardistobeveryclearabout thepurposeofthetrialand,inparticular,whichaveragewearetryingtoestimate.Forthose whofocusoninternalvalidityintermsofestablishingcausalitybyfindinganATEsignificantly 29 differentfromzero,thedefinitionofthepopulationseemstobeasecondaryconcern.Theidea seemstobethatifcausalityisestablishedinsomepopulation,thatfindingisimportantinitself, withthetaskofexploringitsapplicabilitytootherpopulationsleftasasecondarymatter.For themanyeconomicorcost–benefitanalyseswheretheATEistheparameterofinterest,the populationofinterestisdefinitional,andtheinferenceneedstofocusonapathfromtheresults ofthetrialtotheparameterofinterest.Thisisoftendifficultorevenimpossiblewithoutadditionalassumptionsand/ormodelingofbehavior,includingthedecisiontoparticipateinthetrial,andamongparticipants,thedecisionnottodropout.Manski(1990,1995,2003)hasshown that,withoutadditionalevidence,thepopulationATEisnot(point)identifiedfromthetrialresults,andhasdevelopednon-parametricbounds(anintervalestimate)fortheATE.Aswiththe ITT,theseboundsaresometimestightenoughtobeinformative,thoughtheintervaldefinedby theboundswilloftencontainzero,seeManski(2013)foradiscussionaimedatabroadaudience.Facedwiththis,manyscholarsarepreparedtomakeassumptionsortobuildmodelsthat givemorepreciseresults. RCTsmaytellusaboutcausality,evenwhentheydonotdeliveragoodestimateofthe ATE.Forexample,iftheITTestimateissignificantlydifferentfromzero,thetreatmenthasa causaleffectforatleastsomeindividualsinthepopulation.ThesameistrueiftheLATEissignificantlydifferentfromzero;againthetreatmentiscausalforsomesub-population,evenifwe mayhavedifficultycharacterizingitoracceptingitasthepopulationofinterest.Fromthis,we alsolearnthat,providedwehadapopulationwiththerightdistributionof β i ' s andgoverned bythesamepotentialoutcomeequation,thetreatmentwouldproducetheeffectinatleast someindividualsthere. Section2:Usingtheresultsofrandomizedcontrolledtrials 2.1Introduction Supposewehavetheresultsofawell-conductedRCT.Wehaveestimatedanaveragetreatment effect,andourstandarderrorgivesusreasontobelievethattheeffectdidnotcomeaboutby chance.Wethushavegoodwarrantthatthetreatmentcausestheeffectinoursamplepopulation,uptothelimitsofstatisticalinference.Whataresuchfindingsgoodfor?Howshouldwe usethem? Theliteratureineconomics,asindeedinmedicineandinsocialpolicy,haspaidmoreattentiontoobtainingresultsthantowhetherandhowtheyshouldbeadaptedforuse,oftenas- 30 sumingthatfindingscanbeused“asis.”Mucheffortisdevotedtodemonstratingcausalityand estimatingeffectsizesinstudypopulations,bothinempiricalwork—moreandbetterRCTs,or substitutesforRCTs,suchasinstrumentalvariablesorregressiondiscontinuitymodels—aswell asintheoreticalstatisticalwork—forexampleontheconditionsunderwhichwecanestimate anaveragetreatmenteffect,oralocalaveragetreatmenteffect,andwhattheseestimates mean.Thereislesstheoreticalorempiricalworktoguideushowandforwhatpurposestouse thefindingsofRCTs,suchastheconditionsunderwhichthesameresultsholdoutsideofthe originalsettings,howtheymightbeadaptedforuseelsewhere,orhowtheymightbeusedfor formulating,testing,understanding,orprobinghypothesesbeyondtheimmediaterelationbetweenthetreatmentandtheoutcomeinvestigatedinthestudy. Yetitcannotbethatknowinghowtouseresultsislessimportantthanknowinghowto demonstratethem.Anychainofevidenceisonlyasstrongasitweakestlink,sothatarigorously establishedeffectwhoseapplicabilityisjustifiedbyaloosedeclarationofsimilewarrantslittle morethananestimatethatwaspluckedoutofthinair.Iftrialsaretobeuseful,weneedpaths totheirusethatareascarefullyconstructedasarethetrialsthemselves. Itissometimesassumedthataparameter,oncewellestablished,isinvariantacrossset- tings.Theparametermaybedifficulttoestimate,becauseofselectionorotherissues,andit maybethatonlyawell-conductedRCTcanprovideacredibleestimateofit.Ifso,internalvalidityisallthatisrequired,anddebateaboutusingtheresultsbecomesadebateabouttheconduct ofthestudy.Theargumentforthe“primacyofinternalvalidity,”Shadish,Cook,andCampbell (2002),isreasonableasawarningthatbadRCTsareunlikelytogeneralize,butitissometimes incorrectlytakentoimplythatresultsofaninternallyvalidtrialwillautomaticallyoroftenapply ‘asis’elsewhere,orthatthisisthedefaultassumptionfailingargumentstothecontrary.Aninvarianceargumentisoftenmadeinmedicine,whereitissometimesplausiblethataparticular procedureordrugworksthesamewayeverywhere,thoughseeHorton(2000)forastrongdissentandRothwell(2005)forexamplesonbothsidesofthequestion.Weshouldalsonotethe recentmovementtoensurethattestingofdrugsincludeswomenandminoritiesbecausemembersofthosegroupssupposethattheresultsoftrialsonmostlyhealthyyoungwhitemalesdo notapplytothem. 2.2Usingresults,transportability,andexternalvalidity Supposeatrialhasestablishedaresultinaspecificsetting,andweareinterestedinusingthe resultoutsidetheoriginalcontext.If“thesame”resultholdselsewhere,wesaywehaveexter- 31 nalvalidity,otherwisenot.Externalvaliditymayreferjusttothetransportabilityofthecausal connection,orgofurtherandrequirereplicationofthemagnitudeoftheaveragetreatment effect.Eitherway,theresultholds—everywhere,orwidely,orinsomespecificelsewhere—orit doesnot. Thisbinaryconceptofexternalvalidityisoftenunhelpful;itbothoverstatesandunderstatesthevalueoftheresultsfromanRCT.Itdirectsustowardsimpleextrapolation—whether thesameresultwillholdelsewhere—orsimplegeneralization—whetheritholdsuniversallyor atleastwidely—andawayfrompossiblymorecomplexbutmoreusefulapplicationsoftheevidence.Justasinternalvaliditysaysnothingaboutwhetherornotatrialresultwillholdelsewhere,thefailureofexternalvalidityinterpretedassimplegeneralizationorextrapolationsays littleaboutthevalueofthetrial. First,thereareseveralusesofRCTsthatdonotrequiretransportabilitybeyondtheoriginalcontext;wediscusstheseinthenextsubsection.Second,thereareoftengoodreasonsto expectthattheresultsfromawell-conducted,informative,andpotentiallyusefulRCTwillnot applyelsewhereinanysimpleway.Evensuccessfulreplicationbyitselftellsuslittleeitherforor againstsimplegeneralizationorextrapolation.Withoutfurtherunderstandingandanalysis, evenmultiplereplicationscannotprovidemuchsupportfor,letaloneguarantee,theconclusion thatthenextwillworkinthesameway.Nordofailuresofreplicationmaketheoriginalresult useless.Wecanoftenlearnmuchfromcomingtounderstandwhyreplicationfailedanduse thatknowledgetomakeappropriateuseoftheoriginalfindings,notbyexpectingreplication, butbylookingforhowthefactorsthatcausedtheoriginalresultmightbeexpectedtooperate differentlyindifferentsettings.Third,andparticularlyimportantforscientificprogress,theRCT resultcanbeincorporatedintoanetworkofevidenceandhypothesesthattestorexplore claimsthatlookverydifferentfromtheresultsreportedfromtheRCT.Weshallgiveexamples belowofextremelyusefulRCTsthatarenotexternallyvalidinthe(usual)sensethattheirresultsdonotholdelsewhere,whetherinaspecifictargetsettingorinthemoresweepingsense ofholdingeverywhere. BertrandRussell’schickenprovidesanexcellentexampleofthelimitationstostraightforwardextrapolationfromrepeatedsuccessfulreplication.Thebirdinfers,basedonmultiply repeatedevidence,thatwhenthefarmercomesinthemorning,hefeedsher.Theinference servesherwelluntilChristmasmorning,whenhewringsherneckandservesherforChristmas dinner.Ofcourse,ourchickendidnotbaseherinferenceonanRCT.Buthadweconstructed 32 oneforher,wewouldhaveobtainedexactlythesameresultthatshedid.Herproblemwasnot hermethodology,butratherthatshewasstudyingsurfacerelations,andthatshedidnotunderstandthesocialandeconomicstructurethatgaverisetothecausalrelationsthatsheobserved.Soshedidnotknowhowwidelyorhowlongtheywouldobtain.Russellnotes,“more refinedviewsastotheuniformityofnaturewouldhavebeenusefultothechicken”(1912,p. 44).Weoftenactasifthemethodsofinvestigationthatservedthechickensobadlywilldoperfectlywellforus. Establishingcausalitydoesnothinginandofitselftoguaranteegeneralizability.Nor doestheabilityofanidealRCTtoeliminatebiasfromselectionorfromomittedvariablesmean thattheresultingATEwillapplyanywhereelse.Theissueisworthmentioningonlybecauseof theenormousweightthatiscurrentlyattachedineconomicstothediscoveryandlabelingof causalrelations,aweightthatishardtojustifyforeffectsthatmayhaveonlylocalapplicability, whatmight(perhapsprovocatively)belabeled‘anecdotalcausality’.Theoperationofacause generallyrequiresthepresenceofsupportorhelpingfactors,withoutwhichacausethatproducesthetargetedeffectinoneplace,eventhoughitmaybepresentandhavethecapacityto operateelsewhere,willremainlatentandinoperative.WhatMackie(1974)calledINUScausality (InsufficientbutNon-redundantpartsofaconditionthatisitselfUnnecessarybutSufficientfora contributiontotheoutcome)isoftenthekindofcausalitywesee;astandardexampleisa houseburningdownbecausethetelevisionwaslefton,althoughtelevisionsdonotoperatein thiswaywithouthelpingfactors,suchaswiringfaults,thepresenceoftinder,andsoon.Thisis standardfareinepidemiology,whichusestheterm“causalpie”torefertothecasewhereaset ofcausesarejointlybutnotseparatelysufficientforaneffect.Ifwerewrite(3)intheform J J ⎛ K ⎞ Yi = β iTi + ∑ γ j xij = ⎜ ∑ θ k w ik ⎟ Ti + ∑ γ j xij ⎝ k=1 ⎠ j=1 j=1 (6) where θ k controlshow wik affectsindividualI’streatmenteffect β i . The“helping”or“support” factorsforthetreatmentarerepresentedbytheinteractivevariables wik , amongwhichmaybe includedsomex’s.SincetheATEistheaverageofthe β i 's ,twopopulationswillhavethesame ATEonlyif,exceptbyaccident,theyhavethesameaverageforthesupportfactorsnecessary forthetreatmenttowork.Thesearehoweverjustthekindoffactorsthatarelikelytobedifferentlydistributedindifferentpopulations,andindeedwedogenerallyfinddifferentATEsindif- 33 ferentdevelopment(andothersocialpolicy)RCTsindifferentplaceseveninthecaseswhere (unusually)theyallpointinthesamedirection. Causalprocessesoftenrequirehighlyspecializedeconomic,cultural,orsocialstructures toenablethemtowork.ConsidertheRubeGoldbergmachinethatisriggedupsothatflyinga kitesharpensapencil,CartwrightandHardie(2012,77),oranotherwherealongchainofropes andpulleyscausestheinsertionoffoodintothemouthtoactivateaface-wipingnapkin.These arecausalmachines,buttheyarespeciallyconstructedtogiveakindofcausalitythatoperates extremelylocallyandhasnogeneralapplicability.Theunderlyingstructureaffordsaveryspecificformof(6)thatwillnotdescribecausalprocesseselsewhere.NeitherthesameATEnorthe samequalitativecausalrelationscanbeexpectedtoholdwherethespecificformfor(6)isdifferent. Indeed,wecontinuallyattempttodesignsystemsthatwillgeneratecausalrelations thatwelikeandthatwillruleoutcausalrelationsthatwedonotlike.Healthcaresystemsare designedtopreventnursesanddoctorsmakingerrors;carsaredesignedsothatdriverscannot starttheminreverse;workschedulesforpilotsaredesignedsotheydonotflytoomanyconsecutivehourswithoutrestbecausealertnessandperformancearecompromised. AsintheRubeGoldbergmachinesandinthedesignofcarsandworkschedules,the economicstructureandequilibriummaydifferinwaysthatsupportdifferentkindsofcausal relationsandthusrenderatrialinonesettinguselessinanother.Forexample,atrialthatrelies onprovidingincentivesforpersonalpromotionisofnouseinastateinwhichapoliticalsystem lockspeopleintotheirsocialandeconomicpositions.Conditionalcashtransferscannotimprove childhealthintheabsenceoffunctioningclinics.Policiestargetedatmenmaynotworkfor women.Weusealevertotoastourbread,butleversonlyoperatetotoastbreadinatoaster; wecannotbrowntoastbypressinganaccelerator,eveniftheprincipleoftheleveristhesame inbothatoasterandacar.Ifwemisunderstandthesetting,ifwedonotunderstandwhythe treatmentinourRCTworks,werunthesamerisksasRussell’schicken. 2.3WhenRCTsspeakforthemselves:notransportabilityrequired Forsomethingswewanttolearn,anRCTisenoughbyitself.AnRCTmaydisproveageneral theoreticalpropositiontowhichitprovidesacounterexample.Thetestmightbeofthegeneral propositionitself(asimplerefutationtest),orofsomeconsequenceofitthatissusceptibleto testingusinganRCT(acomplexrefutationtest).Ofcourse,counterexamplesareoftenchallenged—forexample,itisnotthegeneralpropositionthatcausedtherejection,butaspecial 34 featureofthetrial—buthereweareonfamiliarinferentialturf.AnRCTmayalsoconfirmapredictionofatheory,andalthoughthisdoesnotconfirmthetheory,itisevidenceinitsfavor,especiallyifthepredictionseemsinherentlyunlikelyinadvance.Onceagain,thisisfamiliarterritory,andthereisnothinguniqueaboutanRCT;itissimplyoneamongmanypossibletesting procedures.Evenwhenthereisnotheory,orveryweaktheory,anRCT,bydemonstratingcausalityinsomepopulationcanbethoughtofasproofofconcept,thatthetreatmentiscapableof workingsomewhere.Thisisoneoftheargumentsfortheimportanceofinternalvalidity. AnothercasewherenotransportationiscalledforiswhenanRCTisusedforevaluation, forexampletosatisfydonorsthattheprojecttheyfundedactuallyachieveditsaimsinthepopulationinwhichitwasconducted.Evenso,forsuchevaluations,saybytheWorldBank,tobe globalpublicgoodsrequiresthedevelopmentofargumentsandguidelinesthatjustifyusingthe resultsinsomewayelsewhere;theglobalpublicgoodisnotanautomaticby-productofthe Bankfulfillingitsfiduciaryresponsibility.Whenthecomponentsoftreatmentschangeacross studies,evaluationsneednotleadtocumulativeknowledge.OrasHeckmanetal(1999,p.1934) note,“thedataproducedfromthem[socialexperiments]arefarfromidealforestimatingthe structuralparametersofbehavioralmodels.Thismakesitdifficulttogeneralizefindingsacross experimentsortouseexperimentstoidentifythepolicy-invariantstructuralparametersthat arerequiredforeconometricpolicyevaluation.”Ofcourse,whenweaskexactlywhatthoseinvariantstructuralparametersare,whethertheyexist,andhowtheyshouldbemodeled,we openupmajorfaultlinesinmodernappliedeconomics.Forexample,wedonotintendtoendorseintertemporaldynamicmodelsofbehaviorastheonlywayofrecoveringtheparameters thatweneed.Wealsorecognizethattheusefulnessofsimplepricetheoryisnotasuniversally acceptedasitoncewas.Butthepointremainsthatweneedsomething,someregularity,and thatthesomethingneededcanrarelyberecoveredbysimplygeneralizingacrosstrials. Athirdnon-problematicandimportantuseofanRCTiswhentheparameterofinterest istheaveragetreatmenteffectinawell-definedpopulationfromwhichthesampletrialpopulation—fromwhichtreatmentsandcontrolsarerandomlyassigned—isitselfarandomsample.In thiscasethesampleaveragetreatmenteffect(SATE)isanunbiasedestimatorofthepopulation averagetreatmenteffect(PATE)that,byassumption,isourtarget,seeImbens(2004)forthese terms.Werefertothisasthe“publichealth”case;likemanypublichealthinterventions,the targetistheaverage,“populationhealth,”notthehealthofindividuals.Onemajor(andwidely recognized)dangerofthepublic-health-styleusesofRCTsisthatthescalingupfrom(evena 35 random)sampletothepopulationwillnotgothroughinanysimplewayiftheoutcomesofindividualsorgroupsofindividualschangethebehaviorofothers—whichwillbecommonineconomicexamplesbutperhapslesscommoninhealth.Thereisalsoanissueoftimingiftheresults aretobeimplementedsometimeafterthetrial. Ineconomics,a‘public-health-style’exampleistheimpositionofacommoditytax, wherethetotaltaxrevenueisofinterestandwedonotcarewhopaysthetax.Indeed,theory canoftenidentifyaspecific,well-definedmagnitudewhosemeasurementiskeyforthepolicy; seeDeatonandNg(1998)foranexampleofwhatChetty(2009)callsa“sufficient”statistic.In thiscase,thebehaviorofarandomsampleofindividualsmightwellprovideagoodguidetothe taxrevenuethatcanbeexpected.Anothercasecomesfromworkonpovertyprogramswhere theinterestofthesponsorsisintheconsequencesforthebudgetofthestateresponsiblefor theprogram;wediscussthesecasesattheendofthisSection.Evenhere,itiseasytoimagine behavioraleffectscomingintoplaythatdriveawedgebetweenthetrialanditsfullscaleimplementation,forexampleifcomplianceishigherwhentheschemeiswidelypublicized,orif governmentagenciesimplementtheschemedifferentlyfromtrialists. 2.4Transportingresultslaterallyandglobally TheprogramofRCTsindevelopmenteconomics,asinotherareasofsocialscience,hasthe broadergoaloffindingout“whatworks.”Atitsmostambitious,thisaimsforuniversalreach, andthedevelopmentliteraturefrequentlyarguesthat“credibleimpactevaluationsareglobal publicgoodsinthesensethattheycanofferreliableguidancetointernationalorganizations, governments,donors,andnongovernmentalorganizations(NGOs)beyondnationalborders,” KremerandDuflo(2008,p.93).SometimestheresultsofasingleRCTareadvocatedashaving wideapplicability,withespeciallystrongendorsementwhenthereisatleastonereplication. Forexample,KremerandHolla(2009)useaKenyantrialasthebasisforablanketstatement withoutcontextrestriction,“Provisionoffreeschooluniforms,forexample,leadsto10%-15% reductionsinteenpregnancyanddropoutrates.”KremerandDuflo(2008),writingaboutanothertrial,aremorecautious,citingtwoevaluations,andrestrictingthemselvestoIndia:“One canberelativelyconfidentaboutrecommendingthescaling-upofthisprogram,atleastinIndia, onthebasisoftheseestimates,sincetheprogramwascontinuedforaperiodoftime,wasevaluatedintwodifferentcontexts,andhasshownitsabilitytoberolledoutonalargescale.” Ofcourse,theproblemofgeneralizationextendsbeyondRCTs,toboth“fullycontrolled”laboratoryexperimentsandtomostnon-experimentalfindings.Forexample,eversince 36 AlfredMarshallthoughtofitwhilesunbathing,economistshaveusedtheconceptofanelasticity—asintheincomeelasticityofthedemandforfood,orthepriceelasticityofthesupplyof cotton—andhavetransportedelasticities—whichareconvenientlydimensionless—fromone contexttoanother,asnumericalestimates,orinranges,suchashigh,medium,orlow.Articles thatcollectsuchestimatesarewidelycitedeventhough,ashaslongbeenknown,theinvarianceofelasticitiesisnotguaranteedinpracticeandissometimesinconsistentwithchoicetheory.OurargumenthereisthatevidencefromRCTs,likeevidenceonelasticities,isnotautomaticallysimplygeneralizable,andthatitsinternalvalidity,whenitexists,doesnotprovideitwith anyuniqueinvarianceacrosscontext.WeshallalsoarguethatspecificfeaturesofRCTs,suchas theirfreedomfromparametricassumptions,althoughadvantageousinestimation,canbeaserioushandicapinuse. MostadvocatesofRCTsunderstandthat“whatworks”needstobequalifiedto“what worksunderwhichcircumstances,”andtrytosaysomethingaboutwhatthosecircumstances mightbe,forexample,byreplicatingRCTsindifferentplaces,andthinkingintelligentlyabout thedifferencesinoutcomeswhentheyfindthem.Sometimesthisisdoneinasystematicway, forexamplebyhavingmultipletreatmentswithinthesametrialsothatitispossibletoestimate a“responsesurface,”thatlinksoutcomestovariouscombinationsoftreatments,seeGreenberg andSchroder(2004)orShadishetal(2002).Forexample,theRANDhealthexperimenthadmultipletreatments,allowinginvestigation,notonlyofwhetherhealthinsuranceincreasedexpenditures,buthowmuchitdidsounderdifferentcircumstances.Someofthenegativeincometax experiments(NITs)inthe1960sand1970sweredesignedtoestimateresponsesurfaces,with thenumberoftreatmentsandcontrolsineacharmoptimizedtomaximizeprecisionofestimatedresponsefunctionssubjecttoanoverallcostlimit,Conlisk(1973).Experimentsontime-ofdaypricingforelectricityhadasimilarstructure,seeAigner(1985). TheMDRCexperimentshavealsobeenanalyzedacrosscitiesinanefforttolinkcityfeaturestotheresultsoftheRCTswithinthem,Bloom,Hill,andRiccio(2005).UnliketheRANDand NITexamples,theseareexpostanalysesofcompletedtrials;thesameistrueofVivalt(2015) whoassemblesevidenceonalargenumberoftrials,andfinds,forthecollectionoftrialsshe studied,thatdevelopment-relatedRCTsrunbygovernmentagenciestypicallyfindsmaller (standardized)effectsizesthanRCTsrunbyacademicsorbyNGOs.Boldetal(2013),whoran parallelRCTsonaninterventionimplementedeitherbyanNGOorbythegovernmentofKenya, foundsimilarresultsthere.Notethattheseanalyseshaveadifferentpurposefromthosemeta- 37 analysesthatassumethatdifferenttrialsestimatethesameparameteruptonoiseandaverage inordertoincreaseprecision. Althoughthereareissueswithallofthesemethodsofinvestigatingdifferencesacross trials,withoutsomedisciplineitistooeasytocomeupwith“just-so”orfairystoriesthataccountforalmostanydifferences.Weriskaprocedurethat,ifaresultisreplicatedinfullorin partinatleasttwoplaces,putsthattreatmentintothe“itworks”boxand,iftheresultdoesnot replicate,causallyinterpretsthedifferenceinawaythatallowsatleastsomeofthefindingsto survive. Howcanwethinkaboutthismoreseriously?Howcanwedobetterthansimplegeneralizationandsimpleextrapolation?Manywritershaveemphasizedtheroleoftheoryintransportingandusingtheresultsoftrials,andweshalldiscussthisfurtherinthenextsubsection. Butstatisticalapproachesarealsowidelyused;thesearedesignedtodealwiththepossibility thattreatmenteffectsvarysystematicallywithothervariables.Referringbackto(6),suppose thatthe β i ' s ,theindividualtreatmenteffects,arefunctionsofasetofKobservableorunobservablesupportvariables, wik ,andthatthenon-vacuousw’smayevenrepresentdifferentfeaturesindifferentplaces.Itisthenclearthat,providedthedistributionofthewvaluesisthe sameinthenewcircumstancesastheold,thentheATEintheoriginaltrialwillholdinthenew circumstances.Ingeneral,ofcourse,thisconditionwillnothold,nordowehaveanyobvious wayofcheckingitunlessweknowwhatthesupportfactorsareinbothplaces. Oneproceduretodealwithinteractionsispost-experimentalstratification,whichparallelspost-surveystratificationinsamplesurveys.Thetrialisbrokenupintosubgroupsthathave thesamecombinationofknown,observablew’s,theATEswithineachofthesubgroupscalculated,andthenreassembledaccordingtotheconfigurationofw’sinthenewcontext.Forexample,ifthetreatmenteffectsvarywithage,theage-specificATEscanbeestimated,andthe agedistributioninthenewcontextusedtoreweighttheage-specificATEstogiveanew,overall, ATE.ThiscanbeusedtoestimatetheATEinanewcontext,ortocorrectestimatestotheparentpopulationwhenthetrialsampleisnotarandomsampleoftheparent.Ofcourse,this methodwillonlyworkinspecialcases;forexample,ifweonlyknowsomeofthew’s,thereisno reasontosupposethatreweightingforthosealonewillgiveausefulcorrection. Othermethodsalsoworkwhentherearetoomanyw’sforstratification,forexampleby estimatingtheprobabilityofeachobservationinthepopulationbeingincludedinthetrialsampleasafunctionofthew’s,thenweightingeachobservationbytheinverseofthesepropensity 38 scores.AgoodreferenceforthesemethodsisStuartetal(2011),orineconomics,Angrist (2004)andHotz,Imbens,andMortimer(2005). Thereareyetfurtherreasonswhythesemethodsdonotalwayswork.Aswithanyform ofreweighting,thevariablesusedtoconstructtheweightsmustbepresentinboththeoriginal andnewcontext.Iftreatmenteffectsvarybysex,wecannotpredicttheoutcomesformenusingatrialsamplethatisentirelyfemale.Ifwearetocarryaresultforwardintime,wemaynot beabletoextrapolatefromaperiodoflowinflationtoaperiodofhighinflation;asHotzetal (2005)note,itwilltypicallybenecessarytoruleoutsuch“macro”effects,whetherovertime,or overlocations.Italsodependsonassumingthatthesamegoverningequation(6)coversthe trialandthetargetpopulation.Iftheydiffernotonlybywhatcausalfactorsarepresentinwhat proportionsbutalsoinhow(ifatall)thecausescontributetotheeffects,re-weightingtheeffect sizesthatoccurintrialsub-populationswillnotproducegoodpredictionsabouttargetpopulationoutcomes. Itshouldbeclearfromthisthatreweightingworksonlywhentheobservablefactors usedforreweightingincludeallandonlygenuineinteractivecauses;weneeddataonallthe relevantinteractivefactors.ButasMuller(2015)notes,thistakesusbacktothesituationthat RCTsaredesignedtoavoid,whereweneedtostartfromacompleteandcorrectspecificationof thecausalstructure.RCTscanavoidthisinestimation—whichisoneoftheirstrengths,supportingtheircredibility—butthebenefitvanishesassoonaswetrytocarrytheirresultstoanew context. PearlandBareinboim(2014)usePearl’sdo–calculustoprovideafullerformalanalysis fortransportabilityofcausalempiricalfindingsacrosspopulations.Theydefinetransportability as“alicensetotransfercausaleffectslearnedinRCTstoanewpopulation,inwhichonlyobservationalstudiescanbeconducted,”PearlandBareinboim(2015,p.1).Theyconsiderbothqualitativecausalrelations,whichtheyrepresentindirectedacyclicgraphs,andprobabilisticfacts, suchastheconditionalprobabilityoftheoutcomeonatreatmentconditionalonsomethird factor.Theythenprovidetheoremsaboutwhattherelationshipbetweenthecausalandprobabilisticfactsintwopopulationsmustbeifitistobepossibletoinferaparticularcausalfact, suchastheATE,aboutpopulation2fromcausalandprobabilisticinformationaboutpopulation 1coupledwithpurelyprobabilisticinformationaboutpopulation2.Notsurprisingly,formany thingsweshouldliketoknowaboutpopulation2,knowledgeofeventhefullstructureonpopulation1willnotsuffice.Inferencestofactsaboutanewpopulationrequirenotonlythatthe 39 factswesupposeaboutpopulation1—likeanATE—arewellgrounded,thattheRCTwaswell conducted,thatthestatisticalinferenceissound—butthatwehaveequallygoodgroundingfor otherassumptionsweneedabouttherelationbetweenthetwopopulations.Forexample,using theresultdescribedabovefordirectlytransportingtheATEfromatrialpopulationtosomeother—simpleextrapolation—weneedgoodgroundstosupposeboththattheaverageofthenet effectoftheinteractivefactorsisthesameinbothpopulationsandalsothatthesamegoverningequationdescribesbothpopulations. Thisdiscussionleadstoanumberofpoints.First,wecannotgettogeneralclaimsby simplegeneralization;thereisnowarrantfortheconvenientassumptionthattheATEestimated inaspecificRCTisaninvariantparameter.Weneedtothinkthroughthecausalchainthathas generatedtheRCTresult,andtheunderlyingstructuresthatsupportthiscausalchain,whether thatcausalchainmightoperateinanewsettingandhowitwoulddosowithdifferentjointdistributionsofthecausalvariables;weneedtoknowwhyandwhetherthatwhywillapplyelsewhere.Whileitistruethatthereexistgeneralcausalclaims—theforceofgravity,orthatpeople respondtoincentives—theyuserelativelyabstractconceptsandoperateatamuchhigherlevel thantheclaimsthatcanbereasonablyinferredfromatypicalRCT,andcannot,bythemselves, guaranteetheoutcomesthatweareconsideringhere.Thattransportationisfarfromautomatic alsotellsuswhy(evenideal)RCTsofsimilarinterventionscanbeexpectedtogivedifferentanswersindifferentsettings.Suchdifferencesdonotnecessarilyreflectmethodologicalfailings andwillholdacrossperfectlyexecutedRCTsjustastheydoacrossobservationalstudies. Second,thoughtfulpre-experimentalstratificationinRCTsislikelytobevaluable,or failingthat,subgroupanalysis,becauseitcanprovideinformationthatmaybeusefulforgeneralizationortransportation.Forexample,KremerandHolla(2009)notethat,intheirtrials, schoolattendanceissurprisinglysensitivetosmallsubsidies,whichtheysuggestisbecause therearealargenumberofstudentsandparentswhoareonthe(financial)marginbetween attendingandnotattendingschool;ifthisisindeedthemechanismfortheirresults,agoodvariableforstratificationwouldbethefractionofpeopleneartherelevantcutoff.Wealsoneedto knowthatthesamemechanismworksinanynewsettingwhereweconsiderusingsmallsubsidiestoincreaseschoolattendance. Third,weneedtobeexplicitaboutcausalstructure,evenifthatmeansmoremodel buildingandmore—ordifferent—assumptionsthanadvocatesofRCTsareoftencomfortable with.Tobeclear,modelingcausalstructuredoesnotnecessarilycommitustotheelaborateand 40 oftenincredibleassumptionsthatcharacterizesomestructuralmodelingineconomics,but thereisnoescapefromthinkingaboutthewaythingswork,thewhyaswellasthewhat. Fourth,wewilltypicallyneedtoknowmorethantheresultsoftheRCTitself,forexampleaboutdifferencesinsocial,economic,andculturalstructuresandaboutthejointdistributionsofcausalvariables,knowledgethatwilloftenonlybeavailablethrougharangeofempiricalstrategiesincludingobservationalstudies.Wewillalsoneedtobeabletocharacterizethe populationtowhichtheoriginalRCTanditsATEappliedbecausehowthepopulationisdescribediscommonlytakentobesomeindicationofwhichotherpopulationstheresultsarelikelytobeexportabletoandwhichnot.Manymedicalandpsychologicaljournalsareexplicitabout this.Forinstance,therulesforsubmissionrecommendedbytheInternationalCommitteeof MedicalJournalEditors,ICMJE(2015,p14)insistthatarticleabstracts“Clearlydescribetheselectionofobservationalorexperimentalparticipants(healthyindividualsorpatients,including controls),includingeligibilityandexclusioncriteriaandadescriptionofthesourcepopulation.” Theproblemsofcharacterizingthepopulationheregoesbeyondthosewefacedinconsidering aLATE.AnRCTisconductedonapopulationofspecificindividuals.Theresultsobtained, whetherwethinkintermsofanATEorintermsofestablishingcausality,arefeaturesofthat population,ofthoseveryindividualsatthatverytime,notanyotherpopulationwithanydifferentindividualsthatmight,forexample,satisfyoneoftheinfinitesetofdescriptionsthatthe trialpopulationsatisfies.Howisthedescriptionofthepopulationthatisusedinreportingthe resultstobechosen?Forchoosewemust—thealternativetodescribingisnaming,identifying eachindividualinthestudybyname,whichiscumbersomeandunhelpfulandoftenunethical. Thissameissueisconfrontedalreadyinstudydesign.Apartfromspecialcases,likepost hocevaluationforpayment-for-results,wearenotespeciallyconcernedtolearnaboutthevery populationenrolledinthetrial.Mostexperimentsare,andshouldbe,conductedwithaneyeto whattheresultscanhelpuslearnaboutotherpopulations.Thiscannotbedonewithoutsignificantsubstantialassumptionsaboutwhatmightbeandwhatmightnotberelevanttotheproductionoftheoutcomestudied.(Forexample,theICMJEguidelinesgoontosay:“Becausethe relevanceofsuchvariablesasage,sex,orethnicityisnotalwaysknownatthetimeofstudydesign,researchersshouldaimforinclusionofrepresentativepopulationsintoallstudytypesand ataminimumprovidedescriptivedatafortheseandotherrelevantdemographicvariables,” p14.)Sobothintelligentstudydesignandresponsiblereportingofstudyresultsinvolvesubstantialbackgroundassumptions.Ofcoursethisistrueforallstudies,notjustRCTs.ButRCTsrequire 41 specialconditionsiftheyaretobeconductedatallandespeciallyiftheyaretobeconducted successfully—localagreements,compliantsubjects,affordableadministrators,peoplecompetenttomeasureandrecordoutcomesreliably,asettingwhererandomallocationismorallyand politicallyacceptable,etc.,whereasobservationaldataareoftenmorereadilyandwidelyavailable.InthecaseofRCTs,thereisdangerthatthesekindsofconsiderationshavetoomucheffect.Thisisespeciallyworrisomewherethefeaturesthestudypopulationshouldhavearenot justified,madeexplicit,orsubjectedtoseriouscriticalreview.Thiscarefuldescriptionofthe studypopulationisuncommonineconomics,whetherinRCTsormanyobservationalstudies. TheneedforobservationalknowledgeisoneofmanyreasonswhyitiscounterproductivetoinsistthatRCTsaretheuniquegoldstandard,orthatsomecategoriesofevidence shouldbeprioritizedoverothers;thesestrategiesleaveushelplessinusingRCTsbeyondtheir originalcontext.TheresultsofRCTsmustbeintegratedwithotherknowledge,includingthe practicalwisdomofpolicymakers,iftheyaretobeuseableoutsidethecontextinwhichthey wereconstructed.Contrarytomuchpracticeinmedicineaswellasineconomics,conflictsbetweenRCTsandobservationalresultsneedtobeexplained,forexamplebyreferencetothedifferentpopulationsineach,aprocessthatwillsometimesyieldimportantevidence,includingon therangeofapplicabilityoftheRCTitself.WhilethevalidityoftheRCTwillsometimesprovide anunderstandingofwhytheobservationalstudyfoundadifferentanswer,thereisnobasis(or excuse)forthecommonpracticeofdismissingtheobservationalstudysimplybecauseitwas notanRCTandthereforemustbeinvalid.Itisabasictenetofscientificadvancethatnewfindingsmustbeabletoexplainpreviousresults,evenresultsthatarenowthoughttobeinvalid; methodologicalprejudiceisnotanexplanation. Theseconsiderationscanbeseeninpracticeintherangeofrandomizedcontrolledtrials ineconomics,whichweshallexploreinthefinalsubsectionbelow. 2.5Usingtheoryforgeneralization Economistshavebeencombiningtheoryandrandomizedcontrolledtrialssincetheearlyexperiments.OrcuttandOrcutt(1968)laidouttheinspirationfortheincometaxtrialsusingasimple, statictheoryoflaborsupply.Accordingtothis,peoplechoosehowtodividetheirtimebetween workandleisureinanenvironmentinwhichtheyreceiveaminimumGiftheydonotwork,and wheretheyreceiveanadditionalamount (1− t)w foreachhourtheywork,wherewisthe wagerate,andtisataxrate.ThetrialsassigneddifferentcombinationsofGandttodifferent trialgroups,sothattheresultstracedoutthelaborsupplyfunction,allowingestimationofthe 42 parametersofpreferences,whichcouldthenbeusedinawiderangeofpolicycalculations,for exampletoraiserevenueatminimumutilitylosstoworkers. Followingtheseearlytrials,therehasbeenalongandcontinuingtraditionofusingtrial results,togetherwiththebaselinedatacollectedforthetrial,tofitstructuralmodelsthatareto beusedmoregenerally.EarlyexamplesincludeMoffitt(1979)onlaborsupplyandWise(1985) onhousing;morerecentexamplesareHeckman,PintoandSavelyev(2013)forthePerrypreschoolprogram.DevelopmenteconomicsexamplesincludeAttanasio,MeghirandSantiago (2012),Attanasioetal(2015),ToddandWolpin(2006)andDuflo,HannaandRyan(2012).Thesestructuralmodelssometimesrequireformidableauxiliaryassumptionsonfunctionalformsor thedistributionsofunobservables,whichmakesmanyeconomistsreluctanttoembracethem, buttheyhavecompensatingadvantages,includingtheabilitytointegratetheoryandevidence, tomakeout-of-samplepredictions,andtoanalyzewelfare—whichalwaysrequiressomeunderstandingofwhythingshappen—andtheuseofRCTevidenceallowstherelaxationofatleast someoftheassumptionsthatareneededforidentification.Inthisway,thestructuralmodels borrowcredibilityfromtheRCTsandinreturnhelpsettheRCTresultswithinacoherent framework.Withoutsomesuchinterpretation,thewelfareimplicationsofRCTresultscanbe problematic;knowinghowpeopleingeneral(letalonejustpeopleinthetrialpopulation,which iswhat,aswekeeprepeating,thetrialresultstellusabout)respondtosomepolicyisrarely enoughtotellwhetherornottheyaremadebetteroff.Whatworksisnotequivalenttowhat shouldbe. Inmanypapers,Heckmanhasdevelopedwaystomodelhowthebeliefsandinterestsof participantsaffecttheirparticipationin,behaviorduring,andtheiroutcomesintrials,forexampleusingaRoymodelofchoice;seee.g.HeckmanandSmith(1995),andmorerecently Chassang,PadróIMiguel,andSnowberg(2012)andChassangetal(2015).Themodelingofbeliefsandbehaviorallowspredictionsabouttheresultsoftrialsthatdifferfromthebasetrial,or wheretheriskandrewardstructuresaredifferent.Beyondthat,andinlinewitharunning themeofthisSection,thinkingabouthowtohandlenewsituationscanbeincorporatedintothe designoftheoriginaltrialsoastoprovidetheinformationneededfortransportation. LighttouchtheorycandomuchtoextendandtouseRCTresults.InboththeRAND HealthExperimentandnegativeincometaxexperiments,animmediateissueconcernedthe differencebetweenshortandlong-runresponses;indeed,differencesbetweenimmediateand ultimateeffectsoccurinawiderangeofRCTs.BothhealthandtaxRCTsaimedtodiscoverwhat 43 wouldhappenifconsumers/workerswerepermanentlyfacedwithhigherorlowerprices/wages,butthetrialscouldonlyrunforalimitedperiod.Atemporarilyhightaxrateonearningswaseffectivelya“firesale”onleisure,sothattheexperimentprovidedanopportunityto takeavacationandmakeuptheearningslater,anincentivethatwouldbeabsentinapermanentscheme.Howdowegetfromtheshort-runresponsesthatcomefromthetrialtothelongrunresponsesthatwewanttoknow?Metcalf(1973)andAshenfelter(1978)providedanswers fortheincometaxexperiments,asdidArrow(1975)fortheRandHealthExperiment. Arrow’sanalysisillustrateshowtousebothstructureandobservationaldatato transportandadaptresultsfromonesettingtoanother.Hemodelsthehealthexperimentasa two-periodmodel,inwhichthepriceofmedicalcareisloweredinthefirstperiodonly,and showshowtoderivewhatwewant,whichistheresponseinthefirstperiodifpriceswereloweredbythesameproportioninbothperiods.ThemagnitudethatwewantisS,thecompensatedpricederivativeofmedicalcareinperiod1inthefaceofidenticalincreasesin p1 and p2 inbothperiods1and2,andthisisequalto s11 + s12 ,thesumofthederivativesofperiod1’s demandwithrespecttothetwoprices.Thetrialgivesonly s11 .Butifwehavepost-trialdataon medicalservicesforbothtreatmentsandcontrols,wecaninfer s21 ,theeffectoftheexperimentalpricemanipulationonpost-experimentalcare.Choicetheory,intheformofSlutsky symmetry,allowsArrowtousethistoinfer s12 andthusS.HecontraststhiswithMetcalf’salternativesolution,whichmakesdifferentassumptions—thattwoperiodpreferencesareintertemporallyadditive,inwhichcasethelong-runelasticitycanbeobtainedfromknowledgeofthe incomeelasticityofpost-experimentalmedicalcare,whichwouldhavetocomefromanobservationalanalysis.Thesetwoalternativeapproachesshowhowwecanchoose,basedonourwillingnesstomakeassumptionsandonthedatawehave,asuitablecombinationof(elementary andtransparent)theoreticalassumptionsandobservationaldatainorderadaptandusethetrial results.Suchanalysiscanalsohelpdesigntheoriginaltrialbyclarifyingwhatweneedtoknowin ordertobeabletousetheresultsofatemporarytreatmenttoestimatethepermanenteffects thatweneed.Ashenfelterprovidesathirdsolution,notingthatthetwoperiodmodelisformally identicaltoatwopersonmodel,sothatwecanuseinformationontwo-personlaborsupplyto tellusaboutthedynamics. Theorycanoftenallowustoreclassifyneworunknownsituationsasanalogoustositua- tionswherewealreadyhavebackgroundknowledge.Onefrequentlyusefulwayofdoingthisis 44 whenthenewpolicycanberecastasequivalenttoachangeinthebudgetconstraintthatrespondentsface.Theconsequencesofanewpolicymaybeeasiertopredictifwecanreduceit toequivalentchangesinincomeandprices,whoseeffectsareoftenwellunderstoodandwell studied.ToddandWolpin(2008)makethispointandprovideexamples.Inthelaborsupply case,anincreaseinthetaxratethasthesameeffectasadecreaseinthewageratew,sothat wecanrelyonpreviousliteraturetopredictwhatwillhappenwhentaxratesarechanged.In thecaseofMexico’sPROGRESAconditionalcashtransferprogram,ToddandWolpinnotethat thesubsidiespaidtoparentsiftheirchildrengotoschoolcanbethoughtofasacombinationof reductioninchildren’swageratesandanincreaseinparents’income,whichallowsthemto predicttheresultsoftheconditionalcashexperimentwithlimitedadditionalassumptions.If thisworks,asitpartiallydoesintheiranalysis,thetrialhelpsconsolidatepreviousknowledge andcontributestoanevolvingbodyoftheoryandempirical,includingtrial,evidence. Theprogramofthinkingaboutpolicychangesasequivalenttopriceandincomechangeshasalonghistoryineconomics;muchofrationalchoicetheorycanbesointerpreted,see DeatonandMuellbauer(1980)formanyexamples.Whenthisconversioniscredible,andwhen atrialonsomeapparentlyunrelatedtopiccanbemodeledasequivalenttoachangeinprices andincomes,andwhenwecanassumethatpeopleindifferentsettingsrespondrelevantlysimilarlytochangesinpricesandincomes,wehaveareadymadeframeworkforincorporatingthe trialresultsintopreviousknowledge,aswellasforextendingthetrialresultsandusingthem elsewhere.Ofcourse,alldependsonthevalidityandcredibilityofthetheory;peoplemaynotin factthinkofataxincreaseasadecreaseinthepriceofleisure,andbehavioraleconomicsisfull ofexampleswhereapparentlyequivalentstimuligeneratenon-equivalentoutcomes.Theembraceofbehavioraleconomicsbymanyofthecurrentgenerationoftrialistsmayaccountfor theirlimitedwillingnesstouseconventionalchoicetheoryinthisway;unfortunately,behavioral economicsdoesnotyetofferareplacementforthegeneralframeworkofchoicetheorythatis sousefulinthisregard. Theorycanalsohelpwiththeproblemweraisedofdelineatingthepopulationtowhich thetrialresultsimmediatelyapplyandforthinkingaboutmovingfromthispopulationtothe populationofinterest.Ashenfelter’s(1978)analysisisagainagoodillustrationandpredates muchsimilarworkinlaterliterature.Theincometaxexperimentsofferedparticipationinthe trialtoarandomsampleofthepopulationofinterest.Becausetherewasnoblindingandno compulsion,peoplewhowererandomizedintothetreatmentgroupwerefreetochoosetore- 45 fusetreatment.Asinmanysubsequentanalyses,Ashenfeltersupposesthatpeoplechooseto participateifitisintheirinteresttodoso,dependingonwhathasbecomeknownintheRCT andInstrumentalVariablesliteratureastheirownidiosyncratic“gain.”Thesimplelaborsupply modelgivesanapproximatecondition:ifthetreatmentincreasesthetaxratefrom t 0 to t1 with anoffsettingincreaseinG,thenanindividualassignedtotheexperimentalgroupwilldeclineto participateif 1 (t1 − t 0 )w0 h0 + s00 (t1 − t 0 ) > G1 − G0 2 (7) wheresubscript1referstothetreatmentsituation,0tothecontrol, h0 ishoursworked,and s00 isthe(negative)utility-constantresponseofhoursworkedtothetaxrate.Ifthereisnosubstitution,thesecondtermontheleft-handsideiszero,andpeoplewillaccepttreatmentifthe increaseinGmorethanmakesupfortheincreasesintaxespayable,the“breakeven”condition. Inconsequence,thosewithhigherearningsarelesslikelytoaccepttreatment.Somebetter-off peoplewithhighsubstitutioneffectswillalsoaccepttreatmentiftheopportunitytobuymore cheapleisureissufficiententicement. Theselectiveacceptanceoftreatmentlimitstheanalyst’sabilitytolearnaboutthebet- ter-offorlow-substitutionpeoplewhodeclinetreatmentbutwhowouldhavetoacceptitifthe policywereactuallyimplemented.BoththeITTestimatorandthe“astreated”estimatorthat comparesthetreatedandtheuntreatedareaffected,notjustbythelaborsupplyeffectsthat thetrialisdesignedtoinduce,butbythekindofselectioneffectsthatrandomizationisdesignedtoeliminate.Ofcourse,theanalysisthatleadsto(3)canperhapshelpussaysomething aboutthisandhelpusadjustthetrialestimatesbacktowhatwewouldliketoknow.Yetthisis noeasymatterbecauseselectiondepends,notonlyonobservables,suchaspre-experimental earningsandhoursworked,buton(muchhardertoobserve)laborsupplyresponsesthatlikely varyacrossindividuals.ParaphrasingAshenfelter,wecannotestimatetheeffectsofapermanentcompulsorynegativeincometaxprogramfromatransitoryvoluntarytrialwithoutstrong assumptionsoradditionalevidence. Muchofthemodernliterature,forexampleontrainingprograms,wrestleswiththeissueofexactlywhoisrepresentedbytheRCTresults,seeagainHeckman,LalondeandSmith (1999).Whenpeopleareallowedtorejecttheirrandomlyassignedtreatmentaccordingtotheir own(realorperceived)individualadvantage,wehavecomealongwayawayfromtherandom allocationinthestandardconceptionofarandomizedcontrolledtrial.Moreover,theabsenceof 46 blindingiscommoninsocialandeconomicRCTs,andwhiletherearetrials,suchaswelfaretrials,thateffectivelycompelpeopletoaccepttheirassignments,andsomewherethetreatment isgenerousenoughtodoso,therearetrialswheresubjectshavemuchfreedomand,inthose cases,itislessthanobvioustouswhatrole,ifany,randomizationplaysinwarrantingtheresults. 2.6Scalingup:usingtheaverageforpopulations AtypicalRCT,especiallyinthedevelopmentcontext,issmall-scaleandlocal,forexampleina fewschools,clinics,orfarmsinaparticulargeographic,cultural,socio-economicsetting.Ifsuccessfulaccordingtoacost-effectivenesscriterion,forexample,itisacandidateforscaling-up, applyingthesameinterventionforamuchlargerarea,oftenawholecountry,orsometimes evenbeyond,aswhensometreatmentisconsideredforallrelevantWorldBankprojects.The factthattheinterventionmightworkdifferentlyatscalehaslongbeennotedintheeconomics literature,e.g.GarfinkelandManski(1992),Heckman(1992),andMoffitt(1992),andisrecognizedintherecentreviewbyBanerjeeandDuflo(2009).Wewantheretoemphasizethepervasivenessofsucheffects—thatfailureofthetrialresultstoreplicateatalargerscaleislikelyto betheruleratherthantheexception—aswellastonoteonceagainthat,asinfailuresoftransportability,thisshouldnotbetakenasanargumentagainstusingRCTs,butonlyagainsttheidea thateffectsatscalearelikelytobethesameasinthetrial.UsingRCTresultsisnotthesameas assumingthesameresultsholdsinallcircumstances. Anexampleofwhatareoftencalledgeneralequilibriumeffectscomesfromagriculture. SupposeanRCTdemonstratesthatinthestudypopulationanewwayofusingfertilizerorinsecticidehadasubstantialpositiveeffecton,say,cocoayields,sothatfarmerswhousedthenew methodssawincreasesinproductionandinincomescomparedtothoseinthecontrolgroup.If theprocedureisscaleduptothewholecountry,ortoallcocoafarmersworldwide,theprice willdrop,andifthedemandforcocoaispriceinelastic—asisusuallythoughttobethecase,at leastintheshortrun—cocoafarmers’incomeswillfall.Indeed,theconventionalwisdomfor manycropsisthatfarmersdobestwhentheharvestissmall,notlarge.Ofcourse,theseconsiderationsmightnotbedecisiveindecidingwhetherornottopromotetheinnovation,andthere maystillbelongtermgainsif,forexample,somefarmersfindsomethingbettertodothan growingcocoa.Butthebasicpointisthatthescaled-upeffectinthiscaseisoppositeinsignto thetrialeffect.Theproblemhereisnotwiththetrialresults,whichcanbeusefullyincorporated intoamorecomprehensivemarketmodelthatincorporatestheresponsesestimatedbythe 47 trial.Theproblemisonlyifweassumethattheaggregatelooksliketheindividual.Thatother ingredientsoftheaggregatemodelmustcomefromobservationalstudiesshouldnotbeacriticism,evenforthosewhofavorRCTs;itissimplythepriceofdoingseriousanalysis. Therearemanypossibleinterventionsthataltersupplyordemandwhoseeffect,inag- gregate,willchangeapriceorawagethatisheldconstantintheoriginalRCT.Educationwill changethesuppliesofskilledversusunskilledlabor,withimplicationsforrelativewagerates. Conditionalcashtransfersincreasethedemandfor(andperhapssupplyof)schoolsandclinics, whichwillchangepricesorwaitinglines,orboth.Thereareinteractionsbetweenpeoplethat willoperateonlyatscale.Givingonechildavouchertogotoprivateschoolmightimproveher future,butdoingsoforeveryonecandecreasethequalityofeducationforthosechildrenwho areleftinthepublicschools,seethecontrastingstudiesofAngristetal(1999)andHsiehand Urquiola(2002).Educationalortrainingprogramsmaybenefitthosewhoaretreated,butharm thoseleftbehind;ifthecontrolgroupisselectedfromthelatter,theRCTmaygenerateapositiveresultinspiteofhurtingsomeandhelpingnone;Créponetal(2014)recognizetheissueand showhowtoadaptanRCTtodealwithit. Scalingupcanalsodisturbthepoliticalequilibrium.Anexploitativegovernmentmaynot allowthemasstransferofmoneyfromabroadtoapowerlesssegmentofthepopulation, thoughitmaypermitasmall-scaleRCTofcashtransfers.Provisionofhealthcarebyforeign NGOsmaybesuccessfulintrials,buthaveunintendednegativeconsequencestoscalebecause ofgeneralequilibriumeffectsonthesupplyofhealthcarepersonnel,orbecauseitdisturbsthe natureofthecontractbetweenthepeopleandagovernmentthatisusingtaxrevenuetoprovideservices.InIndia,thegovernmentspendslargesumsonfoodsubsidiesthroughasystem (thePDS)thatisbothcorruptandinefficient,withmuchofthegrainthatisprocuredfailingto finditswaytotheintendedbeneficiaries.LocalizedRCTsonwhetherornotfamiliesarebetter offwithcashtransfersarenotinformativeabouthowpoliticianswouldchangetheamountof thetransferiffacedwithunanticipatedinflation,andatleastasimportant,whetherthegovernmentcouldcutprocurementfromrelativelywealthyandpoliticallypowerfulfarmers.Withoutapoliticalandgeneralequilibriumanalysis,itisimpossibletothinkabouttheeffectsofreplacingfoodsubsidieswithcashtransfers,seee.g.Basu(2010). Eveninmedicine,wherebiologicalinteractionsbetweenpeoplearelesscommonthan aresocialinteractionsinsocialscience,interactionscanbeimportant;infectiousdiseasesarean example,andimmunizationprogramsaffectthedynamicsofdiseasetransmissionthroughherd 48 immunity,sothattheeffectsonanindividualdependonhowmanyothersarevaccinated,Fine andClarkson(1986),Manski(2013,p52).Theusual,ifseldomcorrect,conceptionofanRCTin medicineisofabiologicalprocess—forexample,theadministrationofaspirinafteraheartattack—wheretheeffectisthoughttobesimilaracrossindividuals,andwheretherearenointeractions.Yetevenhere,thesocialandeconomicsettingaffectshowdrugsareactuallyusedand thesameissuescanarise;thedistinctionbetweenefficacyandeffectivenessinclinicaltrialsisin partrecognitionofthefact. 2.7Drillingdown:usingtheaverageforindividuals Justasthereareissueswithscaling-up,itisnotobvioushowtousetheresultsfromRCTsatthe levelofindividualunits,evenindividualunitsthatwereactually(orpotentially)includedinthe trial.Awell-conductedRCTdeliversanaveragetreatmenteffectforawell-definedpopulation but,ingeneral,thataveragedoesnotapplytoeveryone.Itisnottrue,forexample,asarguedin JAMA’s“Users’guidetothemedicalliterature”that“ifthepatientwouldhavebeenenrolledin thestudyhadshebeenthere—thatisshemeetsalloftheinclusioncriteriaanddoesn’tviolate anyoftheexclusioncriteria—thereislittlequestionthattheresultsareapplicable,”Guyattetal (1994).Evenmoremisleadingaretheoften-heardstatementsthatanRCTwithanaverage treatmenteffectinsignificantlydifferentfromzerohasshownthatthetreatmentworksforno one,thoughsuchaconclusionwouldbebettersupportedbyaFisherrandomizationtest. Theseissuesarefamiliartophysicianspracticingevidence-basedmedicinewhoseguidelinesrequire“integratingindividualclinicalexpertisewiththebestavailableexternalclinicalevidencefromsystematicresearch,”Sackettetal(1996).Exactlywhatthismeansisunclear;physiciansknowmuchmoreabouttheirpatientsthanisallowedforintheATEfromtheRCT (though,onceagain,stratificationinthetrialislikelytobehelpful)andtheyoftenhaveintuitive expertisefromlongpracticethattheyrelyontohelpthemidentifyfeaturesinaparticularpatientthatarelikelytoaffecttheeffectivenessofagiventreatmentforthatpatient.Butthereis anoddbalancebeingstruckhere.Thesejudgmentsaredeemedadmissibleindealingwiththe individualpatient,atleastfordiscussionwiththepatientaspossibleconsiderations,butthey don’tadduptoevidencetobemadepubliclyavailable,withtheusualcautionsaboutcredibility, bythestandardsadoptedbymostEBMsites.Itisalsotruethatphysicianscanhaveprejudices and“knowledge”thatmightbeanythingbut.Clearly,therearesituationswhereforcingpractitionerstofollowtheaveragewilldobetter,evenforindividualpatients,andotherswherethe oppositeistrue,seeKahnemanandKlein(2009). 49 Whetherornotaveragesareusefultoindividualsraisesthesameissueinsocialscience research.Imaginetwoschools,StJoseph’sandSt.Mary’s,bothofwhichwereincludedinan RCTofaclassroominnovation,oratleastwereeligibletobeso.Theinnovationissuccessfulon average,butshouldtheschoolsadoptit?ShouldStMary’sbeinfluencedbyapreviousattempt inStJoseph’sthatwasjudgedafailure?Manywoulddismissthisexperienceasanecdotaland askhowStJoseph’scouldhaveknownthatitwasafailurewithoutbenefitof“rigorous”evidence.YetifStMary’sislikeStJoseph’s,withasimilarmixofpupils,asimilarcurriculum,and similaracademicstanding,mightnotStJoseph’sexperiencebemorerelevanttowhatmight happenatStMary’sthanisthepositiveaveragefromtheRCT?Andmightitnotbeagoodidea fortheteachersandgovernorsofStMary’stogotoStJoseph’sandfindoutwhathappenedand why?Theymaybeabletoobservethemechanismofthefailure,ifsuchitwas,andfigureout whetherthesameproblemswouldapplyforthem,orwhethertheymightbeabletoadaptthe innovationtomakeitworkforthem,perhapsevenmoresuccessfullythanthepositiveaverage inthetrial. Onceagain,thesequestionsareunlikelytobesimplyansweredinpractice;but,aswith transportability,thereisnoseriousalternativetotrying.Assumingthattheaverageworksfor youwilloftenbewrong,anditwillatleastsometimesbepossibletodobetter.Asinthemedicalcase,theadvicetoindividualschoolsoftenlacksspecificity.Forexample,theUSInstituteof EducationScienceshasprovideda“user-friendly”guidetopracticessupportedbyrigorousevidence,USDepartmentofEducation(2003).Theadvice,whichisverysimilartorecommendationsindevelopmenteconomics,isthattheinterventionbedemonstratedeffectivethrough well-designedRCTsinmorethanonesiteofimplementation,andthat“thetrialsshoulddemonstratetheintervention’seffectivenessinschoolsettingssimilartoyours”(2003,p.17).Nooperationaldefinitionof“similar”isprovided. Wenotefinallythatthesecaveats,whichapplytoindividuals(orschools)evenifthey wereinthetrial,provideanotherreasonwhytheconceptof“external”validityisunhelpful.The realissueishowtousethefindingsofatrialinnewsettings,includingsettingsincludedinthe trial;externalvalidityinthesenseofinvarianceoftheATEemphasizessimplereplication,which guaranteesnothing,whileignoringthepossibilitythatlackofreplicationcanbeakeytounderstanding. 50 2.8Examplesandillustrationsfromeconomics OurargumentsinthisSectionshouldnotbecontroversial,yetwebelievethattheyrepresentan approachthatisdifferentfrommostcurrentpractice.Todocumentthisandtofilloutthearguments,weprovidesomeexamples.Whiletheseareoccasionallycritical,ourpurposeisconstructive;indeed,webelievethatmisunderstandingsabouthowtouseRCTshaveartificially limitedtheirusefulness,aswellasalienatedsomewhowouldotherwiseusethem. Conditionalcashtransfers(CCTs)areinterventionsthathavebeentestedusingRCTs (andotherRCT-likemethods)andareoftencitedasaleadingexampleofhowanevaluation withstronginternalvalidityleadstoarapidspreadofthepolicy,e.g.AngristandPischke(2010) amongmanyothers.IThinkthroughthecausalchainthatisrequiredforCCTstobesuccessful: peoplemustlikemoney,theymustlike(ordonotobjecttoomuch)totheirchildrenbeingeducatedandvaccinated,theremustexistschoolsandclinicsthatarecloseenoughandwell enoughstaffedtodotheirjob,andthegovernmentoragencythatisrunningtheschememust careaboutthewellbeingoffamiliesandtheirchildren.Thatsuchconditionsholdinawide rangeof(althoughcertainlynotall)countriesmakesitunsurprisingthatCCTs“work”inmany replications,thoughtheycertainlywillnotworkinplaceswheretheschoolsandclinicsdonot exist,Levy(2001),norinplaceswherepeoplestronglyopposeeducationorvaccination. Similarly,giventhatthehelpingfactorswilloperatewithdifferentstrengthsandeffectivenessindifferentplaces,itisalsonotsurprisingthatthesizeoftheATEdiffersfromplaceto place;forexample,Vivalt’sAidGradewebsitelists29estimatesfromarangeofcountriesofthe standardized(dividedbylocalstandarddeviationoftheoutcome)effectsofconditionalcash transfersonschoolattendance;allbutfourshowtheexpectedpositiveeffect,andtherange runsfrom–8to+38percentagepoints.Eveninthisleadingcase,wherewemightreasonably concludethatCCTs“work”ingettingchildrenintoschool,itwouldbehardtocalculatecredible cost-effectivenessnumbers,ortocometoageneralconclusionaboutwhetherCCTsaremoreor lesscosteffectivethanotherpossiblepolicies.Bothcostsandeffectsizescanbeexpectedto differinnewsettings,justastheyhaveinobservedones,makingthesepredictionsdifficult. Therangeofestimatesillustratesthatthesimpleviewofexternalvalidity—thattheATE shouldtransportfromoneplacetoanother—isnotwelldefined.AidGradeusesstandardized measuresofeffectsizedividedbystandarddeviationofoutcomeatbaseline,asdoesthemajor multi-countrystudybyBanerjeeetal(2015),Butwemightprefermeasuresthathaveaneconomicinterpretation,suchasadditionalmonthsofschoolingper$100spent(forexampleifa 51 donoristryingtodecidewheretospend,seebelow).Nutritionmightbemeasuredbyheight,or bythelogofheight.EveniftheATEbyonemeasurecarriesacross,itwillonlydosousinganothermeasureiftherelationshipbetweenthetwomeasuresisthesameinbothsituations.This isexactlythesortofthingthataformalanalysisoftransportabilityforcesustothinkabout. (NotealsothatATEintheoriginalRCTcandifferdependingonwhethertheoutcomeismeasuredinlevelsorinlogs;thetwoATEscouldevenhavedifferentsigns.) Dewormingissurelymorecomplicatedthanconditionalcashtransfersthoughnotbecauseanyonedisputesthedesirabilityofremovingparasiticalwormsorthebiologicalefficacyof themedicines,atleastiftheyarerepeatedlyandeffectivelyadministered;thatisthepartofthe causalprocessthatistransportablefromoneplacetoanother.Yetnutritionalorschoolattendanceoutcomesdependonreinfectionfromonepersontoanother—whichdependsonlocal customsaboutdefecation(whichvaryfromplacetoplaceandaresubjecttoreligiousandculturalfactors),particularlyontheextentofopendefecationandthedensityofpopulation,on whetherornotchildrenwearshoes,andontheavailabilityanduseofpublicandprivatesanitation;thislastwascrucialintheeliminationofhookworminthesouthernstatesoftheU.S.accordingtoStiles(1939).Temperaturemayalsobeimportant;indeed,such“macro”variablesare likelytobeimportantinawiderangeofmedical,employment,andproductiontrials, RosenzweigandUdry(2016).Therearetwoprominentpositivestudiesintheeconomicsliterature,oneinKenya,KremerandMiguel(2000)andoneinIndia,Bobonis,MiguelandPuriSharma(2006);theseareoftencitedasexamplesofthepowerofRCTstocomeupwiththe “right”answer,forexamplebyKarlanandAppel(2008).YettheCochraneCollaborationreview ofdewormingandschooling,Taylor-Robinsonetal(2015),whichreviewsonetrial(fromIndia) coveringmorethanamillionparticipants,and44otherscovering67,672participants,including KremerandMiguel(2004),concludethatthereis“substantialevidence”thatdewormingshows nobenefitinnutritionalstatus,hemoglobin,cognition,schoolperformanceordeath.Thevalidityofthismeta-analysisisdisputedbyCrokeetal(2016).Areplication,Aikenetal(2015)andreanalysis(usingdifferentmethods)ofMiguelandKremer’soriginaldatabyDaveyetal(2015) concludedthatthestudy“providedsomeevidence,butwithhighriskofbias,”provokinga lengthyexchange,Hicksetal(2015)andHargreavesetal(2015).Mostofthedifferencesinresultscomefromdifferentmethodologicalchoices,themselveslargelybasedondisciplinarytraditions,ratherfromtheeffectsofmistakesorerrors.Inanimpressiveandclearreanalysis, Humphreys(2015)arguesthatonepuzzlingfeatureofMiguelandKremer’sresultsistheab- 52 senceofanycleareffectofdewormingonhealth,aswasthecaseinthelargeIndianRCT.Yet theeffectsofdewormingoneducation,whicharethemaintargetofthepaper,presumably workthroughhealth,sothattheabsenceofhealtheffects—afailureofexpectedmediators—is apuzzle,seealsoMiguel,KremerandHicks(2015),andAhujaetal(2015).Recalltooourearlier discussionofthedifficultyofinterpretingthestandarderrorsoftheoriginalstudyintheabsenceofrandomization. Itisnotourpurposeheretotrytoadjudicatethesecompetingclaimsbutrathertorelatethisworktoourgeneralargument.First,itisnotclearthatthereisarightanswertobediscovered;giventhecausalchainsinvolved,dewormingmightbehelpfulinoneplacebutunhelpfulinanother.Yetthefocusofthedebateisalmostentirelyoninternalvalidity,onwhetherthe originalstudieswerecorrectlydone.TheCochranereview,inlinewiththis,andinlinewith muchmeta-analysisoftrials,seemstosupposethatthereisasingleeffecttobeuncoveredthat, onceestablished,willbeinvarianttolocalandenvironmentaldifferences.Externalvalidity,it seems,isimpliedbyinternalvalidity.Indeed,Chalmers,oneofthefoundersoftheCochrane Collaboration,hasexplicitlyargued(inresponsetooneofus)that,intheabsenceofstrongreasonstothecontrary,resultsshouldbetakenasapplicableeverywhere,PettigrewandChalmers (2011). Second,thedebatemakesitclearthatthepracticeofRCTsineconomicdevelopment hasdonelittletofulfilltheoriginalpromisethattheirsimplicity—howhardisittosubtractone meanfromanother?—woulddisposeofthemethodologicalandeconometricdisputesthat characterizesomanyobservationalstudiesandwerethoughttobeoneoftheirmainflaws. WhileRCTstendtotakesomecontentiousissuesofidentificationoffthetable,theyleavemuch tobedisputed,includingthehandlingoffactorsthatinteractwithtreatmenteffects,theappropriatelevelofrandomization,thecalculationofstandarderrors,thechoiceofoutcomemeasure,theinclusioncriteriaforthesample,placeboandHawthorneeffects,andmuchmore.The claimthatRCTscutthroughtheusualeconometricdisputestodelivertopolicymakersasimple, convincing,andeasilyunderstoodanswerissimplyfalse.Thedewormingdebatesareperhaps theleadingillustration. Muchofthedevelopmentliterature,likethemedicalliterature,workswiththeviewof externalvaliditythat,unlessthereisevidencetothecontrary,thedirectionandsizeoftreatmenteffectscanbetransportedfromoneplacetoanother.TheJ-PALwebsitereportsitsfindingsunderageneralheadofpolicyrelevance,subdividedbyaselectionoftopics.Undereach 53 topic,thereisalistofrelevantRCTsfromarangeofdifferentsettingsaroundtheworld.These areconvenientlyconvertedintoacommoncost-effectivenessmeasuresothat,forexample, under‘education’,subhead‘studentparticipation’,therearefourstudiesfromAfrica:oninformingparentsaboutthereturnstoeducationinMadagascar,ondeworming,onschooluniforms,andonmeritscholarships,allfromKenya.Theunitsofmeasurementareadditionalyears ofstudenteducationper$100,andamongthesefourstudies,theaverageeffectsizesofspending$100are20.7years,13.9years,0.71yearsand0.27yearsrespectively.(Notethatthisisa different—andsuperior—standardizationfromtheeffectsizestandardizationdiscussedabove.) Whatcanweconcludefromsuchcomparisons?Foraphilanthropicdonorinterestedin education,andifmarginalandaverageeffectsarethesame,theymightindicatethatthebest placetodevoteamarginaldollarisinMadagascar,whereitwouldbeusedtoinformparents aboutthevalueofeducation.Thisiscertainlyuseful,butitisnotasusefulasstatementsthat informationordewormingprogramsareeverywheremorecost-effectivethanprogramsinvolvingschooluniformsorscholarships,orifnoteverywhere,atleastoversomedomain,anditis thesesecondkindsofcomparisonthatwouldgenuinelyfulfillthepromiseof“findingoutwhat works.”Butsuchcomparisonsonlymakesenseifwecantransporttheresultsfromoneplaceto another,iftheKenyanresultsalsoholdinMadagascar,Mali,orNamibia,orsomeotherlistof Africanornon-Africanplaces.J-PAL’smanualforcost-effectiveness,Dhaliwaletal(2012)explainsin(entirelyappropriate)detailhowtohandlevariationincostsacrosssites,notingvariablefactorssuchaspopulationdensity,prices,exchangerates,discountrates,inflation,andbulk discounts.Butitgivesshortshrifttocross-sitevariationinthesizeofaveragetreatmenteffects whichplayanequalpartinthecalculationsofcosteffectiveness.Themanualbrieflynotesthat diminishingreturns(orthe“last-mile”problem)mightbeimportantintheory,butarguesthat thebaselinelevelsofoutcomesarelikelytobesimilarinthepilotandreplicationareas,sothat theaveragetreatmenteffectcanbesafelytransportedasis.Allofthislacksajustificationfor transportability,someunderstandingofwhenresultstransport,whentheydonot,orbetter still,howtheyshouldbemodifiedtomakethemtransportable. OneofthelargestandmosttechnicallyimpressiveofthedevelopmentRCTsisby Banerjeeetal(2015),whichtestsa“graduation”programdesignedtopermanentlyliftextremelypoorpeoplefrompovertybyprovidingthemwithagiftofaproductiveasset(fromguineapigs,(regular-)pigs,sheep,goats,orchickensdependingonlocale),trainingandsupport,life skillscoaching,aswellassupportforconsumption,saving,andhealthservices;theideaisthat 54 thispackageofaidcanhelppeoplebreakoutofpovertytrapsinawaythatwouldnotbepossiblewithoneinterventionatatime.ComparableversionsoftheprogramweretestedinEthiopia,Ghana,Honduras,India,Pakistan,andPeruand,exceptingHonduras(wherethechickens died)findlargelypositiveandpersistenteffects—withsimilar(standardized)effectsizes—fora rangeofoutcomes(economic,mentalandphysicalhealth,andfemaleempowerment).Onesite apart,essentiallyeveryoneacceptedtheirassignment,sothatmanyofthefamiliarcaveatsdo notapply.ReplicationofpositiveATEsoversuchawiderangeofplacescertainlyprovidesproof ofconceptforsuchascheme.YetBauchet,Morduch,andRavi(2015)failtoreplicatetheresult inSouthIndia,wherethecontrolgroupgotaccesstomuchthesamebenefits,whatHeckman, Hohman,andSmith(2000)call‘substitutionbias’.Evenso,theresultsareimportantbecause, althoughthereisalongstandinginterestinpovertytraps,manyeconomistshavelongbeen skepticaloftheirexistenceorthattheycouldbesprungbysuchaid-basedpolicies.Inthissense, thestudyisanimportantcontributiontothetheoryofeconomicdevelopment;ittestsatheoreticalpropositionandwill(orshould)changemindsaboutit. Anumberofdifficultiesremain.Astheauthorsnote,suchtrialscannottelluswhich componentofthetreatmentaccountedfortheresults,orwhichmightbedispensable—amuch moreexpensivemultifactorialtrialwouldberequired—thoughitseemslikelyinpracticethat thecostliestcomponent—therepeatedvisitsfortrainingandsupport—islikelytobethefirstto becutbycash-strappedpoliticiansoradministrators.Andasnoted,itisunclearwhatshould countas(simple)replicationininternationalcomparisons;itishardtothinkoftheusesof standardizedeffectsizes,excepttodocumentthateffectsexisteverywhereandthattheyare similarlylargerelativetolocalvariationinsuchthings. Theeffectsize—theaveragetreatmenteffectexpressedinnumbersofstandarddeviationsoftheoriginaloutcome—thoughconvenientlydimensionless,haslittletorecommendit. AswithmuchofRCTpractice,itstripsoutanyeconomiccontent—noratesofreturn,orbenefits minuscosts—anditremovesanydisciplineonwhatisbeingcompared.Applesandorangesbecomeimmediatelycomparable,asdotreatmentswhoseinclusioninameta-analysisislimited onlybytheimaginationoftheanalystsinclaimingsimilarity.Inpsychology,wheretheconcept originated,thereareendlessdisputesaboutwhatshouldandshouldnotbepooledinametaanalysis.Beyondthat,asarguedbySimpson(2016),restrictionsonthetrialsample—oftengood practicetoreducebackgroundnoiseandtohelpdetectaneffect—willreducethebaseline standarddeviationandinflatetheeffectsize.Moregenerally,effectsizesareopentomanipula- 55 tionbyexclusionrules.Itmakesnosensetoclaimreplicabilityonthebasisofeffectsizes,let alonetousethemtorankprojects. Thegraduationstudycanbetakenastheclosesttofulfillingthe“findingoutwhat works”aimoftheRCTmovementindevelopment.Yetitissilentonperhapsthecrucialaspect forpolicy,whichisthatthetrialwasrunentirelyinpartnershipwithNGOs,whereaswhatwe wouldliketoknowiswhetheritcouldbereplicatedbygovernments,includingthosegovernmentsthatareincapableofgettingdoctors,nurses,andteacherstoshowuptoclinics,or schools,Chaudhuryetal(2005),Banerjee,DeatonandDuflo(2004),orofregulatingthequality ofmedicalcareineitherthepublicorprivatesectors,Filmer,HammerandPritchett(2000)or DasandHammer(2005).Infact,wealreadyknowagreatdealabout“whatworks.”Vaccinationswork,maternalandchildhealthcareserviceswork,andclassroomteachingworks.Yet knowingthisdoesnotgetthosethingsdone.Addinganotherprogramthatworksunderideal conditionsisusefulonlywheresuchconditionsexist,andthatwouldlikelybeunnecessarywhen theyexist.Findingoutwhatworksisnotthemagickeytoeconomicdevelopment.Technical knowledge,thoughalwaysworthhaving,requiressuitableinstitutionsifitistodoanygood. Asimilarpointisdocumentedinthecontrastbetweenasuccessfultrialthatusedcamerasandthreatsofwagereductionstoincentivizeattendanceofteachersinschoolsrunbyan NGOinRajasthaninIndia,Duflo,Hanna,andRyan(2012),andthesubsequentfailureofafollow-upprograminthesamestatetotacklemassabsenteeismofhealthworkers,Banerjee, Duflo,andGlennerster(2008).Intheschools,thecamerasandtimekeepingworkedasintended, andteacherattendanceincreased.Intheclinics,therewasashort-runeffectonnurseattendance,butitwasquicklyeliminated.(Theabilityofagentseventuallytounderminepoliciesthat areinitiallyeffectiveiscommonenoughandnoteasilyhandledwithinanRCT.)Inbothtrials, therewereincentivestoimproveattendance,andtherewereincentivestofindawaytosabotagethemonitoringandrestoreworkerstotheiraccustomedpositions;theforceoftheseincentivesisa“high-level”cause,likegravity,ortheprincipleofthelever,thatworksinmuchthe samewayeverywhere.Fortheclinics,somesabotagewasdirect—thesmashingofcameras— andsomewassubtler,whengovernmentsupervisorsprovidedofficial,thoughessentiallyspeciousreasons,formissingwork.Wecanonlyconjecturewhythecausalitywasswitchedinthe movefromNGOtogovernment;wesuspectthatworkingforahighly-respectedlocalNGOisa differentcontractfromworkingforthegovernment,wherenotshowingupforworkiswidely(if informally)understoodtobepartofthedeal.Theincentiveleverworkswhenitiswiredup 56 right,aswiththeNGOs,butnotwhenthewiringcutsitout,aswiththegovernment.Knowing “whatworks”inthesenseofthetreatmenteffectonthetrialpopulationisoflimitedvalue withoutunderstandingthepoliticalandinstitutionalenvironmentinwhichitisset.Thisunderlinestheneedtounderstandtheunderlyingsocial,economic,andculturalstructures—including theincentivesandagencyproblemsthatinhibitservicedelivery—thatarerequiredtosupport thecausalpathwaysthatweshouldliketoseeatwork. Trialsineconomicdevelopmentaresusceptibletothecritiquethattheytakeplaceinartificialenvironments.Drèze(2016)notes,basedonextensiveexperienceinIndia,“whenaforeignagencycomesinwithitsheavybootsandsuitcasesofdollarstoadministera`treatment,’ whetherthroughalocalNGOorgovernmentorwhatever,thereisalotgoingonotherthanthe treatment.”Thereisalsothesuspicionthatatreatmentthatworksdoessobecauseofthepresenceofthe“treators,”oftenfromabroad,ratherthanbecauseofthepeoplewhowillbecalled toworkitinreality. ThereisalsomuchtobelearnedfrommanyyearsofeconomictrialsintheUnited States,particularlyfromtheworkoftheManpowerDemonstrationResearchCorporation(now knownbyitsinitialsMDRC),fromtheearlyincometaxtrials,aswellasfromtheRandHealth Experiment.Followingtheincometaxtrials,MDRChasrunmanyrandomizedtrialssincethe 1970s,mostlyfortheFederalgovernmentbutalsoforindividualstatesandforCanada,seethe thoroughandinformativeaccountbyGueronandRolston(2011)forthefactualinformation underlyingthefollowingdiscussion.MRDC’sprogram,likethatofJPALindevelopment,isintendedtofindout“whatworks”inthestateandfederalwelfareprograms.Theseprogramsare conditionalcashtransfersinwhichpoorrecipientsaregivencashprovidedtheysatisfycertain conditionswhichareoftenthesubjectofthetrial.Shouldtherebeworkrequirements?Should thereberemedialeducationalbeforeworkrequirements?Whatarethebenefitsandcostsof variousalternatives,bothtotherecipientsandtothelocalandfederaltaxpayers?Allofthese programsaredeeplypoliticized,withsharplydifferentviewsoverbothfactsanddesirability. Manyengagedinthesedisputesfeelcertainofwhatshouldbedoneandwhatitsconsequences willbesothat,bytheirlights,controlgroupsareunethicalbecausetheydeprivesomepeopleof whattheadvocates“know”willbecertainbenefits.Giventhis,itisperhapssurprisingthatRCTs havebecometheacceptednormforthiskindofpolicyevaluationintheUS. Thereasonsowemuchtopoliticalinstitutions,aswellastothecommonfaiththatRCTs canrevealthetruth.AttheFederallevel,prospectivepoliciesarevettedbythenon-partisan 57 CongressionalBudgetOffice,whichmakesitsownestimatesofthebudgetaryimplicationsof theprogram.IdeologueswhoseprogramsscorepoorlybytheCBOhaveanincentivetosupport anRCT,nottoconvincethemselves,buttoconvincetheiropponents;onceagain,RCTsareespeciallyvaluablewhenyouropponentsdonotshareyourprior.Andcontrolgroupsareeasierto putinplacewhenthereareinsufficientfundstocoverthewholepopulation.Therewasalsoa widespreadandlargelyuncriticalbeliefthatRCTsalwaysgivetherightanswer,atleastforthe budgetaryimplications,which,ratherthanthewellbeingoftherecipients,wereoftentheprimary(andindeedsometimestheonly)concern;notethatallofthesetrialsareonpoorpeople byrichpeoplewhoaretypicallymoreconcernedwithcostthanwiththewellbeingofthepoor, Greenberg,SchroderandOnstott(1999).MDRCstrialscouldthereforebeeffectivedisputereconciliationmechanismsbothforthosewhosawtheneedforevidenceandforthosewhodid not(exceptinstrumentally).Notethattheoutcomeherefitswithour“publichealth”case;what thepoliticiansneedtoknowisnottheoutcomesforindividuals,orevenhowtheoutcomesin onestatemighttransporttoanother,buttheaveragebudgetarycostinaspecificplaceforeach poorpersontreated,somethingthatagoodRCTconductedonarepresentativesampleofthe targetpopulationisequippedtodeliver,atleastintheabsenceofgeneralequilibriumeffects, timingeffects,etc. TheseRCTsbyMDRCandothercontractorsdeservemuchcredit.Theyhavedemon- stratedboththefeasibilityoflarge-scalesocialtrialsincludingthepossibilityofrandomizationin thesesettings(wheremanyparticipantswerehostiletotheidea),aswellastheirusefulnessto policymakers.Theyalsoseemtohavechangedbeliefs,forexampleinfavorofthedesirabilityof workrequirementsasaconditionofwelfare,evenamongmanyofthosewhowereoriginally opposed.Therearealsolimitations;thetrialsappeartohavehadatbestalimitedinfluenceon scientificthinkingaboutbehaviorinlabormarkets.Theresultsofsimilarprogramshaveoften beendifferentacrossdifferentsites,andtherehastodatebeennofirmunderstandingofwhy; indeed,thetrialsarenotdesignedtorevealthis,Moffitt(2004).Finally,andperhapscruciallyfor thepotentialcontributiontoeconomicscience,therehasbeenlittlesuccessinunderstanding eithertheunderlyingstructuresorchainsofcausation,inspiteofadeterminedeffortfromthe verybeginningtopeerintotheblackboxes.Withoutsuchmechanisms,transportabilityisalwaysindoubt,itisimpossibleforpolicymakersoracademicstopurposivelyimprovethepolicies,andthecontributionstocumulativescienceareseverelylimited. 58 TheRANDhealthexperiment,Manningetal(1975a,b),providesadifferentbutequally instructivestoryifonlybecauseitsresultshavepermeatedtheacademicandpolicydiscussions abouthealthcareeversince.Itwasoriginallydesignedtotestthequestionofwhethermore generousinsurancewouldcausepeopletousemoremedicalcareand,ifso,byhowmuch.The incentiveeffectsarehardlyindoubttoday;theimmortalityofthestudycomesratherfromthe factthatitsmulti-arm(responsesurface)designallowedthecalculationofanelasticityforthe studypopulation,thatmedicalexpendituresdecreasedby–0.1to–0.2percentforeverypercentageincreaseinthecopayment.AccordingtoAron-Dine,Einav,andFinkelstein(2013),itis thisdimensionlessandthusapparentlytransportablenumberthathasbeenusedeversinceto discussthedesignofhealthcarepolicy;theelasticityhascometobetreatedasauniversalconstant.Ironically,theyarguethattheestimatecannotbereplicatedinrecentstudies,anditis evenunclearthatitisfirmlybasedontheoriginalevidence.Thisaccountpoints,onceagain,to thecentralimportanceoftransportabilityfortheusefulnessandlong-termusefulnessofatrial. Here,thesimpledirecttransportabilityoftheresultseemstohavebeenlargelyillusorythough, aswehaveargued,thisdoesnotmeanthatmorecomplexconstructionsbasedontheresultsof thetrialwouldnothavedonebetter. Conclusions RCTsaretheultimateincredibleestimationofaveragetreatmenteffectsinthepopulationbeingstudiedbecausetheymakesofewassumptionsaboutheterogeneity,causalstructure, choiceofvariables,andfunctionalform.Theyaretrulynonparametric.Andindeed,thisissometimesjustwhatwewant,particularlywherewehavelittlecrediblepriorinformation.RCTsare oftenconvenientwaystointroduceexperimenter-controlledvariance—ifyouwanttoseewhat happens,thenkickitandsee,twistthelion’stail—butnotethatmanyexperiments,including manyofthemostimportant(andNobelPrizewinning)experimentsineconomics,donotand didnotuserandomization,Harrison(2013),Svorencik(2015).Butthecredibilityoftheresults, eveninternally,canbeunderminedbyexcessiveheterogeneityinresponses,andespecially whenthedistributionofeffectsisasymmetric,whereinferenceonmeanscanbehazardous. Ironically,thepriceofthecredibilityinRCTsisthatallwegetaremeans.Yet,inthepresenceof outliers,meansthemselvesdonotprovidethebasisforreliableinference.Andrandomizationin andofitselfdoesnothingunlessthedetailsareright;purposiveselectionintotheexperimental population,likepurposiveselectionintoandoutofassignment,underminesinferenceinjust 59 thesamewayasdoesselectioninobservationalstudies.Lackofblinding,whetherofparticipants,trialists,datacollectors,oranalysts,underminesinferencebypermittingfactorsother thanthetreatmenttoaffecttheoutcome,akintoafailureofexclusionrestrictionsininstrumentalvariableanalysis. ThelackofstructurecanbecomeseriouslydisablingwhenwetrytouseRCTresults, outsideofafewcontexts,suchasprogramevaluation,hypothesistesting,orestablishingproof ofconcept.Beyondthat,weareintrouble.Wecannotusetheresultstohelpmakepredictions elsewherewithoutmorestructure,withoutmorepriorinformation,andwithouthavingsome ideaofwhatmakestreatmenteffectsvaryfromplacetoplace,ortimetotime.ThereisnooptionbuttocommittosomecausalstructureifwearetoknowhowtouseRCTevidenceelsewhere,ortousetheestimatesoutoftheoriginalcontext.Simplegeneralizationandsimpleextrapolationjustdonotcutthemustard.Thisistrueofanystudy,experimentalorobservational. Butobservationalstudiesarefamiliarwith,androutinelyworkwith,thesortofassumptions thatRCTsclaimtoavoid,sothatiftheaimistouseempiricalevidence,anycredibilityadvantage thatRCTshaveinestimationisnolongeroperative. Yetoncethatcommitmenthasbeenmade,RCTevidencecanbeextremelyuseful,pinningdownpartofastructure,helpingtobuildstrongerunderstandingandknowledge,andhelpingtoassesswelfareconsequences.Asourexamplesshow,thiscanoftenbedonewithout committingtothefullcomplexityofwhatareoftenthoughtofasstructuralmodels.Yetwithout thestructurethatallowsustoplaceRCTresultsincontext,ortounderstandthemechanisms behindthoseresults,notonlycanwenottransportwhether“itworks”elsewhere,butwecannotdothestandardstuffofeconomics,whichistosaywhetherornottheinterventionisactuallywelfareimproving,seeHarrison(2014)foravividaccountthatsharplyidentifiesthisandotherissues.Withoutknowingwhythingshappenandwhypeopledothings,weruntheriskof worthlesscasual(“fairystory”)causaltheorizingandhaveessentiallygivenupononeofthe centraltasksofeconomics. Wemustbackawayfromtherefusaltotheorize,fromtheexultationinourabilityto handleunlimitedheterogeneity,andactuallySAYsomething.Perhapsparadoxically,unlesswe arepreparedtomakeassumptions,andtosaywhatweknow,makingstatementsthatwillbe incredibletosome,allthecredibilityoftheRCTisfornaught. Inthespecificcontextofdevelopmentthathasconcernedushere,RCTshaveproven theirworthinprovidingproofsofconceptandattestingpredictionsthatsomepoliciesmust 60 alwaysworkorcanneverwork.But,aselsewhereineconomics,wecannotfindoutwhysomethingworksbysimplydemonstratingthatitdoeswork,nomatterhowoften,whichleavesus uninformedastowhetherthepolicyshouldbeimplemented.Beyondthat,smallscale,demonstrationRCTsarenotcapableoftellinguswhatwouldhappenifthesepolicieswereimplementedtoscale,ofcapturingunintendedconsequencesthattypicallycannotbeincludedintheprotocols,orofmodelingwhatwillhappenifschemesareimplementedbygovernments,whose motivesandoperatingprinciplesaredifferentfromtheNGOswhotypicallyruntrials.Whileitis truethatabstractknowledgeisalwayslikelytobebeneficialtoeconomicdevelopment,successfuldevelopmentdependsoninstitutionsandonpolitics,mattersonwhichRCTshavelittleto say.Intheend,RCTsareoneofthemanyexternaltechnicalfixesthathavemeanderedoffand onthedevelopmentstagesincetheSecondWorldWar,includingbuildinginfrastructure,getting pricesright,andservicedelivery,noneofwhichhavefaceduptotheessentialdomesticpolitical foundationsfordevelopment. Citations Ahuja,Amrita,SarahBaird,JoanHamoryHicks,MichaelKremer,EdwardMiguel,andShawn Powers,2015,“Whenshouldgovernmentssubsidizehealth?Thecaseofmassdeworming,” WorldBankEconomicReview,29,S9–S24. Aigner,DennisJ.,1985,“Theresidentialelectricitytime-of-usepricingexperiments.Whathave welearned?”inDavidA.WiseandJerryA.Hausman,Socialexperimentation,Chicago,Il.ChicagoUniversityPressforNationalBureauofEconomicResearch,11–54. Aiken,AlexanderM.,CalumDavey,JamesR.HargreavesandRichardJ.Hayes,“Re-analysisof healthandeducationalimpactsofaschool-baseddewormingprogrammeinwesternKenya: apurereplication,”InternationalJournalofEpidemiology,0(0),1–9. Al-Ubaydil,Omar,andJohnA.List,2013,“Onthegeneralizabilityofexperimentalresultsineconomics,”inG.FrechetteandA.Schotter,Methodsofmodernexperimentaleconomics,OxfordUniversityPress. Altman,DouglasG.,1985,“Comparabilityofrandomizedgroups,”JournaloftheRoyalStatistical Society,SeriesD(TheStatistician),34(1),Statisticsinhealth,125–36. Angrist,JoshuaD.,2004,“Treatmenteffectheterogeneityintheoryandpractice,”Economic Journal,114,C52–C83. Angrist,JoshuaD.,EricBettinger,ErikBloom,ElizabethKingandMichaelKremer,2002,“VouchersforprivateschoolinginColombia:evidencefromarandomizednaturalexperiment,” AmericanEconomicReview,92(5),1535–58. Angrist,JoshuaD.,andJörn-SteffenPischke,2010,“Thecredibilityrevolutioninempiricaleconomics:howbetterresearchdesignistakingtheconoutofeconometrics,”JournalofEconomicPerspectives,24(2),3–30. Aron-Dine,Aviva,LiranEinav,andAmyFinkelstein,2013,“TheRANDhealthinsuranceexperiment,threedecadeslater,”JournalofEconomicPerspectives,27(1),197–222. 61 Arrow,KennethJ.,1975,“Twonotesoninferringlongrunbehaviorfromsocialexperiments,” DocumentNo.P-5546,SantaMonica,CA.RandCorporation. Ashenfelter,Orley,1978,“Estimatingtheeffectoftrainingprogramsonearnings,”Reviewof EconomicsandStatistics,60(1),47–57. Ashenfelter,Orley,1978,“Thelaborsupplyresponseofwageearners,”inJohnL.Palmerand JosephA.Pechman,eds.,Welfareinruralareas:theNorthCarolina–IowaIncomeMaintenanceExperiment,Washington,DC.TheBrookingsInstitution.109–38. Attanasio,Orazio,CostasMeghir,andAnaSantiago,2012,“EducationchoicesinMexico:usinga structuralmodelandarandomizedexperimenttoevaluatePROGRESA,”ReviewofEconomic Studies,79(1),37–66. Attanasio,Orazio,SarahCattan,EmlaFitzsimons,CostasMeghir,andMartaRubioCodina,2015, “Estimatingtheproductionfunctionforhumancapital:resultsfromarandomizedcontrolled trialinColumbia,”London.InstituteforFiscalStudies,WorkingPapernoW15/06. Bahadur,R.R.,andLeonardJ.Savage,1956,“Thenon-existenceofcertainstatisticalprocedures innonparametricproblems,”AnnalsofMathematicalStatistics,25:1115–22. Banerjee,Abhijit,SylvainChassang,SergioMontero,andErikSnowberg,2016,“Atheoryofexperimenters,”processed,July2016. Banerjee,Abhijit,SylvainChassang,andErikSnowberg,2016,“Decisiontheoreticapproachesto experimentdesignandexternalvalidity,”Cambridge,MA.NBERWorkingPaperno22167, April. Banerjee,Abhijit,AngusDeaton,andEstherDuflo,2004,“HealthcaredeliveryinruralRajasthan,”EconomicandPoliticalWeekly,39(9),944–9. Banerjee,Abhijit,andEstherDuflo,2012,Pooreconomics:aradicalrethinkingofthewayto fightglobalpoverty,PublicAffairs. Banerjee,Abhijit,EstherDuflo,NathanaelGoldberg,DeanKarlan,RobertOsei,WilliamParienté, JeremyShapiro,BramThuysbaert,andChristopherUdry,2015,“Amultifacetedprogram causeslastingprogressfortheverypoor:evidencefromsixcountries,”Science,348(6236), 1260799. Banerjee,Abhijit,EstherDuflo,andRachelGlennerster,2008,“Puttingaband-aidonacorpse: incentivesfornursesintheIndianpublichealthcaresystem,”JournaloftheEuropeanEconomicAssociation,6(2–3),487–500. Banerjee,AbhijitV.,andRuiminHe,2003,“TheWorldBankofthefuture,”AmericanEconomic Review,93(2),39–44. Bauchet,Jonathan,JonathanMorduchandShamikaRavi,2015,“Failurevsdisplacement:why aninnovativeanti-povertyprogramshowednonetimpactinSouthIndia,”JournalofDevelopmentEconomics,116,1–16. Basu,Kaushik,2010,“TheeconomicsoffoodgrainmanagementinIndia,”MinistryofFinance, Delhi.http://finmin.nic.in/workingpaper/Foodgrain.pdf Bloom,HowardS.,CarolynJ.Hill,andJamesA.Riccio,2005,“Modelingcross-siteexperimental differencestofindoutwhyprogrameffectivenessvaries,”inHowardS.Bloom,ed.,Learning morefromsocialexperiments:evolvinganalyticalapproaches,NewYork,NY.RussellSage. Bobonis,Gustavo,EdwardMiguel,andCharuPuri-Sharma,2006,“Anemiaandschoolparticipation,”JournalofHumanResources,41(4),692–721. Bold,Tessa,MwangiKimenyi,,GermanoMwabu,AliceNg’ang’aandJustinSandefur,2013, “Scalingupwhatworks:experimentalevidenceonexternalvalidityinKenyaneducation,” Washington,DC.CenterforGlobalDevelopment,WorkingPaper321. Bothwell,LauraE.,andScottH.Podolsky,2016,“Theemergenceoftherandomized,controlled trial,”NewEnglandJournalofMedicine,375(6),501–4.doi:10.1056/NEJMp1604635 62 Campbell,D.T.,andJ.C.Stanley,1963,Experimentalandquasi-experimentaldesignsforresearch.Chicago.RandMcNally. Cartwright,Nancy,1994,Nature’scapacitiesandtheirmeasurement.Oxford.ClarendonPress. Cartwright,Nancy,andJeremyHardie,2012,Evidencebasedpolicy:apracticalguidetodoingit better,Oxford.OxfordUniversityPress. Chalmers,Iain,2001,“Comparinglikewithlike:somehistoricalmilestonesintheevolutionof methodstocreateunbiasedcomparisongroupsintherapeuticexperiments,”International JournalofEpidemiology,30,1156–64. Chalmers,Iain,2003,“FisherandBradfordHill:theoryandpragmatism?”InternationalJournal ofEpidemiology,32,922–24. Chassang,Sylvain,GerardPadróIMiguel,andErikSnowberg,2012,“Selectivetrials:aprincipal– agentapproachtorandomizedcontrolledexperiments,”AmericanEconomicReview,102(4), 1279–1309. Chassang,Sylvain,ErikSnowberg,BenSeymour,andCayleyBowles,2015,“Accountingforbehaviorintreatmenteffects:newapplicationsforblindtrials,”PLoSOne,10(6),e0127227. doi:10:1371/journal.pone.0127227. Chaudhury,Nazmul,JeffreyHammer,MichaelKremer,KarthikMuralidharanandF.HalseyRogers,2005,“Missinginaction:teacherandhealthworkerabsenceindevelopingcountries,” JournalofEconomicPerspectives,19(4),91–116.Chyn,Eric,2016,“Movedtoopportunity: thelong-runeffectofpublichousingdemolitiononlabormarketoutcomesofchildren,”UniversityofMichigan.http://wwwpersonal.umich.edu/~ericchyn/Chyn_Moved_to_Opportunity.pdf Conlisk,John,1973,“Choiceofresponsefunctionalformindesigningsubsidyexperiments,” Econometrica,41(4),643–56. Crépon,Bruno,EstherDuflo,MarcGurgand,RolandRathelot,andPhilippeZamora,2014,“Do labormarketpolicieshavedisplacementeffects?evidencefromaclusteredrandomizedexperiment,”QuarterlyJournalofEconomics,128(2),531–80. Croke,Kevin,JoanHamoryHicks,EricHsu,MichaelKremer,andEdwardMiguel,2016,“Does massdewormingaffectchildren’snutrition?Metaanalysis,costeffectiveness,andstatistical power,”Cambridge,MA.NBERWorkingPaperNo.22382(July.) Cronbach,LeeJ.,S.R.Ambron,S.M.Dornbusch,R.D.Hess,R.C.Hornick,D.C.Phillips,D.F. Walker,andS.S.Weiner,1980,Towardsreformofprogramevaluation,SanFrancisco, Jossey-Bass. Das,JishnuandJeffreyHammer,2005,”’Whichdoctor?Combiningvignettesanditemresponse tomeasureclinicalcompetence,”JournalofDevelopmentEconomics,78,348–83. Davey,Calum,AlexanderM.Aitken,RichardJ.Hayes,andJamesR.Hargreaves,2015,“Reanalysisofhealthandeducationalimpactsofaschool-baseddewormingprogrammein westernKenya:astatisticalreplicationofaclusterquasi-randomizedsteppedwedgetrial,” InternationalJournalofEpidemiology,0(0),1–12. Deaton,Angus,andJohnMuellbauer,1980,Economicsandconsumerbehavior,NewYork.CambridgeUniversityPress. Dhaliwal,Iqbal,EstherDuflo,RachelGlennerster,andCaitlinTulloch,2012,“Comparativecosteffectivenessanalysistoinformpolicyindevelopingcountries:ageneralframeworkwithapplicationsforeducation,”J–PAL,MIT,December3rd. http://www.povertyactionlab.org/publication/cost-effectiveness Drèze,Jean,2016,Personalemailcommunication. Duflo,Esther,RemaHanna,andStephenP.Ryan,2012,“Incentiveswork:gettingteachersto cometoschool,”AmericanEconomicReview,102(4),1241–78. 63 Duflo,Esther,andMichaelKremer,2008,“Useofrandomizationintheevaluationofdevelopmenteffectiveness,”inWilliamEasterly,ed.,Reinventingforeignaid.Washington,DC.Brookings,93–120. Dynarski,Susan,2015,”Helpingthepoorineducation:thepowerofasimplenudge,”NewYork Times,Jan17,2015. Fine,PaulE.M.,andJacquelineA.Clarkson,1986,“Individualversuspublicprioritiesinthedeterminationofoptimalvaccinationpolicies,”AmericanJournalofEpidemiology,124(6), 1012–20. Fisher,RonaldA.,1926,“Thearrangementoffieldexperiments,”JournaloftheMinistryofAgricultureofGreatBritain,33,503–13. Filmer,Deon,JeffreyHammer,andLantPritchett,2000,“Weaklinksinthechain:adiagnosisof healthpolicyinpoorcountries,”WorldBankResearchObserver,15(2),199–204. Freedman,DavidA.,2006,“Statisticalmodelsforcausation:whatinferentialleveragedothey provide?”EvaluationReview,30:691−713. Freedman,DavidA.,2008,“Onregressionadjustmentstoexperimentaldata,”AdvancesinAppliedMathematics,40,180–93. Garfinkel,Irwin,andCharlesF.Manski,1992,“Introduction,”inIrwinGarfinkelandCharlesF. Manski,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversity Press.1–22. Gertler,PaulJ.,SebastianMartinez,PatrickPremand,LauraB.Rawlings,andChristelM.J.Vermeersch,Impactevaluationinpractice,Washington,DC.TheWorldBank. Glewwe,Paul,MichaelKremer,SylvieMoulin,andEricZitzewitz,2004,“Retrospectivevs.prospectiveanalysesofschoolinputs:thecaseofflip-chartsinKenya,”JournalofDevelopment Economics,74,251–68. Greenberg,DavidandMarkShroder,2004,Thedigestofsocialexperiments(3rded.),Washington,DC.UrbanInstitutePress. Greenberg,David,MarkShroder,andMatthewOnstott,1999,“Thesocialexperimentmarket,” JournalofEconomicPerspectives,13(3),157–72. Gueron,JudithM.,andHowardRolston,2013,Fightingforreliableevidence,NewYork,Russell Sage. Guyatt,Gordon,DavidL.SackettandDeborahJ.CookfortheEvidence-BasedMedicineWorking Group,1994,“Users’guidestothemedicalliteratureII:howtouseanarticleabouttherapy orprevention.B.Whatweretheresultsandwilltheyhelpmeincaringformypatients?” JournaloftheAmericanMedicalAssociation,271(1),59–63. Hargreaves,JamesR.,AlexanderM.Aiken,CalumDavey,andRichardJ.Hayes,2015,“Authors’ responseto:dewormingexternalitiesandschoolimpactsinKenya,”InternationalJournalof Epidemiology,0(0),1–3. Harrison,GlennW.,2013,“Fieldexperimentsandmethodologicalintolerance,”JournalofEconomicMethodology,20(2),103–17. Harrison,GlennW.,2014,“Impactevaluationandwelfareevaluation,”EuropeanJournalofDevelopmentResearch,26,39–45. Hausman,JerryA.,andDavidA.Wise,1985,“Technicalproblemsinsocialexperimentation:cost versuseaseofanalysis,”inJerryA.HausmanandDavidA.Wise,eds.,SocialExperimentation, Chicago,IL.ChicagoUniversityPress.187–220. Heckman,JamesJ.,1992,“Randomizationandsocialpolicyevaluation,”inCharlesF.Manskiand IrwinGarfinkel,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.Harvard UniversityPress.547–70. 64 Heckman,JamesJ.,1997,“Instrumentalvariables:astudyofimplicitbehavioralassumptions usedinmakingprogramevaluations,”JournalofHumanResources,32(3),441–62. Heckman,JamesJ.,NeilHohman,andJeffreySmith,withtheassistanceofMichaelKhoo,2000, “Substitutionanddropoutbiasinsocialexperiments:astudyofaninfluentialsocialexperiment,”QuarterlyJournalofEconomics,115(2),651–94. Heckman,JamesJ.,RobertJ.Lalonde,andJeffreyA.Smith,1999,“Theeconomicsandeconometricsofactivelabormarkets,”Chapter31inAshenfelter,OrleyandDavidCard,eds. Handbookoflaboreconomics,Amsterdam.North-Holland,3(A),1866–2097. Heckman,JamesJ,,RodrigoPinto,andPeterSavelyev,2013,“Understandingthemechanisms throughwhichaninfluentialearlychildhoodprogramboostedadultoutcomes,”American EconomicReview,103(6),2052–86. Heckman,JamesJ.,JeffreySmith,andNancyClements,1997,“Makingthemostoutofprogrammeevaluationsandsocialexperiments:accountingforheterogeneityinprogramme impacts,”ReviewofEconomicStudies,64(4),487–535. Heckman,JamesJ,andEdwardVytlacil,2005,“Structuralequations,treatmenteffects,and econometricpolicyevaluation,”Econometrica,73(3),669–738. Heckman,JamesJ.andEdwardJ.Vytlacil,2007,“Econometricevaluationofsocialprograms, Part1:causalmodels,structuralmodels,andeconometricpolicyevaluation,”Chapter70in JamesJ.HeckmanandEdwardE.Leamer,eds.,HandbookofEconometrics,6B,4779–874. Hicks,JoanHamory,MichaelKremer,andEdwardMiguel,2015,“Commentary:dewormingexternalitiesandschoolingimpactsinKenya:acommentonAikenetal(2015)andDaveyetal. (2015),”InternationalJournalofEpidemiology,0(0),1–4. Horton,Richard,2000,“Commonsenseandfigures:therhetoricofvalidityinmedicine:BradfordHillmemoriallecture1999,”Statisticsinmedicine,19,3149–64. Hotz,V.Joseph,GuidoW.ImbensandJulieH.Mortimer,2005,“Predictingtheefficacyoffuture trainingprogramsusingpastexperienceatotherlocations,”JournalofEconometrics,125, 241–70. Hsieh,Chang-taiandMiguelUrquiola,2006,“Theeffectsofgeneralizedschoolchoiceon achievementandstratification:evidencefromChile’svoucherprogram,”JournalofPublic Economics,90,1477–1503. Humphreys,Macartan,2015,“Whathasbeenlearnedfromthedewormingreplications:anonpartisanview,”ColumbiaUniversity,Aug. http://www.columbia.edu/~mh2245/w/worms.html Imbens,GuidoW.,2004,“Nonparametricestimationofaveragetreatmenteffectsunderexogeneity:areview,”ReviewofEconomicsandStatistics,86(1),4–29. Imbens,GuidoW.,2010,“BetterLATEthannothing:somecommentsonDeaton(2009)and HeckmanandUrzua,”JournalofEconomicLiterature,48(2),399–423. Imbens,GuidoW.andJoshuaD.Angrist,1994,“Identificationandestimationoflocalaverage treatmenteffects,”Econometrica,62(2),467–75. Imbens,GuidoW.,andJeffreyM.Wooldridge,2009,“Recentdevelopmentsintheeconometrics ofprogramevaluation,”JournalofEconomicLiterature,47(1),5–86. InternationalCommitteeofMedicalJournalEditors,2015,Recommendationsfortheconduct, reporting,editing,andpublicationofscholarlyworkinmedicaljournals, http://www.icmje.org/icmje-recommendations.pdf(accessed,August20,2016.) Kahneman,DanielandGaryKlein,2009,“Conditionsforintuitiveexpertise:afailuretodisagree,”AmericanPsychologist,64(6),515–26. Karlan,DeanandJacobAppel,2011,Morethangoodintentions:howaneweconomicsishelpingtosolveglobalpoverty,Dutton. 65 Karlan,Dean,NathanealGoldbergandJamesCopestake,2009,“Randomizedcontrolledtrials arethebestwaytomeasureimpactofmicrofinanceprogramsandimprovemicrofinance productdesigns,”EnterpriseDevelopmentandMicrofinance,20(3),167–76. Kasy,Maximilian,2016,“Whyexperimentersmightnotwanttorandomize,andwhattheycould doinstead,”PoliticalAnalysis,1–15doi:10.1093/pan/mpw012 Kendall,MauriceG.,1959,“Hiawathadesignsanexperiment,”AmericanStatistician,13(5),23– 4. Kramer,Peter,2016,Ordinarilywell:thecaseforantidepressants,Farrar,Straus,andGiroux. Kremer,Michael,andAlakaHolla,2009,“Improvingeducationinthedevelopingworld:what havewelearnedfromrandomizedevaluations?”AnnualReviewofEconomics,1,513–42. Lehman,Erich.L.,andJosephP.Romano,2005,Testingstatisticalhypotheses(thirdedition), NewYork.Springer. Levy,Santiago,2006,Progressagainstpoverty:sustainingMexico’sProgresa-Oportunidades program,Washington,DC.Brookings. Mackie,JohnL.,1974,Thecementoftheuniverse:astudyofcausation,Oxford.OxfordUniversityPress. Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeelerandArleenLeibowitz, 1988a,“Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedexperiment,”AmericanEconomicReview,77(3),251–77. Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,BernadetteBenjamin, ArleenLeibowitz,M.SusanMarquis,andJackZwanziger,1988b,Healthinsuranceandthe demandformedicalcare:evidencefromarandomizedexperiment,SantaMonica,CA.RAND. Manski,CharlesF.,1990,“Nonparametricboundsontreatmenteffects”AmericanEconomic Review,80(2),319–23. Manski,CharlesF.,1995,Identificationproblemsinthesocialsciences,Cambridge,MA.Harvard UniversityPress. Manski,CharlesF.,2003,Partialidentificationofprobabilitydistributions,NewYork.Springer. Manski,CharlesF.,2013,Publicpolicyinanuncertainworld:analysisanddecisions,Cambridge, MA.HarvardUniversityPress. Metcalfe,CharlesE.,1973,“Makinginferencesfromcontrolledincomemaintenanceexperiments,”AmericanEconomicReview,63(3),478–83. Miguel,Edward,andMichaelKremer,2004,“Worms:identifyingimpactsoneducationand healthinthepresenceoftreatmentexternalities,”Econometrica,72(1),159–217. Miguel,Edward,MichaelKremer,andJoanHamoryHicks,2015,“CommentonMacartanHumphreys’andotherrecentdiscussionsoftheMiguelandKremer(2004)study,”Berkeley,Dec. http://emiguel.econ.berkeley.edu/assets/miguel_research/63/Worms-Comment_2015-1221.pdf Moffitt,Robert,1979,“ThelaborsupplyresponseintheGaryexperiment,”JournalofHuman Resources,14(4),477–87. Moffitt,Robert,1992,“Evaluationmethodsforprogramentryeffects,”Chapter6inCharles ManskiandIrwinGarfinkel,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress,231–52. Moffitt,Robert,2004,“Theroleofrandomizedfieldtrialsinsocialscienceresearch:aperspectivefromevaluationsofreformsofsocialwelfareprograms,”AmericanBehavioralScientist, 47(5),506–40 Morgan,KariLock,andDonaldB.Rubin,2012,“Rerandomizationtoimprovecovariatebalance inexperiments,”AnnalsofStatistics,40(2),1263–82. 66 Muller,SeánM.,2015,“Causalinteractionandexternalvalidity:obstaclestothepolicyrelevanceofrandomizedevaluations,”WorldBankEconomicReview,29,S217–S225. Orcutt,GuyH.,andAliceG.Orcutt,1968,“Incentiveanddisincentiveexperimentationforincomemaintenancepolicypurposes,”AmericanEconomicReview,58(4),754–72. Pearl,Judea,2009,Causality:models,reasoning,andinference,2ndedition,Cambridge.CambridgeUniversityPress. Pettigrew,Mark,andIainChalmers,2011,“Useofresearchevidenceinpractice,”Lancet, 378(9804),1696. Rodrik,Dani,2006,personalemailcommunication. Rosenzweig,MarkandChristopherUdry,2016,“Externalvalidityinastochasticworld,”Cambridge,MA.NBERWorkingPaper22449(July). Rothwell,PeterM.,2005,“Externalvalidityofrandomizedcontrolledtrials:‘towhomdothe resultsofthetrialapply’”,Lancet,365,82–93. Russell,Bertrand,2008[1912],Theproblemsofphilosophy,Rockville,MD.ArcManor. Sackett,DavidL.,WilliamM.C.Rosenberg,J.A.MuirGray,R.BrianHaynesandW.ScottRichardson,1996,“Evidencebasedmedicine:whatitisandwhatitisn’t,”BritishMedicalJournal, 312(January13),71–2. Scriven,Michael,1974,“Evaluationperspectivesandprocedures,”inW.JamesPopham,ed., Evaluationineducation—currentapplications,Berkeley,CA.McCutchanPublishingCorporation. Sen,AmartyaK.,2011,Theideaofjustice,Cambridge,MA.HarvardUniversityPress. Senn,Stephen,1994,“Testingforbaselinebalanceinclinicaltrials,”StatisticsinMedicine,13, 1715–26. Senn,Stephen,2013,“Sevenmythsofrandomizationinclinicaltrials,”StatisticsinMedicine32, 1439–50. Shadish,WilliamR.,ThomasD.Cook,andDonaldT.Campbell,2002,Experimentalandquasiexperimentaldesignsforgeneralizedcausalinference,Boston,MA.HoughtonMifflin. Simpson,Adrian,2016,“Comparingandcombiningstandardizedeffectsizes:themisdirectionof publicpolicy,”WorkingPaper,UniversityofDurham(July). Singer,BurtonH.,andStevePincus,1998,“Irregulararraysandrandomization,”Proceedingsof theNationalAcademyofSciencesoftheUSA,”95,1363–8. Stiles,CharlesWardell,1939,“Earlyhistory,inpartesoteric,ofthehookworm(uncinariasis) campaigninoursouthernUnitedStates,”JournalofParasitology,25(4),283–308. Stuart,ElizabethA.,StephenR.Cole,andCatharineP.BradshawandPhilipJ.Leaf,2011,“The useofpropensityscorestoassessthegeneralizabilityofresultsfromrandomizedtrials,” JournaloftheRoyalStatisticalSocietyA,174(2)369–86. Svorencik,Andrej,2015,Theexperimentalturnineconomics:ahistoryofexperimentaleconomics,UtrechtSchoolofEconomics,DissertationSeries#29, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2560026 Taylor-Robinson,DavidC.,NicolaMaayan,KarlaSoares-Weiser,SarahDonegan,andPaulGarner,2015,“Dewormingdrugsforsoil-transmittedintestinalwormsinchildren:effectsonnutritionalindicators,haemoglobin,andschoolperformance(review),”TheCochraneCollaboration.Wiley. http://onlinelibrary.wiley.com/doi/10.1002/14651858.CD000371.pub6/abstract Todd,PetraE.,andKennethJ.Wolpin,2006,“Assessingtheimpactofaschoolsubsidyprogram inMexico:usingasocialexperimenttovalidateadynamicbehavioralmodelofchildschoolingandfertility,”AmericanEconomicReview,96(5),1384–1417. 67 Todd,PetraE.,andKennethJ.Wolpin,2008,“Exanteevaluationofsocialprograms,”Annales d’EconomieetdelaStatistique,91/92,263–91. U.S.DepartmentofEducation,InstituteofEducationSciences,NationalCenterforEducation EvaluationandRegionalAssistance,2003,Identifyingandimplementingeducationalpracticessupportedbyrigorousevidence:auserfriendlyguide,Washington,DC.InstituteofEducationSciences. Vandenbroucke,JanP.,2004,“Whenareobservationalstudiesascredibleasrandomizedcontrolledtrials?”TheLancet,363:1728–31. Vivalt,Eva,2015,“Howmuchcanwegeneralizefromimpactevaluations?”NYU,unpublished. http://evavivalt.com/wp-content/uploads/2014/10/Vivalt-JMP-10.27.14.pdf White,Halbert,1980,“Aheteroskedasticity-consistentcovariancematrixestimatorandadirect testforheteroskedasticity,”Econometrica,50(1),1–25. Wise,DavidA.,1985,“Abehavioralmodelversusexperimentation:theeffectsofhousingsubsidiesonrent,”inP.BruckerandR.Pauly,eds..MethodsofOperationsResearch,50,Verlag AnonHain.441–89. Worrall,John,2002,“WhatEvidenceinEvidence-BasedMedicine?”PhilosophyofScience69, S316-S330. Worrall,John,2007,“Evidenceinmedicineandevidence-basedmedicine,”PhilosophyCompass, 2/6,981–1022. Young,Alwyn,2016,“ChannelingFisher:randomizationtestsandthestatisticalinsignificanceof seeminglysignificantexperimentalresults,”LondonSchoolofEconomics,WorkingPaper, Feb. Ziliak,StephenT.,2014,“Balancedversusrandomizedfieldexperimentsineconomics:whyW.S. Gossetaka‘Student’matters,”ReviewofBehavioralEconomics,1,167–208. 68
© Copyright 2026 Paperzz