Page 1
Understandingandmisunderstandingrandomizedcontrolledtrials
AngusDeatonandNancyCartwright
PrincetonUniversity,NBER,andUniversityofSouthernCalifornia
DurhamUniversityandUCSanDiego
Thisversion,October2017
Weacknowledgehelpfuldiscussionswithmanypeopleovertheseveralyearsthispaperhasbeeninpreparation.WewouldparticularlyliketonotecommentsfromseminarparticipantsatPrinceton,Columbia,andChicago,theCHESSresearchgroupatDurham,aswellasdiscussionswithOrleyAsh-enfelter,AnneCase,NickCowen,HankFarber,JimHeckman,BoHonoré,ChuckManski,andJulianReiss.UlrichMuellerhadamajorinfluenceonshapingSection1.WehavebenefitedfromgenerouscommentsonanearlierversionbyChristopherAdams,TimBesley,ChrisBlattman,SylvainChassang,JishnuDas,JeanDrèze,WilliamEasterly,JonathanFuller,LarsHansen,JeffHammer,GlennHarrison,MacartanHumphreys,MichalKolesár,HelenMilner,TamlynMunslow,SureshNaidu,LantPritchett,DaniRodrik,BurtSinger,RichardWilliams,RichardZeckhauser,andSteveZiliak.Cartwright’sre-searchforthispaperhasreceivedfundingfromtheEuropeanResearchCouncil(ERC)undertheEu-ropeanUnion’sHorizon2020researchandinnovationprogram(grantagreementNo667526K4U),theSpencerFoundation,andtheNationalScienceFoundation(award1632471).Deatonacknowl-edgesfinancialsupportfromtheNationalInstituteonAgingthroughtheNationalBureauofEco-nomicResearch,Grants5R01AG040629-02andP01AG05842-14andthroughPrincetonUniversity’sRoybalCenter,GrantP30AG024928.
Page 2
1
ABSTRACT
RCTswouldbemoreusefulifthereweremorerealisticexpectationsofthemandiftheirpit-fallswerebetterrecognized.Forexample,andcontrarytomanyclaimsintheappliedlitera-ture,randomizationdoesnotequalizeeverythingbutthetreatmentacrosstreatmentsandcontrols,itdoesnotautomaticallydeliverapreciseestimateoftheaveragetreatmenteffect(ATE),anditdoesnotrelieveusoftheneedtothinkabout(observedorunobserved)con-founders.Estimatesapplytothetrialsampleonly,sometimesaconveniencesample,andusuallyselected;justificationisrequiredtoextendthemtoothergroups,includinganypop-ulationtowhichthetrialsamplebelongs.Demanding“externalvalidity”isunhelpfulbe-causeitexpectstoomuchofanRCTwhileundervaluingitscontribution.Statisticalinfer-enceonATEsinvolveshazardsthatarenotalwaysrecognized.RCTsdoindeedrequiremin-imalassumptionsandcanoperatewithlittlepriorknowledge.Thisisanadvantagewhenpersuadingdistrustfulaudiences,butitisadisadvantageforcumulativescientificprogress,wherepriorknowledgeshouldbebuiltuponandnotdiscarded.RCTscanplayaroleinbuildingscientificknowledgeandusefulpredictionsbuttheycanonlydosoaspartofacu-mulativeprogram,combiningwithothermethods,includingconceptualandtheoreticalde-velopment,todiscovernot“whatworks,”but“whythingswork”.
Page 3
2
IntroductionRandomizedcontrolledtrials(RCTs)arecurrentlywidelyvisibleineconomicstoday
andhavebeenusedinthesubjectatleastsincethe1960s(seeGreenbergand
Shroder(2004)foracompendium).Itisoftenclaimedthatsuchtrialscandiscover
“whatworks”ineconomics,aswellasinpoliticalscience,education,andsocialpol-
icy.Amongbothresearchersandthegeneralpublic,RCTsareperceivedtoyield
causalinferencesandestimatesofaveragetreatmenteffects(ATEs)thataremore
reliableandmorecrediblethanthosefromanyotherempiricalmethod.Theyare
takentobelargelyexemptfromthemyriadeconometricproblemsthatcharacterize
observationalstudies,torequireminimalsubstantiveassumptions,littleornoprior
information,andtobelargelyindependentof“expert”knowledgethatisoftenre-
gardedasmanipulable,politicallybiased,orotherwisesuspect.
Therearenow“WhatWorks”centersusingandrecommendingRCTsina
rangeofareasofsocialconcernacrossEuropeandtheAnglophoneworld.These
centersseeRCTsastheirpreferredtoolandindeedoftenpreferRCTevidencelexi-
cographically.Asoneofmanyexamples,theUSDepartmentofEducation’sstandard
for“strongevidenceofeffectiveness”requiresa“well-designedandimplemented”
RCT;noobservationalstudycanearnsuchalabel.This“goldstandard”claimabout
RCTsislesscommonineconomics,butImbens(2010,407)writesthat“randomized
experimentsdooccupyaspecialplaceinthehierarchyofevidence,namelyatthe
verytop.”TheAbdulLatifJameelPovertyActionLab(J-PAL),whosestatedmission
is“toreducepovertybyensuringthatpolicyisinformedbyscientificevidence”,ad-
vertisesthatitsaffiliatedprofessors“conductrandomizedevaluationstotestand
improvetheeffectivenessofprogramsandpoliciesaimedatreducingpoverty”,J-
PAL(2017).Theleadpageofitswebsite(echoedinthe‘Evaluation’section)notes
“843ongoingandcompletedrandomizedevaluationsin80countries”withnomen-
tionofanystudiesthatarenotrandomized.
Inmedicine,thegoldstandardviewhaslongbeenwidespread,e.g.fordrug
trialsbytheFDA;anotableexceptionistherecentpaperbyFrieden(2017),ex-di-
Page 4
3
rectoroftheU.S.CentersforDiseaseControlandPrevention,wholistskeylimita-
tionsofRCTsaswellasarangeofcontextswhereRCTs,evenwhenfeasible,are
dominatedbyothermethods.
WearguethatanyspecialstatusforRCTsisunwarranted.Whichmethodis
mostlikelytoyieldagoodcausalinferencedependsonwhatwearetryingtodis-
coveraswellasonwhatknowledgeisalreadyavailable.Whenlittleprior
knowledgeisavailable,nomethodislikelytoyieldwell-supportedconclusions.This
paperisnotacriticismofRCTsinandofthemselves,letaloneanattempttoidentify
goodandbadstudies.Instead,wewillarguethat,dependingonwhatwewantto
discover,whywewanttodiscoverit,andwhatwealreadyknow,therewilloftenbe
superiorroutesofinvestigation.
Wepresenttwosetsofarguments.Thefirstisanenquiryintotheideathat
ATEsestimatedfromRCTSarelikelytobeclosertothetruththanthoseestimated
inotherways.ThesecondexploreshowtousetheresultsofRCTsoncewehave
them.Inthefirstsection,ourdiscussionrunsinfamiliartermsofbiasandefficiency,
orexpectedloss.Noneofthismaterialisnew,butweknowofnosimilartreatment,
andwewishtodisputemanyoftheclaimsthatarefrequentlymadeintheapplied
literature.Someroutinemisunderstandingsare:(a)randomizationensuresafair
trialbyensuringthat,atleastwithhighprobability,treatmentandcontrolgroups
differonlyinthetreatment;(b)RCTsprovidenotonlyunbiasedestimatesofATEs
butalsopreciseestimates;(c)statisticalinferenceinRCTs,whichrequiresonlythe
simplecomparisonofmeans,isstraightforward,sothatstandardsignificancetests
arereliable.
Nothingwesayinthepapershouldbetakenasageneralargumentagainst
RCTs;wearesimplytryingtochallengeunjustifiableclaims,andexposemisunder-
standings.WearenotagainstRCTs,onlymagicalthinkingaboutthem.Themisun-
derstandingsareimportantbecausewebelievethattheycontributetothecommon
perceptionthatRCTsalwaysprovidethestrongestevidenceforcausalityandforef-
fectiveness.
Page 5
4
Inthesecondpartofthepaper,wediscusshowtousetheevidencefrom
RCTs.Thenon-parametricandtheory-freenatureofRCTs,whichisarguablyanad-
vantageinestimation,isoftenadisadvantagewhenwetrytousetheresultsoutside
ofthecontextinwhichtheresultswereobtained;credibilityinestimationcanlead
toincredibilityinuse.Muchoftheliterature,perhapsinspiredbyCampbelland
Stanley’s(1963)famous“primacyofinternalvalidity”,appearstobelievethatinter-
nalvalidityisnotonlynecessarybutalmostsufficienttoguaranteetheusefulnessof
theestimatesindifferentcontexts.Butyoucannotknowhowtousetrialresults
withoutfirstunderstandinghowtheresultsfromRCTsrelatetotheknowledgethat
youalreadypossessabouttheworld,andmuchofthisknowledgeisobtainedby
othermethods.OncethecommitmenthasbeenmadetoseeingRCTswithinthis
broaderstructureofknowledgeandinference,andwhentheyaredesignedtoen-
hanceit,theycanbeenormouslyuseful,notjustforwarrantingclaimsofeffective-
nessbutforscientificprogressmoregenerally.Cumulativescienceisnotadvanced
throughmagicalthinking.
TheliteratureontheprecisionofATEsestimatedfromRCTsgoesbacktothe
verybeginning.Gosset(writingas`Student’)neveracceptedFisher’sargumentsfor
randomizationinagriculturalfieldtrialsandarguedconvincinglythathisownnon-
randomdesignsfortheplacementoftreatmentandcontrolsyieldedmoreprecise
estimatesoftreatmenteffects(seeStudent(1938)andZiliak(2014)).Gosset
workedforGuinnesswhereinefficiencymeantlostrevenue,sohehadreasonsto
care,asshouldwe.Fisherwontheargumentintheend,notbecauseGossetwas
wrongaboutefficiency,butbecause,unlikeGosset’sprocedures,randomizationpro-
videsasoundbasisforstatisticalinference,andthusforjudgingwhetheranesti-
matedATEisdifferentfromzerobychance.Moreover,Fisher’sblockingprocedures
canlimittheinefficiencyfromrandomization(seeYates(1939)).Gosset’sreserva-
tionswereechoedmuchlaterinSavage’s(1962)commentthataBayesianshould
notchoosetheallocationoftreatmentsandcontrolsatrandombutinsuchaway
that,givenwhatelseisknownaboutthetopicandthesubjects,theirplacementre-
vealsthemosttotheresearcher.Theseissuesabouthowtoincorporatepriorinfor-
mationintorandomizedtrialsarecentraltoSection1.
Page 6
5
Ineconomics,thestrengthsandweaknessesofRCTsarewellexploredinthe
volumesbyHausmanandWise(1985)andbyGarfinkelandManski(1992);inthe
latter,theintroductionbyGarfinkelandManskiisabalancedsummaryofwhatran-
domizedtrialscanandcannotdo.ThepaperinthatvolumebyHeckman(1992)
raisesmanyoftheissuesthatheandhiscoauthorshaveexploredinsubsequentpa-
pers,seeinparticularHeckmanandSmith(1995),andHeckman,LalondeandSmith
(1999)whofocusonlabormarketexperiments.Manski(2013)containsagood
summaryofbothstrengthsandweaknesses.
Thereisalsoamorecontestedrecentliterature.Ontheonehand,thereare
proceduresthattakeasfundamentaltheunrestrictedindividualtreatmenteffectsof
individualsandseeknon-parametricapproachestoestimatingtheiraverage.Onthe
otherhand,theseproceduresarecontrastedwithanapproachthatuseselementsof
economictheorytodefineparametersofinterestandtoidentifymagnitudesthat
arelikelytobeinvarianttopolicymanipulationoracrosscontexts,whereinvari-
anceisdefinedinthesenseofHurwicz(1966).TheintroductioninImbensand
Wooldridge(2009)provideaneloquentdefenseofthetreatment-effectformulation.
Itemphasizesthecredibilitythatcomesfromatheory-freespecificationwithalmost
unlimitedheterogeneityintreatmenteffects.TheintroductioninHeckmanand
Vytlacil(2007)makesanequallyeloquentcaseagainst,notingthatthecrucialingre-
dientsoftreatmentsinRCTsareoftennotclearlyspecified—sothatweoftendonot
knowwhatthetreatmentreallyis—andthatthetreatmenteffectsarehardtolinkto
invariantparametersthatwouldbeusefulelsewhere.Aspectsofthesamedebate
featureinImbens(2010),AtheyandImbens(2017),AngristandPischke(2017),
Heckman(2005,2008,2010)andHeckmanandUrzua(2010).
Deaton(2010)complainsabouttheuseofinstrumentalvariables,including
randomization,asasubstituteforthinkingaboutandconstructingmodelsofeco-
nomicdevelopment.HearguesagainsttheideathatusingRCTstoevaluateprojects
todiscover“whatworks”caneveryieldasystematicbodyofscientificknowledge
thatcanbeusedtoreduceoreliminatepoverty.Thatpaperisanargumentagainst
theusefulnessoftheheterogeneoustreatmentapproach.Itarguesthatrefusingto
Page 7
6
modelheterogeneity,thoughavoidingassumptions,precludesthesortofcumula-
tiveresearchprogramthatmightyieldusefulpolicy.Thepaper’sclaimthatRCTs
havenospecialclaimtogeneratecredibleandusefulknowledgewaschallengedby
Imbens(2010);someofhisargumentsareansweredbelow.Cartwright(2007)and
CartwrightandMunro(2010)challengeany“goldstandard”viewofRCTs.Cart-
wright(2011,2012,2016)andCartwrightandHardie(2012)focusonthequestion
ofhowtousetheresultsofRCTsandwhatwecanlearnwhenanexperimentshows
thatsomepolicyworkssomewhere.Section2pursuestheseissuesingeneraland
throughcasestudies.
Section1:DoRCTsgivegoodestimatesofAverageTreatmentEffects
Inthissection,weexplorehowtoestimateaveragetreatmenteffects(ATEs)andthe
roleofrandomization.WenotefirstthatestimatingATEsisonlyoneofmanyuses
forthedatageneratedbyanRCT.Westartfromatrialsample,acollectionofsub-
jectsthatwillbeallocatedrandomlytoeitherthetreatmentorcontrolarmofthe
trial.This“sample”mightbe,butrarelyis,arandomsamplefromsomepopulation
ofinterest.Morefrequently,itisselectedinsomeway,forexampletothosewilling
toparticipate,orissimplyaconveniencesamplethatisavailabletothetrialists.
Givenrandomallocationtotreatmentsandcontrols,thedatafromthetrialallowthe
identificationoftwodistributions,𝐹"(𝑌")and𝐹&(𝑌&),ofoutcomes𝑌"and𝑌&inthe
treatedanduntreatedcaseswithinthetrialsample.TheestimatedATEisthediffer-
enceinmeansofthetwodistributionsandisthefocusofmuchoftheliteraturein
socialscienceandmedicine.Yetpolicymakersandresearchersmaywellbeinter-
estedinotherfeaturesofthetwodistributions.Forexample,ifYisincome,they
maybeinterestedinwhetheratreatmentreducedincomeinequality,orinwhatit
didtothe10thor90thpercentilesoftheincomedistribution,eventhoughdifferent
peopleoccupythosepercentilesinthetreatmentandcontroldistributions(seeBit-
leretal(2006)foranexampleinUSwelfarepolicy).Cancertrialsstandardlyusethe
mediandifferenceinsurvival,whichcomparesthetimesuntilhalfthepatientshave
diedineacharm.Morecomprehensively,policymakersmaywishtocompareex-
pectedutilitiesfortreatedanduntreatedunderthetwodistributionsandconsider
Page 8
7
optimalexpected-utilitymaximizingtreatmentrulesconditionalonthecharacteris-
ticsofsubjects(seeManski(2004)andManskiandTetenov(2016);Bhattacharya
andDupas(2012)containsanapplication.)Theseusesareimportant,butwefocus
onATEshereanddonotconsidertheseotherusesofRCTsanyfurtherinthispaper.
1.1Whatdoesrandomizationdo?
Ausefulwaytothinkabouttheestimationoftreatmenteffectsistouseaschematic
linearcausalmodeloftheform:
(1)
where, istheoutcomeforuniti,𝑇( isadichotomous(1,0)treatmentdummyindi-
catingwhetherornotiistreated,and𝛽( istheindividualtreatmenteffectofthe
treatmentoni.Thex’saretheobservedorunobservedotherlinearcausesofthe
outcome,andwesupposethat(1)capturesaminimalsetofcausesof𝑌( sufficientto
fixitsvalue.Jmaybe(very)large.Becausetheheterogeneityoftheindividualtreat-
menteffects,𝛽( ,isunrestricted,weallowthepossibilitythatthetreatmentinteracts
withthex’sorothervariables,sothattheeffectsofTcandependonanyothervaria-
bles.Notethatwedonotneedisubscriptsonthe𝛾’sthatcontroltheeffectsofthe
othercauses;iftheireffectsdifferacrossindividuals,weincludetheinteractionsof
individualcharacteristicswiththeoriginalx’sasnewx’s.Giventhatthex’scanbe
unobservable,thisisnotrestrictive.
Consideranexperimentthataimstotellussomethingaboutthetreatment
effects;thismightormightnotuserandomization.Eitherway,wecanrepresentthe
treatmentgroupashaving𝑇( = 1andthecontrolgroupashaving𝑇( = 0.Giventhe
study(ortrial)sample,subtractingtheaverageoutcomesamongthecontrolsfrom
theaverageoutcomesamongthetreatments,weget
Y
1−Y
0= β
1+ γ j (xij
1−
j=1
J
∑ xij0) = β
1+ (S
1− S
0) (2)
Thefirsttermonthefar-right-handsideof(2),whichistheATEinthetrialsample,
iswhatwewant,butthesecondtermorerrorterm,whichisthesumofthenetav-
eragebalanceofothercausesacrossthetwogroups,willgenerallybenon-zeroand
Yi = βiTi + γ j xijj=1
J∑Yi
Page 9
8
needstobedealtwithsomehow.Wegetwhatwewantwhenthemeansofallthe
othercausesareidenticalinthetwogroups,ormoreprecisely(andlessonerously)
whenthesumoftheirnetdifferences𝑆" − 𝑆&iszero;thisisthecaseofperfectbal-
ance.Withperfectbalance,thedifferencebetweenthetwomeansisexactlyequalto
theaverageofthetreatmenteffectsamongthetreated,sothatwehavetheultimate
precisioninthatweknowthetruthinthetrialsample,atleastinthislinearcase.As
always,the“truth”herereferstothetrialsample,anditisalwaysimportanttobe
awarethatthetrialsamplemaynotberepresentativeofthepopulationthatisulti-
matelyofinterest,includingthepopulationfromwhichthetrialsamplecomes;any
suchextensionrequiresfurtherargument.
Howdowegetbalance,orsomethingclosetoit?What,exactly,istheroleof
randomization?Inalaboratoryexperiment,wherethereisusuallymuchprior
knowledgeoftheothercauses,theexperimenterhasagoodchanceofcontrolling
(orsubtractingawaytheeffectsof)theothercauses,aimingtoensurethatthelast
termin(1)isclosetozero.Failingsuchknowledgeandcontrol,analternativeis
matching,whichisfrequentlyusedinstatistical,medical,andeconometricwork.For
eachsubject,amatchisfoundthatisascloseaspossibleonallsuspectedcauses,so
that,onceagain,thelasttermin(1)canbekeptsmall.Whenwehaveagoodideaof
thecauses,matchingmayalsodeliverapreciseestimate.Ofcourse,whenthereare
unknownorunobservablecausesthathaveimportanteffects,neitherlaboratory
controlnormatchingoffersprotection.
Whatdoesrandomizationdo?Sincethetreatmentsandcontrolscomefrom
thesameunderlyingdistribution,randomizationguarantees,byconstruction,that
thelasttermontherightin(1)iszeroinexpectation,subjecttothecaveatthatno
correlationsofthex’swithYareintroducedpost-randomization,forexampleby
subjectsnotacceptingtheirassignment.Theexpectationhereistakenoverre-
peatedrandomizationsonthetrialsample,eachwithitsownallocationoftreat-
mentsandcontrols.Assumingthatourcaveatholds,thelasttermin(2)willbezero
whenaveragedoverthisinfinitenumberof(entirelyhypothetical)replications,and
Page 10
9
theaverageoftheestimatedATEswillbethetrueATEinthetrialsample.So𝛽"de-
liversanunbiasedestimateoftheATEamongthetreatedinthetrialsample,andit
doessowhetherornotthecausesareobserved.Unbiasednessdoesnotrequireus
toknowanythingabouttheothercausesthoughitdoesrequirethattheynotchange
afterrandomizationsoastomakethemcorrelatedwiththetreatment,whichisan
importantcaveattowhichweshallreturn.IftheRCTisrepeatedmanytimesonthe
sametrialsample,then,assumingourcaveatholdsinthetrials,thelasttermin(2)
willbezerowhenaveragedoveraninfinitenumberof(entirelyhypothetical)trials,
andtheaverageoftheestimatedATEswillbethetrueATEinthetrialsample.Of
course,noneofthisistrueinanyonetrialwherethedifferenceinmeanswillbe
equaltotheaveragetreatmenteffectamongthosetreatedplusthetermthatreflects
theimbalanceintheneteffectsoftheothercauses.Wedonotknowthesizeofthis
errorterm,andthereisnothinginrandomizationthatlimitsitssize;bychancethe
randomizationinoursingletrialcanover-representanimportantexcludedcause(s)
inonearmovertheother,inwhichcasetherewillbeadifferencebetweenthe
meansofthetwogroupsthatisnotcausedbythetreatment.
Theunbiasednessresultcaneasilybecompromised.Inparticular,thetreat-
mentmustnotbecorrelatedwithanyothercause.Randomassignmentisdesigned
toaidwiththis,butitisnotsufficientif,forexample,thereislackofblindingsothat
individualsareawareoftheirassignment,orifthoseadministeringthetreatment
aresoaware,andifthatawarenesstriggersanothercause.Similarly,researchers
sometimesreturntoindividualswhowererandomizedyearsbefore,sothatthere
hasbeentimeforthesubjectsorotherstolearntheirassignmentorforothercauses
tobeinfluencedbytheassignment.Thisagainopensupthepossibilityofunbal-
ancedeffectsofcausesotherthanthetreatmentweareinterestedin.Wehaveal-
readynotedthatunbiasednessreferstothetrialsample,whichmayormaynotbe
representativeofthepopulationofinterest.
Ifweweretorepeatthetrialmanytimes,theover-representationoftheun-
balancedcauseswillsometimesbeinthetreatmentsandsometimesinthecontrols.
Theimbalancewillvaryoverreplicationsofthetrial,andalthoughwecannotsee
thisfromoursingletrial,weshouldbeabletocaptureitseffectsonourestimateof
Page 11
10
theATEfromanestimatedstandarderror.ThiswasFisher’sinnovation:notthat
randomizationbalancedothercausesbetweentreatmentsandcontrolsbutthat,
conditionalonourcaveatabove,randomizationprovidesthebasisforcalculating
thesizeoftheerror.Gettingthestandarderrorandassociatedsignificancestate-
mentsrightareofthegreatestimportance.Giventheabsenceoftreatment-related
post-randomizationchangesinothercauses,randomizationyieldsanunbiasedesti-
mateoftheATEinthetrialsampleaswellasasoundmethodformeasuringerrorof
estimationinthatsample;thereinliesitsvirtue,notthatityieldspreciseestimates
throughbalance.
1.2Misunderstandings:claimingtoomuch
Everythingsofarshouldbeperfectlyfamiliar,butexactlywhatrandomizationdoes
isfrequentlylostinthepracticalliterature.Thereisoftenconfusionbetweenperfect
control,ontheonehand(asinalaboratoryexperimentorperfectmatchingwithno
unobservablecauses),andcontrolinexpectationontheother,whichiswhatran-
domizationcontributes.Ifweknewenoughabouttheproblemtobeabletocontrol
well,thatiswhatwewoulddo.Randomizationisanalternativewhenwedonot
knowenough,butisgenerallyinferiortogoodcontrol.Wesuspectthatatleastsome
ofthepopularandprofessionalenthusiasmforRCTs,aswellasthebeliefthatthey
areprecisebyconstruction,comesfrommisunderstandingsaboutbalance.These
misunderstandingsarenotsomuchamongthetrialistswhowilloftengiveacorrect
accountwhenpressed.Theycomefromimprecisestatementsbytrialiststhatare
takenliterallybythelayaudiencethatthetrialistsarekeentoreach.
Suchamisunderstandingiswellcapturedbyaquotefromthesecondedition
oftheonlinemanualonimpactevaluationjointlyissuedbytheInter-AmericanDe-
velopmentBankandtheWorldBank(thefirst,2011editionissimilar):
“Wecanbeconfidentthatourestimatedimpactconstitutesthetrueimpact
oftheprogram,sincewehaveeliminatedallobservedandunobservedfac-
torsthatmightotherwiseplausiblyexplainthedifferenceinoutcomes.”Ger-
tler,Martinez,Premand,Rawlings,andVermeersch(2016,69).
Page 12
11
Thisstatementisfalse,becauseitconfusesactualbalanceinanysingletrialwith
balanceinexpectationovermany(hypothetical)trials.Ifitweretrue,andifallfac-
torswereindeedcontrolled(andnoimbalanceswereintroducedpostrandomiza-
tion),thedifferencewouldbeanexactmeasureoftheaveragetreatmenteffect
amongthetreatedinthetrialpopulation(atleastintheabsenceofmeasurementer-
ror).Weshouldnotonlybeconfidentofourestimatebut,asthequotesays,we
wouldknowthatitisthetruth.Notethatthestatementcontainsnoreferenceto
samplesize;wegetthetruthbyvirtueofbalance,notfromalargenumberofobser-
vations.
AsimilarquotecomesfromJohnList,oneofthemostimaginativeandsuc-
cessfulscholarswhouseRCTs:
“complicationsthataredifficulttounderstandandcontrolrepresentkeyrea-
sonstoconductexperiments,notapointofskepticism.Thisisbecauseran-
domizationactsasaninstrumentalvariable,balancingunobservablesacross
controlandtreatmentgroups.”Al-UbaydliandList(2013)(italicsintheorig-
inal.)
AndfromDeanKarlan,founderandPresidentofYale’sInnovationsforPovertyAc-
tion,whichrunsdevelopmentRCTsaroundtheworld:
“Asinmedicaltrials,weisolatetheimpactofaninterventionbyrandomly
assigningsubjectstotreatmentsandcontrolgroups.Thismakesitsothatall
thoseotherfactorswhichcouldinfluencetheoutcomearepresentintreat-
mentandcontrol,andthusanydifferenceinoutcomecanbeconfidentlyat-
tributedtotheintervention.”Karlan,GoldbergandCopestake(2009)
Andfromthemedicalliterature,fromadistinguishedpsychiatristwhoisdeeply
skepticaloftheuseofevidencefromRCTs,
“Thebeautyofarandomizedtrialisthattheresearcherdoesnotneedtoun-
derstandallthefactorsthatinfluenceoutcomes.Saythatanundiscoveredge-
neticvariationmakescertainpeopleunresponsivetomedication.Theran-
domizingprocesswillensure—ormakeithighlyprobable—thatthearmsof
thetrialcontainequalnumbersofsubjectswiththatvariation.Theresultwill
beafairtest.”(Kramer,2016,p.18)
Page 13
12
ClaimsareevenmadethatRCTsrevealknowledgewithoutpossibilityoferror.Judy
Gueron,thelong-timepresidentofMDRC,whichhasbeenrunningRCTsonUSgov-
ernmentpolicyfor45years,askswhyfederalandstateofficialswerepreparedto
supportrandomizationinspiteoffrequentdifficultiesandinspiteoftheavailability
ofothermethodsandconcludesthatitwasbecause“theywantedtolearnthetruth,”
GueronandRolston(2013,429).Therearemanystatementsoftheform“Weknow
that[projectX]workedbecauseitwasevaluatedwitharandomizedtrial,”Dynarski
(2015).
ItiscommontotreattheATEfromanRCTasifitwerethetruth,notjustin
thetrialsamplebutmoregenerally.Ineconomics,afamousexampleisLalonde’s
(1986)studyoftrainingprograms,whoseresultswereatoddswithanumberof
previousnon-randomizedstudies.Thepaperpromptedalarge-scalere-examination
oftheobservationalstudiestotrytobringthemintoline,thoughitnowseemsjust
aslikelythatthedifferenceslieinthefactthatthestudyresultsapplytodifferent
populations(Heckman,Lalonde,andSmith(1999)).Inepidemiology,Davey-Smith
andIbrahim(2002)statethat“observationalstudiespropose,RCTsdispose”.A
goodexampleistheRCTofhormonereplacementtherapy(HRT)forpost-menopau-
salwomen.HRThadpreviouslybeensupportedbypositiveresultsfromahigh-
qualityandlong-runningobservationalstudy,buttheRCTwasstoppedinthefaceof
excessdeathsinthetreatmentgroup.ThenegativeresultoftheRCTledtowide-
spreadabandonmentofthetherapy,whichmighthavebeenamistake(seeVanden-
broucke(2009)andFrieden(2017)).Yetthemedicalandpopularliteraturerou-
tinelystatesthattheRCTwasrightandtheearlierstudywrong,simplybecausethe
earlierstudywasnotrandomized.Thegoldstandardor“truth”viewdoesharm
whenitunderminestheobligationofsciencetoreconcileRCTsresultswithother
evidenceinaprocessofcumulativeunderstanding.
Thefalsebeliefinautomaticprecisionsuggeststhatweneedpaynoatten-
tiontotheothercausesin(1)or(2).Indeed,GerberandGreen(2012),intheir
standardtextforRCTsinpoliticalscience,writethatrunninganRCTis“aresearch
strategythatdoesnotrequire,letalonemeasure,allpotentialconfounders.”Thisis
trueifwearehappywithestimatesthatarearbitrarilyfarfromthetruth,justso
Page 14
13
longastheerrorscanceloutoveraseriesofimaginaryexperiments.Inreality,the
causalitythatisbeingattributedtothetreatmentmight,infact,becomingfroman
imbalanceinsomeothercauseinourparticulartrial;limitingthisrequiresserious
thoughtaboutpossibleconfounders.
1.3Samplesize,balance,andprecision
Atthetimeofrandomizationandintheabsenceofpost-randomizationchangesin
othercauses,atrialismorelikelytobebalancedwhenthesamplesizeislarge.As
thesamplesizetendstoinfinity,themeansofthex’sinthetreatmentandcontrol
groupswillbecomearbitrarilyclose.Yetthisisoflittlehelpinfinitesamples.As
Fisher(1926)noted:“Mostexperimentersoncarryingoutarandomassignment
willbeshockedtofindhowfarfromequallytheplotsdistributethemselves,”quoted
inMorganandRubin(2012).Evenwithverylargesamplesizes,ifthereisalarge
numberofcauses,balanceoneachcausemaybeinfeasible.Evenwithjustthree
causeswiththreevalueseach,thereare27cellstobalance,andinmostsocialand
medicalcasestherewillbemore.Vandenbroucke(2004)notesthattherearethree
billionbasepairsinthehumangenome,manyorallofwhichcouldberelevantprog-
nosticfactorsforthebiologicaloutcomethatweareseekingtoinfluence.Itistrue,
as(2)makesclear,thatwedonotneedbalanceoneachcauseindividually,onlyon
theirneteffect,theterm𝑆" − 𝑆&.Butconsiderthehumangenomebasepairs.Outof
allofthosebillions,onlyonemightbeimportant,andifthatoneisunbalanced,the
resultsofasingletrialcanbefarfromthetruth.Statementsaboutlargesamples
guaranteeingbalancearenotusefulwithoutguidelinesabouthowlargeislarge
enough,andsuchstatementscannotbemadewithoutknowledgeofothercauses
andhowtheyaffectoutcomes.Ofcourse,lackofbalanceintheneteffectofeither
observablesornon-observablesin(2)doesnotcompromisetheinferenceinanRCT
inthesenseofobtainingastandarderrorfortheunbiasedATE(seeSenn(2013)for
aparticularlyclearstatement).
HavingrunanRCT,itmakesgoodsensetoexamineanyavailablecovariates
forbalancebetweenthetreatmentsandcontrols;ifwesuspectthatanobserved
variablexisapossiblecause,anditsmeansinthetwogroupsareverydifferent,we
Page 15
14
shouldtreatourresultswithappropriatesuspicion.Inpractice,trialistsineconom-
ics(andinsomeotherdisciplines)usuallycarryoutastatisticaltestforbalanceaf-
terrandomizationbutbeforeanalysis,presumablywiththeaimoftakingsomeap-
propriateactionifbalancefails.Thefirsttableofthepapertypicallypresentsthe
samplemeansofobservablecovariates—theobservablex’sin(1)orinteractivefac-
torsrepresentedinβ—forthecontrolandtreatmentgroups,togetherwiththeirdif-
ferences,andtestsforwhetherornottheyaresignificantlydifferentfromzero,ei-
thervariablebyvariable,orjointly.Thesetestsareappropriateforunbiasednessif
weareconcernedthattherandomnumbergeneratormighthavefailed,orifweare
worriedthattherandomizationisunderminedbynon-blindedsubjectswhosys-
tematicallyunderminetheallocation.Otherwise,unbiasednessisguaranteedbythe
randomization,whateverthetestsshow,andasthenextparagraphdemonstrates,
thetestisnotinformativeaboutthebalancethatwouldleadtoprecision.
Ifwewrite𝜇&and𝜇"forthe(vectorsof)truemeansinthetrialsample(i.e.
themeansoverallpossiblerandomizations)oftheobservedcausesofYinthecon-
trolandtreatmentgroupsatthepointofassignment,thenullhypothesisis(pre-
sumably,asjudgedbythetypicalbalancetest)thatthetwovectorsareidentical,
withthealternativebeingthattheyarenot.Butiftherandomizationhasbeencor-
rectlydonethenullhypothesisistruebyconstruction(seee.g.Altman(1985)and
Senn(1994)),whichmayhelpexplainwhyitsorarelyfailsinpractice.AsBegg
(1990)notes,“(I)tisatestofanullhypothesisthatisknowntobetrue.Therefore,if
thetestturnsouttobesignificantitis,bydefinition,aafalsepositive.”Thisis,of
course,consistentwithFisher’scommentsabouttheplotsinthefield,whichnotes
thattwosamplesofplotsrandomlydrawnfromthesamefieldcanlookveryunbal-
anced.Indeed,althoughwecannot“test”itinthisway,weknowthatthenullhy-
pothesisisalsotruefortheunobservablecauses.Notethecontrastwiththestate-
mentquotedaboveclaimingthatRCTsguaranteebalanceoncausesacrosstreat-
mentandcontrolgroups.Thosestatementsrefertobalanceofcausesatthepointof
assignmentinanysingletrial,whichisnotguaranteedbyrandomization,whereas
thebalancetestsareaboutthebalanceofcausesatthepointofassignmentinexpec-
Page 16
15
tationovermanytrials,whichisguaranteedbyrandomization.Theconfusionisper-
hapsunderstandable,butitisaconfusionnevertheless.Ofcourse,itisalwaysgood
practicetolookforimbalancesbetweenobservedcovariatesinanysingletrialusing
somemoreappropriatedistancemeasure,forexamplethenormalizeddifferencein
means(ImbensandWooldridge(2009,equation3)).Similarly,itwouldhavebeen
goodpracticeforFishertoabandonarandomizationinwhichtherewereclearpat-
ternsinthe(random)distributionofplotsacrossthefield,eventhoughthetreat-
mentandcontrolplotswererandomlyselectionsthat,byconstruction,couldnot
differ“significantly”usingthestandard(incorrect)balancetest.Whethersuchim-
balancesshouldbeseenasunderminingtheestimateoftheATEdependsonour
priorsaboutwhichcovariatesarelikelytobeimportant,andhowimportant,which
is(notcoincidentally)thesamethoughtexperimentthatisroutinelyundertakenin
observationalstudieswhenweworryaboutconfounding.
Oneproceduretoimprovebalanceistoadaptthedesignbeforerandomiza-
tion,forexample,bystratification.Fisher,whoasthequoteaboveillustrates,was
wellawareofthelossofprecisionfromrandomizationarguedfor“blocking”(strati-
fication)inagriculturaltrialsorforusingLatinSquares,bothofwhichrestrictthe
amountofimbalance.Stratification,tobeuseful,requiressomepriorunderstanding
ofthefactorsthatarelikelytobeimportant,andsoittakesusawayfromthe“no
knowledgerequired”or“nopriorsaccepted”appealofRCTs;itrequiresthinking
aboutandmeasuringcovariates.ButasScriven(1974,103)notes:“(C)ausehunting,
likelionhunting,isonlylikelytobesuccessfulifwehaveaconsiderableamountof
relevantbackgroundknowledge”.Cartwright(1994,Chapter2)putsitevenmore
strongly,“nocausesin,nocausesout”.StratificationinRCTs,asinotherformsof
sampling,isastandardmethodforusingbackgroundknowledgetoincreasethe
precisionofanestimator.Ithasthefurtheradvantagethatitallowsfortheexplora-
tionofdifferentATEsindifferentstratawhichcanbeusefulinadaptingortrans-
portingtheresultstootherlocations(seeSection2).
Stratificationisnotpossibleiftherearetoomanycovariates,orifeachhas
manyvalues,sothattherearemorecellsthancanbefilledgiventhesamplesize.
Withfivecovariates,andtenvaluesoneach,andnopriorstolimitthestructure,we
Page 17
16
wouldhave100,000possiblestrata.Fillingtheseiswellbeyondthesamplesizesin
mosttrials.Analternativethatworksmoregenerallyistore-randomize.Iftheran-
domizationgivesanobviousimbalanceonknowncovariates—treatmentplotsall
ononesideofthefield,allofthetreatmentclinicsinoneregion,toomanyrichand
toofewpoorinthecontrolgroup—wetryagain,andkeeptryinguntilwegetabal-
ancemeasuredasasmallenoughdistancebetweenthemeansoftheobservedco-
variatesinthetwogroups.MorganandRubin(2012)suggesttheMahalanobisD–
statisticbeusedasacriterionanduseFisher’srandomizationinference(tobedis-
cussedfurtherbelow)tocalculatestandarderrorsthattakethere-randomization
intoaccount.Analternative,widelyadaptedinpractice,istoadjustforcovariatesby
runningaregression(orcovariance)analysis,withtheoutcomeontheleft-hand
sideandthetreatmentdummyandthecovariatesasexplanatoryvariables,includ-
ingpossibleinteractionsbetweencovariatesandtreatmentdummies.Freedman
(2008)showsthattheadjustedestimateoftheATEisbiasedinfinitesamples,with
thebiasdependingonthecorrelationbetweenthesquaredtreatmenteffectandthe
covariates.Acceptingsomebiasinexchangeforgreaterprecisionwilloftenmake
sense,thoughitcertainlyunderminesanygoldstandardargumentthatreliesonun-
biasednesswithoutconsiderationofprecision.
1.4Shouldwerandomize?
ThetensionbetweenrandomizationandprecisionthatgoesbacktoFisher,Gosset,
andSavagehasbeenreopenedinrecentpapersbyKasy(2016),Banerjee,Chassang,
andSnowberg(BCS)(2016)andBanerjee,Chassang,Montero,andSnowberg
(BCMS)(2016).
Thetrade-offbetweenbiasandprecisioncanbeformalizedinseveralways,
forexamplebyspecifyingalossorutilityfunctionthatdependsonhowauserisaf-
fectedbydeviationsoftheestimateoftheATEfromthetruthandthenchoosingan
estimatororanexperimentaldesignthatminimizesexpectedlossormaximizesex-
pectedutility.AsSavage(1962)noted,foraBayesian,thisinvolvesallocatingtreat-
mentsandcontrolsin“thespecificlayoutthatpromisedtotellhimthemost,”but
withoutrandomization.Ofcourse,thisrequiresseriousandperhapsdifficultthought
aboutthemechanismsunderlyingtheATE,whichrandomizationavoids.Savagealso
Page 18
17
notesthatseveralpeoplewithdifferentpriorsmaybeinvolvedinaninvestigation
andthatindividualpriorsmaybeunreliablebecauseof“vaguenessandtemptation
toself-deception,”defectsthatrandomizationmayalleviate,oratleastevade.BCMS
(2016)provideaproofofaBayesianno-randomizationtheorem,andBCS(2016)
provideanillustrationofaschooladministratorwhohaslongbelievedthatschool
outcomesaredetermined,notbyschoolquality,butbyparentalbackground,and
whocanlearnthemostbyplacingdeprivedchildrenin(supposed)high-quality
schoolsandprivilegedchildrenin(supposed)low-qualityschools,whichisthekind
ofstudysettingthatcasestudymethodologyiswellattunedto.AsBCSnote,thisal-
locationwouldnotpersuadethosewithdifferentpriors,andtheyproposerandomi-
zationasameansofsatisfyingskepticalobservers.
Severalpointsareimportant.First,theanti-randomizationtheoremisnota
justificationofanynon-randomizeddesign,forexample,onethatallowsselection
onunobservables,butonlytheoptimaldesignthatismostinformative.Accordingto
Chalmers(2001)andBothwellandPodolsky(2016),thedevelopmentofrandomi-
zationinmedicineoriginatedwithBradford-Hill,whousedrandomizationinthe
firstRCTinmedicine—thestreptomycintrial—becauseitpreventeddoctorsselect-
ingpatientsonthebasisofperceivedneed(oragainstperceivedneed,leaningover
backwardasitwere),anargumentrecentlyechoedbyWorrall(2007).Randomiza-
tionservesthispurpose,butsodoothernon-discretionaryschemes;whatisre-
quiredisthathiddeninformationshouldnotbeallowedtoaffecttheallocation.
Second,theidealrulesbywhichunitsareallocatedtotreatmentorcontrol
dependonthecovariatesandontheinvestigators’priorsabouthowthecovariates
affecttheoutcomes.Thisopensupallsortsofmethodsofinferencethatarelongfa-
miliartoeconomistsbutthatareexcludedbypurerandomization.Forexample,
whatphilosopherscallthehypothetico-deductivemethodworksbyusingtheoryto
makeapredictionthatcanbetakentothedataforpotentialfalsification(asinthe
schoolexampleabove).Thisisthewaythatphysicistslearn,asdoeconomistswhen
theyusetheorytoderivepredictionsthatcanbetestedagainstthedata,perhapsin
anRCT,butmorefrequentlynot.Someofthemostfruitfulresearchprogramsin
economicshavebeengeneratedbythepuzzlesthatresultwhenthedatafailto
Page 19
18
matchsuchtheoreticalpredictions,suchastheequitypremiumpuzzle,variouspur-
chasingpowerparitypuzzles,theFeldstein-Horiokapuzzle,theconsumption
smoothnesspuzzle,thepuzzleofcaloriedeclineinthefaceofmalnourishmentand
incomegrowth,andmanyothers.
Third,randomization,byrunningroughshodoverpriorinformationfrom
theoryandfromcovariates,iswastefulandevenunethicalwhenitunnecessarilyex-
posespeople,orunnecessarilymanypeople,topossibleharminariskyexperiment.
Worrall(2008)documentsthe(extreme)caseofECMO,anewtreatmentfornew-
bornswithpersistentpulmonaryhypertensionthatwasdevelopedinthe1970sby
intelligentanddirectedtrialanderrorwithinawell-understoodtheoryofthedis-
ease.Inearlyexperimentationbytheinventors,mortalitywasreducedfrom80to
20percent.TheinvestigatorsfeltcompelledtoconductanRCT,albeitwithanadap-
tive‘play-the-winner’designinwhicheachsuccessinanarmincreasedtheproba-
bilityofthenextbabybeingassignedtothatarm.Onebabyreceivedconventional
therapyanddied,11receivedECMOandlived.Evenso,astandardrandomizedcon-
trolledtrialwasthoughtnecessary.Withastoppingruleoffourdeaths,fourmore
babies(outoften)diedinthecontrolgroupandnoneoftheninewhoreceived
ECMO.
Fourth,thenon-randommethodsusepriorinformation,whichiswhythey
dobetterthanrandomization.Thisisbothanadvantageandadisadvantage,de-
pendingonone’sperspective.Ifpriorinformationisnotwidelyaccepted,orisseen
asnon-crediblebythoseweareseekingtopersuade,wewillgeneratemorecredible
estimatesifwedonotusethosepriors.Indeed,thisiswhyBCS(2016)recommend
randomizeddesigns,includinginmedicineandindevelopmenteconomics.Theyde-
velopatheoryofaninvestigatorwhoisfacinganadversarialaudiencewhowill
challengeanypriorinformationandcanevenpotentiallyvetoresultsbasedonit
(thinkofadministrativeagenciessuchastheFDAorjournalreferees).Theexperi-
mentertradesoffhisorherowndesireforprecision(andpreventingpossibleharm
tosubjects),whichwouldrequirepriorinformation,againstthewishesoftheaudi-
ence,whowantsnothingtodowiththosepriors.Eventhen,theapprovaloftheau-
dienceisonlyexante;oncethefullyrandomizedexperimenthasbeendone,nothing
Page 20
19
stopscriticsarguingthat,infact,therandomizationdidnotofferafairtestbecause
importantothercauseswerenotbalanced.AmongdoctorswhouseRCTs,andespe-
ciallymeta-analysis,suchargumentsare(appropriately)common(seeKramer
(2016)).
Today,whenthepublichascometoquestionexpertpriorknowledge,RCTs
willflourish.Incaseswherethereisgoodreasontodoubtthegoodfaithofexperi-
menters,asinmanypharmaceuticaltrials,randomizationwillindeedbeanappro-
priateresponse.Butwebelievesuchargumentsaredestructiveforscientificen-
deavor(whichisnotthepurposeoftheFDA)andshouldberesistedasageneral
prescriptioninscientificresearch.Previousknowledgeneedstobebuiltonandin-
corporatedintonewknowledge,notdiscardedinthefaceofaggressiveignorance.
Thesystematicrefusaltousepriorknowledgeandtheassociatedpreferencefor
RCTsarerecipesforpreventingcumulativescientificprogress.Intheend,itisalso
self-defeating.ToquoteRodrik(2016)“thepromiseofRCTsastheory-freelearning
machinesisafalseone.”
1.5StatisticalinferenceinRCTs
TheestimatedATEinasimpleRCTisthedifferenceinthemeansbetweenthetreat-
mentandcontrolgroups.Whencovariatesareallowedfor,asinmostRCTsineco-
nomics,theATEisusuallyestimatedfromthecoefficientonthetreatmentdummy
inaregressionthatlookslike(1),butwiththeheterogeneityin𝛽ignored.Modern
workcalculatesstandarderrorsallowingforthepossibilitythatresidualvariances
maybedifferentinthetreatmentandcontrolgroups,usuallybyclusteringthe
standarderrors,whichisequivalenttothefamiliartwosamplestandarderrorin
thecasewithnocovariates.Statisticalinferenceisdonewitht-valuesintheusual
way.Theseproceduresdonotalwaysgivetherightanswer.
Lookingbackat(1),theunderlyingobjectsofinterestaretheindividual
treatmenteffects𝛽( foreachoftheindividualsinthetrialsample.Neitherthey,nor
theirdistribution𝐺(𝛽)isidentifiedfromanRCT;becauseRCTsmakesofewas-
sumptionswhich,inmanycases,istheirstrength,theycanidentifyonlythemeanof
thedistribution.Inmanyobservationalstudies,researchersarepreparedtomake
moreassumptionsonfunctionalformsorondistributions,andforthatpriceweare
Page 21
20
abletoidentifyotherquantitiesofinterest.Withouttheseassumptions,inferences
mustbebasedonthedifferenceinthetwomeans,astatisticthatissometimesill-
behaved,asweshalldiscussbelow.Thisill-behaviorhasnothingtodowithRCTs,
perse,butwithinRCTs,andtheirminimalassumptions,wecannoteasilyswitch
fromthemeantosomeotherquantityofinterest.
Fisherproposedthatstatisticalinferenceshouldbedoneusingwhathasbe-
comeknownas“randomization”inference,aprocedurethatisasnon-parametricas
theRCT-basedestimateofanATEitself.Totestthenullhypothesisthat𝛽( = 0for
alli,notethat,underthenullthatthetreatmenthasnoeffectonanyindividual,an
estimatednonzeroATEmustbeaconsequenceoftheparticularrandomallocation
thatgeneratedit.Bytabulatingallpossiblecombinationsoftreatmentsandcontrols
inourtrialsample,andtheATEassociatedwitheach,wecancalculatetheexactdis-
tributionoftheestimatedATEunderthenull.Thisallowsustocalculatetheproba-
bilityofcalculatinganestimateaslargeasouractualestimatewhentherearenoef-
fectsoftreatment.Thisrandomizationtestrequiresafinitesample,butitwillwork
foranysamplesize(seeImbensandWooldridge(2009)foranexcellentaccountof
theprocedure).Imbens(2010)arguesthatitisthisrandomizationinferenceplus
theunbiasednessoftheATEthatprovidesthetwinnon-parametricpillarsthatsup-
portplacingRCTsatthe“verytop”ofthehierarchyofevidence.
Randomizationinferencecanbeusedfornullhypothesesthatspecifythatall
ofthetreatmenteffectsarezero,asintheaboveexample,butitcannotbeusedto
testthehypothesisthattheaveragetreatmenteffectiszero,whichwilloftenbeof
interest.Inagriculturaltrials,andinmedicine,thestronger(sharp)hypothesisthat
thetreatmenthasnoeffectwhateverisoftenofinterest.Inmanyeconomicapplica-
tionsthatinvolvemoney,suchaswelfareexperimentsorcost-benefitanalyses,we
areinterestedinwhethertheneteffectofthetreatmentispositiveornegative,and
inthesecases,randomizationinferencecannotbeused.Noneofwhichargues
againstitswideruseinsocialscienceswhenappropriate.
Incaseswhererandomizationinferencecannotbeused,wemustconstruct
testsforthedifferencesintwomeans.Standardprocedureswilloftenworkwell,but
therearetwopotentialpitfalls.One,the‘Fisher-Behrensproblem’,comesfromthe
Page 22
21
factthat,whenthetwosampleshavedifferentvariances—whichwetypicallywant
topermit—theusualt-statisticdoesnothavethet-distribution.Thesecondprob-
lem,whichismuchhardertoaddress,occurswhenthedistributionoftreatmentef-
fectsisnotsymmetric(BahadurandSavage(1956)).Neitherpitfallisspecificto
RCTs,butRCTsforceustoworkwithmeansinestimatingtreatmenteffectsand,
withonlyaveryfewexceptionsintheliterature,socialscientistswhouseRCTsap-
peartobeunawareofthedifficulties.
InthesimplecaseofcomparingtwomeansinanRCTwithoutcovariates,in-
ferenceisusuallybasedonthetwo–samplet–statisticwhichiscomputedbydivid-
ingtheATEbytheestimatedstandarderrorwhosesquareisgivenby
𝜎4 =𝑛" − 1 6" 𝑌( − 𝑌"(7"
4
𝑛"+
𝑛& − 1 6" 𝑌( − 𝑌&(7&4
𝑛&3
where0referstocontrolsand1totreatments,sothatthereare𝑛"treatmentsand
𝑛&controls,and𝑌"and𝑌&arethetwomeans.Ashaslongbeenknown,the“t-statis-
tic”basedon(3)isnotdistributedasStudent’stifthetwovariances(treatmentand
control)arenotidenticalbuthastheBehrens–Fisherdistribution.Inextremecases,
whenoneofthevariancesiszero,thet–statistichaseffectivedegreesoffreedom
halfofthatofthenominaldegreesoffreedom,sothatthetest-statistichasthicker
tailsthanallowedfor,andtherewillbetoomanyrejectionswhenthenullistrue.
Young(2016)arguesthatthisproblemisworsewhenthetrialresultsarean-
alyzedbyregressingoutcomesnotonlyonthetreatmentdummybutalsoonaddi-
tionalcontrolsandwhenusingclusteredorrobuststandarderrors.Whenthede-
signmatrixissuchthatthemaximalinfluenceislarge,sothatforsomeobservations
outcomeshavelargeinfluenceontheirownpredictedvalues,thereisareductionin
theeffectivedegreesoffreedomforthet–value(s)oftheaveragetreatmenteffect(s)
leadingtospuriousfindingsofsignificance.Younglooksat2,003regressionsre-
portedin53RCTpapersintheAmericanEconomicAssociationjournalsandrecalcu-
latesthesignificanceoftheestimatesusingrandomizationinferenceappliedtothe
authors’originaldata.In30to40percentoftheestimatedtreatmenteffectsinindi-
vidualequationswithcoefficientsthatarereportedassignificant,hecannotreject
thenullofnoeffectforanyobservation;thefractionofspuriouslysignificantresults
Page 23
22
increasesfurtherwhenhesimultaneouslytestsforallresultsineachpaper.These
spuriousfindingscomeinpartfromissuesofmultiple-hypothesistesting,both
withinregressionswithseveraltreatmentsandacrossregressions.Withinregres-
sions,treatmentsarelargelyorthogonal,butauthorstendtoemphasizesignificant
t–valuesevenwhenthecorrespondingF-testsareinsignificant.Acrossequations,
resultsareoftenstronglycorrelated,sothat,atworst,differentregressionsarere-
portingvariantsofthesameresult,thusspuriouslyaddingtothe“killcount”ofsig-
nificanteffects.Atthesametime,thepervasivenessofobservationswithhighinflu-
encegeneratesspurioussignificanceonitsown.
Theseissuesarenowbeingtakenmoreseriously.InadditiontoYoung
(2016),ImbensandKolesár(2016)providepracticaladvicefordealingwiththe
Fisher-Behrensproblem,andthebestcurrentpracticetriestobecarefulaboutmul-
tiplehypothesistesting.Yetitremainsthecasethatmanyoftheresultsinthelitera-
turearespuriouslysignificant.
Spurioussignificancealsoariseswhenthedistributionoftreatmenteffects
containsoutliersor,moregenerally,isnotsymmetric.Standardt–testsbreakdown
indistributionswithenoughskewness(seeLehmannandRomano(2005,p.466–
8)).Howdifficultisittomaintainsymmetry?Andhowbadlyisinferenceaffected
whenthedistributionoftreatmenteffectsisnotsymmetric?Ineconomics,manytri-
alshaveoutcomesvaluedinmoney.Doesananti-povertyinnovation—forexample
microfinance—increasetheincomesoftheparticipants?Incomeitselfisnotsym-
metricallydistributed,andthismightbetrueofthetreatmenteffectstooifthereare
afewpeoplewhoaretalentedbutcredit-constrainedentrepreneursandwhohave
treatmenteffectsthatarelargeandpositive,whilethevastmajorityofborrowers
fritterawaytheirloans,oratbestmakepositivebutmodestprofits.Arecentsum-
maryoftheliteratureisconsistentwiththis(seeBanerjee,Karlan,andZinman
(2015)).Anotherimportantexampleisexpendituresonhealthcare.Mostpeople
havezeroexpenditureinanygivenperiod,butamongthosewhodoincurexpendi-
tures,afewindividualsspendhugeamountsthataccountforalargeshareoftheto-
tal.Indeed,inthefamousRandhealthexperiment(seeManning,Newhouseetal.
Page 24
23
(1987,1988)),thereisasingleverylargeoutlier.Theauthorsrealizethatthecom-
parisonofmeansacrosstreatmentarmsisfragile,and,althoughtheydonotsee
theirproblemexactlyasdescribedhere,theyobtaintheirpreferredestimatesusing
astructuralapproachthatisexplicitlydesignedtomodeltheskewnessofexpendi-
tures.
Insomecases,itwillbeappropriatetodealwithoutliersbytrimming,trans-
forming,oreliminatingobservationsthathavelargeeffectsontheestimates.Butif
theexperimentisaprojectevaluationdesignedtoestimatethenetbenefitsofapol-
icy,theeliminationofgenuineoutliers,asintheRandHealthExperiment,willviti-
atetheanalysis.Itispreciselytheoutliersthatmakeorbreaktheprogram.Trans-
formations,suchastakinglogarithms,mayhelptoproducesymmetry,butthey
changethenatureofthequestionbeingasked;acostbenefitanalysismustbedone
indollars,notlogdollars.
Weconsideranexamplethatillustrateswhatcanhappeninarealisticbut
simplifiedcase;thefullresultsarereportedintheAppendix.Weimagineapopula-
tionofindividuals,eachwithatreatmenteffect𝛽( .Theparentpopulationmeanof
thetreatmenteffectsiszero,butthereisalongtailofpositivevalues;weusealeft-
shiftedlognormaldistribution.Wehaveamicrofinancetrialinmind,wherethereis
alongpositivetailofrareindividualswhocandoamazingthingswithcreditwhile
mostpeoplecannotuseiteffectively.Atrialsampleof2n individualsisrandomly
drawnfromtheparentpopulationandisrandomlysplitbetweenntreatmentsand
ncontrols.Withineachtrialsample,whosetrueATEwillgenerallydifferfromzero
becauseofthesampling,werunmanyRCTsandtabulatethevaluesoftheATEfor
each.
Usingstandardt-tests,the(trueintheparentdistribution)hypothesisthat
theATEiszeroisrejectedbetween14(𝑛 = 25)and6percent(𝑛 = 500)ofthetime.
Theserejectionscomefromtwoseparateissues,bothofwhicharerelevantinprac-
tice;(a)thattheATEintrialsamplediffersfromtheATEintheparentpopulationof
interest,and(b)thatthet-valuesarenotdistributedastinthepresenceofoutliers.
Page 25
24
Theproblemcasesarewhenthetrialsamplehappenstocontainoneormoreoutli-
ers,somethingthatisalwaysariskgiventhelongpositivetailoftheparentdistribu-
tion.Whenthishappens,everythingdependsonwhethertheoutlierisamongthe
treatmentsorthecontrols;ineffect,theoutliersbecomethesample,reducingthe
effectivenumberofdegreesoffreedom.Inextremecases,oneofwhichisillustrated
inFigureA.1,thedistributionofestimatedATEsisbimodal,dependingonthegroup
towhichtheoutlierisassigned.Whentheoutlierisinthetreatmentgroup,thedis-
persionacrossoutcomesislarge,asistheestimatedstandarderror,andsothose
outcomesrarelyrejectthenullusingthestandardtableoft–values.Theover-rejec-
tionscomefromcaseswhentheoutlierisinthecontrolgroup,theoutcomesarenot
sodispersed,andthet–valuescanbelarge,negative,andsignificant.Whilethese
casesofbimodaldistributionsmaynotbecommonanddependontheexistenceof
largeoutliers,theyillustratetheprocessthatgeneratestheover-rejectionsandspu-
rioussignificance.Notethatthereisnoremedythroughrandomizationinference
here,giventhatourinterestisinthehypothesisthattheaveragetreatmenteffectis
zero.
OurreadingoftheliteratureonRCTsindevelopmenteconomicssuggests
thattheyarenotexemptfromtheseconcerns.Manydevelopmenttrialsarerunon
(sometimesvery)smallsamples,theyhavetreatmenteffectswhereasymmetryis
hardtoruleout—especiallywhentheoutcomesareinmoney—andtheyoftengive
resultsthatarepuzzling,oratleastnoteasilyinterpretedintermsofeconomicthe-
ory.NeitherBanerjeeandDuflo(2012)norKarlanandAppel(2011),whocitemany
RCTs,raiseconcernsaboutmisleadinginference,implicitlytreatingallresultsasre-
liable.Nodoubttherearebehaviorsintheworldthatareinconsistentwithstandard
economics,andsomecanbeexplainedbystandardbiasesinbehavioraleconomics,
butitwouldalsobegoodtobesuspiciousofthesignificancetestsbeforeaccepting
thatanunexpectedfindingiswell-supportedandthattheorymustberevised.Repli-
cationofresultsindifferentsettingsmaybehelpful,iftheyaretherightkindof
places(seeourdiscussioninSection2).Yetithardlysolvestheproblemgiventhat
theasymmetrymaybeinthesamedirectionindifferentsettings,thatitseemslikely
tobesoinjustthosesettingsthataresufficientlyliketheoriginaltrialsettingtobe
Page 26
25
ofuseforinferenceaboutthepopulationofinterest,andthatthe“significant”t–val-
ueswillshowdeparturesfromthenullinthesamedirection.This,then,replicates
thespuriousfindings.
Asummary
Whatdotheargumentsofthissectionmeanabouttheimportanceofrandomization
andtheinterpretationthatshouldbegiventoanestimatedATEfromarandomized
trial?First,weshouldbesurethatanunbiasedestimateofanATEforthetrialpopu-
lationislikelytobeusefulenoughtowarrantthecostsofrunningthetrial.Second,
sincerandomizationdoesnotensureorthogonality,caremustbetaken(e.g.by
blinding)thattherearenosignificantpost-randomizationcorrelateswiththetreat-
ment.Thisisawell-knownlessonbutmanysocialandeconomictrialsarenot
blindedandinsufficientdefenseisofferedthatunbiasednessisnotundermined.In-
deed,lackofblindingisnottheonlysourceofpost-randomizationbias.Treatments
andcontrolsmaybehandledindifferentplaces,orbydifferentlytrainedpractition-
ers,oratdifferenttimesofday,andthesedifferencescanbringwiththemsystem-
aticdifferencesintheothercausestowhichthetwogroupsareexposed.Thesecan,
andshould,beguardedagainst.Butdoingsorequiresanunderstandingofwhat
thesecausallyrelevantfactorsmightbe.Third,theinferenceproblemsreviewed
herecannotjustbepresumedaway.Whenthereissubstantialheterogeneity,the
ATEinthetrialsamplecanbequitedifferentfromtheATEinthepopulationofin-
terest,evenifthetrialisrandomlyselectedfromthatpopulation;inpractice,there-
lationshipbetweenthetrialsampleandthepopulationisoftenobscure.
Beyondthat,inmanycases,thestatisticalinferencewillbefine,butserious
attentionshouldbegiventothepossibilitythatthereareoutliersintreatmentef-
fects,somethingthatknowledgeoftheproblemcansuggestandwhereinspectionof
themarginaldistributionsoftreatmentsandcontrolsmaybeinformative.Forexam-
ple,ifbotharesymmetric,itseemsunlikely(thoughcertainlynotimpossible)that
thetreatmenteffectsarehighlyskewed.MeasurestodealwithFisher-Behrens
shouldbeusedandrandomizationinferenceconsideredwhenappropriatetothe
hypothesisofinterest.
Page 27
26
Allofthiscanberegardedasrecommendationsforimprovementtocurrent
practice,notachallengetoit.Morefundamentally,westronglycontesttheoften-ex-
pressedideathattheATEcalculatedfromanRCTisautomaticallyreliable,thatran-
domizationautomaticallycontrolsforunobservables,orworstofall,thatthecalcu-
latedATEistrue.If,bychance,itisclosetothetruth,thetruthwearereferringtois
thetruthinthetrialsampleonly.Tomakeanyinferencebeyondthatrequiresanar-
gumentofthekindweconsiderinthenextsection.Wehavealsoarguedthat,de-
pendingonwhatwearetryingtomeasureandwhatwewanttousethatmeasure
for,thereisnopresumptionthatanRCTisthebestmeansofestimatingit.Thattoo
requiresanargument,notapresumption.
Section2:Usingtheresultsofrandomizedcontrolledtrials
2.1Introduction
SupposewehaveestimatedanATEfromawell-conductedRCTonatrialsample,
andourstandarderrorgivesusreasontobelievethattheeffectdidnotcomeabout
bychance.Wethushavegoodwarrantthatthetreatmentcausestheeffectinour
trialsample,uptothelimitsofstatisticalinference.Whataresuchfindingsgoodfor?
Theliteratureineconomics,asindeedinmedicineandinsocialpolicy,has
paidmoreattentiontoobtainingresultsthantoconsideringwhatcanbedonewith
them.Thereislittletheoreticalorempiricalworktoguideushowandforwhatpur-
posestousethefindingsofRCTs,suchastheconditionsunderwhichthesamere-
sultsholdoutsideoftheoriginalsettings,howtheymightbeadaptedforuseelse-
where,orhowtheymightbeusedforformulating,testing,understanding,orprob-
inghypothesesbeyondtheimmediaterelationbetweenthetreatmentandtheout-
comeinvestigatedinthestudy.Yetitcannotbethatknowinghowtouseresultsis
lessimportantthanknowinghowtodemonstratethem.Anychainofevidenceisonly
asstrongasitweakestlink,sothatarigorouslyestablishedeffectwhoseapplicabil-
ityisjustifiedbyaloosedeclarationofsimilewarrantslittle.Iftrialsaretobeuseful,
weneedpathstotheirusethatareascarefullyconstructedasarethetrialsthem-
selves.
Theargumentforthe“primacyofinternalvalidity”madebyShadish,Cook,
andCampbell(2002)maybereasonableasawarningthatbadRCTsareunlikelyto
Page 28
27
generalize,butitissometimesincorrectlytakentoimplythatresultsofaninternally
validtrialwillautomatically,oroften,apply‘asis’elsewhere,orthatthisshouldbe
thedefaultassumptionfailingargumentstothecontrary,asifaparameter,once
wellestablished,canbeexpectedtobeinvariantacrosssettings.Aninvarianceas-
sumptionisoftenmadeinmedicine,forexample,whereitissometimesplausible
thataparticularprocedureordrugworksthesamewayeverywhere(thoughsee
Horton(2000)forastrongdissentandRothwell(2005)forexamplesonbothsides
ofthequestion).Weshouldalsonotetherecentmovementtoensurethattestingof
drugsincludeswomenandminoritiesbecausemembersofthosegroupssuppose
thattheresultsoftrialsonmostlyhealthyyoungwhitemalesdonotapplytothem.
2.2Usingresults,transportability,andexternalvalidity
Supposeatrialhasestablishedaresultinaspecificsetting.If`thesame’resultholds
elsewhere,itissaidtohave`externalvalidity’.Externalvaliditymayreferjusttothe
transportabilityofthecausalconnection,orgofurtherandrequirereplicationofthe
magnitudeoftheATE.Eitherway,theresultholds—everywhere,orwidely,orin
somespecificelsewhere—oritdoesnot.
Thisbinaryconceptofexternalvalidityisoftenunhelpfulbecauseitasksthe
resultsofanRCTtosatisfyaconditionthatisneithernecessarynorsufficientfora
trialtobeuseful,andsobothoverstatesandunderstatestheirvalue.Itdirectsusto-
wardsimpleextrapolation—whetherthesameresultholdselsewhere—orsimple
generalization—itholdsuniversallyoratleastwidely—andawayfrommorecom-
plexbutmoreusefulapplicationsoftheresults.Thefailureofexternalvalidityinter-
pretedassimplegeneralizationorextrapolationsayslittleaboutthevalueofthe
trial.
First,thereareseveralusesofRCTsthatdonotrequiretransportabilitybe-
yondtheoriginalcontext;wediscusstheseinthenextsubsection.Second,thereare
oftengoodreasonstoexpectthattheresultsfromawell-conducted,informative,
andpotentiallyusefulRCTwillnotapplyelsewhereinanysimpleway.Withoutfur-
therunderstandingandanalysis,evensuccessfulreplicationtellsuslittleeitherfor
oragainstsimplegeneralizationortosupportfortheconclusionthatthenextwill
workinthesameway.Nordofailuresofreplicationmaketheoriginalresultuseless.
Page 29
28
Weoftenlearnmuchfromcomingtounderstandwhyreplicationfailedandcanuse
thatknowledge,inlookingforhowthefactorsthatcausedtheoriginalresultmight
operatedifferentlyindifferentsettings.Third,andparticularlyimportantforscien-
tificprogress,theRCTresultcanbeincorporatedintoanetworkofevidenceandhy-
pothesesthattestorexploreclaimsthatlookverydifferentfromtheresultsre-
portedfromtheRCT.WeshallgiveexamplesbelowofextremelyusefulRCTsthat
arenotexternallyvalidinthe(usual)sensethattheirresultsdonotholdelsewhere,
whetherinaspecifictargetsettingorinthemoresweepingsenseofholdingevery-
where.
BertrandRussell’schicken(Russell(1912))providesanexcellentexampleof
thelimitationstostraightforwardextrapolationfromrepeatedsuccessfulreplica-
tion.Thebirdinfers,onrepeatedevidence,thatwhenthefarmercomesinthe
morning,hefeedsher.TheinferenceservesherwelluntilChristmasmorning,when
hewringsherneckandservesherfordinner.Thoughthischickendidnotbaseher
inferenceonanRCT,hadweconstructedoneforher,wewouldhaveobtainedthe
sameresultthatshedid.Herproblemwasnothermethodology,butratherthatshe
didnotunderstandthesocialandeconomicstructurethatgaverisetothecausalre-
lationsthatsheobserved.
So,establishingcausalitydoesnothinginandofitselftoguaranteegenerali-
zability.NordoestheabilityofanidealRCTtoeliminatebiasfromselectionorfrom
omittedvariablesmeanthattheresultingATEfromthetrialsamplewillapplyany-
whereelse.Theissueisworthmentioningonlybecauseoftheenormousweight
thatiscurrentlyattachedineconomicstothediscoveryandlabelingofcausalrela-
tions,aweightthatishardtojustifyforeffectsthatmayhaveonlylocalapplicabil-
ity,whatmightbelabeled‘anecdotalcausality’.Theoperationofacausegenerally
requiresthepresenceof“supportfactors”,withoutwhichacausethatproducesthe
targetedeffectinoneplace,eventhoughitmaybepresentandhavethecapacityto
operateelsewhere,willremainlatentandinoperative.WhatMackie(1974)called
INUScausality(InsufficientbutNon-redundantpartsofaconditionthatisitselfUn-
necessarybutSufficientforacontributiontotheoutcome)isoftenthekindofcau-
salitywesee.Astandardexampleisahouseburningdownbecausethetelevision
Page 30
29
waslefton,althoughtelevisionsdonotoperateinthiswaywithoutsupportfactors,
suchaswiringfaults,thepresenceoftinder,andsoon.Thisisstandardfareinepi-
demiology,whichusestheterm`causalpie’torefertoasetofcausesthatarejointly
butnotseparatelysufficientforaneffect.
Ifwerewrite(1)intheform
𝑌( = 𝛽(𝑇( + 𝛾<𝑥(< = 𝜃 𝑤( 𝑇(
@
<A"
+ 𝛾<𝑥(<
@
<A"
(4)
wherethefunction𝜃(. )controlshowak-vector𝑤( ofk`supportfactors’affectindi-
viduali’streatmenteffect𝛽( .Thesupportfactorsmayincludesomeofthex’s.Since
theATEistheaverageofthe𝛽(𝑠,twopopulationswillhavethesameATEifandonly
iftheyhavethesameaveragefortheneteffectofthesupportfactorsnecessaryfor
thetreatmenttowork,i.e.forthequantityinfrontof𝑇( .Thesearehoweverjustthe
kindoffactorsthatarelikelytobedifferentlydistributedindifferentpopulations,
andindeedwedogenerallyfinddifferentATEsindifferenteconomic(andotherso-
cialpolicy)RCTsindifferentplaceseveninthecaseswhere(unusually)theyall
pointinthesamedirection.
Causalprocessesoftenrequirehighlyspecializedeconomic,cultural,orsocial
structurestoenablethemtowork.ConsidertheRubeGoldbergmachinethatis
riggedupsothatflyingakitesharpensapencil(CartwrightandHardie(2012,77)).
Theunderlyingstructureaffordsaveryspecificformof(4)thatwillnotdescribe
causalprocesseselsewhere.NeitherthesameATEnorthesamequalitativecausal
relationscanbeexpectedtoholdwherethespecificformfor(4)isdifferent.Indeed,
wecontinuallyattempttodesignsystemsthatwillgeneratecausalrelationsthatwe
likeandthatwillruleoutcausalrelationsthatwedonotlike.Healthcaresystems
aredesignedtopreventnursesanddoctorsmakingerrors;carsaredesignedsothat
driverscannotstarttheminreverse;workschedulesforpilotsaredesignedsothey
donotflytoomanyconsecutivehourswithoutrestbecausealertnessandperfor-
mancearecompromised.
Page 31
30
AsintheRubeGoldbergmachineandinthedesignofcarsandworksched-
ules,theeconomicstructureandequilibriummaydifferinwaysthatsupportdiffer-
entkindsofcausalrelationsandthusrenderatrialinonesettinguselessinanother.
Forexample,atrialthatreliesonprovidingincentivesforpersonalpromotionisof
nouseinastateinwhichapoliticalsystemlockspeopleintotheirsocialandeco-
nomicpositions.Cashtransfersthatareconditionalonparentstakingtheirchildren
toclinicscannotimprovechildhealthintheabsenceoffunctioningclinics.Policies
targetedatmenmaynotworkforwomen.Weusealevertotoastourbread,butlev-
ersonlyoperatetotoastbreadinatoaster;wecannotbrowntoastbypressingan
accelerator,eveniftheprincipleoftheleveristhesameinbothatoasterandacar.
Ifwemisunderstandthesetting,ifwedonotunderstandwhythetreatmentinour
RCTworks,werunthesamerisksasRussell’schicken.
2.3WhenRCTsspeakforthemselves:notransportabilityrequired
Forsomethingswewanttolearn,anRCTisenoughbyitself.AnRCTmayprovidea
counterexampletoageneraltheoreticalproposition,eithertothepropositionitself
(asimplerefutationtest)ortosomeconsequenceofit(acomplexrefutationtest).
AnRCTmayalsoconfirmapredictionofatheory,andalthoughthisdoesnotcon-
firmthetheory,itisevidenceinitsfavor,especiallyifthepredictionseemsinher-
entlyunlikelyinadvance.Thisisallfamiliarterritory,andthereisnothingunique
aboutanRCT;itissimplyoneamongmanypossibletestingprocedures.Evenwhen
thereisnotheory,orveryweaktheory,anRCT,bydemonstratingcausalityinsome
populationcanbethoughtofasproofofconcept,thatthetreatmentiscapableof
workingsomewhere.Thisisoneoftheargumentsfortheimportanceofinternalva-
lidity.
NoristransportationcalledforwhenanRCTisusedforevaluation,forexam-
pletosatisfydonorsthattheprojecttheyfundedachieveditsaimsinthepopulation
inwhichitwasconducted.Evenso,forsuchevaluations,saybytheWorldBank,to
beglobalpublicgoodsrequiresargumentsandguidelinesthatjustifyusingthere-
sultsinsomewayelsewhere;theglobalpublicgoodisnotanautomaticby-product
oftheBankfulfillingitsfiduciaryresponsibility.Whenthecomponentsoftreat-
mentschangeacrossstudies,evaluationsneednotleadtocumulativeknowledge.Or
Page 32
31
asHeckmanetal(1999,1934)note,“thedataproducedfromthem[socialexperi-
ments]arefarfromidealforestimatingthestructuralparametersofbehavioral
models.Thismakesitdifficulttogeneralizefindingsacrossexperimentsortouse
experimentstoidentifythepolicy-invariantstructuralparametersthatarerequired
foreconometricpolicyevaluation.”
Ofcourse,whenweaskexactlywhatthoseinvariantstructuralparameters
are,whethertheyexist,andhowtheyshouldbemodeled,weopenupmajorfault
linesinmodernappliedeconomics.Forexample,wedonotintendtoendorseinter-
temporaldynamicmodelsofbehaviorastheonlywayofrecoveringtheparameters
thatweneed.Wealsorecognizethattheusefulnessofsimplepricetheoryisnotas
universallyacceptedasitoncewas.Butthepointremainsthatweneedsomething,
someregularityorsomeinvariance,andthatsomethingcanrarelyberecoveredby
simplygeneralizingacrosstrials.
Athirdnon-problematicandimportantuseofanRCTiswhentheparameter
ofinterestistheATEinawell-definedpopulationfromwhichthetrialsampleisit-
selfarandomsample.Inthiscasethesampleaveragetreatmenteffect(SATE)isan
unbiasedestimatorofthepopulationaveragetreatmenteffect(PATE)that,byas-
sumption,isourtarget(seeImbens(2004)fortheseterms).Werefertothisasthe
`publichealth’case;likemanypublichealthinterventions,thetargetistheaverage,
`populationhealth,’notthehealthofindividuals.Onemajor(andwidelyrecog-
nized)dangerofthisuseofRCTsisthatscalingupfrom(evenarandom)sampleto
thepopulationwillnotgothroughinanysimplewayiftheoutcomesofindividuals
orgroupsofindividualschangethebehaviorofothers—whichwillbecommonin
economicexamplesbutperhapslesssoinhealth.Thereisalsoanissueoftimingif
timeelapsesbetweenthetrialandtheimplementation.
Ineconomics,a`public-health-style’exampleistheimpositionofacommod-
itytax,wherethetotaltaxrevenueisofinterestandpolicymakersdonotcarewho
paysthetax.Indeed,theorycanoftenidentifyaspecific,well-definedquantity
whosemeasurementiskeyforapolicy(seeDeatonandNg(1998)foranexampleof
whatChetty(2009)callsa“sufficient”statistic).Inthiscase,thebehaviorofaran-
domsampleofindividualsmightwellprovideagoodguidetothetaxrevenuethat
Page 33
32
canbeexpected.Anothercasecomesfromworkonpovertyprogramswherethe
sponsorsaremostconcernedaboutthebudget;wediscussthesecasesattheendof
thisSection.Evenhere,itiseasytoimaginebehavioraleffectscomingintoplaythat
driveawedgebetweenthetrialanditsfull-scaleimplementation,forexampleif
complianceishigherwhentheschemeiswidelypublicized,orifgovernmentagen-
ciesimplementtheschemedifferentlyfromtrialists.
2.4Transportingresultslaterallyandglobally
TheprogramofRCTsineconomics,asinotherareasofsocialscience,hasthe
broadergoaloffindingout`whatworks.’Atitsmostambitious,thisaimsforuniver-
salreach,andthedevelopmenteconomicsliteraturefrequentlyarguesthat“credi-
bleimpactevaluationsareglobalpublicgoodsinthesensethattheycanofferrelia-
bleguidancetointernationalorganizations,governments,donors,andnongovern-
mentalorganizations(NGOs)beyondnationalborders”,DufloandKremer(2008,
93).SometimestheresultsofasingleRCTareadvocatedashavingwideapplicabil-
ity,withespeciallystrongendorsementwhenthereisatleastonereplication.For
example,KremerandHolla(2009,3)useaKenyantrialasthebasisforablanket
statementwithoutspecifyingcontext,“Provisionoffreeschooluniforms,forexam-
ple,leadsto10%-15%reductionsinteenpregnancyanddropoutrates.”Dufloand
Kremer(2008,104),writingaboutanothertrial,aremorecautious,citingtwoeval-
uationsandrestrictingthemselvestoIndia:“Onecanberelativelyconfidentabout
recommendingthescaling-upofthisprogram,atleastinIndia,onthebasisofthese
estimates,sincetheprogramwascontinuedforaperiodoftime,wasevaluatedin
twodifferentcontexts,andhasshownitsabilitytoberolledoutonalargescale.”
Evenanumberofreplicationsdonotprovideasoundbasisforinference.Without
theorytosupporttheprojectionofresults,thisisjustinductionbysimpleenumera-
tion—swan1iswhite,swan2iswhite,...,soallswansarewhite.
TheproblemofgeneralizationextendsbeyondRCTs,toboth`fullycon-
trolled’laboratoryexperimentsandtomostnon-experimentalfindings.Ourargu-
menthereisthatevidencefromRCTsisnotautomaticallysimplygeneralizable,and
thatitssuperiorinternalvalidity,ifandwhenitexists,doesnotprovideitwithany
uniqueinvarianceacrosscontext.Thattransportationisfarfromautomaticalso
Page 34
33
tellsuswhy(evenideal)RCTsofsimilarinterventionsgivedifferentanswersindif-
ferentsettings.Suchdifferencesdonotnecessarilyreflectmethodologicalfailings
andwillholdacrossperfectlyexecutedRCTsjustastheydoacrossobservational
studies.
ManyadvocatesofRCTsunderstandthat`whatworks’needstobequalified
to`whatworksunderwhichcircumstances’andtrytosaysomethingaboutwhat
thosecircumstancesmightbe,forexample,byreplicatingRCTsindifferentplaces
andthinkingintelligentlyaboutthedifferencesinoutcomeswhentheyfindthem.
Sometimesthisisdoneinasystematicway,forexamplebyhavingmultipletreat-
mentswithinthesametrialsothatitispossibletoestimatea`responsesurface’that
linksoutcomestovariouscombinationsoftreatments(seeGreenbergandSchroder
(2004)orShadishetal(2002)).Forexample,theRANDhealthexperimenthadmul-
tipletreatments,allowinginvestigation,notofhowmuchhealthinsurancein-
creasedexpendituresunderdifferentcircumstances.Someofthenegativeincome
taxexperiments(NITs)inthe1960sand1970sweredesignedtoestimateresponse
surfaces,withthenumberoftreatmentsandcontrolsineacharmoptimizedtomax-
imizeprecisionofestimatedresponsefunctionssubjecttoanoverallcostlimit(see
Conlisk(1973)).Experimentsontime-of-daypricingforelectricityhadasimilar
structure(seeAigner(1985)).
TheexperimentsbyMDRC(originallyknownastheManpowerDevelopment
ResearchCorporation)havealsobeenanalyzedacrosscitiesinanefforttolinkcity
featurestotheresultsoftheRCTswithinthem(seeBloom,Hill,andRiccio(2005)).
UnliketheRANDandNITexamples,theseareexpostanalysesofcompletedtrials;
thesameistrueofVivalt(2015),whofinds,forthecollectionoftrialsshestudied,
thatdevelopment-relatedRCTsrunbygovernmentagenciestypicallyfindsmaller
(standardized)effectsizesthanRCTsrunbyacademicsorbyNGOs.Boldetal
(2013),whoranparallelRCTsonaninterventionimplementedeitherbyanNGOor
bythegovernmentofKenya,foundsimilarresultsthere.Notethattheseanalyses
haveadifferentpurposefrommeta-analysesthatassumethatdifferenttrialsesti-
matethesameparameteruptonoiseandaverageinordertoincreaseprecision.
Page 35
34
Althoughthereareissueswithallofmethodsofinvestigatingdifferences
acrosstrials,withoutsomedisciplineitistooeasytocomeupwith`just-so’orfairy
storiesthataccountfordifferences.Weriskaprocedurethat,ifaresultisreplicated
infullorinpartinatleasttwoplaces,putsthattreatmentintothe`itworks’box
and,iftheresultdoesnotreplicate,casuallyinterpretsthedifferenceinawaythat
allowsatleastsomeofthefindingstosurvive.
Howcanwedobetterthansimplegeneralizationandsimpleextrapolation?
Manywritersemphasizetheroleoftheoryintransportingandusingtheresultsof
trials,andweshalldiscussthisinthenextsubsection.Butstatisticalapproachesare
alsowidelyused;thesearedesignedtodealwiththepossibilitythattreatmentef-
fectsvarysystematicallywithothervariables.Referringbackto(4),itisclearthat,
supposingthesameformof(4)obtains,ifthedistributionofthewvaluesisthe
sameinthenewcircumstancesasintheold,theATEintheoriginaltrialwillholdin
thenewcircumstances.Ingeneral,ofcourse,thisconditionwillnothold,nordowe
haveanyobviouswayofcheckingitunlessweknowwhatthesupportfactorsarein
bothplaces.Oneproceduretodealwithinteractionsispost-experimentalstratifica-
tion,whichparallelspost-surveystratificationinsamplesurveys.Thetrialisbroken
upintosubgroupsthathavethesamecombinationofknown,observablew’s(age,
race,genderforexample),thentheATEswithineachofthesubgroupsarecalcu-
lated,andthentheyarereassembledaccordingtotheconfigurationofw’sinthe
newcontext.ThiscanbeusedtoestimatetheATEinanewcontext,ortocorrectes-
timatestotheparentpopulationwhenthetrialsampleisnotarandomsampleof
theparent.Othermethodscanbeusedwhentherearetoomanyw’sforstratifica-
tion,forexamplebyestimatingtheprobabilityofeachobservationinthepopulation
includedinthetrialsampleasafunctionofthew’s,thenweightingeachobservation
bytheinverseofthesepropensityscores.AgoodreferenceforthesemethodsisStu-
artetal(2011),orineconomics,Angrist(2004)andHotz,Imbens,andMortimer
(2005).
Thesemethodsareoftennotapplicable,however.First,reweightingworks
onlywhentheobservablefactorsusedforreweightingincludeall(andonly)genu-
ineinteractivecauses.Second,aswithanyformofreweighting,thevariablesusedto
Page 36
35
constructtheweightsmustbepresentinboththeoriginalandnewcontext.Forex-
ample,ifwearetocarryaresultforwardintime,wemaynotbeabletoextrapolate
fromaperiodoflowinflationtoaperiodofhighinflation.AsHotzetal(2005)note,
itwilltypicallybenecessarytoruleoutsuch`macro’effects,whetherovertime,or
overlocations.Third,italsodependsonassumingthatthesamegoverningequation
(4)coversthetrialandthetargetpopulation.
PearlandBareinboim(2011,2014)andBareinboimandPearl(2013,2014)
providestrategiesforinferringinformationaboutnewpopulationsfromtrialre-
sultsthataremoregeneralthanreweighting.Theysupposewehaveavailableboth
causalinformationandprobabilisticinformationforpopulationA(e.g.theexperi-
mentalone),whileforpopulationB(thetarget)wehaveonly(some)probabilistic
information,andalsothatweknowthatcertainprobabilisticandcausalfactsare
sharedbetweenthetwoandcertainonesarenot.Theyoffertheoremsdescribing
whatcausalconclusionsaboutpopulationBaretherebyfixed.Theirworkunder-
linesthefactthatexactlywhatconclusionsaboutonepopulationcanbesupported
byinformationaboutanotherdependsonexactlywhatcausalandprobabilisticfacts
theyhaveincommon. ButasMuller(2015)notes,this,liketheproblemwithsimple
reweighting,takesusbacktothesituationthatRCTsaredesignedtoavoid,where
weneedtostartfromacompleteandcorrectspecificationofthecausalstructure.
RCTscanavoidthisinestimation—whichisoneoftheirstrengths,supportingtheir
credibility—butthebenefitvanishesassoonaswetrytocarrytheirresultstoanew
context.
Thisdiscussionleadstoanumberofpoints.First,wecannotgettogeneral
claimsbysimplegeneralization;thereisnowarrantfortheconvenientassumption
thattheATEestimatedinaspecificRCTisaninvariantparameter,northatthe
kindsofinterventionsandoutcomeswemeasureintypicalRCTsparticipateingen-
eralcausalrelations.Whileitistruethatgeneralcausalclaimsexist—thatgravita-
tionalmassesattracteachother,orthatpeoplerespondtoincentives—theseuse
relativelyabstractconceptsandoperateatamuchhigherlevelthantheclaimsthat
canbereasonablyinferredfromatypicalRCT.
Page 37
36
Second,thoughtfulpre-experimentalstratificationinRCTsislikelytobeval-
uable,orfailingthat,subgroupanalysis,becauseitcanprovideinformationthatmay
beusefulforgeneralizationortransportation.Forexample,KremerandHolla
(2009)notethat,intheirtrials,schoolattendanceissurprisinglysensitivetosmall
subsidies,whichtheysuggestisbecausetherearealargenumberofstudentsand
parentswhoareonthe(financial)marginbetweenattendingandnotattending
school;ifthisisindeedthemechanismfortheirresults,agoodvariableforstratifi-
cationwouldbedistancefromtherelevantcutoff.Wealsoneedtoknowthatthis
samemechanismworksinanynewtargetsetting.
Third,weneedtobeexplicitaboutcausalstructure,evenifthatmeansmore
modelbuildingandmore—ordifferent—assumptionsthanadvocatesofRCTsare
oftencomfortablewith.Tobeclear,modelingcausalstructuredoesnotcommitus
totheelaborateandoftenincredibleassumptionsthatcharacterizesomestructural
modelingineconomics,butthereisnoescapefromthinkingaboutthewaythings
work;thewhyaswellasthewhat.
Fourth,wewilltypicallyneedtoknowmorethantheresultsoftheRCTitself,
forexampleaboutdifferencesinsocial,economic,andculturalstructuresandabout
thejointdistributionsofcausalvariables,knowledgethatwilloftenonlybeavaila-
blethroughobservationalstudies.Wewillalsoneedexternalinformation,boththe-
oreticalandempirical,tosettleonaninformativecharacterizationofthepopulation
enrolledintheRCTbecausehowthatpopulationisdescribediscommonlytakento
besomeindicationofwhichotherpopulationstheresultsarelikelytobeexportable
to.Manymedicalandpsychologicaljournalsareexplicitaboutthis.Forinstance,the
rulesforsubmissionrecommendedbytheInternationalCommitteeofMedicalJour-
nalEditors,ICMJE(2015,14)insistthatarticleabstracts“Clearlydescribetheselec-
tionofobservationalorexperimentalparticipants(healthyindividualsorpatients,
includingcontrols),includingeligibilityandexclusioncriteriaandadescriptionof
thesourcepopulation.”AnRCTisconductedonaspecifictrialsample,somehow
drawnfromapopulationofspecificindividuals.Theresultsobtainedarefeaturesof
thatsample,ofthoseveryindividualsatthatverytime,notanyotherpopulation
Page 38
37
withanydifferentindividualsthatmight,forexample,satisfyoneoftheinfiniteset
ofdescriptionsthatthetrialsamplesatisfies.
Thissameissueisconfrontedalreadyinstudydesign.Apartfromspecial
cases,likeposthocevaluationforpayment-for-results,wearenotespeciallycon-
cernedtolearnabouttheveryindividualsenrolledinthetrial.Mostexperiments
are,andshouldbe,conductedwithaneyetowhattheresultscanhelpuslearn
aboutotherpopulations.Thiscannotbedonewithoutsubstantialassumptions
aboutwhatmightandwhatmightnotberelevanttotheproductionoftheoutcome
studied.(Forexample,theICMJEguidelines(2015,14)goontosay:“Becausethe
relevanceofsuchvariablesasage,sex,orethnicityisnotalwaysknownatthetime
ofstudydesign,researchersshouldaimforinclusionofrepresentativepopulations
intoallstudytypesandataminimumprovidedescriptivedatafortheseandother
relevantdemographicvariables,”p14.)Sobothintelligentstudydesignandrespon-
siblereportingofstudyresultsinvolvesubstantialbackgroundassumptions.
Ofcourse,thisistrueforallstudies.ButRCTsrequirespecialconditionsif
theyaretobeconductedatallandespeciallyiftheyaretobeconductedsuccess-
fully—forexample,localagreements,compliantsubjects,affordableadministrators,
multipleblinding,peoplecompetenttomeasureandrecordoutcomesreliably,aset-
tingwhererandomallocationismorallyandpoliticallyacceptable,etc.—whereas
observationaldataareoftenmorereadilyandwidelyavailable.InthecaseofRCTs,
thereisdangerthatthesekindsofconsiderationshavetoomucheffect.Thisisespe-
ciallyworrisomewherethefeaturesthatthetrialsampleshouldhavearenotjusti-
fied,madeexplicit,orsubjectedtoseriouscriticalreview.
Theneedforobservationalknowledgeisoneofmanyreasonswhyitiscoun-
ter-productivetoinsistthatRCTsarethegoldstandardorthatsomecategoriesof
evidenceshouldbeprioritizedoverothers;thesestrategiesleaveushelplessinus-
ingRCTsbeyondtheiroriginalcontext.TheresultsofRCTsmustbeintegratedwith
otherknowledge,includingthepracticalwisdomofpolicymakers,iftheyaretobe
useableoutsidethecontextinwhichtheywereconstructed.
Page 39
38
Contrarytomuchpracticeinmedicineaswellasineconomics,conflictsbe-
tweenRCTsandobservationalresultsneedtobeexplained,forexamplebyrefer-
encetothedifferentpopulationsineach,aprocessthatwillsometimesyieldim-
portantevidence,includingontherangeofapplicabilityoftheRCTresultsthem-
selves.WhilethevalidityoftheRCTwillsometimesprovideanunderstandingof
whytheobservationalstudyfoundadifferentanswer,thereisnobasis(orexcuse)
forthecommonpracticeofdismissingtheobservationalstudysimplybecauseit
wasnotanRCTandthereforemustbeinvalid.Itisabasictenetofscientificadvance
that,ascollectiveknowledgeadvances,newfindingsmustbeabletoexplainandbe
integratedwithpreviousresults,evenresultsthatarenowthoughttobeinvalid;
methodologicalprejudiceisnotanexplanation.
2.5Usingtheoryforgeneralization
Economistshavebeencombiningtheoryandrandomizedcontrolledtrialssincethe
earlyexperiments.OrcuttandOrcutt(1968)laidouttheinspirationfortheincome
taxtrialsusingasimplestatictheoryoflaborsupply.Accordingtothis,people
choosehowtodividetheirtimebetweenworkandleisureinanenvironmentin
whichtheyreceiveaminimumGiftheydonotwork,andwheretheyreceiveanad-
ditionalamount(1-t)wforeachhourtheywork,wherewisthewagerate,andtisa
taxrate.ThetrialsassigneddifferentcombinationsofGandttodifferenttrial
groups,sothattheresultstracedoutthelaborsupplyfunction,allowingestimation
oftheparametersofpreferences,whichcouldthenbeusedinawiderangeofpolicy
calculations,forexampletoraiserevenueatminimumutilitylosstoworkers.
Followingtheseearlytrials,therehasbeenacontinuingtraditionofusing
trialresults,togetherwiththebaselinedatacollectedforthetrial,tofitstructural
modelsthataretobeusedmoregenerally.EarlyexamplesincludeMoffitt(1979)on
laborsupplyandWise(1985)onhousing;amorerecentexampleisHeckman,Pinto
andSavelyev(2013)forthePerrypre-schoolprogram.Developmenteconomicsex-
amplesincludeAttanasio,Meghir,andSantiago(2012),Attanasioetal(2015),Todd
andWolpin(2006),Wolpin(2013),andDuflo,Hanna,andRyan(2012).These
Page 40
39
structuralmodelssometimesrequireformidableauxiliaryassumptionsonfunc-
tionalformsorthedistributionsofunobservables,buttheyhavecompensatingad-
vantages,includingtheabilitytointegratetheoryandevidence,tomakeout-of-sam-
plepredictions,andtoanalyzewelfare,andtheuseofRCTevidenceallowsthere-
laxationofatleastsomeoftheassumptionsthatareneededforidentification.Inthis
way,thestructuralmodelsborrowcredibilityfromtheRCTsandinreturnhelpset
theRCTresultswithinacoherentframework.Withoutsomesuchinterpretation,the
welfareimplicationsofRCTresultscanbeproblematic;knowinghowpeopleingen-
eral(letalonejustpeopleinthetrialpopulation)respondtosomepolicyisrarely
enoughtotellwhetherornottheyaremadebetteroff,Harrison(2014a,b).Tradi-
tionalwelfareeconomicsdrawsalinkfrompreferencestobehavior,alinkthatisre-
spectedinstructuralworkbutoftenlostinthe`whatworks’literature,andwithout
whichwehavenobasisforinferringwelfarefrombehavior.Whatworksisnot
equivalenttowhatshouldbe.
Lighttouchtheorycandomuchtointerpret,toextend,andtouseRCTre-
sults.InboththeRANDHealthExperimentandnegativeincometaxexperiments,an
immediateissueconcernedthedifferencebetweenshortandlong-runresponses;
indeed,differencesbetweenimmediateandultimateeffectsoccurinawiderangeof
RCTs.BothhealthandtaxRCTsaimedtodiscoverwhatwouldhappenifconsum-
ers/workerswerepermanentlyfacedwithhigherorlowerprices/wages,butthetri-
alscouldonlyrunforalimitedperiod.Atemporarilyhightaxrateonearningsisef-
fectivelya`firesale’onleisure,sothattheexperimentprovidedanopportunityto
takeavacationandmakeuptheearningslater,anincentivethatwouldbeabsentin
apermanentscheme.Howdowegetfromtheshort-runresponsesthatcomefrom
thetrialtothelong-runresponsesthatwewanttoknow?Metcalf(1973)andAsh-
enfelter(1978)providedanswersfortheincometaxexperiments,asdidArrow
(1975)fortheRandHealthExperiment.
Arrow’sanalysisillustrateshowtousebothstructureandobservationaldata
totransportandadaptresultsfromonesettingtoanother.Hemodelsthehealthex-
perimentasatwo-periodmodelinwhichthepriceofmedicalcareisloweredinthe
firstperiodonly,andshowshowtoderivewhatwewant,whichistheresponsein
Page 41
40
thefirstperiodifpriceswereloweredbythesameproportioninbothperiods.The
magnitudethatwewantisS,thecompensatedpricederivativeofmedicalcarein
period1inthefaceofidenticalincreasesin𝑝"and𝑝4inbothperiods1and2.Thisis
equalto𝑠"" + 𝑠"4,thesumofthederivativesofperiod1’sdemandwithrespectto
thetwoprices.Thetrialgivesonly𝑠"".Butifwehavepost-trialdataonmedicalser-
vicesforbothtreatmentsandcontrols,wecaninfer𝑠4",theeffectoftheexperi-
mentalpricemanipulationonpost-experimentalcare.Choicetheory,intheformof
Slutskysymmetrysaysthat𝑠"4 = 𝑠4"andsoallowsArrowtoinfer𝑠"4andthusS.He
contraststhiswithMetcalf’salternativesolution,whichmakesdifferentassump-
tions—thattwoperiodpreferencesareintertemporallyadditive,inwhichcasethe
long-runelasticitycanbeobtainedfromknowledgeoftheincomeelasticityofpost-
experimentalmedicalcare,whichwouldhavetocomefromanobservationalanaly-
sis.Thesetwoalternativeapproachesshowhowwecanchoose,basedonourwill-
ingnesstomakeassumptionsandonthedatathatwehave,asuitablecombination
of(elementaryandtransparent)theoreticalassumptionsandobservationaldatain
ordertoadaptandusetrialresults.Suchanalysiscanalsohelpdesigntheoriginal
trialbyclarifyingwhatweneedtoknowinordertousetheresultsofatemporary
treatmenttoestimatethepermanenteffectsthatweneed.Ashenfelterprovidesa
thirdsolution,notingthatthetwo-periodmodelisformallyidenticaltoatwo-person
model,sothatwecanuseinformationontwo-personlaborsupplytotellusabout
thedynamics.
Theorycanoftenallowustoreclassifyneworunknownsituationsasanalo-
goustosituationswherewealreadyhavebackgroundknowledge.Onefrequently
usefulwayofdoingthisiswhenthenewpolicycanberecastasequivalenttoa
changeinthebudgetconstraintthatrespondentsface.Theconsequencesofanew
policymaybeeasiertopredictifwecanreduceittoequivalentchangesinincome
andprices,whoseeffectsareoftenwellunderstoodandwell-studied.Toddand
Wolpin(2008)andWolpin(2013)makethispointandprovideexamples.Inthela-
borsupplycase,anincreaseinthetaxratehasthesameeffectasadecreaseinthe
wagerate,sothatwecanrelyonpreviousliteraturetopredictwhatwillhappen
whentaxratesarechanged.InthecaseofMexico’sPROGRESAconditionalcash
Page 42
41
transferprogram,ToddandWolpinnotethatthesubsidiespaidtoparentsiftheir
childrengotoschoolcanbethoughtofasacombinationofreductioninchildren’s
wagesandanincreaseinparents’income,whichallowsthemtopredicttheresults
oftheconditionalcashexperimentwithlimitedadditionalassumptions.Ifthis
works,asitpartiallydoesintheiranalysis,thetrialhelpsconsolidateprevious
knowledgeandcontributestoanevolvingbodyoftheoryandempirical,including
trial,evidence.
Theprogramofthinkingaboutpolicychangesasequivalenttopriceandin-
comechangeshasalonghistoryineconomics;muchofrationalchoicetheorycanbe
sointerpreted(seeDeatonandMuellbauer(1980)formanyexamples).Whenthis
conversioniscredible,andwhenatrialonsomeapparentlyunrelatedtopiccanbe
modeledasequivalenttoachangeinpricesandincomes,andwhenwecanassume
thatpeopleindifferentsettingsrespondrelevantlysimilarlytochangesinprices
andincomes,wehaveareadymadeframeworkforincorporatingthetrialresults
intopreviousknowledge,aswellasforextendingthetrialresultsandusingthem
elsewhere.Ofcourse,alldependsonthevalidityandcredibilityofthetheory;peo-
plemaynotinfacttreatataxincreaseasadecreaseinthepriceofleisure,andbe-
havioraleconomicsisfullofexampleswhereapparentlyequivalentstimuligenerate
non-equivalentoutcomes.Theembraceofbehavioraleconomicsbymanyofthecur-
rentgenerationoftrialistsmayaccountfortheirlimitedwillingnesstouseconven-
tionalchoicetheoryinthisway.Unfortunately,behavioraleconomicsdoesnotyet
offerareplacementforthegeneralframeworkofchoicetheorythatissousefulin
thisregard.
Theorycanalsohelpwiththeproblemweraisedofdelineatingthepopula-
tiontowhichthetrialresultsimmediatelyapplyandforthinkingaboutmoving
fromthispopulationtopopulationsofinterest.Ashenfelter’s(1978)analysisis
againagoodillustrationandpredatesmuchsimilarworkinlaterliterature.Thein-
cometaxexperimentsofferedparticipationinthetrialtoarandomsampleofthe
populationofinterest.Becausetherewasnoblindingandnocompulsion,people
whowererandomizedintothetreatmentgroupwerefreetochoosetorefusetreat-
ment.Asinmanysubsequentanalyses,Ashenfeltersupposesthatpeoplechooseto
Page 43
42
participateifitisintheirinteresttodoso,dependingonwhathasbecomeknownin
theRCTandInstrumentalVariablesliteratureastheirownidiosyncratic`gain.’The
simplelaborsupplymodelgivesanapproximatecondition:Ifthetreatmentin-
creasesthetaxratefrom t0 to t1 withanoffsettingincreaseinG,thenanindividual
assignedtotheexperimentalgroupwilldeclinetoparticipateif
(t1 − t0 )w0h0 +12s00 (t1 − t0 ) >G1 −G0 (5)
wheresubscript1referstothetreatmentsituation,0tothecontrol,ℎ&ishours
worked,and𝑠&&isthe(negative)utility-constantresponseofhoursworkedtothe
taxrate.Ifthereisnosubstitution,thesecondtermontheleft-handsideiszero,and
peoplewillaccepttreatmentiftheincreaseinGmorethanmakesupforthein-
creasesintaxespayable,the`breakeven’condition.Inconsequence,thosewith
higherearningsarelesslikelytoaccepttreatment.Somebetter-offpeoplewithhigh
substitutioneffectswillalsoaccepttreatmentiftheopportunitytobuymorecheap
leisureissufficiententicement.
Theselectiveacceptanceoftreatmentlimitstheanalyst’sabilitytolearn
aboutthebetter-offorlow-substitutionpeoplewhodeclinetreatmentbutwho
wouldhavetoacceptitifthepolicywereimplemented.Boththeintention-to-treat
estimatorandthe`astreated’estimatorthatcomparesthetreatedandtheun-
treatedareaffected,notjustbythelaborsupplyeffectsthatthetrialisdesignedto
induce,butbythekindofselectioneffectsthatrandomizationisdesignedtoelimi-
nate.Ofcourse,theanalysisthatleadsto(5)canperhapshelpussaysomething
aboutthisandhelpusadjustthetrialestimatesbacktowhatwewouldliketoknow.
Yetthisisnoeasymatterbecauseselectiondepends,notonlyonobservables,such
aspre-experimentalearningsandhoursworked,buton(muchhardertoobserve)
laborsupplyresponsesthatlikelyvaryacrossindividuals.ParaphrasingAshenfelter,
wecannotestimatetheeffectsofapermanentcompulsorynegativeincometaxpro-
gramfromatransitoryvoluntarytrialwithoutstrongassumptionsoradditionalevi-
dence.
Muchofthemodernliterature,forexampleontrainingprograms,wrestles
withtheissueofexactlywhoisrepresentedbytheRCTresults,includingnotonly
Page 44
43
whoparticipatesinthefirstplacebutwholeavesbeforethetrialiscompleted(see
againHeckman,LalondeandSmith(1999)).Asintheexamplesabove,modelingat-
tritionwithinatrialcanyieldestimatesofbehavioralresponsesthatcanbeusedto
transportthefindingstoothersettings(seeChanandHamilton(2006),Chassang,
PadróIMiguel,andSnowberg(2012)andChassangetal(2015)).Whenpeopleare
allowedtorejecttheirrandomlyassignedtreatmentaccordingtotheirown(realor
perceived)advantage,ortodropoutofatrialonanestimateofthebenefitsand
costsfromdoingso,wehavecomealongwayawayfromtherandomallocationin
thestandardconceptionofarandomizedcontrolledtrial.Moreover,theabsenceof
blindingiscommoninsocialandeconomicRCTs,andwhiletherearetrials,suchas
welfaretrials,thateffectivelycompelpeopletoaccepttheirassignments,andsome
wherethetreatmentisgenerousenoughtodoso,therearetrialswheresubjects
havemuchfreedomand,inthosecasesitislessthanobvioustouswhatrole,ifany,
randomizationplaysinwarrantingtheresults.
2.6Scalingup:usingtheaverageforpopulations
ManyRCTsaresmall-scaleandlocal,forexampleinafewschools,clinics,orfarms
inaparticulargeographic,cultural,socio-economicsetting.Ifsuccessfulaccording
toacost-effectivenesscriterion,forexample,itisacandidateforscaling-up,apply-
ingthesameinterventionforamuchlargerarea,oftenawholecountry,orsome-
timesevenbeyond,aswhensometreatmentisconsideredforallrelevantWorld
Bankprojects.Thefactthattheinterventionmightworkdifferentlyatscalehaslong
beennotedintheeconomicsliterature,e.g.GarfinkelandManski(1992),Heckman
(1992),andMoffitt(1992),andisrecognizedintherecentreviewbyBanerjeeand
Duflo(2009).Wewantheretoemphasizethepervasivenessofsucheffectsaswell
astonoteagainthatthisshouldnotbetakenasanargumentagainstusingRCTsbut
onlyagainsttheideathateffectsatscalearelikelytobethesameasinthetrial.
Anexampleofwhatareoftencalled`generalequilibriumeffects’comesfrom
agriculture.SupposeanRCTdemonstratesthatinthestudypopulationanewwayof
usingfertilizerhadasubstantialpositiveeffecton,say,cocoayields,sothatfarmers
whousedthenewmethodssawincreasesinproductionandinincomescompared
tothoseinthecontrolgroup.Iftheprocedureisscaleduptothewholecountry,or
Page 45
44
toallcocoafarmersworldwide,thepricewilldrop,andifthedemandforcocoais
priceinelastic—asisusuallythoughttobethecase,atleastintheshortrun—cocoa
farmers’incomeswillfall.Indeed,theconventionalwisdomformanycropsisthat
farmersdobestwhentheharvestissmall,notlarge.Ofcourse,theseconsiderations
mightnotbedecisiveindecidingwhetherornottopromotetheinnovation,and
theremaystillbelongtermgainsif,forexample,somefarmersfindsomethingbet-
tertodothangrowingcocoa.But,inthiscase,thescaled-upeffectisoppositeinsign
tothetrialeffect.Theproblemisnotwiththetrialresults,whichcanbeusefullyin-
corporatedintoamorecomprehensivemarketmodelthatincorporatesthere-
sponsesestimatedbythetrial.Theproblemisonlyifweassumethattheaggregate
looksliketheindividual.Thatotheringredientsoftheaggregatemodelmustcome
fromobservationalstudiesshouldnotbeacriticism,evenforthosewhofavorRCTs;
itissimplythepriceofdoingseriousanalysis.
Therearemanypossibleinterventionsthataltersupplyordemandwhoseef-
fect,inaggregate,willchangeapriceorawagethatisheldconstantintheoriginal
RCT.Educationwillchangethesuppliesofskilledversusunskilledlabor,withimpli-
cationsforrelativewagerates.Conditionalcashtransfersincreasethedemandfor
(andperhapssupplyof)schoolsandclinics,whichwillchangepricesorwaiting
lines,orboth.Thereareinteractionsbetweenpeoplethatwilloperateonlyatscale.
Givingonechildavouchertogotoprivateschoolmightimproveherfuture,butdo-
ingsoforeveryonecandecreasethequalityofeducationforthosechildrenwhoare
leftinthepublicschools(seethecontrastingstudiesofAngristetal(2002)and
HsiehandUrquiola(2002)).Educationalortrainingprogramsmaybenefitthose
whoaretreatedbutharmthoseleftbehind;Créponetal(2014)recognizetheissue
andshowhowtoadaptanRCTtodealwithit.
Scalingupcanalsodisturbthepoliticalequilibrium.Anexploitativegovern-
mentmaynotallowthemasstransferofmoneyfromabroadtoapowerlessseg-
mentofthepopulation,thoughitmaypermitasmall-scaleRCTofcashtransfers,
perhapseveninthehopethatalarge-scaleimplementationwillyieldopportunities
forpredation.ProvisionofhealthcarebyforeignNGOsmaybesuccessfulintrials,
buthaveunintendednegativeconsequencestoscalebecauseofgeneralequilibrium
Page 46
45
effectsonthesupplyofhealthcarepersonnel,orbecauseitdisturbsthenatureof
thecontractbetweenthepeopleandagovernmentthatisusingtaxrevenuetopro-
videservices.InIndia,thegovernmentspendslargesumsonfoodsubsidiesthrough
asystem(thePDS)thatisbothcorruptandinefficient,withmuchofthegrainthatis
procuredfailingtofinditswaytotheintendedbeneficiaries.LocalizedRCTson
whetherornotfamiliesarebetteroffwithcashtransfersarenotinformativeabout
howpoliticianswouldchangetheamountofthetransferiffacedwithunanticipated
inflation,andatleastasimportant,whetherthegovernmentcouldcutprocurement
fromrelativelywealthyandpoliticallypowerfulfarmers.Withoutapoliticaland
generalequilibriumanalysis,itisimpossibletothinkabouttheeffectsofreplacing
foodsubsidieswithcashtransfers(seee.g.Basu(2010)).
Eveninmedicine,wherebiologicalinteractionsbetweenpeoplearelesscom-
monthanaresocialinteractionsinsocialscience,interactionscanbeimportant.In-
fectiousdiseasesareonewell-knownexample,whereimmunizationprogramsaf-
fectthedynamicsofdiseasetransmissionthroughherdimmunity(seeFineand
Clarkson(1986)andManski(2013,52)).Thesocialandeconomicsettingalsoaf-
fectshowdrugsareactuallyusedandthesameissuescanarise;thedistinctionbe-
tweenefficacyandeffectivenessinclinicaltrialsisinpartrecognitionofthefact.
2.7Drillingdown:usingtheaverageforindividuals
Justasthereareissueswithscaling-up,itisnotobvioushowtousetheresultsfrom
RCTsatthelevelofindividualunits,evenindividualunitsthatwereincludedinthe
trial.Awell-conductedRCTdeliversanATEforthetrialpopulationbut,ingeneral,
thataveragedoesnotapplytoeveryone.Itisnottrue,forexample,asarguedinthe
AmericanMedicalAssociation’s“Users’guidetothemedicalliterature”that“ifthe
patientwouldhavebeenenrolledinthestudyhadshebeenthere—thatisshemeets
alloftheinclusioncriteriaanddoesn’tviolateanyoftheexclusioncriteria—thereis
littlequestionthattheresultsareapplicable”(seeGuyattetal(1994,60)).Even
moremisleadingaretheoften-heardstatementsthatanRCTwithanaveragetreat-
menteffectinsignificantlydifferentfromzerohasshownthatthetreatmentworks
fornoone.
Page 47
46
Theseissuesarefamiliartophysicianspracticingevidence-basedmedicine
whoseguidelinesrequire“integratingindividualclinicalexpertisewiththebest
availableexternalclinicalevidencefromsystematicresearch,”Sackettetal(1996,
71).Exactlywhatthismeansisunclear;physiciansknowmuchmoreabouttheirpa-
tientsthanisallowedforintheATEfromtheRCT(though,onceagain,stratification
inthetrialislikelytobehelpful)andtheyoftenhaveintuitiveexpertisefromlong
practicethatcanhelpthemidentifyfeaturesinaparticularpatientthatmayinflu-
encetheeffectivenessofagiventreatmentforthatpatient.Butthereisanoddbal-
ancestruckhere.Thesejudgmentsaredeemedadmissibleindiscussionwiththein-
dividualpatient,buttheydon’tadduptoevidencetobemadepubliclyavailable,
withtheusualcautionsaboutcredibility,bythestandardsadoptedbymostEBM
sites.Itisalsotruethatphysicianscanhaveprejudicesand`knowledge’thatmight
beanythingbut.Clearly,therearesituationswhereforcingpractitionerstofollow
theaveragewilldobetter,evenforindividualpatients,andotherswheretheoppo-
siteistrue,KahnemanandKlein(2009).
Whetherornotaveragesareusefultoindividualsraisesthesameissue
throughoutsocialscienceresearch.Imaginetwoschools,StJoseph’sandSt.Mary’s,
bothofwhichwereincludedinanRCTofaclassroominnovation.Theinnovationis
successfulonaverage,butshouldtheschoolsadoptit?ShouldStMary’sbeinflu-
encedbyapreviousattemptinStJoseph’sthatwasjudgedafailure?Manywould
dismissthisexperienceasanecdotalandaskhowStJoseph’scouldhaveknownthat
itwasafailurewithoutbenefitof`rigorous’evidence.YetifStMary’sislikeStJo-
seph’s,withasimilarmixofpupils,asimilarcurriculum,andsimilaracademic
standing,mightnotStJoseph’sexperiencebemorerelevanttowhatmighthappen
atStMary’sthanisthepositiveaveragefromtheRCT?Andmightitnotbeagood
ideafortheteachersandgovernorsofStMary’stogotoStJoseph’sandfindout
whathappenedandwhy?Theymaybeabletoobservethemechanismofthefailure,
ifsuchitwas,andfigureoutwhetherthesameproblemswouldapplyforthem,or
whethertheymightbeabletoadapttheinnovationtomakeitworkforthem,per-
hapsevenmoresuccessfullythanthepositiveaverageinthetrial.
Page 48
47
Onceagain,thesequestionsareunlikelytobeeasilyansweredinpractice;
but,aswithtransportability,thereisnoseriousalternativetotrying.Assumingthat
theaverageworksforyouwilloftenbewrong,anditwillatleastsometimesbepos-
sibletodobetter.Asinthemedicalcase,theadvicetoindividualschoolsoftenlacks
specificity.Forexample,theU.S.InstituteofEducationScienceshasprovideda
“user-friendly”guidetopracticessupportedbyrigorousevidence,USDepartmentof
Education(2003).Theadvice,whichissimilartorecommendationsindevelopment
economics,isthattheinterventionbedemonstratedeffectivethroughwell-designed
RCTsinmorethanonesiteandthat“thetrialsshoulddemonstratetheinterven-
tion’seffectivenessinschoolsettingssimilartoyours”(2003,17).Nooperational
definitionof“similar”isprovided.
2.8Examplesandillustrationsfromeconomics
OurargumentsinthisSectionshouldnotbecontroversial,yetwebelievethatthey
representanapproachthatisdifferentfrommostcurrentpractice.Todocument
thisandtofilloutthearguments,weprovidesomeexamples.Whiletheseareocca-
sionallycritical,ourpurposeisconstructive;indeed,webelievethatmisunderstand-
ingsabouthowtouseRCTshaveartificiallylimitedtheirusefulness,aswellasalien-
atedsomewhowouldotherwiseusethem.
Conditionalcashtransfers(CCTs)areinterventionsthathavebeentestedus-
ingRCTs(andotherRCT-likemethods)andareoftencitedasaleadingexampleof
howanevaluationwithstronginternalvalidityleadstoarapidspreadofthepolicy,
e.g.AngristandPischke(2010)amongmanyothers.Thinkthroughthecausalchain
thatisrequiredforCCTstobesuccessful:Peoplemustlikemoney,theymustlike
(ordonotobjecttoomuch)totheirchildrenbeingeducatedandvaccinated,there
mustexistschoolsandclinicsthatarecloseenoughandwellenoughstaffedtodo
theirjob,andthegovernmentoragencythatisrunningtheschememustcareabout
thewellbeingoffamiliesandtheirchildren.Thatsuchconditionsholdinawide
rangeof(althoughcertainlynotall)countriesmakesitunsurprisingthatCCTs
`work’inmanyreplications,thoughtheycertainlywillnotworkinplaceswherethe
schoolsandclinicsdonotexist,e.g.Levy(2001),norinplaceswherepeople
stronglyopposeeducationorvaccination.
Page 49
48
Similarly,giventhatthesupportfactorswilloperatewithdifferentstrengths
andeffectivenessindifferentplaces,itisalsonotsurprisingthatthesizeoftheATE
differsfromplacetoplace;forexample,Vivalt’sAidGradewebsitelists29estimates
fromarangeofcountriesofthestandardized(dividedbylocalstandarddeviationof
theoutcome)effectsofCCTsonschoolattendance;allbutfourshowtheexpected
positiveeffect,andtherangerunsfrom–8to+38percentagepoints,Vivalt(2015).
Eveninthisleadingcase,wherewemightreasonablyconcludethatCCTs`work’in
gettingchildrenintoschool,itwouldbehardtocalculatecrediblecost-effectiveness
numbersortocometoageneralconclusionaboutwhetherCCTsaremoreorless
costeffectivethanotherpossiblepolicies.Bothcostsandeffectsizescanbeex-
pectedtodifferinnewsettings,justastheyhaveinobservedones,makingthese
predictionsdifficult.
Therangeofestimatesillustratesthatthesimpleviewofexternalvalidity—
thattheATEtransportsfromoneplacetoanother—isnotreasonable.AidGrade
usesstandardizedmeasuresofeffectsizedividedbystandarddeviationofoutcome
atbaseline,asdoesthemajormulti-countrystudybyBanerjeeetal(2015).Butwe
mightprefermeasuresthathaveaneconomicinterpretation,suchasadditional
monthsofschoolingper$100spent(forexampleifadonoristryingtodecide
wheretospend,seebelow).Nutritionmightbemeasuredbyheight,orbythelogof
height.EveniftheATEbyonemeasurecarriesacross,itwillonlydosousingan-
othermeasureiftherelationshipbetweenthetwomeasuresisthesameinbothsit-
uations.Thisisexactlythesortofthingthataformalanalysisoftransportability
forcesustothinkabout.(NotealsothattheATEintheoriginalRCTcandifferde-
pendingonwhethertheoutcomeismeasuredinlevelsorinlogs;itiseasytocon-
structexampleswherethetwoATEshavedifferentsigns.)
Muchoftheeconomicsliterature,likethemedicalliterature,workswiththe
viewofexternalvaliditythat,unlessthereisevidencetothecontrary,thedirection
andsizeoftreatmenteffectscanbetransportedfromoneplacetoanother.TheJ-
PALwebsitereportsitsfindingsunderageneralheadingofpolicyrelevance,subdi-
videdbyaselectionoftopics.Undereachtopic,thereisalistofrelevantRCTsfrom
arangeofdifferentsettingsaroundtheworld.Theseareconvenientlyconverted
Page 50
49
intoacommoncost-effectivenessmeasuresothat,forexample,under“education”,
subhead“studentparticipation”,therearefourstudiesfromAfrica:oninforming
parentsaboutthereturnstoeducationinMadagascar,ondeworming,onschooluni-
forms,andonmeritscholarships,allfromKenya.Theunitsofmeasurementaread-
ditionalyearsofstudenteducationper$100,andamongthesefourstudies,theav-
erageeffectsofspending$100are20.7years,13.9years,0.71yearsand0.27years
respectively.(Notethatthisisadifferent—andmuchsuperior—standardization
fromtheeffectsizestandardizationdiscussedbelow.)
Whatcanweconcludefromsuchcomparisons?Foraphilanthropicdonorin-
terestedineducation,andifmarginalandaverageeffectsarethesame,theymight
indicatethatthebestplacetodevoteamarginaldollarisinMadagascar,whereit
wouldbeusedtoinformparentsaboutthevalueofeducation.Thisiscertainlyuse-
ful,butitisnotasusefulasstatementsthatinformationordewormingprograms
areeverywheremorecost-effectivethanprogramsinvolvingschooluniformsor
scholarships,orifnoteverywhere,atleastoversomedomain,anditisthesesecond
kindsofcomparisonthatwouldgenuinelyfulfillthepromiseof`findingoutwhat
works.’Butsuchcomparisonsonlymakesenseifwecantransporttheresultsfrom
oneplacetoanother,iftheKenyanresultsalsoholdinMadagascar,Mali,orNa-
mibia,orsomeotherlistofplaces.J-PAL’smanualforcost-effectiveness,Dhaliwalet
al(2012)explainsin(entirelyappropriate)detailhowtohandlevariationincosts
acrosssites,notingvariablefactorssuchaspopulationdensity,prices,exchange
rates,discountrates,inflation,andbulkdiscounts.Butitgivesshortshrifttocross-
sitevariationinthesizeofATEs,whichplayanequalpartinthecalculationsofcost
effectiveness.Themanualbrieflynotesthatdiminishingreturns(orthelast-mile
problem)mightbeimportantintheorybutarguesthatthebaselinelevelsofout-
comesarelikelytobesimilarinthepilotandreplicationareas,sothattheATEcan
besafelytransportedasis.Allofthislacksajustificationfortransportability,some
understandingofwhenresultstransport,whentheydonot,orbetterstill,howthey
shouldbemodifiedtomakethemtransportable.
Page 51
50
OneofthelargestandmosttechnicallyimpressiveofthedevelopmentRCTs
isbyBanerjeeetal(2015),whichtestsa“graduation”programdesignedtoperma-
nentlyliftextremelypoorpeoplefrompovertybyprovidingthemwithagiftofa
productiveasset(fromguinea-pigs,(regular-)pigs,sheep,goats,orchickensde-
pendingonlocale),trainingandsupport,andlife-skillscoaching,aswellassupport
forconsumption,saving,andhealthservices.Theideaisthatthispackageofaidcan
helppeoplebreakoutofpovertytrapsinawaythatwouldnotbepossiblewithone
interventionatatime.ComparableversionsoftheprogramweretestedinEthiopia,
Ghana,Honduras,India,Pakistan,andPeruand,exceptingHonduras(wherethe
chickensdied)findlargelypositiveandpersistenteffects—withsimilar(standard-
ized)effectsizes—forarangeofoutcomes(economic,mentalandphysicalhealth,
andfemaleempowerment).Onesiteapart,essentiallyeveryoneacceptedtheiras-
signment.ReplicationofpositiveATEsoversuchawiderangeofplacescertainly
providesproofofconceptforsuchascheme.YetBauchet,Morduch,andRavi(2015)
failtoreplicatetheresultinSouthIndia,wherethecontrolgroupgotaccesstomuch
thesamebenefits,whatHeckman,Hohman,andSmith(2000)call“substitution
bias”.Evenso,theresultsareimportantbecause,althoughthereisalongstanding
interestinpovertytraps,manyeconomistshavebeenskepticaloftheirexistenceor
thattheycouldbesprungbysuchaid-basedpolicies.Inthissense,thestudyisan
importantcontributiontothetheoryofeconomicdevelopment;ittestsatheoretical
propositionandwill(orshould)changemindsaboutit.
Anumberofdifficultiesremain.Astheauthorsnote,suchtrialscannottellus
whichcomponentofthetreatmentaccountedfortheresults,orwhichmightbedis-
pensable—amuchmoreexpensivemultifactorialtrialwouldberequired—thoughit
seemslikelyinpracticethatthecostliestcomponent—therepeatedvisitsfortrain-
ingandsupport—islikelytobethefirsttobecutbycash-strappedpoliticiansorad-
ministrators.Andasnoted,itisnotclearwhatshouldcountas(simple)replication
ininternationalcomparisons;itishardtothinkoftheusesofstandardizedeffect
sizes,excepttodocumentthateffectsexisteverywhereandthattheyaresimilarly
largerelativetolocalvariationinsuchthings.
Page 52
51
Theeffectsize—theATEstandardizedbybeingexpressedinnumbersof
standarddeviationsoftheoriginaloutcome—thoughconvenientlydimensionless,
haslittletorecommendit.AswithmuchofRCTpractice,itstripsoutanyeconomic
content—noratesofreturn,orbenefitsminuscosts—anditremovesanydiscipline
onwhatisbeingcompared.Applesandorangesbecomeimmediatelycomparable,
asdotreatmentswhoseinclusioninameta-analysisislimitedonlybytheimagina-
tionoftheanalystsinclaimingsimilarity.Trainingprogramsforphysicalfitnesscan
bepooledwithtrainingprogramsforwelding,ormarketing,orevenobedience
trainingforpets.Inpsychology,wheretheconceptoriginated,thisresultsinendless
disputesaboutwhatshouldandshouldnotbepooledinameta-analysis.Goldberger
andManski(1995,769)notethat“standardizationaccomplishesnothingexceptto
givequantitiesinnoncomparableunitsthesuperficialappearanceofbeingincom-
parableunits.Thisaccomplishmentisworsethanuseless—ityieldsmisleadingin-
ferences.”Beyondthat,Simpson(2017)notesthatrestrictionsonthetrialsample—
oftengoodpracticetoreducebackgroundnoiseandtohelpdetectaneffect—will
reducethebaselinestandarddeviationandinflatetheeffectsize.Moregenerally,ef-
fectsizesareopentomanipulationbyexclusionrules.Itmakesnosensetoclaim
replicabilityonthebasisofeffectsizes,letalonetousethemtorankprojects.Effect
sizesareirrelevantforpolicymaking.
Thegraduationstudycanbetakenastheclosesttofulfillingthe`findingout
whatworks’aimoftheRCTmovementindevelopment.Yetitissilentonperhaps
thecrucialaspectforpolicy,whichisthatthetrialwasruninpartnershipwith
NGOs,whereaswhatwewouldliketoknowiswhetheritcouldbereplicatedbygov-
ernments,includingthosegovernmentsthatareincapableofgettingdoctors,
nurses,andteacherstoshowuptoclinicsorschools,Chaudhuryetal(2005),
Banerjee,DeatonandDuflo(2004),orofregulatingthequalityofmedicalcareinei-
therthepublicorprivatesectors,Filmer,HammerandPritchett(2000)orDasand
Hammer(2005).Infact,wealreadyknowagreatdealabout`whatworks.’Vaccina-
tionswork,maternalandchildhealthcareserviceswork,andclassroomteaching
works.Yetknowingthisdoesnotgetthosethingsdone.Addinganotherprogram
thatworksunderidealconditionsisusefulonlywhereconditionsareinfactideal,in
Page 53
52
whichcaseitwouldlikelybeunnecessary.Findingoutwhatworksisnotthemagic
keytoeconomicdevelopment.Technicalknowledge,thoughalwaysworthhaving,
requiressuitableinstitutionsandsuitableincentivesifitistodoanygood.
Asimilarpointisdocumentedinthecontrastbetweenasuccessfultrialthat
usedcamerasandthreatsofwagereductionstoincentivizeattendanceofteachers
inschoolsrunbyanNGOinRajasthaninIndia,Duflo,Hanna,andRyan(2012),and
thesubsequentfailureofafollow-upprograminthesamestatetotacklemassab-
senteeismofhealthworkers,Banerjee,Duflo,andGlennerster(2008).Inthe
schools,thecamerasandtimekeepingworkedasintended,andteacherattendance
increased.Intheclinics,therewasashort-runeffectonnurseattendance,butitwas
quicklyeliminated.(Theabilityofagentseventuallytounderminepoliciesthatare
initiallyeffectiveiscommonenoughandnoteasilyhandledwithinanRCT.)Inboth
trials,therewereincentivestoimproveattendance,andtherewereincentivesto
findawaytosabotagethemonitoringandrestoreworkerstotheiraccustomedpo-
sitions;theforceoftheseincentivesisa`high-level’cause,likegravity,ortheprinci-
pleofthelever,thatworksinmuchthesamewayeverywhere.Fortheclinics,some
sabotagewasdirect—thesmashingofcameras—andsomewassubtler,whengov-
ernmentsupervisorsprovidedofficial,thoughspecious,reasonsformissingwork.
WecanonlyconjecturewhythecausalitywasswitchedinthemovefromNGOto
government;wesuspectthatworkingforahighly-respectedlocalNGOisadifferent
contractfromworkingforthegovernment,wherenotshowingupforworkis
widely(ifinformally)understoodtobepartofthedeal.Theincentiveleverworks
whenitiswiredupright,aswiththeNGOs,butnotwhenthewiringcutsitout,as
withthegovernment.Knowing`whatworks’inthesenseofthetreatmenteffecton
thetrialpopulationisoflimitedvaluewithoutunderstandingthepoliticalandinsti-
tutionalenvironmentinwhichitisset.Thisunderlinestheneedtounderstandthe
underlyingsocial,economic,andculturalstructures—includingtheincentivesand
agencyproblemsthatinhibitservicedelivery—thatarerequiredtosupportthe
causalpathwaysthatweshouldliketoseeatwork.
Trialsineconomicdevelopmentoftentakeplaceinartificialenvironments.
Drèze(2016)notes,basedonextensiveexperienceinIndia,“whenaforeignagency
Page 54
53
comesinwithitsheavybootsandsuitcasesofdollarstoadministera`treatment,’
whetherthroughalocalNGOorgovernmentorwhatever,thereisalotgoingon
otherthanthetreatment.”Thereisalsothesuspicionthatatreatmentthatworks
doessobecauseofthepresenceofthe`treators,’oftenfromabroad,andmaynot
workdosowiththepeoplewhowillworkitinpractice.
Thereisalsomuchtobelearnedfrommanyyearsofeconomictrialsinthe
UnitedStates,particularlyfromtheworkofMDRC,fromtheearlyincometaxtrials,
aswellasfromtheRandHealthExperiment.Followingtheincometaxtrials,MDRC
hasrunmanyrandomizedtrialssincethe1970s,mostlyfortheFederalgovernment
butalsoforindividualstatesandforCanada(seethethoroughandinformativeac-
countbyGueronandRolston(2011)forthefactualinformationunderlyingthefol-
lowingdiscussion).MDRC’sprogram,likethatofJPALindevelopment,isintended
tofindout`whatworks’inthestateandfederalwelfareprograms.Theseprograms
areconditionalcashtransfersinwhichpoorrecipientsaregivencashprovidedthey
satisfycertainconditionssuchasworkrequirementsortraining,whichareoften
thesubjectofthetrial.Whatarethebenefitsandcostsofvariousalternatives,both
totherecipientsandtothelocalandfederaltaxpayers?Alloftheseprogramsare
deeplypoliticized,withsharplydifferentviewsoverbothfactsanddesirability.
Manyengagedinthesedisputesfeelcertainofwhatshouldbedoneandwhatits
consequenceswillbesothat,bytheirlights,controlgroupsareunethicalbecause
theydeprivesomepeopleofwhattheadvocates`know’willbecertainbenefits.
Giventhis,itisperhapssurprisingthatRCTshavebecometheacceptednormfor
thiskindofpolicyevaluationintheUS.
Thereasonsowemuchtopoliticalinstitutions,aswellastothecommonbe-
lief,exploredinSection1,thatRCTscanrevealthetruth.AttheFederallevel,pro-
spectivepoliciesarevettedbythenon-partisanCongressionalBudgetOffice(CBO),
whichmakesitsownestimatesofthebudgetaryimplicationsoftheprogram.Ideo-
logueswhoseprogramsarescoredpoorlybytheCBOhaveanincentivetosupport
anRCT,nottoconvincethemselves,buttoconvinceopponents;onceagain,RCTs
arevaluablewhenyouropponentsdonotshareyourprior.Andcontrolgroupsare
Page 55
54
easiertoputinplacewhenthereareinsufficientfundstocoverthewholepopula-
tion.TherewasalsoawidespreadandlargelyuncriticalbeliefthatRCTsgivethe
rightanswer,atleastforthebudgetaryimplications,which,ratherthanthewellbe-
ingoftherecipients,wereoftentheprimaryconcern;notethatallofthesetrialsare
onpoorpeoplebyrichpeoplewhoaretypicallymoreconcernedwithcostthanwith
thewellbeingofthepoor,Greenberg,SchroderandOnstott(1999).MDRCstrials
couldthereforebeeffectivedispute-reconciliationmechanismsbothforthosewho
sawtheneedforevidenceandforthosewhodidnot(exceptinstrumentally).The
outcomeherefitswithour`publichealth’case;whatthepoliticiansneedtoknowis
nottheoutcomesforindividuals,orevenhowtheoutcomesinonestatemight
transporttoanother,buttheaveragebudgetarycostinaspecificplace,something
thatagoodRCTconductedonarepresentativesampleofthetargetpopulationcan
deliver,atleastintheabsenceofgeneralequilibriumeffects,timingeffects,etc.
TheseRCTsbyMDRCandothercontractorshavedemonstratedboththefea-
sibilityoflarge-scalesocialtrialsincludingthepossibilityofrandomizationinthese
settings(wheremanyparticipantswerehostiletotheidea),aswellastheiruseful-
nesstopolicymakers.Theyalsoseemtohavechangedbeliefs,forexampleinfavor
ofthedesirabilityofworkrequirementsasaconditionofwelfare,evenamongmany
originallyopposed.Therearealsolimitations;thetrialsappeartohavehadatbesta
minorinfluenceonscientificthinkingaboutbehaviorinlabormarketsand,inthat
sense,theyaremoreabout`plumbing’thanscience,Duflo(2017).Theresultsof
similarprogramshaveoftenbeendifferentacrossdifferentsites,andtherehasto
datebeennofirmunderstandingofwhy;indeed,thetrialsarenotdesignedtore-
vealthis,Moffitt(2004).Finally,andperhapscruciallyforthepotentialcontribution
toeconomicscience,therehasbeenlittlesuccessinunderstandingeithertheunder-
lyingstructuresorchainsofcausation,inspiteofadeterminedeffortfromthebe-
ginningtoopentheblackboxes.
TheRANDhealthexperiment,Manningetal(1975a,b),providesadifferent
butequallyinstructivestoryifonlybecauseitsresultshavepermeatedtheacademic
andpolicydiscussionsabouthealthcareeversince.Itwasoriginallydesignedtotest
whethermoregenerousinsurancecausespeopletousemoremedicalcareand,ifso,
Page 56
55
byhowmuch.Theincentiveeffectsarehardlyindoubttoday;theimmortalityofthe
studycomesratherfromthefactthatitsmulti-arm(responsesurface)designal-
lowedthecalculationofanelasticityforthestudypopulation,thatmedicalexpendi-
turesdecreasedby–0.1to–0.2percentforeverypercentageincreaseinthecopay-
ment.AccordingtoAron-Dine,Einav,andFinkelstein(2013),itisthisdimensionless
andthusapparentlytransportablenumberthathasbeenusedeversincetodiscuss
thedesignofhealthcarepolicy;theelasticityhascometobetreatedasauniversal
constant.Ironically,theyarguethattheestimatecannotbereplicatedinrecentstud-
ies,anditisevenunclearthatitisfirmlybasedontheoriginalevidence.Thispoints,
onceagain,tothecentralimportanceoftransportabilityfortheusefulness,both
shortandlongterm,ofatrial.Here,thesimpledirecttransportabilityoftheresult
seemstohavebeenlargelyillusorythough,aswehaveargued,thisdoesnotmean
thatmorecomplexconstructionsbasedontheresultsofthetrialwouldnothave
donebetter.
Conclusions
Itisusefultorespondtotwochallengesthatareoftenputtous,onefrommedicine
andonefromsocialscience.Themedicalchallengeis,“Ifyouarebeingprescribeda
newdrug,wouldn’tyouwantittohavebeenthroughanRCT?”Thesecond(related)
challengeis,“OK,youhavehighlightedsomeoftheproblemswithRCTs,butother
methodshaveallofthoseproblems,plusproblemsoftheirown.”Webelievethatwe
haveansweredbothoftheseinthepaperbutthatitishelpfultorecapitulate.
Themedicalchallengeisaboutyou,aspecificperson,sothatoneanswer
wouldbethatyoumaybedifferentfromtheaverage,andyouareentitledtoand
oughttoaskabouttheoryandevidenceaboutwhetheritwillworkforyou.This
wouldbeintheformofaconversationbetweenyouandyourphysician,whoknows
alotaboutyou.Youwouldwanttoknowhowthisclassofdrugissupposedtowork
andwhetherthatmechanismislikelytoworkforyou.Isthereanyevidencefrom
otherpatients,especiallypatientslikeyou,withyourconditionandinyourcircum-
stances,oraretheresuggestionsfromtheory?Whatscientificworkhasbeendone
toidentifywhatsupportfactorsmatterforsuccesswiththiskindofdrug?Iftheonly
Page 57
56
informationavailableisfromthepharmaceuticalcompany,anRCTmightseemlike
agoodidea.Buteventhen,andalthoughknowledgeofthemeaneffectamongsome
groupiscertainlyofvalue,youmightgivelittleweighttoanRCTwhoseparticipants
areselectedinthewaytheywereselectedinthetrial,orwherethereislittleinfor-
mationaboutwhethertheoutcomesarerelevanttoyou.Recallthatmanynewdrugs
areprescribed‘off-label’,forapurposeforwhichtheywerenottested,andbeyond
that,thatmanynewdrugsareadministeredintheabsenceofanRCTbecauseyou
areactuallybeingenrolledinone.Forpatientswhoselastchanceistoparticipatein
atrialofsomenewdrug,thisisexactlythesortofconversationyoushouldhave
withyourphysician(followedbyoneaskinghertorevealwhetheryouareintheac-
tivearm,sothatyoucanswitchifnot),andsuchconversationsneedtotakeplace
forallprescriptionsthatarenewtoyou.Intheseconversations,theresultsofan
RCTmayhavemarginalvalue.Ifyourphysiciantellsyouthatsheendorsesevidence-
basedmedicine,andthatthedrugwillmostlikelyworkforyoubecauseanRCThas
shownthat‘itworks’,itistimetofindanewphysician.
Thesecondchallengeclaimsthatothermethodsarealwaysdominatedbyan
RCT.Thiskindofchallengeisnotwell-formulated.Dominatedforansweringwhat
question,forwhatpurposes?ThechiefadvantageoftheRCTisthatitcan,ifwell-
conducted,giveanunbiasedestimateofanATEinastudy(trial)sampleandthus
provideevidencethatthetreatmentcausedtheoutcomeinsomeindividualsinthat
sample.Ifthatiswhatyouwanttoknowandthere’slittlebackgroundknowledge
availableandthepriceisright,thenanRCTmaybethebestchoice.Astoother
questions,theRCTresultcanbepart—butusuallyonlyasmallpart—ofthedefense
of(a)ageneralclaim,(b)aclaimthatthetreatmentwillcausethatoutcomefor
someotherindividuals,oreven(c)aclaimaboutwhattheATEwillbeinsomeother
population.Buttheydolittlefortheseenterprisesontheirown.Whatisthebest
overallpackageofresearchworkfortacklingthesequestions—mostcost-effective
andmostlikelytoproducecorrectresults—dependsonwhatweknowandwhat
differentkindsofresearchwillcost.
Page 58
57
ThereareexampleswhereanRCTdoesbetterthananobservationalstudy,
andtheseseemtobethecasesthatcometomindfordefendersofRCTs.Forexam-
ple,regressionsofwhetherpeoplewhogetMedicaiddobetterorworsethanpeople
withprivateinsurancearevitiatedbygrossdifferencesintheothercharacteristics
ofthetwopopulations.ButitisalongstepfromthattosayingthatanRCTcansolve
theproblem,letalonethatitistheonlywaytosolvetheproblem.Itwillnotonlybe
expensivepersubject,butitcanonlyenrollaselectedandalmostcertainlyunrepre-
sentativestudysample,itcanberunonlytemporarily,andtherecruitmenttothe
experimentwillnecessarilybedifferentfromrecruitmentinaschemethatisper-
manentandopentothefullqualifiedpopulation.Noneofthisremovestheblem-
ishesoftheobservationalstudy,buttherearemanymethodsofmitigatingitsdiffi-
culties,sothat,intheend,anobservationalstudywithcrediblecorrectionsanda
morerelevantandmuchlargerstudysample—todayoftenthecompletepopulation
ofinterestthroughadministrativerecords—mayprovideabetterestimate.Every-
thinghastobejudgedonacase-by-casebasis.Thereisnorigorousargumentfora
lexicographicpreferenceforRCTs.
Thereisalsoanimportantlineofenquirythatgoes,notonlybeyondRCTs,
butbeyondthe‘methodofdifferences’thatiscommontoRCTs,regressions,orany
formofcontrolledoruncontrolledcomparison.Thehypothetico-deductivemethod
confrontstheory-baseddeductionswiththedata—eitherobservationalorexperi-
mental.Asnotedabove,economistsroutinelyusetheorytoteaseoutanewimplica-
tionthatcanbetakentothedata,andtherearealsogoodexamplesinmedicine
suchasBleyerandWelch(2012)’sdemonstrationofthelimitedimpactonbreast
cancerincidenceofmammographyscreening,atopicwhereothermethodshave
generatedgreatcontroversyandlittleconsensus.
RCTsaretheultimateinnon-parametricestimationofaveragetreatmentef-
fectsinthetrialsamplesbecausetheymakesofewassumptionsaboutheterogene-
ity,causalstructure,choiceofvariables,andfunctionalform.RCTsareoftenconven-
ientwaystointroduceexperimenter-controlledvariance—ifyouwanttoseewhat
happens,thenkickitandsee,twistthelion’stail—butnotethatmanyexperiments,
includingmanyofthemostimportant(andNobelPrizewinning)experimentsin
Page 59
58
economics,donotanddidnotuserandomization,Harrison(2013),Svorencik
(2015).Butthecredibilityoftheresults,eveninternally,canbeunderminedbyun-
balancedcovariatesandbyexcessiveheterogeneityinresponses,especiallywhen
thedistributionofeffectsisasymmetric,whereinferenceonmeanscanbehazard-
ous.Ironically,thepriceofthecredibilityinRCTsisthatwecanonlyrecoverthe
meanofthedistributionoftreatmenteffects,andthatonlyforthetrialsample.Yet,
inthepresenceofoutliers,reliableinferenceonmeansisdifficult.Andrandomiza-
tioninandofitselfdoesnothingunlessthedetailsareright;purposiveselectioninto
theexperimentalpopulation,likepurposiveselectionintoandoutofassignment,
underminesinferenceinjustthesamewayasdoesselectioninobservationalstud-
ies.Lackofblinding,whetherofparticipants,trialists,datacollectors,oranalysts,
underminesinference,akintoafailureofexclusionrestrictionsininstrumentalvari-
ableanalysis.
ThelackofstructurecanbeseriouslydisablingwhenwetrytouseRCTre-
sultsoutsideofafewcontexts,suchasprogramevaluation,hypothesistesting,or
establishingproofofconcept.Beyondthat,theresultscannotbeusedtohelpmake
predictionsbeyondthetrialsamplewithoutmorestructure,withoutmorepriorin-
formation,andwithouthavingsomeideaofwhatmakestreatmenteffectsvaryfrom
placetoplaceortimetotime.Thereisnooptionbuttocommittosomecausal
structureifwearetoknowhowtouseRCTevidenceoutoftheoriginalcontext.
Simplegeneralizationandsimpleextrapolationdonotcutthemustard.Thisistrue
ofanystudy,experimentalorobservational.Butobservationalstudiesarefamiliar
with,androutinelyworkwith,thesortofassumptionsthatRCTsclaimtoavoid,so
thatiftheaimistouseempiricalevidence,anycredibilityadvantagethatRCTshave
inestimationisnolongeroperative.AndbecauseRCTstellussolittleaboutwhyre-
sultshappen,theyhaveadisadvantageoverstudiesthatuseawiderrangeofprior
informationanddatatohelpnaildownmechanisms.
Yetoncethatcommitmenthasbeenmade,RCTevidencecanbeextremely
useful,pinningdownpartofastructure,helpingtobuildstrongerunderstanding
andknowledge,andhelpingtoassesswelfareconsequences.Asourexamplesshow,
thiscanoftenbedonewithoutcommittingtothefullcomplexityofwhatareoften
Page 60
59
thoughtofasstructuralmodels.Yetwithoutthestructurethatallowsustoplace
RCTresultsincontext,ortounderstandthemechanismsbehindthoseresults,not
onlycanwenottransportwhether`itworks’elsewhere,butwecannotdothestand-
ardstuffofeconomics,whichistosaywhethertheinterventionisactuallywelfare
improving.Withoutknowingwhythingshappenandwhypeopledothings,werun
theriskofworthlesscasual(`fairystory’)causaltheorizingandhavegivenupon
oneofthecentraltasksofeconomics.
Wemustbackawayfromtherefusaltotheorize,fromtheexultationinour
abilitytohandleunlimitedheterogeneity,andactuallySAYsomething.Perhapspar-
adoxically,unlesswearepreparedtomakeassumptions,andtosaywhatweknow,
makingstatementsthatwillbeincredibletosome,allthecredibilityoftheRCTisfor
naught.
RCTsineconomicsonhealth,labor,anddevelopmenthaveproventheir
worthinprovidingproofsofconceptandattestingpredictionsthatsomepolicies
mustalwaysworkorcanneverwork.But,aselsewhereineconomics,wecannot
findoutwhysomethingworksbysimplydemonstratingthatitdoeswork,nomatter
howoften,whichleavesusuninformedastowhetherthepolicyshouldbeimple-
mented.Beyondthat,smallscale,demonstrationRCTsarenotcapableoftellingus
whatwouldhappenifthesepolicieswereimplementedtoscale,ofcapturingunin-
tendedconsequencesthattypicallycannotbeincludedintheprotocols,orofmodel-
ingwhatwillhappenifschemesareimplementeddifferentlythaninthetrial,forex-
amplebygovernments,whosemotivesandoperatingprinciplesaredifferentfrom
theNGOsoracademicswhotypicallyruntrials.Whileitistruethatabstract
knowledgeisalwayslikelytobebeneficial,successfulpolicydependsoninstitutions
andonpolitics,mattersonwhichRCTshavelittletosay.TheresultsofRCTscanand
shouldfeedintopublicdebateaboutwhatshouldbedone,butweareondangerous
groundwhentheyareused,ongroundsoftheirsupposedepistemicsuperiority,to
insulatepolicyfromdemocraticprocesses.
Citations
Page 61
60
Aigner,DennisJ.,1985,“Theresidentialelectricitytime-of-usepricingexperiments.Whathavewelearned?”inDavidA.WiseandJerryA.Hausman,Socialexperimen-tation,Chicago,Il.ChicagoUniversityPressforNationalBureauofEconomicRe-search,11–54.
Al-Ubaydil,Omar,andJohnA.List,2013,“Onthegeneralizabilityofexperimentalre-sultsineconomics,”inG.FrechetteandA.Schotter,Methodsofmodernexperi-mentaleconomics,OxfordUniversityPress.
Altman,DouglasG.,1985,“Comparabilityofrandomizedgroups,”JournaloftheRoyalStatisticalSociety,SeriesD(TheStatistician),34(1),Statisticsinhealth,125–36.
Angrist,JoshuaD.,2004,“Treatmenteffectheterogeneityintheoryandpractice,”EconomicJournal,114,C52–C83.
Angrist,JoshuaD.,EricBettinger,ErikBloom,ElizabethKing,andMichaelKremer,2002,“VouchersforprivateschoolinginColombia:evidencefromarandomizednaturalexperiment,”AmericanEconomicReview,92(5),1535–58.
Angrist,JoshuaD.andJörn-SteffenPischke,2010,“Thecredibilityrevolutioninem-piricaleconomics:howbetterresearchdesignistakingtheconoutofeconomet-rics,”JournalofEconomicPerspectives,24(2),3–30.
Angrist,JoshuaD.andJörn-SteffenPischke,2017,“Undergraduateeconometricsin-struction:throughourclasses,darkly,”JournalofEconomicPerspectives,31(2),125-44.
Aron-Dine,Aviva,LiranEinav,andAmyFinkelstein,2013,“TheRANDhealthinsur-anceexperiment,threedecadeslater,”JournalofEconomicPerspectives,27(1),197–222.
Arrow,KennethJ.,1975,“Twonotesoninferringlongrunbehaviorfromsocialex-periments,”DocumentNo.P-5546,SantaMonica,CA.RandCorporation.
Ashenfelter,Orley,1978,“Thelaborsupplyresponseofwageearners,”inJohnL.PalmerandJosephA.Pechman,eds.,Welfareinruralareas:theNorthCarolina–IowaIncomeMaintenanceExperiment,Washington,DC.TheBrookingsInstitu-tion.109–38.
Athey,SusanandGuidoW.Imbens,2017,“Thestateofappliedeconometrics:cau-salityandpolicyevaluation,”JournalofEconomicPerspectives,31(2),3-32.
Attanasio,Orazio,CostasMeghir,andAnaSantiago,2012,“EducationchoicesinMexico:usingastructuralmodelandarandomizedexperimenttoevaluatePRO-GRESA,”ReviewofEconomicStudies,79(1),37–66.
Attanasio,Orazio,SarahCattan,EmlaFitzsimons,CostasMeghir,andMartaRubioCodina,2015,“Estimatingtheproductionfunctionforhumancapital:resultsfromarandomizedcontrolledtrialinColombia,”London.InstituteforFiscalStudies,WorkingPapernoW15/06.
Bahadur,R.R.,andLeonardJ.Savage,1956,“Thenon-existenceofcertainstatisticalproceduresinnonparametricproblems,”AnnalsofMathematicalStatistics,25:1115–22.
Banerjee,Abhijit,SylvainChassang,SergioMontero,andErikSnowberg,2016,“Atheoryofexperimenters,”processed,July2016.
Page 62
61
Banerjee,Abhijit,SylvainChassang,andErikSnowberg,2016,“Decisiontheoreticapproachestoexperimentdesignandexternalvalidity,”Cambridge,MA.NBERWorkingPaperno22167,April.
Banerjee,Abhijit,AngusDeaton,andEstherDuflo,2004,“Healthcaredeliveryinru-ralRajasthan,”EconomicandPoliticalWeekly,39(9),944–9.
Banerjee,AbhijitandEstherDuflo,2009,“Theexperimentalapproachtodevelop-menteconomics,”AnnualReviewofEconomics,1,151-78.
Banerjee,AbhijitandEstherDuflo,2012,Pooreconomics:aradicalrethinkingofthewaytofightglobalpoverty,PublicAffairs.
Banerjee,Abhijit,EstherDuflo,NathanaelGoldberg,DeanKarlan,RobertOsei,Wil-liamParienté,JeremyShapiro,BramThuysbaert,andChristopherUdry,2015,“Amultifacetedprogramcauseslastingprogressfortheverypoor:evidencefromsixcountries,”Science,348(6236),1260799.
Banerjee,Abhijit,EstherDuflo,andRachelGlennerster,2008,“Puttingaband-aidonacorpse:incentivesfornursesintheIndianpublichealthcaresystem,”JournaloftheEuropeanEconomicAssociation,6(2–3),487–500.
Banerjee,AbhijitV.andRuiminHe,2003,“TheWorldBankofthefuture,”AmericanEconomicReview,93(2),39–44.
Banerjee,Abhijit,DeanKarlan,andJonathanZinman,2015,“Sixrandomizedevalua-tionsofmicrocredit:introductionandfurthersteps,”AmericanEconomicJournal:AppliedEconomics,7(1),1-21.
Bareinboim,EliasandJudeaPearl,2013,“Ageneralalgorithmfordecidingtrans-portabilityofexperimentalresults,”JournalofCausalInference,1(1),107-34.
Bareinboim,EliasandJudeaPearl,2014,“Transportabilityfrommultipleenviron-mentswithlimitedexperiments:completenessresults,”inM.Welling,Z.Ghah-ramani,C.Cortes,andN.Lawrence,eds.,AdvancesofNeuralInformationPro-cessing,27,(NIPSProceedings),280-8.
Bauchet,Jonathan,JonathanMorduch,andShamikaRavi,2015,“Failurevsdisplace-ment:whyaninnovativeanti-povertyprogramshowednonetimpactinSouthIndia,”JournalofDevelopmentEconomics,116,1–16.
Basu,Kaushik,2010,“TheeconomicsoffoodgrainmanagementinIndia,”MinistryofFinance,Delhi.http://finmin.nic.in/workingpaper/Foodgrain.pdf
Begg,ColinB.,1990,“Significancetestsofcovarianceimbalanceinclinicaltrials,”ControlledClinicalTrials,11(4),223-5.
Bhattacharya,DebopamandPascalineDupas,2012,“Inferringwelfaremaximizingtreatmentassignmentunderbudgetconstraints,”JournalofEconometrics,167(1),168-96.
Bitler,MarianneP.,JonahB.Gelbach,andHilaryW.Hoynes,2006,“Whatmeanim-pactsmiss:distributionaleffectsofwelfarereformexperiments,”AmericanEco-nomicReview,96(4),988-1012.
Bleyer,Archie,andH.GilbertWelch,2012,“Effectofthreedecadesofscreeningmammographyonbreast-cancerincidence,”NewEnglandJournalofMedicine,367,1998-2005
Bloom,HowardS.,CarolynJ.Hill,andJamesA.Riccio,2005,“Modelingcross-siteex-perimentaldifferencestofindoutwhyprogrameffectivenessvaries,”inHoward
Page 63
62
S.Bloom,ed.,Learningmorefromsocialexperiments:evolvinganalyticalap-proaches,NewYork,NY.RussellSage.
Bold,Tessa,MwangiKimenyi,GermanoMwabu,AliceNg’ang’a,andJustinSandefur,2013,“Scalingupwhatworks:experimentalevidenceonexternalvalidityinKen-yaneducation,”Washington,DC.CenterforGlobalDevelopment,WorkingPaper321.
Bothwell,LauraE.andScottH.Podolsky,2016,“Theemergenceoftherandomized,controlledtrial,”NewEnglandJournalofMedicine,375(6),501–4.doi:10.1056/NEJMp1604635
Campbell,D.T.andJ.C.Stanley,1963,Experimentalandquasi-experimentaldesignsforresearch.Chicago.RandMcNally.
Cartwright,Nancy,1994,Nature’scapacitiesandtheirmeasurement.Oxford.Claren-donPress.
Cartwright,Nancy,2007,“AreRCTsthegoldstandard?”Biosocieties,2,11-20.Cartwright,Nancy,2011,“Aphilosopher’sviewofthelongroadfromRCTstoeffec-tiveness,”TheLancet,377,1400-01.
Cartwright,Nancy,2012,“Presidentialaddress:willthispolicyworkforyou?Pre-dictingeffectivenessbetter:howphilosophyhelps,”PhilosophyofScience,79,973-89.
Cartwright,Nancy,2016.“Whereistherigorwhenyouneedit?”inI.Marinovic,ed.,FoundationsandTrendsinAccounting:specialissueoncausalinferenceincapitalmarketsresearch,10(2-4):106-24.
Cartwright,NancyandJeremyHardie,2012,Evidencebasedpolicy:apracticalguidetodoingitbetter,Oxford.OxfordUniversityPress.
Cartwright,NancyandEileenMunro,2010,“ThelimitationsofRCTsinpredictingeffectiveness,”JournalofExperimentalChildPsychology,16(2),
Chalmers,Iain,2001,“Comparinglikewithlike:somehistoricalmilestonesintheevolutionofmethodstocreateunbiasedcomparisongroupsintherapeuticexper-iments,”InternationalJournalofEpidemiology,30,1156–64.
Chan,TatY.andBartonH.Hamilton,2006,“Learning,privateinformation,andtheeconomicevaluationofrandomizedexperiments,”JournalofPoliticalEconomy,114(6),997-1040.
Chassang,Sylvain,GerardPadróIMiguel,andErikSnowberg,2012,“Selectivetrials:aprincipal–agentapproachtorandomizedcontrolledexperiments,”AmericanEconomicReview,102(4),1279–1309.
Chassang,Sylvain,ErikSnowberg,BenSeymour,andCayleyBowles,2015,“Ac-countingforbehaviorintreatmenteffects:newapplicationsforblindtrials,”PLoSOne,10(6),e0127227.doi:10:1371/journal.pone.0127227.
Chaudhury,Nazmul,JeffreyHammer,MichaelKremer,KarthikMuralidharan,andF.HalseyRogers,2005,“Missinginaction:teacherandhealthworkerabsenceinde-velopingcountries,”JournalofEconomicPerspectives,19(4),91–116.
Chyn,Eric,2016,“Movedtoopportunity:thelong-runeffectofpublichousingdemo-litiononlabormarketoutcomesofchildren,”UniversityofMichigan.http://www-personal.umich.edu/~ericchyn/Chyn_Moved_to_Opportunity.pdf
Page 64
63
Chetty,Raj,2009,“Sufficientstatisticsforwelfareanalysis:abridgebetweenstruc-turalandreduced-formmethods,”AnnualReviewofEconomics,1,451-87.
Conlisk,John,1973,“Choiceofresponsefunctionalformindesigningsubsidyexper-iments,”Econometrica,41(4),643–56.
Crépon,Bruno,EstherDuflo,MarcGurgand,RolandRathelot,andPhilippeZamora,2014,“Dolabormarketpolicieshavedisplacementeffects?evidencefromaclus-teredrandomizedexperiment,”QuarterlyJournalofEconomics,128(2),531–80.
Das,JishnuandJeffreyHammer,2005,”Whichdoctor?Combiningvignettesanditemresponsetomeasureclinicalcompetence,”JournalofDevelopmentEconom-ics,78,348–83.
Davey-Smith,George,andShahIbrahim,2002,“Datadredging,bias,orconfound-ing,”BritishMedicalJournal,325,1437-8.
Deaton,Angus,2010,“Instruments,randomization,andlearningaboutdevelop-ment,”JournalofEconomicLiterature,48(2),424-55.
Deaton,AngusandNancyCartwright,2016,“Understandingandmisunderstandingrandomizedcontrolledtrials,”http://www.princeton.edu/~deaton/down-load.html?pdf=Deaton_Cartwright_RCTs_with_ABSTRACT_August_25.pdf
Deaton,AngusandJohnMuellbauer,1980,Economicsandconsumerbehavior,NewYork.CambridgeUniversityPress.
Deaton,AngusandSerenaNg,1998,“Parametricandnonparametricapproachestopriceandtaxreform,”JournaloftheAmericanStatisticalAssociation,93(443),900-9.
Dhaliwal,Iqbal,EstherDuflo,RachelGlennerster,andCaitlinTulloch,2012,“Com-parativecost-effectivenessanalysistoinformpolicyindevelopingcountries:ageneralframeworkwithapplicationsforeducation,”J–PAL,MIT,December3rd.http://www.povertyactionlab.org/publication/cost-effectiveness
Drèze,Jean,2016,Personalemailcommunication.Duflo,Esther,2017,“Theeconomistasplumber,”AmericanEconomicReview,107(5),1-26.
Duflo,Esther,RemaHanna,andStephenP.Ryan,2012,“Incentiveswork:gettingteacherstocometoschool,”AmericanEconomicReview,102(4),1241–78.
Duflo,EstherandMichaelKremer,2008,“Useofrandomizationintheevaluationofdevelopmenteffectiveness,”inWilliamEasterly,ed.,Reinventingforeignaid.Washington,DC.Brookings,93–120.
Dynarski,Susan,2015,“Helpingthepoorineducation:thepowerofasimplenudge,”NewYorkTimes,Jan17,2015.
Fine,PaulE.M.andJacquelineA.Clarkson,1986,“Individualversuspublicprioritiesinthedeterminationofoptimalvaccinationpolicies,”AmericanJournalofEpide-miology,124(6),1012–20.
Fisher,RonaldA.,1926,“Thearrangementoffieldexperiments,”JournaloftheMin-istryofAgricultureofGreatBritain,33,503–13.
Filmer,Deon,JeffreyHammer,andLantPritchett,2000,“Weaklinksinthechain:adiagnosisofhealthpolicyinpoorcountries,”WorldBankResearchObserver,15(2),199–204.
Page 65
64
Freedman,DavidA.,2008,“Onregressionadjustmentstoexperimentaldata,”Ad-vancesinAppliedMathematics,40,180–93.
Frieden,ThomasR.,2017,“Evidenceforhealthdecisionmaking—beyondrandom-ized,controlledtrials,”NewEnglandJournalofMedicine,377,465-75.
Garfinkel,IrwinandCharlesF.Manski,1992,“Introduction,”inIrwinGarfinkelandCharlesF.Manski,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress.1–22.
Gerber,AlanS.andDonaldP.Green,2012,FieldExperiments,NewYork.Norton.Gertler,PaulJ.,SebastianMartinez,PatrickPremand,LauraB.Rawlings,andChristelM.J.Vermeersch,2016,Impactevaluationinpractice,2ndEdition,Washington,DC.Inter-AmericanDevelopmentBankandWorldBank.
Goldberger,ArthurS.andCharlesF.Manski,1995,“ReviewArticle:TheBellCurvebyHerrnsteinandMurray,”JournalofEconomicLiterature,33(2),762-76.
Greenberg,DavidandMarkShroder,2004,Thedigestofsocialexperiments(3rded.),Washington,DC.UrbanInstitutePress.
Greenberg,David,MarkShroder,andMatthewOnstott,1999,“Thesocialexperi-mentmarket,”JournalofEconomicPerspectives,13(3),157–72.
Gueron,JudithM.andHowardRolston,2013,Fightingforreliableevidence,NewYork,RussellSage.
Guyatt,Gordon,DavidL.Sackett,andDeborahJ.CookfortheEvidence-BasedMedi-cineWorkingGroup,1994,“Users’guidestothemedicalliteratureII:howtouseanarticleabouttherapyorprevention.B.Whatweretheresultsandwilltheyhelpmeincaringformypatients?”JournaloftheAmericanMedicalAssociation,271(1),59–63.
Harrison,GlennW.,2013,“Fieldexperimentsandmethodologicalintolerance,”Jour-nalofEconomicMethodology,20(2),103–17.
Harrison,GlennW.,2014,“Impactevaluationandwelfareevaluation,”EuropeanJournalofDevelopmentResearch,26,39–45.
Harrison,GlennW.,2014,“Cautionarynotesontheuseoffieldexperimentstoad-dresspolicyissues,”OxfordReviewofEconomicPolicy,30(4),753-63.
Hausman,JerryA.andDavidA.Wise,1985,“Technicalproblemsinsocialexperi-mentation:costversuseaseofanalysis,”inJerryA.HausmanandDavidA.Wise,eds.,SocialExperimentation,Chicago,IL.ChicagoUniversityPress.187–220.
Heckman,JamesJ.,1992,“Randomizationandsocialpolicyevaluation,”inCharlesF.ManskiandIrwinGarfinkel,eds.,Evaluatingwelfareandtrainingprograms,Cam-bridge,MA.HarvardUniversityPress.547–70.
Heckman,JamesJ.,1997,“Instrumentalvariables:astudyofimplicitbehavioralas-sumptionsusedinmakingprogramevaluations,”JournalofHumanResources,32(3),441–62.
Heckman,JamesJ.,2005,“Thescientificmodelofcausality,”SociologicalMethodol-ogy,35(1),1-97.
Heckman,JamesJ.,2008,“Econometriccausality,”InternationalStatisticalReview,76(1),1-27.
Page 66
65
Heckman,JamesJ.,2010,“Buildingbridgesbetweenstructuralandprogramevalua-tionapproachestoevaluatingpolicy,”JournalofEconomicLiterature,48(2),356-98.
Heckman,JamesJ.,NeilHohman,andJeffreySmith,withtheassistanceofMichaelKhoo,2000,“Substitutionanddropoutbiasinsocialexperiments:astudyofaninfluentialsocialexperiment,”QuarterlyJournalofEconomics,115(2),651–94.
Heckman,JamesJ.,RobertJ.Lalonde,andJeffreyA.Smith,1999,“Theeconomicsandeconometricsofactivelabormarkets,”Chapter31inAshenfelter,OrleyandDa-vidCard,eds.Handbookoflaboreconomics,Amsterdam.North-Holland,3(A),1866–2097.
Heckman,JamesJ.,RodrigoPinto,andPeterSavelyev,2013,“Understandingthemechanismsthroughwhichaninfluentialearlychildhoodprogramboostedadultoutcomes,”AmericanEconomicReview,103(6),2052–86.
Heckman,JamesJ.andJeffreySmith,1995,“Assessingthecaseforsocialexperi-ments,”JournalofEconomicPerspectives,9(2),85-110.
Heckman,JamesJ.,JeffreySmith,andNancyClements,1997,“Makingthemostoutofprogrammeevaluationsandsocialexperiments:accountingforheterogeneityinprogrammeimpacts,”ReviewofEconomicStudies,64(4),487–535.
Heckman,JamesJ.andSergioUrzúa,2010,“ComparingIVwithstructuralmodels:whatsimpleIVcanandcannotidentify,”JournalofEconometrics,156,27-37.
Heckman,JamesJ.andEdwardVytlacil,2005,“Structuralequations,treatmentef-fects,andeconometricpolicyevaluation,”Econometrica,73(3),669–738.
Heckman,JamesJ.andEdwardJ.Vytlacil,2007,“Econometricevaluationofsocialprograms,Part1:causalmodels,structuralmodels,andeconometricpolicyeval-uation,”Chapter70inJamesJ.HeckmanandEdwardE.Leamer,eds.,HandbookofEconometrics,6B,4779–874.
Horton,Richard,2000,“Commonsenseandfigures:therhetoricofvalidityinmedi-cine:BradfordHillmemoriallecture1999,”Statisticsinmedicine,19,3149–64.
Hotz,V.Joseph,GuidoW.Imbens,andJulieH.Mortimer,2005,“Predictingtheeffi-cacyoffuturetrainingprogramsusingpastexperienceatotherlocations,”Jour-nalofEconometrics,125,241–70.
Hsieh,Chang-taiandMiguelUrquiola,2006,“Theeffectsofgeneralizedschoolchoiceonachievementandstratification:evidencefromChile’svoucherpro-gram,”JournalofPublicEconomics,90,1477–1503.
Hurwicz,Leonid,1966,“Onthestructuralformofinterdependentsystems,”StudiesinLogicandtheFoundationsofMathematics,44,232-9.
Imbens,GuidoW.,2004,“Nonparametricestimationofaveragetreatmenteffectsunderexogeneity:areview,”ReviewofEconomicsandStatistics,86(1),4–29.
Imbens,GuidoW.,2010,“BetterLATEthannothing:somecommentsonDeaton(2009)andHeckmanandUrzua,”JournalofEconomicLiterature,48(2),399–423.
Imbens,GuidoW.andJoshuaD.Angrist,1994,“Identificationandestimationoflocalaveragetreatmenteffects,”Econometrica,62(2),467–75.
Imbens,GuidoW.andMichalKolesár,2016,“Robuststandarderrorsinsmallsam-ples:somepracticaladvice,”ReviewofEconomicsandStatistics,98(4),701-12.
Page 67
66
Imbens,GuidoW.andJeffreyM.Wooldridge,2009,“Recentdevelopmentsintheeconometricsofprogramevaluation,”JournalofEconomicLiterature,47(1),5–86.
InternationalCommitteeofMedicalJournalEditors,2015,Recommendationsfortheconduct,reporting,editing,andpublicationofscholarlyworkinmedicaljournals,http://www.icmje.org/icmje-recommendations.pdf(accessed,August20,2016.)
J_PAL,2017,https://www.povertyactionlab.org/about-j-pal,(accessed,August21,2017).
Kahneman,DanielandGaryKlein,2009,“Conditionsforintuitiveexpertise:afailuretodisagree,”AmericanPsychologist,64(6),515–26.
Karlan,DeanandJacobAppel,2011,Morethangoodintentions:howaneweconom-icsishelpingtosolveglobalpoverty,NewYork.Dutton.
Karlan,Dean,NathanealGoldbergandJamesCopestake,2009,“Randomizedcon-trolledtrialsarethebestwaytomeasureimpactofmicrofinanceprogramsandimprovemicrofinanceproductdesigns,”EnterpriseDevelopmentandMicro-finance,20(3),167–76.
Kasy,Maximilian,2016,“Whyexperimentersmightnotwanttorandomize,andwhattheycoulddoinstead,”PoliticalAnalysis,1–15doi:10.1093/pan/mpw012
Kramer,Peter,2016,Ordinarilywell:thecaseforantidepressants,NewYork.Farrar,Straus,andGiroux.
Kremer,MichaelandAlakaHolla,2009,“Improvingeducationinthedevelopingworld:whathavewelearnedfromrandomizedevaluations?”AnnualReviewofEconomics,1,513–42.
Lalonde,RobertJ.,1986,“Evaluatingtheeconometricevaluationsoftrainingpro-gramswithexperimentaldata,”AmericanEconomicReview,76(4),604-20.
Lehman,Erich.L.andJosephP.Romano,2005,Testingstatisticalhypotheses(thirdedition),NewYork.Springer.
Levy,Santiago,2006,Progressagainstpoverty:sustainingMexico’sProgresa-Opor-tunidadesprogram,Washington,DC.Brookings.
Mackie,JohnL.,1974,Thecementoftheuniverse:astudyofcausation,Oxford.Ox-fordUniversityPress.
Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,andArleenLeibowitz,1988a,“Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedexperiment,”AmericanEconomicReview,77(3),251–77.
Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,BernadetteBenjamin,ArleenLeibowitz,M.SusanMarquis,andJackZwanziger,1988b,Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedex-periment,SantaMonica,CA.RAND.
Manski,CharlesF.,2004,“Treatmentrulesforheterogeneouspopulations,”Econo-metrica,72(4),1221-46.
Manski,CharlesF.,2013,Publicpolicyinanuncertainworld:analysisanddecisions,Cambridge,MA.HarvardUniversityPress.
Manski,CharlesF.andAlekseyTetenov,2016,“Sufficienttrialsizetoinformclinicalpractice,”PNAS,113(38),10518-23.
Page 68
67
Metcalf,CharlesE.,1973,“Makinginferencesfromcontrolledincomemaintenanceexperiments,”AmericanEconomicReview,63(3),478–83.
Moffitt,Robert,1979,“ThelaborsupplyresponseintheGaryexperiment,”JournalofHumanResources,14(4),477–87.
Moffitt,Robert,1992,“Evaluationmethodsforprogramentryeffects,”Chapter6inCharlesManskiandIrwinGarfinkel,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress,231–52.
Moffitt,Robert,2004,“Theroleofrandomizedfieldtrialsinsocialscienceresearch:aperspectivefromevaluationsofreformsofsocialwelfareprograms,”AmericanBehavioralScientist,47(5),506–40
Morgan,KariLockandDonaldB.Rubin,2012,“Rerandomizationtoimprovecovari-atebalanceinexperiments,”AnnalsofStatistics,40(2),1263–82.
Muller,SeánM.,2015,“Causalinteractionandexternalvalidity:obstaclestothepol-icyrelevanceofrandomizedevaluations,”WorldBankEconomicReview,29,S217–S225.
Orcutt,GuyH.andAliceG.Orcutt,1968,“Incentiveanddisincentiveexperimenta-tionforincomemaintenancepolicypurposes,”AmericanEconomicReview,58(4),754–72.
Pearl,JudeaandEliasBareinboim,2011,“Transportabilityofcausalandstatisticalrelations:aformalapproach,”Proceedingsofthe25thAAAIConferenceonArtificialIntelligence,AAAIPress,247-54,
Pearl, Judea and Elias Bareinboim, 2014, “External validity: from do-calculus to trans-portability across populations,” Statistical Science, 29(4), 579-95.
Rodrik,Dani,2006,personalemailcommunication.Rothwell,PeterM.,2005,“Externalvalidityofrandomizedcontrolledtrials:‘towhomdotheresultsofthetrialapply’”,Lancet,365,82–93.
Russell,Bertrand,2008[1912],Theproblemsofphilosophy,Rockville,MD.ArcManor.
Sackett,DavidL.,WilliamM.C.Rosenberg,J.A.MuirGray,R.BrianHaynesandW.ScottRichardson,1996,“Evidencebasedmedicine:whatitisandwhatitisn’t,”BritishMedicalJournal,312(January13),71–2.
Savage,LeonardJ.,1962,“Subjectiveprobabilityandstatisticalpractice,”inG.A.Bar-nardandD.R.Cox,eds.,TheFoundationsofStatisticalInference,London.Me-thuen.9-35.
Scriven,Michael,1974,“Evaluationperspectivesandprocedures,”inW.JamesPop-ham,ed.,Evaluationineducation—currentapplications,Berkeley,CA.McCutchanPublishingCorporation.
Senn,Stephen,1994,“Testingforbaselinebalanceinclinicaltrials,”StatisticsinMedicine,13,1715–26.
Senn,Stephen,2013,“Sevenmythsofrandomizationinclinicaltrials,”StatisticsinMedicine32,1439–50.
Shadish,WilliamR.,ThomasD.Cook,andDonaldT.Campbell,2002,Experimentalandquasi-experimentaldesignsforgeneralizedcausalinference,Boston,MA.HoughtonMifflin.
Simpson,Adrian,2017,“Themisdirectionofpublicpolicy:comparingandcombin-ingstandardisedeffectsizes,”JournalofEducationalPolicy32(4),450-66.
Page 69
68
Stuart,ElizabethA.,StephenR.Cole,andCatharineP.Bradshaw,andPhilipJ.Leaf,2011,“Theuseofpropensityscorestoassessthegeneralizabilityofresultsfromrandomizedtrials,”JournaloftheRoyalStatisticalSocietyA,174(2),369–86.
Student(W.S.Gosset),1938,“Comparisonbetweenbalancedandrandomarrange-mentsoffieldplots,”Biometrika,29(3/4),363-78.
Svorencik,Andrej,2015,Theexperimentalturnineconomics:ahistoryofexperi-mentaleconomics,UtrechtSchoolofEconomics,DissertationSeries#29,http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2560026
Todd,PetraE.andKennethJ.Wolpin,2006,“Assessingtheimpactofaschoolsub-sidyprograminMexico:usingasocialexperimenttovalidateadynamicbehav-ioralmodelofchildschoolingandfertility,”AmericanEconomicReview,96(5),1384–1417.
Todd,PetraE.andKennethJ.Wolpin,2008,“Exanteevaluationofsocialprograms,”Annalesd’EconomieetdelaStatistique,91/92,263–91.
U.S.DepartmentofEducation,InstituteofEducationSciences,NationalCenterforEducationEvaluationandRegionalAssistance,2003,Identifyingandimplement-ingeducationalpracticessupportedbyrigorousevidence:auserfriendlyguide,Washington,DC.InstituteofEducationSciences.
Vandenbroucke,JanP.,2004,“Whenareobservationalstudiesascredibleasran-domizedcontrolledtrials?”TheLancet,363:1728–31.
Vandenbroucke,JanP..2009,“TheHRTcontroversy:observationalstudiesandRCTsfallinline,”TheLancet,373,1233-5.
Vivalt,Eva,2015,“Howmuchcanwegeneralizefromimpactevaluations?”NYU,un-published.http://evavivalt.com/wp-content/uploads/2014/10/Vivalt-JMP-10.27.14.pdf
White,Halbert,1980,“Aheteroskedasticity-consistentcovariancematrixestimatorandadirecttestforheteroskedasticity,”Econometrica,50(1),1–25.
Wise,DavidA.,1985,“Abehavioralmodelversusexperimentation:theeffectsofhousingsubsidiesonrent,”inP.BruckerandR.Pauly,eds.MethodsofOperationsResearch,50,VerlagAnonHain.441–89.
Wolpin,KennethI.,2013,Thelimitsofinferencewithouttheory,Camridge,MA.MITPress.
Worrall,John,2007,“Evidenceinmedicineandevidence-basedmedicine,”Philoso-phyCompass,2/6,981–1022.
Worrall,John,2008,“Evidenceandethicsinmedicine,”PerspectivesinBiologyandMedicine,51(3),418-31.
Yates,Frank,1939,“Thecomparativeadvantagesofsystematicandrandomizedar-rangementsinthedesignofagriculturalandbiologicalexperiments,”Biometrika,30(3/4),440-66.
Young,Alwyn,2016,“ChannelingFisher:randomizationtestsandthestatisticalin-significanceofseeminglysignificantexperimentalresults,”LondonSchoolofEco-nomics,WorkingPaper,Feb.
Ziliak,StephenT.,2014,“Balancedversusrandomizedfieldexperimentsineconom-ics:whyW.S.Gossetaka‘Student’matters,”ReviewofBehavioralEconomics,1,167–208.
Page 70
69
Appendix:MonteCarloexperimentforanRCTwithoutliersInthisillustrativeexample,thereisparentpopulationeachmemberofwhichhashisorher
owntreatmenteffect;thesearecontinuouslydistributedwithashiftedlognormaldistribu-
tionwithzeromeansothatthepopulationATEiszero.Theindividualtreatmenteffectsβ
aredistributedsothat β + e0.5 ∼ Λ(0,1) ,forstandardizedlognormaldistributionΛ. Inthe
absenceoftreatment,everyoneinthesamplerecordszero,sothesampleaveragetreat-
menteffectinanyonetrialissimplythemeanoutcomeamongthentreatments.Forvalues
ofnequalto25,50,100,200,and500wedrawfromtheparentpopulation100trialsam-
pleseachofsize2n;withfivevaluesofn,thisgivesus500trialsamplesinall;becauseof
samplingthetrueATE’sineachtrialsamplewillnotbezero.Foreachofthese500samples,
werandomizeintoncontrolsandntreatments,estimatetheATEanditsestimatedt–value
(usingthestandardtwo-samplet–value,orequivalently,byrunningaregressionwithro-
bustt–values),andthenrepeat1,000times,sowehave1,000ATEestimatesandt–values
foreachofthe500trialsamples.TheseallowustoassessthedistributionofATEestimates
andtheirnominalt–valuesforeachtrial.
TheresultsareshowninTableA1.Eachrowcorrespondstoasamplesize.Ineach
row,weshowtheresultsof100,000individualtrials,composedof1,000replicationson
eachofthe100trial(experimental)samples.Thecolumnsareaveragedoverall100,000tri-
als.
TableA1:RCTswithskewedtreatmenteffects
Samplesize MeanofATE
estimates
Meanofnominalt–
values
Fractionnullre-
jected(percent)
25
50
0.0268
0.0266
–0.4274
–0.2952
13.54
11.20
100 –0.0018 –0.2600 8.71
200 0.0184 –0.1748 7.09
500 –0.0024 –0.1362 6.06
Note:1,000randomizationsoneachof100drawsofthetrialsamplerandomlydrawnfromalognormaldistributionoftreatmenteffectsshiftedtohaveazeromean.
Page 71
70
Thelastcolumnshowsthefractionsoftimesthenullthatistrueinthepopulationis
rejectedinthetrialsamplesandisourkeyresult.Whenthereareonly50treatmentsand
50controls(row2),the(true)nullisrejected11.2percentofthetime,insteadofthe5per-
centthatwewouldlikeandexpectifwewereunawareoftheproblem.Whenthereare500
unitsineacharm,therejectionrateis6.06percent,muchclosertothenominal5percent.
FigureA1:EstimatesofanATEwithanoutlierinthetrialsample
FigureA1illustratestheestimatedATEsfromanextremetrialsamplefromthesimulations
inthesecondrowwith100observationsintotal;thehistogramshowsthe1,000estimates
oftheATEforthattrialsample.Thistrialsamplehasasinglelargeoutlyingtreatmenteffect
of48.3;themean(s.d.)oftheother99observationsis–0.51(2.1);whentheoutlierisinthe
treatmentgroup,wegettheright-handsideofthefigure,whenitisinthecontrolgroup,we
gettheleft-handside.
0.5
11.
5D
ensi
ty
-.5 0 .5 1 1.5 21,000 estimates of average treatment effect