Understanding and misunderstanding randomized controlled ...deaton/downloads/Deaton...ropean Union’s Horizon 2020 research and innovation program (grant agreement No 667526 K4U),

Understandingandmisunderstandingrandomizedcontrolledtrials

AngusDeatonandNancyCartwright

PrincetonUniversity,NBER,andUniversityofSouthernCalifornia

DurhamUniversityandUCSanDiego

Thisversion,October2017

Weacknowledgehelpfuldiscussionswithmanypeopleovertheseveralyearsthispaperhasbeeninpreparation.WewouldparticularlyliketonotecommentsfromseminarparticipantsatPrinceton,Columbia,andChicago,theCHESSresearchgroupatDurham,aswellasdiscussionswithOrleyAsh-enfelter,AnneCase,NickCowen,HankFarber,JimHeckman,BoHonoré,ChuckManski,andJulianReiss.UlrichMuellerhadamajorinfluenceonshapingSection1.WehavebenefitedfromgenerouscommentsonanearlierversionbyChristopherAdams,TimBesley,ChrisBlattman,SylvainChassang,JishnuDas,JeanDrèze,WilliamEasterly,JonathanFuller,LarsHansen,JeffHammer,GlennHarrison,MacartanHumphreys,MichalKolesár,HelenMilner,TamlynMunslow,SureshNaidu,LantPritchett,DaniRodrik,BurtSinger,RichardWilliams,RichardZeckhauser,andSteveZiliak.Cartwright’sre-searchforthispaperhasreceivedfundingfromtheEuropeanResearchCouncil(ERC)undertheEu-ropeanUnion’sHorizon2020researchandinnovationprogram(grantagreementNo667526K4U),theSpencerFoundation,andtheNationalScienceFoundation(award1632471).Deatonacknowl-edgesfinancialsupportfromtheNationalInstituteonAgingthroughtheNationalBureauofEco-nomicResearch,Grants5R01AG040629-02andP01AG05842-14andthroughPrincetonUniversity’sRoybalCenter,GrantP30AG024928.

1

ABSTRACT

RCTswouldbemoreusefulifthereweremorerealisticexpectationsofthemandiftheirpit-fallswerebetterrecognized.Forexample,andcontrarytomanyclaimsintheappliedlitera-ture,randomizationdoesnotequalizeeverythingbutthetreatmentacrosstreatmentsandcontrols,itdoesnotautomaticallydeliverapreciseestimateoftheaveragetreatmenteffect(ATE),anditdoesnotrelieveusoftheneedtothinkabout(observedorunobserved)con-founders.Estimatesapplytothetrialsampleonly,sometimesaconveniencesample,andusuallyselected;justificationisrequiredtoextendthemtoothergroups,includinganypop-ulationtowhichthetrialsamplebelongs.Demanding“externalvalidity”isunhelpfulbe-causeitexpectstoomuchofanRCTwhileundervaluingitscontribution.Statisticalinfer-enceonATEsinvolveshazardsthatarenotalwaysrecognized.RCTsdoindeedrequiremin-imalassumptionsandcanoperatewithlittlepriorknowledge.Thisisanadvantagewhenpersuadingdistrustfulaudiences,butitisadisadvantageforcumulativescientificprogress,wherepriorknowledgeshouldbebuiltuponandnotdiscarded.RCTscanplayaroleinbuildingscientificknowledgeandusefulpredictionsbuttheycanonlydosoaspartofacu-mulativeprogram,combiningwithothermethods,includingconceptualandtheoreticalde-velopment,todiscovernot“whatworks,”but“whythingswork”.

2

IntroductionRandomizedcontrolledtrials(RCTs)arecurrentlywidelyvisibleineconomicstoday

andhavebeenusedinthesubjectatleastsincethe1960s(seeGreenbergand

Shroder(2004)foracompendium).Itisoftenclaimedthatsuchtrialscandiscover

“whatworks”ineconomics,aswellasinpoliticalscience,education,andsocialpol-

icy.Amongbothresearchersandthegeneralpublic,RCTsareperceivedtoyield

causalinferencesandestimatesofaveragetreatmenteffects(ATEs)thataremore

reliableandmorecrediblethanthosefromanyotherempiricalmethod.Theyare

takentobelargelyexemptfromthemyriadeconometricproblemsthatcharacterize

observationalstudies,torequireminimalsubstantiveassumptions,littleornoprior

information,andtobelargelyindependentof“expert”knowledgethatisoftenre-

gardedasmanipulable,politicallybiased,orotherwisesuspect.

Therearenow“WhatWorks”centersusingandrecommendingRCTsina

rangeofareasofsocialconcernacrossEuropeandtheAnglophoneworld.These

centersseeRCTsastheirpreferredtoolandindeedoftenpreferRCTevidencelexi-

cographically.Asoneofmanyexamples,theUSDepartmentofEducation’sstandard

for“strongevidenceofeffectiveness”requiresa“well-designedandimplemented”

RCT;noobservationalstudycanearnsuchalabel.This“goldstandard”claimabout

RCTsislesscommonineconomics,butImbens(2010,407)writesthat“randomized

experimentsdooccupyaspecialplaceinthehierarchyofevidence,namelyatthe

verytop.”TheAbdulLatifJameelPovertyActionLab(J-PAL),whosestatedmission

is“toreducepovertybyensuringthatpolicyisinformedbyscientificevidence”,ad-

vertisesthatitsaffiliatedprofessors“conductrandomizedevaluationstotestand

improvetheeffectivenessofprogramsandpoliciesaimedatreducingpoverty”,J-

PAL(2017).Theleadpageofitswebsite(echoedinthe‘Evaluation’section)notes

“843ongoingandcompletedrandomizedevaluationsin80countries”withnomen-

tionofanystudiesthatarenotrandomized.

Inmedicine,thegoldstandardviewhaslongbeenwidespread,e.g.fordrug

trialsbytheFDA;anotableexceptionistherecentpaperbyFrieden(2017),ex-di-

3

rectoroftheU.S.CentersforDiseaseControlandPrevention,wholistskeylimita-

tionsofRCTsaswellasarangeofcontextswhereRCTs,evenwhenfeasible,are

dominatedbyothermethods.

WearguethatanyspecialstatusforRCTsisunwarranted.Whichmethodis

mostlikelytoyieldagoodcausalinferencedependsonwhatwearetryingtodis-

coveraswellasonwhatknowledgeisalreadyavailable.Whenlittleprior

knowledgeisavailable,nomethodislikelytoyieldwell-supportedconclusions.This

paperisnotacriticismofRCTsinandofthemselves,letaloneanattempttoidentify

goodandbadstudies.Instead,wewillarguethat,dependingonwhatwewantto

discover,whywewanttodiscoverit,andwhatwealreadyknow,therewilloftenbe

superiorroutesofinvestigation.

Wepresenttwosetsofarguments.Thefirstisanenquiryintotheideathat

ATEsestimatedfromRCTSarelikelytobeclosertothetruththanthoseestimated

inotherways.ThesecondexploreshowtousetheresultsofRCTsoncewehave

them.Inthefirstsection,ourdiscussionrunsinfamiliartermsofbiasandefficiency,

orexpectedloss.Noneofthismaterialisnew,butweknowofnosimilartreatment,

andwewishtodisputemanyoftheclaimsthatarefrequentlymadeintheapplied

literature.Someroutinemisunderstandingsare:(a)randomizationensuresafair

trialbyensuringthat,atleastwithhighprobability,treatmentandcontrolgroups

differonlyinthetreatment;(b)RCTsprovidenotonlyunbiasedestimatesofATEs

butalsopreciseestimates;(c)statisticalinferenceinRCTs,whichrequiresonlythe

simplecomparisonofmeans,isstraightforward,sothatstandardsignificancetests

arereliable.

Nothingwesayinthepapershouldbetakenasageneralargumentagainst

RCTs;wearesimplytryingtochallengeunjustifiableclaims,andexposemisunder-

standings.WearenotagainstRCTs,onlymagicalthinkingaboutthem.Themisun-

derstandingsareimportantbecausewebelievethattheycontributetothecommon

perceptionthatRCTsalwaysprovidethestrongestevidenceforcausalityandforef-

fectiveness.

4

Inthesecondpartofthepaper,wediscusshowtousetheevidencefrom

RCTs.Thenon-parametricandtheory-freenatureofRCTs,whichisarguablyanad-

vantageinestimation,isoftenadisadvantagewhenwetrytousetheresultsoutside

ofthecontextinwhichtheresultswereobtained;credibilityinestimationcanlead

toincredibilityinuse.Muchoftheliterature,perhapsinspiredbyCampbelland

Stanley’s(1963)famous“primacyofinternalvalidity”,appearstobelievethatinter-

nalvalidityisnotonlynecessarybutalmostsufficienttoguaranteetheusefulnessof

theestimatesindifferentcontexts.Butyoucannotknowhowtousetrialresults

withoutfirstunderstandinghowtheresultsfromRCTsrelatetotheknowledgethat

youalreadypossessabouttheworld,andmuchofthisknowledgeisobtainedby

othermethods.OncethecommitmenthasbeenmadetoseeingRCTswithinthis

broaderstructureofknowledgeandinference,andwhentheyaredesignedtoen-

hanceit,theycanbeenormouslyuseful,notjustforwarrantingclaimsofeffective-

nessbutforscientificprogressmoregenerally.Cumulativescienceisnotadvanced

throughmagicalthinking.

TheliteratureontheprecisionofATEsestimatedfromRCTsgoesbacktothe

verybeginning.Gosset(writingas`Student’)neveracceptedFisher’sargumentsfor

randomizationinagriculturalfieldtrialsandarguedconvincinglythathisownnon-

randomdesignsfortheplacementoftreatmentandcontrolsyieldedmoreprecise

estimatesoftreatmenteffects(seeStudent(1938)andZiliak(2014)).Gosset

workedforGuinnesswhereinefficiencymeantlostrevenue,sohehadreasonsto

care,asshouldwe.Fisherwontheargumentintheend,notbecauseGossetwas

wrongaboutefficiency,butbecause,unlikeGosset’sprocedures,randomizationpro-

videsasoundbasisforstatisticalinference,andthusforjudgingwhetheranesti-

matedATEisdifferentfromzerobychance.Moreover,Fisher’sblockingprocedures

canlimittheinefficiencyfromrandomization(seeYates(1939)).Gosset’sreserva-

tionswereechoedmuchlaterinSavage’s(1962)commentthataBayesianshould

notchoosetheallocationoftreatmentsandcontrolsatrandombutinsuchaway

that,givenwhatelseisknownaboutthetopicandthesubjects,theirplacementre-

vealsthemosttotheresearcher.Theseissuesabouthowtoincorporatepriorinfor-

mationintorandomizedtrialsarecentraltoSection1.

5

Ineconomics,thestrengthsandweaknessesofRCTsarewellexploredinthe

volumesbyHausmanandWise(1985)andbyGarfinkelandManski(1992);inthe

latter,theintroductionbyGarfinkelandManskiisabalancedsummaryofwhatran-

domizedtrialscanandcannotdo.ThepaperinthatvolumebyHeckman(1992)

raisesmanyoftheissuesthatheandhiscoauthorshaveexploredinsubsequentpa-

pers,seeinparticularHeckmanandSmith(1995),andHeckman,LalondeandSmith

(1999)whofocusonlabormarketexperiments.Manski(2013)containsagood

summaryofbothstrengthsandweaknesses.

Thereisalsoamorecontestedrecentliterature.Ontheonehand,thereare

proceduresthattakeasfundamentaltheunrestrictedindividualtreatmenteffectsof

individualsandseeknon-parametricapproachestoestimatingtheiraverage.Onthe

otherhand,theseproceduresarecontrastedwithanapproachthatuseselementsof

economictheorytodefineparametersofinterestandtoidentifymagnitudesthat

arelikelytobeinvarianttopolicymanipulationoracrosscontexts,whereinvari-

anceisdefinedinthesenseofHurwicz(1966).TheintroductioninImbensand

Wooldridge(2009)provideaneloquentdefenseofthetreatment-effectformulation.

Itemphasizesthecredibilitythatcomesfromatheory-freespecificationwithalmost

unlimitedheterogeneityintreatmenteffects.TheintroductioninHeckmanand

Vytlacil(2007)makesanequallyeloquentcaseagainst,notingthatthecrucialingre-

dientsoftreatmentsinRCTsareoftennotclearlyspecified—sothatweoftendonot

knowwhatthetreatmentreallyis—andthatthetreatmenteffectsarehardtolinkto

invariantparametersthatwouldbeusefulelsewhere.Aspectsofthesamedebate

featureinImbens(2010),AtheyandImbens(2017),AngristandPischke(2017),

Heckman(2005,2008,2010)andHeckmanandUrzua(2010).

Deaton(2010)complainsabouttheuseofinstrumentalvariables,including

randomization,asasubstituteforthinkingaboutandconstructingmodelsofeco-

nomicdevelopment.HearguesagainsttheideathatusingRCTstoevaluateprojects

todiscover“whatworks”caneveryieldasystematicbodyofscientificknowledge

thatcanbeusedtoreduceoreliminatepoverty.Thatpaperisanargumentagainst

theusefulnessoftheheterogeneoustreatmentapproach.Itarguesthatrefusingto

6

modelheterogeneity,thoughavoidingassumptions,precludesthesortofcumula-

tiveresearchprogramthatmightyieldusefulpolicy.Thepaper’sclaimthatRCTs

havenospecialclaimtogeneratecredibleandusefulknowledgewaschallengedby

Imbens(2010);someofhisargumentsareansweredbelow.Cartwright(2007)and

CartwrightandMunro(2010)challengeany“goldstandard”viewofRCTs.Cart-

wright(2011,2012,2016)andCartwrightandHardie(2012)focusonthequestion

ofhowtousetheresultsofRCTsandwhatwecanlearnwhenanexperimentshows

thatsomepolicyworkssomewhere.Section2pursuestheseissuesingeneraland

throughcasestudies.

Section1:DoRCTsgivegoodestimatesofAverageTreatmentEffects

Inthissection,weexplorehowtoestimateaveragetreatmenteffects(ATEs)andthe

roleofrandomization.WenotefirstthatestimatingATEsisonlyoneofmanyuses

forthedatageneratedbyanRCT.Westartfromatrialsample,acollectionofsub-

jectsthatwillbeallocatedrandomlytoeitherthetreatmentorcontrolarmofthe

trial.This“sample”mightbe,butrarelyis,arandomsamplefromsomepopulation

ofinterest.Morefrequently,itisselectedinsomeway,forexampletothosewilling

toparticipate,orissimplyaconveniencesamplethatisavailabletothetrialists.

Givenrandomallocationtotreatmentsandcontrols,thedatafromthetrialallowthe

identificationoftwodistributions,𝐹"(𝑌")and𝐹&(𝑌&),ofoutcomes𝑌"and𝑌&inthe

treatedanduntreatedcaseswithinthetrialsample.TheestimatedATEisthediffer-

enceinmeansofthetwodistributionsandisthefocusofmuchoftheliteraturein

socialscienceandmedicine.Yetpolicymakersandresearchersmaywellbeinter-

estedinotherfeaturesofthetwodistributions.Forexample,ifYisincome,they

maybeinterestedinwhetheratreatmentreducedincomeinequality,orinwhatit

didtothe10thor90thpercentilesoftheincomedistribution,eventhoughdifferent

peopleoccupythosepercentilesinthetreatmentandcontroldistributions(seeBit-

leretal(2006)foranexampleinUSwelfarepolicy).Cancertrialsstandardlyusethe

mediandifferenceinsurvival,whichcomparesthetimesuntilhalfthepatientshave

diedineacharm.Morecomprehensively,policymakersmaywishtocompareex-

pectedutilitiesfortreatedanduntreatedunderthetwodistributionsandconsider

7

optimalexpected-utilitymaximizingtreatmentrulesconditionalonthecharacteris-

ticsofsubjects(seeManski(2004)andManskiandTetenov(2016);Bhattacharya

andDupas(2012)containsanapplication.)Theseusesareimportant,butwefocus

onATEshereanddonotconsidertheseotherusesofRCTsanyfurtherinthispaper.

1.1Whatdoesrandomizationdo?

Ausefulwaytothinkabouttheestimationoftreatmenteffectsistouseaschematic

linearcausalmodeloftheform:

(1)

where, istheoutcomeforuniti,𝑇( isadichotomous(1,0)treatmentdummyindi-

catingwhetherornotiistreated,and𝛽( istheindividualtreatmenteffectofthe

treatmentoni.Thex’saretheobservedorunobservedotherlinearcausesofthe

outcome,andwesupposethat(1)capturesaminimalsetofcausesof𝑌( sufficientto

fixitsvalue.Jmaybe(very)large.Becausetheheterogeneityoftheindividualtreat-

menteffects,𝛽( ,isunrestricted,weallowthepossibilitythatthetreatmentinteracts

withthex’sorothervariables,sothattheeffectsofTcandependonanyothervaria-

bles.Notethatwedonotneedisubscriptsonthe𝛾’sthatcontroltheeffectsofthe

othercauses;iftheireffectsdifferacrossindividuals,weincludetheinteractionsof

individualcharacteristicswiththeoriginalx’sasnewx’s.Giventhatthex’scanbe

unobservable,thisisnotrestrictive.

Consideranexperimentthataimstotellussomethingaboutthetreatment

effects;thismightormightnotuserandomization.Eitherway,wecanrepresentthe

treatmentgroupashaving𝑇( = 1andthecontrolgroupashaving𝑇( = 0.Giventhe

study(ortrial)sample,subtractingtheaverageoutcomesamongthecontrolsfrom

theaverageoutcomesamongthetreatments,weget

Y

1−Y

0= β

1+ γ j (xij

1−

j=1

J

∑ xij0) = β

1+ (S

1− S

0) (2)

Thefirsttermonthefar-right-handsideof(2),whichistheATEinthetrialsample,

iswhatwewant,butthesecondtermorerrorterm,whichisthesumofthenetav-

eragebalanceofothercausesacrossthetwogroups,willgenerallybenon-zeroand

Yi = βiTi + γ j xijj=1

J∑Yi

8

needstobedealtwithsomehow.Wegetwhatwewantwhenthemeansofallthe

othercausesareidenticalinthetwogroups,ormoreprecisely(andlessonerously)

whenthesumoftheirnetdifferences𝑆" − 𝑆&iszero;thisisthecaseofperfectbal-

ance.Withperfectbalance,thedifferencebetweenthetwomeansisexactlyequalto

theaverageofthetreatmenteffectsamongthetreated,sothatwehavetheultimate

precisioninthatweknowthetruthinthetrialsample,atleastinthislinearcase.As

always,the“truth”herereferstothetrialsample,anditisalwaysimportanttobe

awarethatthetrialsamplemaynotberepresentativeofthepopulationthatisulti-

matelyofinterest,includingthepopulationfromwhichthetrialsamplecomes;any

suchextensionrequiresfurtherargument.

Howdowegetbalance,orsomethingclosetoit?What,exactly,istheroleof

randomization?Inalaboratoryexperiment,wherethereisusuallymuchprior

knowledgeoftheothercauses,theexperimenterhasagoodchanceofcontrolling

(orsubtractingawaytheeffectsof)theothercauses,aimingtoensurethatthelast

termin(1)isclosetozero.Failingsuchknowledgeandcontrol,analternativeis

matching,whichisfrequentlyusedinstatistical,medical,andeconometricwork.For

eachsubject,amatchisfoundthatisascloseaspossibleonallsuspectedcauses,so

that,onceagain,thelasttermin(1)canbekeptsmall.Whenwehaveagoodideaof

thecauses,matchingmayalsodeliverapreciseestimate.Ofcourse,whenthereare

unknownorunobservablecausesthathaveimportanteffects,neitherlaboratory

controlnormatchingoffersprotection.

Whatdoesrandomizationdo?Sincethetreatmentsandcontrolscomefrom

thesameunderlyingdistribution,randomizationguarantees,byconstruction,that

thelasttermontherightin(1)iszeroinexpectation,subjecttothecaveatthatno

correlationsofthex’swithYareintroducedpost-randomization,forexampleby

subjectsnotacceptingtheirassignment.Theexpectationhereistakenoverre-

peatedrandomizationsonthetrialsample,eachwithitsownallocationoftreat-

mentsandcontrols.Assumingthatourcaveatholds,thelasttermin(2)willbezero

whenaveragedoverthisinfinitenumberof(entirelyhypothetical)replications,and

9

theaverageoftheestimatedATEswillbethetrueATEinthetrialsample.So𝛽"de-

liversanunbiasedestimateoftheATEamongthetreatedinthetrialsample,andit

doessowhetherornotthecausesareobserved.Unbiasednessdoesnotrequireus

toknowanythingabouttheothercausesthoughitdoesrequirethattheynotchange

afterrandomizationsoastomakethemcorrelatedwiththetreatment,whichisan

importantcaveattowhichweshallreturn.IftheRCTisrepeatedmanytimesonthe

sametrialsample,then,assumingourcaveatholdsinthetrials,thelasttermin(2)

willbezerowhenaveragedoveraninfinitenumberof(entirelyhypothetical)trials,

andtheaverageoftheestimatedATEswillbethetrueATEinthetrialsample.Of

course,noneofthisistrueinanyonetrialwherethedifferenceinmeanswillbe

equaltotheaveragetreatmenteffectamongthosetreatedplusthetermthatreflects

theimbalanceintheneteffectsoftheothercauses.Wedonotknowthesizeofthis

errorterm,andthereisnothinginrandomizationthatlimitsitssize;bychancethe

randomizationinoursingletrialcanover-representanimportantexcludedcause(s)

inonearmovertheother,inwhichcasetherewillbeadifferencebetweenthe

meansofthetwogroupsthatisnotcausedbythetreatment.

Theunbiasednessresultcaneasilybecompromised.Inparticular,thetreat-

mentmustnotbecorrelatedwithanyothercause.Randomassignmentisdesigned

toaidwiththis,butitisnotsufficientif,forexample,thereislackofblindingsothat

individualsareawareoftheirassignment,orifthoseadministeringthetreatment

aresoaware,andifthatawarenesstriggersanothercause.Similarly,researchers

sometimesreturntoindividualswhowererandomizedyearsbefore,sothatthere

hasbeentimeforthesubjectsorotherstolearntheirassignmentorforothercauses

tobeinfluencedbytheassignment.Thisagainopensupthepossibilityofunbal-

ancedeffectsofcausesotherthanthetreatmentweareinterestedin.Wehaveal-

readynotedthatunbiasednessreferstothetrialsample,whichmayormaynotbe

representativeofthepopulationofinterest.

Ifweweretorepeatthetrialmanytimes,theover-representationoftheun-

balancedcauseswillsometimesbeinthetreatmentsandsometimesinthecontrols.

Theimbalancewillvaryoverreplicationsofthetrial,andalthoughwecannotsee

thisfromoursingletrial,weshouldbeabletocaptureitseffectsonourestimateof

10

theATEfromanestimatedstandarderror.ThiswasFisher’sinnovation:notthat

randomizationbalancedothercausesbetweentreatmentsandcontrolsbutthat,

conditionalonourcaveatabove,randomizationprovidesthebasisforcalculating

thesizeoftheerror.Gettingthestandarderrorandassociatedsignificancestate-

mentsrightareofthegreatestimportance.Giventheabsenceoftreatment-related

post-randomizationchangesinothercauses,randomizationyieldsanunbiasedesti-

mateoftheATEinthetrialsampleaswellasasoundmethodformeasuringerrorof

estimationinthatsample;thereinliesitsvirtue,notthatityieldspreciseestimates

throughbalance.

1.2Misunderstandings:claimingtoomuch

Everythingsofarshouldbeperfectlyfamiliar,butexactlywhatrandomizationdoes

isfrequentlylostinthepracticalliterature.Thereisoftenconfusionbetweenperfect

control,ontheonehand(asinalaboratoryexperimentorperfectmatchingwithno

unobservablecauses),andcontrolinexpectationontheother,whichiswhatran-

domizationcontributes.Ifweknewenoughabouttheproblemtobeabletocontrol

well,thatiswhatwewoulddo.Randomizationisanalternativewhenwedonot

knowenough,butisgenerallyinferiortogoodcontrol.Wesuspectthatatleastsome

ofthepopularandprofessionalenthusiasmforRCTs,aswellasthebeliefthatthey

areprecisebyconstruction,comesfrommisunderstandingsaboutbalance.These

misunderstandingsarenotsomuchamongthetrialistswhowilloftengiveacorrect

accountwhenpressed.Theycomefromimprecisestatementsbytrialiststhatare

takenliterallybythelayaudiencethatthetrialistsarekeentoreach.

Suchamisunderstandingiswellcapturedbyaquotefromthesecondedition

oftheonlinemanualonimpactevaluationjointlyissuedbytheInter-AmericanDe-

velopmentBankandtheWorldBank(thefirst,2011editionissimilar):

“Wecanbeconfidentthatourestimatedimpactconstitutesthetrueimpact

oftheprogram,sincewehaveeliminatedallobservedandunobservedfac-

torsthatmightotherwiseplausiblyexplainthedifferenceinoutcomes.”Ger-

tler,Martinez,Premand,Rawlings,andVermeersch(2016,69).

11

Thisstatementisfalse,becauseitconfusesactualbalanceinanysingletrialwith

balanceinexpectationovermany(hypothetical)trials.Ifitweretrue,andifallfac-

torswereindeedcontrolled(andnoimbalanceswereintroducedpostrandomiza-

tion),thedifferencewouldbeanexactmeasureoftheaveragetreatmenteffect

amongthetreatedinthetrialpopulation(atleastintheabsenceofmeasurementer-

ror).Weshouldnotonlybeconfidentofourestimatebut,asthequotesays,we

wouldknowthatitisthetruth.Notethatthestatementcontainsnoreferenceto

samplesize;wegetthetruthbyvirtueofbalance,notfromalargenumberofobser-

vations.

AsimilarquotecomesfromJohnList,oneofthemostimaginativeandsuc-

cessfulscholarswhouseRCTs:

“complicationsthataredifficulttounderstandandcontrolrepresentkeyrea-

sonstoconductexperiments,notapointofskepticism.Thisisbecauseran-

domizationactsasaninstrumentalvariable,balancingunobservablesacross

controlandtreatmentgroups.”Al-UbaydliandList(2013)(italicsintheorig-

inal.)

AndfromDeanKarlan,founderandPresidentofYale’sInnovationsforPovertyAc-

tion,whichrunsdevelopmentRCTsaroundtheworld:

“Asinmedicaltrials,weisolatetheimpactofaninterventionbyrandomly

assigningsubjectstotreatmentsandcontrolgroups.Thismakesitsothatall

thoseotherfactorswhichcouldinfluencetheoutcomearepresentintreat-

mentandcontrol,andthusanydifferenceinoutcomecanbeconfidentlyat-

tributedtotheintervention.”Karlan,GoldbergandCopestake(2009)

Andfromthemedicalliterature,fromadistinguishedpsychiatristwhoisdeeply

skepticaloftheuseofevidencefromRCTs,

“Thebeautyofarandomizedtrialisthattheresearcherdoesnotneedtoun-

derstandallthefactorsthatinfluenceoutcomes.Saythatanundiscoveredge-

neticvariationmakescertainpeopleunresponsivetomedication.Theran-

domizingprocesswillensure—ormakeithighlyprobable—thatthearmsof

thetrialcontainequalnumbersofsubjectswiththatvariation.Theresultwill

beafairtest.”(Kramer,2016,p.18)

12

ClaimsareevenmadethatRCTsrevealknowledgewithoutpossibilityoferror.Judy

Gueron,thelong-timepresidentofMDRC,whichhasbeenrunningRCTsonUSgov-

ernmentpolicyfor45years,askswhyfederalandstateofficialswerepreparedto

supportrandomizationinspiteoffrequentdifficultiesandinspiteoftheavailability

ofothermethodsandconcludesthatitwasbecause“theywantedtolearnthetruth,”

GueronandRolston(2013,429).Therearemanystatementsoftheform“Weknow

that[projectX]workedbecauseitwasevaluatedwitharandomizedtrial,”Dynarski

(2015).

ItiscommontotreattheATEfromanRCTasifitwerethetruth,notjustin

thetrialsamplebutmoregenerally.Ineconomics,afamousexampleisLalonde’s

(1986)studyoftrainingprograms,whoseresultswereatoddswithanumberof

previousnon-randomizedstudies.Thepaperpromptedalarge-scalere-examination

oftheobservationalstudiestotrytobringthemintoline,thoughitnowseemsjust

aslikelythatthedifferenceslieinthefactthatthestudyresultsapplytodifferent

populations(Heckman,Lalonde,andSmith(1999)).Inepidemiology,Davey-Smith

andIbrahim(2002)statethat“observationalstudiespropose,RCTsdispose”.A

goodexampleistheRCTofhormonereplacementtherapy(HRT)forpost-menopau-

salwomen.HRThadpreviouslybeensupportedbypositiveresultsfromahigh-

qualityandlong-runningobservationalstudy,buttheRCTwasstoppedinthefaceof

excessdeathsinthetreatmentgroup.ThenegativeresultoftheRCTledtowide-

spreadabandonmentofthetherapy,whichmighthavebeenamistake(seeVanden-

broucke(2009)andFrieden(2017)).Yetthemedicalandpopularliteraturerou-

tinelystatesthattheRCTwasrightandtheearlierstudywrong,simplybecausethe

earlierstudywasnotrandomized.Thegoldstandardor“truth”viewdoesharm

whenitunderminestheobligationofsciencetoreconcileRCTsresultswithother

evidenceinaprocessofcumulativeunderstanding.

Thefalsebeliefinautomaticprecisionsuggeststhatweneedpaynoatten-

tiontotheothercausesin(1)or(2).Indeed,GerberandGreen(2012),intheir

standardtextforRCTsinpoliticalscience,writethatrunninganRCTis“aresearch

strategythatdoesnotrequire,letalonemeasure,allpotentialconfounders.”Thisis

trueifwearehappywithestimatesthatarearbitrarilyfarfromthetruth,justso

13

longastheerrorscanceloutoveraseriesofimaginaryexperiments.Inreality,the

causalitythatisbeingattributedtothetreatmentmight,infact,becomingfroman

imbalanceinsomeothercauseinourparticulartrial;limitingthisrequiresserious

thoughtaboutpossibleconfounders.

1.3Samplesize,balance,andprecision

Atthetimeofrandomizationandintheabsenceofpost-randomizationchangesin

othercauses,atrialismorelikelytobebalancedwhenthesamplesizeislarge.As

thesamplesizetendstoinfinity,themeansofthex’sinthetreatmentandcontrol

groupswillbecomearbitrarilyclose.Yetthisisoflittlehelpinfinitesamples.As

Fisher(1926)noted:“Mostexperimentersoncarryingoutarandomassignment

willbeshockedtofindhowfarfromequallytheplotsdistributethemselves,”quoted

inMorganandRubin(2012).Evenwithverylargesamplesizes,ifthereisalarge

numberofcauses,balanceoneachcausemaybeinfeasible.Evenwithjustthree

causeswiththreevalueseach,thereare27cellstobalance,andinmostsocialand

medicalcasestherewillbemore.Vandenbroucke(2004)notesthattherearethree

billionbasepairsinthehumangenome,manyorallofwhichcouldberelevantprog-

nosticfactorsforthebiologicaloutcomethatweareseekingtoinfluence.Itistrue,

as(2)makesclear,thatwedonotneedbalanceoneachcauseindividually,onlyon

theirneteffect,theterm𝑆" − 𝑆&.Butconsiderthehumangenomebasepairs.Outof

allofthosebillions,onlyonemightbeimportant,andifthatoneisunbalanced,the

resultsofasingletrialcanbefarfromthetruth.Statementsaboutlargesamples

guaranteeingbalancearenotusefulwithoutguidelinesabouthowlargeislarge

enough,andsuchstatementscannotbemadewithoutknowledgeofothercauses

andhowtheyaffectoutcomes.Ofcourse,lackofbalanceintheneteffectofeither

observablesornon-observablesin(2)doesnotcompromisetheinferenceinanRCT

inthesenseofobtainingastandarderrorfortheunbiasedATE(seeSenn(2013)for

aparticularlyclearstatement).

HavingrunanRCT,itmakesgoodsensetoexamineanyavailablecovariates

forbalancebetweenthetreatmentsandcontrols;ifwesuspectthatanobserved

variablexisapossiblecause,anditsmeansinthetwogroupsareverydifferent,we

14

shouldtreatourresultswithappropriatesuspicion.Inpractice,trialistsineconom-

ics(andinsomeotherdisciplines)usuallycarryoutastatisticaltestforbalanceaf-

terrandomizationbutbeforeanalysis,presumablywiththeaimoftakingsomeap-

propriateactionifbalancefails.Thefirsttableofthepapertypicallypresentsthe

samplemeansofobservablecovariates—theobservablex’sin(1)orinteractivefac-

torsrepresentedinβ—forthecontrolandtreatmentgroups,togetherwiththeirdif-

ferences,andtestsforwhetherornottheyaresignificantlydifferentfromzero,ei-

thervariablebyvariable,orjointly.Thesetestsareappropriateforunbiasednessif

weareconcernedthattherandomnumbergeneratormighthavefailed,orifweare

worriedthattherandomizationisunderminedbynon-blindedsubjectswhosys-

tematicallyunderminetheallocation.Otherwise,unbiasednessisguaranteedbythe

randomization,whateverthetestsshow,andasthenextparagraphdemonstrates,

thetestisnotinformativeaboutthebalancethatwouldleadtoprecision.

Ifwewrite𝜇&and𝜇"forthe(vectorsof)truemeansinthetrialsample(i.e.

themeansoverallpossiblerandomizations)oftheobservedcausesofYinthecon-

trolandtreatmentgroupsatthepointofassignment,thenullhypothesisis(pre-

sumably,asjudgedbythetypicalbalancetest)thatthetwovectorsareidentical,

withthealternativebeingthattheyarenot.Butiftherandomizationhasbeencor-

rectlydonethenullhypothesisistruebyconstruction(seee.g.Altman(1985)and

Senn(1994)),whichmayhelpexplainwhyitsorarelyfailsinpractice.AsBegg

(1990)notes,“(I)tisatestofanullhypothesisthatisknowntobetrue.Therefore,if

thetestturnsouttobesignificantitis,bydefinition,aafalsepositive.”Thisis,of

course,consistentwithFisher’scommentsabouttheplotsinthefield,whichnotes

thattwosamplesofplotsrandomlydrawnfromthesamefieldcanlookveryunbal-

anced.Indeed,althoughwecannot“test”itinthisway,weknowthatthenullhy-

pothesisisalsotruefortheunobservablecauses.Notethecontrastwiththestate-

mentquotedaboveclaimingthatRCTsguaranteebalanceoncausesacrosstreat-

mentandcontrolgroups.Thosestatementsrefertobalanceofcausesatthepointof

assignmentinanysingletrial,whichisnotguaranteedbyrandomization,whereas

thebalancetestsareaboutthebalanceofcausesatthepointofassignmentinexpec-

15

tationovermanytrials,whichisguaranteedbyrandomization.Theconfusionisper-

hapsunderstandable,butitisaconfusionnevertheless.Ofcourse,itisalwaysgood

practicetolookforimbalancesbetweenobservedcovariatesinanysingletrialusing

somemoreappropriatedistancemeasure,forexamplethenormalizeddifferencein

means(ImbensandWooldridge(2009,equation3)).Similarly,itwouldhavebeen

goodpracticeforFishertoabandonarandomizationinwhichtherewereclearpat-

ternsinthe(random)distributionofplotsacrossthefield,eventhoughthetreat-

mentandcontrolplotswererandomlyselectionsthat,byconstruction,couldnot

differ“significantly”usingthestandard(incorrect)balancetest.Whethersuchim-

balancesshouldbeseenasunderminingtheestimateoftheATEdependsonour

priorsaboutwhichcovariatesarelikelytobeimportant,andhowimportant,which

is(notcoincidentally)thesamethoughtexperimentthatisroutinelyundertakenin

observationalstudieswhenweworryaboutconfounding.

Oneproceduretoimprovebalanceistoadaptthedesignbeforerandomiza-

tion,forexample,bystratification.Fisher,whoasthequoteaboveillustrates,was

wellawareofthelossofprecisionfromrandomizationarguedfor“blocking”(strati-

fication)inagriculturaltrialsorforusingLatinSquares,bothofwhichrestrictthe

amountofimbalance.Stratification,tobeuseful,requiressomepriorunderstanding

ofthefactorsthatarelikelytobeimportant,andsoittakesusawayfromthe“no

knowledgerequired”or“nopriorsaccepted”appealofRCTs;itrequiresthinking

aboutandmeasuringcovariates.ButasScriven(1974,103)notes:“(C)ausehunting,

likelionhunting,isonlylikelytobesuccessfulifwehaveaconsiderableamountof

relevantbackgroundknowledge”.Cartwright(1994,Chapter2)putsitevenmore

strongly,“nocausesin,nocausesout”.StratificationinRCTs,asinotherformsof

sampling,isastandardmethodforusingbackgroundknowledgetoincreasethe

precisionofanestimator.Ithasthefurtheradvantagethatitallowsfortheexplora-

tionofdifferentATEsindifferentstratawhichcanbeusefulinadaptingortrans-

portingtheresultstootherlocations(seeSection2).

Stratificationisnotpossibleiftherearetoomanycovariates,orifeachhas

manyvalues,sothattherearemorecellsthancanbefilledgiventhesamplesize.

Withfivecovariates,andtenvaluesoneach,andnopriorstolimitthestructure,we

16

wouldhave100,000possiblestrata.Fillingtheseiswellbeyondthesamplesizesin

mosttrials.Analternativethatworksmoregenerallyistore-randomize.Iftheran-

domizationgivesanobviousimbalanceonknowncovariates—treatmentplotsall

ononesideofthefield,allofthetreatmentclinicsinoneregion,toomanyrichand

toofewpoorinthecontrolgroup—wetryagain,andkeeptryinguntilwegetabal-

ancemeasuredasasmallenoughdistancebetweenthemeansoftheobservedco-

variatesinthetwogroups.MorganandRubin(2012)suggesttheMahalanobisD–

statisticbeusedasacriterionanduseFisher’srandomizationinference(tobedis-

cussedfurtherbelow)tocalculatestandarderrorsthattakethere-randomization

intoaccount.Analternative,widelyadaptedinpractice,istoadjustforcovariatesby

runningaregression(orcovariance)analysis,withtheoutcomeontheleft-hand

sideandthetreatmentdummyandthecovariatesasexplanatoryvariables,includ-

ingpossibleinteractionsbetweencovariatesandtreatmentdummies.Freedman

(2008)showsthattheadjustedestimateoftheATEisbiasedinfinitesamples,with

thebiasdependingonthecorrelationbetweenthesquaredtreatmenteffectandthe

covariates.Acceptingsomebiasinexchangeforgreaterprecisionwilloftenmake

sense,thoughitcertainlyunderminesanygoldstandardargumentthatreliesonun-

biasednesswithoutconsiderationofprecision.

1.4Shouldwerandomize?

ThetensionbetweenrandomizationandprecisionthatgoesbacktoFisher,Gosset,

andSavagehasbeenreopenedinrecentpapersbyKasy(2016),Banerjee,Chassang,

andSnowberg(BCS)(2016)andBanerjee,Chassang,Montero,andSnowberg

(BCMS)(2016).

Thetrade-offbetweenbiasandprecisioncanbeformalizedinseveralways,

forexamplebyspecifyingalossorutilityfunctionthatdependsonhowauserisaf-

fectedbydeviationsoftheestimateoftheATEfromthetruthandthenchoosingan

estimatororanexperimentaldesignthatminimizesexpectedlossormaximizesex-

pectedutility.AsSavage(1962)noted,foraBayesian,thisinvolvesallocatingtreat-

mentsandcontrolsin“thespecificlayoutthatpromisedtotellhimthemost,”but

withoutrandomization.Ofcourse,thisrequiresseriousandperhapsdifficultthought

aboutthemechanismsunderlyingtheATE,whichrandomizationavoids.Savagealso

17

notesthatseveralpeoplewithdifferentpriorsmaybeinvolvedinaninvestigation

andthatindividualpriorsmaybeunreliablebecauseof“vaguenessandtemptation

toself-deception,”defectsthatrandomizationmayalleviate,oratleastevade.BCMS

(2016)provideaproofofaBayesianno-randomizationtheorem,andBCS(2016)

provideanillustrationofaschooladministratorwhohaslongbelievedthatschool

outcomesaredetermined,notbyschoolquality,butbyparentalbackground,and

whocanlearnthemostbyplacingdeprivedchildrenin(supposed)high-quality

schoolsandprivilegedchildrenin(supposed)low-qualityschools,whichisthekind

ofstudysettingthatcasestudymethodologyiswellattunedto.AsBCSnote,thisal-

locationwouldnotpersuadethosewithdifferentpriors,andtheyproposerandomi-

zationasameansofsatisfyingskepticalobservers.

Severalpointsareimportant.First,theanti-randomizationtheoremisnota

justificationofanynon-randomizeddesign,forexample,onethatallowsselection

onunobservables,butonlytheoptimaldesignthatismostinformative.Accordingto

Chalmers(2001)andBothwellandPodolsky(2016),thedevelopmentofrandomi-

zationinmedicineoriginatedwithBradford-Hill,whousedrandomizationinthe

firstRCTinmedicine—thestreptomycintrial—becauseitpreventeddoctorsselect-

ingpatientsonthebasisofperceivedneed(oragainstperceivedneed,leaningover

backwardasitwere),anargumentrecentlyechoedbyWorrall(2007).Randomiza-

tionservesthispurpose,butsodoothernon-discretionaryschemes;whatisre-

quiredisthathiddeninformationshouldnotbeallowedtoaffecttheallocation.

Second,theidealrulesbywhichunitsareallocatedtotreatmentorcontrol

dependonthecovariatesandontheinvestigators’priorsabouthowthecovariates

affecttheoutcomes.Thisopensupallsortsofmethodsofinferencethatarelongfa-

miliartoeconomistsbutthatareexcludedbypurerandomization.Forexample,

whatphilosopherscallthehypothetico-deductivemethodworksbyusingtheoryto

makeapredictionthatcanbetakentothedataforpotentialfalsification(asinthe

schoolexampleabove).Thisisthewaythatphysicistslearn,asdoeconomistswhen

theyusetheorytoderivepredictionsthatcanbetestedagainstthedata,perhapsin

anRCT,butmorefrequentlynot.Someofthemostfruitfulresearchprogramsin

economicshavebeengeneratedbythepuzzlesthatresultwhenthedatafailto

18

matchsuchtheoreticalpredictions,suchastheequitypremiumpuzzle,variouspur-

chasingpowerparitypuzzles,theFeldstein-Horiokapuzzle,theconsumption

smoothnesspuzzle,thepuzzleofcaloriedeclineinthefaceofmalnourishmentand

incomegrowth,andmanyothers.

Third,randomization,byrunningroughshodoverpriorinformationfrom

theoryandfromcovariates,iswastefulandevenunethicalwhenitunnecessarilyex-

posespeople,orunnecessarilymanypeople,topossibleharminariskyexperiment.

Worrall(2008)documentsthe(extreme)caseofECMO,anewtreatmentfornew-

bornswithpersistentpulmonaryhypertensionthatwasdevelopedinthe1970sby

intelligentanddirectedtrialanderrorwithinawell-understoodtheoryofthedis-

ease.Inearlyexperimentationbytheinventors,mortalitywasreducedfrom80to

20percent.TheinvestigatorsfeltcompelledtoconductanRCT,albeitwithanadap-

tive‘play-the-winner’designinwhicheachsuccessinanarmincreasedtheproba-

bilityofthenextbabybeingassignedtothatarm.Onebabyreceivedconventional

therapyanddied,11receivedECMOandlived.Evenso,astandardrandomizedcon-

trolledtrialwasthoughtnecessary.Withastoppingruleoffourdeaths,fourmore

babies(outoften)diedinthecontrolgroupandnoneoftheninewhoreceived

ECMO.

Fourth,thenon-randommethodsusepriorinformation,whichiswhythey

dobetterthanrandomization.Thisisbothanadvantageandadisadvantage,de-

pendingonone’sperspective.Ifpriorinformationisnotwidelyaccepted,orisseen

asnon-crediblebythoseweareseekingtopersuade,wewillgeneratemorecredible

estimatesifwedonotusethosepriors.Indeed,thisiswhyBCS(2016)recommend

randomizeddesigns,includinginmedicineandindevelopmenteconomics.Theyde-

velopatheoryofaninvestigatorwhoisfacinganadversarialaudiencewhowill

challengeanypriorinformationandcanevenpotentiallyvetoresultsbasedonit

(thinkofadministrativeagenciessuchastheFDAorjournalreferees).Theexperi-

mentertradesoffhisorherowndesireforprecision(andpreventingpossibleharm

tosubjects),whichwouldrequirepriorinformation,againstthewishesoftheaudi-

ence,whowantsnothingtodowiththosepriors.Eventhen,theapprovaloftheau-

dienceisonlyexante;oncethefullyrandomizedexperimenthasbeendone,nothing

19

stopscriticsarguingthat,infact,therandomizationdidnotofferafairtestbecause

importantothercauseswerenotbalanced.AmongdoctorswhouseRCTs,andespe-

ciallymeta-analysis,suchargumentsare(appropriately)common(seeKramer

(2016)).

Today,whenthepublichascometoquestionexpertpriorknowledge,RCTs

willflourish.Incaseswherethereisgoodreasontodoubtthegoodfaithofexperi-

menters,asinmanypharmaceuticaltrials,randomizationwillindeedbeanappro-

priateresponse.Butwebelievesuchargumentsaredestructiveforscientificen-

deavor(whichisnotthepurposeoftheFDA)andshouldberesistedasageneral

prescriptioninscientificresearch.Previousknowledgeneedstobebuiltonandin-

corporatedintonewknowledge,notdiscardedinthefaceofaggressiveignorance.

Thesystematicrefusaltousepriorknowledgeandtheassociatedpreferencefor

RCTsarerecipesforpreventingcumulativescientificprogress.Intheend,itisalso

self-defeating.ToquoteRodrik(2016)“thepromiseofRCTsastheory-freelearning

machinesisafalseone.”

1.5StatisticalinferenceinRCTs

TheestimatedATEinasimpleRCTisthedifferenceinthemeansbetweenthetreat-

mentandcontrolgroups.Whencovariatesareallowedfor,asinmostRCTsineco-

nomics,theATEisusuallyestimatedfromthecoefficientonthetreatmentdummy

inaregressionthatlookslike(1),butwiththeheterogeneityin𝛽ignored.Modern

workcalculatesstandarderrorsallowingforthepossibilitythatresidualvariances

maybedifferentinthetreatmentandcontrolgroups,usuallybyclusteringthe

standarderrors,whichisequivalenttothefamiliartwosamplestandarderrorin

thecasewithnocovariates.Statisticalinferenceisdonewitht-valuesintheusual

way.Theseproceduresdonotalwaysgivetherightanswer.

Lookingbackat(1),theunderlyingobjectsofinterestaretheindividual

treatmenteffects𝛽( foreachoftheindividualsinthetrialsample.Neitherthey,nor

theirdistribution𝐺(𝛽)isidentifiedfromanRCT;becauseRCTsmakesofewas-

sumptionswhich,inmanycases,istheirstrength,theycanidentifyonlythemeanof

thedistribution.Inmanyobservationalstudies,researchersarepreparedtomake

moreassumptionsonfunctionalformsorondistributions,andforthatpriceweare

20

abletoidentifyotherquantitiesofinterest.Withouttheseassumptions,inferences

mustbebasedonthedifferenceinthetwomeans,astatisticthatissometimesill-

behaved,asweshalldiscussbelow.Thisill-behaviorhasnothingtodowithRCTs,

perse,butwithinRCTs,andtheirminimalassumptions,wecannoteasilyswitch

fromthemeantosomeotherquantityofinterest.

Fisherproposedthatstatisticalinferenceshouldbedoneusingwhathasbe-

comeknownas“randomization”inference,aprocedurethatisasnon-parametricas

theRCT-basedestimateofanATEitself.Totestthenullhypothesisthat𝛽( = 0for

alli,notethat,underthenullthatthetreatmenthasnoeffectonanyindividual,an

estimatednonzeroATEmustbeaconsequenceoftheparticularrandomallocation

thatgeneratedit.Bytabulatingallpossiblecombinationsoftreatmentsandcontrols

inourtrialsample,andtheATEassociatedwitheach,wecancalculatetheexactdis-

tributionoftheestimatedATEunderthenull.Thisallowsustocalculatetheproba-

bilityofcalculatinganestimateaslargeasouractualestimatewhentherearenoef-

fectsoftreatment.Thisrandomizationtestrequiresafinitesample,butitwillwork

foranysamplesize(seeImbensandWooldridge(2009)foranexcellentaccountof

theprocedure).Imbens(2010)arguesthatitisthisrandomizationinferenceplus

theunbiasednessoftheATEthatprovidesthetwinnon-parametricpillarsthatsup-

portplacingRCTsatthe“verytop”ofthehierarchyofevidence.

Randomizationinferencecanbeusedfornullhypothesesthatspecifythatall

ofthetreatmenteffectsarezero,asintheaboveexample,butitcannotbeusedto

testthehypothesisthattheaveragetreatmenteffectiszero,whichwilloftenbeof

interest.Inagriculturaltrials,andinmedicine,thestronger(sharp)hypothesisthat

thetreatmenthasnoeffectwhateverisoftenofinterest.Inmanyeconomicapplica-

tionsthatinvolvemoney,suchaswelfareexperimentsorcost-benefitanalyses,we

areinterestedinwhethertheneteffectofthetreatmentispositiveornegative,and

inthesecases,randomizationinferencecannotbeused.Noneofwhichargues

againstitswideruseinsocialscienceswhenappropriate.

Incaseswhererandomizationinferencecannotbeused,wemustconstruct

testsforthedifferencesintwomeans.Standardprocedureswilloftenworkwell,but

therearetwopotentialpitfalls.One,the‘Fisher-Behrensproblem’,comesfromthe

21

factthat,whenthetwosampleshavedifferentvariances—whichwetypicallywant

topermit—theusualt-statisticdoesnothavethet-distribution.Thesecondprob-

lem,whichismuchhardertoaddress,occurswhenthedistributionoftreatmentef-

fectsisnotsymmetric(BahadurandSavage(1956)).Neitherpitfallisspecificto

RCTs,butRCTsforceustoworkwithmeansinestimatingtreatmenteffectsand,

withonlyaveryfewexceptionsintheliterature,socialscientistswhouseRCTsap-

peartobeunawareofthedifficulties.

InthesimplecaseofcomparingtwomeansinanRCTwithoutcovariates,in-

ferenceisusuallybasedonthetwo–samplet–statisticwhichiscomputedbydivid-

ingtheATEbytheestimatedstandarderrorwhosesquareisgivenby

𝜎4 =𝑛" − 1 6" 𝑌( − 𝑌"(7"

4

𝑛"+

𝑛& − 1 6" 𝑌( − 𝑌&(7&4

𝑛&3

where0referstocontrolsand1totreatments,sothatthereare𝑛"treatmentsand

𝑛&controls,and𝑌"and𝑌&arethetwomeans.Ashaslongbeenknown,the“t-statis-

tic”basedon(3)isnotdistributedasStudent’stifthetwovariances(treatmentand

control)arenotidenticalbuthastheBehrens–Fisherdistribution.Inextremecases,

whenoneofthevariancesiszero,thet–statistichaseffectivedegreesoffreedom

halfofthatofthenominaldegreesoffreedom,sothatthetest-statistichasthicker

tailsthanallowedfor,andtherewillbetoomanyrejectionswhenthenullistrue.

Young(2016)arguesthatthisproblemisworsewhenthetrialresultsarean-

alyzedbyregressingoutcomesnotonlyonthetreatmentdummybutalsoonaddi-

tionalcontrolsandwhenusingclusteredorrobuststandarderrors.Whenthede-

signmatrixissuchthatthemaximalinfluenceislarge,sothatforsomeobservations

outcomeshavelargeinfluenceontheirownpredictedvalues,thereisareductionin

theeffectivedegreesoffreedomforthet–value(s)oftheaveragetreatmenteffect(s)

leadingtospuriousfindingsofsignificance.Younglooksat2,003regressionsre-

portedin53RCTpapersintheAmericanEconomicAssociationjournalsandrecalcu-

latesthesignificanceoftheestimatesusingrandomizationinferenceappliedtothe

authors’originaldata.In30to40percentoftheestimatedtreatmenteffectsinindi-

vidualequationswithcoefficientsthatarereportedassignificant,hecannotreject

thenullofnoeffectforanyobservation;thefractionofspuriouslysignificantresults

22

increasesfurtherwhenhesimultaneouslytestsforallresultsineachpaper.These

spuriousfindingscomeinpartfromissuesofmultiple-hypothesistesting,both

withinregressionswithseveraltreatmentsandacrossregressions.Withinregres-

sions,treatmentsarelargelyorthogonal,butauthorstendtoemphasizesignificant

t–valuesevenwhenthecorrespondingF-testsareinsignificant.Acrossequations,

resultsareoftenstronglycorrelated,sothat,atworst,differentregressionsarere-

portingvariantsofthesameresult,thusspuriouslyaddingtothe“killcount”ofsig-

nificanteffects.Atthesametime,thepervasivenessofobservationswithhighinflu-

encegeneratesspurioussignificanceonitsown.

Theseissuesarenowbeingtakenmoreseriously.InadditiontoYoung

(2016),ImbensandKolesár(2016)providepracticaladvicefordealingwiththe

Fisher-Behrensproblem,andthebestcurrentpracticetriestobecarefulaboutmul-

tiplehypothesistesting.Yetitremainsthecasethatmanyoftheresultsinthelitera-

turearespuriouslysignificant.

Spurioussignificancealsoariseswhenthedistributionoftreatmenteffects

containsoutliersor,moregenerally,isnotsymmetric.Standardt–testsbreakdown

indistributionswithenoughskewness(seeLehmannandRomano(2005,p.466–

8)).Howdifficultisittomaintainsymmetry?Andhowbadlyisinferenceaffected

whenthedistributionoftreatmenteffectsisnotsymmetric?Ineconomics,manytri-

alshaveoutcomesvaluedinmoney.Doesananti-povertyinnovation—forexample

microfinance—increasetheincomesoftheparticipants?Incomeitselfisnotsym-

metricallydistributed,andthismightbetrueofthetreatmenteffectstooifthereare

afewpeoplewhoaretalentedbutcredit-constrainedentrepreneursandwhohave

treatmenteffectsthatarelargeandpositive,whilethevastmajorityofborrowers

fritterawaytheirloans,oratbestmakepositivebutmodestprofits.Arecentsum-

maryoftheliteratureisconsistentwiththis(seeBanerjee,Karlan,andZinman

(2015)).Anotherimportantexampleisexpendituresonhealthcare.Mostpeople

havezeroexpenditureinanygivenperiod,butamongthosewhodoincurexpendi-

tures,afewindividualsspendhugeamountsthataccountforalargeshareoftheto-

tal.Indeed,inthefamousRandhealthexperiment(seeManning,Newhouseetal.

23

(1987,1988)),thereisasingleverylargeoutlier.Theauthorsrealizethatthecom-

parisonofmeansacrosstreatmentarmsisfragile,and,althoughtheydonotsee

theirproblemexactlyasdescribedhere,theyobtaintheirpreferredestimatesusing

astructuralapproachthatisexplicitlydesignedtomodeltheskewnessofexpendi-

tures.

Insomecases,itwillbeappropriatetodealwithoutliersbytrimming,trans-

forming,oreliminatingobservationsthathavelargeeffectsontheestimates.Butif

theexperimentisaprojectevaluationdesignedtoestimatethenetbenefitsofapol-

icy,theeliminationofgenuineoutliers,asintheRandHealthExperiment,willviti-

atetheanalysis.Itispreciselytheoutliersthatmakeorbreaktheprogram.Trans-

formations,suchastakinglogarithms,mayhelptoproducesymmetry,butthey

changethenatureofthequestionbeingasked;acostbenefitanalysismustbedone

indollars,notlogdollars.

Weconsideranexamplethatillustrateswhatcanhappeninarealisticbut

simplifiedcase;thefullresultsarereportedintheAppendix.Weimagineapopula-

tionofindividuals,eachwithatreatmenteffect𝛽( .Theparentpopulationmeanof

thetreatmenteffectsiszero,butthereisalongtailofpositivevalues;weusealeft-

shiftedlognormaldistribution.Wehaveamicrofinancetrialinmind,wherethereis

alongpositivetailofrareindividualswhocandoamazingthingswithcreditwhile

mostpeoplecannotuseiteffectively.Atrialsampleof2n individualsisrandomly

drawnfromtheparentpopulationandisrandomlysplitbetweenntreatmentsand

ncontrols.Withineachtrialsample,whosetrueATEwillgenerallydifferfromzero

becauseofthesampling,werunmanyRCTsandtabulatethevaluesoftheATEfor

each.

Usingstandardt-tests,the(trueintheparentdistribution)hypothesisthat

theATEiszeroisrejectedbetween14(𝑛 = 25)and6percent(𝑛 = 500)ofthetime.

Theserejectionscomefromtwoseparateissues,bothofwhicharerelevantinprac-

tice;(a)thattheATEintrialsamplediffersfromtheATEintheparentpopulationof

interest,and(b)thatthet-valuesarenotdistributedastinthepresenceofoutliers.

24

Theproblemcasesarewhenthetrialsamplehappenstocontainoneormoreoutli-

ers,somethingthatisalwaysariskgiventhelongpositivetailoftheparentdistribu-

tion.Whenthishappens,everythingdependsonwhethertheoutlierisamongthe

treatmentsorthecontrols;ineffect,theoutliersbecomethesample,reducingthe

effectivenumberofdegreesoffreedom.Inextremecases,oneofwhichisillustrated

inFigureA.1,thedistributionofestimatedATEsisbimodal,dependingonthegroup

towhichtheoutlierisassigned.Whentheoutlierisinthetreatmentgroup,thedis-

persionacrossoutcomesislarge,asistheestimatedstandarderror,andsothose

outcomesrarelyrejectthenullusingthestandardtableoft–values.Theover-rejec-

tionscomefromcaseswhentheoutlierisinthecontrolgroup,theoutcomesarenot

sodispersed,andthet–valuescanbelarge,negative,andsignificant.Whilethese

casesofbimodaldistributionsmaynotbecommonanddependontheexistenceof

largeoutliers,theyillustratetheprocessthatgeneratestheover-rejectionsandspu-

rioussignificance.Notethatthereisnoremedythroughrandomizationinference

here,giventhatourinterestisinthehypothesisthattheaveragetreatmenteffectis

zero.

OurreadingoftheliteratureonRCTsindevelopmenteconomicssuggests

thattheyarenotexemptfromtheseconcerns.Manydevelopmenttrialsarerunon

(sometimesvery)smallsamples,theyhavetreatmenteffectswhereasymmetryis

hardtoruleout—especiallywhentheoutcomesareinmoney—andtheyoftengive

resultsthatarepuzzling,oratleastnoteasilyinterpretedintermsofeconomicthe-

ory.NeitherBanerjeeandDuflo(2012)norKarlanandAppel(2011),whocitemany

RCTs,raiseconcernsaboutmisleadinginference,implicitlytreatingallresultsasre-

liable.Nodoubttherearebehaviorsintheworldthatareinconsistentwithstandard

economics,andsomecanbeexplainedbystandardbiasesinbehavioraleconomics,

butitwouldalsobegoodtobesuspiciousofthesignificancetestsbeforeaccepting

thatanunexpectedfindingiswell-supportedandthattheorymustberevised.Repli-

cationofresultsindifferentsettingsmaybehelpful,iftheyaretherightkindof

places(seeourdiscussioninSection2).Yetithardlysolvestheproblemgiventhat

theasymmetrymaybeinthesamedirectionindifferentsettings,thatitseemslikely

tobesoinjustthosesettingsthataresufficientlyliketheoriginaltrialsettingtobe

25

ofuseforinferenceaboutthepopulationofinterest,andthatthe“significant”t–val-

ueswillshowdeparturesfromthenullinthesamedirection.This,then,replicates

thespuriousfindings.

Asummary

Whatdotheargumentsofthissectionmeanabouttheimportanceofrandomization

andtheinterpretationthatshouldbegiventoanestimatedATEfromarandomized

trial?First,weshouldbesurethatanunbiasedestimateofanATEforthetrialpopu-

lationislikelytobeusefulenoughtowarrantthecostsofrunningthetrial.Second,

sincerandomizationdoesnotensureorthogonality,caremustbetaken(e.g.by

blinding)thattherearenosignificantpost-randomizationcorrelateswiththetreat-

ment.Thisisawell-knownlessonbutmanysocialandeconomictrialsarenot

blindedandinsufficientdefenseisofferedthatunbiasednessisnotundermined.In-

deed,lackofblindingisnottheonlysourceofpost-randomizationbias.Treatments

andcontrolsmaybehandledindifferentplaces,orbydifferentlytrainedpractition-

ers,oratdifferenttimesofday,andthesedifferencescanbringwiththemsystem-

aticdifferencesintheothercausestowhichthetwogroupsareexposed.Thesecan,

andshould,beguardedagainst.Butdoingsorequiresanunderstandingofwhat

thesecausallyrelevantfactorsmightbe.Third,theinferenceproblemsreviewed

herecannotjustbepresumedaway.Whenthereissubstantialheterogeneity,the

ATEinthetrialsamplecanbequitedifferentfromtheATEinthepopulationofin-

terest,evenifthetrialisrandomlyselectedfromthatpopulation;inpractice,there-

lationshipbetweenthetrialsampleandthepopulationisoftenobscure.

Beyondthat,inmanycases,thestatisticalinferencewillbefine,butserious

attentionshouldbegiventothepossibilitythatthereareoutliersintreatmentef-

fects,somethingthatknowledgeoftheproblemcansuggestandwhereinspectionof

themarginaldistributionsoftreatmentsandcontrolsmaybeinformative.Forexam-

ple,ifbotharesymmetric,itseemsunlikely(thoughcertainlynotimpossible)that

thetreatmenteffectsarehighlyskewed.MeasurestodealwithFisher-Behrens

shouldbeusedandrandomizationinferenceconsideredwhenappropriatetothe

hypothesisofinterest.

26

Allofthiscanberegardedasrecommendationsforimprovementtocurrent

practice,notachallengetoit.Morefundamentally,westronglycontesttheoften-ex-

pressedideathattheATEcalculatedfromanRCTisautomaticallyreliable,thatran-

domizationautomaticallycontrolsforunobservables,orworstofall,thatthecalcu-

latedATEistrue.If,bychance,itisclosetothetruth,thetruthwearereferringtois

thetruthinthetrialsampleonly.Tomakeanyinferencebeyondthatrequiresanar-

gumentofthekindweconsiderinthenextsection.Wehavealsoarguedthat,de-

pendingonwhatwearetryingtomeasureandwhatwewanttousethatmeasure

for,thereisnopresumptionthatanRCTisthebestmeansofestimatingit.Thattoo

requiresanargument,notapresumption.

Section2:Usingtheresultsofrandomizedcontrolledtrials

2.1Introduction

SupposewehaveestimatedanATEfromawell-conductedRCTonatrialsample,

andourstandarderrorgivesusreasontobelievethattheeffectdidnotcomeabout

bychance.Wethushavegoodwarrantthatthetreatmentcausestheeffectinour

trialsample,uptothelimitsofstatisticalinference.Whataresuchfindingsgoodfor?

Theliteratureineconomics,asindeedinmedicineandinsocialpolicy,has

paidmoreattentiontoobtainingresultsthantoconsideringwhatcanbedonewith

them.Thereislittletheoreticalorempiricalworktoguideushowandforwhatpur-

posestousethefindingsofRCTs,suchastheconditionsunderwhichthesamere-

sultsholdoutsideoftheoriginalsettings,howtheymightbeadaptedforuseelse-

where,orhowtheymightbeusedforformulating,testing,understanding,orprob-

inghypothesesbeyondtheimmediaterelationbetweenthetreatmentandtheout-

comeinvestigatedinthestudy.Yetitcannotbethatknowinghowtouseresultsis

lessimportantthanknowinghowtodemonstratethem.Anychainofevidenceisonly

asstrongasitweakestlink,sothatarigorouslyestablishedeffectwhoseapplicabil-

ityisjustifiedbyaloosedeclarationofsimilewarrantslittle.Iftrialsaretobeuseful,

weneedpathstotheirusethatareascarefullyconstructedasarethetrialsthem-

selves.

Theargumentforthe“primacyofinternalvalidity”madebyShadish,Cook,

andCampbell(2002)maybereasonableasawarningthatbadRCTsareunlikelyto

27

generalize,butitissometimesincorrectlytakentoimplythatresultsofaninternally

validtrialwillautomatically,oroften,apply‘asis’elsewhere,orthatthisshouldbe

thedefaultassumptionfailingargumentstothecontrary,asifaparameter,once

wellestablished,canbeexpectedtobeinvariantacrosssettings.Aninvarianceas-

sumptionisoftenmadeinmedicine,forexample,whereitissometimesplausible

thataparticularprocedureordrugworksthesamewayeverywhere(thoughsee

Horton(2000)forastrongdissentandRothwell(2005)forexamplesonbothsides

ofthequestion).Weshouldalsonotetherecentmovementtoensurethattestingof

drugsincludeswomenandminoritiesbecausemembersofthosegroupssuppose

thattheresultsoftrialsonmostlyhealthyyoungwhitemalesdonotapplytothem.

2.2Usingresults,transportability,andexternalvalidity

Supposeatrialhasestablishedaresultinaspecificsetting.If`thesame’resultholds

elsewhere,itissaidtohaveèxternalvalidity’.Externalvaliditymayreferjusttothe

transportabilityofthecausalconnection,orgofurtherandrequirereplicationofthe

magnitudeoftheATE.Eitherway,theresultholds—everywhere,orwidely,orin

somespecificelsewhere—oritdoesnot.

Thisbinaryconceptofexternalvalidityisoftenunhelpfulbecauseitasksthe

resultsofanRCTtosatisfyaconditionthatisneithernecessarynorsufficientfora

trialtobeuseful,andsobothoverstatesandunderstatestheirvalue.Itdirectsusto-

wardsimpleextrapolation—whetherthesameresultholdselsewhere—orsimple

generalization—itholdsuniversallyoratleastwidely—andawayfrommorecom-

plexbutmoreusefulapplicationsoftheresults.Thefailureofexternalvalidityinter-

pretedassimplegeneralizationorextrapolationsayslittleaboutthevalueofthe

trial.

First,thereareseveralusesofRCTsthatdonotrequiretransportabilitybe-

yondtheoriginalcontext;wediscusstheseinthenextsubsection.Second,thereare

oftengoodreasonstoexpectthattheresultsfromawell-conducted,informative,

andpotentiallyusefulRCTwillnotapplyelsewhereinanysimpleway.Withoutfur-

therunderstandingandanalysis,evensuccessfulreplicationtellsuslittleeitherfor

oragainstsimplegeneralizationortosupportfortheconclusionthatthenextwill

workinthesameway.Nordofailuresofreplicationmaketheoriginalresultuseless.

28

Weoftenlearnmuchfromcomingtounderstandwhyreplicationfailedandcanuse

thatknowledge,inlookingforhowthefactorsthatcausedtheoriginalresultmight

operatedifferentlyindifferentsettings.Third,andparticularlyimportantforscien-

tificprogress,theRCTresultcanbeincorporatedintoanetworkofevidenceandhy-

pothesesthattestorexploreclaimsthatlookverydifferentfromtheresultsre-

portedfromtheRCT.WeshallgiveexamplesbelowofextremelyusefulRCTsthat

arenotexternallyvalidinthe(usual)sensethattheirresultsdonotholdelsewhere,

whetherinaspecifictargetsettingorinthemoresweepingsenseofholdingevery-

where.

BertrandRussell’schicken(Russell(1912))providesanexcellentexampleof

thelimitationstostraightforwardextrapolationfromrepeatedsuccessfulreplica-

tion.Thebirdinfers,onrepeatedevidence,thatwhenthefarmercomesinthe

morning,hefeedsher.TheinferenceservesherwelluntilChristmasmorning,when

hewringsherneckandservesherfordinner.Thoughthischickendidnotbaseher

inferenceonanRCT,hadweconstructedoneforher,wewouldhaveobtainedthe

sameresultthatshedid.Herproblemwasnothermethodology,butratherthatshe

didnotunderstandthesocialandeconomicstructurethatgaverisetothecausalre-

lationsthatsheobserved.

So,establishingcausalitydoesnothinginandofitselftoguaranteegenerali-

zability.NordoestheabilityofanidealRCTtoeliminatebiasfromselectionorfrom

omittedvariablesmeanthattheresultingATEfromthetrialsamplewillapplyany-

whereelse.Theissueisworthmentioningonlybecauseoftheenormousweight

thatiscurrentlyattachedineconomicstothediscoveryandlabelingofcausalrela-

tions,aweightthatishardtojustifyforeffectsthatmayhaveonlylocalapplicabil-

ity,whatmightbelabeled‘anecdotalcausality’.Theoperationofacausegenerally

requiresthepresenceof“supportfactors”,withoutwhichacausethatproducesthe

targetedeffectinoneplace,eventhoughitmaybepresentandhavethecapacityto

operateelsewhere,willremainlatentandinoperative.WhatMackie(1974)called

INUScausality(InsufficientbutNon-redundantpartsofaconditionthatisitselfUn-

necessarybutSufficientforacontributiontotheoutcome)isoftenthekindofcau-

salitywesee.Astandardexampleisahouseburningdownbecausethetelevision

29

waslefton,althoughtelevisionsdonotoperateinthiswaywithoutsupportfactors,

suchaswiringfaults,thepresenceoftinder,andsoon.Thisisstandardfareinepi-

demiology,whichusestheterm`causalpie’torefertoasetofcausesthatarejointly

butnotseparatelysufficientforaneffect.

Ifwerewrite(1)intheform

𝑌( = 𝛽(𝑇( + 𝛾<𝑥(< = 𝜃 𝑤( 𝑇(

@

<A"

+ 𝛾<𝑥(<

@

<A"

(4)

wherethefunction𝜃(. )controlshowak-vector𝑤( ofk`supportfactors’affectindi-

viduali’streatmenteffect𝛽( .Thesupportfactorsmayincludesomeofthex’s.Since

theATEistheaverageofthe𝛽(𝑠,twopopulationswillhavethesameATEifandonly

iftheyhavethesameaveragefortheneteffectofthesupportfactorsnecessaryfor

thetreatmenttowork,i.e.forthequantityinfrontof𝑇( .Thesearehoweverjustthe

kindoffactorsthatarelikelytobedifferentlydistributedindifferentpopulations,

andindeedwedogenerallyfinddifferentATEsindifferenteconomic(andotherso-

cialpolicy)RCTsindifferentplaceseveninthecaseswhere(unusually)theyall

pointinthesamedirection.

Causalprocessesoftenrequirehighlyspecializedeconomic,cultural,orsocial

structurestoenablethemtowork.ConsidertheRubeGoldbergmachinethatis

riggedupsothatflyingakitesharpensapencil(CartwrightandHardie(2012,77)).

Theunderlyingstructureaffordsaveryspecificformof(4)thatwillnotdescribe

causalprocesseselsewhere.NeitherthesameATEnorthesamequalitativecausal

relationscanbeexpectedtoholdwherethespecificformfor(4)isdifferent.Indeed,

wecontinuallyattempttodesignsystemsthatwillgeneratecausalrelationsthatwe

likeandthatwillruleoutcausalrelationsthatwedonotlike.Healthcaresystems

aredesignedtopreventnursesanddoctorsmakingerrors;carsaredesignedsothat

driverscannotstarttheminreverse;workschedulesforpilotsaredesignedsothey

donotflytoomanyconsecutivehourswithoutrestbecausealertnessandperfor-

mancearecompromised.

30

AsintheRubeGoldbergmachineandinthedesignofcarsandworksched-

ules,theeconomicstructureandequilibriummaydifferinwaysthatsupportdiffer-

entkindsofcausalrelationsandthusrenderatrialinonesettinguselessinanother.

Forexample,atrialthatreliesonprovidingincentivesforpersonalpromotionisof

nouseinastateinwhichapoliticalsystemlockspeopleintotheirsocialandeco-

nomicpositions.Cashtransfersthatareconditionalonparentstakingtheirchildren

toclinicscannotimprovechildhealthintheabsenceoffunctioningclinics.Policies

targetedatmenmaynotworkforwomen.Weusealevertotoastourbread,butlev-

ersonlyoperatetotoastbreadinatoaster;wecannotbrowntoastbypressingan

accelerator,eveniftheprincipleoftheleveristhesameinbothatoasterandacar.

Ifwemisunderstandthesetting,ifwedonotunderstandwhythetreatmentinour

RCTworks,werunthesamerisksasRussell’schicken.

2.3WhenRCTsspeakforthemselves:notransportabilityrequired

Forsomethingswewanttolearn,anRCTisenoughbyitself.AnRCTmayprovidea

counterexampletoageneraltheoreticalproposition,eithertothepropositionitself

(asimplerefutationtest)ortosomeconsequenceofit(acomplexrefutationtest).

AnRCTmayalsoconfirmapredictionofatheory,andalthoughthisdoesnotcon-

firmthetheory,itisevidenceinitsfavor,especiallyifthepredictionseemsinher-

entlyunlikelyinadvance.Thisisallfamiliarterritory,andthereisnothingunique

aboutanRCT;itissimplyoneamongmanypossibletestingprocedures.Evenwhen

thereisnotheory,orveryweaktheory,anRCT,bydemonstratingcausalityinsome

populationcanbethoughtofasproofofconcept,thatthetreatmentiscapableof

workingsomewhere.Thisisoneoftheargumentsfortheimportanceofinternalva-

lidity.

NoristransportationcalledforwhenanRCTisusedforevaluation,forexam-

pletosatisfydonorsthattheprojecttheyfundedachieveditsaimsinthepopulation

inwhichitwasconducted.Evenso,forsuchevaluations,saybytheWorldBank,to

beglobalpublicgoodsrequiresargumentsandguidelinesthatjustifyusingthere-

sultsinsomewayelsewhere;theglobalpublicgoodisnotanautomaticby-product

oftheBankfulfillingitsfiduciaryresponsibility.Whenthecomponentsoftreat-

mentschangeacrossstudies,evaluationsneednotleadtocumulativeknowledge.Or

31

asHeckmanetal(1999,1934)note,“thedataproducedfromthem[socialexperi-

ments]arefarfromidealforestimatingthestructuralparametersofbehavioral

models.Thismakesitdifficulttogeneralizefindingsacrossexperimentsortouse

experimentstoidentifythepolicy-invariantstructuralparametersthatarerequired

foreconometricpolicyevaluation.”

Ofcourse,whenweaskexactlywhatthoseinvariantstructuralparameters

are,whethertheyexist,andhowtheyshouldbemodeled,weopenupmajorfault

linesinmodernappliedeconomics.Forexample,wedonotintendtoendorseinter-

temporaldynamicmodelsofbehaviorastheonlywayofrecoveringtheparameters

thatweneed.Wealsorecognizethattheusefulnessofsimplepricetheoryisnotas

universallyacceptedasitoncewas.Butthepointremainsthatweneedsomething,

someregularityorsomeinvariance,andthatsomethingcanrarelyberecoveredby

simplygeneralizingacrosstrials.

Athirdnon-problematicandimportantuseofanRCTiswhentheparameter

ofinterestistheATEinawell-definedpopulationfromwhichthetrialsampleisit-

selfarandomsample.Inthiscasethesampleaveragetreatmenteffect(SATE)isan

unbiasedestimatorofthepopulationaveragetreatmenteffect(PATE)that,byas-

sumption,isourtarget(seeImbens(2004)fortheseterms).Werefertothisasthe

`publichealth’case;likemanypublichealthinterventions,thetargetistheaverage,

`populationhealth,’notthehealthofindividuals.Onemajor(andwidelyrecog-

nized)dangerofthisuseofRCTsisthatscalingupfrom(evenarandom)sampleto

thepopulationwillnotgothroughinanysimplewayiftheoutcomesofindividuals

orgroupsofindividualschangethebehaviorofothers—whichwillbecommonin

economicexamplesbutperhapslesssoinhealth.Thereisalsoanissueoftimingif

timeelapsesbetweenthetrialandtheimplementation.

Ineconomics,a`public-health-style’exampleistheimpositionofacommod-

itytax,wherethetotaltaxrevenueisofinterestandpolicymakersdonotcarewho

paysthetax.Indeed,theorycanoftenidentifyaspecific,well-definedquantity

whosemeasurementiskeyforapolicy(seeDeatonandNg(1998)foranexampleof

whatChetty(2009)callsa“sufficient”statistic).Inthiscase,thebehaviorofaran-

domsampleofindividualsmightwellprovideagoodguidetothetaxrevenuethat

32

canbeexpected.Anothercasecomesfromworkonpovertyprogramswherethe

sponsorsaremostconcernedaboutthebudget;wediscussthesecasesattheendof

thisSection.Evenhere,itiseasytoimaginebehavioraleffectscomingintoplaythat

driveawedgebetweenthetrialanditsfull-scaleimplementation,forexampleif

complianceishigherwhentheschemeiswidelypublicized,orifgovernmentagen-

ciesimplementtheschemedifferentlyfromtrialists.

2.4Transportingresultslaterallyandglobally

TheprogramofRCTsineconomics,asinotherareasofsocialscience,hasthe

broadergoaloffindingout`whatworks.’Atitsmostambitious,thisaimsforuniver-

salreach,andthedevelopmenteconomicsliteraturefrequentlyarguesthat“credi-

bleimpactevaluationsareglobalpublicgoodsinthesensethattheycanofferrelia-

bleguidancetointernationalorganizations,governments,donors,andnongovern-

mentalorganizations(NGOs)beyondnationalborders”,DufloandKremer(2008,

93).SometimestheresultsofasingleRCTareadvocatedashavingwideapplicabil-

ity,withespeciallystrongendorsementwhenthereisatleastonereplication.For

example,KremerandHolla(2009,3)useaKenyantrialasthebasisforablanket

statementwithoutspecifyingcontext,“Provisionoffreeschooluniforms,forexam-

ple,leadsto10%-15%reductionsinteenpregnancyanddropoutrates.”Dufloand

Kremer(2008,104),writingaboutanothertrial,aremorecautious,citingtwoeval-

uationsandrestrictingthemselvestoIndia:“Onecanberelativelyconfidentabout

recommendingthescaling-upofthisprogram,atleastinIndia,onthebasisofthese

estimates,sincetheprogramwascontinuedforaperiodoftime,wasevaluatedin

twodifferentcontexts,andhasshownitsabilitytoberolledoutonalargescale.”

Evenanumberofreplicationsdonotprovideasoundbasisforinference.Without

theorytosupporttheprojectionofresults,thisisjustinductionbysimpleenumera-

tion—swan1iswhite,swan2iswhite,...,soallswansarewhite.

TheproblemofgeneralizationextendsbeyondRCTs,toboth`fullycon-

trolled’laboratoryexperimentsandtomostnon-experimentalfindings.Ourargu-

menthereisthatevidencefromRCTsisnotautomaticallysimplygeneralizable,and

thatitssuperiorinternalvalidity,ifandwhenitexists,doesnotprovideitwithany

uniqueinvarianceacrosscontext.Thattransportationisfarfromautomaticalso

33

tellsuswhy(evenideal)RCTsofsimilarinterventionsgivedifferentanswersindif-

ferentsettings.Suchdifferencesdonotnecessarilyreflectmethodologicalfailings

andwillholdacrossperfectlyexecutedRCTsjustastheydoacrossobservational

studies.

ManyadvocatesofRCTsunderstandthat`whatworks’needstobequalified

to`whatworksunderwhichcircumstances’andtrytosaysomethingaboutwhat

thosecircumstancesmightbe,forexample,byreplicatingRCTsindifferentplaces

andthinkingintelligentlyaboutthedifferencesinoutcomeswhentheyfindthem.

Sometimesthisisdoneinasystematicway,forexamplebyhavingmultipletreat-

mentswithinthesametrialsothatitispossibletoestimatea`responsesurface’that

linksoutcomestovariouscombinationsoftreatments(seeGreenbergandSchroder

(2004)orShadishetal(2002)).Forexample,theRANDhealthexperimenthadmul-

tipletreatments,allowinginvestigation,notofhowmuchhealthinsurancein-

creasedexpendituresunderdifferentcircumstances.Someofthenegativeincome

taxexperiments(NITs)inthe1960sand1970sweredesignedtoestimateresponse

surfaces,withthenumberoftreatmentsandcontrolsineacharmoptimizedtomax-

imizeprecisionofestimatedresponsefunctionssubjecttoanoverallcostlimit(see

Conlisk(1973)).Experimentsontime-of-daypricingforelectricityhadasimilar

structure(seeAigner(1985)).

TheexperimentsbyMDRC(originallyknownastheManpowerDevelopment

ResearchCorporation)havealsobeenanalyzedacrosscitiesinanefforttolinkcity

featurestotheresultsoftheRCTswithinthem(seeBloom,Hill,andRiccio(2005)).

UnliketheRANDandNITexamples,theseareexpostanalysesofcompletedtrials;

thesameistrueofVivalt(2015),whofinds,forthecollectionoftrialsshestudied,

thatdevelopment-relatedRCTsrunbygovernmentagenciestypicallyfindsmaller

(standardized)effectsizesthanRCTsrunbyacademicsorbyNGOs.Boldetal

(2013),whoranparallelRCTsonaninterventionimplementedeitherbyanNGOor

bythegovernmentofKenya,foundsimilarresultsthere.Notethattheseanalyses

haveadifferentpurposefrommeta-analysesthatassumethatdifferenttrialsesti-

matethesameparameteruptonoiseandaverageinordertoincreaseprecision.

34

Althoughthereareissueswithallofmethodsofinvestigatingdifferences

acrosstrials,withoutsomedisciplineitistooeasytocomeupwith`just-so’orfairy

storiesthataccountfordifferences.Weriskaprocedurethat,ifaresultisreplicated

infullorinpartinatleasttwoplaces,putsthattreatmentintotheìtworks’box

and,iftheresultdoesnotreplicate,casuallyinterpretsthedifferenceinawaythat

allowsatleastsomeofthefindingstosurvive.

Howcanwedobetterthansimplegeneralizationandsimpleextrapolation?

Manywritersemphasizetheroleoftheoryintransportingandusingtheresultsof

trials,andweshalldiscussthisinthenextsubsection.Butstatisticalapproachesare

alsowidelyused;thesearedesignedtodealwiththepossibilitythattreatmentef-

fectsvarysystematicallywithothervariables.Referringbackto(4),itisclearthat,

supposingthesameformof(4)obtains,ifthedistributionofthewvaluesisthe

sameinthenewcircumstancesasintheold,theATEintheoriginaltrialwillholdin

thenewcircumstances.Ingeneral,ofcourse,thisconditionwillnothold,nordowe

haveanyobviouswayofcheckingitunlessweknowwhatthesupportfactorsarein

bothplaces.Oneproceduretodealwithinteractionsispost-experimentalstratifica-

tion,whichparallelspost-surveystratificationinsamplesurveys.Thetrialisbroken

upintosubgroupsthathavethesamecombinationofknown,observablew’s(age,

race,genderforexample),thentheATEswithineachofthesubgroupsarecalcu-

lated,andthentheyarereassembledaccordingtotheconfigurationofw’sinthe

newcontext.ThiscanbeusedtoestimatetheATEinanewcontext,ortocorrectes-

timatestotheparentpopulationwhenthetrialsampleisnotarandomsampleof

theparent.Othermethodscanbeusedwhentherearetoomanyw’sforstratifica-

tion,forexamplebyestimatingtheprobabilityofeachobservationinthepopulation

includedinthetrialsampleasafunctionofthew’s,thenweightingeachobservation

bytheinverseofthesepropensityscores.AgoodreferenceforthesemethodsisStu-

artetal(2011),orineconomics,Angrist(2004)andHotz,Imbens,andMortimer

(2005).

Thesemethodsareoftennotapplicable,however.First,reweightingworks

onlywhentheobservablefactorsusedforreweightingincludeall(andonly)genu-

ineinteractivecauses.Second,aswithanyformofreweighting,thevariablesusedto

35

constructtheweightsmustbepresentinboththeoriginalandnewcontext.Forex-

ample,ifwearetocarryaresultforwardintime,wemaynotbeabletoextrapolate

fromaperiodoflowinflationtoaperiodofhighinflation.AsHotzetal(2005)note,

itwilltypicallybenecessarytoruleoutsuch`macro’effects,whetherovertime,or

overlocations.Third,italsodependsonassumingthatthesamegoverningequation

(4)coversthetrialandthetargetpopulation.

PearlandBareinboim(2011,2014)andBareinboimandPearl(2013,2014)

providestrategiesforinferringinformationaboutnewpopulationsfromtrialre-

sultsthataremoregeneralthanreweighting.Theysupposewehaveavailableboth

causalinformationandprobabilisticinformationforpopulationA(e.g.theexperi-

mentalone),whileforpopulationB(thetarget)wehaveonly(some)probabilistic

information,andalsothatweknowthatcertainprobabilisticandcausalfactsare

sharedbetweenthetwoandcertainonesarenot.Theyoffertheoremsdescribing

whatcausalconclusionsaboutpopulationBaretherebyfixed.Theirworkunder-

linesthefactthatexactlywhatconclusionsaboutonepopulationcanbesupported

byinformationaboutanotherdependsonexactlywhatcausalandprobabilisticfacts

theyhaveincommon. ButasMuller(2015)notes,this,liketheproblemwithsimple

reweighting,takesusbacktothesituationthatRCTsaredesignedtoavoid,where

weneedtostartfromacompleteandcorrectspecificationofthecausalstructure.

RCTscanavoidthisinestimation—whichisoneoftheirstrengths,supportingtheir

credibility—butthebenefitvanishesassoonaswetrytocarrytheirresultstoanew

context.

Thisdiscussionleadstoanumberofpoints.First,wecannotgettogeneral

claimsbysimplegeneralization;thereisnowarrantfortheconvenientassumption

thattheATEestimatedinaspecificRCTisaninvariantparameter,northatthe

kindsofinterventionsandoutcomeswemeasureintypicalRCTsparticipateingen-

eralcausalrelations.Whileitistruethatgeneralcausalclaimsexist—thatgravita-

tionalmassesattracteachother,orthatpeoplerespondtoincentives—theseuse

relativelyabstractconceptsandoperateatamuchhigherlevelthantheclaimsthat

canbereasonablyinferredfromatypicalRCT.

36

Second,thoughtfulpre-experimentalstratificationinRCTsislikelytobeval-

uable,orfailingthat,subgroupanalysis,becauseitcanprovideinformationthatmay

beusefulforgeneralizationortransportation.Forexample,KremerandHolla

(2009)notethat,intheirtrials,schoolattendanceissurprisinglysensitivetosmall

subsidies,whichtheysuggestisbecausetherearealargenumberofstudentsand

parentswhoareonthe(financial)marginbetweenattendingandnotattending

school;ifthisisindeedthemechanismfortheirresults,agoodvariableforstratifi-

cationwouldbedistancefromtherelevantcutoff.Wealsoneedtoknowthatthis

samemechanismworksinanynewtargetsetting.

Third,weneedtobeexplicitaboutcausalstructure,evenifthatmeansmore

modelbuildingandmore—ordifferent—assumptionsthanadvocatesofRCTsare

oftencomfortablewith.Tobeclear,modelingcausalstructuredoesnotcommitus

totheelaborateandoftenincredibleassumptionsthatcharacterizesomestructural

modelingineconomics,butthereisnoescapefromthinkingaboutthewaythings

work;thewhyaswellasthewhat.

Fourth,wewilltypicallyneedtoknowmorethantheresultsoftheRCTitself,

forexampleaboutdifferencesinsocial,economic,andculturalstructuresandabout

thejointdistributionsofcausalvariables,knowledgethatwilloftenonlybeavaila-

blethroughobservationalstudies.Wewillalsoneedexternalinformation,boththe-

oreticalandempirical,tosettleonaninformativecharacterizationofthepopulation

enrolledintheRCTbecausehowthatpopulationisdescribediscommonlytakento

besomeindicationofwhichotherpopulationstheresultsarelikelytobeexportable

to.Manymedicalandpsychologicaljournalsareexplicitaboutthis.Forinstance,the

rulesforsubmissionrecommendedbytheInternationalCommitteeofMedicalJour-

nalEditors,ICMJE(2015,14)insistthatarticleabstracts“Clearlydescribetheselec-

tionofobservationalorexperimentalparticipants(healthyindividualsorpatients,

includingcontrols),includingeligibilityandexclusioncriteriaandadescriptionof

thesourcepopulation.”AnRCTisconductedonaspecifictrialsample,somehow

drawnfromapopulationofspecificindividuals.Theresultsobtainedarefeaturesof

thatsample,ofthoseveryindividualsatthatverytime,notanyotherpopulation

37

withanydifferentindividualsthatmight,forexample,satisfyoneoftheinfiniteset

ofdescriptionsthatthetrialsamplesatisfies.

Thissameissueisconfrontedalreadyinstudydesign.Apartfromspecial

cases,likeposthocevaluationforpayment-for-results,wearenotespeciallycon-

cernedtolearnabouttheveryindividualsenrolledinthetrial.Mostexperiments

are,andshouldbe,conductedwithaneyetowhattheresultscanhelpuslearn

aboutotherpopulations.Thiscannotbedonewithoutsubstantialassumptions

aboutwhatmightandwhatmightnotberelevanttotheproductionoftheoutcome

studied.(Forexample,theICMJEguidelines(2015,14)goontosay:“Becausethe

relevanceofsuchvariablesasage,sex,orethnicityisnotalwaysknownatthetime

ofstudydesign,researchersshouldaimforinclusionofrepresentativepopulations

intoallstudytypesandataminimumprovidedescriptivedatafortheseandother

relevantdemographicvariables,”p14.)Sobothintelligentstudydesignandrespon-

siblereportingofstudyresultsinvolvesubstantialbackgroundassumptions.

Ofcourse,thisistrueforallstudies.ButRCTsrequirespecialconditionsif

theyaretobeconductedatallandespeciallyiftheyaretobeconductedsuccess-

fully—forexample,localagreements,compliantsubjects,affordableadministrators,

multipleblinding,peoplecompetenttomeasureandrecordoutcomesreliably,aset-

tingwhererandomallocationismorallyandpoliticallyacceptable,etc.—whereas

observationaldataareoftenmorereadilyandwidelyavailable.InthecaseofRCTs,

thereisdangerthatthesekindsofconsiderationshavetoomucheffect.Thisisespe-

ciallyworrisomewherethefeaturesthatthetrialsampleshouldhavearenotjusti-

fied,madeexplicit,orsubjectedtoseriouscriticalreview.

Theneedforobservationalknowledgeisoneofmanyreasonswhyitiscoun-

ter-productivetoinsistthatRCTsarethegoldstandardorthatsomecategoriesof

evidenceshouldbeprioritizedoverothers;thesestrategiesleaveushelplessinus-

ingRCTsbeyondtheiroriginalcontext.TheresultsofRCTsmustbeintegratedwith

otherknowledge,includingthepracticalwisdomofpolicymakers,iftheyaretobe

useableoutsidethecontextinwhichtheywereconstructed.

38

Contrarytomuchpracticeinmedicineaswellasineconomics,conflictsbe-

tweenRCTsandobservationalresultsneedtobeexplained,forexamplebyrefer-

encetothedifferentpopulationsineach,aprocessthatwillsometimesyieldim-

portantevidence,includingontherangeofapplicabilityoftheRCTresultsthem-

selves.WhilethevalidityoftheRCTwillsometimesprovideanunderstandingof

whytheobservationalstudyfoundadifferentanswer,thereisnobasis(orexcuse)

forthecommonpracticeofdismissingtheobservationalstudysimplybecauseit

wasnotanRCTandthereforemustbeinvalid.Itisabasictenetofscientificadvance

that,ascollectiveknowledgeadvances,newfindingsmustbeabletoexplainandbe

integratedwithpreviousresults,evenresultsthatarenowthoughttobeinvalid;

methodologicalprejudiceisnotanexplanation.

2.5Usingtheoryforgeneralization

Economistshavebeencombiningtheoryandrandomizedcontrolledtrialssincethe

earlyexperiments.OrcuttandOrcutt(1968)laidouttheinspirationfortheincome

taxtrialsusingasimplestatictheoryoflaborsupply.Accordingtothis,people

choosehowtodividetheirtimebetweenworkandleisureinanenvironmentin

whichtheyreceiveaminimumGiftheydonotwork,andwheretheyreceiveanad-

ditionalamount(1-t)wforeachhourtheywork,wherewisthewagerate,andtisa

taxrate.ThetrialsassigneddifferentcombinationsofGandttodifferenttrial

groups,sothattheresultstracedoutthelaborsupplyfunction,allowingestimation

oftheparametersofpreferences,whichcouldthenbeusedinawiderangeofpolicy

calculations,forexampletoraiserevenueatminimumutilitylosstoworkers.

Followingtheseearlytrials,therehasbeenacontinuingtraditionofusing

trialresults,togetherwiththebaselinedatacollectedforthetrial,tofitstructural

modelsthataretobeusedmoregenerally.EarlyexamplesincludeMoffitt(1979)on

laborsupplyandWise(1985)onhousing;amorerecentexampleisHeckman,Pinto

andSavelyev(2013)forthePerrypre-schoolprogram.Developmenteconomicsex-

amplesincludeAttanasio,Meghir,andSantiago(2012),Attanasioetal(2015),Todd

andWolpin(2006),Wolpin(2013),andDuflo,Hanna,andRyan(2012).These

39

structuralmodelssometimesrequireformidableauxiliaryassumptionsonfunc-

tionalformsorthedistributionsofunobservables,buttheyhavecompensatingad-

vantages,includingtheabilitytointegratetheoryandevidence,tomakeout-of-sam-

plepredictions,andtoanalyzewelfare,andtheuseofRCTevidenceallowsthere-

laxationofatleastsomeoftheassumptionsthatareneededforidentification.Inthis

way,thestructuralmodelsborrowcredibilityfromtheRCTsandinreturnhelpset

theRCTresultswithinacoherentframework.Withoutsomesuchinterpretation,the

welfareimplicationsofRCTresultscanbeproblematic;knowinghowpeopleingen-

eral(letalonejustpeopleinthetrialpopulation)respondtosomepolicyisrarely

enoughtotellwhetherornottheyaremadebetteroff,Harrison(2014a,b).Tradi-

tionalwelfareeconomicsdrawsalinkfrompreferencestobehavior,alinkthatisre-

spectedinstructuralworkbutoftenlostinthe`whatworks’literature,andwithout

whichwehavenobasisforinferringwelfarefrombehavior.Whatworksisnot

equivalenttowhatshouldbe.

Lighttouchtheorycandomuchtointerpret,toextend,andtouseRCTre-

sults.InboththeRANDHealthExperimentandnegativeincometaxexperiments,an

immediateissueconcernedthedifferencebetweenshortandlong-runresponses;

indeed,differencesbetweenimmediateandultimateeffectsoccurinawiderangeof

RCTs.BothhealthandtaxRCTsaimedtodiscoverwhatwouldhappenifconsum-

ers/workerswerepermanentlyfacedwithhigherorlowerprices/wages,butthetri-

alscouldonlyrunforalimitedperiod.Atemporarilyhightaxrateonearningsisef-

fectivelya`firesale’onleisure,sothattheexperimentprovidedanopportunityto

takeavacationandmakeuptheearningslater,anincentivethatwouldbeabsentin

apermanentscheme.Howdowegetfromtheshort-runresponsesthatcomefrom

thetrialtothelong-runresponsesthatwewanttoknow?Metcalf(1973)andAsh-

enfelter(1978)providedanswersfortheincometaxexperiments,asdidArrow

(1975)fortheRandHealthExperiment.

Arrow’sanalysisillustrateshowtousebothstructureandobservationaldata

totransportandadaptresultsfromonesettingtoanother.Hemodelsthehealthex-

perimentasatwo-periodmodelinwhichthepriceofmedicalcareisloweredinthe

firstperiodonly,andshowshowtoderivewhatwewant,whichistheresponsein

40

thefirstperiodifpriceswereloweredbythesameproportioninbothperiods.The

magnitudethatwewantisS,thecompensatedpricederivativeofmedicalcarein

period1inthefaceofidenticalincreasesin𝑝"and𝑝4inbothperiods1and2.Thisis

equalto𝑠"" + 𝑠"4,thesumofthederivativesofperiod1’sdemandwithrespectto

thetwoprices.Thetrialgivesonly𝑠"".Butifwehavepost-trialdataonmedicalser-

vicesforbothtreatmentsandcontrols,wecaninfer𝑠4",theeffectoftheexperi-

mentalpricemanipulationonpost-experimentalcare.Choicetheory,intheformof

Slutskysymmetrysaysthat𝑠"4 = 𝑠4"andsoallowsArrowtoinfer𝑠"4andthusS.He

contraststhiswithMetcalf’salternativesolution,whichmakesdifferentassump-

tions—thattwoperiodpreferencesareintertemporallyadditive,inwhichcasethe

long-runelasticitycanbeobtainedfromknowledgeoftheincomeelasticityofpost-

experimentalmedicalcare,whichwouldhavetocomefromanobservationalanaly-

sis.Thesetwoalternativeapproachesshowhowwecanchoose,basedonourwill-

ingnesstomakeassumptionsandonthedatathatwehave,asuitablecombination

of(elementaryandtransparent)theoreticalassumptionsandobservationaldatain

ordertoadaptandusetrialresults.Suchanalysiscanalsohelpdesigntheoriginal

trialbyclarifyingwhatweneedtoknowinordertousetheresultsofatemporary

treatmenttoestimatethepermanenteffectsthatweneed.Ashenfelterprovidesa

thirdsolution,notingthatthetwo-periodmodelisformallyidenticaltoatwo-person

model,sothatwecanuseinformationontwo-personlaborsupplytotellusabout

thedynamics.

Theorycanoftenallowustoreclassifyneworunknownsituationsasanalo-

goustosituationswherewealreadyhavebackgroundknowledge.Onefrequently

usefulwayofdoingthisiswhenthenewpolicycanberecastasequivalenttoa

changeinthebudgetconstraintthatrespondentsface.Theconsequencesofanew

policymaybeeasiertopredictifwecanreduceittoequivalentchangesinincome

andprices,whoseeffectsareoftenwellunderstoodandwell-studied.Toddand

Wolpin(2008)andWolpin(2013)makethispointandprovideexamples.Inthela-

borsupplycase,anincreaseinthetaxratehasthesameeffectasadecreaseinthe

wagerate,sothatwecanrelyonpreviousliteraturetopredictwhatwillhappen

whentaxratesarechanged.InthecaseofMexico’sPROGRESAconditionalcash

41

transferprogram,ToddandWolpinnotethatthesubsidiespaidtoparentsiftheir

childrengotoschoolcanbethoughtofasacombinationofreductioninchildren’s

wagesandanincreaseinparents’income,whichallowsthemtopredicttheresults

oftheconditionalcashexperimentwithlimitedadditionalassumptions.Ifthis

works,asitpartiallydoesintheiranalysis,thetrialhelpsconsolidateprevious

knowledgeandcontributestoanevolvingbodyoftheoryandempirical,including

trial,evidence.

Theprogramofthinkingaboutpolicychangesasequivalenttopriceandin-

comechangeshasalonghistoryineconomics;muchofrationalchoicetheorycanbe

sointerpreted(seeDeatonandMuellbauer(1980)formanyexamples).Whenthis

conversioniscredible,andwhenatrialonsomeapparentlyunrelatedtopiccanbe

modeledasequivalenttoachangeinpricesandincomes,andwhenwecanassume

thatpeopleindifferentsettingsrespondrelevantlysimilarlytochangesinprices

andincomes,wehaveareadymadeframeworkforincorporatingthetrialresults

intopreviousknowledge,aswellasforextendingthetrialresultsandusingthem

elsewhere.Ofcourse,alldependsonthevalidityandcredibilityofthetheory;peo-

plemaynotinfacttreatataxincreaseasadecreaseinthepriceofleisure,andbe-

havioraleconomicsisfullofexampleswhereapparentlyequivalentstimuligenerate

non-equivalentoutcomes.Theembraceofbehavioraleconomicsbymanyofthecur-

rentgenerationoftrialistsmayaccountfortheirlimitedwillingnesstouseconven-

tionalchoicetheoryinthisway.Unfortunately,behavioraleconomicsdoesnotyet

offerareplacementforthegeneralframeworkofchoicetheorythatissousefulin

thisregard.

Theorycanalsohelpwiththeproblemweraisedofdelineatingthepopula-

tiontowhichthetrialresultsimmediatelyapplyandforthinkingaboutmoving

fromthispopulationtopopulationsofinterest.Ashenfelter’s(1978)analysisis

againagoodillustrationandpredatesmuchsimilarworkinlaterliterature.Thein-

cometaxexperimentsofferedparticipationinthetrialtoarandomsampleofthe

populationofinterest.Becausetherewasnoblindingandnocompulsion,people

whowererandomizedintothetreatmentgroupwerefreetochoosetorefusetreat-

ment.Asinmanysubsequentanalyses,Ashenfeltersupposesthatpeoplechooseto

42

participateifitisintheirinteresttodoso,dependingonwhathasbecomeknownin

theRCTandInstrumentalVariablesliteratureastheirownidiosyncratic`gain.’The

simplelaborsupplymodelgivesanapproximatecondition:Ifthetreatmentin-

creasesthetaxratefrom t0 to t1 withanoffsettingincreaseinG,thenanindividual

assignedtotheexperimentalgroupwilldeclinetoparticipateif

(t1 − t0 )w0h0 +12s00 (t1 − t0 ) >G1 −G0 (5)

wheresubscript1referstothetreatmentsituation,0tothecontrol,ℎ&ishours

worked,and𝑠&&isthe(negative)utility-constantresponseofhoursworkedtothe

taxrate.Ifthereisnosubstitution,thesecondtermontheleft-handsideiszero,and

peoplewillaccepttreatmentiftheincreaseinGmorethanmakesupforthein-

creasesintaxespayable,the`breakeven’condition.Inconsequence,thosewith

higherearningsarelesslikelytoaccepttreatment.Somebetter-offpeoplewithhigh

substitutioneffectswillalsoaccepttreatmentiftheopportunitytobuymorecheap

leisureissufficiententicement.

Theselectiveacceptanceoftreatmentlimitstheanalyst’sabilitytolearn

aboutthebetter-offorlow-substitutionpeoplewhodeclinetreatmentbutwho

wouldhavetoacceptitifthepolicywereimplemented.Boththeintention-to-treat

estimatorandtheàstreated’estimatorthatcomparesthetreatedandtheun-

treatedareaffected,notjustbythelaborsupplyeffectsthatthetrialisdesignedto

induce,butbythekindofselectioneffectsthatrandomizationisdesignedtoelimi-

nate.Ofcourse,theanalysisthatleadsto(5)canperhapshelpussaysomething

aboutthisandhelpusadjustthetrialestimatesbacktowhatwewouldliketoknow.

Yetthisisnoeasymatterbecauseselectiondepends,notonlyonobservables,such

aspre-experimentalearningsandhoursworked,buton(muchhardertoobserve)

laborsupplyresponsesthatlikelyvaryacrossindividuals.ParaphrasingAshenfelter,

wecannotestimatetheeffectsofapermanentcompulsorynegativeincometaxpro-

gramfromatransitoryvoluntarytrialwithoutstrongassumptionsoradditionalevi-

dence.

Muchofthemodernliterature,forexampleontrainingprograms,wrestles

withtheissueofexactlywhoisrepresentedbytheRCTresults,includingnotonly

43

whoparticipatesinthefirstplacebutwholeavesbeforethetrialiscompleted(see

againHeckman,LalondeandSmith(1999)).Asintheexamplesabove,modelingat-

tritionwithinatrialcanyieldestimatesofbehavioralresponsesthatcanbeusedto

transportthefindingstoothersettings(seeChanandHamilton(2006),Chassang,

PadróIMiguel,andSnowberg(2012)andChassangetal(2015)).Whenpeopleare

allowedtorejecttheirrandomlyassignedtreatmentaccordingtotheirown(realor

perceived)advantage,ortodropoutofatrialonanestimateofthebenefitsand

costsfromdoingso,wehavecomealongwayawayfromtherandomallocationin

thestandardconceptionofarandomizedcontrolledtrial.Moreover,theabsenceof

blindingiscommoninsocialandeconomicRCTs,andwhiletherearetrials,suchas

welfaretrials,thateffectivelycompelpeopletoaccepttheirassignments,andsome

wherethetreatmentisgenerousenoughtodoso,therearetrialswheresubjects

havemuchfreedomand,inthosecasesitislessthanobvioustouswhatrole,ifany,

randomizationplaysinwarrantingtheresults.

2.6Scalingup:usingtheaverageforpopulations

ManyRCTsaresmall-scaleandlocal,forexampleinafewschools,clinics,orfarms

inaparticulargeographic,cultural,socio-economicsetting.Ifsuccessfulaccording

toacost-effectivenesscriterion,forexample,itisacandidateforscaling-up,apply-

ingthesameinterventionforamuchlargerarea,oftenawholecountry,orsome-

timesevenbeyond,aswhensometreatmentisconsideredforallrelevantWorld

Bankprojects.Thefactthattheinterventionmightworkdifferentlyatscalehaslong

beennotedintheeconomicsliterature,e.g.GarfinkelandManski(1992),Heckman

(1992),andMoffitt(1992),andisrecognizedintherecentreviewbyBanerjeeand

Duflo(2009).Wewantheretoemphasizethepervasivenessofsucheffectsaswell

astonoteagainthatthisshouldnotbetakenasanargumentagainstusingRCTsbut

onlyagainsttheideathateffectsatscalearelikelytobethesameasinthetrial.

Anexampleofwhatareoftencalled`generalequilibriumeffects’comesfrom

agriculture.SupposeanRCTdemonstratesthatinthestudypopulationanewwayof

usingfertilizerhadasubstantialpositiveeffecton,say,cocoayields,sothatfarmers

whousedthenewmethodssawincreasesinproductionandinincomescompared

tothoseinthecontrolgroup.Iftheprocedureisscaleduptothewholecountry,or

44

toallcocoafarmersworldwide,thepricewilldrop,andifthedemandforcocoais

priceinelastic—asisusuallythoughttobethecase,atleastintheshortrun—cocoa

farmers’incomeswillfall.Indeed,theconventionalwisdomformanycropsisthat

farmersdobestwhentheharvestissmall,notlarge.Ofcourse,theseconsiderations

mightnotbedecisiveindecidingwhetherornottopromotetheinnovation,and

theremaystillbelongtermgainsif,forexample,somefarmersfindsomethingbet-

tertodothangrowingcocoa.But,inthiscase,thescaled-upeffectisoppositeinsign

tothetrialeffect.Theproblemisnotwiththetrialresults,whichcanbeusefullyin-

corporatedintoamorecomprehensivemarketmodelthatincorporatesthere-

sponsesestimatedbythetrial.Theproblemisonlyifweassumethattheaggregate

looksliketheindividual.Thatotheringredientsoftheaggregatemodelmustcome

fromobservationalstudiesshouldnotbeacriticism,evenforthosewhofavorRCTs;

itissimplythepriceofdoingseriousanalysis.

Therearemanypossibleinterventionsthataltersupplyordemandwhoseef-

fect,inaggregate,willchangeapriceorawagethatisheldconstantintheoriginal

RCT.Educationwillchangethesuppliesofskilledversusunskilledlabor,withimpli-

cationsforrelativewagerates.Conditionalcashtransfersincreasethedemandfor

(andperhapssupplyof)schoolsandclinics,whichwillchangepricesorwaiting

lines,orboth.Thereareinteractionsbetweenpeoplethatwilloperateonlyatscale.

Givingonechildavouchertogotoprivateschoolmightimproveherfuture,butdo-

ingsoforeveryonecandecreasethequalityofeducationforthosechildrenwhoare

leftinthepublicschools(seethecontrastingstudiesofAngristetal(2002)and

HsiehandUrquiola(2002)).Educationalortrainingprogramsmaybenefitthose

whoaretreatedbutharmthoseleftbehind;Créponetal(2014)recognizetheissue

andshowhowtoadaptanRCTtodealwithit.

Scalingupcanalsodisturbthepoliticalequilibrium.Anexploitativegovern-

mentmaynotallowthemasstransferofmoneyfromabroadtoapowerlessseg-

mentofthepopulation,thoughitmaypermitasmall-scaleRCTofcashtransfers,

perhapseveninthehopethatalarge-scaleimplementationwillyieldopportunities

forpredation.ProvisionofhealthcarebyforeignNGOsmaybesuccessfulintrials,

buthaveunintendednegativeconsequencestoscalebecauseofgeneralequilibrium

45

effectsonthesupplyofhealthcarepersonnel,orbecauseitdisturbsthenatureof

thecontractbetweenthepeopleandagovernmentthatisusingtaxrevenuetopro-

videservices.InIndia,thegovernmentspendslargesumsonfoodsubsidiesthrough

asystem(thePDS)thatisbothcorruptandinefficient,withmuchofthegrainthatis

procuredfailingtofinditswaytotheintendedbeneficiaries.LocalizedRCTson

whetherornotfamiliesarebetteroffwithcashtransfersarenotinformativeabout

howpoliticianswouldchangetheamountofthetransferiffacedwithunanticipated

inflation,andatleastasimportant,whetherthegovernmentcouldcutprocurement

fromrelativelywealthyandpoliticallypowerfulfarmers.Withoutapoliticaland

generalequilibriumanalysis,itisimpossibletothinkabouttheeffectsofreplacing

foodsubsidieswithcashtransfers(seee.g.Basu(2010)).

Eveninmedicine,wherebiologicalinteractionsbetweenpeoplearelesscom-

monthanaresocialinteractionsinsocialscience,interactionscanbeimportant.In-

fectiousdiseasesareonewell-knownexample,whereimmunizationprogramsaf-

fectthedynamicsofdiseasetransmissionthroughherdimmunity(seeFineand

Clarkson(1986)andManski(2013,52)).Thesocialandeconomicsettingalsoaf-

fectshowdrugsareactuallyusedandthesameissuescanarise;thedistinctionbe-

tweenefficacyandeffectivenessinclinicaltrialsisinpartrecognitionofthefact.

2.7Drillingdown:usingtheaverageforindividuals

Justasthereareissueswithscaling-up,itisnotobvioushowtousetheresultsfrom

RCTsatthelevelofindividualunits,evenindividualunitsthatwereincludedinthe

trial.Awell-conductedRCTdeliversanATEforthetrialpopulationbut,ingeneral,

thataveragedoesnotapplytoeveryone.Itisnottrue,forexample,asarguedinthe

AmericanMedicalAssociation’s“Users’guidetothemedicalliterature”that“ifthe

patientwouldhavebeenenrolledinthestudyhadshebeenthere—thatisshemeets

alloftheinclusioncriteriaanddoesn’tviolateanyoftheexclusioncriteria—thereis

littlequestionthattheresultsareapplicable”(seeGuyattetal(1994,60)).Even

moremisleadingaretheoften-heardstatementsthatanRCTwithanaveragetreat-

menteffectinsignificantlydifferentfromzerohasshownthatthetreatmentworks

fornoone.

46

Theseissuesarefamiliartophysicianspracticingevidence-basedmedicine

whoseguidelinesrequire“integratingindividualclinicalexpertisewiththebest

availableexternalclinicalevidencefromsystematicresearch,”Sackettetal(1996,

71).Exactlywhatthismeansisunclear;physiciansknowmuchmoreabouttheirpa-

tientsthanisallowedforintheATEfromtheRCT(though,onceagain,stratification

inthetrialislikelytobehelpful)andtheyoftenhaveintuitiveexpertisefromlong

practicethatcanhelpthemidentifyfeaturesinaparticularpatientthatmayinflu-

encetheeffectivenessofagiventreatmentforthatpatient.Butthereisanoddbal-

ancestruckhere.Thesejudgmentsaredeemedadmissibleindiscussionwiththein-

dividualpatient,buttheydon’tadduptoevidencetobemadepubliclyavailable,

withtheusualcautionsaboutcredibility,bythestandardsadoptedbymostEBM

sites.Itisalsotruethatphysicianscanhaveprejudicesand`knowledge’thatmight

beanythingbut.Clearly,therearesituationswhereforcingpractitionerstofollow

theaveragewilldobetter,evenforindividualpatients,andotherswheretheoppo-

siteistrue,KahnemanandKlein(2009).

Whetherornotaveragesareusefultoindividualsraisesthesameissue

throughoutsocialscienceresearch.Imaginetwoschools,StJoseph’sandSt.Mary’s,

bothofwhichwereincludedinanRCTofaclassroominnovation.Theinnovationis

successfulonaverage,butshouldtheschoolsadoptit?ShouldStMary’sbeinflu-

encedbyapreviousattemptinStJoseph’sthatwasjudgedafailure?Manywould

dismissthisexperienceasanecdotalandaskhowStJoseph’scouldhaveknownthat

itwasafailurewithoutbenefitof`rigorous’evidence.YetifStMary’sislikeStJo-

seph’s,withasimilarmixofpupils,asimilarcurriculum,andsimilaracademic

standing,mightnotStJoseph’sexperiencebemorerelevanttowhatmighthappen

atStMary’sthanisthepositiveaveragefromtheRCT?Andmightitnotbeagood

ideafortheteachersandgovernorsofStMary’stogotoStJoseph’sandfindout

whathappenedandwhy?Theymaybeabletoobservethemechanismofthefailure,

ifsuchitwas,andfigureoutwhetherthesameproblemswouldapplyforthem,or

whethertheymightbeabletoadapttheinnovationtomakeitworkforthem,per-

hapsevenmoresuccessfullythanthepositiveaverageinthetrial.

47

Onceagain,thesequestionsareunlikelytobeeasilyansweredinpractice;

but,aswithtransportability,thereisnoseriousalternativetotrying.Assumingthat

theaverageworksforyouwilloftenbewrong,anditwillatleastsometimesbepos-

sibletodobetter.Asinthemedicalcase,theadvicetoindividualschoolsoftenlacks

specificity.Forexample,theU.S.InstituteofEducationScienceshasprovideda

“user-friendly”guidetopracticessupportedbyrigorousevidence,USDepartmentof

Education(2003).Theadvice,whichissimilartorecommendationsindevelopment

economics,isthattheinterventionbedemonstratedeffectivethroughwell-designed

RCTsinmorethanonesiteandthat“thetrialsshoulddemonstratetheinterven-

tion’seffectivenessinschoolsettingssimilartoyours”(2003,17).Nooperational

definitionof“similar”isprovided.

2.8Examplesandillustrationsfromeconomics

OurargumentsinthisSectionshouldnotbecontroversial,yetwebelievethatthey

representanapproachthatisdifferentfrommostcurrentpractice.Todocument

thisandtofilloutthearguments,weprovidesomeexamples.Whiletheseareocca-

sionallycritical,ourpurposeisconstructive;indeed,webelievethatmisunderstand-

ingsabouthowtouseRCTshaveartificiallylimitedtheirusefulness,aswellasalien-

atedsomewhowouldotherwiseusethem.

Conditionalcashtransfers(CCTs)areinterventionsthathavebeentestedus-

ingRCTs(andotherRCT-likemethods)andareoftencitedasaleadingexampleof

howanevaluationwithstronginternalvalidityleadstoarapidspreadofthepolicy,

e.g.AngristandPischke(2010)amongmanyothers.Thinkthroughthecausalchain

thatisrequiredforCCTstobesuccessful:Peoplemustlikemoney,theymustlike

(ordonotobjecttoomuch)totheirchildrenbeingeducatedandvaccinated,there

mustexistschoolsandclinicsthatarecloseenoughandwellenoughstaffedtodo

theirjob,andthegovernmentoragencythatisrunningtheschememustcareabout

thewellbeingoffamiliesandtheirchildren.Thatsuchconditionsholdinawide

rangeof(althoughcertainlynotall)countriesmakesitunsurprisingthatCCTs

`work’inmanyreplications,thoughtheycertainlywillnotworkinplaceswherethe

schoolsandclinicsdonotexist,e.g.Levy(2001),norinplaceswherepeople

stronglyopposeeducationorvaccination.

48

Similarly,giventhatthesupportfactorswilloperatewithdifferentstrengths

andeffectivenessindifferentplaces,itisalsonotsurprisingthatthesizeoftheATE

differsfromplacetoplace;forexample,Vivalt’sAidGradewebsitelists29estimates

fromarangeofcountriesofthestandardized(dividedbylocalstandarddeviationof

theoutcome)effectsofCCTsonschoolattendance;allbutfourshowtheexpected

positiveeffect,andtherangerunsfrom–8to+38percentagepoints,Vivalt(2015).

Eveninthisleadingcase,wherewemightreasonablyconcludethatCCTs`work’in

gettingchildrenintoschool,itwouldbehardtocalculatecrediblecost-effectiveness

numbersortocometoageneralconclusionaboutwhetherCCTsaremoreorless

costeffectivethanotherpossiblepolicies.Bothcostsandeffectsizescanbeex-

pectedtodifferinnewsettings,justastheyhaveinobservedones,makingthese

predictionsdifficult.

Therangeofestimatesillustratesthatthesimpleviewofexternalvalidity—

thattheATEtransportsfromoneplacetoanother—isnotreasonable.AidGrade

usesstandardizedmeasuresofeffectsizedividedbystandarddeviationofoutcome

atbaseline,asdoesthemajormulti-countrystudybyBanerjeeetal(2015).Butwe

mightprefermeasuresthathaveaneconomicinterpretation,suchasadditional

monthsofschoolingper$100spent(forexampleifadonoristryingtodecide

wheretospend,seebelow).Nutritionmightbemeasuredbyheight,orbythelogof

height.EveniftheATEbyonemeasurecarriesacross,itwillonlydosousingan-

othermeasureiftherelationshipbetweenthetwomeasuresisthesameinbothsit-

uations.Thisisexactlythesortofthingthataformalanalysisoftransportability

forcesustothinkabout.(NotealsothattheATEintheoriginalRCTcandifferde-

pendingonwhethertheoutcomeismeasuredinlevelsorinlogs;itiseasytocon-

structexampleswherethetwoATEshavedifferentsigns.)

Muchoftheeconomicsliterature,likethemedicalliterature,workswiththe

viewofexternalvaliditythat,unlessthereisevidencetothecontrary,thedirection

andsizeoftreatmenteffectscanbetransportedfromoneplacetoanother.TheJ-

PALwebsitereportsitsfindingsunderageneralheadingofpolicyrelevance,subdi-

videdbyaselectionoftopics.Undereachtopic,thereisalistofrelevantRCTsfrom

arangeofdifferentsettingsaroundtheworld.Theseareconvenientlyconverted

49

intoacommoncost-effectivenessmeasuresothat,forexample,under“education”,

subhead“studentparticipation”,therearefourstudiesfromAfrica:oninforming

parentsaboutthereturnstoeducationinMadagascar,ondeworming,onschooluni-

forms,andonmeritscholarships,allfromKenya.Theunitsofmeasurementaread-

ditionalyearsofstudenteducationper$100,andamongthesefourstudies,theav-

erageeffectsofspending$100are20.7years,13.9years,0.71yearsand0.27years

respectively.(Notethatthisisadifferent—andmuchsuperior—standardization

fromtheeffectsizestandardizationdiscussedbelow.)

Whatcanweconcludefromsuchcomparisons?Foraphilanthropicdonorin-

terestedineducation,andifmarginalandaverageeffectsarethesame,theymight

indicatethatthebestplacetodevoteamarginaldollarisinMadagascar,whereit

wouldbeusedtoinformparentsaboutthevalueofeducation.Thisiscertainlyuse-

ful,butitisnotasusefulasstatementsthatinformationordewormingprograms

areeverywheremorecost-effectivethanprogramsinvolvingschooluniformsor

scholarships,orifnoteverywhere,atleastoversomedomain,anditisthesesecond

kindsofcomparisonthatwouldgenuinelyfulfillthepromiseof`findingoutwhat

works.’Butsuchcomparisonsonlymakesenseifwecantransporttheresultsfrom

oneplacetoanother,iftheKenyanresultsalsoholdinMadagascar,Mali,orNa-

mibia,orsomeotherlistofplaces.J-PAL’smanualforcost-effectiveness,Dhaliwalet

al(2012)explainsin(entirelyappropriate)detailhowtohandlevariationincosts

acrosssites,notingvariablefactorssuchaspopulationdensity,prices,exchange

rates,discountrates,inflation,andbulkdiscounts.Butitgivesshortshrifttocross-

sitevariationinthesizeofATEs,whichplayanequalpartinthecalculationsofcost

effectiveness.Themanualbrieflynotesthatdiminishingreturns(orthelast-mile

problem)mightbeimportantintheorybutarguesthatthebaselinelevelsofout-

comesarelikelytobesimilarinthepilotandreplicationareas,sothattheATEcan

besafelytransportedasis.Allofthislacksajustificationfortransportability,some

understandingofwhenresultstransport,whentheydonot,orbetterstill,howthey

shouldbemodifiedtomakethemtransportable.

50

OneofthelargestandmosttechnicallyimpressiveofthedevelopmentRCTs

isbyBanerjeeetal(2015),whichtestsa“graduation”programdesignedtoperma-

nentlyliftextremelypoorpeoplefrompovertybyprovidingthemwithagiftofa

productiveasset(fromguinea-pigs,(regular-)pigs,sheep,goats,orchickensde-

pendingonlocale),trainingandsupport,andlife-skillscoaching,aswellassupport

forconsumption,saving,andhealthservices.Theideaisthatthispackageofaidcan

helppeoplebreakoutofpovertytrapsinawaythatwouldnotbepossiblewithone

interventionatatime.ComparableversionsoftheprogramweretestedinEthiopia,

Ghana,Honduras,India,Pakistan,andPeruand,exceptingHonduras(wherethe

chickensdied)findlargelypositiveandpersistenteffects—withsimilar(standard-

ized)effectsizes—forarangeofoutcomes(economic,mentalandphysicalhealth,

andfemaleempowerment).Onesiteapart,essentiallyeveryoneacceptedtheiras-

signment.ReplicationofpositiveATEsoversuchawiderangeofplacescertainly

providesproofofconceptforsuchascheme.YetBauchet,Morduch,andRavi(2015)

failtoreplicatetheresultinSouthIndia,wherethecontrolgroupgotaccesstomuch

thesamebenefits,whatHeckman,Hohman,andSmith(2000)call“substitution

bias”.Evenso,theresultsareimportantbecause,althoughthereisalongstanding

interestinpovertytraps,manyeconomistshavebeenskepticaloftheirexistenceor

thattheycouldbesprungbysuchaid-basedpolicies.Inthissense,thestudyisan

importantcontributiontothetheoryofeconomicdevelopment;ittestsatheoretical

propositionandwill(orshould)changemindsaboutit.

Anumberofdifficultiesremain.Astheauthorsnote,suchtrialscannottellus

whichcomponentofthetreatmentaccountedfortheresults,orwhichmightbedis-

pensable—amuchmoreexpensivemultifactorialtrialwouldberequired—thoughit

seemslikelyinpracticethatthecostliestcomponent—therepeatedvisitsfortrain-

ingandsupport—islikelytobethefirsttobecutbycash-strappedpoliticiansorad-

ministrators.Andasnoted,itisnotclearwhatshouldcountas(simple)replication

ininternationalcomparisons;itishardtothinkoftheusesofstandardizedeffect

sizes,excepttodocumentthateffectsexisteverywhereandthattheyaresimilarly

largerelativetolocalvariationinsuchthings.

51

Theeffectsize—theATEstandardizedbybeingexpressedinnumbersof

standarddeviationsoftheoriginaloutcome—thoughconvenientlydimensionless,

haslittletorecommendit.AswithmuchofRCTpractice,itstripsoutanyeconomic

content—noratesofreturn,orbenefitsminuscosts—anditremovesanydiscipline

onwhatisbeingcompared.Applesandorangesbecomeimmediatelycomparable,

asdotreatmentswhoseinclusioninameta-analysisislimitedonlybytheimagina-

tionoftheanalystsinclaimingsimilarity.Trainingprogramsforphysicalfitnesscan

bepooledwithtrainingprogramsforwelding,ormarketing,orevenobedience

trainingforpets.Inpsychology,wheretheconceptoriginated,thisresultsinendless

disputesaboutwhatshouldandshouldnotbepooledinameta-analysis.Goldberger

andManski(1995,769)notethat“standardizationaccomplishesnothingexceptto

givequantitiesinnoncomparableunitsthesuperficialappearanceofbeingincom-

parableunits.Thisaccomplishmentisworsethanuseless—ityieldsmisleadingin-

ferences.”Beyondthat,Simpson(2017)notesthatrestrictionsonthetrialsample—

oftengoodpracticetoreducebackgroundnoiseandtohelpdetectaneffect—will

reducethebaselinestandarddeviationandinflatetheeffectsize.Moregenerally,ef-

fectsizesareopentomanipulationbyexclusionrules.Itmakesnosensetoclaim

replicabilityonthebasisofeffectsizes,letalonetousethemtorankprojects.Effect

sizesareirrelevantforpolicymaking.

Thegraduationstudycanbetakenastheclosesttofulfillingthe`findingout

whatworks’aimoftheRCTmovementindevelopment.Yetitissilentonperhaps

thecrucialaspectforpolicy,whichisthatthetrialwasruninpartnershipwith

NGOs,whereaswhatwewouldliketoknowiswhetheritcouldbereplicatedbygov-

ernments,includingthosegovernmentsthatareincapableofgettingdoctors,

nurses,andteacherstoshowuptoclinicsorschools,Chaudhuryetal(2005),

Banerjee,DeatonandDuflo(2004),orofregulatingthequalityofmedicalcareinei-

therthepublicorprivatesectors,Filmer,HammerandPritchett(2000)orDasand

Hammer(2005).Infact,wealreadyknowagreatdealabout`whatworks.’Vaccina-

tionswork,maternalandchildhealthcareserviceswork,andclassroomteaching

works.Yetknowingthisdoesnotgetthosethingsdone.Addinganotherprogram

thatworksunderidealconditionsisusefulonlywhereconditionsareinfactideal,in

52

whichcaseitwouldlikelybeunnecessary.Findingoutwhatworksisnotthemagic

keytoeconomicdevelopment.Technicalknowledge,thoughalwaysworthhaving,

requiressuitableinstitutionsandsuitableincentivesifitistodoanygood.

Asimilarpointisdocumentedinthecontrastbetweenasuccessfultrialthat

usedcamerasandthreatsofwagereductionstoincentivizeattendanceofteachers

inschoolsrunbyanNGOinRajasthaninIndia,Duflo,Hanna,andRyan(2012),and

thesubsequentfailureofafollow-upprograminthesamestatetotacklemassab-

senteeismofhealthworkers,Banerjee,Duflo,andGlennerster(2008).Inthe

schools,thecamerasandtimekeepingworkedasintended,andteacherattendance

increased.Intheclinics,therewasashort-runeffectonnurseattendance,butitwas

quicklyeliminated.(Theabilityofagentseventuallytounderminepoliciesthatare

initiallyeffectiveiscommonenoughandnoteasilyhandledwithinanRCT.)Inboth

trials,therewereincentivestoimproveattendance,andtherewereincentivesto

findawaytosabotagethemonitoringandrestoreworkerstotheiraccustomedpo-

sitions;theforceoftheseincentivesisa`high-level’cause,likegravity,ortheprinci-

pleofthelever,thatworksinmuchthesamewayeverywhere.Fortheclinics,some

sabotagewasdirect—thesmashingofcameras—andsomewassubtler,whengov-

ernmentsupervisorsprovidedofficial,thoughspecious,reasonsformissingwork.

WecanonlyconjecturewhythecausalitywasswitchedinthemovefromNGOto

government;wesuspectthatworkingforahighly-respectedlocalNGOisadifferent

contractfromworkingforthegovernment,wherenotshowingupforworkis

widely(ifinformally)understoodtobepartofthedeal.Theincentiveleverworks

whenitiswiredupright,aswiththeNGOs,butnotwhenthewiringcutsitout,as

withthegovernment.Knowing`whatworks’inthesenseofthetreatmenteffecton

thetrialpopulationisoflimitedvaluewithoutunderstandingthepoliticalandinsti-

tutionalenvironmentinwhichitisset.Thisunderlinestheneedtounderstandthe

underlyingsocial,economic,andculturalstructures—includingtheincentivesand

agencyproblemsthatinhibitservicedelivery—thatarerequiredtosupportthe

causalpathwaysthatweshouldliketoseeatwork.

Trialsineconomicdevelopmentoftentakeplaceinartificialenvironments.

Drèze(2016)notes,basedonextensiveexperienceinIndia,“whenaforeignagency

53

comesinwithitsheavybootsandsuitcasesofdollarstoadministera`treatment,’

whetherthroughalocalNGOorgovernmentorwhatever,thereisalotgoingon

otherthanthetreatment.”Thereisalsothesuspicionthatatreatmentthatworks

doessobecauseofthepresenceofthe`treators,’oftenfromabroad,andmaynot

workdosowiththepeoplewhowillworkitinpractice.

Thereisalsomuchtobelearnedfrommanyyearsofeconomictrialsinthe

UnitedStates,particularlyfromtheworkofMDRC,fromtheearlyincometaxtrials,

aswellasfromtheRandHealthExperiment.Followingtheincometaxtrials,MDRC

hasrunmanyrandomizedtrialssincethe1970s,mostlyfortheFederalgovernment

butalsoforindividualstatesandforCanada(seethethoroughandinformativeac-

countbyGueronandRolston(2011)forthefactualinformationunderlyingthefol-

lowingdiscussion).MDRC’sprogram,likethatofJPALindevelopment,isintended

tofindout`whatworks’inthestateandfederalwelfareprograms.Theseprograms

areconditionalcashtransfersinwhichpoorrecipientsaregivencashprovidedthey

satisfycertainconditionssuchasworkrequirementsortraining,whichareoften

thesubjectofthetrial.Whatarethebenefitsandcostsofvariousalternatives,both

totherecipientsandtothelocalandfederaltaxpayers?Alloftheseprogramsare

deeplypoliticized,withsharplydifferentviewsoverbothfactsanddesirability.

Manyengagedinthesedisputesfeelcertainofwhatshouldbedoneandwhatits

consequenceswillbesothat,bytheirlights,controlgroupsareunethicalbecause

theydeprivesomepeopleofwhattheadvocates`know’willbecertainbenefits.

Giventhis,itisperhapssurprisingthatRCTshavebecometheacceptednormfor

thiskindofpolicyevaluationintheUS.

Thereasonsowemuchtopoliticalinstitutions,aswellastothecommonbe-

lief,exploredinSection1,thatRCTscanrevealthetruth.AttheFederallevel,pro-

spectivepoliciesarevettedbythenon-partisanCongressionalBudgetOffice(CBO),

whichmakesitsownestimatesofthebudgetaryimplicationsoftheprogram.Ideo-

logueswhoseprogramsarescoredpoorlybytheCBOhaveanincentivetosupport

anRCT,nottoconvincethemselves,buttoconvinceopponents;onceagain,RCTs

arevaluablewhenyouropponentsdonotshareyourprior.Andcontrolgroupsare

54

easiertoputinplacewhenthereareinsufficientfundstocoverthewholepopula-

tion.TherewasalsoawidespreadandlargelyuncriticalbeliefthatRCTsgivethe

rightanswer,atleastforthebudgetaryimplications,which,ratherthanthewellbe-

ingoftherecipients,wereoftentheprimaryconcern;notethatallofthesetrialsare

onpoorpeoplebyrichpeoplewhoaretypicallymoreconcernedwithcostthanwith

thewellbeingofthepoor,Greenberg,SchroderandOnstott(1999).MDRCstrials

couldthereforebeeffectivedispute-reconciliationmechanismsbothforthosewho

sawtheneedforevidenceandforthosewhodidnot(exceptinstrumentally).The

outcomeherefitswithour`publichealth’case;whatthepoliticiansneedtoknowis

nottheoutcomesforindividuals,orevenhowtheoutcomesinonestatemight

transporttoanother,buttheaveragebudgetarycostinaspecificplace,something

thatagoodRCTconductedonarepresentativesampleofthetargetpopulationcan

deliver,atleastintheabsenceofgeneralequilibriumeffects,timingeffects,etc.

TheseRCTsbyMDRCandothercontractorshavedemonstratedboththefea-

sibilityoflarge-scalesocialtrialsincludingthepossibilityofrandomizationinthese

settings(wheremanyparticipantswerehostiletotheidea),aswellastheiruseful-

nesstopolicymakers.Theyalsoseemtohavechangedbeliefs,forexampleinfavor

ofthedesirabilityofworkrequirementsasaconditionofwelfare,evenamongmany

originallyopposed.Therearealsolimitations;thetrialsappeartohavehadatbesta

minorinfluenceonscientificthinkingaboutbehaviorinlabormarketsand,inthat

sense,theyaremoreabout`plumbing’thanscience,Duflo(2017).Theresultsof

similarprogramshaveoftenbeendifferentacrossdifferentsites,andtherehasto

datebeennofirmunderstandingofwhy;indeed,thetrialsarenotdesignedtore-

vealthis,Moffitt(2004).Finally,andperhapscruciallyforthepotentialcontribution

toeconomicscience,therehasbeenlittlesuccessinunderstandingeithertheunder-

lyingstructuresorchainsofcausation,inspiteofadeterminedeffortfromthebe-

ginningtoopentheblackboxes.

TheRANDhealthexperiment,Manningetal(1975a,b),providesadifferent

butequallyinstructivestoryifonlybecauseitsresultshavepermeatedtheacademic

andpolicydiscussionsabouthealthcareeversince.Itwasoriginallydesignedtotest

whethermoregenerousinsurancecausespeopletousemoremedicalcareand,ifso,

55

byhowmuch.Theincentiveeffectsarehardlyindoubttoday;theimmortalityofthe

studycomesratherfromthefactthatitsmulti-arm(responsesurface)designal-

lowedthecalculationofanelasticityforthestudypopulation,thatmedicalexpendi-

turesdecreasedby–0.1to–0.2percentforeverypercentageincreaseinthecopay-

ment.AccordingtoAron-Dine,Einav,andFinkelstein(2013),itisthisdimensionless

andthusapparentlytransportablenumberthathasbeenusedeversincetodiscuss

thedesignofhealthcarepolicy;theelasticityhascometobetreatedasauniversal

constant.Ironically,theyarguethattheestimatecannotbereplicatedinrecentstud-

ies,anditisevenunclearthatitisfirmlybasedontheoriginalevidence.Thispoints,

onceagain,tothecentralimportanceoftransportabilityfortheusefulness,both

shortandlongterm,ofatrial.Here,thesimpledirecttransportabilityoftheresult

seemstohavebeenlargelyillusorythough,aswehaveargued,thisdoesnotmean

thatmorecomplexconstructionsbasedontheresultsofthetrialwouldnothave

donebetter.

Conclusions

Itisusefultorespondtotwochallengesthatareoftenputtous,onefrommedicine

andonefromsocialscience.Themedicalchallengeis,“Ifyouarebeingprescribeda

newdrug,wouldn’tyouwantittohavebeenthroughanRCT?”Thesecond(related)

challengeis,“OK,youhavehighlightedsomeoftheproblemswithRCTs,butother

methodshaveallofthoseproblems,plusproblemsoftheirown.”Webelievethatwe

haveansweredbothoftheseinthepaperbutthatitishelpfultorecapitulate.

Themedicalchallengeisaboutyou,aspecificperson,sothatoneanswer

wouldbethatyoumaybedifferentfromtheaverage,andyouareentitledtoand

oughttoaskabouttheoryandevidenceaboutwhetheritwillworkforyou.This

wouldbeintheformofaconversationbetweenyouandyourphysician,whoknows

alotaboutyou.Youwouldwanttoknowhowthisclassofdrugissupposedtowork

andwhetherthatmechanismislikelytoworkforyou.Isthereanyevidencefrom

otherpatients,especiallypatientslikeyou,withyourconditionandinyourcircum-

stances,oraretheresuggestionsfromtheory?Whatscientificworkhasbeendone

toidentifywhatsupportfactorsmatterforsuccesswiththiskindofdrug?Iftheonly

56

informationavailableisfromthepharmaceuticalcompany,anRCTmightseemlike

agoodidea.Buteventhen,andalthoughknowledgeofthemeaneffectamongsome

groupiscertainlyofvalue,youmightgivelittleweighttoanRCTwhoseparticipants

areselectedinthewaytheywereselectedinthetrial,orwherethereislittleinfor-

mationaboutwhethertheoutcomesarerelevanttoyou.Recallthatmanynewdrugs

areprescribed‘off-label’,forapurposeforwhichtheywerenottested,andbeyond

that,thatmanynewdrugsareadministeredintheabsenceofanRCTbecauseyou

areactuallybeingenrolledinone.Forpatientswhoselastchanceistoparticipatein

atrialofsomenewdrug,thisisexactlythesortofconversationyoushouldhave

withyourphysician(followedbyoneaskinghertorevealwhetheryouareintheac-

tivearm,sothatyoucanswitchifnot),andsuchconversationsneedtotakeplace

forallprescriptionsthatarenewtoyou.Intheseconversations,theresultsofan

RCTmayhavemarginalvalue.Ifyourphysiciantellsyouthatsheendorsesevidence-

basedmedicine,andthatthedrugwillmostlikelyworkforyoubecauseanRCThas

shownthat‘itworks’,itistimetofindanewphysician.

Thesecondchallengeclaimsthatothermethodsarealwaysdominatedbyan

RCT.Thiskindofchallengeisnotwell-formulated.Dominatedforansweringwhat

question,forwhatpurposes?ThechiefadvantageoftheRCTisthatitcan,ifwell-

conducted,giveanunbiasedestimateofanATEinastudy(trial)sampleandthus

provideevidencethatthetreatmentcausedtheoutcomeinsomeindividualsinthat

sample.Ifthatiswhatyouwanttoknowandthere’slittlebackgroundknowledge

availableandthepriceisright,thenanRCTmaybethebestchoice.Astoother

questions,theRCTresultcanbepart—butusuallyonlyasmallpart—ofthedefense

of(a)ageneralclaim,(b)aclaimthatthetreatmentwillcausethatoutcomefor

someotherindividuals,oreven(c)aclaimaboutwhattheATEwillbeinsomeother

population.Buttheydolittlefortheseenterprisesontheirown.Whatisthebest

overallpackageofresearchworkfortacklingthesequestions—mostcost-effective

andmostlikelytoproducecorrectresults—dependsonwhatweknowandwhat

differentkindsofresearchwillcost.

57

ThereareexampleswhereanRCTdoesbetterthananobservationalstudy,

andtheseseemtobethecasesthatcometomindfordefendersofRCTs.Forexam-

ple,regressionsofwhetherpeoplewhogetMedicaiddobetterorworsethanpeople

withprivateinsurancearevitiatedbygrossdifferencesintheothercharacteristics

ofthetwopopulations.ButitisalongstepfromthattosayingthatanRCTcansolve

theproblem,letalonethatitistheonlywaytosolvetheproblem.Itwillnotonlybe

expensivepersubject,butitcanonlyenrollaselectedandalmostcertainlyunrepre-

sentativestudysample,itcanberunonlytemporarily,andtherecruitmenttothe

experimentwillnecessarilybedifferentfromrecruitmentinaschemethatisper-

manentandopentothefullqualifiedpopulation.Noneofthisremovestheblem-

ishesoftheobservationalstudy,buttherearemanymethodsofmitigatingitsdiffi-

culties,sothat,intheend,anobservationalstudywithcrediblecorrectionsanda

morerelevantandmuchlargerstudysample—todayoftenthecompletepopulation

ofinterestthroughadministrativerecords—mayprovideabetterestimate.Every-

thinghastobejudgedonacase-by-casebasis.Thereisnorigorousargumentfora

lexicographicpreferenceforRCTs.

Thereisalsoanimportantlineofenquirythatgoes,notonlybeyondRCTs,

butbeyondthe‘methodofdifferences’thatiscommontoRCTs,regressions,orany

formofcontrolledoruncontrolledcomparison.Thehypothetico-deductivemethod

confrontstheory-baseddeductionswiththedata—eitherobservationalorexperi-

mental.Asnotedabove,economistsroutinelyusetheorytoteaseoutanewimplica-

tionthatcanbetakentothedata,andtherearealsogoodexamplesinmedicine

suchasBleyerandWelch(2012)’sdemonstrationofthelimitedimpactonbreast

cancerincidenceofmammographyscreening,atopicwhereothermethodshave

generatedgreatcontroversyandlittleconsensus.

RCTsaretheultimateinnon-parametricestimationofaveragetreatmentef-

fectsinthetrialsamplesbecausetheymakesofewassumptionsaboutheterogene-

ity,causalstructure,choiceofvariables,andfunctionalform.RCTsareoftenconven-

ientwaystointroduceexperimenter-controlledvariance—ifyouwanttoseewhat

happens,thenkickitandsee,twistthelion’stail—butnotethatmanyexperiments,

includingmanyofthemostimportant(andNobelPrizewinning)experimentsin

58

economics,donotanddidnotuserandomization,Harrison(2013),Svorencik

(2015).Butthecredibilityoftheresults,eveninternally,canbeunderminedbyun-

balancedcovariatesandbyexcessiveheterogeneityinresponses,especiallywhen

thedistributionofeffectsisasymmetric,whereinferenceonmeanscanbehazard-

ous.Ironically,thepriceofthecredibilityinRCTsisthatwecanonlyrecoverthe

meanofthedistributionoftreatmenteffects,andthatonlyforthetrialsample.Yet,

inthepresenceofoutliers,reliableinferenceonmeansisdifficult.Andrandomiza-

tioninandofitselfdoesnothingunlessthedetailsareright;purposiveselectioninto

theexperimentalpopulation,likepurposiveselectionintoandoutofassignment,

underminesinferenceinjustthesamewayasdoesselectioninobservationalstud-

ies.Lackofblinding,whetherofparticipants,trialists,datacollectors,oranalysts,

underminesinference,akintoafailureofexclusionrestrictionsininstrumentalvari-

ableanalysis.

ThelackofstructurecanbeseriouslydisablingwhenwetrytouseRCTre-

sultsoutsideofafewcontexts,suchasprogramevaluation,hypothesistesting,or

establishingproofofconcept.Beyondthat,theresultscannotbeusedtohelpmake

predictionsbeyondthetrialsamplewithoutmorestructure,withoutmorepriorin-

formation,andwithouthavingsomeideaofwhatmakestreatmenteffectsvaryfrom

placetoplaceortimetotime.Thereisnooptionbuttocommittosomecausal

structureifwearetoknowhowtouseRCTevidenceoutoftheoriginalcontext.

Simplegeneralizationandsimpleextrapolationdonotcutthemustard.Thisistrue

ofanystudy,experimentalorobservational.Butobservationalstudiesarefamiliar

with,androutinelyworkwith,thesortofassumptionsthatRCTsclaimtoavoid,so

thatiftheaimistouseempiricalevidence,anycredibilityadvantagethatRCTshave

inestimationisnolongeroperative.AndbecauseRCTstellussolittleaboutwhyre-

sultshappen,theyhaveadisadvantageoverstudiesthatuseawiderrangeofprior

informationanddatatohelpnaildownmechanisms.

Yetoncethatcommitmenthasbeenmade,RCTevidencecanbeextremely

useful,pinningdownpartofastructure,helpingtobuildstrongerunderstanding

andknowledge,andhelpingtoassesswelfareconsequences.Asourexamplesshow,

thiscanoftenbedonewithoutcommittingtothefullcomplexityofwhatareoften

59

thoughtofasstructuralmodels.Yetwithoutthestructurethatallowsustoplace

RCTresultsincontext,ortounderstandthemechanismsbehindthoseresults,not

onlycanwenottransportwhetherìtworks’elsewhere,butwecannotdothestand-

ardstuffofeconomics,whichistosaywhethertheinterventionisactuallywelfare

improving.Withoutknowingwhythingshappenandwhypeopledothings,werun

theriskofworthlesscasual(`fairystory’)causaltheorizingandhavegivenupon

oneofthecentraltasksofeconomics.

Wemustbackawayfromtherefusaltotheorize,fromtheexultationinour

abilitytohandleunlimitedheterogeneity,andactuallySAYsomething.Perhapspar-

adoxically,unlesswearepreparedtomakeassumptions,andtosaywhatweknow,

makingstatementsthatwillbeincredibletosome,allthecredibilityoftheRCTisfor

naught.

RCTsineconomicsonhealth,labor,anddevelopmenthaveproventheir

worthinprovidingproofsofconceptandattestingpredictionsthatsomepolicies

mustalwaysworkorcanneverwork.But,aselsewhereineconomics,wecannot

findoutwhysomethingworksbysimplydemonstratingthatitdoeswork,nomatter

howoften,whichleavesusuninformedastowhetherthepolicyshouldbeimple-

mented.Beyondthat,smallscale,demonstrationRCTsarenotcapableoftellingus

whatwouldhappenifthesepolicieswereimplementedtoscale,ofcapturingunin-

tendedconsequencesthattypicallycannotbeincludedintheprotocols,orofmodel-

ingwhatwillhappenifschemesareimplementeddifferentlythaninthetrial,forex-

amplebygovernments,whosemotivesandoperatingprinciplesaredifferentfrom

theNGOsoracademicswhotypicallyruntrials.Whileitistruethatabstract

knowledgeisalwayslikelytobebeneficial,successfulpolicydependsoninstitutions

andonpolitics,mattersonwhichRCTshavelittletosay.TheresultsofRCTscanand

shouldfeedintopublicdebateaboutwhatshouldbedone,butweareondangerous

groundwhentheyareused,ongroundsoftheirsupposedepistemicsuperiority,to

insulatepolicyfromdemocraticprocesses.

Citations

60

Aigner,DennisJ.,1985,“Theresidentialelectricitytime-of-usepricingexperiments.Whathavewelearned?”inDavidA.WiseandJerryA.Hausman,Socialexperimen-tation,Chicago,Il.ChicagoUniversityPressforNationalBureauofEconomicRe-search,11–54.

Al-Ubaydil,Omar,andJohnA.List,2013,“Onthegeneralizabilityofexperimentalre-sultsineconomics,”inG.FrechetteandA.Schotter,Methodsofmodernexperi-mentaleconomics,OxfordUniversityPress.

Altman,DouglasG.,1985,“Comparabilityofrandomizedgroups,”JournaloftheRoyalStatisticalSociety,SeriesD(TheStatistician),34(1),Statisticsinhealth,125–36.

Angrist,JoshuaD.,2004,“Treatmenteffectheterogeneityintheoryandpractice,”EconomicJournal,114,C52–C83.

Angrist,JoshuaD.,EricBettinger,ErikBloom,ElizabethKing,andMichaelKremer,2002,“VouchersforprivateschoolinginColombia:evidencefromarandomizednaturalexperiment,”AmericanEconomicReview,92(5),1535–58.

Angrist,JoshuaD.andJörn-SteffenPischke,2010,“Thecredibilityrevolutioninem-piricaleconomics:howbetterresearchdesignistakingtheconoutofeconomet-rics,”JournalofEconomicPerspectives,24(2),3–30.

Angrist,JoshuaD.andJörn-SteffenPischke,2017,“Undergraduateeconometricsin-struction:throughourclasses,darkly,”JournalofEconomicPerspectives,31(2),125-44.

Aron-Dine,Aviva,LiranEinav,andAmyFinkelstein,2013,“TheRANDhealthinsur-anceexperiment,threedecadeslater,”JournalofEconomicPerspectives,27(1),197–222.

Arrow,KennethJ.,1975,“Twonotesoninferringlongrunbehaviorfromsocialex-periments,”DocumentNo.P-5546,SantaMonica,CA.RandCorporation.

Ashenfelter,Orley,1978,“Thelaborsupplyresponseofwageearners,”inJohnL.PalmerandJosephA.Pechman,eds.,Welfareinruralareas:theNorthCarolina–IowaIncomeMaintenanceExperiment,Washington,DC.TheBrookingsInstitu-tion.109–38.

Athey,SusanandGuidoW.Imbens,2017,“Thestateofappliedeconometrics:cau-salityandpolicyevaluation,”JournalofEconomicPerspectives,31(2),3-32.

Attanasio,Orazio,CostasMeghir,andAnaSantiago,2012,“EducationchoicesinMexico:usingastructuralmodelandarandomizedexperimenttoevaluatePRO-GRESA,”ReviewofEconomicStudies,79(1),37–66.

Attanasio,Orazio,SarahCattan,EmlaFitzsimons,CostasMeghir,andMartaRubioCodina,2015,“Estimatingtheproductionfunctionforhumancapital:resultsfromarandomizedcontrolledtrialinColombia,”London.InstituteforFiscalStudies,WorkingPapernoW15/06.

Bahadur,R.R.,andLeonardJ.Savage,1956,“Thenon-existenceofcertainstatisticalproceduresinnonparametricproblems,”AnnalsofMathematicalStatistics,25:1115–22.

Banerjee,Abhijit,SylvainChassang,SergioMontero,andErikSnowberg,2016,“Atheoryofexperimenters,”processed,July2016.

61

Banerjee,Abhijit,SylvainChassang,andErikSnowberg,2016,“Decisiontheoreticapproachestoexperimentdesignandexternalvalidity,”Cambridge,MA.NBERWorkingPaperno22167,April.

Banerjee,Abhijit,AngusDeaton,andEstherDuflo,2004,“Healthcaredeliveryinru-ralRajasthan,”EconomicandPoliticalWeekly,39(9),944–9.

Banerjee,AbhijitandEstherDuflo,2009,“Theexperimentalapproachtodevelop-menteconomics,”AnnualReviewofEconomics,1,151-78.

Banerjee,AbhijitandEstherDuflo,2012,Pooreconomics:aradicalrethinkingofthewaytofightglobalpoverty,PublicAffairs.

Banerjee,Abhijit,EstherDuflo,NathanaelGoldberg,DeanKarlan,RobertOsei,Wil-liamParienté,JeremyShapiro,BramThuysbaert,andChristopherUdry,2015,“Amultifacetedprogramcauseslastingprogressfortheverypoor:evidencefromsixcountries,”Science,348(6236),1260799.

Banerjee,Abhijit,EstherDuflo,andRachelGlennerster,2008,“Puttingaband-aidonacorpse:incentivesfornursesintheIndianpublichealthcaresystem,”JournaloftheEuropeanEconomicAssociation,6(2–3),487–500.

Banerjee,AbhijitV.andRuiminHe,2003,“TheWorldBankofthefuture,”AmericanEconomicReview,93(2),39–44.

Banerjee,Abhijit,DeanKarlan,andJonathanZinman,2015,“Sixrandomizedevalua-tionsofmicrocredit:introductionandfurthersteps,”AmericanEconomicJournal:AppliedEconomics,7(1),1-21.

Bareinboim,EliasandJudeaPearl,2013,“Ageneralalgorithmfordecidingtrans-portabilityofexperimentalresults,”JournalofCausalInference,1(1),107-34.

Bareinboim,EliasandJudeaPearl,2014,“Transportabilityfrommultipleenviron-mentswithlimitedexperiments:completenessresults,”inM.Welling,Z.Ghah-ramani,C.Cortes,andN.Lawrence,eds.,AdvancesofNeuralInformationPro-cessing,27,(NIPSProceedings),280-8.

Bauchet,Jonathan,JonathanMorduch,andShamikaRavi,2015,“Failurevsdisplace-ment:whyaninnovativeanti-povertyprogramshowednonetimpactinSouthIndia,”JournalofDevelopmentEconomics,116,1–16.

Basu,Kaushik,2010,“TheeconomicsoffoodgrainmanagementinIndia,”MinistryofFinance,Delhi.http://finmin.nic.in/workingpaper/Foodgrain.pdf

Begg,ColinB.,1990,“Significancetestsofcovarianceimbalanceinclinicaltrials,”ControlledClinicalTrials,11(4),223-5.

Bhattacharya,DebopamandPascalineDupas,2012,“Inferringwelfaremaximizingtreatmentassignmentunderbudgetconstraints,”JournalofEconometrics,167(1),168-96.

Bitler,MarianneP.,JonahB.Gelbach,andHilaryW.Hoynes,2006,“Whatmeanim-pactsmiss:distributionaleffectsofwelfarereformexperiments,”AmericanEco-nomicReview,96(4),988-1012.

Bleyer,Archie,andH.GilbertWelch,2012,“Effectofthreedecadesofscreeningmammographyonbreast-cancerincidence,”NewEnglandJournalofMedicine,367,1998-2005

Bloom,HowardS.,CarolynJ.Hill,andJamesA.Riccio,2005,“Modelingcross-siteex-perimentaldifferencestofindoutwhyprogrameffectivenessvaries,”inHoward

62

S.Bloom,ed.,Learningmorefromsocialexperiments:evolvinganalyticalap-proaches,NewYork,NY.RussellSage.

Bold,Tessa,MwangiKimenyi,GermanoMwabu,AliceNg’ang’a,andJustinSandefur,2013,“Scalingupwhatworks:experimentalevidenceonexternalvalidityinKen-yaneducation,”Washington,DC.CenterforGlobalDevelopment,WorkingPaper321.

Bothwell,LauraE.andScottH.Podolsky,2016,“Theemergenceoftherandomized,controlledtrial,”NewEnglandJournalofMedicine,375(6),501–4.doi:10.1056/NEJMp1604635

Campbell,D.T.andJ.C.Stanley,1963,Experimentalandquasi-experimentaldesignsforresearch.Chicago.RandMcNally.

Cartwright,Nancy,1994,Nature’scapacitiesandtheirmeasurement.Oxford.Claren-donPress.

Cartwright,Nancy,2007,“AreRCTsthegoldstandard?”Biosocieties,2,11-20.Cartwright,Nancy,2011,“Aphilosopher’sviewofthelongroadfromRCTstoeffec-tiveness,”TheLancet,377,1400-01.

Cartwright,Nancy,2012,“Presidentialaddress:willthispolicyworkforyou?Pre-dictingeffectivenessbetter:howphilosophyhelps,”PhilosophyofScience,79,973-89.

Cartwright,Nancy,2016.“Whereistherigorwhenyouneedit?”inI.Marinovic,ed.,FoundationsandTrendsinAccounting:specialissueoncausalinferenceincapitalmarketsresearch,10(2-4):106-24.

Cartwright,NancyandJeremyHardie,2012,Evidencebasedpolicy:apracticalguidetodoingitbetter,Oxford.OxfordUniversityPress.

Cartwright,NancyandEileenMunro,2010,“ThelimitationsofRCTsinpredictingeffectiveness,”JournalofExperimentalChildPsychology,16(2),

Chalmers,Iain,2001,“Comparinglikewithlike:somehistoricalmilestonesintheevolutionofmethodstocreateunbiasedcomparisongroupsintherapeuticexper-iments,”InternationalJournalofEpidemiology,30,1156–64.

Chan,TatY.andBartonH.Hamilton,2006,“Learning,privateinformation,andtheeconomicevaluationofrandomizedexperiments,”JournalofPoliticalEconomy,114(6),997-1040.

Chassang,Sylvain,GerardPadróIMiguel,andErikSnowberg,2012,“Selectivetrials:aprincipal–agentapproachtorandomizedcontrolledexperiments,”AmericanEconomicReview,102(4),1279–1309.

Chassang,Sylvain,ErikSnowberg,BenSeymour,andCayleyBowles,2015,“Ac-countingforbehaviorintreatmenteffects:newapplicationsforblindtrials,”PLoSOne,10(6),e0127227.doi:10:1371/journal.pone.0127227.

Chaudhury,Nazmul,JeffreyHammer,MichaelKremer,KarthikMuralidharan,andF.HalseyRogers,2005,“Missinginaction:teacherandhealthworkerabsenceinde-velopingcountries,”JournalofEconomicPerspectives,19(4),91–116.

Chyn,Eric,2016,“Movedtoopportunity:thelong-runeffectofpublichousingdemo-litiononlabormarketoutcomesofchildren,”UniversityofMichigan.http://www-personal.umich.edu/~ericchyn/Chyn_Moved_to_Opportunity.pdf

63

Chetty,Raj,2009,“Sufficientstatisticsforwelfareanalysis:abridgebetweenstruc-turalandreduced-formmethods,”AnnualReviewofEconomics,1,451-87.

Conlisk,John,1973,“Choiceofresponsefunctionalformindesigningsubsidyexper-iments,”Econometrica,41(4),643–56.

Crépon,Bruno,EstherDuflo,MarcGurgand,RolandRathelot,andPhilippeZamora,2014,“Dolabormarketpolicieshavedisplacementeffects?evidencefromaclus-teredrandomizedexperiment,”QuarterlyJournalofEconomics,128(2),531–80.

Das,JishnuandJeffreyHammer,2005,”Whichdoctor?Combiningvignettesanditemresponsetomeasureclinicalcompetence,”JournalofDevelopmentEconom-ics,78,348–83.

Davey-Smith,George,andShahIbrahim,2002,“Datadredging,bias,orconfound-ing,”BritishMedicalJournal,325,1437-8.

Deaton,Angus,2010,“Instruments,randomization,andlearningaboutdevelop-ment,”JournalofEconomicLiterature,48(2),424-55.

Deaton,AngusandNancyCartwright,2016,“Understandingandmisunderstandingrandomizedcontrolledtrials,”http://www.princeton.edu/~deaton/down-load.html?pdf=Deaton_Cartwright_RCTs_with_ABSTRACT_August_25.pdf

Deaton,AngusandJohnMuellbauer,1980,Economicsandconsumerbehavior,NewYork.CambridgeUniversityPress.

Deaton,AngusandSerenaNg,1998,“Parametricandnonparametricapproachestopriceandtaxreform,”JournaloftheAmericanStatisticalAssociation,93(443),900-9.

Dhaliwal,Iqbal,EstherDuflo,RachelGlennerster,andCaitlinTulloch,2012,“Com-parativecost-effectivenessanalysistoinformpolicyindevelopingcountries:ageneralframeworkwithapplicationsforeducation,”J–PAL,MIT,December3rd.http://www.povertyactionlab.org/publication/cost-effectiveness

Drèze,Jean,2016,Personalemailcommunication.Duflo,Esther,2017,“Theeconomistasplumber,”AmericanEconomicReview,107(5),1-26.

Duflo,Esther,RemaHanna,andStephenP.Ryan,2012,“Incentiveswork:gettingteacherstocometoschool,”AmericanEconomicReview,102(4),1241–78.

Duflo,EstherandMichaelKremer,2008,“Useofrandomizationintheevaluationofdevelopmenteffectiveness,”inWilliamEasterly,ed.,Reinventingforeignaid.Washington,DC.Brookings,93–120.

Dynarski,Susan,2015,“Helpingthepoorineducation:thepowerofasimplenudge,”NewYorkTimes,Jan17,2015.

Fine,PaulE.M.andJacquelineA.Clarkson,1986,“Individualversuspublicprioritiesinthedeterminationofoptimalvaccinationpolicies,”AmericanJournalofEpide-miology,124(6),1012–20.

Fisher,RonaldA.,1926,“Thearrangementoffieldexperiments,”JournaloftheMin-istryofAgricultureofGreatBritain,33,503–13.

Filmer,Deon,JeffreyHammer,andLantPritchett,2000,“Weaklinksinthechain:adiagnosisofhealthpolicyinpoorcountries,”WorldBankResearchObserver,15(2),199–204.

64

Freedman,DavidA.,2008,“Onregressionadjustmentstoexperimentaldata,”Ad-vancesinAppliedMathematics,40,180–93.

Frieden,ThomasR.,2017,“Evidenceforhealthdecisionmaking—beyondrandom-ized,controlledtrials,”NewEnglandJournalofMedicine,377,465-75.

Garfinkel,IrwinandCharlesF.Manski,1992,“Introduction,”inIrwinGarfinkelandCharlesF.Manski,eds.,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress.1–22.

Gerber,AlanS.andDonaldP.Green,2012,FieldExperiments,NewYork.Norton.Gertler,PaulJ.,SebastianMartinez,PatrickPremand,LauraB.Rawlings,andChristelM.J.Vermeersch,2016,Impactevaluationinpractice,2ndEdition,Washington,DC.Inter-AmericanDevelopmentBankandWorldBank.

Goldberger,ArthurS.andCharlesF.Manski,1995,“ReviewArticle:TheBellCurvebyHerrnsteinandMurray,”JournalofEconomicLiterature,33(2),762-76.

Greenberg,DavidandMarkShroder,2004,Thedigestofsocialexperiments(3rded.),Washington,DC.UrbanInstitutePress.

Greenberg,David,MarkShroder,andMatthewOnstott,1999,“Thesocialexperi-mentmarket,”JournalofEconomicPerspectives,13(3),157–72.

Gueron,JudithM.andHowardRolston,2013,Fightingforreliableevidence,NewYork,RussellSage.

Guyatt,Gordon,DavidL.Sackett,andDeborahJ.CookfortheEvidence-BasedMedi-cineWorkingGroup,1994,“Users’guidestothemedicalliteratureII:howtouseanarticleabouttherapyorprevention.B.Whatweretheresultsandwilltheyhelpmeincaringformypatients?”JournaloftheAmericanMedicalAssociation,271(1),59–63.

Harrison,GlennW.,2013,“Fieldexperimentsandmethodologicalintolerance,”Jour-nalofEconomicMethodology,20(2),103–17.

Harrison,GlennW.,2014,“Impactevaluationandwelfareevaluation,”EuropeanJournalofDevelopmentResearch,26,39–45.

Harrison,GlennW.,2014,“Cautionarynotesontheuseoffieldexperimentstoad-dresspolicyissues,”OxfordReviewofEconomicPolicy,30(4),753-63.

Hausman,JerryA.andDavidA.Wise,1985,“Technicalproblemsinsocialexperi-mentation:costversuseaseofanalysis,”inJerryA.HausmanandDavidA.Wise,eds.,SocialExperimentation,Chicago,IL.ChicagoUniversityPress.187–220.

Heckman,JamesJ.,1992,“Randomizationandsocialpolicyevaluation,”inCharlesF.ManskiandIrwinGarfinkel,eds.,Evaluatingwelfareandtrainingprograms,Cam-bridge,MA.HarvardUniversityPress.547–70.

Heckman,JamesJ.,1997,“Instrumentalvariables:astudyofimplicitbehavioralas-sumptionsusedinmakingprogramevaluations,”JournalofHumanResources,32(3),441–62.

Heckman,JamesJ.,2005,“Thescientificmodelofcausality,”SociologicalMethodol-ogy,35(1),1-97.

Heckman,JamesJ.,2008,“Econometriccausality,”InternationalStatisticalReview,76(1),1-27.

65

Heckman,JamesJ.,2010,“Buildingbridgesbetweenstructuralandprogramevalua-tionapproachestoevaluatingpolicy,”JournalofEconomicLiterature,48(2),356-98.

Heckman,JamesJ.,NeilHohman,andJeffreySmith,withtheassistanceofMichaelKhoo,2000,“Substitutionanddropoutbiasinsocialexperiments:astudyofaninfluentialsocialexperiment,”QuarterlyJournalofEconomics,115(2),651–94.

Heckman,JamesJ.,RobertJ.Lalonde,andJeffreyA.Smith,1999,“Theeconomicsandeconometricsofactivelabormarkets,”Chapter31inAshenfelter,OrleyandDa-vidCard,eds.Handbookoflaboreconomics,Amsterdam.North-Holland,3(A),1866–2097.

Heckman,JamesJ.,RodrigoPinto,andPeterSavelyev,2013,“Understandingthemechanismsthroughwhichaninfluentialearlychildhoodprogramboostedadultoutcomes,”AmericanEconomicReview,103(6),2052–86.

Heckman,JamesJ.andJeffreySmith,1995,“Assessingthecaseforsocialexperi-ments,”JournalofEconomicPerspectives,9(2),85-110.

Heckman,JamesJ.,JeffreySmith,andNancyClements,1997,“Makingthemostoutofprogrammeevaluationsandsocialexperiments:accountingforheterogeneityinprogrammeimpacts,”ReviewofEconomicStudies,64(4),487–535.

Heckman,JamesJ.andSergioUrzúa,2010,“ComparingIVwithstructuralmodels:whatsimpleIVcanandcannotidentify,”JournalofEconometrics,156,27-37.

Heckman,JamesJ.andEdwardVytlacil,2005,“Structuralequations,treatmentef-fects,andeconometricpolicyevaluation,”Econometrica,73(3),669–738.

Heckman,JamesJ.andEdwardJ.Vytlacil,2007,“Econometricevaluationofsocialprograms,Part1:causalmodels,structuralmodels,andeconometricpolicyeval-uation,”Chapter70inJamesJ.HeckmanandEdwardE.Leamer,eds.,HandbookofEconometrics,6B,4779–874.

Horton,Richard,2000,“Commonsenseandfigures:therhetoricofvalidityinmedi-cine:BradfordHillmemoriallecture1999,”Statisticsinmedicine,19,3149–64.

Hotz,V.Joseph,GuidoW.Imbens,andJulieH.Mortimer,2005,“Predictingtheeffi-cacyoffuturetrainingprogramsusingpastexperienceatotherlocations,”Jour-nalofEconometrics,125,241–70.

Hsieh,Chang-taiandMiguelUrquiola,2006,“Theeffectsofgeneralizedschoolchoiceonachievementandstratification:evidencefromChile’svoucherpro-gram,”JournalofPublicEconomics,90,1477–1503.

Hurwicz,Leonid,1966,“Onthestructuralformofinterdependentsystems,”StudiesinLogicandtheFoundationsofMathematics,44,232-9.

Imbens,GuidoW.,2004,“Nonparametricestimationofaveragetreatmenteffectsunderexogeneity:areview,”ReviewofEconomicsandStatistics,86(1),4–29.

Imbens,GuidoW.,2010,“BetterLATEthannothing:somecommentsonDeaton(2009)andHeckmanandUrzua,”JournalofEconomicLiterature,48(2),399–423.

Imbens,GuidoW.andJoshuaD.Angrist,1994,“Identificationandestimationoflocalaveragetreatmenteffects,”Econometrica,62(2),467–75.

Imbens,GuidoW.andMichalKolesár,2016,“Robuststandarderrorsinsmallsam-ples:somepracticaladvice,”ReviewofEconomicsandStatistics,98(4),701-12.

66

Imbens,GuidoW.andJeffreyM.Wooldridge,2009,“Recentdevelopmentsintheeconometricsofprogramevaluation,”JournalofEconomicLiterature,47(1),5–86.

InternationalCommitteeofMedicalJournalEditors,2015,Recommendationsfortheconduct,reporting,editing,andpublicationofscholarlyworkinmedicaljournals,http://www.icmje.org/icmje-recommendations.pdf(accessed,August20,2016.)

J_PAL,2017,https://www.povertyactionlab.org/about-j-pal,(accessed,August21,2017).

Kahneman,DanielandGaryKlein,2009,“Conditionsforintuitiveexpertise:afailuretodisagree,”AmericanPsychologist,64(6),515–26.

Karlan,DeanandJacobAppel,2011,Morethangoodintentions:howaneweconom-icsishelpingtosolveglobalpoverty,NewYork.Dutton.

Karlan,Dean,NathanealGoldbergandJamesCopestake,2009,“Randomizedcon-trolledtrialsarethebestwaytomeasureimpactofmicrofinanceprogramsandimprovemicrofinanceproductdesigns,”EnterpriseDevelopmentandMicro-finance,20(3),167–76.

Kasy,Maximilian,2016,“Whyexperimentersmightnotwanttorandomize,andwhattheycoulddoinstead,”PoliticalAnalysis,1–15doi:10.1093/pan/mpw012

Kramer,Peter,2016,Ordinarilywell:thecaseforantidepressants,NewYork.Farrar,Straus,andGiroux.

Kremer,MichaelandAlakaHolla,2009,“Improvingeducationinthedevelopingworld:whathavewelearnedfromrandomizedevaluations?”AnnualReviewofEconomics,1,513–42.

Lalonde,RobertJ.,1986,“Evaluatingtheeconometricevaluationsoftrainingpro-gramswithexperimentaldata,”AmericanEconomicReview,76(4),604-20.

Lehman,Erich.L.andJosephP.Romano,2005,Testingstatisticalhypotheses(thirdedition),NewYork.Springer.

Levy,Santiago,2006,Progressagainstpoverty:sustainingMexico’sProgresa-Opor-tunidadesprogram,Washington,DC.Brookings.

Mackie,JohnL.,1974,Thecementoftheuniverse:astudyofcausation,Oxford.Ox-fordUniversityPress.

Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,andArleenLeibowitz,1988a,“Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedexperiment,”AmericanEconomicReview,77(3),251–77.

Manning,WillardG.,JosephP.Newhouse,NaihuaDuan,EmmettKeeler,BernadetteBenjamin,ArleenLeibowitz,M.SusanMarquis,andJackZwanziger,1988b,Healthinsuranceandthedemandformedicalcare:evidencefromarandomizedex-periment,SantaMonica,CA.RAND.

Manski,CharlesF.,2004,“Treatmentrulesforheterogeneouspopulations,”Econo-metrica,72(4),1221-46.

Manski,CharlesF.,2013,Publicpolicyinanuncertainworld:analysisanddecisions,Cambridge,MA.HarvardUniversityPress.

Manski,CharlesF.andAlekseyTetenov,2016,“Sufficienttrialsizetoinformclinicalpractice,”PNAS,113(38),10518-23.

67

Metcalf,CharlesE.,1973,“Makinginferencesfromcontrolledincomemaintenanceexperiments,”AmericanEconomicReview,63(3),478–83.

Moffitt,Robert,1979,“ThelaborsupplyresponseintheGaryexperiment,”JournalofHumanResources,14(4),477–87.

Moffitt,Robert,1992,“Evaluationmethodsforprogramentryeffects,”Chapter6inCharlesManskiandIrwinGarfinkel,Evaluatingwelfareandtrainingprograms,Cambridge,MA.HarvardUniversityPress,231–52.

Moffitt,Robert,2004,“Theroleofrandomizedfieldtrialsinsocialscienceresearch:aperspectivefromevaluationsofreformsofsocialwelfareprograms,”AmericanBehavioralScientist,47(5),506–40

Morgan,KariLockandDonaldB.Rubin,2012,“Rerandomizationtoimprovecovari-atebalanceinexperiments,”AnnalsofStatistics,40(2),1263–82.

Muller,SeánM.,2015,“Causalinteractionandexternalvalidity:obstaclestothepol-icyrelevanceofrandomizedevaluations,”WorldBankEconomicReview,29,S217–S225.

Orcutt,GuyH.andAliceG.Orcutt,1968,“Incentiveanddisincentiveexperimenta-tionforincomemaintenancepolicypurposes,”AmericanEconomicReview,58(4),754–72.

Pearl,JudeaandEliasBareinboim,2011,“Transportabilityofcausalandstatisticalrelations:aformalapproach,”Proceedingsofthe25thAAAIConferenceonArtificialIntelligence,AAAIPress,247-54,

Pearl, Judea and Elias Bareinboim, 2014, “External validity: from do-calculus to trans-portability across populations,” Statistical Science, 29(4), 579-95.

Rodrik,Dani,2006,personalemailcommunication.Rothwell,PeterM.,2005,“Externalvalidityofrandomizedcontrolledtrials:‘towhomdotheresultsofthetrialapply’”,Lancet,365,82–93.

Russell,Bertrand,2008[1912],Theproblemsofphilosophy,Rockville,MD.ArcManor.

Sackett,DavidL.,WilliamM.C.Rosenberg,J.A.MuirGray,R.BrianHaynesandW.ScottRichardson,1996,“Evidencebasedmedicine:whatitisandwhatitisn’t,”BritishMedicalJournal,312(January13),71–2.

Savage,LeonardJ.,1962,“Subjectiveprobabilityandstatisticalpractice,”inG.A.Bar-nardandD.R.Cox,eds.,TheFoundationsofStatisticalInference,London.Me-thuen.9-35.

Scriven,Michael,1974,“Evaluationperspectivesandprocedures,”inW.JamesPop-ham,ed.,Evaluationineducation—currentapplications,Berkeley,CA.McCutchanPublishingCorporation.

Senn,Stephen,1994,“Testingforbaselinebalanceinclinicaltrials,”StatisticsinMedicine,13,1715–26.

Senn,Stephen,2013,“Sevenmythsofrandomizationinclinicaltrials,”StatisticsinMedicine32,1439–50.

Shadish,WilliamR.,ThomasD.Cook,andDonaldT.Campbell,2002,Experimentalandquasi-experimentaldesignsforgeneralizedcausalinference,Boston,MA.HoughtonMifflin.

Simpson,Adrian,2017,“Themisdirectionofpublicpolicy:comparingandcombin-ingstandardisedeffectsizes,”JournalofEducationalPolicy32(4),450-66.

68

Stuart,ElizabethA.,StephenR.Cole,andCatharineP.Bradshaw,andPhilipJ.Leaf,2011,“Theuseofpropensityscorestoassessthegeneralizabilityofresultsfromrandomizedtrials,”JournaloftheRoyalStatisticalSocietyA,174(2),369–86.

Student(W.S.Gosset),1938,“Comparisonbetweenbalancedandrandomarrange-mentsoffieldplots,”Biometrika,29(3/4),363-78.

Svorencik,Andrej,2015,Theexperimentalturnineconomics:ahistoryofexperi-mentaleconomics,UtrechtSchoolofEconomics,DissertationSeries#29,http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2560026

Todd,PetraE.andKennethJ.Wolpin,2006,“Assessingtheimpactofaschoolsub-sidyprograminMexico:usingasocialexperimenttovalidateadynamicbehav-ioralmodelofchildschoolingandfertility,”AmericanEconomicReview,96(5),1384–1417.

Todd,PetraE.andKennethJ.Wolpin,2008,“Exanteevaluationofsocialprograms,”Annalesd’EconomieetdelaStatistique,91/92,263–91.

U.S.DepartmentofEducation,InstituteofEducationSciences,NationalCenterforEducationEvaluationandRegionalAssistance,2003,Identifyingandimplement-ingeducationalpracticessupportedbyrigorousevidence:auserfriendlyguide,Washington,DC.InstituteofEducationSciences.

Vandenbroucke,JanP.,2004,“Whenareobservationalstudiesascredibleasran-domizedcontrolledtrials?”TheLancet,363:1728–31.

Vandenbroucke,JanP..2009,“TheHRTcontroversy:observationalstudiesandRCTsfallinline,”TheLancet,373,1233-5.

Vivalt,Eva,2015,“Howmuchcanwegeneralizefromimpactevaluations?”NYU,un-published.http://evavivalt.com/wp-content/uploads/2014/10/Vivalt-JMP-10.27.14.pdf

White,Halbert,1980,“Aheteroskedasticity-consistentcovariancematrixestimatorandadirecttestforheteroskedasticity,”Econometrica,50(1),1–25.

Wise,DavidA.,1985,“Abehavioralmodelversusexperimentation:theeffectsofhousingsubsidiesonrent,”inP.BruckerandR.Pauly,eds.MethodsofOperationsResearch,50,VerlagAnonHain.441–89.

Wolpin,KennethI.,2013,Thelimitsofinferencewithouttheory,Camridge,MA.MITPress.

Worrall,John,2007,“Evidenceinmedicineandevidence-basedmedicine,”Philoso-phyCompass,2/6,981–1022.

Worrall,John,2008,“Evidenceandethicsinmedicine,”PerspectivesinBiologyandMedicine,51(3),418-31.

Yates,Frank,1939,“Thecomparativeadvantagesofsystematicandrandomizedar-rangementsinthedesignofagriculturalandbiologicalexperiments,”Biometrika,30(3/4),440-66.

Young,Alwyn,2016,“ChannelingFisher:randomizationtestsandthestatisticalin-significanceofseeminglysignificantexperimentalresults,”LondonSchoolofEco-nomics,WorkingPaper,Feb.

Ziliak,StephenT.,2014,“Balancedversusrandomizedfieldexperimentsineconom-ics:whyW.S.Gossetaka‘Student’matters,”ReviewofBehavioralEconomics,1,167–208.

69

Appendix:MonteCarloexperimentforanRCTwithoutliersInthisillustrativeexample,thereisparentpopulationeachmemberofwhichhashisorher

owntreatmenteffect;thesearecontinuouslydistributedwithashiftedlognormaldistribu-

tionwithzeromeansothatthepopulationATEiszero.Theindividualtreatmenteffectsβ

aredistributedsothat β + e0.5 ∼ Λ(0,1) ,forstandardizedlognormaldistributionΛ. Inthe

absenceoftreatment,everyoneinthesamplerecordszero,sothesampleaveragetreat-

menteffectinanyonetrialissimplythemeanoutcomeamongthentreatments.Forvalues

ofnequalto25,50,100,200,and500wedrawfromtheparentpopulation100trialsam-

pleseachofsize2n;withfivevaluesofn,thisgivesus500trialsamplesinall;becauseof

samplingthetrueATE’sineachtrialsamplewillnotbezero.Foreachofthese500samples,

werandomizeintoncontrolsandntreatments,estimatetheATEanditsestimatedt–value

(usingthestandardtwo-samplet–value,orequivalently,byrunningaregressionwithro-

bustt–values),andthenrepeat1,000times,sowehave1,000ATEestimatesandt–values

foreachofthe500trialsamples.TheseallowustoassessthedistributionofATEestimates

andtheirnominalt–valuesforeachtrial.

TheresultsareshowninTableA1.Eachrowcorrespondstoasamplesize.Ineach

row,weshowtheresultsof100,000individualtrials,composedof1,000replicationson

eachofthe100trial(experimental)samples.Thecolumnsareaveragedoverall100,000tri-

als.

TableA1:RCTswithskewedtreatmenteffects

Samplesize MeanofATE

estimates

Meanofnominalt–

values

Fractionnullre-

jected(percent)

25

50

0.0268

0.0266

–0.4274

–0.2952

13.54

11.20

100 –0.0018 –0.2600 8.71

200 0.0184 –0.1748 7.09

500 –0.0024 –0.1362 6.06

Note:1,000randomizationsoneachof100drawsofthetrialsamplerandomlydrawnfromalognormaldistributionoftreatmenteffectsshiftedtohaveazeromean.

70

Thelastcolumnshowsthefractionsoftimesthenullthatistrueinthepopulationis

rejectedinthetrialsamplesandisourkeyresult.Whenthereareonly50treatmentsand

50controls(row2),the(true)nullisrejected11.2percentofthetime,insteadofthe5per-

centthatwewouldlikeandexpectifwewereunawareoftheproblem.Whenthereare500

unitsineacharm,therejectionrateis6.06percent,muchclosertothenominal5percent.

FigureA1:EstimatesofanATEwithanoutlierinthetrialsample

FigureA1illustratestheestimatedATEsfromanextremetrialsamplefromthesimulations

inthesecondrowwith100observationsintotal;thehistogramshowsthe1,000estimates

oftheATEforthattrialsample.Thistrialsamplehasasinglelargeoutlyingtreatmenteffect

of48.3;themean(s.d.)oftheother99observationsis–0.51(2.1);whentheoutlierisinthe

treatmentgroup,wegettheright-handsideofthefigure,whenitisinthecontrolgroup,we

gettheleft-handside.

0.5

11.

5D

ensi

ty

-.5 0 .5 1 1.5 21,000 estimates of average treatment effect

Understanding and misunderstanding randomized controlled ...deaton/downloads/Deaton...ropean Union’s Horizon 2020 research and innovation program (grant agreement No 667526 K4U),

Documents