How Much Can We Generalize? Measuring the …...How Much Can We Generalize? Measuring the External Validity of Impact Evaluations Eva Vivalt New York University August 31, 2015 Abstract

How Much Can We Generalize? Measuring the External

Validity of Impact Evaluations

Eva Vivalt∗

New York University

August 31, 2015

Abstract

Impact evaluations aim to predict the future, but they are rooted in particularcontexts and to what extent they generalize is an open and important question.I founded an organization to systematically collect and synthesize impact evalu-ation results on a wide variety of interventions in development. These data allowme to answer this and other questions for the first time using a large data setof studies. I consider several measures of generalizability, discuss the strengthsand limitations of each metric, and provide benchmarks based on the data. Iuse the example of the effect of conditional cash transfers on enrollment rates toshow how some of the heterogeneity can be modelled and the effect this can haveon the generalizability measures. The predictive power of the model improvesover time as more studies are completed. Finally, I show how researchers canestimate the generalizability of their own study using their own data, even whendata from no comparable studies exist.

∗E-mail: [email protected]. I thank Edward Miguel, Bill Easterly, David Card, Ernesto Dal Bo, HuntAllcott, Elizabeth Tipton, David McKenzie, Vinci Chow, Willa Friedman, Xing Huang, Michaela Pagel,Steven Pennings, Edson Severnini, seminar participants at the University of California, Berkeley, ColumbiaUniversity, New York University, the World Bank, Cornell University, Princeton University, the Universityof Toronto, the London School of Economics, the Australian National University, and the University ofOttawa, among others, and participants at the 2015 ASSA meeting and 2013 Association for Public PolicyAnalysis and Management Fall Research Conference for helpful comments. I am also grateful for the hardwork put in by many at AidGrade over the duration of this project, including but not limited to Jeff Qiu,Bobbie Macdonald, Diana Stanescu, Cesar Augusto Lopez, Mi Shen, Ning Zhang, Jennifer Ambrose, NaomiCrowther, Timothy Catlett, Joohee Kim, Gautam Bastian, Christine Shen, Taha Jalil, Risa Santoso andCatherine Razeto.

1

1 Introduction

In the last few years, impact evaluations have become extensively used in development

economics research. Policymakers and donors typically fund impact evaluations precisely to

figure out how effective a similar program would be in the future to guide their decisions

on what course of action they should take. However, it is not yet clear how much we can

extrapolate from past results or under which conditions. Further, there is some evidence

that even a similar program, in a similar environment, can yield different results. For ex-

ample, Bold et al. (2013) carry out an impact evaluation of a program to provide contract

teachers in Kenya; this was a scaled-up version of an earlier program studied by Duflo, Du-

pas and Kremer (2012). The earlier intervention studied by Duflo, Dupas and Kremer was

implemented by an NGO, while Bold et al. compared implementation by an NGO and the

government. While Duflo, Dupas and Kremer found positive effects, Bold et al. showed

significant results only for the NGO-implemented group. The different findings in the same

country for purportedly similar programs point to the substantial context-dependence of im-

pact evaluation results. Knowing this context-dependence is crucial in order to understand

what we can learn from any impact evaluation.

While the main reason to examine generalizability is to aid interpretation and improve

predictions, it would also help to direct research attention to where it is most needed. If

generalizability were higher in some areas, fewer papers would be needed to understand how

people would behave in a similar situation; conversely, if there were topics or regions where

generalizability was low, it would call for further study. With more information, researchers

can better calibrate where to direct their attentions to generate new insights.

It is well-known that impact evaluations only happen in certain contexts. For example,

Figure 1 shows a heat map of the geocoded impact evaluations in the data used in this paper

overlaid by the distribution of World Bank projects (black dots). Both sets of data are geo-

graphically clustered, and whether or not we can reasonably extrapolate from one to another

depends on how much related heterogeneity there is in treatment effects. Allcott (forthcom-

ing) recently showed that site selection bias was an issue for randomized controlled trials

(RCTs) on a firm’s energy conservation programs. Microfinance institutions that run RCTs

and hospitals that conduct clinical trials are also selected (Allcott, forthcoming), and World

Bank projects that receive an impact evaluation are different from those that do not (Vivalt,

2015). Others have sought to explain heterogeneous treatment effects in meta-analyses of

specific topics (e.g. Saavedra and Garcia, 2013, among many others for conditional cash

transfers), or to argue they are so heterogeneous they cannot be adequately modelled (e.g.

Deaton, 2011; Pritchett and Sandefur, 2013).

2

Figure 1: Growth of Impact Evaluations and Location Relative to Programs

The figure on the left shows a heat map of the impact evaluations in AidGrade’s database overlaid by black

dots indicating where the World Bank has done projects. While there are many other development

programs not done by the World Bank, this figure illustrates the great numbers and geographical

dispersion of development programs. The figure on the right plots the number of studies that came out in

each year that are contained in each of three databases described in the text: 3ie’s title/abstract/keyword

database of impact evaluations; J-PAL’s database of affiliated randomized controlled trials; and AidGrade’s

database of impact evaluation results data.

Impact evaluations are still exponentially increasing in number and in terms of the re-

sources devoted to them. The World Bank recently received a major grant from the UK aid

agency DFID to expand its already large impact evaluation works; the Millennium Challenge

Corporation has committed to conduct rigorous impact evaluations for 50% of its activities,

with “some form of credible evaluation of impact” for every activity (Millennium Challenge

Corporation, 2009); and the U.S. Agency for International Development is also increasingly

invested in impact evaluations, coming out with a new policy in 2011 that directs 3% of

program funds to evaluation.1

Yet while impact evaluations are still growing in development, a few thousand are al-

ready complete. Figure 1 plots the explosion of RCTs that researchers affiliated with J-PAL,

a center for development economics research, have completed each year; alongside are the

number of development-related impact evaluations released that year according to 3ie, which

keeps a directory of titles, abstracts, and other basic information on impact evaluations more

broadly, including quasi-experimental designs; finally, the dashed line shows the number of

papers that came out in each year that are included in AidGrade’s database of impact eval-

uation results, which will be described shortly.

1While most of these are less rigorous “performance evaluations”, country mission leaders are supposedto identify at least one opportunity for impact evaluation for each development objective in their 3-5 yearplans (USAID, 2011).

3

In short, while we do impact evaluation to figure out what will happen in the future,

many issues have been raised about how well we can extrapolate from past impact evalua-

tions, and despite the importance of the topic, previously we were unable to do little more

than guess or examine the question in narrow settings as we did not have the data. Now we

have the opportunity to address speculation, drawing on a large, unique dataset of impact

evaluation results.

I founded a non-profit organization dedicated to gathering this data. That organization,

AidGrade, seeks to systematically understand which programs work best where, a task that

requires also knowing the limits of our knowledge. To date, AidGrade has conducted 20

meta-analyses and systematic reviews of different development programs.2 Data gathered

through meta-analyses are the ideal data to answer the question of how much we can ex-

trapolate from past results, and since data on these 20 topics were collected in the same

way, coding the same outcomes and other variables, we can look across different types of

programs to see if there are any more general trends. Currently, the data set contains 647 pa-

pers on 210 narrowly-defined intervention-outcome combinations, with the greater database

containing 15,021 estimates.

I define generalizability and discuss several metrics with which to measure it. Other

disciplines have considered generalizability more, so I draw on the literature relating to

meta-analysis, which has been most well-developed in medicine, as well as the psychometric

literature on generalizability theory (Higgins and Thompson, 2002; Shavelson and Webb,

2006; Briggs and Wilson, 2007). The measures I discuss could also be used in conjunction

with any model that seeks to explain variation in treatment effects (e.g. Dehejia, Pop-Eleches

and Samii, 2015) to quantify the proportion of variation that such a model explains. Since

some of the analyses will draw upon statistical methods not commonly used in economics,

I will use the concrete example of conditional cash transfers (CCTs), which are relatively

well-understood and on which many papers have been written, to elucidate the issues.

While this paper focuses on results for impact evaluations of development programs, this

is only one of the first areas within economics to which these kinds of methods can be applied.

In many of the sciences, knowledge is built through a combination of researchers conducting

individual studies and other researchers synthesizing the evidence through meta-analysis.

This paper begins that natural next step.

2Throughout, I will refer to all 20 as meta-analyses, but some did not have enough comparable outcomesfor meta-analysis and became systematic reviews.

4

2 Theory

2.1 Heterogeneous Treatment Effects

I model treatment effects as potentially depending on the context of the intervention.

Each impact evaluation is on a particular intervention and covers a number of outcomes.

The relationship between an outcome, the inputs that were part of the intervention, and the

context of the study is complex. In the simplest model, we can imagine that context can be

represented a “contextual variable”, C, such that:

Zj “ α ` βTj ` δCj ` γTjCj ` εj (1)

where j indexes the individual, Z represents the value of an aggregate outcome such as

“enrollment rates”, T indicates being treated, and C represents a contextual variable, such

as the type of agency that implemented the program.3

In this framework, a particular impact evaluation might explicitly estimate:

Zj “ α ` β1Tj ` εj (2)

but, as Equation 1 can be re-written as Zj “ α ` pβ ` γCjqTj ` δCj ` εj, what β1 is really

capturing is the effect β1 “ β ` γC. When C varies, unobserved, in different contexts, the

variance of β1 increases.

This is the simplest case. One can imagine that the true state of the world has “interac-

tion effects all the way down”.

Interaction terms are often considered a second-order problem. However, that intuition

could stem from the fact that we usually look for interaction terms within an already fairly

homogeneous dataset - e.g. data from a single country, at a single point in time, on a par-

ticularly selected sample.

Not all aspects of context need matter to an intervention’s outcomes. The set of con-

textual variables can be divided into a critical set on which outcomes depend and an set on

which they do not; I will ignore the latter. Further, the relationship between Z and C can

vary by intervention or outcome. For example, school meals programs might have more of

an effect on younger children, but scholarship programs could plausibly affect older children

more. If one were to regress effect size on the contextual variable “age”, we would get differ-

ent results depending on which intervention and outcome we were considering. Therefore,

3Z can equally well be thought of as the average individual outcome for an intervention. Throughout,I take high values for an outcome to represent a beneficial change unless otherwise noted; if an outcomerepresents a negative characteristic, like incidence of a disease, its sign will be flipped before analysis.

5

it will be important in this paper to look only at a restricted set of contextual variables

which could plausibly work in a similar way across different interventions. Additional anal-

ysis could profitably be done within some interventions, but this is outside the scope of this

paper.

Generalizability will ultimately depend on the heterogeneity of treatment effects. The

next section formally defines generalizability for use in this paper.

2.2 Generalizability: Definitions and Measurement

Definition 1 Generalizability is the ability to predict results accurately out of sample.

Definition 2 Local generalizability is the ability to predict results accurately in a particular

out-of-sample group.

There are several ways to operationalize these definitions. The ability to predict

results hinges both on the variability of the results and the proportion that can be

explained. For example, if the overall variability in a set of results is high, this might not

be as concerning if the proportion of variability that can be explained is also high.

It is straightforward to measure the variance in results. However, these statistics need

to be benchmarked in order to know what is a “high” or “low” variance. One advantage

of the large data set used in this paper is that I can use it to benchmark the results

from different intervention-outcome combinations against each other. This is not the first

paper to tentatively suggest a scale. Other rules of thumb have also been created in this

manner, such as those used to consider the magnitude of effect sizes (0-0.2 SD = “small”,

0.2-0.5 = “medium”, ą 0.5 SD = “large”) (Cohen, 1988) or the measure of the impact

of heterogeneity on meta-analysis results, I2 (0.25=“low”, 0.5=“medium”, 0.75=“high”)

(Higgins et al., 2003). I can also compare across-paper variation to within-paper variation,

with the idea that within-study variation should represent a lower bound to across-study

variation within the same intervention-outcome combination. Further, I can create variance

benchmarks based on back-of-the-envelope calculations for what the variance would imply

for predictive power under a set of assumptions. This will be discussed in more detail later.

One potential drawback to considering the variance of studies’ results is that we might

be concerned that studies that have higher effect sizes or are measured in terms of units

with larger scales have larger variances. This would limit us to making comparisons only

between data with the same scale. We could either: 1) restrict attention to those outcomes

in the same natural units (e.g. enrollment rates in percentage points); 2) convert results to

6

be in terms of a common unit, such as standard deviations4; 3) scale the standard deviation

by the mean result, creating the coefficient of variation. The coefficient of variation

represents the inverse of the signal-to-noise ratio, and as a unitless figure can be compared

across intervention-outcome combinations with different natural units. It is not immune to

criticism, however, particularly in that it may result in large values as the mean approaches

zero.5

All the measures discussed so far focus on variation. However, if we could explain the

variation, it would no longer worsen our ability to make predictions in a new setting, so

long as we had all the necessary data from that setting, such as covariates, with which to

extrapolate.

To explain variation, we need a model. The meta-analysis literature suggests two

general types of models which can be parameterized in many ways: fixed-effect models and

random-effects models.

Fixed-effect models assume there is one true effect of a particular program and all

differences between studies can be attributed simply to sampling error. In other words:

Yi “ θ ` εi (3)

where Yi is the observed effect size of a particular study, θ is the true effect and εi is the

error term.

Random-effects models do not make this assumption; the true effect could potentially

vary from context to context. Here,

Yi “ θi ` εi (4)

“ θ ` ηi ` εi (5)

where θi is the effect size for a particular study i, θ is the mean true effect size, ηi is a

particular study’s divergence from that mean true effect size, and εi is the error. Random-

effects models are more plausible and they are necessary if we think there are heterogeneous

treatment effects, so I use them in this paper. Random-effects models can also be modified

by the addition of explanatory variables, at which point they are called mixed models; I will

also use mixed models in this paper.

Sampling variance, varpYi|θiq, is denoted as σ2 and between-study variance, varpθiq, τ2.

4This can be problematic if the standard deviations themselves vary but is a common approach in themeta-analysis literature in lieu of a better option.

5This paper follows convention and reports the absolute value of the coefficient of variation wherever itappears.

7

This variation in observed effect sizes is then:

varpYiq “ τ 2 ` σ2 (6)

and the proportion of the variation that is not sampling error is:

I2 “τ 2

τ 2 ` σ2(7)

The I2 is an established metric in the meta-analysis literature that helps determine

whether a fixed or random effects model is more appropriate; the higher I2, the less plausible

it is that sampling error drives all the variation in results. I2 is considered “low” at 0.25,

“medium” at 0.5, and “high” at 0.75 (Higgins et al., 2003).6

If we wanted to explain more of the variation, we could do moderator or mediator analysis,

in which we examine how results vary with the characteristics of the study, characteristics of

its sample, or details about the intervention and its implementation. A linear meta-regression

is one way of accomplishing this goal, explicitly estimating:

Yi “ β0 `ÿ

n

βnXn ` ηi ` εi

where Xn are explanatory variables. This is a mixed model and, upon estimating it, we can

calculate several additional statistics: the amount of residual variation in Yi, after accounting

for Xn, varRpYi´ pYiq, the coefficient of residual variation, CVRpYi´ pYiq, and the residual I2R.

Further, we can examine the R2 of the meta-regression.

It should be noted that a linear meta-regression is only one way of modelling variation in

Yi. The I2, for example, is analogous to the reliability coefficient of classical test theory or

the generalizability coefficient of generalizability theory (a branch of psychometrics), both

of which estimate the proportion of variation that is not error. In this literature, additional

heterogeneity is usually modelled using ANOVA rather than meta-regression. Modelling

variation in treatment effects also does not have to occur only retrospectively at the conclu-

sion of studies; we can imagine that a carefully-designed study could anticipate and estimate

some of the potential sources of variation experimentally.

Table 1 summarizes the different indicators, dividing them into measures of variation and

measures of the proportion of variation that is systematic.

Each of these metrics has its advantages and disadvantages. Table 2 summarizes the

6The Cochrane Collaboration uses a slightly different set of norms, saying 0-0.4 “might not be important”,0.3-0.6 “may represent moderate heterogeneity”, 0.5-0.9 “may represent substantial heterogeneity”, and 0.75-1 “considerable heterogeneity” (Higgins and Green, 2011).

8

Table 1: Summary of heterogeneity measures

Measure of variation Measure of proportionof variation that issystematic

Measure makes use ofexplanatory variables

varpYiq XvarRpYi´ pYiq X XCVpYiq XCVRpYi´ pYiq X XI2 XI2R X XR2 X X

Table 2: Desirable properties of a measure of heterogeneity

Does not dependon the number ofstudies in a cell

Does not dependon the precisionof individual es-timates

Does not dependon the estimates’units

Does not dependon the mean re-sult in the cell

varpYiq X X XvarRpYi´ pYiq X X XCVpYiq X X XCVRpYi´ pYiq X X XI2 X X XI2R X X XR2 X X X X

A “cell” here refers to an intervention-outcome combination. The “precision” of an estimate refers to its

standard error.

desirable properties of a measure of heterogeneity and which properties are possessed by each

of the discussed indicators. Measuring heterogeneity using the variance of Yi requires the

Yi to have comparable units. Using the coefficient of variation requires the assumption that

the mean effect size is an appropriate measure with which to scale sd(Yi). The variance and

coefficient of variation also do not have anything to say about the amount of heterogeneity

that can be explained. Adding explanatory variables also has its limitations. In any model,

we have no way to guarantee that we are indeed capturing all the relevant factors. While

I2 has the nice property that it disaggregates sampling variance as a source of variation,

estimating it depends on the weights applied to each study’s results and thus, in turn, on

the sample sizes of the studies. The R2 has its own well-known caveats, such as that it can

be artificially inflated by over-fitting.

9

Having discussed the different measures of generalizability I will use in this paper, I turn

to describe how I will estimate the parameters of the random effects or mixed models.

2.3 Hierarchical Bayesian Analysis

This paper uses meta-analysis as a tool to synthesize evidence.

As a quick review, there are many steps in a meta-analysis, most of which have to do

with the selection of the constituent papers. The search and screening of papers will be

described in the data section; here, I merely discuss the theory behind how meta-analyses

combine results and estimate the parameters σ2 and τ 2 that will be used to generate I2.

I begin by presenting the random effects model, followed by the related strategy to

estimate a mixed model.

2.4 Estimating a Random Effects Model

To build a hierarchical Bayesian random effects model, I first assume the data are nor-

mally distributed:

Yij|θi „ Npθi, σ2q (8)

where j indexes the individuals in the study. I do not have individual-level data, but instead

can use sufficient statistics:

Yi|θi „ Npθi, σ2i q (9)

where Yi is the sample mean and σ2i the sample variance. This provides the likelihood for θi.

I also need a prior for θi. I assume between-study normality:

θi „ Npµ, τ 2q (10)

where µ and τ are unknown hyperparameters.

Conditioning on the distribution of the data, given by Equation 9, I get a posterior:

θi|µ, τ, Y „ Npθi, Viq (11)

where

θi “

Yiσ2i`

µτ2

1σ2i` 1

τ2

, Vi “1

1σ2i` 1

τ2

(12)

I then need to pin down µ|τ and τ by constructing their posterior distributions given

non-informative priors and updating based on the data. I assume a uniform prior for µ|τ ,

10

and as the Yi are estimates of µ with variance pσ2i ` τ

2q, obtain:

µ|τ, Y „ Npµ, Vµq (13)

where

µ “

ř

iYi

σ2i`τ

2

ř

i1

σ2i`τ

2

, Vµ “ÿ

i

11

σ2i`τ

2

(14)

For τ , note that ppτ |Y q “ ppµ,τ |Y qppµ|τ,Y q

. The denominator follows from Equation 12; for the

numerator, we can observe that ppµ, τ |Y q is proportional to ppµ, τqppY |µ, τq, and we know

the marginal distribution of Yi|µ, τ :

Yi|µ, τ „ Npµ, σ2i ` τ

2q (15)

I use a uniform prior for τ , following Gelman et al. (2005). This yields the posterior for

the numerator:

ppµ, τ |Y q9ppµ, τqź

i

NpYi|µ, σ2i ` τ

2q (16)

Putting together all the pieces in reverse order, I first simulate τ , then generate ppτ |Y q

using τ , followed by µ and finally θi.

2.5 Estimating a Mixed Model

The strategy here is similar. Appendix D contains a derivation.

3 Data

This paper uses a database of impact evaluation results collected by AidGrade, a U.S.

non-profit research institute that I founded in 2012. AidGrade focuses on gathering the

results of impact evaluations and analyzing the data, including through meta-analysis. Its

data on impact evaluation results were collected in the course of its meta-analyses from

2012-2014 (AidGrade, 2015).

AidGrade’s meta-analyses follow the standard stages: (1) topic selection; (2) a search

for relevant papers; (3) screening of papers; (4) data extraction; and (5) data analysis. In

addition, it pays attention to (6) dissemination and (7) updating of results. Here, I will

discuss the selection of papers (stages 1-3) and the data extraction protocol (stage 4); more

detail is provided in Appendix B.

11

3.1 Selection of Papers

The interventions that were selected for meta-analysis were selected largely on the basis

of there being a sufficient number of studies on that topic. Five AidGrade staff members each

independently made a preliminary list of interventions for examination; the lists were then

combined and searches done for each topic to determine if there were likely to be enough

impact evaluations for a meta-analysis. The remaining list was voted on by the general

public online and partially randomized. Appendix B provides further detail.

A comprehensive literature search was done using a mix of the search aggregators Sci-

Verse, Google Scholar, and EBSCO/PubMed. The online databases of J-PAL, IPA, CEGA

and 3ie were also searched for completeness. Finally, the references of any existing system-

atic reviews or meta-analyses were collected.

Any impact evaluation which appeared to be on the intervention in question was included,

barring those in developed countries.7 Any paper that tried to consider the counterfactual

was considered an impact evaluation. Both published papers and working papers were in-

cluded. The search and screening criteria were deliberately broad. There is not enough

room to include the full text of the search terms and inclusion criteria for all 20 topics in

this paper, but these are available in an online appendix as detailed in Appendix A.

3.2 Data Extraction

The subset of the data on which I am focusing is based on those papers that passed all

screening stages in the meta-analyses. Again, the search and screening criteria were very

broad and, after passing the full text screening, the vast majority of papers that were later

excluded were excluded merely because they had no outcome variables in common or did

not provide adequate data (for example, not providing data that could be used to calculate

the standard error of an estimate, or for a variety of other quirky reasons, such as displaying

results only graphically). The small overlap of outcome variables is a surprising and notable

feature of the data. Ultimately, the data I draw upon for this paper consist of 15,021 results

(double-coded and then reconciled by a third researcher) across 647 papers covering the 20

types of development program listed in Table 3.8 For sake of comparison, though the two

organizations clearly do different things, at present time of writing this is more impact eval-

7High-income countries, according to the World Bank’s classification system.8Three titles here may be misleading. “Mobile phone-based reminders” refers specifically to SMS or

voice reminders for health-related outcomes. “Women’s empowerment programs” required an educationalcomponent to be included in the intervention and it could not be an unrelated intervention that merely dis-aggregated outcomes by gender. Finally, micronutrients were initially too loosely defined; this was narroweddown to focus on those providing zinc to children, but the other micronutrient papers are still included inthe data, with a tag, as they may still be useful.

12

uations than J-PAL has published, concentrated in these 20 topics. Unfortunately, only 318

of these papers both overlapped in outcomes with another paper and were able to be stan-

dardized and thus included in the main results which rely on intervention-outcome groups.

Outcomes were defined under several rules of varying specificity, as will be discussed shortly.

Table 3: List of Development Programs Covered

2012 2013Conditional cash transfers Contract teachersDeworming Financial literacy trainingImproved stoves HIV educationInsecticide-treated bed nets IrrigationMicrofinance Micro health insuranceSafe water storage Micronutrient supplementationScholarships Mobile phone-based remindersSchool meals Performance payUnconditional cash transfers Rural electrificationWater treatment Women’s empowerment programs

73 variables were coded for each paper. Additional topic-specific variables were coded for

some sets of papers, such as the median and mean loan size for microfinance programs. This

paper focuses on the variables held in common across the different topics. These include

which method was used; if randomized, whether it was randomized by cluster; whether it

was blinded; where it was (village, province, country - these were later geocoded in a sepa-

rate process); what kind of institution carried out the implementation; characteristics of the

population; and the duration of the intervention from the baseline to the midline or endline

results, among others. A full set of variables and the coding manual is available online, as

detailed in Appendix A.

As this paper pays particular attention to the program implementer, it is worth discussing

how this variable was coded in more detail. There were several types of implementers that

could be coded: governments, NGOs, private sector firms, and academics. There was also a

code for “other” (primarily collaborations) or “unclear”. The vast majority of studies were

implemented by academic research teams and NGOs. This paper considers NGOs and aca-

demic research teams together because it turned out to be practically difficult to distinguish

between them in the studies, especially as the passive voice was frequently used (e.g. “X

was done” without noting who did it). There were only a few private sector firms involved,

so they are considered with the “other” category in this paper.

Studies tend to report results for multiple specifications. AidGrade focused on those

13

results least likely to have been influenced by author choices: those with the fewest con-

trols, apart from fixed effects. Where a study reported results using different methodologies,

coders were instructed to collect the findings obtained under the authors’ preferred method-

ology; where the preferred methodology was unclear, coders were advised to follow the

internal preference ordering of prioritizing randomized controlled trials, followed by regres-

sion discontinuity designs and differences-in-differences, followed by matching, and to collect

multiple sets of results when they were unclear on which to include. Where results were

presented separately for multiple subgroups, coders were similarly advised to err on the side

of caution and to collect both the aggregate results and results by subgroup except where the

author appeared to be only including a subgroup because results were significant within that

subgroup. For example, if an author reported results for children aged 8-15 and then also

presented results for children aged 12-13, only the aggregate results would be recorded, but

if the author presented results for children aged 8-9, 10-11, 12-13, and 14-15, all subgroups

would be coded as well as the aggregate result when presented. Authors only rarely reported

isolated subgroups, so this was not a major issue in practice.

When considering the variation of effect sizes within a group of papers, the definition of

the group is clearly critical. Two different rules were initially used to define outcomes: a

strict rule, under which only identical outcome variables are considered alike, and a loose

rule, under which similar but distinct outcomes are grouped into clusters.

The precise coding rules were as follows:

1. We consider outcome A to be the same as outcome B under the “strict rule” if out-

comes A and B measure the exact same quality. Different units may be used, pending

conversion. The outcomes may cover different timespans (e.g. encompassing both

outcomes over “the last month” and “the last week”). They may also cover different

populations (e.g. children or adults). Examples: height; attendance rates.

2. We consider outcome A to be the same as outcome B under the “loose rule” if they

do not meet the strict rule but are clearly related. Example: parasitemia greater than

4000/µl with fever and parasitemia greater than 2500/µl.

Clearly, even under the strict rule, differences between the studies may exist, however, using

two different rules allows us to isolate the potential sources of variation, and other variables

were coded to capture some of this variation, such as the age of those in the sample. If one

were to divide the studies by these characteristics, however, the data would usually be too

sparse for analysis.

Interventions were also defined separately and coders were also asked to write a short

description of the details of each program. Program names were recorded so as to identify

14

those papers on the same program, such as the various evaluations of PROGRESA.

After coding, the data were then standardized to make results easier to interpret and

so as not to overly weight those outcomes with larger scales. The typical way to compare

results across different outcomes is by using the standardized mean difference, defined as:

SMD “µ1 ´ µ2

σp

where µ1 is the mean outcome in the treatment group, µ2 is the mean outcome in the control

group, and σp is the pooled standard deviation. When data are not available to calculate the

pooled standard deviation, it can be approximated by the standard deviation of the depen-

dent variable for the entire distribution of observations or as the standard deviation in the

control group (Glass, 1976). If that is not available either, due to standard deviations not

having been reported in the original papers, one can use the typical standard deviation for

the intervention-outcome. I follow this approach to calculate the standardized mean differ-

ence, which is then used as the effect size measure for the rest of the paper unless otherwise

noted.

This paper uses the “strict” outcomes where available, but the “loose” outcomes where

that would keep more data. For papers which were follow-ups of the same study, the most

recent results were used for each outcome.

Finally, one paper appeared to misreport results, suggesting implausibly low values and

standard deviations for hemoglobin. These results were excluded and the paper’s correspond-

ing author contacted. Excluding this paper’s results, effect sizes range between -1.5 and 1.8

SD, with an interquartile range of 0 to 0.2 SD. So as to mitigate sensitivity to individual

results, especially with the small number of papers in some intervention-outcome groups, I

restrict attention to those standardized effect sizes less than 2 SD away from 0, dropping 1

additional observation. I report main results including this observation in the Appendix.

3.3 Data Description

Figure 2 summarizes the distribution of studies covering the interventions and outcomes

considered in this paper that can be standardized. Attention will typically be limited to

those intervention-outcome combinations on which we have data for at least three papers.

Table 13 in Appendix C lists the interventions and outcomes and describes their results in

a bit more detail, providing the distribution of significant and insignificant results. It should

be emphasized that the number of negative and significant, insignificant, and positive and

significant results per intervention-outcome combination only provide ambiguous evidence

of the typical efficacy of a particular type of intervention. Simply tallying the numbers in

15

each category is known as “vote counting” and can yield misleading results if, for example,

some studies are underpowered.

Table 4 further summarizes the distribution of papers across interventions and highlights

the fact that papers exhibit very little overlap in terms of outcomes studied. This is consistent

with the story of researchers each wanting to publish one of the first papers on a topic. Vivalt

(2015a) finds that later papers on the same intervention-outcome combination more often

remain as working papers.

A note must be made about combining data. When conducting a meta-analysis, the

Cochrane Handbook for Systematic Reviews of Interventions recommends collapsing the

data to one observation per intervention-outcome-paper, and I do this for generating the

within intervention-outcome meta-analyses (Higgins and Green, 2011). Where results had

been reported for multiple subgroups (e.g. women and men), I aggregated them as in the

Cochrane Handbook’s Table 7.7.a. Where results were reported for multiple time periods

(e.g. 6 months after the intervention and 12 months after the intervention), I used the most

comparable time periods across papers. When combining across multiple outcomes, which

has limited use but will come up later in the paper, I used the formulae from Borenstein et

al. (2009), Chapter 24.

16

Figure 2: Within-Intervention-Outcome Number of Papers

17

Table 4: Descriptive Statistics: Distribution of Narrow Outcomes

Intervention Number of Mean papers Max papersoutcomes per outcome per outcome

Conditional cash transfers 10 21 37Contract teachers 1 3 3Deworming 12 13 18Financial literacy 1 5 5HIV/AIDS Education 3 8 10Improved stoves 4 2 2Insecticide-treated bed nets 1 9 9Irrigation 2 2 2Micro health insurance 1 2 2Microfinance 5 4 5Micronutrient supplementation 23 27 47Mobile phone-based reminders 2 4 5Performance pay 1 3 3Rural electrification 3 3 3Safe water storage 1 2 2Scholarships 3 4 5School meals 3 3 3Unconditional cash transfers 3 9 11Water treatment 2 5 6Women’s empowerment programs 2 2 2

Average 4.2 6.5 9.0

18

4 Generalizability of Impact Evaluation Results

4.1 Without Modelling Heterogeneity

Table 5 presents results for the metrics described earlier, within intervention-outcome

combinations. All Yi were converted to be in terms of standard deviations to put them

on a common scale before statistics were calculated, with the aforementioned caveats.

The different measures yield quite different results, as they measure different things, as

previously discussed. The coefficient of variation depends heavily on the mean; the I2, on

the precision of the underlying estimates.

Table 5: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes

Intervention Outcome var(Yi) CV(Yi) I2

Microfinance Assets 0.000 5.508 1.000

Rural Electrification Enrollment rate 0.001 0.129 0.768

Micronutrients Cough prevalence 0.001 1.648 0.995

Microfinance Total income 0.001 0.989 0.999

Microfinance Savings 0.002 1.773 1.000

Financial Literacy Savings 0.004 5.472 0.891

Microfinance Profits 0.005 5.448 1.000

Contract Teachers Test scores 0.005 0.403 1.000

Performance Pay Test scores 0.006 0.608 1.000

Micronutrients Body mass index 0.007 0.675 1.000

Conditional Cash Transfers Unpaid labor 0.009 0.920 0.797

Micronutrients Weight-for-age 0.009 1.941 0.884

Micronutrients Weight-for-height 0.010 2.148 0.677

Micronutrients Birthweight 0.010 0.981 0.827

Micronutrients Height-for-age 0.012 2.467 0.942

Conditional Cash Transfers Test scores 0.013 1.866 0.995

Deworming Hemoglobin 0.015 3.377 0.919

Micronutrients Mid-upper arm circumference 0.015 2.078 0.502

Conditional Cash Transfers Enrollment rate 0.015 0.831 1.000

Unconditional Cash Transfers Enrollment rate 0.016 1.093 0.998

Water Treatment Diarrhea prevalence 0.020 0.966 1.000

SMS Reminders Treatment adherence 0.022 1.672 0.780

Conditional Cash Transfers Labor force participation 0.023 1.628 0.424

School Meals Test scores 0.023 1.288 0.559

Micronutrients Height 0.023 4.369 0.826

Micronutrients Mortality rate 0.025 2.880 0.201

Micronutrients Stunted 0.025 1.110 0.262

Bed Nets Malaria 0.029 0.497 0.880

Conditional Cash Transfers Attendance rate 0.030 0.523 0.939

19

Micronutrients Weight 0.034 2.696 0.549

HIV/AIDS Education Used contraceptives 0.036 3.117 0.490

Micronutrients Perinatal deaths 0.038 2.096 0.176

Deworming Height 0.049 2.361 1.000

Micronutrients Test scores 0.052 1.694 0.966

Scholarships Enrollment rate 0.053 0.687 1.000

Conditional Cash Transfers Height-for-age 0.055 22.166 0.165

Deworming Weight-for-height 0.072 3.129 0.986

Micronutrients Stillbirths 0.075 3.041 0.108

School Meals Enrollment rate 0.081 1.142 0.080

Micronutrients Prevalence of anemia 0.095 0.793 0.692

Deworming Height-for-age 0.098 1.978 1.000

Deworming Weight-for-age 0.107 2.287 0.998

Micronutrients Diarrhea incidence 0.109 3.300 0.985

Micronutrients Diarrhea prevalence 0.111 1.205 0.837

Micronutrients Fever prevalence 0.146 3.076 0.667

Deworming Weight 0.184 4.758 1.000

Micronutrients Hemoglobin 0.215 1.439 0.984

SMS Reminders Appointment attendance rate 0.224 2.908 0.869

Deworming Mid-upper arm circumference 0.439 1.773 0.994

Conditional Cash Transfers Probability unpaid work 0.609 6.415 0.834

Rural Electrification Study time 0.997 1.102 0.142

How should we interpret these numbers? Higgins and Thompson, who defined I2, called

0.25 indicative of “low”, 0.5 “medium”, and 0.75 “high” levels of heterogeneity (2002;

Higgins et al., 2003). Figure 3 plots a histogram of the results, with lines corresponding

to these values demarcated. Clearly, there is a lot of systematic variation in the results

according to the I2 measure. No similar defined benchmarks exist for the variance or

coefficient of variation, although studies in the medical literature tend to exhibit a coefficient

of variation of approximately 0.05-0.5 (Tian, 2005; Ng, 2014). By this standard, too, results

would appear quite heterogeneous.

20

Figure 3: Density of I2 values

We can also compare values across the different intervention-outcome combinations

within the data set. Here, the intervention-outcome combinations that fall within the

bottom third by variance have varpYiq ď 0.015; the top third have varpYiq ě 0.052. Similarly,

the threshold delineating the bottom third for the coefficient of variation is 1.14 and, for

the top third, 2.36; for I2, the thresholds are 0.78 and 0.99, respectively. If we expect these

intervention-outcomes to be broadly comparable to others we might want to consider in the

future, we could use these values to benchmark future results.

Defining dispersion to be “low” or “high” in this manner may be unsatisfying because

the classifications that result are relative. Relative classifications might have some value, but

sometimes are not so important; for example, it is hard to think that there is a meaningful

difference between an I2 of just below 0.99 and an I2 of just above 0.99. An alternative

benchmark that might have more appeal is that of the average within-study variance or

coefficient of variation. If the across-study variation approached the within-study variation,

we might not be so concerned about generalizability.

Table 6 illustrates the gap between the across-study and mean within-study variance,

coefficient of variation, and I2, for those intervention-outcomes for which we have enough

data to calculate the within-study measures. Not all studies report multiple results for the

intervention-outcome combination in question. A paper might report multiple results for a

particular intervention-outcome combination if, for example, it were reporting results for

different subgroups, such as for different age groups, genders, or geographic areas. The

median within-paper variance for those papers for which this can be generated is 0.027,

while it is 0.037 across papers; similarly, the median within-paper coefficient of variation is

0.91, compared to 1.48 across papers. If we were to form the I2 for each paper separately,

the median within-paper value would be 0.63, as opposed to 0.94 across papers. Figure

21

4 presents the distributions graphically; to increase the sample size, this figure includes

results even when there are only two papers within an intervention-outcome combination or

two results reported within a paper.

22

Table 6: Across-Paper vs. Mean Within-Paper HeterogeneityIntervention Outcome Across-paper Within-paper Across-paper Within-paper Across-paper Within-paper

var(Yi) var(Yi) CV(Yi) CV(Yi) I2 I2

Micronutrients Cough prevalence 0.001 0.006 1.017 3.181 0.755 1.000Conditional Cash Transfers Enrollment rate 0.009 0.027 0.790 0.968 0.998 0.682Conditional Cash Transfers Unpaid labor 0.009 0.004 0.918 0.853 0.781 0.778Deworming Hemoglobin 0.009 0.068 1.639 8.687 0.583 0.712Micronutrients Weight-for-height 0.010 0.005 2.252 * 0.665 0.633Micronutrients Birthweight 0.010 0.011 0.974 0.963 0.784 0.882Micronutrients Weight-for-age 0.010 0.124 2.370 0.713 1.000 0.652School Meals Height-for-age 0.011 0.000 1.086 * 0.942 0.703Micronutrients Height-for-age 0.012 0.042 2.474 3.751 0.993 0.508Unconditional Cash Transfers Enrollment rate 0.014 0.014 1.223 * 0.982 0.497SMS Reminders Treatment adherence 0.022 0.008 1.479 0.672 0.958 0.573Micronutrients Height 0.023 0.028 4.001 3.471 0.896 0.548Micronutrients Stunted 0.024 0.059 1.085 24.373 0.348 0.149Micronutrients Mortality rate 0.026 0.195 2.533 1.561 0.164 0.077Micronutrients Weight 0.029 0.027 2.852 0.149 0.629 0.228Micronutrients Fever prevalence 0.034 0.011 5.937 0.126 0.602 0.066Microfinance Total income 0.037 0.003 1.770 1.232 0.970 1.000Conditional Cash Transfers Probability unpaid work 0.046 0.386 1.419 0.408 0.989 0.517Conditional Cash Transfers Attendance rate 0.046 0.018 0.591 0.526 0.988 0.313Deworming Height 0.048 0.112 1.845 0.211 1.000 0.665Micronutrients Perinatal deaths 0.049 0.015 2.087 0.234 0.451 0.089Bed Nets Malaria 0.052 0.047 0.650 4.093 0.967 0.551Scholarships Enrollment rate 0.053 0.026 1.094 1.561 1.000 0.612Conditional Cash Transfers Height-for-age 0.055 0.002 22.166 1.212 0.162 0.600HIV/AIDS Education Used contraceptives 0.059 0.120 2.863 6.967 0.424 0.492Deworming Weight-for-height 0.072 0.164 3.127 * 1.000 0.907Deworming Height-for-age 0.100 0.005 2.043 1.842 1.000 0.741Deworming Weight-for-age 0.108 0.004 2.317 1.040 1.000 0.704Micronutrients Diarrhea incidence 0.135 0.016 2.844 1.741 0.922 0.807Micronutrients Diarrhea prevalence 0.137 0.029 1.375 3.385 0.811 0.664Deworming Weight 0.168 0.121 4.087 1.900 0.995 0.813Conditional Cash Transfers Labor force participation 0.790 0.047 2.931 4.300 0.378 0.559Micronutrients Hemoglobin 2.650 0.176 2.982 0.731 1.000 0.996

Within-paper values are based on those papers which report results for different subsets of the data. For closer comparison of the across andwithin-paper statistics, the across-paper values are based on the same data set, aggregating the within-paper results to one observation per

23

intervention-outcome-paper, as discussed. Each paper needs to have reported 3 results for an intervention-outcome combination for it to be includedin the calculation, in addition to the requirement of there being 3 papers on the intervention-outcome combination. Due to the slightly differentsample, the across-paper statistics diverge slightly from those reported in Table 5. Occasionally, within-paper measures of the mean equal orapproach zero, making the coefficient of variation undefined or unreasonable; “*” denotes those coefficients of variation that were either undefined orgreater than 10,000,000.

24

Figure 4: Distribution of within and across-paper heterogeneity measures

We can also gauge the magnitudes of these measures by comparison with effect sizes.We know effect sizes are typically considered “small” if they are less than 0.2 SDs and thatthe largest coefficient of variation typically considered in the medical literature is 0.5 (Tian,2005; Ng, 2014). Taking 0.5 as a very conservative upper bound for a “small” coefficient ofvariation, this would imply a variance of less than 0.01 for an effect size of 0.2. That theactual mean effect size in the data is closer to 0.1 makes this even more of an upper bound;applying the same reasoning to an effect size of 0.1 would result in the threshold being setat a variance of 0.0025.

Finally, we can try to set bounds more directly, based on the expected prediction error.Here it is immediately apparent that what counts as large or small error depends on thepolicy question. In some cases, it might not matter if an effect size were mis-predicted by25%; in others, a prediction error of this magnitude could mean the difference betweenchoosing one program over another or determine whether a program is worthwhile to pursueat all.

Still, if we take the mean effect size within an intervention-outcome to be our “bestguess” of how a program will perform and, as an illustrative example, want the predictionerror to be less than 25% at least 50% of the time, this would imply a certain cut-offthreshold for the variance if we assume that results are normally distributed. Note that theassumption that results are drawn from the same normal distribution and the mean andvariance of this distribution can be approximated by the mean and variance of observedresults is a simplification for the purpose of a back-of-the-envelope calculation. We wouldexpect results to be drawn from different distributions.

Table 7 summarizes the implied bounds for the variance for the prediction error to beless than 25% and 50%, respectively, alongside the actual variance in results within eachintervention-outcome. In only 1 of 51 cases is the true variance in results smaller than thevariance implied by the 25% prediction error cut-off threshold, and in 9 other cases it isbelow the 50% prediction error threshold. In other words, the variance of results withineach intervention-outcome would imply a prediction error of more than 50% more than 80%of the time.

Table 7: Actual Variance vs. Variance for Prediction Error Thresholds

Intervention Outcome Yi varpYiq var25 var50

Microfinance Assets 0.003 0.000 0.000 0.000

25

Rural Electrification Enrollment rate 0.176 0.001 0.005 0.027

Micronutrients Cough prevalence -0.016 0.001 0.000 0.000

Microfinance Total income 0.029 0.001 0.000 0.001

Microfinance Savings 0.027 0.002 0.000 0.001

Financial Literacy Savings -0.012 0.004 0.000 0.000

Microfinance Profits -0.013 0.005 0.000 0.000

Contract Teachers Test scores 0.182 0.005 0.005 0.029

Performance Pay Test scores 0.131 0.006 0.003 0.015

Micronutrients Body mass index 0.125 0.007 0.002 0.014

Conditional Cash Transfers Unpaid labor 0.103 0.009 0.002 0.009

Micronutrients Weight-for-age 0.050 0.009 0.000 0.002

Micronutrients Weight-for-height 0.045 0.010 0.000 0.002

Micronutrients Birthweight 0.102 0.010 0.002 0.009

Micronutrients Height-for-age 0.044 0.012 0.000 0.002

Conditional Cash Transfers Test scores 0.062 0.013 0.001 0.003

Deworming Hemoglobin 0.036 0.015 0.000 0.001

Micronutrients Mid-upper arm circumference 0.058 0.015 0.001 0.003

Conditional Cash Transfers Enrollment rate 0.150 0.015 0.003 0.019

Unconditional Cash Transfers Enrollment rate 0.115 0.016 0.002 0.011

Water Treatment Diarrhea prevalence 0.145 0.020 0.003 0.018

SMS Reminders Treatment adherence 0.088 0.022 0.001 0.007

Conditional Cash Transfers Labor force participation 0.092 0.023 0.001 0.007

School Meals Test scores 0.117 0.023 0.002 0.012

Micronutrients Height 0.035 0.023 0.000 0.001

Micronutrients Mortality rate -0.054 0.025 0.000 0.003

Micronutrients Stunted 0.143 0.025 0.003 0.018

Bed Nets Malaria 0.342 0.029 0.018 0.101

Conditional Cash Transfers Attendance rate 0.333 0.030 0.017 0.096

Micronutrients Weight 0.068 0.034 0.001 0.004

HIV/AIDS Education Used contraceptives 0.061 0.036 0.001 0.003

Micronutrients Perinatal deaths -0.093 0.038 0.001 0.008

Deworming Height 0.094 0.049 0.001 0.008

Micronutrients Test scores 0.134 0.052 0.003 0.016

Scholarships Enrollment rate 0.336 0.053 0.017 0.098

Conditional Cash Transfers Height-for-age -0.011 0.055 0.000 0.000

Deworming Weight-for-height 0.086 0.072 0.001 0.006

Micronutrients Stillbirths -0.090 0.075 0.001 0.007

School Meals Enrollment rate 0.250 0.081 0.009 0.054

Micronutrients Prevalence of anemia 0.389 0.095 0.023 0.131

Deworming Height-for-age 0.159 0.098 0.004 0.022

Deworming Weight-for-age 0.143 0.107 0.003 0.018

Micronutrients Diarrhea incidence 0.100 0.109 0.002 0.009

Micronutrients Diarrhea prevalence 0.277 0.111 0.012 0.066

26

Micronutrients Fever prevalence 0.124 0.146 0.002 0.013

Deworming Weight 0.090 0.184 0.001 0.007

Micronutrients Hemoglobin 0.322 0.215 0.016 0.090

SMS Reminders Appointment attendance rate 0.163 0.224 0.004 0.023

Deworming Mid-upper arm circumference 0.373 0.439 0.021 0.121

Conditional Cash Transfers Probability unpaid work -0.122 0.609 0.002 0.013

Rural Electrification Study time 0.906 0.997 0.125 0.710

var25 represents the variance that would result in a 25% prediction error for draws from a normal

distribution centered at Yi. var50 represents the variance that would result in a 50% prediction error.

4.2 With Modelling Heterogeneity

4.2.1 Across Intervention-Outcomes

All the results so far have not considered how much heterogeneity can be explained.

If the heterogeneity can be systematically modelled, it would improve our ability to make

predictions. Do results exhibit any variation that is systematic? To begin, I first present

some OLS results, looking across different intervention-outcome combinations, to examine

whether effect sizes are associated with any characteristics of the program, study, or sample,

pooling data from different intervention-outcomes.

As Table 8 indicates, there is some evidence that studies with a smaller number of

observations have greater effect sizes than studies based on a larger number of observations.

This is what we would expect if specification searching were easier in small datasets; this

pattern of results would also be what we would expect if power calculations drove researchers

to only proceed with studies with small sample sizes if they believed the program would result

in a large effect size or if larger studies are less well-targeted. Interestingly, government-

implemented programs fare worse even controlling for sample size (the dummy variable

category left out is “Other-implemented”, which mainly consists of collaborations and private

sector-implemented interventions). Studies in the Middle East / North Africa region may

appear to do slightly better than those in Sub-Saharan Africa (the excluded region category),

but not much weight should be put on this as very few studies were conducted in the former

region.

While these regressions have the advantage of allowing me to draw on a larger sample

of studies and we might think that any patterns observed across so many interventions and

outcomes are fairly robust, we might be able to explain more variation if we restrict attention

to a particular intervention-outcome combination. I therefore focus on the case of conditional

cash transfers (CCTs) and enrollment rates, as this is the intervention-outcome combination

that contains the largest number of papers.

27

Table 8: Regression of Effect Size on Study Characteristics

(1) (2) (3) (4) (5)Effect size Effect size Effect size Effect size Effect size

b/se b/se b/se b/se b/se

Number of -0.011** -0.012*** -0.009*observations (100,000s) (0.00) (0.00) (0.00)Government-implemented -0.107*** -0.087**

(0.04) (0.04)Academic/NGO-implemented -0.055 -0.057

(0.04) (0.05)RCT 0.038

(0.03)East Asia -0.003

(0.03)Latin America 0.012

(0.04)Middle East/North 0.275**Africa (0.11)South Asia 0.021

(0.04)Constant 0.120*** 0.180*** 0.091*** 0.105*** 0.177***

(0.00) (0.03) (0.02) (0.02) (0.03)

Observations 556 656 656 556 556R2 0.20 0.23 0.22 0.23 0.20

28

4.2.2 Within an Intervention-Outcome Combination: The Case of CCTs and

Enrollment Rates

The previous results used the across-intervention-outcome data, which were aggregated

to one result per intervention-outcome-paper. However, we might think that more variation

could be explained by carefully modelling results within a particular intervention-outcome

combination. This section provides an example, using the case of conditional cash transfers

and enrollment rates, the intervention-outcome combination covered by the most papers.

Suppose we were to try to explain as much variability in outcomes as possible, using

sample characteristics. The available variables which might plausibly have a relationship to

effect size are: the baseline enrollment rates9; the sample size; whether the study was done

in a rural or urban setting, or both; results for other programs in the same region10; and

the age and gender of the sample under consideration.

Table 9 shows the results of OLS regressions of the effect size on these variables, in turn.

The baseline enrollment rates show the strongest relationship to effect size, as reflected in

the R2 and significance levels: it is easier to have large gains where initial rates are low.

Some papers pay particular attention to those children that were not enrolled at baseline or

that were enrolled at baseline. These are coded as a “0%” or “100%” enrollment rate at

baseline but are also represented by two dummy variables. Larger studies and studies done

in urban areas also tend to find smaller effect sizes than smaller studies or studies done in

rural or mixed urban/rural areas. Finally, for each result I calculate the mean result in the

same region, excluding results from the program in question. Results do appear slightly

correlated across different programs in the same region.

9In some cases, only endline enrollment rates are reported. This variable is therefore constructed byusing baseline rates for both the treatment and control group where they are available, followed by, in turn,the baseline rate for the control group; the baseline rate for the treatment group; the endline rate for thecontrol group; the endline rate for the treatment and control group; and the endline rate for the treatmentgroup

10Regions include: Latin America, Africa, the Middle East and North Africa, East Asia, and South Asia,following the World Bank’s geographical divisions.

29

Table 9: Regression of Projects’ Effect Sizes on Characteristics (CCTs on Enrollment Rates)

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)ES ES ES ES ES ES ES ES ES ES

b/se b/se b/se b/se b/se b/se b/se b/se b/se b/se

Enrollment Rates -0.224*** -0.092 -0.127***(0.05) (0.06) (0.02)

Enrolled at Baseline -0.002(0.02)

Not Enrolled at 0.183*** 0.142***Baseline (0.05) (0.03)Number of -0.011* -0.002Observations (100,000s) (0.01) (0.00)Rural 0.049** 0.002

(0.02) (0.02)Urban -0.068*** -0.039**

(0.02) (0.02)Girls -0.002

(0.03)Boys -0.019

(0.02)Minimum Sample Age 0.005

(0.01)Mean Regional Result 1.000** 0.714**

(0.38) (0.28)

Observations 112 112 108 130 130 130 130 104 130 92R2 0.41 0.52 0.01 0.06 0.05 0.00 0.01 0.02 0.01 0.58

30

Table 10: Impact of Mixed Models on Measures

var(Yi) varRpYi ´ pYiq CV(Yi) CVRpYi ´ pYiq I2 I2R NRandom effects model 0.011 0.011 1.24 1.24 0.97 0.97 122Mixed model (1) 0.011 0.007 1.28 1.04 0.97 0.96 104Mixed model (2) 0.012 0.005 1.25 0.85 0.96 0.93 87

As baseline enrollment rates have the strongest relationship to effect size, I use this as

an explanatory variable in a hierarchical mixed model, to explore how it affects the residual

varRpYi ´ pYiq, CVRpYi ´ pYiq and I2R. I also use the specification in column (10) of Table

9 as a robustness check. The results are reported in Table 10 for each of these two mixed

models, alongside the values from the random effects model that does not use any explanatory

variables.

Not all papers provide information for each explanatory variable, and each row is based

on only those studies which could be used to estimate the model. Thus, the value of varpYiq,

CV(Yi) and I2, which do not depend on the model used, may still vary between rows.

In the random effects model, since no explanatory variables are used, pYi is only the mean,

and varRpYi´ pYiq, CVRpYi´ pYiq and I2R do not offer improvements on var(Yi), CV(Yi) and I2.

As more explanatory variables are added, the gap between varpYiq and varRpYi´ pYiq, CV(Yi)

and CVRpYi ´ pYiq and I2 and I2R grows. In all cases, including explanatory variables can

help reduce the unexplained variation, to varying degrees. varRpYi ´ pYiq and CVRpYi ´ pYiq

are greatly reduced, but I2R is not much lower than I2. This is likely due to a feature of I2

(I2R) previously discussed: that it depends on the precision of estimates. With evaluations

of CCT programs tending to have large sample sizes, the value of I2 (I2R) is higher than it

otherwise would be.

4.2.3 How Quickly Do Results Converge?

As more studies are completed, our ability to make predictions based on the previous

studies’ results might improve.

In the overall data set, results do not appear to converge or diverge over time. Figure 5

provides a scatter plot of the relationship between the absolute percent difference between

a particular result and the chronological order of the paper relative to others on the same

intervention-outcome, scaled to run from 0 to 1. For example, if there were 5 papers on a

particular intervention-outcome combination, the first would take the value 0.2, the last, 1.

In this figure, attention is restricted to those percent differences less than 1000%. There is

a weak positive relationship between them, indicating that earlier results tend to be closer

to the mean result than the later results, which are more variable, but this is not signifi-

31

Figure 5: Variance of Results Over Time, Within Intervention-Outcome

cant. Further, the relationship varies according to the cutoff used. Table 17 in Appendix C

illustrates.

However, it is still possible that if we can fit a model of the effect sizes to the data, as

we did in the case of CCTs, the fit of the model could improve over time as more data are

added.

To test this, I run the previous OLS regressions of effect size on a constant and baseline

enrollment rates using the data available at time period t and measure the absolute error

of the predicted values of pYi that would be generated by applying the estimated coefficients

to the data from future time periods. I consider prediction error at time period t ` 1 and,

separately, the mean absolute prediction error across all future time periods (t` 1, t` 2, ...)

in alternative specifications.

Results regressing the error on the number of papers used to generate the coefficients are

shown in Table 11. Since multiple papers may have come out in the same year, there are

necessarily discrete jumps in the number of results available at different time periods t, and

results are bootstrapped.

Overall, it appears that the fit can be improved over time. The fit of model 2, in

particular, improves over the first 30-60 studies and afterwards does not show much further

reduction in error, though the fit of other models could take longer to converge. It is possible

that leveraging within-paper heterogeneity could speed convergence. The next section will

explore the relationship between within-study heterogeneity and across-study heterogeneity.

32

Table 11: Prediction Error from Mixed Models Declines As Evidence Accumulates

Model 1 Model 1 Model 2 Model 2Absolute Mean Absolute Absolute Mean Absolute

Error Error Error Error

Number of Previous 0.003 -0.001 -0.014*** -0.043***Papers (10s) (0.00) (0.00) (0.00) (0.01)Constant 0.042** 0.057*** 0.120*** 0.257***

(0.02) (0.00) (0.02) (0.03)

Observations 135 150 111 150R2 0.01 0.08 0.08 0.42

Columns (1) and (3) focus on the absolute prediction error at time period t` 1 given the evidence at time

t. Columns (2) and (4) focus on the mean absolute prediction error for all time periods t` 1, t` 2, ....

4.3 Predicting External Validity from a Single Paper

It would be very helpful if we could estimate the across-paper within-intervention-

outcome metrics using the results from individual papers. Many papers report results for

different subgroups or over time, and the variation in results for a particular intervention-

outcome within a single paper could be a plausible proxy of variation in results for

that same intervention-outcome across papers. If this relationship holds, it would help

researchers estimate the external validity of their own study, even when no other studies

on the intervention have been completed. Table 12 shows the results of regressing the

across-paper measures of var(Yi) and CV(Yi) on the average within-paper measures for the

same intervention-outcome combination.

33

Table 12: Regression of Mean Within-Paper Heterogeneity on Across-Paper Heterogeneity

(1) (2) (3)Across-paper variance Across-paper CV Across-paper I2

b/se b/se b/se

Mean within-paper variance 0.343**(0.13)

Mean within-paper CV 0.000*(0.00)

Mean within-paper I2 0.543***(0.10)

Constant 0.101* 0.867 0.453***(0.06) (0.63) (0.08)

Observations 51 50 51R2 0.04 0.00 0.31

The mean of each within-paper measure is created by calculating the measure within a paper, for each

paper reporting two or more results on the same intervention-outcome combination, and then averaging

that measure across papers within the intervention-outcome.

It appears that within-paper variation in results is indeed significantly correlated with

across-paper variation in results. Authors could undoubtedly obtain even better estimates

using micro data.

4.4 Robustness Checks

One may be concerned that low-quality papers are either inflating or depressing the

degree of generalizability that is observed. There are infinitely many ways to measure paper

“quality”; I consider two. First, I use the most widely-used quality assessment measure, the

Jadad scale (Jadad et al., 1996). The Jadad scale asks whether the study was randomized,

double-blind, and whether there was a description of withdrawals and dropouts. A paper

gets one point for having each of these characteristics; in addition, a point is added if the

method of randomization was appropriate, subtracted if the method is inappropriate, and

similarly added if the blinding method was appropriate and subtracted if inappropriate.

This results in a 0-5 point scale. Given that the kinds of interventions being tested are not

typically readily suited to blinding, I consider all those papers scoring at least a 3 to be

“high quality”.

In an alternative specification, I also consider only those results from studies that were

RCTs. This is for two reasons. First, many would consider RCTs to be higher-quality

34

studies. We might also be concerned about how specification searching and publication bias

could affect results. In a separate paper (Vivalt, 2015a), I discuss these issues at length and

find relatively little evidence of these biases in the data, with RCTs exhibiting even fewer

signs of specification searching and publication bias. The results based on only those studies

which were RCTs thus provide a good robustness check.

Tables 15 and 16 in the Appendix provide robustness checks using these two quality

measures. Table 14 also includes the one observation previously dropped for having an effect

size more than 2 SD away from 0. The heterogeneity measures are not substantially different

using these data sets.

5 Conclusion

How much impact evaluation results generalize to other settings is an important topic,

and data from meta-analyses are the ideal data with which to answer this question. With

data on 20 different types of interventions, all collected in the same way, we can begin to

speak a bit more generally about how results tend to vary across contexts and what that

implies for impact evaluation design and policy recommendations.

I started by discussing heterogeneous treatment effects, defining generalizability, and

relating generalizability to several possible measures. Each measure has its strengths and

limitations, and to get a more complete view multiple measures should be used. I then

discussed the rich data set the results are based on and its formation. I presented results

for each measure, first looking at the basic measures of variation and proportion of variation

that is systematic across intervention-outcome combinations and then looking within the

case of a particular intervention-outcome: the effect of CCTs on enrollment rates.

Smaller studies tended to have larger effect sizes, which we might expect if the smaller

studies are better-targeted, are selected to be evaluated when there is a higher a priori ex-

pectation they will have a large effect size, or if there is a preference to report larger effect

sizes, which smaller studies would obtain more often by chance. Government-implemented

programs also had smaller effect sizes than academic/NGO-implemented programs, even

after controlling for sample size. This is unfortunate given we often do smaller impact eval-

uations with NGOs in the hopes of finding a strong positive effect that can scale through

government implementation.

In the case of the effect of CCTs on enrollment rates, the generalizability measures im-

prove with the addition of an explanatory mixed model. I also found that the predictive

ability of the model improved over time, estimating the model using sequentially larger cuts

of the data (i.e. the evidence base at time t, t` 1...).

35

Finally, I compared within-paper heterogeneity in treatment effects to across-paper het-

erogeneity in treatment effects. Within-paper heterogeneity is present in my data as papers

often report multiple results for the same outcomes, such as for different subgroups. Fortu-

nately, I find that even these crude measures of within-paper heterogeneity predict across-

paper heterogeneity for the relevant intervention-outcome. This implies that researchers can

get a quick estimate how well their results would apply to other settings, simply by using

their own data. With their access to micro data, authors could do much richer analysis.

Finally, I considered the robustness of these results to specification searching, publication

bias (Vivalt, 2015a), and issues of paper quality. A companion paper finds RCTs fare better

than non-RCTs with respect to specification searching and publication bias, so I present

results based on those studies which are RCTs, as well as separately restricting attention to

those studies that meet a common quality standard.

I consider several ways to evaluate the magnitude of the variation in results. Whether

results are too heterogeneous ultimately depends on the purpose for which they are being

used; some policy decisions might have greater room for error than others. However, it is safe

to say, looking at both the coefficient of variation and the I2, which have commonly accepted

benchmarks in other disciplines, that these impact evaluations exhibit more heterogeneity

than is typical in other fields such as medicine, even after accounting for explanatory vari-

ables in the case of conditional cash transfers. Further, I find that under mild assumptions,

the typical variance of results is such that a particular program would be mis-predicted by

more than 50% over 80% of the time.

There are some steps that researchers can take that may improve the generalizability

of their own studies. First, just as with heterogeneous selection into treatment (Chassang,

Padro i Miquel and Snowberg, 2012), one solution would be to ensure one’s impact evalua-

tion varied some of the contextual variables that we might think underlie the heterogeneous

treatment effects. Given that many studies are underpowered as it is, that may not be

likely; however, large organizations and governments have been supporting more impact

evaluations, providing more opportunities to explicitly integrate these analyses. Efforts to

coordinate across different studies, asking the same questions or looking at some of the same

outcome variables, would also help. The framing of heterogeneous treatment effects could

also provide positive motivation for replication projects in different contexts: different find-

ings would not necessarily negate the earlier ones but add another level of information.

In summary, generalizability is not binary but something that we can measure. This

paper showed that past results have significant but limited ability to predict other results

on the same topic and this was not seemingly due to bias. Knowing how much results tend

to extrapolate and when is critical if we are to know how to interpret an impact evaluation’s

36

results or apply its findings. Given that other fields, with less heterogeneity, also seem to

have a more well-developed practice of replication and meta-analysis, it would seem like

economics would have a lot to gain by expanding in this direction.

37

References

AidGrade (2013). “AidGrade Process Description”, http://www.aidgrade.org/methodology/

processmap-and-methodology, March 9, 2013.

AidGrade (2015). “AidGrade Impact Evaluation Data, Version 1.2”.

Alesina, Alberto and David Dollar (2000). “Who Gives Foreign Aid to Whom and Why?”,

Journal of Economic Growth, vol. 5 (1).

Allcott, Hunt (forthcoming). “Site Selection Bias in Program Evaluation”,

Quarterly Journal of Economics.

Bastardi, Anthony, Eric Luis Uhlmann and Lee Ross (2011). “Wishful Thinking: Belief,

Desire, and the Motivated Evaluation of Scientific Evidence”, Psychological Science.

Becker, Betsy Jane and Meng-Jia Wu (2007). “The Synthesis of Regression Slopes in

Meta-Analysis”, Statistical Science, vol. 22 (3).

Bold, Tessa et al. (2013). “Scaling-up What Works: Experimental Evidence on External

Validity in Kenyan Education”, working paper.

Borenstein, Michael et al. (2009). Introduction to Meta-Analysis. Wiley Publishers.

Boriah, Shyam et al. (2008). “Similarity Measures for Categorical Data: A Comparative

Evaluation”, in Proceedings of the Eighth SIAM International Conference on Data Mining.

Brodeur, Abel et al. (2012). “Star Wars: The Empirics Strike Back”, working paper.

Cartwright, Nancy (2007). Hunting Causes and Using Them: Approaches in Philosophy

and Economics. Cambridge: Cambridge University Press.

Cartwright, Nancy (2010). “What Are Randomized Controlled Trials Good For?”,

Philosophical Studies, vol. 147 (1): 59-70.

Casey, Katherine, Rachel Glennerster, and Edward Miguel (2012). “Reshaping Institutions:

Evidence on Aid Impacts Using a Preanalysis Plan.” Quarterly Journal of Economics, vol.

127 (4): 1755-1812.

Chassang, Sylvain, Gerard Padr I Miquel, and Erik Snowberg (2012). “Selec-

tive Trials: A Principal-Agent Approach to Randomized Controlled Experiments.”

American Economic Review, vol. 102 (4): 1279-1309.

Cohen, Jacob (1988). Statistical power analysis for the behavioral sciences (2nd ed.).

Hillsdale, NJ: Lawrence Earlbaum Associates. Deaton, Angus (2010). “Instruments,

Randomization, and Learning about Development.” Journal of Economic Literature, vol.

48 (2): 424-55.

Duflo, Esther, Pascaline Dupas and Michael Kremer (2012). “School Governance, Teacher

Incentives and Pupil-Teacher Ratios: Experimental Evidence from Kenyan Primary

Schools”, NBER Working Paper.

Evans, David and Anna Popova (2014). “Cost-effectiveness Measure-

38

ment in Development: Accounting for Local Costs and Noisy Impacts”,

World Bank Policy Research Working Paper, No. 7027.

Ferguson, Christopher and Michael Brannick (2012). “Publication bias in psychological

science: Prevalence, methods for identifying and controlling, and implications for the use of

meta-analyses.” Psychological Methods, vol. 17 (1), Mar 2012, 120-128.

Franco, Annie, Neil Malhotra and Gabor Simonovits (2014). “Publication Bias in the Social

Sciences: Unlocking the File Drawer”, Working Paper.

Gerber, Alan and Neil Malhotra (2008a). “Do Statistical Reporting Standards Affect

What Is Published? Publication Bias in Two Leading Political Science Journals”,

Quarterly Journal of Political Science, vol 3.

Gerber, Alan and Neil Malhotra (2008b). “Publication Bias in Empirical So-

ciological Research: Do Arbitrary Significance Levels Distort Published Results”,

Sociological Methods &Research, vol. 37 (3).

Gelman, Andrew et al. (2013). Bayesian Data Analysis, Third Edition, Chapman and

Hall/CRC.

Hedges, Larry and Therese Pigott (2004). “The Power of Statistical Tests for Moderators

in Meta-Analysis”, Psychological Methods, vol. 9 (4).

Higgins, Julian PT and Sally Green, (eds.) (2011).

Cochrane Handbook for Systematic Reviews of

Interventions, Version 5.1.0 [updated March 2011]. The Cochrane Collaboration. Available

from www.cochrane-handbook.org.

Higgins, Julian PT et al. (2003). “Measuring inconsistency in meta-analyses”, BMJ 327:

557-60.

Higgins, Julian PT and Simon Thompson (2002). “Quantifying heterogeneity in a meta-

analysis”, Statistics in Medicine, vol. 21: 1539-1558.

Hsiang, Solomon, Marshall Burke and Edward Miguel (2013). “Quantifying the Influence

of Climate on Human Conflict”, Science, vol. 341.

Independent Evaluation Group (2012). “World Bank Group Impact Evaluations: Relevance

and Effectiveness”, World Bank Group.

Jadad, A.R. et al. (1996). “Assessing the quality of reports of randomized clinical trials: Is

blinding necessary?” Controlled Clinical Trials, 17 (1): 112.

Millennium Challenge Corporation (2009). “Key Elements of Evaluation at MCC”,

presentation June 9, 2009.

Ng, CK (2014). “Inference on the common coefficient of varia-

tion when populations are lognormal: A simulation-based approach”,

Journal of Statistics: Advances in Theory and Applications, vol. 11 (2).

39

Page, Matthew, McKenzie, Joanne and Andrew Forbes (2013). “Many Scenarios Exist

for Selective Inclusion and Reporting of Results in Randomized Trials and Systematic

Reviews”, Journal of Clinical Epidemiology, vol. 66 (5).

Pritchett, Lant and Justin Sandefur (2013). “Context Matters for Size: Why External

Validity Claims and Development Practice Don’t Mix”, Center for Global Development

Working Paper 336.

Rodrik, Dani (2009). “The New Development Economics: We Shall Experiment, but How

Shall We Learn?”, in What Works in Development? Thinking Big, and Thinking Small, ed.

Jessica Cohen and William Easterly, 24-47. Washington, D.C.: Brookings Institution Press.

Saavedra, Juan and Sandra Garcia (2013). “Educational Impacts and Cost-Effectiveness

of Conditional Cash Transfer Programs in Developing Countries: A Meta-Analysis”,

CESR Working Paper.

Simmons, Joseph and Uri Simonsohn (2011). “False-Positive Psychology: Undisclosed

Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”,

Psychological Science, vol. 22.

Simonsohn, Uri et al. (2014). “P-Curve: A Key to the File Drawer”,

Journal of Experimental Psychology: General.

Tian, Lili (2005). “Inferences on the common coefficient of variation”, Statistics in Medicine,

vol. 24: 2213-2220.

Tibshirani, Ryan and Robert Tibshirani (2009). “A Bias Correction for the Minimum Error

Rate in Cross-Validation”, Annals of Applied Statistics, vol. 3 (2).

Tierney, Michael J. et al. (2011). “More Dollars than Sense: Refining Our Knowledge of

Development Finance Using AidData”, World Development, vol. 39.

Tipton, Elizabeth (2013). “Improving generalizations from experiments us-

ing propensity score subclassification: Assumptions, properties, and contexts”,

Journal of Educational and Behavioral Statistics, 38: 239-266.

RePEc (2013). “RePEc h-index for journals”, http://ideas.repec.org/top/

top.journals.hindex.html.

Vivalt, Eva (2015a). “The Trajectory of Specification Searching Across Disciplines and

Methods”, Working Paper.

Vivalt, Eva (2015b). “How Concerned Should We Be About Selection Bias, Hawthorne

Effects and Retrospective Evaluations?”, Working Paper.

Walsh, Michael et. al. (2013). “The Statistical Significance of Randomized

Controlled Trial Results is Frequently Fragile: A Case for a Fragility Index”,

Journal of Clinical Epidemiology.

USAID (2011). “Evaluation: Learning from Experience”, USAID Evaluation Policy,

40

Washington, DC.

41

For Online Publication

42

Appendices

A Guide to Appendices

A.1 Appendices in this Paper

B) Excerpt from AidGrade’s Process Description (2013).

C) Additional results.

D) Derivation of mixed model.

A.2 Further Online Appendices

Having to describe data from twenty different meta-analyses and systematic re-

views, I must rely in part on online appendices. The following are available at

http://www.evavivalt.com/research:

E) The search terms and inclusion criteria for each topic.

F) The references for each topic.

G) The coding manual.

43

B Data Collection

B.1 Description of AidGrade’s Methodology

The following details of AidGrade’s data collection process draw heavily from AidGrade’s

Process Description (AidGrade, 2013).

Figure 6: Process Description

Stage 1: Topic Identification

AidGrade staff members were asked to each independently make a list of at least

thirty international development programs that they considered to be the most interesting.

The independent lists were appended into one document and duplicates were tagged and

removed. Each of the remaining topics was discussed and refined to bring them all to a clear

44

and narrow level of focus. Pilot searches were conducted to get a sense of how many impact

evaluations there might be on each topic, and all the interventions for which the very basic

pilot searches identified at least two impact evaluations were shortlisted. A random subset

of the topics was selected, also acceding to a public vote for the most popular topic.

Stage 2: Search

Each search engine has its own peculiarities. In order to ensure all relevant papers

and few irrelevant papers were included, a set of simple searches was conducted on

different potential search engines. First, initial searches were run on AgEcon; British

Library for Development Studies (BLDS); EBSCO; Econlit; Econpapers; Google Scholar;

IDEAS; JOLISPlus; JSTOR; Oxford Scholarship Online; Proquest; PubMed; ScienceDirect;

SciVerse; SpringerLink; Social Science Research Network (SSRN); Wiley Online Library;

and the World Bank eLibrary. The list of potential search engines was compiled broadly

from those listed in other systematic reviews. The purpose of these initial searches was to

obtain information about the scope and usability of the search engines to determine which

ones would be effective tools in identifying impact evaluations on different topics. External

reviews of different search engines were also consulted, such as a Falagas et al. (2008) study

which covered the advantages and differences between the Google Scholar, Scopus, Web of

Science and PubMed search engines.

Second, searches were conducted for impact evaluations of two test topics: deworming

and toilets. EBSCO, IDEAS, Google Scholar, JOLISPlus, JSTOR, Proquest, PubMed,

ScienceDirect, SciVerse, SpringerLink, Wiley Online Library and the World Bank eLibrary

were used for these searches. 9 search strings were tried for deworming and up to 33 strings

for toilets, with modifications as needed for each search engine. For each search the number

of results and the number of results out of the first 10-50 results which appeared to be

impact evaluations of the topic in question were recorded. This gave a better sense of which

search engines and which kinds of search strings would return both comprehensive and

relevant results. A qualitative assessment of the search results was also provided for the

Google Scholar and SciVerse searches.

Finally, the online databases of J-PAL, IPA, CEGA and 3ie were searched. Since these

databases are already narrowly focused on impact evaluations, attention was restricted to

simple keyword searches, checking whether the search engines that were integrated with

each database seemed to pull up relevant results for each topic.

Ultimately, Google Scholar and the online databases of J-PAL, IPA, CEGA and 3ie,

along with EBSCO/PubMed for health-related interventions, were selected for use in the

full searches.

45

After the interventions of interest were identified, search strings were developed and

tested using each search source. Each search string included methodology-specific stock

keywords that narrowed the search to impact evaluation studies, except for the search

strings for the J-PAL, IPA, CEGA and 3ie searches, as these databases already exclusively

focus on impact evaluations.

Experimentation with keyword combinations in stages 1.4 and 2.1 was helpful in the

development of the search strings. The search strings could take slightly different forms for

different search engines. Search terms were tailored to the search source, and a full list is

included in an appendix.

C# was used to write a script to scrape the results from search engines. The script

was programmed to ensure that the Boolean logic of the search string was properly applied

within the constraints of each search engines capabilities.

Some sources were specialized and could have useful papers that do not turn up in

simple searches. The papers listed on J-PAL, IPA, CEGA and 3ies websites are a good

example of this. For these sites, it made more sense for the papers to be manually searched

and added to the relevant spreadsheets. After the automated and manual searches were

complete, duplicates were removed by matching on author and title names.

During the title screening stage, the consolidated list of citations yielded by the scraped

searches was checked for any existing meta-analyses or systematic reviews. Any papers that

these papers included were added to the list. With these references added, duplicates were

again flagged and removed.

Stage 3: Screening

Generic and topic-specific screening criteria were developed. The generic screening crite-

ria are detailed below, as is an example of a set of topic-specific screening criteria.

The screening criteria were very inclusive overall. This is because AidGrade purposely

follows a different approach to most meta-analyses in the hopes that the data collected can

be re-used by researchers who want to focus on a different subset of papers. Their motiva-

tion is that vast resources are typically devoted to a meta-analysis, but if another team of

researchers thinks a different set of papers should be used, they will have scour the literature

and recreate the data from scratch. If the two groups disagree, all the public sees are their

two sets of findings and their reasoning for selecting different papers. AidGrade instead

strives to cover the superset of all impact evaluations one might wish to include along with a

list of their characteristics (e.g. where they were conducted, whether they were randomized

by individual or by cluster, etc.) and let people set their own filters on the papers or select

individual papers and view the entire space of possible results.

46

Figure 7: Generic Screening Criteria

Category Inclusion Criteria Exclusion CriteriaMethodologies Impact evaluations that have counterfactuals Observational studies,

strictly qualitative studiesPublication status Peer-reviewed or working paper N/ATime period of study Any N/ALocationGeography Any N/AQuality Any N/A

Figure 8: Topic-Specific Criteria Example: Formal Banking

Category Inclusion Criteria Exclusion CriteriaIntervention Formal banking services specifically including: Other formal banking services

- Expansion of credit and/or savings Microfinance- Provision of technological innovations- Introduction or expansion of financial education,or other program to increase financial literacyor awareness

Outcomes - Individual and household income N/A- Small and micro-business income- Household and business assets- Household consumption- Small and micro-business investment- Small, micro-business or agricultural output- Measures of poverty- Measures of well-being or stress- Business ownership- Any other outcome covered by multiple papers

Figure 11 illustrates the difference.

For this reason, minimal screening was done during the screening stage. Instead, data

was collected broadly and re-screening was allowed at the point of doing the analysis. This

is highly beneficial for the purpose of this paper, as it allows us to look at the largest

possible set of papers and all subsets.

After screening criteria were developed, two volunteers independently screened the titles

to determine which papers in the spreadsheet were likely to meet the screening criteria

developed in Stage 3.1. Any differences in coding were arbitrated by a third volunteer. All

volunteers received training before beginning, based on the AidGrade Training Manual and

a test set of entries. Volunteers’ training inputs were screened to ensure that only proficient

47

Figure 9: AidGrade’s Strategy

48

volunteers would be allowed to continue. Of those papers that passed the title screening,

two volunteers independently determined whether the papers in the spreadsheet met the

screening criteria developed in Stage 3.1 judging by the paper abstracts. Any differences in

coding were again arbitrated by a third volunteer. The full text was then found for those

papers which passed both the title and abstract checks. Any paper that proved not to

be a relevant impact evaluation using the aforementioned criteria was discarded at this stage.

Stage 4: Coding

Two AidGrade members each independently used the data extraction form developed

in Stage 4.1 to extract data from the papers that passed the screening in Stage 3. Any

disputes were arbitrated by a third AidGrade member. These AidGrade members received

much more training than those who screened the papers, reflecting the increased difficulty

of their work, and also did a test set of entries before being allowed to proceed. The data

extraction form was organized into three sections: (1) general identifying information; (2)

paper and study characteristics; and (3) results. Each section contained qualitative and

quantitative variables that captured the characteristics and results of the study.

Stage 5: Analysis

A researcher was assigned to each meta-analysis topic who could specialize in determin-

ing which of the interventions and results were similar enough to be combined. If in doubt,

researchers could consult the original papers. In general, researchers were encouraged to

focus on all the outcome variables for which multiple papers had results.

When a study had multiple treatment arms sharing the same control, researchers would

check whether enough data was provided in the original paper to allow estimates to be

combined before the meta-analysis was run. This is a best practice to avoid double-counting

the control group; for details, see the Cochrane Handbook for Systematic Reviews of

Interventions (2011). If a paper did not provide sufficient data for this, the researcher would

make the decision as to which treatment arm to focus on. Data were then standardized

within each topic to be more comparable before analysis (for example, units were converted).

The subsequent steps of the meta-analysis process are irrelevant for the purposes of

this paper. It should be noted that the first set of ten topics followed a slightly different

procedure for stages (1) and (2). Only one list of potential topics was created in Stage

1.1, so Stage 1.2 (Consolidation of Lists) was only vacuously followed. There was also no

randomization after public voting (Stage 1.7) and no scripted scraping searches (Stage 2.3),

as all searches were manually conducted using specific strings. A different search engine was

49

also used: SciVerse Hub, an aggreator that includes SciVerse Scopus, MEDLINE, PubMed

Central, ArXiv.org, and many other databases of articles, books and presentations. The

search strings for both rounds of meta-analysis, manual and scripted, are detailed in another

appendix.

50

C Additional Results

51

Table 13: Descriptive Statistics: Standardized Narrowly Defined OutcomesIntervention Outcome # Neg sig papers # Insig papers # Pos sig papers # PapersConditional cash transfers Attendance rate 0 6 9 15Conditional cash transfers Enrollment rate 0 6 31 37Conditional cash transfers Height 0 1 1 2Conditional cash transfers Height-for-age 0 6 1 7Conditional cash transfers Labor force participation 1 12 5 18Conditional cash transfers Probability unpaid work 1 0 4 5Conditional cash transfers Test scores 1 2 2 5Conditional cash transfers Unpaid labor 0 2 3 5Conditional cash transfers Weight-for-age 0 2 0 2Conditional cash transfers Weight-for-height 0 1 1 2HIV/AIDS Education Pregnancy rate 0 2 0 2HIV/AIDS Education Probability has multiple sex partners 0 1 1 2HIV/AIDS Education Used contraceptives 1 6 3 10Unconditional cash transfers Enrollment rate 0 3 8 11Unconditional cash transfers Test scores 0 1 1 2Unconditional cash transfers Weight-for-height 0 2 0 2Insecticide-treated bed nets Malaria 0 3 6 9Contract teachers Test scores 0 1 2 3Deworming Attendance rate 0 1 1 2Deworming Birthweight 0 2 0 2Deworming Diarrhea incidence 0 1 1 2Deworming Height 3 10 4 17Deworming Height-for-age 1 9 4 14Deworming Hemoglobin 0 13 2 15Deworming Malformations 0 2 0 2Deworming Mid-upper arm circumference 2 0 5 7Deworming Test scores 0 0 2 2Deworming Weight 3 8 7 18Deworming Weight-for-age 1 6 5 12Deworming Weight-for-height 2 7 2 11Financial literacy Savings 0 2 3 5Improved stoves Chest pain 0 0 2 2Improved stoves Cough 0 0 2 2Improved stoves Difficulty breathing 0 0 2 2Improved stoves Excessive nasal secretion 0 1 1 2Irrigation Consumption 0 1 1 2Irrigation Total income 0 1 1 2

52

Microfinance Assets 0 3 1 4Microfinance Consumption 0 2 0 2Microfinance Profits 1 3 1 5Microfinance Savings 0 3 0 3Microfinance Total income 0 3 2 5Micro health insurance Enrollment rate 0 1 1 2Micronutrient supplementation Birthweight 0 4 3 7Micronutrient supplementation Body mass index 0 1 4 5Micronutrient supplementation Cough prevalence 0 3 0 3Micronutrient supplementation Diarrhea incidence 1 5 5 11Micronutrient supplementation Diarrhea prevalence 0 5 1 6Micronutrient supplementation Fever incidence 0 2 0 2Micronutrient supplementation Fever prevalence 1 2 2 5Micronutrient supplementation Height 3 22 7 32Micronutrient supplementation Height-for-age 5 23 8 36Micronutrient supplementation Hemoglobin 7 11 29 47Micronutrient supplementation Malaria 0 2 0 2Micronutrient supplementation Mid-upper arm circumference 2 9 7 18Micronutrient supplementation Mortality rate 0 12 0 12Micronutrient supplementation Perinatal deaths 1 5 0 6Micronutrient supplementation Prevalence of anemia 0 6 9 15Micronutrient supplementation Stillbirths 0 4 0 4Micronutrient supplementation Stunted 0 5 0 5Micronutrient supplementation Test scores 1 2 7 10Micronutrient supplementation Triceps skinfold measurement 1 0 1 2Micronutrient supplementation Wasted 0 2 0 2Micronutrient supplementation Weight 4 19 13 36Micronutrient supplementation Weight-for-age 1 23 10 34Micronutrient supplementation Weight-for-height 0 18 8 26Mobile phone-based reminders Appointment attendance rate 1 0 2 3Mobile phone-based reminders Treatment adherence 1 3 1 5Performance pay Test scores 0 2 1 3Rural electrification Enrollment rate 0 1 2 3Rural electrification Study time 0 1 2 3Rural electrification Total income 0 2 0 2Safe water storage Diarrhea incidence 0 1 1 2Scholarships Attendance rate 0 1 1 2Scholarships Enrollment rate 0 2 3 5Scholarships Test scores 0 2 0 2

53

School meals Enrollment rate 0 3 0 3School meals Height-for-age 0 2 0 2School meals Test scores 0 2 1 3Water treatment Diarrhea incidence 0 1 1 2Water treatment Diarrhea prevalence 0 1 5 6Women’s empowerment programs Savings 0 1 1 2Women’s empowerment programs Total income 0 0 2 2Average 0.6 4.2 3.2 7.9

54

Table 14: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes, Including Outlier


Microfinance Assets 0.000 5.508 0.999

Rural Electrification Enrollment rate 0.001 0.129 0.993

Micronutrients Cough prevalence 0.001 1.648 0.829

Microfinance Total income 0.001 0.989 0.998

Microfinance Savings 0.002 1.773 0.922

Financial Literacy Savings 0.004 5.472 0.979

Microfinance Profits 0.005 5.448 0.519

Contract Teachers Test scores 0.005 0.403 0.998

Performance Pay Test scores 0.006 0.608 0.552

Micronutrients Body mass index 0.007 0.675 1.000

Conditional Cash Transfers Unpaid labor 0.009 0.918 0.836

Micronutrients Weight-for-age 0.009 1.941 0.663

Micronutrients Weight-for-height 0.010 2.148 0.416

Micronutrients Birthweight 0.010 0.981 0.997

Micronutrients Height-for-age 0.012 2.467 0.640

Conditional Cash Transfers Test scores 0.013 1.866 0.887

Deworming Hemoglobin 0.015 3.377 0.996

Micronutrients Mid-upper arm circumference 0.015 2.078 0.317

SMS Reminders Treatment adherence 0.022 1.672 0.050

Micronutrients Height 0.023 4.369 0.991

Micronutrients Mortality rate 0.025 2.880 0.698

Micronutrients Stunted 0.025 1.110 0.665

Bed Nets Malaria 0.029 0.497 1.000

Conditional Cash Transfers Attendance rate 0.030 0.523 0.362

Micronutrients Weight 0.034 2.705 0.708

HIV/AIDS Education Used contraceptives 0.037 3.044 0.867

Micronutrients Perinatal deaths 0.038 2.096 0.108

Deworming Height 0.049 2.310 0.995

Micronutrients Test scores 0.052 1.694 0.891

Conditional Cash Transfers Height-for-age 0.055 22.166 0.125

Conditional Cash Transfers Enrollment rate 0.056 1.287 1.000

Deworming Weight-for-height 0.072 3.129 0.910

Micronutrients Stillbirths 0.075 3.041 0.955

Micronutrients Prevalence of anemia 0.095 0.793 0.268

Deworming Height-for-age 0.098 1.978 0.944

Deworming Weight-for-age 0.107 2.287 0.993

Micronutrients Diarrhea incidence 0.109 3.300 0.663

Micronutrients Diarrhea prevalence 0.111 1.205 0.815

Micronutrients Fever prevalence 0.146 3.076 0.959

Deworming Weight 0.165 3.897 0.999

Micronutrients Hemoglobin 0.215 1.439 0.269

55

SMS Reminders Appointment attendance rate 0.224 2.908 0.913

Deworming Mid-upper arm circumference 0.439 1.773 1.000

Conditional Cash Transfers Probability unpaid work 0.609 6.415 1.000

Conditional Cash Transfers Labor force participation 0.789 2.972 0.461

56

Table 15: Heterogeneity Measures for RCTs


Microfinance Assets 0.000 5.508 1.000Micronutrients Cough prevalence 0.001 1.648 1.000Financial Literacy Savings 0.004 5.472 1.000Contract Teachers Test scores 0.005 0.403 1.000Performance Pay Test scores 0.006 0.608 1.000Micronutrients Body mass index 0.007 0.675 1.000Micronutrients Weight-for-height 0.010 2.148 0.700Micronutrients Birthweight 0.010 0.981 0.833Micronutrients Height-for-age 0.012 2.467 1.000Conditional Cash Transfers Test scores 0.013 1.866 1.000Deworming Hemoglobin 0.015 3.377 1.000Micronutrients Mid-upper arm circumference 0.015 2.078 0.600Conditional Cash Transfers Enrollment rate 0.015 0.831 1.000Water Treatment Diarrhea prevalence 0.020 0.966 1.000SMS Reminders Treatment adherence 0.022 1.672 0.773School Meals Test scores 0.023 1.288 0.556Micronutrients Height 0.023 4.369 0.857Micronutrients Mortality rate 0.025 2.880 0.200Micronutrients Stunted 0.025 1.110 0.275Conditional Cash Transfers Labor force participation 0.027 1.494 0.950Bed Nets Malaria 0.029 0.497 0.846Micronutrients Weight 0.034 2.705 0.583HIV/AIDS Education Used contraceptives 0.038 3.288 0.750Micronutrients Perinatal deaths 0.038 2.096 0.171Conditional Cash Transfers Probability unpaid work 0.046 1.419 1.000Deworming Height 0.049 2.344 1.000Micronutrients Test scores 0.052 1.694 0.944Conditional Cash Transfers Height-for-age 0.055 22.166 0.169Deworming Weight-for-height 0.072 3.129 1.000Micronutrients Stillbirths 0.075 3.041 0.105Micronutrients Prevalence of anemia 0.095 0.793 0.686Deworming Height-for-age 0.098 1.978 1.000Deworming Weight-for-age 0.107 2.287 1.000Micronutrients Diarrhea incidence 0.109 3.300 1.000Micronutrients Diarrhea prevalence 0.111 1.205 0.833Micronutrients Fever prevalence 0.146 3.076 0.655Deworming Weight 0.184 4.822 1.000Micronutrients Hemoglobin 0.215 1.439 1.000SMS Reminders Appointment attendance rate 0.224 2.908 0.893Deworming Mid-upper arm circumference 0.439 1.773 0.967

57

Table 16: Heterogeneity Measures for Higher-Quality Studies


Micronutrients Cough prevalence 0.001 1.648 0.999Microfinance Savings 0.002 1.773 0.998Financial Literacy Savings 0.004 5.472 0.891Contract Teachers Test scores 0.005 0.403 1.000Micronutrients Weight-for-age 0.009 1.941 0.918Micronutrients Weight-for-height 0.010 2.148 0.685Micronutrients Birthweight 0.010 0.981 0.934Micronutrients Height-for-age 0.012 2.467 0.949Conditional Cash Transfers Test scores 0.013 1.866 1.000Deworming Hemoglobin 0.015 3.377 0.927Micronutrients Mid-upper arm circumference 0.015 2.078 0.628Water Treatment Diarrhea prevalence 0.020 0.966 0.987SMS Reminders Treatment adherence 0.022 1.672 0.415School Meals Test scores 0.023 1.288 0.561Micronutrients Height 0.023 4.369 0.860Micronutrients Mortality rate 0.025 2.880 0.265Micronutrients Stunted 0.025 1.110 0.271Bed Nets Malaria 0.029 0.497 0.899Micronutrients Weight 0.034 2.705 0.656HIV/AIDS Education Used contraceptives 0.038 3.288 0.699Micronutrients Perinatal deaths 0.038 2.096 0.304Deworming Height 0.049 2.344 0.998Deworming Weight-for-height 0.072 3.129 0.989School Meals Enrollment rate 0.081 1.142 0.050Micronutrients Prevalence of anemia 0.095 0.793 0.697Deworming Height-for-age 0.098 1.978 0.999Deworming Weight-for-age 0.107 2.287 0.998Micronutrients Diarrhea incidence 0.109 3.300 0.974Micronutrients Diarrhea prevalence 0.111 1.205 0.855Micronutrients Fever prevalence 0.146 3.076 0.670Deworming Weight 0.184 4.822 0.999Micronutrients Hemoglobin 0.215 1.439 1.000SMS Reminders Appointment attendance rate 0.224 2.908 0.634Deworming Mid-upper arm circumference 0.439 1.773 0.990

58

Table 17: Regression of Studies’ Absolute Percent Difference from the Within-Intervention-Outcome Mean on Chronological Order

(1) (2) (3) (4)ă500% ă1000% ă1500% ă2000%

b/se b/se b/se b/se

Chronological order 0.053 0.699 0.926 0.971(0.21) (0.47) (0.61) (0.67)

Observations 480 520 527 532R2 0.13 0.13 0.12 0.12

Each column restricts attention to a set of results a given maximum percentage away from the mean result

in an intervention-outcome combination.

59

D Derivation of Mixed Model Estimation Strategy

Yi “ Xiβ ` ui ` ei

Hierarchical model can be described by considering:

P pβ, e, τ |Y q “ P pe|β, τ, Y qP pβ|τ, Y qP pτ |Y q

First part:

P pei|β, τ, Yiq9P pβ, ei, τ |Yiq (17)

P pβ, ei, τ |Yiq9P pYi|β, eiqP pei|τqP pβ, τq (18)

91

a

2πσ2i

exp

ˆ

´1

2σ2i

pYi ´Xiβ ´ eiq2

˙

1?

2πτ 2exp

ˆ

´1

2τ 2e2i

˙

P pβ, τq (19)

logP pei|β, τ, Yiq “ C ´1

2σ2i

pe2i ´ 2pYi ´Xiβqeiq ´1

2τ 2e2i (20)

“ C ´σ2i ` τ

2

2σ2i τ

2e2i `

Yi ´Xiβ

σ2i

ei (21)

“ C ´σ2i ` τ

2

2σ2i τ

2

ˆ

ei ´pYi ´Xiβqτ

2

σ2i ` τ

2

˙2

(22)

where C is a constant in each line that can be different throughout.

Second part:

P pβ|τ, Y q9P pβ, τ |Y q (23)

P pβ, τ |Y q9P pβ, τqnź

i“1

1a

2πpσ2i ` τ

2qexp

ˆ

´1

2pσ2i ` τ

2qpYi ´Xiβq

2

˙

(24)

logP pβ|τ, Y q “ logP pβ|τq ´nÿ

i“1

pYi ´Xiβq2

2pσ2i ` τ

2q` C (25)

Gelman et al. suggest a noninformative prior for P pβ|τq.

logP pβ|τ, Y q “ C ´ β1

˜

nÿ

i“1

X 1iXi

2pσ2i ` τ

2q

¸

β `

˜

nÿ

i“1

YiXi

pσ2i ` τ

2q

¸

β (26)

“ C ´1

2pβ ´ Ωλ1q1Ω´1pβ ´ Ωλ1q (27)

60

where Ω “´

řni“1

X 1iXi

pσ2i`τ

2q

¯´1

, λ “řni“1

YiXi

pσ2i`τ

2q.

Third part:

P pτ |Y q “P pβ, τ |Y q

P pβ|τ, Y q(28)

P pβ|τ, Y q is solved above, and P pβ, τ |Y q is solved above except for the unknown term

P pβ, τq “ P pβ|τqP pτq. We already defined a uniform prior for P pβ|τq, and can define

another uniform prior for P pτq. Then:

logP pτ |Y q “ C ´1

2

nÿ

i“1

logpσ2i ` τ

2q ´

nÿ

i“1

1

2pσ2i ` τ

2qpYi ´Xiβq

2 (29)

`1

2log|Ω| `

1

2pβ ´ Ωλ1q1Ω´1pβ ´ Ωλ1q (30)

61

How Much Can We Generalize? Measuring the …...How Much Can We Generalize? Measuring the External Validity of Impact Evaluations Eva Vivalt New York University August 31, 2015 Abstract

Documents