How Much Can We Generalize? Measuring the External Validity of Impact Evaluations Eva Vivalt * New York University August 31, 2015 Abstract Impact evaluations aim to predict the future, but they are rooted in particular contexts and to what extent they generalize is an open and important question. I founded an organization to systematically collect and synthesize impact evalu- ation results on a wide variety of interventions in development. These data allow me to answer this and other questions for the first time using a large data set of studies. I consider several measures of generalizability, discuss the strengths and limitations of each metric, and provide benchmarks based on the data. I use the example of the effect of conditional cash transfers on enrollment rates to show how some of the heterogeneity can be modelled and the effect this can have on the generalizability measures. The predictive power of the model improves over time as more studies are completed. Finally, I show how researchers can estimate the generalizability of their own study using their own data, even when data from no comparable studies exist. * E-mail: [email protected]. I thank Edward Miguel, Bill Easterly, David Card, Ernesto Dal B´ o, Hunt Allcott, Elizabeth Tipton, David McKenzie, Vinci Chow, Willa Friedman, Xing Huang, Michaela Pagel, Steven Pennings, Edson Severnini, seminar participants at the University of California, Berkeley, Columbia University, New York University, the World Bank, Cornell University, Princeton University, the University of Toronto, the London School of Economics, the Australian National University, and the University of Ottawa, among others, and participants at the 2015 ASSA meeting and 2013 Association for Public Policy Analysis and Management Fall Research Conference for helpful comments. I am also grateful for the hard work put in by many at AidGrade over the duration of this project, including but not limited to Jeff Qiu, Bobbie Macdonald, Diana Stanescu, Cesar Augusto Lopez, Mi Shen, Ning Zhang, Jennifer Ambrose, Naomi Crowther, Timothy Catlett, Joohee Kim, Gautam Bastian, Christine Shen, Taha Jalil, Risa Santoso and Catherine Razeto. 1
61
Embed
How Much Can We Generalize? Measuring the …...How Much Can We Generalize? Measuring the External Validity of Impact Evaluations Eva Vivalt New York University August 31, 2015 Abstract
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
How Much Can We Generalize? Measuring the External
Validity of Impact Evaluations
Eva Vivalt∗
New York University
August 31, 2015
Abstract
Impact evaluations aim to predict the future, but they are rooted in particularcontexts and to what extent they generalize is an open and important question.I founded an organization to systematically collect and synthesize impact evalu-ation results on a wide variety of interventions in development. These data allowme to answer this and other questions for the first time using a large data setof studies. I consider several measures of generalizability, discuss the strengthsand limitations of each metric, and provide benchmarks based on the data. Iuse the example of the effect of conditional cash transfers on enrollment rates toshow how some of the heterogeneity can be modelled and the effect this can haveon the generalizability measures. The predictive power of the model improvesover time as more studies are completed. Finally, I show how researchers canestimate the generalizability of their own study using their own data, even whendata from no comparable studies exist.
∗E-mail: [email protected]. I thank Edward Miguel, Bill Easterly, David Card, Ernesto Dal Bo, HuntAllcott, Elizabeth Tipton, David McKenzie, Vinci Chow, Willa Friedman, Xing Huang, Michaela Pagel,Steven Pennings, Edson Severnini, seminar participants at the University of California, Berkeley, ColumbiaUniversity, New York University, the World Bank, Cornell University, Princeton University, the Universityof Toronto, the London School of Economics, the Australian National University, and the University ofOttawa, among others, and participants at the 2015 ASSA meeting and 2013 Association for Public PolicyAnalysis and Management Fall Research Conference for helpful comments. I am also grateful for the hardwork put in by many at AidGrade over the duration of this project, including but not limited to Jeff Qiu,Bobbie Macdonald, Diana Stanescu, Cesar Augusto Lopez, Mi Shen, Ning Zhang, Jennifer Ambrose, NaomiCrowther, Timothy Catlett, Joohee Kim, Gautam Bastian, Christine Shen, Taha Jalil, Risa Santoso andCatherine Razeto.
1
1 Introduction
In the last few years, impact evaluations have become extensively used in development
economics research. Policymakers and donors typically fund impact evaluations precisely to
figure out how effective a similar program would be in the future to guide their decisions
on what course of action they should take. However, it is not yet clear how much we can
extrapolate from past results or under which conditions. Further, there is some evidence
that even a similar program, in a similar environment, can yield different results. For ex-
ample, Bold et al. (2013) carry out an impact evaluation of a program to provide contract
teachers in Kenya; this was a scaled-up version of an earlier program studied by Duflo, Du-
pas and Kremer (2012). The earlier intervention studied by Duflo, Dupas and Kremer was
implemented by an NGO, while Bold et al. compared implementation by an NGO and the
government. While Duflo, Dupas and Kremer found positive effects, Bold et al. showed
significant results only for the NGO-implemented group. The different findings in the same
country for purportedly similar programs point to the substantial context-dependence of im-
pact evaluation results. Knowing this context-dependence is crucial in order to understand
what we can learn from any impact evaluation.
While the main reason to examine generalizability is to aid interpretation and improve
predictions, it would also help to direct research attention to where it is most needed. If
generalizability were higher in some areas, fewer papers would be needed to understand how
people would behave in a similar situation; conversely, if there were topics or regions where
generalizability was low, it would call for further study. With more information, researchers
can better calibrate where to direct their attentions to generate new insights.
It is well-known that impact evaluations only happen in certain contexts. For example,
Figure 1 shows a heat map of the geocoded impact evaluations in the data used in this paper
overlaid by the distribution of World Bank projects (black dots). Both sets of data are geo-
graphically clustered, and whether or not we can reasonably extrapolate from one to another
depends on how much related heterogeneity there is in treatment effects. Allcott (forthcom-
ing) recently showed that site selection bias was an issue for randomized controlled trials
(RCTs) on a firm’s energy conservation programs. Microfinance institutions that run RCTs
and hospitals that conduct clinical trials are also selected (Allcott, forthcoming), and World
Bank projects that receive an impact evaluation are different from those that do not (Vivalt,
2015). Others have sought to explain heterogeneous treatment effects in meta-analyses of
specific topics (e.g. Saavedra and Garcia, 2013, among many others for conditional cash
transfers), or to argue they are so heterogeneous they cannot be adequately modelled (e.g.
Deaton, 2011; Pritchett and Sandefur, 2013).
2
Figure 1: Growth of Impact Evaluations and Location Relative to Programs
The figure on the left shows a heat map of the impact evaluations in AidGrade’s database overlaid by black
dots indicating where the World Bank has done projects. While there are many other development
programs not done by the World Bank, this figure illustrates the great numbers and geographical
dispersion of development programs. The figure on the right plots the number of studies that came out in
each year that are contained in each of three databases described in the text: 3ie’s title/abstract/keyword
database of impact evaluations; J-PAL’s database of affiliated randomized controlled trials; and AidGrade’s
database of impact evaluation results data.
Impact evaluations are still exponentially increasing in number and in terms of the re-
sources devoted to them. The World Bank recently received a major grant from the UK aid
agency DFID to expand its already large impact evaluation works; the Millennium Challenge
Corporation has committed to conduct rigorous impact evaluations for 50% of its activities,
with “some form of credible evaluation of impact” for every activity (Millennium Challenge
Corporation, 2009); and the U.S. Agency for International Development is also increasingly
invested in impact evaluations, coming out with a new policy in 2011 that directs 3% of
program funds to evaluation.1
Yet while impact evaluations are still growing in development, a few thousand are al-
ready complete. Figure 1 plots the explosion of RCTs that researchers affiliated with J-PAL,
a center for development economics research, have completed each year; alongside are the
number of development-related impact evaluations released that year according to 3ie, which
keeps a directory of titles, abstracts, and other basic information on impact evaluations more
broadly, including quasi-experimental designs; finally, the dashed line shows the number of
papers that came out in each year that are included in AidGrade’s database of impact eval-
uation results, which will be described shortly.
1While most of these are less rigorous “performance evaluations”, country mission leaders are supposedto identify at least one opportunity for impact evaluation for each development objective in their 3-5 yearplans (USAID, 2011).
3
In short, while we do impact evaluation to figure out what will happen in the future,
many issues have been raised about how well we can extrapolate from past impact evalua-
tions, and despite the importance of the topic, previously we were unable to do little more
than guess or examine the question in narrow settings as we did not have the data. Now we
have the opportunity to address speculation, drawing on a large, unique dataset of impact
evaluation results.
I founded a non-profit organization dedicated to gathering this data. That organization,
AidGrade, seeks to systematically understand which programs work best where, a task that
requires also knowing the limits of our knowledge. To date, AidGrade has conducted 20
meta-analyses and systematic reviews of different development programs.2 Data gathered
through meta-analyses are the ideal data to answer the question of how much we can ex-
trapolate from past results, and since data on these 20 topics were collected in the same
way, coding the same outcomes and other variables, we can look across different types of
programs to see if there are any more general trends. Currently, the data set contains 647 pa-
pers on 210 narrowly-defined intervention-outcome combinations, with the greater database
containing 15,021 estimates.
I define generalizability and discuss several metrics with which to measure it. Other
disciplines have considered generalizability more, so I draw on the literature relating to
meta-analysis, which has been most well-developed in medicine, as well as the psychometric
literature on generalizability theory (Higgins and Thompson, 2002; Shavelson and Webb,
2006; Briggs and Wilson, 2007). The measures I discuss could also be used in conjunction
with any model that seeks to explain variation in treatment effects (e.g. Dehejia, Pop-Eleches
and Samii, 2015) to quantify the proportion of variation that such a model explains. Since
some of the analyses will draw upon statistical methods not commonly used in economics,
I will use the concrete example of conditional cash transfers (CCTs), which are relatively
well-understood and on which many papers have been written, to elucidate the issues.
While this paper focuses on results for impact evaluations of development programs, this
is only one of the first areas within economics to which these kinds of methods can be applied.
In many of the sciences, knowledge is built through a combination of researchers conducting
individual studies and other researchers synthesizing the evidence through meta-analysis.
This paper begins that natural next step.
2Throughout, I will refer to all 20 as meta-analyses, but some did not have enough comparable outcomesfor meta-analysis and became systematic reviews.
4
2 Theory
2.1 Heterogeneous Treatment Effects
I model treatment effects as potentially depending on the context of the intervention.
Each impact evaluation is on a particular intervention and covers a number of outcomes.
The relationship between an outcome, the inputs that were part of the intervention, and the
context of the study is complex. In the simplest model, we can imagine that context can be
represented a “contextual variable”, C, such that:
Zj “ α ` βTj ` δCj ` γTjCj ` εj (1)
where j indexes the individual, Z represents the value of an aggregate outcome such as
“enrollment rates”, T indicates being treated, and C represents a contextual variable, such
as the type of agency that implemented the program.3
In this framework, a particular impact evaluation might explicitly estimate:
Zj “ α ` β1Tj ` εj (2)
but, as Equation 1 can be re-written as Zj “ α ` pβ ` γCjqTj ` δCj ` εj, what β1 is really
capturing is the effect β1 “ β ` γC. When C varies, unobserved, in different contexts, the
variance of β1 increases.
This is the simplest case. One can imagine that the true state of the world has “interac-
tion effects all the way down”.
Interaction terms are often considered a second-order problem. However, that intuition
could stem from the fact that we usually look for interaction terms within an already fairly
homogeneous dataset - e.g. data from a single country, at a single point in time, on a par-
ticularly selected sample.
Not all aspects of context need matter to an intervention’s outcomes. The set of con-
textual variables can be divided into a critical set on which outcomes depend and an set on
which they do not; I will ignore the latter. Further, the relationship between Z and C can
vary by intervention or outcome. For example, school meals programs might have more of
an effect on younger children, but scholarship programs could plausibly affect older children
more. If one were to regress effect size on the contextual variable “age”, we would get differ-
ent results depending on which intervention and outcome we were considering. Therefore,
3Z can equally well be thought of as the average individual outcome for an intervention. Throughout,I take high values for an outcome to represent a beneficial change unless otherwise noted; if an outcomerepresents a negative characteristic, like incidence of a disease, its sign will be flipped before analysis.
5
it will be important in this paper to look only at a restricted set of contextual variables
which could plausibly work in a similar way across different interventions. Additional anal-
ysis could profitably be done within some interventions, but this is outside the scope of this
paper.
Generalizability will ultimately depend on the heterogeneity of treatment effects. The
next section formally defines generalizability for use in this paper.
2.2 Generalizability: Definitions and Measurement
Definition 1 Generalizability is the ability to predict results accurately out of sample.
Definition 2 Local generalizability is the ability to predict results accurately in a particular
out-of-sample group.
There are several ways to operationalize these definitions. The ability to predict
results hinges both on the variability of the results and the proportion that can be
explained. For example, if the overall variability in a set of results is high, this might not
be as concerning if the proportion of variability that can be explained is also high.
It is straightforward to measure the variance in results. However, these statistics need
to be benchmarked in order to know what is a “high” or “low” variance. One advantage
of the large data set used in this paper is that I can use it to benchmark the results
from different intervention-outcome combinations against each other. This is not the first
paper to tentatively suggest a scale. Other rules of thumb have also been created in this
manner, such as those used to consider the magnitude of effect sizes (0-0.2 SD = “small”,
0.2-0.5 = “medium”, ą 0.5 SD = “large”) (Cohen, 1988) or the measure of the impact
of heterogeneity on meta-analysis results, I2 (0.25=“low”, 0.5=“medium”, 0.75=“high”)
(Higgins et al., 2003). I can also compare across-paper variation to within-paper variation,
with the idea that within-study variation should represent a lower bound to across-study
variation within the same intervention-outcome combination. Further, I can create variance
benchmarks based on back-of-the-envelope calculations for what the variance would imply
for predictive power under a set of assumptions. This will be discussed in more detail later.
One potential drawback to considering the variance of studies’ results is that we might
be concerned that studies that have higher effect sizes or are measured in terms of units
with larger scales have larger variances. This would limit us to making comparisons only
between data with the same scale. We could either: 1) restrict attention to those outcomes
in the same natural units (e.g. enrollment rates in percentage points); 2) convert results to
6
be in terms of a common unit, such as standard deviations4; 3) scale the standard deviation
by the mean result, creating the coefficient of variation. The coefficient of variation
represents the inverse of the signal-to-noise ratio, and as a unitless figure can be compared
across intervention-outcome combinations with different natural units. It is not immune to
criticism, however, particularly in that it may result in large values as the mean approaches
zero.5
All the measures discussed so far focus on variation. However, if we could explain the
variation, it would no longer worsen our ability to make predictions in a new setting, so
long as we had all the necessary data from that setting, such as covariates, with which to
extrapolate.
To explain variation, we need a model. The meta-analysis literature suggests two
general types of models which can be parameterized in many ways: fixed-effect models and
random-effects models.
Fixed-effect models assume there is one true effect of a particular program and all
differences between studies can be attributed simply to sampling error. In other words:
Yi “ θ ` εi (3)
where Yi is the observed effect size of a particular study, θ is the true effect and εi is the
error term.
Random-effects models do not make this assumption; the true effect could potentially
vary from context to context. Here,
Yi “ θi ` εi (4)
“ θ ` ηi ` εi (5)
where θi is the effect size for a particular study i, θ is the mean true effect size, ηi is a
particular study’s divergence from that mean true effect size, and εi is the error. Random-
effects models are more plausible and they are necessary if we think there are heterogeneous
treatment effects, so I use them in this paper. Random-effects models can also be modified
by the addition of explanatory variables, at which point they are called mixed models; I will
also use mixed models in this paper.
Sampling variance, varpYi|θiq, is denoted as σ2 and between-study variance, varpθiq, τ2.
4This can be problematic if the standard deviations themselves vary but is a common approach in themeta-analysis literature in lieu of a better option.
5This paper follows convention and reports the absolute value of the coefficient of variation wherever itappears.
7
This variation in observed effect sizes is then:
varpYiq “ τ 2 ` σ2 (6)
and the proportion of the variation that is not sampling error is:
I2 “τ 2
τ 2 ` σ2(7)
The I2 is an established metric in the meta-analysis literature that helps determine
whether a fixed or random effects model is more appropriate; the higher I2, the less plausible
it is that sampling error drives all the variation in results. I2 is considered “low” at 0.25,
“medium” at 0.5, and “high” at 0.75 (Higgins et al., 2003).6
If we wanted to explain more of the variation, we could do moderator or mediator analysis,
in which we examine how results vary with the characteristics of the study, characteristics of
its sample, or details about the intervention and its implementation. A linear meta-regression
is one way of accomplishing this goal, explicitly estimating:
Yi “ β0 `ÿ
n
βnXn ` ηi ` εi
where Xn are explanatory variables. This is a mixed model and, upon estimating it, we can
calculate several additional statistics: the amount of residual variation in Yi, after accounting
for Xn, varRpYi´ pYiq, the coefficient of residual variation, CVRpYi´ pYiq, and the residual I2R.
Further, we can examine the R2 of the meta-regression.
It should be noted that a linear meta-regression is only one way of modelling variation in
Yi. The I2, for example, is analogous to the reliability coefficient of classical test theory or
the generalizability coefficient of generalizability theory (a branch of psychometrics), both
of which estimate the proportion of variation that is not error. In this literature, additional
heterogeneity is usually modelled using ANOVA rather than meta-regression. Modelling
variation in treatment effects also does not have to occur only retrospectively at the conclu-
sion of studies; we can imagine that a carefully-designed study could anticipate and estimate
some of the potential sources of variation experimentally.
Table 1 summarizes the different indicators, dividing them into measures of variation and
measures of the proportion of variation that is systematic.
Each of these metrics has its advantages and disadvantages. Table 2 summarizes the
6The Cochrane Collaboration uses a slightly different set of norms, saying 0-0.4 “might not be important”,0.3-0.6 “may represent moderate heterogeneity”, 0.5-0.9 “may represent substantial heterogeneity”, and 0.75-1 “considerable heterogeneity” (Higgins and Green, 2011).
8
Table 1: Summary of heterogeneity measures
Measure of variation Measure of proportionof variation that issystematic
Measure makes use ofexplanatory variables
varpYiq XvarRpYi´ pYiq X XCVpYiq XCVRpYi´ pYiq X XI2 XI2R X XR2 X X
Table 2: Desirable properties of a measure of heterogeneity
Does not dependon the number ofstudies in a cell
Does not dependon the precisionof individual es-timates
Does not dependon the estimates’units
Does not dependon the mean re-sult in the cell
varpYiq X X XvarRpYi´ pYiq X X XCVpYiq X X XCVRpYi´ pYiq X X XI2 X X XI2R X X XR2 X X X X
A “cell” here refers to an intervention-outcome combination. The “precision” of an estimate refers to its
standard error.
desirable properties of a measure of heterogeneity and which properties are possessed by each
of the discussed indicators. Measuring heterogeneity using the variance of Yi requires the
Yi to have comparable units. Using the coefficient of variation requires the assumption that
the mean effect size is an appropriate measure with which to scale sd(Yi). The variance and
coefficient of variation also do not have anything to say about the amount of heterogeneity
that can be explained. Adding explanatory variables also has its limitations. In any model,
we have no way to guarantee that we are indeed capturing all the relevant factors. While
I2 has the nice property that it disaggregates sampling variance as a source of variation,
estimating it depends on the weights applied to each study’s results and thus, in turn, on
the sample sizes of the studies. The R2 has its own well-known caveats, such as that it can
be artificially inflated by over-fitting.
9
Having discussed the different measures of generalizability I will use in this paper, I turn
to describe how I will estimate the parameters of the random effects or mixed models.
2.3 Hierarchical Bayesian Analysis
This paper uses meta-analysis as a tool to synthesize evidence.
As a quick review, there are many steps in a meta-analysis, most of which have to do
with the selection of the constituent papers. The search and screening of papers will be
described in the data section; here, I merely discuss the theory behind how meta-analyses
combine results and estimate the parameters σ2 and τ 2 that will be used to generate I2.
I begin by presenting the random effects model, followed by the related strategy to
estimate a mixed model.
2.4 Estimating a Random Effects Model
To build a hierarchical Bayesian random effects model, I first assume the data are nor-
mally distributed:
Yij|θi „ Npθi, σ2q (8)
where j indexes the individuals in the study. I do not have individual-level data, but instead
can use sufficient statistics:
Yi|θi „ Npθi, σ2i q (9)
where Yi is the sample mean and σ2i the sample variance. This provides the likelihood for θi.
I also need a prior for θi. I assume between-study normality:
θi „ Npµ, τ 2q (10)
where µ and τ are unknown hyperparameters.
Conditioning on the distribution of the data, given by Equation 9, I get a posterior:
θi|µ, τ, Y „ Npθi, Viq (11)
where
θi “
Yiσ2i`
µτ2
1σ2i` 1
τ2
, Vi “1
1σ2i` 1
τ2
(12)
I then need to pin down µ|τ and τ by constructing their posterior distributions given
non-informative priors and updating based on the data. I assume a uniform prior for µ|τ ,
10
and as the Yi are estimates of µ with variance pσ2i ` τ
2q, obtain:
µ|τ, Y „ Npµ, Vµq (13)
where
µ “
ř
iYi
σ2i`τ
2
ř
i1
σ2i`τ
2
, Vµ “ÿ
i
11
σ2i`τ
2
(14)
For τ , note that ppτ |Y q “ ppµ,τ |Y qppµ|τ,Y q
. The denominator follows from Equation 12; for the
numerator, we can observe that ppµ, τ |Y q is proportional to ppµ, τqppY |µ, τq, and we know
the marginal distribution of Yi|µ, τ :
Yi|µ, τ „ Npµ, σ2i ` τ
2q (15)
I use a uniform prior for τ , following Gelman et al. (2005). This yields the posterior for
the numerator:
ppµ, τ |Y q9ppµ, τqź
i
NpYi|µ, σ2i ` τ
2q (16)
Putting together all the pieces in reverse order, I first simulate τ , then generate ppτ |Y q
using τ , followed by µ and finally θi.
2.5 Estimating a Mixed Model
The strategy here is similar. Appendix D contains a derivation.
3 Data
This paper uses a database of impact evaluation results collected by AidGrade, a U.S.
non-profit research institute that I founded in 2012. AidGrade focuses on gathering the
results of impact evaluations and analyzing the data, including through meta-analysis. Its
data on impact evaluation results were collected in the course of its meta-analyses from
2012-2014 (AidGrade, 2015).
AidGrade’s meta-analyses follow the standard stages: (1) topic selection; (2) a search
for relevant papers; (3) screening of papers; (4) data extraction; and (5) data analysis. In
addition, it pays attention to (6) dissemination and (7) updating of results. Here, I will
discuss the selection of papers (stages 1-3) and the data extraction protocol (stage 4); more
detail is provided in Appendix B.
11
3.1 Selection of Papers
The interventions that were selected for meta-analysis were selected largely on the basis
of there being a sufficient number of studies on that topic. Five AidGrade staff members each
independently made a preliminary list of interventions for examination; the lists were then
combined and searches done for each topic to determine if there were likely to be enough
impact evaluations for a meta-analysis. The remaining list was voted on by the general
public online and partially randomized. Appendix B provides further detail.
A comprehensive literature search was done using a mix of the search aggregators Sci-
Verse, Google Scholar, and EBSCO/PubMed. The online databases of J-PAL, IPA, CEGA
and 3ie were also searched for completeness. Finally, the references of any existing system-
atic reviews or meta-analyses were collected.
Any impact evaluation which appeared to be on the intervention in question was included,
barring those in developed countries.7 Any paper that tried to consider the counterfactual
was considered an impact evaluation. Both published papers and working papers were in-
cluded. The search and screening criteria were deliberately broad. There is not enough
room to include the full text of the search terms and inclusion criteria for all 20 topics in
this paper, but these are available in an online appendix as detailed in Appendix A.
3.2 Data Extraction
The subset of the data on which I am focusing is based on those papers that passed all
screening stages in the meta-analyses. Again, the search and screening criteria were very
broad and, after passing the full text screening, the vast majority of papers that were later
excluded were excluded merely because they had no outcome variables in common or did
not provide adequate data (for example, not providing data that could be used to calculate
the standard error of an estimate, or for a variety of other quirky reasons, such as displaying
results only graphically). The small overlap of outcome variables is a surprising and notable
feature of the data. Ultimately, the data I draw upon for this paper consist of 15,021 results
(double-coded and then reconciled by a third researcher) across 647 papers covering the 20
types of development program listed in Table 3.8 For sake of comparison, though the two
organizations clearly do different things, at present time of writing this is more impact eval-
7High-income countries, according to the World Bank’s classification system.8Three titles here may be misleading. “Mobile phone-based reminders” refers specifically to SMS or
voice reminders for health-related outcomes. “Women’s empowerment programs” required an educationalcomponent to be included in the intervention and it could not be an unrelated intervention that merely dis-aggregated outcomes by gender. Finally, micronutrients were initially too loosely defined; this was narroweddown to focus on those providing zinc to children, but the other micronutrient papers are still included inthe data, with a tag, as they may still be useful.
12
uations than J-PAL has published, concentrated in these 20 topics. Unfortunately, only 318
of these papers both overlapped in outcomes with another paper and were able to be stan-
dardized and thus included in the main results which rely on intervention-outcome groups.
Outcomes were defined under several rules of varying specificity, as will be discussed shortly.
Table 3: List of Development Programs Covered
2012 2013Conditional cash transfers Contract teachersDeworming Financial literacy trainingImproved stoves HIV educationInsecticide-treated bed nets IrrigationMicrofinance Micro health insuranceSafe water storage Micronutrient supplementationScholarships Mobile phone-based remindersSchool meals Performance payUnconditional cash transfers Rural electrificationWater treatment Women’s empowerment programs
73 variables were coded for each paper. Additional topic-specific variables were coded for
some sets of papers, such as the median and mean loan size for microfinance programs. This
paper focuses on the variables held in common across the different topics. These include
which method was used; if randomized, whether it was randomized by cluster; whether it
was blinded; where it was (village, province, country - these were later geocoded in a sepa-
rate process); what kind of institution carried out the implementation; characteristics of the
population; and the duration of the intervention from the baseline to the midline or endline
results, among others. A full set of variables and the coding manual is available online, as
detailed in Appendix A.
As this paper pays particular attention to the program implementer, it is worth discussing
how this variable was coded in more detail. There were several types of implementers that
could be coded: governments, NGOs, private sector firms, and academics. There was also a
code for “other” (primarily collaborations) or “unclear”. The vast majority of studies were
implemented by academic research teams and NGOs. This paper considers NGOs and aca-
demic research teams together because it turned out to be practically difficult to distinguish
between them in the studies, especially as the passive voice was frequently used (e.g. “X
was done” without noting who did it). There were only a few private sector firms involved,
so they are considered with the “other” category in this paper.
Studies tend to report results for multiple specifications. AidGrade focused on those
13
results least likely to have been influenced by author choices: those with the fewest con-
trols, apart from fixed effects. Where a study reported results using different methodologies,
coders were instructed to collect the findings obtained under the authors’ preferred method-
ology; where the preferred methodology was unclear, coders were advised to follow the
internal preference ordering of prioritizing randomized controlled trials, followed by regres-
sion discontinuity designs and differences-in-differences, followed by matching, and to collect
multiple sets of results when they were unclear on which to include. Where results were
presented separately for multiple subgroups, coders were similarly advised to err on the side
of caution and to collect both the aggregate results and results by subgroup except where the
author appeared to be only including a subgroup because results were significant within that
subgroup. For example, if an author reported results for children aged 8-15 and then also
presented results for children aged 12-13, only the aggregate results would be recorded, but
if the author presented results for children aged 8-9, 10-11, 12-13, and 14-15, all subgroups
would be coded as well as the aggregate result when presented. Authors only rarely reported
isolated subgroups, so this was not a major issue in practice.
When considering the variation of effect sizes within a group of papers, the definition of
the group is clearly critical. Two different rules were initially used to define outcomes: a
strict rule, under which only identical outcome variables are considered alike, and a loose
rule, under which similar but distinct outcomes are grouped into clusters.
The precise coding rules were as follows:
1. We consider outcome A to be the same as outcome B under the “strict rule” if out-
comes A and B measure the exact same quality. Different units may be used, pending
conversion. The outcomes may cover different timespans (e.g. encompassing both
outcomes over “the last month” and “the last week”). They may also cover different
populations (e.g. children or adults). Examples: height; attendance rates.
2. We consider outcome A to be the same as outcome B under the “loose rule” if they
do not meet the strict rule but are clearly related. Example: parasitemia greater than
4000/µl with fever and parasitemia greater than 2500/µl.
Clearly, even under the strict rule, differences between the studies may exist, however, using
two different rules allows us to isolate the potential sources of variation, and other variables
were coded to capture some of this variation, such as the age of those in the sample. If one
were to divide the studies by these characteristics, however, the data would usually be too
sparse for analysis.
Interventions were also defined separately and coders were also asked to write a short
description of the details of each program. Program names were recorded so as to identify
14
those papers on the same program, such as the various evaluations of PROGRESA.
After coding, the data were then standardized to make results easier to interpret and
so as not to overly weight those outcomes with larger scales. The typical way to compare
results across different outcomes is by using the standardized mean difference, defined as:
SMD “µ1 ´ µ2
σp
where µ1 is the mean outcome in the treatment group, µ2 is the mean outcome in the control
group, and σp is the pooled standard deviation. When data are not available to calculate the
pooled standard deviation, it can be approximated by the standard deviation of the depen-
dent variable for the entire distribution of observations or as the standard deviation in the
control group (Glass, 1976). If that is not available either, due to standard deviations not
having been reported in the original papers, one can use the typical standard deviation for
the intervention-outcome. I follow this approach to calculate the standardized mean differ-
ence, which is then used as the effect size measure for the rest of the paper unless otherwise
noted.
This paper uses the “strict” outcomes where available, but the “loose” outcomes where
that would keep more data. For papers which were follow-ups of the same study, the most
recent results were used for each outcome.
Finally, one paper appeared to misreport results, suggesting implausibly low values and
standard deviations for hemoglobin. These results were excluded and the paper’s correspond-
ing author contacted. Excluding this paper’s results, effect sizes range between -1.5 and 1.8
SD, with an interquartile range of 0 to 0.2 SD. So as to mitigate sensitivity to individual
results, especially with the small number of papers in some intervention-outcome groups, I
restrict attention to those standardized effect sizes less than 2 SD away from 0, dropping 1
additional observation. I report main results including this observation in the Appendix.
3.3 Data Description
Figure 2 summarizes the distribution of studies covering the interventions and outcomes
considered in this paper that can be standardized. Attention will typically be limited to
those intervention-outcome combinations on which we have data for at least three papers.
Table 13 in Appendix C lists the interventions and outcomes and describes their results in
a bit more detail, providing the distribution of significant and insignificant results. It should
be emphasized that the number of negative and significant, insignificant, and positive and
significant results per intervention-outcome combination only provide ambiguous evidence
of the typical efficacy of a particular type of intervention. Simply tallying the numbers in
15
each category is known as “vote counting” and can yield misleading results if, for example,
some studies are underpowered.
Table 4 further summarizes the distribution of papers across interventions and highlights
the fact that papers exhibit very little overlap in terms of outcomes studied. This is consistent
with the story of researchers each wanting to publish one of the first papers on a topic. Vivalt
(2015a) finds that later papers on the same intervention-outcome combination more often
remain as working papers.
A note must be made about combining data. When conducting a meta-analysis, the
Cochrane Handbook for Systematic Reviews of Interventions recommends collapsing the
data to one observation per intervention-outcome-paper, and I do this for generating the
within intervention-outcome meta-analyses (Higgins and Green, 2011). Where results had
been reported for multiple subgroups (e.g. women and men), I aggregated them as in the
Cochrane Handbook’s Table 7.7.a. Where results were reported for multiple time periods
(e.g. 6 months after the intervention and 12 months after the intervention), I used the most
comparable time periods across papers. When combining across multiple outcomes, which
has limited use but will come up later in the paper, I used the formulae from Borenstein et
al. (2009), Chapter 24.
16
Figure 2: Within-Intervention-Outcome Number of Papers
17
Table 4: Descriptive Statistics: Distribution of Narrow Outcomes
Intervention Number of Mean papers Max papersoutcomes per outcome per outcome
Within-paper values are based on those papers which report results for different subsets of the data. For closer comparison of the across andwithin-paper statistics, the across-paper values are based on the same data set, aggregating the within-paper results to one observation per
23
intervention-outcome-paper, as discussed. Each paper needs to have reported 3 results for an intervention-outcome combination for it to be includedin the calculation, in addition to the requirement of there being 3 papers on the intervention-outcome combination. Due to the slightly differentsample, the across-paper statistics diverge slightly from those reported in Table 5. Occasionally, within-paper measures of the mean equal orapproach zero, making the coefficient of variation undefined or unreasonable; “*” denotes those coefficients of variation that were either undefined orgreater than 10,000,000.
24
Figure 4: Distribution of within and across-paper heterogeneity measures
We can also gauge the magnitudes of these measures by comparison with effect sizes.We know effect sizes are typically considered “small” if they are less than 0.2 SDs and thatthe largest coefficient of variation typically considered in the medical literature is 0.5 (Tian,2005; Ng, 2014). Taking 0.5 as a very conservative upper bound for a “small” coefficient ofvariation, this would imply a variance of less than 0.01 for an effect size of 0.2. That theactual mean effect size in the data is closer to 0.1 makes this even more of an upper bound;applying the same reasoning to an effect size of 0.1 would result in the threshold being setat a variance of 0.0025.
Finally, we can try to set bounds more directly, based on the expected prediction error.Here it is immediately apparent that what counts as large or small error depends on thepolicy question. In some cases, it might not matter if an effect size were mis-predicted by25%; in others, a prediction error of this magnitude could mean the difference betweenchoosing one program over another or determine whether a program is worthwhile to pursueat all.
Still, if we take the mean effect size within an intervention-outcome to be our “bestguess” of how a program will perform and, as an illustrative example, want the predictionerror to be less than 25% at least 50% of the time, this would imply a certain cut-offthreshold for the variance if we assume that results are normally distributed. Note that theassumption that results are drawn from the same normal distribution and the mean andvariance of this distribution can be approximated by the mean and variance of observedresults is a simplification for the purpose of a back-of-the-envelope calculation. We wouldexpect results to be drawn from different distributions.
Table 7 summarizes the implied bounds for the variance for the prediction error to beless than 25% and 50%, respectively, alongside the actual variance in results within eachintervention-outcome. In only 1 of 51 cases is the true variance in results smaller than thevariance implied by the 25% prediction error cut-off threshold, and in 9 other cases it isbelow the 50% prediction error threshold. In other words, the variance of results withineach intervention-outcome would imply a prediction error of more than 50% more than 80%of the time.
Table 7: Actual Variance vs. Variance for Prediction Error Thresholds
4.2.2 Within an Intervention-Outcome Combination: The Case of CCTs and
Enrollment Rates
The previous results used the across-intervention-outcome data, which were aggregated
to one result per intervention-outcome-paper. However, we might think that more variation
could be explained by carefully modelling results within a particular intervention-outcome
combination. This section provides an example, using the case of conditional cash transfers
and enrollment rates, the intervention-outcome combination covered by the most papers.
Suppose we were to try to explain as much variability in outcomes as possible, using
sample characteristics. The available variables which might plausibly have a relationship to
effect size are: the baseline enrollment rates9; the sample size; whether the study was done
in a rural or urban setting, or both; results for other programs in the same region10; and
the age and gender of the sample under consideration.
Table 9 shows the results of OLS regressions of the effect size on these variables, in turn.
The baseline enrollment rates show the strongest relationship to effect size, as reflected in
the R2 and significance levels: it is easier to have large gains where initial rates are low.
Some papers pay particular attention to those children that were not enrolled at baseline or
that were enrolled at baseline. These are coded as a “0%” or “100%” enrollment rate at
baseline but are also represented by two dummy variables. Larger studies and studies done
in urban areas also tend to find smaller effect sizes than smaller studies or studies done in
rural or mixed urban/rural areas. Finally, for each result I calculate the mean result in the
same region, excluding results from the program in question. Results do appear slightly
correlated across different programs in the same region.
9In some cases, only endline enrollment rates are reported. This variable is therefore constructed byusing baseline rates for both the treatment and control group where they are available, followed by, in turn,the baseline rate for the control group; the baseline rate for the treatment group; the endline rate for thecontrol group; the endline rate for the treatment and control group; and the endline rate for the treatmentgroup
10Regions include: Latin America, Africa, the Middle East and North Africa, East Asia, and South Asia,following the World Bank’s geographical divisions.
29
Table 9: Regression of Projects’ Effect Sizes on Characteristics (CCTs on Enrollment Rates)
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)ES ES ES ES ES ES ES ES ES ES
- Expansion of credit and/or savings Microfinance- Provision of technological innovations- Introduction or expansion of financial education,or other program to increase financial literacyor awareness
Outcomes - Individual and household income N/A- Small and micro-business income- Household and business assets- Household consumption- Small and micro-business investment- Small, micro-business or agricultural output- Measures of poverty- Measures of well-being or stress- Business ownership- Any other outcome covered by multiple papers
Figure 11 illustrates the difference.
For this reason, minimal screening was done during the screening stage. Instead, data
was collected broadly and re-screening was allowed at the point of doing the analysis. This
is highly beneficial for the purpose of this paper, as it allows us to look at the largest
possible set of papers and all subsets.
After screening criteria were developed, two volunteers independently screened the titles
to determine which papers in the spreadsheet were likely to meet the screening criteria
developed in Stage 3.1. Any differences in coding were arbitrated by a third volunteer. All
volunteers received training before beginning, based on the AidGrade Training Manual and
a test set of entries. Volunteers’ training inputs were screened to ensure that only proficient
47
Figure 9: AidGrade’s Strategy
48
volunteers would be allowed to continue. Of those papers that passed the title screening,
two volunteers independently determined whether the papers in the spreadsheet met the
screening criteria developed in Stage 3.1 judging by the paper abstracts. Any differences in
coding were again arbitrated by a third volunteer. The full text was then found for those
papers which passed both the title and abstract checks. Any paper that proved not to
be a relevant impact evaluation using the aforementioned criteria was discarded at this stage.
Stage 4: Coding
Two AidGrade members each independently used the data extraction form developed
in Stage 4.1 to extract data from the papers that passed the screening in Stage 3. Any
disputes were arbitrated by a third AidGrade member. These AidGrade members received
much more training than those who screened the papers, reflecting the increased difficulty
of their work, and also did a test set of entries before being allowed to proceed. The data
extraction form was organized into three sections: (1) general identifying information; (2)
paper and study characteristics; and (3) results. Each section contained qualitative and
quantitative variables that captured the characteristics and results of the study.
Stage 5: Analysis
A researcher was assigned to each meta-analysis topic who could specialize in determin-
ing which of the interventions and results were similar enough to be combined. If in doubt,
researchers could consult the original papers. In general, researchers were encouraged to
focus on all the outcome variables for which multiple papers had results.
When a study had multiple treatment arms sharing the same control, researchers would
check whether enough data was provided in the original paper to allow estimates to be
combined before the meta-analysis was run. This is a best practice to avoid double-counting
the control group; for details, see the Cochrane Handbook for Systematic Reviews of
Interventions (2011). If a paper did not provide sufficient data for this, the researcher would
make the decision as to which treatment arm to focus on. Data were then standardized
within each topic to be more comparable before analysis (for example, units were converted).
The subsequent steps of the meta-analysis process are irrelevant for the purposes of
this paper. It should be noted that the first set of ten topics followed a slightly different
procedure for stages (1) and (2). Only one list of potential topics was created in Stage
1.1, so Stage 1.2 (Consolidation of Lists) was only vacuously followed. There was also no
randomization after public voting (Stage 1.7) and no scripted scraping searches (Stage 2.3),
as all searches were manually conducted using specific strings. A different search engine was
49
also used: SciVerse Hub, an aggreator that includes SciVerse Scopus, MEDLINE, PubMed
Central, ArXiv.org, and many other databases of articles, books and presentations. The
search strings for both rounds of meta-analysis, manual and scripted, are detailed in another