Evaluating Anti-Poverty Programs: Concepts and Methods Norbert Schady Development Research Group
Jan 16, 2016
Evaluating Anti-Poverty Programs: Concepts and Methods
Norbert SchadyDevelopment Research Group
Outline of presentation Introduction: The evaluation problem Possible solutions1. Experimental evaluations:
Randomization2. Quasi-experimental evaluations
Instrumental variables Regression discontinuity
3. Non-experimental evaluations OLS Matching methods Differences-in-differences
Learning more from evaluations
Outline of presentation
Big disclaimer! I will frequently be drawing on my own work in this presentation for examples
The evaluation problem Assigned programs
Some units (individuals, households, villages) get the program;
Some do not Examples:
Social fund selects from applicants School construction: some villages get a new
school, others get nothing Cash transfers to eligible households only
Ex-post evaluation
The evaluation problem Impact is the difference between the relevant
outcome indicator with the program and that without it
However, we can never observe someone in two different states of nature at the same time
While a post-intervention indicator is observed, its value in the absence of the program is not—it is a counter-factual
So all evaluation is essentially a problem of missing data Calls for counterfactual analysis
Naïve comparisons can be deceptive
Common practices: Compare outcomes after the intervention to
those before, or Compare units (people, households, villages)
with and without the anti-poverty program
Potential biases from failure to control for: Other changes over time under the
counterfactual, or Unit characteristics that influence program
placement
We observe an outcome indicator,
Intervention
Y0
t=0 time
and its value rises after the program:
Y1 (observedl)
Y0
t=0 t=1 time
Intervention
However, we need to identify the counterfactual…
Y1 (observedl)
Y1
* (counterfactual)
Y0
t=0 t=1 time
Intervention
… since only then can we determine the impact of the intervention
Y1
Impact = Y1- Y1*
Y1
*
Y0
t=0 t=1 time
The evaluation problem However, we never observe the counterfactual, and so
have to estimate it…
Making comparisons between “treated” and “control” (or “comparison”) groups
Alternative solutionsExperimental evaluations (“Social experiments”) Program is randomly assigned
If properly carried out, corrects for observable and unobservable differences between treated and controls
Estimates ATE
Quasi-experimental evaluations Instrumental variables Regression discontinuity
Can correct for observable and unobservable differences, but estimated treatment effect is “local”
Non-experimental evaluations (“observational studies”) OLS Matching techniques
Exogenous placement conditional on observables Differences in differences or higher-order differencing
Can correct for time-invariant, additive differences (including in unobservables) between treated and controls
Randomization
Lottery used to assign households to “treatment” and “control” groups
If sample is large enough, this equates all characteristics—observable and unobservable—of both groups
Differences in outcomes can then be credibly interpreted as program impacts
No need for complicated econometrics or conditioning variables
Simple differences of means suffices
Randomization
Randomization checks Check random assignment Check whether conditioning on X variables makes
a difference Check whether cross-sectional and first-
differenced analysis yields similar results
Conclusion: Randomization
Randomization is the benchmark for quasi-experimental and non-experimental evaluation methods
Has become much more popular in developing countries in recent decades, and with good reason
Groundbreaking example of PROGRESA However, randomization is no panacea
1. Often infeasible: Political and moral difficulties of denying treatment to eligible beneficiaries who have lost a lottery
2. Be thoughtful about extrapolating from estimated parameters
What is the estimated parameter? Is it policy-relevant?
Randomization estimates Average Treatment Effect (ATE) if all households in treatment group receive the treatment and all those in control group do not
If compliance in treatment group is imperfect, then can estimate Intent-to-Treat (ITT)—the impact of being offered the program
Or can inflate ITT by program take-up to estimate Treatment-on-the-Treated (TT)
Program take-up R ITT estimate of program effect: β1 TT estimate of program effect: β2=(β1/R)
What is the estimated parameter? Is it policy-relevant?
Deeper problem: Randomization often implemented in small-scale pilots, with highly-motivated staff
Impact of large-scale, perhaps nationwide program may be very different
US literature: the impact of attending preschool on school outcomes
Perry Preschool program compared to Head Start Difficult problem to overcome
In some cases, randomization takes place in the context of a large-scale program
PROGRESA in Mexico However, this tends to be politically difficult to sustain
Oportunidades evaluations in Mexico
Quasi-experimental analysis: RDD
Threshold M below which individuals are eligible for treatment, above which they are ineligible
Intuition behind approach is you compare individuals “just above” and “just below” this threshold value
Proxy means: Determines eligibility for programs Scholarships in Cambodia (Filmer and Schady 2008) School fee reduction program in Bogota (Barrera,
Linden and Urquiola 2007) Geographic jurisdiction: Program implemented in some
areas but not others Piso Firme in Mexico: Comparisons in households just
across the border in Coahuila and Durango states (Cattaneo et al. 2007)
Class size on test scores in Bolivia (Urquiola 2007)
Quasi-experimental analysis: RDD
Sharp RDD: the threshold M perfectly predicts who receives a given treatment and who does not
Regress Outcome on flexible formulation of control function, and dummy for treatment
Estimate Yi = α + δf(Ci) + Φ(Ci<M) + εi Note that, by definition (Ci<M)=T
Can also estimate control function nonparametrically, above and below threshold
Fuzzy RDD: The threshold is a significant but imperfect predictor of treatment
Estimate Yi = α + δf(Ci) + ΦTi + εi, where Ti is instrumented with Ci<M
Quasi-experimental analysis: RDD Identifying assumption: No discontinuity in
counterfactual values at threshold Essentially: threshold is given exogenously and
individuals respond mechanically to it Can be violated if there is sorting
Urquiola and Verhogen (2008): sorting in Chilean education system Schools don’t want to add another class because it is
expensive Increase fees to limit enrollment Parents understand school behavior and higher
education parents sort themselves into schools with smaller class sizes
Discontinuity in observable (and perhaps unobservable) characteristics at threshold violates identifying assumption
RDD check: present evidence of no observable differences at threshold
Quasi-experimental analysis: RDD
0.2
.4.6
.81
Pro
ba
bili
ty
-25 -15 -5 5 15 25Relative ranking (0=$45 cutoff)
Quartic
Non-parametric
0.2
.4.6
.81
Pro
babi
lity
-25 -15 -5 5 15 25Relative ranking (0=$60 cutoff)
Quartic
Non-parametric
Intent-to-treat effects of $45 versus no scholarship (LHS) and $60 versus $45 (RHS)Source: Filmer and Schady (2008)
What is the estimated parameter? Is it policy-relevant?
RDD estimates treatment effects at the threshold If there is heterogeneity of treatment effects, this
may not correspond to the ATE However, it may be a policy-relevant parameter
for a “small” expansion of the program near the threshold
For example, for targeted programs, it will estimate effect of expanding coverage of program to incorporate “marginal” individuals
Quasi-experimental analysis: IV
Intuition: Identifying exogenous variation using a 3rd variable
Outcome regression:Yi = βTi + ΦXi + εi
Concern is that there are differences between treated (T=1) and control (T=0) individuals that are not captured by vector Xi
Induces correlation Ti between and εi Biased estimates of program effects
Solution: identify a variable Zi that is correlated with Ti (“first stage”) but is uncorrelated with εi (“exclusion restriction”)
Quasi-experimental analysis: IVSteps:1 Regress Ti = β1Zi + Φ1Xi + εi
Predict T-hati
This gives you the exogenous variation in Ti
2 Regress Yi= β2T-hati + Φ2Xi + ωi In practice, this is done in one step to get the
correct standard errors
Practical difficulty: finding convincing instruments (the exclusion restriction cannot be tested)
If exclusion restriction does not hold, biases can be severe
Quasi-experimental analysis: IVSome examples:
Partially randomized design Angrist et al. (2002) on impact of vouchers on test
scores in Colombia Schady and Araujo (2008) on impact of cash transfers
on enrollment in Ecuador Lottery to determine access to Bono de Desarrollo
Humano cash transfer program But substantial contamination of control group, which
appears to be non-random Want to determine impact of program on
enrollment Solution: regress enrollment on treatment, with
treatment instrumented by the lottery Since the lottery was randomized, it is not correlated
with regression error term
Quasi-experimental analysis: IV
Some examples: Political variables as instruments
Want to assess the impact of new school infrastructure on enrollment in Peru
But placement of school infrastructure may be endogenous
Maybe communities with “tastes” for education clamor more for a new school, and tastes are unobserved
Maybe program administrators want to place schools in very disadvantaged areas or in areas where they expect the returns to be highest
In any of these cases, a “simple” regression of school outcomes (enrollment, test scores) on new school infrastructure could be biased
Quasi-experimental analysis: IV
Schady (2000) shows that the distribution of expenditures on school infrastructure in the Fujimori administration was partially determined by political considerations:
Districts that had voted for Fujimori in 1990 but against Fujimori in 1993 were more likely to receive school investments than other, comparable districts (a “buy-back” strategy)
Paxson and Schady (2002) use this to construct an instrument for school infrastructure:
Regress enrollment on school infrastructure, with school infrastructure instrumented with the change in the share voting for Fujimori
Exclusion restriction: changes in vote share uncorrelated with regression error term
Quasi-experimental analysis: IV
Some examples: Program glitches Impact of Bolsa Alimentação CCT program
Software used by program could not read special characters As a result, people whose names had special characters (for
example, Ângela, João, José, Gonçalves) were rejected by the system, and did not receive BDH payments in a first phase
Interested in estimating the effect of Bolsa on an outcome, but participation in program may be endogenous
Regress outcome (say, height-for-age z-score) on Bolsa, with Bolsa instrumented with whether or not applicant had special character in name
What is the estimated parameter? Is it policy-relevant?
If identifying assumptions hold, IV estimates are LATE—they estimate the impact of treatment on outcome on “complier” households (Imbens and Angrist 1994; Angrist, Imbens and Rubin 1996)
These are households whose probability of receiving the treatment was affected by the instrument
So, in “partial randomization” example, these exclude individuals who would have received transfers no matter what, as well as those who would not have received transfers no matter what
Note that this is a counterfactual comparison—we cannot identify these individuals in practice
So, if there is heterogeneity of treatment effects, so that some households respond differently to an intervention than others, it is hard to extrapolate to another population—even if IV estimator is unbiased
What is the estimated parameter? Is it policy-relevant?
Also, if there is selection on expected returns (Card 1999; Heckman and Vytlacil 2005), so that those who stand to benefit the most are most likely to select into the program, this selection effect is incorporated into the estimated treatment effects
Imagine creating a program that randomly assigns fee waivers to some districts in a country but not others
Since program is randomized, you can estimate impact of fee waiver on school attainment without additional complications
What is the estimated parameter? Is it policy-relevant?
But say you also want to use this design to estimate the impact of school attainment on wages
In theory, you could run a regression of wages on schooling, with schooling instrumented with whether a district was selected into the fee reduction program
However, if those who stood to gain the most from schooling were also more likely to respond to the fee waiver program, as seems plausible—so-called Roy selection—then the IV estimates of schooling on wages include (i) the effect of schooling on wages, and (ii) a selection effect
Heckman calls this “essential heterogeneity” Card (1999; 2001) argues that this is the reason why—
contrary to expectations—instrumenting schooling generally results in higher estimates of the returns to schooling than those obtained by OLS
Detour # 1: What is the estimated parameter? Is it
policy-relevant?
So, is the estimated parameter policy relevant?
Not if you are interested in estimating the effect of schooling on wages for the population at large
However, it may be the right parameter if you are considering expanding the fee waiver program and you want to assess how this will affect wages
Conclusion: Quasi-experimental methods
Quasi-experimental methods can be appealing because, in the best of circumstances, they approach the design of a randomized study
Can control for observable and unobservable differences between treated and control households
However, estimates are generally “local” in one way or another
Makes it difficult to extrapolate to other population groups if there is heterogeneity of effects
Also (especially with instrumental variables) they are opportunistic, and the exclusion restriction is untestable
Cannot count on “finding” a good instrument after a program has been rolled out and using this to assess impact
Observational methods: OLS
The intuition behind OLS and matching estimators of impact is that you can correct for differences between “treated” and “control” groups by including a vector of characteristics Xi
Equivalently, that there is selection on observables only Basic set-up:
Yi = βTi + ΦXi + εi The coefficient β is then an estimate of the average
treatment effect Concerns:
Selection on unobservables Using observations outside the region of “common
support” Parametric assumption
Observational methods: Matching Match on the probability of participation
• Ideally we would match on the entire vector X of observed characteristics
• However, this is practically impossible, since X could be huge
• PSM: match on the basis of the propensity score (Rosenbaum and Rubin 1983)
Basic steps Step 1: Regress participation on observable
characteristics Ti = β1Xi + εi Predict T-hati, the propensity score
Step 2: Restrict sample to assume “common support” Failure of common support is an important source of bias in
observational studies (Heckman et al. 1997)
Density
0 1 Propensity score
Density of scores for participants
Density
0 1 Propensity score
Density of scores for non-participants
Density
0 Region of common support 1 Propensity score
Density of scores for non-participants
Observational methods: Matching
Basic steps (continued)
Step 3: For each participant, find a sample of non-participants with similar propensity scores
Various weighting schemes Step 4: Compare the outcome indicators
The difference is the estimate of the gain due to the program for that observation
Step 5: Calculate the mean of these individual gains to obtain the average overall gain
Observational methods: Matching
Many recent developments in the matching literature
For example, Hirano, Imbens, and Ridder (2003) show that a reweighting of the data by the propensity score performs well:
Step 1: Predict propensity score, T-hati, as before Step 2: Run OLS for outcome equation, weighting
treated households by (1/ T-hati) and comparison households by (1/ 1-T-hati)
This produces a fully efficient estimator of the Average Treatment Effect with conservative standard errors
Conclusion: OLS, matching
Low cost, and can use existing data sets (censuses, survey)
However, need high-quality data with information on many X variables for treated and comparison observations
Matching is more flexible than OLS and does not make use of data outside the region of common support
This can be an important advantage However, both methods are based on the assumption of
no selection on observables This is untestable and has to be argued on a case-by-case
basis In practice, single-difference OLS and matching can often be
badly biased by unobserved heterogeneity, correlated with treatment
Observational methods: DD and higher-order differences
Observed changes over time for non-participants provide the counterfactual for participants
Steps: Collect baseline data on non-participants and (probable)
participants before the program Compare with data after the program Subtract the two differences, or use a regression with a
dummy variable for participant
This allows for selection bias but it must be time-invariant and additive
Diff-in-diff requires that the bias be additive and time-invariant
Y1
Impact Y1
*
Y0
t=0 t=1 time
Observational methods: DD and higher-order differences
In practice, estimate a regression of the following form:
Ei = βTi + δYi + Φ(Yi*Ti) + εi
where Φ is the difference-in-difference estimate of program impact
Note that this is equivalent to a regression in first differences:
Eit-Eit-1= βTi + εit-εit Both approaches can also be supplemented with a
vector of characteristics Xi Can also combine with matching:
Step 1: match observations on the basis of their baseline observable characteristics
Step 2: Test whether outcome grew by more in treated than in comparison units (individuals, schools, districts)
Observational methods: DD and higher-order differences
Example 1: Galiani, Gertler and Schargrodsky (2005) on impact of privatization of water services on child mortality in Argentina
• Did child mortality decrease by more in districts that privatized water than in those that did not?
• More convincing when you can show that pre-existing trends were the same in both groups (as they do)
Example 2: Berlinski, Galiani and Gertler (2005) on impact of preschool attendance on test scores in primary school
• Preschool construction program: Did test scores increase by more in provinces and among cohorts exposed to the construction program when they were of preschool age
• More convincing with “placebo” experiment: only the affected cohorts in provinces that received the preschool intervention saw gains in test scores
Observational methods: DD and higher-order differences
Example 3: Filmer and Schady (2008) Did female school enrollment grow by more in schools that offered female scholarships than in other schools in Cambodia?
Yes, but …• These same schools appear to have higher pre-
intervention growth rates in female enrollment
So, triple-differencing:
Did the school enrollment of girls, relative to that of boys, grow by more in schools that offered female scholarships?
• Yes, and there were no pre-existing differences between treated and control schools in the growth rate of the boy-girl enrollment ratio
Conclusion: DD and higher-order differencing
More convincing than OLS or matching with a single, post-intervention survey
Requires careful planning for baseline Particularly convincing when there are placebo
experiments Things that you would not expect to change don’t
change Scholarship program for 7th graders should have no
effects (or very small effects) on enrollment in (say) 1st grade
Cohorts not exposed to program should not behave differently from those who are
No apparent differences in pre-existing trends in outcomes
Detour: spillover effects
What if the effects of the treatment spill over to the control group, or if there are “general equilibrium effects”?
Intervention 1: Provide deworming drugs in Kenya (Miguel and Kremer)
Program benefits extend not just to those who receive the drugs, but also to other children in the study areas
Intervention 2: scholarships to low-SES girls in Cambodia (Filmer and Schady 2008)
Concern that increased enrollment among scholarship recipients affects enrollment decisions of other children in same grade
Can be serious threat to identification Possible solution: move to a higher unit of aggregation—
compare “treated” and “control” villages or schools, rather than individuals
Detour: anticipation effects
What if people in the control group expect that they will be incorporated into the treatment in the future and change their behavior accordingly?
Consumption smoothing Simple version of permanent income hypothesis: all of
short-term transfer income should be invested Or maybe control households change their behavior
(schooling, health-seeking, asset ownership) because they think that this makes it more likely that they will receive benefits?
Very hard to rule out Collect qualitative data Collect data from before baseline, and test for
unexpected changes in behavior among controls
Conclusions and future challenges
Moving beyond averages: Assessing the impact of
program on different population groups Great deal of accumulating evidence of heterogeneity
of treatment effects A positive overall effect may hide a great deal of
variability, possibly including zero or negative effects for some groups
Conclusions and future challenges
Open up the “Black box” provided by impact evaluations
What features of program matter? For example, in explaining the impact of a CCT on
outcomes is it the cash that matters? the condition? the fact that transfers are made to women?
Various options for trying to untangle possible explanations:1. Structural models (Todd and Wolpin 2007) or ex-ante
simulation (Bourguignon, Ferreira and Leite 2003)2. Randomize alternative program features, perhaps on a
small-scale pilot basis (forthcoming evaluation of a CCT in Morocco)
3. Collect information on other “intermediate” outcomes, and see whether these help shed light on underlying mechanisms
Conclusions and future challenges
Example: Macours, Schady and Vakis (2008) on impact of the Atención a Crisis CCT program in Nicaragua on child cognitive development among children of preschool age
Program resulted in an improvement in language ability of ~.17 to ~.22 standard deviations
Was it the cash, the social marketing of the program, or the gender of the beneficiaries?
Literature identifies two key risk factors for inadequate cognitive development in poor countries:
Inadequate nutrition (calories, proteins, micronutrients) Inadequate early stimulation
Program resulted in increase in food expenditures and diversification of food consumption (out of staples, and into fruits, vegetables, animal proteins), and increase in stimulation “inputs”
53
Table 7: Treatment effects on intermediate inputs
Variable ObsMean
controlTreatment
effect s.e.
Hh-level food consumption per capita (logs)Total food 2265 8.028 0.310*** (0.043)Staples 2270 7.214 0.195*** (0.049)Animal protein 2270 5.488 1.071*** (0.12)Fruit and vegetables 2270 4.580 1.005*** (0.12)
Child food intake (nr days in last week)Tortilla 3499 5.950 0.0432 (0.10)Milk 3499 1.580 1.141*** (0.23)Meat 3497 0.564 0.764*** (0.075)Eggs 3499 1.594 1.258*** (0.14)Fruit 3498 2.552 0.452** (0.20)Vegetables 3499 1.468 0.713*** (0.21)
StimulusGot toy in last 6 months 3445 0.271 0.0681** (0.031)Has pen and paper to draw 3480 0.690 0.101*** (0.025)Has books 3490 0.073 0.0673*** (0.021)Somebody tells stories 3494 0.520 0.125*** (0.038)Somebody reads to child 3491 0.080 0.0811*** (0.021)Nr of hours read to 3476 0.134 0.257*** (0.061)
Preventive health care and health outcomeswaz-score 3137 -0.958 -0.0516 (0.10)haz-score 3136 -1.243 -0.0521 (0.12)Consulted doctor if sick or diarrhea 2568 0.730 0.0569** (0.028)Weighed in last 6 months 3473 0.705 0.0626*** (0.020)Received vitamin A or iron in last 6 months 3490 0.734 0.0857*** (0.019)Received deworming drugs in last 6 months 3489 0.566 0.0663*** (0.025)
CaregiversCESD depression scale 1855 0 -0.113* (0.064)HOME scale 1939 0 0.0138 (0.11)
NOTE: Variables for caregivers are standardized using mean and standard deviation of the control group. Standard errors corrected for clustering at community level.
Conclusions and future challenges
Example: Macours, Schady and Vakis (2008)—continued
But can the changes in inputs be fully explained by the increase in income?
Engel curve analysis
Conclusions and future challenges
Conclusions and future challenges
Conclusions and future challenges