Evaluating Anti-Poverty Programs: Concepts and Methods Norbert Schady Development Research Group.

Evaluating Anti-Poverty Programs: Concepts and Methods

Norbert SchadyDevelopment Research Group

Outline of presentation Introduction: The evaluation problem Possible solutions1. Experimental evaluations:

Randomization2. Quasi-experimental evaluations

Instrumental variables Regression discontinuity

3. Non-experimental evaluations OLS Matching methods Differences-in-differences

Learning more from evaluations

Outline of presentation

Big disclaimer! I will frequently be drawing on my own work in this presentation for examples

The evaluation problem Assigned programs

Some units (individuals, households, villages) get the program;

Some do not Examples:

Social fund selects from applicants School construction: some villages get a new

school, others get nothing Cash transfers to eligible households only

Ex-post evaluation

The evaluation problem Impact is the difference between the relevant

outcome indicator with the program and that without it

However, we can never observe someone in two different states of nature at the same time

While a post-intervention indicator is observed, its value in the absence of the program is not—it is a counter-factual

So all evaluation is essentially a problem of missing data Calls for counterfactual analysis

Naïve comparisons can be deceptive

Common practices: Compare outcomes after the intervention to

those before, or Compare units (people, households, villages)

with and without the anti-poverty program

Potential biases from failure to control for: Other changes over time under the

counterfactual, or Unit characteristics that influence program

placement

We observe an outcome indicator,

Intervention

Y0

t=0 time

and its value rises after the program:

Y1 (observedl)

Y0

t=0 t=1 time

Intervention

However, we need to identify the counterfactual…

Y1 (observedl)

Y1

* (counterfactual)

Y0

t=0 t=1 time

Intervention

… since only then can we determine the impact of the intervention

Y1

Impact = Y1- Y1*

Y1

*

Y0

t=0 t=1 time

The evaluation problem However, we never observe the counterfactual, and so

have to estimate it…

Making comparisons between “treated” and “control” (or “comparison”) groups

Alternative solutionsExperimental evaluations (“Social experiments”) Program is randomly assigned

If properly carried out, corrects for observable and unobservable differences between treated and controls

Estimates ATE

Quasi-experimental evaluations Instrumental variables Regression discontinuity

Can correct for observable and unobservable differences, but estimated treatment effect is “local”

Non-experimental evaluations (“observational studies”) OLS Matching techniques

Exogenous placement conditional on observables Differences in differences or higher-order differencing

Can correct for time-invariant, additive differences (including in unobservables) between treated and controls

Randomization

Lottery used to assign households to “treatment” and “control” groups

If sample is large enough, this equates all characteristics—observable and unobservable—of both groups

Differences in outcomes can then be credibly interpreted as program impacts

No need for complicated econometrics or conditioning variables

Simple differences of means suffices

Randomization

Randomization checks Check random assignment Check whether conditioning on X variables makes

a difference Check whether cross-sectional and first-

differenced analysis yields similar results

Conclusion: Randomization

Randomization is the benchmark for quasi-experimental and non-experimental evaluation methods

Has become much more popular in developing countries in recent decades, and with good reason

Groundbreaking example of PROGRESA However, randomization is no panacea

1. Often infeasible: Political and moral difficulties of denying treatment to eligible beneficiaries who have lost a lottery

2. Be thoughtful about extrapolating from estimated parameters

What is the estimated parameter? Is it policy-relevant?

Randomization estimates Average Treatment Effect (ATE) if all households in treatment group receive the treatment and all those in control group do not

If compliance in treatment group is imperfect, then can estimate Intent-to-Treat (ITT)—the impact of being offered the program

Or can inflate ITT by program take-up to estimate Treatment-on-the-Treated (TT)

Program take-up R ITT estimate of program effect: β1 TT estimate of program effect: β2=(β1/R)


Deeper problem: Randomization often implemented in small-scale pilots, with highly-motivated staff

Impact of large-scale, perhaps nationwide program may be very different

US literature: the impact of attending preschool on school outcomes

Perry Preschool program compared to Head Start Difficult problem to overcome

In some cases, randomization takes place in the context of a large-scale program

PROGRESA in Mexico However, this tends to be politically difficult to sustain

Oportunidades evaluations in Mexico

Quasi-experimental analysis: RDD

Threshold M below which individuals are eligible for treatment, above which they are ineligible

Intuition behind approach is you compare individuals “just above” and “just below” this threshold value

Proxy means: Determines eligibility for programs Scholarships in Cambodia (Filmer and Schady 2008) School fee reduction program in Bogota (Barrera,

Linden and Urquiola 2007) Geographic jurisdiction: Program implemented in some

areas but not others Piso Firme in Mexico: Comparisons in households just

across the border in Coahuila and Durango states (Cattaneo et al. 2007)

Class size on test scores in Bolivia (Urquiola 2007)


Sharp RDD: the threshold M perfectly predicts who receives a given treatment and who does not

Regress Outcome on flexible formulation of control function, and dummy for treatment

Estimate Yi = α + δf(Ci) + Φ(Ci<M) + εi Note that, by definition (Ci<M)=T

Can also estimate control function nonparametrically, above and below threshold

Fuzzy RDD: The threshold is a significant but imperfect predictor of treatment

Estimate Yi = α + δf(Ci) + ΦTi + εi, where Ti is instrumented with Ci<M

Quasi-experimental analysis: RDD Identifying assumption: No discontinuity in

counterfactual values at threshold Essentially: threshold is given exogenously and

individuals respond mechanically to it Can be violated if there is sorting

Urquiola and Verhogen (2008): sorting in Chilean education system Schools don’t want to add another class because it is

expensive Increase fees to limit enrollment Parents understand school behavior and higher

education parents sort themselves into schools with smaller class sizes

Discontinuity in observable (and perhaps unobservable) characteristics at threshold violates identifying assumption

RDD check: present evidence of no observable differences at threshold


0.2

.4.6

.81

Pro

ba

bili

ty

-25 -15 -5 5 15 25Relative ranking (0=$45 cutoff)

Quartic

Non-parametric

0.2

.4.6

.81

Pro

babi

lity

-25 -15 -5 5 15 25Relative ranking (0=$60 cutoff)

Quartic

Non-parametric

Intent-to-treat effects of $45 versus no scholarship (LHS) and $60 versus $45 (RHS)Source: Filmer and Schady (2008)


RDD estimates treatment effects at the threshold If there is heterogeneity of treatment effects, this

may not correspond to the ATE However, it may be a policy-relevant parameter

for a “small” expansion of the program near the threshold

For example, for targeted programs, it will estimate effect of expanding coverage of program to incorporate “marginal” individuals

Quasi-experimental analysis: IV

Intuition: Identifying exogenous variation using a 3rd variable

Outcome regression:Yi = βTi + ΦXi + εi

Concern is that there are differences between treated (T=1) and control (T=0) individuals that are not captured by vector Xi

Induces correlation Ti between and εi Biased estimates of program effects

Solution: identify a variable Zi that is correlated with Ti (“first stage”) but is uncorrelated with εi (“exclusion restriction”)

Quasi-experimental analysis: IVSteps:1 Regress Ti = β1Zi + Φ1Xi + εi

Predict T-hati

This gives you the exogenous variation in Ti

2 Regress Yi= β2T-hati + Φ2Xi + ωi In practice, this is done in one step to get the

correct standard errors

Practical difficulty: finding convincing instruments (the exclusion restriction cannot be tested)

If exclusion restriction does not hold, biases can be severe

Quasi-experimental analysis: IVSome examples:

Partially randomized design Angrist et al. (2002) on impact of vouchers on test

scores in Colombia Schady and Araujo (2008) on impact of cash transfers

on enrollment in Ecuador Lottery to determine access to Bono de Desarrollo

Humano cash transfer program But substantial contamination of control group, which

appears to be non-random Want to determine impact of program on

enrollment Solution: regress enrollment on treatment, with

treatment instrumented by the lottery Since the lottery was randomized, it is not correlated

with regression error term


Some examples: Political variables as instruments

Want to assess the impact of new school infrastructure on enrollment in Peru

But placement of school infrastructure may be endogenous

Maybe communities with “tastes” for education clamor more for a new school, and tastes are unobserved

Maybe program administrators want to place schools in very disadvantaged areas or in areas where they expect the returns to be highest

In any of these cases, a “simple” regression of school outcomes (enrollment, test scores) on new school infrastructure could be biased


Schady (2000) shows that the distribution of expenditures on school infrastructure in the Fujimori administration was partially determined by political considerations:

Districts that had voted for Fujimori in 1990 but against Fujimori in 1993 were more likely to receive school investments than other, comparable districts (a “buy-back” strategy)

Paxson and Schady (2002) use this to construct an instrument for school infrastructure:

Regress enrollment on school infrastructure, with school infrastructure instrumented with the change in the share voting for Fujimori

Exclusion restriction: changes in vote share uncorrelated with regression error term


Some examples: Program glitches Impact of Bolsa Alimentação CCT program

Software used by program could not read special characters As a result, people whose names had special characters (for

example, Ângela, João, José, Gonçalves) were rejected by the system, and did not receive BDH payments in a first phase

Interested in estimating the effect of Bolsa on an outcome, but participation in program may be endogenous

Regress outcome (say, height-for-age z-score) on Bolsa, with Bolsa instrumented with whether or not applicant had special character in name


If identifying assumptions hold, IV estimates are LATE—they estimate the impact of treatment on outcome on “complier” households (Imbens and Angrist 1994; Angrist, Imbens and Rubin 1996)

These are households whose probability of receiving the treatment was affected by the instrument

So, in “partial randomization” example, these exclude individuals who would have received transfers no matter what, as well as those who would not have received transfers no matter what

Note that this is a counterfactual comparison—we cannot identify these individuals in practice

So, if there is heterogeneity of treatment effects, so that some households respond differently to an intervention than others, it is hard to extrapolate to another population—even if IV estimator is unbiased


Also, if there is selection on expected returns (Card 1999; Heckman and Vytlacil 2005), so that those who stand to benefit the most are most likely to select into the program, this selection effect is incorporated into the estimated treatment effects

Imagine creating a program that randomly assigns fee waivers to some districts in a country but not others

Since program is randomized, you can estimate impact of fee waiver on school attainment without additional complications


But say you also want to use this design to estimate the impact of school attainment on wages

In theory, you could run a regression of wages on schooling, with schooling instrumented with whether a district was selected into the fee reduction program

However, if those who stood to gain the most from schooling were also more likely to respond to the fee waiver program, as seems plausible—so-called Roy selection—then the IV estimates of schooling on wages include (i) the effect of schooling on wages, and (ii) a selection effect

Heckman calls this “essential heterogeneity” Card (1999; 2001) argues that this is the reason why—

contrary to expectations—instrumenting schooling generally results in higher estimates of the returns to schooling than those obtained by OLS

Detour # 1: What is the estimated parameter? Is it

policy-relevant?

So, is the estimated parameter policy relevant?

Not if you are interested in estimating the effect of schooling on wages for the population at large

However, it may be the right parameter if you are considering expanding the fee waiver program and you want to assess how this will affect wages

Conclusion: Quasi-experimental methods

Quasi-experimental methods can be appealing because, in the best of circumstances, they approach the design of a randomized study

Can control for observable and unobservable differences between treated and control households

However, estimates are generally “local” in one way or another

Makes it difficult to extrapolate to other population groups if there is heterogeneity of effects

Also (especially with instrumental variables) they are opportunistic, and the exclusion restriction is untestable

Cannot count on “finding” a good instrument after a program has been rolled out and using this to assess impact

Observational methods: OLS

The intuition behind OLS and matching estimators of impact is that you can correct for differences between “treated” and “control” groups by including a vector of characteristics Xi

Equivalently, that there is selection on observables only Basic set-up:

Yi = βTi + ΦXi + εi The coefficient β is then an estimate of the average

treatment effect Concerns:

Selection on unobservables Using observations outside the region of “common

support” Parametric assumption

Observational methods: Matching Match on the probability of participation

• Ideally we would match on the entire vector X of observed characteristics

• However, this is practically impossible, since X could be huge

• PSM: match on the basis of the propensity score (Rosenbaum and Rubin 1983)

Basic steps Step 1: Regress participation on observable

characteristics Ti = β1Xi + εi Predict T-hati, the propensity score

Step 2: Restrict sample to assume “common support” Failure of common support is an important source of bias in

observational studies (Heckman et al. 1997)

Density

0 1 Propensity score

Density of scores for participants

Density

0 1 Propensity score

Density of scores for non-participants

Density

0 Region of common support 1 Propensity score

Density of scores for non-participants

Observational methods: Matching

Basic steps (continued)

Step 3: For each participant, find a sample of non-participants with similar propensity scores

Various weighting schemes Step 4: Compare the outcome indicators

The difference is the estimate of the gain due to the program for that observation

Step 5: Calculate the mean of these individual gains to obtain the average overall gain

Observational methods: Matching

Many recent developments in the matching literature

For example, Hirano, Imbens, and Ridder (2003) show that a reweighting of the data by the propensity score performs well:

Step 1: Predict propensity score, T-hati, as before Step 2: Run OLS for outcome equation, weighting

treated households by (1/ T-hati) and comparison households by (1/ 1-T-hati)

This produces a fully efficient estimator of the Average Treatment Effect with conservative standard errors

Conclusion: OLS, matching

Low cost, and can use existing data sets (censuses, survey)

However, need high-quality data with information on many X variables for treated and comparison observations

Matching is more flexible than OLS and does not make use of data outside the region of common support

This can be an important advantage However, both methods are based on the assumption of

no selection on observables This is untestable and has to be argued on a case-by-case

basis In practice, single-difference OLS and matching can often be

badly biased by unobserved heterogeneity, correlated with treatment

Observational methods: DD and higher-order differences

Observed changes over time for non-participants provide the counterfactual for participants

Steps: Collect baseline data on non-participants and (probable)

participants before the program Compare with data after the program Subtract the two differences, or use a regression with a

dummy variable for participant

This allows for selection bias but it must be time-invariant and additive

Diff-in-diff requires that the bias be additive and time-invariant

Y1

Impact Y1

*

Y0

t=0 t=1 time


In practice, estimate a regression of the following form:

Ei = βTi + δYi + Φ(Yi*Ti) + εi

where Φ is the difference-in-difference estimate of program impact

Note that this is equivalent to a regression in first differences:

Eit-Eit-1= βTi + εit-εit Both approaches can also be supplemented with a

vector of characteristics Xi Can also combine with matching:

Step 1: match observations on the basis of their baseline observable characteristics

Step 2: Test whether outcome grew by more in treated than in comparison units (individuals, schools, districts)


Example 1: Galiani, Gertler and Schargrodsky (2005) on impact of privatization of water services on child mortality in Argentina

• Did child mortality decrease by more in districts that privatized water than in those that did not?

• More convincing when you can show that pre-existing trends were the same in both groups (as they do)

Example 2: Berlinski, Galiani and Gertler (2005) on impact of preschool attendance on test scores in primary school

• Preschool construction program: Did test scores increase by more in provinces and among cohorts exposed to the construction program when they were of preschool age

• More convincing with “placebo” experiment: only the affected cohorts in provinces that received the preschool intervention saw gains in test scores


Example 3: Filmer and Schady (2008) Did female school enrollment grow by more in schools that offered female scholarships than in other schools in Cambodia?

Yes, but …• These same schools appear to have higher pre-

intervention growth rates in female enrollment

So, triple-differencing:

Did the school enrollment of girls, relative to that of boys, grow by more in schools that offered female scholarships?

• Yes, and there were no pre-existing differences between treated and control schools in the growth rate of the boy-girl enrollment ratio

Conclusion: DD and higher-order differencing

More convincing than OLS or matching with a single, post-intervention survey

Requires careful planning for baseline Particularly convincing when there are placebo

experiments Things that you would not expect to change don’t

change Scholarship program for 7th graders should have no

effects (or very small effects) on enrollment in (say) 1st grade

Cohorts not exposed to program should not behave differently from those who are

No apparent differences in pre-existing trends in outcomes

Detour: spillover effects

What if the effects of the treatment spill over to the control group, or if there are “general equilibrium effects”?

Intervention 1: Provide deworming drugs in Kenya (Miguel and Kremer)

Program benefits extend not just to those who receive the drugs, but also to other children in the study areas

Intervention 2: scholarships to low-SES girls in Cambodia (Filmer and Schady 2008)

Concern that increased enrollment among scholarship recipients affects enrollment decisions of other children in same grade

Can be serious threat to identification Possible solution: move to a higher unit of aggregation—

compare “treated” and “control” villages or schools, rather than individuals

Detour: anticipation effects

What if people in the control group expect that they will be incorporated into the treatment in the future and change their behavior accordingly?

Consumption smoothing Simple version of permanent income hypothesis: all of

short-term transfer income should be invested Or maybe control households change their behavior

(schooling, health-seeking, asset ownership) because they think that this makes it more likely that they will receive benefits?

Very hard to rule out Collect qualitative data Collect data from before baseline, and test for

unexpected changes in behavior among controls

Conclusions and future challenges

Moving beyond averages: Assessing the impact of

program on different population groups Great deal of accumulating evidence of heterogeneity

of treatment effects A positive overall effect may hide a great deal of

variability, possibly including zero or negative effects for some groups


Open up the “Black box” provided by impact evaluations

What features of program matter? For example, in explaining the impact of a CCT on

outcomes is it the cash that matters? the condition? the fact that transfers are made to women?

Various options for trying to untangle possible explanations:1. Structural models (Todd and Wolpin 2007) or ex-ante

simulation (Bourguignon, Ferreira and Leite 2003)2. Randomize alternative program features, perhaps on a

small-scale pilot basis (forthcoming evaluation of a CCT in Morocco)

3. Collect information on other “intermediate” outcomes, and see whether these help shed light on underlying mechanisms


Example: Macours, Schady and Vakis (2008) on impact of the Atención a Crisis CCT program in Nicaragua on child cognitive development among children of preschool age

Program resulted in an improvement in language ability of ~.17 to ~.22 standard deviations

Was it the cash, the social marketing of the program, or the gender of the beneficiaries?

Literature identifies two key risk factors for inadequate cognitive development in poor countries:

Inadequate nutrition (calories, proteins, micronutrients) Inadequate early stimulation

Program resulted in increase in food expenditures and diversification of food consumption (out of staples, and into fruits, vegetables, animal proteins), and increase in stimulation “inputs”

53

Table 7: Treatment effects on intermediate inputs

Variable ObsMean

controlTreatment

effect s.e.

Hh-level food consumption per capita (logs)Total food 2265 8.028 0.310*** (0.043)Staples 2270 7.214 0.195*** (0.049)Animal protein 2270 5.488 1.071*** (0.12)Fruit and vegetables 2270 4.580 1.005*** (0.12)

Child food intake (nr days in last week)Tortilla 3499 5.950 0.0432 (0.10)Milk 3499 1.580 1.141*** (0.23)Meat 3497 0.564 0.764*** (0.075)Eggs 3499 1.594 1.258*** (0.14)Fruit 3498 2.552 0.452** (0.20)Vegetables 3499 1.468 0.713*** (0.21)

StimulusGot toy in last 6 months 3445 0.271 0.0681** (0.031)Has pen and paper to draw 3480 0.690 0.101*** (0.025)Has books 3490 0.073 0.0673*** (0.021)Somebody tells stories 3494 0.520 0.125*** (0.038)Somebody reads to child 3491 0.080 0.0811*** (0.021)Nr of hours read to 3476 0.134 0.257*** (0.061)

Preventive health care and health outcomeswaz-score 3137 -0.958 -0.0516 (0.10)haz-score 3136 -1.243 -0.0521 (0.12)Consulted doctor if sick or diarrhea 2568 0.730 0.0569** (0.028)Weighed in last 6 months 3473 0.705 0.0626*** (0.020)Received vitamin A or iron in last 6 months 3490 0.734 0.0857*** (0.019)Received deworming drugs in last 6 months 3489 0.566 0.0663*** (0.025)

CaregiversCESD depression scale 1855 0 -0.113* (0.064)HOME scale 1939 0 0.0138 (0.11)

NOTE: Variables for caregivers are standardized using mean and standard deviation of the control group. Standard errors corrected for clustering at community level.


Example: Macours, Schady and Vakis (2008)—continued

But can the changes in inputs be fully explained by the increase in income?

Engel curve analysis




Evaluating Anti-Poverty Programs: Concepts and Methods Norbert Schady Development Research Group.

Documents

y1 y1

intervention y0 t

counterfactual y0 t

intervention y1 observedl

program placement

intervention y1 impact

program impactsno

unobservable differences