7. Measuring Impact (Martinez) Manila - World Bankpubdocs.worldbank.org/en/165451526071510608/14-Measuring... · Sebastian Martinez The World Bank Impact Evaluation Methods for Policymakers

Impact Evaluation

Measuring ImpactMeasuring Impact

Sebastian MartinezSebastian MartinezThe World BankThe World Bank

Impact Evaluation Methods for PolicymakersImpact Evaluation Methods for Policymakers

Note: slides by Sebastian Martinez. The content of this presentation reflects the views of the author, and not necessarily those of the World Bank. December 2007.

2

Measuring Impact

1) Causal InferenceCounterfactualsCounterfeit Counterfactuals:

Before and After (pre-post)Enrolled-not enrolled (apples and oranges)

2) IE Methods Toolbox:Randomized Controls Randomized Promotion (IV)Discontinuity Design (RDD)Difference in Difference (Diff-in-diff)Matching (P-score matching)

3

Impact Evaluation

Logical FrameworkTheory

Measuring ImpactIdentification Strategy

DataOperational PlanResources

4

Measuring Impact

1)Causal InferenceCounterfactualsCounterfeit Counterfactuals:



5

Our Objective:

Estimate the CAUSAL effect (impact) of

intervention P (program or treatment)

on outcome Y (indicator, measure of success)

Example: what is the effect ofa cash transfer program (P)on household consumption (Y)?

6

Causal Inference

What is the effect of P on Y?

Answer:

α= (Y | P=1)-(Y | P=0)

Can we all go home?

7

Problem of MISSING DATA

For a program beneficiary:

we observe (Y | P=1): Consumption level (Y) with a cash transfer program (P)

but we do not observe (Y | P=0):Consumption level (Y) without a cash transfer program (P)

α= (Y | P=1)-(Y | P=0)

8

Solution

Estimate what would have happened to Y in the absence of P

We call this the…………

COUNTERFACTUALHint: The key to a good impact

evaluation is a validcounterfactual!

9

Estimating Impact of P on Y

OBSERVE (Y | P=1)Intention to Treat (ITT) -Those offered treatment Treatment on the Treated (TOT) – Those receiving treatment

ESTIMATE counterfactual for (Y | P=0)

Use comparison or control group

α= (Y | P=1)-(Y | P=0)

IMPACT = outcome with treatment - counterfactual

10

The perfect “Clone”

6 Candies

Impact = 6 Impact = 6 -- 4 = 2 Candies4 = 2 Candies

Beneficiary Control

4 Candies

11

In reality, use statistics

Average Y = 6 Candies

Impact = 6 Impact = 6 -- 4 = 2 Candies4 = 2 Candies

Beneficiary Control

Average Y = 4 Candies

12

Getting Good Counterfactuals

Understand the DATA GENERATION processBehavioral process by which program participation (treatment) is determined

How are benefits assigned?What are the eligibility rules?

The treated observation and the counterfactual:have identical characteristics, except for benefiting from the intervention

Hint: With a good counterfactual, the only reason for different outcomes between treatments and controls is the

intervention (P)

13

Case Study

What is the effect of a cash transfer program (P) on household consumption (Y)?

PROGRESA/OPORTUNIDADES ProgramNational anti-poverty program in Mexico

Started 19975 million beneficiaries by 2004Eligibility – based on poverty index

Cash transfersconditional on school and health care attendance

Rigorous impact evaluation with rich data506 communities, 24K householdsBaseline 1997, follow-up 2008

Many outcomes of interest. Here we consider:Standard of living: consumption per capita

14

Eligibles(Poor)

Ineligibles(Non-Poor)

Case Study

Not Enrolled

Enrolled

Eligibility and Enrollment

15

Measuring Impact

1) Causal InferenceCounterfactuals

Counterfeit Counterfactuals:Before and After (pre-post)Enrolled-not enrolled (apples and oranges)


16

Counterfeit Counterfactuals

Two common counterfactuals to be avoided!!

Before and After (pre-post)Data on the same individuals before and after an

intervention

Enrolled-not enrolled (apples and oranges) Data on a group of individuals that enrolled in a program, and another group that did not

We don’t know why

Both counterfactuals may lead to biased results

17

Counterfeit Counterfactual #1

Before and AfterY

TimeT=0

Baseline

T=1

Endline

A-B = 4

A-C = 2

IMPACT?

B

A

C (counterfactual)

18

Case 1: Before and After

2 Points in TimeMeasure beneficiaries’:

Consumption at T=0Consumption at T=1

Estimate of counterfactual

(Yi,t| P=0) = (Yi,t-1| P=0)

“Impact” = A-B = 35

Time

What is the effect of a cash transfer program (P) on household consumption (Y)?

B

T=0 T=1

Y

233

268 A

α =35

19


Control - Before Treatment - After t-statMean 233.48 268.75 16.3

Case 1 - Before and After

Linear Regression Multivariate Linear Regression

Estimated Impact on CPC 35.27** 34.28**(2.16) (2.11)

** Significant at 1% level


20


2 Points in TimeOnly measure beneficiaries:

Consumption at T=0Consumption at T=1

Estimate of counterfactual(Yi,t| P=0) = (Yi,t-1| P=0)

“Impact” = A-B = 35

Does not control for time varying factors

Boom: Impact = A-CA-B = overestimate

Recession: Impact = A-DA-B = underestimate

Time

What’s the Problem?

B

T=0(1997)

T=1(1998)

Y

233

268 A

α =35

D?

C? Impact

Impact

21

Measuring Impact

1) Causal InferenceCounterfactuals

Counterfeit Counterfactuals:Before and After (pre-post)

Enrolled-not enrolled (apples and oranges)


22

Counterfeit Counterfactual #2Enrolled-not enrolled

Post-treatment data on 2 groupsEnrolled: treatment groupNot-enrolled: “control” group (counterfactual)

Those ineligible to participateThose that choose NOT to participate

Selection BiasReason for not enrolling may be correlated with outcome (Y)

Control for observablesBut not unobservables!!

Estimated impact is confounded with other things

23

Eligibles(Poor)

Ineligibles(Non-Poor)

Case 2: Enrolled- not enrolled

Not Enrolled

Y = 290

Enrolled

Y = 268

Measure outcomes in post-treatment (1998)

In what ways might enrolled/not enrolled be different, other than program?

24

Not Enrolled Enrolled t-statMean CPC 290.16 268.7541 5.6

Case 2 - Enrolled/Not Enrolled


Estimated Impact on CPC -22.7** -4.15(3.78) (4.05)



Case 2: Enrolled- not enrolled

25

What is going on??

Which of these do we believe?Problem with Before-After:

Can not control for other time-varying factors

Problem with Enrolled-Not Enrolled:Do no know if other factors, beyond the intervention, are affecting the outcome

Linear Regression

Multivariate Linear Regression

Linear Regression


Estimated Impact on CPC 35.27** 34.28** -22.7** -4.15

(2.16) (2.11) (3.78) (4.05)** Significant at 1% level

Case 1 - Before and After Case 2 - Enrolled/Not Enrolled

Case Study

26

Measuring Impact



2)IE Methods Toolbox:Randomized Controls Randomized Promotion (IV)Discontinuity Design (RDD)Difference in Difference (Diff-in-diff)Matching (P-score matching)

27

Choosing your methods…..

To identify an IE method for your program, consider:

Prospective/retrospectiveEligibility rulesRoll-out plan (pipeline)

Is universe of eligibles larger than available resources at a given point in time?

Budget and capacity constraints?Excess demand for program?Eligibility criteria?Geographic targeting? Etc….

Hint: Choose the most robust strategy that fits the operational context

28

Choosing your methods

Identify the “best” possible design given the operational context

Best design = fewest risks for contaminationHave we controlled for “everything”?

Internal validity

Is the result valid for “everyone”?External validityLocal versus global treatment effect

29

Measuring Impact



2)IE Methods Toolbox:Randomized ControlsRandomized Promotion (IV)Discontinuity Design (RDD)Difference in Difference (Diff-in-diff)Matching (P-score matching)

30

Randomized Controls

When universe of eligibles > # benefits:Randomize! Lottery for who is offered benefits

Fair, transparent and ethical way to assign benefits to equally deserving populations

Oversubscription:Give each eligible unit the same chance of receiving treatment

Compare those offered treatment with those not offered treatment (controls)

Randomized phase in:Give each eligible unit the same chance of receiving treatment first, second, third….

Compare those offered treatment first, with those offered treatment later (controls)

31

Randomization

1. Universe2. Random Sample

of Eligibles

Ineligible =

Eligible =

3. Randomize Treatment

Not Enrolled =

Enrolled =

External Validity Internal Validity

32

Unit of Randomization

Choose according to type of program:Individual/HouseholdSchool/Health Clinic/catchment areaBlock/Village/CommunityWard/District/Region

Keep in mind:Need “sufficiently large” number of units to detect minimum desired impact.Spillovers/contaminationOperational and survey costs

Hint: As a rule of thumb, choose to randomize at the minimum viable unit of implementation.

33

Oportunidades Evaluation SampleUnit of randomization:

community

Random phase in: 320 treatment communities (14,446 households)

First transfers distributed April 1998

186 control communities (9,630 households)First transfers November 1999

Case 3: Randomization

34

Variables Treatment (4,670)

Control (2727) t-stats

Consumption per capita 233.47 233.4 -0.04

1.02 1.3Head's age 41.94 42.35 1.2

0.2 0.27Head's education 2.95 2.81 -2.16

0.04 0.05Spouse's age 37.02 36.96 -0.38

0.7 0.22Spouse's education 2.76 2.76 -0.006

0.03 0.04

Speaks an indigenous language 41.69 41.95 0.21

0.007 0.009Head is female 0.073 0.078 0.66

0.003 0.005Household at baseline 5.76 5.7 -1.21

0.02 0.038

Bathroom at baseline 0.57 0.56 -1.040.007 0.009

Total hectareas of land 1.63 1.72 1.35

0.03 0.05Min. Distance loc-urban 109.28 106.59 -1.02

0.6 0.81

RANDOMIZATION

Case 3: Baseline Balance

35

Control Treatment t-statMean CPC Baseline 233.40 233.47 0.04

Mean CPC Followup 239.5 268.75 9.6

Case 3 - Randomization

Linear Regression Multivariate Linear RegressionEstimated Impact on CPC 29.25** 29.79**

(3.03) (3.00)** Significant at 1% level


Case 3: Randomization

36







Estimated Impact on CPC 34.28** -4.15 29.79**

(2.11) (4.05) (3.00)** Significant at 1% level

Case Study

37

Measuring Impact



2) IE Methods Toolbox:Randomized Controls

Randomized Promotion (IV)Discontinuity Design (RDD)Difference in Difference (Diff-in-diff)Matching (P-score matching)

38

Randomized Promotion (IV)Common scenarios:

National Program with universal eligibilityVoluntary inscription in program

Can we compare enrolled to not enrolled?Selection Bias!

39

Randomized Promotion (IV)

Possible solution: provide additional promotion, encouragement or incentives to a random sub-sample:

InformationEncouragement (small gift or prize)TransportOther help/incentives

Necessary conditions:1. Promoted and non-promoted groups are comparable:

Promotion not correlated with population characteristicsGuaranteed by randomization

2. Promoted group has higher enrollment in the program3. Promotion does not affect outcomes directly

40

Randomized Promotion

Universal Eligibility

Eligible =

Randomize Promotion

Enrollment

Never Always

Promotion

No Promotion

Enroll =

41

Randomized PromotionNOT Promoted

Enrolled = 30%Y = 80

Always Enroll

Enroll if Encouraged

Never Enroll

IMPACT

∆ Enrolled= 0.5∆ Y=20

Impact = 40

Promoted


42

ExamplesMaternal Child Health Insurance in Argentina

Intensive information campaigns

Employment Program in ArgentinaTransport voucher

Community Based School Management in Nepal

Assistance from NGO

Health Risk Funds in IndiaAssistance from Community Resource Teams

43

Randomized Promotion

Pilot test promotion strategy vigorously!Produces additional information of interest:

How to increase enrollment

Don’t have to “exclude” anyone, but…..Strategy depends on success and validity of promotionProduces a local average treatment effect

Randomized Promotion is an Instrumental Variable (IV)

A variable correlated with treatment but nothing else (i.e. random promotion)More details in the appendix

44

Randomized Control


Enroll if Encouraged

Never Enroll

IMPACT

∆ Enrolled= 0.92

∆ Y=29

TOT Impact = 31

Randomized Treatment

(Promoted)


Case 4: IV

45

Estimate TOT effect of Oportunidades on consumptionRun 2SLS regression

Case 4: IV - TOT




Case 4 - IV

46

Measuring Impact



2) IE Methods Toolbox:Randomized Controls Randomized Promotion (IV)

Discontinuity Design (RDD)Difference in Difference (Diff-in-diff)Matching (P-score matching)

47

Discontinuities in Eligibility

Social programs many times target programs according to an eligibility index:Anti-poverty programs:

targeted to households below a given poverty index

Pension programs:targeted to population above a certain age

Scholarships: targeted to students with high scores on standardized

test

Hint: For a discontinuity design, you need:-Continuous eligibility index

-Clearly defined eligibility cut-off

48

Example:

Eligibility index (score) from 1 to 100 Based on pre-intervention characteristics

Score <=50 are eligibleScore >50 are not eligibleOffer treatment to eligibles

49

6065

7075

80O

utco

me

20 30 40 50 60 70 80Score

Regression Discontinuity Design - Baseline

50

6065

7075

80O

utco

me

20 30 40 50 60 70 80Score

Regression Discontinuity Design - Baseline

Not Eligible

Eligible

51

6570

7580

Out

com

e

20 30 40 50 60 70 80Score

Regression Discontinuity Design - Post Intervention

52

6570

7580

Out

com

e

20 30 40 50 60 70 80Score

Regression Discontinuity Design - Post Intervention

IMPACT

53

Oportunidades assigned benefits based on a poverty index

WhereTreatment = 1 if score <=750Treatment = 0 if score >750

Case 5: Discontinuity Design

54

Fitte

d va

lues

puntaje estimado en focalizacion276 1294

153.578

379.224

2

Baseline – No treatment

0 1 ( )i i iy Treatment scoreβ β δ ε= + + +


55

Fitte

d va

lues

puntaje estimado en focalizacion276 1294

183.647

399.51

Treatment Period


Estimated Impact on CPC


Case 5 - Regression DiscontinuityMultivariate Linear Regression

30.58**(5.93)

56

Potential Disadvantages of RD

Local average treatment effects – not always generalizablePower: effect is estimated at the discontinuity, so we generally have fewer observations than in a randomized experiment with the same sample size Specification can be sensitive to functional form: make sure the relationship between the assignment variable and the outcome variable is correctly modeled, including:

Nonlinear RelationshipsInteractions

57

Advantages of RD for Evaluation

RD yields an unbiased estimate of treatment effect at the discontinuityCan many times take advantage of a known rule for assigning the benefit that are common in the designs of social policy

No need to “exclude” a group of eligible households/individuals from treatment

58

Measuring Impact



2) IE Methods Toolbox:Randomized Controls Randomized Promotion (IV)Discontinuity Design (RDD)

Difference in Difference (Diff-in-diff)Matching (P-score matching)

59

Diff-in-Diff

Compare change in outcomes between treatments and non-treatment

Impact is the difference in the change in outcomes

Impact = (Yt1-Yt0) - (Yc1-Yc0)

60

TimeTreatment

Outcome

Treatment Group

Control Group

Average Treatment Effect

B

A

D

C

61

TimeTreatment

Outcome

Treatment Group

Control Group

EstimatedAverageTreatment Effect

Average Treatment Effect

62

Diff in diff

Fundamental assumption that trends (slopes) are the same in treatments and controlsNeed a minimum of three points in time to verify this and estimate treatment (two pre-intervention)

63

Case 6: Diff-in-Diff

Not Enrolled Enrolled t-statMean ∆CPC 8.26 35.92 10.31

Case 6 - Diff in Diff





64

Case Study


Case 2 - Enrolled/Not

Enrolled


Case 4 - IV (TOT)

Case 5 - Regression

Discontinuity


Multivariate Linear

RegressionMultivariate Linear

Regression

Multivariate Linear

Regression 2SLS

Multivariate Linear

Regression

Multivariate Linear

RegressionEstimated Impact on CPC 34.28** -4.15 29.79** 30.44** 30.58** 25.53**

(2.11) (4.05) (3.00) (3.07) (5.93) (2.77)** Significant at 1% level

65

Measuring Impact



2) IE Methods Toolbox:Randomized Controls Randomized Promotion (IV)Discontinuity Design (RDD)Difference in Difference (Diff-in-diff)

Matching (P-score matching)

66

Matching

Pick up the ideal comparison that matches the treatment group from a larger survey.The matches are selected on the basis of similarities in observed characteristicsThis assumes no selection bias based on unobservable characteristics.

Source: Martin Ravallion

67

Propensity-Score Matching (PSM)

Controls: non- participants with same characteristics as participants

In practice, it is very hard. The entire vector of X observed characteristics could be huge.

Rosenbaum and Rubin: match on the basis of the propensity score=

P(Xi) = Pr (Di=1|X)Instead of aiming to ensure that the matched control for each participant has exactly the same value of X, same result can be achieved by matching on theprobability of participation.This assumes that participation is independent of outcomes given X.

68

Steps in Score Matching

1. Representative & highly comparables survey of non-participants and participants.

2. Pool the two samples and estimated a logit (or probit) model of program participation.

3. Restrict samples to assure common support(important source of bias in observational studies)

4. For each participant find a sample of non-participants that have similar propensity scores

5. Compare the outcome indicators. The difference is the estimate of the gain due to the program for that observation.

6. Calculate the mean of these individual gains to obtain the average overall gain.

69

Density of scores for participants

Density

0 1Propensity score

Region of common support

70

PSM vs an experiment

Pure experiment does not require the untestable assumption of independence conditional on observablesPSM requires large samples and good data

71

Lessons on Matching Methods

Typically used when neither randomization, RD or other quasi-experimental options are not possible (i.e. no baseline)

Be cautious of ex-post matchingMatching on endogenous variables

Matching helps control for OBSERVABLE heterogeneityMatching at baseline can be very useful:

Estimation:combine with other techniques (i.e. diff in diff)Know the assignment rule (match on this rule)

Sampling:selecting non-randomized evaluation samples

Need good quality dataCommon support can be a problem

72

P-score Quintiles

Xi T C t-score T C t-score T C t-score T C t-score T C t-scoreAge Head 68.04 67.45 -1.2 53.61 53.38 -0.51 44.16 44.68 1.34 37.67 38.2 1.72 32.48 32.14 -1.18Educ Head 1.54 1.97 3.13 2.39 2.69 1.67 3.25 3.26 -0.04 3.53 3.43 -0.98 2.98 3.12 1.96Age Spouse 55.95 55.05 -1.43 46.5 46.41 0.66 39.54 40.01 1.86 34.2 34.8 1.84 29.6 29.19 -1.44Educ Spouse 1.89 2.19 2.47 2.61 2.64 0.31 3.17 3.19 0.23 3.34 3.26 -0.78 2.37 2.72 1.99Ethnicity 0.16 0.11 -2.81 0.24 0.27 -1.73 0.3 0.32 1.04 0.14 0.13 -0.11 0.7 0.66 -2.3Female Head 0.19 0.21 0.92 0.42 0.16 -1.4 0.092 0.088 -0.35 0.35 0.32 -0.34 0.008 0.008 0.83

Quintile 4 Quintile 5Quintile 1 Quintile 2 Quintile 3

Case 7 - PROPENSITY SCORE: Pr(treatment=1)Variable Coef. Std. Err.

Age Head -0.0282433 0.0024553Educ Head -0.054722 0.0086369Age Spouse -0.0171695 0.0028683Educ Spouse -0.0643569 0.0093801Ethnicity 0.4166998 0.0397539Female Head -0.2260407 0.0714199_cons 1.6048 0.1013011

Case 7: P-Score Matching

73


Estimated Impact on CPC 1.16 7.06+(3.59) (3.65)

** Significant at 1% level, + Significant at 10% level

Case 7 - Matching

Case 7: P-Score Matching

74

Case Study: Results Summary


Case 2 - Enrolled/Not

Enrolled


Case 4 - IV (TOT)

Case 5 - Regression

Discontinuity


Case 7 - Matching

Multivariate Linear

RegressionMultivariate Linear

Regression

Multivariate Linear

Regression 2SLS

Multivariate Linear

Regression

Multivariate Linear

Regression

Multivariate Linear

RegressionEstimated Impact on CPC 34.28** -4.15 29.79** 30.44** 30.58** 25.53** 7.06+

(2.11) (4.05) (3.00) (3.07) (5.93) (2.77) (3.65)** Significant at 1% level

75

Methods SummaryDiscontinuity

DesignRandomized Promotion

IV

Risks

External Validity

Internal Validity

MatchingDiff-in-DiffRandomization

76

Measuring Impact




Combinations of the above

77

Remember…..

Objective of impact evaluation is to estimate the CAUSAL effect of a program on outcomes of interestIn designing the program we must understand the data generation process

behavioral process that generates the datahow benefits are assigned

Fit the best evaluation design to the operational context

78

Appendix 1:Two Stage Least Squares (2SLS)

Model with endogenous Treatment (T):

Stage 1: Regress endogenous variable on the IV (Z) and other exogenous regressors

Calculate predicted value for each observation: T hat

1 2y T xα β β ε= + + +

0 1 1T x Zδ δ θ τ= + + +

79

Appendix 1Two stage Least Squares (2SLS)

Stage 2: Regress outcome y on predicted variable (and other exogenous variables)

Need to correct Standard Errors (they are based on T hat rather than T)

In practice just use STATA - ivregIntuition: T has been “cleaned” of its correlation with ε.

^

1 2( )y T xα β β ε= + + +

7. Measuring Impact (Martinez) Manila - World Bankpubdocs.worldbank.org/en/165451526071510608/14-Measuring... · Sebastian Martinez The World Bank Impact Evaluation Methods for Policymakers

Documents