Assessing Threats to Validity and Implications for Use of Impact Evaluation Findings Michael Woolcock Development Research Group, World Bank Kennedy School.

Assessing Threats to Validity and Implications for Use of Impact Evaluation Findings

Michael WoolcockDevelopment Research Group, World Bank

Kennedy School of Government, Harvard Universitymwoolcock@worldbank.org

InterActionMay 13, 2013

Overview

• Background• The art, science and politics of evaluation• Forms and sources of validity

– Construct, Internal, External• Applications to ‘complex’ interventions

– With a focus on External Validity • If it works there, will it work here?

• Expanding range of ideas, methods and strategies

[T]he bulk of the literature presently recommended for policy decisions… cannot be used to identify ‘what works here’. And this is not because it may fail to deliver in some particular cases [; it] is not because its advice fails to deliver what it can be expected to deliver… The failing is rather that it is not designed to deliver the bulk of the key facts required to conclude that it will work here.

Nancy Cartwright and Jeremy Hardie (2012) Evidence-Based Policy: A Practical Guide to Doing it Better (New York: Oxford University Press, p. 137)

Contesting Development Participatory Projects and LocalConflict Dynamics in Indonesia

PATRICK BARRONRACHAEL DIPROSEMICHAEL WOOLCOCK

Yale University Press, 2011

The art, science and politics of evaluation

• The Art…– Sensibility, experience– Optimizing under (numerous) constraints– Taking implementation, monitoring, context seriously

• The Science…– Skills, theory

• Modes of causal reasoning (statistical, logical, legal), time

• …and the Politics– Competence and confidence under pressure– Picking battles…

Making, assessing impact claims Quality of empirical knowledge claims turns on…

1. Construct validity• Do key concepts (‘property rights’, ‘informal’) mean the same thing to

different people? What gets “lost in translation”?

2. Internal validity…• In connecting ‘cause’ (better schools) and ‘effect’ (smarter children), have

we considered other factors that might actually be driving the result (home environment, community safety, cultural norms)? Programs rarely placed randomly…

3. …assessed against a ‘theory of change’• Specification of how project’s components (and their interaction) and

processes generate outcomes• Reasoned Expectations: where by when?

4. External validity (how generalizable are the claims?)• If it works here, will it work there? If it works with this group, will it work

with that group? Will bigger be better?

1. Construct validity

• Asking, answering and interpreting questions– To what extent do all parties share similar

understandings of key concepts? • E.g., ‘Poverty’, ‘ethnicity’, ‘violence’, ‘justice’…

• Can be addressed using mixed methods:– Iterative field testing of questionnaire items, and

their sequencing• NOT cut-and-paste from elsewhere

– ‘Anchoring vignettes’ (Gary King et al)• Assessing “quality of government” in China and Mexico

2. Internal validityIn Evaluation 101, we assume…

Impact = f (Design) | Selection, Confounding Variables

Adequate for ‘simple’ interventions with a ‘good-enough’ counterfactual.

But this is inadequate for assessing ‘complex’ interventions: * design is multi-faceted (i.e., many ‘moving parts’)* interaction with context is pervasive, desirable * implementation quality is vital* trajectories of change are probably non-linear (perhaps unknowable ex ante)

Pervasive problem

• Such projects are inherently very complex, thus:– Very hard to isolate ‘true’ impact– Very hard to make claims about likely impact elsewhere– Understanding how (not just whether) impact is

achieved is also very important• Process Evaluations, or ‘Realist Evaluations’, can be most

helpful (see work of Ray Pawson, Patricia Rogers et al)• Mixed methods, theory, and experience all crucial for

investigating these aspects

Evaluating ‘complex’ projects

Impact = f ([DQ, CD], SF) | SE, CV, RE

DQ = Design quality (weak, strong)CD = Causal density (low, high)SF = Support factors: Implementation, ContextSE = Selection effects (non-random placement, participation)CV = Confounding variablesRE = Reasoned expectations (where by when?)

In Social Development projects (cf. roads, immunizations):* CD is high, loose, often unobserved (unobservable?)* Implementation and context are highly variable* RE is often unknown (unknowable?)

Timet = 0 t = 1

Net Impact

3. Theory of Change, Reasoned Expectations: Understanding impact trajectories

Timet = 0 t = 1

Net Impact

Understanding impact trajectories

“Same” impact claim, but entirely a function of when the assessment was done…

Timet = 0 t = 1

Net Impact

If an evaluation was done at ‘A’ or ‘B’, what claims about impact would be made?

Timet = 0 t = 1

Net Impact

4. External Validity: Some Background• Rising obsession with causality, RCTs as ‘gold standard’

– Pushed by donors, foundations (e.g., Gates), researchers• Campbell Collaboration, Cochrane Collaboration, NIJ, JPAL, et al

– For “busy policymaker”, “warehouses” of interventions that “work”

• Yet also serious critiques…– In medicine: Rothwell (2005), Groopman (2008)– In philosophy: Cartwright (2011)– In economics: Deaton (2010), Heckman (1992), Ravallion (2009)

• Reddy (2013) on Poor Economics: “from rigor to rigor mortis”; a radical approach to defining development down, delimiting innovation space

• …especially as it pertains to external validity…– NYT (2013), Engber (2011) on ‘Black 6’ (biomedical research)– Heinrich et al (2011) on ‘WEIRD’ people (social psychology)– Across time, space, groups, scale, units of analysis

• …and understanding of mechanisms– True “science of delivery” requires knowledge of how, not just whether,

something ‘works’ (Cartwright and Hardie 2012)

In Social Development projects (cf. roads, immunizations):* CD is high, loose, often unobserved (unobservable?)* IE and CC are highly variable* RE is often unknown (unknowable?)

From IV to EV, ‘simple to ‘complex’

1. Causal density2. Support factors

– Implementation– Context

3. Reasoned expectations

Central claim: the higher the intervention’s complexity, the lower its external validity

1. ‘Causal density’

Which way up? RCTs vs QICs

Eppstein et al (2012) “Searching the clinical fitness landscape” PLoS ONE: 7(11): e49901

How ‘simple’ or ‘complex’ is your policy/project? Specific questions to ask:

• To what extent does producing successful outcomes from your policy/project require…– that the implementing agents make finely based distinctions

about the “state of the world”? Are these distinctions difficult for a third party to assess/verify?

• Local discretion– many agents to act or few, over extended time periods?

• Transaction intensity– that the agents resist large temptations/pressures to do

something besides implement the policy?• High stakes

– that agents innovate to achieve desired outcomes?• Known technology

Classification of “activities” in health

Local Discretion?

Transaction intensive?

Contentious, ‘temptations’ to do otherwise?

Known technology?

Iodization of salt

No No No Yes

Vaccinations No Yes No Yes

Ambulatory curative care

Yes Yes No(ish) Yes

Regulation of private providers

Yes Yes Yes Yes

Encouraging preventive health

Yes Yes No No

Technocratic (implementation light; policy decree)

Logistical (implementation intensive, but easy)

Implementation Intensive ‘Downstream’ (of services)

Implementation Intensive ‘Upstream’ (of obligations)

Complex (implementation intensive, motivation hard), need (continuous?) innovation

2. Implementation:Using RCTs to test EV of RCTs

• Bold, Sandefur et al (2013)– Take a project (contract teachers) with a positive

impact from India, as determined by an RCT…– …to Kenya; 192 schools randomly split into three

groups to receive a contract teacher:• a control group• through an NGO (World Vision)• through the MoE

– Result?

Implementation matters (a lot)

Bold et al (2013)

The fact is that RCTs come at the end, when you have already decided that it will probably work, here and maybe anywhere… To know that this is a good bet, you have to have thought about causal roles and support factors… [A]nswering the how question is made easier in science by background knowledge of how things work.

Nancy Cartwright and Jeremy Hardie (2012) Evidence-Based Policy: A Practical Guide to Doing it Better (New York: Oxford University Press, p. 125)

Timet = 0 t = 1

Impact

Learning from intra-project variation

Timet = 0 t = 1

Impact

Learning from intra-project variation‘Complex’ projects

Timet = 0 t = 1

Impact

Iterative, adaptive learning

Learning from intra-project variation‘Complex’ projects

Putting it all togetherProject Design

Features Technocratic LogisticalImplementation

Intensive (‘Downstream’)

Implementation Intensive

(‘Upstream’)Complex

Implementation Quality Strong Weak Strong Weak Strong Weak Strong Weak Strong Weak

Context Compatibility + - + - + - + - + - + - + - + - + - + -

External Validity

Even with low EV interventions, the ideas and processes behind them may still travel well

External Validity

LowUtility of case studies, of process evaluations, of MM

Implications• Take the analytics of knowledge claims surrounding EV as

seriously as we do IV• Engage with the vast array of social science tools available for

rigorously assessing complex interventions– Within and beyond economics

• RCTs as one tool among many• New literature on case studies (Mahoney), QCA (Ragin), Complexity

– See especially ‘realist evaluation’ (Pawson, Tilly)• Make implementation cool; it really matters…

– Learning from intra-project variation; projects themselves as laboratories, as “policy experiments” (Rondinelli 1993)

• ‘Science of delivery’ must know how, not just whether, interventions work (mechanisms, theory of change)

• Especially important for engaging with ‘complex’ interventions

• Need ‘counter-temporal’ (not just counterfactual)– Reasoned expectations about what and where, by when?

Primary source material• Bamberger, Michael, Vijayendra Rao and Michael Woolcock (2010) ‘Using

Mixed Methods in Monitoring and Evaluation: Experiences from International Development’, in Abbas Tashakkori and Charles Teddlie (eds.) Handbook of Mixed Methods (2nd revised edition) Thousand Oaks, CA: Sage Publications, pp. 613-641

• Barron, Patrick, Rachael Diprose and Michael Woolcock (2011) Contesting Development: Participatory Projects and Local Conflict Dynamics in Indonesia New Haven: Yale University Press

• Pritchett, Lant, Salimah Samji and Jeffrey Hammer (2012) ‘It’s All About MeE: Using Experiential Learning to Navigate the Design Space’ Center for Global Development Working Paper No.

• Woolcock, Michael (2009) ‘Toward a Plurality of Methods in Project Evaluation: A Contextualized Approach to Understanding Impact Trajectories and Efficacy’ Journal of Development Effectiveness 1(1): 1-14

• Woolcock, Michael (forthcoming) ‘Using Case Studies to Explore the External Validity of Complex Development Interventions’ Evaluation

Assessing Threats to Validity and Implications for Use of Impact Evaluation Findings Michael Woolcock Development Research Group, World Bank Kennedy School.

Documents

October 19 -...

Manejando tu ambiente de autorización en un …...1...

Kennedy Family - · PDF fileMary Hickey * ... + 1923...

Determining Validity There are several ways to measure...

Research Validity & Threats to Validity

BLACK SNAKE - Better Reading · Brisbane Mary (Patricia)...

Writers Reviewer - Belajar jadi Guru | Sekedar berbagi yang....

Internal Validity Construct Validity External Validity *.....

Scientific Program (May 30, 2011)Nicolas Clerbout, Modal...

Kashima, Woolcock, & Kashima 2000

Ch 6 Validity of Instrument. Outline of Validity Definition....

Validity (cont.)/Control RMS – October 7. Validity...

Woolcock 1998 - Social capital

Measurement Validity - Harvard University Major Depressive.....

The Science of Delivery and the Art and Politics of ......

Business Productivity Online Suite Michael Woolcock Business...