Why Non-Experimental Methods are Not Good Enough and Why Experimental Methods Are: Challenging the Folk Lore of Evaluation Research David Weisburd Hebrew.

Why Non-Experimental Methods are Why Non-Experimental Methods are “Not Good Enough” and Why “Not Good Enough” and Why Experimental Methods Are: Experimental Methods Are:

Challenging the Folk Lore of Challenging the Folk Lore of Evaluation ResearchEvaluation Research

David WeisburdDavid Weisburd

Hebrew UniversityHebrew University

George Mason UniversityGeorge Mason University

Oliver Wendell Holmes

Where I am GoingWhere I am Going

Describe how non-experimental evaluation Describe how non-experimental evaluation studies attempt to gain unbiased results in a studies attempt to gain unbiased results in a world where outcomes are confounded.world where outcomes are confounded. Define the fundamental weakness of this approach.Define the fundamental weakness of this approach.

Critically examine the “folklore” that suggests Critically examine the “folklore” that suggests non-experimental studies are “good enough” non-experimental studies are “good enough” despite this weakness.despite this weakness. Folk lore: the traditional beliefs, customs, and stories Folk lore: the traditional beliefs, customs, and stories

of a community, passed through the generations by of a community, passed through the generations by word of mouth. (Oxford Pocket Dictionary)word of mouth. (Oxford Pocket Dictionary)

Experiments are Good EnoughExperiments are Good Enough

Experimental studies provide a statistical Experimental studies provide a statistical solution to the problem of confounding.solution to the problem of confounding. They should be “good enough.”They should be “good enough.”

Critically examine the “folk lore” that Critically examine the “folk lore” that seems to suggest that experiments are not seems to suggest that experiments are not “good enough” despite their statistical “good enough” despite their statistical advantages.advantages.

Neutralizing Confounding in Neutralizing Confounding in Non-Experimental Non-Experimental

ResearchResearch

The Key QuestionThe Key Question

In evaluating treatments or programs the In evaluating treatments or programs the key issue is getting an unbiased estimate key issue is getting an unbiased estimate of the treatment effect.of the treatment effect.

Without that, any other considerations Without that, any other considerations such as the ability to generalize results are such as the ability to generalize results are superfluous.superfluous.

The main problem we face is that The main problem we face is that treatment is confounded with other factors.treatment is confounded with other factors.

The Problem We Need to SolveThe Problem We Need to Solve

Example: We measure the effect of prison on Example: We measure the effect of prison on recidivism.recidivism.

We find that prison increases recidivism.We find that prison increases recidivism. But the reason for this may be that we have not But the reason for this may be that we have not

taken into account the fact the “prisoners” are taken into account the fact the “prisoners” are more likely to recidivate in the first place more likely to recidivate in the first place because they have on average more severe because they have on average more severe prior records. prior records. Treatment (prison) is confounded with prior record.Treatment (prison) is confounded with prior record.

Creating Unbiased Estimates in Creating Unbiased Estimates in Non-Experimental StudiesNon-Experimental Studies

Non-experimental methods such as Non-experimental methods such as regression techniques or matching rely on regression techniques or matching rely on a similar logic.a similar logic. If we know what the factors are that confound If we know what the factors are that confound

treatment we can take them into account.treatment we can take them into account. The primary method of doing this is statistical The primary method of doing this is statistical

(Multivariate Statistical Methods).(Multivariate Statistical Methods).• But Quasi-Experiments that rely on matching, or But Quasi-Experiments that rely on matching, or

propensity scores are based on the same logic.propensity scores are based on the same logic.

Solving the Problem Statistically: Solving the Problem Statistically: CC is the Confounding CauseCC is the Confounding Cause

1 1

21YTr Y CC TrCC Y

TrTrCC Tr

r r r Sb

r S

Elegant Solution, But…Elegant Solution, But…

If we want to get an unbiased estimate of If we want to get an unbiased estimate of treatment in a non-experimental study we treatment in a non-experimental study we would in theory have to identify all would in theory have to identify all “confounding causes.”“confounding causes.”

That “assumption” is on its face unrealistic, That “assumption” is on its face unrealistic, but evaluation researchers often use “folk but evaluation researchers often use “folk lore” to argue that non-experimental lore” to argue that non-experimental studies are in any event “good enough.” studies are in any event “good enough.”

The Folk Lore of Non-The Folk Lore of Non-Experimental EvaluationsExperimental Evaluations

Non-Experimental Methods are Non-Experimental Methods are “Good Enough”?“Good Enough”?

1) Overall We Identify the 1) Overall We Identify the Most Important CausesMost Important Causes

Aren’t we Doing Well Enough?Aren’t we Doing Well Enough?

A common “defense” for non-experimental A common “defense” for non-experimental methods is that our “models” take into methods is that our “models” take into account the “most important factors.”account the “most important factors.” The assumption here is that in practice we The assumption here is that in practice we

don’t have to worry about excluded variables.don’t have to worry about excluded variables. The major ones (that might effect the The major ones (that might effect the

outcomes in meaningful ways) are already outcomes in meaningful ways) are already known and accounted for in the model.known and accounted for in the model.

Impact of Small Excluded Effects Impact of Small Excluded Effects With Little Influence is SmallWith Little Influence is Small

How Well do Criminologists Explain How Well do Criminologists Explain CrimeCrime

Alex Piquero and I have recently published an Alex Piquero and I have recently published an article in Crime and Justice in which we article in Crime and Justice in which we examined this assumption.examined this assumption.

We reviewed all the articles in We reviewed all the articles in Criminology Criminology that that used multivariate statistical modeling to examine used multivariate statistical modeling to examine a criminological theory and provided some a criminological theory and provided some measurement of “variance explained.”measurement of “variance explained.” While my concern here is isolating a treatment effect, While my concern here is isolating a treatment effect,

the question is similar since we would not expect our the question is similar since we would not expect our understanding of treatments or programs to be very understanding of treatments or programs to be very different then our underlying understanding of crime different then our underlying understanding of crime and justice.and justice.

Average Variance ExplainedAverage Variance Explained

Across the articles that reported an R2 Across the articles that reported an R2 value over the time period covered, the value over the time period covered, the average R2 was .389.average R2 was .389.

Some 25% of the 169 articles exhibit R2 Some 25% of the 169 articles exhibit R2 values of below .20, while over 70% have values of below .20, while over 70% have an R2 under .50. an R2 under .50.

Aggregate R2 Value over Time (N=169 articles). Aggregate R2 Value over Time (N=169 articles). (Note: Years (Note: Years

with zero observations are removed for ease of presentation.)with zero observations are removed for ease of presentation.)

Average R-Square Over Time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Year

R-S

qu

are

The Folk Lore is Most Likely WrongThe Folk Lore is Most Likely Wrong

There is a good deal left unexplained, There is a good deal left unexplained, most often more than half the variance.most often more than half the variance.

It would seem very difficult to assume that It would seem very difficult to assume that in all of this variance unexplained there in all of this variance unexplained there are not very meaningful confounding are not very meaningful confounding factors that are routinely excluded.factors that are routinely excluded.

2) If the Effect of Treatment is Large 2) If the Effect of Treatment is Large than You can Assume that Excluded than You can Assume that Excluded

Causes Would not Change that Causes Would not Change that Estimate in a Meaningful WayEstimate in a Meaningful Way

This Effect is Large Enough Not to Worry This Effect is Large Enough Not to Worry About!About!

Another folk lore often used to defend a Another folk lore often used to defend a reliance on non-experimental methods is reliance on non-experimental methods is that very large and robust effects are not that very large and robust effects are not likely to be meaningfully altered even if likely to be meaningfully altered even if there are unmeasured confounding there are unmeasured confounding factors.factors.

Statisticians in contrast have often noted Statisticians in contrast have often noted the “instability” of regression parameters the “instability” of regression parameters under differing assumptions.under differing assumptions.

AOC Death Penalty StudyAOC Death Penalty Study

Joe Naus from Rutgers University and I were Joe Naus from Rutgers University and I were asked by the AOC of New Jersey to Assess the asked by the AOC of New Jersey to Assess the Effects of Race on Death Penalty Sentencing.Effects of Race on Death Penalty Sentencing.

Following an approach that identified major Following an approach that identified major factors influencing death penalty sentencing we factors influencing death penalty sentencing we developed a model that showed a very developed a model that showed a very significant effect of race of victim on the significant effect of race of victim on the likelihood of advancement to penalty trial.likelihood of advancement to penalty trial.

White Victim is the Single Most “Significant” White Victim is the Single Most “Significant” Effect on Advancement to Penalty TrialEffect on Advancement to Penalty Trial

Regional EffectsRegional Effects

The State Prosecutor argued that the effect of The State Prosecutor argued that the effect of race of victim was confounded by district of race of victim was confounded by district of prosecution.prosecution.

He noted that counties that had large numbers He noted that counties that had large numbers of “white victims” were places where it was more of “white victims” were places where it was more likely for a case to go to penalty trial for other likely for a case to go to penalty trial for other reasons.reasons. For example, the cases with large numbers of white For example, the cases with large numbers of white

victims were in counties with many fewer “death victims were in counties with many fewer “death eligible cases.” Prosecutors in such cases were more eligible cases.” Prosecutors in such cases were more likely to focus in more aggressively on such cases. likely to focus in more aggressively on such cases.

White Victim Controlling for CountyWhite Victim Controlling for County

3) We can Assume that 3) We can Assume that the Biases are Balancedthe Biases are Balanced

Everything Will Balance Off in the Everything Will Balance Off in the EndEnd

A common folklore is that the excluded variables A common folklore is that the excluded variables “balance each other,” so we can assume that the “balance each other,” so we can assume that the parameter estimate is unbiased.parameter estimate is unbiased. This assumption relies on a model in which the This assumption relies on a model in which the

exclusion of variables is random, and therefore we exclusion of variables is random, and therefore we would assume an unbiased estimate of b.would assume an unbiased estimate of b.

If this assumption had any basis to it we could just If this assumption had any basis to it we could just rely on the bivariate model. No-one would argue that rely on the bivariate model. No-one would argue that that model provides an unbiased estimate!that model provides an unbiased estimate!

Knowledge Development is not Knowledge Development is not RandomRandom

Indeed, there is good reason to believe Indeed, there is good reason to believe that we identify variables in clusters that we identify variables in clusters around specific theoretical constructs (like around specific theoretical constructs (like poverty or social disorganization).poverty or social disorganization). By definition we are then missing clusters By definition we are then missing clusters

which are likely to cause bias in specific which are likely to cause bias in specific directions.directions.

Data restrictions (e.g. gathering official Data restrictions (e.g. gathering official data) are likely to be even more data) are likely to be even more systematic in their biases.systematic in their biases.

So Why are Experiments So Why are Experiments Good Enough?Good Enough?

Randomized Experiments: Randomized Experiments: A Naïve ApproachA Naïve Approach

Because treatment has been allocated Because treatment has been allocated randomly, in theory it is not going to be randomly, in theory it is not going to be related systematically to other factors such related systematically to other factors such as gender, race, age, attitudes etc. as gender, race, age, attitudes etc.

THERE ARE NO CONFOUNDING THERE ARE NO CONFOUNDING CAUSES!CAUSES!

No Confounding!No Confounding!

So Rather than Taking Confounding Causes So Rather than Taking Confounding Causes Into Account a Randomized Experiment Into Account a Randomized Experiment

Makes the Confounding IrrelevantMakes the Confounding Irrelevant

1 1

21YTr Y CC TrCC Y

TrTrCC Tr

r r r Sb

r S

The product of the correlations is zero in a The product of the correlations is zero in a randomized experiment.randomized experiment.

The Folk Lore of Why The Folk Lore of Why Experiments are Not Good Experiments are Not Good

EnoughEnough

1) Experiments are Not Ethical1) Experiments are Not Ethical

Many people still claim that it is not ethical Many people still claim that it is not ethical to carry out social experiments. to carry out social experiments.

It seems that at least in crime and justice It seems that at least in crime and justice evaluation researchers don’t really accept evaluation researchers don’t really accept this folk lore (this folk lore (Lum and Yang, 2003)Lum and Yang, 2003)..

““Randomized experimental design is the Randomized experimental design is the best method of linking cause and effect.”best method of linking cause and effect.”

4.81

4.16

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Av

era

ge

Sco

re p

er G

rou

p

Non-Experiments

Experiments

t= -2.70* p= .010

““Randomized experiments cannot be carried Randomized experiments cannot be carried out ethically in criminal justice settings.”out ethically in criminal justice settings.”

t= -1.98 p= .051

1.9

1.49

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Av

era

ge

Sco

re p

er G

rou

p

Non-Experiments

Experiments

2) Experiments Cannot be 2) Experiments Cannot be Implemented in the Real WorldImplemented in the Real World

Crime Reduction Experiments 1945-Crime Reduction Experiments 1945-1993( N=267)1993( N=267)

0 0.2 0.6 1.8

6.2

9.4 7.8

11 9.4

11.6

0 2 4 6 8

10 12 14

1945- 1950 1951-

1955 1956- 1960 1961-

1965 1966- 1970 1971-

1975 1976- 1980 1981-

1985 1986- 1990 1991-

1993

3) Experiments Have Low External 3) Experiments Have Low External ValidityValidity

Only innovative agencies are willing to Only innovative agencies are willing to participate in experiments.participate in experiments. ““Ordinary” agencies may be brought on board Ordinary” agencies may be brought on board

if there is strong governmental if there is strong governmental encouragement and financial support that encouragement and financial support that rewards participation.rewards participation.

Experiments operate in an artificial world Experiments operate in an artificial world that is controlled and not dynamic.that is controlled and not dynamic. There is no free lunch!There is no free lunch!

Randomized Experiments Randomized Experiments are Good Enoughare Good Enough

Everything Else is CommentaryEverything Else is Commentary

The great Talmudic scholar Hillel responded The great Talmudic scholar Hillel responded when asked to explain Judaism on one foot, that when asked to explain Judaism on one foot, that its essence was the dictum: “‘Treat others as its essence was the dictum: “‘Treat others as you would like them to treat you.” He then noted you would like them to treat you.” He then noted that everything else is “commentary.”that everything else is “commentary.”

In our case, the essence of evaluation research In our case, the essence of evaluation research is that “experiments are good enough.”is that “experiments are good enough.” Non-experimental methods are “not good enough.”Non-experimental methods are “not good enough.”

The CommentaryThe Commentary

Of course, as in the case of Hillel, the Of course, as in the case of Hillel, the commentary is very important.commentary is very important.

But a simple rule should follow. We But a simple rule should follow. We should begin any study with an should begin any study with an assumption that an experimental design is assumption that an experimental design is required. We should only then get to the required. We should only then get to the commentary.commentary.

Why Non-Experimental Methods are Not Good Enough and Why Experimental Methods Are: Challenging the Folk Lore of Evaluation Research David Weisburd Hebrew.

Documents

nonexperimental study

treatment prison

treatment effect

unbiased estimate of

confounding cause slide

important causes slide

event good

problem of confounding