Does Research Design Affect Study Outcomes in …cebcp.org/wp-content/publications/Does Research Design Affect Study... · THE ANNALS OF THE AMERICAN ACADEMYRESEARCH DESIGN AND STUDY

THE ANNALS OF THE AMERICAN ACADEMYRESEARCH DESIGN AND STUDY OUTCOMES

Does Research DesignAffect Study Outcomes

in Criminal Justice?

By DAVID WEISBURD, CYNTHIA M. LUM,and ANTHONY PETROSINO

ABSTRACT: Does the type of research design used in a crime and jus-tice study influence its conclusions? Scholars agree in theory thatrandomized experimental studies have higher internal validity thando nonrandomized studies. But there is not consensus regarding thecosts of using nonrandomized studies in coming to conclusions re-garding criminal justice interventions. To examine these issues, theauthors look at the relationship between research design and studyoutcomes in a broad review of research evidence on crime and justicecommissioned by the National Institute of Justice. Their findingssuggest that design does have a systematic effect on outcomes incriminal justice studies. The weaker a design, indicated by internalvalidity, the more likely a study is to report a result in favor of treat-ment and the less likely it is to report a harmful effect of treatment.Even when comparing randomized studies with strong quasi-experi-mental research designs, systematic and statistically significant dif-ferences are observed.

ANNALS, AAPSS, 578, November 2001

50

David Weisburd is a senior research fellow in the Department of Criminology andCriminal Justice at the University of Maryland and a professor of criminology at theHebrew University Law School in Jerusalem.

Cynthia M. Lum is a doctoral student in the Department of Criminology and Crimi-nal Justice at the University of Maryland.

Anthony Petrosino is a research fellow at the Center for Evaluation, Initiative forChildren Program at the American Academy of Arts and Sciences and a research asso-ciate at Harvard University. He is also the coordinator of the Campbell Crime and Jus-tice Coordinating Group.

NOTE: We are indebted to a number of colleagues for helpful comments in preparing this arti-cle. We especially want to thank Iain Chalmers, John Eck, David Farrington, Denise Gottfredson,Doris MacKenzie, Joan McCord, Lawrence Sherman, Brandon Welsh, Charles Wellford, andDavid Wilson.

T HERE is a growing consensusamong scholars, practitioners,

and policy makers that crime controlpractices and policies should berooted as much as possible in scien-tific research (Cullen and Gendreau2000; MacKenzie 2000; Sherman1998). This is reflected in the steadygrowth in interest in evaluation ofcriminal justice programs and prac-tices in the United States and theUnited Kingdom over the past de-cade and by large increases in crimi-nal justice funding for research dur-ing this period (Visher and Weisburd1998). Increasing support for re-search and evaluation in criminaljustice may be seen as part of a moregeneral trend toward utilization ofscientific research for establishingrational and effective practices andpolicies. This trend is perhaps mostprominent in the health professions,where the idea of evidence-basedmedicine has gained strong govern-ment and professional support(Millenson 1997; Zuger 1997),though the evidence-based paradigmis also developing in other fields (seeNutley and Davies 1999; Davies,Nutley, and Smith 2000).

A central component of the move-ment toward evidence-based practiceand policy is reliance on systematicreview of prior research and evalua-tion (Davies 1999). Such reviewallows policy makers and practitio-ners to identify what programs andpractices are most effective and inwhich contexts. The Cochrane Col-laboration, for example, seeks to pre-pare, maintain, and make accessiblesystematic reviews of research on theeffects of health care interventions(see Chalmers and Altman 1995;

www.cochrane.org.) The CochraneLibrary is now widely recognized asthe single best source of evidence onthe effectiveness of health care andmedical treatments and has playedan important part in the advance-ment of evidence-based medicine(Egger and Smith 1998). Morerecently, social scientists followingthe Cochrane model established theCampbell Collaboration for develop-ing systematic reviews of researchevidence in the area of social andeducational interventions (seeBoruch, Petrosino, and Chalmers1999). In recognition of the growingimportance of evidence-based poli-cies in criminal justice, the CampbellCollaboration commissioned a coor-dinating group to deal with crimeand justice issues. This group beganwith the goal of providing the bestevidence on “what works in crimeand justice” through the develop-ment of “systematic reviews ofresearch” on the effects of crime andjustice interventions (Farringtonand Petrosino 2001 [this issue]).

In the Cochrane Collaboration,and in medical research in general,clinical trials that randomize partici-pants to treatment and control orcomparison groups are consideredmore reliable than studies that donot employ randomization. And therecognition that experimentaldesigns form the gold standard fordrawing conclusions about theeffects of treatments or programs isnot restricted to medicine. There isbroad agreement among social andbehavioral scientists that random-ized experiments provide the bestmethod for drawing causal infer-ences between treatments and

RESEARCH DESIGN AND STUDY OUTCOMES 51

programs and their outcomes (forexample, see Boruch, Snyder, andDeMoya 2000; Campbell and Boruch1975; Farrington 1983; Feder, Jolin,and Feyerherm 2000). Indeed, a taskforce convened by the Board of Scien-tific Affairs of the American Psycho-logical Association to look into statis-tical methods concluded that “forresearch involving causal inferences,the assignments of units to levels ofthe causal variable is critical. Ran-dom assignment (not to be confusedwith random selection) allows for thestrongest possible causal inferencesfree of extraneous assumptions”(Wilkinson and Task Force on Statis-tical Inference 1999).

While reliance on experimentalstudies in drawing conclusions abouttreatment outcomes has becomecommon in the development of evi-dence-based medicine, the CampbellCollaboration Crime and Justice Co-ordinating Group has concluded thatit is unrealistic at this time to restrictsystematic reviews on the effects ofinterventions relevant to crime andjustice to experimental studies. Indeveloping its Standards for Inclu-sion of Studies in Systematic Reviews(Farrington 2000), the group notesthat it does not require that review-ers select only randomized experi-ments:

This might possibly be the case for an in-tervention where there are many ran-domized experiments (e.g. cognitive-be-havioral skills training). However,randomized experiments to evaluatecriminological interventions are rela-tively uncommon. If reviews were re-stricted to randomized experiments, they

would be relevant to only a small fractionof the key questions for policy and prac-tice in criminology. Where there are fewrandomized experiments, it is expectedthat reviewers will select both random-ized and non-randomized studies for in-clusion in detailed reviews. (3)

In this article we examine a cen-tral question relevant both to theCampbell Collaboration crime andjustice effort and to the more generalemphasis on developing evidence-based practice in criminal justice:Does the type of research design usedin a crime and justice study influencethe conclusions that are reached? As-suming that experimental designsare the gold standard for evaluatingpractices and policies, it is importantto ask what price we pay in includingother types of studies in our reviewsof what works in crime and justice.Are we likely to overestimate or un-derestimate the positive effects oftreatment? Or conversely, might weexpect that the use of well-designednonrandomized studies will lead toabout the same conclusions as wewould gain from randomized experi-mental evaluations?

To examine these issues, we lookat the relationship between researchdesign and study outcomes in a broadreview of research evidence on crimeand justice commissioned by theNational Institute of Justice. Gen-erally referred to as the MarylandReport because it was developed inthe Department of Criminology andCriminal Justice at the University ofMaryland at College Park, the studywas published under the title Pre-venting Crime: What Works, What

52 THE ANNALS OF THE AMERICAN ACADEMY

Doesn’t, What’s Promising (Shermanet al. 1997). The Maryland Reportprovides an unusual opportunity forassessing the impact of study designon study outcomes in crime and jus-tice both because it sought to be com-prehensive in identifying availableresearch and because the principalinvestigators of the study devotedspecific attention to the nature of theresearch designs of the studiesincluded. Below we detail the meth-ods we used to examine how studydesign affects study outcomes incrime and justice research and reporton our main findings. We turn first,however, to a discussion of why ran-domized experiments as contrastedwith quasi-experimental and non-experimental research designs aregenerally considered a gold standardfor making causal inferences.We alsoexamine what prior research sug-gests regarding the questions weraise.

WHY ARE RANDOMIZEDEXPERIMENTS CONSIDERED

THE GOLD STANDARD?

The key to understanding thestrength of experimental researchdesigns is found in what scholarsrefer to as the internal validity of astudy. A research design in which theeffects of treatment or interventioncan be clearly distinguished fromother effects has high internal valid-ity. A research design in which theeffects of treatment are confoundedwith other factors is one in whichthere is low internal validity. Forexample, suppose a researcher seeksto assess the effects of a specific drug

treatment program on recidivism. Ifat the end of the evaluation theresearcher can present study resultsand confidently assert that theeffects of treatment have been iso-lated from other confounding causes,the internal validity of the study ishigh. But if the researcher has beenunable to ensure that other factorssuch as the seriousness of priorrecords or the social status of offend-ers have been disentangled from theinfluence of treatment, he or shemust note that the effects observedfor treatment may be due to such con-founding causes. In this case internalvalidity is low.

In randomized experimental stud-ies, internal validity is developedthrough the process of random allo-cation of the units of treatment orintervention to experimental andcontrol or comparison groups. Thismeans that the researcher has ran-domized other factors besides treat-ment itself, since there is no system-atic bias that brings one type ofsubject into the treatment group andanother into the control or compari-son group. Although the groups arenot necessarily the same on everycharacteristic—indeed, simply bychance, there are likely to be differ-ences—such differences can beassumed to be distributed randomlyand are part and parcel of the sto-chastic processes taken into accountin statistical tests. Random alloca-tion thus allows the researcher toassume that the only systematic dif-ferences between the treatment andcomparison groups are found in thetreatments or interventions that areapplied. When the study is complete,


the researcher can argue with confi-dence that if a difference has beenobserved between treatment andcomparison groups, it is likely theresult of the treatment itself (sincerandomization has isolated the treat-ment effect from other possiblecauses).

In nonrandomized studies, twomethods may be used for isolatingtreatment or program effects. Quasi-experiments, like randomized exper-iments, rely on the design of aresearch study to isolate the effects oftreatment. Using matching or othermethods in an attempt to establishequivalence between groups, quasi-experiments mimic experimentaldesigns in that they attempt to ruleout competing causes by identifyinggroups that are similar except in thenature of the treatment that theyreceive in the study. Importantly,however, quasi-experiments do notrandomize out the effects of othercauses as is the case in randomizedexperimental designs; rather theyseek to maximize the equivalencebetween the units studied throughmatching or other methods. Threatsto internal validity in quasi-experi-mental studies derive from the factthat it is seldom possible to find or tocreate treatment and control groupsthat are not systematically differentin one respect or another.

Nonexperimental studies rely pri-marily on statistical techniques todistinguish the effects of the inter-vention or treatment from other con-founding causes. In practice, quasi-experimental studies often rely aswell on statistical approaches toincrease the equivalence of the

comparisons made.1 However, in non-experimental studies, statistical con-trols are the primary method appliedin attempts to increase the level of astudy’s internal validity. In this case,multivariate statistical methods areused to isolate the effects of treat-ment from that of other causes. Thisdemands of course that theresearcher clearly identify and mea-sure all other factors that maythreaten the internal validity of thestudy outcomes. Only if all such fac-tors are included in the multivariatemodels estimated can the researcherbe confident that the effects of treat-ment that have been reported are notconfounded with other causes.

In theory, the three methodsdescribed here are equally valid forsolving the problem of isolatingtreatment or program effects. Eachcan ensure high internal validitywhen applied correctly. In practice,however, as Feder and Boruch (2000)note, “there is little disagreementthat experiments provide a superiormethod for assessing the effective-ness of a given intervention” (292).Randomization, according to Kunzand Oxman (1998), “is the onlymeans of controlling for unknown andunmeasured differences betweencomparison groups as well as thosethat are known and measured”(1185). While random allocationitself ensures high internal validityin experimental research, for quasi-experimental and nonexperimentalresearch designs, unknown andunmeasured causes are generallyseen as representing significantpotential threats to the internalvalidity of the comparisons made.2


INTERNAL VALIDITYAND STUDY OUTCOMES

IN PRIOR REVIEWS

While there is general agreementthat experimental studies are morelikely to ensure high internal valid-ity than are quasi-experimental ornonexperimental studies, it is diffi-cult to specify at the outset the effectsthat this will have on study out-comes. On one hand, it can beassumed that weaker internal valid-ity is likely to lead to biases in assess-ment of the effects of treatments orinterventions. However, the directionof that bias in any particular study islikely to depend on factors related tothe specific character of the researchthat is conducted. For example, ifnonrandomized studies do notaccount for important confoundingcauses that are positively related totreatment, they may on averageoverestimate program outcomes.However, if such unmeasured causesare negatively related to treatment,nonrandomized studies would beexpected to underestimate programoutcomes. Heinsman and Shadish(1996) suggested that whatever thedifferences in research design, ifnonrandomized and randomizedstudies are equally well designed andimplemented (and thus internalvalidity is maximized in each), thereshould be little difference in the esti-mates gained.Much of what is knownempirically about these questions isdrawn from reviews in such fields asmedicine, psychology, economics, andeducation (for example, see Burtless1995;Hedges 2000;Kunz and Oxman1998; Lipsey and Wilson 1993). Fol-lowing, what one would expect in the-ory, a general conclusion that can be

reached from the literature is thatthere is not a consistent bias thatresults from use of nonrandomizedresearch designs. At the same time, afew studies suggest that differences,in whatever direction, will be small-est when nonrandomized studies arewell designed and implemented.

Kunz and Oxman (1998), forexample, using studies drawn fromthe Cochrane database, found vary-ing results when analyzing 18 meta-analyses (incorporating 1211 clinicaltrials) in the field of health care. Ofthese 18 systematic reviews, 4 foundrandomized and higher-qualitystudies3 to give higher estimates ofeffects than nonrandomized andlower-quality studies, and 8 reviewsfound randomized or high-qualitystudies to produce lower estimates ofeffect sizes than nonrandomized orlower-quality studies. Five otherreviews found little or inconclusivedifferences between different typesof research designs, and in onereview, low-quality studies werefound to be more likely to report find-ings of harmful effects of treatments.

Mixed results are also found insystematic reviews in the social sci-ences. Some reviews suggest thatnonrandomized studies will on aver-age underestimate program effects.For example, Heinsman and Shadish(1996) looked at four meta-analysesthat focused on interventions in fourdifferent areas: drug use, effects ofcoaching on Scholastic Aptitude Testperformance, ability grouping ofpupils in secondary schools, andpsychosocial interventions forpostsurgery outcomes. Included intheir analysis were 98 published andunpublished studies. As a whole,


randomized experiments were foundto yield larger effect sizes than stud-ies where randomization was notused. In contrast, Friedlander andRobins (2001), in a review of socialwelfare programs, found that non-experimental statistical approachesoften yielded estimates larger thanthose gained in randomized studies(see also Cox, Davidson, and Bynum1995; LaLonde 1986).

In a large-scale meta-analysisexamining the efficacy of psychologi-cal, educational, and behavioraltreatment, Lipsey and Wilson (1993)suggested that conclusions reachedon the basis of nonrandomized stud-ies are not likely to strongly bias con-clusions regarding treatment or pro-gram effects. Although studiesvaried greatly in both directions as towhether nonrandomized designsoverestimated or underestimatedeffects as compared with randomizeddesigns, no consistent bias in eitherdirection was detected. Lipsey andWilson, however, did find a notabledifference between studies thatemployed a control/comparisondesign and those that used one-grouppre and post designs. The latter stud-ies produced consistently higher esti-mates of treatment effects.

Support for the view that strongernonrandomized studies are likely toprovide results similar to random-ized experimental designs is pro-vided by Shadish and Ragsdale(1996). In a review of 100 studies ofmarital or family psychotherapy,they found overall that randomizedexperiments yielded significantlylarger weighted average effect sizesthan nonequivalent control groupdesigns. Nonetheless, the difference

between randomized and nonran-domized studies decreased whenconfounding variables related to thequality of the design of the studywere included.

Works that specifically addressthe relationship between studydesign and study outcomes arescarce in criminal justice. In turn,assessment of this relationship ismost often not a central focus of thereviews developed, and reviewersgenerally examine a specific criminaljustice area, most often corrections(for example, see Bailey 1966; Mac-Kenzie and Hickman 1998; White-head and Lab 1989). Results of thesestudies provide little guidance forspecifying a general relationshipbetween study design and study out-comes for criminal justice research.In an early review of 100 reports ofcorrectional treatment between 1940and 1960, for example, Bailey (1966)found that research design had littleeffect on the claimed success of treat-ment, though he noted a slight posi-tive relationship between the “rigor”of the design and study outcome.Logan (1972), who also reviewed cor-rectional treatment programs, founda slight negative correlation betweenstudy design and claimed success.

Recent studies are no more conclu-sive. Wilson, Gallagher, and Mac-Kenzie (2000), in a meta-analysis ofcorrections-based education, voca-tion, and work programs, found thatrun-of-the-mill quasi-experimentalstudies produced larger effects thandid randomized experiments. How-ever, such studies also producedlarger effects than did low-qualitydesigns that clearly lacked compara-bility among groups. In a review of


165 school-based prevention pro-grams, Whitehead and Lab (1989)found little difference in the size ofeffects in randomized and non-randomized studies. Interestinglyhowever, they reported thatnonrandomized studies were muchless likely to report a backfire effectwhereby treatment was found toexacerbate rather than amelioratethe problem examined. In contrast, amore recent review by Wilson,Gottfredson, and Najaka (in press)found overall that nonrandomizedstudies yielded results on averagesignificantly lower than randomizedexperiments’ results, even account-ing for a series of other design char-acteristics (including the overallquality of the implementation of thestudy). However, it should be notedthat many of these studies did notinclude delinquency measures, andschools rather than individuals wereoften the unit of random allocation.4

THE STUDY

We sought to define the influenceof research design on study outcomesacross a large group of studies repre-senting the different types ofresearch design as well as a broadarray of criminal justice areas. Themost comprehensive source we couldidentify for this purpose has come tobe known as the Maryland Report(Sherman et al. 1997). The MarylandReport was commissioned by theNational Institute of Justice to iden-tify “what works, what doesn’t, andwhat’s promising” in preventingcrime. It was conducted at the Uni-versity of Maryland’s Department ofCriminology and Criminal Justice

over a yearlong period between 1996and 1997. The report attempted toidentify all available research rele-vant to crime prevention in sevenbroad areas: communities, families,schools, labor markets, places, polic-ing, and criminal justice (correc-tions). Studies chosen for inclusion inthe Maryland Report met minimalmethodological requirements.5

Though the Maryland Report didnot examine the relationship be-tween study design and study out-comes, it did define the quality of themethods used to evaluate thestrength of the evidence providedthrough a scientific methods scale(SMS). This SMS was coded withnumbers 1 through 5, with “5 beingthe strongest scientific evidence”(Sherman et al. 1997, 2.18). Overall,studies higher on the scale havehigher internal validity, and studieswith lower scores have lower internalvalidity. The 5-point scale wasbroadly defined in the Maryland Re-port (Sherman et al.1997) as follows:

1: Correlation between a crimeprevention program and a mea-sure of crime or crime risk fac-tors.

2: Temporal sequence between theprogram and the crime or riskoutcome clearly observed, or acomparison group present with-out the demonstrated compara-bility to the treatment group.

3: A comparison between two ormore units of analysis, one withand one without the program.

4: Comparison between multipleunits with and without the pro-gram, controlling for other fac-tors, or a non-equivalent com-


parison group has only minordifferences evident.

5: Random assignment and analy-sis of comparable units to pro-gram and comparison groups.(2.18-2.19)

A score of 5 on this scale suggests arandomized experimental design,and a score of 1 a nonexperimentalapproach. Scores of 3 and 4 may beassociated with quasi-experimentaldesigns, with 4 distinguished from 3by a greater concern with control forthreats to internal validity. A score of2 represents a stronger nonexperi-mental design or a weaker quasi-ex-perimental approach. However, theoverall rating given to a study couldbe affected by other design criteriasuch as response rate, attrition, useof statistical tests, and statisticalpower. It is impossible to tell from theMaryland Report how much influ-ence such factors had on each study’srating. However, correspondencewith four of the main study investi-gators suggests that adjustmentsbased on these other factors were un-common and generally would resultin an SMS decrease or increase ofonly one level.

Although the Maryland Reportincluded a measure of study design,it did not contain a standardizedmeasure of study outcome. Mostprior reviews have relied on stan-dardized effect measures as a crite-rion for studying the relationshipbetween design type and study find-ings. Although in some of the areareviews in the Maryland Report,standardized effect sizes were calcu-lated for specific studies, this was notthe case for the bulk of the studies

reviewed in the report. Importantly,in many cases it was not possible tocode such information because theoriginal study authors did not pro-vide the specific details necessary forcalculating standardized effect coef-ficients. But the approach used bythe Maryland investigators alsoreflected a broader philosophicaldecision that emphasized the bottomline of what was known about theeffects of crime and justice interven-tions. In criminal justice, the out-come of a study is often consideredmore important than the effect sizenoted. This is the case in good partbecause there are often only a verysmall number of studies that exam-ine a specific type of treatment orintervention. In addition, policy deci-sions are made not on the basis of areview of the effect sizes that arereported but rather on whether oneor a small group of studies suggeststhat the treatment or interventionworks.

From the data available in theMaryland Report, we developed anoverall measure of study outcomesthat we call the investigator reportedresult (IRR). The IRR was created asan ordinal scale with three values: 1,0, and –1, reflecting whether a studyconcluded that the treatment or in-tervention worked, had no detectedeffect, or led to a backfire effect. It isdefined by what is reported in the ta-bles of the Maryland Report and iscoded as follows:6

1: The program or treatment is re-ported to have had an intendedpositive effect for the criminaljustice system or society. Out-comes in this case supported


the position that interventionsor treatments lead to reduc-tions in crime, recidivism, or re-lated measures. 7

0: The program treatment was re-ported to have no detected ef-fect, or the effect was reportedas not statistically significant.

–1: The program or treatment hadan unintended backfire effectfor the criminal justice systemor society. Outcomes in this casesupported the position that in-terventions or treatments wereharmful and lead to increasesin crime, recidivism, or relatedmeasures.8

This scale provides an overallmeasure of the conclusions reachedby investigators in the studies thatwere reviewed in the Maryland Re-port. However, we think it is impor-tant to note at the outset some spe-cific features of the methodologyused that may affect the findings wegain using this approach. Perhapsmost significant is the fact thatMaryland reviewers generally reliedon the reported conclusions of inves-tigators unless there was obvious ev-idence to the contrary.9 This ap-proach led us to term the scale theinvestigator reported result and rein-forces the fact that we examine theimpacts of study design on what in-vestigators report rather than on theactual outcomes of the studies exam-ined.

While the Maryland reviewersexamined tests of statistical signifi-cance in coming to conclusions aboutwhich programs or treatmentswork,10 they did not require that sta-tistical tests be reported by investi-

gators to support the specific conclu-sions reached in each study. In turn,the tables in the Maryland Reportoften do not note whether specificstudies employed statistical tests ofsignificance. Accordingly, in review-ing the Maryland Report studies, wecannot assess whether the presenceor absence of such tests influencesour conclusions. Later in our articlewe reexamine our results, taking intoaccount statistical significance in thecontext of a more recent review in thecorrections area that was modeled onthe Maryland Report.

Finally, as we noted earlier, mostsystematic reviews of study out-comes have come to use standardizedeffect size as a criterion. While wethink that the IRR scale is useful forgaining an understanding of the rela-tionship between research designand reported study conclusions, werecognize that a different set of con-clusions might have been reachedhad we focused on standardizedeffect sizes. Again, we use the correc-tions review referred to above toassess how our conclusions mighthave differed if we had focused onstandardized effect sizes rather thanthe IRR scale.

We coded the Scientific MethodsScale and the IRR directly from thetables reported in Preventing Crime:What Works, What Doesn’t, What’sPromising (Sherman et al. 1997). Wedo not include all of the studies in theMaryland Report in our review. First,given our interest in the area of crim-inal justice, we excluded studies thatdid not have a crime or delinquencyoutcome measure. Second, weexcluded studies that did not providean SMS score (a feature of some


tables in the community and familysections of the report). Finally, weexcluded the school-based area fromreview because only selected studieswere reported in tables.11 All otherstudies reviewed in the MarylandReport were included,which resultedin a sample of 308 studies. Tables 1and 2 display the breakdown of thesestudies by SMS and IRR.

As is apparent from Table 1, thereis wide variability in the nature ofthe research methods used in thestudies that are reviewed. About 15percent were coded in the highestSMS category, which demands a ran-domized experimental design. Only10 studies included were coded in thelowest SMS category, though almosta third fall in category 2. The largestcategory is score 3, which requiredsimply a comparison between twounits of analysis, one with and onewithout treatment. About 1 in 10cases were coded as 4, suggesting aquasi-experimental study withstrong attention to creating equiva-lence between the groups studied.

The most striking observationthat is drawn from Table 2 is thatalmost two-thirds of the crime and

justice studies reviewed in the Mary-land Report produced a reportedresult in the direction of success forthe treatment or intervention exam-ined. This result is very much at oddswith reviews conducted in earlierdecades that suggested that mostinterventions had little effect oncrime or related problems (forexample, see Lipton, Martinson, andWilks 1975; Logan 1972; Martinson1974).12 At the same time, a numberof the studies examined, about 1 in10, reported a backfire effect fortreatment or intervention.

RELATING STUDY DESIGNAND STUDY OUTCOMES

In Tables 3 and 4 we present ourbasic findings regarding the relation-ship between study design and studyoutcomes in the Maryland Reportsample. Table 3 provides mean IRRoutcome scores across the five SMSdesign categories. While the meanIRR scores in this case present a sim-ple method for examining the results,we also provide an overall statisticalmeasure of correlation, Tau-c (andthe associated significance level),which is more appropriate for data ofthis type. In Table 4 we provide the


TABLE 1

STUDIES CATEGORIZED BY SMS

Studies

SMS n Percentage

1 10 32 94 313 130 424 28 95 46 15

Total 308 100

TABLE 2

STUDIES CATEGORIZED BY THE IRR

Studies

IRR n Percentage

–1 34 110 76 251 198 64

Total 308 100

cross-tabulation of IRR and SMSscores. This presentation of theresults allows us to examine morecarefully the nature of the relation-ship both in terms of outcomes in theexpected treatment direction andoutcomes that may be classified asbackfire effects.

Overall Tables 3 and 4 suggestthat there is a linear inverse rela-tionship between the SMS and theIRR. The mean IRR score decreaseswith each increase in step in the SMSscore (see Table 3). While fullynonexperimental designs have amean IRR score of .80, randomizedexperiments have a mean of only .22.The run of the mill quasi-experimen-tal designs represented in category 3have a mean IRR score of .56, whilethe strongest quasi experiments (cat-egory 4) have a mean of .39. The over-all correlation between study designand study outcomes is moderate andnegative (–.18), and the relationshipis statistically significant at the .001level.

Looking at the cross-tabulation ofSMS and IRR scores, our findings arereinforced. The stronger the method

in terms of internal validity as mea-sured by the SMS, the less likely is astudy to conclude that the interven-tion or treatment worked. Theweaker the method, the less likelythe study is to conclude that theintervention or treatment backfired.

While 8 of the 10 studies in thelowest SMS category and 74 percentof those in category 2 show a treat-ment impact in the desired direction,this was true for only 37 percent ofthe randomized experiments in cate-gory 5. Only in the case of backfireoutcomes in categories 4 and 5 doesthe table not follow our basic find-ings, and this departure is small.Overall the relationship observed inthe table is statistically significant atthe .005 level.

Comparing the highest-qualitynonrandomized studies withrandomized experiments

As noted earlier, some scholarsargue that higher-quality nonran-domized studies are likely to haveoutcomes similar to outcomes of ran-domized evaluations. This hypothe-sis is not supported by our data. InTable 5 we combine quasi-experi-mental studies in SMS categories 3and 4 and compare them with ran-domized experimental studies placedin SMS category 5. Again we find astatistically significant negativerelationship (p < .01). While 37 per-cent of the level 5 experimental stud-ies show a treatment effect in thedesired direction, this was true for 65percent of the quasi-experimentalstudies.

Even if we examine only the high-est-quality quasi-experimental stud-ies as represented by category 4 and


TABLE 3

MEAN IRR SCORESACROSS SMS CATEGORIES

StandardSMS Mean n Deviation

1 .80 10 .422 .66 94 .633 .56 130 .674 .39 28 .835 .22 46 .70

Total .53 308 .69

NOTE: Tau-c = –.181. p < .001.

compare these to the randomizedstudies included in category 5, therelationship between study out-comes and study design remains sta-tistically significant at the .05 level(see Table 6).There is little differencebetween the two groups in the pro-portion of backfire outcomesreported; however, there remains avery large gap between the propor-tion of SMS category 4 and SMS cate-gory 5 studies that report an outcomein the direction of treatment effec-tiveness. While 61 percent of the cat-egory 4 SMS studies reported a posi-tive treatment or intervention effect,

this was true for only 37 percent ofthe randomized studies in category 5.Accordingly, even when comparingthose nonrandomized studies withthe highest internal validity withrandomized experiments, we findsignificant differences in terms ofreported study outcomes.

Taking into account testsof statistical significance

It might be argued that had weused a criterion of statistical signifi-cance, the overall findings would nothave been consistent with the analy-ses reported above. While we cannot


TABLE 4

CROSS-TABULATION OF SMS AND IRR

SMS

1 2 3 4 5

IRR n Percentage n Percentage n Percentage n Percentage n Percentage

–1 0 0 8 9 13 10 6 21 7 150 2 20 16 17 31 24 5 18 22 481 8 80 70 74 86 66 17 61 17 37

Total 10 100 94 100 130 100 28 100 46 100

NOTE: Chi-square = 25.487 with 8 df (p < .005).

TABLE 5

COMPARING QUASI-EXPERIMENTALSTUDIES (SMS = 3 OR 4) WITH

RANDOMIZED EXPERIMENTS (SMS = 5)

SMS

3 or 4 5

IRR n Percentage n Percentage

–1 19 12 7 150 36 23 22 481 103 65 17 37

Total 158 100 46 100

NOTE: Chi-square = 12.971 with 2 df (p <.01).

TABLE 6

COMPARING HIGH-QUALITY QUASI-EXPERIMENTAL DESIGNS (SMS = 4)

WITH RANDOMIZED DESIGNS (SMS = 5)

SMS

4 5

IRR n Percentage n Percentage

–1 6 21 7 150 5 18 22 481 17 61 17 37

Total 28 100 46 100

NOTE:Chi-square = 6.805 with 2 df (p < .05).

examine this question in the contextof the Maryland Report, since statis-tical significance is generally notreported in the tables or the text ofthe report, we can review this con-cern in the context of a more recentreview conducted in the correctionsarea by one of the Maryland investi-gators, which uses a similar method-ology and reports Maryland SMS(see MacKenzie and Hickman 1998).MacKenzie and Hickman (1998)examined 101 studies in their1998 review of what works in correc-tions, of which 68 are reported tohave included tests of statisticalsignificance.

Developing the IRR score for eachof MacKenzie and Hickman’s (1998)studies proved more complex thanthe coding done for the MarylandReport. MacKenzie and Hickmanreported all of the studies’ results,sometimes breaking up results bygender, employment, treatment mix,or criminal history, to list a few exam-ples. Rather than count each resultas a separate study,we developed twodifferent methods that followed dif-ferent assumptions for coding theIRR index.

The first simply notes whetherany significant findings were foundsupporting a treatment effect andcodes a backfire effect when there arestatistically significant negativefindings with no positive treatmenteffects (scale A).13 The second (scaleB) is more complex and gives weightto each result in each study.14

Taking this approach, our findingsanalyzing the MacKenzie andHickman (1998) data follow thosereported when analyzing the Mary-land Report. The correlation between

study design and study outcomes isnegative and statistically significant(p < .005) irrespective of theapproach we used to define the IRRoutcome scale (see Table 7). Usingscale A, the correlation observed is –.29, while using scale B, the observedcorrelation is –.31.

Comparing effect sizeand IRR score results

It might be argued that our overallfindings are related to specific char-acteristics of the IRR scale ratherthan the underlying relationshipbetween study design and study out-comes. We could not test this ques-tion directly using the MarylandReport data because, as noted earlier,standardized effect sizes were notconsistently recorded in the report.However, MacKenzie and Hickman(1998) did report standardized effectsize coefficients, and thus we are ableto reexamine this question in thecontext of corrections-based criminaljustice studies.


TABLE 7

RELATING SMS AND IRR ONLY FORSTUDIES IN MACKENZIE AND HICKMAN

(1998) THAT INCLUDE TESTS OFSTATISTICAL SIGNIFICANCE

Scale A Scale B

SMS Mean n Mean n

1 0 02 0.83 24 1.46 243 0.62 26 1.04 264 0.36 11 0.64 115 0.00 7 0.14 7

Total .59 68 1.03 68

NOTE: Tau-c for scale A = –.285 (p < .005).Tau-c for scale B = –.311 (p < .005).

Using the average standardizedeffect size reported for each studyreviewed by MacKenzie and Hick-man (1998) for the entire sample(including studies where statisticalsignificance is not reported), theresults follow those gained fromrelating IRR and SMS scores usingthe Maryland Report sample (seeTable 8). Again the correlationbetween SMS and study outcomes isnegative; in this case the correlationis about –.30. The observed relation-ship is also statistically significant atthe .005 level. Accordingly, thesefindings suggest that our observa-tion of a negative relationshipbetween study design and study out-comes in the Maryland Report sam-ple is not an artifact of the particularcodings of the IRR scale.

DISCUSSION

Our review of the MarylandReport Studies suggests that in crim-inal justice, there is a moderateinverse relationship between thequality of a research design, definedin terms of internal validity, and theoutcomes reported in a study. Thisrelationship continues to be observedeven when comparing the highest-quality nonrandomized studies withrandomized experiments. Using arelated database concentrating onlyon the corrections area, we also foundthat our findings are consistentwhen taking into account only stud-ies that employed statistical tests ofsignificance. Finally, using the samedatabase, we were able to examinewhether our results would have dif-fered had we used standardizedeffect size measures rather than the

IRR index that was drawn from theMaryland Report. We found ourresults to be consistent using bothmethods. Studies that were definedas including designs with higherinternal validity were likely to reportsmaller effect sizes than studies withdesigns associated with lower inter-nal validity.

Prior reviews of the relationshipbetween study design and study out-comes do not predict our findings.Indeed, as we noted earlier, the mainlesson that can be drawn from priorresearch is that the impact of studydesign is very much dependant onthe characteristics of the particulararea or studies that are reviewed. Intheory as well, there is no reason toassume that there will be a system-atic type of bias in studies with lowerinternal validity. What can be saidsimply is that such studies, all elsebeing equal, are likely to providebiased findings as compared withresults drawn from randomizedexperimental designs. Why then dowe find in reviewing a broad group of


TABLE 8

RELATING AVERAGE EFFECT SIZEAND SMS FOR STUDIES IN

MACKENZIE AND HICKMAN (1988)

Effect Size Availablefrom the Entire Sample

SMS Mean n

1 02 .29 393 .23 304 .19 135 .00 7

Total .23 89Missing values 12

NOTE: Correlation (r) = –.296 (p < .005).

crime and justice studies whatappears to be a systematic relation-ship between study design and studyoutcomes?

One possible explanation for ourfindings is that they are simply anartifact of combining a large numberof studies drawn from many differentareas of criminal justice. Indeed,there are generally very few studiesthat examine a very specific type oftreatment or intervention in theMaryland Report. And it may be thatwere we able to explore the impactsof study design on study outcomes forspecific types of treatments or inter-ventions, we would find patterns dif-ferent from the aggregate onesreported here. We think it is likelythat for specific areas of treatment orspecific types of studies in criminaljustice, the relationship betweenstudy design and study outcomes willdiffer from those we observe. None-theless, review of this question in thecontext of one specific type of treat-ment examined by the Campbell Col-laboration (where there was a sub-stantial enough number of ran-domized and nonrandomized studiesfor comparison) points to the salienceof our overall conclusions evenwithin specific treatment areas (seePetrosino, Petrosino, and Buehler2001). We think this example is par-ticularly important because it sug-gests the potential confusion thatmight result from drawing conclu-sions from nonrandomized studies.

Relying on a systematic reviewconducted by Petrosino, Petrosino,and Buehler (2001) on ScaredStraight and other kids-visit pro-grams, we identified 20 programsthat included crime-related outcome

measures. Of these, 9 were random-ized experiments, 4 were quasi-experimental trials, and 7 were fullynonexperimental studies. Petrosino,Petrosino, and Buehler reported onthe randomized experimental trialsin their Campbell Collaborationreview. They concluded that ScaredStraight and related programs do notevidence any benefit in terms ofrecidivism and actually increase sub-sequent delinquency. However, avery different picture of the effective-ness of these programs is drawn fromour review of the quasi-experimentaland nonexperimental studies. Over-all, these studies, in contrast to theexperimental evaluations, suggestthat Scared Straight programs notonly are not harmful but are morelikely than not to produce a crimeprevention benefit.

We believe that our findings, how-ever preliminary, point to the possi-bility of an overall positive bias innonrandomized criminal justicestudies. This bias may in part reflecta number of other factors that wecould not control for in our data, forexample, publication bias or differen-tial attrition rates across designs(see Shadish and Ragsdale 1996).However, we think that a more gen-eral explanation for our findings islikely to be found in the norms ofcriminal justice research andpractice.

Such norms are particularlyimportant in the development of non-randomized studies. Randomizedexperiments provide little freedom tothe researcher in defining equiva-lence between treatment and com-parison groups. Equivalence in ran-domized experiments is defined


simply through the process of ran-domization. However, nonran-domized studies demand muchinsight and knowledge in the devel-opment of comparable groups of sub-jects. Not only must the researcherunderstand the factors that influ-ence treatment so that he or she canprevent confounding in the studyresults, but such factors must bemeasured and then controlled forthrough some statistical or practicalprocedure.

It may be that such manipulationis particularly difficult in criminaljustice study. Criminal justice practi-tioners may not be as strongly social-ized to the idea of experimentation asare practitioners in other fields likemedicine. And in this context, it maybe that a subtle form of creaming inwhich the cases considered mostamenable to intervention are placedin the intervention group is common.In specific areas of criminal justice,such creaming may be exacerbatedby self-selection of subjects who aremotivated toward rehabilitation.Nonrandomized designs, even in rel-atively rigorous quasi-experimentalstudies, may be unable to compen-sate or control for why a person isconsidered amenable and placed inthe intervention group. Matching ontraditional control variables like ageand race, in turn, might not identifythe subtle components that makeindividuals amenable to treatmentand thus more likely to be placed inintervention or treatment categories.

Of course, we have so far assumedthat nonrandomized studies arebiased in their overestimation of pro-gram effects. Some scholars might

argue just the opposite. The inflexi-bility of randomized experimentaldesigns has sometimes been seen asa barrier to development of effectivetheory and practice in criminology(for example, see Clarke and Cornish1972; Eck 2001; Pawson and Tilley,1997). Here it is argued that in a fieldin which we still know little about theroot causes and processes thatunderlie phenomena we seek toinfluence, randomized studies maynot allow investigators the freedomto carefully explore how treatmentsor programs influence their intendedsubjects. While this argument hasmerit in specific circumstances, espe-cially in exploratory analyses ofproblems and treatments, we thinkour data suggest that it can lead inmore developed areas of our field tosignificant misinterpretation andconfusion.

CONCLUSION

We asked at the outset of our arti-cle whether the type of researchdesign used in criminal justice influ-ences the conclusions that arereached. Our findings, based on theMaryland Report, suggest thatdesign does matter and that its effectin criminal justice study is system-atic. The weaker a design, as indi-cated by internal validity, the morelikely was a study to report a result infavor of treatment and the less likelyit was to report a harmful effect oftreatment. Even when comparingstudies defined as randomizeddesigns in the Maryland Report withstrong quasi-experimental researchdesigns, systematic and statistically


significant differences were ob-served. Though our study shouldbscores e seen only as a preliminarystep in understanding how researchdesign affects study outcomes incriminal justice, it suggests that sys-tematic reviews of what works incriminal justice may be stronglybiased when includingnonrandomized studies. In effortssuch as those being developed by theCampbell Collaboration, such poten-tial biases should be taken intoaccount in coming to conclusionsabout the effects of interventions.

Notes

1. Statistical adjustments for randomgroup differences are sometimes employed inexperimental studies as well.

2. We should note that we have assumedso far that external validity (the degree towhich it can be inferred that outcomes apply tothe populations that are the focus of treat-ment) is held constant in these comparisons.Some scholars argue that experimental stud-ies are likely to have lower external validitybecause it is often difficult to identify institu-tions that are willing to randomize partici-pants. Clearly, where randomized designshave lower external validity, the assumptionthat they are to be preferred to nonrandomizedstudies is challenged.

3. Kunz and Oxman (1998) not only com-pared randomized and nonrandomized stud-ies but also adequately and inadequately con-cealed randomized trials and high-qualityversus low-quality studies. Generally, high-quality randomized studies included ade-quately concealed allocation, while lower-quality randomized trails were inadequatelyconcealed. In addition, the general terms high-quality trials and low-quality trials indicate adifference where “the specific effect of random-ization or allocation concealment could not beseparated from the effect of other methodologi-cal manoeuvres such as double blinding”(Kunz and Oxman 1998, 1185).

4. Moreover, it may be that the finding ofhigher standardized effects sizes for random-ized studies in this review was due to school-level as opposed to individual-level assign-ment. When only those studies that include adelinquency outcome are examined, a largereffect is found when school rather than stu-dent is the unit of analysis (Denise Gott-fredson, personal communication, 2001).

5. As the following Scientific MethodsScale illustrates, the lowest acceptable type ofevaluation for inclusion in the Maryland Re-port is a simple correlation between a crimeprevention program and a measure of crime orcrime risk factors. Thus studies that were de-scriptive or contained only process measureswere excluded.

6. There were also (although rarely) stud-ies in the Maryland Report that reported twofindings in opposite directions. For instance, inSherman and colleagues’ (1997) section onspecific deterrence (8.18-8.19), studies of ar-rest for domestic violence had positive resultsfor employed offenders and backfire results fornonemployed offenders. In these isolatedcases, the study was coded twice with the samescientific methods scores and each of the in-vestigator-reported result scores (of 1 and –1)separately.

7. For studies examining the absence of aprogram (such as a police strike) where socialconditions worsened or crime increased, thiswould be coded as 1.

8. For studies examining the absence of aprogram (such as a police strike) where socialconditions improved or crime decreased, thiswould be coded as –1.

9. Only in the school-based area was therea specific criterion for assessing the investiga-tor’s conclusions. As noted below, however, theschool-based studies are excluded from our re-view for other reasons.

10. For example, the authors of the Mary-land Report noted in discussing criteria for de-ciding which programs work, “These are pro-grams that we are reasonably certain ofpreventing crime or reducing risk factors forcrime in the kinds of social contexts in whichthey have been evaluated, and for which thefindings should be generalizable to similar set-tings in other places and times. Programscoded as ‘working’ by this definition must haveat least two level 3 evaluations with statistical


significance tests showing effectiveness andthe preponderance of all available evidencesupporting the same conclusion” (Shermanet al. 1997, 2-20).

11. It is the case that many of the studies inthis area would have been excluded anywaysince they often did not have a crime or delin-quency outcome measure (but rather exam-ined early risk factors for crime and delin-quency).

12. While the Maryland Report is consis-tent with other recent reviews that also pointto greater success in criminal justice interven-tions during the past 20 years (for example, seePoyner 1993; Visher and Weisburd 1998;Weisburd 1997), we think the very high per-centage of studies showing a treatment impactis likely influenced by publication bias. Thehigh rate of positive findings is also likely in-fluenced by the general weaknesses of thestudy designs employed. This is suggested byour findings reported later: that the weaker aresearch design in terms of internal validity,the more likely is the study to report a positivetreatment outcome.

13. The coding scheme for scale A was asfollows. A value of 1 indicates that the studyhad any statistically significant findings sup-porting a positive treatment effect, even iffindings included results that were not signifi-cant or had negative or backfire findings. Avalue of 0 indicates that the study had onlynonsignificant findings.A value of –1 indicatesthat the study had only statistically signifi-cant negative or backfire findings or statisti-cally significant negative findings with othernonsignificant results.

14. Scale B was created according to the fol-lowing rules. A value of 2 indicates that thestudy had only or mostly statistically signifi-cant findings supporting a treatment effect(more than 50 percent) when including all re-sults, even nonsignificant ones. A value of 1 in-dicates that the study had some statisticallysignificant findings supporting a treatment ef-fect (50 percent or less, counting both positivesignificant and nonsignificant results) even ifthe nonsignificant results outnumbered thepositive statistically significant results. Avalue of 0 indicates that no statistically signifi-cant findings were reported. A value of –1 indi-cates that the study evidenced statisticallysignificant backfire effects (even if non-

significant results were present) but nostatistically significant results supporting theeffectiveness of treatment.

References

Bailey, Walter C. 1966. Correctional Out-come: An Evaluation of 100 Reports.Journal of Criminal Law,Criminologyand Police Science 57:153-60.

Boruch, Robert F., Anthony Petrosino,and Iain Chalmers. 1999. The Camp-bell Collaboration: A Proposal for Sys-tematic, Multi-National, and Continu-ous Reviews of Evidence. Backgroundpaper for the meeting at UniversityCollege–London, School of Public Pol-icy, July.

Boruch, Robert F., Brook Snyder, andDorothy DeMoya. 2000. The Impor-tance of Randomized Field Trials.Crime & Delinquency 46:156-80.

Burtless, Gary. 1995. The Case for Ran-domized Field Trials in Economic andPolicy Research. Journal of EconomicPerspectives 9:63-84.

Campbell, Donald P. and Robert F.Boruch. 1975. Making the Case forRandomized Assignment to Treatmentsby Considering the Alternatives: SixWays in Which Quasi-ExperimentalEvaluations in Compensatory Educa-tion Tend to Underestimate Effects. InEvaluation and Experiment: SomeCritical Issues in Assessing SocialPrograms, ed. Carl Bennett andArthur Lumsdaine. New York: Aca-demic Press.

Chalmers, Iain and Douglas G. Altman.1995. Systematic Reviews. London:British Medical Journal Press.

Clarke, Ronald V. and Derek B. Cornish.1972. The Control Trial in Institu-tional Research: Paradigm or Pitfallfor Penal Evaluators? London:HMSO.

Cox, Stephen M., William S. Davidson,and Timothy S. Bynum. 1995. A Meta-Analytic Assessment of Delinquency-


Related Outcomes of Alternative Edu-cation Programs. Crime & Delin-quency 41:219-34.

Cullen, Francis T. and Paul Gendreau.2000. Assessing Correctional Rehabil-itation: Policy, Practice, and Prospects.In Policies, Processes, and Decisions ofthe Criminal Justice System:CriminalJustice 3, ed. Julie Horney. Washing-ton, DC: U.S. Department of Justice,National Institute of Justice.

Davies, Huw T. O., Sandra Nutley, andPeter C.Smith.2000.What Works:Evi-dence-Based Policy and Practice inPublic Services. London: Policy Press.

Davies, Philip. 1999. What Is Evidence-Based Education? British Journal ofEducational Studies 47:108-21.

Eck, John. 2001. Learning from Experi-ence in Problem Oriented Policing andCrime Prevention: The Positive Func-tions of Weak Evaluations and theNegative Functions of Strong Ones.Unpublished manuscript.

Egger, Matthias and G. Davey Smith.1998. Bias in Location and Selection ofStudies. British Medical Journal316:61-66.

Farrington, David P. 1983. RandomizedExperiments in Crime and Justice. InCrime and Justice: An Annual Reviewof Research, ed. Norval Morris and Mi-chael Tonry. Chicago: University ofChicago Press.

. 2000. Standards for Inclusion ofStudies in Systematic Reviews. Dis-cussion paper for the Campbell Col-laboration Crime and Justice Coordi-nating Group.

Farrington, David P. and AnthonyPetrosino. 2001. The Campbell Collab-oration Crime and Justice Group. An-nals of the American Academy of Polit-ical and Social Science 578:35-49.

Feder, Lynette and Robert F. Boruch.2000. The Need for Experiments inCriminal Justice Settings. Crime &Delinquency 46:291-94.

Feder, Lynette, Annette Jolin, andWilliam Feyerherm. 2000. Lessonsfrom Two Randomized Experimentsin Criminal Justice Settings. Crime &Delinquency 46:380-400.

Friedlander, Daniel and Philip K. Robins.2001. Evaluating Program Evalua-tions:New Evidence on Commonly UsedNon-Experimental Methods. Ameri-can Economic Review 85:923-37.

Hedges, Larry V. 2000. Using ConvergingEvidence in Policy Formation: TheCase of Class Size Research. Evalua-tion and Research in Education14:193-205.

Heinsman, Donna T. and William R.Shadish. 1996. Assignment Methodsin Experimentation: When DoNonrandomized Experiments Ap-proximate Answers from RandomizedExperiments? Psychological Methods1:154-69.

Kunz, Regina and Andy Oxman. 1998.The Unpredictability Paradox: Re-view of Empirical Comparisons ofRandomized and Non-RandomizedClinical Trials. British Medical Jour-nal 317:1185-90.

LaLonde, Robert J. 1986. Evaluating theEconometric Evaluations of TrainingPrograms with Experimental Data.American Economic Review 76:604-20.

Lipsey, Mark W. and David B. Wilson.1993. The Efficacy of Psychological,Educational, and Behavioral Treat-ment: Confirmation from Meta-Analy-sis. American Psychologist 48:1181-209.

Lipton, Douglas S., Robert M. Martinson,and Judith Wilks. 1975. The Effective-ness of Correctional Treatment: A Sur-vey of Treatment Evaluation Studies.New York: Praeger.

Logan, Charles H. 1972. Evaluation Re-search in Crime and Delinquency—AReappraisal. Journal of CriminalLaw, Criminology and Police Science63:378-87.


MacKenzie, Doris L. 2000. Evidence-based Corrections: Identifying WhatWorks.Crime & Delinquency 46:457-71.

MacKenzie, Doris L. and Laura J.Hickman.1998.What Works in Correc-tions (Report submitted to the State ofWashington Legislature Joint Auditand Review Committee).College Park:University of Maryland.

Martinson, Robert. 1974. What Works?Questions and Answers About PrisonReform. Public Interest 35:22-54.

Millenson, Michael L. 1997. DemandingMedical Excellence: Doctors and Ac-countability in the Information Age.Chicago: University of Chicago Press.

Nutley, Sandra and Huw T. O. Davies.1999. The Fall and Rise of Evidence inCriminal Justice. Public Money &Management 19:47-54.

Pawson, Ray and Nick Tilley. 1997. Real-istic Evaluation. London: Sage.

Petrosino, Anthony, Carolyn Petrosino,and John Buehler. 2001. Pilot Test:The Effects of Scared Straight andOther Juvenile Awareness Pro-grams on Delinquency. Unpublishedmanuscript.

Poyner, Barry. 1993. What Works inCrime Prevention: An Overview ofEvaluations. In Crime PreventionStudies. Vol. 1, ed. Ronald V. Clarke.Monsey, NY: Criminal Justice Press.

Shadish, William R. and Kevin Ragsdale.1996. Random Versus Nonrandom As-signment in Controlled Experiments:Do You Get the Same Answer? Jour-nal of Consulting and Clinical Psy-chology 64:1290-305.

Sherman, Lawrence W. 1998. Evidence-Based Policing. In Ideas in AmericanPolicing. Washington, DC: PoliceFoundation.

Sherman, Lawrence W., Denise C.Gottfredson, Doris Layton MacKen-zie, John E. Eck, Peter Reuter, andShawn D. Bushway. 1997. PreventingCrime: What Works, What Doesn’t,What’s Promising. Washington, DC:U.S. Department of Justice, NationalInstitute of Justice.

Visher, Christy A. and David Weisburd.1998. Identifying What Works: RecentTrends in Crime Prevention. Crime,Law and Social Change 28:223-42.

Weisburd, David. 1997. ReorientingCrime Prevention Research and Pol-icy: From the Causes of Criminality tothe Context of Crime (Research ReportNIJ 16504). Washington, DC: U.S. De-partment of Justice, National Insti-tute of Justice.

Whitehead, John T. and Steven P. Lab.1989. A Meta-Analysis of JuvenileCorrectional Treatment. Journal ofResearch in Crime and Delinquency26:276-95.

Wilkinson,Leland and Task Force on Sta-tistical Inference. 1999. StatisticalMethods in Psychology Journals:Guidelines and Explanations. Ameri-can Psychologist 54:594-604.

Wilson, David B., Catherine A. Gallagher,Doris L. MacKenzie. 2000. A Meta-Analysis of Corrections-Based Educa-tion, Vocation, and Work Programs forAdult Offenders. Journal of Researchin Crime and Delinquency 37:347-68.

Wilson, David B., Denise C. Gottfredson,and Stacy S. Najaka. In Press. School-Based Prevention of Problem Behav-iors: A Meta-Analysis. Journal ofQuantitative Criminology.

Zuger, Abigail. 1997. New Way of Doc-toring: By the Book. New York Times,16 Dec.


Does Research Design Affect Study Outcomes in …cebcp.org/wp-content/publications/Does Research Design Affect Study... · THE ANNALS OF THE AMERICAN ACADEMYRESEARCH DESIGN AND STUDY

Documents