IES Workshop on Evaluating State and District Level Interventions Mark W. Lipsey Director, Center for Evaluation Research and Methodology Vanderbilt University.

IES Workshop on Evaluating IES Workshop on Evaluating State and District Level State and District Level

InterventionsInterventions

Mark W. LipseyDirector, Center for Evaluation Research

and MethodologyVanderbilt University

David HoldzkomAssistant Superintendent for Evaluation and ResearchWake County Public School System, North Carolina

April 24, 2008 Washington, DC

PurposeTo help schools, districts, and states design and implement rigorous evaluations of the effects of promising practices, programs, and policies on educational outcomes.

Why encourage locally initiated impact evaluation?

• Many interventions are not effective; users and interested others need to know.

• The interventions most relevant to improving outcomes are those that schools and districts believe are promising and feasible.

• IES has funding to support research initiated by schools, districts, and states.

What kinds of interventions might be evaluated?

• Practices, e.g., one-on-one tutoring, educational software, acceleration of high ability students, cooperative learning.

• Programs, e.g., Reading Recovery, Ladders to Literacy, Cognitive Tutor algebra, Saxon Math, Caring School Community (character education).

• Policies, e.g., reduced class size, pre-K, alternative high schools, all year calendar.

Key Issues in Designing Impact Evaluations for

Education Interventions

Logic Models, Variables, and Evaluation Questions

Logic model: 1. Specifying the problem the intervention addressesNature of the need:

• What and for whom (e.g., kindergarten students who aren’t ready for school).

• Why (e.g., poor pre-literacy skills, inappropriate school behavior).

• Rationale/evidence supporting the intervention target (e.g., at entry K students need to be ready to learn or they will begin to fall behind; research shows school readiness can be enhanced for at-risk 4 year olds).

Logic model: 2. Specifying the planned intervention

What the intervention does that addresses the need:

• Content: What the students should know or be able to do; why this meets the need.

• Pedagogy: Instructional techniques and methods to be used; why appropriate.

• Delivery system: How the intervention will arrange to deliver the instruction.

• The key factors or core ingredients most essential and distinctive to the intervention.

Logic model: 3. Specifying the theory of change

4 year old at-risk

children

Pre-K with literacy

curriculum

Positive attitudes to

school

Improved pre-literacy

skills

Learn appropriate

school behavior

Increased school

readiness

Greater learning

gains in K

TargetPopulation Intervention Proximal Outcomes Distal Outcomes

Mapping variables onto the intervention theory: Sample characteristics

4 year old at-risk

children


school


skills

Learn appropriate

school behavior

Increased school

readiness

Greater learning

gains in K

Sample descriptors:* Basic demographics * Diagnostic, need/eligibility identification* Baseline performance Potential moderators:

* Setting, context* Personal and family characteristics* Prior experience

Pre-K with literacy

curriculum

Mapping variables onto the intervention theory: Intervention characteristics

4 year old at-risk

children


school


skills

Learn appropriate

school behavior

Increased school

readiness

Greater learning

gains in K

Independent variable:* T vs. C comparison conditions

Generic fidelity:* T and C exposure to the generic aspects of the intervention (type, amount, quality)

Specific fidelity:* T and C (?) exposure to distinctive aspects of the intervention (type, amount, quality)

Potential moderators:* Characteristics of personnel* Intervention setting, context e.g., class size

Pre-K with literacy

curriculum

Mapping variables onto the intervention theory: Intervention outcomes

4 year old at-risk

children

Exposed to intervention


school


skills

Learn appropriate

school behavior

Increased school

readiness

Greater learning

gains in K

Focal dependent variables:* Pretests (pre-intervention).* Posttests (at end of intervention)* Follow-ups (lagged after end of intervention).

Other dependent variables:* Side effects– possible unplanned positive or negative outcomes.* Mediators– DVs on causal pathways from intervention to other DVs.

Pre-K with literacy

curriculum

Research questions: Relationships of (possible) interest

• Intervention effects: Causal relationship between intervention and outcomes.

• Duration of effects post-intervention.

• Moderator relationships: Differential intervention effects for different subgroups.

• Mediator relationships: Stepwise causal relationship with effects on a proximal outcome causing effects on a distal outcome.

Research Designs for Estimating

Intervention Effects

What is an intervention effect and why is it so difficult to determine?

Research designs to discuss

• Two strong ones1. Randomized experiment

2. Regression-discontinuity

• Two weak ones3. Nonrandomized comparison groups

with statistical controls

4. Comparative interrupted time series

1. Randomized experiment

Receive experimental interventionResearch

sample of students, teachers,

classrooms, schools, etc. Do not

receive experimenta

l intervention

Outcome

Outcome

Highpre

Medhighpre

Lowpre

Interventioneffect

Sample Pretestblocking

Random assignmentto conditions Posttest

Medlowpre

Randomlyassigned

Circumstances conducive to randomized experiments

• More demand than supply for program– allocate scarce resource by lottery.

• New program that can be phased in– wait list control had delayed start.

• Pull-out or add-on program for selected students– randomly select from among those eligible.

• Volunteers willing to opt in for a chance to receive the program.

Example: Junior high algebra curriculum

• In 2000-01 the Moore Oklahoma Independent School District conducted a study of the effectiveness of the Cognitive Tutor Algebra I program on students in their junior high school system.

• Students in 5 junior high schools were randomly assigned to either the Cognitive Tutor Algebra I course or the ‘practical as usual’ algebra courses. Cognitive Tutor teachers received the curriculum materials and 4 days of training.

• Outcome measures included the ETS Algebra I end-of-course exam, course grades, and a survey of student attitudes towards mathematics.

Example: Alternative high school for students at risk of dropping out

• Horizon High School in Las Vegas identified 9th and 10th grade students behind grade level and at risk of dropping out.

• A random sample of these students was assigned to attend an alternative high school that featured a focus on cooperative learning, small group instruction, and support services.

• Outcomes were compared for the alternative and regular high schools on dropout rates, self-esteem, drug use, and arrest rates.

Example: Remedial reading programs for elementary students

• The Allegheny Intermediate Unit (AIU), which serves 42 suburban school districts in Allegheny County, Pennsylvania, randomly assigned 50 schools to one of four commercially available remedial reading interventions.

• Within each school struggling readers in grades 3 and 5 were identified and randomly assigned to instruction as usual or the remedial reading program designated for that school.

• In each program, 3 students met with a trained teacher one hour/day for 20 weeks.

• Measures of reading skill were administered at the beginning and end of the school year for program and control students.

2. Regression-discontinuity (aka the cutting-point design)

• When well-executed, its ability to provide an unbiased estimate of the intervention effect is strong– comparable to a randomized experiment.

• It is adaptable to many circumstances where it may be difficult to apply a randomized design.

Consider first a posttest on pretest regression for a randomized experiment with no effect

Pretest (S)Mean S

T

C

iiTiSi eTBSBBY 0

Corresponding regression equation (T: 1=treatment, 0=control)

Mean Y

Posttest (Y)

Pretest-posttest randomized experiment, now with an intervention effect

Pretest (S)

Posttest (Y)

T & C Mean S

T

C

iiTiSi eTBSBBY 0

C Mean Y

T Mean Y

Δ

Consider now the same regression with no effect but with a cutting point applied

Selection Variable (S)

Posttest (Y)

Cutting Point

T

C

iiTiSi eTBSBBY 0

C Mean Y

T Mean Y

Regression discontinuity scatterplot (null case)


Posttest (Y)

Cutting Point

TC

iiTiSi eTBSBBY 0

Now add an intervention effect


Posttest (Y)

Cutting Point

T

C

iiTiSi eTBSBBY 0

Δ

Regression discontinuity scatterplot with effect


Posttest (Y)

Cutting Point

T

C

iiTiSi eTBSBBY 0

The effect estimated by R-D is the same as that from the randomized experiment


Posttest (Y)

T

C

iiTiSi eTBSBBY 0

Δ

Cutting Point

The selection variable for R-D

• A continuous quantitative variable measured on every candidate for assignment to T or C who will participate in the study.

• Assignment to T or C strictly on the basis of the score obtained and the predetermined cutting point.

• Does not have to correlate highly with the outcome variable.

• Can be tailored to represent an appropriate basis for the assignment decision in the setting.

Special issues with the R-D design

• Correctly fitting the functional form– possibility that it is not linear– curvilinear functions– interaction with the cutting point.

• Statistical power– requires about 3 times the sample size of a

comparable randomized experiment– covariates correlated with the outcome but not

the selection variable are helpful.

Circumstances conducive to the regression-discontinuity design

• The situation involves a selection from some larger group of who will, or should, receive the intervention and who will not.

• The basis for selection is or can be made explicit and systematic enough to be captured in a quantitative rating or ranking.

• The allocation of the intervention can be made strictly on the basis of the selection score and cutting point in a high proportion of cases. Exceptions can be identified in advance and exempted from the scheme.

Example: Effects of universal pre-k in Tulsa, Oklahoma

• Eligibility for pre-k determined strictly on the basis of age– cutoff by birthday.

• Overall sample of 1,567 children just beginning pre-k plus 1,461 children just beginning kindergarten who had been in pre-k the previous year.

• WJ Letter-Word, Spelling, and Applied Problems as outcome variables.

Entry into pre-k selected by birthday

WJ testscore

Born before September 1

T Completed

pre-K; tested at beginning

of K

Born after September 1

Age

CNo Pre-K

yet; tested at beginning of pre-K year

?

Samples and testing

pre-k kindergarten

pre-k

First cohort

Second cohort

Year 1 Year 2

AdministerWJ tests

Excerpts from Regression AnalysisLetter-Word Spelling Applied Probs

Variable B coeff B coeff B coeff

Treatment (T) 3.00* 1.86* 1.94*

Age: Days ± from Sept 1 .01 .01* .02*

Days2 .00 .00 .00

Days x T .00 -.01 -.01

Days2 x T .00 .00 .00

Free lunch -1.28* -.89* -1.38*

Black .04 -.44* -2.34*

Hispanic -1.70* -.48* -3.66*

Female .92* 1.05* .76*

Mother’s educ: HS .59* .57* 1.25* * p<.05

3. Nonrandomized comparison groups with statistical controls

• Statistical controls: Analysis of covariance and multiple regression

• Matching on the control variables

• Propensity scores derived from the control variables.

Nonequivalent comparison analog to the randomized experiment

Population of students, teachers,

classrooms, schools, etc.

Do not receive

experimental

intervention

Outcome

Outcome

Interventioneffect (??)

Selected through some nonrandom more-or-less natural process

Receive experimental intervention

Issues for obtaining good intervention effect estimates from nonrandomized

comparison groups• The fundamental problem: selection bias

• Knowing/measuring the variables necessary and sufficient to statistically control for the selection bias– characteristics related to the outcome on

which the groups differ

• Using an analysis model that properly adjusts for the selection bias, given appropriate control variables

Nonequivalent comparison groups: Pretest/covariate and posttest means

Pretest/Covariate(s) (X)

Posttest (Y)

T

C

iiTiXi eTBXBBY 0

Diff inpost

means

Diff inpretest/cov

means

Nonequivalent comparison groups: Covariate-adjusted treatment effect estimate


Posttest (Y)

T

C

iiTiXi eTBXBBY 0

Δ

Covariate-adjusted treatment effect estimate with a relevant covariate left out


Posttest (Y)

T

C

iiTiXiXi eTBXBXBBY 22110

Δ

Using control variables via matching

• Groupwise matching: select control comparison to be groupwise similar to intervention group, e.g., schools with similar demographics, geography, etc. Generally a good idea.

• Individual matching: select individuals from the potential control pool that match intervention individuals on one or more observed characteristics.May not be a good idea.

Potential problems with individual level matching

• Basic problem with nonequivalent designs– need to match on all relevant variables to obtain a good estimate of the intervention effect.

• If match on too few variables, may omit some that are important to control.

• If try to match on too many variables, the sample will be restricted to the cases that can be matched; may be overly narrow.

• If must select disproportionately from one tail of the treatment distribution and the other tail of the control distribution, may have regression to the mean artifact.

Regression to the mean: Matching on the pretest

T C

Area where matches can be found

Propensity scores as control variables

• The propensity score is the probability of being in the intervention group instead of the comparison group.

• It is estimated (“predicted”) from data on the characteristics of the individuals already in each group, typically using logistic regression.

• It thus combines all the control variables into a single variable optimized to differentiate the intervention sample from the control sample.

One option: Use the propensity score to create matched groups

Propensity Score Quintiles

TreatmentGroup

ControlGroup

Matches

Another option: Use the propensity scoreas a covariate in ANCOVA or MR

Propensity score (P)

Posttest (Y) T

C

iiTiPi eTBPBBY 0

Δ

Circumstances appropriate for the nonequivalent comparison design

• A stronger design is truly not feasible.

• A sample of relatively comparable units not receiving the intervention is available.

• A full account can be given of the differences between the groups potentially related to the outcomes of interest.

• Data on those differences can be obtained and used for statistical control.

Example: Effects of a professional development program for teachers

• In the Montgomery County Public Schools, MD, some 3d grade teachers had received the Studying Skillful Teaching training, some had not.

• The reading and math achievement test scores for students of teachers with and without training were compared.

• Analysis of covariance was used to test for differences in student outcomes with a propensity score control variable and covariates representing teacher credentials, student pretest, reduced/free lunch status, ethnicity, and special ed or ELL service.

4. Comparative interrupted time series

School Year

Mea

n A

chie

vem

ent

Program Onset

9th gradeprogramschools

9th gradeotherschools

Requirements for a good intervention effect estimate from comparative

interrupted time series• The fundamental problem: changes stemming from

other sources.

• Sufficient pre-intervention time series data showing relative stability.

• No other potentially influential event coincides with the program onset or staggered onsets if available.

• Comparison time series for very similar units in same environment but without the program.

• An analysis model that properly estimates changes and differences with autocorrelated data.

Circumstances appropriate for comparative interrupted time series

• A stronger design is truly not feasible.

• Time series data on a relevant outcome for those exposed to the program are available for periods before and after the onset of the program.

• Sufficient data points are available, with no change in the nature of the measure, to establish stable statistical trends.

• Data on the same measure over the same time period are available for comparable cases without the program.

Example: The ninth grade Success Academy in Philadelphia

• The Success Academy grouped 9th graders together in small learning communities with a specialized curriculum and a small group of dedicated teachers.

• Implemented by 7 of the 22 nonselective high schools in 2003-04.

• The outcomes were attendance, academic credit earned, promotion to 10th grade, achievement test scores, and graduation rates.

• Outcomes are compared for 9th graders during the 3 years prior and 5 years after program onset and for the program schools vs. a matched group of schools without the program.

Other Important Aspects of the Research Plan

Statistical power

• Statistical power = probability of statistical significance when there is an effect.

• Power is mainly a function of:– alpha level for significance testing– minimum effect size to detect in standard

deviation units– the sample size: number of students,

classrooms, schools, etc.– the covariates included in the analysis– the research design and corresponding

analysis model.

Power: Critical considerations

• A realistic identification of the minimal effect size with practical significance that the research should be powered to detect.

• The unit that is assigned to conditions (students, classrooms, schools, etc.).

• The intracluster correlations (ICC) expected for student outcomes when students are nested within the units assigned.

• The expected correlations with outcomes of any covariates measured on the units assigned to conditions.

• The number of schools, classrooms, students, etc. available for the study.

• Specific design issues such as the need for 3-4 times as many units for regression-discontinuity as for a comparable randomized experiment.

Computer program for power estimation in multilevel designs

• Raudenbush, S. W., Spybrook, J., Liu, X., & Congdon, R. (2006). Optimal design for longitudinal and multilevel research: Documentation for the “Optimal Design” software. Optimal Design Version 1.76

http://sitemaker.umich.edu/group-based/optimal_design_software

Multilevel Data Analysis

• Applicable when sampling and assignment to conditions occurs with one unit (e.g., classrooms, schools) and outcomes are measured on units nested within (e.g., students).

• Requires specialized computer programs, e.g., HLM, MLWin, SAS Proc Mixed, SPSS Mixed Models.

References and readings

Experimental and quasi-experimental designShadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and

quasi-experimental designs for generalized causal inference. Houghton Mifflin.

Bickman, L., & Rog, D. J. (eds)(2008). The SAGE Handbook of Applied Social Research Methods (Second Edition). Sage Publications.

Regression-discontinuityHahn, J., Todd, P. and Van der Klaauw, W. (2002). Identification and

estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1), 201-209.

Cappelleri J.C. and Trochim W. (2000). Cutoff designs. In Chow, Shein-Chung (Ed.) Encyclopedia of Biopharmaceutical Statistics, 149-156. NY: Marcel Dekker.

Cappelleri, J., Darlington, R.B. and Trochim, W. (1994). Power analysis of cutoff-based randomized clinical trials. Evaluation Review, 18, 141-152.

Nonequivalent comparison designsRosenbaum, P.R., & Rubin, D.B. (1983). The central role of the

propensity score in observational studies for causal effects. Biometrika 70(1): 41-55.

Luellen, J. K., Shadish, W.R., & Clark, M.H. (2005). Propensity scores: An introduction and experimental test. Evaluation Review 29(6): 530-558.

Schochet, P.Z., & Burghardt, J. (2007). Using propensity scoring to estimate program-related subgroup impacts in experimental program evaluations. Evaluation Review 31(2): 95-120.

Time seriesBloom, H. S. (2003). Using “short” interrupted time-series analysis to

measure the impact of whole school reforms. Evaluation Review, 27(1), 3-49.

Chatfield, C. (2003). The analysis of time series: An introduction (Sixth Ed.) Chapman & Hall/CRC.

Multilevel analysisHox, J. (2002). Multilevel analysis: Techniques and applications.

Lawrence Erlbaum.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (Second ed.). Sage publications.

Examples used in this presentationMorgan, P., & Ritter, S. (2002). An experimental study of the effects of

Cognitive Tutor® Alegbra I on student knowledge and attitude. Carnegie Learning, Inc.

Dynarski, M., Gleason, P., Rangarajan, A., & Wood, R. (1998). Impacts of Dropout Prevention Programs. Final Report. Mathematica Policy Research, Princeton, NJ.

Torgesen, J., Myers, D., Schirm, A., et al. (2006). National Assessment of Title I Interim Report to Congress: Volume II: Closing the Reading Gap, First Year Findings from a Randomized Trial of Four Reading Interventions for Striving Readers. Washington, DC: U.S. Department of Education, Institute of Education Sciences.

W. T. Gormley, T. Gayer, D. Phillips, & B. Dawson (2005). The effects of universal pre-k on cognitive development. Developmental Psychology, 41(6), 872-884.

Modarresi, S., & Wolanin, N. (2007). The effects of Studying Skillful Teaching training program on students’ reading and mathematics achievement. Evaluation Brief, February. Montgomery County Public Schools, MD.

Kemple, J. J., Herlihy, C. M., & Smith, T. J. (2005). Making progress toward graduation: Evidence from the Talent Development High School model. New York: MDRC. [Includes 9th grade Success Academy].

Raudenbush, S. W., Spybrook, J., Liu, X., & Congdon, R. (2006). Optimal design for longitudinal and multilevel research: Documentation for the “Optimal Design” software. Optimal Design Version 1.76

IES Workshop on Evaluating State and District Level Interventions Mark W. Lipsey Director, Center for Evaluation Research and Methodology Vanderbilt University.

Documents

literacy curriculum

intervention characteristics

intervention theory

dc slide

intervention target

inappropriate school

planned intervention

year old atrisk children