IES Workshop on IES Workshop on Evaluating State and Evaluating State and District Level District Level Interventions Interventions Mark W. Lipsey Director, Center for Evaluation Research and Methodology Vanderbilt University David Holdzkom Assistant Superintendent for Evaluation and Research Wake County Public School System, North Carolina April 24, 2008 Washington, DC
63
Embed
IES Workshop on Evaluating State and District Level Interventions Mark W. Lipsey Director, Center for Evaluation Research and Methodology Vanderbilt University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
IES Workshop on Evaluating IES Workshop on Evaluating State and District Level State and District Level
InterventionsInterventions
Mark W. LipseyDirector, Center for Evaluation Research
and MethodologyVanderbilt University
David HoldzkomAssistant Superintendent for Evaluation and ResearchWake County Public School System, North Carolina
April 24, 2008 Washington, DC
PurposeTo help schools, districts, and states design and implement rigorous evaluations of the effects of promising practices, programs, and policies on educational outcomes.
• Many interventions are not effective; users and interested others need to know.
• The interventions most relevant to improving outcomes are those that schools and districts believe are promising and feasible.
• IES has funding to support research initiated by schools, districts, and states.
What kinds of interventions might be evaluated?
• Practices, e.g., one-on-one tutoring, educational software, acceleration of high ability students, cooperative learning.
• Programs, e.g., Reading Recovery, Ladders to Literacy, Cognitive Tutor algebra, Saxon Math, Caring School Community (character education).
• Policies, e.g., reduced class size, pre-K, alternative high schools, all year calendar.
Key Issues in Designing Impact Evaluations for
Education Interventions
Logic Models, Variables, and Evaluation Questions
Logic model: 1. Specifying the problem the intervention addressesNature of the need:
• What and for whom (e.g., kindergarten students who aren’t ready for school).
• Why (e.g., poor pre-literacy skills, inappropriate school behavior).
• Rationale/evidence supporting the intervention target (e.g., at entry K students need to be ready to learn or they will begin to fall behind; research shows school readiness can be enhanced for at-risk 4 year olds).
Logic model: 2. Specifying the planned intervention
What the intervention does that addresses the need:
• Content: What the students should know or be able to do; why this meets the need.
• Pedagogy: Instructional techniques and methods to be used; why appropriate.
• Delivery system: How the intervention will arrange to deliver the instruction.
• The key factors or core ingredients most essential and distinctive to the intervention.
* Setting, context* Personal and family characteristics* Prior experience
Pre-K with literacy
curriculum
Mapping variables onto the intervention theory: Intervention characteristics
4 year old at-risk
children
Positive attitudes to
school
Improved pre-literacy
skills
Learn appropriate
school behavior
Increased school
readiness
Greater learning
gains in K
Independent variable:* T vs. C comparison conditions
Generic fidelity:* T and C exposure to the generic aspects of the intervention (type, amount, quality)
Specific fidelity:* T and C (?) exposure to distinctive aspects of the intervention (type, amount, quality)
Potential moderators:* Characteristics of personnel* Intervention setting, context e.g., class size
Pre-K with literacy
curriculum
Mapping variables onto the intervention theory: Intervention outcomes
4 year old at-risk
children
Exposed to intervention
Positive attitudes to
school
Improved pre-literacy
skills
Learn appropriate
school behavior
Increased school
readiness
Greater learning
gains in K
Focal dependent variables:* Pretests (pre-intervention).* Posttests (at end of intervention)* Follow-ups (lagged after end of intervention).
Other dependent variables:* Side effects– possible unplanned positive or negative outcomes.* Mediators– DVs on causal pathways from intervention to other DVs.
Pre-K with literacy
curriculum
Research questions: Relationships of (possible) interest
• Intervention effects: Causal relationship between intervention and outcomes.
• Duration of effects post-intervention.
• Moderator relationships: Differential intervention effects for different subgroups.
• Mediator relationships: Stepwise causal relationship with effects on a proximal outcome causing effects on a distal outcome.
Research Designs for Estimating
Intervention Effects
What is an intervention effect and why is it so difficult to determine?
Research designs to discuss
• Two strong ones1. Randomized experiment
2. Regression-discontinuity
• Two weak ones3. Nonrandomized comparison groups
with statistical controls
4. Comparative interrupted time series
1. Randomized experiment
Receive experimental interventionResearch
sample of students, teachers,
classrooms, schools, etc. Do not
receive experimenta
l intervention
Outcome
Outcome
Highpre
Medhighpre
Lowpre
Interventioneffect
Sample Pretestblocking
Random assignmentto conditions Posttest
Medlowpre
Randomlyassigned
Circumstances conducive to randomized experiments
• More demand than supply for program– allocate scarce resource by lottery.
• New program that can be phased in– wait list control had delayed start.
• Pull-out or add-on program for selected students– randomly select from among those eligible.
• Volunteers willing to opt in for a chance to receive the program.
Example: Junior high algebra curriculum
• In 2000-01 the Moore Oklahoma Independent School District conducted a study of the effectiveness of the Cognitive Tutor Algebra I program on students in their junior high school system.
• Students in 5 junior high schools were randomly assigned to either the Cognitive Tutor Algebra I course or the ‘practical as usual’ algebra courses. Cognitive Tutor teachers received the curriculum materials and 4 days of training.
• Outcome measures included the ETS Algebra I end-of-course exam, course grades, and a survey of student attitudes towards mathematics.
Example: Alternative high school for students at risk of dropping out
• Horizon High School in Las Vegas identified 9th and 10th grade students behind grade level and at risk of dropping out.
• A random sample of these students was assigned to attend an alternative high school that featured a focus on cooperative learning, small group instruction, and support services.
• Outcomes were compared for the alternative and regular high schools on dropout rates, self-esteem, drug use, and arrest rates.
Example: Remedial reading programs for elementary students
• The Allegheny Intermediate Unit (AIU), which serves 42 suburban school districts in Allegheny County, Pennsylvania, randomly assigned 50 schools to one of four commercially available remedial reading interventions.
• Within each school struggling readers in grades 3 and 5 were identified and randomly assigned to instruction as usual or the remedial reading program designated for that school.
• In each program, 3 students met with a trained teacher one hour/day for 20 weeks.
• Measures of reading skill were administered at the beginning and end of the school year for program and control students.
2. Regression-discontinuity (aka the cutting-point design)
• When well-executed, its ability to provide an unbiased estimate of the intervention effect is strong– comparable to a randomized experiment.
• It is adaptable to many circumstances where it may be difficult to apply a randomized design.
Consider first a posttest on pretest regression for a randomized experiment with no effect
Pretest-posttest randomized experiment, now with an intervention effect
Pretest (S)
Posttest (Y)
T & C Mean S
T
C
iiTiSi eTBSBBY 0
C Mean Y
T Mean Y
Δ
Consider now the same regression with no effect but with a cutting point applied
Selection Variable (S)
Posttest (Y)
Cutting Point
T
C
iiTiSi eTBSBBY 0
C Mean Y
T Mean Y
Regression discontinuity scatterplot (null case)
Selection Variable (S)
Posttest (Y)
Cutting Point
TC
iiTiSi eTBSBBY 0
Now add an intervention effect
Selection Variable (S)
Posttest (Y)
Cutting Point
T
C
iiTiSi eTBSBBY 0
Δ
Regression discontinuity scatterplot with effect
Selection Variable (S)
Posttest (Y)
Cutting Point
T
C
iiTiSi eTBSBBY 0
The effect estimated by R-D is the same as that from the randomized experiment
Selection Variable (S)
Posttest (Y)
T
C
iiTiSi eTBSBBY 0
Δ
Cutting Point
The selection variable for R-D
• A continuous quantitative variable measured on every candidate for assignment to T or C who will participate in the study.
• Assignment to T or C strictly on the basis of the score obtained and the predetermined cutting point.
• Does not have to correlate highly with the outcome variable.
• Can be tailored to represent an appropriate basis for the assignment decision in the setting.
Special issues with the R-D design
• Correctly fitting the functional form– possibility that it is not linear– curvilinear functions– interaction with the cutting point.
• Statistical power– requires about 3 times the sample size of a
comparable randomized experiment– covariates correlated with the outcome but not
the selection variable are helpful.
Circumstances conducive to the regression-discontinuity design
• The situation involves a selection from some larger group of who will, or should, receive the intervention and who will not.
• The basis for selection is or can be made explicit and systematic enough to be captured in a quantitative rating or ranking.
• The allocation of the intervention can be made strictly on the basis of the selection score and cutting point in a high proportion of cases. Exceptions can be identified in advance and exempted from the scheme.
Example: Effects of universal pre-k in Tulsa, Oklahoma
• Eligibility for pre-k determined strictly on the basis of age– cutoff by birthday.
• Overall sample of 1,567 children just beginning pre-k plus 1,461 children just beginning kindergarten who had been in pre-k the previous year.
• WJ Letter-Word, Spelling, and Applied Problems as outcome variables.
Entry into pre-k selected by birthday
WJ testscore
Born before September 1
T Completed
pre-K; tested at beginning
of K
Born after September 1
Age
CNo Pre-K
yet; tested at beginning of pre-K year
?
Samples and testing
pre-k kindergarten
pre-k
First cohort
Second cohort
Year 1 Year 2
AdministerWJ tests
Excerpts from Regression AnalysisLetter-Word Spelling Applied Probs
Variable B coeff B coeff B coeff
Treatment (T) 3.00* 1.86* 1.94*
Age: Days ± from Sept 1 .01 .01* .02*
Days2 .00 .00 .00
Days x T .00 -.01 -.01
Days2 x T .00 .00 .00
Free lunch -1.28* -.89* -1.38*
Black .04 -.44* -2.34*
Hispanic -1.70* -.48* -3.66*
Female .92* 1.05* .76*
Mother’s educ: HS .59* .57* 1.25* * p<.05
3. Nonrandomized comparison groups with statistical controls
• Statistical controls: Analysis of covariance and multiple regression
• Matching on the control variables
• Propensity scores derived from the control variables.
Nonequivalent comparison analog to the randomized experiment
Population of students, teachers,
classrooms, schools, etc.
Do not receive
experimental
intervention
Outcome
Outcome
Interventioneffect (??)
Selected through some nonrandom more-or-less natural process
Receive experimental intervention
Issues for obtaining good intervention effect estimates from nonrandomized
comparison groups• The fundamental problem: selection bias
• Knowing/measuring the variables necessary and sufficient to statistically control for the selection bias– characteristics related to the outcome on
which the groups differ
• Using an analysis model that properly adjusts for the selection bias, given appropriate control variables
Nonequivalent comparison groups: Pretest/covariate and posttest means
Covariate-adjusted treatment effect estimate with a relevant covariate left out
Pretest/Covariate(s) (X)
Posttest (Y)
T
C
iiTiXiXi eTBXBXBBY 22110
Δ
Using control variables via matching
• Groupwise matching: select control comparison to be groupwise similar to intervention group, e.g., schools with similar demographics, geography, etc. Generally a good idea.
• Individual matching: select individuals from the potential control pool that match intervention individuals on one or more observed characteristics.May not be a good idea.
Potential problems with individual level matching
• Basic problem with nonequivalent designs– need to match on all relevant variables to obtain a good estimate of the intervention effect.
• If match on too few variables, may omit some that are important to control.
• If try to match on too many variables, the sample will be restricted to the cases that can be matched; may be overly narrow.
• If must select disproportionately from one tail of the treatment distribution and the other tail of the control distribution, may have regression to the mean artifact.
Regression to the mean: Matching on the pretest
T C
Area where matches can be found
Propensity scores as control variables
• The propensity score is the probability of being in the intervention group instead of the comparison group.
• It is estimated (“predicted”) from data on the characteristics of the individuals already in each group, typically using logistic regression.
• It thus combines all the control variables into a single variable optimized to differentiate the intervention sample from the control sample.
One option: Use the propensity score to create matched groups
Propensity Score Quintiles
TreatmentGroup
ControlGroup
Matches
Another option: Use the propensity scoreas a covariate in ANCOVA or MR
Propensity score (P)
Posttest (Y) T
C
iiTiPi eTBPBBY 0
Δ
Circumstances appropriate for the nonequivalent comparison design
• A stronger design is truly not feasible.
• A sample of relatively comparable units not receiving the intervention is available.
• A full account can be given of the differences between the groups potentially related to the outcomes of interest.
• Data on those differences can be obtained and used for statistical control.
Example: Effects of a professional development program for teachers
• In the Montgomery County Public Schools, MD, some 3d grade teachers had received the Studying Skillful Teaching training, some had not.
• The reading and math achievement test scores for students of teachers with and without training were compared.
• Analysis of covariance was used to test for differences in student outcomes with a propensity score control variable and covariates representing teacher credentials, student pretest, reduced/free lunch status, ethnicity, and special ed or ELL service.
4. Comparative interrupted time series
School Year
Mea
n A
chie
vem
ent
Program Onset
9th gradeprogramschools
9th gradeotherschools
Requirements for a good intervention effect estimate from comparative
interrupted time series• The fundamental problem: changes stemming from
other sources.
• Sufficient pre-intervention time series data showing relative stability.
• No other potentially influential event coincides with the program onset or staggered onsets if available.
• Comparison time series for very similar units in same environment but without the program.
• An analysis model that properly estimates changes and differences with autocorrelated data.
Circumstances appropriate for comparative interrupted time series
• A stronger design is truly not feasible.
• Time series data on a relevant outcome for those exposed to the program are available for periods before and after the onset of the program.
• Sufficient data points are available, with no change in the nature of the measure, to establish stable statistical trends.
• Data on the same measure over the same time period are available for comparable cases without the program.
Example: The ninth grade Success Academy in Philadelphia
• The Success Academy grouped 9th graders together in small learning communities with a specialized curriculum and a small group of dedicated teachers.
• Implemented by 7 of the 22 nonselective high schools in 2003-04.
• The outcomes were attendance, academic credit earned, promotion to 10th grade, achievement test scores, and graduation rates.
• Outcomes are compared for 9th graders during the 3 years prior and 5 years after program onset and for the program schools vs. a matched group of schools without the program.
Other Important Aspects of the Research Plan
Statistical power
• Statistical power = probability of statistical significance when there is an effect.
• Power is mainly a function of:– alpha level for significance testing– minimum effect size to detect in standard
deviation units– the sample size: number of students,
classrooms, schools, etc.– the covariates included in the analysis– the research design and corresponding
analysis model.
Power: Critical considerations
• A realistic identification of the minimal effect size with practical significance that the research should be powered to detect.
• The unit that is assigned to conditions (students, classrooms, schools, etc.).
• The intracluster correlations (ICC) expected for student outcomes when students are nested within the units assigned.
• The expected correlations with outcomes of any covariates measured on the units assigned to conditions.
• The number of schools, classrooms, students, etc. available for the study.
• Specific design issues such as the need for 3-4 times as many units for regression-discontinuity as for a comparable randomized experiment.
Computer program for power estimation in multilevel designs
• Raudenbush, S. W., Spybrook, J., Liu, X., & Congdon, R. (2006). Optimal design for longitudinal and multilevel research: Documentation for the “Optimal Design” software. Optimal Design Version 1.76
• Applicable when sampling and assignment to conditions occurs with one unit (e.g., classrooms, schools) and outcomes are measured on units nested within (e.g., students).
Experimental and quasi-experimental designShadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and
quasi-experimental designs for generalized causal inference. Houghton Mifflin.
Bickman, L., & Rog, D. J. (eds)(2008). The SAGE Handbook of Applied Social Research Methods (Second Edition). Sage Publications.
Regression-discontinuityHahn, J., Todd, P. and Van der Klaauw, W. (2002). Identification and
estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1), 201-209.
Cappelleri J.C. and Trochim W. (2000). Cutoff designs. In Chow, Shein-Chung (Ed.) Encyclopedia of Biopharmaceutical Statistics, 149-156. NY: Marcel Dekker.
Cappelleri, J., Darlington, R.B. and Trochim, W. (1994). Power analysis of cutoff-based randomized clinical trials. Evaluation Review, 18, 141-152.
Nonequivalent comparison designsRosenbaum, P.R., & Rubin, D.B. (1983). The central role of the
propensity score in observational studies for causal effects. Biometrika 70(1): 41-55.
Luellen, J. K., Shadish, W.R., & Clark, M.H. (2005). Propensity scores: An introduction and experimental test. Evaluation Review 29(6): 530-558.
Schochet, P.Z., & Burghardt, J. (2007). Using propensity scoring to estimate program-related subgroup impacts in experimental program evaluations. Evaluation Review 31(2): 95-120.
Time seriesBloom, H. S. (2003). Using “short” interrupted time-series analysis to
measure the impact of whole school reforms. Evaluation Review, 27(1), 3-49.
Chatfield, C. (2003). The analysis of time series: An introduction (Sixth Ed.) Chapman & Hall/CRC.
Multilevel analysisHox, J. (2002). Multilevel analysis: Techniques and applications.
Lawrence Erlbaum.
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (Second ed.). Sage publications.
Examples used in this presentationMorgan, P., & Ritter, S. (2002). An experimental study of the effects of
Cognitive Tutor® Alegbra I on student knowledge and attitude. Carnegie Learning, Inc.
Dynarski, M., Gleason, P., Rangarajan, A., & Wood, R. (1998). Impacts of Dropout Prevention Programs. Final Report. Mathematica Policy Research, Princeton, NJ.
Torgesen, J., Myers, D., Schirm, A., et al. (2006). National Assessment of Title I Interim Report to Congress: Volume II: Closing the Reading Gap, First Year Findings from a Randomized Trial of Four Reading Interventions for Striving Readers. Washington, DC: U.S. Department of Education, Institute of Education Sciences.
W. T. Gormley, T. Gayer, D. Phillips, & B. Dawson (2005). The effects of universal pre-k on cognitive development. Developmental Psychology, 41(6), 872-884.
Modarresi, S., & Wolanin, N. (2007). The effects of Studying Skillful Teaching training program on students’ reading and mathematics achievement. Evaluation Brief, February. Montgomery County Public Schools, MD.
Kemple, J. J., Herlihy, C. M., & Smith, T. J. (2005). Making progress toward graduation: Evidence from the Talent Development High School model. New York: MDRC. [Includes 9th grade Success Academy].
Raudenbush, S. W., Spybrook, J., Liu, X., & Congdon, R. (2006). Optimal design for longitudinal and multilevel research: Documentation for the “Optimal Design” software. Optimal Design Version 1.76