SWPBS Forum October 2008 Claudia Vincent and Scott Spaulding [email protected] [email protected] University of Oregon
Dec 31, 2015
SWPBS Forum October 2008
Claudia Vincent and Scott Spaulding [email protected] [email protected] of Oregon
Provide information about desirable features of SWPBS evaluation tools
Provide an overview of the extent to which SWPBS evaluation tools meet these desirable features
PBS Self-AssessmentImplement sy
stems
to support p
ractices
Implem
ent
practices Improved student
outcomes
EVALUATIONDATAUse eval data for
decision-making
Fidelit
ym
easu
res
Student outcome measures
Action Plan
Inte
rpre
tev
al d
ata
1. Drive implementation decisions
2. Provide evidence for SWPBS impact on student outcomes
A measure that drives implementation decisions should be:◦ socially valid◦ contextually appropriate◦ sufficiently reliable
(reliable enough to make defensible decisions)◦ easy to use
A measure that builds the evidence base for SWPBS should: ◦ have known reliability◦ have known validity◦ clearly link implementation status to student outcomes
Measurement scores have twocomponents:
◦ True score, e.g. a school’s true performance on “teaching behavioral expectations”
◦ Error, e.g. features of the measurement process itself
Our goal is to use tools that 1.maximize true score and minimize measurement error,
and therefore2.yield precise and interpretable data,
and therefore 3.lead to sound implementation decisions and defensible evidence.
True score(relevant to construct)
Error (noise)
True score is maximized and error minimized if the evaluation tool is technically adequate, i.e.
◦ can be applied consistently (has good reliability)
◦ measures the construct of interest (has good validity)
Sound implementation decisions are made if the evaluation tool is practical, i.e. data
◦ are cost efficient to collect (low impact)
◦ are easy to aggregate across units of analysis (e.g. students, classrooms, schools, districts, states)
◦ are consistently used to make meaningful decisions (have high utility)
Consistency across
◦ Items/subscales/total scales (“internal consistency”)
◦ Data collectors (“inter-rater reliability” or “inter-observer agreement”)
◦ Time (“test-retest reliability”)
Definition:◦ Extent to which the items on an instrument adequately and
randomly sample a cohesive construct, e.g. “SWPBS implementation”
Assessment:◦ If the instrument adequately and randomly samples one
construct, and if it were divided into two equal parts, both parts should correlate strongly
Metric:◦ coefficient alpha (the average split-half correlation based on
all possible divisions of an instrument into two parts)
Interpretation:◦ α ≥ .70 (adequate for measures under development)◦ α ≥ .80 (adequate for basic research)◦ α ≥ .90 (adequate for measures on which consequential decisions
are based)
Definition:◦ Extent to which the instrument measures the same
construct regardless of who collects the data
Assessment:◦ If the same construct were observed by two data
collectors, their ratings should be almost identical
Metric:◦ Expressed as percentage of agreement between two data
collectors
Interpretation:◦ ≥ 90% good ◦ ≥ 80% acceptable◦ < 80% problematic
Definition:◦ Extent to which the instrument yields consistent results
at two points in time
Assessment:◦ The measure is administered at two points in time. The
time interval is set so that no improvement is expected to occur between first and second administration.
Metric:◦ Expressed as correlation between pairs of scores from
the same schools obtained at the two measurement administrations
Interpretation:◦ r ≥ .6 acceptable
Interpretability of data!
Did these schools truly differ in the extent to which they taught behavioral expectations?
Or…did these schools obtain different scores because◦ the tool’s items captured only some schools’ approach to teaching expectations?
(tool lacked internal consistency)
◦ they had different data collectors? (tool lacked inter-rater agreement)
◦ some collected data in week 1 and some in week 2 of the same month?(tool lacked test-retest reliability)
Content validity
Criterion-related validity◦ Concurrent validity◦ Predictive validity
Construct validity
Definition:◦ Extent to which the items on an instrument relate to
the construct of interest, e.g. “student behavior”
Assessment:◦ Expert judgment if items measure content
theoretically or empirically linked to the construct
Metric:◦ Expressed as percentage of expert agreement
Interpretation:◦ ≥ 80% agreement desirable
Definition:◦ Extent to which the instrument correlates with another
instrument measuring a similar aspect of the construct of interest and administered concurrently or subsequently
Assessment:◦ Concurrent validity: compare data from concurrently
administered measures for agreement◦ Predictive validity: compare data from subsequently
administered measures for predictive accuracy
Metric:◦ Expressed as a correlation between two measures
Interpretation:◦ Moderate to high correlations are desirable◦ Concurrent validity: Very high correlations might indicate
redundancy of measures
Definition:◦ Extent to which the instrument measures what it is supposed to
measure (e.g. the theorized construct “student behavior”)
Assessment:◦ factor analyses yielding information about the instrument’s
dimensions (e.g. aspects of “student behavior”)◦ Correlations between constructs hypothesized to impact each
other (e.g. “student behavior” and “student reading achievement”)
Metric:◦ Statistical model fit indices (e.g. Chi-Square)
Interpretation:◦ Statistical significance
Interpretability of data!
Can we truly conclude that student behavior is better in school F than school J?◦ Does the tool truly measure well-defined behaviors? (content validity)
◦ Do student behaviors measured with this tool have any relevance for the school’s overall climate? For the student’s long-term success? (concurrent, predictive validity)
◦ Does the tool actually measure “student behavior”, or does it measure “teacher behavior”, “administrator behavior”, “parent behavior” ?(construct validity)
Consider sample size◦ Psychometric data derived from large samples are
better than psychometric data derived from small samples.
Consider sample characteristics◦ Psychometric data derived from specific samples
(e.g. elementary schools) do not automatically generalize to all contexts (e.g. middle schools, high schools).
Making implementation decisions based on evaluation data◦ When has a school reached “full” implementation?
“Criterion” scores on implementation measures should be calibrated based on student outcomes
Implementation
student outcomegoals
criterion
academic achievement
10 20 30 40 50 60 70 80 90 100
social achievement
Evaluation data lead to consequential decisions, e.g.◦ Additional trainings when data indicate
insufficient implementation◦ Emphasis on specific supports where data
indicate greatest student needs
To make sure we arrive at defensible decisions, we need to collect evaluation data with tools that ◦ have documented reliability and validity◦ clearly link implementation to student outcomes
1. Collect evaluation data regularly
2. Collect evaluation data with tools that have good reliability and validity
3. Guide implementation decisions with evaluation data clearly linked to student outcomes
Provide information about desirable features of SWPBS evaluation tools
Provide an overview of the extent to which SWPBS evaluation tools meet these desirable features
How is my school doing? My school is “80/80”. Now what? My school is just beginning SWPBS. Where
do I start? How do we handle the kids still on support
plans? I’ve heard about school climate. What is
that? What about the classroom problems we still
have?
Measurement within SWPBS Research or evaluation? What tools do we have? What evidence exists for use of these tools? Guidelines for using the measures
Focus on the whole school School-wide PBS began with a focus on
multiple systems Evaluation of a process Evaluation of an outcome Growth beyond initial
implementation
Non-
class
room
Classroom
IndividualStudent
School-wideSystems
Sugai & Horner (2002)
Primary Prevention:
School-/Classroom-Wide Systems for
All Students,Staff, & Settings
Continuum of School-wide Positive Behavior Support
~80% of Students
~15%
~5%
Secondary Prevention:
Specialized GroupSystems for
Students with At-Risk Behavior
Tertiary Prevention:Specialized IndividualizedSystems for Students with
High-Risk Behavior
Student
Unit of Measurement and AnalysisSchool Nonclassroom
Classroom
?
AcademicsBehavior
Academics
Behavior
Dimension of Measurement
Academic Achievement
Social Behavior
??
???
???
Tertiary
Secondary
Primary
Outcomes
Process
Measurement within SWPBS Research or evaluation? What tools do we have? What evidence exists for use of these tools? Guidelines for using the measures
1. Drive implementation decisions2. Provide evidence for SWPBS impact on
student outcomes
Measures have developed to support research-quality assessment of SWPBS
Measures have developed to assist teams in monitoring their progress
Measurement within SWPBS Research or evaluation? What tools do we have? What evidence exists for use of these tools? Guidelines for using the measures
Some commonly used measures: Effective Behavior Supports Survey Team Implementation Checklist Benchmarks of Quality School-wide Evaluation Tool Implementation Phases Inventory
Newer measures: Individual Student Schoolwide Evaluation
Tool Checklist for Individual Student Systems Self-assessment and Program Review
Whole-School Non-classroom
Classroom
Tertiary ISSETCISS
Secondary ISSETCISS
Universal EBSTICSETBoQ
EBS EBS
Measurement within SWPBS Research or evaluation? What tools do we have? What evidence exists for use of these
tools? Guidelines for using the measures
Is it important, acceptable, and meaningful? Can we use it in our school? Is it consistent? Is it easy to use? Is it “expensive”? Does it measure what it’s supposed to? Does it link implementation to outcome?
Effective Behavior Supports Survey (EBS) School-wide Evaluation Tool (SET) Benchmarks of Quality (BoQ)
Effective Behavior Supports Survey ◦ Sugai, Horner, & Todd (2003)◦ Hagan-Burke et al. (2005)◦ Safran (2006)
Internal consist.
T-R Inter-rater
Content Criterion
Construct
╳
46-item, support team self-assessment Facilitates initial and annual action planning Current status and priority for improvement
across four systems: ◦ School-wide ◦ Specific Setting◦ Classroom◦ Individual Student
Summary by domain, action planning activities 20-30 minutes, conducted at initial assessment,
quarterly, and annual intervals
Internal consistency◦ Sample of 3 schools◦ current status: α =.85 ◦ improvement priority: α =.94 ◦ Subscale α from .60 to .75 for “current status”
and .81 to .92 for “improvement priority” Internal consistency for School-wide
◦ Sample of 37 schools◦ α = .88 for “current status” ◦ α = .94 for the “improvement priority”
School-wide Evaluation Tool ◦ Sugai, Horner & Todd (2000)◦ Horner et al. (2004)
Internal consist.
T-R Inter-rater
Content Criterion
Construct
╳ ╳ ╳ ╳ ╳
28-item, research evaluation of universal implementation
Total implementation score and 7 subscale scores:1. school-wide behavioral expectations 2. school-wide behavioral expectations taught3. acknowledgement system4. consequences for problem behavior5. system for monitoring of problem behavior6. administrative support7. District support
2-3 hours, external evaluation, annual
Internal consistency ◦ Sample of 45 middle and elementary schools◦ α = .96 for total score◦ α from .71 (district-level support) to .91
(administrative support) Test-retest analysis
◦ Sample of 17 schools◦ Total score, IOA = 97.3%◦ Individual subscales, IOA = 89.8%
(acknowledgement of appropriate behaviors) to 100% (district-level support)
Content validity◦ Collaboration with teachers, staff, and
administrators at 150 middle and elementary schools over a 3-year period
Construct validity◦ Sample of 31 schools◦ SET correlated with EBS Survey◦ Pearson r = .75, p < .01
Sensitivity to differences in implementation across schools ◦ Sample of 13 schools◦ Comparison of average scores before and after
implementation◦ t = 7.63; df = 12, p < .001
Schoolwide Benchmarks of Quality ◦ Kincaid, Childs, & George (2005)◦ Cohen, Kincaid, & Childs (2007)
Internal consist.
T-R Inter-rater
Content Criterion
Construct
╳ ╳ ╳ ╳ ╳
Used to identify areas of success / improvement Self-assessment completed by all team
members 53-items rating level of implementation Team coaches create summary form, noting
discrepancies in ratings Areas of strength, needing development, and of
discrepancy noted for discussion and planning 1-1.5 hours (1 team member plus coach) Completed annually in spring
Items grouped into 10 subscales: 1. PBS team2. faculty commitment3. effective discipline procedures4. data entry5. expectations and rules6. reward system7. lesson plans for teaching behavioral expectations8. implementation plans9. crisis plans10.evaluation
Internal consistency ◦ Sample of 105 schools ◦ Florida and Maryland
44 ES, 35 MS, 10 HS, 16 center schools◦ overall α of .96◦ α values for subscales
.43 “PBS team” to .87 “lesson plans for teaching expectations”
Test-retest reliability◦ Sample of 28 schools◦ Coaches scores only◦ Total score: r = .94, p < 0.01◦ r values for subscales:
0.63 “implementation plan” to 0.93 “evaluation” acceptable test-retest reliability
Inter-observer agreement (IOA)◦ Sample of 32 schools◦ IOA = 89%
Content validity◦ Florida PBS training manual & core SWPBS
elements◦ Feedback from 20 SWPBS research and evaluation
professionals ◦ Interviewing to identify response error in the items◦ Pilot efforts with 10 support teams
Concurrent validity ◦ Sample of 42 schools◦ Correlation between BoQ and SET ◦ Pearson r = .51, p < .05
Measurement within SWPBS Research or evaluation? What tools do we have? What evidence exists for use of these tools? Guidelines for using the measures
What measures do I use? How do I translate a score into “practice”? What school variables affect measurement
choices?◦ SWPBS implementation status
Evaluation template
Fidelity Tool Year 1 Year 2 Year 3
EBS Survey X 1 2 3 4 1 2 3 4 1 2 3 4
Universal
TIC X X X X X X X X X
SET / BoQ X X X
Secondary / tertiary CISS X X X X X X X X X
ISSET X X X
Classroom setting
Class internal X X X X X X X X X
Class external X X X
Evaluation of School-wide PBS occurs for implementation and outcomes
Evidence of a “good” measure depends on its intended use
The quality of implementation decisions depends on the quality of evaluation tools
Evaluation occurs throughout the implementation process, with different tools for different purposes at different stages