State and District Evaluation Tools

SWPBS Forum October 2008

Claudia Vincent and Scott Spaulding [email protected] [email protected] of Oregon

Provide information about desirable features of SWPBS evaluation tools

Provide an overview of the extent to which SWPBS evaluation tools meet these desirable features

PBS Self-AssessmentImplement sy

stems

to support p

ractices

Implem

ent

practices Improved student

outcomes

EVALUATIONDATAUse eval data for

decision-making

Fidelit

ym

easu

res

Student outcome measures

Action Plan

Inte

rpre

tev

al d

ata

1. Drive implementation decisions

2. Provide evidence for SWPBS impact on student outcomes

A measure that drives implementation decisions should be:◦ socially valid◦ contextually appropriate◦ sufficiently reliable

(reliable enough to make defensible decisions)◦ easy to use

A measure that builds the evidence base for SWPBS should: ◦ have known reliability◦ have known validity◦ clearly link implementation status to student outcomes

Measurement scores have twocomponents:

◦ True score, e.g. a school’s true performance on “teaching behavioral expectations”

◦ Error, e.g. features of the measurement process itself

Our goal is to use tools that 1.maximize true score and minimize measurement error,

and therefore2.yield precise and interpretable data,

and therefore 3.lead to sound implementation decisions and defensible evidence.

True score(relevant to construct)

Error (noise)

True score is maximized and error minimized if the evaluation tool is technically adequate, i.e.

◦ can be applied consistently (has good reliability)

◦ measures the construct of interest (has good validity)

Sound implementation decisions are made if the evaluation tool is practical, i.e. data

◦ are cost efficient to collect (low impact)

◦ are easy to aggregate across units of analysis (e.g. students, classrooms, schools, districts, states)

◦ are consistently used to make meaningful decisions (have high utility)

Consistency across

◦ Items/subscales/total scales (“internal consistency”)

◦ Data collectors (“inter-rater reliability” or “inter-observer agreement”)

◦ Time (“test-retest reliability”)

Definition:◦ Extent to which the items on an instrument adequately and

randomly sample a cohesive construct, e.g. “SWPBS implementation”

Assessment:◦ If the instrument adequately and randomly samples one

construct, and if it were divided into two equal parts, both parts should correlate strongly

Metric:◦ coefficient alpha (the average split-half correlation based on

all possible divisions of an instrument into two parts)

Interpretation:◦ α ≥ .70 (adequate for measures under development)◦ α ≥ .80 (adequate for basic research)◦ α ≥ .90 (adequate for measures on which consequential decisions

are based)

Definition:◦ Extent to which the instrument measures the same

construct regardless of who collects the data

Assessment:◦ If the same construct were observed by two data

collectors, their ratings should be almost identical

Metric:◦ Expressed as percentage of agreement between two data

collectors

Interpretation:◦ ≥ 90% good ◦ ≥ 80% acceptable◦ < 80% problematic

Definition:◦ Extent to which the instrument yields consistent results

at two points in time

Assessment:◦ The measure is administered at two points in time. The

time interval is set so that no improvement is expected to occur between first and second administration.

Metric:◦ Expressed as correlation between pairs of scores from

the same schools obtained at the two measurement administrations

Interpretation:◦ r ≥ .6 acceptable

How can we interpret this graph?

Interpretability of data!

Did these schools truly differ in the extent to which they taught behavioral expectations?

Or…did these schools obtain different scores because◦ the tool’s items captured only some schools’ approach to teaching expectations?

(tool lacked internal consistency)

◦ they had different data collectors? (tool lacked inter-rater agreement)

◦ some collected data in week 1 and some in week 2 of the same month?(tool lacked test-retest reliability)

Content validity

Criterion-related validity◦ Concurrent validity◦ Predictive validity

Construct validity

Definition:◦ Extent to which the items on an instrument relate to

the construct of interest, e.g. “student behavior”

Assessment:◦ Expert judgment if items measure content

theoretically or empirically linked to the construct

Metric:◦ Expressed as percentage of expert agreement

Interpretation:◦ ≥ 80% agreement desirable

Definition:◦ Extent to which the instrument correlates with another

instrument measuring a similar aspect of the construct of interest and administered concurrently or subsequently

Assessment:◦ Concurrent validity: compare data from concurrently

administered measures for agreement◦ Predictive validity: compare data from subsequently

administered measures for predictive accuracy

Metric:◦ Expressed as a correlation between two measures

Interpretation:◦ Moderate to high correlations are desirable◦ Concurrent validity: Very high correlations might indicate

redundancy of measures

Definition:◦ Extent to which the instrument measures what it is supposed to

measure (e.g. the theorized construct “student behavior”)

Assessment:◦ factor analyses yielding information about the instrument’s

dimensions (e.g. aspects of “student behavior”)◦ Correlations between constructs hypothesized to impact each

other (e.g. “student behavior” and “student reading achievement”)

Metric:◦ Statistical model fit indices (e.g. Chi-Square)

Interpretation:◦ Statistical significance

How can we interpret this graph?

Interpretability of data!

Can we truly conclude that student behavior is better in school F than school J?◦ Does the tool truly measure well-defined behaviors? (content validity)

◦ Do student behaviors measured with this tool have any relevance for the school’s overall climate? For the student’s long-term success? (concurrent, predictive validity)

◦ Does the tool actually measure “student behavior”, or does it measure “teacher behavior”, “administrator behavior”, “parent behavior” ?(construct validity)

Consider sample size◦ Psychometric data derived from large samples are

better than psychometric data derived from small samples.

Consider sample characteristics◦ Psychometric data derived from specific samples

(e.g. elementary schools) do not automatically generalize to all contexts (e.g. middle schools, high schools).

Making implementation decisions based on evaluation data◦ When has a school reached “full” implementation?

“Criterion” scores on implementation measures should be calibrated based on student outcomes

Implementation

student outcomegoals

criterion

academic achievement

10 20 30 40 50 60 70 80 90 100

social achievement

Evaluation data lead to consequential decisions, e.g.◦ Additional trainings when data indicate

insufficient implementation◦ Emphasis on specific supports where data

indicate greatest student needs

To make sure we arrive at defensible decisions, we need to collect evaluation data with tools that ◦ have documented reliability and validity◦ clearly link implementation to student outcomes

1. Collect evaluation data regularly

2. Collect evaluation data with tools that have good reliability and validity

3. Guide implementation decisions with evaluation data clearly linked to student outcomes

Provide information about desirable features of SWPBS evaluation tools

Provide an overview of the extent to which SWPBS evaluation tools meet these desirable features

How is my school doing? My school is “80/80”. Now what? My school is just beginning SWPBS. Where

do I start? How do we handle the kids still on support

plans? I’ve heard about school climate. What is

that? What about the classroom problems we still

have?

Measurement within SWPBS Research or evaluation? What tools do we have? What evidence exists for use of these tools? Guidelines for using the measures

Focus on the whole school School-wide PBS began with a focus on

multiple systems Evaluation of a process Evaluation of an outcome Growth beyond initial

implementation

Non-

class

room

Classroom

IndividualStudent

School-wideSystems

Sugai & Horner (2002)

Primary Prevention:

School-/Classroom-Wide Systems for

All Students,Staff, & Settings

Continuum of School-wide Positive Behavior Support

~80% of Students

~15%

~5%

Secondary Prevention:

Specialized GroupSystems for

Students with At-Risk Behavior

Tertiary Prevention:Specialized IndividualizedSystems for Students with

High-Risk Behavior

Student

Unit of Measurement and AnalysisSchool Nonclassroom

Classroom

?

AcademicsBehavior

Academics

Behavior

Dimension of Measurement

Academic Achievement

Social Behavior

??

???

???

Tertiary

Secondary

Primary

Outcomes

Process


1. Drive implementation decisions2. Provide evidence for SWPBS impact on

student outcomes

Measures have developed to support research-quality assessment of SWPBS

Measures have developed to assist teams in monitoring their progress


Some commonly used measures: Effective Behavior Supports Survey Team Implementation Checklist Benchmarks of Quality School-wide Evaluation Tool Implementation Phases Inventory

Newer measures: Individual Student Schoolwide Evaluation

Tool Checklist for Individual Student Systems Self-assessment and Program Review

Whole-School Non-classroom

Classroom

Tertiary ISSETCISS

Secondary ISSETCISS

Universal EBSTICSETBoQ

EBS EBS

Measurement within SWPBS Research or evaluation? What tools do we have? What evidence exists for use of these

tools? Guidelines for using the measures

Is it important, acceptable, and meaningful? Can we use it in our school? Is it consistent? Is it easy to use? Is it “expensive”? Does it measure what it’s supposed to? Does it link implementation to outcome?

Effective Behavior Supports Survey (EBS) School-wide Evaluation Tool (SET) Benchmarks of Quality (BoQ)

Effective Behavior Supports Survey ◦ Sugai, Horner, & Todd (2003)◦ Hagan-Burke et al. (2005)◦ Safran (2006)

Internal consist.

T-R Inter-rater

Content Criterion

Construct

╳

46-item, support team self-assessment Facilitates initial and annual action planning Current status and priority for improvement

across four systems: ◦ School-wide ◦ Specific Setting◦ Classroom◦ Individual Student

Summary by domain, action planning activities 20-30 minutes, conducted at initial assessment,

quarterly, and annual intervals

Internal consistency◦ Sample of 3 schools◦ current status: α =.85 ◦ improvement priority: α =.94 ◦ Subscale α from .60 to .75 for “current status”

and .81 to .92 for “improvement priority” Internal consistency for School-wide

◦ Sample of 37 schools◦ α = .88 for “current status” ◦ α = .94 for the “improvement priority”

School-wide Evaluation Tool ◦ Sugai, Horner & Todd (2000)◦ Horner et al. (2004)

Internal consist.

T-R Inter-rater

Content Criterion

Construct

╳ ╳ ╳ ╳ ╳

28-item, research evaluation of universal implementation

Total implementation score and 7 subscale scores:1. school-wide behavioral expectations 2. school-wide behavioral expectations taught3. acknowledgement system4. consequences for problem behavior5. system for monitoring of problem behavior6. administrative support7. District support

2-3 hours, external evaluation, annual

Internal consistency ◦ Sample of 45 middle and elementary schools◦ α = .96 for total score◦ α from .71 (district-level support) to .91

(administrative support) Test-retest analysis

◦ Sample of 17 schools◦ Total score, IOA = 97.3%◦ Individual subscales, IOA = 89.8%

(acknowledgement of appropriate behaviors) to 100% (district-level support)

Content validity◦ Collaboration with teachers, staff, and

administrators at 150 middle and elementary schools over a 3-year period

Construct validity◦ Sample of 31 schools◦ SET correlated with EBS Survey◦ Pearson r = .75, p < .01

Sensitivity to differences in implementation across schools ◦ Sample of 13 schools◦ Comparison of average scores before and after

implementation◦ t = 7.63; df = 12, p < .001

Schoolwide Benchmarks of Quality ◦ Kincaid, Childs, & George (2005)◦ Cohen, Kincaid, & Childs (2007)

Internal consist.

T-R Inter-rater

Content Criterion

Construct

╳ ╳ ╳ ╳ ╳

Used to identify areas of success / improvement Self-assessment completed by all team

members 53-items rating level of implementation Team coaches create summary form, noting

discrepancies in ratings Areas of strength, needing development, and of

discrepancy noted for discussion and planning 1-1.5 hours (1 team member plus coach) Completed annually in spring

Items grouped into 10 subscales: 1. PBS team2. faculty commitment3. effective discipline procedures4. data entry5. expectations and rules6. reward system7. lesson plans for teaching behavioral expectations8. implementation plans9. crisis plans10.evaluation

Internal consistency ◦ Sample of 105 schools ◦ Florida and Maryland

44 ES, 35 MS, 10 HS, 16 center schools◦ overall α of .96◦ α values for subscales

.43 “PBS team” to .87 “lesson plans for teaching expectations”

Test-retest reliability◦ Sample of 28 schools◦ Coaches scores only◦ Total score: r = .94, p < 0.01◦ r values for subscales:

0.63 “implementation plan” to 0.93 “evaluation” acceptable test-retest reliability

Inter-observer agreement (IOA)◦ Sample of 32 schools◦ IOA = 89%

Content validity◦ Florida PBS training manual & core SWPBS

elements◦ Feedback from 20 SWPBS research and evaluation

professionals ◦ Interviewing to identify response error in the items◦ Pilot efforts with 10 support teams

Concurrent validity ◦ Sample of 42 schools◦ Correlation between BoQ and SET ◦ Pearson r = .51, p < .05


What measures do I use? How do I translate a score into “practice”? What school variables affect measurement

choices?◦ SWPBS implementation status

Evaluation template

Fidelity Tool Year 1 Year 2 Year 3

EBS Survey X 1 2 3 4 1 2 3 4 1 2 3 4

Universal

TIC X X X X X X X X X

SET / BoQ X X X

Secondary / tertiary CISS X X X X X X X X X

ISSET X X X

Classroom setting

Class internal X X X X X X X X X

Class external X X X

Evaluation of School-wide PBS occurs for implementation and outcomes

Evidence of a “good” measure depends on its intended use

The quality of implementation decisions depends on the quality of evaluation tools

Evaluation occurs throughout the implementation process, with different tools for different purposes at different stages

State and District Evaluation Tools

Documents

data assessment

interpretable data

instrument measures

data collectors interpretation

swpbs impact

meaningful decisions

consequential decisions

swpbs forum