Andy Hegedus, Ed.D. Kingsbury Center at NWEA June 2014 Using Assessment Data for Educator and Student Growth
Nov 27, 2014
Andy Hegedus, Ed.D. Kingsbury Center at NWEA
June 2014
Using Assessment Data for Educator and
Student Growth
• Increase your understanding about various urgent assessment related topics– Ask better questions– Useful for making all types of decisions with
data
My Purpose
1. Alignment between the content assessed and the content to be taught
2. Selection of an appropriate assessment• Used for the purpose for which it was designed
(proficiency vs. growth)• Can accurately measure the knowledge of all students• Adequate sensitivity to growth
3. Adjust for context/control for factors outside a teacher’s direct control (value-added)
Three primary conditions
1. Assessment results used wisely as part of a dialogue to help teachers set and meet challenging goals
2. Use of tests as a “yellow light” to identify teachers who may be in need of additional support or are ready for more
Two approaches we like
• What we’ve known to be true is now being shown to be true– Using data thoughtfully improves student
achievement and growth rates– 12% mathematics, 13% reading
• There are dangers present however– Unintended Consequences
Go forth thoughtfullywith care
Slotnik, W. J. , Smith, M. D., It’s more than money, February 2013, retrieved from http://www.ctacusa.com/PDFs/MoreThanMoney-report.pdf
“What gets measured (and attended to), gets done”
Remember the old adage?
• NCLB– Cast light on inequities– Improved performance of “Bubble Kids”– Narrowed taught curriculum
The same dynamic happens inside your schools
An infamous example
It’s what we do that counts
A patient’s health doesn’t change because we know their blood pressure
It’s our response that makes all the difference
Be considerate of the continuum of stakes involved
Support
Compensate
Terminate
Increasing levels of required rigor
Incr
easi
ng r
isk
Marcus Normal Growth Needed Growth
Marcus’ growth
College readiness standard
The Test
The Growth Metric
The Evaluation
The Rating
There are four key steps required to answer this question
Top-Down Model
Assessment 1
Goal Setting
Assessment(s)
Results and Analysis
Evaluation (Rating)
How does the otherpopular process work?
Bottom-Up Model(Student Learning Objectives)
Understanding all four of the top-down elements are needed here
The Test
The Growth Metric
The Evaluation
The Rating
Let’s begin at the beginning
3rd Grade ELA
Standards
3rd Grade ELA
Teacher?
3rd Grade Social
Studies Teacher?
Elem. Art Teacher?
What is measured should be aligned to what is to be taught
1. Answer questions to demonstrate understanding of text….
2. Determine the main idea of a text….
3. Determine the meaning of general academic and domain specific words…
Would you use a general reading assessment in the evaluation of a….
~30% of teachers teach in tested subjects and gradesThe Other 69 Percent: Fairly Rewarding the Performance of Teachers of Nontested Subjects and Grades, http://www.cecr.ed.gov/guides/other69Percent.pdf
• Assessments should align with the teacher’s instructional responsibility– Specific advanced content
• HS teachers teaching discipline specific content – Especially 11th and 12th grade
• MS teachers teaching HS content to advanced students
– Non-tested subjects• School-wide results are more likely “professional
responsibility” rather than reflecting competence
– HS teachers providing remedial services
What is measured should be aligned to what is to be taught
• Many assessments are not designed to measure growth
• Others do not measure growth equally well for all students
The purpose and design of the instrument is significant
Let’s ensure we have similar meaning
Beginning
Literacy
Adult Reading
5th Grade x
x
Time 1 Time 2
StatusGrowth
Two assumptions:1. Measurement accuracy,
and2. Vertical interval scale
Accurately measuring growth
depends on accurately measuring
achievement
Questions surrounding the
student’s achievement level
The more questions the
merrier
What does it take to accurately measure achievement?
Teachers encounter a distribution of student performance
Beginning
Literacy
Adult Reading
5th Grad
e
x x xx
xx
xx
x
x
xx
x
xx
Grade Level Performance
Adaptive testing works differently
Item bank can span full range of achievement
How about accurately measuring height?
What if the yardstick stopped in the middle of his back?
Items available need to match student ability
California STAR NWEA MAP
How about accurately measuring height?
What if we could only mark within a pre-defined six inch range?
5th Grade Level Items
These differences impact measurement error
.00
.02
.04
.06
.08
.10
.12
Info
rmati
on
170 180 190 200 210 220 230 240Scale Score
Fully Adaptive Test
Significantly Different Error
160
Constrained Adaptive or
Paper/PencilTest
To determine growth, achievement
measurements must be related through
a scale
If I was measured as:5’ 9”
And a year later I was:1.82m
Did I grow?Yes. ~ 2.5”
How do you know?
Let’s measure height again
Traditional assessment uses items reflecting the grade level standards
Beginning
Literacy
Adult Reading
4th Grade
5th Grade
6th Grade
Grade Level Standards
Traditional Assessment Item Bank
Traditional assessment uses items reflecting the grade level standards
Beginning
Literacy
Adult Reading
4th Grade
5th Grade
6th Grade
Grade Level Standards
Grade Level StandardsOverlap allows linking and scale construction
Grade Level Standards
Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design principles drawn from international comparisons', Measurement: Interdisciplinary Research & Perspective, 5: 1, 1 — 53
• …when science is defined in terms of knowledge of facts that are taught in school…(then) those students who have been taught the facts will know them, and those who have not will…not. A test that assesses these skills is likely to be highly sensitive to instruction.
The instrument must be able to detect instruction
Black, P. and Wiliam, D.(2007) 'Large-scale assessment systems: Design principles drawn from international comparisons', Measurement: Interdisciplinary Research & Perspective, 5: 1, 1 — 53
• When ability in science is defined in terms of scientific reasoning…achievement will be less closely tied to age and exposure, and more closely related to general intelligence. In other words, science reasoning tasks are relatively insensitive to instruction.
The more complex, the harder to detect and attribute to one teacher
• Tests specifically designed to inform classroom instruction and school improvement in formative ways
No incentive in the system for inaccurate data
Using tests in high stakes ways creates new dynamic
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71-6.00
-4.00
-2.00
0.00
2.00
4.00
6.00
8.00
10.00
Students taking 10+ minutes longer spring than fall All other students
New phenomenon when used as part of a compensation program
Mean value-added growth by school
Cheating
Atlanta Public SchoolsCrescendo Charter SchoolsPhiladelphia Public SchoolsWashington DC Public SchoolsHouston Independent School DistrictMichigan Public Schools
When teachers are evaluated on growth using a once per year assessment, one teacher who cheats disadvantages the next teacher
Other consequence
• Both a proctor and the teacher should be presenting during testing– Teacher can best guide students and ensure effort– Proctor protects integrity of results and can
support defense of teacher if results are challenged
• Have all student test each term– Need two terms to determine growth– More student aggregated the more you know
Proctoring
• Important for reliable test data particularly when determining growth
• Use Testing Condition Indicators as KPIs – Accuracy, duration, changes in duration– Formative conversations to improve over time
• Short test durations are worth considering follow-up– Apply criteria each test event
• Be concerned more with consistency in test duration than duration itself
Consistent Testing Conditions
• Pause or terminate before completion– Preferred option – Address when problems are
identified– Not subject to challenge that student retested
simply because the score wasn’t good enough• Monitor students as testing is going on
– Ensure effort– Support students as they struggle – G&T
• Show that accurate data is important
Early Intervention
• Define “Significant” decline between test events– Apply significant decline criteria each test term
• Simply missing cut score is not an acceptable reason to retest
Retesting
Testing is complete . . . What is useful to answer our question?
The Test
The Growth Metric
The Evaluation
The Rating
Grade 2 Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 80
10
20
30
40
50
60
70
80
90
100
ReadingMath
The metric matters - Let’s go underneath “Proficiency”
Difficulty of New York Cut Score Between Level 2 and 3
Nat
iona
l Per
cent
ile
College Readiness
A study of the alignment of the NWEA RIT scale with the New York State (NYS) Testing Program, November 2013
Difficulty of ACT college readiness standards
The metric matters - Let’s go underneath “Proficiency”
Dahlin, M. and Durant, S., The State of Proficiency, Kingsbury Center at NWEA, July 2011
Mathematics
No ChangeDownUp
Fall RIT
Num
ber o
f Stu
dent
sWhat gets measured and attended to
really does matter
Proficiency College Readiness
One district’s change in 5th grade mathematics performance relative to the KY proficiency cut scores
Mathematics
Below projected growthMet or above pro-jected growth
Student’s score in fall
Nu
mb
er o
f S
tud
ents
Number of 5th grade students meeting projected mathemat-ics growth in the same district
Changing from Proficiency to Growth means all kids matter
• What did you just learn?• How will you change what you typically
do?
Guiding Questions
How can we make it fair?
The Test
The Growth Metric
The Evaluation
The Rating
Without context what is “Good”?
Beginning Reading
Adult Literacy
Nati
onal
Pe
rcen
tile
Norms StudyScale
Colle
ge R
eadi
ness
Be
nchm
arks
ACT
Perf
orm
ance
Lev
els
State Test
“Meets”Proficiency
Perf
orm
ance
Lev
els
Common Core
Proficient
Normative data for growth is a bit different
Fall Score
Subject: Reading
Grade: 4th
7 points
FRL vs. non-FRL?
IEP vs. non-IEP?
ESL vs. non-ESL?
Outside of a teacher’s direct control
Starting Achievement
Instructional Weeks
Basic Factors
Typical growth
60%20%
20%
APPRObservations State Test Growth EA Value-Added
How did we address requirements in New York?
State Tested Grades / Subjects (4-8 Math and Reading)
Other Grades / Subjects for which there is an available non-state test
60%20%
20%
APPRObservations Local Measure 2 EA Value-Added
Value-Added
Value-Added
Local Measure 2
(SLO)
State Test
Growth
Partnered with Education Analytics on VAM
The Oak Tree Analogy* – a conceptual introduction to the metric
*Developed at the Value-Added Research Center
An Introduction to Value-Added
The Oak Tree Analogy
Gardener A Gardener B
Explaining Value-Added by Evaluating Gardener Performance
• For the past year, these gardeners have been tending to their oak trees trying to maximize the height of the trees.
This method is analogous to using an Achievement Model.
Gardener A Gardener B
61 in.
72 in.
Method 1: Measure the Height of the Trees Today (One Year After the Gardeners Began)
• Using this method, Gardener B is the more effective gardener.
61 in.
72 in.Gardener A Gardener B
Oak AAge 4
(Today)
Oak BAge 4
(Today)
Oak AAge 3
(1 year ago)
Oak BAge 3
(1 year ago)
47 in.52 in.
This Achievement Result is not the Whole Story
• We need to find the starting height for each tree in order to more fairly evaluate each gardener’s performance during the past year.
This is analogous to a Simple Growth Model, also called Gain.
61 in.
72 in.Gardener A Gardener B
Oak AAge 4
(Today)
Oak BAge 4
(Today)
Oak AAge 3
(1 year ago)
Oak BAge 3
(1 year ago)
47 in.52 in.+14 in. +20 in
.
Method 2: Compare Starting Height to Ending Height
• Oak B had more growth this year, so Gardener B is the more effective gardener.
Gardener A Gardener B
What About Factors Outside the Gardener’s Influence?
• This is an “apples to oranges” comparison.• For our oak tree example, three environmental factors we will examine are:
Rainfall, Soil Richness, and Temperature.
External condition Oak Tree A Oak Tree B
Rainfall amount
Soil richness
Temperature
High LowLow HighHigh Low
Gardener A Gardener B
Gardener A Gardener B
How Much Did These External Factors Affect Growth?
• We need to analyze real data from the region to predict growth for these trees.• We compare the actual height of the trees to their predicted heights to determine
if the gardener’s effect was above or below average.
In order to find the impact of rainfall, soil richness, and temperature, we will plot the growth of each individual oak in the region compared to its environmental conditions.
Rainfall Low Medium HighGrowth in inches
relative to the average
-5 -2 +3
Soil Richness Low Medium HighGrowth in inches
relative to the average
-3 -1 +2
Temperature Low Medium HighGrowth in inches
relative to the average
+5 -3 -8
Calculating Our Prediction Adjustments Based on Real Data
Oak AAge 3
(1 year ago)
Oak BAge 3
(1 year ago)
67 in.72 in.Gardener A Gardener B
Oak APrediction
Oak BPrediction
47 in.52 in.
+20 Average+20 Average
Make Initial Prediction for the Trees Based on Starting Height
• Next, we will refine out prediction based on the growing conditions for each tree. When we are done, we will have an “apples to apples” comparison of the gardeners’ effect.
70 in. 67 in.Gardener A Gardener B
47 in.52 in.
+20 Average+20 Average
+ 3 for Rainfall - 5 for Rainfall
Based on Real Data, Customize Predictions based on Rainfall
• For having high rainfall, Oak A’s prediction is adjusted by +3 to compensate.• Similarly, for having low rainfall, Oak B’s prediction is adjusted by -5 to compensate.
67 in.69 in.Gardener A Gardener B
47 in.52 in.
+20 Average+20 Average
+ 3 for Rainfall
- 3 for Soil + 2 for Soil
- 5 for Rainfall
Adjusting for Soil Richness
• For having poor soil, Oak A’s prediction is adjusted by -3.• For having rich soil, Oak B’s prediction is adjusted by +2.
59 in.
74 in.Gardener A Gardener B
47 in.52 in.
+20 Average+20 Average
+ 3 for Rainfall
- 3 for Soil + 2 for Soil
- 8 for Temp + 5 for Temp
- 5 for Rainfall
Adjusting for Temperature
• For having high temperature, Oak A’s prediction is adjusted by -8.• For having low temperature, Oak B’s prediction is adjusted by +5.
+20 Average+20 Average
+ 3 for Rainfall
- 3 for Soil + 2 for Soil
- 8 for Temp + 5 for Temp_________+12 inchesDuring the year
_________+22 inches During the year
59 in.
74 in.Gardener A Gardener B
47 in.52 in.
- 5 for Rainfall
Our Gardeners are Now on a Level Playing Field
• The predicted height for trees in Oak A’s conditions is 59 inches.
• The predicted height for trees in Oak B’s conditions is 74 inches.
PredictedOak A
PredictedOak B
ActualOak A
ActualOak B
59 in.
74 in.Gardener A Gardener B61 in.
72 in.+2-2
Compare the Predicted Height to the Actual Height
• Oak A’s actual height is 2 inches more than predicted. We attribute this to the effect of Gardener A.• Oak B’s actual height is 2 inches less than predicted. We attribute this to the effect of Gardener B.
This is analogous to a Value-Added measure.
Above Average
Value-Added
Below Average
Value-Added
PredictedOak A
PredictedOak B
ActualOak A
ActualOak B
59 in.
74 in.Gardener A Gardener B61 in.
72 in.+2-2
Method 3: Compare the Predicted Height to the Actual Height
• By accounting for last year’s height and environmental conditions of the trees during this year, we found the “value” each gardener “added” to the growth of the trees.
Gardener A
Value-Added is a Group Measure
• To statistically isolate a gardener’s effect, we need data from many trees under that gardener’s care.
Gardener B
Oak Tree Analogy Value-Added in Education
What are we evaluating?
• Gardeners • Districts• Schools• Grades• Classrooms• Programs and Interventions
How does this analogy relate to value added in the education context?
What are we using to measure success?
• Relative height improvement in inches
• Relative improvement on standardized test scores
Sample • Single oak tree • Groups of students
Control factors • Tree’s prior height
• Other factors beyond the gardener’s control:
• Rainfall• Soil richness• Temperature
• Students’ prior test performance (usually most significant predictor)
• Other demographic characteristics such as:
• Grade level• Gender• Race / Ethnicity• Low-Income Status• ELL Status• Disability Status• Section 504 Status
• What if I skip this step?– Comparison is likely against normative data
so the comparison is to “typical kids in typical settings”
• How fair is it to disregard context?– Good teacher – bad school– Good teacher – challenging kids
Consider . . .
• Control for measurement error– All models attempt to address
this issue• Population size• Multiple data points
– Error is compounded with combining two test events
– Many teachers’ value-added scores will fall within the range of statistical error
A variety of errors means more stability only at the extremes
-12.00-11.00-10.00
-9.00-8.00-7.00-6.00-5.00-4.00-3.00-2.00-1.000.001.002.003.004.005.006.007.008.009.00
10.0011.0012.00
Mathematics Growth Index Distribution by Teacher - Validity Filtered
Aver
age
Grow
th In
dex
Scor
e an
d Ra
nge
Q5
Q4
Q3
Q2
Q1
Each line in this display represents a single teacher. The graphic shows the average growth index score for each teacher (green line), plus or minus the standard error of the growth index estimate (black line). We removed stu-dents who had tests of questionable validity and teachers with fewer than 20 students.
Range of teacher value-added estimates
With one teacher, error means a lot
Because we want studentsto learn more!
• Research view–Setting goals improves performance
Why should we care about goal setting in education?
What does research say on goal setting?
Locke, E. A. & Latham, G. P. (2002). Building a practically useful theory of goal setting and task motivation: A 35-year odyssey. American psychologist. American Psychological Association.
Goals
Moderators
Mechanisms
Performance
Satisfaction with
Performance and Rewards
Willingness to commit
Essential Elements of Goal-Setting Theory and the High-Performance Cycle
What does research say on goal setting?
Locke, E. A. & Latham, G. P. (2002). Building a practically useful theory of goal setting and task motivation: A 35-year odyssey. American psychologist. American Psychological Association.
Goals
Moderators
Mechanisms
Performance
Satisfaction with
Performance and Rewards
Willingness to commit
Essential Elements of Goal-Setting Theory and the High-Performance Cycle
• Specificity• Difficulty
– Performance and learning goals
– Proximal goals
Goals
Goals Explanation
• Specific goals are typically stronger than “Do your best” goals
• Moderately challenging is better than too easy or too hard– If complex and new knowledge or
skills needed, set learning goals• Master five new ways to assess each
student’s learning in the moment
– If complex, set short term goals to gauge progress and feel rewarded
• Lack of a historical context– What has this teacher and these students done in
the past?• Lack of comparison groups
– What have other teachers done in the past?• What is the objective?
– Is the objective to meet a standard of performance or demonstrate improvement?
• Do you set safe goals or challenging goals?
Challenges with goal setting
• Goals and targets themselves– Appropriately balance moderately
challenging goals with consequences • Only use “Stretch” goals for the organization to
stimulate creativity and create unconventional solutions
Suggestions
Locke, E. A., & Latham, G. P. (2013). New developments in goal setting and task performance.
• Goals and targets themselves (cont.)– Set additional learning goals if complex and
new– Set interim benchmarks for progress
monitoring– Carefully consider what will not happen to
attain the goal• Can you live with the consequences?• How will you look for other unintended ones?
Suggestions
Locke, E. A., & Latham, G. P. (2013). New developments in goal setting and task performance.
How tests are used to evaluate teachers
The Test
The Growth Metric
The Evaluation
The Rating
• How would you translate a rank order to a rating?• Data can be provided
• Value judgment ultimately the basis for setting cut scores for points or rating
Translation into ratings can be difficult to inform with data
• What is far below a district’s expectation is subjective
• What about• Obligation to help
teachers improve?• Quality of replacement
teachers?
Decisions are value based, not empirical
• System for combining elements and producing a rating is also a value based decision– Multiple measures and principal judgment
must be included– Evaluate the extremes to make sure it
makes sense
Even multiple measures need to be used well
Leadership Courage Is A Key
Teacher 1 Teacher 2 Teacher 30
1
2
3
4
5
Ratings can be driven by the assessment
Observation Assessment
Real or Noise?
If evaluators do not differentiate their ratings,
then all differentiation comes from the test
Big Message
1. Alignment between the content assessed and the content to be taught
2. Selection of an appropriate assessment• Used for the purpose for which it was designed
(proficiency vs. growth)• Can accurately measure the knowledge of all students• Adequate sensitivity to growth
3. Adjust for context/control for factors outside a teacher’s direct control (value-added)
Please be thoughtful about . . .
• Presentations and other recommended resources are available at: – www.nwea.org– www.kingsburycenter.org– www.slideshare.net
• Contacting us:NWEA Main Number 503-624-1951 E-mail: [email protected]
More information