Top Banner
Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center for Next Generation Learning and Assessment CCSSO National Conference on Student Assessment, June 24, 2015
32

Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Jan 02, 2016

Download

Documents

Darcy McCormick
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment ActivitiesJeffrey Steedle and Steve Ferrara

Center for Next Generation Learning and Assessment

CCSSO National Conference on Student Assessment, June 24, 2015

Page 2: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Which of these essays is of higher quality?

A time when i felt free was, when i finally got released from being in the hospital for four days. The reason i was in the hospital was because i had a kidney stones which hurted really bad that i couldn't eat and stand up straight.So i decided to go to the emergency room to see what was going on.This was before i found out i had kidney stones…

A time I felt like I was free was when I was fifteen years old. At age fifteen, everybody is curious and anxious to do things on there own without parental consent. I was just another one of those fifteen year olds anxious to get my turn at something, but then I learned how to drive. A lot of people enjoy driving around, some people do it because they have to get to their job or because they need to go from one place to another…

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 2

Page 3: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Traditional, Rubric-Based Scoring

“Those responsible for test scoring should establish and document quality control processes and criteria. Adequate training should be provided. The quality of scoring should be monitored and documented. Any systematic source of scoring errors should be documented and corrected” (AERA, APA, & NCME, 2014).

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 3

Page 4: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Comparative Judgment

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 4

Prompt and responses from http://tea.texas.gov/student.assessment/staar/writing/

Page 5: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Comparative Judgment Background

• Not a new idea (Law of Comparative Judgement, Thurstone, 1927)• Relative judgments are more accurate than absolute judgments for

– psychophysical phenomena (Stewart et al., 2005)– estimating distances, counting spelling errors (Shah et al., 2014)– evaluating physics and history exams (Gill & Bramley, 2008)

• Past uses in educational assessment– Comparing the alignment of passing standards over time (Bramley,

Bell, & Pollitt, 1998; Curcin et al., 2009)– Estimating item difficulty (Walker et al., 2005)– Scoring essays, portfolios, and short-answer responses (Pollitt, 2004;

Whitehouse & Pollitt, 2012; Kimbell et al., 2009; Pollitt, 2012; Attali, 2014)

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 5

Page 6: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Comparative Judgments

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 6

Rubric-Based Scoring Comparative Judgment

Scorers must internalize the definition of each score point

Judges must internalize the definition of “quality”

Scorers must agree exactly with the trainer and “anchor papers”

Judges must agree with the trainer about the relative quality of responses

Lengthy training and qualification (e.g., 16 hours)

Brief training and qualification (e.g., 3 hours)

Longer time per evaluation

Shorter time per evaluation

Requires fewer evaluations per response

Requires more evaluations per response

Page 7: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Comparative Judgment Advantages

• Eliminating certain scorer biases/increased validity• Faster time per evaluation• Reduced cognitive demand• Minimal training, qualification, and monitoring• Reduced costs

Research is needed to test the potential advantages.

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 7

POTENTIAL

Page 8: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Potential Applications in Scoring

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.

8

Few responses to a large number of prompts

Many lengthy trainings, shorter overall evaluation time

Many brief trainings, longer overall evaluation time

Rubric Scoring

Comparative Judgment

Field Test Scoring

Educators get buy-in and

professional development

Fewer teachers in lengthy trainings

More teachers in brief trainings

Lower overall productivity,

narrow PD reach

Greater overall productivity,

expanded PD reach

Educator ScoringRubric Scoring

Comparative Judgment

Possibly more efficient

Page 9: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Research Questions

1. How closely do comparative judgment measures correspond to rubric scores?

2. Do comparative judgments take less time than rubric scoring decisions?

3. How do comparative judgment measures and rubric scores compare in terms of validity coefficients?

4. How is the reliability of comparative judgment measures associated with the number of judgments per essay response?

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 9

Page 10: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Method: Essay Prompts

• Two essay prompts from online administrations of a high school achievement testing program in a large state

• 4-point holistic rubric scoring, at least two scores per response, exact agreement required

• Samples of 200 responses for each prompt

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 10

Agreementr

Rubric Score DistributionPrompt Exact Adj. 1 2 3 4

1 70% 29% .81 25% 40% 25% 10%2 69% 30% .85 25% 40% 25% 10%

Page 11: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Method: Participants

• All with secondary English teaching experience

• No professional scorers to avoid interference between methods of evaluating student responses

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 11

4 judges

Prompt 1

5 judges

Prompt 2

Page 12: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Method: Training

• Conducted via web conference by an experienced scoring trainer

• Judges learned rubric criteria (focus, organization, development, etc.), but the rubric was never shown

• Judges practiced making comparative judgments on “anchor pairs” involving “anchor papers” used in rubric-based training

• Qualification test accuracy ranged from 11 to 15 out of 15

• Training durations were 3 and 3.75 hours

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 12

Page 13: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Multivariate generalization of Bradley-Terry model (Bradley & Terry, 1952)

µA is the latent location of response A on a continuum of writing quality

Method: Statistical Model

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 13

-1 0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Category Response Curve for Writing Sample 1

Location of response B B

ABj

Prefer BOptions equalPrefer A

Lo

catio

n o

f re

spo

nse

A

A

When µB < µA, “Prefer A” is the most probable judgment

When µB > µA, “Prefer B” is the most probable judgment

“Options equal” is never the most probable judgment

𝑃 (𝑌 𝐴𝐵= 𝑗|𝜇𝐴 ,𝜇𝐵 ,𝜏 )=𝜋 𝐴𝐵𝑗=exp(∑

𝑠=1

𝑗

[𝜇𝐴−(𝜇𝐵+𝜏 𝑠)])

∑𝑦=1

𝐽

exp (∑𝑠=1

𝑦

[𝜇𝐴−(𝜇𝐵+𝜏𝑠)])

Page 14: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Method: Pairing Responses

• Note: The most information about a response’s latent location is obtained by comparing it to another response of similar quality.

• The Generalized Grading Model (GGM) provided a predicted score for each response on the 1–4 rubric scale (based on text complexity, coherence, length, spelling, and vocabulary).

• Each response was paired with – 16 other responses (with the same or adjacent predicted score)– 2 anchor papers

• 2,000 judgments per prompt

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 14

Page 15: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Method: Data Collection

• Responses were “chained” so that a judge only read one new response per judgment

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 15

A vs. B B vs. C C vs. D D vs. E

Page 16: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Results: Parameter Estimation

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 16

Prompt 1

Fre

qu

en

cy

0 1 2 3 4 5

01

02

03

04

0

Mean = 2.4SD = 0.99

Prompt 2

Fre

qu

en

cy

0 1 2 3 4 5

05

15

25

35

Mean = 2.13SD = 1.04

Scale anchored by anchor paper scores, so most measures fall between 1.0 and 4.0

Page 17: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Results: Correspondence

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 17

Measure

Prompt 1 Prompt 2

RubricRounded

CJ RubricRounded

CJ

Mean 2.20 2.40 2.20 2.21

Std. Deviation 0.93 0.97 0.93 0.98

Exact Agmt. 60.0% 64.0%

Adj. Agmt. 38.5% 33.5%

Correlation .78 .76

60.0% exact agreement between rubric scores and rounded comparative judgment scores on Prompt 1

Slight tendency for comparative judgment to overestimate on Prompt 1

Better agreement overall on Prompt 2

Page 18: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Results: Judgment Time

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 18

Prompt 1 Prompt 2 Both

Mean (Rubric) 121.2 s 116.4 s 119.4 s

Mean (CJ) 116.7 s 70.45 s 93.5 s

Median (CJ) 83.0 s 45.0 s 62.0 s

Some huge outliers in these data (e.g., 2,760 seconds)

Medians likely provide better measures of central tendency

Page 19: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Results: Validity Coefficients

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 19

Rubric Score

Multiple-Choice Writing Test

.63, .69

Continuous Comparative

Judgment Measure

.67, .72

Rounded Comparative

Judgment Measure

.66, .71

Page 20: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Results: Reliability• In this context, “reliability” reflects judge behavior and is

therefore akin to inter-rater reliability.

• High reliability translates into greater precision in estimating the perceived relative quality of responses.

• Reliability does not reflect correspondence between estimated scores and “true” scores. Studying this would require multiple responses from each student.

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 20

Reliability =Consistency in judgments about the quality of a response relative to other responses

Page 21: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Results: Reliability• Remove random samples of judgments, refit the model,

recalculate reliability.

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 21

Prompt 1

Average Number of Comparisons per Object

Re

liab

ility

0 2 4 6 8 10 12 14 16 18 20

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Prompt 2

Average Number of Comparisons per Object

Re

liab

ility

0 2 4 6 8 10 12 14 16 18 20

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Reliability drops below .80 with a 50% reduction (~9 judgments per response)

Page 22: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

A Note on Number of Judgments

• TRUE or FALSE: If you have 200 responses and you want reliability of .80, you need about 200×9 = 1,800 judgments.

• FALSE: A judgment provides information about 2 responses, so you would need about 900 judgments (or 4.5 judgments per unique response).

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 22

Page 23: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Conclusions

• Scores from comparative judgment correspond to rubric scores at a rate similar to that observed between two scorers (60–70% exact agreement; Ferrara & DeMauro, 2006).

• Comparative judgment measures appear to have higher validity coefficients than rubric scores

• With 3-4 hours of comparative judgment training, judges can consistently judge the relative quality of responses, as reflected by high reliability coefficients.

• Time per comparative judgment appears to be less than time per rubric score.

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 23

Page 24: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Future Research

• Agreement might be improved with improvements in the pairing process

• Potentially improve accuracy and efficiency by implementing adaptive comparative judgment (Pollitt, 2012)– Initial pairings are random– Subsequent pairings are based on preliminary score estimates

• Pilot rangefinding study• Data-free form assembly and equating

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 24

Page 25: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Pilot Rangefinding Results

• Six panelists made 106 judgments about 15 responses in 16 minutes (with reliability = .97).

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 25

Caterpillar Plot

Co

mp

ara

tive

Ju

dg

me

nt

Me

asu

re

-6

-4

-2

0

2

4

6

Pa

pe

r01

Pa

pe

r02

Pa

pe

r03

Pa

pe

r05

Pa

pe

r06

Pa

pe

r04

Pa

pe

r08

Pa

pe

r07

Pa

pe

r09

Pa

pe

r10

Pa

pe

r12

Pa

pe

r11

Pa

pe

r13

Pa

pe

r15

Pa

pe

r14

1s

2s3s

4s5s

Page 26: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Data-Free Forms Assembly and Equating

• Field testing (especially embedded) is useful for estimating item difficulties for forms assembly and/or pre-equating

• Problems with field testing:– It is not permitted or valued in some countries– There is backlash against it in the U.S. (i.e, using kids as unpaid

laborers)– Test security may be compromised because performance tasks

and essays are highly memorable– Examinees may not be motivated

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 26

Page 27: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Which of these items is more difficult?

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 27

What single transformation is shown below?

ReflectionRotationTranslationNo single transformation is shown.

The masses of two gorillas are given below.

A female gorilla has a mass of 85,000 grams.A male gorilla has a mass of 220 kilograms.

What is the difference between these two masses in grams?

135,000 g84,780 g63,000 g305,000 g

http://tea.texas.gov/Student_Testing_and_Accountability/Testing/State_of_Texas_Assessments_of_Academic_Readiness_(STAAR)/STAAR_Released_Test_Questions/

Page 28: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Data-Free Forms Assembly and Equating

• To the extent that such judgments are accurate, comparative judgment can be used to put items (from different test forms) on a common scale of perceived item difficulty.

• Those measures could be used for– Developing test forms of similar difficulty– Equating test forms (with no common items or persons)

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 28

Page 29: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Example Equating Process

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 29

Calibrate Form X (prior admin.)

Calibrate Form Y (current admin.)

Compare a sample of Form Y items to a sample of Form X “equating” items to

calculate an equating constant

Apply the constant to all of Form Y

Locate the Form X performance standard on Form Y

Page 30: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

Data-Free Forms Assembly and Equating

• Prior research has demonstrated that comparative judgment measures can be highly correlated with empirical item difficulties (e.g., Heldsinger & Humphry, 2014).

• Our study will focus on the accuracy of the comparative judgment measures and subsequent accuracy of raw-to-theta pre-equating tables, equating of performance standards across forms, and inferences about the relative difficulty of test forms.

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 30

Page 31: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

THANK YOU!

Center for Next Generation Learning and AssessmentResearch and Innovation Network

[email protected]@pearson.com

31

Page 32: Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.

ReferencesAERA, APA, & NCME. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Attali, Y. (2014). A ranking method for evaluating constructed responses. Educational and Psychological Measurement, Online First, 1-14. Bradley, R.A., & Terry, M.E. (1952). Rank analysis of incomplete block designs: The method of paired comparisons. Biometrika, 39, 324-345. Bramley, T., Bell, J.F., & Pollitt, A. (1998). Assessing changes in standards over time using thurstone paired comparisons. Education Research and

Perspectives, 25(2), 1-24.Curcin, M., Black, B., & Bramley, T. (2009). Standard maintaining by expert judgment on multiple-choice tests: A new use for the rank-ordering

method. Paper presented at the the British Educational Research Association Annual Conference, Manchester.Elliot, S., Ferrara, S., Fisher, T., Klein, S., Pitoniak, M., & Steedle, J. (2010). Developing the edsteps continuum Washington, DC. Council of Chief State

School Officers.Ferrara, S., & DeMauro, G.E. (2006). Standardized assessment of individual achievement in k-12. In R. L. Brennan (Ed.), Educational measurement

(4th ed., pp. 579-621). Westport, CT: Praeger.Gill, T., & Bramley, T. (2008). How accurate are examiners’ judgments of script quality? An investigation of absolute and relative judgments in two

units, one with a wide and one with a narrow ‘zone of uncertainty’. Paper presented at the British Educational Research Association Annual Conference, Edinburgh, Scotland.

Heldsinger, S., & Humphry, S. (2010). Using the method of pairwise comparison to obtain reliable teacher assessments. The Australian Educational Researcher, 37(2), 1-19.

Heldsinger, S., & Humphry, S. (2014). Maintaining consistent metrics in standard setting. Murdoch, Western Australia: Murdoch University.Kimbell, R., Wheeler, T., Stables, K., Shepard, T., Martin, F., Davies, D., . . . Whitehouse, G. (2009). E-scape portfolio assessment: Phase 3 report.

London: Technology Education Research Unit, Goldsmiths College, University of London.Pollitt, A. (2004). Let’s stop marking exams. Paper presented at the IAEA Conference, Philadelphia, PA.Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281-300. Shah, N.B., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K., & Wainwright, M. (2014). When is it better to compare than to score? arXiv.

http://arxiv.org/abs/1406.6618Stewart, N., Brown, G.D.A., & Chater, N. (2005). Absolute identification by relative judgment. Psychological Review, 112(4), 881-911. Thurstone, L.L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286.Walker, M.E., Dorans, N.J., Kim, S., Vafis, G., & Fecko-Curtis, E. (2005). Alternative methods for obtaining item difficulty information. Paper presented

at the Annual Meeting of the American Educational Research Association, Montreal, Canada.Whitehouse, C., & Pollitt, A. (2012). Using adaptive comparative judgement to obtain a highly reliable rank order in summative assessment.

Manchester: The Assessment and Qualifications Alliance.Wolfe, E.W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues

and Practice, 31(3), 31-37.Zahner, D., & Steedle, J.T. (2014). Evaluating performance task scoring comparability in an international testing program. Paper presented at the

National Council on Measurement in Education Annual Meeting, Philadelphia, PA.

Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 32