Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center for Next Generation Learning and Assessment CCSSO National Conference on Student Assessment, June 24, 2015
32
Embed
Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment ActivitiesJeffrey Steedle and Steve Ferrara
Center for Next Generation Learning and Assessment
CCSSO National Conference on Student Assessment, June 24, 2015
Which of these essays is of higher quality?
A time when i felt free was, when i finally got released from being in the hospital for four days. The reason i was in the hospital was because i had a kidney stones which hurted really bad that i couldn't eat and stand up straight.So i decided to go to the emergency room to see what was going on.This was before i found out i had kidney stones…
A time I felt like I was free was when I was fifteen years old. At age fifteen, everybody is curious and anxious to do things on there own without parental consent. I was just another one of those fifteen year olds anxious to get my turn at something, but then I learned how to drive. A lot of people enjoy driving around, some people do it because they have to get to their job or because they need to go from one place to another…
“Those responsible for test scoring should establish and document quality control processes and criteria. Adequate training should be provided. The quality of scoring should be monitored and documented. Any systematic source of scoring errors should be documented and corrected” (AERA, APA, & NCME, 2014).
Scorers must internalize the definition of each score point
Judges must internalize the definition of “quality”
Scorers must agree exactly with the trainer and “anchor papers”
Judges must agree with the trainer about the relative quality of responses
Lengthy training and qualification (e.g., 16 hours)
Brief training and qualification (e.g., 3 hours)
Longer time per evaluation
Shorter time per evaluation
Requires fewer evaluations per response
Requires more evaluations per response
Comparative Judgment Advantages
• Eliminating certain scorer biases/increased validity• Faster time per evaluation• Reduced cognitive demand• Minimal training, qualification, and monitoring• Reduced costs
Research is needed to test the potential advantages.
When µB < µA, “Prefer A” is the most probable judgment
When µB > µA, “Prefer B” is the most probable judgment
“Options equal” is never the most probable judgment
𝑃 (𝑌 𝐴𝐵= 𝑗|𝜇𝐴 ,𝜇𝐵 ,𝜏 )=𝜋 𝐴𝐵𝑗=exp(∑
𝑠=1
𝑗
[𝜇𝐴−(𝜇𝐵+𝜏 𝑠)])
∑𝑦=1
𝐽
exp (∑𝑠=1
𝑦
[𝜇𝐴−(𝜇𝐵+𝜏𝑠)])
Method: Pairing Responses
• Note: The most information about a response’s latent location is obtained by comparing it to another response of similar quality.
• The Generalized Grading Model (GGM) provided a predicted score for each response on the 1–4 rubric scale (based on text complexity, coherence, length, spelling, and vocabulary).
• Each response was paired with – 16 other responses (with the same or adjacent predicted score)– 2 anchor papers
Results: Reliability• In this context, “reliability” reflects judge behavior and is
therefore akin to inter-rater reliability.
• High reliability translates into greater precision in estimating the perceived relative quality of responses.
• Reliability does not reflect correspondence between estimated scores and “true” scores. Studying this would require multiple responses from each student.
• Scores from comparative judgment correspond to rubric scores at a rate similar to that observed between two scorers (60–70% exact agreement; Ferrara & DeMauro, 2006).
• Comparative judgment measures appear to have higher validity coefficients than rubric scores
• With 3-4 hours of comparative judgment training, judges can consistently judge the relative quality of responses, as reflected by high reliability coefficients.
• Time per comparative judgment appears to be less than time per rubric score.
• Agreement might be improved with improvements in the pairing process
• Potentially improve accuracy and efficiency by implementing adaptive comparative judgment (Pollitt, 2012)– Initial pairings are random– Subsequent pairings are based on preliminary score estimates
• Pilot rangefinding study• Data-free form assembly and equating
• To the extent that such judgments are accurate, comparative judgment can be used to put items (from different test forms) on a common scale of perceived item difficulty.
• Those measures could be used for– Developing test forms of similar difficulty– Equating test forms (with no common items or persons)
Compare a sample of Form Y items to a sample of Form X “equating” items to
calculate an equating constant
Apply the constant to all of Form Y
Locate the Form X performance standard on Form Y
Data-Free Forms Assembly and Equating
• Prior research has demonstrated that comparative judgment measures can be highly correlated with empirical item difficulties (e.g., Heldsinger & Humphry, 2014).
• Our study will focus on the accuracy of the comparative judgment measures and subsequent accuracy of raw-to-theta pre-equating tables, equating of performance standards across forms, and inferences about the relative difficulty of test forms.
ReferencesAERA, APA, & NCME. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Attali, Y. (2014). A ranking method for evaluating constructed responses. Educational and Psychological Measurement, Online First, 1-14. Bradley, R.A., & Terry, M.E. (1952). Rank analysis of incomplete block designs: The method of paired comparisons. Biometrika, 39, 324-345. Bramley, T., Bell, J.F., & Pollitt, A. (1998). Assessing changes in standards over time using thurstone paired comparisons. Education Research and
Perspectives, 25(2), 1-24.Curcin, M., Black, B., & Bramley, T. (2009). Standard maintaining by expert judgment on multiple-choice tests: A new use for the rank-ordering
method. Paper presented at the the British Educational Research Association Annual Conference, Manchester.Elliot, S., Ferrara, S., Fisher, T., Klein, S., Pitoniak, M., & Steedle, J. (2010). Developing the edsteps continuum Washington, DC. Council of Chief State
School Officers.Ferrara, S., & DeMauro, G.E. (2006). Standardized assessment of individual achievement in k-12. In R. L. Brennan (Ed.), Educational measurement
(4th ed., pp. 579-621). Westport, CT: Praeger.Gill, T., & Bramley, T. (2008). How accurate are examiners’ judgments of script quality? An investigation of absolute and relative judgments in two
units, one with a wide and one with a narrow ‘zone of uncertainty’. Paper presented at the British Educational Research Association Annual Conference, Edinburgh, Scotland.
Heldsinger, S., & Humphry, S. (2010). Using the method of pairwise comparison to obtain reliable teacher assessments. The Australian Educational Researcher, 37(2), 1-19.
Heldsinger, S., & Humphry, S. (2014). Maintaining consistent metrics in standard setting. Murdoch, Western Australia: Murdoch University.Kimbell, R., Wheeler, T., Stables, K., Shepard, T., Martin, F., Davies, D., . . . Whitehouse, G. (2009). E-scape portfolio assessment: Phase 3 report.
London: Technology Education Research Unit, Goldsmiths College, University of London.Pollitt, A. (2004). Let’s stop marking exams. Paper presented at the IAEA Conference, Philadelphia, PA.Pollitt, A. (2012). The method of adaptive comparative judgement. Assessment in Education: Principles, Policy & Practice, 19(3), 281-300. Shah, N.B., Balakrishnan, S., Bradley, J., Parekh, A., Ramchandran, K., & Wainwright, M. (2014). When is it better to compare than to score? arXiv.
http://arxiv.org/abs/1406.6618Stewart, N., Brown, G.D.A., & Chater, N. (2005). Absolute identification by relative judgment. Psychological Review, 112(4), 881-911. Thurstone, L.L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273-286.Walker, M.E., Dorans, N.J., Kim, S., Vafis, G., & Fecko-Curtis, E. (2005). Alternative methods for obtaining item difficulty information. Paper presented
at the Annual Meeting of the American Educational Research Association, Montreal, Canada.Whitehouse, C., & Pollitt, A. (2012). Using adaptive comparative judgement to obtain a highly reliable rank order in summative assessment.
Manchester: The Assessment and Qualifications Alliance.Wolfe, E.W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues
and Practice, 31(3), 31-37.Zahner, D., & Steedle, J.T. (2014). Evaluating performance task scoring comparability in an international testing program. Paper presented at the
National Council on Measurement in Education Annual Meeting, Philadelphia, PA.