Top Banner
PARCC Research Results Karen E. Lochbaum Pearson June 22, 2016 Presented at that National Conference on Student Assessment, Philadelphia, PA 1
24

PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Jul 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

PARCC Research Results

Karen E. Lochbaum

Pearson

June 22, 2016

Presented at that National Conference on Student Assessment, Philadelphia, PA

1

Page 2: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Research Questions

• Do scores assigned by the Intelligent Essay Assessor (IEA) agree with human scores as well as human scores agree with each other?‒ Across all prompts and traits for all responses?‒ Across prompts and traits for responses across

subgroups?• Do scores assigned by IEA agree with scores assigned

by experts to validity papers as well as human scores do?

Page 3: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Series of Studies and Results

• 2014: Field Test Study• Promising Initial Results

• 2015: Year 1 Operational Studies• Performance• Validity responses• Subgroups

• 2016: Year 2 Operational Performance

3

Page 4: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

2015 Research Summary

4

Page 5: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Year 1 Operational Study

• IEA served as 10% second score

• A subset of prompts received an additional human score• One of each prompt type• In each grade level

• Study compared IEA-human to human-human performance on 26 prompts

5

Page 6: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Summary of Human vs. IEA Exact Agreement Rates

The exact agreement between IEA and human readers was higherthan it was between two human readers. And higher still between IEA and more experienced human back read scorers.

Page 7: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Summary of Human vs. IEA Exact Agreement Rates on Validity Responses

IEA’s exact agreement on validity responses was higher than it was for humans

Page 8: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Human vs. IEA Exact Agreement Rates by Subgroup

Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

The exact agreement between IEA and human readers was higherthan it was between two human readers for various demographic subgroups.

Comparison  Af Am Asian Hispanic 2+ Races Native AmHuman 2  Human 1 68.6% 62.8% 67.1%  69.8% 65.4%IEA Op  Human 1 74.0% 68.1% 72.5%  72.6% 72.6%

Comparison  White ELL SWD  Female Male

Human 2  Human 1 65.0% 71.2% 75.5% 63.9% 68.2%IEA Op  Human 1 69.9% 76.3% 78.6% 69.0% 73.0%

Page 9: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

2016 Operational Performance

9

Page 10: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

A Reminder: Criteria for Operationally Deploying the AI Scoring Model

1. Primary Criteria – Based on validity responses• With smart routing applied as needed, IEA agreement is as good

or better than human agreement for both trait scores2. Contingent Primary Criteria (if validity responses are not

available)• With smart routing applied as needed, IEA-Human exact

agreement is within 5.25% of Human-Human exact agreement for both trait scores

3. Secondary Criteria - Based on the training responses • With smart routing applied as needed, IEA-human differences on

statistical measures for both traits are evaluated against quality criteria tolerances for subgroups with at least 50 responses

10

Page 11: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Summary of Results: Comparison of IEA and Human Scores• Mean and Standard Deviations of IEA and Human Scores across all

prompts were very close

• Some variability compared to the first human scorer might be expected item-by-item because IEA was trained on the “best” score available (backread, resolution, first read)

Page 12: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

IEA Mean vs. Human MeanConventions Trait

Page 13: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

IEA SD vs. Human SDConventions Trait

Page 14: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

IEA Mean vs. Human MeanExpressions Trait

Page 15: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

IEA SD vs. Human SDExpressions Trait

Page 16: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

IEA vs. Human Validity AgreementConventions Trait

16

Blue means IEA performance exceeds human by > 5.25

Blue-Green means IEA at or above human

Green means IEA performance within 5.25 of human

Red means IEA performance lower than human by > 5.25

Grade Exact SP0 SP1 SP2 SP33

4

4

5

56

6

6

77

8

9

99

10

10

1011

11

1111

11

Page 17: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

IEA vs. Human Validity AgreementExpressions Trait

17

Blue exceeds by > 5.25Blue-Green exceedsGreen within 5.25 Red lower by > 5.25

Grade Exact SP0 SP1 SP2 SP3 SP43

4

4

5

56

6

6

77

8

9

99

10

10

1011

11

1111

11

Page 18: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

IEA vs. Human AgreementConventions Trait

18

Grade Exact SP0 SP1 SP2 SP333344444555556667777888889991010101011

Blue exceeds by > 5.25Blue-Green exceedsGreen within 5.25 Red lower by > 5.25

Page 19: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

IEA vs. Human AgreementExpressions Trait

19

Grade Exact SP0 SP1 SP2 SP3 SP433344444555556667777888889991010101011

Blue exceeds by > 5.25Blue-Green exceedsGreen within 5.25 Red lower by > 5.25

Page 20: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

A Reminder: Subgroup Analyses• For each prompt, we evaluated the performance of IEA for

various subgroups • We calculated various agreement indices (r, Kappa,

Quadratic Kappa, Exact Agreement) based human-human results with IEA-human results

• We also looked at standardized mean differences (SMDs) between IEA and human scores

• We flagged differences for any groups based on the quality criteria:

20

Measure Threshold Human-Machine Difference Pearson Correlation Less than 0.7 Greater than 0.1 Kappa Less than 0.4 Greater than 0.1 Quadratic Weighted Kappa Less than 0.7 Greater than 0.1 Exact Agreement Less than 65% Greater than 5.25% Standardized Mean Difference Greater than 0.15

Page 21: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Subgroup Analyses• 29/55 prompts had no flags on either trait

• When flags did occur• Only for one or two groups• Only one or two of the quality measures• None sufficiently concerning to consider retraining

• Sometimes different measures indicated different results• Lower than humans on exact agreement• Higher on quadratic weighted kappa

• SMD flags were rare• Always indicated higher IEA scores than human scores

21

Page 22: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Summary of Subgroup Analyses

22

Page 23: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Spring 2016 Continuous Flow Performance

23

With 6.5M responses scored YTD

Page 24: PARCC Research Results › ccsso › 2016 › webprogram › ...Human vs. IEA Exact Agreement Rates by Subgroup Min N count: 1,379/14,370 (2+ Races); Max N count: 43,693/448,339 (Whites)

Summary

• Extensive research was conducted over three years to validate the use of the Continuous Flow system on the PARCC assessment

• Initial results indicate its successful operational use in 2016

• Continuous Flow combines the strengths and benefits of both human and automated scoring

• Continuous Flow performance exceeds that of a human only scoring system while routing potentially challenging responses for further review

24