NCSA, Detroit, June 2010. Kevin King, Utah State Office of Education Sarah Susbury, Virginia Department of Education Dona Carling, Measured Progress.

NCSA, Detroit, June 2010

Kevin King, Utah State Office of Education Sarah Susbury, Virginia Department of

Education Dona Carling, Measured Progress Kelly Burling, Pearson Chris Domaleski, National Center for the

Improvement of Educational Assessment

For not surprising reasons, many states deliver statewide assessments in both paper and computer modes.

Also not surprising, comparability of scores for the two modes is of concern.

What about comparability issues when switching computer-based testing interfaces?

This session will explore the idea of comparability in the context of statewide testing using two administration modes.

Challenges addressed will include:◦ satisfying peer review,◦ the impact of switching testing providers,◦ changing interfaces, ◦ adding tools and functionality, and◦ adding item types.

The ability of a system to deliver data that can be compared in standard units of measurement and by standard statistical techniques with the data delivered by other systems. (online statistics dictionary)

Comparability refers to the commonality of score meaning across testing conditions including delivery modes, computer platforms, and scoring presentation. (Bennett, 2003)

From Peer Review Guidance (2007) Comparability of results Many uses of State

assessment results assume comparability of different types: comparability from year to year, from student to student, and from school to school. Although this is difficult to implement and to document, States have an obligation to show that they have made a reasonable effort to attain comparability, especially where locally selected assessments are part of the system.

Section 4, Technical Quality 4.4 When different test forms or formats are

used, the State must ensure that the meaning and interpretation of results are consistent.

Has the State taken steps to ensure consistency of test forms over time?

If the State administers both an online and paper and pencil test, has the State documented the comparability of the electronic and paper forms of the test?

Section 4, Technical Quality 4.4 Possible Evidence Documentation describing the State’s

approach to ensuring comparability of assessments and assessment results across groups and time.

Documentation of equating studies that confirm the comparability of the State’s assessments and assessment results across groups and across time, as well as follow-up documentation describing how the State has addressed any deficiencies.

Bennett, R.E. (2003). Online assessment and the comparability of score meaning. Princeton, NJ: Educational Testing Service.

U.S. Department of Education (2009). Standards and Assessments Peer Review Guidance: Information and Examples for Meeting Requirements of the No Child Left Behind Act of 2001. Washington, DC: U.S. Government Printing Office.

Kevin KingAssessment Development Coordinator

Previous concerns have been around PBT and CBT comparability

Potentially more variability between CBT and CBT than between CBT and PBT

There are policy considerations about when to pursue comparability studies and when not to

• Which tests– 27 Multiple Choice CRTs• Grades 3 – 11 English language arts• Grades 3 – 7 math, Pre-Algebra, Algebra 1, Geometry, Algebra 2• Grades 4 – 8 science, Earth Systems Science, Physics, Chemistry,

Biology

• Timeline (total test administrations approximately 1.2 million)

– 2001 – 2006: 4% – 8%– 2007: 8%– 2008: 50%– 2009: 66%– 2010: 80%

Focus on PBT to CBT◦ Prior to this year, that is what was warranted

2006 (8% CBT participation) ◦ Item by item performance comparison◦ Matched Samples Comparability Analyses

Using NRT as a basis for the matched set◦ ELA (3, 5, & 8), Math (3, 5, & 8), Science (5 & 8)◦ Results◦ Actionable conclusions

Content

Grade

Higher/Lower on CBT compared to PBT

Statistically significant

ELA 3 Lower Yes

ELA 5 Lower No

ELA 8 Higher No

Math 3 Higher Yes

Math 5 Higher Yes

Math 8 Higher Yes

Science

5 Lower No

Science

8 Higher No

2006 Study Results

Conclusions: Additional investigations around mode by items be conductedPolicy overtones: Rapid movement to 100% CBT impacts these investigations.

• 2008 (50 % CBT participation)– Focus on mode transition (i.e., from PBT one year

to CBT the next year)–Determine PBT and CBT rs-ss tables for all

courses– Benefit to CBT if PBT ss is lower than CBT– Very few variances between rs-ss tables (no

variances at proficiency cut)• Conclusions and Policy: move forward with

CBT as base for data decisions

Issues◦ Variations due to local configurations

Screen resolution (e.g., 800x600 vs. 1280x1024) Monitor size Browser differences

How much of an issue is this

A current procurement dilemma◦How to mitigate

What could be different will be different◦Item displays

How items are displayed As graphic images, with text wrapping How graphics are handled by the different systems

◦Item transfer between providers and interoperability concerns Brought about re-entering of items and potential

variations in presentation

How to make decisions as the same system with the same vendor advances◦ Tool availability◦ Text wrapping◦ HTML coding

When do you take advantage of technology advance, but sacrifice item similarity

◦ Portal/Interface for item access How students navigate the system How tools function (e.g., highlighter, cross out, item

selection) How advanced tools function

Text enlarger Color contrast

Utah will be bringing on Technology Enhanced items.

How will the different tests from year to year be comparable?

What about PBT tests versions as accommodated?

Technical disruptions during testing Student familiarity with workstation and

environment

PBT and CBT CBT and PBT Vendors Operating Systems and Browsers Curriculum changes

Forced us to really address ◦ why is comparability important◦ AND◦ what does that mean?

Is “equal” always “fair”?

Sarah SusburyDirector of Test Administration, Scoring,

and Reporting

• Which tests• Grades 3 – 8 Reading and End-of-Course (EOC) Reading• Grades 3 – 8 Math, EOC Algebra 1, EOC Geometry, EOC Algebra 2• Grades 3, 5, and 8 Science, EOC Earth Science, EOC Biology, EOC

Chemistry• Grade 3 History, Grades 5 – 8 History, EOC VA & US History, EOC

World History I, EOC World History II, EOC World Geography

• Phased Approach (EOC --> MS --> ES)• Participation by districts was voluntary• Timeline (growth in online tests administered)

– 2001: 1,700– Spring 2004: 265,000– Spring 2006: 924,000– Spring 2009: 1.67 million – Spring 2010: 2.05 million

• Conducted comparability studies with the online introduction of each new EOC subject (2001 – 2004)

• Students were administered a paper/pencil test form and an equivalent online test form in the same administration– Students were not provided scores until both tests

were completed– Students were aware they would be awarded the

higher of the two scores (motivation)

• Results indicated varying levels of comparability.

• Due to Virginia’s graduation requirements of passing certain EOC tests, decision was made to equate online tests and paper/pencil tests separately.

• Required planning to ensure adequate n-counts in both modes would be available for equating purposes.

• Comparability has improved over time.

Some accommodations transfer easily between modes:◦ Read aloud and audio test administrations◦ Allowable manipulatives (bilingual dictionary,

calculator, etc)◦ Answer transcription◦ Visual aids, magnifiers

Other accommodations do not readily transfer:◦ Flexible administration (variable number of test

items at a time)◦ Braille (cost of Braille writers)◦ Large Print forms (ability to hold the test form)

• Screen resolution: Changing from 800 X 600 dpi to 1024 X 768 dpi– Eliminating older hardware from use for online testing– Changing the amount of scrolling needed for full view– Revise flow of text?

• Desktop vs laptop computers– Less of an issue than in early 2000’s

• Laptop computers vs “Netbooks”– Variability of “Netbooks”

New vendor or changes in current vendor’s system◦ Test navigation may vary

Advancing or returning to items Submitting a test for scoring Audio player controls

◦ Test taking tools & strategies may vary Available tools (e.g., highlighter, choice eliminator,

mark for review, etc)◦ Changes in test security

Prominent display of student’s name on screen?

Virginia is implementing technology enhanced items simultaneously with revised Standards of Learning (SOL)◦ Mathematics SOL

Revised mathematics standards approved in 2009 Field testing (embedded) new technology enhanced

math items during 2010-2011 test administrations Revised math assessments implemented in 2011-2012

with new standard setting

◦ English and Science SOL Revised English and science standards approved in 2010 Field testing (embedded) new English and science items

during 2010-2011 test administrations Revised assessments (reading, writing, and science)

implemented in 2013

Sometimes there is no choice:◦ New technology; prior technology becomes obsolete◦ Procurement changes/decisions

Sometimes there is a choice (or timing options):◦ Advances in technology◦ Advances in assessment concepts

A Shared Responsibility◦ Provide teachers with training and exposure to

changes in time to impact instruction and test prep. Systems, test navigation, test items, accommodations,

etc.◦ Provide students with training and exposure to

changes prior to test preparation and prior to testing Practice tests in the new environment, sample new item

types, etc.◦ Communicate changes and change processes to all

stakeholders in the educational community

Kelly BurlingPearson

WhenWhatWhyHowWhat’s Next

Whenever!

Comparability Studies for Operational Systems

& Comparability Studies for Research

Purposes

Operational Introducing CBT Curricular Changes New Item Types New Provider New Interface Any time there are

changes in a high stakes environment

Research Introducing CBT Curricular Changes New Item Types New Provider New Interface Changes over time Changes in technology

use in schools

See slides 5, 6, 7, & 8

Evaluation Criteria◦ Validity◦ Psychometric◦ Statistical

Wang, T., & Kolen, M. J. (2001). Evaluating comparability in computerized adaptive testing: Issues, criteria and an example. Journal of Educational Measurement 38, 19–49.

User Experience & Systems Designs Cognitive Psychology Construct Dimensionality Relationships to External Criterion Variables Sub-group differences

Score distribution Reliability Conditional SEM Mean difference Propensity Scores Item Level

◦ Mean difference◦ IRT parameter differences◦ Response distributions◦ DIF

Evaluating the assumptions underlying the◦ scoring model◦ test characteristics◦ study design

Performance assessments E Portfolio with digitally created content, e

portfolio with traditional content, physical portfolio

Platforms Devices Data Mining

Chris DomaleskiNational Center for the Improvement of

Educational Assessment

• NCLB Federal Peer Review Guidance (4.4) requires the state to document the comparability of the electronic and paper forms of the test

• AERA, APA, and NCME Joint Standards (4.10) “a clear rationale and supporting evidence should be provided for any claim that scores earned on different forms of a test may be used interchangeably.”

• Design– Item and form development processes (e.g. comparable

blueprints and specifications) – Procedures to ensure comparable presentation of and

interaction with items (e.g. Can examinees review the entire passage when responding to passage dependent items?) • Don’t’ forget within mode consistency. For example, do items

render consistently for all computer users? – Adequate pilot and field testing of each mode

• Administration– Certification process for computer based administration

to ensure technical requirements are met– Examinees have an opportunity to gain familiarity with

assessment mode– Resources (e.g. calculator, marking tools) are equivalent– Accommodations are equivalent

• Analyses– Comparability of item statistics • To what extent do the same items produce different

statistics by mode?• Are there differences by item ‘bundles’ (e.g. passage or

stimulus dependent items) • DIF studies

– Comparability of scores• Comparability of total score by tests (e.g. grade, content)• Comparability of total score by group (e.g. SWD, ELL etc.)• Comparability of total score by ability regions (e.g. by

quartiles, TCC correspondence)• DTF• Classification consistency studies

No single approach to demonstrating comparability and no single piece of evidence is likely to be sufficient

Don’t assume that findings apply to all grades, content areas, subgroups

Item type may interact with presentation mode Design considerations

◦ Are inferences based on equivalent groups? If so, how is this supported?

◦ Are inferences on repeated measures? If so are order effects addressed?

Be clear about standard of evidence required. ◦ Effect size?◦ Classification differences?

NCSA, Detroit, June 2010. Kevin King, Utah State Office of Education Sarah Susbury, Virginia Department of Education Dona Carling, Measured Progress.

Documents

comparability of results

comparability of assessments

comparability studies

comparability issues

idea of comparability

comparability of scores

state assessment results

states assessments