NCSA, Detroit, June 2010
Dec 28, 2015
NCSA, Detroit, June 2010
Kevin King, Utah State Office of Education Sarah Susbury, Virginia Department of
Education Dona Carling, Measured Progress Kelly Burling, Pearson Chris Domaleski, National Center for the
Improvement of Educational Assessment
For not surprising reasons, many states deliver statewide assessments in both paper and computer modes.
Also not surprising, comparability of scores for the two modes is of concern.
What about comparability issues when switching computer-based testing interfaces?
This session will explore the idea of comparability in the context of statewide testing using two administration modes.
Challenges addressed will include:◦ satisfying peer review,◦ the impact of switching testing providers,◦ changing interfaces, ◦ adding tools and functionality, and◦ adding item types.
The ability of a system to deliver data that can be compared in standard units of measurement and by standard statistical techniques with the data delivered by other systems. (online statistics dictionary)
Comparability refers to the commonality of score meaning across testing conditions including delivery modes, computer platforms, and scoring presentation. (Bennett, 2003)
From Peer Review Guidance (2007) Comparability of results Many uses of State
assessment results assume comparability of different types: comparability from year to year, from student to student, and from school to school. Although this is difficult to implement and to document, States have an obligation to show that they have made a reasonable effort to attain comparability, especially where locally selected assessments are part of the system.
Section 4, Technical Quality 4.4 When different test forms or formats are
used, the State must ensure that the meaning and interpretation of results are consistent.
Has the State taken steps to ensure consistency of test forms over time?
If the State administers both an online and paper and pencil test, has the State documented the comparability of the electronic and paper forms of the test?
Section 4, Technical Quality 4.4 Possible Evidence Documentation describing the State’s
approach to ensuring comparability of assessments and assessment results across groups and time.
Documentation of equating studies that confirm the comparability of the State’s assessments and assessment results across groups and across time, as well as follow-up documentation describing how the State has addressed any deficiencies.
Bennett, R.E. (2003). Online assessment and the comparability of score meaning. Princeton, NJ: Educational Testing Service.
U.S. Department of Education (2009). Standards and Assessments Peer Review Guidance: Information and Examples for Meeting Requirements of the No Child Left Behind Act of 2001. Washington, DC: U.S. Government Printing Office.
Kevin KingAssessment Development Coordinator
Previous concerns have been around PBT and CBT comparability
Potentially more variability between CBT and CBT than between CBT and PBT
There are policy considerations about when to pursue comparability studies and when not to
• Which tests– 27 Multiple Choice CRTs• Grades 3 – 11 English language arts• Grades 3 – 7 math, Pre-Algebra, Algebra 1, Geometry, Algebra 2• Grades 4 – 8 science, Earth Systems Science, Physics, Chemistry,
Biology
• Timeline (total test administrations approximately 1.2 million)
– 2001 – 2006: 4% – 8%– 2007: 8%– 2008: 50%– 2009: 66%– 2010: 80%
Focus on PBT to CBT◦ Prior to this year, that is what was warranted
2006 (8% CBT participation) ◦ Item by item performance comparison◦ Matched Samples Comparability Analyses
Using NRT as a basis for the matched set◦ ELA (3, 5, & 8), Math (3, 5, & 8), Science (5 & 8)◦ Results◦ Actionable conclusions
Content
Grade
Higher/Lower on CBT compared to PBT
Statistically significant
ELA 3 Lower Yes
ELA 5 Lower No
ELA 8 Higher No
Math 3 Higher Yes
Math 5 Higher Yes
Math 8 Higher Yes
Science
5 Lower No
Science
8 Higher No
2006 Study Results
Conclusions: Additional investigations around mode by items be conductedPolicy overtones: Rapid movement to 100% CBT impacts these investigations.
• 2008 (50 % CBT participation)– Focus on mode transition (i.e., from PBT one year
to CBT the next year)–Determine PBT and CBT rs-ss tables for all
courses– Benefit to CBT if PBT ss is lower than CBT– Very few variances between rs-ss tables (no
variances at proficiency cut)• Conclusions and Policy: move forward with
CBT as base for data decisions
Issues◦ Variations due to local configurations
Screen resolution (e.g., 800x600 vs. 1280x1024) Monitor size Browser differences
How much of an issue is this
A current procurement dilemma◦How to mitigate
What could be different will be different◦Item displays
How items are displayed As graphic images, with text wrapping How graphics are handled by the different systems
◦Item transfer between providers and interoperability concerns Brought about re-entering of items and potential
variations in presentation
How to make decisions as the same system with the same vendor advances◦ Tool availability◦ Text wrapping◦ HTML coding
When do you take advantage of technology advance, but sacrifice item similarity
◦ Portal/Interface for item access How students navigate the system How tools function (e.g., highlighter, cross out, item
selection) How advanced tools function
Text enlarger Color contrast
Utah will be bringing on Technology Enhanced items.
How will the different tests from year to year be comparable?
What about PBT tests versions as accommodated?
Technical disruptions during testing Student familiarity with workstation and
environment
PBT and CBT CBT and PBT Vendors Operating Systems and Browsers Curriculum changes
Forced us to really address ◦ why is comparability important◦ AND◦ what does that mean?
Is “equal” always “fair”?
Sarah SusburyDirector of Test Administration, Scoring,
and Reporting
• Which tests• Grades 3 – 8 Reading and End-of-Course (EOC) Reading• Grades 3 – 8 Math, EOC Algebra 1, EOC Geometry, EOC Algebra 2• Grades 3, 5, and 8 Science, EOC Earth Science, EOC Biology, EOC
Chemistry• Grade 3 History, Grades 5 – 8 History, EOC VA & US History, EOC
World History I, EOC World History II, EOC World Geography
• Phased Approach (EOC --> MS --> ES)• Participation by districts was voluntary• Timeline (growth in online tests administered)
– 2001: 1,700– Spring 2004: 265,000– Spring 2006: 924,000– Spring 2009: 1.67 million – Spring 2010: 2.05 million
• Conducted comparability studies with the online introduction of each new EOC subject (2001 – 2004)
• Students were administered a paper/pencil test form and an equivalent online test form in the same administration– Students were not provided scores until both tests
were completed– Students were aware they would be awarded the
higher of the two scores (motivation)
• Results indicated varying levels of comparability.
• Due to Virginia’s graduation requirements of passing certain EOC tests, decision was made to equate online tests and paper/pencil tests separately.
• Required planning to ensure adequate n-counts in both modes would be available for equating purposes.
• Comparability has improved over time.
Some accommodations transfer easily between modes:◦ Read aloud and audio test administrations◦ Allowable manipulatives (bilingual dictionary,
calculator, etc)◦ Answer transcription◦ Visual aids, magnifiers
Other accommodations do not readily transfer:◦ Flexible administration (variable number of test
items at a time)◦ Braille (cost of Braille writers)◦ Large Print forms (ability to hold the test form)
• Screen resolution: Changing from 800 X 600 dpi to 1024 X 768 dpi– Eliminating older hardware from use for online testing– Changing the amount of scrolling needed for full view– Revise flow of text?
• Desktop vs laptop computers– Less of an issue than in early 2000’s
• Laptop computers vs “Netbooks”– Variability of “Netbooks”
New vendor or changes in current vendor’s system◦ Test navigation may vary
Advancing or returning to items Submitting a test for scoring Audio player controls
◦ Test taking tools & strategies may vary Available tools (e.g., highlighter, choice eliminator,
mark for review, etc)◦ Changes in test security
Prominent display of student’s name on screen?
Virginia is implementing technology enhanced items simultaneously with revised Standards of Learning (SOL)◦ Mathematics SOL
Revised mathematics standards approved in 2009 Field testing (embedded) new technology enhanced
math items during 2010-2011 test administrations Revised math assessments implemented in 2011-2012
with new standard setting
◦ English and Science SOL Revised English and science standards approved in 2010 Field testing (embedded) new English and science items
during 2010-2011 test administrations Revised assessments (reading, writing, and science)
implemented in 2013
Sometimes there is no choice:◦ New technology; prior technology becomes obsolete◦ Procurement changes/decisions
Sometimes there is a choice (or timing options):◦ Advances in technology◦ Advances in assessment concepts
A Shared Responsibility◦ Provide teachers with training and exposure to
changes in time to impact instruction and test prep. Systems, test navigation, test items, accommodations,
etc.◦ Provide students with training and exposure to
changes prior to test preparation and prior to testing Practice tests in the new environment, sample new item
types, etc.◦ Communicate changes and change processes to all
stakeholders in the educational community
Kelly BurlingPearson
WhenWhatWhyHowWhat’s Next
Whenever!
Comparability Studies for Operational Systems
& Comparability Studies for Research
Purposes
Operational Introducing CBT Curricular Changes New Item Types New Provider New Interface Any time there are
changes in a high stakes environment
Research Introducing CBT Curricular Changes New Item Types New Provider New Interface Changes over time Changes in technology
use in schools
See slides 5, 6, 7, & 8
Evaluation Criteria◦ Validity◦ Psychometric◦ Statistical
Wang, T., & Kolen, M. J. (2001). Evaluating comparability in computerized adaptive testing: Issues, criteria and an example. Journal of Educational Measurement 38, 19–49.
User Experience & Systems Designs Cognitive Psychology Construct Dimensionality Relationships to External Criterion Variables Sub-group differences
Score distribution Reliability Conditional SEM Mean difference Propensity Scores Item Level
◦ Mean difference◦ IRT parameter differences◦ Response distributions◦ DIF
Evaluating the assumptions underlying the◦ scoring model◦ test characteristics◦ study design
Performance assessments E Portfolio with digitally created content, e
portfolio with traditional content, physical portfolio
Platforms Devices Data Mining
Chris DomaleskiNational Center for the Improvement of
Educational Assessment
• NCLB Federal Peer Review Guidance (4.4) requires the state to document the comparability of the electronic and paper forms of the test
• AERA, APA, and NCME Joint Standards (4.10) “a clear rationale and supporting evidence should be provided for any claim that scores earned on different forms of a test may be used interchangeably.”
• Design– Item and form development processes (e.g. comparable
blueprints and specifications) – Procedures to ensure comparable presentation of and
interaction with items (e.g. Can examinees review the entire passage when responding to passage dependent items?) • Don’t’ forget within mode consistency. For example, do items
render consistently for all computer users? – Adequate pilot and field testing of each mode
• Administration– Certification process for computer based administration
to ensure technical requirements are met– Examinees have an opportunity to gain familiarity with
assessment mode– Resources (e.g. calculator, marking tools) are equivalent– Accommodations are equivalent
• Analyses– Comparability of item statistics • To what extent do the same items produce different
statistics by mode?• Are there differences by item ‘bundles’ (e.g. passage or
stimulus dependent items) • DIF studies
– Comparability of scores• Comparability of total score by tests (e.g. grade, content)• Comparability of total score by group (e.g. SWD, ELL etc.)• Comparability of total score by ability regions (e.g. by
quartiles, TCC correspondence)• DTF• Classification consistency studies
No single approach to demonstrating comparability and no single piece of evidence is likely to be sufficient
Don’t assume that findings apply to all grades, content areas, subgroups
Item type may interact with presentation mode Design considerations
◦ Are inferences based on equivalent groups? If so, how is this supported?
◦ Are inferences on repeated measures? If so are order effects addressed?
Be clear about standard of evidence required. ◦ Effect size?◦ Classification differences?