Can Large-Scale Tests be Fair to All Students? Bias Issues Related to WASL Catherine S. Taylor University of Washington/OSPI Yoonsun Lee OSPI Johnnie McKinley University of Washington March 29, 2007
Jan 13, 2016
Can Large-Scale Tests be Fair to All Students?
Bias Issues Related to WASL
Catherine S. TaylorUniversity of Washington/OSPI
Yoonsun LeeOSPI
Johnnie McKinleyUniversity of Washington
March 29, 2007
Focus of this presentation is on three studies:
• Study 1: What can we learn from Bias and Sensitivity Review procedures used for WASL (2004)
• Study 2: Report of input from two Public Forums on Bias and Sensitivity (2004)– Yakima
– Seattle
• Study 3: Investigation of ‘Differential Item Functioning’ (AKA statistical bias) in WASL test items (1997-2001)
WASL test items are developed using state of the art procedures:
• Test Specifications: define how many and what types of items will be on a test
• Item Specifications: define exactly what kinds of items will assess each Grade Level Expectation (GLE)
• Item writing: overseen by skilled test developers• Item reviews: check for match to GLEs by
teachers• Bias and sensitivity reviews: by individuals who
represent the diversity of WA State students
WASL test items are developed using state of the art procedures:
• Item pilots: items are randomly assigned to students throughout WA State
• Item data reviews: based on students’ performances– Statistical difficulty: Is the item easy or difficult
because of content tested NOT some flaw in the item?– Statistical validity: Do high performing students do
better on the item than low performing students?– Statistical bias: Is item performance related to level of
knowledge and skill NOT group membership?
Study 1: Bias & Sensitivity Reviews• Committee members represent diversity in the
student population (regions, ethnicity, gender, socio-economic status, religion, special population issues)
• Members review reading passages and items for: Implied or overt stereotyping or negative representations
of any group Too much or too little representation of any group Terms that may be confusing to students based on
language, region, culture, socio-economic status, etc. Controversial issues and topics that may affect some
groups more than others
Procedures Used to Observe Bias & Sensitivity Reviews:
• Participant-observer
• Recorded panelists comments during review process
• Cross checked records with facilitator notes
• Looked for patterns in notes/records in relation to reading passages and items
Results of Bias and Sensitivity Review Observations:
• Few passages or test items are identified as problematic
• Reading passages present the greatest potential for bias
• Sources of bias in reading passages are subtle
Reading passages present the greatest potential for bias:
• WASL reading passages include:– narrative and informative passages– passages with social studies, science, and
literary content
• WASL reading passages are from published sources
• Authors resist changes to their published writing (even when changes lessen bias/stereotyping)
Sources of bias in reading passages are subtle:• Alterations of original narratives:
– Use of legends and folk tales may be altered to fit Western notions of literature
– Language changes can change meaning (first feast vs. barbeque)
• “Othering”: – Biographies may focus on how individuals overcame or
coped with their minority status (Jackie Robinson; Helen Keller)
– Informational passages about cultural groups may have a patronizing tone (i.e., aren’t “their” ways cute)
• Interpretations: Items may focus on interpretations that are unique to middle class values rather than values of the culture of origin
Questions?
Study 2: Bias & Sensitivity Forums
• Two community forums (Yakima and Seattle)
• Community members came together to discuss concerns about WASL
• Participants included:– Teachers and school administrators
– Tribal elders
– Latino community leaders
– Parents and community members
Procedures used to Gather Data during Bias & Sensitivity Forums
• Did mock bias & sensitivity review
• Presented methods used for statistical “bias” analysis (also called differential item functioning (DIF))
• Showed items flagged for DIF and asked for likely causes
• Small group discussion with reports to larger group
• Recorded participant ideas about bias issues in WASL
• Examined written notes and chart paper for themes
Themes in Participant Comments• Need for involvement of minority teachers in all
stages of WASL development work – this may require community involvement
• Need for sensitivity to cultural values in selection of reading passages, item content, and the types of questions (particularly in reading)
• Need for inclusion of tribal elders in selection of text and contexts for WASL items
• Need for inclusion of individuals with cultural expertise in bias/sensitivity review panels
Study 3: Differential Item Functioning (DIF) Analyses
What is Differential Item Functioning (DIF or Item Bias)?
• When examinees, from different groups, with the same level of ability, have a different chance of answering an item correctly (Dorans & Holland, 1993)
• Most Bias analyses focus on cultural differences with students grouped according to some inherent demographic attribute (Scheuneman &Gerritz, 1990; Schmitt & Dorans, 1990; Wang & Lane, 1996).
Usual Focus of DIF/Bias Studies
• Two comparable groups of examinees Reference group – Larger or more dominant
group Focal group – Smaller or less dominant group
• Common demographic dimensions: Males compared with femalesStudents speaking English as first language
compared with students speaking English as second language
European American students compared with students from other American races/cultures
Multidimensionality as an Explanation for DIF/Bias
Multidimensionality occurs when:
1. An item requires use of two or more abilities (e.g., reading and mathematics) to respond correctly
2. DIF/Bias for multi-dimensional items occurs when individuals from different groups have:
identical ability on the primary dimension
unequal ability on the secondary dimension
A different likelihood of answering the item correctly
Typical Steps in a DIF Analysis:
• Identify two groups to be compared
• Compute item performance for students in each group at each total test score
• Summarize the differences in performance across all test scores
DIF/Bias Statistical Procedures
• Mantel-Haenszel
(Holland & Thayer, 1988)
• Logistic regression
(Swaminathan & Rogers, 1990)
• SIBTEST
(Shealy & Stout’s simultaneous item bias, 1993)
SIBTEST
• A nonparametric statistical test to sequentially detect DIF/Bias present in one or more items of a test
• An outgrowth of the multidimensional IRT modeling of DIF/Bias (Nandakumar & Stout, 1993; Ackerman, 1994; Roussos & Stout, 1996)
Example of Equal Abilities Distribution on the Primary Dimension; Different Abilities on the Secondary Dimension
3
0
-3 0 3
θ1
θ2
F R
R
F
Mathematics
Readin
g
θ1
R = F
Comparison of White Students' and Black Students' Performance on a Hyphothetical Item
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
270 275 280 285 290 295 300 305 310 315
Scale Score on the Test
Pe
rce
nt
of
Stu
de
nts
wit
h C
orr
ec
t A
ns
we
r
White Students
Black Students
DIF Can Go Both Ways:
• When individual students get their total scores from different items – that’s normal
• When there is a pattern in how groups of students get their total scores - that’s DIF
• When students in a group do better than expected on an item based on their total test score DIF is in favor of the group
• When students in a group do more poorly than expected on an item based on their total test score, DIF is against the group.
Typical Causes of DIF:• Impact: Students from different groups receive
different educational experiences such that item performance differences reflect true differences in knowledge/skills.
• Culture/Background: Students from different backgrounds bring unique perspectives to bear on test items.
• Language: Language used in items is differentially familiar to students.
• Effort: Examinees from different groups may attempt different items based on perceived likelihood of success.
Research on DIF for WASL Test Items:
• Studies were conducted after items had been:– reviewed by bias & sensitivity committee– examined for statistical bias– used in an operational test
• Compared performance of:
– Males and Females
– White students and Black/African American students
– White students and Latino/Hispanic students
– White students and Native American students
– White students and Asian/Pacific Islander students
Research on DIF for WASL Test Items:
• Examined test items from:
1997, 1998, 1999, 2000, 2001 Grade 4 Reading and Mathematics
1998, 1999, 2000, 2001 Grade 7 Reading and Mathematics
1999, 2000, 2001 Grade 10 Reading and Mathematics
DIF Results for Reading:• Most reading items showed no statistical bias• Reading items flagged for Gender DIF:
Multiple choice items tend to favor boys Performance items tend to favor girls Items favoring boys tend to be related to informational
passages Reading items flagged for Ethnic DIF
Multiple-choice items asking for text interpretation tend to favor white students
Performance-items asking for text interpretation tend to favor minority students
Patterns became more extreme across grade levels
Mean Number of Reading Items Flagged for
DIF (Males & Females)
Grade Item
TypeFavor Males Favor Females
4 MC 1.20 0.00
P 0.80 3.00
7 MC 4.50 0.50
P 0.50 5.00
10 MC 5.33 0.33
P 2.33 6.00
Mean Number of Reading Items Flagged for
DIF (Asian/Pacific Islander & White)
Grade Item
Type
Favor Asians/
Pacific Islanders
Favor Whites
4 MC 0.20 1.40
P 2.20 0.60
7 MC 0.00 4.25
P 5.50 0.00
10 MC 0.00 4.00
P 6.67 1.67
Mean Number of Reading Items Flagged for DIF (Black/African & White)
Grade Item
Type
Favor Blacks/Africans Favor Whites
4 MC 0.20 0.60
P 2.00 0.40
7 MC 0.00 2.25
P 3.25 0.25
10 MC 0.67 2.33
P 5.33 1.33
Mean Number of Reading Items Flagged for
DIF (Native American & White)
Grade Item
Type
Favor Native Americans Favor Whites
4 MC 0.00 0.00
P 1.00 0.20
7 MC 0.00 0.25
P 1.00 0.25
10 MC 0.00 1.00
P 1.67 0.67
Mean Number of Reading Items Flagged for
DIF (Latino/Hispanic & White)
Grade Item
Type
Favor Latinos/
HispanicsFavor Whites
4 MC 0.40 1.20
P 2.20 0.20
7 MC 0.00 3.25
P 5.50 0.00
10 MC 0.00 3.00
P 6.00 1.67
Excerpt from a reading passage:The best looking fences are often the simplest. A simple
fence around a beautiful home can be like a frame around a picture. The house isn’t hidden; its beauty is enhanced by the frame. But a fence can be a massive, ugly thing, too, made of bricks and mortar. Sometimes the insignificant little fences do their job just as well as the ten-foot walls. Maybe it’s only a string stretched between here and there in a field. The message is clear; don’t cross here.
Every fence has its own personality and some don’t have much. There are friendly fences. A friendly fence takes kindly to being leaned on. There are friendly fences around some playgrounds. And some playgrounds fences are more fun to play on than anything they surround. There are more mean fences than friendly fences overall, though. Some have their own built-in invitation not to be sat upon. Unfriendly fences get it right back sometimes. You seldom see one that hasn’t been hit, bashed, or bumped or in some way broken or knocked down.
Example of a Reading an Item that Shows
Statistical Bias in Favor of Focal Groups:
In the sixth paragraph, the author talks about friendly and unfriendly fences. How can you tell them apart?
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
* Favors Latinos, Blacks/African Americans, and Asian/Pacific Islanders
Example of a Reading Item that Shows
Statistical Bias in Favor of Focal Groups:
What is the author’s attitude toward fences? Give three pieces of evidence from the essay to support your point.
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
* favors females, Asian/Pacific Islanders, and Latinos
Example of a Reading Item that Shows Statistical Bias in Favor of Males and Whites
DIF Results for Mathematics:
• Most mathematics items showed no statistical DIF
• Mathematics items flagged for Gender DIF: Multiple choice items tend to favor boys Performance items tend to favor girls DIF items favoring boys tend to require simple
applications of mathematical procedures in number, algebra, geometry, and statistics
DIF items favoring girls tend to assess data analysis, measurement, complex applications, reasoning, and problem-solving
Number of items flagged for DIF increased across grade levels
DIF Results for Mathematics:
• Ethnic DIF statistical patterns:
Performance items were flagged for DIF more often than multiple-choice items
Slightly more of the flagged performance items favored minority students, although differences were small
Mean Number of Mathematics Items
Flagged for DIF (Males & Females)
Grade Item
TypeFavor Males Favor Females
4 MC 2.20 0.00
P 2.00 5.20
7 MC 3.50 0.50
P 1.75 5.25
10 MC 5.67 0.00
P 3.67 7.33
Mean Number of Mathematics Items Flagged for
DIF (Asian/Pacific Islander & White)
Grade Item
TypeFavor Asians/
Pacific Islanders
Favor Whites
4 MC 1.00 2.00
P 1.80 1.60
7 MC 1.50 2.50
P 5.75 3.00
10 MC 3.00 1.33
P 3.67 4.67
Mean Number of Mathematics Items
Flagged for DIF (Black/African & White)
Grade Item
TypeFavor Blacks/
Africans
Favor Whites
4 MC 1.00 0.80
P 2.00 1.40
7 MC 0.25 0.75
P 3.25 1.50
10 MC 2.33 1.33
P 3.00 3.00
Mean Number of Mathematics Items
Flagged for DIF (Native American & White)
Grade Item
TypeFavor Native
Americans
Favor Whites
4 MC 0.00 0.00
P 1.80 1.00
7 MC 0.00 0.50
P 1.75 1.25
10 MC 0.00 0.67
P 3.00 2.00
Mean Number of Mathematics Items
Flagged for DIF (Latino/Hispanic & White)
Grade Item
TypeFavor Latinos/
Hispanics
Favor Whites
4 MC 0.80 1.00
P 2.60 0.80
7 MC 0.25 1.25
P 3.50 1.75
10 MC 0.33 0.67
P 3.00 2.00
DIF Results for Mathematics:
• Content analysis of Mathematics items flagged for Ethnic DIF: Flagged items favoring Asian/Pacific Islander
students generally assessed number concepts, computation, geometric procedures, algebraic procedures, and simple statistics
Flagged items favoring Black/African, Native American, and Latino/Hispanic students generally assessed number, number patterns, computation, and logical reasoning
Flagged items favoring White students generally assessed data analysis, data representation, measurement, reasoning, and problem-solving
Example of a Mathematics Item that Shows Statistical Bias in Favor of Focal Groups:
Favor Latinos, Native Americans, Asian/Pacific Islanders, Black/African Americans, and Females
Example of a Mathematics Item that Shows Statistical Bias in Favor of Focal Groups:
* Favors Asian/Pacific Islanders
Conclusions from DIF Studies:
• Results suggest:– Exclusive reliance on multiple-choice items for
reading tests may result in bias against girls and minority students – particularly when items assess interpretation of text
– Exclusive reliance on multiple-choice items for mathematics tests may result in bias against girls
– Ethnic DIF results in mathematics suggest that content of instruction differs for students in different groups
Additional Points
• Similar results have been found in studies of other tests
• However, these results can only be generalized when:– Items are written in the same way as WASL
items (structured, not too open-ended)– Diverse, appropriate interpretations and problem
solutions are selected for use to train scorers
Can Standardized Tests be Fair to All Students?
Yes, under some conditions:– Use of reading passages that maintain cultural
characteristics– Well developed performance items that present
clear directions to students– Use of item writers from diverse backgrounds– Selection of anchor papers and training papers
that represent diverse, valid responses– Cultural experts in bias & sensitivity reviews
AppendixAppendix
The following pages give The following pages give the mathematical model the mathematical model
for SIBTESTfor SIBTEST
Estimated True Score Estimated True Score (Regression Correction)(Regression Correction)
where,
Bias indexBias index
(After regression correction applied)
Bias statisticBias statistic
where
Bias indexBias index
where
Gfk: proportion of people in focal group getting score of k