Can Large-Scale Tests be Fair to All Students? Bias Issues Related to WASL Catherine S. Taylor University of Washington/OSPI Yoonsun Lee OSPI Johnnie McKinley.

Can Large-Scale Tests be Fair to All Students?

Bias Issues Related to WASL

Catherine S. TaylorUniversity of Washington/OSPI

Yoonsun LeeOSPI

Johnnie McKinleyUniversity of Washington

March 29, 2007

Focus of this presentation is on three studies:

• Study 1: What can we learn from Bias and Sensitivity Review procedures used for WASL (2004)

• Study 2: Report of input from two Public Forums on Bias and Sensitivity (2004)– Yakima

– Seattle

• Study 3: Investigation of ‘Differential Item Functioning’ (AKA statistical bias) in WASL test items (1997-2001)

WASL test items are developed using state of the art procedures:

• Test Specifications: define how many and what types of items will be on a test

• Item Specifications: define exactly what kinds of items will assess each Grade Level Expectation (GLE)

• Item writing: overseen by skilled test developers• Item reviews: check for match to GLEs by

teachers• Bias and sensitivity reviews: by individuals who

represent the diversity of WA State students

WASL test items are developed using state of the art procedures:

• Item pilots: items are randomly assigned to students throughout WA State

• Item data reviews: based on students’ performances– Statistical difficulty: Is the item easy or difficult

because of content tested NOT some flaw in the item?– Statistical validity: Do high performing students do

better on the item than low performing students?– Statistical bias: Is item performance related to level of

knowledge and skill NOT group membership?

Study 1: Bias & Sensitivity Reviews• Committee members represent diversity in the

student population (regions, ethnicity, gender, socio-economic status, religion, special population issues)

• Members review reading passages and items for: Implied or overt stereotyping or negative representations

of any group Too much or too little representation of any group Terms that may be confusing to students based on

language, region, culture, socio-economic status, etc. Controversial issues and topics that may affect some

groups more than others

Procedures Used to Observe Bias & Sensitivity Reviews:

• Participant-observer

• Recorded panelists comments during review process

• Cross checked records with facilitator notes

• Looked for patterns in notes/records in relation to reading passages and items

Results of Bias and Sensitivity Review Observations:

• Few passages or test items are identified as problematic

• Reading passages present the greatest potential for bias

• Sources of bias in reading passages are subtle

Reading passages present the greatest potential for bias:

• WASL reading passages include:– narrative and informative passages– passages with social studies, science, and

literary content

• WASL reading passages are from published sources

• Authors resist changes to their published writing (even when changes lessen bias/stereotyping)

Sources of bias in reading passages are subtle:• Alterations of original narratives:

– Use of legends and folk tales may be altered to fit Western notions of literature

– Language changes can change meaning (first feast vs. barbeque)

• “Othering”: – Biographies may focus on how individuals overcame or

coped with their minority status (Jackie Robinson; Helen Keller)

– Informational passages about cultural groups may have a patronizing tone (i.e., aren’t “their” ways cute)

• Interpretations: Items may focus on interpretations that are unique to middle class values rather than values of the culture of origin

Questions?

Study 2: Bias & Sensitivity Forums

• Two community forums (Yakima and Seattle)

• Community members came together to discuss concerns about WASL

• Participants included:– Teachers and school administrators

– Tribal elders

– Latino community leaders

– Parents and community members

Procedures used to Gather Data during Bias & Sensitivity Forums

• Did mock bias & sensitivity review

• Presented methods used for statistical “bias” analysis (also called differential item functioning (DIF))

• Showed items flagged for DIF and asked for likely causes

• Small group discussion with reports to larger group

• Recorded participant ideas about bias issues in WASL

• Examined written notes and chart paper for themes

Themes in Participant Comments• Need for involvement of minority teachers in all

stages of WASL development work – this may require community involvement

• Need for sensitivity to cultural values in selection of reading passages, item content, and the types of questions (particularly in reading)

• Need for inclusion of tribal elders in selection of text and contexts for WASL items

• Need for inclusion of individuals with cultural expertise in bias/sensitivity review panels

Study 3: Differential Item Functioning (DIF) Analyses

What is Differential Item Functioning (DIF or Item Bias)?

• When examinees, from different groups, with the same level of ability, have a different chance of answering an item correctly (Dorans & Holland, 1993)

• Most Bias analyses focus on cultural differences with students grouped according to some inherent demographic attribute (Scheuneman &Gerritz, 1990; Schmitt & Dorans, 1990; Wang & Lane, 1996).

Usual Focus of DIF/Bias Studies

• Two comparable groups of examinees Reference group – Larger or more dominant

group Focal group – Smaller or less dominant group

• Common demographic dimensions: Males compared with femalesStudents speaking English as first language

compared with students speaking English as second language

European American students compared with students from other American races/cultures

Multidimensionality as an Explanation for DIF/Bias

Multidimensionality occurs when:

1. An item requires use of two or more abilities (e.g., reading and mathematics) to respond correctly

2. DIF/Bias for multi-dimensional items occurs when individuals from different groups have:

identical ability on the primary dimension

unequal ability on the secondary dimension

A different likelihood of answering the item correctly

Typical Steps in a DIF Analysis:

• Identify two groups to be compared

• Compute item performance for students in each group at each total test score

• Summarize the differences in performance across all test scores

DIF/Bias Statistical Procedures

• Mantel-Haenszel

(Holland & Thayer, 1988)

• Logistic regression

(Swaminathan & Rogers, 1990)

• SIBTEST

(Shealy & Stout’s simultaneous item bias, 1993)

SIBTEST

• A nonparametric statistical test to sequentially detect DIF/Bias present in one or more items of a test

• An outgrowth of the multidimensional IRT modeling of DIF/Bias (Nandakumar & Stout, 1993; Ackerman, 1994; Roussos & Stout, 1996)

Example of Equal Abilities Distribution on the Primary Dimension; Different Abilities on the Secondary Dimension

3

0

-3 0 3

θ1

θ2

F R

R

F

Mathematics

Readin

g

θ1

R = F

Comparison of White Students' and Black Students' Performance on a Hyphothetical Item

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

270 275 280 285 290 295 300 305 310 315

Scale Score on the Test

Pe

rce

nt

of

Stu

de

nts

wit

h C

orr

ec

t A

ns

we

r

White Students

Black Students

DIF Can Go Both Ways:

• When individual students get their total scores from different items – that’s normal

• When there is a pattern in how groups of students get their total scores - that’s DIF

• When students in a group do better than expected on an item based on their total test score DIF is in favor of the group

• When students in a group do more poorly than expected on an item based on their total test score, DIF is against the group.

Typical Causes of DIF:• Impact: Students from different groups receive

different educational experiences such that item performance differences reflect true differences in knowledge/skills.

• Culture/Background: Students from different backgrounds bring unique perspectives to bear on test items.

• Language: Language used in items is differentially familiar to students.

• Effort: Examinees from different groups may attempt different items based on perceived likelihood of success.

Research on DIF for WASL Test Items:

• Studies were conducted after items had been:– reviewed by bias & sensitivity committee– examined for statistical bias– used in an operational test

• Compared performance of:

– Males and Females

– White students and Black/African American students

– White students and Latino/Hispanic students

– White students and Native American students

– White students and Asian/Pacific Islander students

Research on DIF for WASL Test Items:

• Examined test items from:

1997, 1998, 1999, 2000, 2001 Grade 4 Reading and Mathematics

1998, 1999, 2000, 2001 Grade 7 Reading and Mathematics

1999, 2000, 2001 Grade 10 Reading and Mathematics

DIF Results for Reading:• Most reading items showed no statistical bias• Reading items flagged for Gender DIF:

Multiple choice items tend to favor boys Performance items tend to favor girls Items favoring boys tend to be related to informational

passages Reading items flagged for Ethnic DIF

Multiple-choice items asking for text interpretation tend to favor white students

Performance-items asking for text interpretation tend to favor minority students

Patterns became more extreme across grade levels

Mean Number of Reading Items Flagged for

DIF (Males & Females)

Grade Item

TypeFavor Males Favor Females

4 MC 1.20 0.00

P 0.80 3.00

7 MC 4.50 0.50

P 0.50 5.00

10 MC 5.33 0.33

P 2.33 6.00


DIF (Asian/Pacific Islander & White)

Grade Item

Type

Favor Asians/

Pacific Islanders

Favor Whites

4 MC 0.20 1.40

P 2.20 0.60

7 MC 0.00 4.25

P 5.50 0.00

10 MC 0.00 4.00

P 6.67 1.67

Mean Number of Reading Items Flagged for DIF (Black/African & White)

Grade Item

Type

Favor Blacks/Africans Favor Whites

4 MC 0.20 0.60

P 2.00 0.40

7 MC 0.00 2.25

P 3.25 0.25

10 MC 0.67 2.33

P 5.33 1.33


DIF (Native American & White)

Grade Item

Type

Favor Native Americans Favor Whites

4 MC 0.00 0.00

P 1.00 0.20

7 MC 0.00 0.25

P 1.00 0.25

10 MC 0.00 1.00

P 1.67 0.67


DIF (Latino/Hispanic & White)

Grade Item

Type

Favor Latinos/

HispanicsFavor Whites

4 MC 0.40 1.20

P 2.20 0.20

7 MC 0.00 3.25

P 5.50 0.00

10 MC 0.00 3.00

P 6.00 1.67

Excerpt from a reading passage:The best looking fences are often the simplest. A simple

fence around a beautiful home can be like a frame around a picture. The house isn’t hidden; its beauty is enhanced by the frame. But a fence can be a massive, ugly thing, too, made of bricks and mortar. Sometimes the insignificant little fences do their job just as well as the ten-foot walls. Maybe it’s only a string stretched between here and there in a field. The message is clear; don’t cross here.

Every fence has its own personality and some don’t have much. There are friendly fences. A friendly fence takes kindly to being leaned on. There are friendly fences around some playgrounds. And some playgrounds fences are more fun to play on than anything they surround. There are more mean fences than friendly fences overall, though. Some have their own built-in invitation not to be sat upon. Unfriendly fences get it right back sometimes. You seldom see one that hasn’t been hit, bashed, or bumped or in some way broken or knocked down.

Example of a Reading an Item that Shows

Statistical Bias in Favor of Focal Groups:

In the sixth paragraph, the author talks about friendly and unfriendly fences. How can you tell them apart?

________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

* Favors Latinos, Blacks/African Americans, and Asian/Pacific Islanders

Example of a Reading Item that Shows

Statistical Bias in Favor of Focal Groups:

What is the author’s attitude toward fences? Give three pieces of evidence from the essay to support your point.

________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

* favors females, Asian/Pacific Islanders, and Latinos

Example of a Reading Item that Shows Statistical Bias in Favor of Males and Whites

DIF Results for Mathematics:

• Most mathematics items showed no statistical DIF

• Mathematics items flagged for Gender DIF: Multiple choice items tend to favor boys Performance items tend to favor girls DIF items favoring boys tend to require simple

applications of mathematical procedures in number, algebra, geometry, and statistics

DIF items favoring girls tend to assess data analysis, measurement, complex applications, reasoning, and problem-solving

Number of items flagged for DIF increased across grade levels


• Ethnic DIF statistical patterns:

Performance items were flagged for DIF more often than multiple-choice items

Slightly more of the flagged performance items favored minority students, although differences were small

Mean Number of Mathematics Items

Flagged for DIF (Males & Females)

Grade Item

TypeFavor Males Favor Females

4 MC 2.20 0.00

P 2.00 5.20

7 MC 3.50 0.50

P 1.75 5.25

10 MC 5.67 0.00

P 3.67 7.33

Mean Number of Mathematics Items Flagged for

DIF (Asian/Pacific Islander & White)

Grade Item

TypeFavor Asians/

Pacific Islanders

Favor Whites

4 MC 1.00 2.00

P 1.80 1.60

7 MC 1.50 2.50

P 5.75 3.00

10 MC 3.00 1.33

P 3.67 4.67


Flagged for DIF (Black/African & White)

Grade Item

TypeFavor Blacks/

Africans

Favor Whites

4 MC 1.00 0.80

P 2.00 1.40

7 MC 0.25 0.75

P 3.25 1.50

10 MC 2.33 1.33

P 3.00 3.00


Flagged for DIF (Native American & White)

Grade Item

TypeFavor Native

Americans

Favor Whites

4 MC 0.00 0.00

P 1.80 1.00

7 MC 0.00 0.50

P 1.75 1.25

10 MC 0.00 0.67

P 3.00 2.00


Flagged for DIF (Latino/Hispanic & White)

Grade Item

TypeFavor Latinos/

Hispanics

Favor Whites

4 MC 0.80 1.00

P 2.60 0.80

7 MC 0.25 1.25

P 3.50 1.75

10 MC 0.33 0.67

P 3.00 2.00


• Content analysis of Mathematics items flagged for Ethnic DIF: Flagged items favoring Asian/Pacific Islander

students generally assessed number concepts, computation, geometric procedures, algebraic procedures, and simple statistics

Flagged items favoring Black/African, Native American, and Latino/Hispanic students generally assessed number, number patterns, computation, and logical reasoning

Flagged items favoring White students generally assessed data analysis, data representation, measurement, reasoning, and problem-solving

Example of a Mathematics Item that Shows Statistical Bias in Favor of Focal Groups:

Favor Latinos, Native Americans, Asian/Pacific Islanders, Black/African Americans, and Females

Example of a Mathematics Item that Shows Statistical Bias in Favor of Focal Groups:

* Favors Asian/Pacific Islanders

Conclusions from DIF Studies:

• Results suggest:– Exclusive reliance on multiple-choice items for

reading tests may result in bias against girls and minority students – particularly when items assess interpretation of text

– Exclusive reliance on multiple-choice items for mathematics tests may result in bias against girls

– Ethnic DIF results in mathematics suggest that content of instruction differs for students in different groups

Additional Points

• Similar results have been found in studies of other tests

• However, these results can only be generalized when:– Items are written in the same way as WASL

items (structured, not too open-ended)– Diverse, appropriate interpretations and problem

solutions are selected for use to train scorers

Can Standardized Tests be Fair to All Students?

Yes, under some conditions:– Use of reading passages that maintain cultural

characteristics– Well developed performance items that present

clear directions to students– Use of item writers from diverse backgrounds– Selection of anchor papers and training papers

that represent diverse, valid responses– Cultural experts in bias & sensitivity reviews

AppendixAppendix

The following pages give The following pages give the mathematical model the mathematical model

for SIBTESTfor SIBTEST

Estimated True Score Estimated True Score (Regression Correction)(Regression Correction)

where,

Bias indexBias index

(After regression correction applied)

Bias statisticBias statistic

where

Bias indexBias index

where

Gfk: proportion of people in focal group getting score of k

Can Large-Scale Tests be Fair to All Students? Bias Issues Related to WASL Catherine S. Taylor University of Washington/OSPI Yoonsun Lee OSPI Johnnie McKinley.

Documents

bias issues

itemsresults of bias

biassources of bias

subtlereading passages

problematicreading passages

kinds of items

types of items

sensitivity review procedures