Top Banner
Vertical and horizontal test equating in educational research Eveline Gebhardt & Wolfram Schulz
63

Vert&Hor Equating 111024

Dec 15, 2014

Download

Documents

egebhardt72

Vertical and horizontal equating and measurement invariance
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Vert&Hor Equating 111024

Vertical and horizontal test equating in educational

research

Eveline Gebhardt&

Wolfram Schulz

Page 2: Vert&Hor Equating 111024

Method for estimating

• change over time in student abilities• growth between year levels

Page 3: Vert&Hor Equating 111024

CLASSICAL TEST THEORYA method based on

Page 4: Vert&Hor Equating 111024

Classical test theory

• Student performance: % correct on set of items– Compare students that respond to identical

set of items• Item difficulty: % of students responding

correctly– Compare items that were administered to

the same group of students

Page 5: Vert&Hor Equating 111024

Constraints

• Limited number of items to measure a domain

• All items need to be kept secure

Page 6: Vert&Hor Equating 111024

Problematic

• Comparing students from different age groups (ceiling or floor effect)

• Comparing student abilities over time when not all items can be kept secure

• Item difficulty and student performance are confounded

Page 7: Vert&Hor Equating 111024

ITEM RESPONSE THEORYA method based on

Page 8: Vert&Hor Equating 111024

Rasch model

Common scale for item difficulties and student abilities– If ability = difficulty, the student has 50%

chance to respond correctly to that item– If ability > difficulty, most likely to respond

correctly– If ability < difficulty, most likely to respond

incorrectly

Page 9: Vert&Hor Equating 111024

Example scale – Year 6

Year 6 students

xx

xxxxxxxxxxxxxxx

Items

6

1 53 7 94 10

82

3

-3

2

1

0

-1

-2

Page 10: Vert&Hor Equating 111024

Example scale – Year 10

3

-3

2

1

0

-1

-2

Items

1412613 151 5 113 7 94 10

82

Year 10 students

xxxxxxxxxxxxxx

x

Page 11: Vert&Hor Equating 111024

Example scale – Combined

Year 6

xx

xxxxxxxxxxxxxx

3

-3

2

1

0

-1

-2

Items

1412613 151 5 113 7 94 10

82

Year 10

xxxxxxxxxxxxxx

x

Page 12: Vert&Hor Equating 111024

Vertical and horizontal equating

Year 10

Year 6

2011 2014V

ertic

alHorizontal

Page 13: Vert&Hor Equating 111024

COMMON ITEM EQUATINGThree methods

Page 14: Vert&Hor Equating 111024

Several methods

• Average item difficulty of set of link items needs to be equal in both tests

• Three common methods:– Shift method (trends)– Joint scaling (booklets)– Anchoring item difficulties

Page 15: Vert&Hor Equating 111024

SHIFT METHODMethod 1

Page 16: Vert&Hor Equating 111024

Shift method

• Test 1 and test 2 are scaled separately• Average difficulty of items B in test 1 (MN1) and test 2

(MN2) is computed

• Difference between averages (d = MN1 – MN2) is computed

• Difference is added to the student abilities of test 2 (θ2* = θ 2 + d)

Items A Items B Items C

Test 1 X X

Test 2 X X

Page 17: Vert&Hor Equating 111024

0

Test 1

MN1

d 0

Test 2

MN2

Page 18: Vert&Hor Equating 111024

Item Difficulty T1 Difficulty T2

A

1 -1.12 1.63 -2.64 0.95 -1.8

B

6 0.8 -1.27 1.7 -0.38 0.9 -1.19 -0.2 -2.2

10 -0.9 -2.9

C

11 2.112 1.513 0.414 2.415 1.2

AVG all 0.0 0.0AVG link 0.5 -1.5 2.0 = shift

Page 19: Vert&Hor Equating 111024

JOINT SCALINGMethod 2

Page 20: Vert&Hor Equating 111024

Joint scaling

• Data of test 1 and 2 are joined in one data set• Test 1 and 2 are scaled together• Difficulties of items B are estimated only once• Difficulties of items B are identical for test 1

and 2• Tests are on the same scale• Also called concurrent equating

Page 21: Vert&Hor Equating 111024

Joint scaling - Data file A B C

Std Year i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15

1 6 0 1 0 0 1 0 0 1 1 1 n n n n n

2 6 0 0 1 1 1 0 1 0 0 0 n n n n n

3 6 1 0 1 1 1 0 1 1 0 1 n n n n n

4 6 0 0 1 1 0 1 1 0 0 1 n n n n n

5 6 1 1 0 1 1 1 1 1 1 1 n n n n n

6 10 n n n n n 0 0 0 0 0 0 1 0 0 0

7 10 n n n n n 0 1 0 1 0 0 0 0 0 0

8 10 n n n n n 0 1 1 0 1 1 1 1 1 1

9 10 n n n n n 1 1 0 0 1 1 1 0 1 1

10 10 n n n n n 1 1 1 1 1 1 1 0 1 1

Page 22: Vert&Hor Equating 111024

ANCHORINGMethod 3

Page 23: Vert&Hor Equating 111024

Anchoring

• Test 1 (items A and B) is scaled• Difficulties of items B are copied• Test 2 (items B and C) is scaled,

anchoring items B to the same values as test 1

Page 24: Vert&Hor Equating 111024

Set Item Difficulty T1 Difficulty T2

A

1 -1.12 1.63 -2.64 0.95 -1.8

B

6 0.8 0.8*7 1.7 1.7*8 0.9 0.9*9 -0.2 -0.2*

10 -0.9 -0.9*

C

11 4.112 3.513 2.414 4.415 3.2

AVG all 0.0 2.5AVG link 0.5 0.5

Page 25: Vert&Hor Equating 111024

EVALUATION OF LINK ITEMSBefore equating tests

Page 26: Vert&Hor Equating 111024

Link item invariance

• Relative item difficulty• Discrimination• Differential item functioning (DIF)

Page 27: Vert&Hor Equating 111024

RELATIVE ITEM DIFFICULTYEvaluation of

Page 28: Vert&Hor Equating 111024

Relative item difficulty

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Page 29: Vert&Hor Equating 111024

ITEM DISCRIMINATIONEvaluation of

Page 30: Vert&Hor Equating 111024

Item discrimination

• Discriminate between able and less able• Some items discriminate more than others• Average abilities of students:

Item 1 Item 2

Answer A 1.00 0.62

Answer B -0.22 0.61

Answer C -0.15 0.81

Answer D -0.02 0.53

Page 31: Vert&Hor Equating 111024

Slopes

• Level of discrimination is reflected by the slope of the item characteristic curve

Page 32: Vert&Hor Equating 111024

Assumption

• Assumption of the Rasch model:slopes are equal across items

• However, in practice slopes always vary a little within a test

• The expected slope is the average slope of all items in a test

• Steeper average slopes reflect a larger spread in abilities in the population

Page 33: Vert&Hor Equating 111024

Link items &

Discrimination

• The average discrimination of link item can vary between tests

• Individual link items can vary in discrimination between tests

Page 34: Vert&Hor Equating 111024

Experiment - 1

• Same test with 10 items is used in Year 6 and Year 10

• Spread in abilities is larger in Year 10 than in Year 6

• Item discriminate more in Year 10 than in Year 6

Page 35: Vert&Hor Equating 111024

Results experiment 1

Average discrimination

Population variance

True variance

Separate Joint Separate Joint

Year 6 0.25 0.34 0.76 1.07 0.80

Year 10 0.41 0.34 1.89 1.49 2.00

Page 36: Vert&Hor Equating 111024

DIFFERENTIAL ITEM FUNCTIONING

Evaluation of

Page 37: Vert&Hor Equating 111024

Differential Item Functioning

• Assumption of Rasch model:all students with the same ability have the same probability to respond correctly to an item, independent of the subgroup a student belongs to

• The violation of this assumption is called Differential Item Functioning (DIF)

Page 38: Vert&Hor Equating 111024

Example: sex DIF

Page 39: Vert&Hor Equating 111024

Link items &

DIF

• Set of link items needs to have the same average DIF as the non-link items in both tests

• The following experiment shows why

Page 40: Vert&Hor Equating 111024

Experiment 2

• Item pool of 105 items for assessment at time 1

• Selection of 55 trend items all favouring boys

• Scale two sets of items on the same set of student responses

Page 41: Vert&Hor Equating 111024

Results experiment 2

All items Boys items

0.44

0.50

0.60

0.44

Abilities by subgroup

All items Link items

M F M F

Page 42: Vert&Hor Equating 111024

Conclusion experiment 2

• Selecting link items that on average favour a subgroup of students changes the gap in performance between subgroups

• The average DIF should be as close to 0 as possible

Page 43: Vert&Hor Equating 111024

OTHER ITEM CHARACTERISTICS

Evaluation of

Page 44: Vert&Hor Equating 111024

Link items &

Sub-domains

• Equating shift should be based on a set of items that is representative of the whole test

• Equating shifts can be slightly different for different sub-domains

• Best practice to have equal proportions of sub-domains in trend items and in the total item pool

Page 45: Vert&Hor Equating 111024

Link items &

Item types

• Equating shifts can be slightly different for multiple choice items than for open ended items

• Best practice to have equal proportions of item types in trend items and in the total item pool

Page 46: Vert&Hor Equating 111024

EQUATING EXAMPLEHorizontal and vertical equating in NAP Civics and Citizenship

Page 47: Vert&Hor Equating 111024

Equating in practice

• NAP-CC survey• Year 6 and Year 10• Assessment every 3 years since 2004

Page 48: Vert&Hor Equating 111024

Equating overview

Page 49: Vert&Hor Equating 111024

45 horizontal link items in Year 10

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative difficulties link items

Page 50: Vert&Hor Equating 111024

Average discrimination

2007 2010

45 link items 0.43 0.45

Page 51: Vert&Hor Equating 111024

Plot discrimination

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Page 52: Vert&Hor Equating 111024

Average gender DIF

2007 2010

45 link items -0.027 -0.014

Page 53: Vert&Hor Equating 111024

Selection of link items

• 32 of 45 items were selected to use as link items based on:– change in relative difficulty– change in discrimination– average gender DIF

Page 54: Vert&Hor Equating 111024

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative difficulties 45 ink items

-5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

-3.0

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative difficulties 32 link items

Page 55: Vert&Hor Equating 111024

Average discrimination

2007 2010

45 link items 0.43 0.45

32 link items 0.41 0.42

Page 56: Vert&Hor Equating 111024

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Discrimination 45 items

0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

-0.10

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

Discrimination 32 items

Page 57: Vert&Hor Equating 111024

Average gender DIF

2007 2010

45 link items -0.027 -0.014

32 link items -0.035 -0.023

Page 58: Vert&Hor Equating 111024

Horizontal equating Year 6

• The process for Year 6 was identical• 24 out 27 link items could be used for

equating from 2010 to 2007

Page 59: Vert&Hor Equating 111024

Equating shifts

Year 6 Year 10

Average difficulty 2010 0.384 0.618

Average difficulty 2007 -0.089 -0.159

Difference (=shift) -0.473 -0.777

Page 60: Vert&Hor Equating 111024

Equating overview

Page 61: Vert&Hor Equating 111024

EQUATING ERRORRelated to common item equating is the

Page 62: Vert&Hor Equating 111024

Uncertainty in the link

• The equating shift depends on the change in relative difficulty of each item

• Different sets of items will lead to slightly different shifts

• An uncertainty is associated with equating two tests due to sampling of items

Page 63: Vert&Hor Equating 111024

Equating error

• Expressed as a standard error, just like the student sampling error

• Take into account when estimating change over time

• The equating error is added to the standard error of the difference when comparing across time