The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods. Kinge Mbella Liz Burton Rob Keller Nambury.

The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods.

Kinge MbellaLiz BurtonRob KellerNambury Raju Psychometric InternshipMeasured ProgressJuly 24, 2009

Presentation Outline

Introduction Background to Study Research Hypothesis

Small Sample Equating Identity Equating Chained linear Synthetic Linking Function Chained Log linear Pre Smoothing Circle-arc

Methodology Research Design Procedure

Results Discussion and Conclusion.

Introduction The primary motivation is from the 2007 paper by Livingston and

Kim “Small Sample Equating by the Circle-arc method.”

Empirical research findings confirm that this method produces smaller random and systematic errors when equating with samples smaller than 50 per form (Darby & Mbella, NCME 2009).

Technological innovation is increasing the flexibility of test administration, and reporting. Most test have multiple forms taken by smaller samples of students at different test dates. The need to provide accurate equated scores in a timely manner is imminent.

Practical circumstances in most certification programs dictate the use of small samples.

Research Objectives

This research used empirical data to compare random and systematic errors associated with small sample equating methods.

The ultimate goal is to provide practitioners with objective and valid results to effectively examine the small sample equating dilemma.

It is my intention that these result will provide scientific and logical facts that “Yes we may be able to equate accurately with smaller samples”.

Background Into Equating

Mislevy (1992)

“Test construction and equating are inseparable, when they are applied in concert, equated scores from parallel test forms provide virtually exchangeable evidence about students’ behavior on some domain…”

Kolen and Brennan (2004, p. 269)

Research and Equating Jargons

Form X : The test form administered to the 2007/08 examinees (New Form). Form Y: Test form administered to 2006/07 examinees (Old Form).Population: Scored responses for all students on a test form for a particular year.Small samples selected for this research are: 22, 35, 44, 70. Example SE_22_22

Experimental test form Y and X:Test forms assembled from an operational test form and response matrix.CING: The common item non equivalent group designCriterion estimate: The Equipercentile equating results of the Form X observed scores equated onto the Form Y observed scale scores for that particular grade level and subject area.

Equating Methods

Linear Methods

–Identity Equating

–Chained Linear

–Chained Log linear

•Synthetic linking Function

Non Linear

–Circle-arc

–Equipercentile

Chained Linear Function

( ) _y y xvy xv yv x

yv yv x

l X Xi i

µ : Sample mean

σ: Sample standard deviation

yv= Anchor Old Form (Y)

xv = Anchor New Form (X)

Identity Function

( )i iIDy X XIdentity Equating function is a technical term for saying No equating is done. The equated score equal the observed score.

Synthetic Linking Function

1( ) ( ) (1 ) ( )i i isyn x w l x w IDy X

The synthetic linking function is a weighted average between an equating function (in this case Chained Linear) with the Identity function.

W = 0.5

Chained log Linear

2 1( ) [ ( )]y ye chain e ev xIn chained equipercentile, the linking is done through the common items. The percentile rank of a score on the common item for form X is linked to the equivalent percentile in form Y common scale. Then the corresponding form Y score at that percentile is the chained equipercentile equivalent for that particular form X observed score.

Using an adaptation of the log-linear function developed by Rosenbaum and Thayer (1987) the first two univariate moments of the observed score distribution are pre-smoothed before equating,

Circle-arc

Livingston and Kim in 2007 proposed an innovative method with potential to considerably reduce sampling error of equating in small samples while introducing very little systematic error. Their rationale is based on the fact that the relationship between test forms is always curvilinear when forms differ in difficulty. Empirical research has shown that the circle-arc method is the most accurate method in modeling the equipercentile relationship in small samples.

Circle-arc

Circle-arc is a very simplistic model. It relies entirely on the characteristics of the observed scores.

The main properties are:• The minimum and maximum possible observed scores are

fixed for both test forms.• A middle point is empirically determined by carrying out

any of the linear equating transformations based on the data collection method.

A combination of mathematical formulae which forces an arc of a circle to pass through these three points is used to produce the Circle-arc equating function.

.

The Circle-Arc MethodThe Circle-Arc Method

Empirical Equating Curves

0 10 20 30 40 50 60

01

02

03

04

05

06

0

Equating Transformations by Method

New Form Raw Scale(X)

Old

Fo

rm R

aw

Sca

le(Y

)

Equip

Chain_Lin

Circ.arc

Research QuestionsResearch Questions

I. How similar are the various small sample equating methods in terms of equating errors?

II. How do differences in test form difficulty affect the accuracy and consistency of the various equating methods?

III. What is the minimum sample size at which the standard error of equating becomes unacceptable?

Research MethodologyResearch Methodology

Using real examinees’ responses on a Math and Reading Standardized test, two experimental test forms were created for each subject area and grade level. The Common Item Non Equivalent Group (CING) design with an internal anchor was used as the basis for collecting data for equating purposes.

Data Specification

Grade 7Form Y

Math ReadingN Total 33121 33098Ncommon Items 15 15N items tot 60 60

Form XN Total 34116 34105Ncommon Items 15 15N items tot 60 60

2006/2007

2007/2008

Operational Item Bank

Define test statistical and content objectivesCreate data base of item based on CTT statistics and content specificationSelect the Common Items

Automated Test Assembly using Builder2.EXEBuild Experimental test form based on predefined Statistical and content constraints

Experimental Test FormReading Grade 7

Form Y

Reconstructed Response matrixBased on new test form for all examinees.

Operational Response MatrixCommon Item bank

Descriptive Statistics for Reading Grade 7

Reading

Grade 7 Year 607 (form Y) Grade 7 Year 708 (Form X)

Anchor Total Score Anchor Total Score

N 33098 34105

# of Items 15 60 15 60

Mean 9.04 36.33 9.16 31.94

Standardized Mean Diff. 0.03 0.36

SD 3.77 11.876 3.76 12.32

Minimum 0 3 0 5

Maximum 15 57 15 60

Reliability α 0.815 0.918 0.815 0.922

Correlation 0.919 0.905

Procedure

• Large Sample EquatingAn equipercentile equating was done on the full population of Form Y and X for each subject and grade level. The unsmoothed equipercentile conversion was used as the base equating for comparison.

• Small Sample EquatingUsing a bootstrap sampling method without replacement, small samples were drawn from each population and concurrently equated using all 5 equating methods. The sampling and equating was repeated 250 times and the average equated score at each score point by method was used as the estimated equated score of form X on form Y observed scale.

Procedure_ Result Analysis

Standard Error (SE) (Error due to sampling

variability)

250

1

1( )

250X EQEQxi XiEQ

i

SE Y Y

Conditional bias

(Error due to method effect)

250

1

1

250EQxi xiEQ xiCRi

bias Y Y

Conditional RMSE 2 2EQxi EQxiRMSE SE bias

Research Design Matrix

Reading Grade 7

Sample Sizes 22 35 44 70 22 35 44 7022

New

Form 35

(Form

X) 44

70

Math Grade 7Old Form (Form Y)

Bootstrap Mean Distribution

PoP_3500 Samp22 Samp35 Samp44 Samp70

25

30

35

40

Bootstrap Means Reading Grade 7(Form X)

Samples

Pop_3675 Samp22 Samp35 Samp44 Samp70

28

30

32

34

36

38

40

42

Reading grade 7 (Form Y)

Standard Errors of Equating Methods by Condition (Reading Grade 7)

Raw Score Scale

SE

in R

aw

Sco

re S

cale

0

1

2

3

4

20 30 40 50

SE_22_22 SE_22_44

20 30 40 50

SE_22_70

SE_35_35 SE_35_44

0

1

2

3

4

SE_35_700

1

2

3

4

SE_44_44

20 30 40 50

SE_44_70 SE_70_70

Pop_SE Chain_Lin

Standard Error Results (SE)


Raw Score Scale

SE

in R

aw

Sco

re S

cale

0

1

2

3

4

20 30 40 50

SE_22_22 SE_22_44

20 30 40 50

SE_22_70

SE_35_35 SE_35_44

0

1

2

3

4

SE_35_700

1

2

3

4

SE_44_44

20 30 40 50

SE_44_70 SE_70_70

Pop_SE Chain_Lin Chainlog



Raw Score Scale

SE

in R

aw

Sco

re S

cale

0

1

2

3

4

20 30 40 50

SE_22_22 SE_22_44

20 30 40 50

SE_22_70

SE_35_35 SE_35_44

0

1

2

3

4

SE_35_700

1

2

3

4

SE_44_44

20 30 40 50

SE_44_70 SE_70_70

Pop_SE Chain_Lin Chainlog Cir_arc



Raw Score Scale

SE

in R

aw

Sco

re S

cale

0

1

2

3

4

20 30 40 50

SE_22_22 SE_22_44

20 30 40 50

SE_22_70

SE_35_35 SE_35_44

0

1

2

3

4

SE_35_700

1

2

3

4

SE_44_44

20 30 40 50

SE_44_70 SE_70_70

Pop_SE Chain_Lin Chainlog Cir_arc Synth


Preliminary ResultsI. How similar are the various equating methods in terms of equating

error?

The following conclusions have been reached based on these preliminary analyses:

i. On the average, the Circle-arc method appears to have the smallest random error across the entire scale.

ii. The Synthetic linking function has the smallest random error variance for scores between -1 and 1 standard deviation around the mean.

iii. For all methods, the general trend is that the overall random error variance tend to decrease as sample size increases.

Conditional Bias Estimates Reading Grade 7

Raw Score Scale

Bia

s in

Ra

w S

core

Sca

le

-5

0

5

20 30 40 50

Bias_22_22 Bias_22_44

20 30 40 50

Bias_22_70

Bias_35_35 Bias_35_44

-5

0

5

Bias_35_70

-5

0

5

Bias_44_44

20 30 40 50

Bias_44_70 Bias_70_70

Identity Chain_Lin Chainlog Cir_arc Synth

Bias Summary for selected conditions

Conditional RMSE Estimates Reading Grade 7

Raw Score Scale

RM

SE

in R

aw

Sco

re S

cale

0

2

4

6

20 30 40 50

RMSE_22_22 RMSE_22_44

20 30 40 50

RMSE_22_70

RMSE_35_35 RMSE_35_44

0

2

4

6

RMSE_35_700

2

4

6

RMSE_44_44

20 30 40 50

RMSE_44_70 RMSE_70_70


RMSE Summary (Reading Grade 7)

Conditional RMSE Estimates Math Grade 7

Raw Score Scale

RM

SE

in R

aw

Sco

re S

cale

0

2

4

6

8

10

20 30 40 50

RMSE_22_22 RMSE_22_44

20 30 40 50

RMSE_22_70

RMSE_35_35 RMSE_35_44

0

2

4

6

8

10RMSE_35_70

0

2

4

6

8

10RMSE_44_44

20 30 40 50

RMSE_44_70 RMSE_70_70


RMSE Summary (Math Grade 7)

Chain_Lin Chainlog Cir_arc Identity Synth

Average RMSE for selected Sample sizes(Reading Grade 7)

Equating Methods

01

23

45

Chain_Lin Chainlog Cir_arc Identity Synth

Average RMSE for selected Sample sizes(Math Grade 7)

01

23

45

Exploratory MANOVA

Graphical Manova Summary

• Reading Grade 7– Preliminary results suggest that the within error variance due to sample

variability is not significant.

– There appears to be a significant mean difference between the various equating methods in terms of the RMSE index. The mean RMSE for Circle-arc appears to be significantly different from the other methods

• Math Grade 7– The Exploratory Manova results from Math grade 7 leads to a slightly

different conclusion. Both the within and between error variances appear not to be significantly different for all methods and sample conditions.

Results Summary _ Reading Grade 7

Mean SD Mean SD Mean SD Mean SD Mean SD

SE 2.43 0.75 2.05 0.65 0.97 0.48 0.00 1.21 0.37

Bias 1.61 2.65 1.04 2.10 -0.59 1.27 -3.34 2.60 -0.87 2.62RMSE 3.34 1.90 2.82 1.34 1.89 1.00 3.73 2.72 2.75 1.23

SE 1.99 0.65 1.66 0.59 0.76 0.38 0.00 0.99 0.32

Bias 1.57 2.67 1.01 2.11 -0.62 1.27 -3.34 2.60 -0.89 2.62

RMSE 2.84 1.84 2.38 1.25 1.70 0.94 3.73 2.71 2.54 1.32

SE 1.99 0.65 1.66 0.59 0.76 0.38 0.00 0.99 0.32

Bias 1.67 2.72 1.10 2.16 -0.58 1.27 -3.34 2.60 -0.84 2.64

RMSE 2.73 1.87 2.27 1.29 1.67 0.93 3.73 2.72 2.49 1.35

SE 1.36 0.41 1.17 0.40 0.56 0.28 0.00 0.00 0.68 0.20

Bias 1.46 2.71 0.93 2.16 -0.68 1.28 -3.34 2.60 -0.94 2.63

RMSE 2.51 1.88 2.05 1.26 1.62 0.93 3.73 2.72 2.41 1.41

Synth

N=35_35

N=44_44

N= 22_22

Chain_Lin Chainlog Cir_arc Identity

N= 70_70

Conclusion

From this first phase of analyses, the Circle-arc method appears to produce on the average the smallest amount of systematic and random error.However, the interpretation of which method produces the least amount of error depends on where the cut scores are set on the scale.An important recommendation from this study is that if the cut score is set around the mean, then any of these methods will produce similar equating errors proportional to the difference in form difficulty.

Future Directions

1. I would like to look at the effects of differences in test form difficulty on the various methods.

2. I also intend to explore even smaller samples to estimate the minimum sample sizes for each method where equating becomes unrealistic.

3. My ultimate goal is to explore new ways to build test forms to meet predefined statistical and content characteristics in small sample situations.

Questions and Comments

I would like to thank everyone in the Psychometrics Department and Measured Progress for making the whole experience very enjoyable and the actual research as painless as possible.Thank you

Kinge MbellaDoctoral Student UNC Greensboro

The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods. Kinge Mbella Liz Burton Rob Keller Nambury.

Documents

The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods. Kinge Mbella Liz Burton Rob Keller Nambury.