The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods.
Kinge MbellaLiz BurtonRob KellerNambury Raju Psychometric InternshipMeasured ProgressJuly 24, 2009
Presentation Outline
Introduction Background to Study Research Hypothesis
Small Sample Equating Identity Equating Chained linear Synthetic Linking Function Chained Log linear Pre Smoothing Circle-arc
Methodology Research Design Procedure
Results Discussion and Conclusion.
Introduction The primary motivation is from the 2007 paper by Livingston and
Kim “Small Sample Equating by the Circle-arc method.”
Empirical research findings confirm that this method produces smaller random and systematic errors when equating with samples smaller than 50 per form (Darby & Mbella, NCME 2009).
Technological innovation is increasing the flexibility of test administration, and reporting. Most test have multiple forms taken by smaller samples of students at different test dates. The need to provide accurate equated scores in a timely manner is imminent.
Practical circumstances in most certification programs dictate the use of small samples.
Research Objectives
This research used empirical data to compare random and systematic errors associated with small sample equating methods.
The ultimate goal is to provide practitioners with objective and valid results to effectively examine the small sample equating dilemma.
It is my intention that these result will provide scientific and logical facts that “Yes we may be able to equate accurately with smaller samples”.
Background Into Equating
Mislevy (1992)
“Test construction and equating are inseparable, when they are applied in concert, equated scores from parallel test forms provide virtually exchangeable evidence about students’ behavior on some domain…”
Kolen and Brennan (2004, p. 269)
Research and Equating Jargons
Form X : The test form administered to the 2007/08 examinees (New Form). Form Y: Test form administered to 2006/07 examinees (Old Form).Population: Scored responses for all students on a test form for a particular year.Small samples selected for this research are: 22, 35, 44, 70. Example SE_22_22
Experimental test form Y and X:Test forms assembled from an operational test form and response matrix.CING: The common item non equivalent group designCriterion estimate: The Equipercentile equating results of the Form X observed scores equated onto the Form Y observed scale scores for that particular grade level and subject area.
Equating Methods
Linear Methods
–Identity Equating
–Chained Linear
–Chained Log linear
•Synthetic linking Function
Non Linear
–Circle-arc
–Equipercentile
Chained Linear Function
( ) _y y xvy xv yv x
yv yv x
l X Xi i
µ : Sample mean
σ: Sample standard deviation
yv= Anchor Old Form (Y)
xv = Anchor New Form (X)
Identity Function
( )i iIDy X XIdentity Equating function is a technical term for saying No equating is done. The equated score equal the observed score.
Synthetic Linking Function
1( ) ( ) (1 ) ( )i i isyn x w l x w IDy X
The synthetic linking function is a weighted average between an equating function (in this case Chained Linear) with the Identity function.
W = 0.5
Chained log Linear
2 1( ) [ ( )]y ye chain e ev xIn chained equipercentile, the linking is done through the common items. The percentile rank of a score on the common item for form X is linked to the equivalent percentile in form Y common scale. Then the corresponding form Y score at that percentile is the chained equipercentile equivalent for that particular form X observed score.
Using an adaptation of the log-linear function developed by Rosenbaum and Thayer (1987) the first two univariate moments of the observed score distribution are pre-smoothed before equating,
Circle-arc
Livingston and Kim in 2007 proposed an innovative method with potential to considerably reduce sampling error of equating in small samples while introducing very little systematic error. Their rationale is based on the fact that the relationship between test forms is always curvilinear when forms differ in difficulty. Empirical research has shown that the circle-arc method is the most accurate method in modeling the equipercentile relationship in small samples.
Circle-arc
Circle-arc is a very simplistic model. It relies entirely on the characteristics of the observed scores.
The main properties are:• The minimum and maximum possible observed scores are
fixed for both test forms.• A middle point is empirically determined by carrying out
any of the linear equating transformations based on the data collection method.
A combination of mathematical formulae which forces an arc of a circle to pass through these three points is used to produce the Circle-arc equating function.
.
The Circle-Arc MethodThe Circle-Arc Method
Empirical Equating Curves
0 10 20 30 40 50 60
01
02
03
04
05
06
0
Equating Transformations by Method
New Form Raw Scale(X)
Old
Fo
rm R
aw
Sca
le(Y
)
Equip
Chain_Lin
Circ.arc
Research QuestionsResearch Questions
I. How similar are the various small sample equating methods in terms of equating errors?
II. How do differences in test form difficulty affect the accuracy and consistency of the various equating methods?
III. What is the minimum sample size at which the standard error of equating becomes unacceptable?
Research MethodologyResearch Methodology
Using real examinees’ responses on a Math and Reading Standardized test, two experimental test forms were created for each subject area and grade level. The Common Item Non Equivalent Group (CING) design with an internal anchor was used as the basis for collecting data for equating purposes.
Data Specification
Grade 7Form Y
Math ReadingN Total 33121 33098Ncommon Items 15 15N items tot 60 60
Form XN Total 34116 34105Ncommon Items 15 15N items tot 60 60
2006/2007
2007/2008
Operational Item Bank
Define test statistical and content objectivesCreate data base of item based on CTT statistics and content specificationSelect the Common Items
Automated Test Assembly using Builder2.EXEBuild Experimental test form based on predefined Statistical and content constraints
Experimental Test FormReading Grade 7
Form Y
Reconstructed Response matrixBased on new test form for all examinees.
Operational Response MatrixCommon Item bank
Descriptive Statistics for Reading Grade 7
Reading
Grade 7 Year 607 (form Y) Grade 7 Year 708 (Form X)
Anchor Total Score Anchor Total Score
N 33098 34105
# of Items 15 60 15 60
Mean 9.04 36.33 9.16 31.94
Standardized Mean Diff. 0.03 0.36
SD 3.77 11.876 3.76 12.32
Minimum 0 3 0 5
Maximum 15 57 15 60
Reliability α 0.815 0.918 0.815 0.922
Correlation 0.919 0.905
Procedure
• Large Sample EquatingAn equipercentile equating was done on the full population of Form Y and X for each subject and grade level. The unsmoothed equipercentile conversion was used as the base equating for comparison.
• Small Sample EquatingUsing a bootstrap sampling method without replacement, small samples were drawn from each population and concurrently equated using all 5 equating methods. The sampling and equating was repeated 250 times and the average equated score at each score point by method was used as the estimated equated score of form X on form Y observed scale.
Procedure_ Result Analysis
Standard Error (SE) (Error due to sampling
variability)
250
1
1( )
250X EQEQxi XiEQ
i
SE Y Y
Conditional bias
(Error due to method effect)
250
1
1
250EQxi xiEQ xiCRi
bias Y Y
Conditional RMSE 2 2EQxi EQxiRMSE SE bias
Research Design Matrix
Reading Grade 7
Sample Sizes 22 35 44 70 22 35 44 7022
New
Form 35
(Form
X) 44
70
Math Grade 7Old Form (Form Y)
Bootstrap Mean Distribution
PoP_3500 Samp22 Samp35 Samp44 Samp70
25
30
35
40
Bootstrap Means Reading Grade 7(Form X)
Samples
Pop_3675 Samp22 Samp35 Samp44 Samp70
28
30
32
34
36
38
40
42
Reading grade 7 (Form Y)
Standard Errors of Equating Methods by Condition (Reading Grade 7)
Raw Score Scale
SE
in R
aw
Sco
re S
cale
0
1
2
3
4
20 30 40 50
SE_22_22 SE_22_44
20 30 40 50
SE_22_70
SE_35_35 SE_35_44
0
1
2
3
4
SE_35_700
1
2
3
4
SE_44_44
20 30 40 50
SE_44_70 SE_70_70
Pop_SE Chain_Lin
Standard Error Results (SE)
Standard Errors of Equating Methods by Condition (Reading Grade 7)
Raw Score Scale
SE
in R
aw
Sco
re S
cale
0
1
2
3
4
20 30 40 50
SE_22_22 SE_22_44
20 30 40 50
SE_22_70
SE_35_35 SE_35_44
0
1
2
3
4
SE_35_700
1
2
3
4
SE_44_44
20 30 40 50
SE_44_70 SE_70_70
Pop_SE Chain_Lin Chainlog
Standard Error Results (SE)
Standard Errors of Equating Methods by Condition (Reading Grade 7)
Raw Score Scale
SE
in R
aw
Sco
re S
cale
0
1
2
3
4
20 30 40 50
SE_22_22 SE_22_44
20 30 40 50
SE_22_70
SE_35_35 SE_35_44
0
1
2
3
4
SE_35_700
1
2
3
4
SE_44_44
20 30 40 50
SE_44_70 SE_70_70
Pop_SE Chain_Lin Chainlog Cir_arc
Standard Error Results (SE)
Standard Errors of Equating Methods by Condition (Reading Grade 7)
Raw Score Scale
SE
in R
aw
Sco
re S
cale
0
1
2
3
4
20 30 40 50
SE_22_22 SE_22_44
20 30 40 50
SE_22_70
SE_35_35 SE_35_44
0
1
2
3
4
SE_35_700
1
2
3
4
SE_44_44
20 30 40 50
SE_44_70 SE_70_70
Pop_SE Chain_Lin Chainlog Cir_arc Synth
Standard Error Results (SE)
Preliminary ResultsI. How similar are the various equating methods in terms of equating
error?
The following conclusions have been reached based on these preliminary analyses:
i. On the average, the Circle-arc method appears to have the smallest random error across the entire scale.
ii. The Synthetic linking function has the smallest random error variance for scores between -1 and 1 standard deviation around the mean.
iii. For all methods, the general trend is that the overall random error variance tend to decrease as sample size increases.
Conditional Bias Estimates Reading Grade 7
Raw Score Scale
Bia
s in
Ra
w S
core
Sca
le
-5
0
5
20 30 40 50
Bias_22_22 Bias_22_44
20 30 40 50
Bias_22_70
Bias_35_35 Bias_35_44
-5
0
5
Bias_35_70
-5
0
5
Bias_44_44
20 30 40 50
Bias_44_70 Bias_70_70
Identity Chain_Lin Chainlog Cir_arc Synth
Bias Summary for selected conditions
Conditional RMSE Estimates Reading Grade 7
Raw Score Scale
RM
SE
in R
aw
Sco
re S
cale
0
2
4
6
20 30 40 50
RMSE_22_22 RMSE_22_44
20 30 40 50
RMSE_22_70
RMSE_35_35 RMSE_35_44
0
2
4
6
RMSE_35_700
2
4
6
RMSE_44_44
20 30 40 50
RMSE_44_70 RMSE_70_70
Identity Chain_Lin Chainlog Cir_arc Synth
RMSE Summary (Reading Grade 7)
Conditional RMSE Estimates Math Grade 7
Raw Score Scale
RM
SE
in R
aw
Sco
re S
cale
0
2
4
6
8
10
20 30 40 50
RMSE_22_22 RMSE_22_44
20 30 40 50
RMSE_22_70
RMSE_35_35 RMSE_35_44
0
2
4
6
8
10RMSE_35_70
0
2
4
6
8
10RMSE_44_44
20 30 40 50
RMSE_44_70 RMSE_70_70
Identity Chain_Lin Chainlog Cir_arc Synth
RMSE Summary (Math Grade 7)
Chain_Lin Chainlog Cir_arc Identity Synth
Average RMSE for selected Sample sizes(Reading Grade 7)
Equating Methods
01
23
45
Chain_Lin Chainlog Cir_arc Identity Synth
Average RMSE for selected Sample sizes(Math Grade 7)
01
23
45
Exploratory MANOVA
Graphical Manova Summary
• Reading Grade 7– Preliminary results suggest that the within error variance due to sample
variability is not significant.
– There appears to be a significant mean difference between the various equating methods in terms of the RMSE index. The mean RMSE for Circle-arc appears to be significantly different from the other methods
• Math Grade 7– The Exploratory Manova results from Math grade 7 leads to a slightly
different conclusion. Both the within and between error variances appear not to be significantly different for all methods and sample conditions.
Results Summary _ Reading Grade 7
Mean SD Mean SD Mean SD Mean SD Mean SD
SE 2.43 0.75 2.05 0.65 0.97 0.48 0.00 1.21 0.37
Bias 1.61 2.65 1.04 2.10 -0.59 1.27 -3.34 2.60 -0.87 2.62RMSE 3.34 1.90 2.82 1.34 1.89 1.00 3.73 2.72 2.75 1.23
SE 1.99 0.65 1.66 0.59 0.76 0.38 0.00 0.99 0.32
Bias 1.57 2.67 1.01 2.11 -0.62 1.27 -3.34 2.60 -0.89 2.62
RMSE 2.84 1.84 2.38 1.25 1.70 0.94 3.73 2.71 2.54 1.32
SE 1.99 0.65 1.66 0.59 0.76 0.38 0.00 0.99 0.32
Bias 1.67 2.72 1.10 2.16 -0.58 1.27 -3.34 2.60 -0.84 2.64
RMSE 2.73 1.87 2.27 1.29 1.67 0.93 3.73 2.72 2.49 1.35
SE 1.36 0.41 1.17 0.40 0.56 0.28 0.00 0.00 0.68 0.20
Bias 1.46 2.71 0.93 2.16 -0.68 1.28 -3.34 2.60 -0.94 2.63
RMSE 2.51 1.88 2.05 1.26 1.62 0.93 3.73 2.72 2.41 1.41
Synth
N=35_35
N=44_44
N= 22_22
Chain_Lin Chainlog Cir_arc Identity
N= 70_70
Conclusion
From this first phase of analyses, the Circle-arc method appears to produce on the average the smallest amount of systematic and random error.However, the interpretation of which method produces the least amount of error depends on where the cut scores are set on the scale.An important recommendation from this study is that if the cut score is set around the mean, then any of these methods will produce similar equating errors proportional to the difference in form difficulty.
Future Directions
1. I would like to look at the effects of differences in test form difficulty on the various methods.
2. I also intend to explore even smaller samples to estimate the minimum sample sizes for each method where equating becomes unrealistic.
3. My ultimate goal is to explore new ways to build test forms to meet predefined statistical and content characteristics in small sample situations.
Questions and Comments
I would like to thank everyone in the Psychometrics Department and Measured Progress for making the whole experience very enjoyable and the actual research as painless as possible.Thank you
Kinge MbellaDoctoral Student UNC Greensboro