A Comparison of Four Test Equating Methods Report Prepared for the Education Quality and Accountability Office (EQAO) by Xiao Pang, Ph.D. Psychometrician, EQAO Ebby Madera, Ph.D. Psychometrician, EQAO Nizam Radwan, Ph.D. Psychometrician, EQAO Su Zhang, Ph.D. Psychometrician, EQAO APRIL 2010
37
Embed
A Comparison of Four Test Equating Methods - EQAO OQRE · of IRT-based equating methods is often the logical choice. Therefore, since EQAO uses procedures based on IRT to calibrate
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Comparison of Four Test Equating Methods
Report Prepared for the Education Quality and Accountability Office (EQAO) by
Xiao Pang, Ph.D.Psychometrician, EQAO
Ebby Madera, Ph.D.Psychometrician, EQAO
Nizam Radwan, Ph.D.Psychometrician, EQAO
Su Zhang, Ph.D.Psychometrician, EQAO
APRIL 2010
Education Quality and Accountability Office, 2 Carlton Street, Suite 1200, Toronto ON M5B 2M9, 1-888-327-7377, www.eqao.com
About the Education Quality and Accountability Office
The Education Quality and Accountability Office (EQAO) is an independent provincial agency funded by theGovernment of Ontario. EQAO’s mandate is to conduct province-wide tests at key points in every student’sprimary, junior and secondary education and report the results to educators, parents and the public.
EQAO acts as a catalyst for increasing the success of Ontario students by measuring their achievementin reading, writing and mathematics in relation to Ontario Curriculum expectations. The resulting dataprovide a gauge of quality and accountability in the Ontario education system.
The objective and reliable assessment results are evidence that adds to current knowledge about studentlearning and serves as an important tool for improvement at all levels: for individual students, schools,boards and the province.
About EQAO Research
EQAO undertakes research for two main purposes:
• to maintain best-of-class practices and to ensure that the agency remains at the forefront of large-scale assessment and
• to promote the use of EQAO data for improved student achievement through the investigation ofmeans to inform policy directions and decisions made by educators, parents and the government.
EQAO research projects delve into the factors that influence student achievement and education quality,and examine the statistical and psychometric processes that result in high-quality assessment data.
Acknowledgements
This research was conducted under the direction of Michael Kozlow and the
EQAO scholars in residence, Todd Rogers and Mark Reckase, who provided guidance on
the development of the proposal and the conduct of the study. They provided extensive
and valuable advice on the research procedures, input at different stages of the analysis
and review and editorial comments on the final report. Qi Chen provided academic and
technical assistance to speed up the process of the analysis. Yunmei Xu provided timely
assistance in completing the analysis. The authors are grateful to them for the significant
contributions they made to improve the academic quality of this research.
Abstract
This research evaluated the effectiveness of identifying students’ real gains
through the application of four commonly used equating methods: concurrent calibration
(CC) equating, fixed common item parameter (FCIP) equating, Stocking and Lord test
characteristic curve (TCC) equating, and mean/sigma (M/S) equating. The performance
of the four procedures was evaluated using simulated data for a test design with multiple
item formats. Five gain conditions (-0.3, -0.1, 0.0, 0.1 and 0.3 on the θ-scale) were built
into the simulation to mimic the Ontario Secondary School Literacy Test (OSSLT), the
Test provincial de compétences linguistiques (TPCL), the Assessments of Reading,
Writing and Mathematics, Primary and Junior Divisions and the applied version of the
English Grade 9 Assessment of Mathematics. Twenty replications were conducted. The
estimated percentages at multiple achievement levels and in the successful and
unsuccessful categories were compared with the respective true percentages obtained
from the known θ-distributions. The results across seven assessments showed that the
FCIP, TCC and M/S equating procedures based on separate calibrations performed
equally well and much better than the CC procedure.
1
Introduction
One of the goals of the Education Quality and Accountability Office (EQAO) is
to provide evidence concerning changes in student achievement from year to year in the
province of Ontario.1 Yearly assessments in both English and French are conducted at the
primary (Grade 3) and junior (Grade 6) levels (reading, writing and mathematics) and in
Grade 9 (academic and applied mathematics). The results for these assessments are
reported in terms of the percentage of students at each of five achievement levels (Not
Enough Evidence for Level 1 [NE1] or Below Level 1 and Levels 1, 2, 3 and 4). The
provincial standard for acceptable performance is Level 3. In addition to these
assessments, EQAO is responsible for two literacy tests: the Ontario Secondary School
Literacy Test (OSSLT) in English and the Test provincial de compétences linguistiques
(TPCL) in French, either of which is a required credential for graduation from high
school.2
When reporting evidence of change in performance between two years, it is
important that a distinction be made between differences in difficulty of the test forms
used to assess the students and real gains or losses in achievement between the two years.
The purpose of equating is to adjust for test difficulty differences so that only real
differences in performance are reported.
There are, however, different procedures for equating tests, some of which are
based on classical test score theory (CTST) and others on item response theory (IRT).
Some research has shown that equating based on CTST and IRT provides similar results
for horizontal equating. For example, Hills, Subhiyah and Hirsch (1988) found similar
results with linear equating, concurrent calibration (CC) using the Rasch model and the
three-parameter IRT model, and separate calibration using the three-parameter IRT
model with fixed common item parameter (FCIP) equating and mean/sigma (M/S)
equating (Marco, 1977). However, Kolen and Brennan (1995) pointed out that since
many large assessment programs use IRT models to develop and calibrate tests, the use
1 EQAO is an arm’s length agency of the Ontario Ministry of Education that administers large-scale
provincial assessments. 2 Students who are unsuccessful on the OSSLT may take it again the next year or enrol in the Ontario
Secondary School Literary Course.
2
of IRT-based equating methods is often the logical choice. Therefore, since EQAO uses
procedures based on IRT to calibrate and equate the items in each of its assessments, the
equating methods considered in the present study were restricted to IRT-based equating
methods.
The most commonly used IRT equating procedures are the CC procedure
(Wingersky & Lord, 1984), which is based on a concurrent calibration of a sample
consisting of the students assessed in each of two years to be equated; the FCIP
procedure; the test characteristic curve (TCC) procedure (Loyd & Hoover, 1980) and the
M/S procedure. The FCIP, TCC and M/S procedures are based on separate calibrations of
the two samples. Unfortunately these procedures do not always yield the same results.
Therefore, understanding the behavior of different equating methods is critical to
ensuring that the interpretation of estimates of change is valid.
EQAO currently uses separate IRT calibration and the FCIP equating procedure.
However, no research has examined the effectiveness of this approach in recovering
gains or differences between two years for the EQAO assessments, or whether or not one
of the other IRT equating methods might better recover such changes.
Purpose of the Study
The purpose of the present study is to assess the effectiveness of the four
different equating procedures identified above (CC, FCIP, TCC and M/S) in identifying
the real changes in student performance across years. Specifically, the four procedures
were compared in terms of how accurately the results they yielded represented known
changes in the percentages of students at each achievement level for the primary (Grade
3), junior (Grade 6) and Grade 9 assessments and in the two achievement categories for
the OSSLT and TPCL (successful and unsuccessful).
Review of Equating Methods
When the common-item nonequivalent group design and IRT-based equating
methods are used, one of two approaches can be taken: concurrent or separate calibration.
With the concurrent calibration and equating approach (Lord & Winkersky, 1984), the
students’ responses from the two tests to be equated are combined into one data file
3
through the alignment of the common items. The tests are then simultaneously calibrated.
As a result, the parameter estimates for the items in the tests are put on a common scale.
The students’ ability scores for two tests are estimated separately using the corresponding
scaled item parameters, and the means of the two tests are then compared to determine
the direction and magnitude of the change. Theoretically, CC is expected to yield more
stable results than the separate-calibration methods that employ transformations, and CC
is also expected to minimize the impact of sampling fluctuations in the estimation of the
pseudo-guessing parameter due to an increase in the number of low-ability examinees.
With separate calibration, the calibrations are performed separately for the two
tests and common items are used to put the two tests on a common scale. The test used to
set the common scale is referred to as “the reference test” and the second test is referred
to as “the equated test.” A linear transformation can then be used to place the item
parameters from the equated test on the scale of the reference test based on the items
common to the two tests. Equating procedures that use a linear transformation include the
mean/mean approach (M/M) (Loyd & Hoover, 1980), the M/S method (Marco, 1977) and
the TCC approach (Li, Lissitz & Yang, 1999; Stocking & Lord, 1983). While it is
theoretically correct to use the M/M or the M/S procedure, the parameters are used
separately to estimate the equating coefficients. In contrast, the TCC method is a
simultaneous estimation procedure that takes better account of the information provided
(Li et al., 1999).
FCIP is an alternative two-step calibration and equating method. In it, the
reference test is calibrated first. When the equated test is calibrated, the parameters of its
common items are fixed at the estimates obtained through the calibration of the reference
test. As a result, the equated test score distribution is placed on the reference test scale
(for a technically detailed description of FCIP, refer to Kim, 2006). The FCIP procedure
is expected to produce results superior to those produced by the M/M, M/S and TCC
procedures because of the avoidance of incorrect transformation functions.
While some research has been conducted to evaluate different IRT equating
approaches (Hanson & Belguin, 2002; Hills, Subhiyah & Hirsch, 1988; Kim & Cohen,
a x (y): number of open-response items and the total number of possible points for these items. b Combined winter and spring samples. c The descriptive statistics were based on the θ-scale from the operational calibrations.
IRT Models. The IRT model used to generate the item responses for the OSSLT
and TPCL was a modified Rasch model with guessing fixed to 0.20 for multiple-choice
items and the a-parameter fixed to 0.588. This value of the a-parameter effectively sets
the discrimination to 1.0 because the a-parameter is multiplied by 1.7 in the model. For
the primary, junior and Grade 9 subtests, the two-parameter model with a fixed guessing
parameter added was used for multiple-choice items. For all tests and subtests, the
10
generalized partial credit model was used for open-response items. These IRT models
appear to be the most appropriate for EQAO’s assessments (Xie, 2006).
Steps for Data Simulation. The following two questions guided the development
of the computer simulation for each assessment:
a. What are the true changes (in percentage) at each achievement level?
b. What would the gains be at each achievement level in a real testing situation
after the four equating processes of interest are applied, and how close are
these estimated changes to the true changes?
The following data simulation steps were carried out to help answer these questions.
1. True Percentages
To identify the true percentages in each achievement category, the known θ-
distributions for Year 1 and Year 2 were simulated using the Pearson type-four family
(mean, standard deviation, skewness and kurtosis) of the θ-distributions taken from the
Year 1 and Year 2 operational tests, respectively (see Table 1). Since true changes
between the two years are not known, five possible “true” changes (-0.3, -0.1, 0.0, 0.1
and 0.3 units on the θ-scale) were modelled in the data simulations to reflect different
performance changes. These values span the range of changes in performance that might
be seen in realistic educational settings, although the ±0.3 conditions represent changes
that are larger than those that have been generally observed in the EQAO assessments. To
create the five gain conditions for Year 2, the five gains were added to the mean of the
Year 1 θ-distribution. The known θ-distribution for Year 2 was then simulated for each of
the five gain conditions for each selected test or subtest. The sample sizes used in the
simulations were chosen to be close to the equating samples used in practice for each
assessment. In the equating samples, the students who were accommodated with special
versions and the students who did not respond were excluded. In the case of the OSSLT,
the students who had previously been eligible to write were also excluded from the
equating sample.
Cut scores were determined on the known Year 1 θ-distribution using the EQAO-
reported percentages for each achievement level. These cut scores were then applied to
11
the five simulated known Year 2 θ-distributions to identify the true percentage for each
achievement level.
2. Empirical Percentages
To obtain the empirical percentages, the matrix data that mimic EQAO
assessments have to be simulated. The data simulation includes two stages: a) simulate
the full data set for Year 1 and Year 2 students; b) use the full data set to generate the
matrix data set for calibration. To simulate the full data set, the operational-item
parameters from Years 1 and 2 were combined into one file. The known Year 1 θ-
distribution was also combined with each of the five known Year 2 θ-distributions. The
item-response vectors for the students were then generated for theYear 1 and Year 2 test
forms for each gain condition based on the combined parameter file and true θ-
distributions (see Figure 2).
Year 1 Test Form Year 2 Test Form Sample Year 1
Sample Year 2
Figure 2. Full Data Structure
The vertical axis of the diagram represents students. Those above the mid-point
on the vertical axis are from Year 1 and those below the mid-point are from Year 2. The
horizontal axis represents items. Items to the left of the mid-point on the horizontal axis
are included in the form administered in Year 1 and items to the right are included in the
form administered in Year 2. To create the Year 1 matrix equating sample and the Year 2
operational equating sample, the light grey parts in the diagram are removed from the full
data set. It is believed that the best way to get good information about changes in
students’ performance would be to have both cohorts of students take both tests.
Therefore, creating the equating samples from the ideal full data set seemed reasonable.
12
After the usual equating data sets were created, equating was conducted using the
CC, FCIP, TCC and M/S equating methods to obtain empirical percentages. For the CC
procedure, the Year 1 and Year 2 data sets were combined and calibrated together. In the
case of the TCC and M/S procedures, the Year 1 and Year 2 data sets were first calibrated
separately. Then the TCC and M/S procedures were applied to obtain the linear
transformation coefficients to scale the Year 1 test (equated) to the Year 2 test
(reference). With the FCIP procedures, the two tests were calibrated separately with the
matrix item parameters of the Year 1 test fixed at the values of the Year 2 operational-
item parameters to place the Year 1 test on the Year 2 scale. Similar procedures to those
used in Step 1 were applied to identify a cut score and obtain an empirical percentage for
each achievement level and each gain condition.
Computer Programs. The examinees’ item responses were simulated using
Matlab Datagengpcmv and Datagen3plt. Datagengpcmv was used to simulate responses
for the open-response items and datagen3plt was used to simulate responses for the
multiple-choice items. The simulated item-response distributions were compared with the
actual item-response distributions, and they showed very similar patterns for each of the
selected assessments. PARSCALE was chosen to conduct calibrations because EQAO
uses it for operational IRT calibration and scoring. MULTILOG and PARSCALE
generate similar parameter estimates (Childs and Chen, 1999; Hanson and Beguin, 2002).
However, PARSCALE produces an overall item location parameter and reproduces the
category parameters by centring them to zero (Childs & Chen, 1999). Further, the number
of examinees PARSCALE can handle is much greater.
Evaluation of the FCIP, CC, TCC and M/S Equating Methods. The performance
of the FCIP, CC, TCC and M/S equating methods was evaluated by comparing the
estimated percentage with the corresponding true percentage at each of the four
achievement levels for the primary, junior and Grade 9 assessments and for the successful
and unsuccessful categories for the OSSLT and TPCL. Twenty replications of each
simulation were carried out. The inclusion of a wide variety of assessments was also
considered to be very important for this study.
Descriptive statistics of the empirical percentages across the 20 replications were
computed for each achievement level, equating method and change condition. Each
13
average estimated percentage was compared to the corresponding true percentage to
determine the bias in the empirical estimate:
lillil
n
inBias Δ−Δ=Δ−Δ= Σ
=
)))(1
1, (1)
where il
n
iil n
Δ=Δ Σ=
))
1
1 ,
lΔ is the true value for achievement level l,
ilΔ)
is the estimated value for the ith replication at achievement level l, and
n = 20 is the number of replications (Sinharay & Holland, 2007).
If the bias is negative, then the true percentage is underestimated; if the bias is
positive, then the true percentage is overestimated.
The stability of the empirical percentages across replications was assessed using
the root mean square error (RMSE):
2
1)(1
lil
n
inRMSE Δ−Δ= Σ
=
) (2)
The smaller the RMSE is, the closer the estimated values are to the true values.
For the purposes of this study, bias and RMSE values smaller than or equal to 1%
were considered to be negligible. Differences in bias yielded by two methods had to
exceed 0.50 to be claimed as a meaningful difference. Many large-scale assessment
programs consider a change of 1% from one year to the next to be meaningful.
Results
The results for the OSSLT and TPCL are presented first, followed by the results
for the primary, junior and Grade 9 subtests selected for this study.
OSSLT and TPCL
14
OSSLT. The results for the OSSLT are reported in the top panel of Table 2. The
pattern of bias for the four equating methods is complex. For example, while the CC
method recovered the true change of zero, this procedure did not recover the other
changes as well. While the FCIP, TCC and M/S procedures recovered changes in the
percentages of unsuccessful students equally well across all change conditions, the TCC
and M/S procedures recovered the -0.1 and 0.3 changes much better than the FCIP and
CC procedures did. Overestimates were observed in the positive changes (i.e., an
increases in the percentage of unsuccessful students), with the bias of the CC procedures
being more pronounced than the bias of the FCIP, TCC and M/S procedures (e.g., 3.64%
vs. 0.72%, 0.38% and 0.37% for a true gain of 0.3 on the theta scale). Underestimates
were observed in the negative changes, with the bias of the CC equating procedure larger
than that for the other three procedures for a true change of -0.3 (-5.76% vs. -0.99%, -
1.06% and -1.05%). Overall, the TCC and M/S methods fared slightly better than the
FCIP method, and these three performed much better than the CC method. Expressed in
terms of average RMSE across the five gain conditions, the four equating methods rank
as follows: TCC (0.45%), M/S (0.49%), FCIP (0.59%) and CC (2.52%).
TPCL. As shown in the lower panel of Table 2, the performance of the CC
method was again the poorest at the changes in both directions, with large overestimates
in the positive changes (2.48% for a change of 0.1 and 4.29% for a change of 0.3) and
underestimates 1n the negative changes (-0.89% in a change of -0.1 and -4.57% in a gain
of -0.3). The FCIP, TCC and M/S procedures also overestimated the zero gain.
Interestingly, except for those of -0.3, FCIP, TCC and M/S procedures overestimated rest
of the gains, with -0.1 and 0.1 more pronounced. Again, FCIP, TCC and M/S were
ranked first with an average RMSE of around 1% across the five gain conditions. The CC
procedure showed an average RMSE of 2.66%, which was substantially larger. Lastly,
the magnitude of RMSE for the TPCL was generally larger than the bias for the OSSLT.
15
Table 2
Equating Results for the OSSLT and TPCL: Percentage Unsuccessful for Year 2 Theta