Top Banner
Journal of Educational Measurement Fall 2012, Vol. 49, No. 3, pp. 285–311 A Comparison of Item Calibration Procedures in the Presence of Test Speededness Youngsuk Suh Rutgers, The State University of New Jersey Sun-Joo Cho Peabody College of Vanderbilt University James A. Wollack University of Wisconsin-Madison In the presence of test speededness, the parameter estimates of item response the- ory models can be poorly estimated due to conditional dependencies among items, particularly for end-of-test items (i.e., speeded items). This article conducted a sys- tematic comparison of five-item calibration procedures—a two-parameter logistic (2PL) model, a one-dimensional mixture model, a two-step strategy (a combination of the one-dimensional mixture and the 2PL), a two-dimensional mixture model, and a hybrid model-–by examining how sample size, percentage of speeded examinees, percentage of missing responses, and way of scoring missing responses (incorrect vs. omitted) affect the item parameter estimation in speeded tests. For nonspeeded items, all five procedures showed similar results in recovering item parameters. For speeded items, the one-dimensional mixture model, the two-step strategy, and the two-dimensional mixture model provided largely similar results and performed bet- ter than the 2PL model and the hybrid model in calibrating slope parameters. How- ever, those three procedures performed similarly to the hybrid model in estimating intercept parameters. As expected, the 2PL model did not appear to be as accurate as the other models in recovering item parameters, especially when there were large numbers of examinees showing speededness and a high percentage of missing re- sponses with incorrect scoring. Real data analysis further described the similarities and differences between the five procedures. There are many reasons why examinees might not answer all items on a test. For example, answers may be omitted because an examinee carelessly and inadvertently skipped an item or entered the answer on the wrong line of the answer sheet. In such instances, the data are said to be missing at random, meaning that the probability that an observation is missing is not related to the other missing observations after con- trolling all the observed data (Little & Rubin, 1987). Rubin (1976) explained mathe- matically why, under direct maximum likelihood and Bayesian estimation, data miss- ing at random can be ignored without affecting parameter estimation. Other sources of nonresponses may be more systematic, such as when an exami- nee has insufficient time to consider answering the items. The effects of time limits on test performances have been referred to as speededness effects (Evans & Reilly, 1972). When tests are administered under time constraints, some examinees may be- come speeded and their performance on items at the end of the test may change. Examinees who are hurried may find end-of-test items harder than examinees with Copyright c 2012 by the National Council on Measurement in Education 285
27

A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Apr 11, 2023

Download

Documents

Simon Darroch
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Journal of Educational MeasurementFall 2012, Vol. 49, No. 3, pp. 285–311

A Comparison of Item Calibration Procedures in thePresence of Test Speededness

Youngsuk SuhRutgers, The State University of New Jersey

Sun-Joo ChoPeabody College of Vanderbilt University

James A. WollackUniversity of Wisconsin-Madison

In the presence of test speededness, the parameter estimates of item response the-ory models can be poorly estimated due to conditional dependencies among items,particularly for end-of-test items (i.e., speeded items). This article conducted a sys-tematic comparison of five-item calibration procedures—a two-parameter logistic(2PL) model, a one-dimensional mixture model, a two-step strategy (a combinationof the one-dimensional mixture and the 2PL), a two-dimensional mixture model, anda hybrid model-–by examining how sample size, percentage of speeded examinees,percentage of missing responses, and way of scoring missing responses (incorrectvs. omitted) affect the item parameter estimation in speeded tests. For nonspeededitems, all five procedures showed similar results in recovering item parameters. Forspeeded items, the one-dimensional mixture model, the two-step strategy, and thetwo-dimensional mixture model provided largely similar results and performed bet-ter than the 2PL model and the hybrid model in calibrating slope parameters. How-ever, those three procedures performed similarly to the hybrid model in estimatingintercept parameters. As expected, the 2PL model did not appear to be as accurateas the other models in recovering item parameters, especially when there were largenumbers of examinees showing speededness and a high percentage of missing re-sponses with incorrect scoring. Real data analysis further described the similaritiesand differences between the five procedures.

There are many reasons why examinees might not answer all items on a test. Forexample, answers may be omitted because an examinee carelessly and inadvertentlyskipped an item or entered the answer on the wrong line of the answer sheet. In suchinstances, the data are said to be missing at random, meaning that the probability thatan observation is missing is not related to the other missing observations after con-trolling all the observed data (Little & Rubin, 1987). Rubin (1976) explained mathe-matically why, under direct maximum likelihood and Bayesian estimation, data miss-ing at random can be ignored without affecting parameter estimation.

Other sources of nonresponses may be more systematic, such as when an exami-nee has insufficient time to consider answering the items. The effects of time limitson test performances have been referred to as speededness effects (Evans & Reilly,1972). When tests are administered under time constraints, some examinees may be-come speeded and their performance on items at the end of the test may change.Examinees who are hurried may find end-of-test items harder than examinees with

Copyright c© 2012 by the National Council on Measurement in Education 285

Page 2: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Suh, Cho, and Wollack

ample time, or they may even fail to attempt some items. As a result, test speeded-ness1 has been shown to cause local item dependence among the items at the end ofthe test (Yen, 1993). Because of the systematic nature underlying omitted responseson speeded tests, these omitted responses cannot be classified as missing at random.

In the presence of test speededness, the parameters of item response models canbe poorly estimated, particularly for items located near the end of tests (Douglas,Kim, Habing, & Gao, 1998; Oshima, 1994). Item response theory (IRT) applicationsthat are particularly sensitive to the accuracy of the item parameter estimates, suchas differential item functioning (DIF) analysis, standard-setting studies, or even itemanalysis, can produce inaccurate and misleading results when speededness is present,especially for the end-of-test items.

Augmenting this perspective, several IRT calibration approaches have been pro-posed in an attempt to improve the estimation of parameters for items at the endof tests in the presence of test speededness (e.g., Bolt, Cohen, & Wollack, 2002;Bolt, Mroch, and Kim, 2003; De Boeck, Cho, & Wilson, 2011; Mroch, Bolt, & Wol-lack, 2005; Wollack, Cohen, & Wells, 2003; Yamamoto & Everson, 1997). For ex-ample, Bolt et al. (2002) proposed a two-class mixture Rasch model and showedthat the parameter estimates obtained for end-of-test items in the nonspeeded classwere nearly identical to the estimates obtained when those same items were ad-ministered under nonspeeded conditions. Wollack et al. (2003) used a more tra-ditional way of dealing with the speededness effect. Under the premise that us-ing a mixture model for scoring purposes is impractical in practice because allitems will need to be scored the same way for all examinees, they used the mix-ture Rasch, or one-parameter logistic (1PL) model to study the effect on the sta-bility of a score scale of removing speeded examinees from the calibration sam-ple. They showed that conducting the Rasch (or 1PL) item calibration and link-ing using only those examinees classified as nonspeeded produced a more unidi-mensional scale, smaller effects of item parameter drift, and less scale drift whencompared to the results based on the total group of examinees. Yamamoto andEverson (1997) proposed a 2PL hybrid model which assumes examinees’ responsesare modeled by a 2PL model up to the point where test speededness begins dueto the time limit, but the responses thereafter are modeled by a random guessingcomponent. In their simulation study, the accuracy of the estimated item and abilityparameters were markedly improved using the hybrid model compared to the ordi-nary 2PL model. Most recently, De Boeck et al. (2011) proposed a two-dimensionalmixture IRT model for explaining individual differences due to the test speedednessfrom items at the end of a test. Through a real data analysis, they showed that the2PL mixture model with one dimension in the nonspeeded class and two dimensionsin the speeded class provided a comparatively better model fit than other alternativemodels, implying that the speededness effect on difficulty parameters of end-of-testitems can be explained as stemming from a secondary dimension.

Although several approaches have been proposed to account for test speededness,each using a real data set, no further work has been reported comparing these var-ious approaches. Since any IRT calibration procedure selected to estimate modelparameters in real data will at best provide an approximation to the “true” underly-ing response process, comparing several existing item calibration procedures via a

286

Page 3: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Item Calibration Procedures for Test Speededness

Monte Carlo simulation study can provide test practitioners with valuable insightsinto the relative merits of each approach. Although missing responses are sometimesconsidered useful indicators of speededness (Mroch & Bolt, 2006), very few stud-ies have been conducted to focus on the effects of both the percentage of missingresponses and the treatment of the missing responses on item parameter estimation.Furthermore, missing responses may lower content validity because target contentareas measured by a test are not maintained for the examinees who have not reachedthe end of the test due to test speededness (Lu & Sireci, 2007). Therefore, the pur-pose of this study is to investigate, through simulations, the relative performance offive item calibration procedures for speeded item responses (which will be describedin more detail below) under various testing conditions: two sample sizes, two per-centages of speeded examinees, two percentages of missing responses for end-of-testitems, and two ways of scoring missing responses (either incorrect or omitted), andto investigate the utility of these procedures in a real data application.

The remaining sections of the article are laid out as follows. In the following sec-tion, the description of the five different item calibration procedures is provided.Next, a section addresses the simulation study, presenting the simulation design, es-timation, and results to compare the performance of the five procedures under varioussimulation conditions. The next section presents the results of a real data analysis uti-lizing the five procedures. The final section provides a brief summary of the findingsand a discussion about the practical implications of using these procedures.

Item Calibration Procedures

One-Dimensional Mixture 2PL Model

Bolt et al. (2002) used a two-class mixture Rasch model (Rost, 1990) to modeltest speededness. Under this model, examinees are classified to one of two differentlatent classes: speeded and nonspeeded. These mixture classes are defined by impos-ing a set of constraints on the item difficulty parameters. In particular, for items earlyin the test (which are assumed to be free from test speededness effects), the item dif-ficulty parameters are constrained to be equal in the two classes. However, the itemdifficulty parameters for end-of-test items (which are assumed to be affected by thetime constraints) are constrained to be more difficult in the speeded class than thosein the nonspeeded class. The model was extended to the mixture 2PL model (Boltet al., 2003), and using a slope-intercept parameterization, the one-dimensional mix-ture 2PL model (which will be referred to as the “one-dimensional mixture model”throughout the article) is given by:

Pi (U = 1|g, θ jg) = exp(aigθ jg + big)

1 + exp(aigθ jg + big), (1)

where g indexes the latent class (g = 1, 2), θ jg is the latent ability of examinee j inclass g, aig and big are the slope (discrimination) and intercept parameters of item i inclass g, respectively, Pi (U = 1|g, θ jg) is the probability of a correct response to itemi by examinee j in class g with ability θ jg . For the purpose of model identification,the constraint θ j1∼normal(0, 1) is imposed.

287

Page 4: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Suh, Cho, and Wollack

In this article, a slightly modified one-dimensional mixture model is used to cal-ibrate item parameters in the presence of speededness. Here, although the equalityconstraints for the items early in the test are still present, the ordinal constraints forend-of-test items used in Bolt et al. (2002, 2003) are not imposed because it has beenempirically found that these constraints are unnecessary for the purpose of account-ing for the speededness effect and that on average the item difficulties for the speededclass are higher than those for the nonspeeded class (see De Boeck et al., 2011). Itshould be noted that if there is no latent class variable in Equation 1 (i.e., g is fixedat 1), the model reduces to the ordinary 2PL model, which also is evaluated in thisarticle along with the one-dimensional mixture model.

Two-Dimensional Mixture 2PL Model

De Boeck et al. (2011) proposed a two-dimensional mixture 2PL model (whichwill be referred to as the “two-dimensional mixture model”) in the context of DIFwith the potential to explain away the DIF effect with respect to difficulty, on the con-dition that all DIF has one underlying basis (speededness from end-of-test items). Inthis study, the reference mixture group is a nonspeeded class and the focal mixturegroup is a speeded class. Under the two-dimensional mixture model, the nonspeededclass (g = 1) has one dimension (θ j1) as an interaction between all items and exam-inees in the nonspeeded class. The speeded class (g = 2) has two dimensions wherethe first dimension (θ j1) is applied for all items as in the nonspeeded class and thesecond dimension (θ j2) is from an interaction between the end-of-test items (i.e.,speeded items) and examinees in the speeded class. The two-dimensional mixturemodel is described as follows:

Pi (U = 1|g, θ j1, θ j2g) = exp(ai1θ j1 + ai2gθ j2g Z j + bi )

1 + exp(ai1θ j1 + ai2gθ j2g Z j + bi ), (2)

where g indexes the latent class (g = 1, 2), θ j1 is the first latent ability of examineej , θ j2g is the second latent ability of examinee j in class g, ai1 is the slope parameterof item i for the first latent ability, ai2g is the slope parameter of item i for the secondability and class g, bi is the intercept parameter of item i , Z j is an indicator variablewith Z j = 1 if an examinee is classified into the speeded class (g = 2), Z j = 0 oth-erwise, and Pi (U = 1|g, θ j1, θ j2g) is the probability of a correct response to item i byexaminee j in class g with abilities θ j1 and θ j2g . For the purpose of model identifica-tion, constraints are set as follows: (1) θ j1∼ normal(0,1), (2) θ j2(g=2)∼ normal(0,1),(3) Cov(θ j1, θ j2(g=2)) = 0, and (4) the item slope parameter must be fixed at 1 forone item in the speeded class of the second dimension.2

As was done with the one-dimensional mixture model, the item parameters forthe first dimension (θ j1) are constrained to be equal in the two classes. The logicof the model suggests that by introducing the second dimension only for the end-of-test items in the speeded class, that dimension can account for the speedednesseffects, leaving the first dimension and its corresponding item parameter estimates(i.e., ai1and bi ) as purified estimates under nonspeeded conditions.

288

Page 5: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

A 2PL Hybrid Model

Yamamoto and Everson (1997) proposed a 2PL hybrid model (hereafter we use“hybrid model” to refer to the 2PL hybrid model) to account for the test speedednessand produce parameter estimates that were purified of the contaminating effects ofspeededness. The hybrid model assumes that in speeded tests when an examinee hasinsufficient time to consider answering the items, the examinee will switch from athoughtful response strategy to a strategy of random responses (i.e., the examineeguesses). Unlike the one- and two-dimensional mixture models, the hybrid modeldoes not assume that there exist only two latent classes. The model allows examineesto become speeded at different points on a test, thereby having different latent classesof examinees that differ in the number of consecutively guessed items at the end ofthe test. The hybrid model is described as follows:

Pi (U = 1|g, θ j ) =(

exp(aiθ j + bi )

1 + exp(aiθ j + bi )

)1−Zig

× (rig)Zig, (3)

where g indexes the latent class (g = 1, 2, . . . , k), θ j is the latent ability of exam-inee j , ai and bi are the slope (discrimination) and intercept parameters of item i ,respectively, rig is the probability of examinees in class g randomly guessing thecorrect answer to item i , Zig is an indicator variable with Zig = 1 if an item i isspeeded in class g, Zig = 0 otherwise, and Pi (U = 1|g, θ j ) is the probability of acorrect response to item i by examinee j in class g with ability θ j .

A 2PL model is fit for the nonspeeded item responses (before examinees switch torandom guessing). When items become speeded, the probability of a correct responseis modeled as a random guessing probability, rig , which is equal to the reciprocal ofthe number of alternatives.3 In the model, the proportions of classes of examineeswho become speeded at different points on a test are estimated along with the 2PLitem parameters. By eliminating the influence of the random response subpopulationson the item parameter estimation, the estimated item parameters can be regarded asthe purified estimates under nonspeeded conditions.

Item Calibration Models Used in This Study

In this article, five IRT calibration procedures are considered:

1) Model 1: the traditional 2PL model,2) Model 2: the one-dimensional mixture model (Bolt et al., 2003) without the

ordinal constraints for end-of-test items,3) Model 3: a two-step approach first using the one-dimensional mixture to iden-

tify and remove speeded examinees and then applying the 2PL model to onlythose examinees in the nonspeeded class for the purpose of item calibration (anapproach similar to the Wollack et al. (2003) approach),

4) Model 4: the two-dimensional mixture model (De Boeck et al., 2011), and5) Model 5: the hybrid model (Yamamoto & Everson, 1997).

For Model 1 (2PL) and Model 3 (two-step), ordinary 2PL model item slopeand intercept parameters are compared with the true parameters in the simulation

289

Page 6: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Suh, Cho, and Wollack

study (which will be described below). The difference between these models is thatModel 1 is applied to the entire examinee group (including nonspeeded and speededclasses) and Model 3 applies the 2PL model only to the examinees identified as non-speeded by using Model 2 (one-dimensional mixture). Like these two models, Model5 (hybrid) also produces one set of slope and intercept parameter estimates which areevaluated in relation to the true parameters. For Model 2 (one-dimensional mixture),there are two sets of item parameter estimates: one for the nonspeeded class andone for the speeded class. The item parameter estimates for the nonspeeded class4

are compared with the true parameters. Finally, Model 4 (two-dimensional mixture)produces two sets of item slope parameter estimates (ai1 for the entire class and ai2g

only for the speeded class) in addition to one common set of item intercept parameterestimates (bi ). As in Model 2, only the purified item parameter estimates (i.e., ai1andbi in Equation 2) are evaluated in relation to the true parameters.

Simulation Study

Simulation Design

The performance of any IRT model depends on the fit of the model to data. Whenone of the estimating models is used to generate data, there is a possibility that aperfect fit of the model chosen to data might overstate the value of the performanceof the model compared to other estimating models. In order to minimize this modelfit issue, an alternative speeded IRT model was used to generate the speededness data.A speededness generating model (Wollack & Cohen, 2004; Goegebeur, De Boeck,Wollack, & Cohen, 2008) has been proposed to provide a more realistic view of testspeededness. A 2PL version of the model was used to simulate the responses affectedby test speededness and given by:

P∗i j = Pi (U = 1|θ j ) × min

(1,

[1 −

(i

n− η j

)])λ j

, (4)

where Pi (U = 1|θ j ) is the probability of examinee j answering item i correctly underthe 2PL model, η j (0 ≤ η j ≤ 1) is the speededness point parameter for examinee j,λ j (λ j ≥ 0) is the speededness rate parameter for examinee j.

In Equation 4, the function min(x, y) selects the smaller of the two values, x andy. η j equals the percentage of items, proceeding from the beginning of the test,that will be completed before examinee j first experiences speededness. Therefore,smaller values of η j correspond to the fact that examinee j becomes speeded earlierin the test. For example, η j = .75 indicates that examinee j becomes speeded three-quarters of the way through the test. Once an examinee passes the speededness point,[1 − ( i

n − η j )] is raised to the power λ j , which serves to control the speed at whichP∗

i j decreases. Examinees with η j = 1 or λ j = 0 are not speeded for any items. Insuch cases, P∗

i j reduces to the 2PL model.As shown in Table 1, the design of the simulation study includes 16 condi-

tions resulting from four fully crossed factors: sample size, percentage of speed-edness (speeded examinees), percentage of missing responses, and way of scoring

290

Page 7: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Table 1Simulation Design

Percentage of Percentage of Missing Way of ScoringSample Size Speededness on Speeded Items Missing Responses

500 10% (50 examinees) 20% (.8%) IncorrectOmitted

40% (1.6%) IncorrectOmitted

30% (150 examinees) 20% (2.4%) IncorrectOmitted

40% (4.8%) IncorrectOmitted

2000 10% (200 examinees) 20% (.8%) IncorrectOmitted

40% (1.6%) IncorrectOmitted

30% (600 examinees) 20% (2.4%) IncorrectOmitted

40% (4.8%) IncorrectOmitted

Note. The percentages in the parentheses represent the percentages of missing responses in terms of thetotal responses.

missing responses (either incorrect or omitted). The number of items was fixedat 50. Item parameters were generated from the following distributions: a ∼log normal (0, .5) and b ∼ normal (0, 1). Examinee parameters were generated dif-ferently for speeded and nonspeeded examinees. For speeded examinees, η j and λ j

(the speededness point and rate parameters, respectively) were generated as follows:η ∼ beta (120, 80) and λ ∼ log normal (3.912, 1). These distributions were chosento resemble values observed in previous simulation studies (see Suh, Kang, Wol-lack, & Kim, 2006; Wollack & Cohen, 2004). The distribution for λ j represents aquite extreme speededness condition (i.e., the probability of correct responses tendsto drop at a fast rate). For nonspeeded examinees, η j and λ j were fixed at 1 and 0,respectively, which indicates the special case for which Equation 4 is identical to the2PL model. For all examinees, θ j was generated such that θ ∼ normal (0,1). Giventhe multitude of ways in which examinees may exhibit speeded behavior, simulatingonly two latent classes of examinees might be thought of as unrealistic. However,the speededness-generating model allows for different examinees to be heteroge-neous with respect to speededness behavior. Within the speeded group, examineesbecome speeded at different points in the test; more importantly, they are allowed tohave their performance deteriorate at different rates, thereby simulating varying de-grees of rushing behavior as well as completely random guessing. Also, Mroch et al.(2005) showed that a two-class mixture Rasch model (Bolt et al., 2002) performedsimilarly to a multiclass mixture Rasch model (with seven latent classes) when datawere simulated to exhibit speededness at different points near the end of the test.

291

Page 8: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Suh, Cho, and Wollack

Two sample sizes were considered: 500 and 2000. These designated small andlarge samples, respectively. Two percentages of speededness were simulated: lowand high. The low speededness was simulated by having 10% of the total num-ber of examinees be speeded, while the high speededness was simulated by having30%5 of the total number of examinees be speeded. Two different percentages ofmissing responses due to test speededness were simulated: 20% and 40%. By usingη ∼ beta (120, 80), examinees, on average, become speeded around the 30th itemand, hence, 20 items from the end of the test are considered as items showing speed-edness. It should be noted that these percentages indicate the percentage of omittedresponses among the items showing speededness, not among the total number ofitems. In terms of the total responses, the percentages of missing range from .8%to 4.8%.6 In order to make missing responses for the end-of-test items, an odd-evenapproach was used: missing responses were substituted for originally generated re-sponses for odd-numbered items at the end of test for odd numbered examinees inthe speeded class and responses were changed to missing for even numbered itemsfor even-numbered examinees. This odd-even approach is needed to prevent havingall speeded examinees omit responses on the same set of items, thereby leaving noavailable data from which to estimate speeded item parameters.7 As a result of thisprocedure, for each examinee in the speeded class responses for the last four (odd oreven) items of the 20 speeded items were systematically changed to be missing forthe 20% missing conditions, while in the 40% missing condition responses for thelast eight (odd or even) items were made to be missing.

Estimation and Evaluative Measures

Each data set was analyzed by the five different item calibration models describedabove. Each of the models was applied twice to each data set, once by scoring miss-ing responses as incorrect and once by treating missing as omitted. Fifty replicationswere simulated for each condition. For Models 1–4, a marginal maximum likelihoodestimation (MMLE) algorithm was used to estimate item parameters using LatentGOLD 4.5 Syntax module (Vermunt & Magidson, 2007). By default, the number ofquadrature points used in the module was 10. However, in order to get more preciseparameter estimates, different numbers of quadrature points were examined. Thenumber of nonadaptive quadrature points was increased from 10 in increments of5 until the change in maximized log-likelihood associated with an increment becameless than .1. The number of quadrature points varied depending on the calibrationmodels and simulation conditions, with values ranging from 40 to 90. The estima-tion of mixture models, in general, is prone to yielding multiple local maxima. Theusual method of checking the local maxima is to run the model with multiple differ-ent starting values (McLachlan & Peel, 2000). Ten different sets of initial values wereused at each time of model estimation. All results reported in this article were fromconvergent solutions. For Model 5 (hybrid), a Markov chain Monte Carlo (MCMC)algorithm8 was used to estimate the model parameters using WinBUGS 1.4.3 soft-ware (Spiegelhalter, Thomas, & Best, 2003). Convergence of the MCMC solutionwas determined by inspecting autocorrelation plots and Gelman and Rubin (1992)

292

Page 9: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Table 2Correct Classification of Class Membership

Model 2 Model 4Percentage (One- (Two-

of Percentage of Way of Dimensional Dimensional Model 5Sample Size Speededness Missing Scoring Mixture) Mixture) (Hybrid)

500 10% 20% Incorrect .97 .97 .91(50 examinees) Omitted .97 .96 .92

40% Incorrect .97 .97 .91Omitted .96 .95 .91

30% 20% Incorrect .99 .98 .89(150 examinees) Omitted .98 .97 .89

40% Incorrect .99 .98 .88Omitted .97 .96 .87

2000 10% 20% Incorrect .99 .99 .97(200 examinees) Omitted .99 .99 .97

40% Incorrect .99 .99 .97Omitted .98 .98 .96

30% 20% Incorrect .99 .99 .94(600 examinees) Omitted .98 .98 .95

40% Incorrect .99 .99 .95Omitted .97 .97 .93Average .98 .98 .93

method in WinBUGS 1.4.3. A conservative burn-in of 5,000 iterations was used inthis study followed by 5,000 post burn-in iterations for all conditions.

Several different evaluative measures were adopted to investigate the relative per-formance of the five-item calibration procedures. For the simulation study, the esti-mates of item parameters were compared to the true parameters with respect to biasand root mean square error (RMSE) after scaling them to a common metric using themean/mean method (Loyd & Hoover, 1980). For example, bias and RMSE for theslope parameter (ai ) are defined as follows:

bias(ai ) =

R∑r=1

(ai − ai )

Rand RMSE(ai ) =

√√√√√R∑

r=1(ai − ai )2

R, (5)

where R is the number of replications. Also, the classification accuracy (i.e., speededor nonspeeded class) was examined by comparing the true class membership andthe estimated class membership obtained from the one-dimensional mixture, two-dimensional mixture, and hybrid models. The latent classification probability of eachlatent class (i.e., posterior probability) was calculated for each person and producedas the outcomes of model estimation. Persons then were assigned to a latent classwith the highest posterior probability.

293

Page 10: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Results

The percentages of correct membership classification for the three models (one-dimensional mixture, two-dimensional mixture, and hybrid) are shown in Table 2. Itshould be noted that the hybrid model identified one nonspeeded class and multiplelatent classes that became speeded at different points at the end-of-test items. Sincedata were generated using two classes (nonspeeded and speeded), all assigned la-tent classes showing speededness at any set of items were combined as one speededgroup. Accordingly, the two assigned group memberships were evaluated in relationto the true group memberships.

The values for the two-class mixture models (Models 2 and 4) ranged from 95%to 99%, implying that both models performed well in terms of classifying examineesinto the specified groups regardless of various simulation conditions. However, thevalues for Model 5 ranged from 87% to 97%, indicating a less accurate classificationcompared to the two-class mixture models. In particular, the 500-examinee condi-tions showed lower correct classification rates than the 2,000-examinee conditions.This tendency was more apparent when 30% of the examinees were speeded.

Tables 3 and 4 show bias values for slope and intercept parameters across all sim-ulation conditions, with each condition having three bias values: for speeded items,for nonspeeded items, and for total items. Except for the 2PL and hybrid models, biaswas close to zero for both parameters for all other models on average across almostall simulation conditions; this implies no apparent evidence of systematic underes-timation or overestimation. For the 2PL model, the evidence suggested that slopeand intercept parameters were both overestimated for the speeded items. Such biaswas anticipated because the 2PL model did not in any way correct item parameterestimates for speededness effects. For the hybrid model, the slope parameter wasoverestimated for the speeded items while there was no apparent evidence of under-or overestimation on average for the intercept parameter.

For the slope parameter (Table 3), the incorrect scoring consistently showed largerbias than the omitted scoring across all conditions under the 2PL and hybrid models.This tendency was severe, particularly when 30% of the examinees were speeded.For the 20% missing condition under the 2PL model, bias for the slope parameter inthe incorrect scoring was twice as large as in the omitted scoring. This pattern wasmore apparent in the 40% missing condition. For example, under the 500-examineeand 30% speededness condition, bias for the slope parameter in the incorrect scoringwas about six times larger than in the omitted scoring. Finally, the large percentage ofspeeded examinees (30% condition) produced larger bias than the small percentagecondition (10%) regardless of the sample sizes. Similar patterns were observed forthe 2PL model in Table 4.

Tables 5 and 6 provide RMSEs for slope and intercept parameters. As expected,2,000-examinee conditions yielded smaller RMSEs than 500-examinee conditionsin both parameters, and nonspeeded item parameters were better estimated thanspeeded item parameters, particularly for the 2PL and hybrid models. In general, the2PL model showed the largest RMSE values for the slope parameter, and the hybridshowed the next largest values. The remaining three models (Models 2–4) showedcomparable performances across conditions. For the intercept parameter, the 2PL

294

Page 11: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Tabl

e3

Bia

sfo

rth

eSl

ope

Para

met

er

Perc

enta

geof

Perc

enta

geof

Way

ofSa

mpl

eSi

zeSp

eede

dnes

sM

issi

ngSc

orin

gIt

emM

odel

1M

odel

2M

odel

3M

odel

4M

odel

5

500

10%

(50

exam

inee

s)20

%In

corr

ect

Spee

d.2

1.0

0.0

0.0

0.0

9N

onsp

eed

.00

.00

.00

.00

.00

Tota

l.1

1.0

0.0

0.0

0.0

4O

mitt

edSp

eed

.12

.00

.00

−.02

.08

Non

spee

d.0

0.0

0.0

0−.

01.0

0To

tal

.06

.00

.00

−.01

.04

40%

Inco

rrec

tSp

eed

.22

.00

−.01

.00

.09

Non

spee

d.0

0.0

0.0

0.0

0.0

0To

tal

.11

.00

−.01

.00

.04

Om

itted

Spee

d.0

6.0

1.0

0.0

2.0

7N

onsp

eed

.00

.00

.00

.00

.00

Tota

l.0

3.0

0.0

0.0

1.0

330

%(1

50ex

amin

ees)

20%

Inco

rrec

tSp

eed

1.11

.00

.00

.01

.21

Non

spee

d.0

0.0

0.0

0.0

0.0

0To

tal

.58

.00

.00

.00

.09

Om

itted

Spee

d.4

9.0

0−.

01.0

1.1

7N

onsp

eed

.00

.00

.00

.00

.00

Tota

l.2

6.0

0−.

01.0

0.0

740

%In

corr

ect

Spee

d1.

09.0

1.0

0.0

1.2

1N

onsp

eed

.00

.00

.00

.00

.00

Tota

l.5

6.0

0.0

0.0

0.0

9O

mitt

edSp

eed

.17

.01

−.01

.02

.14

Non

spee

d.0

0.0

0.0

0.0

0.0

0To

tal

.09

.00

−.01

.01

.06

295

Page 12: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Tabl

e3

Con

tinu

ed

Perc

enta

geof

Perc

enta

geof

Way

ofSa

mpl

eSi

zeSp

eede

dnes

sM

issi

ngSc

orin

gIt

emM

odel

1M

odel

2M

odel

3M

odel

4M

odel

5

2000

10%

(200

exam

inee

s)20

%In

corr

ect

Spee

d.2

0.0

0−.

01.0

0.0

6N

onsp

eed

.00

.00

.00

.00

.00

Tota

l.1

0.0

0−.

01.0

0.0

3O

mitt

edSp

eed

.11

.00

−.01

.00

.05

Non

spee

d.0

0.0

0.0

0.0

0.0

0To

tal

.05

.00

−.01

.00

.02

40%

Inco

rrec

tSp

eed

.20

.00

−.01

.00

.06

Non

spee

d.0

0.0

0.0

0.0

0.0

0To

tal

.10

.00

−.01

.00

.03

Om

itted

Spee

d.0

5.0

0.0

0.0

1.0

5N

onsp

eed

.00

.00

.00

.00

.00

Tota

l.0

2.0

0.0

0.0

0.0

230

%(6

00ex

amin

ees)

20%

Inco

rrec

tSp

eed

1.07

.01

.00

.01

.21

Non

spee

d.0

0.0

0.0

0.0

0.0

0To

tal

.60

.00

.00

.01

.09

Om

itted

Spee

d.5

1.0

1.0

0.0

1.1

6N

onsp

eed

.00

.00

.00

.00

.00

Tota

l.2

9.0

0.0

0.0

1.0

740

%In

corr

ect

Spee

d1.

07.0

1.0

0.0

1.2

0N

onsp

eed

.00

.00

.00

.00

.00

Tota

l.6

0.0

0.0

0.0

1.0

8O

mitt

edSp

eed

.23

.02

.00

.02

.13

Non

spee

d.0

0.0

0.0

0.0

0.0

0To

tal

.13

.01

.00

.01

.05

Ave

rage

Spee

d.4

3.0

1.0

0.0

1.1

2N

onsp

eed

.00

.00

.00

.00

.00

Tota

l.2

3.0

0.0

0.0

0.0

5

Not

es.

Mod

els

1–5

repr

esen

tth

e2P

Lm

odel

,th

eon

e-di

men

sion

alm

ixtu

rem

odel

,th

etw

o-st

eppr

oced

ure,

the

two-

dim

ensi

onal

mix

ture

mod

el,

and

the

hybr

idm

odel

,re

spec

tivel

y.A

vera

ges

wer

eca

lcul

ated

base

don

alls

ampl

esi

zeco

nditi

ons.

296

Page 13: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Tabl

e4

Bia

sfo

rth

eIn

terc

eptP

aram

eter

Perc

enta

geof

Perc

enta

geof

Way

ofSa

mpl

eSi

zeSp

eede

dnes

sM

issi

ngSc

orin

gIt

emM

odel

1M

odel

2M

odel

3M

odel

4M

odel

5

500

10%

(50

exam

inee

s)20

%In

corr

ect

Spee

d.1

8−.

01.0

0−.

01−.

05N

onsp

eed

−.04

−.01

.00

−.01

−.01

Tota

l.0

8−.

01.0

0−.

01−.

03O

mitt

edSp

eed

.16

−.01

.00

.01

−.05

Non

spee

d−.

03−.

01.0

0.0

0−.

02To

tal

.07

−.01

.00

.00

−.03

40%

Inco

rrec

tSp

eed

.28

.03

.05

.04

−.03

Non

spee

d.0

4.0

3. 0

5.0

4.0

1To

tal

.16

.03

.05

.04

−.01

Om

itted

Spee

d.1

9.0

5.0

4.0

4−.

02N

onsp

eed

.04

.04

.04

.05

.01

Tota

l.1

2.0

4.0

4.0

5−.

0130

%(1

50ex

amin

ees)

20%

Inco

rrec

tSp

eed

.65

−.04

−.05

−.05

−.05

Non

spee

d−.

05−.

03−.

04−.

02−.

02To

tal

.31

−.03

−.04

−.04

−.03

Om

itted

Spee

d.5

6−.

03−.

04−.

04−.

06N

onsp

eed

−.05

−.03

−.04

−.03

−.02

Tota

l.2

6−.

03−.

04−.

04−.

0440

%In

corr

ect

Spee

d.6

3.0

0−.

01−.

02−.

03N

onsp

eed

−.07

−.02

−.03

−.02

.01

Tota

l.2

9−.

01−.

02−.

02−.

01O

mitt

edSp

eed

.47

.01

−.01

−.01

−.03

Non

spee

d−.

04−.

02−.

03−.

02−.

01To

tal

.23

−.01

−.02

−.02

−.02

297

Page 14: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Tabl

e4

Con

tinu

ed

Perc

enta

geof

Perc

enta

geof

Way

ofSa

mpl

eSi

zeSp

eede

dnes

sM

issi

ngSc

orin

gIt

emM

odel

1M

odel

2M

odel

3M

odel

4M

odel

5

2000

10%

(200

exam

inee

s)20

%In

corr

ect

Spee

d.1

9−.

01−.

01−.

02−.

02N

onsp

eed

−.03

−.02

−.02

−.02

.00

Tota

l.0

8−.

02−.

01−.

02−.

01O

mitt

edSp

eed

.16

−.01

−.01

−.02

−.01

Non

spee

d−.

03−.

02−.

02−.

02.0

0To

tal

.07

−.01

−.01

−.02

−.01

40%

Inco

rrec

tSp

eed

.21

.00

.00

−.01

− .01

Non

spee

d−.

02.0

0.0

0.0

0.0

1To

tal

.10

.00

.00

.00

.00

Om

itted

Spee

d.1

4.0

1.0

0.0

0.0

0N

onsp

eed

−.01

.00

.00

.00

.00

Tota

l.0

7.0

0.0

0.0

0.0

030

%(6

00ex

amin

ees)

20%

Inco

rrec

tSp

eed

.66

.01

.00

−.01

.03

Non

spee

d−.

02−.

01−.

01−.

01.0

2To

tal

.36

.00

.00

−.01

.02

Om

itted

Spee

d.5

5.0

1.0

1−.

01.0

2N

onsp

eed

−.02

−.01

−.01

−.01

.01

Tota

l.3

0.0

0.0

0−.

01.0

140

%In

corr

ect

Spee

d.6

6.0

0−.

01−.

01.0

3N

onsp

eed

−.01

−.01

−.02

−.01

.02

Tota

l.3

7.0

0−.

01−.

01.0

2O

mitt

edSp

eed

.45

.02

.00

.00

.03

Non

spee

d−.

02.0

0−.

01.0

0.0

1To

tal

.24

.01

−.01

.00

.02

Ave

rage

Spee

d.3

8.0

0.0

0−.

01−.

02N

onsp

eed

−.02

−.01

−.01

−.01

.00

Tota

l.1

9.0

0.0

0−.

01−.

01

Not

es.

Mod

els

1–5

repr

esen

tth

e2P

Lm

odel

,th

eon

e-di

men

sion

alm

ixtu

rem

odel

,th

etw

o-st

eppr

oced

ure,

the

two-

dim

ensi

onal

mix

ture

mod

el,

and

the

hybr

idm

odel

,re

spec

tivel

y.A

vera

ges

wer

eca

lcul

ated

base

don

alls

ampl

esi

zeco

nditi

ons.

298

Page 15: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Item Calibration Procedures for Test Speededness

model still showed the largest values among the five models while the four mixturemodels showed similar results with each other, especially for the 2,000-examineecondition. RMSE values for the intercept parameter (Table 6) were slightly higherin Model 3 (two-step) for some conditions, such as the conditions of 500-examineeand 30% proportion of speededness. For Models 2–5, all differences between theincorrect and omitted scoring methods also were quite small, except for those forModel 5 under the 30% speededness conditions in Table 5 and those for Model 3under the 500-examinee, 30% speededness, and 40% missing condition in Table 6,where RMSE for the incorrect scoring was a little higher than for the omitted scoring(see Figure 1, two-step-incorrect and two-step-omitted for the intercept parameterand hybrid-incorrect and hybrid-omitted for the slope parameter). In addition, therewere very minor differences in RMSEs between nonspeeded and speeded items forthe four mixture models except for the slope parameters estimated under the hybridmodel. For the hybrid model, the large percentage of speeded examinees (30%) pro-duced larger RMSEs in the slope parameters for the speeded items than the smallpercentage of speeded examinees (10%), regardless of the sample sizes.

For the 2PL model, similar patterns to those reported in Tables 3 and 4 werefound. First, the incorrect scoring consistently showed larger RMSEs than the omit-ted scoring across all conditions. This tendency was severe for the RMSEs ofthe slope parameters in the 30% speededness conditions. For example, the largedeclines in RMSE when moving from the incorrect to the omitted scoring canbe seen in the 40% missing conditions (with approximately a 76% decrease forthe speeded items, from 1.16 (2PL-incorrect) to .28 (2PL-omitted), see Figure 1).Except for some 500-examinee conditions, there was not a large difference in pa-rameter recovery between the conditions of 20% and 40% in missing responseswhen compared under the same scoring method. As with the hybrid model, the 30%speeded condition produced larger RMSEs in the slope parameters than the 10%speeded condition. The 2PL model showed similar results for nonspeeded items withthe other models, especially when there were small numbers of examinees showingspeededness (10% condition). Finally, the difference in the accuracy of item param-eter estimation between the 2PL model and the other models became bigger whenthere were larger percentages of examinees showing speededness and the missingresponses were coded as incorrect.

An Empirical Illustration

A sample of 1,000 examinees who took a 53-item college-level reading place-ment test in 1998 was analyzed. All items were multiple-choice items with five re-sponse categories including one correct answer. Since missing responses are some-times considered useful indicators of speededness (Mroch & Bolt, 2006), we firstexamined the frequency distributions of item responses for each item to see if miss-ing responses increased across end-of-test items. There was no systematic patternin missingness until item 35 (with missing percentages less than 2.5% for eachitem). However, there was an increasing pattern starting from item 36 (with 2.7%)to the last item (with 33%). In terms of total responses, the missing response per-centage was approximately 4%. By assuming that the first 35 items were free

299

Page 16: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Tabl

e5

RM

SEfo

rth

eSl

ope

Para

met

er

Perc

enta

geof

Perc

enta

geof

Way

ofSa

mpl

eSi

zeSp

eede

dnes

sM

issi

ngSc

orin

gIt

emM

odel

1M

odel

2M

odel

3M

odel

4M

odel

5

500

10%

(50

exam

inee

s)20

%In

corr

ect

Spee

d.2

9.1

8.1

8.1

9.2

1N

onsp

eed

.16

.14

.15

.14

.15

Tota

l.2

2.1

6.1

7.1

7.1

7O

mitt

edSp

eed

.21

.18

.18

.19

.20

Non

spee

d.1

5.1

4.1

5.1

4.1

5To

tal

.18

.16

.17

.17

.17

40%

Inco

rrec

tSp

eed

.29

.17

.17

.18

.20

Non

spee

d.1

5.1

5.1

6.1

5.1

5To

tal

.22

.16

.17

.16

.17

Om

itted

Spee

d.1

8.1

7.1

8.1

8.2

0N

onsp

eed

.15

.15

.15

.15

.15

Tota

l.1

6.1

6.1

6.1

7.1

730

%(1

50ex

amin

ees)

20%

Inco

rrec

tSp

eed

1.18

.18

.18

.21

.31

Non

spee

d.2

1.1

4.1

7.1

4.1

6To

tal

.72

.17

.18

.18

.22

Om

itted

Spee

d.5

7.1

9.1

9.2

1.2

8N

onsp

eed

.18

.14

.17

.15

.16

Tota

l.3

8.1

7.1

8.1

8.2

140

%In

corr

ect

Spee

d1.

16.1

9.1

9.2

1.3

2N

onsp

eed

.21

.15

.18

.15

.16

Tota

l.7

0.1

7.1

8.1

8.2

3O

mitt

edSp

eed

.28

.19

.19

.21

.27

Non

spee

d.1

6.1

5.1

8.1

5.1

6To

tal

.22

.17

.18

.18

.21

300

Page 17: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Tabl

e5

Con

tinu

ed

Perc

enta

geof

Perc

enta

geof

Way

ofSa

mpl

eSi

zeSp

eede

dnes

sM

issi

ngSc

orin

gIt

emM

odel

1M

odel

2M

odel

3M

odel

4M

odel

5

2000

10%

(200

exam

inee

s)20

%In

corr

ect

Spee

d.2

2.0

8.0

8.0

8.1

1N

onsp

eed

.08

.07

.08

.07

.07

Tota

l.1

6.0

8.0

8.0

8.0

9O

mitt

edSp

eed

.14

.08

.08

.08

.11

Non

spee

d.0

8.0

7.0

8.0

7.0

7To

tal

.11

.08

.08

.08

.09

40%

Inco

rrec

tSp

eed

.23

.09

.09

.09

.11

Non

spee

d.0

8.0

7.0

8.0

7.0

8To

tal

.16

.08

.08

.08

.09

Om

itted

Spee

d.1

0.0

9.0

9.0

9.1

1N

onsp

eed

.08

.07

.08

.07

.08

Tota

l.0

9.0

8.0

8.0

8.0

930

%(6

00ex

amin

ees)

20%

Inco

rrec

tSp

eed

1.09

.09

.09

.10

.23

Non

spee

d.1

4.0

7.0

9.0

7.0

9To

tal

.67

.08

.09

.09

.15

Om

itted

Spee

d.5

4.0

9.0

9.1

0.1

9N

onsp

eed

.11

.07

.09

.07

.09

Tota

l.3

5.0

8.0

9.0

9.1

340

%In

corr

ect

Spee

d1.

09.1

0.1

0.1

1.2

3N

onsp

eed

.14

.07

.09

.07

.09

Tota

l.6

7.0

9.0

9.0

9.1

5O

mitt

edSp

eed

.30

.10

.10

.11

.17

Non

spee

d.0

9.0

7.0

9.0

7.0

8To

tal

.21

.09

.09

.09

.12

Ave

rage

Spee

d.4

9.1

4.1

4.1

5.2

0N

onsp

eed

.14

.11

.12

.11

.12

Tota

l.3

3.1

2.1

3.1

3.1

5

Not

es.

Mod

els

1–5

repr

esen

tth

e2P

Lm

odel

,th

eon

e-di

men

sion

alm

ixtu

rem

odel

,th

etw

o-st

eppr

oced

ure,

the

two-

dim

ensi

onal

mix

ture

mod

el,

and

the

hybr

idm

odel

,re

spec

tivel

y.A

vera

ges

wer

eca

lcul

ated

base

don

alls

ampl

esi

zeco

nditi

ons.

301

Page 18: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Tabl

e6

RM

SEfo

rth

eIn

terc

eptP

aram

eter

Perc

enta

geof

Perc

enta

geof

Way

ofSa

mpl

eSi

zeSp

eede

dnes

sM

issi

ngSc

orin

gIt

emM

odel

1M

odel

2M

odel

3M

odel

4M

odel

5

500

10%

(50

exam

inee

s)20

%In

corr

ect

Spee

d.2

9.2

0.2

5.2

2.1

9N

onsp

eed

.17

.19

.23

.20

.15

Tota

l.2

3.1

9.2

4.2

1.1

7O

mitt

edSp

eed

.27

.20

.26

.21

.19

Non

spee

d.1

7.1

9.2

5.1

9.1

5To

tal

.22

.20

.25

.20

.17

40%

Inco

rrec

tSp

eed

.50

.20

.26

.24

.19

Non

spee

d.3

5.1

9.2

6.2

3.1

7To

tal

.43

.20

.26

.24

.18

Om

itted

Spee

d.3

5.2

3.2

3.2

6.1

9N

onsp

eed

.27

.21

.23

.25

.16

Tota

l.3

1.2

2.2

3.2

6.1

730

%(1

50ex

amin

ees)

20%

Inco

rrec

tSp

eed

.98

.22

.25

.23

.25

Non

spee

d.3

6.2

0.2

5.1

9.1

7To

tal

.68

.21

.25

.21

.20

Om

itted

Spee

d.6

7.2

2.2

6.2

2.2

4N

onsp

eed

.22

.20

.25

.19

.17

Tota

l.4

6.2

1.2

5.2

1.2

040

%In

corr

ect

Spee

d.9

0.2

0.3

4.2

0.2

4N

onsp

eed

.33

.17

.33

.17

.17

Tota

l.6

3.1

8.3

3.1

9.2

0O

mitt

edSp

eed

.55

.19

.22

.20

.24

Non

spee

d.1

8.1

7.2

2.1

7.1

7To

tal

.37

.18

.22

.19

.19

302

Page 19: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Tabl

e6

Con

tinu

ed

Perc

enta

geof

Perc

enta

geof

Way

ofSa

mpl

eSi

zeSp

eede

dnes

sM

issi

ngSc

orin

gIt

emM

odel

1M

odel

2M

odel

3M

odel

4M

odel

5

2000

10%

(200

exam

inee

s)20

%In

corr

ect

Spee

d.2

3.0

8.0

8.0

8.0

9N

onsp

eed

.08

.07

.08

.07

.07

Tota

l.1

6.0

7.0

8.0

8.0

8O

mitt

edSp

eed

.20

.08

.08

.08

.08

Non

spee

d.0

8.0

7.0

8.0

7.0

7To

tal

.14

.07

.08

.08

.08

40%

Inco

rrec

tSp

eed

.24

.08

.09

.09

.09

Non

spee

d.0

8.0

8.0

8.0

8.0

8To

tal

.17

.08

.08

.08

.09

Om

itted

Spee

d.1

8.0

9.0

9.0

9.0

9N

onsp

eed

.08

.08

.08

.08

.08

Tota

l.1

3.0

8.0

8.0

8.0

930

%(6

00ex

amin

ees)

20%

Inco

rrec

tSp

eed

.71

.09

.10

.09

.11

Non

spee

d.1

0.0

7.0

9.0

7.0

8To

tal

.44

.08

.10

.08

.09

Om

itted

Spee

d.5

9.0

9.1

0.0

9.1

1N

onsp

eed

.09

.07

.09

.07

.08

Tota

l.3

7.0

8.1

0.0

8.0

940

%In

corr

ect

Spee

d.7

2.1

0.1

0.1

0.1

2N

onsp

eed

.12

.09

.10

.09

.09

Tota

l.4

6.0

9.1

0.0

9.1

0O

mitt

edSp

eed

.49

.10

.11

.10

.13

Non

spee

d.1

0.0

9.1

0.0

9.0

9To

tal

.32

.09

.11

.09

.10

Ave

rage

Spee

d.4

9.1

5.1

8.1

6.1

6N

onsp

eed

.17

.13

.17

.14

.12

Tota

l.3

5.1

4.1

7.1

5.1

4

Not

es.

Mod

els

1–5

repr

esen

tth

e2P

Lm

odel

,th

eon

e-di

men

sion

alm

ixtu

rem

odel

,th

etw

o-st

eppr

oced

ure,

the

two-

dim

ensi

onal

mix

ture

mod

el,

and

the

hybr

idm

odel

,re

spec

tivel

y.A

vera

ges

wer

eca

lcul

ated

base

don

alls

ampl

esi

zeco

nditi

ons.

303

Page 20: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Figure 1. RMSEs of speeded items for both slope and intercept parameters under thecondition of 500-examinee, 30% speededness, and 40% missing. “I” stands for incorrectscoring and “O” stands for omitted scoring. “1DMix” represents the one-dimensional mixturemodel and “2DMix” represents the two-dimensional mixture model.

from test speededness effects (i.e., nonspeeded items) and the last 18 items wereaffected by the time constraints (i.e., speeded items), we did a preliminary analysisusing the one-dimensional mixture model. Based on the item parameter estimationresults from the model, we found that the item parameters at the end-of-test items (18speeded items) were more difficult in the speeded class than those in the nonspeededclass, which is consistent with the rationale of imposing the ordinal constraints usedby Bolt et al. (2002, 2003).

The data set was analyzed using the five models investigated in the simulationstudy. Each of the models was applied twice to the data set, once by scoring missingresponses as incorrect and once by treating missing as omitted. In the absence ofthe true item parameters, the different models were compared by calculating theroot mean square difference (RMSD) for each pair of the models. For example, theRMSD between Model 1 and Model 2 for the slope parameter estimate (ai ) is definedas follows:

RMSD(ai ) =

√√√√√n∑

i=1(ai1 − ai2)2

n, (6)

where n is the number of items, ai1 is a slope parameter estimate for item i fromModel 1, and ai2 is a slope parameter estimate for item i from Model 2.

Table 7 shows RMSDs between each pair of the five models for the slopeand intercept parameter estimates. In general, RMSDs for the pairs between thethree two-class mixture approaches (Models 2–4) were smaller than those for

304

Page 21: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Table 7Root Mean Square Differences for the Slope and Intercept Parameter Estimates, Reading Test

Incorrect Omitted

Comparison Speed Nonspeed Total Speed Nonspeed Total

SlopeModel 1 vs. Model 2 .16 .07 .11 .16 .10 .13Model 1 vs. Model 3 .15 .10 .12 .19 .11 .15Model 1 vs. Model 4 .22 .09 .15 .34 .09 .21Model 1 vs. Model 5 .16 .08 .11 .09 .04 .06Model 2 vs. Model 3 .01 .06 .05 .07 .08 .08Model 2 vs. Model 4 .12 .02 .07 .25 .01 .15Model 2 vs. Model 5 .24 .03 .14 .18 .07 .12Model 3 vs. Model 4 .12 .05 .09 .30 .08 .19Model 3 vs. Model 5 .24 .06 .15 .22 .09 .15Model 4 vs. Model 5 .28 .02 .16 .29 .06 .18

InterceptModel 1 vs. Model 2 .33 .03 .20 .44 .11 .27Model 1 vs. Model 3 .34 .06 .20 .51 .12 .31Model 1 vs. Model 4 .35 .04 .21 .31 .14 .21Model 1 vs. Model 5 .46 .02 .27 .22 .01 .13Model 2 vs. Model 3 .01 .05 .04 .07 .05 .06Model 2 vs. Model 4 .05 .01 .03 .20 .03 .12Model 2 vs. Model 5 .22 .02 .13 .32 .11 .21Model 3 vs. Model 4 .05 .05 .05 .26 .06 .16Model 3 vs. Model 5 .21 .05 .13 .38 .11 .24Model 4 vs. Model 5 .21 .03 .12 .23 .14 .17

Note. Models 1–5 represent the 2PL model, the one-dimensional mixture model, the two-step procedure,the two-dimensional mixture model, and the hybrid model, respectively.

the other comparisons involving the 2PL or hybrid models, implying more similar-ities among the three approaches. There was one exception, however: for the slopeparameter estimates with the omitted scoring, the two-dimensional mixture modelproduced large differences when paired with the other two mixture models as well asthe 2PL or hybrid models, especially with respect to speeded items. When the incor-rect scoring was used, the hybrid model showed large differences when paired withthe other models. For the intercept parameter estimates, the differences involvingthe 2PL model tended to be higher than those involving the other models regard-less of the scoring methods. Among the three 2-class mixture approaches (Models2–4), Model 2 (one-dimensional mixture) and Model 3 (two-step) were most similarto each other; these two models differed somewhat from Model 4 (two-dimensionalmixture) for the slope parameter estimates with the omitted scoring. The incorrectscoring method tended to present smaller differences among the five models than didthe omitted scoring.

Finally, the missing patterns between the two latent classes (speeded and non-speeded classes) identified from the one- and two-dimensional mixture models wereexamined. As expected, missing percentages tended to be higher in the speeded class

305

Page 22: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Suh, Cho, and Wollack

than in the nonspeeded class with the omitted scoring. This tendency was more ap-parent in the results of the two-dimensional mixture model than from the results ofthe one-dimensional mixture model. The difference in missing percentages betweenthe two latent classes became larger toward the end of the test. For example, thedifferences in missing percentages between the two latent classes identified from theone-dimensional mixture model for the last four items (items 50, 51, 52, and 53) were1.1%, 5.0%, 5.2%, and 15.0%, respectively, whereas the corresponding differencesresulting from the two-dimensional mixture model were 10.6%, 12.8%, 12.8%, and16.8%, respectively. This indicates that the speeded class for the two-dimensionalmixture model is particularly sensitive to examinees who omit answers at the end ofthe test, a characteristic that is conceptually appealing in a speededness model.

Conclusion and Discussion

When a test is suspected to be speeded to some degree, important considerationmust be given to the scoring of missing responses as well as to the speeded items.Although the evaluation of item calibration procedures is very important for the suc-cessful application of IRT item calibration, results from different treatments of miss-ing data and different IRT calibration procedures for speeded tests have not beencomprehensively studied. In this regard, this study evaluated the relative performanceof the five-item calibration procedures for speeded item responses under various test-ing conditions. The results from this simulation study would appear to have severalimplications for how practitioners select and understand item calibration procedureswhen working with incomplete data due to test speededness.

First, the one-dimensional mixture, two-step procedure, and two-dimensional mix-ture models performed similarly in estimating item parameters. In particular, whenthe sample size was large, all three models were indistinguishable from each otherregardless of the percentage of speeded examinees or the percentage of missing re-sponses. However, for the conditions of small sample size (500-examinee), item in-tercept parameters tended to be recovered better using the one- and two-dimensionalmixture models than using the two-step procedure, which may be attributed in part tothe fact that the two-step procedure used a smaller sample than the other two models,because examinees in the speeded class were excluded for analysis in the two-stepprocedure. As expected, the influence of using a nonspeeded class (i.e., smaller sam-ple) on the item estimation was diminished as the total sample size itself increased(i.e., 2,000-examinee conditions). Therefore, the item calibration procedure shouldbe selected with caution when dealing with a small sample in the presence of testspeededness. The performance of the hybrid model was somewhat different fromthese three models. The hybrid model showed less accuracy in estimating slope pa-rameters. However, it performed similarly to these three models in estimating in-tercept parameters. In this regard, it would be interesting to investigate the relativeperformance of the five procedures with a fixed slope (i.e., controlling for the effectof slope) when the Rasch model fits the data.

Second, for all four mixture models, the differences between the omitted and in-correct scoring methods appeared to be very small for almost all simulated condi-tions. However, this was not the case for the 2PL model. Since the 2PL model did

306

Page 23: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Item Calibration Procedures for Test Speededness

not consider speededness, the omitted scoring method performed better than the in-correct scoring method. Especially when there were large numbers of examineesshowing speededness (30%) and a high percentage of missing responses (40%), theomitted scoring worked much better than the incorrect scoring, regardless of thesample sizes. However, as noted earlier, the 40% missing response conditions rep-resented only up to 4.8% in terms of the total responses. Therefore, it is expectedthat the impact of using the omitted scoring versus incorrect scoring would be evengreater in some testing situations with higher missing response percentages. Thus,the omitted scoring should be preferred for item calibration if one has no choice butto use the 2PL model on heavily speeded data with missing responses; however, insuch instances, for the purpose of score reporting in real testing situations, it maybe required that omitted responses be treated as incorrect. In this case, practitionersmight need to seriously consider one of the other approaches discussed in the presentstudy for calibrating item parameters.

Third, except for the 2PL model parameter recovery under the 500-examinee con-ditions, there was not a large difference between the conditions of 20% and 40% inmissing responses when the two results obtained from the same scoring method werecompared. It should be noted again that these percentages reflect the percentage ofmissing responses in relation to the speeded item responses, not to the total numberof responses. The percentages of the total number of responses varied from .8% to4.8% depending on the simulation conditions (see Table 1), implying that no greaterthan 5% of all responses were scored as missing in this study (which, as explainedearlier, seems to be keeping with what is often observed in real testing situations).However, there might be some situations where higher missing response percentagesoccur due to test speededness with a strict time limit. Therefore, in circumstanceswhere the 2PL model is used to calibrate items for a sample size similar to 500 orsmaller, careful attention would need to be paid to missing response percentages.

Finally, as expected, when the sample size increased and the percentage of speed-edness decreased, the item parameters were better estimated for all models consid-ered. Also, all five models showed relatively similar results for nonspeeded items,especially under the 10% speeded conditions. Under the 30% speeded conditions,Models 2–5 tended to perform slightly better than did the 2PL model. Therefore, ifthe percentage of speededness were large (i.e., similar to or larger than 30%), the fourmixture approaches would provide more accurate item parameter estimates even fornonspeeded items than the 2PL model.

There are several limitations to the current study. Regarding the simulation con-ditions, fixed effects of the speededness point parameter (η j ), speededness rate pa-rameter (λ j ), and test length were considered in this study. The impact of differentsimulation conditions needs to be considered in a future study. In the present study,the 2PL version of the speededness generating model was used to simulate the re-sponses affected by test speededness. That model was chosen because it has beenproposed to be a realistic model for speededness (Wollack et al., 2004) and it alsois a model that differed from any of the estimating models considered in this study.There is still one small concern in that the form of the generating model used in thisstudy, which was based on the 2PL model, did not include a pseudo-guessing pa-rameter. Consequently, examinees with large speededness rate parameters could be

307

Page 24: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Suh, Cho, and Wollack

simulated with below chance-level probabilities of correct response for some items.Although this might be somewhat unrealistic, the number of examinees for whichthis was the case was relatively small and therefore figures to have only a very smalleffect, if any, on final outcomes. Ultimately, it is our belief that the model used hereto generate speeded data is the most flexible and realistic option available; how-ever it is important to recognize that any of the other models studied in this articlecould have been used to generate speeded data sets. It might be valuable to examinethe effect of different models to generate data on the recovery of item parameters.Furthermore, continued study of the psychology of test speededness is needed tohelp researchers better understand how to generate speeded test taking behaviors andmissing response patterns as realistically as possible.

In estimating item and person parameters, two different estimation algorithmswere used: MMLE for Models 1–4 and MCMC for Model 5 (hybrid). It would beideal to use the same estimation method across all models for the comparison ofparameter estimates. As noted earlier, we were unable to obtain a software packagewhich implements MMLE for the hybrid model. In order to have the estimates ascomparable as possible between MMLE and MCMC, diffuse priors on parameterswere employed for the model in implementing the MCMC algorithm. In addition,since the sample sizes and the number of items we considered were relatively large,it is expected that the results of MCMC were as comparable as possible to those ofMMLE.

The different kinds of mixture IRT modeling approaches were used in this studyto classify examinees into two latent classes, a speeded class and a nonspeeded class.After latent class membership is assigned to an examinee, an important step is to vali-date the characteristics of the latent classes. In the empirical study, missing responsesas an indication of test speededness were found more frequently in the speeded classthan in the nonspeeded class. Since test speededness mainly occurs as a result of ex-aminees having a limited amount of time in which to test, the item response time foreach examinee could also be a good explanatory variable to characterize and validatethe latent classes. In this regard, an item response time distribution, such as the mix-ture log normal distribution, could be jointly modeled with a mixture Rasch model toclassify examinees with respect to their rapid-guessing behavior (see Meyer, 2010).A response time model and an item response model also can be modeled jointlywithout any mixture modeling to improve the precision of item and person parame-ter estimation (van der Linden, Klein Entink, & Fox, 2010). The joint models couldimprove the item calibration with a different structure from that of mixture IRTmodels by modeling the response time (person-by-item response time) along withresponse accuracy (person-by-item responses) as dependent variables. These mod-els were not included in this study because our purpose was to evaluate the relativeperformance of various existing mixture IRT models with which we can detect thespeededness group, and thus by eliminating (or modeling) them we can obtain pureitem parameter estimates. In addition, many testing programs remain entirely paperbased and so are not able to capture response time data. Therefore, it is important toimprove our understanding of the methods that are available to address test speeded-ness when response times are unavailable. For this reason, we did not simulate item

308

Page 25: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Item Calibration Procedures for Test Speededness

response time information in the data generation stage. It therefore is left for a futurestudy to compare these newly developed models with the models presented in thisstudy.

Here, we focused on comparing relative performance of different item calibrationprocedures with different scoring schemes in terms of the recovery of item param-eters. Studying the effect of using these different calibration procedures on abilityestimation has not yet been studied. Particularly useful and important would be theexamination of handling of the missing responses as incorrect versus omitted andmissing response percentages at end-of-test items with different speededness effects.In addition, the difficulty of the end-of-test items would be another interesting con-dition to examine because tests often are constructed so that items at the end appearto be more difficult. Results and implications from such studies would provide testpractitioners with useful information concerning the relative values of choosing dif-ferent calibration procedures in ability estimation.

Notes1It should be noted that a speeded test is different from a speed test. A speed

test is designed so that speed is a construct of interest and individual differencesare assessed based on speed of performance. Such a test is made up of all relativelyeasy items, and the time limit is made sufficiently short that no one can finish allitems in the provided time. However, in this article, we will use “test speededness”to refer to speededness effects in the context of an examination administered withtime constraints but for which the speed at which one responds to questions is anuisance variable.

2ai22 for item 33 was constrained to be 1 in the simulation study, while ai22 foritem 41 was set as 1 in the real data analysis.

3Five response alternatives were assumed for all items in the simulation study likethe actual test items analyzed in the empirical illustration section.

4In order to identify a set of item parameter estimates for the nonspeeded class,the item difficulty parameter estimates for the speeded items need to be comparedbetween the two classes. The one with lower item difficulty values (less difficult) islabeled as the nonspeeded class, whereas the one with higher values (more difficult)is labeled as the speeded class.

5Thirty percent is chosen to be similar to those observed in previous studies(Mroch et al., 2005; Wollack & Cohen, 2004), and 10% is chosen to examine theeffect of a smaller percentage on the estimation accuracy.

6Based on the authors’ experience, these values appear to be realistic (usually lessthan 5%) for actual testing programs (which are not pure speed tests as noted infootnote 1) like the empirical illustration described below, where 4% was observedin terms of the total responses.

7Although it seems to be more realistic that an examinee keeps giving responsesuntil s/he becomes speeded and stops responding, there also is a possibility that theexaminee randomly tries to respond to some end-of-test items, resulting in interme-diate missing responses. As one feasible way of generating such a random pattern,

309

Page 26: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Suh, Cho, and Wollack

the odd-even approach was chosen along with fulfilling the purpose of ensuring thatthere is enough information to estimate all item parameters.

8Although there is a computer program HYBIL (Yamamoto, 1990) which im-plements MMLE algorithm for the hybrid model, it was not available. In order tocompare the performance of MCMC and MMLE, we used diffuse priors for the itemparameters.

References

Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2002). Item parameter estimation under condi-tions of test speededness: Applications of a mixture Rasch model with ordinal constraints.Journal of Educational Measurement, 39, 331–348.

Bolt, D. M., Mroch, A. A., & Kim, J.-S. (2003, April). An empirical investigation of the hybridIRT model for improving item parameter estimation in speeded tests. Paper presented at themeeting of the American Educational Research Association, Chicago, IL.

De Boeck, P., Cho, S.-J., & Wilson, M. (2011). Explanatory secondary dimension mod-eling of latent differential item functioning. Applied Psychological Measurement, 35,583–603.

Douglas, J., Kim, H. R., Habing, B., & Gao, F. (1998). Investigating local dependencewith conditional covariance functions. Journal of Educational & Behavioral Statistics, 23,129–151.

Evans, F. R., & Reilly, R. R. (1972). A study of speededness as a source of test bias. Journalof Educational Measurement, 9, 123–131.

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple se-quences. Statistical Science, 7, 457–472.

Goegebeur, Y., De Boeck, P., Wollack, J. A., & Cohen, A. S. (2008). A speeded item responsemodel with gradual process change. Psychometrika, 73, 65–87.

Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York, NY:Wiley.

Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. EducationalMeasurement, 17, 179–193.

Lu, Y., & Sireci, S. G. (2007). Validity issues in test speededness. Educational Measurement:Issues and Practice, 26(4), 29–37.

Meyer, J. P. (2010). A mixture Rasch model with item response time components. AppliedPsychological Measurement, 34, 521–538.

McLachlan, G., & Peel, D. (2000). Finite mixture models. New York, NY: Wiley.Mroch, A. A., Bolt, D. M., & Wollack, J. A. (2005, April). A new multi-class mixture Rasch

model for test speededness. Paper presented at the meeting of the National Council on Mea-surement in Education, Montreal, Quebec.

Mroch, A. A., & Bolt, D. M. (2006, April). An IRT-based response likelihood approach foraddressing test speededness. Paper presented at the meeting of the National Council onMeasurement in Education, San Francisco, CA.

Oshima, T. C. (1994). The effect of speededness on parameter estimation in item responsetheory. Journal of Educational Measurement, 31, 200–219.

Rost, J. (1990). Rasch model in latent cases: An integration of two approaches to item analysis.Applied Psychological Measurement, 14, 271–282.

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592.Spiegelhalter, D. J., Thomas, A., & Best, N. G. (2003). WinBUGS version 1.4.3 [Computer

program]. Cambridge, UK: MRC Biostatistics Unit, Institute of Public Health.

310

Page 27: A Comparison of Item Calibration Procedures in the Presence of Test Speededness

Item Calibration Procedures for Test Speededness

Suh, Y., Kang, T., Wollack, J. A., & Kim, S.-Y. (2006, April). A comparison of test scoringmethods in the presence of test speededness. Paper presented at the meeting of the NationalCouncil on Measurement in Education, San Francisco, CA.

van der Linden, W. J., Klein Entink, R. H., & Fox, J.-P. (2010). IRT parameter estimation withresponse times as collateral information. Applied Psychological Measurement, 34, 327–347.

Vermunt, J. K., & Magidson, J. (2007). Latent GOLD 4.5 syntax module [Computer program].Belmont, MA: Statistical Innovations Inc.

Wollack, J. A., & Cohen, A. S. (2004, April). A model for simulating speeded test data. Paperpresented at the meeting of the American Educational Research Association. San Diego,CA.

Wollack, J. A., Cohen, A. S., & Wells, C. S. (2003). A method for maintaining scale stabilityin the presence of test speededness. Journal of Educational Measurement, 40, 307–330.

Yamamoto, K. (1990). HYBIL: A computer program to estimate HYBRID model parameters.Princeton, NJ: Educational Testing Service.

Yamamoto, K., & Everson, H. (1997). Modeling the effects of test length and test time onparameter estimation using the HYBRID model. In J. Rost, & R. Langeheine (Eds.), Appli-cations of latent trait and latent class models in the social sciences (pp. 89–98). New York,NY: Waxmann.

Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item de-pendence. Journal of Educational Measurement, 30, 187–213.

Authors

YOUNGSUK SUH is Assistant Professor of Educational Psychology, Rutgers University, 10Seminary Place, Room 321A, New Brunswick, NJ 08901; [email protected]. Her pri-mary research interests include the development and application of latent variable models.

SUN-JOO CHO is Assistant Professor of Psychology and Human Development, PeabodyCollege of Vanderbilt University, Hobbs 213a, 230 Appleton Place, Nashville, TN 27303;[email protected]. Her primary research interests include generalized latent variablemodels and their parameter estimation, with a focus on item response models.

JAMES A. WOLLACK is Associate Professor of Educational Psychology and Direc-tor of Testing & Evaluation Services and the Center for Placement Testing, Univer-sity of Wisconsin-Madison, 1025 West Johnson Street, #373, Madison, WI 53706; [email protected]. His primary research interests include psychometric methods, test security,and item response theory.

311