Article Applied Psychological Measurement 36(2) 122–146 Ó The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0146621612438725 http://apm.sagepub.com Using the Graded Response Model to Control Spurious Interactions in Moderated Multiple Regression Brendan J. Morse 1 , George A. Johanson 2 , and Rodger W. Griffeth 2 Abstract Recent simulation research has demonstrated that using simple raw score to operationalize a latent construct can result in inflated Type I error rates for the interaction term of a mod- erated statistical model when the interaction (or lack thereof) is proposed at the latent vari- able level. Rescaling the scores using an appropriate item response theory (IRT) model can mitigate this effect under similar conditions. However, this work has thus far been limited to dichotomous data. The purpose of this study was to extend this investigation to multicate- gory (polytomous) data using the graded response model (GRM). Consistent with previous studies, inflated Type I error rates were observed under some conditions when polytomous number-correct scores were used, and were mitigated when the data were rescaled with the GRM. These results support the proposition that IRT-derived scores are more robust to spurious interaction effects in moderated statistical models than simple raw scores under certain conditions. Keywords graded response model, item response theory, polytomous models, simulation Operationalizing a latent construct such as an attitude or ability is a common practice in psy- chological research. Stine (1989) described this process as the creation of a mathematical structure (scores) that represents the empirical structure (construct) of interest. Typically, researchers will use simple raw scores (e.g., either as a sum or a mean) from a scale or test as the mathematical structure for a latent construct. However, much debate regarding the properties of such scores has ensued since S. S. Stevens’s classic publication of the nominal, ordinal, interval, and ratio scales of measurement (Stevens, 1946). Although it is beyond the scope of this article to enter the scale of measurement foray, an often agreed-on position is that simple raw scores for latent constructs do not exceed an ordinal scale of measurement. This scale imbues such scores with limited mathematical properties and permissible 1 Bridgewater State University, MA, USA 2 Ohio University, Athens, USA Corresponding author: Brendan J. Morse, Department of Psychology, Bridgewater State University, 90 Burrill Avenue, 340 Hart Hall, Bridgewater, MA 02325, USA Email: [email protected]
25
Embed
Using the Graded Response Model to Control Spurious Interactions in Moderated Multiple Regression
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Using the Graded ResponseModel to Control SpuriousInteractions in ModeratedMultiple Regression
Brendan J. Morse1, George A. Johanson2, and Rodger W. Griffeth2
Abstract
Recent simulation research has demonstrated that using simple raw score to operationalizea latent construct can result in inflated Type I error rates for the interaction term of a mod-erated statistical model when the interaction (or lack thereof) is proposed at the latent vari-able level. Rescaling the scores using an appropriate item response theory (IRT) model canmitigate this effect under similar conditions. However, this work has thus far been limited todichotomous data. The purpose of this study was to extend this investigation to multicate-gory (polytomous) data using the graded response model (GRM). Consistent with previousstudies, inflated Type I error rates were observed under some conditions when polytomousnumber-correct scores were used, and were mitigated when the data were rescaled with theGRM. These results support the proposition that IRT-derived scores are more robust tospurious interaction effects in moderated statistical models than simple raw scores undercertain conditions.
transformations that are necessary for the appropriate application of parametric statistical
models. Nonparametric, or distribution-free, statistics have been proposed as a solution for
the scale of measurement problem. However, many researchers are reluctant to use nonpara-
metric techniques because they are often associated with a loss of information pertaining to
the nature of the variables (Gardner, 1975). McNemar (1969) articulated this point by say-
ing, ‘‘Consequently, in using a non-parametric method as a short-cut, we are throwing away
dollars in order to save pennies’’ (p. 432).
Assuming that simple raw scores are limited to the ordinal scale of measurement and
researchers typically prefer parametric models to their nonparametric analogues, the empiri-
cal question regarding the robustness of various parametric statistical models to scale viola-
tions arises. Davison and Sharma (1988) and Maxwell and Delaney (1985) demonstrated
through mathematical derivations that there is little cause for concern when comparing mean
group differences in the independent samples t test when the assumptions of normality and
homogeneity of variance are met. However, Davison and Sharma (1990) subsequently
demonstrated that scaling-induced spurious interaction effects could occur with ordinal-level
observed scores in multiple regression analyses. These findings suggest that scaling may
become a problem when a multiplicative interaction term is introduced into a parametric sta-
tistical model.
Scaling and Item Response Theory (IRT)
An alternative solution to the scale of measurement issue for parametric statistics is to rescale
the raw data itself into an interval-level metric, and a variety of methods for this rescaling have
been proposed (see Embretson, 2006; Granberg-Rademacker, 2010; Harwell & Gatti, 2001). A
potential method for producing scores with near interval-level scaling properties is the applica-
tion of IRT models to operationalize number-correct scores into estimated theta scores—the
IRT-derived estimate of an individual’s ability or latent construct standing. Conceptually, the
attractiveness of this method rests with the invariance property in IRT scaling, and such scores
may provide a more appropriate metric for use in parametric statistical analyses.1 Reise,
Ainsworth, and Haviland (2005) stated that
Trait-level estimates in IRT are superior to raw total scores because (a) they are optimal scalings of
individual differences (i.e., no scaling can be more precise or reliable) and (b) latent-trait scales have
relatively better (i.e., closer to interval) scaling properties. (p. 98, italics in original)
In addition, Reise and Haviland (2005) gave an elegant treatment of this condition by demon-
strating that the log-odds of endorsing an item and the theta scale form a linearly increasing rela-
tionship. Specifically, the rate of change on the theta scale is preserved (for all levels of theta) in
relation to the log-odds of item endorsement.
Empirical Evidence of IRT Scaling
In a simulation testing the effect of scaling and test difficulty on interaction effects in factor-
ial analysis of variance (ANOVA), Embretson (1996) demonstrated that Type I and Type II
errors for the interaction term could be exacerbated when simple raw scores are used under
nonoptimal psychometric conditions. Such errors occurred primarily due to the ordinal-level
scaling limitations of simple raw scores, and the ceiling and floor effects imposed when an
assessment is either too easy or too difficult for a group of individuals—a condition known
Morse et al. 123
as assessment inappropriateness (see Figure 1). Embretson fitted the one-parameter logistic
(Rasch) model to the data and was able to mitigate the null hypothesis errors using the esti-
mated theta scores rather than the simple raw scores. These results illuminated the
Assessment Appropriateness
Theta−4 −2 0 2 4
Trait ScoresTest Information
Assessment Inappropriateness
Theta−4 −2 0 2 4
Figure 1. A representation of the latent construct distribution and test information (reliability)distributions for appropriate assessments (top) and inappropriate assessments (bottom)
124 Applied Psychological Measurement 36(2)
usefulness of IRT scaling for dependent variables in factorial models, especially under sub-
optimal psychometric conditions. Embretson argued that researchers are often unaware when
these conditions are present and can benefit from using appropriately fitted IRT models to
generate scores that are more appropriate for use with parametric analyses.
An important question that now arises is whether these characteristics extend to more com-
plex IRT models such as the two- and three-parameter logistic models (dichotomous models
with a discrimination and guessing parameter, respectively) and polytomous models. Although
the Rasch model demonstrates desirable measurement characteristics (i.e., true parameter invar-
iance; Embretson & Reise, 2000; Fischer, 1995; Perline, Wright, & Wainer, 1979), it is some-
times too restrictive to use in practical contexts. However, the consensus regarding the
likelihood that non-Rasch models could achieve interval-level scaling properties is ‘‘yes’’
In Equation 6, b1 and b2 are the simulated regression weights and e is an error term. Note
that the intercept term, b0, was set to equal 0 and thus omitted from the model. The termffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� b2
1 + b22
� �qwas included to represent an appropriate error variance component for each
level of b. See the Appendix for a derivation of the error term.
Number-correct scores
To generate the number-correct scores, X1, X2, and X3, the values of the previously defined con-
struct scores u1, u2, and u3 were entered into the GRM algorithm (Equations 1 and 2) for each
simulated participant.
A matrix of response scores was generated by reporting the number-correct score (1, 2, 3, 4,
or 5) corresponding to the highest-category response likelihood for each simulated participant
on each item. These values were derived using an algorithm written by the first author in the
R language based on response probabilities calculated in Equations 1 and 2. Actual number-
correct score responses were generated by comparing a randomly selected value from a uniform
distribution, U(0.0, 1.0), with the relative response probabilities that are generated for each level
of theta (or individual) and each item. This process can be thought of as determining the relative
likelihood of a category response given the item and person parameters with a realistic level of
decision-making error (Kang & Waller, 2005; Stone, 1992). This integration of response error is
important so as to not assume perfect responding by simulated individuals. A mean score for X1,
X2, and X3 for each simulated individual was calculated from the number-correct score response
matrices for analysis in the regression models.
Estimated Theta Scores
Finally, theta scores u1, u2, and u3 were estimated from the simulated raw data, using
PARSCALE 4.1 (Muraki & Bock, 2003). PARSCALE was set to derive the person (latent con-
struct scores) and item parameters using the expected a posteriori (EAP) method. This method
calculates u1, u2, and u3 as the modal value of the posterior distribution, which is the most
likely value of theta for the observed response pattern (Baker & Kim, 2004) and is a preferred
estimation method for assessments that are moderate to short in item length (Mislevy &
Stocking, 1989).
Iterations
For the purposes of estimating Type I error rates in Monte Carlo studies, Robey and
Barcikowski (1992) specify that approximately 1,000 iterations will achieve a power equal
to .90 when approximating an alpha level of a = .05 and using the interval of a 6 1=2a as a
robustness interval. Therefore, 1,000 iterations per condition were conducted. This allowed
Morse et al. 129
for adequate reduction in sampling variance for the IRT parameter estimates (Harwell,
Stone, Hsu, & Kirisci, 1996), achieves a power of .90 around the interval :025 � a � :075
(Robey & Barcikowski, 1992), and doubles the number of iterations used by Kang and
Waller (2005).
Simulation-Dependent Variables
Type I errors. The primary dependent variable for this study was the empirical Type I error
rate (p) that is observed for the interaction term of the MMR models. The specific value of p
was identified in a three-step process. In each iteration of the simulation, the variance in u3
accounted for by u1 and u2 was recorded as the R2 value for the additive and multiplicative
regression models specified in Equations 3 through 5b. Second, the significance of the change
in variance accounted for, DR2, between the respective additive and multiplicative models was
tested at an alpha level of p � :05, and recorded as 1 for a significant result and 0 for a non-
significant result. Finally, the empirical alpha level p was recorded as the proportion ð x1;000Þ of
iterations resulting in a significant DR2 for the actual latent trait scores u3, the number-correct
scores X3, and the estimated theta scores u3.
Procedure
The simulation for the current study was conducted in the R environment (version 2.9.0; Ihaka
& Gentleman, 1996; R Development Core Team, 2008) using a series of functions written by
the authors, contributed code from Kang and Waller (2005), and PARSCALE 4.1 (Muraki &
Bock, 2003). For ease of interpretation, four separate simulations were conducted. The four
simulations were separated based on sample size (n = 250, 750) and scale fidelity (normal,
high). In each simulation, the independent variables of scale length, regression weights, dis-
crimination, and difficulty will be systematically varied. Therefore, the summary statistics for
each simulation are included in four tables, each with 24 rows.
Each simulation was run using the following process. First, using the pseudorandom number
generator in R, theta vectors were sampled from a standard normal distribution N(0.0, 1.0) for
u1 and u2. Next, corresponding vectors for u3 were calculated using Equation 6. These vectors
were saved as the actual latent construct scores. To calculate the number-correct score matrices,
X1, X2, and X3, each of these three score vectors were evaluated in an algorithm written by the
first author that implements Equation 1 and Equation 2 to determine the probability of a cate-
gory response. Final number-correct score values were determined by the comparison of a ran-
domly selected value from a uniform distribution as previously described. Finally, the estimated
theta scores u1, u2, and u3 were derived using PARSCALE 4.1 (Muraki & Bock, 2003). To
accomplish this task, the number-correct score matrices were ‘‘batched’’ out to PARSCALE
with an accompanying syntax file following the structure identified by Gagne, Furlow, and
Ross (2009). The estimated theta scores from PARSCALE were then returned to R as the vec-
tors u1, u2, and u3.
Finally, the nine score vectors to be entered into the corresponding additive and multiplica-
tive regression models specified in Equations 3 through 5b and the change in variance accounted
for between the two corresponding models was recorded. The final summary statistics and tables
were generated using portions of code provided by Niels Waller and used in the Kang and
Waller (2005) study.
130 Applied Psychological Measurement 36(2)
Results
Using the a 6 1=2a criterion, the results indicated meaningfully inflated Type I error rates in 53
of 96 conditions (55%) when number-correct scores were used to operationalize the latent con-
structs, and in 33 of 96 conditions (34%) when estimated theta scores were used to operationa-
lize the latent constructs (see conditions marked with an asterisk under the columns labeled pX
and pu, respectively, in Tables 1-4). In addition, a binomial test was conducted as a measure of
statistically significant departures from the nominal (a = .05) alpha level for each scoring
method. The results of the binomial test were slightly more conservative and indicated signifi-
cantly inflated Type I error rates in 63 of 96 conditions (66%) when number-correct scores
were used to operationalize the latent constructs and in 44 of 96 conditions (46%) when esti-
mated theta scores were used to operationalize the latent constructs (see conditions marked with
dagger under the columns labeled pX and pu, respectively, in Tables 1-4).
In addition, Figure 2 represents the frequency of the empirical Type I error rates for the
number-correct scores and estimated theta scores. For the number-correct scores, these data
indicate a positively skewed distribution (skew = 2.04) ranging from 3.1% to 84.9%, with a
mean empirical Type I error rate of 17.5%, median of 8.7%, and a standard deviation of 20%.
Restricting the summary to only those occurrences outside of the a 6 1=2a interval, the mean
empirical Type I error rate was 27.6% with values ranging from 7.8% to 84.9%. The distribu-
tion for the estimated theta scores was also positively skewed (skew = 2.97) and ranged from
3.7% to 43.6% with a mean empirical Type I error rate of 9.0%, median of 5.9%, and a standard
deviation of 8.0%. Restricting the summary to only those occurrences outside of the a 6 1=2ainterval, the mean empirical Type I error rate was 15.9%, with values ranging from 7.6% to
43.6%.
These results indicate that there were instances of spurious interactions regardless of the
scoring method. However, it is clear that the number-correct scores performed much worse than
the estimated theta scores in comparable conditions. Finally, an important finding to highlight
is that, of the conditions with meaningfully inflated Type I error rates for the estimated theta
scores, none were unique with regard to the number-correct scores (see Tables 1-4). In other
words, no meaningful inflations existed for the estimated theta scores that did not also exist for
the number-correct scores.
Assessment Appropriateness and Type I Errors
The results of this simulation also clearly indicate the anticipated effect of scoring method and
assessment appropriateness on the occurrence of Type I errors for the interaction term of a
MMR analysis. Figure 3 represents the mean and maximum Type I error rates for each scoring
method collapsed across the 32 assessment appropriateness conditions. Under these conditions,
there is no significant departure from the nominal Type I error rate, regardless of whether one
uses simple raw scores or estimated theta scores in the MMR analysis. These results are consis-
tent with previous findings related to scaling effects on Type I error rates for moderated statisti-
cal models (Davison & Sharma, 1990; Embretson, 1996; Kang & Waller, 2005).
However, striking differences in the empirical Type I error rate can be observed for each
scoring method when the assessment is inappropriate for the individuals. Figure 4 represents the
mean and maximum Type I error rates for each scoring method collapsed across the 64 assess-
ment inappropriateness (easy/difficult) conditions. Number-correct scores resulted in empirical
Type I error rates that were above the acceptable interval in 53 of the 64 (83%) inappropriate
assessment conditions. At the iteration level, a direct logistic regression analysis indicated that
the likelihood of committing a Type I error was 8.13 times greater when number-correct scores
(Text continues on p. 137.)
Morse et al. 131
Tab
le1.
Res
ults
ofSi
mula
tion
1(N
orm
alFi
del
ity,
Dis
trib
ution
ofLa
tent
Const
ruct
Score
s=
Stan
dar
dN
orm
alN
(0,1
))
cn
b i,j
21
a ib
kp
uD
R2 u
px
DR
2 xp
uD
R2 u
aR
MSE
SWu
SWx
SWu
skx
sku
1250
Eas
yLo
w.3
15
0.0
62
0.0
18
0.0
55
0.0
21
0.0
47
0.0
22
.66
0.8
60.9
60.0
60.7
80.5
40.2
32
250
Eas
yLo
w.3
30
0.0
62
0.0
18
0.0
68
a0.0
20
0.0
52
0.0
23
.80
0.8
40.9
60.0
60.7
70.5
60.2
23
250
Eas
yLo
w.5
15
0.0
62
0.0
11
0.0
88
a,b
0.0
18
0.0
58
0.0
18
.66
0.8
60.9
60.2
30.8
40.5
40.2
44
250
Eas
yLo
w.5
30
0.0
62
0.0
11
0.1
13
a,b
0.0
17
0.0
61
0.0
17
.80
0.8
30.9
60.3
00.8
20.5
70.2
45
250
Eas
yH
igh
.315
0.0
62
0.0
18
0.0
55
0.0
21
0.0
47
0.0
22
.66
1.4
40.9
60.0
60.7
80.5
40.2
36
250
Eas
yH
igh
.330
0.0
61
0.0
18
0.1
05
a,b
0.0
22
0.0
82a,
b0.0
16
.92
1.4
40.9
60.0
00.4
11.0
10.3
47
250
Eas
yH
igh
.515
0.0
61
0.0
11
0.2
96
a,b
0.0
20
0.0
89a,
b0.0
12
.86
1.4
40.9
60.0
10.5
70.9
90.3
48
250
Eas
yH
igh
.530
0.0
62
0.0
11
0.3
72
a,b
0.0
19
0.0
98a,
b0.0
14
.92
1.4
40.9
60.0
20.6
11.0
10.3
49
250
Moder
ate
Low
.315
0.0
62
0.0
18
0.0
46
0.0
20
0.0
56
0.0
20
.69
0.7
00.9
60.8
30.8
60.0
20.0
110
250
Moder
ate
Low
.330
0.0
62
0.0
18
0.0
42
0.0
21
0.0
48
0.0
20
.82
0.7
00.9
60.8
60.8
60.0
20.0
211
250
Moder
ate
Low
.515
0.0
62
0.0
11
0.0
41
0.0
16
0.0
52
0.0
17
.69
0.7
00.9
60.9
10.8
60.0
20.0
212
250
Moder
ate
Low
.530
0.0
62
0.0
11
0.0
37
0.0
15
0.0
37
0.0
18
.82
0.7
00.9
60.9
30.8
60.0
20.0
313
250
Moder
ate
Hig
h.3
15
0.0
62
0.0
18
0.0
51
0.0
19
0.0
47
0.0
20
.88
0.5
30.9
60.4
30.8
80.0
40.0
114
250
Moder
ate
Hig
h.3
30
0.0
62
0.0
18
0.0
48
0.0
19
0.0
59
0.0
19
.94
0.5
20.9
60.5
00.8
80.0
30.0
015
250
Moder
ate
Hig
h.5
15
0.0
62
0.0
11
0.0
41
0.0
14
0.0
50
0.0
14
.88
0.5
30.9
60.8
60.9
10.0
40.0
016
250
Moder
ate
Hig
h.5
30
0.0
61
0.0
11
0.0
39
0.0
13
0.0
55
0.0
14
.94
0.5
20.9
60.8
90.9
10.0
30.0
017
250
Diff
icult
Low
.315
0.0
61
0.0
18
0.0
64
a0.0
20
0.0
53
0.0
21
.65
0.8
90.9
60.0
30.7
20.5
80.2
118
250
Diff
icult
Low
.330
0.0
61
0.0
18
0.0
67
a0.0
21
0.0
52
0.0
22
.79
0.8
80.9
60.0
40.7
00.6
00.2
219
250
Diff
icult
Low
.515
0.0
61
0.0
11
0.1
06
a,b
0.0
19
0.0
75a
0.0
18
.65
0.8
90.9
60.1
50.7
90.5
80.2
120
250
Diff
icult
Low
.530
0.0
61
0.0
11
0.1
38
a,b
0.0
17
0.0
55
0.0
19
.79
0.8
80.9
60.2
30.7
80.6
00.2
021
250
Diff
icult
Hig
h.3
15
0.0
61
0.0
18
0.1
12
a,b
0.0
23
0.0
98a,
b0.0
11
.85
1.5
80.9
60.0
00.3
01.0
60.3
722
250
Diff
icult
Hig
h.3
30
0.0
61
0.0
18
0.1
23
a,b
0.0
22
0.0
70a
0.0
14
.92
1.5
90.9
60.0
00.2
61.0
80.3
823
250
Diff
icult
Hig
h.5
15
0.0
61
0.0
11
0.3
17
a,b
0.0
21
0.1
20a,
b0.0
11
.85
1.5
90.9
60.0
00.4
91.0
60.3
824
250
Diff
icult
Hig
h.5
30
0.0
61
0.0
11
0.3
86
a,b
0.0
20
0.1
02a,
b0.0
12
.92
1.5
90.9
60.0
10.4
71.0
80.3
8
Note
:c
=co
nditio
n;n
=num
ber
ofin
div
idual
s;b i
j21
=item
cate
gory
diff
iculty
dis
trib
ution,Eas
y(a
sses
smen
tin
appro
pri
aten
ess)
=N
(21.5
,1),
Moder
ate
(ass
essm
ent
appro
pria
tenes
s)=
N(0
,1),
Diff
icult
(ass
essm
ent
inap
pro
pri
aten
ess)
=N
(1.5
,1);
a i=
item
dis
crim
inat
ion
dis
trib
ution,
Low
=U
(.31,.5
8),
Hig
h=
U(.58,1
.13);
b=
regr
essi
on
wei
ght;
k=
num
ber
ofitem
s;p
u=
empir
ical
Type
Ier
ror
rate
for
actu
alth
eta
score
s;D
R2 u
=av
erag
eef
fect
size
for
sign
ifica
nt
inte
ract
ions
for
actu
alth
eta
score
s;p
x=
empir
ical
Type
Ier
ror
rate
for
num
ber-
corr
ect
score
s;D
R2 x
=av
erag
eef
fect
size
for
sign
ifica
nt
inte
ract
ions
for
num
ber
-corr
ect
score
s;p
u=
empir
ical
Type
Ier
ror
rate
for
estim
ated
thet
asc
ore
s;D
R2 u
=av
erag
e
effe
ctsi
zefo
rsi
gnifi
cant
inte
ract
ions
for
estim
ated
thet
asc
ore
s;a
=av
erag
ein
tern
alco
nsis
tency
for
the
num
ber
-corr
ect
score
s;R
MSE
=ro
ot
mea
nsq
uar
eer
ror
for
the
estim
ated
thet
asc
ore
s;SW
u=
pro
port
ion
ofn.s
.Shap
iro–W
ilkte
sts
for
the
actu
alth
eta
score
s;SW
x=
pro
port
ion
ofn.s
.Shap
iro–W
ilkte
sts
for
the
num
ber-
corr
ect
score
s;SW
u=
pro
port
ion
ofn.s
.Sh
apir
o–W
ilkte
sts
for
the
estim
ated
thet
asc
ore
s;sk
x=
|ske
wnes
s|fo
rth
enum
ber
-corr
ect
score
s(a
bs.
valu
e);s
ku
=|s
kew
nes
s|fo
rth
ees
tim
ated
thet
asc
ore
s.It
erat
ions
per
conditio
n=
1,0
00.
a Sign
ifica
nt
Type
IErr
or
rate
bas
edon
the
resu
lts
ofa
bin
om
ialte
st.
bSi
gnifi
cant
Type
IErr
or
rate
bas
edon
the
alpha
+/-
.5al
pha
criter
ion.
132
Tab
le2.
Res
ults
ofSi
mula
tion
2(N
orm
alFi
del
ity,
Dis
trib
ution
ofLa
tent
Const
ruct
Score
s=
Stan
dar
dN
orm
alN
(0,1
))
cn
b i,j2
1a i
bk
pu
DR
2 up
xD
R2 x
pu
DR
2 ua
RM
SESW
uSW
xSW
usk
xsk
u
25
750
Eas
yLo
w.3
15
0.0
49
0.0
06
0.0
74
a0.0
07
0.0
56
0.0
07
.66
0.8
20.9
50.0
00.4
40.5
50.2
326
750
Eas
yLo
w.3
30
0.0
49
0.0
06
0.0
69
a0.0
07
0.0
57
0.0
07
.80
0.8
40.9
50.0
00.4
50.5
70.2
427
750
Eas
yLo
w.5
15
0.0
49
0.0
04
0.1
67
a,b
0.0
07
0.0
89
a,b
0.0
06
.66
0.8
20.9
50.0
00.6
50.5
50.2
228
750
Eas
yLo
w.5
30
0.0
49
0.0
04
0.2
22
a,b
0.0
06
0.0
79
a,b
0.0
06
.80
0.8
40.9
50.0
10.6
60.5
60.2
429
750
Eas
yH
igh
.315
0.0
49
0.0
06
0.1
62
a,b
0.0
08
0.0
84
a,b
0.0
06
.86
1.3
10.9
50.0
00.1
11.0
00.3
130
750
Eas
yH
igh
.330
0.0
49
0.0
06
0.1
42
a,b
0.0
07
0.0
66
a0.0
06
.92
1.3
10.9
50.0
00.1
11.0
10.3
231
750
Eas
yH
igh
.515
0.0
49
0.0
04
0.6
27
a,b
0.0
09
0.1
73
a,b
0.0
06
.86
1.3
10.9
50.0
00.5
21.0
00.3
132
750
Eas
yH
igh
.530
0.0
49
0.0
04
0.7
10
a,b
0.0
09
0.1
58
a,b
0.0
06
.92
1.3
10.9
50.0
00.5
21.0
00.3
233
750
Moder
ate
Low
.315
0.0
49
0.0
06
0.0
56
0.0
06
0.0
52
0.0
06
.69
0.6
80.9
50.3
20.7
60.0
20.0
134
750
Moder
ate
Low
.330
0.0
49
0.0
06
0.0
46
0.0
07
0.0
43
0.0
07
.82
0.6
60.9
50.4
40.7
70.0
20.0
235
750
Moder
ate
Low
.515
0.0
49
0.0
04
0.0
46
0.0
06
0.0
55
0.0
05
.69
0.6
80.9
50.6
90.8
20.0
20.0
236
750
Moder
ate
Low
.530
0.0
49
0.0
04
0.0
43
0.0
05
0.0
38
0.0
06
.82
0.6
60.9
50.8
10.8
20.0
20.0
237
750
Moder
ate
Hig
h.3
15
0.0
49
0.0
06
0.0
50
0.0
06
0.0
53
0.0
06
.88
0.5
10.9
50.0
10.6
90.0
30.0
038
750
Moder
ate
Hig
h.3
30
0.0
49
0.0
06
0.0
44
0.0
06
0.0
44
0.0
06
.94
0.5
30.9
50.0
10.7
10.0
30.0
139
750
Moder
ate
Hig
h.5
15
0.0
49
0.0
04
0.0
65
a0.0
05
0.0
68
a0.0
05
.88
0.5
00.9
50.5
10.7
90.0
30.0
040
750
Moder
ate
Hig
h.5
30
0.0
49
0.0
04
0.0
56
0.0
04
0.0
55
0.0
05
.94
0.5
30.9
50.6
70.8
00.0
30.0
241
750
Diff
icult
Low
.315
0.0
48
0.0
06
0.0
75
a0.0
07
0.0
58
0.0
07
.66
0.8
60.9
50.0
00.3
40.5
80.2
142
750
Diff
icult
Low
.330
0.0
49
0.0
06
0.0
81
a,b
0.0
08
0.0
43
0.0
07
.79
0.8
70.9
50.0
00.3
60.6
00.2
043
750
Diff
icult
Low
.515
0.0
49
0.0
04
0.1
64
a,b
0.0
07
0.0
75
a0.0
07
.66
0.8
60.9
50.0
00.5
60.5
80.1
844
750
Diff
icult
Low
.530
0.0
49
0.0
04
0.2
69
a,b
0.0
07
0.0
59
0.0
06
.79
0.8
70.9
50.0
00.6
00.6
00.1
945
750
Diff
icult
Hig
h.3
15
0.0
49
0.0
06
0.1
59
a,b
0.0
08
0.0
66
a0.0
07
.85
1.4
00.9
50.0
00.0
71.0
70.3
346
750
Diff
icult
Hig
h.3
30
0.0
49
0.0
06
0.1
80
a,b
0.0
08
0.0
65
a0.0
07
.92
1.4
10.9
50.0
00.0
71.0
90.3
347
750
Diff
icult
Hig
h.5
15
0.0
49
0.0
04
0.6
35
a,b
0.0
10
0.1
93
a,b
0.0
06
.85
1.4
00.9
50.0
00.4
91.0
70.3
348
750
Diff
icult
Hig
h.5
30
0.0
49
0.0
04
0.7
65
a,b
0.0
10
0.1
92
a,b
0.0
06
.92
1.4
10.9
50.0
00.4
81.0
90.3
3
Note
:c
=co
nditio
n;n
=num
ber
ofin
div
idual
s;b
ij2
1=
item
cate
gory
diff
iculty
dis
trib
ution,
Eas
y(a
sses
smen
tin
appro
pri
aten
ess)
=N
(21.5
,1),
Moder
ate
(ass
essm
ent
appr
opri
aten
ess)
=N
(0,1),
Diff
icult
(ass
essm
ent
inap
pro
pri
aten
ess)
=N
(1.5
,1);
a i=
item
dis
crim
inat
ion
dis
trib
ution,L
ow
=U
(.31,
.58),
Hig
h=
U(.58,1
.13);
b=
regr
essi
on
wei
ght;
k=
num
ber
ofitem
s;p
u=
empir
ical
Type
Ier
ror
rate
for
actu
alth
eta
score
s;D
R2 u
=av
erag
eef
fect
size
for
sign
ifica
nt
inte
ract
ions
for
actu
alth
eta
score
s;p
x=
empir
ical
Type
Ier
ror
rate
for
num
ber
-corr
ect
score
s;D
R2 x
=av
erag
eef
fect
size
for
sign
ifica
nt
inte
ract
ions
for
num
ber
-corr
ect
score
s;p
u=
empir
ical
Type
Ier
ror
rate
for
estim
ated
thet
asc
ore
s;D
R2 u
=av
erag
e
effe
ctsi
zefo
rsi
gnifi
cant
inte
ract
ions
for
estim
ated
thet
asc
ore
s;a
=av
erag
ein
tern
alco
nsi
sten
cyfo
rth
enum
ber
-corr
ect
score
s;R
MSE
=ro
ot
mea
nsq
uar
eer
ror
for
the
estim
ated
thet
asc
ore
s;SW
u=
pro
port
ion
ofn.s
.Sh
apir
o–W
ilkte
sts
for
the
actu
alth
eta
score
s;SW
x=
pro
port
ion
ofn.s
.Sh
apir
o–W
ilkte
sts
for
the
num
ber
-corr
ect
score
s;SW
u=
pro
port
ion
ofn.s
.Shap
iro–W
ilkte
sts
for
the
estim
ated
thet
asc
ore
s;sk
x=
|ske
wnes
s|fo
rth
enum
ber-
corr
ect
score
s(a
bs.
valu
e);sk
u=
|ske
wnes
s|fo
rth
ees
tim
ated
thet
asc
ore
s.It
erat
ions
per
cond
itio
n=
1,0
00.
a Sign
ifica
nt
Type
IErr
or
rate
bas
edon
the
resu
lts
ofa
bin
om
ialte
st.
bSi
gnifi
cant
Type
IErr
or
rate
bas
edon
the
alpha
+/-
.5al
pha
criter
ion.
133
Tab
le3.
Res
ults
ofSi
mula
tion
3(H
igh
Fidel
ity,
Dis
trib
ution
ofLa
tent
Const
ruct
Score
s=
Stan
dar
dN
orm
alN
(0,1
))
cn
b i,j2
1a i
bk
pu
DR
2 up
xD
R2 x
pu
DR
2 ua
RM
SESW
uSW
xSW
usk
xsk
u
49
250
Eas
yLo
w.3
15
0.0
62
0.0
18
0.0
67
a0.0
21
0.0
55
0.0
20
.64
0.7
0.9
60.0
10.7
80.6
40.2
350
250
Eas
yLo
w.3
30
0.0
62
0.0
18
0.0
78
a,b
0.0
20
0.0
54
0.0
22
.78
0.6
90.9
60.0
10.7
50.6
70.2
451
250
Eas
yLo
w.5
15
0.0
62
0.0
11
0.0
99
a,b
0.0
18
0.0
64
a0.0
18
.64
0.7
00.9
60.1
00.8
90.6
40.2
352
250
Eas
yLo
w.5
30
0.0
62
0.0
11
0.1
32
a,b
0.0
18
0.0
57
0.0
19
.78
0.6
90.9
60.1
50.8
30.6
70.2
453
250
Eas
yH
igh
.315
0.0
62
0.0
18
0.1
28
a,b
0.0
24
0.0
79
a,b
0.0
22
.84
1.5
70.9
60.0
00.0
31.3
40.7
654
250
Eas
yH
igh
.330
0.0
61
0.0
18
0.1
52
a,b
0.0
23
0.0
85
a,b
0.0
21
.91
1.5
60.9
60.0
00.0
41.3
80.7
655
250
Eas
yH
igh
.515
0.0
61
0.0
11
0.3
90
a,b
0.0
23
0.2
15
a,b
0.0
19
.84
1.5
70.9
60.0
00.2
61.3
40.7
756
250
Eas
yH
igh
.530
0.0
62
0.0
11
0.4
67
a,b
0.0
23
0.2
24
a,b
0.0
19
.91
1.5
60.9
60.0
00.2
71.3
70.7
657
250
Moder
ate
Low
.315
0.0
62
0.0
18
0.0
47
0.0
20
0.0
44
0.0
21
.68
0.5
60.9
60.7
70.9
60.0
10.0
058
250
Moder
ate
Low
.330
0.0
62
0.0
18
0.0
44
0.0
20
0.0
58
0.0
20
.81
0.5
60.9
60.8
00.9
70.0
10.0
059
250
Moder
ate
Low
.515
0.0
62
0.0
11
0.0
41
0.0
16
0.0
50
0.0
16
.68
0.5
60.9
60.8
90.9
70.0
10.0
060
250
Moder
ate
Low
.530
0.0
62
0.0
11
0.0
40
0.0
15
0.0
50
0.0
17
.81
0.5
60.9
60.9
00.9
60.0
10.0
061
250
Moder
ate
Hig
h.3
15
0.0
62
0.0
18
0.0
47
0.0
19
0.0
56
0.0
18
.88
0.3
80.9
60.1
10.9
30.0
20.0
162
250
Moder
ate
Hig
h.3
30
0.0
62
0.0
18
0.0
42
0.0
20
0.0
59
0.0
19
.93
0.3
90.9
60.1
40.9
30.0
20.0
163
250
Moder
ate
Hig
h.5
15
0.0
62
0.0
11
0.0
31
0.0
14
0.0
42
0.0
14
.88
0.3
80.9
60.7
90.9
50.0
20.0
164
250
Moder
ate
Hig
h.5
30
0.0
61
0.0
11
0.0
32
0.0
14
0.0
54
0.0
14
.93
0.3
90.9
60.8
60.9
50.0
20.0
165
250
Diff
icult
Low
.315
0.0
61
0.0
18
0.0
75
a0.0
21
0.0
50
0.0
20
.63
0.7
20.9
60.0
10.7
40.6
60.2
466
250
Diff
icult
Low
.330
0.0
61
0.0
18
0.0
66
a0.0
22
0.0
49
0.0
21
.78
0.7
10.9
60.0
10.7
30.6
90.2
467
250
Diff
icult
Low
.515
0.0
61
0.0
11
0.1
15
a,b
0.0
20
0.0
71
a0.0
18
.63
0.7
20.9
60.0
70.8
30.6
60.2
468
250
Diff
icult
Low
.530
0.0
61
0.0
11
0.1
50
a,b
0.0
18
0.0
67
a0.0
18
.78
0.7
00.9
60.1
10.8
30.6
90.2
469
250
Diff
icult
Hig
h.3
15
0.0
61
0.0
18
0.1
41
a,b
0.0
25
0.0
98
a,b
0.0
22
.83
1.6
50.9
60.0
00.0
31.3
90.8
070
250
Diff
icult
Hig
h.3
30
0.0
61
0.0
18
0.1
55
a,b
0.0
24
0.0
93
a,b
0.0
22
.91
1.6
50.9
60.0
00.0
21.4
20.8
071
250
Diff
icult
Hig
h.5
15
0.0
61
0.0
11
0.4
11
a,b
0.0
25
0.2
46
a,b
0.0
21
.83
1.6
50.9
60.0
00.2
21.4
00.8
172
250
Diff
icult
Hig
h.5
30
0.0
61
0.0
11
0.4
88
a,b
0.0
25
0.2
35
a,b
0.0
20
.91
1.6
50.9
60.0
00.2
41.4
30.8
1
Note
:c=
cond
itio
n;n
=num
ber
ofin
div
idual
s;b
i,j2
1=
item
cate
gory
diff
iculty
dis
trib
ution,E
asy
(ass
essm
ent
inap
pro
pri
aten
ess)
=N
(21.5
,0.5
),M
oder
ate
(ass
essm
ent
appro
pri
aten
ess)
=N
(0,0.5
),D
iffic
ult
(ass
essm
ent
inap
pro
pri
aten
ess)
=N
(1.5
,0.5
);a i
=item
dis
crim
inat
ion
dis
trib
ution,L
ow
=U
(.31,
.58),
Hig
h=
U(.58,1
.13);
b=
regr
essi
on
wei
ght;
k=
num
ber
ofitem
s;p
u=
empir
ical
Type
Ier
ror
rate
for
actu
alth
eta
score
s;D
R2 u
=av
erag
eef
fect
size
for
sign
ifica
nt
inte
ract
ions
for
actu
alth
eta
score
s;p
x=
empir
ical
Type
Ier
ror
rate
for
num
ber-
corr
ect
score
s;D
R2 x
=av
erag
eef
fect
size
for
sign
ifica
nt
inte
ract
ions
for
num
ber-
corr
ect
score
s;p
u=
empir
ical
Type
Ier
ror
rate
for
estim
ated
thet
asc
ore
s;D
R2 u
=
aver
age
effe
ctsi
zefo
rsi
gnifi
cant
inte
ract
ions
for
estim
ated
thet
asc
ore
s;a
=av
erag
ein
tern
alco
nsis
tency
for
the
num
ber
-corr
ect
score
s;R
MSE
=ro
ot
mea
nsq
uar
eer
ror
for
the
estim
ated
thet
asc
ore
s;SW
u=
pro
port
ion
ofn.s
.Shap
iro–W
ilkte
sts
for
the
actu
alth
eta
score
s;SW
x=
pro
port
ion
ofn.s
.Shap
iro–W
ilkte
sts
for
the
num
ber
-corr
ect
score
s;SW
u=
pro
port
ion
ofn.s
.Shap
iro–W
ilkte
sts
for
the
estim
ated
thet
asc
ore
s;sk
x=
|ske
wnes
s|fo
rth
enum
ber
-corr
ect
score
s(a
bs.va
lue)
;sk
u=
|ske
wnes
s|fo
rth
ees
tim
ated
thet
asc
ore
s.
Iter
atio
ns
per
cond
itio
n=
1,0
00.
a Sign
ifica
nt
Type
IErr
or
rate
bas
edon
the
resu
lts
ofa
bin
om
ialt
est.
bSi
gnifi
cant
Type
IErr
or
rate
bas
edon
the
alpha
+/-
.5al
pha
criter
ion.
134
Tab
le4.
Res
ults
ofSi
mula
tion
4(H
igh
Fidel
ity,
Dis
trib
ution
ofLa
tent
Const
ruct
Score
s=
Stan
dar
dN
orm
alN
(0,1
))
cn
b i,j–
1a i
bk
pu
DR
2 up
xD
R2 x
pu
DR
2 ua
RM
SESW
uSW
xSW
usk
xsk
u
73
750
Eas
yLo
w.3
15
0.0
49
0.0
06
0.0
86
a,b
0.0
07
0.0
66
b0.0
07
.64
0.6
50.9
50.0
00.1
60.6
50.2
674
750
Eas
yLo
w.3
30
0.0
49
0.0
06
0.0
80
a,b
0.0
07
0.0
56
0.0
07
.78
0.6
50.9
50.0
00.1
60.6
70.2
675
750
Eas
yLo
w.5
15
0.0
49
0.0
04
0.1
83
a,b
0.0
07
0.1
00
a,b
0.0
06
.64
0.6
50.9
50.0
00.5
40.6
50.2
676
750
Eas
yLo
w.5
30
0.0
49
0.0
04
0.2
68
a,b
0.0
06
0.0
86
a,b
0.0
06
.78
0.6
50.9
50.0
00.5
50.6
70.2
677
750
Eas
yH
igh
.315
0.0
49
0.0
06
0.2
16
a,b
0.0
08
0.1
09
a,b
0.0
07
.84
1.3
60.9
50.0
00.0
01.3
50.6
278
750
Eas
yH
igh
.330
0.0
49
0.0
06
0.1
99
a,b
0.0
09
0.0
95
a,b
0.0
07
.91
1.3
60.9
50.0
00.0
01.3
80.6
279
750
Eas
yH
igh
.515
0.0
49
0.0
04
0.7
40
a,b
0.0
11
0.4
07
a,b
0.0
07
.84
1.3
60.9
50.0
00.0
71.3
50.6
280
750
Eas
yH
igh
.530
0.0
49
0.0
04
0.8
42
a,b
0.0
12
0.3
88
a,b
0.0
07
.91
1.3
50.9
50.0
00.0
81.3
80.6
281
750
Moder
ate
Low
.315
0.0
49
0.0
06
0.0
47
0.0
06
0.0
49
0.0
06
.68
0.5
60.9
50.1
50.9
00.0
10.0
082
750
Moder
ate
Low
.330
0.0
49
0.0
06
0.0
45
0.0
07
0.0
46
0.0
06
.81
0.5
60.9
50.2
30.8
90.0
10.0
083
750
Moder
ate
Low
.515
0.0
49
0.0
04
0.0
43
0.0
05
0.0
47
0.0
05
.68
0.5
60.9
50.6
10.9
30.0
10.0
084
750
Moder
ate
Low
.530
0.0
49
0.0
04
0.0
46
0.0
05
0.0
42
0.0
06
.81
0.5
60.9
50.7
40.9
40.0
10.0
085
750
Moder
ate
Hig
h.3
15
0.0
49
0.0
06
0.0
43
0.0
06
0.0
49
0.0
06
.88
0.3
90.9
50.0
00.7
00.0
20.0
086
750
Moder
ate
Hig
h.3
30
0.0
49
0.0
06
0.0
41
0.0
06
0.0
42
0.0
06
.93
0.3
90.9
50.0
00.7
20.0
20.0
087
750
Moder
ate
Hig
h.5
15
0.0
49
0.0
04
0.0
45
0.0
05
0.0
54
0.0
04
.88
0.3
90.9
50.3
20.9
20.0
20.0
088
750
Moder
ate
Hig
h.5
30
0.0
49
0.0
04
0.0
40
0.0
04
0.0
47
0.0
05
.93
0.3
90.9
50.4
60.9
20.0
20.0
089
750
Diff
icult
Low
.315
0.0
48
0.0
06
0.0
80
a,b
0.0
08
0.0
59
0.0
07
.64
0.6
80.9
50.0
00.1
20.6
70.2
690
750
Diff
icult
Low
.330
0.0
49
0.0
06
0.0
94
a,b
0.0
07
0.0
41
0.0
07
.78
0.6
70.9
50.0
00.1
20.6
90.2
691
750
Diff
icult
Low
.515
0.0
49
0.0
04
0.1
80
a,b
0.0
07
0.0
94
a,b
0.0
07
.64
0.6
80.9
50.0
00.5
00.6
70.2
692
750
Diff
icult
Low
.530
0.0
49
0.0
04
0.3
15
a,b
0.0
07
0.0
76
a,b
0.0
06
.78
0.6
70.9
50.0
00.5
10.6
90.2
793
750
Diff
icult
Hig
h.3
15
0.0
49
0.0
06
0.1
99
a,b
0.0
09
0.1
06
a,b
0.0
08
.84
1.4
60.9
50.0
00.0
01.4
00.6
694
750
Diff
icult
Hig
h.3
30
0.0
49
0.0
06
0.2
36
a,b
0.0
09
0.1
07
a,b
0.0
08
.91
1.4
60.9
50.0
00.0
01.4
30.6
595
750
Diff
icult
Hig
h.5
15
0.0
49
0.0
04
0.7
34
a,b
0.0
12
0.4
36
a,b
0.0
08
.84
1.4
60.9
50.0
00.0
61.4
00.6
696
750
Diff
icult
Hig
h.5
30
0.0
49
0.0
04
0.8
49
a,b
0.0
13
0.4
04
a,b
0.0
08
.91
1.4
60.9
50.0
00.0
51.4
30.6
5
Note
:c
=co
nditio
n;n
=num
ber
ofin
div
idua
ls;b
i,j2
1=
item
cate
gory
diff
iculty
dis
trib
ution,Eas
y(a
sses
smen
tin
appro
pri
aten
ess)
=N
(21.5
,0.5
),M
oder
ate
(ass
essm
ent
appro
pri
aten
ess)
=N
(0,0.5
),D
iffic
ult
(ass
essm
ent
inap
pro
pri
aten
ess)
=N
(1.5
,0.5
);a i
=item
dis
crim
inat
ion
dis
trib
ution,Lo
w=
U(.31,
.58),
Hig
h=
U(.58,1
.13);
b=
regr
essi
on
wei
ght;
k=
num
ber
ofitem
s;p
u=
empir
ical
Type
Ier
ror
rate
for
actu
alth
eta
score
s;D
R2 u
=av
erag
eef
fect
size
for
sign
ifica
nt
inte
ract
ions
for
actu
alth
eta
score
s;p
x=
empir
ical
Type
Ier
ror
rate
for
num
ber-
corr
ect
score
s;D
R2 x
=av
erag
eef
fect
size
for
sign
ifica
nt
inte
ract
ions
for
num
ber-
corr
ect
score
s;p
u=
empir
ical
Type
Ier
ror
rate
for
estim
ated
thet
asc
ore
s;D
R2 u
=
aver
age
effe
ctsi
zefo
rsi
gnifi
cant
inte
ract
ions
for
estim
ated
thet
asc
ore
s;a
=av
erag
ein
tern
alco
nsis
tency
for
the
num
ber
-corr
ect
score
s;R
MSE
=ro
ot
mea
nsq
uar
eer
ror
for
the
estim
ated
thet
asc
ore
s;SW
u=
pro
port
ion
ofn.s
.Shap
iro–W
ilkte
sts
for
the
actu
alth
eta
score
s;SW
x=
pro
port
ion
ofn.s
.Shap
iro–W
ilkte
sts
for
the
num
ber
-corr
ect
score
s;SW
u=
pro
port
ion
ofn.s
.Shap
iro–W
ilkte
sts
for
the
estim
ated
thet
asc
ore
s;sk
x=|s
kew
nes
s|fo
rth
enum
ber
-corr
ect
score
s(a
bs.
valu
e);s
ku
=|s
kew
nes
s|fo
rth
ees
tim
ated
thet
asc
ore
s.
Iter
atio
ns
per
conditio
n=
1,0
00.
a Sign
ifica
nt
Type
IErr
or
rate
bas
edon
the
resu
lts
ofa
bin
om
ialte
st.
bSi
gnifi
cant
Type
IErr
or
rate
bas
edon
the
resu
lts
ofth
eal
pha
+/-
.5al
pha
criter
ion.
135
Figure 3. Empirical Type I error rates for the interaction term of a simulated moderated multipleregression model under conditions of assessment appropriateness
Figure 2. Distribution of spurious interactions for number-correct scores and estimated theta scores
136 Applied Psychological Measurement 36(2)
from inappropriate assessments were used, x2(1, N = 96,000) = 5,008.55, p \ .001, odds ratio =
8.13. In addition, estimated theta scores resulted in empirical Type I error rates that were above
the acceptable interval in 33 of the 64 (51%) inappropriate assessment conditions. A direct
logistic regression analysis indicated that the likelihood of committing a Type I error was 2.4
times greater when estimated theta scores from inappropriate assessments were used, x2(1, N =
96,000) = 918.15, p \ .001, odds ratio = 2.40.
Impact of the Independent Variables on the Empirical Type I Error Rate
Table 5 represents the mean empirical Type I error rate as well as direct logistic regression tests
for the levels of each independent variable. The dependent variable in the logistic regression
analyses was the occurrence of a Type I error and the iteration level, and was coded as a 1 if the
DR2 between the additive and multiplicative model was significant (p \ .05) or a 0 if it was not
significant. All of the independent variables were entered into the model simultaneously as cate-
gorical predictors. A general pattern can be identified in these results such that higher empirical
Type I error rates were observed for the stronger level of each independent variable. This pat-
tern would indicate that each psychometric characteristic that was varied in the simulations had
an overall effect on the empirical Type I error rates for the interaction term.
Several important findings can be identified from these results. First, the psychometric char-
acteristics that were manipulated in this simulation had a stronger overall effect on Type I
errors when the variables were operationalized as number-correct scores when compared with
estimated theta scores. These results suggest that number-correct scores are more sensitive to
Figure 4. Empirical Type I error rates for the interaction term of a simulated moderated multipleregression model under conditions of assessment inappropriateness
Morse et al. 137
Tab
le5.
Impac
tofIn
div
idual
Pre
dic
tors
on
Em
pir
ical
Type
IErr
or
Rat
es
MD
irec
tlo
gist
icre
gres
sion
for
num
ber
-corr
ect
score
Type
Ier
rors
aD
irec
tlo
gist
icre
gres
sion
for
estim
ated
thet
asc
ore
Type
Ier
rors
b
pu
px
pu
Wal
dc2
dfB
OR
cW
ald
x2
dfB
OR
c
Appro
pri
aten
ess
(diff
iculty)
Appro
pri
ate
0.0
60.0
40.0
55,0
08.5
5***
12.0
96
8.1
3918.1
5***
10.8
75
2.4
0In
appro
pri
ate
0.0
60.2
40.1
1D
iscr
imin
atio
nLo
w0.0
60.1
00.0
64,4
37.1
9***
11.3
39
3.8
21,1
85.0
1***
10.8
36
2.3
0H
igh
0.0
60.2
50.1
2Sa
mple
size
250
0.0
60.1
60.0
9154.0
9***
10.2
34
1.2
69.9
3***
10.0
73
1.0
8750
0.0
60.1
90.0
9Fi
del
ity
Norm
al0.0
60.1
60.0
7165.4
2***
10.2
42
1.2
7364.7
7***
10.4
46
1.5
6H
igh
0.0
60.1
90.1
1It
ems
15
0.0
60.1
60.0
9154.0
9***
10.2
34
1.2
69.9
3***
10.0
73
1.0
830
0.0
60.1
90.0
9Bet
aw
eigh
ts.3
0.0
60.0
90.0
64,8
76.1
3***
11.4
17
4.1
2881.9
8***
10.7
10
2.0
3.5
0.0
60.2
60.1
2
Note
:O
R=
odds
ratio.
a Om
nib
usfu
llm
odel
,x
2(1
,N
=96,0
00)
=17,1
57.5
1,p
\.0
01,R
2=
.27.
bO
mnib
usfu
llm
odel
,x
2(1
,N
=96,
000)
=3,5
71.4
7,p
\.0
01,R
2=
.08.
c Inea
chca
se,
the
OR
report
edco
rres
pond
sto
incr
ease
sin
the
pre
dic
tor
vari
able
(e.g
.,in
crea
sed
asse
ssm
ent
inap
pro
pri
aten
ess
resu
lts
inhig
her
likel
ihoods
of
Type
Ier
rors
,
incr
ease
sin
dis
crim
inat
ion
resu
lts
inhig
her
likel
ihoods
ofTy
peIer
rors
,et
c.).
***p
\.0
01.
138
measurement effects in parametric analyses than are IRT-derived theta estimates. For both
dependent variables, assessment appropriateness was the most impactful predictor of Type I
errors, followed by item discrimination and regression weights. This result confirms and
extends the effects of assessment appropriateness identified by Kang and Waller (2005), as well
as arguments raised by Busemeyer (1980) on the role of assessment difficulty in parametric
statistics.
Strength of Spurious Interaction Effects
Finally, the authors were interested in understanding how assessment appropriateness affected
the strength of spurious interactions for the different scoring methods. The columns labeled with
DR2 for each respective scoring method in Tables 1 through 4 indicate the average strength of
the interaction when a spurious interaction was identified. Because sample size is known to
affect the strength of interaction effects, the authors used a multivariate analysis of covariance
(MANCOVA) to determine the effect of assessment appropriateness using sample size as a cov-
ariate. After adjusting for the effects of sample size, the results indicated a significant effect of
assessment appropriateness on the strength of spurious interaction effects for number-correct
scores F(1, 93) = 51.92, p \ .001, partial h2 = .36 such that the average interaction strength in
the inappropriate assessment conditions (M = 0.015, SD = 0.007) was significantly greater than
the appropriate assessment conditions (M = 0.011, SD = 0.006). A similar, albeit much weaker,
result was also identified for estimated theta scores F(1, 93) = 4.47, p \ .05, partial h2 = .05
such that the average interaction strength in the inappropriate assessment conditions (M = 0.013,
SD = 0.006) was significantly greater than the appropriate assessment conditions (M = 0.012,
SD = 0.006). No significant difference was identified for actual theta scores. These results indi-
cate that assessment appropriateness has an effect on the strength of spurious interaction effects
for number-correct scores and estimated theta scores and that the effect is considerably stronger
in the number-correct score conditions.
Discussion
Theoretical and empirical evidence has emerged to suggest that using IRT to operationalize an
individual’s standing on a latent construct has important measurement implications over the use