Page 1
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 112
Assessment
1 ndash12copy The Author(s) 2014
Reprints and permissionssagepubcomjournalsPermissionsnav
DOI 1011771073191114560882asmsagepubcom
Article
During the previous two decades the growing interest in
the quality of mental health care has led to an increased use
of self-report outcome measures (Holloway 2002) To
monitor the effectiveness of treatment for individual patients outcome measures that assess symptom severity
and daily functioning are repeatedly administered during
treatment Based on the repeated measurements the treat-
ment plan can be altered if recovery does not proceed as
expected (Lambert amp Shimokawa 2011) Furthermore
mental health care providers use these outcome data to eval-
uate treatment results at the institutional level and insur-
ance companies health care managers and other regularity
bodies use outcome measures for policy decisions aimed at
improving cost-effectiveness (Bickman amp Salzer 1997)
Examples of frequently used outcome measures are the
Outcome Questionnairendash45 (OQ-45 Lambert et al 2004)
the Brief Symptom Inventory (BSI Derogatis 1993) and
the Clinical Outcomes in Routine EvaluationndashOutcome
Measure (CORE-OM Evans et al 2002)
Given the importance of outcome measures for individual
decision making in mental health care their psychometric
properties are a major concern (eg Doucette amp Wolf 2009
Pirkis et al 2005) However even if instruments have
excellent psychometric properties persons may respond
aberrantly to clinical and personality scales thus produc-
ing invalid test scores In fact response inconsistency to
personality and psychopathology self-report inventories was
found to be positively related to indicators of psychological
distress psychological problems and negative affect (Conijn
Emons Van Assen Pedersen amp Sijtsma 2013 Reise ampWaller 1993 Woods Oltmanns amp Turkheimer 2008)
which suggests that mental health care patients may be
inclined to respond aberrantly Cognitive deficits that are
commonly observed in mental illness may explain concentra-
tion problems that interfere with the quality of self-reports
(Atre-Vaidya et al 1998 Cuijpers Li Hofann amp Andersson
2010) However potential causes of aberrant responding are
numerous including lack of motivation response styles
idiosyncratic interpretation of item content and low traited-
ness Traitedness refers to the applicability of the trait to the
respondent (Tellegen 1988)
882 ASMXXX1011771073191114560882AssessmentConijn etal
1School of Social and Behavioral Sciences Tilburg University Tilburg
Netherlands2Institute of Psychology Leiden University Leiden Netherlands
983219Research department GGZ Noord-Holland-Noord
Corresponding Author
Judith M Conijn Department of Clinical Psychology Institute of
Psychology Leiden University Wassenaarseweg 52 2333 AK Leiden
Netherlands
Email jmconijnfswleidenunivnl
Detecting and Explaining AberrantResponding to the OutcomeQuestionnairendash45
Judith M Conijn1 Wilco H M Emons1 Kim De Jong23 and Klaas Sijtsma1
Abstract
We applied item response theory based person-fit analysis (PFA) to data of the Outcome Questionnaire-45 (OQ-45) to
investigate the prevalence and causes of aberrant responding in a sample of Dutch clinical outpatients The l zp person-fit
statistic was used to detect misfitting item-score patterns and the standardized residual statistic for identifying the sourceof the misfit in the item-score patterns identified as misfitting Logistic regression analysis was used to predict person
misfit from clinical diagnosis OQ-45 total score and Global Assessment of Functioning code The l zp statistic classified
126 of the item-score patterns as misfitting Person misfit was positively related to the severity of psychological distress
Furthermore patients with psychotic disorders somatoform disorders or substance-related disorders more likely showedmisfit than the baseline group of patients with mood and anxiety disorders The results suggest that general outcome
measures such as the OQ-45 are not equally appropriate for patients with different disorders Our study emphasizes the
importance of person-misfit detection in clinical practice
Keywords
aberrant responding item response theory outcome measurement Outcome Questionnairendash45 person-fit analysis
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 212
2 Assessment
Aberrant responding provides clinicians with invalid
information and as a result adversely affects the quality of
treatment and diagnosis decisions (Conrad et al 2010
Handel Ben-Porath Tellegen amp Archer 2010) Person-fit
analysis (PFA) involves statistical methods to detect aber-
rant item-score patterns that are due to aberrant responding
Conrad et al (2010) provided an example of the potential of
PFA for mental health care by using PFA to screen for atypi-
cal symptom profiles among persons at intake for drug or
alcohol dependence treatment They found that the persons
with aberrant item-score patterns required different treat-
ments than persons with model-consistent item-score pat-
terns and concluded that PFA may detect inconsistencies that
have important implications for treatment and diagnosis
decisions As self-report outcome measures are increasingly
used to make treatment decisions in clinical practice PFA
may be a valuable screening tool in outcome measurement
The importance of detecting aberrant responding has
been recognized since long Both the original and current
versions of the Minnesota Multiphasic Personality Inventory(Butcher et al 2001 Handel et al 2010) include scales to
detect different types of aberrant responding Examples are
lie scales to detect faking good or faking bad and indices
based on the consistency of the responses to items either
highly similar or opposite with respect to content such as
the Variable Response Inconsistency (VRIN) scale to detect
random responding and the True Response Inconsistency
(TRIN) scale to detect acquiescence Despite validity
scalesrsquo importance outcome questionnaires typically do not
include specialized scales for detecting aberrant responding
(Lambert amp Hawkins 2004) One possible explanation is
that with the increasing demand of cost-effectiveness time
for assessment has been reduced greatly (Wood Garb
Lilienfeld amp Nezworski 2002) Consequently outcome
questionnaires are required to be short and efficient which
limits the use of validity scales consisting of additional
items (eg lie scales) and limits construction of TRIN and
VRIN scales because less item pairs with similar or oppo-
site content are available
Person-Fit Analysis in Outcome Measurement
In this study we used PFA to investigate the prevalence and
possible causes of aberrant responding in outcome mea-
surement by means of the OQ-45 In PFA person-fit statis-tics signal whether an individualrsquos item-score pattern is
inconsistent with the item-score pattern expected under the
particular measurement model (Meijer amp Sijtsma 2001) A
significant discrepancy between the observed item-score
pattern and the expected item-score pattern provides evi-
dence of person misfit Person misfit means that the indi-
vidualrsquos test score is unlikely to be meaningful in terms of
the trait being measured For noncognitive data the l z per-
son-fit statistic (Drasgow Levine amp McLaughlin 1987) is
one of the best performing and most popular person-fit sta-
tistics (Emons 2008 Ferrando 2012 Karabatsos 2003)
To determine whether an item-score pattern shows signifi-
cant misfit statistic l z is compared with a cutoff value
obtained under the item response theory (IRT Embretson amp
Reise 2000) model that serves as the null model of consis-
tency (De la Torre amp Deng 2008 Nering 1995) Statistic l z
detects various types of aberrant responding such as acqui-
escence and extreme response style but the statistic is most
powerful for detecting random responding (Emons 2008)
In detecting random responding to 57 items measuring the
Big Five personality factors PFA has been found to outper-
form an inconsistency index based on the rationale of the
Minnesota Multiphasic Personality Inventory VRIN scale
(Egberink 2010 pp 94-100)
An advantage of statistic l z and other person-fit statistics
for application to outcome measurement is that they can be
used to detect invalid test scores on any self-report scale
that is consistent with an IRT model Also the rise of com-
puterized and IRT-based outcome monitoring (eg PatientReported Outcomes Measurement Information System
Cella et al 2007) renders the implementation of PFA fea-
sible Along with the computer-generated test score a per-
son-fit value may be provided to the clinician serving as an
alarm bell that warns him that the test score may be invalid
and further inquiry may be useful
Follow-up PFA of item-score patterns flagged as misfit-
ting can help the clinician to infer possible explanations for
an individualrsquos observed aberrant responding In personal-
ity measurement Ferrando (2010) used item-score residu-
als for follow-up PFA and found that a person who had an
aberrant item-score pattern on an extraversion scale showed
unexpected low scores on items referring to situations
where the person could make a fool of himself This result
suggested that the aberrant responding was due to fear of
being rejected For another person follow-up PFA sug-
gested inattentiveness to reverse item wording In outcome
measurement for individual patients follow-up PFA can
inform the clinician about the sources of the misfit and cli-
nicians can discuss the unexpected item scores with the
patients to obtain a better understanding of the patientrsquos
psychological profile
PFA primarily focuses on individuals but can also be
used to explain individual differences in aberrant respond-
ing at the group level for examples see Conijn Emonsand Sijtsma (2014) and Conijn et al (2013) In outcome
measurement PFA can be used to investigate the extent to
which general measures are suited for assessing patients
suffering from different disorders General outcome mea-
sures such as the OQ-45 and the CORE-OM use items
that assess the most common symptoms of psychopathol-
ogy such as those observed in depression and anxiety dis-
orders (Lambert amp Hawkins 2004) and are also used to
assess patients suffering from different specific disorders
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 312
Conijn et al 3
varying from somatoform disorders to psychotic disorders
and addiction For rare or specific disorders many of the
general measuresrsquo items are irrelevant and low traitedness
may lead to inconsistent or unmotivated completion of out-
come measures
Goal of the Study We investigated the prevalence and the causes of aberrant
responding to the OQ-45 (Lambert et al 2004) The OQ-45
is one of the most popular general outcome measures used
in mental health care We used OQ-45 data of a large Dutch
clinical outpatient sample suffering from a large variety of
disorders The l z person-fit statistic (Drasgow et al 1987)
was used to identify misfitting item-score patterns and
standardized item-score residuals (Ferrando 2010 2012)
were used to investigate sources of item-score pattern mis-
fit We employed logistic regression analyses using the l z
statistic as the dependent variable to investigate whether
patients suffering from specific disorders (eg somatoformdisorders and psychotic disorders) and severely distressed
patients are more predisposed to produce aberrant item-
score patterns on the OQ-45 than other patients Based on
the results for the OQ-45 we discuss the possible causes of
aberrant responding in outcome measurement in general
and the potential of PFA for improving outcome-measure-
ment practice
Method
Participants
We performed a secondary analysis on data collected in
routine mental health care Participants were 2906 clinical
outpatients (421 male) from a mental health care institu-
tion with four different locations situated in Noord-Holland
a predominantly rural province in the Netherlands
Participantsrsquo age ranged from 17 to 77 years ( M = 37 SD =
13) Apart from gender and age no other demographic infor-
mation was collected
Most patients completed the OQ-45 at intake but 160
(55) patients completed the OQ-45 after treatment
started The sample included 2632 (91) patients with a
clinician-rated Diagnostic and Statistical Manual of Mental
Disorders (4th ed DSM-IV ) primary diagnosis at Axis I192 (7) persons with a primary diagnosis at Axis II and
82 (3) patients for which the primary diagnosis was miss-
ing Most frequent primary diagnoses were depression
(38) anxiety disorders (20) disorders usually first
diagnosed in infancy childhood or adolescence (8) per-
sonality disorders (7) adjustment disorders (6)
somatoform disorders (3) eating disorders (2) and
substance-related disorders (2) Of the diagnosed patients
131 had comorbidity between Axis 1 and Axis 2 320
had comorbidity within Axis 1 and 01 had comorbidity
within Axis 2 The clinician had access to the OQ-45 data
but since the OQ-45 is not a diagnostic instrument it was
unlikely that diagnosis was based on the OQ-45 results
Measures
The Outcome Questionnairendash45 The OQ-45 (Lambert et al
2004) uses three subscales to measure symptom severity
and daily functioning The Social Role Performance (SR 9
items of which 3 are reversely worded) subscale measures
dissatisfaction distress and conflicts concerning onersquos
employment education or leisure pursuits An example
item is ldquoI feel stressed at workschoolrdquo The Interpersonal
Relations (IR 11 items of which 4 are reversely worded)
subscale measures difficulty with family friends and mari-
tal relationship An example item is ldquoI feel lonelyrdquo The
Symptom Distress (SD 25 items of which 3 are reversely
worded) subscale measures symptoms of the most fre-
quently diagnosed mental disorders in particular anxietyand depression An example item is ldquoI feel no interest in
thingsrdquo Respondents are instructed to express their feelings
with respect to the past week on a 5-point rating scale with
scores ranging from 0 (never ) through 4 (almost always)
higher scores indicating more psychological distress
In this study we used the Dutch OQ-45 (De Jong amp
Nugter 2004) The Dutch OQ-45 total score has good con-
current and criterion-related validity (De Jong et al 2007)
In our sample coefficient alpha for the subscale total scores
equaled 65 (SR) 77 (IR) and 91 (SD) Results concern-
ing OQ-45 factor structure are ambiguous Some studies
provided support for the theoretical three-factor model for
both the original OQ-45 (Bludworth Tracey amp Glidden-
Tracey 2010) and the Dutch OQ-45 (De Jong et al 2007)
Other studies found poor fit of the theoretical three-factor
model and suggested that a three-factor model showed bet-
ter fit when it was based on a reduced item set (Kim
Beretvas amp Sherry 2010) or a one-factor model (Mueller
Lambert amp Burlingame 1998) In this study we further
investigated the fit of the theoretical three-factor model
Explanatory Variables for Person Misfit
Severity of distress The OQ-45 total score and the cli-
nician-rated DSM-IV Global Assessment of Functioning
(GAF) code were taken as measures of the patientrsquos sever-ity of distress The GAF code ranges from 1 to 100 with
higher values indicating better psychological social and
occupational functioning The GAF code was missing for
187 (6) patients
Diagnosis category The clinician-rated DSM-IV diag-
nosis was classified into nine categories representing
the most common types of disorders present in the sam-
ple Table 1 describes the diagnosis categories and the
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 412
4 Assessment
number of patients classified in each category by their pri-
mary diagnosis Three remarks are in order First because
mood and anxiety symptoms dominate the OQ-45 (Lambert
et al 2004) we assumed that for patients suffering from
these symptoms misfit was less likely than for patients with
other diagnoses Hence we classified patients with mood
and anxiety diagnoses into the same category and used
this category as the baseline for testing the effects of the
other diagnosis categories on person fit Second because
we expected that the probability of aberrant responding
depends on the specific symptoms the patient experienced
we categorized patients into diagnosis categories that are
defined by symptomatology Third if we were unable to
categorize patientsrsquo diagnosis unambiguously in one of the
categories (eg adjustment disorder with predominant dis-
turbance of conduct) we treated the diagnosis as missing
Our approach resulted in 2514 categorized patients (87)
Statistical Analysis
Model-Fit Evaluation We conducted PFA based on the
graded response model (GRM Samejima 1997) The GRM
is an IRT model for polytomous items The core of the
GRM are the item step response functions (ISRFs) which
specify the relationship between the probability of a
response in a specific or higher answer category and the
latent trait the test measures The GRM is defined by three
assumptions unidimensionality of the latent trait absence
of structural influences other than the latent trait on item
responding (ie local independence) and logistic ISRFs A
detailed discussion of the GRM is beyond the scope of this
study the interested reader is referred to Embretson and
Reise (2000) and Samejima (1997)
Satisfactory GRM fit to the data is a prerequisite for
application of GRM-based PFA to the OQ-45 subscale data
Forero and Maydeu-Olivares (2009) showed that differ-
ences between parameter estimates obtained from the full
information (GRM) and the limited information (factor
analysis on the polychoric correlation matrix) approaches
are negligible Hence for each OQ-45 subscale we used
exploratory factor analysis (EFA) for categorical data in
Mplus (Mutheacuten amp Mutheacuten 2007) to assess the GRM
assumptions of unidimensionality and local independence
For comparing one-factor models with multidimensional
models we used the root mean squared error of approxima-
tion (RMSEA) and the standardized root mean residual
(SRMR Mutheacuten amp Mutheacuten 2007) RMSEA le 08 and
SRMR lt 05 suggest acceptable model fit (MacCallum
Browne amp Sugawara 1996 Mutheacuten amp Mutheacuten 2009) Todetect local dependence we used the residual correlations
under the one-factor solution We assessed the logistic
shape of ISRFs by means of a graphical analysis comparing
the observed ISRFs with the ISRFs expected under the
GRM (Drasgow Levine Tsien Williams amp Mead 1995)
In case substantial violations of GRM assumptions were
identified we used a simulation study to investigate whether
PFA was sufficiently robust with respect to the identified
OQ-45 model misfit (Conijn et al 2014)
Table 1 Description of the Diagnosis Categories Used as Explanatory Variables in Multiple Regression Analysis
Category Common DSM-IV diagnoses included n Mean l zmp Detected Detected
Mood and anxiety disordera Depressive disorders generalized anxietydisorders phobias panic disordersposttraumatic stress disorder
1786 028 229 128
Somatoform disorder Pain disorder somatization disorder
hypochondriasis undifferentiatedsomatoform disorder
82 016 16 195
Attention deficit hyperactivitydisorder(ADHD)
Predominantly inattentive combinedhyperactive-impulsive and inattentive
198 008 15 76
Psychotic disorder Schizophrenia psychotic disorder nototherwise specified
26 minus010 7 269
Borderline personalitydisorder
Borderline personality disorder 53 035 2 38
Impulse-control disorders notelsewhere classified
Impulse-control disorder intermittentexplosive disorder
58 002 10 172
Eating disorder Eating disorder not otherwise specified bulimianervosa
67 038 4 60
Substance-related disorder Cannabis-related disorders alcohol-relateddisorders
58 009 13 224
Social and relational problem Phase of life problem partner relationalproblem identity problem
186 026 20 108
a Including 65 patients with a mood disorder
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512
Conijn et al 5
Person-Fit Analysis
Detection of misfit We used statistic l z for polytomous
item scores denoted by l z p (Drasgow Levine amp Williams
1985) to identify item-score patterns that show misfit rela-
tive to the GRM Statistic l z p
is the standardized log-like-
lihood of a personrsquos item-score pattern given the response
probabilities under the GRM with larger negative l z p
values indicating a higher degree of misfit (see Appendix
A for the equations) Emons (2008) found that l z p
had a
higher detection rate than several other person-fit statistics
Because item-score patterns that contain only 0s or only
4s (ie after recoding the reversed worded items) always
fit under the postulated GRM corresponding l z p
statistics
are meaningless and therefore treated as missing values
Twenty-two respondents (1) had a missing l z p
value due
to only 0 or 4 scores We may add that even though these
perfect patterns are consistent with the model they still may
be suspicious as they may reflect gross under- or overre-
porting of symptoms
Because the GRM is a model for unidimensional datawe computed statistic l z
p for each OQ-45 subscale sepa-
rately To categorize persons as fitting or misfitting with
respect to the complete OQ-45 we used the multiscale per-
son-fit statistic l zm p
(Conijn et al 2014 Drasgow Levine
amp McLaughlin 1991) which equals the standardized sum
of the subscale l z p values across all subscales
Under the null model of fit to the IRT model and given
the true θ value statistic l z p
is standard normally distributed
(Drasgow et al 1985) but when the unknown true θ value
is replaced by the estimated θ value statistic l z p is no longer
standard normal (Nering 1995) Therefore following De la
Torre and Deng (2008) we used the following parametric
bootstrap procedure to compute the l z p and l zm p values and
the corresponding p values For each person we generated
5000 item-score patterns under the postulated GRM using
the item parameters and the personrsquos estimated θ value For
each item-score pattern we again estimated the estimated θ
value and computed the corresponding l zm p statistic The
5000 bootstrap replications of l zm p
determined the person-
specific null distribution of l zm p
The percentile rank of the
observed lzm p
value in this bootstrapped distribution pro-
vided the p value We used one-tailed significance testing
and a 05 significance level (α) The GRM item parameters
were estimated using MULTILOG (Thissen Chen amp Bock
2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained
from Baker and Kim (2004) The software including the
source code is available on request from the first author
Follow-up analyses For each item-score pattern l zm p classi-
fied as misfitting we used standardized item-score residuals
to identify the source of the person misfit (Ferrando 2010
2012) Negative residuals indicate that the personrsquos observed
item score is lower than expected under the estimated GRM
and positive residuals indicate that the item score is higher
than expected (Appendix A) To test residuals for signifi-
cance we used critical values based on the standard normal
distribution and two-tailed significance testing with α = 05
(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff
values of minus164 and 164)
Explanatory person-fit analysis We used logistic regres-
sion to relate type of disorder and severity of psychological
distress to person misfit on the OQ-45 The dependent vari-
able was the dichotomous person-fit classification based on
l zm p
(1 = significant misfit 0 = no misfit) Based on pre-
vious research results gender (0 = male 1 = female) and
measurement occasion (0 = at intake 1 = during treatment)
were included in the model as control variables (eg Pitts
West amp Tein 1996 Schmitt Chan Sacco McFarland amp
Jennings 1999 Woods et al 2008)
Results
First we discuss model fit and implications of the identified
model misfit for the application of PFA to the OQ-45 data
Second we discuss the number of item-score patterns that
the l zm p
statistic classified as misfitting (prevalence) and we
illustrate how standardized item-score residuals may help
infer possible causes of misfit for individual respondents
Third we discuss the results of logistic regression analysis
in which l zm p
person misfit classification was predicted by
means of clinical diagnosis and severity of disorder
Model-Fit Evaluation
Inspection of multiple correlation coefficients and item-rest
correlations showed that the Items 11 26 and 32 which
measured substance abuse and Item 14 (ldquoI workstudy too
muchrdquo) fitted poorly in their subscales As these results
were consistent with previous research (De Jong et al
2007 Mueller et al 1998) we excluded these items from
further analysis Coefficient alphas for the SR (7 items 2
items excluded) IR (10 items 1 item excluded) and SD (24
items 1 item excluded) subscales equaled 67 78 and 91
respectively
For the subscale data EFA showed that the first factor
explained 386 to 400 of the variance and that one-fac-
tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed
to produce an RMSEA le 08 The two-factor model pro-
duced an RMSEA of 13 but the SRMR of 05 was accept-
able The RMSEA value may have been inflated due to the
small number of degrees of freedom (ie df = 7) of the two-
factor model (Kenny Kaniskan amp McCoach 2014)
Because parallel analysis based on the polychoric correla-
tion matrix suggested that two factors explained the data we
decided that a two-factor solution was most appropriate
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612
6 Assessment
- 4
- 2
0
2
4
S t a n
d a r d i z e d r e s i d u a l s
4 12 21 28 38
39 44
SR subscale
1
7
16
17
18 19
20 30 37
43
IR subscale
2 3
56
8
9 101315
22
232425
27
2931
3334
3536
40
41
42
45
SD subscale
- 4
- 2
0
2
4
S t a n d a r d i z e d r e s i d u a l s
4
12
21
28
38 39 44
1
7
16
17
18
19
20
30 37
43
2
3
5 6
8
9
10
13
15
22
23
24
2527
29
3133
34
3536
40
41
42
45
Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient
2752 ( l zmp
= minus796 lower panel)
(RMSEA = 13 SRMR = 05) For the IR subscale a two-
factor solution provided acceptable fit (RMSEA = 08 and
SRMR = 04) and for the SD subscale a three-factor solu-
tion provided acceptable fit (RMSEA = 07 and SRMR =
03) Violations of local independence and violations of a
logistic ISRF were only found for some items of the SD and
SR subscales respectively Thus EFA results suggested that
more than other model violations multidimensionality
caused the subscale data to show GRM misfit
To investigate the performance of statistic l zm p
and the
standardized item-score residuals for detecting person mis-
fit on OQ-45 in the presence of mild model misfit we used
data simulation to assess the Type I error rates and the
detection rates of these statistics Data were simulated using
methods proposed by Conijn et al (2014 also see Appendix
B) The types of misfit included were random error (three
levels random scores on 10 20 or 30 items) and acquies-
cence (three levels weak moderate and strong)
We found that for statistic l zm p
Type I error rate equaled
01 meaning that the risk of incorrectly classifying normal
respondents as misfitting was small and much lower than
nominal Type I error rate Furthermore the power of l zm p
to
detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-
escence equaled at most 51 (ie for strong acquiescence
88 of the responses in the most extreme category) We
concluded that despite mild GRM model misfit l zm p
is use-
ful for application to the OQ-45 but lacks power to detect
acquiescence For the residual statistic we found modest
power to detect deviant item scores due to random error
and low power to detect deviant item scores due to acquies-
cence Even though the residuals had low power in our
simulation study we decided to use the residual statistic for
the OQ-45 data analysis to obtain further insight in the sta-
tisticsrsquo performance
Detection of Misfit and Follow-Up Analyses
For 90 (3) patients with a missing l z p
value for one of the
subscales l zm p was computed across the two other OQ-45
subscales Statistic l zm p ranged from minus796 to 354 ( M =
045 SD = 354) For 367 (126) patients statistic l zm p
classified the observed item-score pattern as misfitting
With respect to age gender and measurement occasion we
did not find substantial differences between detection rates
Based on the residual statisticrsquos low power in the simula-
tion study we used α = 10 for identifying unexpected item
scores We use two cases to illustrate the use of the residual
statistic Figure 1 shows the standardized residuals for
female patient 663 who had the highest l zm p
value ( l zm p
=
354 p gt 99) and for male patient 2752 who had the low-
est l zm p value ( l zm
p = minus796 p lt 001) Patient 663 (upper
panel) was diagnosed with posttraumatic stress disorder
The patientrsquos absolute residuals were smaller than 164
thus showing that her item scores were consistent with theexpected GRM item scores
Patient 2752 (lower panel) was diagnosed to suffer
from adjustment disorder with depressed mood He had
large residuals for each of the OQ-45 subscales but misfit
was largest on the IR subscale (l z p
= minus5 44 ) and the SD sub-
scale (l z p
= minus7 66 ) On the IR subscale residuals suggested
unexpected high distress on Items 7 19 and 20 One of
these items concerned his ldquomarriagesignificant other rela-
tionshiprdquo Therefore a possible cause of the IR subscale
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712
Conijn et al 7
misfit may be that his problems were limited to this rela-
tionship On the SD subscale he had both several unex-
pected high and low item scores Two of the three items
with unexpected high scores reflected mood symptoms of
depression feeling blue (Item 42) and not being happy
(Item 13) The third item concerned suicidal thoughts (Item
8) Most items with unexpected low scores concerned low
self-worth and incompetence (Items 15 24 and 40) and
hopelessness (Item 10) which are all cognitive symptoms
of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an
external source of psychological distress caused the patient
to experience only the mood symptoms of depression but
not the cognitive symptoms Hence the cognitive symp-
toms constituted a separate dimension on which the patient
had a lower trait value Furthermore except for 10 items all
other item scores were either 0s or 4s Hence apart from
potential content-related misfit extreme response style may
have been another cause of the severe misfit In practice the
clinician may discuss unexpected item scores and potential
explanations with the patient and suggest a more definite
explanation for the person misfit
Explanatory Person-Fit Analysis
For each of the diagnosis categories Table 1 shows the
average l zm p
value and the number and percentage of
patients classified as misfitting For patients with mood and
anxiety disorders (ie the baseline category) the detection
rate was substantial (128) but not high relative to most of
the other diagnosis categories
Table 2 shows the results of the logistic regression analy-
sis Model 1 included gender measurement occasion and
the diagnosis categories as predictors of person misfit
Diagnosis category had a significant overall effect χ 2(8) =
2647 p lt 001 The effects of somatoform disorder ADHD
psychotic disorder and substance abuse disorder were sig-
nificant Patients with ADHD were unlikely to show misfit
relative to the baseline category of patients with mood or
anxiety disorders Patients with somatoform disorders psy-
chotic disorders and substance-related disorders were morelikely to show misfit
Model 2 (Table 2 third column) also included GAF code
and OQ-45 total score Both effects were significant and
suggested that patients with higher levels of psychological
distress were more likely to show misfit After controlling
for GAF code and OQ-45 score the positive effect of
ADHD was not significant Hence patients with ADHD
were less likely to show misfit because they had less severe
levels of distress For the baseline category the estimated
probability of misfit was 13 For patients with somatoform
disorders psychotic disorders and substance-related disor-
ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis
category showed similar person misfit we compared the
standardized item-score residuals of the misfitting patterns
produced by patients with psychotic disorders (n = 7)
somatoform disorders (n = 16) or substance-related disor-
ders (n = 13) Most patients with a psychotic disorder had
low or average θ levels for each of the subscales Misfit was
due to several unexpected high scores indicating severe
symptoms In general these patients did not have large
Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the
5 Level 0 = No Misfit)
Model 1 Model 2
Intercept minus184 (011) minus193 (011)
Gender minus012 (013) minus012 (013)
Measurement occasion minus017 (027) minus018 (027)
Diagnosis categorya
Somatoform 057 (029) 074 (029)
ADHD minus058 (028) minus039 (028)
Psychotic 105 (046) 113 (047)
Borderline minus130 (072) minus139 (073)
Impulse control 035 (036) 057 (036)
Eating disorders minus110 (060) minus097 (060)
Substance related 066 (033) 069 (033)
Socialrelational minus020 (026) 008 (027)
GAF code mdash minus017 (007)
OQ-45 total score mdash 026 (007)
Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26
a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812
8 Assessment
residuals for the same items However unexpected high
scores on Item 25 (ldquodisturbing thoughts come into my mind
that I cannot get rid ofrdquo) were frequent (4 patients) which is
consistent with the symptoms characterizing psychotic dis-
orders Low to average θ levels combined with several
unexpected high item scores suggested that patients with
psychotic disorders showed misfit because many OQ-45
items were irrelevant to them suggesting lack of traitedness
(Reise amp Waller 1993) For the misfitting patients with a
somatoform disorder or a substance-related disorder the
standardized residuals showed no common pattern and
thus they did not show similar person misfit
Discussion
We investigated prevalence and explanations of aberrant
responding to the OQ-45 by means of IRT-based person-fit
methods Reise and Waller (2009) suggested that IRT mod-
els fit poorly to psychopathology data and that misfit may
adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn
et al 2014) our simulation results suggested that l zm p is
robust to model misfit The low empirical Type I error rates
suggested that GRM misfit did not lead to incorrect classi-
fication of model-consistent item-score patterns as misfit-
ting The current findings are valuable because they are
obtained under realistic conditions using a psychometri-
cally imperfect outcome measure that is frequently used in
practice We notice that this is the rule rather than the excep-
tion in general measurement instrumentsrsquo psychometric
properties are imperfect However our findings concerning
the robustness of l zm p
may only hold for the kind of model
misfit we found for the OQ-45 Future studies should inves-
tigate the robustness of PFA methods to different IRT model
violations
The detection rate of 126 in the OQ-45 data is sub-
stantial and comparable with detection rates found in other
studies using different measures For example using
repeated measurements on the StatendashTrait Anxiety
Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs
1983) Conijn et al (2013) found detection rates of 11 to
14 in a sample of cardiac patients and Conijn et al (2014)
also found 16 misfit on the International Personality Item
Pool 50-item questionnaire (Goldberg et al 2006) in a
panel sample from the general populationConsistent with previous research (Conijn et al 2013
Reise amp Waller 1993 Woods et al 2008) we found that
more severely distressed patients more likely showed mis-
fit Also patients with somatoform disorders psychotic dis-
orders and substance-related disorders more likely showed
misfit Plausible explanations for these results include
patients tending to deny their mental problems (somatoform
disorders) symptoms of disorder and confusion and the
irrelevance of most OQ-45 items (psychotic disorders)
being under the influence during test-taking and the nega-
tive effect of long-term substance use on cognitive abilities
(substance-related disorders) The residual analysis con-
firmed the explanation of limited item relevance for patients
with psychotic disorders but also suggested that patients
identified with the same type of disorder generally did not
show the same type of misfit However as the simulation
study revealed that the residual statistic had low power sug-
gesting only item scores reflecting large misfit are identi-
fied results should be interpreted with caution
There are two possible general explanations for finding
group differences with respect to the tendency to show per-
son misfit First person misfit may be due to a mismatch
between the OQ-45 and the specific disorder and second
misfit may be due to a general tendency to show misfit on
self-report measures Each explanation has a unique impli-
cation for outcome measurement The mismatch explana-
tion implies that disease-specific outcome measures rather
than general outcome measures should be used Examples
of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for
assessing somatic symptoms and the Severe Outcome
Questionnaire (Burlingame Thayer Lee Nelson amp
Lambert 2007) to diagnose severe mental illnesses such as
psychotic disorders and bipolar disorder
The general-misfit explanation implies that other meth-
ods than self-reports should be used for patientsrsquo treatment
decisions for example clinician-rated outcome measures
such as the Health of the Nation Outcome Scales (Wing et
al 1998) Also these patientsrsquo self-report results should be
excluded from cost-effectiveness studies to prevent poten-
tial negative effects on policy decisions Future studies may
address the scale-specific misfit versus general-misfit
explanations by means of explanatory PFA of data from
other outcome measures
Residual statistics have shown to be useful in real-data
applications for analyzing causes of aberrant responding to
unidimensional personality scales containing at least 30
items (Ferrando 2010 2012) but their performance has not
been evaluated previously by means of simulations Our
simulation study and real-data application question the use-
fulness of the residual statistic for retrieving causes of mis-
fit in outcome measures consisting of multiple short
subscales An alternative method is the inspection of item
content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such
a qualitative approach to follow-up PFA could be particu-
larly useful when clinicians use the OQ-45 to provide feed-
back to the patient
Person-fit statistics such as l zm p can potentially detect
aberrant responding due to low traitedness low motivation
cognitive deficits and concentration problems However
an important limitation for outcome measurement is that the
statistics have low power to identify response styles and
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912
Conijn et al 9
malingering These types of aberrant responding result in
biased total scores but do not necessarily produce an incon-
sistent pattern of item scores across the complete measure
(Ferrando amp Chico 2001) Hence future research might
also use measures especially designed to identify response
styles (Cheung amp Rensvold 2000) and malingering and
compare the results with those from general-purpose per-
son-fit statistics Inconsistency scales such as the TRIN and
VRIN potentially detect the same types of generic response
inconsistencies as person-fit indices but to our knowledge
only one study has compared their performance to that of
PFA (Egberink 2010) PFA was found to outperform the
inconsistency scale but relative performance may depend
on dimensionality and test length Hence more research is
needed on this subject
Other suggestions for future research include the fol-
lowing A first question is whether the prevalence of aber-
rant responding increases (eg due to diminishing
motivation) or decreases (eg due to familiarity with the
questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant
responding justifies routine application of PFA in clinical
settings such as Patient Reported Outcomes Measurement
Information System If prevalence among the tested
patients is low the number of item-score patterns incor-
rectly classified as aberrant (ie Type I errors) may out-
number the correctly identified aberrant item-score
patterns (Piedmont McCrae Riemann amp Angleitner
2000) Then PFA becomes very inefficient and it is
unlikely to improve the quality of individual decision
making in clinical practice Third future research may use
group-level analysis such as differential item functioning
analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-
ture modeling (Rost 1990) to study whether patients with
the same disorder showed similar patterns of misfit
To conclude our results have two main implications
pertaining to psychometrics and psychopathology First
the simulation study results suggest that l zm p is useful for
application to outcome measures despite moderate model
misfit due to multidimensionality As data of other psy-
chopathology measures also have been shown to be incon-
sistent with assumptions of IRT models the results of the
simulation study are valuable when considering applica-
tion of the l zm p
statistic to data of outcome measures
Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients
with different disorders Also more severely distressed
patients for whom psychological intervention is mostly
needed appear to be at the largest risk to produce invalid
outcome scores Overall our results emphasize the impor-
tance of person misfit identification in outcome measure-
ment and demonstrate that PFA may be useful for
preventing incorrect decision making in clinical practice
due to aberrant responding
Appendix A
Statistic l z p
Suppose the data are polytomous scores on J items (items
are indexed j j = 1 J ) with M + 1 ordered answer cat-
egories Let the score on item j be denoted by X j with
possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a
score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0
otherwise The unstandardized log-likelihood of an item-
score pattern x of person i is given by
l d m P X m pi
j
J
m
M
j j i( ) x = ( ) =( )= =
sumsum1 0
ln |θ (A1)
The standardized log-likelihood is defined as
l
l E l
VAR l
z p i
pi
pi
pi
x
x x
x
( ) = ( ) minus ( )
( )
( )
1
2
(A2)
where E l p( ) is the expected value and VAR l p( ) is the
variance of l p
Standardized Residual Statistic
The unstandardized residual for person i on item j is given by
e X E X ij ij ij= minus ( ) (A3)
where E X ij( ) is the expected value of X ij which equals
E X m P X mij
m
M
ij i( ) = =( )=
sum0
|θ The residual eij has a mean
of 0 and variance equal to
VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2
(A4)
The standardized residual is given by
ze e
VAR eij
ij
ij
=
( )
(A5)
To compute zeij latent trait θ
i needs to be replaced by its
estimated value This may bias the standardization of eij
Appendix B
For each OQ-45 subscale we estimated an exploratory mul-
tidimensional IRT (MIRT Reckase 2009) model Based on
results of exploratory factor analyses conducted in Mplus
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 2
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 212
2 Assessment
Aberrant responding provides clinicians with invalid
information and as a result adversely affects the quality of
treatment and diagnosis decisions (Conrad et al 2010
Handel Ben-Porath Tellegen amp Archer 2010) Person-fit
analysis (PFA) involves statistical methods to detect aber-
rant item-score patterns that are due to aberrant responding
Conrad et al (2010) provided an example of the potential of
PFA for mental health care by using PFA to screen for atypi-
cal symptom profiles among persons at intake for drug or
alcohol dependence treatment They found that the persons
with aberrant item-score patterns required different treat-
ments than persons with model-consistent item-score pat-
terns and concluded that PFA may detect inconsistencies that
have important implications for treatment and diagnosis
decisions As self-report outcome measures are increasingly
used to make treatment decisions in clinical practice PFA
may be a valuable screening tool in outcome measurement
The importance of detecting aberrant responding has
been recognized since long Both the original and current
versions of the Minnesota Multiphasic Personality Inventory(Butcher et al 2001 Handel et al 2010) include scales to
detect different types of aberrant responding Examples are
lie scales to detect faking good or faking bad and indices
based on the consistency of the responses to items either
highly similar or opposite with respect to content such as
the Variable Response Inconsistency (VRIN) scale to detect
random responding and the True Response Inconsistency
(TRIN) scale to detect acquiescence Despite validity
scalesrsquo importance outcome questionnaires typically do not
include specialized scales for detecting aberrant responding
(Lambert amp Hawkins 2004) One possible explanation is
that with the increasing demand of cost-effectiveness time
for assessment has been reduced greatly (Wood Garb
Lilienfeld amp Nezworski 2002) Consequently outcome
questionnaires are required to be short and efficient which
limits the use of validity scales consisting of additional
items (eg lie scales) and limits construction of TRIN and
VRIN scales because less item pairs with similar or oppo-
site content are available
Person-Fit Analysis in Outcome Measurement
In this study we used PFA to investigate the prevalence and
possible causes of aberrant responding in outcome mea-
surement by means of the OQ-45 In PFA person-fit statis-tics signal whether an individualrsquos item-score pattern is
inconsistent with the item-score pattern expected under the
particular measurement model (Meijer amp Sijtsma 2001) A
significant discrepancy between the observed item-score
pattern and the expected item-score pattern provides evi-
dence of person misfit Person misfit means that the indi-
vidualrsquos test score is unlikely to be meaningful in terms of
the trait being measured For noncognitive data the l z per-
son-fit statistic (Drasgow Levine amp McLaughlin 1987) is
one of the best performing and most popular person-fit sta-
tistics (Emons 2008 Ferrando 2012 Karabatsos 2003)
To determine whether an item-score pattern shows signifi-
cant misfit statistic l z is compared with a cutoff value
obtained under the item response theory (IRT Embretson amp
Reise 2000) model that serves as the null model of consis-
tency (De la Torre amp Deng 2008 Nering 1995) Statistic l z
detects various types of aberrant responding such as acqui-
escence and extreme response style but the statistic is most
powerful for detecting random responding (Emons 2008)
In detecting random responding to 57 items measuring the
Big Five personality factors PFA has been found to outper-
form an inconsistency index based on the rationale of the
Minnesota Multiphasic Personality Inventory VRIN scale
(Egberink 2010 pp 94-100)
An advantage of statistic l z and other person-fit statistics
for application to outcome measurement is that they can be
used to detect invalid test scores on any self-report scale
that is consistent with an IRT model Also the rise of com-
puterized and IRT-based outcome monitoring (eg PatientReported Outcomes Measurement Information System
Cella et al 2007) renders the implementation of PFA fea-
sible Along with the computer-generated test score a per-
son-fit value may be provided to the clinician serving as an
alarm bell that warns him that the test score may be invalid
and further inquiry may be useful
Follow-up PFA of item-score patterns flagged as misfit-
ting can help the clinician to infer possible explanations for
an individualrsquos observed aberrant responding In personal-
ity measurement Ferrando (2010) used item-score residu-
als for follow-up PFA and found that a person who had an
aberrant item-score pattern on an extraversion scale showed
unexpected low scores on items referring to situations
where the person could make a fool of himself This result
suggested that the aberrant responding was due to fear of
being rejected For another person follow-up PFA sug-
gested inattentiveness to reverse item wording In outcome
measurement for individual patients follow-up PFA can
inform the clinician about the sources of the misfit and cli-
nicians can discuss the unexpected item scores with the
patients to obtain a better understanding of the patientrsquos
psychological profile
PFA primarily focuses on individuals but can also be
used to explain individual differences in aberrant respond-
ing at the group level for examples see Conijn Emonsand Sijtsma (2014) and Conijn et al (2013) In outcome
measurement PFA can be used to investigate the extent to
which general measures are suited for assessing patients
suffering from different disorders General outcome mea-
sures such as the OQ-45 and the CORE-OM use items
that assess the most common symptoms of psychopathol-
ogy such as those observed in depression and anxiety dis-
orders (Lambert amp Hawkins 2004) and are also used to
assess patients suffering from different specific disorders
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 312
Conijn et al 3
varying from somatoform disorders to psychotic disorders
and addiction For rare or specific disorders many of the
general measuresrsquo items are irrelevant and low traitedness
may lead to inconsistent or unmotivated completion of out-
come measures
Goal of the Study We investigated the prevalence and the causes of aberrant
responding to the OQ-45 (Lambert et al 2004) The OQ-45
is one of the most popular general outcome measures used
in mental health care We used OQ-45 data of a large Dutch
clinical outpatient sample suffering from a large variety of
disorders The l z person-fit statistic (Drasgow et al 1987)
was used to identify misfitting item-score patterns and
standardized item-score residuals (Ferrando 2010 2012)
were used to investigate sources of item-score pattern mis-
fit We employed logistic regression analyses using the l z
statistic as the dependent variable to investigate whether
patients suffering from specific disorders (eg somatoformdisorders and psychotic disorders) and severely distressed
patients are more predisposed to produce aberrant item-
score patterns on the OQ-45 than other patients Based on
the results for the OQ-45 we discuss the possible causes of
aberrant responding in outcome measurement in general
and the potential of PFA for improving outcome-measure-
ment practice
Method
Participants
We performed a secondary analysis on data collected in
routine mental health care Participants were 2906 clinical
outpatients (421 male) from a mental health care institu-
tion with four different locations situated in Noord-Holland
a predominantly rural province in the Netherlands
Participantsrsquo age ranged from 17 to 77 years ( M = 37 SD =
13) Apart from gender and age no other demographic infor-
mation was collected
Most patients completed the OQ-45 at intake but 160
(55) patients completed the OQ-45 after treatment
started The sample included 2632 (91) patients with a
clinician-rated Diagnostic and Statistical Manual of Mental
Disorders (4th ed DSM-IV ) primary diagnosis at Axis I192 (7) persons with a primary diagnosis at Axis II and
82 (3) patients for which the primary diagnosis was miss-
ing Most frequent primary diagnoses were depression
(38) anxiety disorders (20) disorders usually first
diagnosed in infancy childhood or adolescence (8) per-
sonality disorders (7) adjustment disorders (6)
somatoform disorders (3) eating disorders (2) and
substance-related disorders (2) Of the diagnosed patients
131 had comorbidity between Axis 1 and Axis 2 320
had comorbidity within Axis 1 and 01 had comorbidity
within Axis 2 The clinician had access to the OQ-45 data
but since the OQ-45 is not a diagnostic instrument it was
unlikely that diagnosis was based on the OQ-45 results
Measures
The Outcome Questionnairendash45 The OQ-45 (Lambert et al
2004) uses three subscales to measure symptom severity
and daily functioning The Social Role Performance (SR 9
items of which 3 are reversely worded) subscale measures
dissatisfaction distress and conflicts concerning onersquos
employment education or leisure pursuits An example
item is ldquoI feel stressed at workschoolrdquo The Interpersonal
Relations (IR 11 items of which 4 are reversely worded)
subscale measures difficulty with family friends and mari-
tal relationship An example item is ldquoI feel lonelyrdquo The
Symptom Distress (SD 25 items of which 3 are reversely
worded) subscale measures symptoms of the most fre-
quently diagnosed mental disorders in particular anxietyand depression An example item is ldquoI feel no interest in
thingsrdquo Respondents are instructed to express their feelings
with respect to the past week on a 5-point rating scale with
scores ranging from 0 (never ) through 4 (almost always)
higher scores indicating more psychological distress
In this study we used the Dutch OQ-45 (De Jong amp
Nugter 2004) The Dutch OQ-45 total score has good con-
current and criterion-related validity (De Jong et al 2007)
In our sample coefficient alpha for the subscale total scores
equaled 65 (SR) 77 (IR) and 91 (SD) Results concern-
ing OQ-45 factor structure are ambiguous Some studies
provided support for the theoretical three-factor model for
both the original OQ-45 (Bludworth Tracey amp Glidden-
Tracey 2010) and the Dutch OQ-45 (De Jong et al 2007)
Other studies found poor fit of the theoretical three-factor
model and suggested that a three-factor model showed bet-
ter fit when it was based on a reduced item set (Kim
Beretvas amp Sherry 2010) or a one-factor model (Mueller
Lambert amp Burlingame 1998) In this study we further
investigated the fit of the theoretical three-factor model
Explanatory Variables for Person Misfit
Severity of distress The OQ-45 total score and the cli-
nician-rated DSM-IV Global Assessment of Functioning
(GAF) code were taken as measures of the patientrsquos sever-ity of distress The GAF code ranges from 1 to 100 with
higher values indicating better psychological social and
occupational functioning The GAF code was missing for
187 (6) patients
Diagnosis category The clinician-rated DSM-IV diag-
nosis was classified into nine categories representing
the most common types of disorders present in the sam-
ple Table 1 describes the diagnosis categories and the
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 412
4 Assessment
number of patients classified in each category by their pri-
mary diagnosis Three remarks are in order First because
mood and anxiety symptoms dominate the OQ-45 (Lambert
et al 2004) we assumed that for patients suffering from
these symptoms misfit was less likely than for patients with
other diagnoses Hence we classified patients with mood
and anxiety diagnoses into the same category and used
this category as the baseline for testing the effects of the
other diagnosis categories on person fit Second because
we expected that the probability of aberrant responding
depends on the specific symptoms the patient experienced
we categorized patients into diagnosis categories that are
defined by symptomatology Third if we were unable to
categorize patientsrsquo diagnosis unambiguously in one of the
categories (eg adjustment disorder with predominant dis-
turbance of conduct) we treated the diagnosis as missing
Our approach resulted in 2514 categorized patients (87)
Statistical Analysis
Model-Fit Evaluation We conducted PFA based on the
graded response model (GRM Samejima 1997) The GRM
is an IRT model for polytomous items The core of the
GRM are the item step response functions (ISRFs) which
specify the relationship between the probability of a
response in a specific or higher answer category and the
latent trait the test measures The GRM is defined by three
assumptions unidimensionality of the latent trait absence
of structural influences other than the latent trait on item
responding (ie local independence) and logistic ISRFs A
detailed discussion of the GRM is beyond the scope of this
study the interested reader is referred to Embretson and
Reise (2000) and Samejima (1997)
Satisfactory GRM fit to the data is a prerequisite for
application of GRM-based PFA to the OQ-45 subscale data
Forero and Maydeu-Olivares (2009) showed that differ-
ences between parameter estimates obtained from the full
information (GRM) and the limited information (factor
analysis on the polychoric correlation matrix) approaches
are negligible Hence for each OQ-45 subscale we used
exploratory factor analysis (EFA) for categorical data in
Mplus (Mutheacuten amp Mutheacuten 2007) to assess the GRM
assumptions of unidimensionality and local independence
For comparing one-factor models with multidimensional
models we used the root mean squared error of approxima-
tion (RMSEA) and the standardized root mean residual
(SRMR Mutheacuten amp Mutheacuten 2007) RMSEA le 08 and
SRMR lt 05 suggest acceptable model fit (MacCallum
Browne amp Sugawara 1996 Mutheacuten amp Mutheacuten 2009) Todetect local dependence we used the residual correlations
under the one-factor solution We assessed the logistic
shape of ISRFs by means of a graphical analysis comparing
the observed ISRFs with the ISRFs expected under the
GRM (Drasgow Levine Tsien Williams amp Mead 1995)
In case substantial violations of GRM assumptions were
identified we used a simulation study to investigate whether
PFA was sufficiently robust with respect to the identified
OQ-45 model misfit (Conijn et al 2014)
Table 1 Description of the Diagnosis Categories Used as Explanatory Variables in Multiple Regression Analysis
Category Common DSM-IV diagnoses included n Mean l zmp Detected Detected
Mood and anxiety disordera Depressive disorders generalized anxietydisorders phobias panic disordersposttraumatic stress disorder
1786 028 229 128
Somatoform disorder Pain disorder somatization disorder
hypochondriasis undifferentiatedsomatoform disorder
82 016 16 195
Attention deficit hyperactivitydisorder(ADHD)
Predominantly inattentive combinedhyperactive-impulsive and inattentive
198 008 15 76
Psychotic disorder Schizophrenia psychotic disorder nototherwise specified
26 minus010 7 269
Borderline personalitydisorder
Borderline personality disorder 53 035 2 38
Impulse-control disorders notelsewhere classified
Impulse-control disorder intermittentexplosive disorder
58 002 10 172
Eating disorder Eating disorder not otherwise specified bulimianervosa
67 038 4 60
Substance-related disorder Cannabis-related disorders alcohol-relateddisorders
58 009 13 224
Social and relational problem Phase of life problem partner relationalproblem identity problem
186 026 20 108
a Including 65 patients with a mood disorder
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512
Conijn et al 5
Person-Fit Analysis
Detection of misfit We used statistic l z for polytomous
item scores denoted by l z p (Drasgow Levine amp Williams
1985) to identify item-score patterns that show misfit rela-
tive to the GRM Statistic l z p
is the standardized log-like-
lihood of a personrsquos item-score pattern given the response
probabilities under the GRM with larger negative l z p
values indicating a higher degree of misfit (see Appendix
A for the equations) Emons (2008) found that l z p
had a
higher detection rate than several other person-fit statistics
Because item-score patterns that contain only 0s or only
4s (ie after recoding the reversed worded items) always
fit under the postulated GRM corresponding l z p
statistics
are meaningless and therefore treated as missing values
Twenty-two respondents (1) had a missing l z p
value due
to only 0 or 4 scores We may add that even though these
perfect patterns are consistent with the model they still may
be suspicious as they may reflect gross under- or overre-
porting of symptoms
Because the GRM is a model for unidimensional datawe computed statistic l z
p for each OQ-45 subscale sepa-
rately To categorize persons as fitting or misfitting with
respect to the complete OQ-45 we used the multiscale per-
son-fit statistic l zm p
(Conijn et al 2014 Drasgow Levine
amp McLaughlin 1991) which equals the standardized sum
of the subscale l z p values across all subscales
Under the null model of fit to the IRT model and given
the true θ value statistic l z p
is standard normally distributed
(Drasgow et al 1985) but when the unknown true θ value
is replaced by the estimated θ value statistic l z p is no longer
standard normal (Nering 1995) Therefore following De la
Torre and Deng (2008) we used the following parametric
bootstrap procedure to compute the l z p and l zm p values and
the corresponding p values For each person we generated
5000 item-score patterns under the postulated GRM using
the item parameters and the personrsquos estimated θ value For
each item-score pattern we again estimated the estimated θ
value and computed the corresponding l zm p statistic The
5000 bootstrap replications of l zm p
determined the person-
specific null distribution of l zm p
The percentile rank of the
observed lzm p
value in this bootstrapped distribution pro-
vided the p value We used one-tailed significance testing
and a 05 significance level (α) The GRM item parameters
were estimated using MULTILOG (Thissen Chen amp Bock
2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained
from Baker and Kim (2004) The software including the
source code is available on request from the first author
Follow-up analyses For each item-score pattern l zm p classi-
fied as misfitting we used standardized item-score residuals
to identify the source of the person misfit (Ferrando 2010
2012) Negative residuals indicate that the personrsquos observed
item score is lower than expected under the estimated GRM
and positive residuals indicate that the item score is higher
than expected (Appendix A) To test residuals for signifi-
cance we used critical values based on the standard normal
distribution and two-tailed significance testing with α = 05
(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff
values of minus164 and 164)
Explanatory person-fit analysis We used logistic regres-
sion to relate type of disorder and severity of psychological
distress to person misfit on the OQ-45 The dependent vari-
able was the dichotomous person-fit classification based on
l zm p
(1 = significant misfit 0 = no misfit) Based on pre-
vious research results gender (0 = male 1 = female) and
measurement occasion (0 = at intake 1 = during treatment)
were included in the model as control variables (eg Pitts
West amp Tein 1996 Schmitt Chan Sacco McFarland amp
Jennings 1999 Woods et al 2008)
Results
First we discuss model fit and implications of the identified
model misfit for the application of PFA to the OQ-45 data
Second we discuss the number of item-score patterns that
the l zm p
statistic classified as misfitting (prevalence) and we
illustrate how standardized item-score residuals may help
infer possible causes of misfit for individual respondents
Third we discuss the results of logistic regression analysis
in which l zm p
person misfit classification was predicted by
means of clinical diagnosis and severity of disorder
Model-Fit Evaluation
Inspection of multiple correlation coefficients and item-rest
correlations showed that the Items 11 26 and 32 which
measured substance abuse and Item 14 (ldquoI workstudy too
muchrdquo) fitted poorly in their subscales As these results
were consistent with previous research (De Jong et al
2007 Mueller et al 1998) we excluded these items from
further analysis Coefficient alphas for the SR (7 items 2
items excluded) IR (10 items 1 item excluded) and SD (24
items 1 item excluded) subscales equaled 67 78 and 91
respectively
For the subscale data EFA showed that the first factor
explained 386 to 400 of the variance and that one-fac-
tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed
to produce an RMSEA le 08 The two-factor model pro-
duced an RMSEA of 13 but the SRMR of 05 was accept-
able The RMSEA value may have been inflated due to the
small number of degrees of freedom (ie df = 7) of the two-
factor model (Kenny Kaniskan amp McCoach 2014)
Because parallel analysis based on the polychoric correla-
tion matrix suggested that two factors explained the data we
decided that a two-factor solution was most appropriate
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612
6 Assessment
- 4
- 2
0
2
4
S t a n
d a r d i z e d r e s i d u a l s
4 12 21 28 38
39 44
SR subscale
1
7
16
17
18 19
20 30 37
43
IR subscale
2 3
56
8
9 101315
22
232425
27
2931
3334
3536
40
41
42
45
SD subscale
- 4
- 2
0
2
4
S t a n d a r d i z e d r e s i d u a l s
4
12
21
28
38 39 44
1
7
16
17
18
19
20
30 37
43
2
3
5 6
8
9
10
13
15
22
23
24
2527
29
3133
34
3536
40
41
42
45
Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient
2752 ( l zmp
= minus796 lower panel)
(RMSEA = 13 SRMR = 05) For the IR subscale a two-
factor solution provided acceptable fit (RMSEA = 08 and
SRMR = 04) and for the SD subscale a three-factor solu-
tion provided acceptable fit (RMSEA = 07 and SRMR =
03) Violations of local independence and violations of a
logistic ISRF were only found for some items of the SD and
SR subscales respectively Thus EFA results suggested that
more than other model violations multidimensionality
caused the subscale data to show GRM misfit
To investigate the performance of statistic l zm p
and the
standardized item-score residuals for detecting person mis-
fit on OQ-45 in the presence of mild model misfit we used
data simulation to assess the Type I error rates and the
detection rates of these statistics Data were simulated using
methods proposed by Conijn et al (2014 also see Appendix
B) The types of misfit included were random error (three
levels random scores on 10 20 or 30 items) and acquies-
cence (three levels weak moderate and strong)
We found that for statistic l zm p
Type I error rate equaled
01 meaning that the risk of incorrectly classifying normal
respondents as misfitting was small and much lower than
nominal Type I error rate Furthermore the power of l zm p
to
detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-
escence equaled at most 51 (ie for strong acquiescence
88 of the responses in the most extreme category) We
concluded that despite mild GRM model misfit l zm p
is use-
ful for application to the OQ-45 but lacks power to detect
acquiescence For the residual statistic we found modest
power to detect deviant item scores due to random error
and low power to detect deviant item scores due to acquies-
cence Even though the residuals had low power in our
simulation study we decided to use the residual statistic for
the OQ-45 data analysis to obtain further insight in the sta-
tisticsrsquo performance
Detection of Misfit and Follow-Up Analyses
For 90 (3) patients with a missing l z p
value for one of the
subscales l zm p was computed across the two other OQ-45
subscales Statistic l zm p ranged from minus796 to 354 ( M =
045 SD = 354) For 367 (126) patients statistic l zm p
classified the observed item-score pattern as misfitting
With respect to age gender and measurement occasion we
did not find substantial differences between detection rates
Based on the residual statisticrsquos low power in the simula-
tion study we used α = 10 for identifying unexpected item
scores We use two cases to illustrate the use of the residual
statistic Figure 1 shows the standardized residuals for
female patient 663 who had the highest l zm p
value ( l zm p
=
354 p gt 99) and for male patient 2752 who had the low-
est l zm p value ( l zm
p = minus796 p lt 001) Patient 663 (upper
panel) was diagnosed with posttraumatic stress disorder
The patientrsquos absolute residuals were smaller than 164
thus showing that her item scores were consistent with theexpected GRM item scores
Patient 2752 (lower panel) was diagnosed to suffer
from adjustment disorder with depressed mood He had
large residuals for each of the OQ-45 subscales but misfit
was largest on the IR subscale (l z p
= minus5 44 ) and the SD sub-
scale (l z p
= minus7 66 ) On the IR subscale residuals suggested
unexpected high distress on Items 7 19 and 20 One of
these items concerned his ldquomarriagesignificant other rela-
tionshiprdquo Therefore a possible cause of the IR subscale
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712
Conijn et al 7
misfit may be that his problems were limited to this rela-
tionship On the SD subscale he had both several unex-
pected high and low item scores Two of the three items
with unexpected high scores reflected mood symptoms of
depression feeling blue (Item 42) and not being happy
(Item 13) The third item concerned suicidal thoughts (Item
8) Most items with unexpected low scores concerned low
self-worth and incompetence (Items 15 24 and 40) and
hopelessness (Item 10) which are all cognitive symptoms
of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an
external source of psychological distress caused the patient
to experience only the mood symptoms of depression but
not the cognitive symptoms Hence the cognitive symp-
toms constituted a separate dimension on which the patient
had a lower trait value Furthermore except for 10 items all
other item scores were either 0s or 4s Hence apart from
potential content-related misfit extreme response style may
have been another cause of the severe misfit In practice the
clinician may discuss unexpected item scores and potential
explanations with the patient and suggest a more definite
explanation for the person misfit
Explanatory Person-Fit Analysis
For each of the diagnosis categories Table 1 shows the
average l zm p
value and the number and percentage of
patients classified as misfitting For patients with mood and
anxiety disorders (ie the baseline category) the detection
rate was substantial (128) but not high relative to most of
the other diagnosis categories
Table 2 shows the results of the logistic regression analy-
sis Model 1 included gender measurement occasion and
the diagnosis categories as predictors of person misfit
Diagnosis category had a significant overall effect χ 2(8) =
2647 p lt 001 The effects of somatoform disorder ADHD
psychotic disorder and substance abuse disorder were sig-
nificant Patients with ADHD were unlikely to show misfit
relative to the baseline category of patients with mood or
anxiety disorders Patients with somatoform disorders psy-
chotic disorders and substance-related disorders were morelikely to show misfit
Model 2 (Table 2 third column) also included GAF code
and OQ-45 total score Both effects were significant and
suggested that patients with higher levels of psychological
distress were more likely to show misfit After controlling
for GAF code and OQ-45 score the positive effect of
ADHD was not significant Hence patients with ADHD
were less likely to show misfit because they had less severe
levels of distress For the baseline category the estimated
probability of misfit was 13 For patients with somatoform
disorders psychotic disorders and substance-related disor-
ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis
category showed similar person misfit we compared the
standardized item-score residuals of the misfitting patterns
produced by patients with psychotic disorders (n = 7)
somatoform disorders (n = 16) or substance-related disor-
ders (n = 13) Most patients with a psychotic disorder had
low or average θ levels for each of the subscales Misfit was
due to several unexpected high scores indicating severe
symptoms In general these patients did not have large
Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the
5 Level 0 = No Misfit)
Model 1 Model 2
Intercept minus184 (011) minus193 (011)
Gender minus012 (013) minus012 (013)
Measurement occasion minus017 (027) minus018 (027)
Diagnosis categorya
Somatoform 057 (029) 074 (029)
ADHD minus058 (028) minus039 (028)
Psychotic 105 (046) 113 (047)
Borderline minus130 (072) minus139 (073)
Impulse control 035 (036) 057 (036)
Eating disorders minus110 (060) minus097 (060)
Substance related 066 (033) 069 (033)
Socialrelational minus020 (026) 008 (027)
GAF code mdash minus017 (007)
OQ-45 total score mdash 026 (007)
Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26
a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812
8 Assessment
residuals for the same items However unexpected high
scores on Item 25 (ldquodisturbing thoughts come into my mind
that I cannot get rid ofrdquo) were frequent (4 patients) which is
consistent with the symptoms characterizing psychotic dis-
orders Low to average θ levels combined with several
unexpected high item scores suggested that patients with
psychotic disorders showed misfit because many OQ-45
items were irrelevant to them suggesting lack of traitedness
(Reise amp Waller 1993) For the misfitting patients with a
somatoform disorder or a substance-related disorder the
standardized residuals showed no common pattern and
thus they did not show similar person misfit
Discussion
We investigated prevalence and explanations of aberrant
responding to the OQ-45 by means of IRT-based person-fit
methods Reise and Waller (2009) suggested that IRT mod-
els fit poorly to psychopathology data and that misfit may
adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn
et al 2014) our simulation results suggested that l zm p is
robust to model misfit The low empirical Type I error rates
suggested that GRM misfit did not lead to incorrect classi-
fication of model-consistent item-score patterns as misfit-
ting The current findings are valuable because they are
obtained under realistic conditions using a psychometri-
cally imperfect outcome measure that is frequently used in
practice We notice that this is the rule rather than the excep-
tion in general measurement instrumentsrsquo psychometric
properties are imperfect However our findings concerning
the robustness of l zm p
may only hold for the kind of model
misfit we found for the OQ-45 Future studies should inves-
tigate the robustness of PFA methods to different IRT model
violations
The detection rate of 126 in the OQ-45 data is sub-
stantial and comparable with detection rates found in other
studies using different measures For example using
repeated measurements on the StatendashTrait Anxiety
Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs
1983) Conijn et al (2013) found detection rates of 11 to
14 in a sample of cardiac patients and Conijn et al (2014)
also found 16 misfit on the International Personality Item
Pool 50-item questionnaire (Goldberg et al 2006) in a
panel sample from the general populationConsistent with previous research (Conijn et al 2013
Reise amp Waller 1993 Woods et al 2008) we found that
more severely distressed patients more likely showed mis-
fit Also patients with somatoform disorders psychotic dis-
orders and substance-related disorders more likely showed
misfit Plausible explanations for these results include
patients tending to deny their mental problems (somatoform
disorders) symptoms of disorder and confusion and the
irrelevance of most OQ-45 items (psychotic disorders)
being under the influence during test-taking and the nega-
tive effect of long-term substance use on cognitive abilities
(substance-related disorders) The residual analysis con-
firmed the explanation of limited item relevance for patients
with psychotic disorders but also suggested that patients
identified with the same type of disorder generally did not
show the same type of misfit However as the simulation
study revealed that the residual statistic had low power sug-
gesting only item scores reflecting large misfit are identi-
fied results should be interpreted with caution
There are two possible general explanations for finding
group differences with respect to the tendency to show per-
son misfit First person misfit may be due to a mismatch
between the OQ-45 and the specific disorder and second
misfit may be due to a general tendency to show misfit on
self-report measures Each explanation has a unique impli-
cation for outcome measurement The mismatch explana-
tion implies that disease-specific outcome measures rather
than general outcome measures should be used Examples
of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for
assessing somatic symptoms and the Severe Outcome
Questionnaire (Burlingame Thayer Lee Nelson amp
Lambert 2007) to diagnose severe mental illnesses such as
psychotic disorders and bipolar disorder
The general-misfit explanation implies that other meth-
ods than self-reports should be used for patientsrsquo treatment
decisions for example clinician-rated outcome measures
such as the Health of the Nation Outcome Scales (Wing et
al 1998) Also these patientsrsquo self-report results should be
excluded from cost-effectiveness studies to prevent poten-
tial negative effects on policy decisions Future studies may
address the scale-specific misfit versus general-misfit
explanations by means of explanatory PFA of data from
other outcome measures
Residual statistics have shown to be useful in real-data
applications for analyzing causes of aberrant responding to
unidimensional personality scales containing at least 30
items (Ferrando 2010 2012) but their performance has not
been evaluated previously by means of simulations Our
simulation study and real-data application question the use-
fulness of the residual statistic for retrieving causes of mis-
fit in outcome measures consisting of multiple short
subscales An alternative method is the inspection of item
content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such
a qualitative approach to follow-up PFA could be particu-
larly useful when clinicians use the OQ-45 to provide feed-
back to the patient
Person-fit statistics such as l zm p can potentially detect
aberrant responding due to low traitedness low motivation
cognitive deficits and concentration problems However
an important limitation for outcome measurement is that the
statistics have low power to identify response styles and
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912
Conijn et al 9
malingering These types of aberrant responding result in
biased total scores but do not necessarily produce an incon-
sistent pattern of item scores across the complete measure
(Ferrando amp Chico 2001) Hence future research might
also use measures especially designed to identify response
styles (Cheung amp Rensvold 2000) and malingering and
compare the results with those from general-purpose per-
son-fit statistics Inconsistency scales such as the TRIN and
VRIN potentially detect the same types of generic response
inconsistencies as person-fit indices but to our knowledge
only one study has compared their performance to that of
PFA (Egberink 2010) PFA was found to outperform the
inconsistency scale but relative performance may depend
on dimensionality and test length Hence more research is
needed on this subject
Other suggestions for future research include the fol-
lowing A first question is whether the prevalence of aber-
rant responding increases (eg due to diminishing
motivation) or decreases (eg due to familiarity with the
questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant
responding justifies routine application of PFA in clinical
settings such as Patient Reported Outcomes Measurement
Information System If prevalence among the tested
patients is low the number of item-score patterns incor-
rectly classified as aberrant (ie Type I errors) may out-
number the correctly identified aberrant item-score
patterns (Piedmont McCrae Riemann amp Angleitner
2000) Then PFA becomes very inefficient and it is
unlikely to improve the quality of individual decision
making in clinical practice Third future research may use
group-level analysis such as differential item functioning
analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-
ture modeling (Rost 1990) to study whether patients with
the same disorder showed similar patterns of misfit
To conclude our results have two main implications
pertaining to psychometrics and psychopathology First
the simulation study results suggest that l zm p is useful for
application to outcome measures despite moderate model
misfit due to multidimensionality As data of other psy-
chopathology measures also have been shown to be incon-
sistent with assumptions of IRT models the results of the
simulation study are valuable when considering applica-
tion of the l zm p
statistic to data of outcome measures
Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients
with different disorders Also more severely distressed
patients for whom psychological intervention is mostly
needed appear to be at the largest risk to produce invalid
outcome scores Overall our results emphasize the impor-
tance of person misfit identification in outcome measure-
ment and demonstrate that PFA may be useful for
preventing incorrect decision making in clinical practice
due to aberrant responding
Appendix A
Statistic l z p
Suppose the data are polytomous scores on J items (items
are indexed j j = 1 J ) with M + 1 ordered answer cat-
egories Let the score on item j be denoted by X j with
possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a
score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0
otherwise The unstandardized log-likelihood of an item-
score pattern x of person i is given by
l d m P X m pi
j
J
m
M
j j i( ) x = ( ) =( )= =
sumsum1 0
ln |θ (A1)
The standardized log-likelihood is defined as
l
l E l
VAR l
z p i
pi
pi
pi
x
x x
x
( ) = ( ) minus ( )
( )
( )
1
2
(A2)
where E l p( ) is the expected value and VAR l p( ) is the
variance of l p
Standardized Residual Statistic
The unstandardized residual for person i on item j is given by
e X E X ij ij ij= minus ( ) (A3)
where E X ij( ) is the expected value of X ij which equals
E X m P X mij
m
M
ij i( ) = =( )=
sum0
|θ The residual eij has a mean
of 0 and variance equal to
VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2
(A4)
The standardized residual is given by
ze e
VAR eij
ij
ij
=
( )
(A5)
To compute zeij latent trait θ
i needs to be replaced by its
estimated value This may bias the standardization of eij
Appendix B
For each OQ-45 subscale we estimated an exploratory mul-
tidimensional IRT (MIRT Reckase 2009) model Based on
results of exploratory factor analyses conducted in Mplus
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 3
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 312
Conijn et al 3
varying from somatoform disorders to psychotic disorders
and addiction For rare or specific disorders many of the
general measuresrsquo items are irrelevant and low traitedness
may lead to inconsistent or unmotivated completion of out-
come measures
Goal of the Study We investigated the prevalence and the causes of aberrant
responding to the OQ-45 (Lambert et al 2004) The OQ-45
is one of the most popular general outcome measures used
in mental health care We used OQ-45 data of a large Dutch
clinical outpatient sample suffering from a large variety of
disorders The l z person-fit statistic (Drasgow et al 1987)
was used to identify misfitting item-score patterns and
standardized item-score residuals (Ferrando 2010 2012)
were used to investigate sources of item-score pattern mis-
fit We employed logistic regression analyses using the l z
statistic as the dependent variable to investigate whether
patients suffering from specific disorders (eg somatoformdisorders and psychotic disorders) and severely distressed
patients are more predisposed to produce aberrant item-
score patterns on the OQ-45 than other patients Based on
the results for the OQ-45 we discuss the possible causes of
aberrant responding in outcome measurement in general
and the potential of PFA for improving outcome-measure-
ment practice
Method
Participants
We performed a secondary analysis on data collected in
routine mental health care Participants were 2906 clinical
outpatients (421 male) from a mental health care institu-
tion with four different locations situated in Noord-Holland
a predominantly rural province in the Netherlands
Participantsrsquo age ranged from 17 to 77 years ( M = 37 SD =
13) Apart from gender and age no other demographic infor-
mation was collected
Most patients completed the OQ-45 at intake but 160
(55) patients completed the OQ-45 after treatment
started The sample included 2632 (91) patients with a
clinician-rated Diagnostic and Statistical Manual of Mental
Disorders (4th ed DSM-IV ) primary diagnosis at Axis I192 (7) persons with a primary diagnosis at Axis II and
82 (3) patients for which the primary diagnosis was miss-
ing Most frequent primary diagnoses were depression
(38) anxiety disorders (20) disorders usually first
diagnosed in infancy childhood or adolescence (8) per-
sonality disorders (7) adjustment disorders (6)
somatoform disorders (3) eating disorders (2) and
substance-related disorders (2) Of the diagnosed patients
131 had comorbidity between Axis 1 and Axis 2 320
had comorbidity within Axis 1 and 01 had comorbidity
within Axis 2 The clinician had access to the OQ-45 data
but since the OQ-45 is not a diagnostic instrument it was
unlikely that diagnosis was based on the OQ-45 results
Measures
The Outcome Questionnairendash45 The OQ-45 (Lambert et al
2004) uses three subscales to measure symptom severity
and daily functioning The Social Role Performance (SR 9
items of which 3 are reversely worded) subscale measures
dissatisfaction distress and conflicts concerning onersquos
employment education or leisure pursuits An example
item is ldquoI feel stressed at workschoolrdquo The Interpersonal
Relations (IR 11 items of which 4 are reversely worded)
subscale measures difficulty with family friends and mari-
tal relationship An example item is ldquoI feel lonelyrdquo The
Symptom Distress (SD 25 items of which 3 are reversely
worded) subscale measures symptoms of the most fre-
quently diagnosed mental disorders in particular anxietyand depression An example item is ldquoI feel no interest in
thingsrdquo Respondents are instructed to express their feelings
with respect to the past week on a 5-point rating scale with
scores ranging from 0 (never ) through 4 (almost always)
higher scores indicating more psychological distress
In this study we used the Dutch OQ-45 (De Jong amp
Nugter 2004) The Dutch OQ-45 total score has good con-
current and criterion-related validity (De Jong et al 2007)
In our sample coefficient alpha for the subscale total scores
equaled 65 (SR) 77 (IR) and 91 (SD) Results concern-
ing OQ-45 factor structure are ambiguous Some studies
provided support for the theoretical three-factor model for
both the original OQ-45 (Bludworth Tracey amp Glidden-
Tracey 2010) and the Dutch OQ-45 (De Jong et al 2007)
Other studies found poor fit of the theoretical three-factor
model and suggested that a three-factor model showed bet-
ter fit when it was based on a reduced item set (Kim
Beretvas amp Sherry 2010) or a one-factor model (Mueller
Lambert amp Burlingame 1998) In this study we further
investigated the fit of the theoretical three-factor model
Explanatory Variables for Person Misfit
Severity of distress The OQ-45 total score and the cli-
nician-rated DSM-IV Global Assessment of Functioning
(GAF) code were taken as measures of the patientrsquos sever-ity of distress The GAF code ranges from 1 to 100 with
higher values indicating better psychological social and
occupational functioning The GAF code was missing for
187 (6) patients
Diagnosis category The clinician-rated DSM-IV diag-
nosis was classified into nine categories representing
the most common types of disorders present in the sam-
ple Table 1 describes the diagnosis categories and the
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 412
4 Assessment
number of patients classified in each category by their pri-
mary diagnosis Three remarks are in order First because
mood and anxiety symptoms dominate the OQ-45 (Lambert
et al 2004) we assumed that for patients suffering from
these symptoms misfit was less likely than for patients with
other diagnoses Hence we classified patients with mood
and anxiety diagnoses into the same category and used
this category as the baseline for testing the effects of the
other diagnosis categories on person fit Second because
we expected that the probability of aberrant responding
depends on the specific symptoms the patient experienced
we categorized patients into diagnosis categories that are
defined by symptomatology Third if we were unable to
categorize patientsrsquo diagnosis unambiguously in one of the
categories (eg adjustment disorder with predominant dis-
turbance of conduct) we treated the diagnosis as missing
Our approach resulted in 2514 categorized patients (87)
Statistical Analysis
Model-Fit Evaluation We conducted PFA based on the
graded response model (GRM Samejima 1997) The GRM
is an IRT model for polytomous items The core of the
GRM are the item step response functions (ISRFs) which
specify the relationship between the probability of a
response in a specific or higher answer category and the
latent trait the test measures The GRM is defined by three
assumptions unidimensionality of the latent trait absence
of structural influences other than the latent trait on item
responding (ie local independence) and logistic ISRFs A
detailed discussion of the GRM is beyond the scope of this
study the interested reader is referred to Embretson and
Reise (2000) and Samejima (1997)
Satisfactory GRM fit to the data is a prerequisite for
application of GRM-based PFA to the OQ-45 subscale data
Forero and Maydeu-Olivares (2009) showed that differ-
ences between parameter estimates obtained from the full
information (GRM) and the limited information (factor
analysis on the polychoric correlation matrix) approaches
are negligible Hence for each OQ-45 subscale we used
exploratory factor analysis (EFA) for categorical data in
Mplus (Mutheacuten amp Mutheacuten 2007) to assess the GRM
assumptions of unidimensionality and local independence
For comparing one-factor models with multidimensional
models we used the root mean squared error of approxima-
tion (RMSEA) and the standardized root mean residual
(SRMR Mutheacuten amp Mutheacuten 2007) RMSEA le 08 and
SRMR lt 05 suggest acceptable model fit (MacCallum
Browne amp Sugawara 1996 Mutheacuten amp Mutheacuten 2009) Todetect local dependence we used the residual correlations
under the one-factor solution We assessed the logistic
shape of ISRFs by means of a graphical analysis comparing
the observed ISRFs with the ISRFs expected under the
GRM (Drasgow Levine Tsien Williams amp Mead 1995)
In case substantial violations of GRM assumptions were
identified we used a simulation study to investigate whether
PFA was sufficiently robust with respect to the identified
OQ-45 model misfit (Conijn et al 2014)
Table 1 Description of the Diagnosis Categories Used as Explanatory Variables in Multiple Regression Analysis
Category Common DSM-IV diagnoses included n Mean l zmp Detected Detected
Mood and anxiety disordera Depressive disorders generalized anxietydisorders phobias panic disordersposttraumatic stress disorder
1786 028 229 128
Somatoform disorder Pain disorder somatization disorder
hypochondriasis undifferentiatedsomatoform disorder
82 016 16 195
Attention deficit hyperactivitydisorder(ADHD)
Predominantly inattentive combinedhyperactive-impulsive and inattentive
198 008 15 76
Psychotic disorder Schizophrenia psychotic disorder nototherwise specified
26 minus010 7 269
Borderline personalitydisorder
Borderline personality disorder 53 035 2 38
Impulse-control disorders notelsewhere classified
Impulse-control disorder intermittentexplosive disorder
58 002 10 172
Eating disorder Eating disorder not otherwise specified bulimianervosa
67 038 4 60
Substance-related disorder Cannabis-related disorders alcohol-relateddisorders
58 009 13 224
Social and relational problem Phase of life problem partner relationalproblem identity problem
186 026 20 108
a Including 65 patients with a mood disorder
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512
Conijn et al 5
Person-Fit Analysis
Detection of misfit We used statistic l z for polytomous
item scores denoted by l z p (Drasgow Levine amp Williams
1985) to identify item-score patterns that show misfit rela-
tive to the GRM Statistic l z p
is the standardized log-like-
lihood of a personrsquos item-score pattern given the response
probabilities under the GRM with larger negative l z p
values indicating a higher degree of misfit (see Appendix
A for the equations) Emons (2008) found that l z p
had a
higher detection rate than several other person-fit statistics
Because item-score patterns that contain only 0s or only
4s (ie after recoding the reversed worded items) always
fit under the postulated GRM corresponding l z p
statistics
are meaningless and therefore treated as missing values
Twenty-two respondents (1) had a missing l z p
value due
to only 0 or 4 scores We may add that even though these
perfect patterns are consistent with the model they still may
be suspicious as they may reflect gross under- or overre-
porting of symptoms
Because the GRM is a model for unidimensional datawe computed statistic l z
p for each OQ-45 subscale sepa-
rately To categorize persons as fitting or misfitting with
respect to the complete OQ-45 we used the multiscale per-
son-fit statistic l zm p
(Conijn et al 2014 Drasgow Levine
amp McLaughlin 1991) which equals the standardized sum
of the subscale l z p values across all subscales
Under the null model of fit to the IRT model and given
the true θ value statistic l z p
is standard normally distributed
(Drasgow et al 1985) but when the unknown true θ value
is replaced by the estimated θ value statistic l z p is no longer
standard normal (Nering 1995) Therefore following De la
Torre and Deng (2008) we used the following parametric
bootstrap procedure to compute the l z p and l zm p values and
the corresponding p values For each person we generated
5000 item-score patterns under the postulated GRM using
the item parameters and the personrsquos estimated θ value For
each item-score pattern we again estimated the estimated θ
value and computed the corresponding l zm p statistic The
5000 bootstrap replications of l zm p
determined the person-
specific null distribution of l zm p
The percentile rank of the
observed lzm p
value in this bootstrapped distribution pro-
vided the p value We used one-tailed significance testing
and a 05 significance level (α) The GRM item parameters
were estimated using MULTILOG (Thissen Chen amp Bock
2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained
from Baker and Kim (2004) The software including the
source code is available on request from the first author
Follow-up analyses For each item-score pattern l zm p classi-
fied as misfitting we used standardized item-score residuals
to identify the source of the person misfit (Ferrando 2010
2012) Negative residuals indicate that the personrsquos observed
item score is lower than expected under the estimated GRM
and positive residuals indicate that the item score is higher
than expected (Appendix A) To test residuals for signifi-
cance we used critical values based on the standard normal
distribution and two-tailed significance testing with α = 05
(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff
values of minus164 and 164)
Explanatory person-fit analysis We used logistic regres-
sion to relate type of disorder and severity of psychological
distress to person misfit on the OQ-45 The dependent vari-
able was the dichotomous person-fit classification based on
l zm p
(1 = significant misfit 0 = no misfit) Based on pre-
vious research results gender (0 = male 1 = female) and
measurement occasion (0 = at intake 1 = during treatment)
were included in the model as control variables (eg Pitts
West amp Tein 1996 Schmitt Chan Sacco McFarland amp
Jennings 1999 Woods et al 2008)
Results
First we discuss model fit and implications of the identified
model misfit for the application of PFA to the OQ-45 data
Second we discuss the number of item-score patterns that
the l zm p
statistic classified as misfitting (prevalence) and we
illustrate how standardized item-score residuals may help
infer possible causes of misfit for individual respondents
Third we discuss the results of logistic regression analysis
in which l zm p
person misfit classification was predicted by
means of clinical diagnosis and severity of disorder
Model-Fit Evaluation
Inspection of multiple correlation coefficients and item-rest
correlations showed that the Items 11 26 and 32 which
measured substance abuse and Item 14 (ldquoI workstudy too
muchrdquo) fitted poorly in their subscales As these results
were consistent with previous research (De Jong et al
2007 Mueller et al 1998) we excluded these items from
further analysis Coefficient alphas for the SR (7 items 2
items excluded) IR (10 items 1 item excluded) and SD (24
items 1 item excluded) subscales equaled 67 78 and 91
respectively
For the subscale data EFA showed that the first factor
explained 386 to 400 of the variance and that one-fac-
tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed
to produce an RMSEA le 08 The two-factor model pro-
duced an RMSEA of 13 but the SRMR of 05 was accept-
able The RMSEA value may have been inflated due to the
small number of degrees of freedom (ie df = 7) of the two-
factor model (Kenny Kaniskan amp McCoach 2014)
Because parallel analysis based on the polychoric correla-
tion matrix suggested that two factors explained the data we
decided that a two-factor solution was most appropriate
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612
6 Assessment
- 4
- 2
0
2
4
S t a n
d a r d i z e d r e s i d u a l s
4 12 21 28 38
39 44
SR subscale
1
7
16
17
18 19
20 30 37
43
IR subscale
2 3
56
8
9 101315
22
232425
27
2931
3334
3536
40
41
42
45
SD subscale
- 4
- 2
0
2
4
S t a n d a r d i z e d r e s i d u a l s
4
12
21
28
38 39 44
1
7
16
17
18
19
20
30 37
43
2
3
5 6
8
9
10
13
15
22
23
24
2527
29
3133
34
3536
40
41
42
45
Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient
2752 ( l zmp
= minus796 lower panel)
(RMSEA = 13 SRMR = 05) For the IR subscale a two-
factor solution provided acceptable fit (RMSEA = 08 and
SRMR = 04) and for the SD subscale a three-factor solu-
tion provided acceptable fit (RMSEA = 07 and SRMR =
03) Violations of local independence and violations of a
logistic ISRF were only found for some items of the SD and
SR subscales respectively Thus EFA results suggested that
more than other model violations multidimensionality
caused the subscale data to show GRM misfit
To investigate the performance of statistic l zm p
and the
standardized item-score residuals for detecting person mis-
fit on OQ-45 in the presence of mild model misfit we used
data simulation to assess the Type I error rates and the
detection rates of these statistics Data were simulated using
methods proposed by Conijn et al (2014 also see Appendix
B) The types of misfit included were random error (three
levels random scores on 10 20 or 30 items) and acquies-
cence (three levels weak moderate and strong)
We found that for statistic l zm p
Type I error rate equaled
01 meaning that the risk of incorrectly classifying normal
respondents as misfitting was small and much lower than
nominal Type I error rate Furthermore the power of l zm p
to
detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-
escence equaled at most 51 (ie for strong acquiescence
88 of the responses in the most extreme category) We
concluded that despite mild GRM model misfit l zm p
is use-
ful for application to the OQ-45 but lacks power to detect
acquiescence For the residual statistic we found modest
power to detect deviant item scores due to random error
and low power to detect deviant item scores due to acquies-
cence Even though the residuals had low power in our
simulation study we decided to use the residual statistic for
the OQ-45 data analysis to obtain further insight in the sta-
tisticsrsquo performance
Detection of Misfit and Follow-Up Analyses
For 90 (3) patients with a missing l z p
value for one of the
subscales l zm p was computed across the two other OQ-45
subscales Statistic l zm p ranged from minus796 to 354 ( M =
045 SD = 354) For 367 (126) patients statistic l zm p
classified the observed item-score pattern as misfitting
With respect to age gender and measurement occasion we
did not find substantial differences between detection rates
Based on the residual statisticrsquos low power in the simula-
tion study we used α = 10 for identifying unexpected item
scores We use two cases to illustrate the use of the residual
statistic Figure 1 shows the standardized residuals for
female patient 663 who had the highest l zm p
value ( l zm p
=
354 p gt 99) and for male patient 2752 who had the low-
est l zm p value ( l zm
p = minus796 p lt 001) Patient 663 (upper
panel) was diagnosed with posttraumatic stress disorder
The patientrsquos absolute residuals were smaller than 164
thus showing that her item scores were consistent with theexpected GRM item scores
Patient 2752 (lower panel) was diagnosed to suffer
from adjustment disorder with depressed mood He had
large residuals for each of the OQ-45 subscales but misfit
was largest on the IR subscale (l z p
= minus5 44 ) and the SD sub-
scale (l z p
= minus7 66 ) On the IR subscale residuals suggested
unexpected high distress on Items 7 19 and 20 One of
these items concerned his ldquomarriagesignificant other rela-
tionshiprdquo Therefore a possible cause of the IR subscale
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712
Conijn et al 7
misfit may be that his problems were limited to this rela-
tionship On the SD subscale he had both several unex-
pected high and low item scores Two of the three items
with unexpected high scores reflected mood symptoms of
depression feeling blue (Item 42) and not being happy
(Item 13) The third item concerned suicidal thoughts (Item
8) Most items with unexpected low scores concerned low
self-worth and incompetence (Items 15 24 and 40) and
hopelessness (Item 10) which are all cognitive symptoms
of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an
external source of psychological distress caused the patient
to experience only the mood symptoms of depression but
not the cognitive symptoms Hence the cognitive symp-
toms constituted a separate dimension on which the patient
had a lower trait value Furthermore except for 10 items all
other item scores were either 0s or 4s Hence apart from
potential content-related misfit extreme response style may
have been another cause of the severe misfit In practice the
clinician may discuss unexpected item scores and potential
explanations with the patient and suggest a more definite
explanation for the person misfit
Explanatory Person-Fit Analysis
For each of the diagnosis categories Table 1 shows the
average l zm p
value and the number and percentage of
patients classified as misfitting For patients with mood and
anxiety disorders (ie the baseline category) the detection
rate was substantial (128) but not high relative to most of
the other diagnosis categories
Table 2 shows the results of the logistic regression analy-
sis Model 1 included gender measurement occasion and
the diagnosis categories as predictors of person misfit
Diagnosis category had a significant overall effect χ 2(8) =
2647 p lt 001 The effects of somatoform disorder ADHD
psychotic disorder and substance abuse disorder were sig-
nificant Patients with ADHD were unlikely to show misfit
relative to the baseline category of patients with mood or
anxiety disorders Patients with somatoform disorders psy-
chotic disorders and substance-related disorders were morelikely to show misfit
Model 2 (Table 2 third column) also included GAF code
and OQ-45 total score Both effects were significant and
suggested that patients with higher levels of psychological
distress were more likely to show misfit After controlling
for GAF code and OQ-45 score the positive effect of
ADHD was not significant Hence patients with ADHD
were less likely to show misfit because they had less severe
levels of distress For the baseline category the estimated
probability of misfit was 13 For patients with somatoform
disorders psychotic disorders and substance-related disor-
ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis
category showed similar person misfit we compared the
standardized item-score residuals of the misfitting patterns
produced by patients with psychotic disorders (n = 7)
somatoform disorders (n = 16) or substance-related disor-
ders (n = 13) Most patients with a psychotic disorder had
low or average θ levels for each of the subscales Misfit was
due to several unexpected high scores indicating severe
symptoms In general these patients did not have large
Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the
5 Level 0 = No Misfit)
Model 1 Model 2
Intercept minus184 (011) minus193 (011)
Gender minus012 (013) minus012 (013)
Measurement occasion minus017 (027) minus018 (027)
Diagnosis categorya
Somatoform 057 (029) 074 (029)
ADHD minus058 (028) minus039 (028)
Psychotic 105 (046) 113 (047)
Borderline minus130 (072) minus139 (073)
Impulse control 035 (036) 057 (036)
Eating disorders minus110 (060) minus097 (060)
Substance related 066 (033) 069 (033)
Socialrelational minus020 (026) 008 (027)
GAF code mdash minus017 (007)
OQ-45 total score mdash 026 (007)
Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26
a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812
8 Assessment
residuals for the same items However unexpected high
scores on Item 25 (ldquodisturbing thoughts come into my mind
that I cannot get rid ofrdquo) were frequent (4 patients) which is
consistent with the symptoms characterizing psychotic dis-
orders Low to average θ levels combined with several
unexpected high item scores suggested that patients with
psychotic disorders showed misfit because many OQ-45
items were irrelevant to them suggesting lack of traitedness
(Reise amp Waller 1993) For the misfitting patients with a
somatoform disorder or a substance-related disorder the
standardized residuals showed no common pattern and
thus they did not show similar person misfit
Discussion
We investigated prevalence and explanations of aberrant
responding to the OQ-45 by means of IRT-based person-fit
methods Reise and Waller (2009) suggested that IRT mod-
els fit poorly to psychopathology data and that misfit may
adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn
et al 2014) our simulation results suggested that l zm p is
robust to model misfit The low empirical Type I error rates
suggested that GRM misfit did not lead to incorrect classi-
fication of model-consistent item-score patterns as misfit-
ting The current findings are valuable because they are
obtained under realistic conditions using a psychometri-
cally imperfect outcome measure that is frequently used in
practice We notice that this is the rule rather than the excep-
tion in general measurement instrumentsrsquo psychometric
properties are imperfect However our findings concerning
the robustness of l zm p
may only hold for the kind of model
misfit we found for the OQ-45 Future studies should inves-
tigate the robustness of PFA methods to different IRT model
violations
The detection rate of 126 in the OQ-45 data is sub-
stantial and comparable with detection rates found in other
studies using different measures For example using
repeated measurements on the StatendashTrait Anxiety
Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs
1983) Conijn et al (2013) found detection rates of 11 to
14 in a sample of cardiac patients and Conijn et al (2014)
also found 16 misfit on the International Personality Item
Pool 50-item questionnaire (Goldberg et al 2006) in a
panel sample from the general populationConsistent with previous research (Conijn et al 2013
Reise amp Waller 1993 Woods et al 2008) we found that
more severely distressed patients more likely showed mis-
fit Also patients with somatoform disorders psychotic dis-
orders and substance-related disorders more likely showed
misfit Plausible explanations for these results include
patients tending to deny their mental problems (somatoform
disorders) symptoms of disorder and confusion and the
irrelevance of most OQ-45 items (psychotic disorders)
being under the influence during test-taking and the nega-
tive effect of long-term substance use on cognitive abilities
(substance-related disorders) The residual analysis con-
firmed the explanation of limited item relevance for patients
with psychotic disorders but also suggested that patients
identified with the same type of disorder generally did not
show the same type of misfit However as the simulation
study revealed that the residual statistic had low power sug-
gesting only item scores reflecting large misfit are identi-
fied results should be interpreted with caution
There are two possible general explanations for finding
group differences with respect to the tendency to show per-
son misfit First person misfit may be due to a mismatch
between the OQ-45 and the specific disorder and second
misfit may be due to a general tendency to show misfit on
self-report measures Each explanation has a unique impli-
cation for outcome measurement The mismatch explana-
tion implies that disease-specific outcome measures rather
than general outcome measures should be used Examples
of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for
assessing somatic symptoms and the Severe Outcome
Questionnaire (Burlingame Thayer Lee Nelson amp
Lambert 2007) to diagnose severe mental illnesses such as
psychotic disorders and bipolar disorder
The general-misfit explanation implies that other meth-
ods than self-reports should be used for patientsrsquo treatment
decisions for example clinician-rated outcome measures
such as the Health of the Nation Outcome Scales (Wing et
al 1998) Also these patientsrsquo self-report results should be
excluded from cost-effectiveness studies to prevent poten-
tial negative effects on policy decisions Future studies may
address the scale-specific misfit versus general-misfit
explanations by means of explanatory PFA of data from
other outcome measures
Residual statistics have shown to be useful in real-data
applications for analyzing causes of aberrant responding to
unidimensional personality scales containing at least 30
items (Ferrando 2010 2012) but their performance has not
been evaluated previously by means of simulations Our
simulation study and real-data application question the use-
fulness of the residual statistic for retrieving causes of mis-
fit in outcome measures consisting of multiple short
subscales An alternative method is the inspection of item
content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such
a qualitative approach to follow-up PFA could be particu-
larly useful when clinicians use the OQ-45 to provide feed-
back to the patient
Person-fit statistics such as l zm p can potentially detect
aberrant responding due to low traitedness low motivation
cognitive deficits and concentration problems However
an important limitation for outcome measurement is that the
statistics have low power to identify response styles and
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912
Conijn et al 9
malingering These types of aberrant responding result in
biased total scores but do not necessarily produce an incon-
sistent pattern of item scores across the complete measure
(Ferrando amp Chico 2001) Hence future research might
also use measures especially designed to identify response
styles (Cheung amp Rensvold 2000) and malingering and
compare the results with those from general-purpose per-
son-fit statistics Inconsistency scales such as the TRIN and
VRIN potentially detect the same types of generic response
inconsistencies as person-fit indices but to our knowledge
only one study has compared their performance to that of
PFA (Egberink 2010) PFA was found to outperform the
inconsistency scale but relative performance may depend
on dimensionality and test length Hence more research is
needed on this subject
Other suggestions for future research include the fol-
lowing A first question is whether the prevalence of aber-
rant responding increases (eg due to diminishing
motivation) or decreases (eg due to familiarity with the
questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant
responding justifies routine application of PFA in clinical
settings such as Patient Reported Outcomes Measurement
Information System If prevalence among the tested
patients is low the number of item-score patterns incor-
rectly classified as aberrant (ie Type I errors) may out-
number the correctly identified aberrant item-score
patterns (Piedmont McCrae Riemann amp Angleitner
2000) Then PFA becomes very inefficient and it is
unlikely to improve the quality of individual decision
making in clinical practice Third future research may use
group-level analysis such as differential item functioning
analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-
ture modeling (Rost 1990) to study whether patients with
the same disorder showed similar patterns of misfit
To conclude our results have two main implications
pertaining to psychometrics and psychopathology First
the simulation study results suggest that l zm p is useful for
application to outcome measures despite moderate model
misfit due to multidimensionality As data of other psy-
chopathology measures also have been shown to be incon-
sistent with assumptions of IRT models the results of the
simulation study are valuable when considering applica-
tion of the l zm p
statistic to data of outcome measures
Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients
with different disorders Also more severely distressed
patients for whom psychological intervention is mostly
needed appear to be at the largest risk to produce invalid
outcome scores Overall our results emphasize the impor-
tance of person misfit identification in outcome measure-
ment and demonstrate that PFA may be useful for
preventing incorrect decision making in clinical practice
due to aberrant responding
Appendix A
Statistic l z p
Suppose the data are polytomous scores on J items (items
are indexed j j = 1 J ) with M + 1 ordered answer cat-
egories Let the score on item j be denoted by X j with
possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a
score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0
otherwise The unstandardized log-likelihood of an item-
score pattern x of person i is given by
l d m P X m pi
j
J
m
M
j j i( ) x = ( ) =( )= =
sumsum1 0
ln |θ (A1)
The standardized log-likelihood is defined as
l
l E l
VAR l
z p i
pi
pi
pi
x
x x
x
( ) = ( ) minus ( )
( )
( )
1
2
(A2)
where E l p( ) is the expected value and VAR l p( ) is the
variance of l p
Standardized Residual Statistic
The unstandardized residual for person i on item j is given by
e X E X ij ij ij= minus ( ) (A3)
where E X ij( ) is the expected value of X ij which equals
E X m P X mij
m
M
ij i( ) = =( )=
sum0
|θ The residual eij has a mean
of 0 and variance equal to
VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2
(A4)
The standardized residual is given by
ze e
VAR eij
ij
ij
=
( )
(A5)
To compute zeij latent trait θ
i needs to be replaced by its
estimated value This may bias the standardization of eij
Appendix B
For each OQ-45 subscale we estimated an exploratory mul-
tidimensional IRT (MIRT Reckase 2009) model Based on
results of exploratory factor analyses conducted in Mplus
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 4
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 412
4 Assessment
number of patients classified in each category by their pri-
mary diagnosis Three remarks are in order First because
mood and anxiety symptoms dominate the OQ-45 (Lambert
et al 2004) we assumed that for patients suffering from
these symptoms misfit was less likely than for patients with
other diagnoses Hence we classified patients with mood
and anxiety diagnoses into the same category and used
this category as the baseline for testing the effects of the
other diagnosis categories on person fit Second because
we expected that the probability of aberrant responding
depends on the specific symptoms the patient experienced
we categorized patients into diagnosis categories that are
defined by symptomatology Third if we were unable to
categorize patientsrsquo diagnosis unambiguously in one of the
categories (eg adjustment disorder with predominant dis-
turbance of conduct) we treated the diagnosis as missing
Our approach resulted in 2514 categorized patients (87)
Statistical Analysis
Model-Fit Evaluation We conducted PFA based on the
graded response model (GRM Samejima 1997) The GRM
is an IRT model for polytomous items The core of the
GRM are the item step response functions (ISRFs) which
specify the relationship between the probability of a
response in a specific or higher answer category and the
latent trait the test measures The GRM is defined by three
assumptions unidimensionality of the latent trait absence
of structural influences other than the latent trait on item
responding (ie local independence) and logistic ISRFs A
detailed discussion of the GRM is beyond the scope of this
study the interested reader is referred to Embretson and
Reise (2000) and Samejima (1997)
Satisfactory GRM fit to the data is a prerequisite for
application of GRM-based PFA to the OQ-45 subscale data
Forero and Maydeu-Olivares (2009) showed that differ-
ences between parameter estimates obtained from the full
information (GRM) and the limited information (factor
analysis on the polychoric correlation matrix) approaches
are negligible Hence for each OQ-45 subscale we used
exploratory factor analysis (EFA) for categorical data in
Mplus (Mutheacuten amp Mutheacuten 2007) to assess the GRM
assumptions of unidimensionality and local independence
For comparing one-factor models with multidimensional
models we used the root mean squared error of approxima-
tion (RMSEA) and the standardized root mean residual
(SRMR Mutheacuten amp Mutheacuten 2007) RMSEA le 08 and
SRMR lt 05 suggest acceptable model fit (MacCallum
Browne amp Sugawara 1996 Mutheacuten amp Mutheacuten 2009) Todetect local dependence we used the residual correlations
under the one-factor solution We assessed the logistic
shape of ISRFs by means of a graphical analysis comparing
the observed ISRFs with the ISRFs expected under the
GRM (Drasgow Levine Tsien Williams amp Mead 1995)
In case substantial violations of GRM assumptions were
identified we used a simulation study to investigate whether
PFA was sufficiently robust with respect to the identified
OQ-45 model misfit (Conijn et al 2014)
Table 1 Description of the Diagnosis Categories Used as Explanatory Variables in Multiple Regression Analysis
Category Common DSM-IV diagnoses included n Mean l zmp Detected Detected
Mood and anxiety disordera Depressive disorders generalized anxietydisorders phobias panic disordersposttraumatic stress disorder
1786 028 229 128
Somatoform disorder Pain disorder somatization disorder
hypochondriasis undifferentiatedsomatoform disorder
82 016 16 195
Attention deficit hyperactivitydisorder(ADHD)
Predominantly inattentive combinedhyperactive-impulsive and inattentive
198 008 15 76
Psychotic disorder Schizophrenia psychotic disorder nototherwise specified
26 minus010 7 269
Borderline personalitydisorder
Borderline personality disorder 53 035 2 38
Impulse-control disorders notelsewhere classified
Impulse-control disorder intermittentexplosive disorder
58 002 10 172
Eating disorder Eating disorder not otherwise specified bulimianervosa
67 038 4 60
Substance-related disorder Cannabis-related disorders alcohol-relateddisorders
58 009 13 224
Social and relational problem Phase of life problem partner relationalproblem identity problem
186 026 20 108
a Including 65 patients with a mood disorder
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512
Conijn et al 5
Person-Fit Analysis
Detection of misfit We used statistic l z for polytomous
item scores denoted by l z p (Drasgow Levine amp Williams
1985) to identify item-score patterns that show misfit rela-
tive to the GRM Statistic l z p
is the standardized log-like-
lihood of a personrsquos item-score pattern given the response
probabilities under the GRM with larger negative l z p
values indicating a higher degree of misfit (see Appendix
A for the equations) Emons (2008) found that l z p
had a
higher detection rate than several other person-fit statistics
Because item-score patterns that contain only 0s or only
4s (ie after recoding the reversed worded items) always
fit under the postulated GRM corresponding l z p
statistics
are meaningless and therefore treated as missing values
Twenty-two respondents (1) had a missing l z p
value due
to only 0 or 4 scores We may add that even though these
perfect patterns are consistent with the model they still may
be suspicious as they may reflect gross under- or overre-
porting of symptoms
Because the GRM is a model for unidimensional datawe computed statistic l z
p for each OQ-45 subscale sepa-
rately To categorize persons as fitting or misfitting with
respect to the complete OQ-45 we used the multiscale per-
son-fit statistic l zm p
(Conijn et al 2014 Drasgow Levine
amp McLaughlin 1991) which equals the standardized sum
of the subscale l z p values across all subscales
Under the null model of fit to the IRT model and given
the true θ value statistic l z p
is standard normally distributed
(Drasgow et al 1985) but when the unknown true θ value
is replaced by the estimated θ value statistic l z p is no longer
standard normal (Nering 1995) Therefore following De la
Torre and Deng (2008) we used the following parametric
bootstrap procedure to compute the l z p and l zm p values and
the corresponding p values For each person we generated
5000 item-score patterns under the postulated GRM using
the item parameters and the personrsquos estimated θ value For
each item-score pattern we again estimated the estimated θ
value and computed the corresponding l zm p statistic The
5000 bootstrap replications of l zm p
determined the person-
specific null distribution of l zm p
The percentile rank of the
observed lzm p
value in this bootstrapped distribution pro-
vided the p value We used one-tailed significance testing
and a 05 significance level (α) The GRM item parameters
were estimated using MULTILOG (Thissen Chen amp Bock
2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained
from Baker and Kim (2004) The software including the
source code is available on request from the first author
Follow-up analyses For each item-score pattern l zm p classi-
fied as misfitting we used standardized item-score residuals
to identify the source of the person misfit (Ferrando 2010
2012) Negative residuals indicate that the personrsquos observed
item score is lower than expected under the estimated GRM
and positive residuals indicate that the item score is higher
than expected (Appendix A) To test residuals for signifi-
cance we used critical values based on the standard normal
distribution and two-tailed significance testing with α = 05
(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff
values of minus164 and 164)
Explanatory person-fit analysis We used logistic regres-
sion to relate type of disorder and severity of psychological
distress to person misfit on the OQ-45 The dependent vari-
able was the dichotomous person-fit classification based on
l zm p
(1 = significant misfit 0 = no misfit) Based on pre-
vious research results gender (0 = male 1 = female) and
measurement occasion (0 = at intake 1 = during treatment)
were included in the model as control variables (eg Pitts
West amp Tein 1996 Schmitt Chan Sacco McFarland amp
Jennings 1999 Woods et al 2008)
Results
First we discuss model fit and implications of the identified
model misfit for the application of PFA to the OQ-45 data
Second we discuss the number of item-score patterns that
the l zm p
statistic classified as misfitting (prevalence) and we
illustrate how standardized item-score residuals may help
infer possible causes of misfit for individual respondents
Third we discuss the results of logistic regression analysis
in which l zm p
person misfit classification was predicted by
means of clinical diagnosis and severity of disorder
Model-Fit Evaluation
Inspection of multiple correlation coefficients and item-rest
correlations showed that the Items 11 26 and 32 which
measured substance abuse and Item 14 (ldquoI workstudy too
muchrdquo) fitted poorly in their subscales As these results
were consistent with previous research (De Jong et al
2007 Mueller et al 1998) we excluded these items from
further analysis Coefficient alphas for the SR (7 items 2
items excluded) IR (10 items 1 item excluded) and SD (24
items 1 item excluded) subscales equaled 67 78 and 91
respectively
For the subscale data EFA showed that the first factor
explained 386 to 400 of the variance and that one-fac-
tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed
to produce an RMSEA le 08 The two-factor model pro-
duced an RMSEA of 13 but the SRMR of 05 was accept-
able The RMSEA value may have been inflated due to the
small number of degrees of freedom (ie df = 7) of the two-
factor model (Kenny Kaniskan amp McCoach 2014)
Because parallel analysis based on the polychoric correla-
tion matrix suggested that two factors explained the data we
decided that a two-factor solution was most appropriate
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612
6 Assessment
- 4
- 2
0
2
4
S t a n
d a r d i z e d r e s i d u a l s
4 12 21 28 38
39 44
SR subscale
1
7
16
17
18 19
20 30 37
43
IR subscale
2 3
56
8
9 101315
22
232425
27
2931
3334
3536
40
41
42
45
SD subscale
- 4
- 2
0
2
4
S t a n d a r d i z e d r e s i d u a l s
4
12
21
28
38 39 44
1
7
16
17
18
19
20
30 37
43
2
3
5 6
8
9
10
13
15
22
23
24
2527
29
3133
34
3536
40
41
42
45
Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient
2752 ( l zmp
= minus796 lower panel)
(RMSEA = 13 SRMR = 05) For the IR subscale a two-
factor solution provided acceptable fit (RMSEA = 08 and
SRMR = 04) and for the SD subscale a three-factor solu-
tion provided acceptable fit (RMSEA = 07 and SRMR =
03) Violations of local independence and violations of a
logistic ISRF were only found for some items of the SD and
SR subscales respectively Thus EFA results suggested that
more than other model violations multidimensionality
caused the subscale data to show GRM misfit
To investigate the performance of statistic l zm p
and the
standardized item-score residuals for detecting person mis-
fit on OQ-45 in the presence of mild model misfit we used
data simulation to assess the Type I error rates and the
detection rates of these statistics Data were simulated using
methods proposed by Conijn et al (2014 also see Appendix
B) The types of misfit included were random error (three
levels random scores on 10 20 or 30 items) and acquies-
cence (three levels weak moderate and strong)
We found that for statistic l zm p
Type I error rate equaled
01 meaning that the risk of incorrectly classifying normal
respondents as misfitting was small and much lower than
nominal Type I error rate Furthermore the power of l zm p
to
detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-
escence equaled at most 51 (ie for strong acquiescence
88 of the responses in the most extreme category) We
concluded that despite mild GRM model misfit l zm p
is use-
ful for application to the OQ-45 but lacks power to detect
acquiescence For the residual statistic we found modest
power to detect deviant item scores due to random error
and low power to detect deviant item scores due to acquies-
cence Even though the residuals had low power in our
simulation study we decided to use the residual statistic for
the OQ-45 data analysis to obtain further insight in the sta-
tisticsrsquo performance
Detection of Misfit and Follow-Up Analyses
For 90 (3) patients with a missing l z p
value for one of the
subscales l zm p was computed across the two other OQ-45
subscales Statistic l zm p ranged from minus796 to 354 ( M =
045 SD = 354) For 367 (126) patients statistic l zm p
classified the observed item-score pattern as misfitting
With respect to age gender and measurement occasion we
did not find substantial differences between detection rates
Based on the residual statisticrsquos low power in the simula-
tion study we used α = 10 for identifying unexpected item
scores We use two cases to illustrate the use of the residual
statistic Figure 1 shows the standardized residuals for
female patient 663 who had the highest l zm p
value ( l zm p
=
354 p gt 99) and for male patient 2752 who had the low-
est l zm p value ( l zm
p = minus796 p lt 001) Patient 663 (upper
panel) was diagnosed with posttraumatic stress disorder
The patientrsquos absolute residuals were smaller than 164
thus showing that her item scores were consistent with theexpected GRM item scores
Patient 2752 (lower panel) was diagnosed to suffer
from adjustment disorder with depressed mood He had
large residuals for each of the OQ-45 subscales but misfit
was largest on the IR subscale (l z p
= minus5 44 ) and the SD sub-
scale (l z p
= minus7 66 ) On the IR subscale residuals suggested
unexpected high distress on Items 7 19 and 20 One of
these items concerned his ldquomarriagesignificant other rela-
tionshiprdquo Therefore a possible cause of the IR subscale
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712
Conijn et al 7
misfit may be that his problems were limited to this rela-
tionship On the SD subscale he had both several unex-
pected high and low item scores Two of the three items
with unexpected high scores reflected mood symptoms of
depression feeling blue (Item 42) and not being happy
(Item 13) The third item concerned suicidal thoughts (Item
8) Most items with unexpected low scores concerned low
self-worth and incompetence (Items 15 24 and 40) and
hopelessness (Item 10) which are all cognitive symptoms
of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an
external source of psychological distress caused the patient
to experience only the mood symptoms of depression but
not the cognitive symptoms Hence the cognitive symp-
toms constituted a separate dimension on which the patient
had a lower trait value Furthermore except for 10 items all
other item scores were either 0s or 4s Hence apart from
potential content-related misfit extreme response style may
have been another cause of the severe misfit In practice the
clinician may discuss unexpected item scores and potential
explanations with the patient and suggest a more definite
explanation for the person misfit
Explanatory Person-Fit Analysis
For each of the diagnosis categories Table 1 shows the
average l zm p
value and the number and percentage of
patients classified as misfitting For patients with mood and
anxiety disorders (ie the baseline category) the detection
rate was substantial (128) but not high relative to most of
the other diagnosis categories
Table 2 shows the results of the logistic regression analy-
sis Model 1 included gender measurement occasion and
the diagnosis categories as predictors of person misfit
Diagnosis category had a significant overall effect χ 2(8) =
2647 p lt 001 The effects of somatoform disorder ADHD
psychotic disorder and substance abuse disorder were sig-
nificant Patients with ADHD were unlikely to show misfit
relative to the baseline category of patients with mood or
anxiety disorders Patients with somatoform disorders psy-
chotic disorders and substance-related disorders were morelikely to show misfit
Model 2 (Table 2 third column) also included GAF code
and OQ-45 total score Both effects were significant and
suggested that patients with higher levels of psychological
distress were more likely to show misfit After controlling
for GAF code and OQ-45 score the positive effect of
ADHD was not significant Hence patients with ADHD
were less likely to show misfit because they had less severe
levels of distress For the baseline category the estimated
probability of misfit was 13 For patients with somatoform
disorders psychotic disorders and substance-related disor-
ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis
category showed similar person misfit we compared the
standardized item-score residuals of the misfitting patterns
produced by patients with psychotic disorders (n = 7)
somatoform disorders (n = 16) or substance-related disor-
ders (n = 13) Most patients with a psychotic disorder had
low or average θ levels for each of the subscales Misfit was
due to several unexpected high scores indicating severe
symptoms In general these patients did not have large
Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the
5 Level 0 = No Misfit)
Model 1 Model 2
Intercept minus184 (011) minus193 (011)
Gender minus012 (013) minus012 (013)
Measurement occasion minus017 (027) minus018 (027)
Diagnosis categorya
Somatoform 057 (029) 074 (029)
ADHD minus058 (028) minus039 (028)
Psychotic 105 (046) 113 (047)
Borderline minus130 (072) minus139 (073)
Impulse control 035 (036) 057 (036)
Eating disorders minus110 (060) minus097 (060)
Substance related 066 (033) 069 (033)
Socialrelational minus020 (026) 008 (027)
GAF code mdash minus017 (007)
OQ-45 total score mdash 026 (007)
Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26
a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812
8 Assessment
residuals for the same items However unexpected high
scores on Item 25 (ldquodisturbing thoughts come into my mind
that I cannot get rid ofrdquo) were frequent (4 patients) which is
consistent with the symptoms characterizing psychotic dis-
orders Low to average θ levels combined with several
unexpected high item scores suggested that patients with
psychotic disorders showed misfit because many OQ-45
items were irrelevant to them suggesting lack of traitedness
(Reise amp Waller 1993) For the misfitting patients with a
somatoform disorder or a substance-related disorder the
standardized residuals showed no common pattern and
thus they did not show similar person misfit
Discussion
We investigated prevalence and explanations of aberrant
responding to the OQ-45 by means of IRT-based person-fit
methods Reise and Waller (2009) suggested that IRT mod-
els fit poorly to psychopathology data and that misfit may
adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn
et al 2014) our simulation results suggested that l zm p is
robust to model misfit The low empirical Type I error rates
suggested that GRM misfit did not lead to incorrect classi-
fication of model-consistent item-score patterns as misfit-
ting The current findings are valuable because they are
obtained under realistic conditions using a psychometri-
cally imperfect outcome measure that is frequently used in
practice We notice that this is the rule rather than the excep-
tion in general measurement instrumentsrsquo psychometric
properties are imperfect However our findings concerning
the robustness of l zm p
may only hold for the kind of model
misfit we found for the OQ-45 Future studies should inves-
tigate the robustness of PFA methods to different IRT model
violations
The detection rate of 126 in the OQ-45 data is sub-
stantial and comparable with detection rates found in other
studies using different measures For example using
repeated measurements on the StatendashTrait Anxiety
Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs
1983) Conijn et al (2013) found detection rates of 11 to
14 in a sample of cardiac patients and Conijn et al (2014)
also found 16 misfit on the International Personality Item
Pool 50-item questionnaire (Goldberg et al 2006) in a
panel sample from the general populationConsistent with previous research (Conijn et al 2013
Reise amp Waller 1993 Woods et al 2008) we found that
more severely distressed patients more likely showed mis-
fit Also patients with somatoform disorders psychotic dis-
orders and substance-related disorders more likely showed
misfit Plausible explanations for these results include
patients tending to deny their mental problems (somatoform
disorders) symptoms of disorder and confusion and the
irrelevance of most OQ-45 items (psychotic disorders)
being under the influence during test-taking and the nega-
tive effect of long-term substance use on cognitive abilities
(substance-related disorders) The residual analysis con-
firmed the explanation of limited item relevance for patients
with psychotic disorders but also suggested that patients
identified with the same type of disorder generally did not
show the same type of misfit However as the simulation
study revealed that the residual statistic had low power sug-
gesting only item scores reflecting large misfit are identi-
fied results should be interpreted with caution
There are two possible general explanations for finding
group differences with respect to the tendency to show per-
son misfit First person misfit may be due to a mismatch
between the OQ-45 and the specific disorder and second
misfit may be due to a general tendency to show misfit on
self-report measures Each explanation has a unique impli-
cation for outcome measurement The mismatch explana-
tion implies that disease-specific outcome measures rather
than general outcome measures should be used Examples
of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for
assessing somatic symptoms and the Severe Outcome
Questionnaire (Burlingame Thayer Lee Nelson amp
Lambert 2007) to diagnose severe mental illnesses such as
psychotic disorders and bipolar disorder
The general-misfit explanation implies that other meth-
ods than self-reports should be used for patientsrsquo treatment
decisions for example clinician-rated outcome measures
such as the Health of the Nation Outcome Scales (Wing et
al 1998) Also these patientsrsquo self-report results should be
excluded from cost-effectiveness studies to prevent poten-
tial negative effects on policy decisions Future studies may
address the scale-specific misfit versus general-misfit
explanations by means of explanatory PFA of data from
other outcome measures
Residual statistics have shown to be useful in real-data
applications for analyzing causes of aberrant responding to
unidimensional personality scales containing at least 30
items (Ferrando 2010 2012) but their performance has not
been evaluated previously by means of simulations Our
simulation study and real-data application question the use-
fulness of the residual statistic for retrieving causes of mis-
fit in outcome measures consisting of multiple short
subscales An alternative method is the inspection of item
content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such
a qualitative approach to follow-up PFA could be particu-
larly useful when clinicians use the OQ-45 to provide feed-
back to the patient
Person-fit statistics such as l zm p can potentially detect
aberrant responding due to low traitedness low motivation
cognitive deficits and concentration problems However
an important limitation for outcome measurement is that the
statistics have low power to identify response styles and
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912
Conijn et al 9
malingering These types of aberrant responding result in
biased total scores but do not necessarily produce an incon-
sistent pattern of item scores across the complete measure
(Ferrando amp Chico 2001) Hence future research might
also use measures especially designed to identify response
styles (Cheung amp Rensvold 2000) and malingering and
compare the results with those from general-purpose per-
son-fit statistics Inconsistency scales such as the TRIN and
VRIN potentially detect the same types of generic response
inconsistencies as person-fit indices but to our knowledge
only one study has compared their performance to that of
PFA (Egberink 2010) PFA was found to outperform the
inconsistency scale but relative performance may depend
on dimensionality and test length Hence more research is
needed on this subject
Other suggestions for future research include the fol-
lowing A first question is whether the prevalence of aber-
rant responding increases (eg due to diminishing
motivation) or decreases (eg due to familiarity with the
questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant
responding justifies routine application of PFA in clinical
settings such as Patient Reported Outcomes Measurement
Information System If prevalence among the tested
patients is low the number of item-score patterns incor-
rectly classified as aberrant (ie Type I errors) may out-
number the correctly identified aberrant item-score
patterns (Piedmont McCrae Riemann amp Angleitner
2000) Then PFA becomes very inefficient and it is
unlikely to improve the quality of individual decision
making in clinical practice Third future research may use
group-level analysis such as differential item functioning
analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-
ture modeling (Rost 1990) to study whether patients with
the same disorder showed similar patterns of misfit
To conclude our results have two main implications
pertaining to psychometrics and psychopathology First
the simulation study results suggest that l zm p is useful for
application to outcome measures despite moderate model
misfit due to multidimensionality As data of other psy-
chopathology measures also have been shown to be incon-
sistent with assumptions of IRT models the results of the
simulation study are valuable when considering applica-
tion of the l zm p
statistic to data of outcome measures
Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients
with different disorders Also more severely distressed
patients for whom psychological intervention is mostly
needed appear to be at the largest risk to produce invalid
outcome scores Overall our results emphasize the impor-
tance of person misfit identification in outcome measure-
ment and demonstrate that PFA may be useful for
preventing incorrect decision making in clinical practice
due to aberrant responding
Appendix A
Statistic l z p
Suppose the data are polytomous scores on J items (items
are indexed j j = 1 J ) with M + 1 ordered answer cat-
egories Let the score on item j be denoted by X j with
possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a
score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0
otherwise The unstandardized log-likelihood of an item-
score pattern x of person i is given by
l d m P X m pi
j
J
m
M
j j i( ) x = ( ) =( )= =
sumsum1 0
ln |θ (A1)
The standardized log-likelihood is defined as
l
l E l
VAR l
z p i
pi
pi
pi
x
x x
x
( ) = ( ) minus ( )
( )
( )
1
2
(A2)
where E l p( ) is the expected value and VAR l p( ) is the
variance of l p
Standardized Residual Statistic
The unstandardized residual for person i on item j is given by
e X E X ij ij ij= minus ( ) (A3)
where E X ij( ) is the expected value of X ij which equals
E X m P X mij
m
M
ij i( ) = =( )=
sum0
|θ The residual eij has a mean
of 0 and variance equal to
VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2
(A4)
The standardized residual is given by
ze e
VAR eij
ij
ij
=
( )
(A5)
To compute zeij latent trait θ
i needs to be replaced by its
estimated value This may bias the standardization of eij
Appendix B
For each OQ-45 subscale we estimated an exploratory mul-
tidimensional IRT (MIRT Reckase 2009) model Based on
results of exploratory factor analyses conducted in Mplus
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 5
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512
Conijn et al 5
Person-Fit Analysis
Detection of misfit We used statistic l z for polytomous
item scores denoted by l z p (Drasgow Levine amp Williams
1985) to identify item-score patterns that show misfit rela-
tive to the GRM Statistic l z p
is the standardized log-like-
lihood of a personrsquos item-score pattern given the response
probabilities under the GRM with larger negative l z p
values indicating a higher degree of misfit (see Appendix
A for the equations) Emons (2008) found that l z p
had a
higher detection rate than several other person-fit statistics
Because item-score patterns that contain only 0s or only
4s (ie after recoding the reversed worded items) always
fit under the postulated GRM corresponding l z p
statistics
are meaningless and therefore treated as missing values
Twenty-two respondents (1) had a missing l z p
value due
to only 0 or 4 scores We may add that even though these
perfect patterns are consistent with the model they still may
be suspicious as they may reflect gross under- or overre-
porting of symptoms
Because the GRM is a model for unidimensional datawe computed statistic l z
p for each OQ-45 subscale sepa-
rately To categorize persons as fitting or misfitting with
respect to the complete OQ-45 we used the multiscale per-
son-fit statistic l zm p
(Conijn et al 2014 Drasgow Levine
amp McLaughlin 1991) which equals the standardized sum
of the subscale l z p values across all subscales
Under the null model of fit to the IRT model and given
the true θ value statistic l z p
is standard normally distributed
(Drasgow et al 1985) but when the unknown true θ value
is replaced by the estimated θ value statistic l z p is no longer
standard normal (Nering 1995) Therefore following De la
Torre and Deng (2008) we used the following parametric
bootstrap procedure to compute the l z p and l zm p values and
the corresponding p values For each person we generated
5000 item-score patterns under the postulated GRM using
the item parameters and the personrsquos estimated θ value For
each item-score pattern we again estimated the estimated θ
value and computed the corresponding l zm p statistic The
5000 bootstrap replications of l zm p
determined the person-
specific null distribution of l zm p
The percentile rank of the
observed lzm p
value in this bootstrapped distribution pro-
vided the p value We used one-tailed significance testing
and a 05 significance level (α) The GRM item parameters
were estimated using MULTILOG (Thissen Chen amp Bock
2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained
from Baker and Kim (2004) The software including the
source code is available on request from the first author
Follow-up analyses For each item-score pattern l zm p classi-
fied as misfitting we used standardized item-score residuals
to identify the source of the person misfit (Ferrando 2010
2012) Negative residuals indicate that the personrsquos observed
item score is lower than expected under the estimated GRM
and positive residuals indicate that the item score is higher
than expected (Appendix A) To test residuals for signifi-
cance we used critical values based on the standard normal
distribution and two-tailed significance testing with α = 05
(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff
values of minus164 and 164)
Explanatory person-fit analysis We used logistic regres-
sion to relate type of disorder and severity of psychological
distress to person misfit on the OQ-45 The dependent vari-
able was the dichotomous person-fit classification based on
l zm p
(1 = significant misfit 0 = no misfit) Based on pre-
vious research results gender (0 = male 1 = female) and
measurement occasion (0 = at intake 1 = during treatment)
were included in the model as control variables (eg Pitts
West amp Tein 1996 Schmitt Chan Sacco McFarland amp
Jennings 1999 Woods et al 2008)
Results
First we discuss model fit and implications of the identified
model misfit for the application of PFA to the OQ-45 data
Second we discuss the number of item-score patterns that
the l zm p
statistic classified as misfitting (prevalence) and we
illustrate how standardized item-score residuals may help
infer possible causes of misfit for individual respondents
Third we discuss the results of logistic regression analysis
in which l zm p
person misfit classification was predicted by
means of clinical diagnosis and severity of disorder
Model-Fit Evaluation
Inspection of multiple correlation coefficients and item-rest
correlations showed that the Items 11 26 and 32 which
measured substance abuse and Item 14 (ldquoI workstudy too
muchrdquo) fitted poorly in their subscales As these results
were consistent with previous research (De Jong et al
2007 Mueller et al 1998) we excluded these items from
further analysis Coefficient alphas for the SR (7 items 2
items excluded) IR (10 items 1 item excluded) and SD (24
items 1 item excluded) subscales equaled 67 78 and 91
respectively
For the subscale data EFA showed that the first factor
explained 386 to 400 of the variance and that one-fac-
tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed
to produce an RMSEA le 08 The two-factor model pro-
duced an RMSEA of 13 but the SRMR of 05 was accept-
able The RMSEA value may have been inflated due to the
small number of degrees of freedom (ie df = 7) of the two-
factor model (Kenny Kaniskan amp McCoach 2014)
Because parallel analysis based on the polychoric correla-
tion matrix suggested that two factors explained the data we
decided that a two-factor solution was most appropriate
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612
6 Assessment
- 4
- 2
0
2
4
S t a n
d a r d i z e d r e s i d u a l s
4 12 21 28 38
39 44
SR subscale
1
7
16
17
18 19
20 30 37
43
IR subscale
2 3
56
8
9 101315
22
232425
27
2931
3334
3536
40
41
42
45
SD subscale
- 4
- 2
0
2
4
S t a n d a r d i z e d r e s i d u a l s
4
12
21
28
38 39 44
1
7
16
17
18
19
20
30 37
43
2
3
5 6
8
9
10
13
15
22
23
24
2527
29
3133
34
3536
40
41
42
45
Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient
2752 ( l zmp
= minus796 lower panel)
(RMSEA = 13 SRMR = 05) For the IR subscale a two-
factor solution provided acceptable fit (RMSEA = 08 and
SRMR = 04) and for the SD subscale a three-factor solu-
tion provided acceptable fit (RMSEA = 07 and SRMR =
03) Violations of local independence and violations of a
logistic ISRF were only found for some items of the SD and
SR subscales respectively Thus EFA results suggested that
more than other model violations multidimensionality
caused the subscale data to show GRM misfit
To investigate the performance of statistic l zm p
and the
standardized item-score residuals for detecting person mis-
fit on OQ-45 in the presence of mild model misfit we used
data simulation to assess the Type I error rates and the
detection rates of these statistics Data were simulated using
methods proposed by Conijn et al (2014 also see Appendix
B) The types of misfit included were random error (three
levels random scores on 10 20 or 30 items) and acquies-
cence (three levels weak moderate and strong)
We found that for statistic l zm p
Type I error rate equaled
01 meaning that the risk of incorrectly classifying normal
respondents as misfitting was small and much lower than
nominal Type I error rate Furthermore the power of l zm p
to
detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-
escence equaled at most 51 (ie for strong acquiescence
88 of the responses in the most extreme category) We
concluded that despite mild GRM model misfit l zm p
is use-
ful for application to the OQ-45 but lacks power to detect
acquiescence For the residual statistic we found modest
power to detect deviant item scores due to random error
and low power to detect deviant item scores due to acquies-
cence Even though the residuals had low power in our
simulation study we decided to use the residual statistic for
the OQ-45 data analysis to obtain further insight in the sta-
tisticsrsquo performance
Detection of Misfit and Follow-Up Analyses
For 90 (3) patients with a missing l z p
value for one of the
subscales l zm p was computed across the two other OQ-45
subscales Statistic l zm p ranged from minus796 to 354 ( M =
045 SD = 354) For 367 (126) patients statistic l zm p
classified the observed item-score pattern as misfitting
With respect to age gender and measurement occasion we
did not find substantial differences between detection rates
Based on the residual statisticrsquos low power in the simula-
tion study we used α = 10 for identifying unexpected item
scores We use two cases to illustrate the use of the residual
statistic Figure 1 shows the standardized residuals for
female patient 663 who had the highest l zm p
value ( l zm p
=
354 p gt 99) and for male patient 2752 who had the low-
est l zm p value ( l zm
p = minus796 p lt 001) Patient 663 (upper
panel) was diagnosed with posttraumatic stress disorder
The patientrsquos absolute residuals were smaller than 164
thus showing that her item scores were consistent with theexpected GRM item scores
Patient 2752 (lower panel) was diagnosed to suffer
from adjustment disorder with depressed mood He had
large residuals for each of the OQ-45 subscales but misfit
was largest on the IR subscale (l z p
= minus5 44 ) and the SD sub-
scale (l z p
= minus7 66 ) On the IR subscale residuals suggested
unexpected high distress on Items 7 19 and 20 One of
these items concerned his ldquomarriagesignificant other rela-
tionshiprdquo Therefore a possible cause of the IR subscale
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712
Conijn et al 7
misfit may be that his problems were limited to this rela-
tionship On the SD subscale he had both several unex-
pected high and low item scores Two of the three items
with unexpected high scores reflected mood symptoms of
depression feeling blue (Item 42) and not being happy
(Item 13) The third item concerned suicidal thoughts (Item
8) Most items with unexpected low scores concerned low
self-worth and incompetence (Items 15 24 and 40) and
hopelessness (Item 10) which are all cognitive symptoms
of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an
external source of psychological distress caused the patient
to experience only the mood symptoms of depression but
not the cognitive symptoms Hence the cognitive symp-
toms constituted a separate dimension on which the patient
had a lower trait value Furthermore except for 10 items all
other item scores were either 0s or 4s Hence apart from
potential content-related misfit extreme response style may
have been another cause of the severe misfit In practice the
clinician may discuss unexpected item scores and potential
explanations with the patient and suggest a more definite
explanation for the person misfit
Explanatory Person-Fit Analysis
For each of the diagnosis categories Table 1 shows the
average l zm p
value and the number and percentage of
patients classified as misfitting For patients with mood and
anxiety disorders (ie the baseline category) the detection
rate was substantial (128) but not high relative to most of
the other diagnosis categories
Table 2 shows the results of the logistic regression analy-
sis Model 1 included gender measurement occasion and
the diagnosis categories as predictors of person misfit
Diagnosis category had a significant overall effect χ 2(8) =
2647 p lt 001 The effects of somatoform disorder ADHD
psychotic disorder and substance abuse disorder were sig-
nificant Patients with ADHD were unlikely to show misfit
relative to the baseline category of patients with mood or
anxiety disorders Patients with somatoform disorders psy-
chotic disorders and substance-related disorders were morelikely to show misfit
Model 2 (Table 2 third column) also included GAF code
and OQ-45 total score Both effects were significant and
suggested that patients with higher levels of psychological
distress were more likely to show misfit After controlling
for GAF code and OQ-45 score the positive effect of
ADHD was not significant Hence patients with ADHD
were less likely to show misfit because they had less severe
levels of distress For the baseline category the estimated
probability of misfit was 13 For patients with somatoform
disorders psychotic disorders and substance-related disor-
ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis
category showed similar person misfit we compared the
standardized item-score residuals of the misfitting patterns
produced by patients with psychotic disorders (n = 7)
somatoform disorders (n = 16) or substance-related disor-
ders (n = 13) Most patients with a psychotic disorder had
low or average θ levels for each of the subscales Misfit was
due to several unexpected high scores indicating severe
symptoms In general these patients did not have large
Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the
5 Level 0 = No Misfit)
Model 1 Model 2
Intercept minus184 (011) minus193 (011)
Gender minus012 (013) minus012 (013)
Measurement occasion minus017 (027) minus018 (027)
Diagnosis categorya
Somatoform 057 (029) 074 (029)
ADHD minus058 (028) minus039 (028)
Psychotic 105 (046) 113 (047)
Borderline minus130 (072) minus139 (073)
Impulse control 035 (036) 057 (036)
Eating disorders minus110 (060) minus097 (060)
Substance related 066 (033) 069 (033)
Socialrelational minus020 (026) 008 (027)
GAF code mdash minus017 (007)
OQ-45 total score mdash 026 (007)
Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26
a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812
8 Assessment
residuals for the same items However unexpected high
scores on Item 25 (ldquodisturbing thoughts come into my mind
that I cannot get rid ofrdquo) were frequent (4 patients) which is
consistent with the symptoms characterizing psychotic dis-
orders Low to average θ levels combined with several
unexpected high item scores suggested that patients with
psychotic disorders showed misfit because many OQ-45
items were irrelevant to them suggesting lack of traitedness
(Reise amp Waller 1993) For the misfitting patients with a
somatoform disorder or a substance-related disorder the
standardized residuals showed no common pattern and
thus they did not show similar person misfit
Discussion
We investigated prevalence and explanations of aberrant
responding to the OQ-45 by means of IRT-based person-fit
methods Reise and Waller (2009) suggested that IRT mod-
els fit poorly to psychopathology data and that misfit may
adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn
et al 2014) our simulation results suggested that l zm p is
robust to model misfit The low empirical Type I error rates
suggested that GRM misfit did not lead to incorrect classi-
fication of model-consistent item-score patterns as misfit-
ting The current findings are valuable because they are
obtained under realistic conditions using a psychometri-
cally imperfect outcome measure that is frequently used in
practice We notice that this is the rule rather than the excep-
tion in general measurement instrumentsrsquo psychometric
properties are imperfect However our findings concerning
the robustness of l zm p
may only hold for the kind of model
misfit we found for the OQ-45 Future studies should inves-
tigate the robustness of PFA methods to different IRT model
violations
The detection rate of 126 in the OQ-45 data is sub-
stantial and comparable with detection rates found in other
studies using different measures For example using
repeated measurements on the StatendashTrait Anxiety
Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs
1983) Conijn et al (2013) found detection rates of 11 to
14 in a sample of cardiac patients and Conijn et al (2014)
also found 16 misfit on the International Personality Item
Pool 50-item questionnaire (Goldberg et al 2006) in a
panel sample from the general populationConsistent with previous research (Conijn et al 2013
Reise amp Waller 1993 Woods et al 2008) we found that
more severely distressed patients more likely showed mis-
fit Also patients with somatoform disorders psychotic dis-
orders and substance-related disorders more likely showed
misfit Plausible explanations for these results include
patients tending to deny their mental problems (somatoform
disorders) symptoms of disorder and confusion and the
irrelevance of most OQ-45 items (psychotic disorders)
being under the influence during test-taking and the nega-
tive effect of long-term substance use on cognitive abilities
(substance-related disorders) The residual analysis con-
firmed the explanation of limited item relevance for patients
with psychotic disorders but also suggested that patients
identified with the same type of disorder generally did not
show the same type of misfit However as the simulation
study revealed that the residual statistic had low power sug-
gesting only item scores reflecting large misfit are identi-
fied results should be interpreted with caution
There are two possible general explanations for finding
group differences with respect to the tendency to show per-
son misfit First person misfit may be due to a mismatch
between the OQ-45 and the specific disorder and second
misfit may be due to a general tendency to show misfit on
self-report measures Each explanation has a unique impli-
cation for outcome measurement The mismatch explana-
tion implies that disease-specific outcome measures rather
than general outcome measures should be used Examples
of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for
assessing somatic symptoms and the Severe Outcome
Questionnaire (Burlingame Thayer Lee Nelson amp
Lambert 2007) to diagnose severe mental illnesses such as
psychotic disorders and bipolar disorder
The general-misfit explanation implies that other meth-
ods than self-reports should be used for patientsrsquo treatment
decisions for example clinician-rated outcome measures
such as the Health of the Nation Outcome Scales (Wing et
al 1998) Also these patientsrsquo self-report results should be
excluded from cost-effectiveness studies to prevent poten-
tial negative effects on policy decisions Future studies may
address the scale-specific misfit versus general-misfit
explanations by means of explanatory PFA of data from
other outcome measures
Residual statistics have shown to be useful in real-data
applications for analyzing causes of aberrant responding to
unidimensional personality scales containing at least 30
items (Ferrando 2010 2012) but their performance has not
been evaluated previously by means of simulations Our
simulation study and real-data application question the use-
fulness of the residual statistic for retrieving causes of mis-
fit in outcome measures consisting of multiple short
subscales An alternative method is the inspection of item
content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such
a qualitative approach to follow-up PFA could be particu-
larly useful when clinicians use the OQ-45 to provide feed-
back to the patient
Person-fit statistics such as l zm p can potentially detect
aberrant responding due to low traitedness low motivation
cognitive deficits and concentration problems However
an important limitation for outcome measurement is that the
statistics have low power to identify response styles and
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912
Conijn et al 9
malingering These types of aberrant responding result in
biased total scores but do not necessarily produce an incon-
sistent pattern of item scores across the complete measure
(Ferrando amp Chico 2001) Hence future research might
also use measures especially designed to identify response
styles (Cheung amp Rensvold 2000) and malingering and
compare the results with those from general-purpose per-
son-fit statistics Inconsistency scales such as the TRIN and
VRIN potentially detect the same types of generic response
inconsistencies as person-fit indices but to our knowledge
only one study has compared their performance to that of
PFA (Egberink 2010) PFA was found to outperform the
inconsistency scale but relative performance may depend
on dimensionality and test length Hence more research is
needed on this subject
Other suggestions for future research include the fol-
lowing A first question is whether the prevalence of aber-
rant responding increases (eg due to diminishing
motivation) or decreases (eg due to familiarity with the
questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant
responding justifies routine application of PFA in clinical
settings such as Patient Reported Outcomes Measurement
Information System If prevalence among the tested
patients is low the number of item-score patterns incor-
rectly classified as aberrant (ie Type I errors) may out-
number the correctly identified aberrant item-score
patterns (Piedmont McCrae Riemann amp Angleitner
2000) Then PFA becomes very inefficient and it is
unlikely to improve the quality of individual decision
making in clinical practice Third future research may use
group-level analysis such as differential item functioning
analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-
ture modeling (Rost 1990) to study whether patients with
the same disorder showed similar patterns of misfit
To conclude our results have two main implications
pertaining to psychometrics and psychopathology First
the simulation study results suggest that l zm p is useful for
application to outcome measures despite moderate model
misfit due to multidimensionality As data of other psy-
chopathology measures also have been shown to be incon-
sistent with assumptions of IRT models the results of the
simulation study are valuable when considering applica-
tion of the l zm p
statistic to data of outcome measures
Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients
with different disorders Also more severely distressed
patients for whom psychological intervention is mostly
needed appear to be at the largest risk to produce invalid
outcome scores Overall our results emphasize the impor-
tance of person misfit identification in outcome measure-
ment and demonstrate that PFA may be useful for
preventing incorrect decision making in clinical practice
due to aberrant responding
Appendix A
Statistic l z p
Suppose the data are polytomous scores on J items (items
are indexed j j = 1 J ) with M + 1 ordered answer cat-
egories Let the score on item j be denoted by X j with
possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a
score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0
otherwise The unstandardized log-likelihood of an item-
score pattern x of person i is given by
l d m P X m pi
j
J
m
M
j j i( ) x = ( ) =( )= =
sumsum1 0
ln |θ (A1)
The standardized log-likelihood is defined as
l
l E l
VAR l
z p i
pi
pi
pi
x
x x
x
( ) = ( ) minus ( )
( )
( )
1
2
(A2)
where E l p( ) is the expected value and VAR l p( ) is the
variance of l p
Standardized Residual Statistic
The unstandardized residual for person i on item j is given by
e X E X ij ij ij= minus ( ) (A3)
where E X ij( ) is the expected value of X ij which equals
E X m P X mij
m
M
ij i( ) = =( )=
sum0
|θ The residual eij has a mean
of 0 and variance equal to
VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2
(A4)
The standardized residual is given by
ze e
VAR eij
ij
ij
=
( )
(A5)
To compute zeij latent trait θ
i needs to be replaced by its
estimated value This may bias the standardization of eij
Appendix B
For each OQ-45 subscale we estimated an exploratory mul-
tidimensional IRT (MIRT Reckase 2009) model Based on
results of exploratory factor analyses conducted in Mplus
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 6
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612
6 Assessment
- 4
- 2
0
2
4
S t a n
d a r d i z e d r e s i d u a l s
4 12 21 28 38
39 44
SR subscale
1
7
16
17
18 19
20 30 37
43
IR subscale
2 3
56
8
9 101315
22
232425
27
2931
3334
3536
40
41
42
45
SD subscale
- 4
- 2
0
2
4
S t a n d a r d i z e d r e s i d u a l s
4
12
21
28
38 39 44
1
7
16
17
18
19
20
30 37
43
2
3
5 6
8
9
10
13
15
22
23
24
2527
29
3133
34
3536
40
41
42
45
Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient
2752 ( l zmp
= minus796 lower panel)
(RMSEA = 13 SRMR = 05) For the IR subscale a two-
factor solution provided acceptable fit (RMSEA = 08 and
SRMR = 04) and for the SD subscale a three-factor solu-
tion provided acceptable fit (RMSEA = 07 and SRMR =
03) Violations of local independence and violations of a
logistic ISRF were only found for some items of the SD and
SR subscales respectively Thus EFA results suggested that
more than other model violations multidimensionality
caused the subscale data to show GRM misfit
To investigate the performance of statistic l zm p
and the
standardized item-score residuals for detecting person mis-
fit on OQ-45 in the presence of mild model misfit we used
data simulation to assess the Type I error rates and the
detection rates of these statistics Data were simulated using
methods proposed by Conijn et al (2014 also see Appendix
B) The types of misfit included were random error (three
levels random scores on 10 20 or 30 items) and acquies-
cence (three levels weak moderate and strong)
We found that for statistic l zm p
Type I error rate equaled
01 meaning that the risk of incorrectly classifying normal
respondents as misfitting was small and much lower than
nominal Type I error rate Furthermore the power of l zm p
to
detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-
escence equaled at most 51 (ie for strong acquiescence
88 of the responses in the most extreme category) We
concluded that despite mild GRM model misfit l zm p
is use-
ful for application to the OQ-45 but lacks power to detect
acquiescence For the residual statistic we found modest
power to detect deviant item scores due to random error
and low power to detect deviant item scores due to acquies-
cence Even though the residuals had low power in our
simulation study we decided to use the residual statistic for
the OQ-45 data analysis to obtain further insight in the sta-
tisticsrsquo performance
Detection of Misfit and Follow-Up Analyses
For 90 (3) patients with a missing l z p
value for one of the
subscales l zm p was computed across the two other OQ-45
subscales Statistic l zm p ranged from minus796 to 354 ( M =
045 SD = 354) For 367 (126) patients statistic l zm p
classified the observed item-score pattern as misfitting
With respect to age gender and measurement occasion we
did not find substantial differences between detection rates
Based on the residual statisticrsquos low power in the simula-
tion study we used α = 10 for identifying unexpected item
scores We use two cases to illustrate the use of the residual
statistic Figure 1 shows the standardized residuals for
female patient 663 who had the highest l zm p
value ( l zm p
=
354 p gt 99) and for male patient 2752 who had the low-
est l zm p value ( l zm
p = minus796 p lt 001) Patient 663 (upper
panel) was diagnosed with posttraumatic stress disorder
The patientrsquos absolute residuals were smaller than 164
thus showing that her item scores were consistent with theexpected GRM item scores
Patient 2752 (lower panel) was diagnosed to suffer
from adjustment disorder with depressed mood He had
large residuals for each of the OQ-45 subscales but misfit
was largest on the IR subscale (l z p
= minus5 44 ) and the SD sub-
scale (l z p
= minus7 66 ) On the IR subscale residuals suggested
unexpected high distress on Items 7 19 and 20 One of
these items concerned his ldquomarriagesignificant other rela-
tionshiprdquo Therefore a possible cause of the IR subscale
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712
Conijn et al 7
misfit may be that his problems were limited to this rela-
tionship On the SD subscale he had both several unex-
pected high and low item scores Two of the three items
with unexpected high scores reflected mood symptoms of
depression feeling blue (Item 42) and not being happy
(Item 13) The third item concerned suicidal thoughts (Item
8) Most items with unexpected low scores concerned low
self-worth and incompetence (Items 15 24 and 40) and
hopelessness (Item 10) which are all cognitive symptoms
of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an
external source of psychological distress caused the patient
to experience only the mood symptoms of depression but
not the cognitive symptoms Hence the cognitive symp-
toms constituted a separate dimension on which the patient
had a lower trait value Furthermore except for 10 items all
other item scores were either 0s or 4s Hence apart from
potential content-related misfit extreme response style may
have been another cause of the severe misfit In practice the
clinician may discuss unexpected item scores and potential
explanations with the patient and suggest a more definite
explanation for the person misfit
Explanatory Person-Fit Analysis
For each of the diagnosis categories Table 1 shows the
average l zm p
value and the number and percentage of
patients classified as misfitting For patients with mood and
anxiety disorders (ie the baseline category) the detection
rate was substantial (128) but not high relative to most of
the other diagnosis categories
Table 2 shows the results of the logistic regression analy-
sis Model 1 included gender measurement occasion and
the diagnosis categories as predictors of person misfit
Diagnosis category had a significant overall effect χ 2(8) =
2647 p lt 001 The effects of somatoform disorder ADHD
psychotic disorder and substance abuse disorder were sig-
nificant Patients with ADHD were unlikely to show misfit
relative to the baseline category of patients with mood or
anxiety disorders Patients with somatoform disorders psy-
chotic disorders and substance-related disorders were morelikely to show misfit
Model 2 (Table 2 third column) also included GAF code
and OQ-45 total score Both effects were significant and
suggested that patients with higher levels of psychological
distress were more likely to show misfit After controlling
for GAF code and OQ-45 score the positive effect of
ADHD was not significant Hence patients with ADHD
were less likely to show misfit because they had less severe
levels of distress For the baseline category the estimated
probability of misfit was 13 For patients with somatoform
disorders psychotic disorders and substance-related disor-
ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis
category showed similar person misfit we compared the
standardized item-score residuals of the misfitting patterns
produced by patients with psychotic disorders (n = 7)
somatoform disorders (n = 16) or substance-related disor-
ders (n = 13) Most patients with a psychotic disorder had
low or average θ levels for each of the subscales Misfit was
due to several unexpected high scores indicating severe
symptoms In general these patients did not have large
Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the
5 Level 0 = No Misfit)
Model 1 Model 2
Intercept minus184 (011) minus193 (011)
Gender minus012 (013) minus012 (013)
Measurement occasion minus017 (027) minus018 (027)
Diagnosis categorya
Somatoform 057 (029) 074 (029)
ADHD minus058 (028) minus039 (028)
Psychotic 105 (046) 113 (047)
Borderline minus130 (072) minus139 (073)
Impulse control 035 (036) 057 (036)
Eating disorders minus110 (060) minus097 (060)
Substance related 066 (033) 069 (033)
Socialrelational minus020 (026) 008 (027)
GAF code mdash minus017 (007)
OQ-45 total score mdash 026 (007)
Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26
a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812
8 Assessment
residuals for the same items However unexpected high
scores on Item 25 (ldquodisturbing thoughts come into my mind
that I cannot get rid ofrdquo) were frequent (4 patients) which is
consistent with the symptoms characterizing psychotic dis-
orders Low to average θ levels combined with several
unexpected high item scores suggested that patients with
psychotic disorders showed misfit because many OQ-45
items were irrelevant to them suggesting lack of traitedness
(Reise amp Waller 1993) For the misfitting patients with a
somatoform disorder or a substance-related disorder the
standardized residuals showed no common pattern and
thus they did not show similar person misfit
Discussion
We investigated prevalence and explanations of aberrant
responding to the OQ-45 by means of IRT-based person-fit
methods Reise and Waller (2009) suggested that IRT mod-
els fit poorly to psychopathology data and that misfit may
adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn
et al 2014) our simulation results suggested that l zm p is
robust to model misfit The low empirical Type I error rates
suggested that GRM misfit did not lead to incorrect classi-
fication of model-consistent item-score patterns as misfit-
ting The current findings are valuable because they are
obtained under realistic conditions using a psychometri-
cally imperfect outcome measure that is frequently used in
practice We notice that this is the rule rather than the excep-
tion in general measurement instrumentsrsquo psychometric
properties are imperfect However our findings concerning
the robustness of l zm p
may only hold for the kind of model
misfit we found for the OQ-45 Future studies should inves-
tigate the robustness of PFA methods to different IRT model
violations
The detection rate of 126 in the OQ-45 data is sub-
stantial and comparable with detection rates found in other
studies using different measures For example using
repeated measurements on the StatendashTrait Anxiety
Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs
1983) Conijn et al (2013) found detection rates of 11 to
14 in a sample of cardiac patients and Conijn et al (2014)
also found 16 misfit on the International Personality Item
Pool 50-item questionnaire (Goldberg et al 2006) in a
panel sample from the general populationConsistent with previous research (Conijn et al 2013
Reise amp Waller 1993 Woods et al 2008) we found that
more severely distressed patients more likely showed mis-
fit Also patients with somatoform disorders psychotic dis-
orders and substance-related disorders more likely showed
misfit Plausible explanations for these results include
patients tending to deny their mental problems (somatoform
disorders) symptoms of disorder and confusion and the
irrelevance of most OQ-45 items (psychotic disorders)
being under the influence during test-taking and the nega-
tive effect of long-term substance use on cognitive abilities
(substance-related disorders) The residual analysis con-
firmed the explanation of limited item relevance for patients
with psychotic disorders but also suggested that patients
identified with the same type of disorder generally did not
show the same type of misfit However as the simulation
study revealed that the residual statistic had low power sug-
gesting only item scores reflecting large misfit are identi-
fied results should be interpreted with caution
There are two possible general explanations for finding
group differences with respect to the tendency to show per-
son misfit First person misfit may be due to a mismatch
between the OQ-45 and the specific disorder and second
misfit may be due to a general tendency to show misfit on
self-report measures Each explanation has a unique impli-
cation for outcome measurement The mismatch explana-
tion implies that disease-specific outcome measures rather
than general outcome measures should be used Examples
of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for
assessing somatic symptoms and the Severe Outcome
Questionnaire (Burlingame Thayer Lee Nelson amp
Lambert 2007) to diagnose severe mental illnesses such as
psychotic disorders and bipolar disorder
The general-misfit explanation implies that other meth-
ods than self-reports should be used for patientsrsquo treatment
decisions for example clinician-rated outcome measures
such as the Health of the Nation Outcome Scales (Wing et
al 1998) Also these patientsrsquo self-report results should be
excluded from cost-effectiveness studies to prevent poten-
tial negative effects on policy decisions Future studies may
address the scale-specific misfit versus general-misfit
explanations by means of explanatory PFA of data from
other outcome measures
Residual statistics have shown to be useful in real-data
applications for analyzing causes of aberrant responding to
unidimensional personality scales containing at least 30
items (Ferrando 2010 2012) but their performance has not
been evaluated previously by means of simulations Our
simulation study and real-data application question the use-
fulness of the residual statistic for retrieving causes of mis-
fit in outcome measures consisting of multiple short
subscales An alternative method is the inspection of item
content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such
a qualitative approach to follow-up PFA could be particu-
larly useful when clinicians use the OQ-45 to provide feed-
back to the patient
Person-fit statistics such as l zm p can potentially detect
aberrant responding due to low traitedness low motivation
cognitive deficits and concentration problems However
an important limitation for outcome measurement is that the
statistics have low power to identify response styles and
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912
Conijn et al 9
malingering These types of aberrant responding result in
biased total scores but do not necessarily produce an incon-
sistent pattern of item scores across the complete measure
(Ferrando amp Chico 2001) Hence future research might
also use measures especially designed to identify response
styles (Cheung amp Rensvold 2000) and malingering and
compare the results with those from general-purpose per-
son-fit statistics Inconsistency scales such as the TRIN and
VRIN potentially detect the same types of generic response
inconsistencies as person-fit indices but to our knowledge
only one study has compared their performance to that of
PFA (Egberink 2010) PFA was found to outperform the
inconsistency scale but relative performance may depend
on dimensionality and test length Hence more research is
needed on this subject
Other suggestions for future research include the fol-
lowing A first question is whether the prevalence of aber-
rant responding increases (eg due to diminishing
motivation) or decreases (eg due to familiarity with the
questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant
responding justifies routine application of PFA in clinical
settings such as Patient Reported Outcomes Measurement
Information System If prevalence among the tested
patients is low the number of item-score patterns incor-
rectly classified as aberrant (ie Type I errors) may out-
number the correctly identified aberrant item-score
patterns (Piedmont McCrae Riemann amp Angleitner
2000) Then PFA becomes very inefficient and it is
unlikely to improve the quality of individual decision
making in clinical practice Third future research may use
group-level analysis such as differential item functioning
analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-
ture modeling (Rost 1990) to study whether patients with
the same disorder showed similar patterns of misfit
To conclude our results have two main implications
pertaining to psychometrics and psychopathology First
the simulation study results suggest that l zm p is useful for
application to outcome measures despite moderate model
misfit due to multidimensionality As data of other psy-
chopathology measures also have been shown to be incon-
sistent with assumptions of IRT models the results of the
simulation study are valuable when considering applica-
tion of the l zm p
statistic to data of outcome measures
Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients
with different disorders Also more severely distressed
patients for whom psychological intervention is mostly
needed appear to be at the largest risk to produce invalid
outcome scores Overall our results emphasize the impor-
tance of person misfit identification in outcome measure-
ment and demonstrate that PFA may be useful for
preventing incorrect decision making in clinical practice
due to aberrant responding
Appendix A
Statistic l z p
Suppose the data are polytomous scores on J items (items
are indexed j j = 1 J ) with M + 1 ordered answer cat-
egories Let the score on item j be denoted by X j with
possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a
score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0
otherwise The unstandardized log-likelihood of an item-
score pattern x of person i is given by
l d m P X m pi
j
J
m
M
j j i( ) x = ( ) =( )= =
sumsum1 0
ln |θ (A1)
The standardized log-likelihood is defined as
l
l E l
VAR l
z p i
pi
pi
pi
x
x x
x
( ) = ( ) minus ( )
( )
( )
1
2
(A2)
where E l p( ) is the expected value and VAR l p( ) is the
variance of l p
Standardized Residual Statistic
The unstandardized residual for person i on item j is given by
e X E X ij ij ij= minus ( ) (A3)
where E X ij( ) is the expected value of X ij which equals
E X m P X mij
m
M
ij i( ) = =( )=
sum0
|θ The residual eij has a mean
of 0 and variance equal to
VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2
(A4)
The standardized residual is given by
ze e
VAR eij
ij
ij
=
( )
(A5)
To compute zeij latent trait θ
i needs to be replaced by its
estimated value This may bias the standardization of eij
Appendix B
For each OQ-45 subscale we estimated an exploratory mul-
tidimensional IRT (MIRT Reckase 2009) model Based on
results of exploratory factor analyses conducted in Mplus
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 7
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712
Conijn et al 7
misfit may be that his problems were limited to this rela-
tionship On the SD subscale he had both several unex-
pected high and low item scores Two of the three items
with unexpected high scores reflected mood symptoms of
depression feeling blue (Item 42) and not being happy
(Item 13) The third item concerned suicidal thoughts (Item
8) Most items with unexpected low scores concerned low
self-worth and incompetence (Items 15 24 and 40) and
hopelessness (Item 10) which are all cognitive symptoms
of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an
external source of psychological distress caused the patient
to experience only the mood symptoms of depression but
not the cognitive symptoms Hence the cognitive symp-
toms constituted a separate dimension on which the patient
had a lower trait value Furthermore except for 10 items all
other item scores were either 0s or 4s Hence apart from
potential content-related misfit extreme response style may
have been another cause of the severe misfit In practice the
clinician may discuss unexpected item scores and potential
explanations with the patient and suggest a more definite
explanation for the person misfit
Explanatory Person-Fit Analysis
For each of the diagnosis categories Table 1 shows the
average l zm p
value and the number and percentage of
patients classified as misfitting For patients with mood and
anxiety disorders (ie the baseline category) the detection
rate was substantial (128) but not high relative to most of
the other diagnosis categories
Table 2 shows the results of the logistic regression analy-
sis Model 1 included gender measurement occasion and
the diagnosis categories as predictors of person misfit
Diagnosis category had a significant overall effect χ 2(8) =
2647 p lt 001 The effects of somatoform disorder ADHD
psychotic disorder and substance abuse disorder were sig-
nificant Patients with ADHD were unlikely to show misfit
relative to the baseline category of patients with mood or
anxiety disorders Patients with somatoform disorders psy-
chotic disorders and substance-related disorders were morelikely to show misfit
Model 2 (Table 2 third column) also included GAF code
and OQ-45 total score Both effects were significant and
suggested that patients with higher levels of psychological
distress were more likely to show misfit After controlling
for GAF code and OQ-45 score the positive effect of
ADHD was not significant Hence patients with ADHD
were less likely to show misfit because they had less severe
levels of distress For the baseline category the estimated
probability of misfit was 13 For patients with somatoform
disorders psychotic disorders and substance-related disor-
ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis
category showed similar person misfit we compared the
standardized item-score residuals of the misfitting patterns
produced by patients with psychotic disorders (n = 7)
somatoform disorders (n = 16) or substance-related disor-
ders (n = 13) Most patients with a psychotic disorder had
low or average θ levels for each of the subscales Misfit was
due to several unexpected high scores indicating severe
symptoms In general these patients did not have large
Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the
5 Level 0 = No Misfit)
Model 1 Model 2
Intercept minus184 (011) minus193 (011)
Gender minus012 (013) minus012 (013)
Measurement occasion minus017 (027) minus018 (027)
Diagnosis categorya
Somatoform 057 (029) 074 (029)
ADHD minus058 (028) minus039 (028)
Psychotic 105 (046) 113 (047)
Borderline minus130 (072) minus139 (073)
Impulse control 035 (036) 057 (036)
Eating disorders minus110 (060) minus097 (060)
Substance related 066 (033) 069 (033)
Socialrelational minus020 (026) 008 (027)
GAF code mdash minus017 (007)
OQ-45 total score mdash 026 (007)
Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26
a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812
8 Assessment
residuals for the same items However unexpected high
scores on Item 25 (ldquodisturbing thoughts come into my mind
that I cannot get rid ofrdquo) were frequent (4 patients) which is
consistent with the symptoms characterizing psychotic dis-
orders Low to average θ levels combined with several
unexpected high item scores suggested that patients with
psychotic disorders showed misfit because many OQ-45
items were irrelevant to them suggesting lack of traitedness
(Reise amp Waller 1993) For the misfitting patients with a
somatoform disorder or a substance-related disorder the
standardized residuals showed no common pattern and
thus they did not show similar person misfit
Discussion
We investigated prevalence and explanations of aberrant
responding to the OQ-45 by means of IRT-based person-fit
methods Reise and Waller (2009) suggested that IRT mod-
els fit poorly to psychopathology data and that misfit may
adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn
et al 2014) our simulation results suggested that l zm p is
robust to model misfit The low empirical Type I error rates
suggested that GRM misfit did not lead to incorrect classi-
fication of model-consistent item-score patterns as misfit-
ting The current findings are valuable because they are
obtained under realistic conditions using a psychometri-
cally imperfect outcome measure that is frequently used in
practice We notice that this is the rule rather than the excep-
tion in general measurement instrumentsrsquo psychometric
properties are imperfect However our findings concerning
the robustness of l zm p
may only hold for the kind of model
misfit we found for the OQ-45 Future studies should inves-
tigate the robustness of PFA methods to different IRT model
violations
The detection rate of 126 in the OQ-45 data is sub-
stantial and comparable with detection rates found in other
studies using different measures For example using
repeated measurements on the StatendashTrait Anxiety
Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs
1983) Conijn et al (2013) found detection rates of 11 to
14 in a sample of cardiac patients and Conijn et al (2014)
also found 16 misfit on the International Personality Item
Pool 50-item questionnaire (Goldberg et al 2006) in a
panel sample from the general populationConsistent with previous research (Conijn et al 2013
Reise amp Waller 1993 Woods et al 2008) we found that
more severely distressed patients more likely showed mis-
fit Also patients with somatoform disorders psychotic dis-
orders and substance-related disorders more likely showed
misfit Plausible explanations for these results include
patients tending to deny their mental problems (somatoform
disorders) symptoms of disorder and confusion and the
irrelevance of most OQ-45 items (psychotic disorders)
being under the influence during test-taking and the nega-
tive effect of long-term substance use on cognitive abilities
(substance-related disorders) The residual analysis con-
firmed the explanation of limited item relevance for patients
with psychotic disorders but also suggested that patients
identified with the same type of disorder generally did not
show the same type of misfit However as the simulation
study revealed that the residual statistic had low power sug-
gesting only item scores reflecting large misfit are identi-
fied results should be interpreted with caution
There are two possible general explanations for finding
group differences with respect to the tendency to show per-
son misfit First person misfit may be due to a mismatch
between the OQ-45 and the specific disorder and second
misfit may be due to a general tendency to show misfit on
self-report measures Each explanation has a unique impli-
cation for outcome measurement The mismatch explana-
tion implies that disease-specific outcome measures rather
than general outcome measures should be used Examples
of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for
assessing somatic symptoms and the Severe Outcome
Questionnaire (Burlingame Thayer Lee Nelson amp
Lambert 2007) to diagnose severe mental illnesses such as
psychotic disorders and bipolar disorder
The general-misfit explanation implies that other meth-
ods than self-reports should be used for patientsrsquo treatment
decisions for example clinician-rated outcome measures
such as the Health of the Nation Outcome Scales (Wing et
al 1998) Also these patientsrsquo self-report results should be
excluded from cost-effectiveness studies to prevent poten-
tial negative effects on policy decisions Future studies may
address the scale-specific misfit versus general-misfit
explanations by means of explanatory PFA of data from
other outcome measures
Residual statistics have shown to be useful in real-data
applications for analyzing causes of aberrant responding to
unidimensional personality scales containing at least 30
items (Ferrando 2010 2012) but their performance has not
been evaluated previously by means of simulations Our
simulation study and real-data application question the use-
fulness of the residual statistic for retrieving causes of mis-
fit in outcome measures consisting of multiple short
subscales An alternative method is the inspection of item
content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such
a qualitative approach to follow-up PFA could be particu-
larly useful when clinicians use the OQ-45 to provide feed-
back to the patient
Person-fit statistics such as l zm p can potentially detect
aberrant responding due to low traitedness low motivation
cognitive deficits and concentration problems However
an important limitation for outcome measurement is that the
statistics have low power to identify response styles and
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912
Conijn et al 9
malingering These types of aberrant responding result in
biased total scores but do not necessarily produce an incon-
sistent pattern of item scores across the complete measure
(Ferrando amp Chico 2001) Hence future research might
also use measures especially designed to identify response
styles (Cheung amp Rensvold 2000) and malingering and
compare the results with those from general-purpose per-
son-fit statistics Inconsistency scales such as the TRIN and
VRIN potentially detect the same types of generic response
inconsistencies as person-fit indices but to our knowledge
only one study has compared their performance to that of
PFA (Egberink 2010) PFA was found to outperform the
inconsistency scale but relative performance may depend
on dimensionality and test length Hence more research is
needed on this subject
Other suggestions for future research include the fol-
lowing A first question is whether the prevalence of aber-
rant responding increases (eg due to diminishing
motivation) or decreases (eg due to familiarity with the
questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant
responding justifies routine application of PFA in clinical
settings such as Patient Reported Outcomes Measurement
Information System If prevalence among the tested
patients is low the number of item-score patterns incor-
rectly classified as aberrant (ie Type I errors) may out-
number the correctly identified aberrant item-score
patterns (Piedmont McCrae Riemann amp Angleitner
2000) Then PFA becomes very inefficient and it is
unlikely to improve the quality of individual decision
making in clinical practice Third future research may use
group-level analysis such as differential item functioning
analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-
ture modeling (Rost 1990) to study whether patients with
the same disorder showed similar patterns of misfit
To conclude our results have two main implications
pertaining to psychometrics and psychopathology First
the simulation study results suggest that l zm p is useful for
application to outcome measures despite moderate model
misfit due to multidimensionality As data of other psy-
chopathology measures also have been shown to be incon-
sistent with assumptions of IRT models the results of the
simulation study are valuable when considering applica-
tion of the l zm p
statistic to data of outcome measures
Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients
with different disorders Also more severely distressed
patients for whom psychological intervention is mostly
needed appear to be at the largest risk to produce invalid
outcome scores Overall our results emphasize the impor-
tance of person misfit identification in outcome measure-
ment and demonstrate that PFA may be useful for
preventing incorrect decision making in clinical practice
due to aberrant responding
Appendix A
Statistic l z p
Suppose the data are polytomous scores on J items (items
are indexed j j = 1 J ) with M + 1 ordered answer cat-
egories Let the score on item j be denoted by X j with
possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a
score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0
otherwise The unstandardized log-likelihood of an item-
score pattern x of person i is given by
l d m P X m pi
j
J
m
M
j j i( ) x = ( ) =( )= =
sumsum1 0
ln |θ (A1)
The standardized log-likelihood is defined as
l
l E l
VAR l
z p i
pi
pi
pi
x
x x
x
( ) = ( ) minus ( )
( )
( )
1
2
(A2)
where E l p( ) is the expected value and VAR l p( ) is the
variance of l p
Standardized Residual Statistic
The unstandardized residual for person i on item j is given by
e X E X ij ij ij= minus ( ) (A3)
where E X ij( ) is the expected value of X ij which equals
E X m P X mij
m
M
ij i( ) = =( )=
sum0
|θ The residual eij has a mean
of 0 and variance equal to
VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2
(A4)
The standardized residual is given by
ze e
VAR eij
ij
ij
=
( )
(A5)
To compute zeij latent trait θ
i needs to be replaced by its
estimated value This may bias the standardization of eij
Appendix B
For each OQ-45 subscale we estimated an exploratory mul-
tidimensional IRT (MIRT Reckase 2009) model Based on
results of exploratory factor analyses conducted in Mplus
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 8
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812
8 Assessment
residuals for the same items However unexpected high
scores on Item 25 (ldquodisturbing thoughts come into my mind
that I cannot get rid ofrdquo) were frequent (4 patients) which is
consistent with the symptoms characterizing psychotic dis-
orders Low to average θ levels combined with several
unexpected high item scores suggested that patients with
psychotic disorders showed misfit because many OQ-45
items were irrelevant to them suggesting lack of traitedness
(Reise amp Waller 1993) For the misfitting patients with a
somatoform disorder or a substance-related disorder the
standardized residuals showed no common pattern and
thus they did not show similar person misfit
Discussion
We investigated prevalence and explanations of aberrant
responding to the OQ-45 by means of IRT-based person-fit
methods Reise and Waller (2009) suggested that IRT mod-
els fit poorly to psychopathology data and that misfit may
adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn
et al 2014) our simulation results suggested that l zm p is
robust to model misfit The low empirical Type I error rates
suggested that GRM misfit did not lead to incorrect classi-
fication of model-consistent item-score patterns as misfit-
ting The current findings are valuable because they are
obtained under realistic conditions using a psychometri-
cally imperfect outcome measure that is frequently used in
practice We notice that this is the rule rather than the excep-
tion in general measurement instrumentsrsquo psychometric
properties are imperfect However our findings concerning
the robustness of l zm p
may only hold for the kind of model
misfit we found for the OQ-45 Future studies should inves-
tigate the robustness of PFA methods to different IRT model
violations
The detection rate of 126 in the OQ-45 data is sub-
stantial and comparable with detection rates found in other
studies using different measures For example using
repeated measurements on the StatendashTrait Anxiety
Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs
1983) Conijn et al (2013) found detection rates of 11 to
14 in a sample of cardiac patients and Conijn et al (2014)
also found 16 misfit on the International Personality Item
Pool 50-item questionnaire (Goldberg et al 2006) in a
panel sample from the general populationConsistent with previous research (Conijn et al 2013
Reise amp Waller 1993 Woods et al 2008) we found that
more severely distressed patients more likely showed mis-
fit Also patients with somatoform disorders psychotic dis-
orders and substance-related disorders more likely showed
misfit Plausible explanations for these results include
patients tending to deny their mental problems (somatoform
disorders) symptoms of disorder and confusion and the
irrelevance of most OQ-45 items (psychotic disorders)
being under the influence during test-taking and the nega-
tive effect of long-term substance use on cognitive abilities
(substance-related disorders) The residual analysis con-
firmed the explanation of limited item relevance for patients
with psychotic disorders but also suggested that patients
identified with the same type of disorder generally did not
show the same type of misfit However as the simulation
study revealed that the residual statistic had low power sug-
gesting only item scores reflecting large misfit are identi-
fied results should be interpreted with caution
There are two possible general explanations for finding
group differences with respect to the tendency to show per-
son misfit First person misfit may be due to a mismatch
between the OQ-45 and the specific disorder and second
misfit may be due to a general tendency to show misfit on
self-report measures Each explanation has a unique impli-
cation for outcome measurement The mismatch explana-
tion implies that disease-specific outcome measures rather
than general outcome measures should be used Examples
of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for
assessing somatic symptoms and the Severe Outcome
Questionnaire (Burlingame Thayer Lee Nelson amp
Lambert 2007) to diagnose severe mental illnesses such as
psychotic disorders and bipolar disorder
The general-misfit explanation implies that other meth-
ods than self-reports should be used for patientsrsquo treatment
decisions for example clinician-rated outcome measures
such as the Health of the Nation Outcome Scales (Wing et
al 1998) Also these patientsrsquo self-report results should be
excluded from cost-effectiveness studies to prevent poten-
tial negative effects on policy decisions Future studies may
address the scale-specific misfit versus general-misfit
explanations by means of explanatory PFA of data from
other outcome measures
Residual statistics have shown to be useful in real-data
applications for analyzing causes of aberrant responding to
unidimensional personality scales containing at least 30
items (Ferrando 2010 2012) but their performance has not
been evaluated previously by means of simulations Our
simulation study and real-data application question the use-
fulness of the residual statistic for retrieving causes of mis-
fit in outcome measures consisting of multiple short
subscales An alternative method is the inspection of item
content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such
a qualitative approach to follow-up PFA could be particu-
larly useful when clinicians use the OQ-45 to provide feed-
back to the patient
Person-fit statistics such as l zm p can potentially detect
aberrant responding due to low traitedness low motivation
cognitive deficits and concentration problems However
an important limitation for outcome measurement is that the
statistics have low power to identify response styles and
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912
Conijn et al 9
malingering These types of aberrant responding result in
biased total scores but do not necessarily produce an incon-
sistent pattern of item scores across the complete measure
(Ferrando amp Chico 2001) Hence future research might
also use measures especially designed to identify response
styles (Cheung amp Rensvold 2000) and malingering and
compare the results with those from general-purpose per-
son-fit statistics Inconsistency scales such as the TRIN and
VRIN potentially detect the same types of generic response
inconsistencies as person-fit indices but to our knowledge
only one study has compared their performance to that of
PFA (Egberink 2010) PFA was found to outperform the
inconsistency scale but relative performance may depend
on dimensionality and test length Hence more research is
needed on this subject
Other suggestions for future research include the fol-
lowing A first question is whether the prevalence of aber-
rant responding increases (eg due to diminishing
motivation) or decreases (eg due to familiarity with the
questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant
responding justifies routine application of PFA in clinical
settings such as Patient Reported Outcomes Measurement
Information System If prevalence among the tested
patients is low the number of item-score patterns incor-
rectly classified as aberrant (ie Type I errors) may out-
number the correctly identified aberrant item-score
patterns (Piedmont McCrae Riemann amp Angleitner
2000) Then PFA becomes very inefficient and it is
unlikely to improve the quality of individual decision
making in clinical practice Third future research may use
group-level analysis such as differential item functioning
analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-
ture modeling (Rost 1990) to study whether patients with
the same disorder showed similar patterns of misfit
To conclude our results have two main implications
pertaining to psychometrics and psychopathology First
the simulation study results suggest that l zm p is useful for
application to outcome measures despite moderate model
misfit due to multidimensionality As data of other psy-
chopathology measures also have been shown to be incon-
sistent with assumptions of IRT models the results of the
simulation study are valuable when considering applica-
tion of the l zm p
statistic to data of outcome measures
Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients
with different disorders Also more severely distressed
patients for whom psychological intervention is mostly
needed appear to be at the largest risk to produce invalid
outcome scores Overall our results emphasize the impor-
tance of person misfit identification in outcome measure-
ment and demonstrate that PFA may be useful for
preventing incorrect decision making in clinical practice
due to aberrant responding
Appendix A
Statistic l z p
Suppose the data are polytomous scores on J items (items
are indexed j j = 1 J ) with M + 1 ordered answer cat-
egories Let the score on item j be denoted by X j with
possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a
score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0
otherwise The unstandardized log-likelihood of an item-
score pattern x of person i is given by
l d m P X m pi
j
J
m
M
j j i( ) x = ( ) =( )= =
sumsum1 0
ln |θ (A1)
The standardized log-likelihood is defined as
l
l E l
VAR l
z p i
pi
pi
pi
x
x x
x
( ) = ( ) minus ( )
( )
( )
1
2
(A2)
where E l p( ) is the expected value and VAR l p( ) is the
variance of l p
Standardized Residual Statistic
The unstandardized residual for person i on item j is given by
e X E X ij ij ij= minus ( ) (A3)
where E X ij( ) is the expected value of X ij which equals
E X m P X mij
m
M
ij i( ) = =( )=
sum0
|θ The residual eij has a mean
of 0 and variance equal to
VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2
(A4)
The standardized residual is given by
ze e
VAR eij
ij
ij
=
( )
(A5)
To compute zeij latent trait θ
i needs to be replaced by its
estimated value This may bias the standardization of eij
Appendix B
For each OQ-45 subscale we estimated an exploratory mul-
tidimensional IRT (MIRT Reckase 2009) model Based on
results of exploratory factor analyses conducted in Mplus
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 9
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912
Conijn et al 9
malingering These types of aberrant responding result in
biased total scores but do not necessarily produce an incon-
sistent pattern of item scores across the complete measure
(Ferrando amp Chico 2001) Hence future research might
also use measures especially designed to identify response
styles (Cheung amp Rensvold 2000) and malingering and
compare the results with those from general-purpose per-
son-fit statistics Inconsistency scales such as the TRIN and
VRIN potentially detect the same types of generic response
inconsistencies as person-fit indices but to our knowledge
only one study has compared their performance to that of
PFA (Egberink 2010) PFA was found to outperform the
inconsistency scale but relative performance may depend
on dimensionality and test length Hence more research is
needed on this subject
Other suggestions for future research include the fol-
lowing A first question is whether the prevalence of aber-
rant responding increases (eg due to diminishing
motivation) or decreases (eg due to familiarity with the
questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant
responding justifies routine application of PFA in clinical
settings such as Patient Reported Outcomes Measurement
Information System If prevalence among the tested
patients is low the number of item-score patterns incor-
rectly classified as aberrant (ie Type I errors) may out-
number the correctly identified aberrant item-score
patterns (Piedmont McCrae Riemann amp Angleitner
2000) Then PFA becomes very inefficient and it is
unlikely to improve the quality of individual decision
making in clinical practice Third future research may use
group-level analysis such as differential item functioning
analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-
ture modeling (Rost 1990) to study whether patients with
the same disorder showed similar patterns of misfit
To conclude our results have two main implications
pertaining to psychometrics and psychopathology First
the simulation study results suggest that l zm p is useful for
application to outcome measures despite moderate model
misfit due to multidimensionality As data of other psy-
chopathology measures also have been shown to be incon-
sistent with assumptions of IRT models the results of the
simulation study are valuable when considering applica-
tion of the l zm p
statistic to data of outcome measures
Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients
with different disorders Also more severely distressed
patients for whom psychological intervention is mostly
needed appear to be at the largest risk to produce invalid
outcome scores Overall our results emphasize the impor-
tance of person misfit identification in outcome measure-
ment and demonstrate that PFA may be useful for
preventing incorrect decision making in clinical practice
due to aberrant responding
Appendix A
Statistic l z p
Suppose the data are polytomous scores on J items (items
are indexed j j = 1 J ) with M + 1 ordered answer cat-
egories Let the score on item j be denoted by X j with
possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a
score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0
otherwise The unstandardized log-likelihood of an item-
score pattern x of person i is given by
l d m P X m pi
j
J
m
M
j j i( ) x = ( ) =( )= =
sumsum1 0
ln |θ (A1)
The standardized log-likelihood is defined as
l
l E l
VAR l
z p i
pi
pi
pi
x
x x
x
( ) = ( ) minus ( )
( )
( )
1
2
(A2)
where E l p( ) is the expected value and VAR l p( ) is the
variance of l p
Standardized Residual Statistic
The unstandardized residual for person i on item j is given by
e X E X ij ij ij= minus ( ) (A3)
where E X ij( ) is the expected value of X ij which equals
E X m P X mij
m
M
ij i( ) = =( )=
sum0
|θ The residual eij has a mean
of 0 and variance equal to
VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2
(A4)
The standardized residual is given by
ze e
VAR eij
ij
ij
=
( )
(A5)
To compute zeij latent trait θ
i needs to be replaced by its
estimated value This may bias the standardization of eij
Appendix B
For each OQ-45 subscale we estimated an exploratory mul-
tidimensional IRT (MIRT Reckase 2009) model Based on
results of exploratory factor analyses conducted in Mplus
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 10
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012
10 Assessment
(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with
two factors for both the SR and IR subscale and three fac-
tors for the SD subscale The MIRT model is defined as
logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent
traits and has a multivariate standard normal distribution
where the θ correlations are estimated along with the item
parameters of the MIRT model Higher α values indicate
better discriminating items and higher δ values indicate
more popular answer categories
We used the MIRT parameter estimates (Table B1) for
generating replicated OQ-45 data sets In each replication
we included two types of misfitting item-score patterns
patterns were due to random error or acquiescence
Following St-Onge Valois Abdous and Germain (2011)
we simulated varying levels of random error (random scores
on 10 20 or 30 items) and varying levels of acquiescence
(weak moderate and strong based on Van Herk Poortinga
and Verhallen 2004) The total percentage of misfitting pat-
terns in the data was 20 (Conijn et al 2013 Conijn et al
2014) Based on 100 replicated data sets the average Type
I error rate and the average detection rates of l zm p and the
standardized residual were computed For computing the
person-fit statistics GRM parameters were estimated using
MULTILOG 7 (Thissen et al 2003)
Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation
SR subscale IR subscale SD subscale
Item parameter M (SD) Range M (SD) Range M (SD) Range
Discrimination
α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132
α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054
α θ 3 mdash mdash mdash mdash 002 (027) minus056 053
Threshold
δ 0 mdash mdash mdash mdash mdash mdash
δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308
δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175
δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062
δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070
Latent-trait correlations
r θ θ 1 2 20 50 minus42r
θ θ 1 3 mdash mdash 53
r θ θ 2 3 mdash mdash minus55
Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect
to the research authorship andor publication of this article
Funding
The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article
This research was supported by a grant from the Netherlands
Organization of Scientific Research (NWO 400-06-087 first
author)
References
Atre-Vaidya N Taylor M A Seidenberg M Reed R
Perrine A amp Glick-Oberwise F (1998) Cognitive deficits
psychopathology and psychosocial functioning in bipolar
mood disorder Neuropsychiatry Neuropsychology and
Behavioral Neurology 11 120-126
Baker F B amp Kim S-H (2004) Item response theory
Parameter estimation techniques (2nd ed) New York NY
Marcel Dekker
Bickman L amp Salzer M S (1997) Introduction Measuring
quality in mental health services Evaluation Review 21285-291
Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)
The bilevel structure of the Outcome Questionnairendash45
Psychological Assessment 22 350-355
Burlingame G M Thayer S D Lee J A Nelson P L amp
Lambert M J (2007) Administration amp scoring manual for
the Severe Outcome Questionnaire (SOQ) Salt Lake City
UT OQ Measures
Butcher J N Graham J R Ben-Porath Y S Tellegen
A Dahlstrom W G amp Kaemmer B (2001) MMPI-2
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 11
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112
Conijn et al 11
(Minnesota Multiphasic Personality Inventory-2) Manual
for administration and scoring (Rev ed) Minneapolis
University of Minnesota Press
Cella D Yount S Rothrock N Gershon R Cook K amp
Reeve B on behalf of the PROMIS Cooperative Group
(2007) The Patient Reported Outcomes Measurement
Information System (PROMIS) Progress of an NIH Roadmap
Cooperative Group during its first two years Medical Care45 S3-S11
Cheung G W amp Rensvold R B (2000) Assessing extreme and
acquiescence response sets in cross-cultural research using
structural equations modeling Journal of Cross-Cultural
Psychology 31 187-212
Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z
based person-fit methods for non-cognitive multiscale mea-
sures Applied Psychological Measurement 38 122-136
Conijn J M Emons W H M Van Assen M A L M Pedersen
S S amp Sijtsma K (2013) Explanatory multilevel person-fit
analysis of response consistency on the Spielberger State-Trait
Anxiety Inventory Multivariate Behavioral Research 48
692-718
Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide
risk with person fit statistics among people presenting to alco-
hol and other drug treatment Drug and Alcohol Dependence
106 92-100
Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-
reported versus clinician-rated symptoms of depression as
outcome measures in psychotherapy research on depression
A meta-analysis Clinical Psychology Review 30 768-778
De Jong K amp Nugter A (2004) De Outcome Questionnaire
Psychometrische kenmerken van de Nederlandse vertal-
ing [The Outcome Questionnaire Psychometric properties
of the Dutch translation] Nederlands Tijdschrift voor de
Psychologie 59 76-79
De Jong K Nugter M A Polak M G Wagenborg J E
A Spinhoven P amp Heiser W J (2007) The Outcome
Questionnaire-45 in a Dutch population A cross cultural vali-
dation Clinical Psychology amp Psychotherapy 14 288-301
De la Torre J amp Deng W (2008) Improving person-fit
assessment by correcting the ability estimate and its refer-
ence distribution Journal of Educational Measurement 45
159-177
Derogatis L R (1993) BSI Brief Symptom Inventory
Administration scoring and procedures manual (4th ed)
Minneapolis MN National Computer Systems
Drasgow F Levine M V amp McLaughlin M E (1987)
Detecting inappropriate test scores with optimal and practical
appropriateness indices Applied Psychological Measurement 11 59-79
Drasgow F Levine M V amp McLaughlin M E (1991)
Appropriateness measurement for some multidimensional test
batteries Applied Psychological Measurement 15 171-191
Drasgow F Levine M V Tsien S Williams B amp Mead A
D (1995) Fitting polytomous item response theory models
to multiple-choice tests Applied Psychological Measurement
19 143-165
Drasgow F Levine M V amp Williams E A (1985) Appropriateness
measurement with polychotomous item response models and
standardized indices British Journal of Mathematical and
Statistical Psychology 38 67-86
Doucette A amp Wolf A W (2009) Questioning the measure-
ment precision of psychotherapy research Psychotherapy
Research 19 374-389
Egberink I J L (2010) The use of different types of validity
indicators in personality assessment (Doctoral dissertation)
University of Groningen Netherlands httpirsubrugnl ppn32993466X
Embretson S E amp Reise S P (2000) Item response theory for
psychologists Mahwah NJ Erlbaum
Emons W H M (2008) Nonparametric person-fit analysis of
polytomous item scores Applied Psychological Measurement
32 224-247
Evans C Connell J Barkham M Margison F Mellor-Clark
J McGrath G amp Audin K (2002) Towards a standardised
brief outcome measure Psychometric properties and utility
of the CORE-OM British Journal of Psychiatry 180 51-60
Ferrando P J (2010) Some statistics for assessing person-fit
based on continuous-response models Applied Psychological
Measurement 34 219-237
Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-
ity Personality and Individual Differences 52 718-722
Ferrando P J amp Chico E (2001) Detecting dissimulation in
personality test scores A comparison between person-fit
indices and detection scales Educational and Psychological
Measurement 61 997-1012
Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT
graded response models Limited versus full information
methods Psychological Methods 14 275-299
Goldberg L R Johnson J A Eber H W Hogan R
Ashton M C Cloninger C R amp Gough H C (2006)
The International Personality Item Pool and the future of
public-domain personality measures Journal of Research in
Personality 40 84-96
Handel R W Ben-Porath Y S Tellegen A amp Archer R
P (2010) Psychometric functioning of the MMPI-2-RF
VRIN-r and TRIN-r scales with varying degrees of random-
ness acquiescence and counter-acquiescence Psychological
Assessment 22 87-95
Holloway F (2002) Outcome measurement in mental healthmdash
Welcome to the revolution British Journal of Psychiatry
181 1-2
Karabatsos G (2003) Comparing the aberrant response detec-
tion performance of thirty-six person-fit statistics Applied
Measurement in Education 16 277-298
Kenny D A Kaniskan B amp McCoach D B (2014) The per-
formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online
publication doi1011770049124114543236
Kim S Beretvas S N amp Sherry A R (2010) A validation
of the factor structure of OQ-45 scores using factor mixture
modeling Measurement and Evaluation in Counseling and
Development 42 275-295
Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-
15 validity of a new measure for evaluating the severity
of somatic symptoms Psychosomsatic Medicine 2002
258-266
at Seoul National University on December 18 2014asmsagepubcomDownloaded from
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168
Page 12
7212019 Conijn 2014
httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212
12 Assessment
Lambert M J amp Hawkins E J (2004) Measuring outcome in
professional practice Considerations in selecting and utiliz-
ing brief outcome instruments Professional Psychology
Research and Practice 35 492-499
Lambert M J Morton J J Hatfield D Harmon C Hamilton
S Reid R C Burlingame G M (2004) Administration
and scoring manual for the OQ-452 (Outcome Questionnaire)
(3th ed) Wilmington DE American Professional CredentialServices
Lambert M J amp Shimokawa K (2011) Collecting client feed-
back Psychotherapy 48 72-79
MacCallum R C Browne M W amp Sugawara H M (1996)
Power analysis and determination of sample size for covari-
ance structure modeling Psychological Methods 1 130-149
Meijer R R amp Sijtsma K (2001) Methodology review
Evaluating person fit Applied Psychological Measurement
25 107-135
Mueller R M Lambert M J amp Burlingame G M (1998)
Construct validity of the outcome questionnaire A confirma-
tory factor analysis Journal of Personality Assessment 70
248-262
Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA
Statmodel
Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-
eos and handouts Retrieved from httpwwwstatmodelcom
downloadTopic201-v11pdf
Nering M L (1995) The distribution of person fit using true
and estimated person parameters Applied Psychological
Measurement 19 121-129
Piedmont R L McCrae R R Riemann R amp Angleitner
A (2000) On the invalidity of validity scales Evidence
from self-reports and observer ratings in volunteer samples
Journal of Personality and Social Psychology 78 582-593
Pirkis J E Burgess P M Kirk P K Dodson S Coombs
T J amp Williamson M K (2005) A review of the psycho-
metric properties of the Health of the Nation Outcome Scales
(HoNOS) family of measures Health and Quality of Life
Outcomes 3 76-87
Pitts S C West S G amp Tein J (1996) Longitudinal measure-
ment models in evaluation research Examining stability and
change Evaluation and Program Planning 19 333-350
Reckase M D (2009) Multidimensional item response theory
New York NY Springer
Reise S P amp Waller N G (1993) Traitedness and the assess-
ment of response pattern scalability Journal of Personality
and Social Psychology 65 143-151
Reise S P amp Waller N G (2009) Item response theory and
clinical measurement Annual Review of Clinical Psychology
5 27-48
Rost J (1990) Rasch models in latent classes An integration
of two approaches to item analysis Applied Psychological
Measurement 3 271-282
Samejima F (1997) Graded response model In W J van der
Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer
Schmitt N Chan D Sacco J M McFarland L A amp Jennings
D (1999) Correlates of person-fit and effect of person-fit
on test validity Applied Psychological Measurement 23
41-53
Spielberger C D Gorsuch R L Lushene R Vagg P R amp
Jacobs G A (1983) Manual for the State-Trait Anxiety
Inventory (Form Y) Palo Alto CA Consulting Psychologists
Press
St-Onge C Valois P Abdous B amp Germain S (2011) Person-
fit statisticsrsquo accuracy A Monte Carlo study of the aberrance
ratersquos influence Applied Psychological Measurement 35
419-432
Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663
Thissen D Chen W H amp Bock R D (2003) MULTILOG for
Windows (Version 7) Lincolnwood IL Scientific Software
International
Thissen D Steinberg L amp Wainer H (1993) Detection of dif-
ferential functioning using the parameters of item response
models In P W Holland amp H Wainer (Eds) Differential
Item Functioning (pp 67-113) Hillsdale NJ Lawrence
Erlbaum
Van Herk H Poortinga Y H amp Verhallen T M M (2004)
Response styles in rating scales Evidence of method bias
in data from six EU countries Journal of Cross-Cultural
Psychology 35 346-360
Wing J K Beevor A S Curtis R H Park B G Hadden S
amp Burns H (1998) Health of the Nation Outcome Scales
(HoNOS) Research and development British Journal of
Psychiatry 172 11-18
Wood J M Garb H N Lilienfeld S O amp Nezworski M T
(2002) Clinical assessment Annual Review of Psychology
53 519-543
Woods C M Oltmanns T F amp Turkheimer E (2008)
Detection of aberrant responding on a personality scale in a
military sample An application of evaluating person fit with
two-level logistic regression Psychological Assessment 20
159-168