Top Banner
7/21/2019 Conijn, 2014 http://slidepdf.com/reader/full/conijn-2014-56dab45b83e53 1/12 Assessment 1–12 © The Author(s) 2014 Reprints and permissions: sagepub.com/journalsPermissions.nav DOI: 10.1177/1073191114560882 asm.sagepub.com  Article During the previous two decades, the growing interest in the quality of mental health care has led to an increased use of self-report outcome measures (Holloway, 2002). To monitor the effectiveness of treatment for individual  patients, outcome measures that assess symptom severity and daily functioning are repeatedly administered during treatment. Based on the repeated measurements, the treat- ment plan can be altered if recovery does not proceed as expected (Lambert & Shimokawa, 2011). Furthermore, mental health care providers use these outcome data to eval- uate treatment results at the institutional level, and insur- ance companies, health care managers, and other regularity  bodies use outcome measures for policy decisions aimed at improving cost-effectiveness (Bickman & Salzer, 1997). Examples of frequently used outcome measures are the Outcome Questionnaire–45 (OQ-45; Lambert et al., 2004), the Brief Symptom Inventory (BSI; Derogatis, 1993), and the Clinical Outcomes in Routine Evaluation–Outcome Measure (CORE-OM; Evans et al., 2002). Given the importance of outcome measures for individual decision making in mental health care, their psychometric  properties are a major concern (e.g., Doucette & Wolf, 2009; Pirkis et al., 2005). However, even if instruments have excellent psychometric properties, persons may respond aberrantly to clinical and personality scales, thus produc- ing invalid test scores. In fact, response inconsistency to  personality and psychopathology self-report inventories was found to be positively related to indicators of psychological distress, psychological problems, and negative affect (Conijn, Emons, Van Assen, Pedersen, & Sijtsma, 2013; Reise & Waller, 1993; Woods, Oltmanns, & Turkheimer, 2008), which suggests that mental health care patients may be inclined to respond aberrantly. Cognitive deficits that are commonly observed in mental illness may explain concentra- tion problems that interfere with the quality of self-reports (Atre-Vaidya et al., 1998; Cuijpers, Li, Hofann, & Andersson, 2010). However, potential causes of aberrant responding are numerous, including lack of motivation, response styles, idiosyncratic interpretation of item content, and low traited- ness. Traitedness refers to the applicability of the trait to the respondent (Tellegen, 1988). 882ASMXXX 10.1177/1073191114560882Assessment Conijn etal. 1 School of Social and Behavioral Sciences, Tilburg University, Tilburg, Netherlands 2 Institute of Psychology, Leiden University, Leiden, Netherlands Research department, GGZ Noord-Holland-Noord Corresponding Author:  Judith M. Conijn, Department of Clinical Psychology, Institute of Psychology, Leiden University, Wassenaarseweg 52, 2333 AK Leiden, Netherlands. Email: [email protected] Detecting and Explaining Aberrant Responding to the Outcome Questionnaire–45  Judith M. Conijn 1 , Wilco H. M. Emons 1 , Kim De Jong 2,3 , and Klaas Sijtsma 1 Abstract We applied item response theory based person-fit analysis (PFA) to data of the Outcome Questionnaire-45 (OQ-45) to investigate the prevalence and causes of aberrant responding in a sample of Dutch clinical outpatients. The  z p  person-fit statistic was used to detect misfitting item-score patterns and the standardized residual statistic for identifying the source of the misfit in the item-score patterns identified as misfitting. Logistic regression analysis was used to predict person misfit from clinical diagnosis, OQ-45 total score, and Global Assessment of Functioning code. The  z p  statistic classified 12.6% of the item-score patterns as misfitting. Person misfit was positively related to the severity of psychological distress. Furthermore, patients with psychotic disorders, somatoform disorders, or substance-related disorders more likely showed misfit than the baseline group of patients with mood and anxiety disorders. The results suggest that general outcome measures such as the OQ-45 are not equally appropriate for patients with different disorders. Our study emphasizes the importance of person-misfit detection in clinical practice. Keywords aberrant responding, item response theory, outcome measurement, Outcome Questionnaire–45, person-fit analysis  at Seoul National University on December 18, 2014 asm.sagepub.com Downloaded from 
12

Conijn, 2014

Mar 05, 2016

Download

Documents

Harold Lee

research
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 112

Assessment

1 ndash12copy The Author(s) 2014

Reprints and permissionssagepubcomjournalsPermissionsnav

DOI 1011771073191114560882asmsagepubcom

Article

During the previous two decades the growing interest in

the quality of mental health care has led to an increased use

of self-report outcome measures (Holloway 2002) To

monitor the effectiveness of treatment for individual patients outcome measures that assess symptom severity

and daily functioning are repeatedly administered during

treatment Based on the repeated measurements the treat-

ment plan can be altered if recovery does not proceed as

expected (Lambert amp Shimokawa 2011) Furthermore

mental health care providers use these outcome data to eval-

uate treatment results at the institutional level and insur-

ance companies health care managers and other regularity

bodies use outcome measures for policy decisions aimed at

improving cost-effectiveness (Bickman amp Salzer 1997)

Examples of frequently used outcome measures are the

Outcome Questionnairendash45 (OQ-45 Lambert et al 2004)

the Brief Symptom Inventory (BSI Derogatis 1993) and

the Clinical Outcomes in Routine EvaluationndashOutcome

Measure (CORE-OM Evans et al 2002)

Given the importance of outcome measures for individual

decision making in mental health care their psychometric

properties are a major concern (eg Doucette amp Wolf 2009

Pirkis et al 2005) However even if instruments have

excellent psychometric properties persons may respond

aberrantly to clinical and personality scales thus produc-

ing invalid test scores In fact response inconsistency to

personality and psychopathology self-report inventories was

found to be positively related to indicators of psychological

distress psychological problems and negative affect (Conijn

Emons Van Assen Pedersen amp Sijtsma 2013 Reise ampWaller 1993 Woods Oltmanns amp Turkheimer 2008)

which suggests that mental health care patients may be

inclined to respond aberrantly Cognitive deficits that are

commonly observed in mental illness may explain concentra-

tion problems that interfere with the quality of self-reports

(Atre-Vaidya et al 1998 Cuijpers Li Hofann amp Andersson

2010) However potential causes of aberrant responding are

numerous including lack of motivation response styles

idiosyncratic interpretation of item content and low traited-

ness Traitedness refers to the applicability of the trait to the

respondent (Tellegen 1988)

882 ASMXXX1011771073191114560882AssessmentConijn etal

1School of Social and Behavioral Sciences Tilburg University Tilburg

Netherlands2Institute of Psychology Leiden University Leiden Netherlands

983219Research department GGZ Noord-Holland-Noord

Corresponding Author

Judith M Conijn Department of Clinical Psychology Institute of

Psychology Leiden University Wassenaarseweg 52 2333 AK Leiden

Netherlands

Email jmconijnfswleidenunivnl

Detecting and Explaining AberrantResponding to the OutcomeQuestionnairendash45

Judith M Conijn1 Wilco H M Emons1 Kim De Jong23 and Klaas Sijtsma1

Abstract

We applied item response theory based person-fit analysis (PFA) to data of the Outcome Questionnaire-45 (OQ-45) to

investigate the prevalence and causes of aberrant responding in a sample of Dutch clinical outpatients The l zp person-fit

statistic was used to detect misfitting item-score patterns and the standardized residual statistic for identifying the sourceof the misfit in the item-score patterns identified as misfitting Logistic regression analysis was used to predict person

misfit from clinical diagnosis OQ-45 total score and Global Assessment of Functioning code The l zp statistic classified

126 of the item-score patterns as misfitting Person misfit was positively related to the severity of psychological distress

Furthermore patients with psychotic disorders somatoform disorders or substance-related disorders more likely showedmisfit than the baseline group of patients with mood and anxiety disorders The results suggest that general outcome

measures such as the OQ-45 are not equally appropriate for patients with different disorders Our study emphasizes the

importance of person-misfit detection in clinical practice

Keywords

aberrant responding item response theory outcome measurement Outcome Questionnairendash45 person-fit analysis

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 212

2 Assessment

Aberrant responding provides clinicians with invalid

information and as a result adversely affects the quality of

treatment and diagnosis decisions (Conrad et al 2010

Handel Ben-Porath Tellegen amp Archer 2010) Person-fit

analysis (PFA) involves statistical methods to detect aber-

rant item-score patterns that are due to aberrant responding

Conrad et al (2010) provided an example of the potential of

PFA for mental health care by using PFA to screen for atypi-

cal symptom profiles among persons at intake for drug or

alcohol dependence treatment They found that the persons

with aberrant item-score patterns required different treat-

ments than persons with model-consistent item-score pat-

terns and concluded that PFA may detect inconsistencies that

have important implications for treatment and diagnosis

decisions As self-report outcome measures are increasingly

used to make treatment decisions in clinical practice PFA

may be a valuable screening tool in outcome measurement

The importance of detecting aberrant responding has

been recognized since long Both the original and current

versions of the Minnesota Multiphasic Personality Inventory(Butcher et al 2001 Handel et al 2010) include scales to

detect different types of aberrant responding Examples are

lie scales to detect faking good or faking bad and indices

based on the consistency of the responses to items either

highly similar or opposite with respect to content such as

the Variable Response Inconsistency (VRIN) scale to detect

random responding and the True Response Inconsistency

(TRIN) scale to detect acquiescence Despite validity

scalesrsquo importance outcome questionnaires typically do not

include specialized scales for detecting aberrant responding

(Lambert amp Hawkins 2004) One possible explanation is

that with the increasing demand of cost-effectiveness time

for assessment has been reduced greatly (Wood Garb

Lilienfeld amp Nezworski 2002) Consequently outcome

questionnaires are required to be short and efficient which

limits the use of validity scales consisting of additional

items (eg lie scales) and limits construction of TRIN and

VRIN scales because less item pairs with similar or oppo-

site content are available

Person-Fit Analysis in Outcome Measurement

In this study we used PFA to investigate the prevalence and

possible causes of aberrant responding in outcome mea-

surement by means of the OQ-45 In PFA person-fit statis-tics signal whether an individualrsquos item-score pattern is

inconsistent with the item-score pattern expected under the

particular measurement model (Meijer amp Sijtsma 2001) A

significant discrepancy between the observed item-score

pattern and the expected item-score pattern provides evi-

dence of person misfit Person misfit means that the indi-

vidualrsquos test score is unlikely to be meaningful in terms of

the trait being measured For noncognitive data the l z per-

son-fit statistic (Drasgow Levine amp McLaughlin 1987) is

one of the best performing and most popular person-fit sta-

tistics (Emons 2008 Ferrando 2012 Karabatsos 2003)

To determine whether an item-score pattern shows signifi-

cant misfit statistic l z is compared with a cutoff value

obtained under the item response theory (IRT Embretson amp

Reise 2000) model that serves as the null model of consis-

tency (De la Torre amp Deng 2008 Nering 1995) Statistic l z

detects various types of aberrant responding such as acqui-

escence and extreme response style but the statistic is most

powerful for detecting random responding (Emons 2008)

In detecting random responding to 57 items measuring the

Big Five personality factors PFA has been found to outper-

form an inconsistency index based on the rationale of the

Minnesota Multiphasic Personality Inventory VRIN scale

(Egberink 2010 pp 94-100)

An advantage of statistic l z and other person-fit statistics

for application to outcome measurement is that they can be

used to detect invalid test scores on any self-report scale

that is consistent with an IRT model Also the rise of com-

puterized and IRT-based outcome monitoring (eg PatientReported Outcomes Measurement Information System

Cella et al 2007) renders the implementation of PFA fea-

sible Along with the computer-generated test score a per-

son-fit value may be provided to the clinician serving as an

alarm bell that warns him that the test score may be invalid

and further inquiry may be useful

Follow-up PFA of item-score patterns flagged as misfit-

ting can help the clinician to infer possible explanations for

an individualrsquos observed aberrant responding In personal-

ity measurement Ferrando (2010) used item-score residu-

als for follow-up PFA and found that a person who had an

aberrant item-score pattern on an extraversion scale showed

unexpected low scores on items referring to situations

where the person could make a fool of himself This result

suggested that the aberrant responding was due to fear of

being rejected For another person follow-up PFA sug-

gested inattentiveness to reverse item wording In outcome

measurement for individual patients follow-up PFA can

inform the clinician about the sources of the misfit and cli-

nicians can discuss the unexpected item scores with the

patients to obtain a better understanding of the patientrsquos

psychological profile

PFA primarily focuses on individuals but can also be

used to explain individual differences in aberrant respond-

ing at the group level for examples see Conijn Emonsand Sijtsma (2014) and Conijn et al (2013) In outcome

measurement PFA can be used to investigate the extent to

which general measures are suited for assessing patients

suffering from different disorders General outcome mea-

sures such as the OQ-45 and the CORE-OM use items

that assess the most common symptoms of psychopathol-

ogy such as those observed in depression and anxiety dis-

orders (Lambert amp Hawkins 2004) and are also used to

assess patients suffering from different specific disorders

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 312

Conijn et al 3

varying from somatoform disorders to psychotic disorders

and addiction For rare or specific disorders many of the

general measuresrsquo items are irrelevant and low traitedness

may lead to inconsistent or unmotivated completion of out-

come measures

Goal of the Study We investigated the prevalence and the causes of aberrant

responding to the OQ-45 (Lambert et al 2004) The OQ-45

is one of the most popular general outcome measures used

in mental health care We used OQ-45 data of a large Dutch

clinical outpatient sample suffering from a large variety of

disorders The l z person-fit statistic (Drasgow et al 1987)

was used to identify misfitting item-score patterns and

standardized item-score residuals (Ferrando 2010 2012)

were used to investigate sources of item-score pattern mis-

fit We employed logistic regression analyses using the l z

statistic as the dependent variable to investigate whether

patients suffering from specific disorders (eg somatoformdisorders and psychotic disorders) and severely distressed

patients are more predisposed to produce aberrant item-

score patterns on the OQ-45 than other patients Based on

the results for the OQ-45 we discuss the possible causes of

aberrant responding in outcome measurement in general

and the potential of PFA for improving outcome-measure-

ment practice

Method

Participants

We performed a secondary analysis on data collected in

routine mental health care Participants were 2906 clinical

outpatients (421 male) from a mental health care institu-

tion with four different locations situated in Noord-Holland

a predominantly rural province in the Netherlands

Participantsrsquo age ranged from 17 to 77 years ( M = 37 SD =

13) Apart from gender and age no other demographic infor-

mation was collected

Most patients completed the OQ-45 at intake but 160

(55) patients completed the OQ-45 after treatment

started The sample included 2632 (91) patients with a

clinician-rated Diagnostic and Statistical Manual of Mental

Disorders (4th ed DSM-IV ) primary diagnosis at Axis I192 (7) persons with a primary diagnosis at Axis II and

82 (3) patients for which the primary diagnosis was miss-

ing Most frequent primary diagnoses were depression

(38) anxiety disorders (20) disorders usually first

diagnosed in infancy childhood or adolescence (8) per-

sonality disorders (7) adjustment disorders (6)

somatoform disorders (3) eating disorders (2) and

substance-related disorders (2) Of the diagnosed patients

131 had comorbidity between Axis 1 and Axis 2 320

had comorbidity within Axis 1 and 01 had comorbidity

within Axis 2 The clinician had access to the OQ-45 data

but since the OQ-45 is not a diagnostic instrument it was

unlikely that diagnosis was based on the OQ-45 results

Measures

The Outcome Questionnairendash45 The OQ-45 (Lambert et al

2004) uses three subscales to measure symptom severity

and daily functioning The Social Role Performance (SR 9

items of which 3 are reversely worded) subscale measures

dissatisfaction distress and conflicts concerning onersquos

employment education or leisure pursuits An example

item is ldquoI feel stressed at workschoolrdquo The Interpersonal

Relations (IR 11 items of which 4 are reversely worded)

subscale measures difficulty with family friends and mari-

tal relationship An example item is ldquoI feel lonelyrdquo The

Symptom Distress (SD 25 items of which 3 are reversely

worded) subscale measures symptoms of the most fre-

quently diagnosed mental disorders in particular anxietyand depression An example item is ldquoI feel no interest in

thingsrdquo Respondents are instructed to express their feelings

with respect to the past week on a 5-point rating scale with

scores ranging from 0 (never ) through 4 (almost always)

higher scores indicating more psychological distress

In this study we used the Dutch OQ-45 (De Jong amp

Nugter 2004) The Dutch OQ-45 total score has good con-

current and criterion-related validity (De Jong et al 2007)

In our sample coefficient alpha for the subscale total scores

equaled 65 (SR) 77 (IR) and 91 (SD) Results concern-

ing OQ-45 factor structure are ambiguous Some studies

provided support for the theoretical three-factor model for

both the original OQ-45 (Bludworth Tracey amp Glidden-

Tracey 2010) and the Dutch OQ-45 (De Jong et al 2007)

Other studies found poor fit of the theoretical three-factor

model and suggested that a three-factor model showed bet-

ter fit when it was based on a reduced item set (Kim

Beretvas amp Sherry 2010) or a one-factor model (Mueller

Lambert amp Burlingame 1998) In this study we further

investigated the fit of the theoretical three-factor model

Explanatory Variables for Person Misfit

Severity of distress The OQ-45 total score and the cli-

nician-rated DSM-IV Global Assessment of Functioning

(GAF) code were taken as measures of the patientrsquos sever-ity of distress The GAF code ranges from 1 to 100 with

higher values indicating better psychological social and

occupational functioning The GAF code was missing for

187 (6) patients

Diagnosis category The clinician-rated DSM-IV diag-

nosis was classified into nine categories representing

the most common types of disorders present in the sam-

ple Table 1 describes the diagnosis categories and the

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 412

4 Assessment

number of patients classified in each category by their pri-

mary diagnosis Three remarks are in order First because

mood and anxiety symptoms dominate the OQ-45 (Lambert

et al 2004) we assumed that for patients suffering from

these symptoms misfit was less likely than for patients with

other diagnoses Hence we classified patients with mood

and anxiety diagnoses into the same category and used

this category as the baseline for testing the effects of the

other diagnosis categories on person fit Second because

we expected that the probability of aberrant responding

depends on the specific symptoms the patient experienced

we categorized patients into diagnosis categories that are

defined by symptomatology Third if we were unable to

categorize patientsrsquo diagnosis unambiguously in one of the

categories (eg adjustment disorder with predominant dis-

turbance of conduct) we treated the diagnosis as missing

Our approach resulted in 2514 categorized patients (87)

Statistical Analysis

Model-Fit Evaluation We conducted PFA based on the

graded response model (GRM Samejima 1997) The GRM

is an IRT model for polytomous items The core of the

GRM are the item step response functions (ISRFs) which

specify the relationship between the probability of a

response in a specific or higher answer category and the

latent trait the test measures The GRM is defined by three

assumptions unidimensionality of the latent trait absence

of structural influences other than the latent trait on item

responding (ie local independence) and logistic ISRFs A

detailed discussion of the GRM is beyond the scope of this

study the interested reader is referred to Embretson and

Reise (2000) and Samejima (1997)

Satisfactory GRM fit to the data is a prerequisite for

application of GRM-based PFA to the OQ-45 subscale data

Forero and Maydeu-Olivares (2009) showed that differ-

ences between parameter estimates obtained from the full

information (GRM) and the limited information (factor

analysis on the polychoric correlation matrix) approaches

are negligible Hence for each OQ-45 subscale we used

exploratory factor analysis (EFA) for categorical data in

Mplus (Mutheacuten amp Mutheacuten 2007) to assess the GRM

assumptions of unidimensionality and local independence

For comparing one-factor models with multidimensional

models we used the root mean squared error of approxima-

tion (RMSEA) and the standardized root mean residual

(SRMR Mutheacuten amp Mutheacuten 2007) RMSEA le 08 and

SRMR lt 05 suggest acceptable model fit (MacCallum

Browne amp Sugawara 1996 Mutheacuten amp Mutheacuten 2009) Todetect local dependence we used the residual correlations

under the one-factor solution We assessed the logistic

shape of ISRFs by means of a graphical analysis comparing

the observed ISRFs with the ISRFs expected under the

GRM (Drasgow Levine Tsien Williams amp Mead 1995)

In case substantial violations of GRM assumptions were

identified we used a simulation study to investigate whether

PFA was sufficiently robust with respect to the identified

OQ-45 model misfit (Conijn et al 2014)

Table 1 Description of the Diagnosis Categories Used as Explanatory Variables in Multiple Regression Analysis

Category Common DSM-IV diagnoses included n Mean l zmp Detected Detected

Mood and anxiety disordera Depressive disorders generalized anxietydisorders phobias panic disordersposttraumatic stress disorder

1786 028 229 128

Somatoform disorder Pain disorder somatization disorder

hypochondriasis undifferentiatedsomatoform disorder

82 016 16 195

Attention deficit hyperactivitydisorder(ADHD)

Predominantly inattentive combinedhyperactive-impulsive and inattentive

198 008 15 76

Psychotic disorder Schizophrenia psychotic disorder nototherwise specified

26 minus010 7 269

Borderline personalitydisorder

Borderline personality disorder 53 035 2 38

Impulse-control disorders notelsewhere classified

Impulse-control disorder intermittentexplosive disorder

58 002 10 172

Eating disorder Eating disorder not otherwise specified bulimianervosa

67 038 4 60

Substance-related disorder Cannabis-related disorders alcohol-relateddisorders

58 009 13 224

Social and relational problem Phase of life problem partner relationalproblem identity problem

186 026 20 108

a Including 65 patients with a mood disorder

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512

Conijn et al 5

Person-Fit Analysis

Detection of misfit We used statistic l z for polytomous

item scores denoted by l z p (Drasgow Levine amp Williams

1985) to identify item-score patterns that show misfit rela-

tive to the GRM Statistic l z p

is the standardized log-like-

lihood of a personrsquos item-score pattern given the response

probabilities under the GRM with larger negative l z p

values indicating a higher degree of misfit (see Appendix

A for the equations) Emons (2008) found that l z p

had a

higher detection rate than several other person-fit statistics

Because item-score patterns that contain only 0s or only

4s (ie after recoding the reversed worded items) always

fit under the postulated GRM corresponding l z p

statistics

are meaningless and therefore treated as missing values

Twenty-two respondents (1) had a missing l z p

value due

to only 0 or 4 scores We may add that even though these

perfect patterns are consistent with the model they still may

be suspicious as they may reflect gross under- or overre-

porting of symptoms

Because the GRM is a model for unidimensional datawe computed statistic l z

p for each OQ-45 subscale sepa-

rately To categorize persons as fitting or misfitting with

respect to the complete OQ-45 we used the multiscale per-

son-fit statistic l zm p

(Conijn et al 2014 Drasgow Levine

amp McLaughlin 1991) which equals the standardized sum

of the subscale l z p values across all subscales

Under the null model of fit to the IRT model and given

the true θ value statistic l z p

is standard normally distributed

(Drasgow et al 1985) but when the unknown true θ value

is replaced by the estimated θ value statistic l z p is no longer

standard normal (Nering 1995) Therefore following De la

Torre and Deng (2008) we used the following parametric

bootstrap procedure to compute the l z p and l zm p values and

the corresponding p values For each person we generated

5000 item-score patterns under the postulated GRM using

the item parameters and the personrsquos estimated θ value For

each item-score pattern we again estimated the estimated θ

value and computed the corresponding l zm p statistic The

5000 bootstrap replications of l zm p

determined the person-

specific null distribution of l zm p

The percentile rank of the

observed lzm p

value in this bootstrapped distribution pro-

vided the p value We used one-tailed significance testing

and a 05 significance level (α) The GRM item parameters

were estimated using MULTILOG (Thissen Chen amp Bock

2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained

from Baker and Kim (2004) The software including the

source code is available on request from the first author

Follow-up analyses For each item-score pattern l zm p classi-

fied as misfitting we used standardized item-score residuals

to identify the source of the person misfit (Ferrando 2010

2012) Negative residuals indicate that the personrsquos observed

item score is lower than expected under the estimated GRM

and positive residuals indicate that the item score is higher

than expected (Appendix A) To test residuals for signifi-

cance we used critical values based on the standard normal

distribution and two-tailed significance testing with α = 05

(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff

values of minus164 and 164)

Explanatory person-fit analysis We used logistic regres-

sion to relate type of disorder and severity of psychological

distress to person misfit on the OQ-45 The dependent vari-

able was the dichotomous person-fit classification based on

l zm p

(1 = significant misfit 0 = no misfit) Based on pre-

vious research results gender (0 = male 1 = female) and

measurement occasion (0 = at intake 1 = during treatment)

were included in the model as control variables (eg Pitts

West amp Tein 1996 Schmitt Chan Sacco McFarland amp

Jennings 1999 Woods et al 2008)

Results

First we discuss model fit and implications of the identified

model misfit for the application of PFA to the OQ-45 data

Second we discuss the number of item-score patterns that

the l zm p

statistic classified as misfitting (prevalence) and we

illustrate how standardized item-score residuals may help

infer possible causes of misfit for individual respondents

Third we discuss the results of logistic regression analysis

in which l zm p

person misfit classification was predicted by

means of clinical diagnosis and severity of disorder

Model-Fit Evaluation

Inspection of multiple correlation coefficients and item-rest

correlations showed that the Items 11 26 and 32 which

measured substance abuse and Item 14 (ldquoI workstudy too

muchrdquo) fitted poorly in their subscales As these results

were consistent with previous research (De Jong et al

2007 Mueller et al 1998) we excluded these items from

further analysis Coefficient alphas for the SR (7 items 2

items excluded) IR (10 items 1 item excluded) and SD (24

items 1 item excluded) subscales equaled 67 78 and 91

respectively

For the subscale data EFA showed that the first factor

explained 386 to 400 of the variance and that one-fac-

tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed

to produce an RMSEA le 08 The two-factor model pro-

duced an RMSEA of 13 but the SRMR of 05 was accept-

able The RMSEA value may have been inflated due to the

small number of degrees of freedom (ie df = 7) of the two-

factor model (Kenny Kaniskan amp McCoach 2014)

Because parallel analysis based on the polychoric correla-

tion matrix suggested that two factors explained the data we

decided that a two-factor solution was most appropriate

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612

6 Assessment

- 4

- 2

0

2

4

S t a n

d a r d i z e d r e s i d u a l s

4 12 21 28 38

39 44

SR subscale

1

7

16

17

18 19

20 30 37

43

IR subscale

2 3

56

8

9 101315

22

232425

27

2931

3334

3536

40

41

42

45

SD subscale

- 4

- 2

0

2

4

S t a n d a r d i z e d r e s i d u a l s

4

12

21

28

38 39 44

1

7

16

17

18

19

20

30 37

43

2

3

5 6

8

9

10

13

15

22

23

24

2527

29

3133

34

3536

40

41

42

45

Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient

2752 ( l zmp

= minus796 lower panel)

(RMSEA = 13 SRMR = 05) For the IR subscale a two-

factor solution provided acceptable fit (RMSEA = 08 and

SRMR = 04) and for the SD subscale a three-factor solu-

tion provided acceptable fit (RMSEA = 07 and SRMR =

03) Violations of local independence and violations of a

logistic ISRF were only found for some items of the SD and

SR subscales respectively Thus EFA results suggested that

more than other model violations multidimensionality

caused the subscale data to show GRM misfit

To investigate the performance of statistic l zm p

and the

standardized item-score residuals for detecting person mis-

fit on OQ-45 in the presence of mild model misfit we used

data simulation to assess the Type I error rates and the

detection rates of these statistics Data were simulated using

methods proposed by Conijn et al (2014 also see Appendix

B) The types of misfit included were random error (three

levels random scores on 10 20 or 30 items) and acquies-

cence (three levels weak moderate and strong)

We found that for statistic l zm p

Type I error rate equaled

01 meaning that the risk of incorrectly classifying normal

respondents as misfitting was small and much lower than

nominal Type I error rate Furthermore the power of l zm p

to

detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-

escence equaled at most 51 (ie for strong acquiescence

88 of the responses in the most extreme category) We

concluded that despite mild GRM model misfit l zm p

is use-

ful for application to the OQ-45 but lacks power to detect

acquiescence For the residual statistic we found modest

power to detect deviant item scores due to random error

and low power to detect deviant item scores due to acquies-

cence Even though the residuals had low power in our

simulation study we decided to use the residual statistic for

the OQ-45 data analysis to obtain further insight in the sta-

tisticsrsquo performance

Detection of Misfit and Follow-Up Analyses

For 90 (3) patients with a missing l z p

value for one of the

subscales l zm p was computed across the two other OQ-45

subscales Statistic l zm p ranged from minus796 to 354 ( M =

045 SD = 354) For 367 (126) patients statistic l zm p

classified the observed item-score pattern as misfitting

With respect to age gender and measurement occasion we

did not find substantial differences between detection rates

Based on the residual statisticrsquos low power in the simula-

tion study we used α = 10 for identifying unexpected item

scores We use two cases to illustrate the use of the residual

statistic Figure 1 shows the standardized residuals for

female patient 663 who had the highest l zm p

value ( l zm p

=

354 p gt 99) and for male patient 2752 who had the low-

est l zm p value ( l zm

p = minus796 p lt 001) Patient 663 (upper

panel) was diagnosed with posttraumatic stress disorder

The patientrsquos absolute residuals were smaller than 164

thus showing that her item scores were consistent with theexpected GRM item scores

Patient 2752 (lower panel) was diagnosed to suffer

from adjustment disorder with depressed mood He had

large residuals for each of the OQ-45 subscales but misfit

was largest on the IR subscale (l z p

= minus5 44 ) and the SD sub-

scale (l z p

= minus7 66 ) On the IR subscale residuals suggested

unexpected high distress on Items 7 19 and 20 One of

these items concerned his ldquomarriagesignificant other rela-

tionshiprdquo Therefore a possible cause of the IR subscale

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712

Conijn et al 7

misfit may be that his problems were limited to this rela-

tionship On the SD subscale he had both several unex-

pected high and low item scores Two of the three items

with unexpected high scores reflected mood symptoms of

depression feeling blue (Item 42) and not being happy

(Item 13) The third item concerned suicidal thoughts (Item

8) Most items with unexpected low scores concerned low

self-worth and incompetence (Items 15 24 and 40) and

hopelessness (Item 10) which are all cognitive symptoms

of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an

external source of psychological distress caused the patient

to experience only the mood symptoms of depression but

not the cognitive symptoms Hence the cognitive symp-

toms constituted a separate dimension on which the patient

had a lower trait value Furthermore except for 10 items all

other item scores were either 0s or 4s Hence apart from

potential content-related misfit extreme response style may

have been another cause of the severe misfit In practice the

clinician may discuss unexpected item scores and potential

explanations with the patient and suggest a more definite

explanation for the person misfit

Explanatory Person-Fit Analysis

For each of the diagnosis categories Table 1 shows the

average l zm p

value and the number and percentage of

patients classified as misfitting For patients with mood and

anxiety disorders (ie the baseline category) the detection

rate was substantial (128) but not high relative to most of

the other diagnosis categories

Table 2 shows the results of the logistic regression analy-

sis Model 1 included gender measurement occasion and

the diagnosis categories as predictors of person misfit

Diagnosis category had a significant overall effect χ 2(8) =

2647 p lt 001 The effects of somatoform disorder ADHD

psychotic disorder and substance abuse disorder were sig-

nificant Patients with ADHD were unlikely to show misfit

relative to the baseline category of patients with mood or

anxiety disorders Patients with somatoform disorders psy-

chotic disorders and substance-related disorders were morelikely to show misfit

Model 2 (Table 2 third column) also included GAF code

and OQ-45 total score Both effects were significant and

suggested that patients with higher levels of psychological

distress were more likely to show misfit After controlling

for GAF code and OQ-45 score the positive effect of

ADHD was not significant Hence patients with ADHD

were less likely to show misfit because they had less severe

levels of distress For the baseline category the estimated

probability of misfit was 13 For patients with somatoform

disorders psychotic disorders and substance-related disor-

ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis

category showed similar person misfit we compared the

standardized item-score residuals of the misfitting patterns

produced by patients with psychotic disorders (n = 7)

somatoform disorders (n = 16) or substance-related disor-

ders (n = 13) Most patients with a psychotic disorder had

low or average θ levels for each of the subscales Misfit was

due to several unexpected high scores indicating severe

symptoms In general these patients did not have large

Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the

5 Level 0 = No Misfit)

Model 1 Model 2

Intercept minus184 (011) minus193 (011)

Gender minus012 (013) minus012 (013)

Measurement occasion minus017 (027) minus018 (027)

Diagnosis categorya

Somatoform 057 (029) 074 (029)

ADHD minus058 (028) minus039 (028)

Psychotic 105 (046) 113 (047)

Borderline minus130 (072) minus139 (073)

Impulse control 035 (036) 057 (036)

Eating disorders minus110 (060) minus097 (060)

Substance related 066 (033) 069 (033)

Socialrelational minus020 (026) 008 (027)

GAF code mdash minus017 (007)

OQ-45 total score mdash 026 (007)

Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26

a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812

8 Assessment

residuals for the same items However unexpected high

scores on Item 25 (ldquodisturbing thoughts come into my mind

that I cannot get rid ofrdquo) were frequent (4 patients) which is

consistent with the symptoms characterizing psychotic dis-

orders Low to average θ levels combined with several

unexpected high item scores suggested that patients with

psychotic disorders showed misfit because many OQ-45

items were irrelevant to them suggesting lack of traitedness

(Reise amp Waller 1993) For the misfitting patients with a

somatoform disorder or a substance-related disorder the

standardized residuals showed no common pattern and

thus they did not show similar person misfit

Discussion

We investigated prevalence and explanations of aberrant

responding to the OQ-45 by means of IRT-based person-fit

methods Reise and Waller (2009) suggested that IRT mod-

els fit poorly to psychopathology data and that misfit may

adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn

et al 2014) our simulation results suggested that l zm p is

robust to model misfit The low empirical Type I error rates

suggested that GRM misfit did not lead to incorrect classi-

fication of model-consistent item-score patterns as misfit-

ting The current findings are valuable because they are

obtained under realistic conditions using a psychometri-

cally imperfect outcome measure that is frequently used in

practice We notice that this is the rule rather than the excep-

tion in general measurement instrumentsrsquo psychometric

properties are imperfect However our findings concerning

the robustness of l zm p

may only hold for the kind of model

misfit we found for the OQ-45 Future studies should inves-

tigate the robustness of PFA methods to different IRT model

violations

The detection rate of 126 in the OQ-45 data is sub-

stantial and comparable with detection rates found in other

studies using different measures For example using

repeated measurements on the StatendashTrait Anxiety

Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs

1983) Conijn et al (2013) found detection rates of 11 to

14 in a sample of cardiac patients and Conijn et al (2014)

also found 16 misfit on the International Personality Item

Pool 50-item questionnaire (Goldberg et al 2006) in a

panel sample from the general populationConsistent with previous research (Conijn et al 2013

Reise amp Waller 1993 Woods et al 2008) we found that

more severely distressed patients more likely showed mis-

fit Also patients with somatoform disorders psychotic dis-

orders and substance-related disorders more likely showed

misfit Plausible explanations for these results include

patients tending to deny their mental problems (somatoform

disorders) symptoms of disorder and confusion and the

irrelevance of most OQ-45 items (psychotic disorders)

being under the influence during test-taking and the nega-

tive effect of long-term substance use on cognitive abilities

(substance-related disorders) The residual analysis con-

firmed the explanation of limited item relevance for patients

with psychotic disorders but also suggested that patients

identified with the same type of disorder generally did not

show the same type of misfit However as the simulation

study revealed that the residual statistic had low power sug-

gesting only item scores reflecting large misfit are identi-

fied results should be interpreted with caution

There are two possible general explanations for finding

group differences with respect to the tendency to show per-

son misfit First person misfit may be due to a mismatch

between the OQ-45 and the specific disorder and second

misfit may be due to a general tendency to show misfit on

self-report measures Each explanation has a unique impli-

cation for outcome measurement The mismatch explana-

tion implies that disease-specific outcome measures rather

than general outcome measures should be used Examples

of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for

assessing somatic symptoms and the Severe Outcome

Questionnaire (Burlingame Thayer Lee Nelson amp

Lambert 2007) to diagnose severe mental illnesses such as

psychotic disorders and bipolar disorder

The general-misfit explanation implies that other meth-

ods than self-reports should be used for patientsrsquo treatment

decisions for example clinician-rated outcome measures

such as the Health of the Nation Outcome Scales (Wing et

al 1998) Also these patientsrsquo self-report results should be

excluded from cost-effectiveness studies to prevent poten-

tial negative effects on policy decisions Future studies may

address the scale-specific misfit versus general-misfit

explanations by means of explanatory PFA of data from

other outcome measures

Residual statistics have shown to be useful in real-data

applications for analyzing causes of aberrant responding to

unidimensional personality scales containing at least 30

items (Ferrando 2010 2012) but their performance has not

been evaluated previously by means of simulations Our

simulation study and real-data application question the use-

fulness of the residual statistic for retrieving causes of mis-

fit in outcome measures consisting of multiple short

subscales An alternative method is the inspection of item

content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such

a qualitative approach to follow-up PFA could be particu-

larly useful when clinicians use the OQ-45 to provide feed-

back to the patient

Person-fit statistics such as l zm p can potentially detect

aberrant responding due to low traitedness low motivation

cognitive deficits and concentration problems However

an important limitation for outcome measurement is that the

statistics have low power to identify response styles and

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912

Conijn et al 9

malingering These types of aberrant responding result in

biased total scores but do not necessarily produce an incon-

sistent pattern of item scores across the complete measure

(Ferrando amp Chico 2001) Hence future research might

also use measures especially designed to identify response

styles (Cheung amp Rensvold 2000) and malingering and

compare the results with those from general-purpose per-

son-fit statistics Inconsistency scales such as the TRIN and

VRIN potentially detect the same types of generic response

inconsistencies as person-fit indices but to our knowledge

only one study has compared their performance to that of

PFA (Egberink 2010) PFA was found to outperform the

inconsistency scale but relative performance may depend

on dimensionality and test length Hence more research is

needed on this subject

Other suggestions for future research include the fol-

lowing A first question is whether the prevalence of aber-

rant responding increases (eg due to diminishing

motivation) or decreases (eg due to familiarity with the

questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant

responding justifies routine application of PFA in clinical

settings such as Patient Reported Outcomes Measurement

Information System If prevalence among the tested

patients is low the number of item-score patterns incor-

rectly classified as aberrant (ie Type I errors) may out-

number the correctly identified aberrant item-score

patterns (Piedmont McCrae Riemann amp Angleitner

2000) Then PFA becomes very inefficient and it is

unlikely to improve the quality of individual decision

making in clinical practice Third future research may use

group-level analysis such as differential item functioning

analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-

ture modeling (Rost 1990) to study whether patients with

the same disorder showed similar patterns of misfit

To conclude our results have two main implications

pertaining to psychometrics and psychopathology First

the simulation study results suggest that l zm p is useful for

application to outcome measures despite moderate model

misfit due to multidimensionality As data of other psy-

chopathology measures also have been shown to be incon-

sistent with assumptions of IRT models the results of the

simulation study are valuable when considering applica-

tion of the l zm p

statistic to data of outcome measures

Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients

with different disorders Also more severely distressed

patients for whom psychological intervention is mostly

needed appear to be at the largest risk to produce invalid

outcome scores Overall our results emphasize the impor-

tance of person misfit identification in outcome measure-

ment and demonstrate that PFA may be useful for

preventing incorrect decision making in clinical practice

due to aberrant responding

Appendix A

Statistic l z p

Suppose the data are polytomous scores on J items (items

are indexed j j = 1 J ) with M + 1 ordered answer cat-

egories Let the score on item j be denoted by X j with

possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a

score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0

otherwise The unstandardized log-likelihood of an item-

score pattern x of person i is given by

l d m P X m pi

j

J

m

M

j j i( ) x = ( ) =( )= =

sumsum1 0

ln |θ (A1)

The standardized log-likelihood is defined as

l

l E l

VAR l

z p i

pi

pi

pi

x

x x

x

( ) = ( ) minus ( )

( )

( )

1

2

(A2)

where E l p( ) is the expected value and VAR l p( ) is the

variance of l p

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E X ij ij ij= minus ( ) (A3)

where E X ij( ) is the expected value of X ij which equals

E X m P X mij

m

M

ij i( ) = =( )=

sum0

|θ The residual eij has a mean

of 0 and variance equal to

VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2

(A4)

The standardized residual is given by

ze e

VAR eij

ij

ij

=

( )

(A5)

To compute zeij latent trait θ

i needs to be replaced by its

estimated value This may bias the standardization of eij

Appendix B

For each OQ-45 subscale we estimated an exploratory mul-

tidimensional IRT (MIRT Reckase 2009) model Based on

results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 2: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 212

2 Assessment

Aberrant responding provides clinicians with invalid

information and as a result adversely affects the quality of

treatment and diagnosis decisions (Conrad et al 2010

Handel Ben-Porath Tellegen amp Archer 2010) Person-fit

analysis (PFA) involves statistical methods to detect aber-

rant item-score patterns that are due to aberrant responding

Conrad et al (2010) provided an example of the potential of

PFA for mental health care by using PFA to screen for atypi-

cal symptom profiles among persons at intake for drug or

alcohol dependence treatment They found that the persons

with aberrant item-score patterns required different treat-

ments than persons with model-consistent item-score pat-

terns and concluded that PFA may detect inconsistencies that

have important implications for treatment and diagnosis

decisions As self-report outcome measures are increasingly

used to make treatment decisions in clinical practice PFA

may be a valuable screening tool in outcome measurement

The importance of detecting aberrant responding has

been recognized since long Both the original and current

versions of the Minnesota Multiphasic Personality Inventory(Butcher et al 2001 Handel et al 2010) include scales to

detect different types of aberrant responding Examples are

lie scales to detect faking good or faking bad and indices

based on the consistency of the responses to items either

highly similar or opposite with respect to content such as

the Variable Response Inconsistency (VRIN) scale to detect

random responding and the True Response Inconsistency

(TRIN) scale to detect acquiescence Despite validity

scalesrsquo importance outcome questionnaires typically do not

include specialized scales for detecting aberrant responding

(Lambert amp Hawkins 2004) One possible explanation is

that with the increasing demand of cost-effectiveness time

for assessment has been reduced greatly (Wood Garb

Lilienfeld amp Nezworski 2002) Consequently outcome

questionnaires are required to be short and efficient which

limits the use of validity scales consisting of additional

items (eg lie scales) and limits construction of TRIN and

VRIN scales because less item pairs with similar or oppo-

site content are available

Person-Fit Analysis in Outcome Measurement

In this study we used PFA to investigate the prevalence and

possible causes of aberrant responding in outcome mea-

surement by means of the OQ-45 In PFA person-fit statis-tics signal whether an individualrsquos item-score pattern is

inconsistent with the item-score pattern expected under the

particular measurement model (Meijer amp Sijtsma 2001) A

significant discrepancy between the observed item-score

pattern and the expected item-score pattern provides evi-

dence of person misfit Person misfit means that the indi-

vidualrsquos test score is unlikely to be meaningful in terms of

the trait being measured For noncognitive data the l z per-

son-fit statistic (Drasgow Levine amp McLaughlin 1987) is

one of the best performing and most popular person-fit sta-

tistics (Emons 2008 Ferrando 2012 Karabatsos 2003)

To determine whether an item-score pattern shows signifi-

cant misfit statistic l z is compared with a cutoff value

obtained under the item response theory (IRT Embretson amp

Reise 2000) model that serves as the null model of consis-

tency (De la Torre amp Deng 2008 Nering 1995) Statistic l z

detects various types of aberrant responding such as acqui-

escence and extreme response style but the statistic is most

powerful for detecting random responding (Emons 2008)

In detecting random responding to 57 items measuring the

Big Five personality factors PFA has been found to outper-

form an inconsistency index based on the rationale of the

Minnesota Multiphasic Personality Inventory VRIN scale

(Egberink 2010 pp 94-100)

An advantage of statistic l z and other person-fit statistics

for application to outcome measurement is that they can be

used to detect invalid test scores on any self-report scale

that is consistent with an IRT model Also the rise of com-

puterized and IRT-based outcome monitoring (eg PatientReported Outcomes Measurement Information System

Cella et al 2007) renders the implementation of PFA fea-

sible Along with the computer-generated test score a per-

son-fit value may be provided to the clinician serving as an

alarm bell that warns him that the test score may be invalid

and further inquiry may be useful

Follow-up PFA of item-score patterns flagged as misfit-

ting can help the clinician to infer possible explanations for

an individualrsquos observed aberrant responding In personal-

ity measurement Ferrando (2010) used item-score residu-

als for follow-up PFA and found that a person who had an

aberrant item-score pattern on an extraversion scale showed

unexpected low scores on items referring to situations

where the person could make a fool of himself This result

suggested that the aberrant responding was due to fear of

being rejected For another person follow-up PFA sug-

gested inattentiveness to reverse item wording In outcome

measurement for individual patients follow-up PFA can

inform the clinician about the sources of the misfit and cli-

nicians can discuss the unexpected item scores with the

patients to obtain a better understanding of the patientrsquos

psychological profile

PFA primarily focuses on individuals but can also be

used to explain individual differences in aberrant respond-

ing at the group level for examples see Conijn Emonsand Sijtsma (2014) and Conijn et al (2013) In outcome

measurement PFA can be used to investigate the extent to

which general measures are suited for assessing patients

suffering from different disorders General outcome mea-

sures such as the OQ-45 and the CORE-OM use items

that assess the most common symptoms of psychopathol-

ogy such as those observed in depression and anxiety dis-

orders (Lambert amp Hawkins 2004) and are also used to

assess patients suffering from different specific disorders

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 312

Conijn et al 3

varying from somatoform disorders to psychotic disorders

and addiction For rare or specific disorders many of the

general measuresrsquo items are irrelevant and low traitedness

may lead to inconsistent or unmotivated completion of out-

come measures

Goal of the Study We investigated the prevalence and the causes of aberrant

responding to the OQ-45 (Lambert et al 2004) The OQ-45

is one of the most popular general outcome measures used

in mental health care We used OQ-45 data of a large Dutch

clinical outpatient sample suffering from a large variety of

disorders The l z person-fit statistic (Drasgow et al 1987)

was used to identify misfitting item-score patterns and

standardized item-score residuals (Ferrando 2010 2012)

were used to investigate sources of item-score pattern mis-

fit We employed logistic regression analyses using the l z

statistic as the dependent variable to investigate whether

patients suffering from specific disorders (eg somatoformdisorders and psychotic disorders) and severely distressed

patients are more predisposed to produce aberrant item-

score patterns on the OQ-45 than other patients Based on

the results for the OQ-45 we discuss the possible causes of

aberrant responding in outcome measurement in general

and the potential of PFA for improving outcome-measure-

ment practice

Method

Participants

We performed a secondary analysis on data collected in

routine mental health care Participants were 2906 clinical

outpatients (421 male) from a mental health care institu-

tion with four different locations situated in Noord-Holland

a predominantly rural province in the Netherlands

Participantsrsquo age ranged from 17 to 77 years ( M = 37 SD =

13) Apart from gender and age no other demographic infor-

mation was collected

Most patients completed the OQ-45 at intake but 160

(55) patients completed the OQ-45 after treatment

started The sample included 2632 (91) patients with a

clinician-rated Diagnostic and Statistical Manual of Mental

Disorders (4th ed DSM-IV ) primary diagnosis at Axis I192 (7) persons with a primary diagnosis at Axis II and

82 (3) patients for which the primary diagnosis was miss-

ing Most frequent primary diagnoses were depression

(38) anxiety disorders (20) disorders usually first

diagnosed in infancy childhood or adolescence (8) per-

sonality disorders (7) adjustment disorders (6)

somatoform disorders (3) eating disorders (2) and

substance-related disorders (2) Of the diagnosed patients

131 had comorbidity between Axis 1 and Axis 2 320

had comorbidity within Axis 1 and 01 had comorbidity

within Axis 2 The clinician had access to the OQ-45 data

but since the OQ-45 is not a diagnostic instrument it was

unlikely that diagnosis was based on the OQ-45 results

Measures

The Outcome Questionnairendash45 The OQ-45 (Lambert et al

2004) uses three subscales to measure symptom severity

and daily functioning The Social Role Performance (SR 9

items of which 3 are reversely worded) subscale measures

dissatisfaction distress and conflicts concerning onersquos

employment education or leisure pursuits An example

item is ldquoI feel stressed at workschoolrdquo The Interpersonal

Relations (IR 11 items of which 4 are reversely worded)

subscale measures difficulty with family friends and mari-

tal relationship An example item is ldquoI feel lonelyrdquo The

Symptom Distress (SD 25 items of which 3 are reversely

worded) subscale measures symptoms of the most fre-

quently diagnosed mental disorders in particular anxietyand depression An example item is ldquoI feel no interest in

thingsrdquo Respondents are instructed to express their feelings

with respect to the past week on a 5-point rating scale with

scores ranging from 0 (never ) through 4 (almost always)

higher scores indicating more psychological distress

In this study we used the Dutch OQ-45 (De Jong amp

Nugter 2004) The Dutch OQ-45 total score has good con-

current and criterion-related validity (De Jong et al 2007)

In our sample coefficient alpha for the subscale total scores

equaled 65 (SR) 77 (IR) and 91 (SD) Results concern-

ing OQ-45 factor structure are ambiguous Some studies

provided support for the theoretical three-factor model for

both the original OQ-45 (Bludworth Tracey amp Glidden-

Tracey 2010) and the Dutch OQ-45 (De Jong et al 2007)

Other studies found poor fit of the theoretical three-factor

model and suggested that a three-factor model showed bet-

ter fit when it was based on a reduced item set (Kim

Beretvas amp Sherry 2010) or a one-factor model (Mueller

Lambert amp Burlingame 1998) In this study we further

investigated the fit of the theoretical three-factor model

Explanatory Variables for Person Misfit

Severity of distress The OQ-45 total score and the cli-

nician-rated DSM-IV Global Assessment of Functioning

(GAF) code were taken as measures of the patientrsquos sever-ity of distress The GAF code ranges from 1 to 100 with

higher values indicating better psychological social and

occupational functioning The GAF code was missing for

187 (6) patients

Diagnosis category The clinician-rated DSM-IV diag-

nosis was classified into nine categories representing

the most common types of disorders present in the sam-

ple Table 1 describes the diagnosis categories and the

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 412

4 Assessment

number of patients classified in each category by their pri-

mary diagnosis Three remarks are in order First because

mood and anxiety symptoms dominate the OQ-45 (Lambert

et al 2004) we assumed that for patients suffering from

these symptoms misfit was less likely than for patients with

other diagnoses Hence we classified patients with mood

and anxiety diagnoses into the same category and used

this category as the baseline for testing the effects of the

other diagnosis categories on person fit Second because

we expected that the probability of aberrant responding

depends on the specific symptoms the patient experienced

we categorized patients into diagnosis categories that are

defined by symptomatology Third if we were unable to

categorize patientsrsquo diagnosis unambiguously in one of the

categories (eg adjustment disorder with predominant dis-

turbance of conduct) we treated the diagnosis as missing

Our approach resulted in 2514 categorized patients (87)

Statistical Analysis

Model-Fit Evaluation We conducted PFA based on the

graded response model (GRM Samejima 1997) The GRM

is an IRT model for polytomous items The core of the

GRM are the item step response functions (ISRFs) which

specify the relationship between the probability of a

response in a specific or higher answer category and the

latent trait the test measures The GRM is defined by three

assumptions unidimensionality of the latent trait absence

of structural influences other than the latent trait on item

responding (ie local independence) and logistic ISRFs A

detailed discussion of the GRM is beyond the scope of this

study the interested reader is referred to Embretson and

Reise (2000) and Samejima (1997)

Satisfactory GRM fit to the data is a prerequisite for

application of GRM-based PFA to the OQ-45 subscale data

Forero and Maydeu-Olivares (2009) showed that differ-

ences between parameter estimates obtained from the full

information (GRM) and the limited information (factor

analysis on the polychoric correlation matrix) approaches

are negligible Hence for each OQ-45 subscale we used

exploratory factor analysis (EFA) for categorical data in

Mplus (Mutheacuten amp Mutheacuten 2007) to assess the GRM

assumptions of unidimensionality and local independence

For comparing one-factor models with multidimensional

models we used the root mean squared error of approxima-

tion (RMSEA) and the standardized root mean residual

(SRMR Mutheacuten amp Mutheacuten 2007) RMSEA le 08 and

SRMR lt 05 suggest acceptable model fit (MacCallum

Browne amp Sugawara 1996 Mutheacuten amp Mutheacuten 2009) Todetect local dependence we used the residual correlations

under the one-factor solution We assessed the logistic

shape of ISRFs by means of a graphical analysis comparing

the observed ISRFs with the ISRFs expected under the

GRM (Drasgow Levine Tsien Williams amp Mead 1995)

In case substantial violations of GRM assumptions were

identified we used a simulation study to investigate whether

PFA was sufficiently robust with respect to the identified

OQ-45 model misfit (Conijn et al 2014)

Table 1 Description of the Diagnosis Categories Used as Explanatory Variables in Multiple Regression Analysis

Category Common DSM-IV diagnoses included n Mean l zmp Detected Detected

Mood and anxiety disordera Depressive disorders generalized anxietydisorders phobias panic disordersposttraumatic stress disorder

1786 028 229 128

Somatoform disorder Pain disorder somatization disorder

hypochondriasis undifferentiatedsomatoform disorder

82 016 16 195

Attention deficit hyperactivitydisorder(ADHD)

Predominantly inattentive combinedhyperactive-impulsive and inattentive

198 008 15 76

Psychotic disorder Schizophrenia psychotic disorder nototherwise specified

26 minus010 7 269

Borderline personalitydisorder

Borderline personality disorder 53 035 2 38

Impulse-control disorders notelsewhere classified

Impulse-control disorder intermittentexplosive disorder

58 002 10 172

Eating disorder Eating disorder not otherwise specified bulimianervosa

67 038 4 60

Substance-related disorder Cannabis-related disorders alcohol-relateddisorders

58 009 13 224

Social and relational problem Phase of life problem partner relationalproblem identity problem

186 026 20 108

a Including 65 patients with a mood disorder

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512

Conijn et al 5

Person-Fit Analysis

Detection of misfit We used statistic l z for polytomous

item scores denoted by l z p (Drasgow Levine amp Williams

1985) to identify item-score patterns that show misfit rela-

tive to the GRM Statistic l z p

is the standardized log-like-

lihood of a personrsquos item-score pattern given the response

probabilities under the GRM with larger negative l z p

values indicating a higher degree of misfit (see Appendix

A for the equations) Emons (2008) found that l z p

had a

higher detection rate than several other person-fit statistics

Because item-score patterns that contain only 0s or only

4s (ie after recoding the reversed worded items) always

fit under the postulated GRM corresponding l z p

statistics

are meaningless and therefore treated as missing values

Twenty-two respondents (1) had a missing l z p

value due

to only 0 or 4 scores We may add that even though these

perfect patterns are consistent with the model they still may

be suspicious as they may reflect gross under- or overre-

porting of symptoms

Because the GRM is a model for unidimensional datawe computed statistic l z

p for each OQ-45 subscale sepa-

rately To categorize persons as fitting or misfitting with

respect to the complete OQ-45 we used the multiscale per-

son-fit statistic l zm p

(Conijn et al 2014 Drasgow Levine

amp McLaughlin 1991) which equals the standardized sum

of the subscale l z p values across all subscales

Under the null model of fit to the IRT model and given

the true θ value statistic l z p

is standard normally distributed

(Drasgow et al 1985) but when the unknown true θ value

is replaced by the estimated θ value statistic l z p is no longer

standard normal (Nering 1995) Therefore following De la

Torre and Deng (2008) we used the following parametric

bootstrap procedure to compute the l z p and l zm p values and

the corresponding p values For each person we generated

5000 item-score patterns under the postulated GRM using

the item parameters and the personrsquos estimated θ value For

each item-score pattern we again estimated the estimated θ

value and computed the corresponding l zm p statistic The

5000 bootstrap replications of l zm p

determined the person-

specific null distribution of l zm p

The percentile rank of the

observed lzm p

value in this bootstrapped distribution pro-

vided the p value We used one-tailed significance testing

and a 05 significance level (α) The GRM item parameters

were estimated using MULTILOG (Thissen Chen amp Bock

2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained

from Baker and Kim (2004) The software including the

source code is available on request from the first author

Follow-up analyses For each item-score pattern l zm p classi-

fied as misfitting we used standardized item-score residuals

to identify the source of the person misfit (Ferrando 2010

2012) Negative residuals indicate that the personrsquos observed

item score is lower than expected under the estimated GRM

and positive residuals indicate that the item score is higher

than expected (Appendix A) To test residuals for signifi-

cance we used critical values based on the standard normal

distribution and two-tailed significance testing with α = 05

(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff

values of minus164 and 164)

Explanatory person-fit analysis We used logistic regres-

sion to relate type of disorder and severity of psychological

distress to person misfit on the OQ-45 The dependent vari-

able was the dichotomous person-fit classification based on

l zm p

(1 = significant misfit 0 = no misfit) Based on pre-

vious research results gender (0 = male 1 = female) and

measurement occasion (0 = at intake 1 = during treatment)

were included in the model as control variables (eg Pitts

West amp Tein 1996 Schmitt Chan Sacco McFarland amp

Jennings 1999 Woods et al 2008)

Results

First we discuss model fit and implications of the identified

model misfit for the application of PFA to the OQ-45 data

Second we discuss the number of item-score patterns that

the l zm p

statistic classified as misfitting (prevalence) and we

illustrate how standardized item-score residuals may help

infer possible causes of misfit for individual respondents

Third we discuss the results of logistic regression analysis

in which l zm p

person misfit classification was predicted by

means of clinical diagnosis and severity of disorder

Model-Fit Evaluation

Inspection of multiple correlation coefficients and item-rest

correlations showed that the Items 11 26 and 32 which

measured substance abuse and Item 14 (ldquoI workstudy too

muchrdquo) fitted poorly in their subscales As these results

were consistent with previous research (De Jong et al

2007 Mueller et al 1998) we excluded these items from

further analysis Coefficient alphas for the SR (7 items 2

items excluded) IR (10 items 1 item excluded) and SD (24

items 1 item excluded) subscales equaled 67 78 and 91

respectively

For the subscale data EFA showed that the first factor

explained 386 to 400 of the variance and that one-fac-

tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed

to produce an RMSEA le 08 The two-factor model pro-

duced an RMSEA of 13 but the SRMR of 05 was accept-

able The RMSEA value may have been inflated due to the

small number of degrees of freedom (ie df = 7) of the two-

factor model (Kenny Kaniskan amp McCoach 2014)

Because parallel analysis based on the polychoric correla-

tion matrix suggested that two factors explained the data we

decided that a two-factor solution was most appropriate

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612

6 Assessment

- 4

- 2

0

2

4

S t a n

d a r d i z e d r e s i d u a l s

4 12 21 28 38

39 44

SR subscale

1

7

16

17

18 19

20 30 37

43

IR subscale

2 3

56

8

9 101315

22

232425

27

2931

3334

3536

40

41

42

45

SD subscale

- 4

- 2

0

2

4

S t a n d a r d i z e d r e s i d u a l s

4

12

21

28

38 39 44

1

7

16

17

18

19

20

30 37

43

2

3

5 6

8

9

10

13

15

22

23

24

2527

29

3133

34

3536

40

41

42

45

Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient

2752 ( l zmp

= minus796 lower panel)

(RMSEA = 13 SRMR = 05) For the IR subscale a two-

factor solution provided acceptable fit (RMSEA = 08 and

SRMR = 04) and for the SD subscale a three-factor solu-

tion provided acceptable fit (RMSEA = 07 and SRMR =

03) Violations of local independence and violations of a

logistic ISRF were only found for some items of the SD and

SR subscales respectively Thus EFA results suggested that

more than other model violations multidimensionality

caused the subscale data to show GRM misfit

To investigate the performance of statistic l zm p

and the

standardized item-score residuals for detecting person mis-

fit on OQ-45 in the presence of mild model misfit we used

data simulation to assess the Type I error rates and the

detection rates of these statistics Data were simulated using

methods proposed by Conijn et al (2014 also see Appendix

B) The types of misfit included were random error (three

levels random scores on 10 20 or 30 items) and acquies-

cence (three levels weak moderate and strong)

We found that for statistic l zm p

Type I error rate equaled

01 meaning that the risk of incorrectly classifying normal

respondents as misfitting was small and much lower than

nominal Type I error rate Furthermore the power of l zm p

to

detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-

escence equaled at most 51 (ie for strong acquiescence

88 of the responses in the most extreme category) We

concluded that despite mild GRM model misfit l zm p

is use-

ful for application to the OQ-45 but lacks power to detect

acquiescence For the residual statistic we found modest

power to detect deviant item scores due to random error

and low power to detect deviant item scores due to acquies-

cence Even though the residuals had low power in our

simulation study we decided to use the residual statistic for

the OQ-45 data analysis to obtain further insight in the sta-

tisticsrsquo performance

Detection of Misfit and Follow-Up Analyses

For 90 (3) patients with a missing l z p

value for one of the

subscales l zm p was computed across the two other OQ-45

subscales Statistic l zm p ranged from minus796 to 354 ( M =

045 SD = 354) For 367 (126) patients statistic l zm p

classified the observed item-score pattern as misfitting

With respect to age gender and measurement occasion we

did not find substantial differences between detection rates

Based on the residual statisticrsquos low power in the simula-

tion study we used α = 10 for identifying unexpected item

scores We use two cases to illustrate the use of the residual

statistic Figure 1 shows the standardized residuals for

female patient 663 who had the highest l zm p

value ( l zm p

=

354 p gt 99) and for male patient 2752 who had the low-

est l zm p value ( l zm

p = minus796 p lt 001) Patient 663 (upper

panel) was diagnosed with posttraumatic stress disorder

The patientrsquos absolute residuals were smaller than 164

thus showing that her item scores were consistent with theexpected GRM item scores

Patient 2752 (lower panel) was diagnosed to suffer

from adjustment disorder with depressed mood He had

large residuals for each of the OQ-45 subscales but misfit

was largest on the IR subscale (l z p

= minus5 44 ) and the SD sub-

scale (l z p

= minus7 66 ) On the IR subscale residuals suggested

unexpected high distress on Items 7 19 and 20 One of

these items concerned his ldquomarriagesignificant other rela-

tionshiprdquo Therefore a possible cause of the IR subscale

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712

Conijn et al 7

misfit may be that his problems were limited to this rela-

tionship On the SD subscale he had both several unex-

pected high and low item scores Two of the three items

with unexpected high scores reflected mood symptoms of

depression feeling blue (Item 42) and not being happy

(Item 13) The third item concerned suicidal thoughts (Item

8) Most items with unexpected low scores concerned low

self-worth and incompetence (Items 15 24 and 40) and

hopelessness (Item 10) which are all cognitive symptoms

of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an

external source of psychological distress caused the patient

to experience only the mood symptoms of depression but

not the cognitive symptoms Hence the cognitive symp-

toms constituted a separate dimension on which the patient

had a lower trait value Furthermore except for 10 items all

other item scores were either 0s or 4s Hence apart from

potential content-related misfit extreme response style may

have been another cause of the severe misfit In practice the

clinician may discuss unexpected item scores and potential

explanations with the patient and suggest a more definite

explanation for the person misfit

Explanatory Person-Fit Analysis

For each of the diagnosis categories Table 1 shows the

average l zm p

value and the number and percentage of

patients classified as misfitting For patients with mood and

anxiety disorders (ie the baseline category) the detection

rate was substantial (128) but not high relative to most of

the other diagnosis categories

Table 2 shows the results of the logistic regression analy-

sis Model 1 included gender measurement occasion and

the diagnosis categories as predictors of person misfit

Diagnosis category had a significant overall effect χ 2(8) =

2647 p lt 001 The effects of somatoform disorder ADHD

psychotic disorder and substance abuse disorder were sig-

nificant Patients with ADHD were unlikely to show misfit

relative to the baseline category of patients with mood or

anxiety disorders Patients with somatoform disorders psy-

chotic disorders and substance-related disorders were morelikely to show misfit

Model 2 (Table 2 third column) also included GAF code

and OQ-45 total score Both effects were significant and

suggested that patients with higher levels of psychological

distress were more likely to show misfit After controlling

for GAF code and OQ-45 score the positive effect of

ADHD was not significant Hence patients with ADHD

were less likely to show misfit because they had less severe

levels of distress For the baseline category the estimated

probability of misfit was 13 For patients with somatoform

disorders psychotic disorders and substance-related disor-

ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis

category showed similar person misfit we compared the

standardized item-score residuals of the misfitting patterns

produced by patients with psychotic disorders (n = 7)

somatoform disorders (n = 16) or substance-related disor-

ders (n = 13) Most patients with a psychotic disorder had

low or average θ levels for each of the subscales Misfit was

due to several unexpected high scores indicating severe

symptoms In general these patients did not have large

Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the

5 Level 0 = No Misfit)

Model 1 Model 2

Intercept minus184 (011) minus193 (011)

Gender minus012 (013) minus012 (013)

Measurement occasion minus017 (027) minus018 (027)

Diagnosis categorya

Somatoform 057 (029) 074 (029)

ADHD minus058 (028) minus039 (028)

Psychotic 105 (046) 113 (047)

Borderline minus130 (072) minus139 (073)

Impulse control 035 (036) 057 (036)

Eating disorders minus110 (060) minus097 (060)

Substance related 066 (033) 069 (033)

Socialrelational minus020 (026) 008 (027)

GAF code mdash minus017 (007)

OQ-45 total score mdash 026 (007)

Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26

a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812

8 Assessment

residuals for the same items However unexpected high

scores on Item 25 (ldquodisturbing thoughts come into my mind

that I cannot get rid ofrdquo) were frequent (4 patients) which is

consistent with the symptoms characterizing psychotic dis-

orders Low to average θ levels combined with several

unexpected high item scores suggested that patients with

psychotic disorders showed misfit because many OQ-45

items were irrelevant to them suggesting lack of traitedness

(Reise amp Waller 1993) For the misfitting patients with a

somatoform disorder or a substance-related disorder the

standardized residuals showed no common pattern and

thus they did not show similar person misfit

Discussion

We investigated prevalence and explanations of aberrant

responding to the OQ-45 by means of IRT-based person-fit

methods Reise and Waller (2009) suggested that IRT mod-

els fit poorly to psychopathology data and that misfit may

adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn

et al 2014) our simulation results suggested that l zm p is

robust to model misfit The low empirical Type I error rates

suggested that GRM misfit did not lead to incorrect classi-

fication of model-consistent item-score patterns as misfit-

ting The current findings are valuable because they are

obtained under realistic conditions using a psychometri-

cally imperfect outcome measure that is frequently used in

practice We notice that this is the rule rather than the excep-

tion in general measurement instrumentsrsquo psychometric

properties are imperfect However our findings concerning

the robustness of l zm p

may only hold for the kind of model

misfit we found for the OQ-45 Future studies should inves-

tigate the robustness of PFA methods to different IRT model

violations

The detection rate of 126 in the OQ-45 data is sub-

stantial and comparable with detection rates found in other

studies using different measures For example using

repeated measurements on the StatendashTrait Anxiety

Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs

1983) Conijn et al (2013) found detection rates of 11 to

14 in a sample of cardiac patients and Conijn et al (2014)

also found 16 misfit on the International Personality Item

Pool 50-item questionnaire (Goldberg et al 2006) in a

panel sample from the general populationConsistent with previous research (Conijn et al 2013

Reise amp Waller 1993 Woods et al 2008) we found that

more severely distressed patients more likely showed mis-

fit Also patients with somatoform disorders psychotic dis-

orders and substance-related disorders more likely showed

misfit Plausible explanations for these results include

patients tending to deny their mental problems (somatoform

disorders) symptoms of disorder and confusion and the

irrelevance of most OQ-45 items (psychotic disorders)

being under the influence during test-taking and the nega-

tive effect of long-term substance use on cognitive abilities

(substance-related disorders) The residual analysis con-

firmed the explanation of limited item relevance for patients

with psychotic disorders but also suggested that patients

identified with the same type of disorder generally did not

show the same type of misfit However as the simulation

study revealed that the residual statistic had low power sug-

gesting only item scores reflecting large misfit are identi-

fied results should be interpreted with caution

There are two possible general explanations for finding

group differences with respect to the tendency to show per-

son misfit First person misfit may be due to a mismatch

between the OQ-45 and the specific disorder and second

misfit may be due to a general tendency to show misfit on

self-report measures Each explanation has a unique impli-

cation for outcome measurement The mismatch explana-

tion implies that disease-specific outcome measures rather

than general outcome measures should be used Examples

of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for

assessing somatic symptoms and the Severe Outcome

Questionnaire (Burlingame Thayer Lee Nelson amp

Lambert 2007) to diagnose severe mental illnesses such as

psychotic disorders and bipolar disorder

The general-misfit explanation implies that other meth-

ods than self-reports should be used for patientsrsquo treatment

decisions for example clinician-rated outcome measures

such as the Health of the Nation Outcome Scales (Wing et

al 1998) Also these patientsrsquo self-report results should be

excluded from cost-effectiveness studies to prevent poten-

tial negative effects on policy decisions Future studies may

address the scale-specific misfit versus general-misfit

explanations by means of explanatory PFA of data from

other outcome measures

Residual statistics have shown to be useful in real-data

applications for analyzing causes of aberrant responding to

unidimensional personality scales containing at least 30

items (Ferrando 2010 2012) but their performance has not

been evaluated previously by means of simulations Our

simulation study and real-data application question the use-

fulness of the residual statistic for retrieving causes of mis-

fit in outcome measures consisting of multiple short

subscales An alternative method is the inspection of item

content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such

a qualitative approach to follow-up PFA could be particu-

larly useful when clinicians use the OQ-45 to provide feed-

back to the patient

Person-fit statistics such as l zm p can potentially detect

aberrant responding due to low traitedness low motivation

cognitive deficits and concentration problems However

an important limitation for outcome measurement is that the

statistics have low power to identify response styles and

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912

Conijn et al 9

malingering These types of aberrant responding result in

biased total scores but do not necessarily produce an incon-

sistent pattern of item scores across the complete measure

(Ferrando amp Chico 2001) Hence future research might

also use measures especially designed to identify response

styles (Cheung amp Rensvold 2000) and malingering and

compare the results with those from general-purpose per-

son-fit statistics Inconsistency scales such as the TRIN and

VRIN potentially detect the same types of generic response

inconsistencies as person-fit indices but to our knowledge

only one study has compared their performance to that of

PFA (Egberink 2010) PFA was found to outperform the

inconsistency scale but relative performance may depend

on dimensionality and test length Hence more research is

needed on this subject

Other suggestions for future research include the fol-

lowing A first question is whether the prevalence of aber-

rant responding increases (eg due to diminishing

motivation) or decreases (eg due to familiarity with the

questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant

responding justifies routine application of PFA in clinical

settings such as Patient Reported Outcomes Measurement

Information System If prevalence among the tested

patients is low the number of item-score patterns incor-

rectly classified as aberrant (ie Type I errors) may out-

number the correctly identified aberrant item-score

patterns (Piedmont McCrae Riemann amp Angleitner

2000) Then PFA becomes very inefficient and it is

unlikely to improve the quality of individual decision

making in clinical practice Third future research may use

group-level analysis such as differential item functioning

analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-

ture modeling (Rost 1990) to study whether patients with

the same disorder showed similar patterns of misfit

To conclude our results have two main implications

pertaining to psychometrics and psychopathology First

the simulation study results suggest that l zm p is useful for

application to outcome measures despite moderate model

misfit due to multidimensionality As data of other psy-

chopathology measures also have been shown to be incon-

sistent with assumptions of IRT models the results of the

simulation study are valuable when considering applica-

tion of the l zm p

statistic to data of outcome measures

Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients

with different disorders Also more severely distressed

patients for whom psychological intervention is mostly

needed appear to be at the largest risk to produce invalid

outcome scores Overall our results emphasize the impor-

tance of person misfit identification in outcome measure-

ment and demonstrate that PFA may be useful for

preventing incorrect decision making in clinical practice

due to aberrant responding

Appendix A

Statistic l z p

Suppose the data are polytomous scores on J items (items

are indexed j j = 1 J ) with M + 1 ordered answer cat-

egories Let the score on item j be denoted by X j with

possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a

score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0

otherwise The unstandardized log-likelihood of an item-

score pattern x of person i is given by

l d m P X m pi

j

J

m

M

j j i( ) x = ( ) =( )= =

sumsum1 0

ln |θ (A1)

The standardized log-likelihood is defined as

l

l E l

VAR l

z p i

pi

pi

pi

x

x x

x

( ) = ( ) minus ( )

( )

( )

1

2

(A2)

where E l p( ) is the expected value and VAR l p( ) is the

variance of l p

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E X ij ij ij= minus ( ) (A3)

where E X ij( ) is the expected value of X ij which equals

E X m P X mij

m

M

ij i( ) = =( )=

sum0

|θ The residual eij has a mean

of 0 and variance equal to

VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2

(A4)

The standardized residual is given by

ze e

VAR eij

ij

ij

=

( )

(A5)

To compute zeij latent trait θ

i needs to be replaced by its

estimated value This may bias the standardization of eij

Appendix B

For each OQ-45 subscale we estimated an exploratory mul-

tidimensional IRT (MIRT Reckase 2009) model Based on

results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 3: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 312

Conijn et al 3

varying from somatoform disorders to psychotic disorders

and addiction For rare or specific disorders many of the

general measuresrsquo items are irrelevant and low traitedness

may lead to inconsistent or unmotivated completion of out-

come measures

Goal of the Study We investigated the prevalence and the causes of aberrant

responding to the OQ-45 (Lambert et al 2004) The OQ-45

is one of the most popular general outcome measures used

in mental health care We used OQ-45 data of a large Dutch

clinical outpatient sample suffering from a large variety of

disorders The l z person-fit statistic (Drasgow et al 1987)

was used to identify misfitting item-score patterns and

standardized item-score residuals (Ferrando 2010 2012)

were used to investigate sources of item-score pattern mis-

fit We employed logistic regression analyses using the l z

statistic as the dependent variable to investigate whether

patients suffering from specific disorders (eg somatoformdisorders and psychotic disorders) and severely distressed

patients are more predisposed to produce aberrant item-

score patterns on the OQ-45 than other patients Based on

the results for the OQ-45 we discuss the possible causes of

aberrant responding in outcome measurement in general

and the potential of PFA for improving outcome-measure-

ment practice

Method

Participants

We performed a secondary analysis on data collected in

routine mental health care Participants were 2906 clinical

outpatients (421 male) from a mental health care institu-

tion with four different locations situated in Noord-Holland

a predominantly rural province in the Netherlands

Participantsrsquo age ranged from 17 to 77 years ( M = 37 SD =

13) Apart from gender and age no other demographic infor-

mation was collected

Most patients completed the OQ-45 at intake but 160

(55) patients completed the OQ-45 after treatment

started The sample included 2632 (91) patients with a

clinician-rated Diagnostic and Statistical Manual of Mental

Disorders (4th ed DSM-IV ) primary diagnosis at Axis I192 (7) persons with a primary diagnosis at Axis II and

82 (3) patients for which the primary diagnosis was miss-

ing Most frequent primary diagnoses were depression

(38) anxiety disorders (20) disorders usually first

diagnosed in infancy childhood or adolescence (8) per-

sonality disorders (7) adjustment disorders (6)

somatoform disorders (3) eating disorders (2) and

substance-related disorders (2) Of the diagnosed patients

131 had comorbidity between Axis 1 and Axis 2 320

had comorbidity within Axis 1 and 01 had comorbidity

within Axis 2 The clinician had access to the OQ-45 data

but since the OQ-45 is not a diagnostic instrument it was

unlikely that diagnosis was based on the OQ-45 results

Measures

The Outcome Questionnairendash45 The OQ-45 (Lambert et al

2004) uses three subscales to measure symptom severity

and daily functioning The Social Role Performance (SR 9

items of which 3 are reversely worded) subscale measures

dissatisfaction distress and conflicts concerning onersquos

employment education or leisure pursuits An example

item is ldquoI feel stressed at workschoolrdquo The Interpersonal

Relations (IR 11 items of which 4 are reversely worded)

subscale measures difficulty with family friends and mari-

tal relationship An example item is ldquoI feel lonelyrdquo The

Symptom Distress (SD 25 items of which 3 are reversely

worded) subscale measures symptoms of the most fre-

quently diagnosed mental disorders in particular anxietyand depression An example item is ldquoI feel no interest in

thingsrdquo Respondents are instructed to express their feelings

with respect to the past week on a 5-point rating scale with

scores ranging from 0 (never ) through 4 (almost always)

higher scores indicating more psychological distress

In this study we used the Dutch OQ-45 (De Jong amp

Nugter 2004) The Dutch OQ-45 total score has good con-

current and criterion-related validity (De Jong et al 2007)

In our sample coefficient alpha for the subscale total scores

equaled 65 (SR) 77 (IR) and 91 (SD) Results concern-

ing OQ-45 factor structure are ambiguous Some studies

provided support for the theoretical three-factor model for

both the original OQ-45 (Bludworth Tracey amp Glidden-

Tracey 2010) and the Dutch OQ-45 (De Jong et al 2007)

Other studies found poor fit of the theoretical three-factor

model and suggested that a three-factor model showed bet-

ter fit when it was based on a reduced item set (Kim

Beretvas amp Sherry 2010) or a one-factor model (Mueller

Lambert amp Burlingame 1998) In this study we further

investigated the fit of the theoretical three-factor model

Explanatory Variables for Person Misfit

Severity of distress The OQ-45 total score and the cli-

nician-rated DSM-IV Global Assessment of Functioning

(GAF) code were taken as measures of the patientrsquos sever-ity of distress The GAF code ranges from 1 to 100 with

higher values indicating better psychological social and

occupational functioning The GAF code was missing for

187 (6) patients

Diagnosis category The clinician-rated DSM-IV diag-

nosis was classified into nine categories representing

the most common types of disorders present in the sam-

ple Table 1 describes the diagnosis categories and the

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 412

4 Assessment

number of patients classified in each category by their pri-

mary diagnosis Three remarks are in order First because

mood and anxiety symptoms dominate the OQ-45 (Lambert

et al 2004) we assumed that for patients suffering from

these symptoms misfit was less likely than for patients with

other diagnoses Hence we classified patients with mood

and anxiety diagnoses into the same category and used

this category as the baseline for testing the effects of the

other diagnosis categories on person fit Second because

we expected that the probability of aberrant responding

depends on the specific symptoms the patient experienced

we categorized patients into diagnosis categories that are

defined by symptomatology Third if we were unable to

categorize patientsrsquo diagnosis unambiguously in one of the

categories (eg adjustment disorder with predominant dis-

turbance of conduct) we treated the diagnosis as missing

Our approach resulted in 2514 categorized patients (87)

Statistical Analysis

Model-Fit Evaluation We conducted PFA based on the

graded response model (GRM Samejima 1997) The GRM

is an IRT model for polytomous items The core of the

GRM are the item step response functions (ISRFs) which

specify the relationship between the probability of a

response in a specific or higher answer category and the

latent trait the test measures The GRM is defined by three

assumptions unidimensionality of the latent trait absence

of structural influences other than the latent trait on item

responding (ie local independence) and logistic ISRFs A

detailed discussion of the GRM is beyond the scope of this

study the interested reader is referred to Embretson and

Reise (2000) and Samejima (1997)

Satisfactory GRM fit to the data is a prerequisite for

application of GRM-based PFA to the OQ-45 subscale data

Forero and Maydeu-Olivares (2009) showed that differ-

ences between parameter estimates obtained from the full

information (GRM) and the limited information (factor

analysis on the polychoric correlation matrix) approaches

are negligible Hence for each OQ-45 subscale we used

exploratory factor analysis (EFA) for categorical data in

Mplus (Mutheacuten amp Mutheacuten 2007) to assess the GRM

assumptions of unidimensionality and local independence

For comparing one-factor models with multidimensional

models we used the root mean squared error of approxima-

tion (RMSEA) and the standardized root mean residual

(SRMR Mutheacuten amp Mutheacuten 2007) RMSEA le 08 and

SRMR lt 05 suggest acceptable model fit (MacCallum

Browne amp Sugawara 1996 Mutheacuten amp Mutheacuten 2009) Todetect local dependence we used the residual correlations

under the one-factor solution We assessed the logistic

shape of ISRFs by means of a graphical analysis comparing

the observed ISRFs with the ISRFs expected under the

GRM (Drasgow Levine Tsien Williams amp Mead 1995)

In case substantial violations of GRM assumptions were

identified we used a simulation study to investigate whether

PFA was sufficiently robust with respect to the identified

OQ-45 model misfit (Conijn et al 2014)

Table 1 Description of the Diagnosis Categories Used as Explanatory Variables in Multiple Regression Analysis

Category Common DSM-IV diagnoses included n Mean l zmp Detected Detected

Mood and anxiety disordera Depressive disorders generalized anxietydisorders phobias panic disordersposttraumatic stress disorder

1786 028 229 128

Somatoform disorder Pain disorder somatization disorder

hypochondriasis undifferentiatedsomatoform disorder

82 016 16 195

Attention deficit hyperactivitydisorder(ADHD)

Predominantly inattentive combinedhyperactive-impulsive and inattentive

198 008 15 76

Psychotic disorder Schizophrenia psychotic disorder nototherwise specified

26 minus010 7 269

Borderline personalitydisorder

Borderline personality disorder 53 035 2 38

Impulse-control disorders notelsewhere classified

Impulse-control disorder intermittentexplosive disorder

58 002 10 172

Eating disorder Eating disorder not otherwise specified bulimianervosa

67 038 4 60

Substance-related disorder Cannabis-related disorders alcohol-relateddisorders

58 009 13 224

Social and relational problem Phase of life problem partner relationalproblem identity problem

186 026 20 108

a Including 65 patients with a mood disorder

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512

Conijn et al 5

Person-Fit Analysis

Detection of misfit We used statistic l z for polytomous

item scores denoted by l z p (Drasgow Levine amp Williams

1985) to identify item-score patterns that show misfit rela-

tive to the GRM Statistic l z p

is the standardized log-like-

lihood of a personrsquos item-score pattern given the response

probabilities under the GRM with larger negative l z p

values indicating a higher degree of misfit (see Appendix

A for the equations) Emons (2008) found that l z p

had a

higher detection rate than several other person-fit statistics

Because item-score patterns that contain only 0s or only

4s (ie after recoding the reversed worded items) always

fit under the postulated GRM corresponding l z p

statistics

are meaningless and therefore treated as missing values

Twenty-two respondents (1) had a missing l z p

value due

to only 0 or 4 scores We may add that even though these

perfect patterns are consistent with the model they still may

be suspicious as they may reflect gross under- or overre-

porting of symptoms

Because the GRM is a model for unidimensional datawe computed statistic l z

p for each OQ-45 subscale sepa-

rately To categorize persons as fitting or misfitting with

respect to the complete OQ-45 we used the multiscale per-

son-fit statistic l zm p

(Conijn et al 2014 Drasgow Levine

amp McLaughlin 1991) which equals the standardized sum

of the subscale l z p values across all subscales

Under the null model of fit to the IRT model and given

the true θ value statistic l z p

is standard normally distributed

(Drasgow et al 1985) but when the unknown true θ value

is replaced by the estimated θ value statistic l z p is no longer

standard normal (Nering 1995) Therefore following De la

Torre and Deng (2008) we used the following parametric

bootstrap procedure to compute the l z p and l zm p values and

the corresponding p values For each person we generated

5000 item-score patterns under the postulated GRM using

the item parameters and the personrsquos estimated θ value For

each item-score pattern we again estimated the estimated θ

value and computed the corresponding l zm p statistic The

5000 bootstrap replications of l zm p

determined the person-

specific null distribution of l zm p

The percentile rank of the

observed lzm p

value in this bootstrapped distribution pro-

vided the p value We used one-tailed significance testing

and a 05 significance level (α) The GRM item parameters

were estimated using MULTILOG (Thissen Chen amp Bock

2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained

from Baker and Kim (2004) The software including the

source code is available on request from the first author

Follow-up analyses For each item-score pattern l zm p classi-

fied as misfitting we used standardized item-score residuals

to identify the source of the person misfit (Ferrando 2010

2012) Negative residuals indicate that the personrsquos observed

item score is lower than expected under the estimated GRM

and positive residuals indicate that the item score is higher

than expected (Appendix A) To test residuals for signifi-

cance we used critical values based on the standard normal

distribution and two-tailed significance testing with α = 05

(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff

values of minus164 and 164)

Explanatory person-fit analysis We used logistic regres-

sion to relate type of disorder and severity of psychological

distress to person misfit on the OQ-45 The dependent vari-

able was the dichotomous person-fit classification based on

l zm p

(1 = significant misfit 0 = no misfit) Based on pre-

vious research results gender (0 = male 1 = female) and

measurement occasion (0 = at intake 1 = during treatment)

were included in the model as control variables (eg Pitts

West amp Tein 1996 Schmitt Chan Sacco McFarland amp

Jennings 1999 Woods et al 2008)

Results

First we discuss model fit and implications of the identified

model misfit for the application of PFA to the OQ-45 data

Second we discuss the number of item-score patterns that

the l zm p

statistic classified as misfitting (prevalence) and we

illustrate how standardized item-score residuals may help

infer possible causes of misfit for individual respondents

Third we discuss the results of logistic regression analysis

in which l zm p

person misfit classification was predicted by

means of clinical diagnosis and severity of disorder

Model-Fit Evaluation

Inspection of multiple correlation coefficients and item-rest

correlations showed that the Items 11 26 and 32 which

measured substance abuse and Item 14 (ldquoI workstudy too

muchrdquo) fitted poorly in their subscales As these results

were consistent with previous research (De Jong et al

2007 Mueller et al 1998) we excluded these items from

further analysis Coefficient alphas for the SR (7 items 2

items excluded) IR (10 items 1 item excluded) and SD (24

items 1 item excluded) subscales equaled 67 78 and 91

respectively

For the subscale data EFA showed that the first factor

explained 386 to 400 of the variance and that one-fac-

tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed

to produce an RMSEA le 08 The two-factor model pro-

duced an RMSEA of 13 but the SRMR of 05 was accept-

able The RMSEA value may have been inflated due to the

small number of degrees of freedom (ie df = 7) of the two-

factor model (Kenny Kaniskan amp McCoach 2014)

Because parallel analysis based on the polychoric correla-

tion matrix suggested that two factors explained the data we

decided that a two-factor solution was most appropriate

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612

6 Assessment

- 4

- 2

0

2

4

S t a n

d a r d i z e d r e s i d u a l s

4 12 21 28 38

39 44

SR subscale

1

7

16

17

18 19

20 30 37

43

IR subscale

2 3

56

8

9 101315

22

232425

27

2931

3334

3536

40

41

42

45

SD subscale

- 4

- 2

0

2

4

S t a n d a r d i z e d r e s i d u a l s

4

12

21

28

38 39 44

1

7

16

17

18

19

20

30 37

43

2

3

5 6

8

9

10

13

15

22

23

24

2527

29

3133

34

3536

40

41

42

45

Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient

2752 ( l zmp

= minus796 lower panel)

(RMSEA = 13 SRMR = 05) For the IR subscale a two-

factor solution provided acceptable fit (RMSEA = 08 and

SRMR = 04) and for the SD subscale a three-factor solu-

tion provided acceptable fit (RMSEA = 07 and SRMR =

03) Violations of local independence and violations of a

logistic ISRF were only found for some items of the SD and

SR subscales respectively Thus EFA results suggested that

more than other model violations multidimensionality

caused the subscale data to show GRM misfit

To investigate the performance of statistic l zm p

and the

standardized item-score residuals for detecting person mis-

fit on OQ-45 in the presence of mild model misfit we used

data simulation to assess the Type I error rates and the

detection rates of these statistics Data were simulated using

methods proposed by Conijn et al (2014 also see Appendix

B) The types of misfit included were random error (three

levels random scores on 10 20 or 30 items) and acquies-

cence (three levels weak moderate and strong)

We found that for statistic l zm p

Type I error rate equaled

01 meaning that the risk of incorrectly classifying normal

respondents as misfitting was small and much lower than

nominal Type I error rate Furthermore the power of l zm p

to

detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-

escence equaled at most 51 (ie for strong acquiescence

88 of the responses in the most extreme category) We

concluded that despite mild GRM model misfit l zm p

is use-

ful for application to the OQ-45 but lacks power to detect

acquiescence For the residual statistic we found modest

power to detect deviant item scores due to random error

and low power to detect deviant item scores due to acquies-

cence Even though the residuals had low power in our

simulation study we decided to use the residual statistic for

the OQ-45 data analysis to obtain further insight in the sta-

tisticsrsquo performance

Detection of Misfit and Follow-Up Analyses

For 90 (3) patients with a missing l z p

value for one of the

subscales l zm p was computed across the two other OQ-45

subscales Statistic l zm p ranged from minus796 to 354 ( M =

045 SD = 354) For 367 (126) patients statistic l zm p

classified the observed item-score pattern as misfitting

With respect to age gender and measurement occasion we

did not find substantial differences between detection rates

Based on the residual statisticrsquos low power in the simula-

tion study we used α = 10 for identifying unexpected item

scores We use two cases to illustrate the use of the residual

statistic Figure 1 shows the standardized residuals for

female patient 663 who had the highest l zm p

value ( l zm p

=

354 p gt 99) and for male patient 2752 who had the low-

est l zm p value ( l zm

p = minus796 p lt 001) Patient 663 (upper

panel) was diagnosed with posttraumatic stress disorder

The patientrsquos absolute residuals were smaller than 164

thus showing that her item scores were consistent with theexpected GRM item scores

Patient 2752 (lower panel) was diagnosed to suffer

from adjustment disorder with depressed mood He had

large residuals for each of the OQ-45 subscales but misfit

was largest on the IR subscale (l z p

= minus5 44 ) and the SD sub-

scale (l z p

= minus7 66 ) On the IR subscale residuals suggested

unexpected high distress on Items 7 19 and 20 One of

these items concerned his ldquomarriagesignificant other rela-

tionshiprdquo Therefore a possible cause of the IR subscale

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712

Conijn et al 7

misfit may be that his problems were limited to this rela-

tionship On the SD subscale he had both several unex-

pected high and low item scores Two of the three items

with unexpected high scores reflected mood symptoms of

depression feeling blue (Item 42) and not being happy

(Item 13) The third item concerned suicidal thoughts (Item

8) Most items with unexpected low scores concerned low

self-worth and incompetence (Items 15 24 and 40) and

hopelessness (Item 10) which are all cognitive symptoms

of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an

external source of psychological distress caused the patient

to experience only the mood symptoms of depression but

not the cognitive symptoms Hence the cognitive symp-

toms constituted a separate dimension on which the patient

had a lower trait value Furthermore except for 10 items all

other item scores were either 0s or 4s Hence apart from

potential content-related misfit extreme response style may

have been another cause of the severe misfit In practice the

clinician may discuss unexpected item scores and potential

explanations with the patient and suggest a more definite

explanation for the person misfit

Explanatory Person-Fit Analysis

For each of the diagnosis categories Table 1 shows the

average l zm p

value and the number and percentage of

patients classified as misfitting For patients with mood and

anxiety disorders (ie the baseline category) the detection

rate was substantial (128) but not high relative to most of

the other diagnosis categories

Table 2 shows the results of the logistic regression analy-

sis Model 1 included gender measurement occasion and

the diagnosis categories as predictors of person misfit

Diagnosis category had a significant overall effect χ 2(8) =

2647 p lt 001 The effects of somatoform disorder ADHD

psychotic disorder and substance abuse disorder were sig-

nificant Patients with ADHD were unlikely to show misfit

relative to the baseline category of patients with mood or

anxiety disorders Patients with somatoform disorders psy-

chotic disorders and substance-related disorders were morelikely to show misfit

Model 2 (Table 2 third column) also included GAF code

and OQ-45 total score Both effects were significant and

suggested that patients with higher levels of psychological

distress were more likely to show misfit After controlling

for GAF code and OQ-45 score the positive effect of

ADHD was not significant Hence patients with ADHD

were less likely to show misfit because they had less severe

levels of distress For the baseline category the estimated

probability of misfit was 13 For patients with somatoform

disorders psychotic disorders and substance-related disor-

ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis

category showed similar person misfit we compared the

standardized item-score residuals of the misfitting patterns

produced by patients with psychotic disorders (n = 7)

somatoform disorders (n = 16) or substance-related disor-

ders (n = 13) Most patients with a psychotic disorder had

low or average θ levels for each of the subscales Misfit was

due to several unexpected high scores indicating severe

symptoms In general these patients did not have large

Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the

5 Level 0 = No Misfit)

Model 1 Model 2

Intercept minus184 (011) minus193 (011)

Gender minus012 (013) minus012 (013)

Measurement occasion minus017 (027) minus018 (027)

Diagnosis categorya

Somatoform 057 (029) 074 (029)

ADHD minus058 (028) minus039 (028)

Psychotic 105 (046) 113 (047)

Borderline minus130 (072) minus139 (073)

Impulse control 035 (036) 057 (036)

Eating disorders minus110 (060) minus097 (060)

Substance related 066 (033) 069 (033)

Socialrelational minus020 (026) 008 (027)

GAF code mdash minus017 (007)

OQ-45 total score mdash 026 (007)

Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26

a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812

8 Assessment

residuals for the same items However unexpected high

scores on Item 25 (ldquodisturbing thoughts come into my mind

that I cannot get rid ofrdquo) were frequent (4 patients) which is

consistent with the symptoms characterizing psychotic dis-

orders Low to average θ levels combined with several

unexpected high item scores suggested that patients with

psychotic disorders showed misfit because many OQ-45

items were irrelevant to them suggesting lack of traitedness

(Reise amp Waller 1993) For the misfitting patients with a

somatoform disorder or a substance-related disorder the

standardized residuals showed no common pattern and

thus they did not show similar person misfit

Discussion

We investigated prevalence and explanations of aberrant

responding to the OQ-45 by means of IRT-based person-fit

methods Reise and Waller (2009) suggested that IRT mod-

els fit poorly to psychopathology data and that misfit may

adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn

et al 2014) our simulation results suggested that l zm p is

robust to model misfit The low empirical Type I error rates

suggested that GRM misfit did not lead to incorrect classi-

fication of model-consistent item-score patterns as misfit-

ting The current findings are valuable because they are

obtained under realistic conditions using a psychometri-

cally imperfect outcome measure that is frequently used in

practice We notice that this is the rule rather than the excep-

tion in general measurement instrumentsrsquo psychometric

properties are imperfect However our findings concerning

the robustness of l zm p

may only hold for the kind of model

misfit we found for the OQ-45 Future studies should inves-

tigate the robustness of PFA methods to different IRT model

violations

The detection rate of 126 in the OQ-45 data is sub-

stantial and comparable with detection rates found in other

studies using different measures For example using

repeated measurements on the StatendashTrait Anxiety

Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs

1983) Conijn et al (2013) found detection rates of 11 to

14 in a sample of cardiac patients and Conijn et al (2014)

also found 16 misfit on the International Personality Item

Pool 50-item questionnaire (Goldberg et al 2006) in a

panel sample from the general populationConsistent with previous research (Conijn et al 2013

Reise amp Waller 1993 Woods et al 2008) we found that

more severely distressed patients more likely showed mis-

fit Also patients with somatoform disorders psychotic dis-

orders and substance-related disorders more likely showed

misfit Plausible explanations for these results include

patients tending to deny their mental problems (somatoform

disorders) symptoms of disorder and confusion and the

irrelevance of most OQ-45 items (psychotic disorders)

being under the influence during test-taking and the nega-

tive effect of long-term substance use on cognitive abilities

(substance-related disorders) The residual analysis con-

firmed the explanation of limited item relevance for patients

with psychotic disorders but also suggested that patients

identified with the same type of disorder generally did not

show the same type of misfit However as the simulation

study revealed that the residual statistic had low power sug-

gesting only item scores reflecting large misfit are identi-

fied results should be interpreted with caution

There are two possible general explanations for finding

group differences with respect to the tendency to show per-

son misfit First person misfit may be due to a mismatch

between the OQ-45 and the specific disorder and second

misfit may be due to a general tendency to show misfit on

self-report measures Each explanation has a unique impli-

cation for outcome measurement The mismatch explana-

tion implies that disease-specific outcome measures rather

than general outcome measures should be used Examples

of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for

assessing somatic symptoms and the Severe Outcome

Questionnaire (Burlingame Thayer Lee Nelson amp

Lambert 2007) to diagnose severe mental illnesses such as

psychotic disorders and bipolar disorder

The general-misfit explanation implies that other meth-

ods than self-reports should be used for patientsrsquo treatment

decisions for example clinician-rated outcome measures

such as the Health of the Nation Outcome Scales (Wing et

al 1998) Also these patientsrsquo self-report results should be

excluded from cost-effectiveness studies to prevent poten-

tial negative effects on policy decisions Future studies may

address the scale-specific misfit versus general-misfit

explanations by means of explanatory PFA of data from

other outcome measures

Residual statistics have shown to be useful in real-data

applications for analyzing causes of aberrant responding to

unidimensional personality scales containing at least 30

items (Ferrando 2010 2012) but their performance has not

been evaluated previously by means of simulations Our

simulation study and real-data application question the use-

fulness of the residual statistic for retrieving causes of mis-

fit in outcome measures consisting of multiple short

subscales An alternative method is the inspection of item

content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such

a qualitative approach to follow-up PFA could be particu-

larly useful when clinicians use the OQ-45 to provide feed-

back to the patient

Person-fit statistics such as l zm p can potentially detect

aberrant responding due to low traitedness low motivation

cognitive deficits and concentration problems However

an important limitation for outcome measurement is that the

statistics have low power to identify response styles and

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912

Conijn et al 9

malingering These types of aberrant responding result in

biased total scores but do not necessarily produce an incon-

sistent pattern of item scores across the complete measure

(Ferrando amp Chico 2001) Hence future research might

also use measures especially designed to identify response

styles (Cheung amp Rensvold 2000) and malingering and

compare the results with those from general-purpose per-

son-fit statistics Inconsistency scales such as the TRIN and

VRIN potentially detect the same types of generic response

inconsistencies as person-fit indices but to our knowledge

only one study has compared their performance to that of

PFA (Egberink 2010) PFA was found to outperform the

inconsistency scale but relative performance may depend

on dimensionality and test length Hence more research is

needed on this subject

Other suggestions for future research include the fol-

lowing A first question is whether the prevalence of aber-

rant responding increases (eg due to diminishing

motivation) or decreases (eg due to familiarity with the

questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant

responding justifies routine application of PFA in clinical

settings such as Patient Reported Outcomes Measurement

Information System If prevalence among the tested

patients is low the number of item-score patterns incor-

rectly classified as aberrant (ie Type I errors) may out-

number the correctly identified aberrant item-score

patterns (Piedmont McCrae Riemann amp Angleitner

2000) Then PFA becomes very inefficient and it is

unlikely to improve the quality of individual decision

making in clinical practice Third future research may use

group-level analysis such as differential item functioning

analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-

ture modeling (Rost 1990) to study whether patients with

the same disorder showed similar patterns of misfit

To conclude our results have two main implications

pertaining to psychometrics and psychopathology First

the simulation study results suggest that l zm p is useful for

application to outcome measures despite moderate model

misfit due to multidimensionality As data of other psy-

chopathology measures also have been shown to be incon-

sistent with assumptions of IRT models the results of the

simulation study are valuable when considering applica-

tion of the l zm p

statistic to data of outcome measures

Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients

with different disorders Also more severely distressed

patients for whom psychological intervention is mostly

needed appear to be at the largest risk to produce invalid

outcome scores Overall our results emphasize the impor-

tance of person misfit identification in outcome measure-

ment and demonstrate that PFA may be useful for

preventing incorrect decision making in clinical practice

due to aberrant responding

Appendix A

Statistic l z p

Suppose the data are polytomous scores on J items (items

are indexed j j = 1 J ) with M + 1 ordered answer cat-

egories Let the score on item j be denoted by X j with

possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a

score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0

otherwise The unstandardized log-likelihood of an item-

score pattern x of person i is given by

l d m P X m pi

j

J

m

M

j j i( ) x = ( ) =( )= =

sumsum1 0

ln |θ (A1)

The standardized log-likelihood is defined as

l

l E l

VAR l

z p i

pi

pi

pi

x

x x

x

( ) = ( ) minus ( )

( )

( )

1

2

(A2)

where E l p( ) is the expected value and VAR l p( ) is the

variance of l p

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E X ij ij ij= minus ( ) (A3)

where E X ij( ) is the expected value of X ij which equals

E X m P X mij

m

M

ij i( ) = =( )=

sum0

|θ The residual eij has a mean

of 0 and variance equal to

VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2

(A4)

The standardized residual is given by

ze e

VAR eij

ij

ij

=

( )

(A5)

To compute zeij latent trait θ

i needs to be replaced by its

estimated value This may bias the standardization of eij

Appendix B

For each OQ-45 subscale we estimated an exploratory mul-

tidimensional IRT (MIRT Reckase 2009) model Based on

results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 4: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 412

4 Assessment

number of patients classified in each category by their pri-

mary diagnosis Three remarks are in order First because

mood and anxiety symptoms dominate the OQ-45 (Lambert

et al 2004) we assumed that for patients suffering from

these symptoms misfit was less likely than for patients with

other diagnoses Hence we classified patients with mood

and anxiety diagnoses into the same category and used

this category as the baseline for testing the effects of the

other diagnosis categories on person fit Second because

we expected that the probability of aberrant responding

depends on the specific symptoms the patient experienced

we categorized patients into diagnosis categories that are

defined by symptomatology Third if we were unable to

categorize patientsrsquo diagnosis unambiguously in one of the

categories (eg adjustment disorder with predominant dis-

turbance of conduct) we treated the diagnosis as missing

Our approach resulted in 2514 categorized patients (87)

Statistical Analysis

Model-Fit Evaluation We conducted PFA based on the

graded response model (GRM Samejima 1997) The GRM

is an IRT model for polytomous items The core of the

GRM are the item step response functions (ISRFs) which

specify the relationship between the probability of a

response in a specific or higher answer category and the

latent trait the test measures The GRM is defined by three

assumptions unidimensionality of the latent trait absence

of structural influences other than the latent trait on item

responding (ie local independence) and logistic ISRFs A

detailed discussion of the GRM is beyond the scope of this

study the interested reader is referred to Embretson and

Reise (2000) and Samejima (1997)

Satisfactory GRM fit to the data is a prerequisite for

application of GRM-based PFA to the OQ-45 subscale data

Forero and Maydeu-Olivares (2009) showed that differ-

ences between parameter estimates obtained from the full

information (GRM) and the limited information (factor

analysis on the polychoric correlation matrix) approaches

are negligible Hence for each OQ-45 subscale we used

exploratory factor analysis (EFA) for categorical data in

Mplus (Mutheacuten amp Mutheacuten 2007) to assess the GRM

assumptions of unidimensionality and local independence

For comparing one-factor models with multidimensional

models we used the root mean squared error of approxima-

tion (RMSEA) and the standardized root mean residual

(SRMR Mutheacuten amp Mutheacuten 2007) RMSEA le 08 and

SRMR lt 05 suggest acceptable model fit (MacCallum

Browne amp Sugawara 1996 Mutheacuten amp Mutheacuten 2009) Todetect local dependence we used the residual correlations

under the one-factor solution We assessed the logistic

shape of ISRFs by means of a graphical analysis comparing

the observed ISRFs with the ISRFs expected under the

GRM (Drasgow Levine Tsien Williams amp Mead 1995)

In case substantial violations of GRM assumptions were

identified we used a simulation study to investigate whether

PFA was sufficiently robust with respect to the identified

OQ-45 model misfit (Conijn et al 2014)

Table 1 Description of the Diagnosis Categories Used as Explanatory Variables in Multiple Regression Analysis

Category Common DSM-IV diagnoses included n Mean l zmp Detected Detected

Mood and anxiety disordera Depressive disorders generalized anxietydisorders phobias panic disordersposttraumatic stress disorder

1786 028 229 128

Somatoform disorder Pain disorder somatization disorder

hypochondriasis undifferentiatedsomatoform disorder

82 016 16 195

Attention deficit hyperactivitydisorder(ADHD)

Predominantly inattentive combinedhyperactive-impulsive and inattentive

198 008 15 76

Psychotic disorder Schizophrenia psychotic disorder nototherwise specified

26 minus010 7 269

Borderline personalitydisorder

Borderline personality disorder 53 035 2 38

Impulse-control disorders notelsewhere classified

Impulse-control disorder intermittentexplosive disorder

58 002 10 172

Eating disorder Eating disorder not otherwise specified bulimianervosa

67 038 4 60

Substance-related disorder Cannabis-related disorders alcohol-relateddisorders

58 009 13 224

Social and relational problem Phase of life problem partner relationalproblem identity problem

186 026 20 108

a Including 65 patients with a mood disorder

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512

Conijn et al 5

Person-Fit Analysis

Detection of misfit We used statistic l z for polytomous

item scores denoted by l z p (Drasgow Levine amp Williams

1985) to identify item-score patterns that show misfit rela-

tive to the GRM Statistic l z p

is the standardized log-like-

lihood of a personrsquos item-score pattern given the response

probabilities under the GRM with larger negative l z p

values indicating a higher degree of misfit (see Appendix

A for the equations) Emons (2008) found that l z p

had a

higher detection rate than several other person-fit statistics

Because item-score patterns that contain only 0s or only

4s (ie after recoding the reversed worded items) always

fit under the postulated GRM corresponding l z p

statistics

are meaningless and therefore treated as missing values

Twenty-two respondents (1) had a missing l z p

value due

to only 0 or 4 scores We may add that even though these

perfect patterns are consistent with the model they still may

be suspicious as they may reflect gross under- or overre-

porting of symptoms

Because the GRM is a model for unidimensional datawe computed statistic l z

p for each OQ-45 subscale sepa-

rately To categorize persons as fitting or misfitting with

respect to the complete OQ-45 we used the multiscale per-

son-fit statistic l zm p

(Conijn et al 2014 Drasgow Levine

amp McLaughlin 1991) which equals the standardized sum

of the subscale l z p values across all subscales

Under the null model of fit to the IRT model and given

the true θ value statistic l z p

is standard normally distributed

(Drasgow et al 1985) but when the unknown true θ value

is replaced by the estimated θ value statistic l z p is no longer

standard normal (Nering 1995) Therefore following De la

Torre and Deng (2008) we used the following parametric

bootstrap procedure to compute the l z p and l zm p values and

the corresponding p values For each person we generated

5000 item-score patterns under the postulated GRM using

the item parameters and the personrsquos estimated θ value For

each item-score pattern we again estimated the estimated θ

value and computed the corresponding l zm p statistic The

5000 bootstrap replications of l zm p

determined the person-

specific null distribution of l zm p

The percentile rank of the

observed lzm p

value in this bootstrapped distribution pro-

vided the p value We used one-tailed significance testing

and a 05 significance level (α) The GRM item parameters

were estimated using MULTILOG (Thissen Chen amp Bock

2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained

from Baker and Kim (2004) The software including the

source code is available on request from the first author

Follow-up analyses For each item-score pattern l zm p classi-

fied as misfitting we used standardized item-score residuals

to identify the source of the person misfit (Ferrando 2010

2012) Negative residuals indicate that the personrsquos observed

item score is lower than expected under the estimated GRM

and positive residuals indicate that the item score is higher

than expected (Appendix A) To test residuals for signifi-

cance we used critical values based on the standard normal

distribution and two-tailed significance testing with α = 05

(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff

values of minus164 and 164)

Explanatory person-fit analysis We used logistic regres-

sion to relate type of disorder and severity of psychological

distress to person misfit on the OQ-45 The dependent vari-

able was the dichotomous person-fit classification based on

l zm p

(1 = significant misfit 0 = no misfit) Based on pre-

vious research results gender (0 = male 1 = female) and

measurement occasion (0 = at intake 1 = during treatment)

were included in the model as control variables (eg Pitts

West amp Tein 1996 Schmitt Chan Sacco McFarland amp

Jennings 1999 Woods et al 2008)

Results

First we discuss model fit and implications of the identified

model misfit for the application of PFA to the OQ-45 data

Second we discuss the number of item-score patterns that

the l zm p

statistic classified as misfitting (prevalence) and we

illustrate how standardized item-score residuals may help

infer possible causes of misfit for individual respondents

Third we discuss the results of logistic regression analysis

in which l zm p

person misfit classification was predicted by

means of clinical diagnosis and severity of disorder

Model-Fit Evaluation

Inspection of multiple correlation coefficients and item-rest

correlations showed that the Items 11 26 and 32 which

measured substance abuse and Item 14 (ldquoI workstudy too

muchrdquo) fitted poorly in their subscales As these results

were consistent with previous research (De Jong et al

2007 Mueller et al 1998) we excluded these items from

further analysis Coefficient alphas for the SR (7 items 2

items excluded) IR (10 items 1 item excluded) and SD (24

items 1 item excluded) subscales equaled 67 78 and 91

respectively

For the subscale data EFA showed that the first factor

explained 386 to 400 of the variance and that one-fac-

tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed

to produce an RMSEA le 08 The two-factor model pro-

duced an RMSEA of 13 but the SRMR of 05 was accept-

able The RMSEA value may have been inflated due to the

small number of degrees of freedom (ie df = 7) of the two-

factor model (Kenny Kaniskan amp McCoach 2014)

Because parallel analysis based on the polychoric correla-

tion matrix suggested that two factors explained the data we

decided that a two-factor solution was most appropriate

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612

6 Assessment

- 4

- 2

0

2

4

S t a n

d a r d i z e d r e s i d u a l s

4 12 21 28 38

39 44

SR subscale

1

7

16

17

18 19

20 30 37

43

IR subscale

2 3

56

8

9 101315

22

232425

27

2931

3334

3536

40

41

42

45

SD subscale

- 4

- 2

0

2

4

S t a n d a r d i z e d r e s i d u a l s

4

12

21

28

38 39 44

1

7

16

17

18

19

20

30 37

43

2

3

5 6

8

9

10

13

15

22

23

24

2527

29

3133

34

3536

40

41

42

45

Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient

2752 ( l zmp

= minus796 lower panel)

(RMSEA = 13 SRMR = 05) For the IR subscale a two-

factor solution provided acceptable fit (RMSEA = 08 and

SRMR = 04) and for the SD subscale a three-factor solu-

tion provided acceptable fit (RMSEA = 07 and SRMR =

03) Violations of local independence and violations of a

logistic ISRF were only found for some items of the SD and

SR subscales respectively Thus EFA results suggested that

more than other model violations multidimensionality

caused the subscale data to show GRM misfit

To investigate the performance of statistic l zm p

and the

standardized item-score residuals for detecting person mis-

fit on OQ-45 in the presence of mild model misfit we used

data simulation to assess the Type I error rates and the

detection rates of these statistics Data were simulated using

methods proposed by Conijn et al (2014 also see Appendix

B) The types of misfit included were random error (three

levels random scores on 10 20 or 30 items) and acquies-

cence (three levels weak moderate and strong)

We found that for statistic l zm p

Type I error rate equaled

01 meaning that the risk of incorrectly classifying normal

respondents as misfitting was small and much lower than

nominal Type I error rate Furthermore the power of l zm p

to

detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-

escence equaled at most 51 (ie for strong acquiescence

88 of the responses in the most extreme category) We

concluded that despite mild GRM model misfit l zm p

is use-

ful for application to the OQ-45 but lacks power to detect

acquiescence For the residual statistic we found modest

power to detect deviant item scores due to random error

and low power to detect deviant item scores due to acquies-

cence Even though the residuals had low power in our

simulation study we decided to use the residual statistic for

the OQ-45 data analysis to obtain further insight in the sta-

tisticsrsquo performance

Detection of Misfit and Follow-Up Analyses

For 90 (3) patients with a missing l z p

value for one of the

subscales l zm p was computed across the two other OQ-45

subscales Statistic l zm p ranged from minus796 to 354 ( M =

045 SD = 354) For 367 (126) patients statistic l zm p

classified the observed item-score pattern as misfitting

With respect to age gender and measurement occasion we

did not find substantial differences between detection rates

Based on the residual statisticrsquos low power in the simula-

tion study we used α = 10 for identifying unexpected item

scores We use two cases to illustrate the use of the residual

statistic Figure 1 shows the standardized residuals for

female patient 663 who had the highest l zm p

value ( l zm p

=

354 p gt 99) and for male patient 2752 who had the low-

est l zm p value ( l zm

p = minus796 p lt 001) Patient 663 (upper

panel) was diagnosed with posttraumatic stress disorder

The patientrsquos absolute residuals were smaller than 164

thus showing that her item scores were consistent with theexpected GRM item scores

Patient 2752 (lower panel) was diagnosed to suffer

from adjustment disorder with depressed mood He had

large residuals for each of the OQ-45 subscales but misfit

was largest on the IR subscale (l z p

= minus5 44 ) and the SD sub-

scale (l z p

= minus7 66 ) On the IR subscale residuals suggested

unexpected high distress on Items 7 19 and 20 One of

these items concerned his ldquomarriagesignificant other rela-

tionshiprdquo Therefore a possible cause of the IR subscale

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712

Conijn et al 7

misfit may be that his problems were limited to this rela-

tionship On the SD subscale he had both several unex-

pected high and low item scores Two of the three items

with unexpected high scores reflected mood symptoms of

depression feeling blue (Item 42) and not being happy

(Item 13) The third item concerned suicidal thoughts (Item

8) Most items with unexpected low scores concerned low

self-worth and incompetence (Items 15 24 and 40) and

hopelessness (Item 10) which are all cognitive symptoms

of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an

external source of psychological distress caused the patient

to experience only the mood symptoms of depression but

not the cognitive symptoms Hence the cognitive symp-

toms constituted a separate dimension on which the patient

had a lower trait value Furthermore except for 10 items all

other item scores were either 0s or 4s Hence apart from

potential content-related misfit extreme response style may

have been another cause of the severe misfit In practice the

clinician may discuss unexpected item scores and potential

explanations with the patient and suggest a more definite

explanation for the person misfit

Explanatory Person-Fit Analysis

For each of the diagnosis categories Table 1 shows the

average l zm p

value and the number and percentage of

patients classified as misfitting For patients with mood and

anxiety disorders (ie the baseline category) the detection

rate was substantial (128) but not high relative to most of

the other diagnosis categories

Table 2 shows the results of the logistic regression analy-

sis Model 1 included gender measurement occasion and

the diagnosis categories as predictors of person misfit

Diagnosis category had a significant overall effect χ 2(8) =

2647 p lt 001 The effects of somatoform disorder ADHD

psychotic disorder and substance abuse disorder were sig-

nificant Patients with ADHD were unlikely to show misfit

relative to the baseline category of patients with mood or

anxiety disorders Patients with somatoform disorders psy-

chotic disorders and substance-related disorders were morelikely to show misfit

Model 2 (Table 2 third column) also included GAF code

and OQ-45 total score Both effects were significant and

suggested that patients with higher levels of psychological

distress were more likely to show misfit After controlling

for GAF code and OQ-45 score the positive effect of

ADHD was not significant Hence patients with ADHD

were less likely to show misfit because they had less severe

levels of distress For the baseline category the estimated

probability of misfit was 13 For patients with somatoform

disorders psychotic disorders and substance-related disor-

ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis

category showed similar person misfit we compared the

standardized item-score residuals of the misfitting patterns

produced by patients with psychotic disorders (n = 7)

somatoform disorders (n = 16) or substance-related disor-

ders (n = 13) Most patients with a psychotic disorder had

low or average θ levels for each of the subscales Misfit was

due to several unexpected high scores indicating severe

symptoms In general these patients did not have large

Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the

5 Level 0 = No Misfit)

Model 1 Model 2

Intercept minus184 (011) minus193 (011)

Gender minus012 (013) minus012 (013)

Measurement occasion minus017 (027) minus018 (027)

Diagnosis categorya

Somatoform 057 (029) 074 (029)

ADHD minus058 (028) minus039 (028)

Psychotic 105 (046) 113 (047)

Borderline minus130 (072) minus139 (073)

Impulse control 035 (036) 057 (036)

Eating disorders minus110 (060) minus097 (060)

Substance related 066 (033) 069 (033)

Socialrelational minus020 (026) 008 (027)

GAF code mdash minus017 (007)

OQ-45 total score mdash 026 (007)

Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26

a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812

8 Assessment

residuals for the same items However unexpected high

scores on Item 25 (ldquodisturbing thoughts come into my mind

that I cannot get rid ofrdquo) were frequent (4 patients) which is

consistent with the symptoms characterizing psychotic dis-

orders Low to average θ levels combined with several

unexpected high item scores suggested that patients with

psychotic disorders showed misfit because many OQ-45

items were irrelevant to them suggesting lack of traitedness

(Reise amp Waller 1993) For the misfitting patients with a

somatoform disorder or a substance-related disorder the

standardized residuals showed no common pattern and

thus they did not show similar person misfit

Discussion

We investigated prevalence and explanations of aberrant

responding to the OQ-45 by means of IRT-based person-fit

methods Reise and Waller (2009) suggested that IRT mod-

els fit poorly to psychopathology data and that misfit may

adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn

et al 2014) our simulation results suggested that l zm p is

robust to model misfit The low empirical Type I error rates

suggested that GRM misfit did not lead to incorrect classi-

fication of model-consistent item-score patterns as misfit-

ting The current findings are valuable because they are

obtained under realistic conditions using a psychometri-

cally imperfect outcome measure that is frequently used in

practice We notice that this is the rule rather than the excep-

tion in general measurement instrumentsrsquo psychometric

properties are imperfect However our findings concerning

the robustness of l zm p

may only hold for the kind of model

misfit we found for the OQ-45 Future studies should inves-

tigate the robustness of PFA methods to different IRT model

violations

The detection rate of 126 in the OQ-45 data is sub-

stantial and comparable with detection rates found in other

studies using different measures For example using

repeated measurements on the StatendashTrait Anxiety

Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs

1983) Conijn et al (2013) found detection rates of 11 to

14 in a sample of cardiac patients and Conijn et al (2014)

also found 16 misfit on the International Personality Item

Pool 50-item questionnaire (Goldberg et al 2006) in a

panel sample from the general populationConsistent with previous research (Conijn et al 2013

Reise amp Waller 1993 Woods et al 2008) we found that

more severely distressed patients more likely showed mis-

fit Also patients with somatoform disorders psychotic dis-

orders and substance-related disorders more likely showed

misfit Plausible explanations for these results include

patients tending to deny their mental problems (somatoform

disorders) symptoms of disorder and confusion and the

irrelevance of most OQ-45 items (psychotic disorders)

being under the influence during test-taking and the nega-

tive effect of long-term substance use on cognitive abilities

(substance-related disorders) The residual analysis con-

firmed the explanation of limited item relevance for patients

with psychotic disorders but also suggested that patients

identified with the same type of disorder generally did not

show the same type of misfit However as the simulation

study revealed that the residual statistic had low power sug-

gesting only item scores reflecting large misfit are identi-

fied results should be interpreted with caution

There are two possible general explanations for finding

group differences with respect to the tendency to show per-

son misfit First person misfit may be due to a mismatch

between the OQ-45 and the specific disorder and second

misfit may be due to a general tendency to show misfit on

self-report measures Each explanation has a unique impli-

cation for outcome measurement The mismatch explana-

tion implies that disease-specific outcome measures rather

than general outcome measures should be used Examples

of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for

assessing somatic symptoms and the Severe Outcome

Questionnaire (Burlingame Thayer Lee Nelson amp

Lambert 2007) to diagnose severe mental illnesses such as

psychotic disorders and bipolar disorder

The general-misfit explanation implies that other meth-

ods than self-reports should be used for patientsrsquo treatment

decisions for example clinician-rated outcome measures

such as the Health of the Nation Outcome Scales (Wing et

al 1998) Also these patientsrsquo self-report results should be

excluded from cost-effectiveness studies to prevent poten-

tial negative effects on policy decisions Future studies may

address the scale-specific misfit versus general-misfit

explanations by means of explanatory PFA of data from

other outcome measures

Residual statistics have shown to be useful in real-data

applications for analyzing causes of aberrant responding to

unidimensional personality scales containing at least 30

items (Ferrando 2010 2012) but their performance has not

been evaluated previously by means of simulations Our

simulation study and real-data application question the use-

fulness of the residual statistic for retrieving causes of mis-

fit in outcome measures consisting of multiple short

subscales An alternative method is the inspection of item

content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such

a qualitative approach to follow-up PFA could be particu-

larly useful when clinicians use the OQ-45 to provide feed-

back to the patient

Person-fit statistics such as l zm p can potentially detect

aberrant responding due to low traitedness low motivation

cognitive deficits and concentration problems However

an important limitation for outcome measurement is that the

statistics have low power to identify response styles and

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912

Conijn et al 9

malingering These types of aberrant responding result in

biased total scores but do not necessarily produce an incon-

sistent pattern of item scores across the complete measure

(Ferrando amp Chico 2001) Hence future research might

also use measures especially designed to identify response

styles (Cheung amp Rensvold 2000) and malingering and

compare the results with those from general-purpose per-

son-fit statistics Inconsistency scales such as the TRIN and

VRIN potentially detect the same types of generic response

inconsistencies as person-fit indices but to our knowledge

only one study has compared their performance to that of

PFA (Egberink 2010) PFA was found to outperform the

inconsistency scale but relative performance may depend

on dimensionality and test length Hence more research is

needed on this subject

Other suggestions for future research include the fol-

lowing A first question is whether the prevalence of aber-

rant responding increases (eg due to diminishing

motivation) or decreases (eg due to familiarity with the

questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant

responding justifies routine application of PFA in clinical

settings such as Patient Reported Outcomes Measurement

Information System If prevalence among the tested

patients is low the number of item-score patterns incor-

rectly classified as aberrant (ie Type I errors) may out-

number the correctly identified aberrant item-score

patterns (Piedmont McCrae Riemann amp Angleitner

2000) Then PFA becomes very inefficient and it is

unlikely to improve the quality of individual decision

making in clinical practice Third future research may use

group-level analysis such as differential item functioning

analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-

ture modeling (Rost 1990) to study whether patients with

the same disorder showed similar patterns of misfit

To conclude our results have two main implications

pertaining to psychometrics and psychopathology First

the simulation study results suggest that l zm p is useful for

application to outcome measures despite moderate model

misfit due to multidimensionality As data of other psy-

chopathology measures also have been shown to be incon-

sistent with assumptions of IRT models the results of the

simulation study are valuable when considering applica-

tion of the l zm p

statistic to data of outcome measures

Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients

with different disorders Also more severely distressed

patients for whom psychological intervention is mostly

needed appear to be at the largest risk to produce invalid

outcome scores Overall our results emphasize the impor-

tance of person misfit identification in outcome measure-

ment and demonstrate that PFA may be useful for

preventing incorrect decision making in clinical practice

due to aberrant responding

Appendix A

Statistic l z p

Suppose the data are polytomous scores on J items (items

are indexed j j = 1 J ) with M + 1 ordered answer cat-

egories Let the score on item j be denoted by X j with

possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a

score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0

otherwise The unstandardized log-likelihood of an item-

score pattern x of person i is given by

l d m P X m pi

j

J

m

M

j j i( ) x = ( ) =( )= =

sumsum1 0

ln |θ (A1)

The standardized log-likelihood is defined as

l

l E l

VAR l

z p i

pi

pi

pi

x

x x

x

( ) = ( ) minus ( )

( )

( )

1

2

(A2)

where E l p( ) is the expected value and VAR l p( ) is the

variance of l p

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E X ij ij ij= minus ( ) (A3)

where E X ij( ) is the expected value of X ij which equals

E X m P X mij

m

M

ij i( ) = =( )=

sum0

|θ The residual eij has a mean

of 0 and variance equal to

VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2

(A4)

The standardized residual is given by

ze e

VAR eij

ij

ij

=

( )

(A5)

To compute zeij latent trait θ

i needs to be replaced by its

estimated value This may bias the standardization of eij

Appendix B

For each OQ-45 subscale we estimated an exploratory mul-

tidimensional IRT (MIRT Reckase 2009) model Based on

results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 5: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 512

Conijn et al 5

Person-Fit Analysis

Detection of misfit We used statistic l z for polytomous

item scores denoted by l z p (Drasgow Levine amp Williams

1985) to identify item-score patterns that show misfit rela-

tive to the GRM Statistic l z p

is the standardized log-like-

lihood of a personrsquos item-score pattern given the response

probabilities under the GRM with larger negative l z p

values indicating a higher degree of misfit (see Appendix

A for the equations) Emons (2008) found that l z p

had a

higher detection rate than several other person-fit statistics

Because item-score patterns that contain only 0s or only

4s (ie after recoding the reversed worded items) always

fit under the postulated GRM corresponding l z p

statistics

are meaningless and therefore treated as missing values

Twenty-two respondents (1) had a missing l z p

value due

to only 0 or 4 scores We may add that even though these

perfect patterns are consistent with the model they still may

be suspicious as they may reflect gross under- or overre-

porting of symptoms

Because the GRM is a model for unidimensional datawe computed statistic l z

p for each OQ-45 subscale sepa-

rately To categorize persons as fitting or misfitting with

respect to the complete OQ-45 we used the multiscale per-

son-fit statistic l zm p

(Conijn et al 2014 Drasgow Levine

amp McLaughlin 1991) which equals the standardized sum

of the subscale l z p values across all subscales

Under the null model of fit to the IRT model and given

the true θ value statistic l z p

is standard normally distributed

(Drasgow et al 1985) but when the unknown true θ value

is replaced by the estimated θ value statistic l z p is no longer

standard normal (Nering 1995) Therefore following De la

Torre and Deng (2008) we used the following parametric

bootstrap procedure to compute the l z p and l zm p values and

the corresponding p values For each person we generated

5000 item-score patterns under the postulated GRM using

the item parameters and the personrsquos estimated θ value For

each item-score pattern we again estimated the estimated θ

value and computed the corresponding l zm p statistic The

5000 bootstrap replications of l zm p

determined the person-

specific null distribution of l zm p

The percentile rank of the

observed lzm p

value in this bootstrapped distribution pro-

vided the p value We used one-tailed significance testing

and a 05 significance level (α) The GRM item parameters

were estimated using MULTILOG (Thissen Chen amp Bock

2003) For the bootstrap procedure we developed dedicatedsoftware in C++ Algorithms for estimating θ were obtained

from Baker and Kim (2004) The software including the

source code is available on request from the first author

Follow-up analyses For each item-score pattern l zm p classi-

fied as misfitting we used standardized item-score residuals

to identify the source of the person misfit (Ferrando 2010

2012) Negative residuals indicate that the personrsquos observed

item score is lower than expected under the estimated GRM

and positive residuals indicate that the item score is higher

than expected (Appendix A) To test residuals for signifi-

cance we used critical values based on the standard normal

distribution and two-tailed significance testing with α = 05

(ie cutoff values of minus196 and 196) or α = 10 (ie cutoff

values of minus164 and 164)

Explanatory person-fit analysis We used logistic regres-

sion to relate type of disorder and severity of psychological

distress to person misfit on the OQ-45 The dependent vari-

able was the dichotomous person-fit classification based on

l zm p

(1 = significant misfit 0 = no misfit) Based on pre-

vious research results gender (0 = male 1 = female) and

measurement occasion (0 = at intake 1 = during treatment)

were included in the model as control variables (eg Pitts

West amp Tein 1996 Schmitt Chan Sacco McFarland amp

Jennings 1999 Woods et al 2008)

Results

First we discuss model fit and implications of the identified

model misfit for the application of PFA to the OQ-45 data

Second we discuss the number of item-score patterns that

the l zm p

statistic classified as misfitting (prevalence) and we

illustrate how standardized item-score residuals may help

infer possible causes of misfit for individual respondents

Third we discuss the results of logistic regression analysis

in which l zm p

person misfit classification was predicted by

means of clinical diagnosis and severity of disorder

Model-Fit Evaluation

Inspection of multiple correlation coefficients and item-rest

correlations showed that the Items 11 26 and 32 which

measured substance abuse and Item 14 (ldquoI workstudy too

muchrdquo) fitted poorly in their subscales As these results

were consistent with previous research (De Jong et al

2007 Mueller et al 1998) we excluded these items from

further analysis Coefficient alphas for the SR (7 items 2

items excluded) IR (10 items 1 item excluded) and SD (24

items 1 item excluded) subscales equaled 67 78 and 91

respectively

For the subscale data EFA showed that the first factor

explained 386 to 400 of the variance and that one-fac-

tor models fitted the subscale data poorly (RMSEA gt 10 andSRMR gt 06) For the SR subscale three factors were needed

to produce an RMSEA le 08 The two-factor model pro-

duced an RMSEA of 13 but the SRMR of 05 was accept-

able The RMSEA value may have been inflated due to the

small number of degrees of freedom (ie df = 7) of the two-

factor model (Kenny Kaniskan amp McCoach 2014)

Because parallel analysis based on the polychoric correla-

tion matrix suggested that two factors explained the data we

decided that a two-factor solution was most appropriate

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612

6 Assessment

- 4

- 2

0

2

4

S t a n

d a r d i z e d r e s i d u a l s

4 12 21 28 38

39 44

SR subscale

1

7

16

17

18 19

20 30 37

43

IR subscale

2 3

56

8

9 101315

22

232425

27

2931

3334

3536

40

41

42

45

SD subscale

- 4

- 2

0

2

4

S t a n d a r d i z e d r e s i d u a l s

4

12

21

28

38 39 44

1

7

16

17

18

19

20

30 37

43

2

3

5 6

8

9

10

13

15

22

23

24

2527

29

3133

34

3536

40

41

42

45

Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient

2752 ( l zmp

= minus796 lower panel)

(RMSEA = 13 SRMR = 05) For the IR subscale a two-

factor solution provided acceptable fit (RMSEA = 08 and

SRMR = 04) and for the SD subscale a three-factor solu-

tion provided acceptable fit (RMSEA = 07 and SRMR =

03) Violations of local independence and violations of a

logistic ISRF were only found for some items of the SD and

SR subscales respectively Thus EFA results suggested that

more than other model violations multidimensionality

caused the subscale data to show GRM misfit

To investigate the performance of statistic l zm p

and the

standardized item-score residuals for detecting person mis-

fit on OQ-45 in the presence of mild model misfit we used

data simulation to assess the Type I error rates and the

detection rates of these statistics Data were simulated using

methods proposed by Conijn et al (2014 also see Appendix

B) The types of misfit included were random error (three

levels random scores on 10 20 or 30 items) and acquies-

cence (three levels weak moderate and strong)

We found that for statistic l zm p

Type I error rate equaled

01 meaning that the risk of incorrectly classifying normal

respondents as misfitting was small and much lower than

nominal Type I error rate Furthermore the power of l zm p

to

detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-

escence equaled at most 51 (ie for strong acquiescence

88 of the responses in the most extreme category) We

concluded that despite mild GRM model misfit l zm p

is use-

ful for application to the OQ-45 but lacks power to detect

acquiescence For the residual statistic we found modest

power to detect deviant item scores due to random error

and low power to detect deviant item scores due to acquies-

cence Even though the residuals had low power in our

simulation study we decided to use the residual statistic for

the OQ-45 data analysis to obtain further insight in the sta-

tisticsrsquo performance

Detection of Misfit and Follow-Up Analyses

For 90 (3) patients with a missing l z p

value for one of the

subscales l zm p was computed across the two other OQ-45

subscales Statistic l zm p ranged from minus796 to 354 ( M =

045 SD = 354) For 367 (126) patients statistic l zm p

classified the observed item-score pattern as misfitting

With respect to age gender and measurement occasion we

did not find substantial differences between detection rates

Based on the residual statisticrsquos low power in the simula-

tion study we used α = 10 for identifying unexpected item

scores We use two cases to illustrate the use of the residual

statistic Figure 1 shows the standardized residuals for

female patient 663 who had the highest l zm p

value ( l zm p

=

354 p gt 99) and for male patient 2752 who had the low-

est l zm p value ( l zm

p = minus796 p lt 001) Patient 663 (upper

panel) was diagnosed with posttraumatic stress disorder

The patientrsquos absolute residuals were smaller than 164

thus showing that her item scores were consistent with theexpected GRM item scores

Patient 2752 (lower panel) was diagnosed to suffer

from adjustment disorder with depressed mood He had

large residuals for each of the OQ-45 subscales but misfit

was largest on the IR subscale (l z p

= minus5 44 ) and the SD sub-

scale (l z p

= minus7 66 ) On the IR subscale residuals suggested

unexpected high distress on Items 7 19 and 20 One of

these items concerned his ldquomarriagesignificant other rela-

tionshiprdquo Therefore a possible cause of the IR subscale

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712

Conijn et al 7

misfit may be that his problems were limited to this rela-

tionship On the SD subscale he had both several unex-

pected high and low item scores Two of the three items

with unexpected high scores reflected mood symptoms of

depression feeling blue (Item 42) and not being happy

(Item 13) The third item concerned suicidal thoughts (Item

8) Most items with unexpected low scores concerned low

self-worth and incompetence (Items 15 24 and 40) and

hopelessness (Item 10) which are all cognitive symptoms

of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an

external source of psychological distress caused the patient

to experience only the mood symptoms of depression but

not the cognitive symptoms Hence the cognitive symp-

toms constituted a separate dimension on which the patient

had a lower trait value Furthermore except for 10 items all

other item scores were either 0s or 4s Hence apart from

potential content-related misfit extreme response style may

have been another cause of the severe misfit In practice the

clinician may discuss unexpected item scores and potential

explanations with the patient and suggest a more definite

explanation for the person misfit

Explanatory Person-Fit Analysis

For each of the diagnosis categories Table 1 shows the

average l zm p

value and the number and percentage of

patients classified as misfitting For patients with mood and

anxiety disorders (ie the baseline category) the detection

rate was substantial (128) but not high relative to most of

the other diagnosis categories

Table 2 shows the results of the logistic regression analy-

sis Model 1 included gender measurement occasion and

the diagnosis categories as predictors of person misfit

Diagnosis category had a significant overall effect χ 2(8) =

2647 p lt 001 The effects of somatoform disorder ADHD

psychotic disorder and substance abuse disorder were sig-

nificant Patients with ADHD were unlikely to show misfit

relative to the baseline category of patients with mood or

anxiety disorders Patients with somatoform disorders psy-

chotic disorders and substance-related disorders were morelikely to show misfit

Model 2 (Table 2 third column) also included GAF code

and OQ-45 total score Both effects were significant and

suggested that patients with higher levels of psychological

distress were more likely to show misfit After controlling

for GAF code and OQ-45 score the positive effect of

ADHD was not significant Hence patients with ADHD

were less likely to show misfit because they had less severe

levels of distress For the baseline category the estimated

probability of misfit was 13 For patients with somatoform

disorders psychotic disorders and substance-related disor-

ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis

category showed similar person misfit we compared the

standardized item-score residuals of the misfitting patterns

produced by patients with psychotic disorders (n = 7)

somatoform disorders (n = 16) or substance-related disor-

ders (n = 13) Most patients with a psychotic disorder had

low or average θ levels for each of the subscales Misfit was

due to several unexpected high scores indicating severe

symptoms In general these patients did not have large

Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the

5 Level 0 = No Misfit)

Model 1 Model 2

Intercept minus184 (011) minus193 (011)

Gender minus012 (013) minus012 (013)

Measurement occasion minus017 (027) minus018 (027)

Diagnosis categorya

Somatoform 057 (029) 074 (029)

ADHD minus058 (028) minus039 (028)

Psychotic 105 (046) 113 (047)

Borderline minus130 (072) minus139 (073)

Impulse control 035 (036) 057 (036)

Eating disorders minus110 (060) minus097 (060)

Substance related 066 (033) 069 (033)

Socialrelational minus020 (026) 008 (027)

GAF code mdash minus017 (007)

OQ-45 total score mdash 026 (007)

Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26

a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812

8 Assessment

residuals for the same items However unexpected high

scores on Item 25 (ldquodisturbing thoughts come into my mind

that I cannot get rid ofrdquo) were frequent (4 patients) which is

consistent with the symptoms characterizing psychotic dis-

orders Low to average θ levels combined with several

unexpected high item scores suggested that patients with

psychotic disorders showed misfit because many OQ-45

items were irrelevant to them suggesting lack of traitedness

(Reise amp Waller 1993) For the misfitting patients with a

somatoform disorder or a substance-related disorder the

standardized residuals showed no common pattern and

thus they did not show similar person misfit

Discussion

We investigated prevalence and explanations of aberrant

responding to the OQ-45 by means of IRT-based person-fit

methods Reise and Waller (2009) suggested that IRT mod-

els fit poorly to psychopathology data and that misfit may

adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn

et al 2014) our simulation results suggested that l zm p is

robust to model misfit The low empirical Type I error rates

suggested that GRM misfit did not lead to incorrect classi-

fication of model-consistent item-score patterns as misfit-

ting The current findings are valuable because they are

obtained under realistic conditions using a psychometri-

cally imperfect outcome measure that is frequently used in

practice We notice that this is the rule rather than the excep-

tion in general measurement instrumentsrsquo psychometric

properties are imperfect However our findings concerning

the robustness of l zm p

may only hold for the kind of model

misfit we found for the OQ-45 Future studies should inves-

tigate the robustness of PFA methods to different IRT model

violations

The detection rate of 126 in the OQ-45 data is sub-

stantial and comparable with detection rates found in other

studies using different measures For example using

repeated measurements on the StatendashTrait Anxiety

Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs

1983) Conijn et al (2013) found detection rates of 11 to

14 in a sample of cardiac patients and Conijn et al (2014)

also found 16 misfit on the International Personality Item

Pool 50-item questionnaire (Goldberg et al 2006) in a

panel sample from the general populationConsistent with previous research (Conijn et al 2013

Reise amp Waller 1993 Woods et al 2008) we found that

more severely distressed patients more likely showed mis-

fit Also patients with somatoform disorders psychotic dis-

orders and substance-related disorders more likely showed

misfit Plausible explanations for these results include

patients tending to deny their mental problems (somatoform

disorders) symptoms of disorder and confusion and the

irrelevance of most OQ-45 items (psychotic disorders)

being under the influence during test-taking and the nega-

tive effect of long-term substance use on cognitive abilities

(substance-related disorders) The residual analysis con-

firmed the explanation of limited item relevance for patients

with psychotic disorders but also suggested that patients

identified with the same type of disorder generally did not

show the same type of misfit However as the simulation

study revealed that the residual statistic had low power sug-

gesting only item scores reflecting large misfit are identi-

fied results should be interpreted with caution

There are two possible general explanations for finding

group differences with respect to the tendency to show per-

son misfit First person misfit may be due to a mismatch

between the OQ-45 and the specific disorder and second

misfit may be due to a general tendency to show misfit on

self-report measures Each explanation has a unique impli-

cation for outcome measurement The mismatch explana-

tion implies that disease-specific outcome measures rather

than general outcome measures should be used Examples

of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for

assessing somatic symptoms and the Severe Outcome

Questionnaire (Burlingame Thayer Lee Nelson amp

Lambert 2007) to diagnose severe mental illnesses such as

psychotic disorders and bipolar disorder

The general-misfit explanation implies that other meth-

ods than self-reports should be used for patientsrsquo treatment

decisions for example clinician-rated outcome measures

such as the Health of the Nation Outcome Scales (Wing et

al 1998) Also these patientsrsquo self-report results should be

excluded from cost-effectiveness studies to prevent poten-

tial negative effects on policy decisions Future studies may

address the scale-specific misfit versus general-misfit

explanations by means of explanatory PFA of data from

other outcome measures

Residual statistics have shown to be useful in real-data

applications for analyzing causes of aberrant responding to

unidimensional personality scales containing at least 30

items (Ferrando 2010 2012) but their performance has not

been evaluated previously by means of simulations Our

simulation study and real-data application question the use-

fulness of the residual statistic for retrieving causes of mis-

fit in outcome measures consisting of multiple short

subscales An alternative method is the inspection of item

content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such

a qualitative approach to follow-up PFA could be particu-

larly useful when clinicians use the OQ-45 to provide feed-

back to the patient

Person-fit statistics such as l zm p can potentially detect

aberrant responding due to low traitedness low motivation

cognitive deficits and concentration problems However

an important limitation for outcome measurement is that the

statistics have low power to identify response styles and

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912

Conijn et al 9

malingering These types of aberrant responding result in

biased total scores but do not necessarily produce an incon-

sistent pattern of item scores across the complete measure

(Ferrando amp Chico 2001) Hence future research might

also use measures especially designed to identify response

styles (Cheung amp Rensvold 2000) and malingering and

compare the results with those from general-purpose per-

son-fit statistics Inconsistency scales such as the TRIN and

VRIN potentially detect the same types of generic response

inconsistencies as person-fit indices but to our knowledge

only one study has compared their performance to that of

PFA (Egberink 2010) PFA was found to outperform the

inconsistency scale but relative performance may depend

on dimensionality and test length Hence more research is

needed on this subject

Other suggestions for future research include the fol-

lowing A first question is whether the prevalence of aber-

rant responding increases (eg due to diminishing

motivation) or decreases (eg due to familiarity with the

questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant

responding justifies routine application of PFA in clinical

settings such as Patient Reported Outcomes Measurement

Information System If prevalence among the tested

patients is low the number of item-score patterns incor-

rectly classified as aberrant (ie Type I errors) may out-

number the correctly identified aberrant item-score

patterns (Piedmont McCrae Riemann amp Angleitner

2000) Then PFA becomes very inefficient and it is

unlikely to improve the quality of individual decision

making in clinical practice Third future research may use

group-level analysis such as differential item functioning

analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-

ture modeling (Rost 1990) to study whether patients with

the same disorder showed similar patterns of misfit

To conclude our results have two main implications

pertaining to psychometrics and psychopathology First

the simulation study results suggest that l zm p is useful for

application to outcome measures despite moderate model

misfit due to multidimensionality As data of other psy-

chopathology measures also have been shown to be incon-

sistent with assumptions of IRT models the results of the

simulation study are valuable when considering applica-

tion of the l zm p

statistic to data of outcome measures

Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients

with different disorders Also more severely distressed

patients for whom psychological intervention is mostly

needed appear to be at the largest risk to produce invalid

outcome scores Overall our results emphasize the impor-

tance of person misfit identification in outcome measure-

ment and demonstrate that PFA may be useful for

preventing incorrect decision making in clinical practice

due to aberrant responding

Appendix A

Statistic l z p

Suppose the data are polytomous scores on J items (items

are indexed j j = 1 J ) with M + 1 ordered answer cat-

egories Let the score on item j be denoted by X j with

possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a

score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0

otherwise The unstandardized log-likelihood of an item-

score pattern x of person i is given by

l d m P X m pi

j

J

m

M

j j i( ) x = ( ) =( )= =

sumsum1 0

ln |θ (A1)

The standardized log-likelihood is defined as

l

l E l

VAR l

z p i

pi

pi

pi

x

x x

x

( ) = ( ) minus ( )

( )

( )

1

2

(A2)

where E l p( ) is the expected value and VAR l p( ) is the

variance of l p

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E X ij ij ij= minus ( ) (A3)

where E X ij( ) is the expected value of X ij which equals

E X m P X mij

m

M

ij i( ) = =( )=

sum0

|θ The residual eij has a mean

of 0 and variance equal to

VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2

(A4)

The standardized residual is given by

ze e

VAR eij

ij

ij

=

( )

(A5)

To compute zeij latent trait θ

i needs to be replaced by its

estimated value This may bias the standardization of eij

Appendix B

For each OQ-45 subscale we estimated an exploratory mul-

tidimensional IRT (MIRT Reckase 2009) model Based on

results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 6: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 612

6 Assessment

- 4

- 2

0

2

4

S t a n

d a r d i z e d r e s i d u a l s

4 12 21 28 38

39 44

SR subscale

1

7

16

17

18 19

20 30 37

43

IR subscale

2 3

56

8

9 101315

22

232425

27

2931

3334

3536

40

41

42

45

SD subscale

- 4

- 2

0

2

4

S t a n d a r d i z e d r e s i d u a l s

4

12

21

28

38 39 44

1

7

16

17

18

19

20

30 37

43

2

3

5 6

8

9

10

13

15

22

23

24

2527

29

3133

34

3536

40

41

42

45

Figure 1 Standardized residuals plotted by item number for fitting patient 663 (l zmp = 354 upper panel) and misfitting patient

2752 ( l zmp

= minus796 lower panel)

(RMSEA = 13 SRMR = 05) For the IR subscale a two-

factor solution provided acceptable fit (RMSEA = 08 and

SRMR = 04) and for the SD subscale a three-factor solu-

tion provided acceptable fit (RMSEA = 07 and SRMR =

03) Violations of local independence and violations of a

logistic ISRF were only found for some items of the SD and

SR subscales respectively Thus EFA results suggested that

more than other model violations multidimensionality

caused the subscale data to show GRM misfit

To investigate the performance of statistic l zm p

and the

standardized item-score residuals for detecting person mis-

fit on OQ-45 in the presence of mild model misfit we used

data simulation to assess the Type I error rates and the

detection rates of these statistics Data were simulated using

methods proposed by Conijn et al (2014 also see Appendix

B) The types of misfit included were random error (three

levels random scores on 10 20 or 30 items) and acquies-

cence (three levels weak moderate and strong)

We found that for statistic l zm p

Type I error rate equaled

01 meaning that the risk of incorrectly classifying normal

respondents as misfitting was small and much lower than

nominal Type I error rate Furthermore the power of l zm p

to

detect substantial random error ranged from 77 to 95 (ie20 to 30 random item scores) and the power to detect acqui-

escence equaled at most 51 (ie for strong acquiescence

88 of the responses in the most extreme category) We

concluded that despite mild GRM model misfit l zm p

is use-

ful for application to the OQ-45 but lacks power to detect

acquiescence For the residual statistic we found modest

power to detect deviant item scores due to random error

and low power to detect deviant item scores due to acquies-

cence Even though the residuals had low power in our

simulation study we decided to use the residual statistic for

the OQ-45 data analysis to obtain further insight in the sta-

tisticsrsquo performance

Detection of Misfit and Follow-Up Analyses

For 90 (3) patients with a missing l z p

value for one of the

subscales l zm p was computed across the two other OQ-45

subscales Statistic l zm p ranged from minus796 to 354 ( M =

045 SD = 354) For 367 (126) patients statistic l zm p

classified the observed item-score pattern as misfitting

With respect to age gender and measurement occasion we

did not find substantial differences between detection rates

Based on the residual statisticrsquos low power in the simula-

tion study we used α = 10 for identifying unexpected item

scores We use two cases to illustrate the use of the residual

statistic Figure 1 shows the standardized residuals for

female patient 663 who had the highest l zm p

value ( l zm p

=

354 p gt 99) and for male patient 2752 who had the low-

est l zm p value ( l zm

p = minus796 p lt 001) Patient 663 (upper

panel) was diagnosed with posttraumatic stress disorder

The patientrsquos absolute residuals were smaller than 164

thus showing that her item scores were consistent with theexpected GRM item scores

Patient 2752 (lower panel) was diagnosed to suffer

from adjustment disorder with depressed mood He had

large residuals for each of the OQ-45 subscales but misfit

was largest on the IR subscale (l z p

= minus5 44 ) and the SD sub-

scale (l z p

= minus7 66 ) On the IR subscale residuals suggested

unexpected high distress on Items 7 19 and 20 One of

these items concerned his ldquomarriagesignificant other rela-

tionshiprdquo Therefore a possible cause of the IR subscale

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712

Conijn et al 7

misfit may be that his problems were limited to this rela-

tionship On the SD subscale he had both several unex-

pected high and low item scores Two of the three items

with unexpected high scores reflected mood symptoms of

depression feeling blue (Item 42) and not being happy

(Item 13) The third item concerned suicidal thoughts (Item

8) Most items with unexpected low scores concerned low

self-worth and incompetence (Items 15 24 and 40) and

hopelessness (Item 10) which are all cognitive symptoms

of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an

external source of psychological distress caused the patient

to experience only the mood symptoms of depression but

not the cognitive symptoms Hence the cognitive symp-

toms constituted a separate dimension on which the patient

had a lower trait value Furthermore except for 10 items all

other item scores were either 0s or 4s Hence apart from

potential content-related misfit extreme response style may

have been another cause of the severe misfit In practice the

clinician may discuss unexpected item scores and potential

explanations with the patient and suggest a more definite

explanation for the person misfit

Explanatory Person-Fit Analysis

For each of the diagnosis categories Table 1 shows the

average l zm p

value and the number and percentage of

patients classified as misfitting For patients with mood and

anxiety disorders (ie the baseline category) the detection

rate was substantial (128) but not high relative to most of

the other diagnosis categories

Table 2 shows the results of the logistic regression analy-

sis Model 1 included gender measurement occasion and

the diagnosis categories as predictors of person misfit

Diagnosis category had a significant overall effect χ 2(8) =

2647 p lt 001 The effects of somatoform disorder ADHD

psychotic disorder and substance abuse disorder were sig-

nificant Patients with ADHD were unlikely to show misfit

relative to the baseline category of patients with mood or

anxiety disorders Patients with somatoform disorders psy-

chotic disorders and substance-related disorders were morelikely to show misfit

Model 2 (Table 2 third column) also included GAF code

and OQ-45 total score Both effects were significant and

suggested that patients with higher levels of psychological

distress were more likely to show misfit After controlling

for GAF code and OQ-45 score the positive effect of

ADHD was not significant Hence patients with ADHD

were less likely to show misfit because they had less severe

levels of distress For the baseline category the estimated

probability of misfit was 13 For patients with somatoform

disorders psychotic disorders and substance-related disor-

ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis

category showed similar person misfit we compared the

standardized item-score residuals of the misfitting patterns

produced by patients with psychotic disorders (n = 7)

somatoform disorders (n = 16) or substance-related disor-

ders (n = 13) Most patients with a psychotic disorder had

low or average θ levels for each of the subscales Misfit was

due to several unexpected high scores indicating severe

symptoms In general these patients did not have large

Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the

5 Level 0 = No Misfit)

Model 1 Model 2

Intercept minus184 (011) minus193 (011)

Gender minus012 (013) minus012 (013)

Measurement occasion minus017 (027) minus018 (027)

Diagnosis categorya

Somatoform 057 (029) 074 (029)

ADHD minus058 (028) minus039 (028)

Psychotic 105 (046) 113 (047)

Borderline minus130 (072) minus139 (073)

Impulse control 035 (036) 057 (036)

Eating disorders minus110 (060) minus097 (060)

Substance related 066 (033) 069 (033)

Socialrelational minus020 (026) 008 (027)

GAF code mdash minus017 (007)

OQ-45 total score mdash 026 (007)

Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26

a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812

8 Assessment

residuals for the same items However unexpected high

scores on Item 25 (ldquodisturbing thoughts come into my mind

that I cannot get rid ofrdquo) were frequent (4 patients) which is

consistent with the symptoms characterizing psychotic dis-

orders Low to average θ levels combined with several

unexpected high item scores suggested that patients with

psychotic disorders showed misfit because many OQ-45

items were irrelevant to them suggesting lack of traitedness

(Reise amp Waller 1993) For the misfitting patients with a

somatoform disorder or a substance-related disorder the

standardized residuals showed no common pattern and

thus they did not show similar person misfit

Discussion

We investigated prevalence and explanations of aberrant

responding to the OQ-45 by means of IRT-based person-fit

methods Reise and Waller (2009) suggested that IRT mod-

els fit poorly to psychopathology data and that misfit may

adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn

et al 2014) our simulation results suggested that l zm p is

robust to model misfit The low empirical Type I error rates

suggested that GRM misfit did not lead to incorrect classi-

fication of model-consistent item-score patterns as misfit-

ting The current findings are valuable because they are

obtained under realistic conditions using a psychometri-

cally imperfect outcome measure that is frequently used in

practice We notice that this is the rule rather than the excep-

tion in general measurement instrumentsrsquo psychometric

properties are imperfect However our findings concerning

the robustness of l zm p

may only hold for the kind of model

misfit we found for the OQ-45 Future studies should inves-

tigate the robustness of PFA methods to different IRT model

violations

The detection rate of 126 in the OQ-45 data is sub-

stantial and comparable with detection rates found in other

studies using different measures For example using

repeated measurements on the StatendashTrait Anxiety

Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs

1983) Conijn et al (2013) found detection rates of 11 to

14 in a sample of cardiac patients and Conijn et al (2014)

also found 16 misfit on the International Personality Item

Pool 50-item questionnaire (Goldberg et al 2006) in a

panel sample from the general populationConsistent with previous research (Conijn et al 2013

Reise amp Waller 1993 Woods et al 2008) we found that

more severely distressed patients more likely showed mis-

fit Also patients with somatoform disorders psychotic dis-

orders and substance-related disorders more likely showed

misfit Plausible explanations for these results include

patients tending to deny their mental problems (somatoform

disorders) symptoms of disorder and confusion and the

irrelevance of most OQ-45 items (psychotic disorders)

being under the influence during test-taking and the nega-

tive effect of long-term substance use on cognitive abilities

(substance-related disorders) The residual analysis con-

firmed the explanation of limited item relevance for patients

with psychotic disorders but also suggested that patients

identified with the same type of disorder generally did not

show the same type of misfit However as the simulation

study revealed that the residual statistic had low power sug-

gesting only item scores reflecting large misfit are identi-

fied results should be interpreted with caution

There are two possible general explanations for finding

group differences with respect to the tendency to show per-

son misfit First person misfit may be due to a mismatch

between the OQ-45 and the specific disorder and second

misfit may be due to a general tendency to show misfit on

self-report measures Each explanation has a unique impli-

cation for outcome measurement The mismatch explana-

tion implies that disease-specific outcome measures rather

than general outcome measures should be used Examples

of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for

assessing somatic symptoms and the Severe Outcome

Questionnaire (Burlingame Thayer Lee Nelson amp

Lambert 2007) to diagnose severe mental illnesses such as

psychotic disorders and bipolar disorder

The general-misfit explanation implies that other meth-

ods than self-reports should be used for patientsrsquo treatment

decisions for example clinician-rated outcome measures

such as the Health of the Nation Outcome Scales (Wing et

al 1998) Also these patientsrsquo self-report results should be

excluded from cost-effectiveness studies to prevent poten-

tial negative effects on policy decisions Future studies may

address the scale-specific misfit versus general-misfit

explanations by means of explanatory PFA of data from

other outcome measures

Residual statistics have shown to be useful in real-data

applications for analyzing causes of aberrant responding to

unidimensional personality scales containing at least 30

items (Ferrando 2010 2012) but their performance has not

been evaluated previously by means of simulations Our

simulation study and real-data application question the use-

fulness of the residual statistic for retrieving causes of mis-

fit in outcome measures consisting of multiple short

subscales An alternative method is the inspection of item

content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such

a qualitative approach to follow-up PFA could be particu-

larly useful when clinicians use the OQ-45 to provide feed-

back to the patient

Person-fit statistics such as l zm p can potentially detect

aberrant responding due to low traitedness low motivation

cognitive deficits and concentration problems However

an important limitation for outcome measurement is that the

statistics have low power to identify response styles and

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912

Conijn et al 9

malingering These types of aberrant responding result in

biased total scores but do not necessarily produce an incon-

sistent pattern of item scores across the complete measure

(Ferrando amp Chico 2001) Hence future research might

also use measures especially designed to identify response

styles (Cheung amp Rensvold 2000) and malingering and

compare the results with those from general-purpose per-

son-fit statistics Inconsistency scales such as the TRIN and

VRIN potentially detect the same types of generic response

inconsistencies as person-fit indices but to our knowledge

only one study has compared their performance to that of

PFA (Egberink 2010) PFA was found to outperform the

inconsistency scale but relative performance may depend

on dimensionality and test length Hence more research is

needed on this subject

Other suggestions for future research include the fol-

lowing A first question is whether the prevalence of aber-

rant responding increases (eg due to diminishing

motivation) or decreases (eg due to familiarity with the

questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant

responding justifies routine application of PFA in clinical

settings such as Patient Reported Outcomes Measurement

Information System If prevalence among the tested

patients is low the number of item-score patterns incor-

rectly classified as aberrant (ie Type I errors) may out-

number the correctly identified aberrant item-score

patterns (Piedmont McCrae Riemann amp Angleitner

2000) Then PFA becomes very inefficient and it is

unlikely to improve the quality of individual decision

making in clinical practice Third future research may use

group-level analysis such as differential item functioning

analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-

ture modeling (Rost 1990) to study whether patients with

the same disorder showed similar patterns of misfit

To conclude our results have two main implications

pertaining to psychometrics and psychopathology First

the simulation study results suggest that l zm p is useful for

application to outcome measures despite moderate model

misfit due to multidimensionality As data of other psy-

chopathology measures also have been shown to be incon-

sistent with assumptions of IRT models the results of the

simulation study are valuable when considering applica-

tion of the l zm p

statistic to data of outcome measures

Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients

with different disorders Also more severely distressed

patients for whom psychological intervention is mostly

needed appear to be at the largest risk to produce invalid

outcome scores Overall our results emphasize the impor-

tance of person misfit identification in outcome measure-

ment and demonstrate that PFA may be useful for

preventing incorrect decision making in clinical practice

due to aberrant responding

Appendix A

Statistic l z p

Suppose the data are polytomous scores on J items (items

are indexed j j = 1 J ) with M + 1 ordered answer cat-

egories Let the score on item j be denoted by X j with

possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a

score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0

otherwise The unstandardized log-likelihood of an item-

score pattern x of person i is given by

l d m P X m pi

j

J

m

M

j j i( ) x = ( ) =( )= =

sumsum1 0

ln |θ (A1)

The standardized log-likelihood is defined as

l

l E l

VAR l

z p i

pi

pi

pi

x

x x

x

( ) = ( ) minus ( )

( )

( )

1

2

(A2)

where E l p( ) is the expected value and VAR l p( ) is the

variance of l p

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E X ij ij ij= minus ( ) (A3)

where E X ij( ) is the expected value of X ij which equals

E X m P X mij

m

M

ij i( ) = =( )=

sum0

|θ The residual eij has a mean

of 0 and variance equal to

VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2

(A4)

The standardized residual is given by

ze e

VAR eij

ij

ij

=

( )

(A5)

To compute zeij latent trait θ

i needs to be replaced by its

estimated value This may bias the standardization of eij

Appendix B

For each OQ-45 subscale we estimated an exploratory mul-

tidimensional IRT (MIRT Reckase 2009) model Based on

results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 7: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 712

Conijn et al 7

misfit may be that his problems were limited to this rela-

tionship On the SD subscale he had both several unex-

pected high and low item scores Two of the three items

with unexpected high scores reflected mood symptoms of

depression feeling blue (Item 42) and not being happy

(Item 13) The third item concerned suicidal thoughts (Item

8) Most items with unexpected low scores concerned low

self-worth and incompetence (Items 15 24 and 40) and

hopelessness (Item 10) which are all cognitive symptoms

of depression A plausible explanation of the SD subscalemisfit consistent with the patientrsquos diagnosis is that an

external source of psychological distress caused the patient

to experience only the mood symptoms of depression but

not the cognitive symptoms Hence the cognitive symp-

toms constituted a separate dimension on which the patient

had a lower trait value Furthermore except for 10 items all

other item scores were either 0s or 4s Hence apart from

potential content-related misfit extreme response style may

have been another cause of the severe misfit In practice the

clinician may discuss unexpected item scores and potential

explanations with the patient and suggest a more definite

explanation for the person misfit

Explanatory Person-Fit Analysis

For each of the diagnosis categories Table 1 shows the

average l zm p

value and the number and percentage of

patients classified as misfitting For patients with mood and

anxiety disorders (ie the baseline category) the detection

rate was substantial (128) but not high relative to most of

the other diagnosis categories

Table 2 shows the results of the logistic regression analy-

sis Model 1 included gender measurement occasion and

the diagnosis categories as predictors of person misfit

Diagnosis category had a significant overall effect χ 2(8) =

2647 p lt 001 The effects of somatoform disorder ADHD

psychotic disorder and substance abuse disorder were sig-

nificant Patients with ADHD were unlikely to show misfit

relative to the baseline category of patients with mood or

anxiety disorders Patients with somatoform disorders psy-

chotic disorders and substance-related disorders were morelikely to show misfit

Model 2 (Table 2 third column) also included GAF code

and OQ-45 total score Both effects were significant and

suggested that patients with higher levels of psychological

distress were more likely to show misfit After controlling

for GAF code and OQ-45 score the positive effect of

ADHD was not significant Hence patients with ADHD

were less likely to show misfit because they had less severe

levels of distress For the baseline category the estimated

probability of misfit was 13 For patients with somatoform

disorders psychotic disorders and substance-related disor-

ders the probability was 23 31 and 22 respectivelyTo investigate whether patients in the same diagnosis

category showed similar person misfit we compared the

standardized item-score residuals of the misfitting patterns

produced by patients with psychotic disorders (n = 7)

somatoform disorders (n = 16) or substance-related disor-

ders (n = 13) Most patients with a psychotic disorder had

low or average θ levels for each of the subscales Misfit was

due to several unexpected high scores indicating severe

symptoms In general these patients did not have large

Table 2 Estimated Regression Coefficients of Logistic Regression Predicting Person Misfit Based on l zmp (1 = Significant Misfit at the

5 Level 0 = No Misfit)

Model 1 Model 2

Intercept minus184 (011) minus193 (011)

Gender minus012 (013) minus012 (013)

Measurement occasion minus017 (027) minus018 (027)

Diagnosis categorya

Somatoform 057 (029) 074 (029)

ADHD minus058 (028) minus039 (028)

Psychotic 105 (046) 113 (047)

Borderline minus130 (072) minus139 (073)

Impulse control 035 (036) 057 (036)

Eating disorders minus110 (060) minus097 (060)

Substance related 066 (033) 069 (033)

Socialrelational minus020 (026) 008 (027)

GAF code mdash minus017 (007)

OQ-45 total score mdash 026 (007)

Note N = 2434 The correlation between the GAF code and OQ-45 total score equaled minus26

a The mood and anxiety disorders category is used as the baseline categoryp lt 05 p lt 01 p lt 001

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812

8 Assessment

residuals for the same items However unexpected high

scores on Item 25 (ldquodisturbing thoughts come into my mind

that I cannot get rid ofrdquo) were frequent (4 patients) which is

consistent with the symptoms characterizing psychotic dis-

orders Low to average θ levels combined with several

unexpected high item scores suggested that patients with

psychotic disorders showed misfit because many OQ-45

items were irrelevant to them suggesting lack of traitedness

(Reise amp Waller 1993) For the misfitting patients with a

somatoform disorder or a substance-related disorder the

standardized residuals showed no common pattern and

thus they did not show similar person misfit

Discussion

We investigated prevalence and explanations of aberrant

responding to the OQ-45 by means of IRT-based person-fit

methods Reise and Waller (2009) suggested that IRT mod-

els fit poorly to psychopathology data and that misfit may

adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn

et al 2014) our simulation results suggested that l zm p is

robust to model misfit The low empirical Type I error rates

suggested that GRM misfit did not lead to incorrect classi-

fication of model-consistent item-score patterns as misfit-

ting The current findings are valuable because they are

obtained under realistic conditions using a psychometri-

cally imperfect outcome measure that is frequently used in

practice We notice that this is the rule rather than the excep-

tion in general measurement instrumentsrsquo psychometric

properties are imperfect However our findings concerning

the robustness of l zm p

may only hold for the kind of model

misfit we found for the OQ-45 Future studies should inves-

tigate the robustness of PFA methods to different IRT model

violations

The detection rate of 126 in the OQ-45 data is sub-

stantial and comparable with detection rates found in other

studies using different measures For example using

repeated measurements on the StatendashTrait Anxiety

Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs

1983) Conijn et al (2013) found detection rates of 11 to

14 in a sample of cardiac patients and Conijn et al (2014)

also found 16 misfit on the International Personality Item

Pool 50-item questionnaire (Goldberg et al 2006) in a

panel sample from the general populationConsistent with previous research (Conijn et al 2013

Reise amp Waller 1993 Woods et al 2008) we found that

more severely distressed patients more likely showed mis-

fit Also patients with somatoform disorders psychotic dis-

orders and substance-related disorders more likely showed

misfit Plausible explanations for these results include

patients tending to deny their mental problems (somatoform

disorders) symptoms of disorder and confusion and the

irrelevance of most OQ-45 items (psychotic disorders)

being under the influence during test-taking and the nega-

tive effect of long-term substance use on cognitive abilities

(substance-related disorders) The residual analysis con-

firmed the explanation of limited item relevance for patients

with psychotic disorders but also suggested that patients

identified with the same type of disorder generally did not

show the same type of misfit However as the simulation

study revealed that the residual statistic had low power sug-

gesting only item scores reflecting large misfit are identi-

fied results should be interpreted with caution

There are two possible general explanations for finding

group differences with respect to the tendency to show per-

son misfit First person misfit may be due to a mismatch

between the OQ-45 and the specific disorder and second

misfit may be due to a general tendency to show misfit on

self-report measures Each explanation has a unique impli-

cation for outcome measurement The mismatch explana-

tion implies that disease-specific outcome measures rather

than general outcome measures should be used Examples

of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for

assessing somatic symptoms and the Severe Outcome

Questionnaire (Burlingame Thayer Lee Nelson amp

Lambert 2007) to diagnose severe mental illnesses such as

psychotic disorders and bipolar disorder

The general-misfit explanation implies that other meth-

ods than self-reports should be used for patientsrsquo treatment

decisions for example clinician-rated outcome measures

such as the Health of the Nation Outcome Scales (Wing et

al 1998) Also these patientsrsquo self-report results should be

excluded from cost-effectiveness studies to prevent poten-

tial negative effects on policy decisions Future studies may

address the scale-specific misfit versus general-misfit

explanations by means of explanatory PFA of data from

other outcome measures

Residual statistics have shown to be useful in real-data

applications for analyzing causes of aberrant responding to

unidimensional personality scales containing at least 30

items (Ferrando 2010 2012) but their performance has not

been evaluated previously by means of simulations Our

simulation study and real-data application question the use-

fulness of the residual statistic for retrieving causes of mis-

fit in outcome measures consisting of multiple short

subscales An alternative method is the inspection of item

content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such

a qualitative approach to follow-up PFA could be particu-

larly useful when clinicians use the OQ-45 to provide feed-

back to the patient

Person-fit statistics such as l zm p can potentially detect

aberrant responding due to low traitedness low motivation

cognitive deficits and concentration problems However

an important limitation for outcome measurement is that the

statistics have low power to identify response styles and

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912

Conijn et al 9

malingering These types of aberrant responding result in

biased total scores but do not necessarily produce an incon-

sistent pattern of item scores across the complete measure

(Ferrando amp Chico 2001) Hence future research might

also use measures especially designed to identify response

styles (Cheung amp Rensvold 2000) and malingering and

compare the results with those from general-purpose per-

son-fit statistics Inconsistency scales such as the TRIN and

VRIN potentially detect the same types of generic response

inconsistencies as person-fit indices but to our knowledge

only one study has compared their performance to that of

PFA (Egberink 2010) PFA was found to outperform the

inconsistency scale but relative performance may depend

on dimensionality and test length Hence more research is

needed on this subject

Other suggestions for future research include the fol-

lowing A first question is whether the prevalence of aber-

rant responding increases (eg due to diminishing

motivation) or decreases (eg due to familiarity with the

questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant

responding justifies routine application of PFA in clinical

settings such as Patient Reported Outcomes Measurement

Information System If prevalence among the tested

patients is low the number of item-score patterns incor-

rectly classified as aberrant (ie Type I errors) may out-

number the correctly identified aberrant item-score

patterns (Piedmont McCrae Riemann amp Angleitner

2000) Then PFA becomes very inefficient and it is

unlikely to improve the quality of individual decision

making in clinical practice Third future research may use

group-level analysis such as differential item functioning

analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-

ture modeling (Rost 1990) to study whether patients with

the same disorder showed similar patterns of misfit

To conclude our results have two main implications

pertaining to psychometrics and psychopathology First

the simulation study results suggest that l zm p is useful for

application to outcome measures despite moderate model

misfit due to multidimensionality As data of other psy-

chopathology measures also have been shown to be incon-

sistent with assumptions of IRT models the results of the

simulation study are valuable when considering applica-

tion of the l zm p

statistic to data of outcome measures

Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients

with different disorders Also more severely distressed

patients for whom psychological intervention is mostly

needed appear to be at the largest risk to produce invalid

outcome scores Overall our results emphasize the impor-

tance of person misfit identification in outcome measure-

ment and demonstrate that PFA may be useful for

preventing incorrect decision making in clinical practice

due to aberrant responding

Appendix A

Statistic l z p

Suppose the data are polytomous scores on J items (items

are indexed j j = 1 J ) with M + 1 ordered answer cat-

egories Let the score on item j be denoted by X j with

possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a

score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0

otherwise The unstandardized log-likelihood of an item-

score pattern x of person i is given by

l d m P X m pi

j

J

m

M

j j i( ) x = ( ) =( )= =

sumsum1 0

ln |θ (A1)

The standardized log-likelihood is defined as

l

l E l

VAR l

z p i

pi

pi

pi

x

x x

x

( ) = ( ) minus ( )

( )

( )

1

2

(A2)

where E l p( ) is the expected value and VAR l p( ) is the

variance of l p

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E X ij ij ij= minus ( ) (A3)

where E X ij( ) is the expected value of X ij which equals

E X m P X mij

m

M

ij i( ) = =( )=

sum0

|θ The residual eij has a mean

of 0 and variance equal to

VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2

(A4)

The standardized residual is given by

ze e

VAR eij

ij

ij

=

( )

(A5)

To compute zeij latent trait θ

i needs to be replaced by its

estimated value This may bias the standardization of eij

Appendix B

For each OQ-45 subscale we estimated an exploratory mul-

tidimensional IRT (MIRT Reckase 2009) model Based on

results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 8: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 812

8 Assessment

residuals for the same items However unexpected high

scores on Item 25 (ldquodisturbing thoughts come into my mind

that I cannot get rid ofrdquo) were frequent (4 patients) which is

consistent with the symptoms characterizing psychotic dis-

orders Low to average θ levels combined with several

unexpected high item scores suggested that patients with

psychotic disorders showed misfit because many OQ-45

items were irrelevant to them suggesting lack of traitedness

(Reise amp Waller 1993) For the misfitting patients with a

somatoform disorder or a substance-related disorder the

standardized residuals showed no common pattern and

thus they did not show similar person misfit

Discussion

We investigated prevalence and explanations of aberrant

responding to the OQ-45 by means of IRT-based person-fit

methods Reise and Waller (2009) suggested that IRT mod-

els fit poorly to psychopathology data and that misfit may

adversely affect PFA The GRM failed to fit the OQ-45data but consistent with previous research results (Conijn

et al 2014) our simulation results suggested that l zm p is

robust to model misfit The low empirical Type I error rates

suggested that GRM misfit did not lead to incorrect classi-

fication of model-consistent item-score patterns as misfit-

ting The current findings are valuable because they are

obtained under realistic conditions using a psychometri-

cally imperfect outcome measure that is frequently used in

practice We notice that this is the rule rather than the excep-

tion in general measurement instrumentsrsquo psychometric

properties are imperfect However our findings concerning

the robustness of l zm p

may only hold for the kind of model

misfit we found for the OQ-45 Future studies should inves-

tigate the robustness of PFA methods to different IRT model

violations

The detection rate of 126 in the OQ-45 data is sub-

stantial and comparable with detection rates found in other

studies using different measures For example using

repeated measurements on the StatendashTrait Anxiety

Inventory (Spielberger Gorsuch Lushene Vagg amp Jacobs

1983) Conijn et al (2013) found detection rates of 11 to

14 in a sample of cardiac patients and Conijn et al (2014)

also found 16 misfit on the International Personality Item

Pool 50-item questionnaire (Goldberg et al 2006) in a

panel sample from the general populationConsistent with previous research (Conijn et al 2013

Reise amp Waller 1993 Woods et al 2008) we found that

more severely distressed patients more likely showed mis-

fit Also patients with somatoform disorders psychotic dis-

orders and substance-related disorders more likely showed

misfit Plausible explanations for these results include

patients tending to deny their mental problems (somatoform

disorders) symptoms of disorder and confusion and the

irrelevance of most OQ-45 items (psychotic disorders)

being under the influence during test-taking and the nega-

tive effect of long-term substance use on cognitive abilities

(substance-related disorders) The residual analysis con-

firmed the explanation of limited item relevance for patients

with psychotic disorders but also suggested that patients

identified with the same type of disorder generally did not

show the same type of misfit However as the simulation

study revealed that the residual statistic had low power sug-

gesting only item scores reflecting large misfit are identi-

fied results should be interpreted with caution

There are two possible general explanations for finding

group differences with respect to the tendency to show per-

son misfit First person misfit may be due to a mismatch

between the OQ-45 and the specific disorder and second

misfit may be due to a general tendency to show misfit on

self-report measures Each explanation has a unique impli-

cation for outcome measurement The mismatch explana-

tion implies that disease-specific outcome measures rather

than general outcome measures should be used Examples

of disease-specific outcome measures are the Patient HealthQuestionnaire-15 (Kroenke Spitzer amp Williams 2002) for

assessing somatic symptoms and the Severe Outcome

Questionnaire (Burlingame Thayer Lee Nelson amp

Lambert 2007) to diagnose severe mental illnesses such as

psychotic disorders and bipolar disorder

The general-misfit explanation implies that other meth-

ods than self-reports should be used for patientsrsquo treatment

decisions for example clinician-rated outcome measures

such as the Health of the Nation Outcome Scales (Wing et

al 1998) Also these patientsrsquo self-report results should be

excluded from cost-effectiveness studies to prevent poten-

tial negative effects on policy decisions Future studies may

address the scale-specific misfit versus general-misfit

explanations by means of explanatory PFA of data from

other outcome measures

Residual statistics have shown to be useful in real-data

applications for analyzing causes of aberrant responding to

unidimensional personality scales containing at least 30

items (Ferrando 2010 2012) but their performance has not

been evaluated previously by means of simulations Our

simulation study and real-data application question the use-

fulness of the residual statistic for retrieving causes of mis-

fit in outcome measures consisting of multiple short

subscales An alternative method is the inspection of item

content to identify unlikely combinations of item scores in particular when outcome measures contain few items Such

a qualitative approach to follow-up PFA could be particu-

larly useful when clinicians use the OQ-45 to provide feed-

back to the patient

Person-fit statistics such as l zm p can potentially detect

aberrant responding due to low traitedness low motivation

cognitive deficits and concentration problems However

an important limitation for outcome measurement is that the

statistics have low power to identify response styles and

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912

Conijn et al 9

malingering These types of aberrant responding result in

biased total scores but do not necessarily produce an incon-

sistent pattern of item scores across the complete measure

(Ferrando amp Chico 2001) Hence future research might

also use measures especially designed to identify response

styles (Cheung amp Rensvold 2000) and malingering and

compare the results with those from general-purpose per-

son-fit statistics Inconsistency scales such as the TRIN and

VRIN potentially detect the same types of generic response

inconsistencies as person-fit indices but to our knowledge

only one study has compared their performance to that of

PFA (Egberink 2010) PFA was found to outperform the

inconsistency scale but relative performance may depend

on dimensionality and test length Hence more research is

needed on this subject

Other suggestions for future research include the fol-

lowing A first question is whether the prevalence of aber-

rant responding increases (eg due to diminishing

motivation) or decreases (eg due to familiarity with the

questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant

responding justifies routine application of PFA in clinical

settings such as Patient Reported Outcomes Measurement

Information System If prevalence among the tested

patients is low the number of item-score patterns incor-

rectly classified as aberrant (ie Type I errors) may out-

number the correctly identified aberrant item-score

patterns (Piedmont McCrae Riemann amp Angleitner

2000) Then PFA becomes very inefficient and it is

unlikely to improve the quality of individual decision

making in clinical practice Third future research may use

group-level analysis such as differential item functioning

analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-

ture modeling (Rost 1990) to study whether patients with

the same disorder showed similar patterns of misfit

To conclude our results have two main implications

pertaining to psychometrics and psychopathology First

the simulation study results suggest that l zm p is useful for

application to outcome measures despite moderate model

misfit due to multidimensionality As data of other psy-

chopathology measures also have been shown to be incon-

sistent with assumptions of IRT models the results of the

simulation study are valuable when considering applica-

tion of the l zm p

statistic to data of outcome measures

Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients

with different disorders Also more severely distressed

patients for whom psychological intervention is mostly

needed appear to be at the largest risk to produce invalid

outcome scores Overall our results emphasize the impor-

tance of person misfit identification in outcome measure-

ment and demonstrate that PFA may be useful for

preventing incorrect decision making in clinical practice

due to aberrant responding

Appendix A

Statistic l z p

Suppose the data are polytomous scores on J items (items

are indexed j j = 1 J ) with M + 1 ordered answer cat-

egories Let the score on item j be denoted by X j with

possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a

score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0

otherwise The unstandardized log-likelihood of an item-

score pattern x of person i is given by

l d m P X m pi

j

J

m

M

j j i( ) x = ( ) =( )= =

sumsum1 0

ln |θ (A1)

The standardized log-likelihood is defined as

l

l E l

VAR l

z p i

pi

pi

pi

x

x x

x

( ) = ( ) minus ( )

( )

( )

1

2

(A2)

where E l p( ) is the expected value and VAR l p( ) is the

variance of l p

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E X ij ij ij= minus ( ) (A3)

where E X ij( ) is the expected value of X ij which equals

E X m P X mij

m

M

ij i( ) = =( )=

sum0

|θ The residual eij has a mean

of 0 and variance equal to

VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2

(A4)

The standardized residual is given by

ze e

VAR eij

ij

ij

=

( )

(A5)

To compute zeij latent trait θ

i needs to be replaced by its

estimated value This may bias the standardization of eij

Appendix B

For each OQ-45 subscale we estimated an exploratory mul-

tidimensional IRT (MIRT Reckase 2009) model Based on

results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 9: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 912

Conijn et al 9

malingering These types of aberrant responding result in

biased total scores but do not necessarily produce an incon-

sistent pattern of item scores across the complete measure

(Ferrando amp Chico 2001) Hence future research might

also use measures especially designed to identify response

styles (Cheung amp Rensvold 2000) and malingering and

compare the results with those from general-purpose per-

son-fit statistics Inconsistency scales such as the TRIN and

VRIN potentially detect the same types of generic response

inconsistencies as person-fit indices but to our knowledge

only one study has compared their performance to that of

PFA (Egberink 2010) PFA was found to outperform the

inconsistency scale but relative performance may depend

on dimensionality and test length Hence more research is

needed on this subject

Other suggestions for future research include the fol-

lowing A first question is whether the prevalence of aber-

rant responding increases (eg due to diminishing

motivation) or decreases (eg due to familiarity with the

questions) with repeated outcome measurements A sec-ond question is whether the prevalence of aberrant

responding justifies routine application of PFA in clinical

settings such as Patient Reported Outcomes Measurement

Information System If prevalence among the tested

patients is low the number of item-score patterns incor-

rectly classified as aberrant (ie Type I errors) may out-

number the correctly identified aberrant item-score

patterns (Piedmont McCrae Riemann amp Angleitner

2000) Then PFA becomes very inefficient and it is

unlikely to improve the quality of individual decision

making in clinical practice Third future research may use

group-level analysis such as differential item functioning

analysis (Thissen Steinberg amp Wainer 1993) or IRT mix-

ture modeling (Rost 1990) to study whether patients with

the same disorder showed similar patterns of misfit

To conclude our results have two main implications

pertaining to psychometrics and psychopathology First

the simulation study results suggest that l zm p is useful for

application to outcome measures despite moderate model

misfit due to multidimensionality As data of other psy-

chopathology measures also have been shown to be incon-

sistent with assumptions of IRT models the results of the

simulation study are valuable when considering applica-

tion of the l zm p

statistic to data of outcome measures

Second our results suggest that general outcome measuressuch as the OQ-45 may not be equally suitable for patients

with different disorders Also more severely distressed

patients for whom psychological intervention is mostly

needed appear to be at the largest risk to produce invalid

outcome scores Overall our results emphasize the impor-

tance of person misfit identification in outcome measure-

ment and demonstrate that PFA may be useful for

preventing incorrect decision making in clinical practice

due to aberrant responding

Appendix A

Statistic l z p

Suppose the data are polytomous scores on J items (items

are indexed j j = 1 J ) with M + 1 ordered answer cat-

egories Let the score on item j be denoted by X j with

possible realizations x M j = hellip0 The latent trait isdenoted by θ and P X x j j( | )= θ is the probability of a

score X j = x j Let d m j ( ) =1 if x m m M j = = hellip( )0 and 0

otherwise The unstandardized log-likelihood of an item-

score pattern x of person i is given by

l d m P X m pi

j

J

m

M

j j i( ) x = ( ) =( )= =

sumsum1 0

ln |θ (A1)

The standardized log-likelihood is defined as

l

l E l

VAR l

z p i

pi

pi

pi

x

x x

x

( ) = ( ) minus ( )

( )

( )

1

2

(A2)

where E l p( ) is the expected value and VAR l p( ) is the

variance of l p

Standardized Residual Statistic

The unstandardized residual for person i on item j is given by

e X E X ij ij ij= minus ( ) (A3)

where E X ij( ) is the expected value of X ij which equals

E X m P X mij

m

M

ij i( ) = =( )=

sum0

|θ The residual eij has a mean

of 0 and variance equal to

VAR e E X E X ij ij ij( ) = ( ) minus ( ) 2 2

(A4)

The standardized residual is given by

ze e

VAR eij

ij

ij

=

( )

(A5)

To compute zeij latent trait θ

i needs to be replaced by its

estimated value This may bias the standardization of eij

Appendix B

For each OQ-45 subscale we estimated an exploratory mul-

tidimensional IRT (MIRT Reckase 2009) model Based on

results of exploratory factor analyses conducted in Mplus

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 10: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1012

10 Assessment

(Mutheacuten amp Mutheacuten 2007) we used a MIRT model with

two factors for both the SR and IR subscale and three fac-

tors for the SD subscale The MIRT model is defined as

logit[ p(θ )] = minus1702(αθ + δ) Vector θ denotes the latent

traits and has a multivariate standard normal distribution

where the θ correlations are estimated along with the item

parameters of the MIRT model Higher α values indicate

better discriminating items and higher δ values indicate

more popular answer categories

We used the MIRT parameter estimates (Table B1) for

generating replicated OQ-45 data sets In each replication

we included two types of misfitting item-score patterns

patterns were due to random error or acquiescence

Following St-Onge Valois Abdous and Germain (2011)

we simulated varying levels of random error (random scores

on 10 20 or 30 items) and varying levels of acquiescence

(weak moderate and strong based on Van Herk Poortinga

and Verhallen 2004) The total percentage of misfitting pat-

terns in the data was 20 (Conijn et al 2013 Conijn et al

2014) Based on 100 replicated data sets the average Type

I error rate and the average detection rates of l zm p and the

standardized residual were computed For computing the

person-fit statistics GRM parameters were estimated using

MULTILOG 7 (Thissen et al 2003)

Table B1 Descriptive Statistics for Multidimensional IRT Model Parameters Used for Data Simulation

SR subscale IR subscale SD subscale

Item parameter M (SD) Range M (SD) Range M (SD) Range

Discrimination

α θ 1 076 (047) 023 146 084 (037) 036 165 090 (026) 052 132

α θ 2 026 (050) minus009 131 002 (037) minus048 073 003 (040) minus074 054

α θ 3 mdash mdash mdash mdash 002 (027) minus056 053

Threshold

δ 0 mdash mdash mdash mdash mdash mdash

δ 1 098 (070) minus025 184 112 (066) 014 221 176 (084) minus011 308

δ 2 010 (060) minus088 080 007 (052) minus098 095 077 (065) minus070 175

δ 3 minus082 (053) minus157 minus019 minus111 (067) minus240 minus018 minus040 (053) minus167 062

δ 4 minus184 (050) minus260 minus126 minus228 (087) minus362 minus109 minus174 (053) minus287 minus070

Latent-trait correlations

r θ θ 1 2 20 50 minus42r

θ θ 1 3 mdash mdash 53

r θ θ 2 3 mdash mdash minus55

Note To simulate weak moderate and strong acquiescence the δ s were shifted 1 2 or 3 points respectively (Cheung amp Rensvold 2000) For thepositively worded items the δ shift was positive and for the negatively worded items the δ shift was negative For weak acquiescence the averagepercentages of respondentsrsquo 3-scores and 4-scores were 27 and 25 respectively for moderate acquiescence percentages were 20 and 45 and forstrong acquiescence percentages were 12 and 88

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect

to the research authorship andor publication of this article

Funding

The author(s) disclosed receipt of the following financial supportfor the research authorship andor publication of this article

This research was supported by a grant from the Netherlands

Organization of Scientific Research (NWO 400-06-087 first

author)

References

Atre-Vaidya N Taylor M A Seidenberg M Reed R

Perrine A amp Glick-Oberwise F (1998) Cognitive deficits

psychopathology and psychosocial functioning in bipolar

mood disorder Neuropsychiatry Neuropsychology and

Behavioral Neurology 11 120-126

Baker F B amp Kim S-H (2004) Item response theory

Parameter estimation techniques (2nd ed) New York NY

Marcel Dekker

Bickman L amp Salzer M S (1997) Introduction Measuring

quality in mental health services Evaluation Review 21285-291

Bludworth J L Tracey T J G amp Glidden-Tracey C (2010)

The bilevel structure of the Outcome Questionnairendash45

Psychological Assessment 22 350-355

Burlingame G M Thayer S D Lee J A Nelson P L amp

Lambert M J (2007) Administration amp scoring manual for

the Severe Outcome Questionnaire (SOQ) Salt Lake City

UT OQ Measures

Butcher J N Graham J R Ben-Porath Y S Tellegen

A Dahlstrom W G amp Kaemmer B (2001) MMPI-2

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 11: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1112

Conijn et al 11

(Minnesota Multiphasic Personality Inventory-2) Manual

for administration and scoring (Rev ed) Minneapolis

University of Minnesota Press

Cella D Yount S Rothrock N Gershon R Cook K amp

Reeve B on behalf of the PROMIS Cooperative Group

(2007) The Patient Reported Outcomes Measurement

Information System (PROMIS) Progress of an NIH Roadmap

Cooperative Group during its first two years Medical Care45 S3-S11

Cheung G W amp Rensvold R B (2000) Assessing extreme and

acquiescence response sets in cross-cultural research using

structural equations modeling Journal of Cross-Cultural

Psychology 31 187-212

Conijn J M Emons W H M amp Sijtsma K (2014) Statistic l z

based person-fit methods for non-cognitive multiscale mea-

sures Applied Psychological Measurement 38 122-136

Conijn J M Emons W H M Van Assen M A L M Pedersen

S S amp Sijtsma K (2013) Explanatory multilevel person-fit

analysis of response consistency on the Spielberger State-Trait

Anxiety Inventory Multivariate Behavioral Research 48

692-718

Conrad K J Bezruczko N Chan Y F Riley B DiamondG amp Dennis M L (2010) Screening for atypical suicide

risk with person fit statistics among people presenting to alco-

hol and other drug treatment Drug and Alcohol Dependence

106 92-100

Cuijpers P Li J Hofmann S G amp Andersson G (2010) Self-

reported versus clinician-rated symptoms of depression as

outcome measures in psychotherapy research on depression

A meta-analysis Clinical Psychology Review 30 768-778

De Jong K amp Nugter A (2004) De Outcome Questionnaire

Psychometrische kenmerken van de Nederlandse vertal-

ing [The Outcome Questionnaire Psychometric properties

of the Dutch translation] Nederlands Tijdschrift voor de

Psychologie 59 76-79

De Jong K Nugter M A Polak M G Wagenborg J E

A Spinhoven P amp Heiser W J (2007) The Outcome

Questionnaire-45 in a Dutch population A cross cultural vali-

dation Clinical Psychology amp Psychotherapy 14 288-301

De la Torre J amp Deng W (2008) Improving person-fit

assessment by correcting the ability estimate and its refer-

ence distribution Journal of Educational Measurement 45

159-177

Derogatis L R (1993) BSI Brief Symptom Inventory

Administration scoring and procedures manual (4th ed)

Minneapolis MN National Computer Systems

Drasgow F Levine M V amp McLaughlin M E (1987)

Detecting inappropriate test scores with optimal and practical

appropriateness indices Applied Psychological Measurement 11 59-79

Drasgow F Levine M V amp McLaughlin M E (1991)

Appropriateness measurement for some multidimensional test

batteries Applied Psychological Measurement 15 171-191

Drasgow F Levine M V Tsien S Williams B amp Mead A

D (1995) Fitting polytomous item response theory models

to multiple-choice tests Applied Psychological Measurement

19 143-165

Drasgow F Levine M V amp Williams E A (1985) Appropriateness

measurement with polychotomous item response models and

standardized indices British Journal of Mathematical and

Statistical Psychology 38 67-86

Doucette A amp Wolf A W (2009) Questioning the measure-

ment precision of psychotherapy research Psychotherapy

Research 19 374-389

Egberink I J L (2010) The use of different types of validity

indicators in personality assessment (Doctoral dissertation)

University of Groningen Netherlands httpirsubrugnl ppn32993466X

Embretson S E amp Reise S P (2000) Item response theory for

psychologists Mahwah NJ Erlbaum

Emons W H M (2008) Nonparametric person-fit analysis of

polytomous item scores Applied Psychological Measurement

32 224-247

Evans C Connell J Barkham M Margison F Mellor-Clark

J McGrath G amp Audin K (2002) Towards a standardised

brief outcome measure Psychometric properties and utility

of the CORE-OM British Journal of Psychiatry 180 51-60

Ferrando P J (2010) Some statistics for assessing person-fit

based on continuous-response models Applied Psychological

Measurement 34 219-237

Ferrando P J (2012) Assessing inconsistent responding in E and N measures An application of person-fit analysis in personal-

ity Personality and Individual Differences 52 718-722

Ferrando P J amp Chico E (2001) Detecting dissimulation in

personality test scores A comparison between person-fit

indices and detection scales Educational and Psychological

Measurement 61 997-1012

Forero C G amp Maydeu-Olivares A (2009) Estimation of IRT

graded response models Limited versus full information

methods Psychological Methods 14 275-299

Goldberg L R Johnson J A Eber H W Hogan R

Ashton M C Cloninger C R amp Gough H C (2006)

The International Personality Item Pool and the future of

public-domain personality measures Journal of Research in

Personality 40 84-96

Handel R W Ben-Porath Y S Tellegen A amp Archer R

P (2010) Psychometric functioning of the MMPI-2-RF

VRIN-r and TRIN-r scales with varying degrees of random-

ness acquiescence and counter-acquiescence Psychological

Assessment 22 87-95

Holloway F (2002) Outcome measurement in mental healthmdash

Welcome to the revolution British Journal of Psychiatry

181 1-2

Karabatsos G (2003) Comparing the aberrant response detec-

tion performance of thirty-six person-fit statistics Applied

Measurement in Education 16 277-298

Kenny D A Kaniskan B amp McCoach D B (2014) The per-

formance of RMSEA in models with small degrees of free-dom Sociological Methods and Research Advance online

publication doi1011770049124114543236

Kim S Beretvas S N amp Sherry A R (2010) A validation

of the factor structure of OQ-45 scores using factor mixture

modeling Measurement and Evaluation in Counseling and

Development 42 275-295

Kroenke K Spitzer R L amp Williams J B (2002) The PHQ-

15 validity of a new measure for evaluating the severity

of somatic symptoms Psychosomsatic Medicine 2002

258-266

at Seoul National University on December 18 2014asmsagepubcomDownloaded from

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168

Page 12: Conijn, 2014

7212019 Conijn 2014

httpslidepdfcomreaderfullconijn-2014-56dab45b83e53 1212

12 Assessment

Lambert M J amp Hawkins E J (2004) Measuring outcome in

professional practice Considerations in selecting and utiliz-

ing brief outcome instruments Professional Psychology

Research and Practice 35 492-499

Lambert M J Morton J J Hatfield D Harmon C Hamilton

S Reid R C Burlingame G M (2004) Administration

and scoring manual for the OQ-452 (Outcome Questionnaire)

(3th ed) Wilmington DE American Professional CredentialServices

Lambert M J amp Shimokawa K (2011) Collecting client feed-

back Psychotherapy 48 72-79

MacCallum R C Browne M W amp Sugawara H M (1996)

Power analysis and determination of sample size for covari-

ance structure modeling Psychological Methods 1 130-149

Meijer R R amp Sijtsma K (2001) Methodology review

Evaluating person fit Applied Psychological Measurement

25 107-135

Mueller R M Lambert M J amp Burlingame G M (1998)

Construct validity of the outcome questionnaire A confirma-

tory factor analysis Journal of Personality Assessment 70

248-262

Mutheacuten B O amp Mutheacuten L K (2007) Mplus Statistical analy-sis with latent variables (Version 50) Los Angeles CA

Statmodel

Mutheacuten L K amp Mutheacuten B O (2009) Mplus Short Course vid-

eos and handouts Retrieved from httpwwwstatmodelcom

downloadTopic201-v11pdf

Nering M L (1995) The distribution of person fit using true

and estimated person parameters Applied Psychological

Measurement 19 121-129

Piedmont R L McCrae R R Riemann R amp Angleitner

A (2000) On the invalidity of validity scales Evidence

from self-reports and observer ratings in volunteer samples

Journal of Personality and Social Psychology 78 582-593

Pirkis J E Burgess P M Kirk P K Dodson S Coombs

T J amp Williamson M K (2005) A review of the psycho-

metric properties of the Health of the Nation Outcome Scales

(HoNOS) family of measures Health and Quality of Life

Outcomes 3 76-87

Pitts S C West S G amp Tein J (1996) Longitudinal measure-

ment models in evaluation research Examining stability and

change Evaluation and Program Planning 19 333-350

Reckase M D (2009) Multidimensional item response theory

New York NY Springer

Reise S P amp Waller N G (1993) Traitedness and the assess-

ment of response pattern scalability Journal of Personality

and Social Psychology 65 143-151

Reise S P amp Waller N G (2009) Item response theory and

clinical measurement Annual Review of Clinical Psychology

5 27-48

Rost J (1990) Rasch models in latent classes An integration

of two approaches to item analysis Applied Psychological

Measurement 3 271-282

Samejima F (1997) Graded response model In W J van der

Linden amp R Hambleton (Eds) Handbook of modern itemresponse theory (pp 85-100) New York NY Springer

Schmitt N Chan D Sacco J M McFarland L A amp Jennings

D (1999) Correlates of person-fit and effect of person-fit

on test validity Applied Psychological Measurement 23

41-53

Spielberger C D Gorsuch R L Lushene R Vagg P R amp

Jacobs G A (1983) Manual for the State-Trait Anxiety

Inventory (Form Y) Palo Alto CA Consulting Psychologists

Press

St-Onge C Valois P Abdous B amp Germain S (2011) Person-

fit statisticsrsquo accuracy A Monte Carlo study of the aberrance

ratersquos influence Applied Psychological Measurement 35

419-432

Tellegen A (1988) The analysis of consistency in personalityassessment Journal of Personality 56 621-663

Thissen D Chen W H amp Bock R D (2003) MULTILOG for

Windows (Version 7) Lincolnwood IL Scientific Software

International

Thissen D Steinberg L amp Wainer H (1993) Detection of dif-

ferential functioning using the parameters of item response

models In P W Holland amp H Wainer (Eds) Differential

Item Functioning (pp 67-113) Hillsdale NJ Lawrence

Erlbaum

Van Herk H Poortinga Y H amp Verhallen T M M (2004)

Response styles in rating scales Evidence of method bias

in data from six EU countries Journal of Cross-Cultural

Psychology 35 346-360

Wing J K Beevor A S Curtis R H Park B G Hadden S

amp Burns H (1998) Health of the Nation Outcome Scales

(HoNOS) Research and development British Journal of

Psychiatry 172 11-18

Wood J M Garb H N Lilienfeld S O amp Nezworski M T

(2002) Clinical assessment Annual Review of Psychology

53 519-543

Woods C M Oltmanns T F amp Turkheimer E (2008)

Detection of aberrant responding on a personality scale in a

military sample An application of evaluating person fit with

two-level logistic regression Psychological Assessment 20

159-168