Top Banner
ORIGINAL ARTICLE Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks Plans for the Patient-Reported Outcomes Measurement Information System (PROMIS) Bryce B. Reeve,* Ron D. Hays,† Jakob B. Bjorner,‡ Karon F. Cook,§ Paul K. Crane, Jeanne A. Teresi,¶ David Thissen, Dennis A. Revicki,** David J. Weiss,†† Ronald K. Hambleton,‡‡ Honghu Liu,†† Richard Gershon,§§ Steven P. Reise,†† Jin-shei Lai,§§ and David Cella,§§ on behalf of the PROMIS Cooperative Group Background: The construction and evaluation of item banks to measure unidimensional constructs of health-related quality of life (HRQOL) is a fundamental objective of the Patient-Reported Out- comes Measurement Information System (PROMIS) project. Objectives: Item banks will be used as the foundation for develop- ing short-form instruments and enabling computerized adaptive testing. The PROMIS Steering Committee selected 5 HRQOL do- mains for initial focus: physical functioning, fatigue, pain, emotional distress, and social role participation. This report provides an over- view of the methods used in the PROMIS item analyses and proposed calibration of item banks. Analyses: Analyses include evaluation of data quality (eg, logic and range checking, spread of response distribution within an item), descriptive statistics (eg, frequencies, means), item response theory model assumptions (unidimensionality, local independence, mono- tonicity), model fit, differential item functioning, and item calibra- tion for banking. Recommendations: Summarized are key analytic issues; recom- mendations are provided for future evaluations of item banks in HRQOL assessment. Key Words: item response theory, unidimensionality, model fit, differential item functioning, computerized adaptive testing (Med Care 2007;45: S22–S31) T he Patient-Reported Outcomes Measurement Information System (PROMIS) project provides a unique opportunity to use advanced psychometric methods to construct, analyze and refine item banks, from which improved patient-reported outcome (PRO) instruments can be developed. 1,2 PRO mea- sures include instruments that measure domains like health- related quality of life (HRQOL) and satisfaction with medical care. Presented in this report are the methodological consid- erations for analyzing both existing data from a number of sources and new data to be collected by the PROMIS. These methods and approaches were adopted by the PROMIS net- work. The PROMIS project will produce item banks that will be used for both computerized-adaptive testing (CAT) 3 and nonadaptive (ie, fixed length) assessment of HRQOL do- mains including pain, fatigue, emotional distress, physical functioning, and social-role participation as the initial focus. In the beginning, PROMIS investigators identified available datasets containing more than 50,000 respondents (n 1000 per dataset) and multi-item PRO responses in cancer, heart disease, HIV disease, diabetes, gastrointestinal disorders, hepatitis C, mental health, and other chronic health conditions. Results from analyses of these datasets were used to refine the proposed methods and offer candidate item banks before the development of the PROMIS item banks. In particular, secondary data analyses allowed PROMIS inves- tigators to examine the dimensionality of domains; identify candidate items that represent the domains of interest; and evaluate the optimal number of response categories to field in the PROMIS data collection phase. The secondary analyses From the *National Cancer Institute, NIH, Bethesda, Maryland; †UCLA Division of General Internal Medicine & Health Services Research, Los Angeles, California; ‡QualityMetric Inc., Lincoln, Rhode Island, and the Health Assessment Lab, Waltham, Massachusetts; §University of Wash- ington, Seattle; ¶Columbia University Stroud Center and Faculty of Medicine; New York State Psychiatric Institute, and Research Division, Hebrew Home for the Aged in Riverdale, New York, New York; Psychology Department, University of North Carolina at Chapel Hill; **Center for Health Outcomes Research, United BioSource Corporation, Bethesda, Maryland; ††Psychology Department, University of Minne- sota, Minneapolis; ‡‡Center for Educational Assessment, University of Massachusetts at Amherst; and §§Northwestern University Feinberg School of Medicine and Evanston Northwestern Healthcare, Evanston, Illinois. Preparation of this work by non-NIH employees was supported by the National Institutes of Health through the NIH Roadmap for Medical Research Grant (AG015815), PROMIS Project. Reprints: Bryce B. Reeve, PhD, Outcomes Research Branch, National Cancer Institute, NIH, EPN 4005, 6130 Executive Blvd. MSC 7344, Bethesda, MD. 20892-7344. E-mail: [email protected]. Copyright © 2007 by Lippincott Williams & Wilkins ISSN: 0025-7079/07/4500-0022 Medical Care • Volume 45, Number 5 Suppl 1, May 2007 S22
10

Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

May 06, 2023

Download

Documents

Ole Wæver
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

ORIGINAL ARTICLE

Psychometric Evaluation and Calibration of Health-RelatedQuality of Life Item Banks

Plans for the Patient-Reported Outcomes Measurement InformationSystem (PROMIS)

Bryce B. Reeve,* Ron D. Hays,† Jakob B. Bjorner,‡ Karon F. Cook,§ Paul K. Crane,�Jeanne A. Teresi,¶ David Thissen,� Dennis A. Revicki,** David J. Weiss,†† Ronald K. Hambleton,‡‡

Honghu Liu,†† Richard Gershon,§§ Steven P. Reise,†† Jin-shei Lai,§§ and David Cella,§§on behalf of the PROMIS Cooperative Group

Background: The construction and evaluation of item banks tomeasure unidimensional constructs of health-related quality of life(HRQOL) is a fundamental objective of the Patient-Reported Out-comes Measurement Information System (PROMIS) project.Objectives: Item banks will be used as the foundation for develop-ing short-form instruments and enabling computerized adaptivetesting. The PROMIS Steering Committee selected 5 HRQOL do-mains for initial focus: physical functioning, fatigue, pain, emotionaldistress, and social role participation. This report provides an over-view of the methods used in the PROMIS item analyses andproposed calibration of item banks.Analyses: Analyses include evaluation of data quality (eg, logic andrange checking, spread of response distribution within an item),descriptive statistics (eg, frequencies, means), item response theorymodel assumptions (unidimensionality, local independence, mono-tonicity), model fit, differential item functioning, and item calibra-tion for banking.

Recommendations: Summarized are key analytic issues; recom-mendations are provided for future evaluations of item banks inHRQOL assessment.

Key Words: item response theory, unidimensionality, model fit,differential item functioning, computerized adaptive testing

(Med Care 2007;45: S22–S31)

The Patient-Reported Outcomes Measurement InformationSystem (PROMIS) project provides a unique opportunity

to use advanced psychometric methods to construct, analyzeand refine item banks, from which improved patient-reportedoutcome (PRO) instruments can be developed.1,2 PRO mea-sures include instruments that measure domains like health-related quality of life (HRQOL) and satisfaction with medicalcare. Presented in this report are the methodological consid-erations for analyzing both existing data from a number ofsources and new data to be collected by the PROMIS. Thesemethods and approaches were adopted by the PROMIS net-work. The PROMIS project will produce item banks that willbe used for both computerized-adaptive testing (CAT)3 andnonadaptive (ie, fixed length) assessment of HRQOL do-mains including pain, fatigue, emotional distress, physicalfunctioning, and social-role participation as the initial focus.

In the beginning, PROMIS investigators identifiedavailable datasets containing more than 50,000 respondents(n �1000 per dataset) and multi-item PRO responses incancer, heart disease, HIV disease, diabetes, gastrointestinaldisorders, hepatitis C, mental health, and other chronic healthconditions. Results from analyses of these datasets were usedto refine the proposed methods and offer candidate itembanks before the development of the PROMIS item banks. Inparticular, secondary data analyses allowed PROMIS inves-tigators to examine the dimensionality of domains; identifycandidate items that represent the domains of interest; andevaluate the optimal number of response categories to field inthe PROMIS data collection phase. The secondary analyses

From the *National Cancer Institute, NIH, Bethesda, Maryland; †UCLADivision of General Internal Medicine & Health Services Research, LosAngeles, California; ‡QualityMetric Inc., Lincoln, Rhode Island, and theHealth Assessment Lab, Waltham, Massachusetts; §University of Wash-ington, Seattle; ¶Columbia University Stroud Center and Faculty ofMedicine; New York State Psychiatric Institute, and Research Division,Hebrew Home for the Aged in Riverdale, New York, New York;�Psychology Department, University of North Carolina at Chapel Hill;**Center for Health Outcomes Research, United BioSource Corporation,Bethesda, Maryland; ††Psychology Department, University of Minne-sota, Minneapolis; ‡‡Center for Educational Assessment, University ofMassachusetts at Amherst; and §§Northwestern University FeinbergSchool of Medicine and Evanston Northwestern Healthcare, Evanston,Illinois.

Preparation of this work by non-NIH employees was supported by theNational Institutes of Health through the NIH Roadmap for MedicalResearch Grant (AG015815), PROMIS Project.

Reprints: Bryce B. Reeve, PhD, Outcomes Research Branch, NationalCancer Institute, NIH, EPN 4005, 6130 Executive Blvd. MSC 7344,Bethesda, MD. 20892-7344. E-mail: [email protected].

Copyright © 2007 by Lippincott Williams & WilkinsISSN: 0025-7079/07/4500-0022

Medical Care • Volume 45, Number 5 Suppl 1, May 2007S22

Page 2: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

also allowed PROMIS researchers to anticipate psychometricchallenges in developing the PROMIS item banks. For ex-ample, analyses suggested substantial floor and/or ceilingeffects for many domains, which underscored the importanceof identifying items that discriminated well at very low and atvery high levels of the traits being measured. The samepsychometric considerations apply to analysis of newly col-lected PROMIS data: confirm assumptions about dimension-ality of the data; examine item properties; test for differentialitem functioning (DIF)4 across sociodemographic or clinicalgroups; and calibrate the items for CAT and short forms.

Because researchers recognize the many challenges inanalyzing HRQOL data, the plan provides flexibility withrespect to the methods used to explore psychometric proper-ties. Some methods were identified as primary and others asexploratory. The results obtained using exploratory methodswill be evaluated based on whether they add substantively tothe results obtained using the primary methods. Examples ofapplying the methods discussed in this analysis plan can befound in articles included in this supplement.5,6 Further, thepsychometrics field is evolving for measuring PROs, both interms of methods development (eg, recent advances in thebi-factor model and the full information factor analysis forpolytomous response items discussed later in this article) andapplication (eg, advances in electronic-PRO assessment). Asthe state of the measurement field changes, the PROMISnetwork will adapt their analytic plans.

PROMIS DATA COLLECTION ANDSAMPLING PLAN

From July 2006 to March 2007, the PROMIS researchsites collected data from the US general population (�n �7523) and multiple disease populations including those withcancer (�n � 1000), heart disease (�n � 500), rheumatoidarthritis (�n � 500), osteoarthritis (�n � 500), psychiatricconditions (�n � 500), spinal cord injury (�n � 500), andchronic obstructive pulmonary disease (�n � 500). Thegeneral population sample will be constructed to ensureadequate representation with respect to key demographiccharacteristics such as gender (50% each), age (20% of eachage group in years: 18–29, 30–44, 45–59, 60–74, 75�),ethnicity (12% black, 12% Hispanic), and education (25%with high school education or less). A health conditionchecklist will also be included in the assessment. Beyonddemographic and clinical characteristics, PRO data in theareas of pain, fatigue, emotional distress, physical function-ing, and social-role participation will be collected for inclu-sion in item banks developed within the PROMIS network.All candidate items for the PROMIS item banks have beenthoroughly examined using qualitative methods such as cog-nitive testing and expert item review panels.7 The first waveof data are being collected via a computer or laptop linked toa web-based questionnaire.

A detailed data sampling plan was developed for col-lecting initial item responses to the candidate items from thetargeted PROMIS domains. This sampling plan was designedto accommodate best a number of purposes: (1) create itemcalibrations for all of the items in each of the subdomains; (2)

estimate profile scores for various disease populations; (3)create linking metrics to legacy questionnaires (eg, SF-36);(4) confirm the factor structure of the primary and subdo-mains; and (5) conduct item and bank analyses. However,because of the large total number of items (�1000), it is notpossible for participants to respond to the entire set of itemsin each pool. Based on an estimate of 4 questions per minute,the length of the PROMIS questionnaires in the first wave oftesting is limited to approximately 150 items which areexpected to take about 40 minutes to answer. Two datacollection designs (“full bank” and “block administration”)will be implemented during wave 1 to address the 5 purposeslisted previously.

“Full bank” testing will be conducted using the generalpopulation sample (n � 3507). Each respondent will answerall of the items in 2 of the primary item banks, for example,depression and anxiety or fatigue impact and fatigue experi-ence. Data collected from full bank testing will be analyzed toconfirm the factor structure of the PROMIS domains, test forDIF, and to perform CAT simulations.

For “blocked administration” in both the general pop-ulation sample and the samples of individuals with chronicdiseases, a balanced incomplete blocked design will be usedin which a subset of items from each item pool is adminis-tered to every person.8,9 Item blocks will be designed to allowsimultaneous item response theory (IRT)-based estimation ofitem parameters, and of population mean differences andstandard deviation ratios.

ANALYTIC METHODSAdvanced psychometric methods will be used through-

out the instrument development process to inform our under-standing of the latent constructs, particularly with respect tothe populations studied, and to develop adaptive and non-adaptive instruments with appropriate psychometric proper-ties for implementation in a range of research applications.

This process, outlined in Table 1, will include theanalysis of item and scale properties using both traditional(ie, classic) and modern (ie, IRT) psychometric methods.Factor analysis will be used to examine the underlyingstructures of the measured constructs and to evaluate theassumptions of the IRT model. DIF testing will evaluatewhether items perform differently across key demographic ordisease groups when controlling for the underlying level ofthe trait assessed by the scale. Finally, items will be calibratedto an IRT model and used in CAT. The plan builds onprevious PRO item bank development work by differentresearch groups10–12; however, the scale of the PROMISproject is a more extensive testing strategy than performedpreviously. The steps noted in Table 1 are presented sequen-tially, but often many steps can be carried out in parallel andresults from later steps may suggest returning to earlier stepsto re-evaluate findings based on different interpretations ormethods. Herein, we describe each of these methods, reviewavailable analytic options and, when evidence supports it,suggest preferred methods and criteria. Decisions aboutmodel selection, fit, and, or satisfaction of assumptions will

Medical Care • Volume 45, Number 5 Suppl 1, May 2007 Psychometric Evaluation Plan

© 2007 Lippincott Williams & Wilkins S23

Page 3: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

not be based solely on statistical criteria, but will incorporateexpert judgment from both psychometric and content expertswho will review the evidence to make interpretations and todetermine the next steps.

DESCRIPTIVE STATISTICSA variety of descriptive statistics will be used, includ-

ing measures of central tendency (mean, median), spread(standard deviation, range), skewness and kurtosis, and re-

sponse category frequencies. Patterns and frequency of miss-ing data will be examined to identify the likelihood ofsystematic or random patterns. For example, if missing datawere more prevalent later in the sequence of administereditems, this would suggest that the cause may be responseburden or lack of time for completing the questionnaire. Thecontent of items that draw substantial missing responses willbe examined by content experts to evaluate whether missingresponses may be due to sensitive item content.

TABLE 1. PROMIS Psychometric Evaluation and Calibration Plan

I. Traditional Descriptive Statistics

A. Item analysis

1. Response frequency, mean, standard deviation, range, skewness, and kurtosis

2. Inter-item correlation matrix, item-scale correlations, drop in coefficient alpha

B. Scale analysis

1. Mean, standard deviation, range, skewness, kurtosis, internal consistency reliability (coefficient alpha)

II. Evaluate Assumptions of the Item Response Theory (IRT) Model

A. Unidimensionality

1. Confirmatory Factor Analysis (CFA) using polychoric correlations (one-factor and bi-factor models)

2. Exploratory Factor Analysis will be performed if CFA shows poor fit

B. Local independence

1. Examine residual correlation matrix after first factor removed in factor analysis

2. IRT-based tests of local dependence

C. Monotonicity

1. Graph item mean scores conditional on total score minus item score

2. Examine initial probability functions from nonparametric IRT models

III. Fit Item Response Theory (IRT) Model to Data

A. Estimate IRT model parameters

1. Samejima’s Graded Response Model for unidimensional polytomous response data

B. Examine model fit

1. Compare observed and expected response frequencies

2. Examine fit indices: S-X2, Bock’s �2, and Q1 statistics

C. Evaluate item properties

1. IRT category response curves

2. IRT item information curves

D. Evaluate scale properties

1. IRT scale information function

IV. Evaluate Differential Item Functioning (DIF) Among Key Demographic and Clinical Groups

A. Qualitative analyses and generation of DIF hypotheses

B. Evaluation of presence, magnitude, and impact of DIF using IRT-based methods:

1. General IRT-based likelihood-ratio test

2. Raju’s signed and unsigned area tests, Differential Functioning of Items and Tests (DFIT) methods and expected scale scores

C. Evaluation of presence, magnitude, and impact of DIF using non-IRT-based methods

1. Ordinal logistic regression

2. Multigroup multiple-indicator, multiple cause model using structural equation modeling to evaluate impact of DIF

V. Item Calibration for Item Banking

A. Design for administration of PROMIS items for calibration phase

1. Full bank and incomplete block designs for administration of items to respondents for each item pool

B. Standardize theta metric

1. Standardizing metric so that general US population has a mean of 0 and standard deviation of 1. All disease/disorder groups will have a populationmean and standard deviation ratio relative to this reference group

C. Assign item properties for each item in the bank

1. Calibrate each item with a discrimination and threshold parameter using Samejima’s Graded Response Model

2. Design CAT algorithms

Reeve et al Medical Care • Volume 45, Number 5 Suppl 1, May 2007

© 2007 Lippincott Williams & WilkinsS24

Page 4: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

Several basic classic test theory statistics will be esti-mated to provide descriptive information about the perfor-mance of the item set. These include inter-item correlations,item-scale correlations, and internal consistency reliability.Cronbach’s coefficient alpha13 will be used to examine inter-nal consistency with 0.70 to 0.80 as an accepted minimum forgroup level measurement and 0.90 to 0.95 as an acceptedminimum for individual level measurement. Internal consis-tency estimates are based on the assumption that the item setis homogeneous; because high internal consistency can beachieved with multidimensional data, this statistic does notprovide sufficient evidence of unidimensionality.

EVALUATE ASSUMPTIONS OF THE IRT MODELBefore applying IRT models, it is important to evaluate

the core assumptions of the model: unidimensionality, localindependence, and monotonicity. To follow are the describedmethods for testing these assumptions. The order in whichassumptions are tested can vary.

UnidimensionalityOne critical assumption of IRT models is that a per-

son’s response to an item that measures a construct is ac-counted for by his/her level (amount) on that trait, and not byother factors. For example, a highly depressed person is morelikely to endorse “true” for the statement “I don’t care whathappens to me” than a person with low depression. Theassumption is that a person’s depression level is the mainfactor that gives rise to his/her response to the item. No itemset will ever perfectly meet strictly defined unidimensionalityassumptions.14 Thus, one wants to assess whether scales are“essentially” or “sufficiently” unidimensional15 to permitunbiased scaling of individuals on a common latent trait. Oneimportant criterion is the robustness of item parameter esti-mates, which can be examined by removing items that mayrepresent a significant dimension. If the item parameters (inparticular the item discrimination parameters or factor load-ings) significantly change, then this may indicate insufficientunidimensionality.16,17 A number of researchers have recom-mended methods and considerations for evaluating essentialunidimensionality as reviewed below.14,15,18–20

Factor Analytic Methods to AssessUnidimensionality

Confirmatory factor analysis (CFA) will be performedto evaluate the extent that the item pool measures a dominanttrait that is consistent with the content experts’ definition ofthe domain. CFA was selected over an exploratory analysis asthe first step because each potential pool of items wascarefully selected by experts to represent a dominant PROconstruct through an exhaustive literature review and feed-back from patients through focus groups and cognitive test-ing.7 Because of the ordinal nature of the PRO data, appro-priate software (eg, MPLUS21 or LISREL22) is required toevaluate polychoric correlations using an appropriate estima-tor (eg, the weighted least squares with adjustments for themean and variance (WLSMV23 in MPLUS21) estimator or thediagonally weighted least squares (DWLS in LISREL22)estimator) for factor analysis.

CFA model fit will be assessed by examining multipleindices. Noting that statistical criteria like the �2 statistic aresensitive to sample size, a range of practical fit indices will beexamined such as the comparative fit index (CFI �0.95 forgood fit), root mean square error of approximation (RMSEA�0.06 for good fit), Tucker-Lewis Index (TLI �0.95 forgood fit), standardized root mean residuals (SRMR �0.08 forgood fit), and average absolute residual correlations (�0.10for good fit).15,24–28 If the CFA shows poor fit, then we willconduct an exploratory factor analysis and examine the mag-nitude of eigenvalues for the larger factors (at least 20% ofthe variability on the first factor is especially desirable),differences in the magnitude of eigenvalues between factors(a ratio in excess of 4 is supportive of the unidimensionalityassumption), scree test, parallel analysis, correlations amongfactors, and factor loadings to determine the underlyingstructural patterns.

An alternate method to determine whether the items are“sufficiently” unidimensional is McDonald’s bifactor model15

(see also Gibbons29,30). McDonald’s approach to assessingunidimensionality (which he terms “homogeneity”) is toassign each item to a specific subdomain based on theoreticalconsiderations. A model is then fit with each item loading ona common factor and on a specific subdomain (group factor).The common factor is defined by all the items, whereas eachsubdomain is defined by a subset of items in the pool. Thefactors are constrained to be mutually uncorrelated so that allcovariance is partitioned either into loadings on the commonfactor or onto the subdomain factors. If the standardizedloadings on the common factor are all salient (defined as�0.30) and substantially larger than loadings on the groupfactors, the item pool is thought to be “sufficiently homoge-neous.”15 Furthermore, one can compare individual scoresunder a bifactor and unidimensional model. If scores arehighly correlated (eg, r �0.90), this is further evidence thatthe effects of multidimensionality is ignorable.31

To illustrate the active evolution of psychometric proce-dures applicable to the analysis of PROs, during the writing ofthis article an implementation of full information (exploratory)factor analysis for polytomous item responses became availablein version 8.8 of the computer software LISREL22,32 In addition,Edwards33 has illustrated the use of a Markov chain MonteCarlo (MCMC) algorithm for CFA of polytomous item re-sponses such as those obtained in measurement of PROs. It islikely that those procedures and others that may become avail-able soon, will also be useful in the examination of dimension-ality of the PROMIS scales.

Local IndependenceLocal independence assumes that once the dominant

factor influencing a person’s response to an item is con-trolled, there should be no significant association among itemresponses.34–36 The existence of local dependencies thatinfluence IRT parameter estimates poses a problem for scaleconstruction or CAT implementation. Further, scoring re-spondents based on miss-specified models will result ininaccurate estimates of their level on the underlying trait. Inother words, uncontrolled local dependence (LD) among

Medical Care • Volume 45, Number 5 Suppl 1, May 2007 Psychometric Evaluation Plan

© 2007 Lippincott Williams & Wilkins S25

Page 5: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

items in a CAT assessment could result in a score differentfrom the HRQOL construct being measured.

Identification of LD among polytomous response itemsincludes examining the residual correlation matrix producedby the single factor CFA. High residual correlations (greaterthan 0.2) will be flagged and considered as possible LD. Inaddition, IRT-based tests of LD will be used; among them areYen’s Q3 statistic37 and Chen and Thissen’s LD indices.38

These statistics are based on a process that involves fitting aunidimensional IRT model to the data, and then examiningthe residual covariation between pairs of items, which shouldbe zero if the unidimensional model fits. For example, Stein-berg and Thissen34 described the use of Chen and Thissen’sG2 LD index to identify locally dependent items among 16dichotomous items on a scale measuring history of violentactivity.

The modification indices (MIs) of structural equationmodeling (SEM) software may also serve as statistics todetect LD. When inter-item polychoric correlations are fittedwith a one-factor model, the result is a limited informationparameter estimation scheme for the graded normal ogivemodel. The MIs for such a model are 1 degree of freedom �2

scaled statistics that suggest un-modeled excess covariationbetween items, which in the context of item factor analysis isindicative of LD. Hill, Edwards, Thissen, et al describe theuse of MIs to detect LD in the PedsQL™ Social FunctioningScale and other examples.6

Items that are flagged as LD will be examined toevaluate their effect on IRT parameter estimates. One test isto remove one of the items with LD, and to examine changesin IRT model parameter estimates and in factor loadings forall other items.

One solution to control the influence of LD on item andperson parameter estimates is omitting one of the items withLD. If this is not feasible because both items provide asubstantial amount of information, then LD items can bemarked as “enemies,” preventing them from both beingadministered in a single assessment to any individual. Fur-ther, the LD must be controlled in the calibration step toremove the influence of the highly correlated items. In allcases, the LD items should be evaluated to understand thesource of the dependency. LD may exist for nonsubstantivereasons such as structural similarity in wording or contentwhen the wording of 2 or more item stems are so similar thatthe respondent can’t differentiate what the questions areasking. Thus, they will mark the same response for bothitems.

MonotonicityThe assumption of monotonicity means that the prob-

ability of endorsing or selecting an item response indicativeof better health status should increase as the underlying levelof health increases. This is a basic requirement for IRTmodels for items with ordered response categories. Ap-proaches for studying monotonicity include examininggraphs of item mean scores conditional on “rest-scores” (ie,total raw scale score minus the item score) or fitting anonparametric IRT model39 to the data that yields initial IRTprobability curve estimates, using programs such as Mokken

scale analysis for polytomous items (MSP40) software. Anonparametric IRT model fits trace lines for each response toan item without any a priori specification of the order of theresponses. The data analyst may then examine those fittedtrace lines to determine which response alternatives are (em-pirically) associated with lower levels of the domain andwhich are associated with higher levels. The shapes of thetrace lines may also indicate other departures from monoto-nicity, such as bimodality, if they exist. Although nonpara-metric IRT may not be the most (statistically) efficient way toproduce the final item analysis and scores for a scale, it canbe very informative about the tenability of the assumptions ofparametric IRT.

FIT ITEM RESPONSE THEORY MODEL TO DATAOnce the assumptions have been confirmed, IRT mod-

els are fit to the data both for item and scale analysis and foritem calibration to set the stage for CAT. IRT refers to afamily of models that describe, in probabilistic terms, therelationship between a person’s response to a survey questionand his or her standing (level) on the PRO latent construct(eg, pain) that the scale measures.41,42 For every item in ascale, a set of properties (item parameters) are estimated. Theitem slope or discrimination parameter describes how wellthe item performs in the scale in terms of the strength of therelationship between the item and the scale. The item diffi-culty or threshold parameter(s) identifies the location alongthe construct’s latent continuum where the item best discrim-inates among individuals. This information can be used toevaluate properties of the items in the scale or used by theCAT algorithm to select items that are appropriately matchedto the respondent’s estimated level on the measured trait,based on their responses to previously administered items.

Although there are well more than 100 varieties of IRTmodels41–43 to handle various data characteristics such asdichotomous and polytomous response data, ordinal andnominal data, and unidimensional and multidimensional data,only a handful have been used in item analysis and scoring.In initial analyses of existing data sets, the PROMIS networkevaluated both a general IRT model, Samejima’s GradedResponse Model44,45(GRM), and 2 models based on theRasch model framework, the Partial Credit Model46 and theRating Scale Model.47,48 On the basis of these analyses,the PROMIS network decided to focus on the GRM in futureitem bank development work.

The GRM is a very flexible model from the parametric,unidimensional, polytomous-response IRT family of models.Because it allows discrimination to vary item by item, ittypically fits response data better than a one-parameter (ie,Rasch) model.43,49 Further, compared with alternative 2-pa-rameter models such as the generalized partial credit model,the model is relatively easy to understand and illustrate to“consumers” and retains its functional form when responsecategories are merged. Thus, the GRM offers a flexibleframework for modeling the participant responses to examineitem and scale properties, to calibrate the items of the itembank, and to score individual response patterns in the PROassessment. However, the PROMIS network will examine

Reeve et al Medical Care • Volume 45, Number 5 Suppl 1, May 2007

© 2007 Lippincott Williams & WilkinsS26

Page 6: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

further the fit and added value of alternate IRT models usingPROMIS data.

The unidimensional GRM is a generalization of the IRT2-parameter logistic model for dichotomous response data.The GRM is based on the logistic function that describes,given the level of the trait being measured, the probabilitythat an item response will be observed in category k orhigher. For ordered responses X � k, k � 1,2,3, . . ., mi,where response m reflects the highest � value, this probabilityis defined44,45,50 as:

P�Xi � k��,bi,ai� �1

1 � exp�ai�� � bi,k 1)

1

1 � exp�ai�� � bi,k�

This function models the probability of observing each cat-egory as a function of the underlying construct. The subscripton m above indicates that the number of response categoriesdoes not need to be equal across items. The discrimination(slope) parameter ai varies by item i in a scale. The thresholdparameters bik varies within an item with the constraint bk

1 �bk �bk�1, and represents the point on the � axis at whichthe probability passes 50% that the response is in category kor higher.

Figure 1 presents the category response curves (CRCs)for a 4-response category item with IRT GRM parameters:a � 2.26, b1 � 1.00, b2 � 0.00, and b3 � 1.50. Each curve(one for each response category) represents the probability ofa respondent selecting category k, given his/her level (�) onthe underlying construct. If a person’s estimated � is less than1.00, then he/she is more likely to endorse the first responsecategory. Likewise, if a person’s estimated � is between1.00 and 0.00, then he/she is more likely to endorse thesecond category. A person with estimated � above 1.50 willhave the greatest likelihood of endorsing the fourth responsecategory. In a calibration of the GRM to the item responses,category response curves such as those shown in Figure 1 areestimated for every item.

Once these response curves are estimated on a group ofrespondents from the first wave of PROMIS data collection,the curves are then used to estimate the � levels of newrespondents to the PROMIS questionnaires. For example, if aperson selects response 3 for the item in Figure 1, it is likelytheir � level is between 0.0 and 1.5. Using this kind ofinformation for additional items, a person’s � level is esti-mated by identifying which response they chose for eachadministered item. Thus, a person’s level on the trait (�) andan associated standard error are estimated, using maximumlikelihood or Bayesian estimation methods, based on thecomplete pattern of responses given by each person in con-junction with the probability functions associated with eachitem response.

IRT model fit will be assessed using a number ofindices, recognizing that universally accepted fit statistics donot exist. Also note that if model assumptions are supportedby the data, then strict adherence to model fit statistics is notvital, given the limits of acceptable fit indices. Residualsbetween observed and expected response frequencies by itemresponse category will be compared based on analyses of thesize of the differences (residuals). Common fit statistics suchas Q1, Bock’s �2, and others43,51 will be examined; alsoconsidered will be generalizations of Orlando and Thissen’sS X2 to polytomous data.52,53 The ultimate issue is to whatdegree misfit affects model performance in terms of the validscaling of individual differences.54

Once analysts are satisfied with the fit of the IRT modelto the response data, attention is shifted to analyzing the itemand scale properties of the PROMIS domains. The psycho-metric properties of the items will be examined by review oftheir item parameter estimates, CRCs, and item informationcurves.55,56 Information curves indicate the range of � wherean item is best at discriminating among individuals. Higherinformation denotes more precision for measuring a person’strait level. The height of the curves (denoting more informa-tion) is a function of the discrimination power (a parameter)of the item. The location of the information curves is deter-mined by the threshold (b) parameter(s) of the item. Infor-mation curves indicate which items are most useful formeasuring different levels of the measured construct. This iscritical for the item selection process in CAT and in thedevelopment of short-forms.

Poorly performing items will be reviewed by contentexperts before the item bank is established. Misfitting itemsmay be retained or revised when they are identified asclinically relevant and no better-fitting alternative is avail-able. Low discriminating items in the tails of the thetadistribution (at low or at high levels of the trait beingmeasured) also may be retained or revised to add informationfor extreme scores where they would not have been retainedin better-populated regions of the continuum. It is at theextremes of the trait continuum that CAT is most effective,but only if items exist that provide good measurement alongthese portions of the continuum.

Future research by PROMIS will examine the addedvalue of more complex models including multidimensionalIRT models.57–63 The attraction of these methods is reduced

FIGURE 1. Category response curves for a four-response cat-egory item with IRT GRM parameters: a � 2.26, b1 �1.00, b2 � 0.00, and b3 � 1.50.

Medical Care • Volume 45, Number 5 Suppl 1, May 2007 Psychometric Evaluation Plan

© 2007 Lippincott Williams & Wilkins S27

Page 7: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

respondent burden and a more realistic model for the under-lying measurement model. Multidimensional models takeadvantage of the correlations among subdomains to informthe measurement of the target constructs, thus precise thetaestimates are obtained with fewer items. However, it shouldbe noted that multidimensional IRT has all the rotationproblems and complexity that factor analysis does; it greatlycomplicates DIF analyses, and the meaning of scores is oftenunclear when subscales are highly correlated. In addition,essentially unidimensional constructs are more often desir-able from a theoretical and clinical perspective.

EVALUATION OF DIFFERENTIAL ITEMFUNCTIONING

According to the IRT model, an item displays differ-ential item functioning (DIF) if the probabilities of respond-ing in different categories vary across studied groups, givenequivalent levels of the underlying attribute.4,41,43,64 In otherwords, DIF exists when, for example, women at moderatelevels of emotional distress are more likely to report cryingthan are men at the same moderate level of distress. Onereason that instruments containing items with DIF may havereduced validity for between-group comparisons is becausetheir scores indicate attributes other than the one the scale isintended to measure.64 The impact of DIF on CAT may begreater than in fixed-length assessments because only a smallitem set is administered.

In the context of PROMIS, DIF may occur acrossgroups of different races, gender, age groups, or diseaseconditions. The question of whether or not DIF should betested with respect to a specific disease category is one thatshould be considered by content experts. Roussos and Stout18

recommended a first step in DIF analyses that includessubstantive (qualitative) reviews in which DIF hypotheses aregenerated, and it is decided whether or not unintended “ad-verse” DIF is present as a secondary factor. Because thisprocess is largely based on judgment, there may be someerror at this step. Substantive reviewers may use 4 sources toinform the DIF hypotheses: previously published DIF analy-ses; substantive content considerations and judgment regard-ing current items; review of archival data—review of con-texts present in other similar data; using archival or pretestdata for testing bundles of items according to some organiz-ing principle. The stage 2 statistical analyses are comprised ofconfirmatory tests of DIF hypotheses. This type of procedurecan be extended to health-related quality of life measuresthrough use of qualitative methods proposed in the PROMISeffort, including the use of expert review, focus groups,cognitive interviews and the generation of possible hypothe-ses regarding subgroups for which DIF might be observed.

IRT provides a useful framework for identifying itemswith DIF. The category response curves of an item calibratedbased on the responses of 2 different groups can be displayedsimultaneously. If the model fits, IRT item parameters (ie,threshold and discrimination parameters) are assumed to belinearly invariant with respect to group membership. There-fore, differences between the CRCs, after linking the � metricbetween each group, indicate that respondents at the same

level of the underlying trait, but from different groups, havedifferent probabilities of endorsing the item. DIF can occur inthe threshold or discrimination parameter. Uniform DIF re-fers to DIF in the threshold parameter of the model, whichindicates that the focal and reference groups have uniformlydifferent response probabilities for the tested item. Nonuni-form DIF appears in the discrimination parameter and sug-gests interaction between the underlying measured variableand group membership; that is, the degree to which an itemrelates to the underlying construct depends on the groupbeing measured.41,64,65

Determination of DIF is optimized when the samplesare as representative as possible of the populations fromwhich they are drawn. Most DIF procedures rely on theidentification of a core set of anchor items that are thought tobe free of DIF and are used to link the 2 groups on a commonscale. DIF detection methods use scores based on these itemsto control for underlying differences between the comparisongroups while testing for DIF in the item under scrutiny. Thereare numerous approaches to assessing DIF.66 Herein, wedescribe the DIF methods being considered by the PROMISanalytic team. It is prudent to evaluate DIF using multiplemethods and flag those items identified consistently.

There are basically 2 IRT-based methods that will beused to identify DIF; these are the log-likelihood IRT ap-proach accompanied by byproducts of differential function-ing of items and tests (DFIT) to examine DIF magnitude, andthe IRT/ordinal logistic regression (OLR) approach withbuilt-in tests of magnitude. The approach recommended isthat used by PROMIS investigators, in which significant DIFis first identified using either likelihood ratio (LR)-basedsignificance tests (IRT-LR), or significance tests and changesin beta coefficients (IRT/OLR). The IRT-LR approach alsoincorporates a correction for multiple comparisons. Finally,both approaches examine magnitude of DIF in the determi-nation of the final items that are flagged. If either methodflags an item as having DIF according to these rules, the itemwill be considered as having DIF. Details regarding the stepsin the analyses can be found elsewhere.67–69

The IRT-LR test64 will be used to identify both uniformand nonuniform DIF. The procedure compares hierarchicallynested IRT models; with 1 model that fully constrains the IRTparameters to be equal between the 2 comparison groups andother models that allow the item parameters to be freelyestimated between groups. One key difference between theIRT-LR method and many other DIF methods is how differ-ences between comparison groups are estimated from theanchor items. Other DIF methods use the simple summedscore of the anchor set, but the IRT-LR procedure estimatesa person’s theta score based on his/her responses to theanchor set. This approach is similar to that used in CAT.Thus, IRT-LR procedures make an easy transition to thedetection of DIF for data collected in a CAT environment.70

Used in conjunction with IRT-LR, and based on IRT,are Raju’s signed and unsigned area tests, combined with theDifferential Functioning of Items and Tests (DFIT) frame-work.71 This framework includes a noncompensatory DIF(NCDIF) index which reflects the average squared difference

Reeve et al Medical Care • Volume 45, Number 5 Suppl 1, May 2007

© 2007 Lippincott Williams & WilkinsS28

Page 8: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

between the item-level scores for the focal and referencegroups. Several magnitude measures are available in thecontext of area statistics and the DFIT methodology devel-oped by Raju and colleagues.71–73 For binary items, the exactarea methods compare the areas between the item responsefunctions estimated in 2 different groups; Cohen et al74

extended these area statistics for the graded response model.The second DIF method is ordinal logistic regression

(OLR)75 in which a series of 3 logistic models predicting theprobability of item response are compared. The independentvariables in Model 1 are the trait estimate (eg, raw scale scoreor theta estimate), group, and the interaction between groupand trait. Model 2 includes only the main effects of trait andgroup, and Model 3 includes only the trait estimate. Nonuni-form DIF is detected if there is a statistically significantdifference between the likelihood values for Model 1 andModel 2. Uniform DIF is evident if there is a significantdifference between the likelihood values for Models 2 and 3.Crane et al76 suggested that, in addition to statistical signif-icance, the relative change in beta coefficients betweenModel 2 and 3 should be considered. On the basis of simu-lations by Maldonado and Greenland,77 a 10% change in betahas been recommended as a criterion for uniform DIF.

PROMIS also will evaluate items for DIF using ahybrid approach that combines the strengths of OLR andIRT.68,78 This iterative approach uses IRT theta estimates inOLR models to determine whether items have uniform ornonuniform DIF. To account for spurious DIF (false-positiveor false-negative DIF found due to DIF in other items),demographic-specific item parameters are estimated for itemsfound on the initial run to have DIF; items free of DIF serveas anchor items. DIF detection is repeated using these up-dated IRT estimates, and these procedures (DIF detection andIRT estimation) are repeated until the same items are identi-fied on successive runs.

Advantages of the techniques reviewed above includethe rapid empirical identification of anchor items and thedetermination of the presence and magnitude of DIF. Anotheradvantage is the possibility of using demographic-specificitem parameters in a CAT context if that is considered aviable option.

The multiple-indicator, multiple cause models (MIMIC)offer an attractive framework for examining DIF in the contextof evaluation of the impact.79 Based on a modification ofstructural equation modeling, the single group MIMIC modelpermits examination of the direct effect of background variableson items, while controlling for the level of the attribute studied.80

The MIMIC model also allows background variables like de-mographic characteristics to be used as covariates to account fordifferences among the comparison populations when examiningDIF. Although the MIMIC model does not permit tests ofnonuniform DIF, an advantage is that impact can be examinedby comparing the estimated group effects in models with, andwithout, adjustment for DIF.81

There are several options for treating items with DIF.One extreme option is to eliminate the item from the bank. Ifthe analyses suggest that there are large numbers of itemswithout consequential DIF, this option will be considered. On

the other hand, if many items have DIF, especially in keyareas of the trait continuum that are sparsely populated byitems, or if content experts determine that the items with DIFare central to the meaning of the construct, other options areto ignore DIF if it is small, to revise items to be free of DIF,to tag items that should not be administered to specificgroups, or to control for DIF by using demographic-specificitem parameters.

ITEM CALIBRATION FOR BANKINGAfter a comprehensive review of the item properties,

including evaluation of DIF across key demographic andclinically different groups, the final selected item set will becalibrated using the GRM and CAT algorithms developed.One set of IRT item parameters will be established for allitems unless DIF evidence suggests that some items shouldhave different calibrations based on key groups to be mea-sured by the PROMIS system. The item pools for eachunidimensional PROMIS domain will include a large set ofitems with most pools containing more than 50 items.

To identify the metric for the PROMIS item parameterestimates, the scale for person parameters must be fixed insome manner—typically by specifying that the mean in thereference population is 0 and the standard deviation is 1. ThePROMIS network has selected the reference population to bethe US general population. This will allow interpretation ofdifficulty (threshold) parameter(s) relative to the general USpopulation mean and the discrimination parameters relative tothe population standard deviation. Calibrated in this manner,in the dichotomous response case, an item with a difficultyparameter estimate of b � 1.5 suggests that a person who is1.5 standard deviations above the mean will have a 50%probability of endorsing the item. Population mean differ-ences and standard deviation ratios will be computed for eachdisease population tested within PROMIS to allow bench-marking. Thus, a person can compare his/her symptom se-verity or functioning to people with similar disease or to thegeneral US population.

This standardized metric will facilitate the conversionof the IRT z-score metric to the T-score distribution adoptedby the PROMIS steering committee. For the purposes ofcomputing the proportion of the norming/calibration samplethat score below each theta level and identifying the z-scorecorresponding to that percentage from a normal distribution,the IRT scale score estimates will be treated as raw scores.These pseudo-normalized z-scores will be converted to T-scores with mean of 50 and standard deviation of 10. ForPRO domains where the normal distribution is not appropri-ate, theta estimates will be converted to T-scores by a linearconversion.

Each of the PROMIS item banks calibrated from thewave one data will be examined for its ability to provideprecise measurement across the construct continuum, as as-sessed by scale information and standard error of measure-ment curves. Further, CAT simulations will examine thediscriminative ability of the item bank at any level of theconstruct continuum.82,83 The ideal is to have high precisionand discrimination ability across the continuum of symptom

Medical Care • Volume 45, Number 5 Suppl 1, May 2007 Psychometric Evaluation Plan

© 2007 Lippincott Williams & Wilkins S29

Page 9: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

severity or functional ability. Likely, there will be less preci-sion in the extremes of the distributions (eg, high physicalfunctioning or absence from depression); however, the PROMIScontent experts are taking great care to write items that may helpreduce floor and ceiling effects. The PROMIS network willreview the findings from these analyses, and will follow-upwith additional work to: (1) write new items to fill gaps in theconstruct continuum; (2) examine alternate psychometricmethods that may improve precision or efficiency; (3) eval-uate the items and scales for clinical application; and (4)review the bank items to ensure its relevance in differentdisease and demographic populations not covered or poorlycovered in the calibration data.

CONCLUSIONSThis report has presented an overview of the psycho-

metric methods that will be used in the PROMIS project, bothto examine the properties of the items and domains and tocalibrate items with properties that will allow the CATprocedure to select the most informative set of items toestimate a person’s level of health. The PROMIS project isfaced with an enormous challenge to create psychometricallysound and valid banks in a short amount of time. Multipleitem banks will be developed and at least 7 disease popula-tions and a general US population that vary across a range ofkey demographic characteristics will be represented in theinitial calibration sample collected in wave one. The enormityof the project requires the PROMIS psychometric team to beflexible in terms of the methods used. The design presentedherein was developed to be robust to violations of the as-sumptions required to reach project goals. It is also expectedthat a large-scale evaluation phase will follow the initial waveof testing to examine alternative methods that may yield moreinterpretable and efficient results.

REFERENCES1. Ader DN. Developing the Patient-Reported Outcomes Measurement

Information System (PROMIS). Med Care. 2007;45(Suppl 1):S1–S2.2. Cella D, Yount S, Rothrock N, et al. The Patient-Reported Outcomes

Measurement Information System (PROMIS): progress of an NIH Road-map Cooperative Group during its first two years. Med Care. 2007;45(Suppl 1):S3–S11.

3. Cook KF, O’Malley KJ, Roddey TS. Dynamic assessment of healthoutcomes: time to let the CAT out of the bag? Health Services Res.2005;40(Part II):1694–1711.

4. Teresi JA. Statistical methods of examination of differential item func-tioning with applications to cross-cultural measurement of functional,physical and mental health. J Mental Health Aging. 2001;7:31–40.

5. Hays RD, Liu H, Spritzer K, et al. Item response theory analyses ofphysical functioning items in the Medical Outcomes Study. Med Care.2007;45(Suppl 1):S32–S38.

6. Hill CD, Edwards MC, Thissen D, et al. Practical issues in the applica-tion of item response theory: a demonstration using items from thePediatric Quality of Life Inventory (PedsQL) 4.0 Generic Core Scales.Med Care. 2007;45(Suppl 1):S39–S47.

7. DeWalt DA, Rothrock N, Yount S, et al. Evaluation of item candidates:the PROMIS qualitative item review. Med Care. 2007;45(Suppl 1):S12–S21.

8. Kutner MH, Nachtsheim CJ, Neter J, et al. Applied Linear StatisticalModel. 5th ed. New York, NY: McGraw-Hill/Irwin; 2005:664–665,1173–1183.

9. Armitage P, Berry G, Matthews JNS. Statistical Methods in MedicalResearch. 4th ed. Malden, MA: Blackwell Science; 2002:261–264.

10. Ware JE Jr, Bjorner JB, Kosinski M. Practical implications of itemresponse theory and computerized adaptive testing: a brief summary ofongoing studies of widely used headache impact scales. Med Care.2000;38;II73–II82.

11. Bjorner JB, Kosinski M, Ware JE Jr. Calibration of an item pool for assessingthe burden of headaches: an application of item response theory to the headacheimpact test (HIT). Quality of Life Res. 2003;12:913–933.

12. Lai JS, Cella D, Chang CH, et al. Item banking to improve, shorten andcomputerize self-reported fatigue: an illustration of steps to create a coreitem bank from the FACIT-Fatigue Scale. Quality of Life Res. 2003;12:485–501.

13. Cronbach LJ. Coefficient alpha and the internal structure of tests.Psychometrika. 1951;16:297–334.

14. McDonald RP. The dimensionality of test and items. Br J MathematicalStat Psychol. 1981;34:100–117.

15. McDonald RP. Test Theory: A Unified Treatment. Mahwah, NJ: Law-rence Erlbaum; 1999.

16. Drasgow F, Parsons CK. Application of unidimensional item responsetheory models to multidimensional data. Appl Psychol Measure. 1983;7:189–199.

17. Harrison DA. Robustness of IRT parameter estimation to violations ofthe unidimensionality assumption. J Educational Stat. 1986;11:91–115.

18. Roussos L, Stout W. A multidimensionality-based DIF analysis para-digm. Appl Psychol Measure. 1996;20:355–371.

19. Stout W. A nonparametric approach for assessing latent trait unidimen-sionality. Psychometrika. 1987;52:589–617.

20. Lai J-S, Crane PK, Cella D. Factor analysis techniques for assessingsufficient unidimensionality of cancer related fatigue. Qual Life Res.2006;15:1179–1190.

21. Muthen LK, Muthen BO. Mplus User’s Guide. Los Angeles, CA:Muthen & Muthen; 1998.

22. Joreskog KG, Sorbom D, Du Toit S, et al. LISREL 8: New StatisticalFeatures. Third printing with revisions. Lincolnwood: Scientific Soft-ware International; 2003.

23. Muthen B, du Toit SHC, Spisic D. Robust inference using weighted leastsquared and quadratic estimating equations in latent variable modelingwith categorical and continuous outcomes. Psychometrika. 1997.

24. Kline RB. Principles and Practice of Structural Equation Modeling.New York, NY: Guilford Press; 1998.

25. Bentler P. Comparative fit indices in structural models. Psychol Bull.1990;107:238–246.

26. West SG, Finch JF, Curran PJ. SEM with nonnormal variables. In:Hoyle RH, ed. Structural Equation Modeling: Concepts Issues andApplications. Thousand Oaks, CA: Sage Publications; 1995;56–75.

27. Hu LT, Bentler P. Cutoff criteria for fit indices in covariance structureanalysis: conventional criteria versus new alternatives. Structural Equa-tion Modeling. 1999;6:1–55.

28. Browne MW, Cudeck R. Alternative ways of assessing model fit. In:Bollen KA, Long JS, eds. Testing Structural Equation Models. NewburyPark, CA: Sage Publications; 1993.

29. Gibbons RD, Hedeker DR, Bock RD. Full-information item bi-factoranalysis. Psychometrika. 1992;57:423–436.

30. Gibbons RD, Bock RD, Hedeker D, et al. Full-information item bi-factoranalysis of graded response data. Appl Psychol Measure. In press.

31. Reise SP, Haviland MG. Item response theory and the measurement ofclinical change. J Personality Assess. 2005;84:228–238.

32. Joreskog KG, Moustaki I. Factor analysis of ordinal variables with fullinformation maximum likelihood. Unpublished manuscript downloadedfrom http://www.ssicentral.com. Accessed January 6, 2007.

33. Edwards MC. A Markov Chain Monte Carlo Approach to ConfirmatoryItem Factor Analysis. �dissertation. Chapel Hill, NC: University ofNorth Carolina; 2005.

34. Steinberg L, Thissen D. Uses of item response theory and the testletconcept in the measurement of psychopathology. Psychol Methods.1996;1:81–97.

35. Wainer H, Thissen D. How is reliability related to the quality of testscores? What is the effect of local dependence on reliability? Educa-tional Measure. 1996;15:22–29.

36. Yen WM. Scaling performance assessments: strategies for managinglocal item dependence. J Educational Measuret. 1993;30:187–213.

37. Yen WM. Effect of local item dependence on the fit and equating

Reeve et al Medical Care • Volume 45, Number 5 Suppl 1, May 2007

© 2007 Lippincott Williams & WilkinsS30

Page 10: Psychometric Evaluation and Calibration of Health-Related Quality of Life Item Banks

performance of the three-parameter logistic model. Appl Psychol Mea-sure. 1984;8:125–145.

38. Chen W-H, Thissen D. Local dependence indexes for item pairs usingitem response theory. Educational Behav Stat. 1997;22:265–289.

39. Ramsay JO. A functional approach to modeling test data. In: van derLinden WJ, Hambleton RK, eds. Handbook of Modern Item ResponseTheory. New York, NY: Springer; 1997:381–394.

40. Molenaar IW, Sijtsma K. Users Manual MSP5 for Windows: A Programfor Mokken Scale Analysis for Polytomous Items �software manual.Groningen, the Netherlands: iec ProGAMMA; 2000.

41. Hambleton RK, Swaminathan H, Rogers H. Fundamentals of ItemResponse Theory. Newbury Park, CA: Sage; 1991.

42. Embretson SE, Reise SP. Item Response Theory for Psychologists.Mahwah, NJ: Lawrence Erlbaum; 2000.

43. van der Linden WJ, Hambleton RK, eds. Handbook of Modern ItemResponse Theory. New York, NY: Springer-Verlag; 1997.

44. Samejima F. Estimation of latent ability using a response pattern ofgraded scores. Psychometrika Monogr. 1969;No. 17.

45. Samejima F. Graded response model. In: van der Linden WJ, HambletonRK, eds. Handbook of Modern Item Response Theory. New York, NY:Springer; 1997:85–100.

46. Masters GN. A Rasch model for partial credit scoring. Psychometrika.1982;47:149–174.

47. Andrich D. A rating formulation for ordered response categories. Psy-chometrika. 1978;43:561–573.

48. Wright BD, Masters GN. Rating Scale Analysis. Chicago, IL: MESAPress; 1982.

49. Thissen D, Orlando M. Item response theory for items scored in twocategories. In: Thissen D, Wainer H, eds. Test Scoring. Mahwah, NJ:Lawrence Erlbaum; 2001:73–140.

50. Thissen D, Nelson L, Rosa K, et al. Item response theory for itemsscored in more than two categories. In: Thissen D, Wainer H, eds. TestScoring. Mahwah, NJ: Lawrence Erlbaum; 2001:141–186.

51. Yen WM. Using simulation results to choose a latent trait model. ApplPsychol Measure. 1981;5:245–262.

52. Orlando M, Thissen D. Likelihood-based item-fit indices for dichotomousitem response theory models. Appl Psychol Measure. 2000;24:50–64.

53. Orlando M, Thissen D. Further examination of the performance of S-X2,an item fit index for dichotomous item response theory models. ApplPsychol Measure. 2003;27:289–298.

54. Hambleton RK, Han N. Assessing the fit of IRT models to educationaland psychological test data: A five step plan and several graphicaldisplays. In: Lenderking WR, Revicki D, eds. Advances in HealthOutcomes Research Methods, Measurement, Statistical Analysis, andClinical Applications. Washington, DC: International Society for Qual-ity of life Research; 2005:57–78.

55. Reeve BB. Item response theory modeling in health outcomes measure-ment. Exp Rev Pharmacoeconomics Outcomes Res. 2003;3:131–145.

56. Reeve BB, Fayers P. Applying item response theory modeling forevaluating questionnaire item and scale properties. In: Fayers P, HaysRD, eds. Assessing Quality of Life in Clinical Trials: Methods ofPractice. 2nd ed. Oxford, NY: Oxford University Press; 2005:55–73.

57. Briggs DC, Wilson M. An introduction to multidimensional measure-ments using Rasch models. J Appl Measure. 2003;4:87–100.

58. te Marvelde JM, Glas CAW, Landeghem GV, et al. Application ofmultidimensional item response theory models to longitudinal data.Educational Psychol Measure. 2006;66:5–34.

59. Segal DO. Multidimensional adaptive testing. Psychometrika. 1996;61:331–354.

60. van der Linden WJ. Multidimensional adaptive testing with a minimumerror-variance criterion. J Educ Behav Stat. 1999;24:398–412.

61. Gardner W, Kelleher KJ, Pajer KA. Multidimensional adaptive testing formental health problems in primary care. Med Care. 2002;40:812–823.

62. Ackerman TA, Gierl MJ, Walker CM. Using multidimensional item

response theory to evaluate educational and psychological tests. Educa-tional Measure. 2003;22:37–51.

63. Petersen MA, Groenvold M, Aaronson N, et al. Multidimensionalcomputerized adaptive testing of the EORTC QLQ-C30: Basic devel-opments and evaluation. Quality Life Res. 2006;15:315–329.

64. Thissen D, Steinberg L, Wainer H. Detection of differential itemfunctioning using the parameters of item response models. In: HollandPW, Wainer H, eds. Differential Item Functioning. Hillsdale, NJ: Law-rence Erlbaum Associates; 1993:67–113.

65. Teresi JA, Kleinman MK, Ocepek-Welikson K. Modern psychometricmethods for detection of differential item functioning: application tocognitive assessment measures. Stat Med. 2000;19:1651–1683.

66. Millsap RE, Everson HT. Methodology review: statistical approachesfor assessing measurement bias. ApplPsychol Measure. 1993;17:297–334.

67. Orlando M, Thissen D, Teresi J, et al. Identification of differential itemfunctioning using item response theory and the likelihood-based modelcomparison approach: application to the Mini-Mental State Examina-tion. Med Care. 2006;44(Suppl 3):S134–S142.

68. Crane PK, Gibbons LE, Jolley L, et al. Differential item functioninganalysis with ordinal logistic regression techniques: DIFdetect anddifwithpar. Med Care. 2006;44(Suppl 3):S115–S123.

69. Teresi JA. Different approaches to differential item functioning in healthapplications: advantages, disadvantages, and some neglected topics.Med Care. 2006;44(Suppl 3):S152–S170.

70. Thissen D. IRTLRDIF v2.0b: Software for the computation of thestatistics involved in item response theory likelihood-ratio tests fordifferential item functioning; 2001.

71. Raju NS, van der Linden WJ, Fleer PF. IRT-based internal measures ofdifferential functioning of items and tests. Appl Psychol Measure.1995;19:353–368.

72. Flowers CP, Oshima TC, Raju NS. A description and demonstration of thepolytomous DFIT framework. Appl Psychol Measure. 1999;23:309–326.

73. Raju NS. DFITP5: A Fortran Program for Calculating DichotomousDIF/DTF �computer program. Chicago, IL: Illinois Institute of Tech-nology; 1999.

74. Cohen AS, Kim SH, Baker FB. Detection of differential item functioning inthe graded response model. Appl Psychol Measure. 1993;17:335–350.

75. Zumbo BD. A Handbook on the Theory and Methods of Differential ItemFunctioning (DIF): Logistic Regression Modeling as a Unitary Frame-work for Binary and Likert-type (Ordinal) Item Scores. Ottawa, ON:Directorate of Human Resources Research and Evaluation, Departmentof National Defense; 1999.

76. Crane PK, van Belle G, Larson EB. Test bias in a cognitive test:differential item functioning in the CASI. Stat Med. 2004;23:241–256.

77. Maldonado G, Greenland S. Simulation study of confounder-selectionstrategies. Am J Epidemiol. 1993;138:923–936.

78. Crane PK, Hart DL, Gibbons LE, et al. A 37-item shoulder functionalstatus item pool had negligible differential item functioning. J ClinEpidemiol. 2006;59:478–484.

79. Muthen BO. A general structural equation model with dichotomous,ordered categorical, and continuous latent variable indicators. Psy-chometrika. 1984;49:115–132.

80. Fleishman JA, Lawrence WF. Demographic variation in SF-12 scores:true differences or differential item functioning. Med Care. 2003;41:III-75–III-86.

81. Jones RN, Gallo JJ. Education and sex differences in the Mini-MentalState Examination: effects of differential item functioning. J Gerontol.2000;55B:273–282.

82. Fliege H, Becker J, Walter OB, et al. Development of a computer-adaptive test for depression. Quality Life Res. 2005;14:2277–2291.

83. Hart DL, Cook KF, Mioduski JE, et al. Simulated computerized adaptivetest for patients with shoulder impairments was efficient and producedvalid measures of function. J Epidemiol. 2006;59:290–298.

Medical Care • Volume 45, Number 5 Suppl 1, May 2007 Psychometric Evaluation Plan

© 2007 Lippincott Williams & Wilkins S31