Top Banner
Psychological Methods 2000. Vol. 5, No. 1, 125-146 Copyrighi 2000 by the An IOS2-989X/00/S5.I Psychological Association, Inc. .00 DOI: 10.1037//1082-989X.5.I.125 Using IRT to Separate Measurement Bias From True Group Differences on Homogeneous and Heterogeneous Scales: An Illustration With the MMPI Niels G. Waller Vanderbilt University Jane S. Thompson and Ernst Wenk University of California, Davis The authors present a didactic illustration of how item response theory (IRT) can be used to separate measurement bias from true group differences on homogeneous and heterogeneous scales. Several bias detection methods are illustrated with 12 unidimensional Minnesota Multiphasic Personality Inventory (MMPI) factor scales (Waller, 1999) and the 13 multidimensional MMPI validity and clinical scales. The article begins with a brief review of MMPI bias research and nontechnical reviews of the 2-parameter logistic model (2-PLM) and several IRT-based methods for bias detection. A goal of this article is to demonstrate that homogeneous and heteroge- neous scales that are composed of biased items do not necessarily yield biased test scores. To that end, the authors perform differential item- and test-functioning analyses on the MMPI factor, validity, and clinical scales using data from 511 Blacks and 1,277 Whites from the California Youth Authority. Few topics in applied psychometrics have gener- ated as much controversy and confusion as the complementary issues of measurement invariance and measurement bias. Although formal definitions of measurement invariance (Holland & Wainer, 1993; Meredith, 1993) are widely known by psychometri- cians, and psychometrically defensible methods for identifying biased items and tests have been available Niels G. Waller, Department of Psychology and Human Development, Vanderbilt University; Jane S. Thompson and Ernst Wenk, Department of Psychology, University of California, Davis. We express our sincere thanks to Chris Fraley, Tim Gaffney, Lew Goldberg, Caprice Niccoli-Waller, Steve Reise, Auke Tellegen, Craig Thompson, and three anony- mous reviewers for helpful comments on an earlier version of this article. All the tests for differential item and test functioning described in this article can be calculated with an S-PLUS function called LINKDIF (Waller, 1998). Re- searchers interested in applying the item response theory methods described in this article may download LINKDIF from the following website: http://peabody.vanderbilt.edu/ depts/psych_and_hd/faculty/wallern/. Correspondence concerning this article should be ad- dressed to Niels G. Waller, Department of Psychology and Human Development, Box 512, Peabody College, Vander- bilt University, Nashville, Tennessee 37203. Electronic mail may be sent to [email protected]. for some time (Oshima, Raju, & Flowers, 1997; Raju, 1988, 1990; for nontechnical reviews see Reise, Widaman, & Pugh, 1993; Widaman & Reise, 1997), the assessment community has yet to fully assimilate this work. As a case in point, consider the voluminous literature about racial bias on the Minnesota Multi- phasic Personality Inventory (MMPI; Hathaway & McKinley, 1940) and the MMPI-2 (Butcher, Dahl- strom, Graham, Tellegen, & Kaemmer, 1989). Much of this literature concerns Black-White differences on the MMPI validity and clinical scales (for reviews, see Dahlstrom & Gynther, 1986; Greene, 1987, 1991, chap. 8) and the supposed implications of these dif- ferences for valid test interpretation (Gynther & Green, 1980; McNulty, Graham, Ben-Porath, & Stein, 1997; Pritchard & Rosenblatt, 1980a; Timbrook & Graham, 1994). Given that the MMPI and MMPI-2 are the most widely used psychological tests in the world (Butcher & Rouse, 1996; Lubin, Larsen, Mata- razzo, & Seever, 1985), any biases that are found in the inventory would have legal, ethical, and practical implications (Gottfredson, 1994). Thus, it is surpris- ing that modern techniques for assessing measure- ment bias have not been applied to the MMPI inven- tories. In this article we review several contemporary item response theory (IRT; Hambleton, Swaminathan, & Rogers, 1991; Lord, 1980) models for assessing item 125
22

Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

May 03, 2018

Download

Documents

vuonghanh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

Psychological Methods2000. Vol. 5, No. 1, 125-146

Copyrighi 2000 by the AnIOS2-989X/00/S5.I

Psychological Association, Inc..00 DOI: 10.1037//1082-989X.5.I.125

Using IRT to Separate Measurement Bias From True Group

Differences on Homogeneous and Heterogeneous Scales:

An Illustration With the MMPI

Niels G. WallerVanderbilt University

Jane S. Thompson and Ernst WenkUniversity of California, Davis

The authors present a didactic illustration of how item response theory (IRT) can

be used to separate measurement bias from true group differences on homogeneous

and heterogeneous scales. Several bias detection methods are illustrated with 12

unidimensional Minnesota Multiphasic Personality Inventory (MMPI) factor scales

(Waller, 1999) and the 13 multidimensional MMPI validity and clinical scales. The

article begins with a brief review of MMPI bias research and nontechnical reviews

of the 2-parameter logistic model (2-PLM) and several IRT-based methods for bias

detection. A goal of this article is to demonstrate that homogeneous and heteroge-

neous scales that are composed of biased items do not necessarily yield biased test

scores. To that end, the authors perform differential item- and test-functioning

analyses on the MMPI factor, validity, and clinical scales using data from 511

Blacks and 1,277 Whites from the California Youth Authority.

Few topics in applied psychometrics have gener-

ated as much controversy and confusion as the

complementary issues of measurement invariance and

measurement bias. Although formal definitions of

measurement invariance (Holland & Wainer, 1993;

Meredith, 1993) are widely known by psychometri-

cians, and psychometrically defensible methods for

identifying biased items and tests have been available

Niels G. Waller, Department of Psychology and Human

Development, Vanderbilt University; Jane S. Thompson

and Ernst Wenk, Department of Psychology, University of

California, Davis.

We express our sincere thanks to Chris Fraley, Tim

Gaffney, Lew Goldberg, Caprice Niccoli-Waller, Steve

Reise, Auke Tellegen, Craig Thompson, and three anony-

mous reviewers for helpful comments on an earlier version

of this article. All the tests for differential item and test

functioning described in this article can be calculated with

an S-PLUS function called LINKDIF (Waller, 1998). Re-

searchers interested in applying the item response theory

methods described in this article may download LINKDIF

from the following website: http://peabody.vanderbilt.edu/

depts/psych_and_hd/faculty/wallern/.

Correspondence concerning this article should be ad-

dressed to Niels G. Waller, Department of Psychology and

Human Development, Box 512, Peabody College, Vander-

bilt University, Nashville, Tennessee 37203. Electronic

mail may be sent to [email protected].

for some time (Oshima, Raju, & Flowers, 1997; Raju,

1988, 1990; for nontechnical reviews see Reise,

Widaman, & Pugh, 1993; Widaman & Reise, 1997),

the assessment community has yet to fully assimilate

this work. As a case in point, consider the voluminous

literature about racial bias on the Minnesota Multi-

phasic Personality Inventory (MMPI; Hathaway &

McKinley, 1940) and the MMPI-2 (Butcher, Dahl-

strom, Graham, Tellegen, & Kaemmer, 1989). Much

of this literature concerns Black-White differences on

the MMPI validity and clinical scales (for reviews,

see Dahlstrom & Gynther, 1986; Greene, 1987, 1991,

chap. 8) and the supposed implications of these dif-

ferences for valid test interpretation (Gynther &

Green, 1980; McNulty, Graham, Ben-Porath, & Stein,

1997; Pritchard & Rosenblatt, 1980a; Timbrook &

Graham, 1994). Given that the MMPI and MMPI-2

are the most widely used psychological tests in the

world (Butcher & Rouse, 1996; Lubin, Larsen, Mata-

razzo, & Seever, 1985), any biases that are found in

the inventory would have legal, ethical, and practical

implications (Gottfredson, 1994). Thus, it is surpris-

ing that modern techniques for assessing measure-

ment bias have not been applied to the MMPI inven-

tories.

In this article we review several contemporary item

response theory (IRT; Hambleton, Swaminathan, &

Rogers, 1991; Lord, 1980) models for assessing item

125

Page 2: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

126 WALLER, THOMPSON, AND WENK

and scale bias on unidimensional (Raju, 1988, 1990)

and multidimensional (Oshima et al., 1997) scales.

One of our goals is to encourage the assessment com-

munity to use these psychometric models when con-

ducting group comparisons research (e.g., in racial,

ethnic, cross-cultural, or gender group comparisons).

Toward this end we demonstrate how IRT can be used

to elucidate the psychometric properties of 12 homo-

geneous factor scales that can be scored on the MMPI

(Waller, 1999). We realize that most MMPI and

MMPI-2 scales are not homogeneous, a fact that

likely explains why IRT models for item and scale

bias have heretofore not been applied to the MMPI.

Although scale heterogeneity (i.e., multidimensional-

ity) is a putative impediment in the application of IRT

to the MMPI validity and clinical scales, we demon-

strate how unidimensional IRT models can be used to

assess measurement invariance on these scales.

This article is organized as follows. The first sec-

tion provides a brief review of the MMPI bias litera-

ture with respect to Black-White differences in scale

elevation. The second section provides a relatively

nontechnical introduction to the two-parameter logis-

tic IRT model (2-PLM; Birnbaum, 1968; Lord, 1980).

The third section describes how this model can be

used to separate measurement bias from true group

differences on estimated latent variables. The fourth

section characterizes the samples used in our didactic

example and reports the results of a series of analyses

aimed at detecting differential item and test function-

ing on the MMPI factor, validity, and clinical scales.

Finally, in the last section, we discuss the implications

of our analyses for future research aimed at distin-

guishing measurement bias from true group differ-

ences on homogeneous and heterogeneous scales.

Black-White Differences on the MMPI: ABrief Review of the Literature

The literature on MMPI Black-White differences

has been characterized by a level of passion not often

found in academic writing. During the 1970s and

1980s, for example, articles routinely appeared with

provocative titles such as "Is the MMPI an Appropri-

ate Assessment Device for Blacks?" (Gynther, 1981)

and "White Norms and Black MMPIs: A Prescription

for Discrimination?" (Gynther, 1972). Reviewers of

this literature (Gynther & Green, 1980; Pritchard &

Rosenblatt, 1980a, 1980b) expressed strong opinions,

and they frequently came away with widely opposing

conclusions when reviewing similar' bodies of work

(Greene, 1987; Gynther, 1989).

Two points of contention galvanized the contro-

versy during this period. The first was whether Blacks

scored significantly higher than Whites on various

MMPI scales, and the second was whether those dif-

ferences, if they existed, were attributable to biased

measurement. Dozens of articles compared Blacks

and Whites on the MMPI validity and clinical scales

(reviewed in Dahlstrom, Lachar, & Dahlstrom, 1986;

Greene, 1991). Many others attempted to document

Black-White differences at the item level (Bertelson,

Marks, & May, 1982; Costello, 1973, 1977; Gynther

& Witt, 1976: Harrison & Kass, 1967. 1968; Jones,

1978; Miller, Knapp, & Daniels, 1968; Witt &

Gynther, 1975).

Gynther (1972, 1989; Gynther & Green, 1980;

Gynther, Lachar, & Dahlstrom, 1978), in an influen-

tial series of articles, argued that a close examination

of the literature revealed a consistent pattern in which

"blacks, whether normal or institutionalized, gener-

ally obtain higher scores than whites on Scales F [In-

frequency], 8 [Sc; Schizophrenia] and 9 [Ma; Hypo-

mania]" (Gynther, 1972, p. 386). He also suggested

that these differences stemmed from inherent biases in

the test, and consequently he called for race-based

norms for scoring and interpreting the MMPI

(Gynther & Green, 1980; Gynther et al., 1978;

Gynther, 1972). Other researchers (e.g., Pritchard &

Rosenblatt, 1980a, 1980b) were quick to disagree, and

some pointed out that without further information

Black-White score differences could not speak to is-

sues of test bias or test fairness. Pritchard and Rosen-

blatt (1980b), for instance, noted that "scale differ-

ences between racial subgroups imply differential

rates of classification error only when the racial sub-

groups in a sample have equivalent base rates for

psychopathology" (p. 273, emphasis added). These

authors also noted that none of the comparisons cited

by Gynther and others had adequately matched their

samples on psychopathology, and thus the implica-

tions of those studies with respect to differential as-

sessment validity were uninterpretable.

Following Pritchard and Rosenblatt's (1980b) com-

mentary, numerous studies matched Black and White

samples on several moderator variables that were be-

lieved to account for the observed group differences

on the MMPI (Bertelson et al., 1982; Butcher,

Braswcll, & Raney, 1983; Newmark, Gentry, Warren,

& Finch, 1981; Patterson, Charles, Woodward, Rob-

erts, & Penk, 1981; Penk et al., 1982). Years later, in

his review of this literature, Greene (1987) concluded

that "moderator variables, such as socioeconomic sta-

Page 3: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 127

tus, education, and intelligence, as well as profile va-

lidity, are more important determinants of MMPI per-

formance than ethnic status" (p. 509). More recently,

Graham (1993) opined that "differences between Af-

rican Americans and Caucasians are small when

groups are matched on variables such as age and so-

eioeconomic status" (p. 199).

We concur that matching is an important design

feature of valid measurement bias research. We also

believe, however, that the most important matching

variables in this regard have been conspicuously miss-

ing from this literature. To wit, no studies to our

knowledge have matched groups on the estimated la-

tent variables that hypothetically determine MMPI

item response behavior. The absence of such studies

is unfortunate because, as noted by Meredith and

Millsap (1992), "bias detection procedures which rely

exclusively on manifest variables are not generally

diagnostic of bias, or lack of bias" (p. 310). These

authors suggest that "logical alternatives to manifest

variable procedures are procedures that model the

manifest variables in terms of latent variables"

(p. 310).

There are several reasons why bias researchers

should match groups on the latent variables measured

by a test. (In the present case, this is analogous to

matching groups on psychopathology.) First, bias can

create or accentuate group differences on manifest

variables in situations in which there are no differ-

ences on the latent variables. Second, measurement

bias can mask true differences on latent variables such

that differences at the manifest level fail to emerge.

Third, multiitem inventories, such as the MMPI and

MMPT-2, may contain biased items that, when aggre-

gated, do not produce biased scales. This can occur

when a subgroup of items is biased against the ma-

jority group and a different subgroup of items is bi-

ased against the minority group. When such items are

combined, the effects of the biased items can be mini-

mized or eliminated at the scale level (Harrison &

Kass, 1967; Raju, van der Linden, & Fleer, 1995).

A central theme of this article is that group differ-

ences at the item or scale level can arise from mea-

surement bias, actual group differences, or a combi-

nation of these influences. Thus, studies that report

group differences on observed scores cannot unam-

biguously resolve the question of whether those

scores are equally precise, or equally valid, for dif-

ferent groups. Although the inability of group com-

parisons to resolve issues of measurement bias has

been recognized by the psychometrics community for

some time (Holland & Thayer, 1988; Jensen, 1980),

this uninformative method continues to be used in

many assessment literatures. In the MMPI literature,

for instance, group comparisons of scale means are

sometimes called the difference of means test (Pritch-

ard & Rosenblatt, 1980a, p. 263; see also Greene,

1991, chap. 8; Whitworth & McBlaine, 1993).

In the next two sections, we review the fundamen-

tals of the 2-PLM IRT model and several methods that

are based on this model for identifying biased items

and scales. These methods for assessing differential

functioning of items and tests do not suffer from the

flaws of the so-called difference of means test de-

scribed above. We focus on IRT methods because of

our strong conviction, and the consensus of the psy-

chometrics community (Camilli & Shepard, 1994;

Holland & Wainer, 1993; Millsap & Everson, 1993;

Thissen, Steinberg, & Gerrard, 1986), that IRT mod-

els provide the most powerful methods for detecting

differential functioning of items and scales in group

comparisons research (e.g., racial bias, cross-cultural

comparison, and questionnaire translation research;

see Ellis, Becker, & Kimmel, 1993; Ellis, Minsel, and

Becker, 1989; Huang & Church, 1997; Hulin &

Mayer, 1986). Readers who are familiar with the

2-PLM for dichotomous items (Hambleton et al.,

1991; Reise & Waller, 1990) may wish to proceed to

the third section.

A Brief Overview of the Two-ParameterLogistic IRT Model

The rubric IRT covers an extended family of psy-

chometric models (van der Linden & Hambleton,

1997), and thus we make no attempt to describe these

models in detail. Rather, we briefly describe the fun-

damentals of the 2-PLM because it is the most appro-

priate IRT model for modeling the (MMPI) data in

our didactic example.

An important component of the 2-PLM is called the

item response function (IRF). The TRF is the nonlin-

ear regression line that characterizes the probability of

a [0/1] keyed response as a function of an underlying

trait value. These nonlinear response functions are

also called item characteristic curves or item trace

lines. An example IRF for the 2-PLM is illustrated in

Figure 1A. Notice that in this figure trait levels are

represented on the jt-axis, and the item response prob-

ability is represented on the y-axis. Trait levels are

customarily scaled to a z-score metric such that the

population of trait values has a mean of 0.00 and a

standard deviation of 1.00, although other scalings are

Page 4: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

128 WALLER, THOMPSON, AND WENK

A. B.

0

Theta (trait level)

0

Theta

c. D.

0

Theta

0

Theta

Figure 1. Item response theory-based graphical methods for assessing differential item

functioning (DIP) and differential test functioning. A: Item response function. B: Uniform

DIP. C: Nonuniform DIP. D: Test characteristic curve.

possible and sometimes used in IRT applications. No-

tice also that the trait level is called theta in this plot.

In this article, we often use the Greek letter 8 (theta)

to denote latent trait values.

The 2-PLM derives its name because the nonlinear

item-trait regression function, like the IRF in Figure

1, is defined by a two-parameter logistic function.

This function can be mathematically defined as

1

where Pj(9,-) denotes the probability that an individual

with 6 level i will endorse item j in the keyed direc-

tion. The e in Equation 1 denotes the transcendental

number that is approximated by 2.71828. The 1.7 in

this equation is a scaling factor that makes the logistic

IRF similar to the IRF in a two-parameter Normal

Ogive item response model (Lord, 1980). The two

parameters of the 2-PLM are the item slope (a) and

the item threshold ((3) parameters. An important char-

acteristic of this model is that item thresholds ($) and

participant trait levels (8) are scaled on a common

metric. For a particular item, the value of the item

threshold equals the 6 level that corresponds to a .50

probability of endorsing the item in the keyed direc-

tion. Thus, relatively difficult items—that is, items

with low endorsement frequencies—will have high

threshold values (i.e., large pi parameters), and they

will be located at the high end of the 8 continuum.

Items with low thresholds will be located at the low

end of the 8 continuum. The value of the item slope

(a) is a function of the steepness of the IRF at that

point on the trait continuum where 6, = (3,. This

parameter is related to the item factor loading such

that items with steep slopes have large factor loadings

(Takane & DC Leeuw, 1987). In other words, items

Page 5: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 129

with large slope parameters (a) have comparatively

high saturations of trait relevant variance. Although

we have left out many important details about IRFs,

such as how the IRF is estimated (see Baker, 1992, for

a review), we have presented the requisite information

for understanding how IRT uses the IRF to distinguish

item and scale bias from true group differences on

latent variables.

Using Item Response Theory to SeparateMeasurement Bias From True Group

Differences on Estimated Latent Variables

At the item level, measurement equivalence is ob-

tained whenever the IRFs for two groups do not differ

(Reise et al., 1993). In effect, this implies that the

probability of endorsing an item in the keyed direction

is the same for two individuals with equal trait values

(i.e., individuals who are perfectly matched on the

latent trait) regardless of group membership. Notice

that in this definition of measurement equivalence, we

are not assuming that individuals from different

groups will have identical endorsement probabilities.

On the contrary, measurement equivalence requires

that we observe equal endorsement probabilities for

individuals with equal trait values.

Consider the two IRFs in Figure IB. For purposes

of illustration, we can imagine that these IRFs repre-

sent the item-trait regression functions for Blacks and

Whites on an MMPI item. Let the solid line denote the

IRF for Whites and the dashed line the IRF for

Blacks. Notice that the probability of endorsing this

item in the keyed direction is higher for Whites than

it is for Blacks at virtually all 0 levels. When this

occurs we say that the item shows evidence of uni-

form differential item functioning1 (DIP; Camilli &

Shepard, 1994; Holland & Wainer, 1993). Impor-

tantly, the amount of DIF is not constant across trait

levels. Specifically, at very low (—4 to -2) and very

high (+2 to +4) trait levels the IRFs are not dramati-

cally different, though at more moderate trait levels

the endorsement probabilities differ by as much as .80

on the probability scale.

Figure 1C shows an example of nonuniform DIF.

This plot illustrates the dangers that can arise when-

ever groups are compared on items that have group-

specific IRFs. Notice, for example, that in groupscomposed of individuals with low trait levels, Whites

endorse the item more frequently than Blacks. In

groups composed of high trait level individuals, the

opposite pattern occurs. Namely, in high-scoring

groups, Blacks endorse this item more frequently than

do Whites. The possibility of nonuniform DIF should

clearly make one pause before trying to draw conclu-

sions from the literature on group differences in item

endorsement frequencies. For instance, Black-White

differences in MMPI item endorsement frequencies

have been reported in a number of studies (Costello,

1977; Dahlstrom & Gynther, 1986). Greene (1987)

recently noted that "although from 58 to 213 [out of

566] items have been found to differentiate Blacks

from Whites [on the MMPI] in a given study, there

has been limited overlap among these items across

studies" (p. 503). If many MMPI items show evidence

of nonuniform DIF, we would expect such a diversity

of findings in studies with samples that vary in aver-

age trait level.

Item Response Theory Tests of DifferentialItem Functioning

From an IRT perspective, several methods are

available for detecting the magnitude and significance

of different IRFs (see Holland & Wainer, 1993, for a

review). In this study we calculated five IRT mea-

sures of DIF: (a) Lord's x2 test of DIF (Lord, 1980),

(b) Raju's signed area (SA) measure of DIF (Raju,

1988, 1990), (c) Raju's unsigned area (USA) measure

ofDIF(Raju, 1988,1990), (d) Raju etal.'s measure of

compensatory DIF (CDIF; Raju et al., 1995), and

(e) Raju et al.'s measure of noncompensatory DIF

(NCDIF; Raju et al., 1995). We now briefly describe

these measures.

Lord's x2 measure of DIF (Lord, 1980) is computed

as the weighted squared difference between the item

parameter estimates from two groups. This index is

used to test whether the estimated item slopes (awi,

&Bj) and item thresholds ((3^, (3Bj) differ for two

groups, say, for groups of Whites (W) and Blacks (B).

Lord's test statistic is computed by weighting the

squared discrepancy between the estimated item pa-

rameters by the pooled (estimated) parameter infor-

mation matrix. The information matrix is the inverse

of the variance-covariance matrix of the estimated

parameters. Specifically, let

V = (&WJ - €LBJ, $WJ - $BJ) (2)

1 Technically, the terms DIF and item bias are not syn-

onymous. DIF is a psychometric property of an item in two

or more groups, whereas item bias is a value judgment

concerning the personal, social, or ethical implications of

DIF. In this article we often use the terms interchangeably.

Page 6: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

130 WALLER, THOMPSON, AND WENK

equal the vector of differences between the estimated

parameters and

(3)

equal the pooled variance-covariance matrix of the

parameter estimates. Then Lord's x2 is calculated as

= v r'v. (4)

In large samples, Lord's x2 has a central chi-square

distribution with two degrees of freedom.

Like many asymptotic test statistics, Lord's x~ suf-

fers from the fact that the statistic is valid only in large

samples, yet in large samples almost any difference

between the estimated item parameters of two groups

will reach statistical significance. Thus, there has been

a trend in recent years to place more emphasis on IRT

measures of DIP that yield measures of effect size as

well as tests of significance. Two such measures that

were calculated in the present study are Raju's SA and

USA indices (Raju, 1988, 1990). Both of these indices

quantify DTP by integrating the area (i.e., summing up

the distances) between two IRFs. Specifically, if Pw

and Pfl denote the estimated IRFs for two groups, say,

Whites and Blacks, then for item j,

and

^A — I (P ~ P Wft f^\O.r\, — I V* VV — B' "> v*^/

USA.= \PW-PB\M. (6)

In words, Equation 5 says that the SA index is com-

puted by adding up the area between the two IRFs for

all trait levels (6) from negative infinity to positive

infinity. Note that when the IRFs show evidence of

nonuniform DIP, as in Figure 1C, for some trait levels

the area between the IRFs will be positive and for

other trait levels the area will be negative. Thus, the

total signed area, represented by SA, might be small

(or even 0.00) even when the IRFs differ substan-

tially. The SA can take on positive or negative values.

The USA. on the other hand, equals the sum of the

absolute values of the differences between the IRFs,

and thus the USA can take on only positive values.

The derivations of Equations 5 and 6 are presented in

Raju's (1988) study. Raju (1990) presented formulas

for determining the statistical significance of the es-

timated SA and USA indices.

A Scale Composed of Biased Items Is Not

Necessarily Biased

Returning to our running example, when compar-

ing Black-White differences on the MMPI, some re-

searchers (e.g., Harrison & Kass, 1967) have sug-

gested that the largest differences occur at the item

level rather than the scale level. For example, accord-

ing to Harrison and Kass (1967), the MMPI validity

and clinical scales "are not very sensitive to race dif-

ferences, whereas the items are remarkably sensitive.

A canceling-out process must be at work in each

scale" (p. 462). Although, as noted previously, nu-

merous researchers have reported Black-White dif-

ferences for MMPI scales (Dahlstrom & Gynther,

1986; Gynther, 1972), the notion that item differ-

ences—or more interestingly, item biases—might

cancel out at the scale level is an interesting idea that

warrants further consideration. Fortunately, this topic

has received increased attention in the psychometrics

community in recent years (Nandakumar, 1993; Raju

et al., 1995; Shealy & Stout, 1993); some psychome-

tricians use the terms DIP amplification and cancel-

lation when describing an item's contribution to dif-

ferential test functioning (DTP) or test bias.

To better understand the concept of DTP or test

bias from an IRT perspective requires that we intro-

duce another important idea from IRT: the test char-

acteristic curve (TCC; Hambleton et al., 1991). Sim-

ply put, the (estimated) TCC is the nonlinear

regression of observed scores on the (estimated) IRT

latent trait values. By logical extension of Equation 1,

the predicted true score (T,) for an individual with

estimated latent trait leveli is calculated by summing

the predicted item endorsement probabilities across

all items of a scale. This idea can be mathematically

expressed as

(7)

where 7} is the predicted true score for subject ;', J is

the number of items on the scale, and the remaining

terms are defined as before.

As noted above, the TCC is the nonlinear regres-

sion of (predicted) true scores on (estimated) latent

trait levels. An example TCC for a hypothetical 25-

item test is depicted in Figure ID. Using this concept,

we can say that a test (such as an MMPI factor scale)

provides equal expected scores for individuals with

the same latent trait level regardless of group mem-bership when the TCCs for those groups are identical.

Page 7: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 131

If the TCCs are not identical, then at some point along

the trait continuum the expected observed scores for

the two groups will differ. The TCC is the sum of the

IRFs for a particular scale, a fact that clearly explains

why DIP can be amplified or canceled when summing

over items.

Raju et al. (1995) have introduced a measure of

DTP that is calculated from the TCCs of two groups.

In terms of our running example, Raju et al.'s DTP

index is calculated as

1995) also introduced a measure of noncompensatory

DIP, which is calculated as:

1 ,DTF = — (8)

where TB and Tw are the true scores that are derived

from the test characteristic curves for the Black and

White examinees. Notice that for the Black partici-

pants only (nB; in Equation 8 we are averaging over

the trait scores of Black examinees), we are asking the

following question: Would the estimated true scores

(i.e., expected observed scores) differ if the items

were scored using the estimated item parameters cali-

brated on the White group versus the estimated item

parameters calibrated on the Black group? If the an-

swer to this question is yes, that is, if the TCCs for the

two groups differ, then Equation 8 will yield a large

positive number and we can confidently conclude that

the scale provides differential measurement for

Blacks. However, if the TCCs are similar, then Blacks

and Whites with similar trait estimates (6) will have

similar predicted true scores (within the boundaries

imposed by measurement error) and Equation 8 will

yield a small number. The square root of Equation 8,

the root differential test function (rDTF) expresses the

differences between the TCCs in the metric of the

observed scores. Thus, the rDTF serves as a useful

size measure of bias. Raju et al. (1995) provided equa-

tions for determining the statistical significance of the

DTP.

Raju (Raju, van der Linden & Fleer 1995) has also

introduced an index of so-called compensatory DIP

(CDIF). Raju's CDIF index measures an item's addi-

tive contribution to a scale's DTP:

j

DTP =2 CDIF.;=i

Thus, an item may show substantial DIP in terms of

Lord's x2 or Raju's SA and USA measures but show

relatively little, if any, CDIF. This would happen if

the item DIP was in the opposite direction of the DIP

of other items. Raju (Raju, van der Linden & Fleer

NCDIF = ) - /ve,)]2/Be,<*e.

In words, NCDIF is simply the average squared dif-

ference between the expected item endorsement prob-

abilities, where the expectations are calculated from

the two sets of item parameters. As before, the aver-

aging is computed over the distribution of estimated

trait levels for Blacks. In other words, for a given

estimate trait level (6) for an individual from the

Black sample, we (a) calculate the probability that

item j will be endorsed in the keyed direction when

using the estimated item parameters from the White

calibration, (b) calculate the probability that item j

will be endorsed in the keyed direction when using the

estimated item parameters from the Black calibration,

(c) square the difference between these probabilities,

and (d) calculate the weighted average of the squared

differences for all Black participants in our sample

(/B(9,) denotes the relative frequency of $,).

Detecting Differential Item and Test

Functioning: An Empirical Example With

the MMPI

Method

Participants. Our total sample included MMPI

item response data from 1,277 Whites and 511

Blacks. At the time of testing, all participants were

young male offenders committed to the California

Youth Authority (CYA) between January 1964 and

December 1965. These 1,788 individuals are a subset

of the 4,164 consecutive CYA intakes from the Re-

ception Guidance Center at the Deuel Vocational In-

stitution in Tracy, California. These data were origi-

nally collected as part of a larger study designed to

investigate the criminal career paths of youth offend-

ers (Wenk, 1990). Only MMPI protocols that satisfied

purposely conservative selection criteria (described

below) were included in the sample. When the data

were collected, the average age of the White male

offenders was 19.01 years (Mdn - 19, SO = 0.98,

range = 17-23), and the average age of the Black

male offenders was 18.97 years (Mdn = 19, SD =

0.94, range = 16-24). Participant race was coded

from official CYA documents (probation records, ar-

rest records, assessment records, etc.). The youth of-

Page 8: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

132 WALLER, THOMPSON, AND WENK

fenders in this sample had committed a variety of

crimes including murder, auto theft, rape, robbery,

burglary, possession of drugs, assault with a deadly

weapon, arson, and kidnapping.

As part of the normal CYA intake, all youth of-

fenders are administered an extensive test battery.

Thus, we had access to an unusually rich collection of

data. For instance, our data set contained numerous IQ

and achievement measures for each participant. Al-

though these other tests are not the focus of this study,

a few summary findings from these data deserve men-

tion. In particular, as noted previously, several re-

searchers (reviewed in Greene, 1987) have claimed

that observed group differences on the MMPI and

MMPI-2 are minimized or eradicated when the groups

are matched on IQ or other moderator variables. An

examination of the IQ and achievement data for our

participants revealed group differences in the range

found in many other studies (Jensen, 1998). For ex-

ample, on the G Factor of the General Aptitude Test

Battery (Science Research Associates, 1947), Whites

achieved an average score of 99.25 (SD = 16.54),

and Blacks achieved an average score of 84.12 (SD =

13.14). On Raven's Progressive Matrices (Raven,

1960), Whites achieved an average score of 45.92 (SD

= 7.18), and Blacks achieved an average score of

41.80 (SD = 8.50). These group differences are in

line with those reported for other samples during the

mid 1960s. In this study, no attempt was made to

match the two groups on the IQ or achievement data.

Selection of MMPI protocols. When conducting

research involving group comparisons, it is particu-

larly important to exclude potentially invalid proto-

cols from the analyses. Unfortunately, not all studies

in the MMPI literature have taken this precaution.

Greene (1987) noted, for instance, that almost one

third of the MMPI racial bias studies included in his

review made no mention of how invalid protocols

were identified—if indeed they were. For the present

study, we decided to use purposely conservative se-

lection criteria that would err on the side of excluding

possible valid protocols rather than including possible

invalid protocols. After reviewing the literature on

MMPI profile validity (Graham, 1993, chap. 3;

Greene, 1991, chap. 3), we settled on the following

criteria. Protocols were selected if (a) the number of

"Cannot say" (omitted) responses was <30, (b) the

Gough F-K index was £ 11, (c) Greene's (1978) Care-

lessness score was -£5, (d) the Lie (L) scale score was

£7, and (e) the raw F scale score was <15. Several

studies have found that Blacks (Gynther et al., 1978),

delinquent adolescents (McKegney, 1965), and young

adults in general (Archer, 1984, 1987) endorse items

on the F scale at higher rates than do individuals from

the Minnesota normative sample, and thus our selec-

tion criteria may have resulted in the exclusion of

several valid MMPI protocols. Fortunately, our rela-

tively large samples allowed us to use stringent selec-

tion criteria for deeming a protocol valid.

The application of the aforementioned selection cri-

teria to the total sample of Blacks and Whites (N =

2,284) resulted in the exclusion of 25% of the avail-

able protocols from Blacks and 20% of the available

protocols from Whites. Because slightly more Blacks

were deleted from the final sample, we wondered

whether the two samples of excluded protocols dif-

fered in important ways. Specifically, we wondered

whether the samples differed in their mean endorse-

ment rates on the F (Infrequency) scale. Among the

scales included in our analyses, the F scale holds a

unique position because (a) previous researchers have

reported Black-White differences of F (e.g., Gynther,

1972), (b) scores on F are positively correlated with

clinical scales (e.g., Sc) on which group differences

have been found, and (c) F was used to select valid

MMPI protocols. We did not wish the selection cri-

teria to minimize group differences on F, and our

analyses revealed that they did not. The median F

score for the 174 excluded Blacks was 16.00 (M =

15.52, SD = 8.89), and the median F score for the

322 excluded Whites was also 16.00 (M = 14.29, SD

= 9.28). A two-tailed Wilcoxon rank sum test re-

vealed that the two groups did not differ significantly

o n f ( z = -1.37, p = .17).

Results

Our discussion of Black-White differences on the

MMPI is divided into three parts. First, to characterize

the personality profiles of our samples, we compare

the performance of Whites and Blacks on the MMPI

validity and clinical scales. We then tackle the ques-

tion of MMPI measurement equivalence, at both the

item and scale levels, by conducting IRT analyses of

12 MMPI factor scales (Waller, 1999) in the two sub-

groups. Using the item parameter estimates and esti-

mated latent trait values from these analyses, we then

examine differential functioning of items and tests on

the (unidimensional) factor scales and the (multidi-

mensional) MMPI validity and clinical scales.

Table 1 reports raw score summary statistics for 13

commonly scored validity and clinical scales of the

Page 9: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 133

Table 1

MMPl Scale Scores and Effect Sizes for Whites and Blacks

MMPI scale

L

F

K

1 Hx

2D

3Hy

4Pd

5Mf

f,Pa

7Pt

SSc

9 Ma

OSi

Total

items

15

64

30

33

60

60

50

60

40

48

78

46

70

Whites

M

3.85

6.52

13.27

4.44

21.04

19.37

24.63

23.12

10.63

15.39

15.79

19.58

27.47

(« =

SD

1.60

3.21

4.22

3.10

4.12

3.87

3.85

4.12

2.97

7.29

7.47

3.97

8.47

1,277)

Range

0-7

0-15

1-26

0-23

6-39

4-36

7-39

9-41

3-24

CMC

1-50

6-33

6-57

Blacks (n =

M

3.80

7.06

12.92

4.93

20.49

18.61

23.71

23.44

10.31

15.53

17.37

21.83

26.14

SD

1.62

3.18

3.54

2.80

3.56

3.55

3.36

3.85

3.04

6.14

6.72

3.55

6.58

511)

Range

0-7

0-15

3-26

0-21

10-41

7-̂ 15

9-38

10-38

2-21

2-39

2-39

8-34

11-50

Effect

size"

.03

-.17

.08

-.16

.13

.20

.24

-.08

.11

-.02

-.21

-.57

.16

Note. Minnesota Multiphasic Personality Inventory (MMPI) scale names: L = Lie; F = Infrequency;

K — Defensiveness; Hs = Hypochondriasis; D = Depression; Hy = Hysteria; Pd = Psychopathic

Deviate; i\ff = Masculinity-Femininity; Pa = Paranoia; Ft = Psychaslhenia; Sc = Schizophrenia; Ma

~ Hypomania; i7 = Social Introversion.

" Effect size is calculated as (w025 ~ k.o25)/°Vj2Si where w025, b025, and cr№(05 equal the 2.5% trimmed

means and standard deviation of the White (w) and Black (b} groups.

MMPI. When computing the score means and stan-

dard deviations, we trimmed 2.5% off the lower and

upper score distributions to minimize the effects of

outliers on the obtained results (Wilcox, 1998). As

evidenced by these findings, the average profiles for

the Blacks and Whites are very similar. A quantitative

measure of this similarity is provided by the effect

size measures (calculated as the difference between

the White and Black trimmed means, divided by the

White trimmed standard deviation) that are reported

in the final column of Table 1. In our opinion, these

effect sizes are more informative than the results of

simple t tests for each scale. Nevertheless, considering

the large samples in this study—and hence the large

statistical power—it is noteworthy that 5 of the 13

scale comparisons to not reach statistical significance

at the .05 alpha level (L, K, Mf, Pa, and Pt; note that

trimmed means and standard deviations were also

used when calculating the t tests for each scale). As

quantified by the effect sizes, most of the differences

are relatively small. Only Scales 4 (Psychopathic De-

viate; Pd) and 9 (Hypomania; Ma) have moderate

effect sizes according to Cohen's (1988) widely

adopted criteria.

Perhaps an easier way to grasp the similarity of

these profiles is to look at the plotted scores in Figure

2. Note that the profiles in Figure 2 portray the aver-

age T scores of the Whites and Blacks. Our results

would have differed slightly if we had converted the

profiles of average raw scores (that are reported in

Table 2) into T scores because the MMPI does not use

linear T scores (a linear 7" score equals \QzI + 50).

Because our samples include adolescents and young

adults, we have plotted non-AT-corrected 7* scores,

consistent with standard practice for these age groups

(Archer, 1984, 1987). An inspection of these plots

bolsters our initial impression that the average profiles

for Blacks and Whites are remarkably similar. The

small differences that exist are not of sufficient mag-

nitude to warrant different interpretations of the av-

erage profiles. Both profiles show the characteristic

4—9 code type (i.e., highest elevations on scales Pd

[Psychopathic Deviate] and Ma [Hypomania]) that is

so often seen in delinquent and offender populations

(Graham, 1993).

Although the findings in Table 1 and Figure 2 fail

to show large Black-White differences on the stan-

dard MMPI validity and clinical scales, these results

should not be interpreted as indicating measurement

equivalence for the two groups. For reasons already

stated, observed group differences are irrelevant to

questions of measurement bias unless the groups are

perfectly matched on the latent variables that are mea-

sured by the scales. In the present case, the absence of

group differences may stem from a form of bias that

masks true differences on the latent variables. To rule

Page 10: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

134 WALLER, THOMPSON, AND WENK

L F K Hs D Hy Pd Mf Pa Pt Sc Ma Si

1 2 3 4 5 6 7 8 9 0

MMPI Validity and Clinical Scales

Figure 2. Average Minnesota Multiphasic Personality Inventory (MMPI) profiles of White

and Black male youth offenders. MMPI scale names: L = Lie; F = Infrequency; K =

Defensiveness; Hs = Hypochondriasis; D = Depression; Hy — Hysteria; Pd = Psycho-

pathic Deviate; Mf = Masculinity-Femininity; fa = Paranoia; Pt = Psychasthenia; Sc =

Schizophrenia; Ma = Hypomania; Si = Social Introversion.

out this possibility, it is necessary to focus our analy-

ses at the latent variable level.

Item response theory analyses of differential item

and test functioning on MMPI factor scales. To rig-

orously test hypotheses of item and scale bias from a

model-based perspective (Embretson, 1996), we per-

formed IRT analyses on 12 unidimensional factor

scales that can be scored on the MMPI. These scales

are a subset of the 16 factor scales that are described

in Waller (1999). Each MMPI factor scale was de-

signed to measure a single latent trait. Although no

test is strictly unidimensional, the MMPI factor scales

are dominated by large first dimensions, and thus they

can be considered unidimensional for practical pur-

poses. Four MMPI scales were either too short or

otherwise unsuitable for an IRT study and thus are not

discussed in this article (e.g., our IRT analyses sug-

gested that the IRFs were not monotonically increas-

ing for the Stereotypic Masculine Interests factor

scale). The 12 factor scales (with sample items and

keyed responses) that were analyzed are called (a)

General Maladjustment (Gm; "Life is a strain for me

much of the time"; true), (b) Psychotic Ideation (Ps;

"I commonly hear voices without knowing where they

come from"; true), (c) Antisocial Tendencies (At;

"During one period when I was a youngster I engaged

in petty thievery"; true), (d) Stereotypic Feminine In-

terests (Fe; "I enjoy reading love stories"; true), (e)

Extroversion (Ex; "I enjoy the excitement of a

crowd"; true), (f) Family Attachment (Fm; "There is

very little love and companionship in my family as

compared to other homes"; false), (g) Christian Fun-

damentalism (Cf; "I believe in the Second Coming of

Christ"; true), (h) Phobias and Fears (Ph; "I am afraid

of fire"; true), (i) Social Inhibition (So; "I wish I were

not so shy"; true), (j) Cynicism (Cy; "Most people

make friends because friends are likely to be useful to

them"; true), (k) Assertiveness (As; "I like to let

people know where I stand on things"; true), and (1)

Somatic Complaints (Sm; "Much of the time my head

seems to hurt all over"; true). A fuller description of

these scales, such as their item composition, average

Page 11: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 135

Table 2

Item Parameters and Differential Item Functioning (DIP) for MMPJ Phobias and Fears Scale

MMPI

no.

128

131

166

176270

367392401

480

492522

Note.

Whites

Abbreviated content a (3

Blood does not frighten me. (F) 0.63 1.81

Do not worry about catching

diseases. (F) 0.43 0.91

Afraid when looking down from

heights. (T) 0.86 1.06

No fear of snakes. (F) 0.75 0.86

Do not worry whether the door

is locked and windows

closed. (F) 0.23 0.40

Afraid of fire. (T) 0.83 1.19

Afraid of windstorms. (T) 0.67 2.91

No fear of water. (F) 0.78 1.61

Afraid of the dark. (T) 1.29 2,22

Afraid of earthquakes. (T) 0.44 0.66

No fear of spiders. (F) 0.87 0.31

Blacks

a

0.62

0.41

0.70

0.72

0.33

0.88

0.75

0.70

0.60

0.52

1.01

z equals the parameter estimate divided by the estimate'

3

2.22

0.70

1.51

0.69

-1.20

1.38

2.44

1.89

3.73

0.44

0.65

I nt-H'cLoro s

X2

8.97

1.81

14.15

3.01

45.02

5.50

3.74

3.74

18.91

3.40

19.05

's standard error.F — false. MMPI = Minnesota Multiphasic Personality Inventory; SA = si

P

.01

.40

<OI.22

•cOl

.06

.15

.15<.OI

.18

<01

SA

0.41

-0.21

0.45

-0.17

-1.60

0.19

-0.47

0.28

1.51

-0.21

0.35

(SA)

2.07

-1.22

3.76

-1.64

-4.55

1.85

-1.68

1.86

3.74

-1.44

4.22

USA

0.41

0.22

0.46

0.17

1.78

0.19

0.48

0.29

1.57

0.33

0.35

(USA)

2.07

1.26

3.53

1.70

5.37

1.97

1.65

1.75

3.67

1.60

4.39

CDIF

0.007

0.000

0.010

0.000

-0.010

0.002

-0.008

0.006

0.021

-0.004

-0.001

Direction of keying follows abbreviated content:igned area index ; USA — un;signed area index;

NCDIF"

0.005

0.001

0.011

0.002

0.039

0.003

0.003

0.003

0.024

0.002

0.009

T = true;CDIF =

compensatorv DIP; NCDIF = noncompensatory DIP.

"All? values for NCDfF are less than .01.

reliabilities in diverse samples, and correlations with

MMPI clinical and validity scales, is reported in

Waller (1999).

As a first step in the IRT analyses, marginal maxi-

mum-likelihood IRT item parameters were estimated

for the items of the 12 factor scales. These analyses

were conducted separately for Whites and Blacks and

for each factor scale using BILOG 3.10 (Mislevy &

Bock, 1990). BILOG 3.10 is a Windows-based pro-

gram for estimating the parameters of the 1 -, 2-, or

3-parameter logistic (unidimensional) IRT models by

marginal maximum likelihood. All BILOG program

defaults were used in these analyses.2 After calibrat-

ing the items, we compared the estimated and the

empirical 2-PLM IRFs for each item in the two

samples. The estimated 2-PLM IRFs are calculated by

substituting the group-specific item parameter esti-

mates in Equation 1. The empirical IRFs are calcu-

lated by grouping the maximum likelihood 6 esti-

mates from the IRT analyses into nine nonoverlapping

intervals and then determining the average item en-

dorsement frequency for each 0 interval.

When we compared the estimated and empirical

2-PLM IRFs, we found that virtually all of the items

on the 12 factor scales could be successfully cali-

brated with the 2-PLM. Specifically, the vast majority

of points of the empirical IRFs fell within the 95%

tolerance intervals of the estimated 2-PLM IRFs. We

should note that these comparisons were conducted

after linking the two sets of item parameters to a com-

mon metric. To accomplish the item linking, we used

the linking procedure of Stocking and Lord (1983) as

implemented in the software routine LINKDIF

(Waller, 1998).

Having estimated the item parameters in the two

groups, we were finally in a position to look for DIF

and DTP in the 12 factor scales. LINKDIF (Waller,

1998) was also used to calculate the five DIF and DTP

measures introduced in the previous sections. Table 2

reports our findings for the 11 items of the Phobias

and Fears (Ph) factor scale. A graphical display of

2 BILOG uses a normal prior for the latent trait distribu-

tion in the marginal maximum-likelihood estimation of item

parameters. For some scales, such as the MMPI Ps scale, a

normal prior for 0 may be unreasonable. We investigated

the influence of the prior distribution of 6 on the final item

parameter estimates by also analyzing the data using em-

pirically generated prior distributions (starting from either

normal or uniform distributions). These empirical priors

were estimated during the item parameter estimation phase

(using the BILOG FREE command on the CAL1R line). Our

results suggested that the form of the prior had little effect

on the final item parameter estimates (though it did have a

noticeable effect on the estimated distribution of 8). Thus,

without further information, we believe that a normal prior

can be justified in these moderately sized samples.

Page 12: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

136 WALLER, THOMPSON, AND WENK

these findings is also provided by the 2-PLM IRFs in

Figure 3. Several aspects of these results and plots

warrant discussion. First, notice that a number of

items from the Phobias and Fears factor scale show

significant DIP as measured by Lord's \2 and Raju's

signed (SA) and unsigned (USA) measures. (We have

calculated agreement indices [kappas] for all pairs of

DIP indices for the 383 items that are included on the

12 MMPI factor scales. A summary of these results

can be obtained from Niels G. Waller.) On the basis of

Lord's x2, five items show significant item parameter

differences at the .01 significance level: Items 128,

166, 270, 480, and 522. The other DIP indices re-

ported in Table 2 also identify these items as showing

DIP. Notice, however, that these differences are not

always in the same direction. For instance, at a 0 level

of 2.00, Blacks are more likely than Whites to endorse

Item 270, "When I leave home I do not worry about

whether the door is locked and the windows closed,"

in the keyed direction, false. It is not difficult to imag-

ine why Blacks and Whites have different endorse-

ment probabilities on this item when the groups are

matched on the latent Phobias and Fears construct.

Many of the Black youth offenders in our sample

lived in crime-ridden neighborhoods and housing pro-

jects where unlocked doors and windows would be

invitations for robbery. Thus, although Item 270 is a

valid measure of a more general Phobias and Fears

construct, it also taps a specific fear (i.e., the fear of

being a crime victim) that may be a realistic concern

in some environments. For Item 480, on the other

hand, the situation is quite different. At a 6 level of

2.00, Whites are more likely than Blacks to answer

"True" to the statement "I am often afraid of the

dark." We do not know why Blacks and Whites re-

spond differently to this item. We do know that our

IRT analyses have elucidated many interesting item

differences that provide hypotheses for further study.

As interesting as these item differences are, we re-

mind the readers that differential item functioning

does not imply differential test functioning. In other

words, although many items on a scale may show

evidence of DIP in two groups, the scale may none-

theless provide valid measurement for both groups.

For instance, although several items in the Phobias

and Fears factor scale show evidence of DIF (see

Table 2 or Figure 3), the scale produces unbiased

scores for Blacks and Whites. This statement is sup-

ported by the fact that Raju's DTP index for the Pho-

bias and Fears factor scale was only 0.02 (p — .05) in

our samples. The square root of the DTP, which

equals 0.15 for this comparison, suggests that for a

given trait level estimate, Blacks and Whites will dif-

fer, on average, by 0.15 of a single point on the ob-

served scores.

Our analyses of the 12 MMPI factor scales revealed

numerous instances of DIF. On average, 38% of the

items on each scale produced significant values of

Lord's x2 at the .01 significance level. Of course, with

large sample sizes, and hence large statistical power,

one expects to find numerous significant and uninter-

esting differences when groups are compared on any

set of psychological variables (Meehl, 1967, 1978).

These findings were neither surprising—in that we

would expect similar results on any set of broadband

factor scales—nor troublesome. Our results would

give us cause for worry if the item differences pro-

duced biased scales. However, the plots in Figure 4

(which also report the rDTF index) demonstrate that

the 12 MMPI factor scales are not biased against

Blacks or Whites.

Figure 4 displays the test characteristic curves

(TCCs) for the 12 MMPI factor scales. Each plot con-

tains two TCCs, one produced from the estimated

item parameters of the Black participants and one

produced from the estimated item parameters of the

White participants. Notice that in virtually all cases,

the TCCs are similar and that in many cases they are

not visually distinguishable. Only the Assertiveness

(As) and Extroversion (Ex) factor scales show any

appreciable evidence of biased measurement, and in

both cases the amount of score distortion that is pro-

duced by differential item functioning would not yield

different clinical or personalogical interpretations at

any score level on the these scales, Por instance, the

rDTF divided by the total number of scale items is

only 0.04 and 0.05 for Assertiveness and Extrover-

sion, respectively.

The TCC plots in Figure 4 reassure us that the 12

MMPI factor scales can be used to make meaningful

group comparisons for Blacks and Whites. In other

words, in samples where Blacks (Whites) score rela-

tively higher (lower) than Whites (Blacks) on the ob-

served scores from these scales, we can be confident

that Blacks (Whites) also score higher (lower) than

Whites (Blacks) on the latent variables that are mea-

sured by these scales.

The previous analyses provided strong evidence for

Black-White measurement equivalence on our 12

MMPI factor scales. We noted that even if the TCCs

showed that these scales could not be used to make

valid comparisons at the observed score level, we

Page 13: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 137

O'l 8'0 9'0 t'O Z'O 00

osuodsai paAes) B (o qoij

3

O'l 8'0 9'0 fr'O J O 00

esuodsej pefe>| B jo qojj

sn>

5

toCO

ffl

O'l B'O 9'0 »'0 J O 0-0

asuod$ej peAa>| F |o qoj^

(NOlCO

I

0 1 80 9'0 f'O J'O O'O

osuodsai

CN

C>4in

O'l. 9'0 9'0 *'0 Z'O 00

asuodsej paXeif e ^o qoj^

CO

E

O'l 8"0 90 »'0 J'O O'O

asuodssj paXs)) B jo qoj,d

ot^

S CM

I E

O'l. 8'0 9'0 t'O Z'O O'O

asuodsej pe/,e>| B jo qojd

'i 9'0 9'0 ra z'o o'

asuodsej pa^a>{ e |0 qcud

01 8'0 9'0 f O ZO O'O

asuodsaj paAa)| e 10 qoid

3

O'l 9'0 9'0 t-'O J'O O'O

asuodsej peAei) e

O'l 8'0 9'0 f'O J'O O'O

esuodsej pa^ei) e j

Page 14: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

138 WALLER, THOMPSON, AND WENK

" 8U.M

8

°8U.C/3

OZ SI. 01 S 0

3JOOS letOl PSP!p3Jd

S3 OZ SI 01 9 0

3JODS IB101

s?

01 S 9 fr Z 0

3JOOS IBJOl papjpaJd

ii 8LLW

01 9 9 f Z 0

SJODS ROl pappaJd

o

s

IILLW

I '-

I -S

•S^M IS '

S S xf = h-

OS Ot- OE OZ 01 0

3JODS |E}oi

9 9 » Z

|-|£o y E

K !•

u cC

o

fi°

LJ-OT

oe oz 01

11 8U-OT

0|

tl II 01 3 9 f Z 0

Q ^S

dlOE OZ 01 0

aioos |B|oi pspipay

SS

tl

-a -O -S

'•S K

CMoq

CM

II

l£l^a

on osi 001 oe 09 OP oz o si 01

3JO3:

B

I

SI 01

ajoos leioi pspipsjd

S5 g «

II:-i"a s si?^(2

Page 15: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 139

could still validly compare our groups at the estimated

latent trait level. In other words, because the maxi-

mum-likelihood trait estimates (from the IRT analy-

ses) are calculated from the group-specific (estimated)

item parameters, we can always compare groups on

the estimated 6 levels after the item and estimated

trait levels have been linked to a common metric (this

statement is just another way of saying that partial

measurement invariance is often a sufficient condition

for making valid group comparisons; see Byrne,

Shavelson, & Muthen, 1989; Reise et al., 1993).

Are the MMPI and MMPI-2 validity and clinical

scales biased against Blacks? It is well known that

the MMPI clinical scales were primarily constructed

by the method of contrasted groups (Greene, 1991,

chap. 1). Scales that are developed by this method are

notorious for being highly multidimensional, and the

MMPI clinical scales are no exception. Consequently,

the internal structure of the clinical scales and the

MMPI factor scales are notably different. The clinical

scales contain items of diverse content—as measured,

for example, by the Harris and Lingoes (1955) sub-

scales—whereas the factor scales are relatively con-

tent pure. These differences raise an intriguing ques-

tion. Namely, when individuals complete an MMPI

protocol, are their item responses determined by the

constructs underlying the multidimensional clinical

scales or the unidimensional factor scales? Surpris-

ingly, no one has addressed this question, although the

answer to this query has important implications for

much applied and theoretical MMPI work. For in-

stance, as we show below, our answer will help us

determine whether the MMPI validity and clinical

scales are biased against Blacks or Whites.

We tackled the aforementioned question by posing

the following hypothesis: If the unidimensional con-

structs, which are represented by the factor scales, are

of prime importance in determining MMPI item re-

sponses, then scores on the validity and clinical scales

should be recoverable from the factor-scale 8 esti-

mates. For instance, using only the trait estimates

from our previous IRT analyses, we should be able to

accurately reproduce the MMPI profiles that are dis-

played in Figure 2.

To test the aforementioned hypothesis requires IRT

item parameter estimates for the 383 unique items that

are scored on the MMPI validity and clinical scales.

There are also, coincidentally, 383 unique items on

the MMPI factor scales, although the items on the

validity and clinical scales do not overlap completely

with the items on the factor scales. In particular, we

did not have item parameter estimates for 87 items

(hereinafter called the missing items) that are scored

on one or more of the validity and clinical scales.

Thus, it was necessary to estimate item parameters for

as many of these items as possible. These estimates

were obtained as follows.3

First, in both the White and Black samples, we

calculated biserial correlations (using PRELIS 2;

Joreskog & Sorbom, 1996) between each of the 87

missing items and the estimated 0 levels from the

previous IRT analyses of the MMPI factor scales.

These correlations were used to assign a missing item

to one of the factor scales. An item was assigned to a

scale if its correlation with that scale was higher than

its correlation with any other factor scale in both the

White and Black samples. Moreover, the absolute

value of the highest item-factor correlation was re-

quired to exceed .20. Seventy of the 87 missing items

met these liberal criteria and were thus provisionally

assigned to a factor scale. Items were retained on the

scale if the item could be well modeled by the 2-PLM.

At this point we wanted to estimate item parameters

for the 70 recently assigned items in a manner that

would not bias the trait level estimates or the item

parameter estimates from our original IRT analyses of

the factor scales. To accomplish this goal we used

marginal maximum-likelihood item parameter estima-

tion as implemented in BILOG 3.10 (Mislevy &

Bock, 1990). This program is well suited to our task

because it allows item parameter estimates to be fixed

or freely estimated. Parameters are fixed (i.e., con-

strained) to user-specified values when a tight Bayes-

ian prior (i.e., a prior with a user-specified mean and

a small standard deviation) is placed on the parameter

estimate. Parameters are freely estimated when the

Bayesian priors are loose or when no prior is speci-

fied. Thus, by making judicious use of this option, we

were able to link the parameter estimates of the re-

cently assigned items to the metric of the previously

calibrated items. We simply assigned tight priors to

the slopes and thresholds of the original items—

thereby fixing their values to their previously calcu-

lated estimates—and assigned loose or no priors (we

used the BILOG default values) to the slopes and

thresholds for the recently added items. An example

BILOG file that demonstrates how to fix and free

parameter estimates is reproduced in the Appendix.

3 We would like to thank an anonymous reviewer for

suggesting the following analyses.

Page 16: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

140 WALLER, THOMPSON, AND WENK

Using the aforementioned procedure, we ran 22

BILOG jobs to calculate slope and threshold estimates

for the 70 missing items that are needed to score the

MMPI validity and clinical scales (no additional items

were assigned to the MMPI Christian Fundamental-

ism scale). After completing these runs, we had

group-specific slope and threshold estimates for 96%

of the items needed to score the MMPI validity and

clinical scales. As previously noted, it was not pos-

sible to estimate item parameters for 17 items. Having

no parameter estimates for these items, however,

posed no problems for our ultimate goal because

MMPI protocols are considered potentially valid as

long as no more than 30 items are omitted. Thus,

using the 366 items for which group-specific param-

eter estimates were available, we were able to simu-

late item response vectors for Whites and Blacks with

identical latent trait values. These response vectors

were then used to score the MMPI validity and clini-

cal scales and to look for possible scale biases.

Specifically, 511 MMPI protocols were simulated

for Whites, and 511 protocols were simulated for

Blacks. These protocols (i.e., item response vectors)

were paired such that the same profile of latent trait

values was used to generate a White and a Black item

response vector. One aspect of this analysis deserves

emphasis: Namely, we used the same latent trait es-

timates in both simulations (in both cases we used the

estimated 6 levels from the Blacks); thus, our two

samples are perfectly matched on the latent traits that

are measured by the MMPI factor scales. Any differ-

ences found at the observed score level on the validity

and clinical scales can only arise from the use of the

two sets of estimated item parameters (i.e., the esti-

mated parameters from the Blacks and Whites).

Moreover, to minimize the error that is inherent in any

probabilistic item response model, when scoring the

MMPI scales, we summed the simulated item re-

sponse probabilities rather than simulated [0/1] raw

item responses. Thus, our analyses were conducted on

simulated true scores for Whites and Blacks.

Figure 5 displays the average MMPI profiles from

the two samples of reproduced item responses. Two

features of this plot bear directly on the question that

120

110

100

90

en2 80oo

<?70h-

60

50

40

30

Hs

1

D Pd Mi PaHy

2 3 4 5 6 7

MMPI Validity and Clinical Scales

Sc

8

Ma

9

S;

0

Figure 5. Reproduced Minnesota Multiphasic Personality Inventory (MMPI) profiles of

Black and White youth offenders using model-based item response probabilities. /- = Lie; F

= Infrequency; K = Defensiveness; Hs = Hypochondriasis; D = Depression; Hy =

Hysteria; Pd = Psychopathic Deviate; Mf — Masculinity-Femininity; Pa = Paranoia; Pt =

Psychasthenia; Sc = Schizophrenia; Ma = Hypomania; Si = Social Introversion.

Page 17: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 141

motivated our analyses. The first is that the profiles in

Figure 5 show a reassuringly close resemblance to the

profiles in Figure 2. This finding suggests that MMPI

item responses are largely determined by the homo-

geneous constructs that are tapped by the MMPI fac-

tor scales. It does not prove our hypothesis, nor does

it rule out the possibility that latent taxonic variables

(Waller & Meehl, 1998) also influence MMPI item

response behavior. It does suggest, however, that the

latent dimensions that are measured by the factor

scales are useful explanatory constructs that can be

used to predict an individual's item response profile.

The second noteworthy feature of Figure 5 is that the

two profiles—which were reproduced from the White

and Black estimated item parameters—are remark-

ably close (the small differences are not clinically

meaningful). This finding strongly supports the con-

tention that groups of Blacks and Whites can be

meaningfully compared on the MMPI validity and

clinical scales.

The above conclusions are further bolstered by con-

sidering the multivariate extension of Raju's index of

DTP (Raju et al., 1995). Oshima et al. (1997) have

recently demonstrated how this index can be mean-

ingfully applied to multidimensional scales. To do so

one computes the estimated true score by summing

the (keyed) item response probabilities for all items of

a multidimensional scale. These response probabili-

ties are determined by the unidimensional (e.g., Equa-

tion 1, supra) or multidimensional IRFs (Ackerman,

1996) that were used to characterize the items. In our

didactic example, all MMPI items were modeled by

the (unidimensional) 2-PLM. A person's estimated

true score for a multidimensional scale, such as an

MMPI validity or clinical scale, is calculated by sum-

ming the (keyed) response probabilities for all items

on the scale. For example, although MMPI Items 13,

23, and 30 are scored on the (multidimensional) clini-

cal Depression scale, each item is scored on a separate

(unidimensional) factor scale. Specifically, Item 13

(work under tension) is scored on General Maladjust-

ment, Item 23 (troubled by nausea) is scored on So-

matic Complaints, and Item 30 (feel like swearing) is

scored on Antisocial Tendencies. Thus, when calcu-

lating estimated item response probabilities for these

items it is necessary to consider latent trait values on

three dimensions (General Maladjustment, Somatic

Complaints, and Antisocial Tendencies). Once these

response probabilities have been calculated they can

be summed to produce estimated true scores on De-

pression.

When working with multidimensional scales, such

as the MMPI validity and clinical scales, it is not

possible to portray the TCCs. Thus, we cannot com-

pare group-specific TCCs for the validity and clinical

scales as we did for the unidimensional factor scales

in Figure 4. However, as noted above, it is certainly

possible and desirable to compute the multidimen-

sional version of the DTP or the square root of this

index, the rDTF (recall that the rDTF is placed on the

metric of the original scores). Moreover, Oshima et al.

(1997) provided formulas for computing chi-square

test statistics for the DTP in the multidimensional

case. Table 3 reports the rDTF for the 13 validity and

clinical scales of our analyses. Note that the rDTF

values for all scales except K (Defensiveness) and Si

(Social Introversion) are statistically significant (p <

.05). Note also, however, that the rDTF values for all

scales are small. The two scales with the largest rDTF

are Scales 8 (Sc; Schizophrenia) and 9 (Ma; Hypo-

mania). These findings are interesting because previ-

ous investigators have speculated that MMPI Scales 8

and 9 are biased against Blacks. Although our find-

ings support that contention, they also forcefully sug-

gest that the degree of bias in these scales is minimal.

In particular, the average score differences between

Whites and Blacks with equal latent trait values on the

Table 3

Root Differential Test Functioning (rDTF) for MMPI

Validity and Clinical Scales for Black and White Item

Parameter Estimates

Validity and

clinical scales

L

F

K

1 Hs

ID

3//>'

4Pd

5Mf

6 Pa

IPi

8Sc

9 Ma

OSi

No. items

12

63

30

32

57

60

46

54

38

48

76

44

69

No. missing

31

0

1

3

0

4

6

2

02

2

1

rDTF

0.32

0.38

0.20*

0.35

0.56

0.47

0.63

0.41

0.63

0.82

1.89

1.56

0.57*

Note. rDTF is reported in the raw score metric. Minnesota. Mul-tiphasic Personality Inventory (MMPI) scale names: L = Lie; F =Infrequency; K = Defensiveness; Hs = Hypochondriasis; D =Depression; Hy = Hysteria; Pd = Psychopathic Deviate; Mf =Masculinity-Femininity; Pa = Paranoia; Pt = Psychasthenia;

Sc = Schizophrenia; Ma = Hypomania; Si = Social Introversion.* p > .05 (i.e., differential test functioning is not significantly dif-ferent from 0.00).

Page 18: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

142 WALLER, THOMPSON, AND WENK

dimensions that are tapped by the MMPI are only 1.89

and 1.56 raw score points on Sc and Ma, respectively.

These small differences would not result in different

clinical interpretations. Nevertheless, researchers

wishing to reduce the amount of measurement bias in

these scales can easily do so by deleting those items

that maximally contribute to the DTP. For instance,

our item analyses indicated that MMPI Item 157 con-

tributes the most to the Sc DTP. This item asks ex-

aminees whether they have been punished without

cause. It is not difficult to understand why Blacks

(especially during the mid 1960s when our data were

collected) endorse this item more frequently than

Whites in our racially tense society, irrespective of

their standing on the constructs tapped by the MMPI.

By removing Item 157 the DTP for Sc is lowered from

1.89 to 1.71. We could continue to remove biased

items from Sc until the DTP fell below a specific

threshold, if desired.

Discussion

In this article we have described several methods

for separating true group difference from measure-

ment bias on unidimensional and multidimensional

scales. These methods are based on IRT and differ

from less formal procedures, such as the difference of

means test (Pritchard & Rosenblatt, 1980a), by equat-

ing the groups on the underlying latent traits that are

being measured. We have described how unidimen-

sional scales can sometimes be used to generate IRT

slope and threshold estimates for items on multidi-

mensional scales, and we have shown how these es-

timated item parameters can be used to elucidate dif-

ferential item and test functioning.

Our didactic example includes MMPI data from

1,277 White and 511 Black young adult criminal of-

fenders. Several findings from our analyses were no-

table. For instance, many MMPI items show evidence

of bias against Whites or Blacks. This finding was not

surprising. Any omnibus inventory, such as the

MMPI, the California Psychological Inventory

(Gough & Bradley, 1996), or the Multidimensional

Personality Questionnaire (Tellegen, 1982; Smith &

Reise, 1998), is likely to contain numerous items that

perform differently across various homogeneous

groups. However, most psychological research is con-

ducted at the scale level and thus the more important

question to ask is whether the items yield biased test

scores after they have been aggregated into scales.

Our analyses of the MMPI factor, validity, and clini-

cal scales suggest that Whites and Blacks can be

meaningfully compared on these scales with little fear

that obtained group differences are due to measure-

ment bias. We note that a small amount of bias was

found for two MMPI clinical scales (Sc and Ma) but

that the magnitude of the bias was insufficient to ef-

fect clinical interpretations. Nevertheless, we showed

how the IRT methods that we used in this article could

also be used to identify items whose removal would

be most effective in reducing scale score bias.

An important thesis of this article is that group

differences at the item or scale level can arise from

measurement bias, actual group differences, or a com-

bination of these influences. A corollary of this thesis

is that bias research must necessarily focus on latent

trait (unobserved) scores rather than manifest (ob-

served) scores (Meredith & Millsap, 1992). To con-

duct bias research at the latent trait level requires one

to fit a formal psychometric model to the observed

item responses (Embretson, 1996; Lord, 1980). In this

article, we demonstrated how the 2-PLM IRT model

could be used to obtain latent trait estimates for the

underlying constructs that are tapped by the MMPI

factor scales. Model-data fit analyses demonstrated

that this psychometric model accurately describes

item response behavior on the MMPI. This last point

is particularly noteworthy because previous research-

ers have paid scant attention to the latent structure of

the MMPI. Consider, for example, the MMPI Depres-

sion scale, also known as Scale 2. There is no con-

sensus in the MMPI community on whether elevated

scores on this scale signify higher levels of depres-

sion—that is, higher scores on an underlying latent

Depression dimension—or whether higher scores im-

ply higher probabilities of belonging to a latent de-

pression taxon (Waller & Meehl, 1998). This problem

is compounded exponentially when one considers that

most of the MMPI clinical scales, including Scale 2,

are highly multidimensional when modeled by factor

analysis (a model that is arguably inappropriate if

Scale 2 measures a latent taxon). For this and other

reasons, groups that are matched on manifest (i.e.,

observed) MMPI clinical scales are almost certainly

not matched on the underlying latent constructs that

the scales implicitly measure. These problems are not

unique to the MMPI but plague numerous personality

and psychopathology scales.

In recent years there have been repeated calls for

model-based personality and psychopathology assess-

ment (Embretson, 1996; Embretson & Herschberger,

1999; Waller, 1999; Waller & Meehl, 1998; Waller,

Page 19: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 143

Tellegen, McDonald, & Lykken, 1996). In this article

we have attempted to demonstrate the benefits of

model-based assessment with the MMPI by demon-

strating two IRT models that can be used to assess

measurement bias at item and scale levels on both

homogeneous and heterogeneous scales. Importantly,

these models show that scales that contain biased

items may nonetheless provide unbiased estimates of

the underlying latent traits that influence scale-score

performance. Stated more formally, the presence of

differential item functioning does not lead inexorably

to differential test functioning. Item bias may become

amplified or canceled when aggregated at the total

score level. These important characteristics of items

and tests will remain hidden, however, until research-

ers adopt a model-based approach to psychological

assessment.

References

Ackerman, T. (1996). Graphical representation of multidi-

mensional item response theory analyses. Applied Psy-

chological Measurement, 20, 311-329.

Archer, R. P. (1984). Use of the MMPI with adolescents: A

review of salient issues. Clinical Psychology Review, 4,

241-251.

Archer, R. P. (1987). Using the MMPI with adolescents.

Hillsdale, NJ: Erlbaum.

Baker, F. B. (1992). Item response theory: Parameter esti-

mation techniques. New York: Marcel Dekker.

Bertelson, A. D., Marks, P. A., & May, G. D. (1982). MMPI

and race: A controlled study. Journal of Consulting and

Clinical Psychology, SO, 316-318.

Birnbaum, A. (1968). Some latent trait models and their use

in inferring an examinee's ability. In F. M. Lord & M. R.

Novick (Eds.), Statistical theories of mental test scores

(pp. 395^79). Reading, MA: Addison-Wesley.

Butcher, J. N., Braswell, L., & Raney, D. (1983). A cross-

cultural comparison of American Indian, Black, and

White inpatients on the MMPI and presenting symptoms.

Journal of Consulting and Clinical Psychology, 51, 587-

594.

Butcher, J. N., Dahlstrom, W. G., Graham, J. R., Tellegen,

A., & Kaemmer, B. (1989). MMPI-2 manual for admin-

istration and scoring. Minneapolis: University of Minne-

sota Press.

Butcher, J. N., & Rouse, S. V. (1996). Personality: Indi-

vidual differences and clinical assessment. Annual Re-

view of Psychology, 47, 87-111.

Byrne, B. M., Shavelson, R. J., & Muthen, B. (1989). Test-

ing for the equivalence of factor covariance and mean

structures: The issue of partial measurement invariance.

Psychological Bulletin, 105, 456-466.

Camilli, G., & Shepard, L. A. (1994). Methods for identi-

fying biased test items. Thousand Oaks, CA: Sage.

Cohen, J. (1988). Statistical power analysis for the behav-

ioral sciences. Hillsdale, NJ: Erlbaum.

Costello, R. M. (1973). Item level racial differences on the

MMPI. Journal of Social Psychology, 91, 161-162.

Costello, R. M. (1977). Construction and cross-validation of

an MMPI Black-White scale. Journal of Personality As-

sessment, 41, 514-519.

Dahlstrom, W. G., & Gynther, M. D. (1986). Previous

MMPI research on Black Americans. In W. G. Dahl-

strom, D. Lachar, & L. E. Dahlstrom (Eds.), MMPI pat-

terns of American minorities. Minneapolis: University of

Minnesota Press.

Dahlstrom, W. G., Lachar, D., & Dahlstrom, L. E. (Eds.).

(1986). MMPI patterns of American minorities. Minne-

apolis: University of Minnesota Press.

Ellis, B. B., Becker, P., & Kimmel, H. D. (1993). An item

response theory evaluation of an English version of the

Trier Personality Inventory (TPI). Journal of Cross-

Cultural Psychology, 24, 133-148.

Ellis, B. B., Minsel, B., & Becker, P. (1989). Evaluation of

attitude survey translations: An investigation using item

response theory. International Journal of Psychology, 24,

665-684.

Embretson, S.E. (1996). The new rules of measurement.

Psychological Assessment, 8, 341-349.

Embretson, S. E., & Herschberger, S. (1999). The new rules

of measurement: What every psychologist and educator

should know. Mahwah, NJ: Erlbaum.

Gottfredson, L. S. (1994). The science and politics of race

norming. American Psychologist, 49, 955-963.

Gough, H. G., & Bradley, P. (1996). Manual for the Cali-

fornia Psychological Inventory. Palo Alto, CA: Consult-

ing Psychologists Press.

Graham, J. R. (1993). MMPI-2: Assessing personality and

psychopathology. Oxford, England: Oxford University

Press.

Greene, R.L. (1978). An empirically derived MMPI care-

lessness scale. Journal of Clinical Psychology, 34, 407—

410.

Greene, R. L. (1987). Ethnicity and MMPI performance: A

review. Journal of Consulting and Clinical Psychology,

55, 497-512.

Greene, R. L. (1991). The MMPI-2/MMPI: An interpretive

manual. Needham Heights, MA: Allyn and Bacon.

Gynther, M. D. (1972). White norms and Black MMPIs: A

prescription for discrimination? Psychological Bulletin,

78, 386-402.

Page 20: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

144 WALLER, THOMPSON, AND WENK

Gynther, M. D. (1981). Is the MMPI an appropriate assess-

ment device for Blacks? Journal of Black Psychology, 7,

67-75.

Gynther, M. D. (1989). MMPI comparisons of Blacks and

Whites: A review and commentary. Journal of Clinical

Psychology, 45, 878-883.

Gynther, M. D., & Green, S. B. (1980). Accuracy may make

a difference, but does a difference make for accuracy? A

response to Pritchard and Rosenblatt. Journal of Clinical

Psychology, 48, 268-272.

Gynther, M. D., Lachar, D., & Dahlstrom, W. G. (1978).

Are special norms for minorities needed? Development

of an MMPI F scale for Blacks. Journal of Consulting

and Clinical Psychology, 46, 1403-1408.

Gynther, M. D., & Witt. P. H. (1976). Windstorms and im-

portant persons: Personality characteristics of Black edu-

cators. Journal of Clinical Psychology, 32, 613—616.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J.

(1991). Fundamentals of items response theory. Newbury

Park, CA: Sage.

Harris, R., & Lingoes, J. (1955). Subscales for the Minne-

sota Multiphasic Personality Inventory. Ann Arbor, MI:

The Langley Porter Clinic.

Harrison, R. H., & Kass, E. H. (1967). Differences between

Negro and White pregnant women on the MMPI. Journal

of Consulting Psychology, 31, 454—463.

Harrison, R. H., & Kass, E. H. (1968). MMPI correlates of

Negro acculturation in a northern city. Journal of Per-

sonality and Social Psychology, 10, 262-270.

Hathaway, S. R., & McKinley, J. C. (1940). A multiphasic

personality schedule (Minnesota): I. Construction of the

schedule. Journal of Psychology, 10, 249-254.

Holland, P. W., & Thayer, D. T. (1988). Differential item

performance and the Mantel-Haenszel procedure. In H.

Waincr & H. 1. Braun (Eds.), Test validity (pp. 129-145).

Hillsdale, NJ: Erlbaum.

Holland, P. W., & Wainer, H. (Eds.). (1993). Differential

item functioning. Hillsdale, NJ: Erlbaum.

Huang, C. D., & Church, A. T. (1997). Identifying cultural

differences in items and traits: Differential item function-

ing in the NEO Personality Inventory. Journal of Cross-

Cultural Psychology, 28, 192-218.

Hulin, C. L., & Mayer, L. J. (1986). Psychometric equiva-

lence of a translation of the Job Descriptive Index into

Hebrew. Journal of Applied Psychology, 71, 83-94.

Jensen, A. R. (1980). Bias in mental testing. New York:

Free Press.

Jensen, A. R. (1998). The g factor: The science of mental

ability. Westport, CT: Praeger.

Jones, E. E. (1978). Black-White personality differences:

Another look. Journal of Personality Assessment, 42,

244-252.

Joreskog, K. G., & Sorbom, D. (1996). PRELIS 2 user's

reference guide. Chicago: Scientific Software Interna-

tional.

Lord, F. M. (1980). Applications of item response theory.

Hillsdale, NJ: Erlbaum.

Lubin, B., Larsen, R. M., Matarazzo, J. D., & Secver, M.

(1985). Psychological test usage patterns in five profes-

sional settings. American Psychologist, 40, 857-861.

McKegney, F. P. (1965). An item analysis of the MMPI F

scale in juvenile delinquents. Journal of Clinical Psy-

chology, 21, 201-205.

McNulty, J. L., Graham, J. R., Ben-Porath, Y. S., & Stein,

L. A. R. (1997). Comparative validity of MMPI-2 scores

of African American and Caucasian mental health center

clients. Psychological Assessment, 9, 464-470.

Meehl, P. E. (1967). Theory-testing in psychology and

physics: A methodological paradox. Philosophy of Sci-

ence, 34, 103-115.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks:

Sir Karl, Sir Ronald, and the slow progress of soft psy-

chology. Journal of Consulting and Clinical Psychology,

46, 806-834.

Meredith, W. (1993). Measurement invariance, factorial

analysis, and factorial invariance. Psychometrika. 58,

525-543.

Meredith, W., & Millsap, E. E. (1992). On the misuse of

manifest variables in the detection of measurement bias.

Psychometrika, 57, 289-311.

Miller, C., Knapp, S. C., & Daniels, C. W. (1968). MMPI

study of Negro mental hygiene clinic patients. Journal of

Abnormal Psychology, 73, 168-173.

Millsap, R. E., & Everson, H. (1993). Methodology review-

Statistical approaches for assessing measurement bias.

Applied Psychological Measurement, 17, 297-334.

Mislevy, R. J., & Bock, R. D. (1990). B1LOG 3: Item analy-

sis and test scoring with binary logistic models. Chicago:

Scientific Software International.

Nandakumar, R. (1993). Simultaneous DIF amplification

and cancellation: Shealy-Stout's test for DIF. Journal of

Educational Measurement, 30, 293—311.

Newmark, C. S., Gentry, L., Warren, N., & Finch, A. J.

(1981). Racial bias in an MMPI index of schizophrenia.

Journal of Clinical Psychology, 20, 215-216.

Oshima, T. C., Raju, N. S., & Flowers, C. P. (1997). Devel-

opment and demonstration of multidimensional IRT-

based internal measures of differential function of items

and tests. Journal of Educational Measurement, 34, 253-

272.

Patterson, E. T., Charles, H. L., Woodward, W. A., Roberts,

Page 21: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

DIFFERENTIAL FUNCTIONING OF ITEMS AND TESTS 145

W. R., & Penk, W. E. (1981). Differences in measures of

personality and family environment among Black and

White alcoholics. Journal of Consulting and Clinical

Psychology, 49, 1-9.

Penk, W. E., Roberts, W. R., Robinowitz, R.. Dolan, M. P.,

Atkins, H. G., and Woodward, W. A. (1982). MMPI dif-

ferences of Black and White male polydrug abusers seek-

ing treatment. Journal of Consulting and Clinical Psy-

chology, 50, 463^65.

Pritchard, D. A., & Rosenblatt, A. (1980a). Racial bias in

the MMPI: A methodological review. Journal of Con-

sulting and Clinical Psychology, 48, 263-267.

Pritehard, D. A., & Rosenblatt, A. (1980b). Reply to

Gynther and Green. Journal of Consulting and Clinical

Psychology, 48, 273-274.

Raju, N. S. (1988). The area between two item characteristic

curves. Psychometrlka, 53, 495-502.

Raju, N. S. (1990). Determining the significance of esti-

mated signed and unsigned areas between two item re-

sponse functions. Applied Psychological Measurement,

14, 197-207.

Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995).

IRT-based internal measures of differential functioning

of items and tests. Applied Psychological Measurement,

19, 353-368.

Raven, J. C. (1960). Guide to the standard progressive ma-

trices. London: H. K. Lewis.

Reise, S. P., & Waller, N. G. (1990). Fitting the two-

parameter model to personality data. Applied Psychologi-

cal Measurement, 14, 45-58.

Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Con-

firmatory factor analysis and item response theory: Two

approaches for exploring measurement invariance. Psy-

chological Bulletin, 114, 552-566.

Science Research Associates. (1947). Army general classi-

fication test examiners manual. Chicago: Author.

Shealy, R., & Stout, W. (1993). An item response theory

model for test bias. In P. W. Holland & H. Wainer (Eds.),

Differential item functioning (pp. 197-239). Hillsdale,

NJ: Erlbaum.

Smith, L. L., & Reise, S. P. (1998). Gender differences on

negative affectivity: An IRT study of differential item

functioning on the Multidimensional Personality Ques-

tionnaire Stress Reaction scale. Journal of Personality

and Social Psychology, 75, 1350-1362.

Stocking, M. L., & Lord, F. M. (1983). Developing a com-

mon metric in item response theory. Applied Psychologi-

cal Measurement, 7, 201-210.

Takane, Y., & De Leeuw, J. (1987). On the relationship

between item response theory and factor analysis of dis-

cretized variables. Psychometrika, 52, 393^08.

Tellegen, A. (1982). Manual for the Multidimensional Per-

sonality Questionnaire. Minneapolis: University of Min-

nesota, Department of Psychology.

Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond

group differences: The concept of item bias. Psychologi-

cal Bulletin, 99, 118-128.

Timbrook, R. E., & Graham, J. R. (1994). Ethnic differ-

ences on the MMPI-2. Psychological Assessment, 6, 212-

217.

van der Linden, W. !., & Hamblcton. R. K. (1997). Hand-

book of modern item response theory. New York: Springer.

Waller, N. G. (1998). LINKDIF: An S-PLUS routine for

linking item parameters and calculating IRT measures of

differential functioning of items and tests. Applied Psy-

chological Measurement, 22, 392.

Waller, N.G. (1999). Searching for structure in the MMPI.

In S. E. Embretson & S. L. Hershberger (Eds.), The new

rules of measurement: What every psychologist and edu-

cator should know (pp. 185-217). Hillsdale, NJ: Erlbaum.

Waller, N. G., & Meehl, P. E. (1998). Multivariate taxomet-

ric procedures: Distinguishing types from continua.

Thousand Oaks, CA: Sage.

Waller, N. G., Tellegen, A., McDonald, R. P., & Lykken,

D. T. (1996). Exploring nonlinear models in personality

assessment: Development and preliminary validation of a

negative emotionality scale. Journal of Personality, 64,

545-576.

Wenk, E. (1990). Criminal careers: Criminal violence and

substance abuse (final report). Washington, DC: United

States Department of Justice, National Institute of Justice.

Whitworth, R. H., & McBlaine, D. D. (1993). Comparison

of the MMPI and the MMPI-2 administered to Anglo-

and Hispanic-American university students. Journal of

Personality Assessment, 61, 19-27.

Widaman, K. F., & Reise, S. P. (1997). Exploring the mea-

surement invariance of psychological instruments: Appli-

cations in the substance use domain. In K. J. Bryant, M.

Windle, S. G. West (Eds.), The science of prevention:

Methodological advances from alcohol and substance

abuse research (pp. 281-324). Washington, DC: Ameri-

can Psychological Association.

Wilcox, R. R. (1998). How many discoveries have been lost

by ignoring modern statistical methods? American Psy-

chologist, 53, 300-314.

Witt, P. H., & Gynther, M. D. (1975). Another explanation

for Black-White MMPI differences. Journal of Clinical

Psychology, 31, 69-70.

(Appendix follows}

Page 22: Using IRT to Separate Measurement Bias From True …psych.colorado.edu/~willcutt/pdfs/Waller_2000.pdfThe authors present a didactic illustration of how item response theory (IRT) can

146 WALLER, THOMPSON, AND WENK

Appendix

BILOG Program

The following BILOG file demonstrates how to estimate

item response theory (IRT) item parameters for the two-

parameter logistic TRT model with fixed and free parameter

constraints. In this example the parameters for the first 11

items are fixed to their previously estimated values by

specifying tight priors on the PRIOR command line. TMU

and TSIGMA specify the means and standard deviations of

the prior distributions for the item thresholds. Note that the

means for the prior threshold distributions match the esti-

mated threshold values (for Whites) that are reported in

Table 2. Note also that the standard deviations for these

prior distributions are exceedingly small. Specifically, for

the 11 original items of the Phobias and Fears factor scale,

the standard deviations of the threshold prior distributions

are uniformly .005. Hence, by specifying tight priors,

BILOG constrains the estimated thresholds (for these 11

items) to the means of the prior distributions. Means and

standard deviations for prior distributions are not specified

for the two items that were added to the scale (Items 240

and 287). Notice also that the means of the prior distribu-

tions for the slope parameters (SMU) are equal to the natu-

ral log of (he slope estimates that are reported in Table 2 for

the original 11 items on the Phobias and Fears factor scale.

To fix these items to their previously estimated slope values,

the standard deviations for the slope prior distributions are

uniformly equal to .001.

Phobias and Fears Whites: 2PLM

Items 240 (-) & 287 (-) added to augment the original factor scale

>COMMENTS: Method = 1 (maximum likelihood scoring of theta)

>COMMENTS:

>GLOBAL NPArm = 2, SAVE, DFName = 'c: \PsyMeth\PhobW.DAT';

>SAVE FARM = 'c: \PsyMeth\PhobW.PRM',

GRAPH = 'c: \PsyMeth\PhobW.PLT,

SCORE = 'c: \PsyMeth\PhobW.SCR';

>LENGTH NITems= 13;

>INPUT NTOtal= 13, NIDch =4, SAMPLE = 1277;

(4A1, 2X, 6A1, 2X, 5A1, 2X, 2A1)>TEST TNAme= "Phobias", EVAMES = (1128, 1131, 1166, 1176, 1270, 1367, 1392, 1401, 1480, 1492,

1522, R240, R287);

>CALIB TPRIOR, READPRI, NEW = 50, NFULL = 500, PLOT =1.0, CHISOR = 0.0;

>PRIOR TMU = (1.8063, 0.9110, 1.0659, 0.8611, 0.3967, 1.1900, 2.9153, 1.6149, 2.2257. 0.6577, 0.3088),

TSIGMA = (.005(0)11),

SMU = (-0.46777, -0.83264, -0.15106, -0.28157, -1.47447, -0.18971, -0.40557 -0.24974

0.25518 -0.82098 -0.14030),

SSIGMA = (.001(0)11);

>SCORE METHOD =1, NQPT = 20, IDIST = 3;

Received July 31, 1998

Revision received July 5, 1999

Accepted October 2, 1999 •