-
USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT
GENDER-
RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE
by
KATHERINE RACZYNSKI
(Under the Direction of Seock-Ho Kim)
ABSTRACT
This study uses the multiple-indicator/multiple cause (MIMIC)
latent variable model and multiple group analysis to evaluate
Likert-type items for gender-related differential item functioning
(DIF) on an aggression measure in an adolescent population. The
MIMIC model allows for the simultaneous examination of group
differences in the latent factor of interest (i.e., aggression) and
response to measurement (i.e., DIF). Multiple group analysis
provides an overall examination of measurement invariance across
groups. This study will test for gender DIF in two scales of an
aggression measure, physical aggression and relational
aggression.
INDEX WORDS: Measurement invariance, MIMIC model, Differential
item functioning
-
USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT
GENDER-
RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE
by
KATHERINE RACZYNSKI
B.S.Ed., University of Georgia, 2002
A Thesis Submitted to the Graduate Faculty of The University of
Georgia in Partial Fulfillment
of the Requirements for the Degree
MASTER OF ARTS
ATHENS, GEORGIA
2008
-
© 2008
Katherine Raczynski
All Rights Reserved
-
USING MIMIC MODELING AND MULTIPLE-GROUP ANALYSIS TO DETECT
GENDER-
RELATED DIF IN LIKERT-TYPE ITEMS ON AN AGGRESSION MEASURE
by
KATHERINE RACZYNSKI
Major Professor: Seock-Ho Kim
Committee: Deborah Bandalos Stephen Olejnik
Electronic Version Approved: Maureen Grasso Dean of the Graduate
School The University of Georgia December 2008
-
ACKNOWLEDGEMENTS
I would like to thank my major advisor, Seock-Ho Kim, for
providing support and
guidance, along with the other members of my committee, Deborah
Bandalos and Stephen
Olejnik. The suggestions and assistance provided by my committee
were of tremendous value.
I also owe a debt of gratitude to Andy Horne, Pamela Orpinas,
and the Youth Violence
Prevention Project, for allowing me access to the data and for
being great friends and role
models. Finally, thank you to my family for providing unflagging
support and encouragement.
iv
-
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS...........................................................................................................
iv
LIST OF
TABLES........................................................................................................................
vii
CHAPTER
1 INTRODUCTION AND THEORETICAL FRAMEWORK
........................................1
Introduction
...............................................................................................................1
Measurement
.............................................................................................................1
Measurement Invariance
...........................................................................................3
Differential Item
Functioning....................................................................................3
Item Response
Theory...............................................................................................4
Structural Equation Modeling
...................................................................................6
Connections between CFA and
IRT..........................................................................8
2 LITERATURE REVIEW
............................................................................................10
The MIMIC model
..................................................................................................10
Advantages of the MIMIC model
...........................................................................11
DIF Detection using MIMIC Models: A Comparison to IRT
Models....................12
Prior Studies using MIMIC Modeling to Detect
DIF..............................................13
Gender-related DIF in Measures of
Aggression......................................................16
v
-
3
PROCEDURE..............................................................................................................18
Sample
.....................................................................................................................18
Instrumentation........................................................................................................19
Computer Program
..................................................................................................20
Detection of Gender-related
DIF.............................................................................20
Multiple-Indicator/Multiple Cause Modeling
.........................................................21
Multiple Group Analysis
.........................................................................................23
4 RESULTS
....................................................................................................................25
Descriptive Statistics
...............................................................................................25
Outliers
....................................................................................................................29
Missing Value Treatment
........................................................................................29
Physical Aggression
Scale.......................................................................................29
Relational Aggression
Scale....................................................................................34
5 SUMMARY AND
DISCUSSION...............................................................................38
Summary
.................................................................................................................38
Discussion
...............................................................................................................39
Limitations and Future
Research.............................................................................42
REFERENCES
..............................................................................................................................45
APPENDICES
...............................................................................................................................51
A MPLUS
SYNTAX.......................................................................................................51
vi
-
vii
LIST OF TABLES
Page
Table 1: Physical and Relational Aggression Items Means and
Standard Deviations for Boys and
Girls
...............................................................................................................................27
Table 2: Physical and Relational Aggression Items
Intercorrelations, Skewness and Kurtosis ...28
Table 3: Multiple-Indicator/Multiple Causes (MIMIC) Model
Estimates for the Physical
Aggression Scale
...........................................................................................................30
Table 4: Sequential Chi Square Tests of Invariance for the
Physical and Relational Aggression
Scales.............................................................................................................................32
Table 5: Multiple-Indicator/Multiple Causes (MIMIC) Model
Estimates for the Relational
Aggression Scale
...........................................................................................................35
-
CHAPTER 1
INTRODUCTION AND THEORETICAL FRAMEWORK
Introduction
This study uses the multiple-indicator/multiple cause (MIMIC)
latent variable model and
multiple group analysis to evaluate Likert-type items for
gender-related differential item
functioning (DIF) on an aggression measure in an adolescent
population. The MIMIC model
allows for the simultaneous examination of group differences in
the latent factor of interest (i.e.,
aggression) and response to measurement (i.e., DIF). Multiple
group analysis provides an
overall examination of measurement invariance across groups.
This study tests for gender DIF
in two scales of an aggression measure, physical aggression and
relational aggression. Because
gender differences in levels of victimization may contribute to
differences in levels of
aggression, the model includes a measure of victimization as a
covariate.
Measurement
Measurement is the term given to the systematic act of assigning
numbers on variables to
represent properties or characteristics of people, events, or
objects (Stevens, 1946; Lord &
Novick, 1968, p. 16). Within education and psychology,
measurement is used to aid in the
understanding of unobserved or latent variables that are of
interest to the researcher, such as
academic achievement or attitudes toward violence. While
researchers believe that these internal
characteristics exist, there is no direct way to observe them.
Instead, researchers rely on theory
to develop survey instruments that indirectly measure constructs
of interest.
1
-
The objective of any well-designed survey instrument is to
obtain observed item
responses that are reflective of respondents’ levels of an
unobserved latent trait. Ideally,
individuals with the same level of the underlying trait should
obtain the same score on an
instrument measuring that trait. However, according to classical
test theory, survey responses
always include an unknown quantity of error or unexplained
variation. That is, other factors,
apart from the level of latent trait, influence how participants
respond to items. Classical test
theory models this variation using the equation
X = T + E, (1)
where X is the observed score, T is the true score, and E is the
error or unexplained variation
(Lord & Novick, 1968, p. 34). Theoretically, E is normally
distributed, with a mean of zero and
variance σ2. The error component is also assumed to be
nonsystematic in nature and
uncorrelated with T. Because T and E are theoretically
uncorrelated, it follows that the variance
of the observed scores (σ 2X ) can be parceled out into two
components, true score variation (σ 2T )
and error variation (σ 2E ), as modeled in
σ 2X = σ 2T + σ 2E . (2)
A derivation of this model allows researchers to conceptualize
test reliability:
rX ′ X = σT2 /(σT
2 + σ E2 ) = σT
2 /σ X2 . (3)
The above model demonstrates that high reliability is dependent
on a large proportion of
variation in true scores (T) to variation in observed scores
(X). Because T is not directly
observable, researchers rely on other techniques (e.g., Pearson
correlation, test-retest reliability)
to estimate reliability. Regardless of the measure, the goal in
any testing situation is to obtain
observed scores (X) that are substantially made up of T, with
negligible amounts of E. It is the
2
-
responsibility of the researcher to minimize the amount of error
likely to be contained in
responses through rigorous confirmation of the instrument’s
reliability and validity evidence.
Classical test theory provides a framework for understanding the
measurement properties
of observed variables in terms of their reliability and
validity. However, more recent analytical
techniques have allowed researchers to address additional
aspects of reliability and validity that
were previously inaccessible using solely classical test theory
methods of inquiry (Vandenberg &
Lance, 2000). One such topic is the systematic evaluation of
measurement invariance.
Measurement Invariance
Measurement invariance refers to a test’s ability to measure the
same latent variable
under different measurement conditions, such as with different
populations of respondents (Horn
& McArdle, 1992). Therefore, measurement invariance is
primarily concerned with the
generalizability of interpretations of test responses across
different sets of circumstances. Before
conducting comparisons across groups on a common measure,
researchers should evaluate
whether different groups of respondents conceptually respond to
and interpret the measure in a
similar way.
Without evidence of adequate measurement invariance,
interpretations of observed scores
may be flawed. Differences in group means may be related to
actual group differences (e.g., one
group has more of the latent variable assessed) or to group
differences in response to the measure
(e.g., differences in frame of reference). In order to
meaningfully explore true differences, it is
necessary to discount the possibility of substantial measurement
differences.
Differential Item Functioning
This paper is concerned with a type of measurement invariance
analysis called
differential item functioning (DIF). DIF analysis involves
evaluating differences in item
3
-
performance across distinct groups of respondents after matching
the groups on ability (on
achievement measures) or “severity” (on psychological measures)
(Angoff, 1993). DIF occurs
when subgroups of respondents with equal amounts of the latent
trait respond differently to
items, causing potentially serious threats to the validity of
the test.
DIF can be uniform or non-uniform. Uniform DIF occurs when one
group consistently
scores higher than the other tested group, across all levels of
ability. An example of uniform DIF
is in the case when a group of girls outperforms a group of boys
on a math test when the two
groups of children possess an equal amount of underlying math
ability. That is, some other
factor is interfering to give the girls a consistent advantage
over boys.
Non-uniform DIF occurs when items discriminate differently
between different ability
levels within groups. An example would be a math problem that
average ability girls can answer
correctly, but only high ability boys (and not average ability
boys) are able to answer. In effect,
this item differentiates between low-ability and average-ability
girls and average-ability and
high-ability boys.
Item Response Theory
Researchers have primarily examined DIF using item response
theory (IRT) techniques.
IRT models evaluate variation among respondents by analyzing
item-level data (Edelen et al.,
2006). In effect, IRT modeling assigns respondents as well as
items to a scale of measurement,
which is conceptualized as having a range of negative infinity
to positive infinity, a mean of
zero, and a unit of measurement of one (Baker, 2001).
The fundamental elements of IRT are a latent variable (such as
ability or severity),
generally called θ , and an item characteristic curve (ICC) for
each item. The ICC graphically
represents the probability of correct answer choice or
endorsement along a continuum of θ.
4
-
Therefore, the ICC is conceptualized as a function of θ. When
analyzing a polythomous item
(e.g., Likert-type), IRT methods assume that the item true score
function, a nonlinear monotonic
function, connects θ to the expected answer choice. There are
also additional assumptions
underlying polythomous IRT methods, namely, that the set of
items of interest are
unidimensional and locally independent (i.e., uncorrelated after
controlling for θ).
One IRT model that can be used to examine polythomous items is
the graded response
model (Samejima, 1969). This model is applied when items have
Likert-type or categorical
answer choices, which is a common characteristic of personality
and attitude measures
(Embretson & Reise, 2000 p. 308). In the graded response
model, when the item has K answer
choices or categories, k = 1,2,…,K, the item true score function
of θ , T(θ), is modeled as
T(θ) = Σk=1
Kk × Pk (θ), (4)
where the item response y for a particular item category
response function k is
Pk (θ) = Pk−1* (θ) − Pk
*(θ). (5)
The boundary response function, or probability of responding
above k is
Pk*(θ) = 1
1+ exp[a(θ − bk )], (6)
where a is the slope parameter and bk are the threshold
parameters. Additionally, P0*(θ) =1 and
PK* (θ) = 0 . For an item with five response categories, there
are five item response functions and
four boundary response functions. In this case, there would be
five total item parameters.
In order to evaluate an instrument for DIF using a graded
response model, the researcher
designates a reference group and a focal group of respondents.
DIF is said to be present if item
true score functions are not equal among these groups such that
TR (θ) ≠ TF (θ) where the
subscripts signify the reference and focal groups,
respectively.
5
-
Structural Equation Modeling
Another class of techniques used to evaluate measurement
invariance is structural
equation modeling (SEM). In particular, confirmatory factor
analysis (CFA), a type of SEM
analysis, has been used extensively to study measurement
invariance. CFA methods for
detecting measurement invariance involve an overall test of
comparability of parameter values
across groups, followed by a series of more specialized
comparisons to identify the source of
lack of equivalence, if indicated.
The measurement model for CFA can be written as
x = τ + Λxξ + δ , (7)
(Vandenberg & Lance, 2000). In this equation, x represents a
vector of q ×1 observed variables,
ξ represents a vector of latent variables, n ×1 Λx is a q × n
matrix of factor loadings, and δ is a
vector of measurement error in x. This equation also includes
the τ vector of intercepts,
although generally intercepts are assumed to be zero and are not
estimated. In order to obtain a
covariance matrix,
q ×1
δξ +Λ x is multiplied by its transpose . Following the
assumption
that measurement errors are uncorrelated with each other and the
latent construct, the covariance
matrix (
tx δξ +Λ
Σx ) may be expressed as
Σx = ΛxΦΛxt + Θδ , (8)
where Φ is the covariance matrix of the latent variables and Θδ
is the diagonal matrix of
measurement error variances. While Equation 8 is identical to
the measurement model for
exploratory factor analysis (EFA), CFA places restrictions on Λx
which differentiates CFA from
EFA.
Measurement invariance can be examined on many different levels
using CFA. In that
Σx can be different for different populations, it is possible to
test for equivalence of Σx , Λx , Φ,
6
-
and Θδ (Raju, Laffitte, & Byrne, 2002). A test of Σx = Σx′,
where Σx refers to the covariance
matrix of the reference group and ′ Σ x refers to the covariance
matrix of the comparison group,
can be thought of as an omnibus test of measurement invariance.
If the null hypothesis is not
rejected using chi-square and other goodness-of-fit methods, the
measure is generally accepted
as invariant and further tests are unnecessary. However, if the
null hypothesis is rejected, lack of
invariance is indicted, and further tests should be conducted to
identify the source of invariance
(Schmitt, 1982).
There are several types of tests of invariance to identify the
source of lack of
measurement equivalence in a measure. Although there has been
some inconsistency in the
literature with regard to terminology, and number of necessary
tests and order of necessary tests
of invariance (Vandenberg & Lance, 2000), I will summarize
the tests described in Vandenberg
and Lance (2000) in the order in which they are recommended.
The test of configural invariance evaluates whether patterns of
significant and non-
significant factor loadings across groups are similar. The test
of metric invariance ( Λx = Λx′) is
concerned with the equality of the value of factor loadings
across groups. The test of scalar
invariance (τ x = τ x′) indicates whether intercepts on the
latent variable are equivalent across
groups. The test of the invariance of the unique variances
across groups ( Θδ = Θδ ′) examines
whether like items’ uniquenesses are equivalent between groups.
The test of factor variance
invariance (Φ = ′ Φ ) examines whether respondents in different
groups utilized a similar range of
responses along the answer continuum. Vandenberg and Lance
(2000) also discuss the test of
equal factor covariances, although they do not find this test
useful.
7
-
There have been several sets of step-by-step recommendations for
undertaking the
aforementioned tests (e.g., Steenkamp & Baumgartener, 1998,
Vandenberg & Lance, 2000).
While different sequences of testing have been reported, the
overarching idea is that each test is
undertaken sequentially, and at each step, more restrictive
constraints are added to the model
(e.g., setting equal factor loadings on like items across
groups). The more restricted model is
compared in terms of goodness of fit (i.e., χ2 value and other
goodness of fit indices) to the less
restricted or baseline model. The source of lack of measurement
invariance is indicated at the
level of testing when the more restricted model does not meet
acceptable standards for goodness-
of-fit.
Connections between CFA and IRT
Raju, Laffitte, and Byrne (2002) compared and contrasted CFA and
IRT techniques for
evaluating measurement invariance. They reported that CFA and
IRT methods are alike in that
they are both concerned with determining whether true scores are
equivalent across groups for
respondents with equal amounts of the latent trait. The two
techniques similarly allow for group
differences in the distributions of scores across subgroups. CFA
and IRT also both give
information about the source and extent of any lack of
measurement invariance identified.
While CFA and IRT are alike in certain respects, Raju, Laffitte,
and Byrne (2002) also
noted several differences between the two techniques. Primarily,
CFA modeling assumes a
linear relationship between the construct and items, while IRT
assumes a non-linear relationship.
They found that with dichotomously scored items, the IRT
approach is a more appropriate model
in terms of expressing the relationship between the measured
variable and the continuous latent
construct. They found that the literature describing CFA models
is more advanced in terms of
simultaneously looking at multiple latent constructs and
multiple populations, although the IRT
8
-
literature is progressing in this direction. While CFA has been
used to examine equivalence of
error variances across populations, IRT methods have not in
practice examined an equivalent
statistic: the invariance of the standard error of measurement
for θ across populations of
respondents. The CFA framework, on the other hand, does not have
a means for determining the
probability of a respondent with a given θ selecting a
particular response category. These two
techniques, when used in concert, can provide complimentary
information (Meade &
Lautenschlager, 2004: Reise et al., 1993).
Several connections between the statistical frameworks of IRT
and CFA models have
been discussed recently. In particular, Takane and de Leeuw
(1987) provided proofs showing
that a two-parameter normal ogive IRT model is equivalent to a
CFA with dichotomous
variables. Researchers have extended upon this parallel to
investigate DIF using a multiple-
indicator/multiple cause (MIMIC) model. The MIMIC model is a
special case of the factor
analysis model which includes causal variables. Work by Muthén
et al. (1991) provided
equations for converting MIMIC model parameters to IRT
discrimination and difficulty
parameters. MacIntosh and Hashim (2003) presented a procedure
for converting standard errors
for these parameters from the MIMIC model parameters.
9
-
CHAPTER 2
LITERATURE REVEIW
The MIMIC Model
This paper is primarily concerned with demonstrating the use of
MIMIC modeling to
identify DIF. The MIMIC model is an SEM-based alternative to the
multiple-sample CFA
analysis described earlier. In a MIMIC model, one or more
grouping (or background) variables
function simultaneously as contributors to differences in the
latent trait and as covariates upon
which the outcome variables are regressed (Muthén, 1989). In
this study, one dichotomous
grouping variable (gender) is included in the model.
The MIMIC model with a dichotomous grouping variable can be
expressed as
η = ′ γ x x + ′ γ zz + ζ , (9)
where η is the latent trait, x represents observed background
variables, z represents a dummy
variable, and ζ represents an error term, which is normally
distributed and independent of x and
z (MacIntosh & Hashim, 2003). MIMIC modeling with
categorical data, like IRT-based DIF
detection, includes comparing a latent response variable ( ) to
a threshold (y j* τ j ). If the threshold
is exceeded ( y j* > τ j), then the indicator of the latent
response variable ( y j) is one. If the
threshold is not exceeded, is zero. y j
The latent response variable ( ) can be modeled as a combination
of the indirect effects
through the latent trait variable (η) and the direct effects of
the dummy variable (
y j*
zk), which is a
measure of the grouping variable (e.g., gender) that is being
examined for potential contributions
to DIF:
10
-
y j* = λ jη + β j zk + ε j , (10)
where λ j is the factor loading, β j is the slope relating the
grouping variable to the response
variable, and ε j is the random error (Finch, 2005).
The procedure for assessing DIF using MIMIC modeling entails
estimating the direct and
indirect effects of group membership on the latent trait and
item response (Finch, 2005).
Significant indirect effects of group membership on the item
indicate that differences in item
response are influenced by group differences on the mean of the
latent factor. For example,
assessing indirect effects can indicate whether a greater level
of the latent trait “aggression” in
boys contributes to higher endorsement of items on an aggression
scale.
Significant direct effects between the grouping variable and the
item indicates that group
membership directly impacts item response, apart from any group
difference on the latent trait.
Evaluating direct effects, after controlling for indirect
effects, is the procedure for assessing
uniform DIF. For example, this procedure can indicate whether
boys are endorsing higher levels
of aggression items after controlling for differences in the
underlying trait “aggression” (Finch,
2005).
Advantages of the MIMIC model
There are several unique advantages to using a MIMIC model to
identify DIF. A MIMIC
model can be used to simultaneously obtain estimates of group
difference in item response (DIF)
and amount of the latent trait. That is, the MIMIC model
provides information about the
structural model and the measurement model (Muthén, 1989). MIMIC
models can be estimated
for data of ordinal or continuous scales, data with multiple
grouping variables (including
grouping variables with more than two groups), and data with
multiple independent variables,
including categorical or continuous variables (Glockner-Rist
& Hoitjink, 2003; Muthén, 1988).
11
-
Whereas IRT models require large sample sizes (Reise, Widaman,
& Pugh, 1993), MIMIC
models can accommodate smaller sample sizes. Finally,
researchers may be more familiar with
CFA-based procedures than analyses requiring knowledge of IRT
and IRT software.
There are also some disadvantages of MIMIC modeling cited in the
literature. Jones
(2006) notes that MIMIC modeling cannot account for a guessing
parameter, unlike IRT-based
methods. However, in the behavior rating scales evaluated in
this study, guessing is unlikely.
Another disadvantage is that a single-group MIMIC model can only
identify uniform DIF,
although this study utilizes a multiple-group MIMIC approach,
which is able to test for uniform
and non-uniform DIF.
DIF Detection using MIMIC Models: A Comparison to IRT Models
Finch (2005) conducted Monte-Carlo simulations to compare MIMIC
model detection of
DIF to SIBTEST, IRT LR, and the Mantel-Haenszel (MH) statistic.
He evaluated Type I error
rate and power under several sets of conditions, varying the
size of the reference group (100 or
500), the number of items (20 or 50), the parameter model
assessed, (three parameter logistic
model or two parameter logistic model), the amount of DIF
contamination in the anchor items
(none or 15%), and the amount of DIF present in the target item
(0 or .6). The study conditions
were completely crossed, and each combination of specifications
was tested with 500
replications.
He found that the MIMIC model performed as well or better than
traditional DIF
detection methods under some, but not all, conditions.
Specifically, MIMIC performed well in
the 50 item test and when the two parameter logistic model was
used. For the 50 item test, the
MIMIC model was more resistant than the other techniques to Type
I error inflation when DIF
contamination was present in the anchor items. However, in the
case where the exam was short
12
-
(20 items) and the three parameter logistic model was used, the
MIMIC model had an
undesirably high rate of Type I error.
In terms of power, the MIMIC model performed well. Power was
especially good when
the exam had 50 items and when the two parameter logistic model
was used. Under these
conditions, the MIMIC model matched or exceeded the power of the
other techniques.
Specifically, the MIMIC model outperformed the MH statistic and
the SIBTEST when the
reference group was smaller (100) and the level of DIF
contamination in the anchor items was
higher (15%).
Prior Studies Using MIMIC Modeling to Detect DIF
Several prior studies have used MIMIC modeling to check for DIF.
Grayson et al. (2000)
used a MIMIC model to assess a depression scale for uniform DIF
associated with demographic
(e.g., age), disability and physical disorder variables. They
conducted their analyses in three
steps. First, a confirmatory factor analysis tested the
acceptability of the structural model.
Second, each of the predictor variables (demographic,
disability, physical disorder) was
introduced into the model serially. Each model included an
indirect path from the predictor to
each item via the latent variable as well as a direct path to
each item. Notably, the direct paths
from the predictor to the items were estimated in the same model
and not sequentially. One
referent item was constrained to have zero bias for model
identification reasons. For each
predictor, the researchers flagged significant parameter
estimates for these paths linking items to
the predictor. The researchers frame this step as a screening
procedure. That is, they were
interested in identifying predictors that had no significant
direct effects on the items, and were
thus unlikely to be contributing to DIF. These predictors were
then eliminated from the final
model. In the third and final step of the analyses, each of the
significant predictors was added
13
-
into the model together, and the resulting multivariate model
was estimated. The researchers
were again interested in identifying significant effects from
the predictor to the items, although
in the multivariate model, all other predictors are held
constant in the estimation procedure.
The researchers conducted the analysis using maximum likelihood
parameter estimates.
The goodness of fit index (GFI), the Tucker-Lewis index (TLI),
and the root mean square error
of approximation (RMSEA) values assessed model fit. The
researchers were concerned about
violations to multivariate normality; therefore they used
bootstrapping to obtain confidence
intervals. The researchers determined that a biased loading
parameter estimate that exceeded
twice its standard error (z score of 2) was statistically
significant in this context. The loadings on
the latent variable required a z score of 1.5 to reach
significance.
In reporting the results, the researchers partitioned the
effects of each predictor into a bias
effect (i.e., the sum of the direct effects from the predictor
to the items) and an actual effect (i.e.,
the sum of the direct effects from the latent variable to the
items, multiplied by the effect from
the predictor to the latent variable). These effects were
compared to the critical ratios described
above (2 for bias parameters, 1.5 for direct effect from the
predictor to the latent variable). Items
that exceeded the bias cut-off on one or more predictors were
identified as exhibiting DIF.
Gallo et al. (1994) used MIMIC modeling to identify age-related
uniform DIF in a
depression scale with marital status, minority status, cognitive
status, and recent unemployment
in the model. The latent variable, depression, was regressed on
each covariate. The analysis was
conducted by successively testing for significant parameter
estimates between age and each item.
Analyses were conducted using LISCOMP’s limited-information
generalized least
squares estimator for dichotomous response. Model fit was
reported using the descriptive fit
value, the goodness-of-fit index, the adjusted goodness-of-fit
index, and the critical number.
14
-
Christensen et al. (1999) examined a depression scale and an
anxiety scale for age-related
uniform DIF using a MIMIC model. They evaluated a two-factor
measurement model including
five demographic covariates (age, sex, marital status, financial
status, and level of education).
First, the researchers conducted a CFA to assess the fit of the
measurement model.
Next, the researchers conducted analyses to select a referent
item for the substantive DIF
analysis. For identification purposes, one item must be selected
as the referent (e.g., no DIF)
item. The researchers tested each item individually (e.g.,
testing the significance of the direct
paths to the item from each covariate consecutively) to
determine which items show no DIF
across all of the covariates. An item that showed weak direct
associations with the covariates
was selected as the referent.
Finally, the substantive model was analyzed. The model included
paths from each
covariate to both latent variables, and to each item. This model
was similar to the model
described in Grayson et al. (2000), in that DIF was estimated
for each of the items
simultaneously (i.e., paths from the covariate to each of the
items were included in one model).
Analyses were conducted using maximum likelihood parameter
estimates using Amos
3.6.1. Model fit was reported using the goodness-of-fit index
(GFI), the non-normed fit index
(NNFI) and the root mean square error of approximation (RMSEA).
The researchers were
concerned about violations to multivariate normality; therefore
bootstrapping was used to obtain
confidence intervals.
In an article that demonstrated a procedure for calculating the
standard error of the
estimates of IRT difficulty and discrimination from MIMIC model
parameters, MacIntosh and
Hashim (2003) employed MIMIC modeling to identify uniform
gender-related DIF on a scale
measuring racial prejudice. They included gender in the model as
the predictor variable, along
15
-
with three other covariates (educational status, political
conservatism, and religious
fundamentalism). The researchers conducted their analyses in two
steps. First, they ran the
model with a path from gender to the latent variable and with
one path directly from gender to an
item. The researchers obtained the residual variance of the
latent variable from the output of this
analysis ( R2). This value was used to set the variance of the
latent variable for the second run as
1− R2. Once this variance was set, the model was sequentially
run to test each item for DIF in
the same manner as Gallo et al. (2000); that is, with parameter
estimates from gender to the item.
The Mplus program was used to run the analysis. The researchers
reported model fit
using the chi-squared value.
Gender-related DIF in Measures of Aggression
Obtaining accurate measures of self-reported aggression in
adolescents is of interest to
researchers, educators, and policy-makers alike. Aggression
scales have been used to calculate
the prevalence of aggression in schools (e.g., Nansel et al.,
2001) and as outcome variables for
evaluating the impact of violence-prevention programs (e.g.,
Farrell, Meyer, & White, 2001).
Evaluating gender differences in levels of aggression and type
of perpetration (e.g., physical,
relational) have been topics of particular consideration in the
literature. Boys have generally
scored higher than girls on measures of physical aggression,
(e.g., Bongers et. al, 2004; Broidy,
Nagin, Trembaly, Bates, Brame, Dodge, et al., 2003), although
researchers have recently argued
that adolescent aggression is not only constrained to physical
acts. Notably, Crick and Bigbee
(1995) coined the term “relational aggression” to encompass
behaviors that are purposefully
damaging to the victim’s peer relationships. Crick and Grotpeter
(1995) argue that these
aggressive behaviors are more common in girls than physical
aggression.
16
-
While measuring gender differences on different types of
aggression has been a topic of
considerable interest, there are significant gaps in the
literature regarding the validity of the
instruments used, particularly in terms of measurement
invariance. Typically, researchers have
evaluated gender differences in aggression by comparing means
(e.g., Crick & Grotpeter, 1995;
Björkqvist, Lagerspetz, & Kaukiainen, 1992). As discussed
earlier, a simple means comparison
may not be interpretable without first obtaining evidence of
measurement invariance.
The purpose of this study is to evaluate the measurement
invariance of two measures of
aggression (physical and relational) across gender groups in
order to evaluate whether
differences in self-reported aggression are due solely to
differences in the latent trait, or if
differences are partially due to factors related to measurement.
This topic may be of particular
importance because the wording of items on the aggression scale
may lend itself to differential
item functioning. In particular, the Crick and Grotpeter (1995)
relational aggression scale was
developed with the aim of capturing behaviors that they
theorized were common in girls. If the
authors were particularly concerned with female-oriented
behavior, the wording of items may
reflect subtle gender-related bias. On the other hand, because
physical aggression is more
commonly associated with boys, measures of physical aggression
may include an over-
representation of items that are more salient to male
respondents.
17
-
CHAPTER 3
PROCEDURE
Sample
Data for this study are from the GREAT Schools and Families
project (GSF), a seven
year, multi-site violence prevention project described in detail
in a supplement to the American
Journal of Preventive Medicine (Horne, 2004). The data used here
comprises the Spring 2004
student survey of a randomly-selected sample of students from
one cohort at the University of
Georgia (UGA) site. Students attended one of nine middle schools
in Northeast Georgia. Of the
719 students who were eligible to participate in GSF, 623 (87%)
completed the Spring 2004
student assessment. At this assessment wave, all respondents
were in 7th grade, unless they had
been retained.
The sample was 49% female. Of the 612 students who selected one
race, 53% were
white, 34% were black, less than 1% were American Indian or
Asian Indian, 2% were other
Asian, and 10% were some other race. There were 14 students (2%)
who selected more than one
race. Twelve percent of students were Hispanic. Students ranged
in age from 12 to 15.
Participants took the survey via computer assisted survey
interview (CASI) using laptop
computers. Participants wore headphones, and the CASI program
read the questions aloud.
Respondents recorded their answers via the keyboard. All surveys
were proctored by GSF staff.
Students were assigned an ID number, and the survey was
confidential.
This study examines three types of behaviors: physical
aggression, relational aggression,
and overt victimization.
18
-
Instrumentation
The items that are examined in this study are taken from the
problem behavior frequency
scale, a collection of 47 items grouped into 7 subscales that
assess the 30-day frequency of
problem behaviors, such as aggression and delinquency. This
study is concerned with three
subscales (henceforth called scales): physical aggression,
relational aggression, and overt
victimization.
Physical aggression is a 7-item scale that measures
self-reported physical aggression in
the past 30 days. The stem is, “In the past 30 days, how many
times have you,” followed by
descriptions of physical aggression (e.g., hitting) or other
serious aggressive behavior (e.g.,
threatening someone with a weapon). Relational aggression is a
6-item scale that measures self-
reported relational aggression in the past 30 days. The stem is,
“In the past 30 days, how many
times have you,” followed by descriptions of relational
aggression, such as spreading a false
rumor about someone. Overt victimization is a 6-item scale that
measures self-reported
victimization in the past 30 days. The stem is, “In the past 30
days, how many times has this
happened to you,” followed by descriptions of victimization
(e.g., been pushed).
For each item, the six Likert-type response categories range
from “never” to “20 or more
times.” These responses are coded from one to six, respectively,
and the scale score is the
average of the item scores.
In terms of reliability, each scale demonstrated acceptable
internal consistency based on
the current dataset (physical aggression, α = .85; relational
aggression, α = .81; overt
victimization, α = .86). High internal consistency indicates
that respondents demonstrated
consistency in answer selection across subsets of items (Crocker
& Algina, 1986, p. 135). These
19
-
values are consistent with or higher than other reported
internal consistencies calculated based on
other data sources (Farrell et al., 2001, Crick & Bigbee,
1998, Orpinas & Frankowski, 2001).
In general, published validity evidence on the scales used is
scarce. In many cases, the
internal consistency of the measure is offered as the only
indicator of validity evidence (see
Dahlberg et al., 2005). More comprehensive validity evidence is
available for one of the
measures used. The physical aggression scale was adapted in part
from Orpinas and
Frankowski’s 2001 Aggression Scale. Validity studies conducted
on this scale using three
samples of middle school students indicated that scores on the
Aggression Scale were
significantly positively related to teacher-rated aggression,
and self-reports of drug use, weapon-
carrying, and injuries due to fights (Orpinas & Frankowski,
2001).
Computer Program
The data were analyzed using SPSS Version 16.0, and Mplus
Version 5 (Muthén &
Muthén, 2007). Syntax for all analyses is provided in Appendix
A.
Detection of Gender-related DIF
Gender-related DIF detection was conducted on the two scales of
interest: physical
aggression and relational aggression. A measure of victimization
was included in the model as a
covariate. Victimization scores were categorized into three
groups: no victimization reported
(30% of respondents), one instance of victimization reported
(13% of respondents) and more
than one instance of victimization reported (57% of
respondents). By adding victimization to the
model, it is possible to evaluate gender- related DIF while
adjusting for the effect of
victimization on the aggression measure. Experiences of
victimization may moderate the
relationship between gender and the tendency to endorse items on
the aggression scale. In other
20
-
words, differences in victimization may explain some of the
gender differences in item
endorsement that I would have otherwise attributed to DIF.
Two approaches were used to evaluate the measurement invariance
of the aggression
measures. First, the MIMIC approach was used to test each item
for DIF. Second, I employed
multiple group analysis to obtain a test of the overall
invariance of the measures across groups.
Each of the procedures was conducted for the physical aggression
scale and the relational
aggression scale independently.
Multiple-Indicator/Multiple Cause Modeling
Single group MIMIC modeling was used to identify items that
exhibit DIF. For
identification purposes, the mean of the latent variable was set
to zero and the variance set to one
according to a procedure documented in MacIntosh and Hashim
(2003). To set the latent mean
to zero, the exogenous variables (i.e., gender, victimization)
are mean centered. To set the latent
factor variance to one, a two-step procedure was demonstrated in
MacIntosh and Hashim (2003).
First, the model is run with no constraints on the variance of
the latent factor to obtain the R2
value for the latent factor. Second, the model is estimated
again with the variance of the latent
factor set to . I received a warning when running the first step
of the procedure regarding
the identification of the model, and the program was not able to
calculate the
(1− R2)
R2 value for the
latent factor. In order to identify the model for the first
step, I set the variance of the latent
variable to 1. After re-running the model, I was able to obtain
the R2 value for the latent factor.
The second step of the model was run exactly as outlined above
and no further problems were
encountered.
First, a baseline model containing no direct effects from gender
to item responses (i.e., no
DIF) was analyzed. Girls are coded zero, and boys are coded one.
Model fit was evaluated
21
-
using the chi square (χ2) statistic along with other fit
indices, the root mean square error of
approximation (RMSEA), and the comparative fit index (CFI). The
χ2 test tests the hypothesis
that the original covariance matrix and the estimated or
reproduced covariance matrix are
identical. However, the χ2 statistic has been shown to be
sensitive to trivial differences among
these matrices under large sample sizes (Bentler, 1990). In
other words, non-significant p-values
may be obtained even in models where the model fits the data
well. In order to obtain a more
comprehensive view of model fit, additional fit indices in
addition to χ2 are typically reported.
The RMSEA is a stand-alone fit index that adjusts for the
complexity of the model,
favoring parsimony. It is a standardized measure of the degree
to which the population data do
not fit the model. Hu and Bentler (1998) suggest that values of
.06 or lower are indicative of
good fit. The CFI is an incremental fit index that compares the
amount of non-centrality in the χ2
distribution of the specified model to a baseline (null) model.
Hu and Bentler (1998) recommend
using .95 (or above) as a cut off for good model fit for the
CFI.
Weighted least squares means and variances (WLSMV) estimation
was used for all
analyses. WLSMV is a robust weighted least squares (WLS)
estimator, which, similar to WLS
relies on an asymptotic covariance matrix, making both
estimators attractive options for use with
non-normal data. However, WLS requires extremely large sample
sizes, and is thus not practical
for most datasets. WLSMV is less computationally intensive than
WLS, which results in smaller
sample size requirements. WLSMV adjusts the mean and variance of
the χ2 value, along with
parameter estimates and standard errors to account for the level
of non-normality in the data
(Finney & Distephano, 2006, p. 298).
Next, a series of models were run to test for uniform DIF. Each
item was tested
sequentially by adding a direct path ( β ) from gender to the
item in the model. A significant
22
-
parameter estimate for this direct path is indicative of uniform
DIF. That is, gender is explaining
differences in item means above and beyond that which is
explained via the indirect path of
gender to the item through the latent variable.
In order to test if including DIF effects (β ) results in
significantly improved model fit, a
DIF model was created by including all significant β’s in the
model. The fit of the DIF model
was compared to the baseline model using a χ2 difference test.
It is important to note that under
WLSMV estimation, the χ2 values and degrees of freedom function
differently than maximum
likelihood based estimators. According to comments posted by
Linda Muthén on the Mplus
Discussion Board (2007), the χ2 values and degrees of freedom
obtained from WLSMV
estimation are not able to be interpreted in a similar way as
values obtained by ML estimation
(e.g., degrees of freedom calculated as a function of the number
of estimated parameters
subtracted from the number of elements in the covariance
matrix), because these values obtained
under WLSMV are adjusted to obtain accurate p-values. Similarly,
a χ2 difference test between
two nested models estimated using WLSMV cannot be conducted in a
straightforward manner
by subtracting the χ2 value and degrees of freedom of the
unconstrained model from the
constrained model. Instead, the DIFFTEST command in Mplus must
be used to obtain the
accurate p-value for this test.
Multiple Group Analysis
Multiple group analysis was used to further evaluate the
equivalence of the factor
structures of the physical and relational aggression scales
across gender. Multiple group
analysis considers the fit of the model as equality constraints
are placed on two single-gender
groups. For each scale, the data from each single-gender group
are first fit to a baseline or
configural model without equality constraints to determine
whether the same pattern of
23
-
significant and non-significant parameter estimates holds across
groups. If so, there is evidence
of configural invariance, and a more restrictive model is placed
on the data. According to the
guidelines provided in the Mplus manual for conducting multiple
group analysis with ordered
categorical data, the tests of metric invariance and scalar
invariance described earlier are
combined into one test. That is, the factor loadings and
thresholds are constrained all in one step,
“because the item probability curve is influenced by both
parameters” (Muthén & Muthén, 2007,
p. 399). Because the two models are nested, χ2 difference tests
are conducted to determine if the
added constraints result in significantly poorer fit.
24
-
CHAPTER 4
RESULTS
Descriptive Statistics
Means and standard deviations for the physical aggression and
relational aggression
scales are presented separately by gender in Table 1. Spearman
item intercorrelations, skewness,
and kurtosis, are provided in Table 2.
An examination of descriptive statistics (i.e., skewness,
kurtosis) revealed violations of
univariate normality in several items. Because these scales
measure the 30-day frequency of
aggressive behaviors, it is unsurprising that the responses will
be skewed toward “never.” Two
items on the physical aggression scale and four items on the
relational aggression scale had
skewness and kurtosis values outside of the acceptable range
suggested by Kline (2005)—
│3│for skewness and │10│for kurtosis. Because these values fall
outside of the acceptable
range, these items cannot be assumed to be univariate normal
(D’Agostino, 1986), and the data
as a whole cannot be assumed to be multivariate normal.
Because of the non-normality in the data, I selected an
estimation method that was not
based on normal theory, weighted least squares means and
variances (WLSMV). As discussed
earlier, WLSMV accounts for the categorical nature of the data
and adjusts for non-normality,
specifically through weighted least squares parameter estimates,
mean- and variance- adjusted χ2
values, and scaled standard errors.
Prior to conducting the MIMIC and multiple groups analyses,
means and standard
deviations of items were examined for differences in gender. For
the physical aggression scale,
25
-
26
boys reported more aggression than girls across all items except
for one. The mean of item 3
(“threatened to hurt a teacher”) was slightly higher for girls
(M=1.17) than boys (M=1.11).
Overall, the item standard deviations were smaller for girls
than for boys. For the relational
aggression scale, item means for girls’ and boys’ scores were
quite similar, except for one item.
The mean of item 6 (“said things about another student to make
other students laugh”) was
higher for boys (M=2.43) than girls (M=2.08). All of the item
standard deviations were smaller
for girls than for boys.
-
Table 1 Physical and Relational Aggression Items Means and
Standard Deviations for Boys and Girls Boys Girls
Scale Item M SD M SD 1. Thrown something at another student to
hurt them 1.87 1.283 1.59 0.999
2. Been in a fight in which someone was hit 1.81 1.253 1.41
0.910
3. Threatened to hurt a teacher 1.11 0.498 1.17 0.705
4. Shoved or pushed another kid 2.50 1.583 2.05 1.391
5. Threatened someone with a weapon (gun, knife, club, etc.)
1.19 0.755 1.10 0.535
6. Hit or slapped another kid 2.06 1.486 1.82 1.154
Physical Aggression
7. Threatened to hit or physically harm another kid 1.80 1.392
1.56 1.133
1. Didn't let another student be in your group anymore because
you were mad at them. 1.63 1.094 1.64 0.987
2. Told another kid you wouldn't like them unless they did what
you wanted them to do. 1.19 0.701 1.20 0.656
3. Tried to keep others from liking another kid by saying mean
things about him/her. 1.46 1.065 1.44 0.869
4. Spread a false rumor about someone. 1.35 0.940 1.31 0.757
5. Left another kid out on purpose when it was time to do an
activity. 1.46 1.068 1.43 0.940
Relational Aggression
6. Said things about another student to make other students
laugh. 2.43 1.640 2.08 1.314
N = 621 (313 boys, 308 girls)
27
-
28
e 2
Tabl Physical and Relational Aggression Items Intercorrelations,
Skewness and Kurtosis Items Physical Aggression Relational
Aggression 1 2 3 4 5 6 7 1 2 3 4 5 6
1 -- 2 .45 -- 3 .21 .21 -- 4 .48 .42 .20 -- 5 .36 .32 .31 .31 --
6 .49 .44 .19 .65 .31 -- 7 .48 .43 .30 .50 .41 .54 -- 1 .34 .30 .16
.30 .11 .24 .29 -- 2 .30 .29 .22 .29 .23 .22 .28 .36 -- 3 .35 .31
.17 .38 .24 .33 .36 .38 .44 -- 4 .38 .30 .22 .36 .21 .30 .39 .36
.42 .53 -- 5 .35 .25 .16 .32 .19 .29 .36 .44 .36 .56 .44 -- 6 .41
.31 .13 .55 .26 .51 .48 .31 .26 .41 .31 .37 --
Skewness 2.11 2.23 5.63 1.21 5.69 1.68 2.20 2.31 4.59 3.02 3.62
3.05 1.31Kurtosis 4.51 4.93 34.88 0.51 34.85 2.24 4.16 6.05 24.39
10.18 14.88 9.90 0.78
-
Outliers
The existence of outliers in data can be problematic. On one
hand, outliers can exert
disproportionate influence on results and can sometimes be the
result of data entry errors or
untruthful respondents. On the other hand, outliers can reflect
true variation in respondents (e.g.,
highly deviant behaviors) and are therefore of interest to the
researcher.
In order to screen for outliers, I used the macro given in
DeCarlo (1997). DeCarlo’s
macro calculates the significance of multivariate outliers and
the Mahalanobis distance for each
observation. Fifty-three observations, or 9% of the total
sample, met the criteria for multivariate
outliers, using the critical value at the .05 level of
significance. These outliers were not removed
from the dataset because I hypothesized that these values were
most likely due to true differences
in students and not errors in the dataset or untruthful
responses.
Missing Value Treatment
There was very little missing data in the dataset. In total,
only two students did not
answer every item. The Mplus program accommodates missing values
using full information
maximum likelihood estimation.
Physical Aggression Scale
Parameter estimates and standard errors for the MIMIC analysis
for the physical
aggression scale are presented in Table 3. For the baseline
(i.e., no DIF) model, the fit of the
physical aggression scale was adequate ( χ202 = 73.308, P
-
slapped another kid”). These significant values provide evidence
that these items may exhibit
different measurement characteristics based on the gender of the
respondent.
Next, a DIF model was estimated by including all three
significant β ’s in the model. The
fit of the model was adequate ( χ182 = 54.9, P
-
4 2.290 0.157 14.593 5 2.560 0.196 13.052
1 -0.259 0.055 -4.719 2 0.594 0.056 10.688 3 1.062 0.063 16.757
4 1.393 0.068 20.586
Y4
5 1.661 0.076 21.868
1 -3.121 0.308 -10.120 2 1.608 0.118 13.592 3 2.002 0.139 14.417
4 2.234 0.163 13.725
Y5
5 2.357 0.178 13.244
1 0.055 0.053 1.039 2 0.882 0.061 14.535 3 1.320 0.070 18.943 4
1.628 0.080 20.344
Y6
5 1.829 0.090 20.401
1 0.532 0.057 9.323 2 1.142 0.068 16.809 3 1.526 0.075 20.248 4
1.704 0.081 21.114
Y7
5 1.901 0.087 21.822 β Y1 0.023 0.082 0.286 Y2 0.364 0.095 3.817
Y3 -0.319 0.155 -2.061 Y4 0.101 0.077 1.318 Y5 0.008 0.165 0.050 Y6
-0.222 0.076 -2.903 Y7 -0.101 0.084 -1.204 γ Gender 0.136 0.081
Victimization 0.519 0.046 Ψ Physical Aggression 0.297
31
-
Table 4
Sequential Chi Square Tests of Invariance for Physical and
Relational Aggression Scales
Scale Analysis Model χ 2 df p Δχ 2 Δdf p
Model 0 73.308 20 .000 -- -- -- MIMIC
Model 1 54.971 18 .000 23.711 3 .000
Model 2 119.991 22 .000 -- -- --
Physical Aggression
Multiple Group
Model 3 112.916 28 .000 37.041 15 .001
Model 0 56.385 15 .000 -- -- -- MIMIC Model 1 50.111 14 .000
6.154 1 .013
Model 2 56.414 22 .000 -- -- --
Relational Aggression
Multiple Group
Model 3 35.598 18 .007 0.340 7 .999
Model 0: No direct DIF effects
Model 1: All significant DIF effects included
Model 2: No invariance constraints
Model 3: Invariance of loadings and thresholds
I next used multiple group analysis to further evaluate the
equivalence of the factor
structures across gender. Before beginning the process, it was
necessary to collapse the fifth and
sixth response categories (“10-19 times” and “20 or more times”)
for item 2 (“been in a fight in
which someone was hit”) because no females endorsed the fourth
item response. In this case, the
program cannot estimate a threshold and the categories must be
collapsed. Because of problems
running the model (i.e., non-positive definite matrix),
victimization was removed from the
32
-
model. After removing victimization, no additional problems were
encountered running the
analyses.
In order to identify the model, I followed the recommendations
from the Mplus Manual
(Muthén & Muthén, 2007, p 398). That is, for the baseline
model (i.e., test of configural
invariance), all of the item error variances are set to one, the
factor means are set to zero for both
groups, one item loading is set to one in both groups, and one
threshold per item is held invariant
(one additional threshold is held invariant for the item with
the factor loading set to one). It is
important to note that by setting loadings and thresholds to one
across groups, these reference
parameters are assumed to be invariant. The loading of the first
item was chosen as the reference
parameter because this item showed little evidence of lack of
invariance in the MIMIC analysis.
The highest threshold was chosen to be held invariant on the
theoretical grounds that there may
be less variation across gender on the top end of the scale
(i.e., differences between respondents
endorsing one of two answer categories corresponding to high
frequency of aggression) than at
the bottom (i.e., differences between respondents endorsing one
of two answer categories
corresponding to no aggression versus low frequency of
aggression).
Table 4 provides a summary of the sequential χ2 difference tests
for the multiple group
analysis. The fit of the baseline model was mediocre ( χ222 =
119.991, P
-
Relational Aggression Scale
Parameter estimates and standard errors for the MIMIC analysis
for the relational
aggression scale are presented in Table 5. The fit of the
relational aggression scale for the
MIMIC analysis was adequate ( χ152 = 56.385, P
-
Table 5 Multiple-Indicator/Multiple-Causes (MIMIC) Model
Estimates for the Relational Aggression Scale Estimate SE Est./S.E.
λ Y1 0.730 0.035 20.561 Y2 0.852 0.040 21.320 Y3 0.960 0.024 40.630
Y4 0.856 0.036 23.545 Y5 0.905 0.027 33.667 Y6 0.641 0.033 19.639 τ
Threshold
1 0.278 0.052 5.361 2 1.174 0.066 17.853 3 1.652 0.084 19.704 4
1.901 0.097 19.621
Y1
5 2.055 0.112 18.370
1 1.221 0.069 17.781 2 1.749 0.089 19.557 3 2.099 0.122 17.137 4
2.353 0.150 15.736
Y2
5 2.411 0.156 15.477
1 0.651 0.056 11.634 2 1.401 0.078 18.060 3 1.806 0.095 19.010 4
2.029 0.110 18.375
Y3
5 2.057 0.112 18.330
1 0.933 0.065 14.328 2 1.713 0.091 18.767 3 2.061 0.111 18.620 4
2.199 0.121 18.199
Y4
5 2.342 0.133 17.634
1 0.710 0.057 12.501 2 1.429 0.083 17.129 3 1.779 0.097 18.284 4
1.960 0.106 18.466
Y5
5 2.062 0.114 18.006
1 -0.300 0.052 -5.710 2 0.630 0.058 10.919 3 1.030 0.061 16.760
4 1.315 0.068 19.346
Y6
5 1.534 0.075 20.438
35
-
βz Y1 -0.092 0.086 -1.061 Y2 -0.015 0.114 -0.135 Y3 -0.121 0.083
-1.458 Y4 0.016 0.097 0.161 Y5 -0.013 0.091 -0.146 Y6 0.211 0.085
2.479 γ Gender -0.060 0.093 -0.649 Victimization 0.468 0.052 9.067
Ψ Relational Aggression 0.175 -- No responses
I next used multiple group analysis to further evaluate the
equivalence of the factor
structures across gender. Similar to the physical aggression
scale, it was necessary to collapse
the 5th and 6th response categories (“10-19 times” and “20 or
more times”) for two items. For
item 2 (“told another kid you wouldn’t like them unless they did
what you wanted them to”), no
respondents endorsed the 5th response category. For item 3,
(“tried to keep others from liking
another kid by saying mean things about him/her”) no girls
endorsed the 5th response category.
No other problems were encountered, and the model was estimated
with victimization in the
model.
The same procedures that were described earlier were followed to
identify the model for
multiple groups analysis. The first item loading was held equal
across groups, because this item
did not display evidence of DIF in the MIMIC modeling procedure.
The highest threshold for
each item was held invariant for the same reason described in
the physical aggression section.
Table 4 provides a summary of the sequential χ2 difference tests
for the multiple group
analysis. The fit of the baseline model was mediocre ( χ222 =
56.414, P
-
CFI = .98). After holding the factor loadings and thresholds
equal across groups, the model fit
adequately ( χ182 = 35.598, P=0.007; RMSEA = .056, CFI = .99).
The χ2 difference test was not
significant ( χ72 = .340, P=.999). This result indicates that
adding the additional constraints to t
model does not result in a statistically significant decrease in
model fit, which supports the
hypothesis that the relational aggression scale does exhibit
measurement invariance across
gender.
he
37
-
CHAPTER 5
SUMMARY AND DISCUSSION
Summary
The purpose of this study was to evaluate two measures of
aggression in adolescents for
gender-related DIF. MIMIC modeling was used to test for direct
effects of gender on each item
in the two scales. Another procedure, multiple group analysis,
was used to provide additional
information about the measurement invariance of the scales
across gender.
For the physical aggression scale, three items displayed
evidence of DIF. The model that
did not include these DIF effects fit the data significantly
worse than the model that did include
the DIF effects. This finding that including the DIF effects
improves model fit suggests that there
are measurement differences between boys and girls on these
items. The results of the multiple
group analysis buttressed this finding. Placing equality
constraints on the loadings and
thresholds across the single-gender groups resulted in
significantly poorer model fit.
For the relational aggression scale, one item displayed evidence
of DIF. The model that
did not account for this DIF effect fit the data significantly
worse than the model that did include
the DIF effect, indicating a lack of invariance for this item.
Although there is evidence that one
item displays DIF, the results of the multiple group analysis
indicated that the scale as a whole
does exhibit gender-related measurement invariance. Imposing
equality constraints on the
loadings and thresholds of the two groups did not significantly
decrease the fit of the model in
comparison to the unconstrained model.
38
-
Discussion
While statistical procedures can flag items for DIF and identify
lack of invariance in
scales, it is up to the researcher to provide possible
explanations for the measurement
differences. Although it is not possible to conclusively speak
about the reasons for the
measurement differences, additional investigation of the flagged
items can provide some
tentative explanations of the findings.
In the physical aggression scale, item 2 (“been in a fight in
which someone was hit”)
exhibited DIF. The estimate of β for this item was .364. Because
boys are coded 1, this estimate
indicates that in the model either boys endorsed higher values
for this item due to gender above
and beyond the indirect effect of gender through latent physical
aggression, or alternately, that
girls endorsed lower values. In order to gain additional
information about patterns of responses
among boys and girls, I examined item intercorrelations and
response frequencies from the
marginal tables (not shown), although this information is not
based on the structural equation
modeling estimated parameters. An examination of item
intercorrelations did not reveal any
clear pattern of differences between item 2 and the other items
(i.e., item 2 correlations were
mostly consistent with the magnitude of the correlations among
other items). An examination of
the response frequencies contributes additional information.
Boys are less likely to than girls to
choose the response “never” (56.9% to 77.3%, respectively).
There are differences at the top end
of the scale as well. While 4.9% of girls selected one of the
last three response categories (“6-9
times,” “10-19 times,” or “20 or more times”), 9.9% of boys
chose one of these categories. It is
possible that boys are more likely than girls to endorse higher
levels of this item for reasons
other than increased physical aggression. For example, boys may
be more likely to engage in
play fighting, which may result in one of the participants
getting hit. Although this behavior can
39
-
be considered aggressive, it is different in nature than the
malevolent aggression that this scale is
intended to measure.
Item 3 (“threatened to hurt a teacher”) also exhibited DIF. The
estimate of β for this item
was -.319, indicating that in the model either girls were more
likely or boys were less likely to
endorse higher values for this item due to gender-related
factors beyond differences in latent
physical aggression. Examining the item intercorrelation matrix
revealed that the correlations
for this item were somewhat lower than other items on the
physical relation scale. In terms of
response category distribution from the marginal table, the
pattern for girls and boys looks quite
similar. However, more girls than boys indicated that they had
threatened to hurt a teacher 10-19
or 20 or more times (2% versus .6%, respectively). While it is
difficult to speculate why this item
exhibits DIF, it may be possible that boys who are otherwise
highly aggressive may be less likely
to threaten or admit to threatening teachers. This could
possibly be because many teachers are
female and there is a strong social norm against males hitting
or threatening females in the
Southern U.S.
Item 6 (“hit or slapped another kid”) was also flagged for DIF.
The β estimate for this
item was -.222, which indicates that in the model females may be
more likely or males may be
less likely to select higher values on this item due to
measurement differences rather than true
differences in physical aggression. Examining the item
intercorrelation matrix revealed that the
correlations for this item were fairly consistent with the
correlations between the other items. In
terms of response category frequencies from the marginal table,
substantial differences exist in
the high end of the scale. Where 2.6% of girls reported hitting
or slapping another kid 20 or
more times, 7% of boys reported this behavior. It is possible
that the word “slapping” resonates
more with girls, as this behavior in our culture is more often
associated with females.
40
-
For the relational aggression scale, item 6 (“said things about
another student to make
other students laugh”) demonstrated DIF. The β estimate for this
item was .211, which indicates
that males may be more likely or females may be less likely to
select higher values on this item
than would be expected based on their level of latent physical
aggression. Interitem correlations
for this item were somewhat lower than correlations among other
items on the relational
aggression scale. In terms of response category frequencies from
the marginal table, boys were
more likely than girls to select the response “20 or more times”
(9.9% versus 5.5%,
respectively). It is possible that for adolescent boys, the act
of making fun of others is a more
common experience that is not necessarily always tied to a
malevolent or aggressive motive.
One important aspect of testing for measurement invariance is
deciding what to do if full
measurement invariance is not supported. It may not be practical
to remove all items from a
scale that exhibit DIF. Hambleton (2006) suggests that it is
common to find many items that
display DIF in psychological, attitudinal, or personality
measures. He notes that it is possible
that these effects, taken as a whole, may cancel each other out.
That is, while some items may
exhibit DIF in one direction, others counter those effects, so
overall the measure does not favor
one group over the other.
It is also possible that while statistically significant DIF
effects may exist in the model,
the impact of these effects is not of practical importance. With
larger sample sizes the power to
detect DIF increases. Because the present dataset is reasonably
large, it is possible that some of
the significant DIF effects correspond to measurement
differences that are not practically
meaningful. In future studies, it may be desirable to examine
effect size measures of the DIF
items.
41
-
Limitations and Future Research
There are several limitations inherent to the research presented
in this paper. An
important limitation is that for all of the tested models, fit
was less than ideal. While the fit of
the MIMIC models was adequate, the fit was mediocre for the
multiple group analysis models,
especially for physical aggression. When model fit is borderline
in terms of acceptability, it
becomes harder to draw conclusions based on those models. For
example, a non-significant χ2
difference test may be reflective of problems with the overall
fit of the model, rather than lack of
fit caused only by introducing equality constraints. The less
than ideal fit of the models for
multiple groups analysis may be due to constraints that are put
on the data for identification
purposes. This possibility is discussed later in this
section.
Another problem is that the generalizability of the findings may
be limited by the study
sample. All of the participants were middle school students from
one region of the U.S. While
in this population the aggression measures examined may not
demonstrate full measurement
invariance, this result may not hold for other groups of
students, or for the same students at a
different point in time (e.g., in high school).
Another limitation concerns estimation method. Although WLSMV
was likely the most
appropriate estimator for the data, based on the categorical and
non-normal nature of the data,
there are also some disadvantages to using WLSMV. First, WLSMV
is a relatively new
estimation method, and there is relatively little research
examining its functioning. Overall,
however, research has supported its use as an improvement upon
WLS estimation (Muthén,
1993).
WLSMV also provides challenges in terms of identifying the model
when thresholds are
being estimated. In multiple group analysis, one threshold per
item must be held invariant across
42
-
groups. Ideally, this threshold should in reality be invariant
across groups, although it is difficult
to determine if this is the case. Systematically testing each
threshold becomes time-consuming
when there are many thresholds per item. However, if a
non-invariant threshold is set to equality
in the multiple group analysis, the overall model fit may be
worsened. Additional research into
this aspect of multiple group analysis with categorical data is
warranted.
Second, WLSMV provides less information in terms of multiple
group analysis than
continuous estimators. For example, under MLM estimation, tests
of metric and scalar
invariance could have been conducted separately, whereas under
WLSMV these tests are
combined. By conducting metric and scalar tests separately, the
source of lack of invariance
(slopes or intercepts) can be investigated. While under WLSMV it
is possible to identify the
model in such a way as to estimate some differences in loadings
and thresholds, this procedure is
not recommended in the Mplus manual because the loadings and
thresholds together contribute
to the item characteristic curve (Muthén and Muthén, 2007, p.
299).
Another drawback to using WLSMV estimation is that modification
indices are not
available. Under multiple group analysis, modification indices
can be used to identify items that
potentially are problematic in terms of non-invariant loadings
or intercepts. Modification indices
are often used to test for partial invariance, where some
parameters are held invariant, while
problematic items are freely estimated. Without modification
indices, I was unable to compare
whether the two methods (MIMIC modeling and multiple groups
analysis) identified the same
problematic items.
Certain properties of the present dataset may make accurate
estimation difficult.
Specifically, several of the items used were severely
non-normal, with kurtosis values as high as
34.882. While WLSMV does adjust parameter estimates and χ2
values for non-normality, the
43
-
accuracy of WLSMV adjustments in cases of severe non-normality
has not been studied
extensively. In some cases, the characteristics of the WLSMV
estimates contradict intuitive
sense, such as more restricted models that have better fit index
values than less restricted values.
Additional studies of the functioning of WLSMV under conditions
of severe non-normality
would be useful.
The findings of this study did not support the full measurement
invariance of two
measures of adolescent aggression. While lack of invariance was
indicated for several items, the
findings of this paper are less than conclusive. Particularly,
for the relational aggression scale,
one item displayed DIF, yet invariance of factor loadings and
thresholds was supported. It is
clear that additional research is needed to investigate the
measurement properties of the scales
included in this study, as well as other measures of adolescent
aggression.
44
-
REFERENCES
Angoff, W. H. (1993). Perspectives on differential item
functioning methodology. In P. W.
Holland & H. Wainer (Eds.), Differential item functioning
(pp. 3-23). Hillsdale, NJ,
England: Lawrence Erlbaum Associates Inc.
Baker, F. B., & ERIC Clearinghouse on Assessment and
Evaluation College Park MD. (2001).
The basics of item response theory. (2nd ed.).
Bentler, P. M. (1990). Comparative fit indexes in structural
models. Psychological Bulletin, 107,
238-246.
Björkqvist, K., Lagerspetz, K. M., & Kaukiainen, A. (1992).
Do girls manipulate and boys fight?
Developmental trends in regard to direct and indirect
aggression. Aggressive Behavior,
18, 117-127.
Bongers, I. L., Koot, H. M., van der Ende, J., & Verhulst,
F. C. (2004). Developmental
trajectories of externalizing behaviors in childhood and
adolescence. Child Development,
75, 1523-1537.
Broidy, L. M., Nagin, D. S., Tremblay, R. E., Bates, J. E.,
Brame, B., Dodge, K. A., et al. (2003).
Developmental trajectories of childhood disruptive behaviors and
adolescent
delinquency: A six-site, cross-national study. Developmental
Psychology, 39, 222-245.
Christensen, H., Jorm, A. F., Mackinnon, A. J., Korten, A. E.,
Jacomb, P. A., Henderson, A. S.,
et al. (1999). Age differences in depression and anxiety
symptoms: A structural equation
modelling analysis of data from a general population sample.
Psychological Medicine,
29, 325-339.
45
-
Crick, N. R., & Bigbee, M. A. (1998). Relational and overt
forms of peer victimization: A
multiinformant approach. Journal of Consulting and Clinical
Psychology, 66, 337-347.
Crick, N. R., & Grotpeter, J. K. (1995). Relational
aggression, gender, and social-psychological
adjustment. Child Development, 66, 710-722.
Crocker, L., & Algina, J. (1986). Introduction to classical
and modern test theory. Forth Worth,
TX: Harcourt Brace.
D'Agostino, R. B. (1986). Tests for the normal distribution. In
R. B. D'Agostino & M. A.
Stephens (Eds.), Goodness-of-fit techniques (pp. 367-419). New
York: Marcel Dekker.
Dahlberg, L. L., Toal, S. B., Swahn, M., & Behrens, C. B.
(2005). Measuring violence-related
attitudes, behaviors, and influences among youths: A compendium
of assessment tools.
(2nd ed.) Atlanta, GA: Centers for Disease Control and
Prevention, National Center for
Injury Prevention and Control.
DeCarlo, L. T. (1997). On the meaning and use of kurtosis.
Psychological Methods, 2, 292-307.
Edelen, M. O., Thissen, D., Teresi, J. A., Kleinman, M., &
Ocepek-Welikson, K. (2006).
Identification of differential item functioning using item
response theory and the
likelihood-based model comparison approach: Application to the
Mini-Mental State
Examination. Medical Care, 44, 134-142.
Embretson, S. E., & Reise, S. P. (2000). Item response
theory for psychologists. Mahwah, NJ:
Lawrence Erlbaum Associates Publishers.
Farrell, A. D., Meyer, A. L., & White, K. S. (2001).
Evaluation of responding in peaceful and
positive ways (RIPP): A school-based prevention program for
reducing violence among
urban adolescents. Journal of Clinical Child Psychology, 30,
451-463.
46
-
Finch, H. (2005). The MIMIC model as a method for detecting DIF:
Comparison with Mantel-
Haenszel, SIBTEST, and the IRT likelihood ratio. Applied
Psychological Measurement,
29(4), 278-295.
Finney, S. J., & Distefano, C. (2006). Non-normal and
categorical data in structural equation
modeling. In G. R. Hancock (Ed.), Structural equaltion modeling:
A second course (pp.
269-313). Greenwich, CT: Information Age Publishing.
Gallo, J. J., Anthony, J. C., & Muthén, B. O. (1994).
Age-differences in the symptoms of
depression - a latent trait analysis. Journals of Gerontology,
49, 251-264.
Glockner-Rist, A., & Hoijtink, H. (2003). The best of both
worlds: Factor analysis of
dichotomous data using item response theory and structural
equation modeling.
Structural Equation Modeling: A Multidisciplinary Journal, 10,
544 - 565.
Grayson, D. A., Mackinnon, A., Jorm, A. F., Creasey, H., &
Broe, G. A. (2000). Item bias in the
center for epidemiologic studies depression scale: Effects of
physical disorders and
disability in an elderly community sample. Journals of
Gerontology Series B-
Psychological Sciences and Social Sciences, 55, 273-282.
Hambleton, R. K. (2006). Good practices for identifying
differential item functioning -
Commentary. Medical Care, 44(11), S182-S188.
Horn, J. L., & McArdle, J. J. (1992). A practical and
theoretical guide to measurement invariance
in aging research. Experimental Aging Research, 18, 117-144.
Horne, A. M. (2004). The multisite violence prevention project:
background and overview.
American Journal of Preventive Medicine, 26(Suppl1), 3-11.
Hu, L. T., & Bentler, P. M. (1998). Fit indices in
covariance structure modeling: Sensitivity to
underparameterized model misspecification. Psychological
Methods, 3, 424-453.
47
-
Jones, R. N. (2003). Racial bias in the assessment of cognitive
functioning of older adults. Aging
& Mental Health, 7, 83-102.
Jones, R. N. (2006). Identification of measurement differences
between English and Spanishl
language versions of the Mini-Mental State Examination:
Detecting differential item
functioning using MIMIC modeling. Medical Care, 44, 124-133.
Kline, R. B. (2005). Principles and practice of structural
equation modeling (2nd ed.). New
York: The Guilford Press.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of
mental test scores. Oxford, England:
Addison-Wesley.
MacIntosh, R., & Hashim, S. (2003). Variance estimation for
converting MIMIC model
parameters to IRT parameters in DIF analysis. Applied
Psychological Measurement, 27,
372-379.
Meade, A. W., & Lautenschlager, G. J. (2004). A comparison
of item response theory and
confirmatory factor analytic methodologies for e