-
APPLIED MEASUREMENT IN EDUCATION, 23: 286306, 2010Copyright
Taylor & Francis Group, LLCISSN: 0895-7347 print / 1532-4818
onlineDOI: 10.1080/08957347.2010.486289
HAME0895-73471532-4818Applied Measurement in Education, Vol. 23,
No. 3, May 2010: pp. 00Applied Measurement in EducationUsing
Confirmatory Factor Analysis and the Rasch Model to Assess
Measurement
Invariance in a High Stakes Reading Assessment
Assessing Measurement Invariance in Reading AssessmentRandall
and Engelhard, Jr. Jennifer RandallResearch and Evaluation Methods
Program
University of Massachusetts, Amherst
George Engelhard, Jr.Education Studies Emory University
The psychometric properties and multigroup measurement
invariance of scoresacross subgroups, items, and persons on the
Reading for Meaning items from theGeorgia Criterion Referenced
Competency Test (CRCT) were assessed in a sampleof 778
seventh-grade students. Specifically, we sought to determine the
extent towhich score-based inferences on a high stakes state
assessment hold across severalsubgroups within the population of
students. To that end, both confirmatory factoranalysis (CFA) and
Rasch (1980) models were used to assess measurementinvariance.
Results revealed a unidimensional construct with
factorial-levelmeasurement invariance across disability status
(students with and without specificlearning disabilities), but not
across test accommodations (resource guide, read-aloud, and
standard administrations). Item-level analysis using the Rasch
Modelalso revealed minimal differential item functioning across
disability status, but notaccommodation status.
The federal government, with the Individuals with Disabilities
Education Act of2004 (IDEA), defines the term child with a
disability to mean a child withmental retardation, hearing
impairments (including deafness), speech or language
Correspondence should be addressed to Professor Jennifer
Randall, Ph.D., University ofMassachusetts, Hills House South, Room
171, Amherst, MA 01003. E-mail: [email protected]
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 287
impairments, visual impairments (including blindness), serious
emotional distur-bance, orthopedic impairments, autism, traumatic
brain injury, other healthimpairments, or specific learning
disabilities and who, by reason thereof, needsspecial education and
related services (Public Law 108-446, 108th Congress).Over 6.5
million infants, children, and youth are currently served under
IDEAlegislation (U.S. Department of Education, 2007b) that requires
that all publicschool systems provide students with disabilities a
free and appropriate educationin the least restrictive environment.
This least restrictive environment mandateoften requires schools
and school systems to place students with disabilities inregular,
nonspecial education classrooms.
In addition to IDEA, No Child Left Behind (NCLB, 2002) seeks to
address,and to prevent, the practice of excluding disabled students
from quality instruc-tion and, consequently, assessment. Although
the U.S. Department of Education(DOE) does not require students
with significant cognitive disabilities to achieveat the same
levels of non-disabled students under NCLB, the DOE does demandthat
all other students with less severe disabilities make progress
similarly to thatof their non-disabled peers. Because many students
with disabilities must beassessed using the same tests as students
without disabilities, the need for testingaccommodations that
compensate for their unique needs and disabilities becomesapparent.
Yet some may argue that the non-standard accommodations requiredby
special needs students could undermine the meaningfulness of scores
obtainedon a standardized test.
The inclusion of students with disabilities (SWDs) certainly
presents somemeasurement challenges. Federal law requires that the
mandatory assessmentsof SWDs meet current psychometric and test
standards related to validity, reli-ability, and fairness of the
scores. States must (i) identify those accommoda-tions for each
assessment that do not invalidate the score; and (ii) instruct
IEPTeams to select, for each assessment, only those accommodations
that do notinvalidate the score (Department of Education, 2007b, p.
177781). The Stan-dards for Educational and Psychological Testing
(AERA, APA, and NCME,1999) dictate:
Standard 10.1In testing individuals with disabilities, test
developers, test administrators, and testusers should take steps to
ensure that the test score inferences accurately reflect
theintended construct rather than any disabilities and their
associated characteristicsextraneous to the intent of the
measurement. (p. 106)
Standard 10.7When sample sizes permit, the validity of
inferences made from test scores and thereliability of scores on
tests administered to individuals with various disabilitiesshould
be investigated and reported by the agency or publisher that makes
the
-
288 RANDALL AND ENGELHARD, JR.
modification. Such investigations should examine the effects of
modificationsmade for people with various disabilities on resulting
scores, as well as the effectsof administering standard unmodified
tests to them. (p. 107)
This study seeks to address these standards by examining
evidence of measure-ment invariance for a set of reading items used
on the Georgia ReferencedCompetency Test (CRCT). The basic
measurement problem addressed iswhether or not the probability of
an observed score on these reading itemsdepends on an individuals
group membership. In other words, measurementinvariance requires
that students from different groups (students with
disabilities,students without disabilities as well as students who
receive resource guide, read-aloud, or standard administrations),
but with the same true score, have the sameobserved score (Wu, Li,
& Zumbo, 2007). Meredith (1993) provides a
statisticaldefinition of measurement invariance:
The random variable X is measurement invariant with respect to
selection on V if F(x|w, v) = F (x|w) for all (x, w, v) in the
sample space. Where X denotes an observedrandom variable with
realization x; w denotes the latent variable with realization wthat
underlies, or measures X. V denotes a random variable, with
realization v thatfunctions as a selection of a subpopulation from
the parent population by thefunction s(V), 0 s(v) 1. (see Meredith,
1993, p. 528)
Wu et al. (2007) assert that such a general definition is useful
in that it can beapplied to any observed variables at the item or
test level; consequently provid-ing a statistical basis for
psychometric techniques such as factor analyticinvariance, as well
as differential item functioning, or item response theory meth-ods
(p. 3). At the test level, factor analysis provides an excellent
psychometricframework in that the factor score acts as a surrogate
for an individuals truescore, and the observed random variables are
represented by the items. Whenassessing data with dichotomous
outcomes, factorial invariance is established ifthe factor loadings
and thresholds are equivalent across multiple sub-populations.At
the item level, item response models provide an appropriate
psychometricframework in that a persons expected score on any one
item acts a proxy for thetrue score and the observed score on that
same item represents the observed ran-dom variable. Item-level
invariance is established if the item parameters areequivalent
across multiple populations. In other words, for all values of
(theunderlying, or latent construct), the item true scores are
identical across groups.Both factorial and item-level equivalence
is necessary when one seeks to provideevidence of measurement
equivalence. As pointed out by Bock and Jones (1968),in a well
developed science, measurement can be made to yield invariant
resultsover a variety of measurement methods and over a range of
experimental condi-tions for any one method (p. 9).
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 289
Previously, several methods have been employed to establish the
measurementinvariance of assessment results for SWDs receiving test
accommodations. Analy-sis of variance and analysis of covariance
procedures have been used to measurethe effects of extended time
(Munger & Lloyd, 1991; Runyan, 1991) and read-aloud or oral
(Meloy, Deville, Frisbie, 2000; Bolt & Thurlow, 2006;
Elbaum,2007; Elbaum, Arguelles, Campbell, & Saleh, 2004)
accommodations. Factor-analytic methods have been used to examine
factorial invariance of assessmentsfor SWDs receiving various
accommodations such as oral administration of items(Huynh &
Barton, 2006; Huynh, Meyer, & Gallant, 2004), extended
time(Huesman & Frisbie, 2000; Rock et al., 1987), and large
type (Rock, Bennett, &Kaplan, 1985). Similarly, methods that
examine item-level equivalence have beenutilized to examine across
Braille (Bennett, Rock, & Novatkoksi, 1988), use ofcalculator
(Fuchs, 2000a), read-aloud (Bielinski, Thurlow, Ysseldyke,
Friedebach,& Friedebach, 2001; Bolt & Ysseldyke, 2006;
Fuchs, 2000b), and extended time(Cohen, Gregg, & Deng, 2005;
Fuchs, 2000b) accommodations.
The purpose of this present study is to describe a coherent
framework that canbe used to explore systematically whether or not
specific accommodations meetpsychometric criteria of measurement
invariance for students with specific learningdisabilities (SLD) on
items designed to assess reading for meaning in two stages.The
first stage utilizes the confirmatory factor analysis (CFA) model
to establishunidimensionality and to assess measurement invariance
across several subgroupsat the test level, specifically factorial
invariance. In the second stage, we present adifferent approach to
assessing measurement invariance using the Rasch Model(1980), an
Item Response Theory Model, to investigate item-level
equivalence.First, we assessed the factor structure of the reading
for meaning items by examin-ing whether a single factor underlay
the items. Next, we sought to determinewhether a one factor
measurement model for reading for meaning was invariantacross
disability status and type of test administration (i.e., assessing
factorialinvariance). In the second stage of the analysis, we
examined the data to insure theoverall fit to the Rasch model.
Finally, we sought to test item invariance over dis-ability status
and test administration using the Rasch Model. This
conceptualizationof measurement invariance includes a consideration
of test-level invariance asdefined within the framework of
confirmatory factor analysis, as well as item-leveland person-level
invariance as conceived with Rasch measurement theory.
METHOD
ParticipantsThe students included in this study were drawn from
a larger study in Georgiathat examined the effects of two test
modifications (resource guide and read
-
290 RANDALL AND ENGELHARD, JR.
aloud) on the performance of students with and without
identified disabilities(Engelhard, Fincher, & Domaleski, 2006).
The original study included studentsfrom 76 schools with a wide
range of physical, mental, and intellectual disabili-ties. Because
the value and impact of a test accommodation can vary in relationto
the specific disability, we chose to focus only on students
identified as havinga specific learning disability (SLD) within the
broader category of students withdisabilities. Table 1 provides a
description of the demographic characteristics ofthe students by
disability status (N = 788). Table 2 provides the
demographiccharacteristics by accommodation category (resource
guides, read-aloud, and stan-dard administration). Consistent with
previous research that indicates male studentsare
disproportionately identified as having learning disabilities (DOE,
2007a,Wagner, Cameto, & Guzman, 2003; Wagner, Marder,
Blackorby, & Cardoso,2002) 70% of the 219 students with
specific learning disabilities were male.According to the Georgia
Department of Education website, over 700,000 fullacademic year
students participated in the statewide Criterion
ReferencedCompetency Test in reading. Due to NCLB mandates, student
ethnicity must alsobe tracked and reported. This information can be
used to infer the overall demo-graphic make-up of all test-takers
(as all K8 students in Georgia are required tocomplete the CRCT) in
order to assess the representativeness of our sample.Across all
ethnic groups the sample and population proportions were nearly
iden-tical. For example, 47.3% of public school students in Georgia
are White and48.0% of our sample was composed of White students.
Similarly, 38.1% ofGeorgias population of students are Black, and
40.3% of sample was composedof Black students. Hispanic students
compose 8.99% of the student population,
TABLE 1Demographic Characteristics of Seventh-Grade Students by
Disability Status
SWOD SLD Total
n = 569 27.8%
n = 219 72.2% n = 788
Gender (percentages)1. Male (n = 410) 32.4 19.7 52.02. Female (n
= 376) 39.6 8.1 47.7
Race/Ethnicity (percentages)1. Asian, Pacific Islander 2.3 0.8
3.12. Black, Non-Hispanic 30.0 10.3 40.33. Hispanic 3.9 2.0 6.04.
American Indian, Alaskan Native 0.0 .1 .15. White, Non-Hispanic
34.2 13.7 48.06. Multiracial 1.7 0.9 2.5
Note. SWOD = Students Without Disabilities; SLD = Specific
Learning Disability.
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 291
and 6.0% of our sample. In an effort to achieve equal group
sizes, students withdisabilities were oversampled in the original
study. We would like to note, how-ever, that 13.39% of Georgias
tested public schools students have identified dis-abilities. We
feel confident that our sample adequately represents the
studentpopulation of Georgia students.
Instrument
Developed in 2000, the Georgia Criterion Referenced Competency
Test (CRCT)is a large state standardized assessment designed to
measure how well public K8students in Georgia have mastered the
Quality Core Curriculum (QCC). TheQCCs are the curriculum content
standards developed by the Georgia Depart-ment of Education for its
public schools. Georgia law requires that all students ingrades 18
be assessed in the content areas of reading, English/language arts,
andmathematics. In addition, students in grades 38 are assessed in
both social stud-ies and science as well. The CRCT yields
information on academic achievementthat can be used to diagnose
individual student strengths and weaknesses asrelated to
instruction of the QCC, and to gauge the quality of education
through-out Georgia. The reading CRCT for sixth-grade students
consists of 40 opera-tional selected-response items and 10 embedded
field test (FT) items (FT itemsdo not contribute to the students
score) within four content domains: reading forvocabulary
improvement, reading for meaning, reading for critical analysis,
andreading for locating and recalling. Twelve items from the
reading for meaningdomain were selected and analyzed here because
this domain most closely repre-sents what is commonly referred to
as reading comprehension. Reading forMeaning is defined as the
recognition of underlying and overall themes and con-cepts in
fiction and nonfiction literature as well as the main idea and
details of thetext. It also includes the recognition of the
structure of information in fiction andnonfiction. Items in the
reading for meaning domain include identifying literaryforms;
identifying purpose of text; identifying characters and their
traits; recog-nizing sequence of events; recognizing text
organization/structure; recognizingexplicit main idea; and
retelling or summarizing.
Data CollectionAll state schools were stratified into one of
three categories based on the propor-tion of students receiving
free and reduced lunch in each school. Within thosecategories
schools were then randomly selected and assigned to one of three
con-ditions (resource guide test modification, read-aloud test, or
oral, test modifica-tion, or the standard test condition), and all
students (both students withdisabilities and without disabilities)
within the school were tested under the samecondition. Two of the
three conditions involved the use of a test modification,
-
292 RANDALL AND ENGELHARD, JR.
and the third condition involved the standard administration of
the test. It shouldbe noted that, for the purposes of the larger
original study, that all students weretested under standard
operational conditions at the end of the sixth grade duringthe
regular operational administration of the reading exam. The
assignment toone of three conditions involved the second
administration of the same test whichwas given the following spring
when students were in the seventh grade. In sum-mary, every student
completed the reading exam under standard, operational con-ditions
and then a second time under one of three conditions. Data from
thesecond experimental administration were analyzed for the
purposes of this study.
Description of Test ModificationsThe resource guide consisted of
a single page (front and back) guide that pro-vided students with
key definitions and examples that were hypothesized to behelpful.
The resource guides were designed to provide students with
scaffoldedsupport, much like they would receive in the classroom
and English languagelearners receive from a translation dictionary.
The guides included commonlyused definitions of academic terms and
vocabulary words (provided in alphabeti-cal order as in a
dictionary) that could be applied to the test. These terms were
notassessed by the exam, but rather provided explanations of
construct-irrelevantwords, expressions, or phrases that might be
found in the passages or within theitem stems. For example, a
question may ask the student to identify the centralidea of the
passage. The resource guide indicated that the central idea meant
themain point. Similarly, vocabulary within a passage that a
student may not befamiliar with but not directly assessed was
defined in hopes that providing suchsupport would increase the
students comprehension of the overall passage.Vocabulary that was
assessed was not defined. The guides were developed by acommittee
of Georgia Department of Education specialists from assessment,
cur-riculum, and special education offices. Careful attention was
given to the con-structs measured by the test items. The intent of
the guides was to providestudents with information they could use
to answer questions on the test, butwould not provide the students
with the answers themselves. It was hypothesizedthat the removal of
construct irrelevant vocabulary, or expressions, wouldimprove
student performance on the exam as they would be able to focus on
theintended construct without confusion or frustration. One could
imagine theresource guide as a glossary of important terms used
throughout the exam.Because the use of resource guides was new for
most students, students weregiven the opportunity to work through a
sample test using the resource guide.Teachers were allowed to
review the sample test with students and provide point-ers, if
necessary, on how the sample test related to the resource guide.
Becausethe test material is secure, it is not possible to reveal
the actual content of theresource guides here.
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 293
The read-aloud administration involved the teacher reading the
entire test tostudents, including reading passages and questions.
Teachers were instructed toread the test to students at a natural
pace. Students were encouraged to read alongsilently with teachers.
The third type of administration was simply the
standardadministration in which the test was administered in the
standard format as if itwere an operational administration.
Engelhard et al. (2006) should be consultedfor additional details
regarding the full study.
ProceduresData analyses were conducted in two stages. In the
first stage, analyses were con-ducted with Mplus computer software
(Muthen & Muthen, 19982007) using theconfirmatory
factor-analytic model for dichotomous data as defined by Muthenand
Christofferson (1981).
where
xg is a vector of observed scores for each group,tg is a vector
of item intercepts (or thresholds),g is a matrix of the factor
loadings,xg is a vector of factor scores (latent variables),and dg
is a vector of errors.
With the CFA model, the relationship between observed variables
(in this case12 reading items) and the underlying construct they
are intended to measure (inthis case reading for meaning) is
modeled with the observed response to an itemrepresented as a
linear function of the latent construct (x), an
intercept/threshold(t), and an error term (d). The factor loading
() describes the amount of changein x due to a unit of change in
the latent construct (x). The parameters of thethresholds (t(g))
and the factor loadings ((g)) describe the measurement proper-ties
of dichotomous variables. If these measurement properties are
invariantacross groups, then
where G represents each group (see Muthen & Christofferson,
1981, p. 408).
x g g g g g= + +t L x d , (1)
t t t t( ) ( ) ( )... ,1 2= == G
( ) ( ) ( )...1 2= = =G
-
294 RANDALL AND ENGELHARD, JR.
Confirmatory factor analysis (CFA) was used to test the one
factor measure-ment model of the 12 reading for meaning items with
six groups, one for the fullsample and one for each of the five
subgroups of interest: students with specificlearning disabilities,
students without disabilities, students who received theresource
guides, students who received the read-aloud administration, and
stu-dents who received the standard administration. All items were
hypothesized tobe a function of a single latent factor, and error
terms were hypothesized to beuncorrelated. In each model, the
factor loading from the latent factor to the firstitem was
constrained to 1.0 to set the scale of measurement.
All parameters were estimated using robust weighted least
squares (WLSMV)with delta parameterization. With WLSMV estimation,
probit regressions forthe factor indicators regressed on the
factors are estimated. We used the chi-square statistic to assess
how well the model reproduced the covariancematrix. Because this
statistic is sensitive to sample size and may not be apractical
test of model fit (Cheung & Rensvold, 2002), we used two
addi-tional goodness of fit indexes less vulnerable to sample size:
the comparativefit index (CFI) and the root mean square error
approximation (RMSEA). CFIvalues near 1.0 are optimal, with values
greater than .90 indicating acceptablemodel fit (Byrne, 2006). With
RMSEA, values of 0.0 indicate the best fitbetween the population
covariance matrix and the covariance matrix impliedby the model and
estimated with sample data. Generally, values less than .08
areconsidered reasonable, with values less than .05 indicating a
closer approxi-mate fit (Kline, 2005).
Because identical model specification for each subgroup does not
guaranteethat item measurement is equivalent across groups (Byrne
& Campbell, 1999),we conducted a series of tests for multigroup
invariance by examining twoincreasingly restrictive hierarchical
CFA models. Models were run separately bydisability status as well
as accommodation category, and the fit statistics wereused to
verify adequate model fit before proceeding to subsequent steps
(Byrne,2006). Muthen and Muthen (19982007) recommend a set of
models to be con-sidered for measurement invariance of categorical
variables noting that becausethe item probability curve is
influenced by both parameters, factor loadings andthresholds must
be constrained in tandem. With the baseline model, thresholdsand
factor loadings are free across groups; scale factors are fixed at
one in allgroups; and the factor means are fixed at zero in all
groups (to insure model iden-tification). This baseline model
provides a model by which the subsequent invari-ance model can be
compared. In the second model, factor loadings and thresholdswere
constrained to be invariant across the groups; scale factors were
fixed at onein one group and free in the others; factor means were
fixed at zero in one groupand free in the others (the Mplus
default). This is because the variances of thelatent response
variables are not required to be equal across subgroups
(Muthen& Muthen, 19982007). Because the chi-square values for
WLSMV estimations
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 295
cannot be used for chi-square difference tests, we compared the
fit of the twomodels using the DIFFTEST option to determine if an
argument for factorialinvariance could be supported. The DIFFTEST
(available in Mplus) null hypoth-esis asserts that the restricted
model worsens the model fit (i.e., the unconstrainedmodel is a
better model). Non-significant p-values indicate equivalent model
datafit consistent with factorial invariance. In the absence of
full factorial invariance,data were also examined to determine if
partial measurement invariance waspresent. Partial measurement
invariance applies when factors are configurallyinvariant (as in
the baseline model), but do not demonstrate metric (factor
load-ings) invariance (Byrne, Shavelson, & Muthen, 1989). Byrne
et al. (1989) assertthat further tests of invariance and analysis
can continue as long as configuralinvariance has been established
and at least one item is metrically invariant. Insuch cases
Benedict, Stenkamp, and Baumgarther (1998) recommend that
invari-ance constraints be relaxed for highly significant
modification indices in order tominimize chance model improvement
and maximize cross-validity of the model.
In the second stage, data analyses were conducted with the
FACETS, a multi-faceted Rasch measurement (MRM), computer program
(Linacre, 2007). Threemodels were fit to these data. Model I can be
written as follows:
where
Pnijk1 = probability of person n succeeding on item i for group
j andadministration k,Pnijk0 = probability of person n failing on
item i for group j and administration k,n = location of person n on
latent variable,di = difficulty of item i,j = location of group j,
andk = location of administration k.
This model dictates that student achievement in reading for
meaning is the latentvariable that is made observable through a set
of 12 reading items, and that theitems vary in their locations on
this latent variable. Unlike the CFA model, theRasch Model (a)
allows for person and item parameters to be estimated
indepen-dently of each other and (b) includes no item
discrimination parameter (or itemloadings) as it is assumed to be
equal across all items. The observed responsesare dichotomous
(correct or incorrect), and they are a function of both
personachievement and the difficulty of the item. Group membership
(dichotomouslyscored as student with a specific learning disability
or student without disability)
ln /P Pnijk nijk n i j k1 0 = q d a l (2)
-
296 RANDALL AND ENGELHARD, JR.
and type of administration (standard, resource guide, read
aloud) may influenceperson achievement levels. Once estimates of
the main-effect parameters wereobtained, the residuals were
defined. The unstandardized residual reflects thedifference between
the observed and expected responses:
A standardized residual can also be defined as follows:
These residuals can be summarized to create mean square errors
(MSE) statistics(labeled Infit and Outfit statistics in the FACETS
computer program) for eachitem and person. These MSE can also be
summarized over items and persons, aswell as subsets of items and
subgroups of persons. See Wright and Masters(1982) for a
description of the Rasch based fit statistics.
In addition to establishing item parameters and model fit
statistics, the FAC-ETS program was used to examine uniform
differential item functioning (DIF).DIF is present when the
locations of the items vary, beyond sampling error,across group
characteristics, such as gender, race, or disability status. If, as
aresearcher, one suspects that certain characteristics may interact
or behave differ-ently than others, one can simply add an
interaction term for those two character-istics. Model II focuses
on examining the interaction effects between items i andgroup j
(diaj). This can be written as follows
Student groups are defined as students with and without
disabilities. This modelexplores whether or not the items are
functioning invariantly over disability status.
The final model examined, Model III, assesses possible
interaction effectsbetween items i and administration k (dik). It
can be written as
This model explores whether or not the items functioned
invariantly across testadministrations. For both Model II and Model
III, FACETS provides bias
R x Pnijk nijk nijk= ( ). (3)
Z = (x P )/[(1 P )P ] .nijk nijk nijk nijk nijk 1/2 (4)
ln P / Pnijk1 nijk0 n i j k i j = [5]
ln P / Pnijk1 nijk0 n i j i k = k [6]
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 297
measures in terms of logits. These estimates are reported as
t-scores (bias mea-sure divided by its standard error) with finite
degrees of freedom. When dealingwith more than 30 observations,
t-scores greater than the absolute value of twoare considered
statistically significant, indicating differential item
functioningand a threat to item level invariance. Because we can
expect statistically signifi-cant results to appear by chance due
to the use of multiple significance tests, weused the Bonferroni
multiple comparison correction to guard against
spurioussignificance. Testing the hypothesis at the p < .05
level that there is no DIF inthis test, the most significant DIF
effect must be p < .05 divided by the number ofitem-DIF
contrasts.
RESULTS
Study results are discussed within the frameworks of the two
stages: first, theresults of the multigroup confirmatory analysis
using Mplus software; and sec-ondly, the results of the Rasch
analyses using FACETS software.
Stage 1: Confirmatory Factor AnalysesResults for Stage 1 of the
data analysis are divided into three subsections. The
firstsubsection addresses the fit of the measurement model within
each subgroup:SLDs, students without disabilities, and students who
received the resource guide,read-aloud, and standard
administrations. The second subsection details the exami-nation of
factorial invariance across test administration. The final
subsectiondescribes the examination of factorial invariance across
disability status.
Model fit within each subgroup. Recollect that five separate
CFAs wereconducted to examine the measurement models of reading for
meaning for eachsubgroup of interest. The reading for meaning
measurement model demonstratedexcellent model fit for students
without a specific learning disability, c2 (45) = 41.20p = .63, CFI
= 1.00, RMSEA = .00; for SLD c2* (40) = 53.222, p = .08, CFI =
.96,RMSEA = .04; for students who received the resource guide test
administration,c2 (38) = 40.24, p = .37, CFI = 1.00, RMSEA = .02;
for students who received theread-aloud administration, c2 (36) =
44.48, p = .16, CFI = 0.98, RMSEA = .03;and for students who
received the standard administration, c2 (8) = 31.97, p = .74,CFI =
1.00, RMSEA = .00. Consequently, when testing groups for
factorialinvariance, we specified the same model for all
subgroups.
*Degrees of freedom for these groups differ due to the way in
which they are computed for theWLSMV estimator.
-
298 RANDALL AND ENGELHARD, JR.
Factorial invariance across administration type. Recall that to
assessbetween-group invariance, we examined change in fit
statistics between the base-line model (i.e., factor loadings and
thresholds free) and Model 2 in which factorloadings and thresholds
were constrained to be equal or invariant. Our findings,presented
in Table 3, suggest excellent overall fit, c2 (112) = 117.76, p =
.34 andCFI = 1.00, RMSEA = .01. Model 2 also reflects adequate
model fit, c2 (122) =152.80, p = .03 and CFI of .98, RMSEA = .03.
Using the DIFFTEST option inMplus, we assessed if Model 2 (nested
model) was significantly different fromthe less restrictive model:
c2 (18) = 43.27, p < .01. Results suggest that the
factorstructure of the reading for meaning domain is not invariant
across the three testadministrations. Consequently, the data were
investigated to determine if partial
TABLE 2Demographic Characteristics of Seventh Grade Students by
Test Administration
Resource Guides Read Aloud Standard Total
n = 25432.3%
n = 25732.7%
n = 27535.0% n = 786
Gender (percentages)1. Male (n = 410) 16.8 16.9 18.4 52.22.
Female (n = 376) 15.5 15.8 16.5 47.8
Race/Ethnicity (percentages)1. Asian, Pacific Islander 1.5 0.1
1.4 3.12. Black, Non-Hispanic 9.2 13.2 17.9 40.33. Hispanic 1.7 2.4
1.9 6.04. American Indian, Alaskan Native 0.0 0.0 0.1 0.15. White,
Non-Hispanic 18.8 16.0 13.1 48.06. Multiracial 1.1 0.9 0.5 2.5
TABLE 3Tests for Invariance for Reading for Meaning Measurement
Model Across Test
Administration: Summary of Goodness of Fit Statistics
Equality Test c2 df CFI RMSEA Dc2 p-value
No Constraints (Configural) 117.76 112 1.00 .01 __ __Factor
Loadings & Thresholds 21 152.80 122 .98 .03 43.27 .00Free Items
6 & 12 31 133.05 120 .99 .02 22.81 .08
Note. c2 = chi-square statistic based on robust weighted least
squares estimation; df = degrees offreedom; CFI = comparative fit
index; RMSEA = root mean square error of approximation;
Robuststatistics are reported. Students who received resource guide
(n = 254), students who received read-aloud (n = 257), students who
received standard administration (n = 275).
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 299
measurement invariance (Byrne, Shavelson, & Muthen, 1989)
could be estab-lished across test administrations. Examination of
the modification indicesrevealed that releasing the equality
constraints of both the factor loadings andthresholds of Items 6
and 12 resulted in a better overall model, c2 (120) = 133.05,p =
.20, CFI = 0.99 RMSEA = .02, and a non-significant chi-square test
for differ-ence, c2 (15) = 22.81, p = .08. Closer examination of
the unconstrained parameterestimates, displayed in Table 4,
revealed that Item 6 was less discriminating andeasier for students
in the read-aloud test administration than for students in
theresource guide and standard test administrations. Furthermore,
Item 12 was morediscriminating and easier for students in the
resource guide test administrationthan in the read- aloud or
standard administrations. These findings suggest partialmeasurement
invariance or factorial invariance for a majority of the items.
Factorial invariance across disability type. In the next series
of modelswithin Stage 1 we examined factorial invariance across
disability status aspresented in Table 5. As in the analyses of
test administration, results indicateexcellent overall fit across
disability status with the baseline model, c2 (84) = 95.40,p = .19,
CFI = .99, RMSEA = .01. Model 2 (factor loadings and thresholds
con-strained) also demonstrated adequate fit, c2 (90) = 98.95, p =
.24, CFI = .99,RMSEA = .02. Again, using the DIFFTEST option in
Mplus, we assessed if
TABLE 4Item 6 and Item 12 Unconstrained Parameter Estimates
Resource Guide Read Aloud Standard
Item Factor Loading Threshold Factor Loading Threshold Factor
Loading Threshold
6 1.198 .644 .735 .885 1.030 .52312 .923 .239 .838 .042 .835
.080
TABLE 5Tests for Invariance for Reading for Meaning Measurement
Model Disability Status:
Summary of Goodness of Fit Statistics
Equality Test c2 df CFI RMSEA Dc2 p-value
No Constraints (Configural) 95.40 84 .99 .01 __ __Factor
Loadings & Thresholds 21 98.95 90 .99 .02 8.14 .62
Note. c2 = chi-square statistic based on robust weighted least
squares estimation; df = degrees offreedom; CFI = comparative fit
index; RMSEA = root mean square error of approximation;
Robuststatistics are reported. Regular education students (n =
569), students with specified learning disabili-ties (n = 219).
-
300 RANDALL AND ENGELHARD, JR.
Model 2 (nested model) was significantly different from the less
restrictivemodel: c2 (10) = 8.14, p = .62. Results support complete
factorial invarianceacross disability type.
Given evidence that the measurement model representing the
latent readingability for the reading for meaning factor was
invariant across disability statusand demonstrated partial
invariance across test administration, we ran a finalCFA of the
model for the full sample for the original model (all items loading
onthe latent factor reading for meaning). This final full model
showed excellent fitto the data, c2 (47) = 56.89, p = .15, CFI =
1.00, RMSEA = .02. Stage 1 resultsprovide strong evidence that at
the test level (a) the reading for meaning domainis a
unidimensional construct and (b) the factorial structure is fully
invariantacross disability status and partially invariant across
administration type.
Stage 2: Multifacted Rasch MeasurementNext, we turned our
attention to Stage 2 of our data analysis based on the
Raschmeasurement model. The results within this stage are divided
into three subsec-tions. The first subsection presents the main
effects model (Model I). The secondand third subsections explore
the interaction between items and disability status(Model II) and
test administration (Model III).
Model Imain effects model. Figure 1 displays a variable map
represent-ing the calibrations of the students, items, conditions,
and groups. The FACETScomputer program (Linacre, 2007) was used to
calibrate the four facets. The firstcolumn of Figure 1 represents
the logit scale. The second column of the variablemap displays the
student measures of reading (for meaning) achievement.
Higherability students appear at the top of the column, while lower
ability students appearat the bottom. Each asterisk represents 8
students. The student achievement mea-sures range from 4.36 logits
to 4.49 logits (M = .94, SD = 1.64, N = 786). Thethird column shows
the locations of the administration conditions on the latent
vari-able. Administrations appearing higher in this column yielded
higher achievement.In the case of the reading for meaning items,
the read aloud administration yieldedslightly higher results than
both the standard and resource guide administrations;the resource
guide administration yielded the lowest results overall. Group
differ-ences are shown in column four of the variable map. As
expected, the overallachievement of the students without specific
learning disabilities was higher onaverage as compared to the
students with specific learning disabilities. The fifthand final
column represents the location of the reading for meaning items
with itemdifficulty ranging from 1.02 logits to 1.86 logits (M =
.00, SD =.84, N = 12).
Table 6 presents a variety of summary statistics related to the
FACETS analy-ses. The items, administrations, and disability status
are anchored at zero by defi-nition. In order to define an
unambiguous frame of reference for the model, only
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 301
FIGURE 1 Variable map of reading ability. * = 8 students. Higher
values for the student,type of administration, and disability
status facets indicate higher scores on the reading
abilityconstruct. Higher values on the item facet indicate harder
items.
Logit Students Administration Disability Items
High High High Harder
4 *********
.
3 *. .
****.
*********.
.
2 *. 7 ****.
******.
****.
***.
1 **. ******** 12 ***.
****. 10 11 **. SWOD 5 ***. Read-Aloud 9 0 *****. Standard ****.
Guide *. SLD 1 8 ***.
*.
*** 6 -1 **. 2 3 4 *
*
.
*
-2 . .
.
.
.
-3 .
-4 **. Low Low Low Easy
Logit Students Administration Disability Items
-
302 RANDALL AND ENGELHARD, JR.
one facet (student measure) is allowed to vary. The overall
model-data fit is quitegood. The expected value of the mean square
error statistics (infit and outfit) is1.00 with a standard
deviation of .20, and the expected values for these statisticsare
very close to the observed values. The most prominent exception is
thestudent facet which has more variation than expected for the
outfit statistic(M = 1.00, SD = .59).
As shown in Table 6, all four of the reliability of separation
statistics are sta-tistically significant (p < .01): students,
disability status, type of administration,and items. The
reliability of separation statistic is conceptually equivalent
toCronbachs coefficient alpha, and they are used to test the
hypothesis ofwhether or not there are significant differences
between the elements within afacet. The largest reliability of
separation index is .99 (Items) indicating a goodspread of reading
for meaning items on the latent variable. The smallest reli-ability
of separation index is .65 (Students). Given the small number of
items(N = 12), this is comparable to the values obtained for other
subtests in similarsituations. Both the type of administration
(.83) and disability status (.98) werealso well differentiated.
Model II: Item disability status interactions. Model II explores
theinteractions among items and disability status. This model
explores whether ornot the items are functioning invariantly over
groups (i.e., differential item func-tioning). Two items (4 &
7) exhibited statistically significant differential
itemfunctioning. Recall that the use of multiple significance tests
can result in spuri-ous significance, so Bonferroni multiple
comparison tests were used to confirm
TABLE 6Summary Statistics for FACETS Analysis (Reading for
Meaning Items, Grade 7)
Students Item Administration Disability Status
MeasuresMean .94 .00 .00 .00SD 1.64 .84 .12 .30n 786 12 3
2INFITMean 1.00 .99 1.00 1.02SD .24 .11 .01 .05OUTFITMean 1.00 1.00
1.00 1.03SD .59 .23 .02 .07Reliability of Separation .65* .99* .83*
.98*Chi-Square Statistic 1795.6 884.8 17.6 107.4Degrees of Freedom
785 11 2 1
*p < .01.
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 303
any apparent DIF. Comparison tests revealed only one
statistically significantitem. Item 7 was differentially easier for
students with specific learning disabili-ties with an observed
score of 0.32, but an expected score of 0.23, t = 3.53.
Model III: Item test administration interaction. Model III
explores item-level invariance across test administrations. We
found no statistically significantinteraction bias between test
administration and the reading for meaning items sug-gesting
complete item-level invariance across the type of test
administration.
SUMMARY AND DISCUSSION
The major contribution of this study is to encourage a
systematic approach toestablishing measurement invariance with
large-state assessments with dichoto-mous data. By combining and
integrating both confirmatory factor-analytic andRasch measurement
procedures, practitioners are able to develop a more com-plete
picture of the extent to which score-based inferences from these
measureshold across several subgroups within a population of
students. Although estab-lishing measurement invariance is
essential for all tests/measures that seek tomake inferences across
multiple groups, it is particularly necessary when theseinferences
have high stakes consequences (i.e.,
promotion/retention/graduation).Add to this, the legal obligation
of a school system to assess accurately protectedor vulnerable
groups (i.e., students with disabilities), and the significance of
thisstudy becomes apparent.
A two-stage approach was utilized. The first stage works within
a CFA frame-work to establish both unidimensionality and test-level
measurement invariance,specifically factorial equivalence. Assuming
factorial equivalence is establishedin the first stage, the second
stage works within a narrower conceptual frame-work focusing on
invariance at the item-level using a model that allows for
theseparation of item and person parameters. These complementary
methods enablethe practitioner to address issues of model-data
misfit to insure accurate interpre-tation of test scores.
The results of this study provide strong evidence that the
reading for meaningitems of the CRCT exhibit test-level invariance
across SLDs and students with-out disabilities. The factorial
invariance across test administration, however, isless clear.
Multigroup confirmatory factor analysis revealed a one-factor
modelwith partial-measurement invariance (when items 6 & 12 are
freely estimated).These findings suggest that the use of read aloud
and resource guides maychange the underlying structure of the exam.
Further examination into the utilityand appropriateness of these
test accommodations may be necessary.
Analyses using the Rasch Model also suggest overall good item
fit (outfit = 0.99)with only one item exhibiting evidence
consistent with differential item functioning
-
304 RANDALL AND ENGELHARD, JR.
across disability status. SLD performed differentially better on
this item thanexpected on Item 7. Closer examination of this item
also reveals some mild itemmisfit (outfit = 1.26). The tendency of
this item to function differentially acrossdisability status, its
lack of fit to the measurement model, as well as its extremelylow
p-value (0.39) suggest that it is a threat to item-level
measurement invarianceand should be examined more closely by
measurement professionals and practi-tioners. Indeed, such results
suggest a clear need for detailed qualitative interpre-tations of
the quantitative analysis. The two-stage approach to
assessingmeasurement invariance described in this article provides
a useful template thatcan be used, in conjunction with qualitative
evaluations, to aid in establishingfairness and equity in high
stakes testing.
ACKNOWLEDGMENTS
We thank Chris Domaleski and Melisa Fincher for providing us
with access tothe data set. The opinions express in this article
are those of the authors, and theydo not reflect the views of the
Georgia Department of Education.
REFERENCES
Asparouhov, T., & Muthen, B. (2006). Robust chi-square
difference testing with mean and varianceadjusted test statistics.
Mplus Web Notes: No. 10. Retrieved January 13, 1998, from
http://www.statmodel.com/download/webnotes/webnote10.pdf
Benedict, J., Steenkamp, E. M., & Baumgartner, H. (1998).
Assessing measurement invariance incross-national consumer
research. Journal of Consumer Research, 25, 7890.
Bennett, R., Rock, D., & Novatkoski, I. (1989). Differential
item functioning on the SAT-M Brailleedition. Journal of
Educational Measurement, 26(1), 6779.
Bielinski, J., Thurlow, M., Ysseldyke, J., Freidebach, J., &
Freidebach, M. (2001). Read aloudaccommodations: Effects on
multiple choice reading and math items (Technical Report
31).Minneapolis, MN: University of Minnesota, National Center on
Educational Outcomes.
Bock, R. D., & Jones, L. V. (1968). The measurement and
prediction of judgment and choice. SanFrancisco: Holden-Day.
Bolt, S., & Thurlow, M. (2006). Item level effects of the
read aloud accommodation for students withdisabilities. (Synthesis
Report 65). Minneapolis: University of Minnesota, National Center
onEducational Outcomes.
Bolt, S., & Ysseldyke, J. (2006). Comparing DIF across math
and reading/language arts tests forstudents receiving a read-aloud
accommodation. Applied Measurement in Education, 19(4), 329355.
Byrne, B. (2006). Structural equation modeling with EQS: Basic
concepts, applications, andprogramming (2nd ed.). Mahwah, NJ:
Lawrence Erlbaum Associates.
Byrne, B. & Campbell, T. L. (1999). Cross-cultural
comparisons and the presumption of equivalentmeasurement and
theoretical structure: A look beneath the surface. Journal of
Cross-CulturalPsychology, 30(5), 555574.
Byrne, B., Shavelson, R., & Muthen, B. (1989). Testing for
the equivalence of factor covariance andmean structures: The issue
of partial measurement invariance. Psychological Bulletin, 105,
456466.
-
ASSESSING MEASUREMENT INVARIANCE IN READING ASSESSMENT 305
Cheung, G., & Rensvold, R. (2002). Evaluating goodness of
fit indexes for testing measurementinvariance. Structural Equation
Modeling: A Multidisciplinary Journal, 9, 233245.
Cohen, A., Gregg, N., Deng, M. (2005). The role of extended time
and item content on a high-stakesmathematics test. Learning
Disabilities Research and Practice, 20(4), 225233.
Elbaum, B., Arguelles, M., Campbell, Y., & Saleh, M. (2004).
Effects of a student-reads-aloudaccommodation on the performance of
students with and without learning disabilities on a test ofreading
comprehension. Exceptionality, 12(2), 7187.
Elbaum, B. (2007). Effects of an oral testing accommodation on
the mathematics performance of sec-ondary students with and without
learning disabilities. The Journal of Special Education,
40(4),218229.
Engelhard, G., Fincher, M., & Domaleski, C. S. (2006).
Examining the reading and mathematicsperformance of students with
disabilities under modified conditions: The Georgia Department
ofEducation modification research study. Atlanta: Georgia
Department of Education.
Fuchs, L. (2000a, July). The validity of test accommodations for
students with disabilities: Differentialitem performance on
mathematics tests as a function of test accommodations and
disability status.Final Report: U.S. Department of Education
through the Delaware Department of Education.
Fuchs, L. (2000b, July). The validity of test accommodations for
students with disabilities: Differen-tial item performance on
reading tests as a function of test accommodations and disability
status.Final Report: U.S. Department of Education through the
Delaware Department of Education.
Huesman, R. & Frisbie, D. (2000, April). The validity of
ITBS reading comprehension test scores forlearning disabled and non
learning disabled students under extended tie conditions. Paper
presentedat the annual meeting of the American Educational Research
Association, New Orleans, LA.
Huynh, H., Meyer, J., & Gallant, D. (2004). Comparability of
student performance between regularand oral administrations for a
high stakes mathematics test. Applied Measurement in
Education,17(1), 3957.
Huynh, H., & Barton, K. (2006). Performance of students with
disabilities under regular and oral admin-istrations of a
high-stakes reading examination. Applied Measurement in Education,
19(1), 2139.
Kline, R. (2005). Principles and practice of structural equation
modeling (2nd ed.). New York:Guilford.
Linacre, J. M. (2007). A users guide to FACETS: Rasch-model
computer programs. Chicago:winsteps.com.
Meloy, L., Deville, C., & Frisbie, D. (2000, April). The
effect of a reading accommodation onstandardized test scores of
learning disabled students. Paper presented at the annual meeting
of theAmerican Educational Research Association: New Orleans,
LA.
Meredith, W. (1993). Measurement invariance, factor analysis and
factorial invariance. Psychometrika,58(4), 525543.
Munger, G., & Loyd, B. (1991). Effect of speededness on test
performance of handicapped and non-handicapped examinees. Journal
of Educational Research, 85(1), 5357.
Muthen, B., & Christofferson, A. (1981). Simultaneous factor
analysis of dichotomous variables inseveral groups. Psychometrika,
46(4), 407419.
Muthen, L. K., & Muthen, B. O. (19982007). Mplus users
guide. Fifth Edition. Los Angeles, CA:Muthen & Muthen.
Rasch, G. (1980). Probabilistic models for some intelligence and
attainment tests. Chicago: TheUniversity of Chicago Press.
(Original work published 1960)
Rock, D., Bennett, R., & Kaplan, B. (1985). Internal
construct validity of the SAT across handicappedand nonhandicapped
populations. ETS Research Report RR-85-50. Princeton, NJ:
EducationalTesting Service.
Rock, D., Bennett, R., & Kaplan, B. (1987). Internal
construct validity of a college admissions testacross handicapped
and nonhandicapped groups. Educational and Psychological
Measurement,47(1), 193205.
-
306 RANDALL AND ENGELHARD, JR.
Runyan, M. (1991). The effect of extra time on reading
comprehension scores for university studentswith and without
learning disabilities. Journal of Learning Disabilities, 24(2),
104108.
U.S. Department of Education (2007a). Demographic and school
characteristics of students receiv-ing special education in
elementary grades (NCES Publication 2007-005). Jessup, MD:
NationalCenter for Educational Statistics.
U.S. Department of Education (2007b). Title I: Improving the
academic achievement of the disadvan-taged; Individuals with
Disabilities Act (IDEA); Final Rule. Federal Register, Vol. 72, No.
67,Monday, April 9, 2007.
Wagner, M., Marder, C., Blackorby, J., & Cardosa, D. (2002).
The children we serve: Thedemographic characteristics of elementary
and middle school students with disabilities and theirhouseholds.
Menlo Park, CA: SRI International.
Wagner, M., Cameto, R., & Guzman, A. (2003). Who are
secondary students in special educationtoday? (A Report from the
National Longitudial Transition Stud). Retrieved September 1,
2008,from http://www.ncset.org/publications
Wright, B. D., & Masters, G. (1982). Rating scale analysis:
Rasch measurement. Chicago: MESAPress.
Wu, A., Li, Z., & Zumbo, B. (2007). Decoding the meaning of
factorial invariance and updating thepractice of multi-group
confirmatory factor analysis: A demonstration with TIMMS data.
PracticalAssessment, Research, & Evaluation, 12(3), 123.
-
Copyright of Applied Measurement in Education is the property of
Taylor & Francis Ltd and its content maynot be copied or
emailed to multiple sites or posted to a listserv without the
copyright holder's express writtenpermission. However, users may
print, download, or email articles for individual use.