This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Statistical and Qualitative Approaches for Facilitating
Comparability in Cross-Lingual AssessmentsStephen G. Sireci
Center for Educational AssessmentUniversity of Massachusetts, USAPresentation delivered at the OECD Seminar on Translating and Adapting Instruments in Large-
lAssessing people who communicate using different languages, by using multiple language versions of an assessment.
Why have multiple language versions of a test?
lNeed to compare individuals who function in different languages– International comparisons– Cross-cultural research– Certification Testing– Employee surveys– Employee selection– Admissions testing– District testing
Why have multiple language versions of a test (cont.)?
lEnhance fairness:–Remove the language barrier.–Adjust tests for culture- or
language-related differences that are not relevant to “construct” measured.
–Provide options for people to select best language.
Multiple-language versions of a test
lSeen as one way to promote FAIRNESS by allowing examinees to access and interact with the test in their native language.
The Standards for Educational & Psychological Testing
A test that is fair within the meaning of the Standardsa) reflects the same construct(s) for all
test takersb) scores from it have the same meaning
for all individuals in the intended population
c) a fair test does not advantage or disadvantage some individuals because of characteristics irrelevant to the intended construct.
The (1999) Standards pointed out…any test that employs language is,
in part, a measure of [students’] language skills… This is of particular concern for test takers whose first language is not the language of the test… In such instances, test results may not reflect accurately the qualities and competencies intended to be measured. (AERA, et al., 1999, p. 91)
Federal law in the United States
(NCLB) required states to assess English learners using “assessments in the language and form most likely to yield accurate data on what such students know and can do in academic content areas, until such students have achieved English language proficiency” (NCLB, 2002, cited in Stansfield, 2003, p. 203).
Interest in “multilingualism” in the USA is very different from 100 years ago:“If English was good enough for
Jesus, it’s good enough for the school children of Texas.”
Texas Governor James “Pa” Ferguson (1917) after vetoing a bill to finance the teaching of foreign languages in classrooms.
How do we assess people who communicate using different languages?
lThe most common procedure is to translate (adapt) an existing test into other languages.
In this talk, I will describe
1. validity issues2. “quality control” procedures3. research designs4. statistical methodsfor developing and evaluating tests designed for cross-lingual assessment
I will also discuss
5. Standards and Guidelines relevant to cross-lingual assessment– AERA et al. (2014) Standards– ITC Guidelines on Translating and
ITC Guidelines on Translating and Adapting TestslMost famous of all the ITC guidelineslFirst published by Hambleton (1994)lMany versions & revisions
– 2005, 2010, 2005, 2017
Standards for test score interpretation (1)lAERA et al. Standards (2014):Simply translating a test from one
language to another does not ensure that the translation produces a version of the test that is comparable in content and difficulty level to the original version of the test, or that the translated test produces scores that are equally reliable/precise and valid... (p. 60)
ITC “Test Adaptation” Guidelines(2017)
l Instrument developers/publishers should implement systematic judgmental evidence, both linguistic and psychological, to improve the accuracy of the adaptation process and compile evidence on the equivalence of all language versions.
ITC “Test Adaptation” Guidelines(2017)
lTest developers/publishers should apply appropriate statistical techniques to (1) establish the equivalence of the language versions of the test, and (2) identify problematic components or aspects of the test that may be inadequate in one or more of the tested populations.
SUMMARY: What do the professional standards say?
lAdapted tests cannot be assumed to be equivalent.
lResearch is necessary to demonstrate test and score comparability.
lStatistical methods can help evaluate item/test comparability.
Test Adaptation and Research Methodology
Good research designs are needed for – The translation/adaptation process
(qualitative)– Evaluating the similarity
(comparability) of the scores from different language tests (quantitative)
– Establishing formal relationships among the different tests (quantitative)
Adaptation and Validity
lAll research designs are designed to provide evidence of validity– of score interpretations– for the use of the test scores
But what is validity?
Standards’ Definition
l“Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests”
AERA, APA, & NCME (2014, p. 11)
Therefore, validity issues in cross-lingual assessment must consider the purpose of the assessment
lValidation involves gathering data to support (defend) the use of a test for a particular purpose.
lDifferent research designs are needed to support different uses of test scores (i.e., different interpretations of scores)
Adapting a test to use in another language/culturelIf test score interpretations are to
be made WITHIN a language group, no evidence of comparability across languages is needed.– Evidence that scores are appropriate
for whatever they are used for is (obviously) still needed.
– E.g., Scores on the Prueba de Aptitud Academica used only for selection into graduate school in Puerto Rico
However, if scores are to be compared ACROSS language groups:lNeed to validate comparative
inferences– E.g, TIMSS, PISA
lHow?– Demonstrate construct
equivalence– Rule out method bias– Rule out item bias(van der Vijver)
Are scores from different language versions of an exam supposed to be “comparable?”lIf comparable means “equated,”
or on the same scale, this is probably impossible.– In its strictest sense, equating
implies examinees would get the same score no matter which “form” of the exam they took.
However, comparability does not need to mean “equivalent” or equated.
Can test scores from different language versions of an exam be considered “comparable?”lThrough (a) careful adaptation
procedures, and (b) statistical evaluation of score comparability, an argument can be made that cross-linguistic inferences are appropriate.
Adapting “comparable” tests across languages
Involves1. Quality adaptation involving multiple
steps, multiple translators, and multiple quality control checks
2. Statistical analysis of structural equivalence and differential item functioning
3. Qualitative analysis of item bias and method bias
4. Developing a sound validity argument for comparative inferences
What can we do to promote fairness in cross-lingual assessment?lQualitative: Develop and translate
tests carefullylQuantitative: Evaluate the
statistical characteristics of test and item scores– factor structure– differential item functioning– consistency of criterion-related validity
evidence
Questions that statistical techniques can help us answerlHow “similar” are tests?lHow “similar” are items?lHow “comparable” are test scores?
Methods for evaluating construct equivalence:lDifferential predictive validity
Common practice is to conduct separate analysis in each language group and compare solutions.– Same number of factors?– Items load (cluster) in the same way?– Subjective comparison of solutions.
lUnlike exploratory factor analysis, both SEM and MDS allow for simultaneous analysis across multiple groups
CFA Model
y = Λη + εwhere y is a (p x 1) column vector of scores for person i on the p items,
Λ is a (p x q) matrix of loadings of the p items on the q latent variables, η is a q x 1 vector of latent variables, and ε is (p x p) column vector of measurement residuals.
CFA Model (continued)
y = Λη + ε
Sx=observed variance/covariance matrix,F=variance/covariance matrix of latent variables, andQd=diagonal matrix of error variances.
dQ+ÙFÙ=S 'xxx
Multiple-group CFA involves evaluating the invariance of Sx, F, and/or Qd across groups
å=
-=r
ajaiakaijk xxwd
1
2)(
Weighted MDS model
i,j=items, k=group, a=dimension, x=coordinate, w=group weight on dimension
Example of using CFA & MDS to evaluate different language versions of a testlSireci, Bastari, & Allalouf (1998)
Psychometrics Entrance Test (PET)– Used in Israel for postsecondary
admissions decisions– We looked at Hebrew and Russian
versions of items from the Verbal Reasoning section of the exam.
PET: Analysis of Construct EquivalencelVerbal reasoning test:
lCross-lingual DIF analyses also catch differences in cultural familiarity.
Important NotelMethods for detecting DIF were not
designed for studying translated/adapted items.
lProblem:(a) Cannot assume translated/adapted
items are equivalent(b) Cannot assume different language
groups are equivalent
How is translation DIF different from “normal” DIF?l Items are NOT the “same” for
studied groups.lGroups cannot be considered
randomly equivalent.lDifferent groups, different items….
How can this problem be solved?l It cannot be solved.
lHowever, there are (at least) 4 things that can help:
1) Careful adaptation procedures2) Advanced research designs3) Aforementioned statistical analyses4) Making certain assumptionsSystematic bias must be ruled out to
justify conditioning variable in DIF analyses.
1) Careful adaptation proceduresExample: Angoff and Cook (1988)l Items in Spanish translated to
English.l Items in English translated to
Spanish.l Independent translators evaluated
translations.lAlso, iterative DIF screening
procedures.
2) Advanced research designsExample: Sireci & Berberoglu (2000)lBilingual group designlTwo (randomly equivalent) groups
of Turkish-English bilinguals took counterbalanced English/Turkish surveys.
lRandom equivalence was tested.lPolytomous (Likert) data:
– Samejima’s graded response model was used (IRT-LR)
2) Advanced research designs
Sireci & Berberoglu (2000)– Items identified for DIF were
removed from conditioning variable (theta) before making comparisons across groups• i.e., purified criterion
– Non-DIF items could be used to anchor scale across languages
Other advanced research design idea:lLink score scales through an
external criterion.– (Wainer, 1993)– Separate predictive validity
studies– Logical, but has it ever been
used?lProblem: How to validate
criterion.
Other approaches
Use DIF screening to identify items to form link.– E.g., PET exam Allalouf, A., Rapp, J., & Stoller, R. (2009).
What item types are best suited to the linking of verbal adapted tests? International Journal of Testing, 9, 92-107.
lAlso used by PIRLS, PISA, TIMSS
3) Statistical analyseslAnalysis of structural
equivalence– Evaluate factorial invariance– Justify matching variable for DIF
analysisl“Double Linking” equating
evaluation– Rapp & Allalouf (2003). Evaluating
cross-lingual equating. International Journal of Testing, 3, 101-117.
4) Making certain assumptionslGroups are randomly equivalent
– Canada
lAnchor items are equivalent across languages (Allalouf et al., 2009)– Screened items– Non-verbal items
lNo systematic biasThese assumptions must be defended!
SUMMARY: Evaluating Cross-lingual Assessments:
What have we learned?1. Judgmental methods for evaluating
adapted tests and items are not enough.
2. Statistical procedures are also necessary.
3. Relatively simple DIF detection procedures are effective for identifying “big” problem items, even when sample sizes are small.
Lessons Learned (continued)4. CFA and MDS are useful for
evaluating construct equivalence across different language versions of a test.
Need to look at structure simultaneously across multiple groups.CFA: confirmatoryMDS: exploratory
Lessons Learned (continued)
5. DIF is often explainable.ØDifferences in cultural relevanceØTranslation problemsØAdministration problems
Lessons Learned (continued)
6. Validate testing purpose(s) associated with cross-lingual assessment
If scores/people are comparedacross languageslSome type of linking is needed.lLinking based on subset of items
most likely to be psychologically and statistically equivalent may be defensible.– Confirmed by translation teams– Pass DIF screening procedures– Representative of total test– Common stratifying variable(s)
defended via structural analysis
Lessons Learned (continued)7. Given that linking is not equating,
cross-lingual inferences should be cautiously interpreted.– What cautions should international
assessments provide for cross-lingual comparisons?
– What cautions should they provide for same-language, cross-cultural comparisons?
– What cautions should be provided for same language/culture comparisons over time?
Next Steps/Future ResearchlWhat are the best ways to interpret
and communicate cross-lingual inferences?
lUsing the AERA et al Standards 5 sources of validity evidence to evaluate cross-lingual inferences.
lUsing lessons learned from DIF analyses to avoid future adaptation problems.
Concluding remarksWe continue to make progress on
methods for evaluating adapted tests,
but of course, more research is needed.
We continue to make progress on the process of test adaptation.
ITC Guidelines (2017) are a good foundation upon which we should build.
Resources for Learning MoreInternational Test Commission