Evalua&ng Measurement Invariance with Cross Cultural Sensi&vity Leslie R. Hawley, Betty-Jean Usher-Tate, Sara E. Gonzalez, & Natalie Koziol
Evalua&ngMeasurementInvariancewithCrossCulturalSensi&vity
Leslie R. Hawley, Betty-Jean Usher-Tate, Sara E. Gonzalez, & Natalie Koziol
Whyisthistopicimportant?
• Evalua4ngMeasurementInvariancewithCrossCulturalSensi4vity
• Noman’sanisland– Diversity– Globaliza4on– Li4ga4on– ProfessionalStandards – ScoreInterpreta4on&Comparisons
Whyisthistopicimportant?
$$$$$$– SchoolFinance– Businessoftes4ng– Philanthropists– AccountabilityReforms
LegalIssues– USCons4tu4on(StateControl)– Federal(ESEA,NCLB,RTTT,ESSE)– Li4ga4on
DecisionsBasedonScores– High-stakesandlow-stakes– Benefits&Consequences– UnintendedConsequences
Concepts:ValidityandValida4on
Thevalidityques&on–Doesthetest/instrumentdowhatitisdesignedtodoanddosoconsistently?
– Validityhasalonghistoryinpsychology&tes4ng– Validityisassessedthroughvalida4onresearch
– Valida7onfocusesupontheresearchthatsubstan4atestheeviden4albasisfortestuses
– Thevalida7onprocessu4lizesbothempiricalevidenceandtheore4calbasestosupport
(Geisinger,inpress)
Valida&onistheresponsibilityofboththetestuser(consumer)andthetestpublisher(vendor)
2014Standards(APA,AERA&NCME)
• StandardsforEduca7onalandPsychologicalTes7ng(Standards)definesvalidityas:– “aunitaryconcept”– “thedegreetowhichallaccumulatedevidencesupportstheintendedinterpreta4onoftestscoresfortheproposeduse”(AERA,APA,&NCME,2014,p.14)
• TheStandardsconsidersvalidity:– “themostfundamentalconsidera4onindevelopingandevalua4ngtests”(AERA,APA,&NCME,2014,p.11)
• Validity(propertyofthescores)– Interpreta4on– Decisions(high-stake,low-stake,consequences)– Fairness– Comparability– Trust/Confidence
ValidityArgument&Evidence
Thevalidityargumentisconstructedmuchlikeargumentsinacourtcase;thereareexpecta4onsorstandardstoupholdandtheremayalsobeimportantmi4ga4ngcircumstancesuniquetothemeasureorsample.• Whereonemaylookforevidenceforavalidityargument:
• Developmentprocess– Blueprint[defini4onandoutlineofconstruct]– Defineintendedusesforscores– Testmanual– Normsample
• Contentoftheinstrument• Responseprocessesbytesttakers• Consistency[internalstructureoftheassessment]• Fairness[rela4onshipoftestscoreswithothervariables]• Outcomeimpact[benefitsandconsequences]
ValidityasaUnitaryConcept
• Historically, validity had been conceptualized categorically: content, construct, discriminant, convergent, . . .
• What may seem like different types of validity are now viewed as different sources of evidence related to the overarching unitary concept of validity – “all validity is of one kind, namely, construct validity”(Messick, 1998, p.37)
• Accumulating construct evidence is an umbrella approach that subsumes all validation – Includes, but is not limited to:
• Reliability evidence • Statistical conclusion validity evidence • Content evidence • Convergent, discriminant, and factorial evidence • Evaluation of group differences
ValidityEvidence
Valida4onrequiresaclearargumentfortheproposedinterpreta4onsandusesofscores(Kane,2006)
– Interpre4veargumentàinferencesfromtheobserveddatatoanyclaimswehypothesize
• Outlinesreasoningandprovidesspecificclaimsthatneedtobeevaluated• Frameworkforevalua4on
– Validityargumentàevalua4onoftheinterpre4veargument
“Validityisaninduc4vesummaryofboththeexis4ngevidenceforandtheactualaswellaspoten4alconsequencesofscoreinterpreta4onanduse”(Messick,1989,p.5)
ValidityEvidence
Evidenceisbasedonapar4cularuseandinterpreta4on• Specifictohowwedefinetheconstruct• Determineshowwecaninterpretscoresfromourmeasure• Validityisapropertyofthescoresandnottheinstrument
Evidenceshouldbemul4faceted• Varietyofsourcesandmethods• Needtoprovide“aconvincing,comprehensivevalidityargument”(Sireci,2009,p.33)
ValidityEvidence
“Mul4plelinesofevidence...consonantwiththeinference,whileestablishingthatalterna4veinferencesarelesswellsupported”(Messick,1989,p.5)
Mul4plesourcesforaccumula4ngvalidityevidence– Considera4onsforculturalandlinguis4cdifferences
– Testplamormandissuesofaccessand/orfamiliarity
– Today’sfocusisprimarilyoncontentandconstructsourcesofevidence
Percep4on,TrustandConfidence
• FaceValidity• Notalwaysseenaslegi4matecomponentofthevalidityargument
• Empiricalmethods
Content-relatedEvidence
Poten4alques4onsofinterest:– Howwelldoesthemeasurereflecttheintendedconstruct,knowledge,skills?
• Relevance• Representa4veness
– Howwereitemsdeveloped?– Wereitemsevaluatedpriortoadministra4on?– Weremul4plegroups(e.g.,women,minori4es)representedinthedevelopmentprocess?
KnowYourself
• CulturalBackground• Language• LanguageModality(i.e.,Verbal,Nonverbal)
• Educa4on(e.g.,level,field)
History:Con4nuumofProcedures
• Literaltransla4onswere[are]standardprac4ce.– Forwardtransla4on:na4vespeakerofthetargetlanguageandfluentinthesourcelanguage.
– Backwardtransla4on:na4vespeakerofthesourcelanguageandfluentinthetargetlanguage.
• Societalships(i.e.,globaliza4on)lettoincreasingawarenessofproblemswithtransla4onsalone.
• Needforadapta4onsandstandardiza4onofproceduresarose!
Whataretheop4ons?
• Literaltransla4on– Pro:maintainsmetricequivalence– Con:doesnottakeintoaccountculturaldifferences;maynotbeadequate
• Adapta4on– Pro:Adaptabletospecificculture/group– Con:Increaseddifficultytocomparecross-culturally
• Newtest– Pro:Flexible;specifictoculture/group– Con:Nearlynoequivalencemaintained
ReasonsforTestAdapta4on
• Knowledgeandskillsofinterestareopenthesameacrosslanguagegroups– Testadapta4onensuresconsistencyofcontent
• Moreefficientthandevelopinganewtest• Testequivalenceandfairnessissimplertoestablish
StepsforAdap4ngMeasures
1. Checkingcontentandformatequivalence2. Decideonsuitabilityoftransla4on/adapta4on
orcrea4ngofnewmeasure/test3. Selectwell-qualifiedtranslators4. Transla4ngandadap4ngprocess5. Reviewingtheadaptedversion6. Conduc4ngasmalltryoutoftheadapted
version7. Carryingoutamoreambi4ousstudy(checkfor
validityandequivalence–tobediscussedlater)8. Documenttheprocess
Hambleton&Patsula,1999
ITCGuidelines
• Documenta4onofadapta4onshouldbeprovided,alongwithevidenceoftheequivalence.
• Scoredifferencesamongsamplesofpopula4onscannotbetakenatfacevalue.– Researcherhasresponsibilitytoverifywithotherempiricalevidence
• Comparisonscanonlybemadeatthelevelofinvarianceestablishedforthescale.
ITCGuidelines
• Specificinforma4onofwaysinwhichthesocio-culturalandecologicalcontextspoten4allyaffectsperformanceshouldbeprovided.– Testdevelopersshouldsuggestprocedurestoaccountfortheseeffectsintheinterpreta4onofresults.
• Applyappropriatesta4s4caltechniquesto:– establishequivalenceofdifferentversions– iden4fyproblema4ccomponents
Progress?
• Hasprogressbeenmadeintestadapta4onmethodology?
• TheBurosCenterforTes4ngischangingthewayindividuals– assesstheirknowledgeoftes4ngdiversepopula4ons
– partakeinappropriatetestselec4on.
PruebasPublicadasenEspañol
• Resourcethatprovidesdescrip4veandanaly4calinforma4onaboutcommerciallyavailabletestsavailableinSpanish.
• Materialpresentedinabilingualmanner• Effortstopointouttheneedforadapta4on
– availabilityofnormsforSpanish-speakingpopula4on– country/languagethetestoriginated– transla4onoradapta4onprocessesimplemented– testcomponents– originalnameofthetest
Carlson&Gonzalez,2015
CONSTRUCT-RELATEDEVIDENCE
Demographics Scores
Group1
Group2
Group5
Group3
Group4
Skills
THEORY TRAITS
PURPOSE Reliability
Scale
fairness expertreviewDIF
Construct-relatedEvidence
• Whatseemslikedifferenttypesofvalidityaredifferentsourcesofevidencerelatedtotheoverarchingconceptofconstructvalidity– “allvalidityisofonekind,namely,constructvalidity”(Messick,1998,p.
37)
• Constructs– Unobserved,latentcharacteris4csgivenmeaningthroughthecombina4onofmeasurableauributes,skills,ortraits
• Ex:Depression,IQ,Conflict,Self-Efficacy,Mo4va4on
– Operaliza4onofconstructsisguidedbytheory
Construct-relatedEvidence
• Constructevidenceisbasedonapar4cularuseandinterpreta4on
• Specifictohowwedefineourconstruct• Determineshowwecaninterpretscoresfromaninstrument
• Forinstance,ifwewanttouseapar4cularinstrumenttomakecomparisonsbetweentwogroupsweneedtoprovideevidenceofinvariance– Ismyconstructmeasuredthesamewayacrossgroups?
Invariance
• Incross-culturalresearchweassumethatboththeinstrumentandtheconstructbeingmeasuredareworkingthesamewayacrossdifferentgroups
• Weassumethefollowingareequalbetweengroups:– Numberoffactors– Pauernofloadingsonfactors– Percep4onofitemcontent– Loadingsize– Itemmeans– ConstructDimensionality
• Rela4onshipsbetweenconstructdimensions
Invariance
• Ifourassump4onsbetweengroupsdonotholdthenourinstrumentmaynotrepresenttheconstructequallywellacrossgroupsandwemaynotbeabletointerpretscoresfromtheinstrumentacrossgroups
• Subsequently,itisimportanttotestthevalidityoftheseassump4ons
Invariance–DataExample
• 2012ProgrammeforInterna4onalStudentAssessment(PISA)
• 5itemscale:TeacherSupportinMathema7csClasses– “Theteachershowsaninterestineverystudent’slearning”– “Theteachergivesextrahelpwithstudentsneedit”– “Theteacherhelpsstudentswiththeirlearning”
• Datawerecollectedusingcomplexsamplingtechniques(studentsnestedwithinschools)
• TwoCountries:USA&Finland
Invariance–DataExample
• Ini4alanalysesauemptedtoincorporatemul4levelstructureintoinvariancetes4ngbuttheICCsofthevariableswerecloseto0(e.g.,.05)andmodelswouldfailedtoconverge– PISAsamplingstrategy
• Duetomul4levelnon-convergence,asinglelevelapproachwasusedinthesubsequentexamples– Mul4ple-GroupConfirmatoryFactorAnalysis(MGCFA)– IninstancesoflowICCs,conven4onalMGCFAapproacheswillopenprovideunbiasedes4mates(Julian,2001)
Invariance
• ConfiguralInvariance(BaselineModel)– Doesthesamegeneralfactorstructure(configura4on)holdacrosscountries?
United States
Finland Support
X1 X2 X3 X4
e4e3e2e1
X5
e5
Support
X1 X2 X3 X4
e4e3e2e1
X5
e5
• Metric(weak)Invariance– Doindividualitemsbehavesimilarlyacrosscountries?
• Constraint:Factorloadings(λ)areheldequal– Par4almetricinvarianceisnecessarytomakevalidinferencesinlatentfactormeans(Byrne,Shavelson&Muthén,1989)
Invariance
• Constraint:Factorloadings(λ)areheldequalMetricInvariance
United States
Finland Support
X1 X2 X3 X4
e4e3e2e1
λ1λ2 λ3 λ4
X5
e5
λ5
Support
X1 X2 X3 X4
e4e3e2e1
λ1λ2 λ3 λ4
X5
e5
λ5
MetricSyntax(Mplus)
• Loadingsheldequalacrossgroups
• Factorvarianceinreferencegroupfixedto1
Finland(Reference)USA
Par4alMetricInvariance
• Modelfit(i.e.,H0LL;MLRscalingcorrec4on)wascomparedbetweenConfiguralandMetric– Modelfitwassignificantlyworsewithfullmetricinvariance
– Modifica4onindiceswereusedtoitera4velyadjustthemodelun4lfitwasnotsignificantlyworsethantheconfiguralmodel
• Par4almetricinvariancewasachievedaper2itera4ons(onlyoneconstraintrelaxedata4me)
– 1loadingwasfreed– 1residualcovarianceaddedforUSAonly
• Scalar(strong)Invariance– Arethemeaningoftheconstructanditemsequalacrosscountries?• Constraint:Intercepts(τ)andloadings(λ)heldequal
– Scalarinvarianceisnecessarytocomparesumscoresorobservedmeans(vandeSchoot,Lug4g&Hox,2012)
Invariance
• Constraint:Intercepts(τ)andloadings(λ)heldequal
United States
Finland
e4e3e2e1τ1 τ2 τ3 τ4
ScalarInvariance
Support
X1 X2 X3 X4
λ1λ2 λ3 λ4
X5
e5
λ5
τ5
e4e3e2e1τ1 τ2 τ3 τ4
Support
X1 X2 X3 X4
λ1λ2 λ3 λ4
X5
e5
λ5
τ5
Par4alScalarSyntax(Mplus)
• Loadingsheldequalacrossgroups
• Interceptsheldequalacross
• Factorvarianceinreferencegroupfixedto1
• FactormeanofUSAnowfree
Finland(Reference)USA
Par4alMetricInvariance
• Modelfit(i.e.,H0LL;MLRscalingcorrec4on)wascomparedbetweencondi4ons– Modelfitwassignificantlyworsebetweenpar4almetricandpar4alscalarcondi4ons
– Modifica4onindiceswereusedtoitera4velyadjustthemodelun4lfitwasnotsignificantlyworsethanthepar4almetricmodel
• Par4alscalarinvariancewasachievedaper4itera4ons– 4interceptswerefreed
Howdidthisinstrumentdo?
• Obtained:Par4alScalarinvariance
• MinimumGoal:Par4alMetricinvariance– Inferencesbetweenlatentfactormeans(Byrne,Shavelson&
Muthén,1989)
Howdidthisinstrumentdo?
• Possiblereasonsforfindingnon-invariance– Instrumenttransla4on
• Perearliercontentdiscussion– Bias(3types)
• Construct– Differen4almeaningsacrossgroups
• Method– Sample,instrument,administra4ve
• Item– Content,terminology,unclearwording
Byrne,2012
Whatifwehavemorethan2groups?
• Limita4onsofinvariancemethodswithMGCFAandlargenumberofgroups– Numberofgroupscomparedatone4me– Scalarinvarianceisrarelyachievedwithalargenumberofgroups
• Alignmentmethod(Asparouhov&Muthén,2014)– Poten4alop4onformul4plegroups(upto100)– Mplus7.1– Goalistoprovideamethodforcomparingfactormeans&varianceswhilepermi|ngapproximatemeasurementinvariance
FINALTHOUGHTS
• Validityevidenceshouldbemul4faceted– Varietyofsourcesandmethods
• Evidenceisbasedonapar4cularuseandinterpreta4on– Determineshowwecaninterpretscoresfromourmeasure
• Cannotignoreculturalcomponentsthatmayinfluenceourconstructs– Needevidencetodemonstrateequalityofmeasurementtointerpretscoresacrossgroups
• Thevalida4onprocess(accumula4onofevidence)isacon4nualprocess
FINALTHOUGHTS
• Validityisatthecruxformeaningfuluseoftestscores,whetherfordecisionsorcomparisons.
• BasedonanalysesoftestreviewspublishedintheMentalMeasurementYearbooks...“favorableevalua7onsofatesttendtobeassociatedwithgreaterprovisionofvalidityevidence.”(Cizek,Rosenberg,&Koons,2008)