Evalua&ng Measurement Invariance with Cross Cultural …cyfs.unl.edu/cyfsprojects/videoPPT/71567726faaf05c291a4d5921f31113... · • Statistical conclusion validity evidence • Content

Evalua&ngMeasurementInvariancewithCrossCulturalSensi&vity

Leslie R. Hawley, Betty-Jean Usher-Tate, Sara E. Gonzalez, & Natalie Koziol

Whyisthistopicimportant?

•  Evalua4ngMeasurementInvariancewithCrossCulturalSensi4vity

•  Noman’sanisland– Diversity– Globaliza4on– Li4ga4on– ProfessionalStandards – ScoreInterpreta4on&Comparisons

Whyisthistopicimportant?

$$$$$$–  SchoolFinance–  Businessoftes4ng–  Philanthropists–  AccountabilityReforms

LegalIssues–  USCons4tu4on(StateControl)–  Federal(ESEA,NCLB,RTTT,ESSE)–  Li4ga4on

DecisionsBasedonScores–  High-stakesandlow-stakes–  Benefits&Consequences–  UnintendedConsequences

Concepts:ValidityandValida4on

Thevalidityques&on–Doesthetest/instrumentdowhatitisdesignedtodoanddosoconsistently?

–  Validityhasalonghistoryinpsychology&tes4ng–  Validityisassessedthroughvalida4onresearch

–  Valida7onfocusesupontheresearchthatsubstan4atestheeviden4albasisfortestuses

–  Thevalida7onprocessu4lizesbothempiricalevidenceandtheore4calbasestosupport

(Geisinger,inpress)

Valida&onistheresponsibilityofboththetestuser(consumer)andthetestpublisher(vendor)

2014Standards(APA,AERA&NCME)

•  StandardsforEduca7onalandPsychologicalTes7ng(Standards)definesvalidityas:–  “aunitaryconcept”–  “thedegreetowhichallaccumulatedevidencesupportstheintendedinterpreta4onoftestscoresfortheproposeduse”(AERA,APA,&NCME,2014,p.14)

•  TheStandardsconsidersvalidity:–  “themostfundamentalconsidera4onindevelopingandevalua4ngtests”(AERA,APA,&NCME,2014,p.11)

•  Validity(propertyofthescores)–  Interpreta4on–  Decisions(high-stake,low-stake,consequences)–  Fairness–  Comparability–  Trust/Confidence

ValidityArgument&Evidence

Thevalidityargumentisconstructedmuchlikeargumentsinacourtcase;thereareexpecta4onsorstandardstoupholdandtheremayalsobeimportantmi4ga4ngcircumstancesuniquetothemeasureorsample.•  Whereonemaylookforevidenceforavalidityargument:

•  Developmentprocess–  Blueprint[defini4onandoutlineofconstruct]–  Defineintendedusesforscores–  Testmanual–  Normsample

•  Contentoftheinstrument•  Responseprocessesbytesttakers•  Consistency[internalstructureoftheassessment]•  Fairness[rela4onshipoftestscoreswithothervariables]•  Outcomeimpact[benefitsandconsequences]

ValidityasaUnitaryConcept

•  Historically, validity had been conceptualized categorically: content, construct, discriminant, convergent, . . .

•  What may seem like different types of validity are now viewed as different sources of evidence related to the overarching unitary concept of validity –  “all validity is of one kind, namely, construct validity”(Messick, 1998, p.37)

•  Accumulating construct evidence is an umbrella approach that subsumes all validation –  Includes, but is not limited to:

•  Reliability evidence •  Statistical conclusion validity evidence •  Content evidence •  Convergent, discriminant, and factorial evidence •  Evaluation of group differences

ValidityEvidence

Valida4onrequiresaclearargumentfortheproposedinterpreta4onsandusesofscores(Kane,2006)

–  Interpre4veargumentàinferencesfromtheobserveddatatoanyclaimswehypothesize

•  Outlinesreasoningandprovidesspecificclaimsthatneedtobeevaluated•  Frameworkforevalua4on

–  Validityargumentàevalua4onoftheinterpre4veargument

“Validityisaninduc4vesummaryofboththeexis4ngevidenceforandtheactualaswellaspoten4alconsequencesofscoreinterpreta4onanduse”(Messick,1989,p.5)

ValidityEvidence

Evidenceisbasedonapar4cularuseandinterpreta4on•  Specifictohowwedefinetheconstruct•  Determineshowwecaninterpretscoresfromourmeasure•  Validityisapropertyofthescoresandnottheinstrument

Evidenceshouldbemul4faceted•  Varietyofsourcesandmethods•  Needtoprovide“aconvincing,comprehensivevalidityargument”(Sireci,2009,p.33)

ValidityEvidence

“Mul4plelinesofevidence...consonantwiththeinference,whileestablishingthatalterna4veinferencesarelesswellsupported”(Messick,1989,p.5)

Mul4plesourcesforaccumula4ngvalidityevidence– Considera4onsforculturalandlinguis4cdifferences

– Testplamormandissuesofaccessand/orfamiliarity

– Today’sfocusisprimarilyoncontentandconstructsourcesofevidence

Percep4on,TrustandConfidence

•  FaceValidity•  Notalwaysseenaslegi4matecomponentofthevalidityargument

•  Empiricalmethods

Content-relatedEvidence

Poten4alques4onsofinterest:– Howwelldoesthemeasurereflecttheintendedconstruct,knowledge,skills?

•  Relevance•  Representa4veness

– Howwereitemsdeveloped?– Wereitemsevaluatedpriortoadministra4on?– Weremul4plegroups(e.g.,women,minori4es)representedinthedevelopmentprocess?

ExampleofContent-RelatedEvidence

KnowYourself

•  CulturalBackground•  Language•  LanguageModality(i.e.,Verbal,Nonverbal)

•  Educa4on(e.g.,level,field)

History:Con4nuumofProcedures

•  Literaltransla4onswere[are]standardprac4ce.–  Forwardtransla4on:na4vespeakerofthetargetlanguageandfluentinthesourcelanguage.

–  Backwardtransla4on:na4vespeakerofthesourcelanguageandfluentinthetargetlanguage.

•  Societalships(i.e.,globaliza4on)lettoincreasingawarenessofproblemswithtransla4onsalone.

•  Needforadapta4onsandstandardiza4onofproceduresarose!

Whataretheop4ons?

•  Literaltransla4on– Pro:maintainsmetricequivalence– Con:doesnottakeintoaccountculturaldifferences;maynotbeadequate

•  Adapta4on– Pro:Adaptabletospecificculture/group– Con:Increaseddifficultytocomparecross-culturally

•  Newtest– Pro:Flexible;specifictoculture/group– Con:Nearlynoequivalencemaintained

ReasonsforTestAdapta4on

•  Knowledgeandskillsofinterestareopenthesameacrosslanguagegroups– Testadapta4onensuresconsistencyofcontent

•  Moreefficientthandevelopinganewtest•  Testequivalenceandfairnessissimplertoestablish

StepsforAdap4ngMeasures

1.  Checkingcontentandformatequivalence2.  Decideonsuitabilityoftransla4on/adapta4on

orcrea4ngofnewmeasure/test3.  Selectwell-qualifiedtranslators4.  Transla4ngandadap4ngprocess5.  Reviewingtheadaptedversion6.  Conduc4ngasmalltryoutoftheadapted

version7.  Carryingoutamoreambi4ousstudy(checkfor

validityandequivalence–tobediscussedlater)8.  Documenttheprocess

Hambleton&Patsula,1999

ITCGuidelines

•  Documenta4onofadapta4onshouldbeprovided,alongwithevidenceoftheequivalence.

•  Scoredifferencesamongsamplesofpopula4onscannotbetakenatfacevalue.– Researcherhasresponsibilitytoverifywithotherempiricalevidence

•  Comparisonscanonlybemadeatthelevelofinvarianceestablishedforthescale.

ITCGuidelines

•  Specificinforma4onofwaysinwhichthesocio-culturalandecologicalcontextspoten4allyaffectsperformanceshouldbeprovided.– Testdevelopersshouldsuggestprocedurestoaccountfortheseeffectsintheinterpreta4onofresults.

•  Applyappropriatesta4s4caltechniquesto:– establishequivalenceofdifferentversions–  iden4fyproblema4ccomponents

Progress?

•  Hasprogressbeenmadeintestadapta4onmethodology?

•  TheBurosCenterforTes4ngischangingthewayindividuals– assesstheirknowledgeoftes4ngdiversepopula4ons

– partakeinappropriatetestselec4on.

PruebasPublicadasenEspañol

•  Resourcethatprovidesdescrip4veandanaly4calinforma4onaboutcommerciallyavailabletestsavailableinSpanish.

•  Materialpresentedinabilingualmanner•  Effortstopointouttheneedforadapta4on

–  availabilityofnormsforSpanish-speakingpopula4on–  country/languagethetestoriginated–  transla4onoradapta4onprocessesimplemented–  testcomponents–  originalnameofthetest

Carlson&Gonzalez,2015

CONSTRUCT-RELATEDEVIDENCE

Demographics Scores

Group1

Group2

Group5

Group3

Group4

Skills

THEORY TRAITS

PURPOSE Reliability

Scale

fairness expertreviewDIF

Construct-relatedEvidence

•  Whatseemslikedifferenttypesofvalidityaredifferentsourcesofevidencerelatedtotheoverarchingconceptofconstructvalidity–  “allvalidityisofonekind,namely,constructvalidity”(Messick,1998,p.

37)

•  Constructs–  Unobserved,latentcharacteris4csgivenmeaningthroughthecombina4onofmeasurableauributes,skills,ortraits

•  Ex:Depression,IQ,Conflict,Self-Efficacy,Mo4va4on

–  Operaliza4onofconstructsisguidedbytheory

Construct-relatedEvidence

•  Constructevidenceisbasedonapar4cularuseandinterpreta4on

•  Specifictohowwedefineourconstruct•  Determineshowwecaninterpretscoresfromaninstrument

•  Forinstance,ifwewanttouseapar4cularinstrumenttomakecomparisonsbetweentwogroupsweneedtoprovideevidenceofinvariance–  Ismyconstructmeasuredthesamewayacrossgroups?

Invariance

•  Incross-culturalresearchweassumethatboththeinstrumentandtheconstructbeingmeasuredareworkingthesamewayacrossdifferentgroups

•  Weassumethefollowingareequalbetweengroups:–  Numberoffactors–  Pauernofloadingsonfactors–  Percep4onofitemcontent–  Loadingsize–  Itemmeans–  ConstructDimensionality

•  Rela4onshipsbetweenconstructdimensions

Invariance

•  Ifourassump4onsbetweengroupsdonotholdthenourinstrumentmaynotrepresenttheconstructequallywellacrossgroupsandwemaynotbeabletointerpretscoresfromtheinstrumentacrossgroups

•  Subsequently,itisimportanttotestthevalidityoftheseassump4ons

Invariance–DataExample

•  2012ProgrammeforInterna4onalStudentAssessment(PISA)

•  5itemscale:TeacherSupportinMathema7csClasses–  “Theteachershowsaninterestineverystudent’slearning”–  “Theteachergivesextrahelpwithstudentsneedit”–  “Theteacherhelpsstudentswiththeirlearning”

•  Datawerecollectedusingcomplexsamplingtechniques(studentsnestedwithinschools)

•  TwoCountries:USA&Finland

Invariance–DataExample

•  Ini4alanalysesauemptedtoincorporatemul4levelstructureintoinvariancetes4ngbuttheICCsofthevariableswerecloseto0(e.g.,.05)andmodelswouldfailedtoconverge–  PISAsamplingstrategy

•  Duetomul4levelnon-convergence,asinglelevelapproachwasusedinthesubsequentexamples– Mul4ple-GroupConfirmatoryFactorAnalysis(MGCFA)–  IninstancesoflowICCs,conven4onalMGCFAapproacheswillopenprovideunbiasedes4mates(Julian,2001)

InvarianceThefollowingstepswereconductedtoevaluatemeasurementinvariance:

(Millsap,2011)

Invariance

•  ConfiguralInvariance(BaselineModel)– Doesthesamegeneralfactorstructure(configura4on)holdacrosscountries?

United States

Finland Support

X1 X2 X3 X4

e4e3e2e1

X5

e5

Support

X1 X2 X3 X4

e4e3e2e1

X5

e5

ConfiguralSyntax(Mplus)

•  Baselinemodel

•  Everythingisseparateacrossgroups

Finland(Reference)USA

•  Metric(weak)Invariance– Doindividualitemsbehavesimilarlyacrosscountries?

•  Constraint:Factorloadings(λ)areheldequal– Par4almetricinvarianceisnecessarytomakevalidinferencesinlatentfactormeans(Byrne,Shavelson&Muthén,1989)

Invariance

•  Constraint:Factorloadings(λ)areheldequalMetricInvariance

United States

Finland Support

X1 X2 X3 X4

e4e3e2e1

λ1λ2 λ3 λ4

X5

e5

λ5

Support

X1 X2 X3 X4

e4e3e2e1

λ1λ2 λ3 λ4

X5

e5

λ5

MetricSyntax(Mplus)

•  Loadingsheldequalacrossgroups

•  Factorvarianceinreferencegroupfixedto1


Par4alMetricInvariance

•  Modelfit(i.e.,H0LL;MLRscalingcorrec4on)wascomparedbetweenConfiguralandMetric– Modelfitwassignificantlyworsewithfullmetricinvariance

– Modifica4onindiceswereusedtoitera4velyadjustthemodelun4lfitwasnotsignificantlyworsethantheconfiguralmodel

•  Par4almetricinvariancewasachievedaper2itera4ons(onlyoneconstraintrelaxedata4me)

–  1loadingwasfreed–  1residualcovarianceaddedforUSAonly

•  Scalar(strong)Invariance– Arethemeaningoftheconstructanditemsequalacrosscountries?•  Constraint:Intercepts(τ)andloadings(λ)heldequal

– Scalarinvarianceisnecessarytocomparesumscoresorobservedmeans(vandeSchoot,Lug4g&Hox,2012)

Invariance

•  Constraint:Intercepts(τ)andloadings(λ)heldequal

United States

Finland

e4e3e2e1τ1 τ2 τ3 τ4

ScalarInvariance

Support

X1 X2 X3 X4

λ1λ2 λ3 λ4

X5

e5

λ5

τ5

e4e3e2e1τ1 τ2 τ3 τ4

Support

X1 X2 X3 X4

λ1λ2 λ3 λ4

X5

e5

λ5

τ5

Par4alScalarSyntax(Mplus)

•  Loadingsheldequalacrossgroups

•  Interceptsheldequalacross

•  Factorvarianceinreferencegroupfixedto1

•  FactormeanofUSAnowfree


Par4alMetricInvariance

•  Modelfit(i.e.,H0LL;MLRscalingcorrec4on)wascomparedbetweencondi4ons– Modelfitwassignificantlyworsebetweenpar4almetricandpar4alscalarcondi4ons

– Modifica4onindiceswereusedtoitera4velyadjustthemodelun4lfitwasnotsignificantlyworsethanthepar4almetricmodel

•  Par4alscalarinvariancewasachievedaper4itera4ons–  4interceptswerefreed

Howdidthisinstrumentdo?

•  Obtained:Par4alScalarinvariance

•  MinimumGoal:Par4alMetricinvariance–  Inferencesbetweenlatentfactormeans(Byrne,Shavelson&

Muthén,1989)

Howdidthisinstrumentdo?

•  Possiblereasonsforfindingnon-invariance–  Instrumenttransla4on

•  Perearliercontentdiscussion– Bias(3types)

•  Construct–  Differen4almeaningsacrossgroups

•  Method–  Sample,instrument,administra4ve

•  Item–  Content,terminology,unclearwording

Byrne,2012

Whatifwehavemorethan2groups?

•  Limita4onsofinvariancemethodswithMGCFAandlargenumberofgroups– Numberofgroupscomparedatone4me–  Scalarinvarianceisrarelyachievedwithalargenumberofgroups

•  Alignmentmethod(Asparouhov&Muthén,2014)–  Poten4alop4onformul4plegroups(upto100)– Mplus7.1– Goalistoprovideamethodforcomparingfactormeans&varianceswhilepermi|ngapproximatemeasurementinvariance

FINAL THOUGHTS

FINALTHOUGHTS

•  Validityevidenceshouldbemul4faceted–  Varietyofsourcesandmethods

•  Evidenceisbasedonapar4cularuseandinterpreta4on–  Determineshowwecaninterpretscoresfromourmeasure

•  Cannotignoreculturalcomponentsthatmayinfluenceourconstructs– Needevidencetodemonstrateequalityofmeasurementtointerpretscoresacrossgroups

•  Thevalida4onprocess(accumula4onofevidence)isacon4nualprocess

FINALTHOUGHTS

•  Validityisatthecruxformeaningfuluseoftestscores,whetherfordecisionsorcomparisons.

•  BasedonanalysesoftestreviewspublishedintheMentalMeasurementYearbooks...“favorableevalua7onsofatesttendtobeassociatedwithgreaterprovisionofvalidityevidence.”(Cizek,Rosenberg,&Koons,2008)

Ques&ons?

[email protected]

Evalua&ng Measurement Invariance with Cross Cultural …cyfs.unl.edu/cyfsprojects/videoPPT/71567726faaf05c291a4d5921f31113... · • Statistical conclusion validity evidence • Content

Documents