This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Exploring Data: The Beast of Bias Sources of Bias Abitofrevision.We’veseenthathavingcollecteddataweusuallyfitamodelthatrepresentsthehypothesisthatwewanttotest.Thismodelisusuallyalinearmodel,whichtakestheformof:
Whenwefitamodel,weoftenestimatetheparameters(b)usingthemethodofleastsquares(knownasordinaryleastsquaresorOLS).We’renotinterestedinoursamplesomuchasageneralpopulation,soweusethesampledatatoestimatethevalueoftheparametersinthepopulation(that’swhywecallthemestimatesratherthanvalues).Whenweestimateaparameterwealsocomputeanestimateofhowwell itrepresentsthepopulationsuchasastandarderrororconfidence interval.Wecantesthypothesesabout theseparametersbycomputingteststatisticsandtheirassociated probabilities (p-values). Therefore, when we think about bias, we need to think about it within threecontexts:
These situations are related because confidence intervals and test statistics both rely on the standard error. It isimportantthatweidentifyandeliminateanythingthatmightaffecttheinformationthatweusetodrawconclusionsabouttheworld,soweexploredatatolookforbias.
Spotting outliers with graphs Abiologistwasworriedaboutthepotentialhealtheffectsofmusicfestivals.So,oneyearshewenttotheDownloadMusicFestival(http://www.downloadfestival.co.uk)andmeasuredthehygieneof810concertgoersoverthethreedaysofthefestival.Intheoryeachpersonwasmeasuredoneachdaybutbecauseitwasdifficulttotrackpeopledown,thereweresomemissingdataondays2and3.Hygienewasmeasuredusingastandardisedtechnique(don’tworryitwasn’tlickingtheperson’sarmpit)thatresultsinascorerangingbetween0(yousmelllikearottingcorpsethat’shidingupaskunk’sarse)and5(yousmellofsweetrosesonafreshspringday).NowIknowfrombitterexperiencethatsanitationisnotalwaysgreatattheseplaces(Readingfestivalseemsparticularlybad…)andsothisresearcherpredictedthatpersonal hygiene would go down dramatically over the three days of the festival. The data file is calledDownloadFestival.sav.
® Stacked histogram andPopulation Pyramid: If you had a grouping variable (e.g.whethermen orwomenattendedthefestival)youcouldproduceahistograminwhicheachbarissplitbygroup(stackedhistogram)ortheoutcome(inthiscasehygiene)ontheverticalaxisandeachgroup(i.e.menvs.women)onthehorizontal(i.e.,thehistogramsformenandwomenappearbacktobackonthegraph).
TheresultinghistogramisshowninFigure3andthefirstthingthatshouldleapoutatyouisthatthereappearstobeonecasethat isverydifferenttotheothers.Allofthescoresappeartobesquisheduponeendofthedistributionbecausetheyarealllessthan5(yieldingwhatisknownasaleptokurticdistribution!)exceptforone,whichhasavalueof20!Thisscoreisanoutlier.What’soddaboutthisoutlieristhatithasascoreof20,whichisabovethetopofourscale(rememberourhygienescalerangedonlyfrom0-5)andso itmustbeamistake(orthepersonhadobsessivecompulsivedisorderandhadwashedthemselvesintoastateofextremecleanliness).However,with810cases,howonearthdowe findoutwhich case itwas?You could just look through thedata, but thatwould certainly give youaheadacheandsoinsteadwecanuseaboxplot.
You encountered boxplots or box-whisker diagrams last year. At the centre of the plot is themedian, which issurroundedbyaboxthetopandbottomofwhicharethelimitswithinwhichthemiddle50%ofobservationsfall(theinterquartilerange).Stickingoutofthetopandbottomoftheboxaretwowhiskerswhichextendtothemostandleastextremescoresrespectively.Outliersasshownasdotsorstars(seemybookfordetails).
® Simpleboxplot:Usethisoptionwhenyouwanttoplotaboxplotofasinglevariable,butyouwantdifferentboxplots produced for different categories in thedata (for thesehygienedatawe couldproduce separateboxplotsformenandwomen).
® Clusteredboxplot:Thisoptionisthesameasthesimpleboxplotexceptthatyoucanselectasecondcategoricalvariableonwhichtosplit thedata.Boxplots for thissecondvariableareproduced indifferentcolours.Forexample,wemighthavemeasuredwhetherourfestival-goerwasstayinginatentoranearbyhotelduringthefestival.Wecouldproduceboxplotsnotjustformenandwomen,butwithinmenandwomenwecouldhavedifferent-colouredboxplotsforthosewhostayedintentsandthosewhostayedinhotels.
Additivity and Linearity Thevastmajorityofstatisticalmodelsinmybookarebasedonthelinearmodel,whichtakesthisform:
outcome! = 𝑏𝑏$𝑋𝑋$! + 𝑏𝑏'𝑋𝑋'! ⋯ 𝑏𝑏)𝑋𝑋)! + error!
The assumption of additivity and linearity means that the outcome variable is, in reality, linearly related to anypredictors(i.e.,theirrelationshipcanbesummedupbyastraightline)andthatifyouhaveseveralpredictorsthentheircombinedeffect isbestdescribedbyaddingtheireffects together. Inotherwords, itmeansthat theprocesswe’retryingtomodelcanbeaccuratelydescribedas:
2. Forsignificancetestsofmodels(andtheparameterestimatesthatdefinethem)tobeaccuratethesamplingdistributionofwhat’sbeingtestedmustbenormal.Forexample,iftestingwhethertwomeansaredifferent,the data do not need to be normally distributed, but the sampling distribution of means (or differencesbetweenmeans) does. Similarly, if looking at relationshipsbetween variables, the significance tests of theparameterestimatesthatdefinethoserelationships(thebsinEq.1)willbeaccurateonlywhenthesamplingdistributionoftheestimateisnormal.
Whenyouhaveacategoricalpredictorvariable(suchaspeoplefallingintodifferentgroups)youwouldn’texpecttheoveralldistributionoftheoutcome(orresiduals)tobenormal.Forexample,ifyouhaveseenthemovie‘theMuppets’,youwillknowthatMuppetsliveamongus.ImagineyoupredictedthatMuppetsarehappierthanhumans(onTVtheyseemtobe).YoucollecthappinessscoresinsomeMuppetsandsomeHumansandplotthefrequencydistribution.YougetthegraphontheleftofFigure7anddecidethatyourdataarenotnormal:youthinkthatyouhaveviolatedtheassumption of normality. However, you haven’t because you predicted that Humans and Muppets will differ inhappiness; in other words, you predict that they come from different populations. If we plot separate frequencydistributionsforhumansandMuppets(rightofFigure7)you’llnoticethatwithineachgroupthedistributionofscoresisverynormal.Thedataareasyoupredicted:Muppetsarehappierthanhumansandsothecentreoftheirdistributionishigherthanthatofhumans.Whenyoucombineallofthescoresthisgivesyouabimodaldistribution(i.e.,twohumps).Thisexampleillustratesthatitisnotthenormalityoftheoutcome(orresiduals)overallthatmatters,butnormalityateachuniquelevelofthepredictorvariable.
1. Confidenceintervals:Thecentrallimittheoremtellsusthatinlargesamples,theestimatewillhavecomefromanormaldistributionregardlessofwhatthesampleorpopulationdatalooklike.Therefore,ifweareinterestedin computing confidence intervals thenwe don’t need toworry about the assumption of normality if oursampleislargeenough.
Homogeneity of Variance/Homoscedasticity Thesecondassumptionwe’llexplorerelatestovarianceanditcanimpactonthetwomainthingsthatwemightdowhenwefitmodelstodata:
Indesignsinwhichyoutestseveralgroupsofparticipantsthisassumptionmeansthateachofthesesamplescomesfrompopulationswith the same variance. In correlational designs, this assumptionmeans that the varianceof theoutcomevariableshouldbestableatalllevelsofthepredictorvariable.Inotherwords,asyougothroughlevelsofthepredictorvariable,thevarianceoftheoutcomevariableshouldnotchange.
When does homoscedasticity/homogeneity of variance matter?
Unequal variances/heteroscedasticity also creates bias and inconsistency in the estimate of the standard errorassociated with the parameter estimates (Hayes & Cai, 2007). This basically means that confidence intervals and
Independence Thisassumptionmeansthattheerrorsinyourmodel(theerroriinEq.1)arenotrelatedtoeachother.ImaginePauland Julie were participants in an experiment where they had to indicate whether they remembered having seenparticularphotos.IfPaulandJulieweretoconferaboutwhetherthey’dseencertainphotosthentheiranswerswouldnotbeindependent:Julie’sresponsetoagivenquestionwoulddependonPaul’sanswer.Weknowalreadythatifweestimateamodeltopredicttheirresponses,therewillbeerrorinthosepredictionsandbecausePaulandJulie’sscoresarenotindependenttheerrorsassociatedwiththesepredictedvalueswillalsonotbeindependent.IfPaulandJuliewereunabletoconfer (if theywere locked indifferentrooms) thentheerror termsshouldbe independent (unlessthey’retelepathic):theerrorinpredictingPaul’sresponseshouldnotbeinfluencedbytheerrorinpredictingJulie’sresponse.
Theequationthatweusetoestimatethestandarderrorisvalidonlyifobservationsareindependent.Rememberthatweusethestandarderrortocomputeconfidenceintervalsandsignificancetests,soifweviolatetheassumptionofindependencethenourconfidenceintervalsandsignificancetestswillbeinvalid.Ifweusethemethodofleastsquares,thenmodelparameterestimateswill still bevalidbutnotoptimal (wecouldgetbetterestimatesusingadifferentmethod).Ingeneral,ifthisassumptionisviolated,therearetechniquesyoucanusedescribedinChapter20of(Field,2013).
Numerically, SPSS usesmethods to calculate skew and kurtosis (see Field (2013) if you have forgottenwhat theseconceptsare)thatgivevaluesofzeroinanormaldistribution.Positivevaluesofskewnessindicateapile-upofscoresontheleftofthedistribution,whereasnegativevaluesindicateapile-upontheright.Positivevaluesofkurtosisindicateapointyandheavy-taileddistribution,whereasnegativevaluesindicateaflatandlight-taileddistribution.Thefurtherthevalueisfromzero,themorelikelyitisthatthedataarenotnormallydistributed.
Figure8showsthedialogboxesfortheExplorecommand( ).First,enteranyvariablesofinterestintheboxlabelledDependentListbyhighlightingthemontheleft-handsideandtransferringthembyclickingon .Forthisexample,selectthehygienescoresforthethreedays.Ifyouclickon adialogboxappears,butthedefaultoptionisfine(itwillproducemeans,standarddeviationsandsoon).Themoreinterestingoption for our current purposes is accessed by clicking on . In this dialog box select the option
, and thiswill produceboth theK-S test and somenormalQ-Qplots. You can also split theanalysis by a factor or grouping varaiable (for example,we could do a separete analysis formales and females bydragginggendertotheFactorListbox—we’lldothislaterinthehandout).
Figure9showsthehistograms(fromtheself-testtasks)andthecorrespondingQ-Qplots.Theday1scoreslookquitenormal;TheQ-Qplotechoesthisviewbecausethedatapointsallfallveryclosetothe‘ideal’diagonalline.However,thedistributionsfordays2and3arenotnearlyassymmetricalasday1:theybothlookpositivelyskewed.Again,thiscanbeseenintheQ-Qplotsbythedatapointsdeviatingawayfromthediagonal.Ingeneral,thisseemstosuggestthatbydays2and3,hygienescoresweremuchmoreclusteredaroundthelowendofthescale.Rememberthatthelowerthescore, the lesshygienic theperson is, sogenerallypeoplebecamesmellieras the festivalprogressed.Theskewoccursbecauseasubstantialminorityinsistedonupholdingtheirlevelsofhygiene(againstallodds)overthecourseofthefestival(babywet-wipesareindispensableIfind).
Forday1theK-Stestisjustaboutnotsignificant(p=.097),whichissurprisinglyclosetosignificantgivennownormaltheday1scoreslookedinthehistogram(Figure3).However,thesamplesizeonday1isverylarge(N=810)andthesignificanceof theK-S test for thesedatashowshow in largesamplesevensmallandunimportantdeviations fromnormalitymightbedeemedsignificantbythistest(JaneSuperbrainBox).Fordays2and3thetestishighlysignificant,indicatingthatthesedistributionsarenotnormal,whichislikelytoreflecttheskewseeninthehistogramsforthesedatabutcouldagainbedowntothelargesample(Figure9).
ü Thehygienescoresonday1,D(810)=0.029,p=.097,didnotdeviatesignificantlyfromnormal;however,day2,D(264)=0.121,p<.001,andday3,D(123)=0.140,p<.001,scoreswerebothsignificantlynon-normal.
Throughoutthishandoutwewilllookatvarioussignificanceteststhathavebeendevisedtolookatwhetherassumptions are violated. These include tests of whether a distribution is normal (the Kolmorgorov-Smirnoff and Shapiro-Wilk tests), tests of homogeneity of variances (Levene’s test), and tests ofsignificanceofskewandkurtosis.Allofthesetestsarebasedonnullhypothesissignificancetestingandthismeansthat(1)inlargesamplestheycanbesignificantevenforsmallandunimportanteffects,and(2)insmallsamplestheywilllackpowertodetectviolationsof.
Testing homogeneity of variance/homoscedasticity Likenormality,youcanlookatthevariancesusinggraphs,numbersandsignificancetests.Graphically,wecancreateascatterplotofthevaluesoftheresidualsplottedagainstthevaluesoftheoutcomepredictedbyourmodel.Indoingsowe’relookingatwhetherthereisasystematicrelationshipbetweenwhatcomesoutofthemodel(thepredictedvalues)andtheerrorsinthemodel.Normallyweconvertthepredictedvaluesanderrorstoz-scores1sothisplotissometimesreferred to as zpred vs. zresid. If linearity and homoscedasticity hold true then there should be no systematicrelationshipbetweentheerrorsinthemodelandwhatthemodelpredicts.Lookingatthisgraphcan,therefore,killtwobirdswithonestone.Ifthisgraphfunnelsout,thenthechancesarethatthereisheteroscedasticityinthedata.Ifthereisanysortofcurveinthisgraphthenthechancesarethatthedatahavebrokentheassumptionoflinearity.
• If Levene’s test is significant at p£ .05 then you conclude that the variances are significantly different—therefore,theassumptionofhomogeneityofvarianceshasbeenviolated.
• If, however, Levene’s test is non-significant (i.e., p > .05) then the variances are roughly equal and theassumptionistenable.
WecangetLevene’stestusingtheexploremenuthatweusedintheprevioussection.Stickingwiththehygienescores,we’ll compare the variances of males and females on day 1 of the festival. Use
Probablythebestofthesechoicesistouserobusttests,whichisatermappliedtoafamilyofprocedurestoestimatestatistics thatare reliableevenwhen thenormalassumptionsof the statisticarenotmet. For thepurposesof thishandoutwe’lllookattransformingdata,andthroughoutthemodulewe’llusebootstrapping(whichisarobustmethodexplainedinyourlecture),butyoucanfindmoredetailonthesetechniquesandtheotherinChapter5of(Field,2013).
).Youcanalsochangetheconfidencelevelbytypinganumberotherthan95intheboxlabelled Level(%). By default, SPSS uses 1000 bootstrap samples, which is a reasonable number and you certainlywouldn’tneedtousemorethan2000.
http://youtu.be/mNrxixgwA2M
Figure12:Thestandardbootstrapdialogbox
Transforming Data Thefinalthingthatyoucandotocombatproblemswithnormalityandlinearity istotransformyourdata.Theideabehindtransformationsisthatyoudosomethingtoeveryscoretocorrectfordistributionalproblems,outliers,lackoflinearityorunequalvariances.Ifyouarelookingatrelationshipsbetweenvariables(e.g.,regression)justtransformtheproblematicvariable,butifyouarelookingatdifferencesbetweenvariables(e.g.,changeinavariableovertime)thenyouneedtotransformallofthosevariables.Forexample,ourfestivalhygienedatawerenotnormalondays2and3ofthefestival.Now,wemightwanttolookathowhygienelevelschangedacrossthethreedays(i.e.,comparethemean
Logtransformation(log(Xi)):Takingthelogarithmofasetofnumberssquashestherighttailofthedistribution.Assuchit’sa goodway to reducepositive skew. This transformation is also veryuseful if youhaveproblemswith linearity (it cansometimesmakeacurvilinearrelationshiplinear).However,youcan’tgetalogvalueofzeroornegativenumbers,soifyourdata tend to zero or produce negative numbers you need to add a constant to all of the data before you do thetransformation.Forexample,ifyouhavezerosinthedatathendolog(Xi+1),orifyouhavenegativenumbersaddwhatevervaluemakesthesmallestnumberinthedatasetpositive.
Reciprocaltransformation(1/Xi):Dividing1byeachscorealsoreducestheimpactoflargescores.Thetransformedvariablewillhavealowerlimitof0(verylargenumberswillbecomecloseto0).Onethingtobearinmindwiththistransformationis that it reverses the scores: scores that were originally large in the data set become small (close to zero) after thetransformation,butscoresthatwereoriginallysmallbecomebigafterthetransformation.Forexample,imaginetwoscoresof1and10;afterthetransformationtheybecome1/1=1,and1/10=0.1:thesmallscorebecomesbiggerthanthelargescoreafterthetransformation.However,youcanavoidthisbyreversingthescoresbeforethetransformation,byfindingthe highest score and changing each score to the highest score minus the score you’re looking at. So, you do atransformation1/(XHighest−Xi).Likethelogtransformation,youcan’ttakethereciprocalof0(because1/0=infinity)soifyouhavezerosinthedatayouneedtoaddaconstanttoallscoresbeforedoingthetransformation.
FirsttypeavariablenameintheboxlabelledTargetVariable,thenclickon andanotherdialogboxappears,whereyou cangive thevariableadescriptive label and specifywhether it is anumericor stringvariable (seeyourhandout fromweek1). Thenwhenyouhavewrittenyour command for SPSS toexecute, clickon to run thecommandandcreate thenewvariable.Thereare functions forcalculatingmeans, standarddeviationsandsumsofcolumns.We’regoingtousethesquarerootandlogarithmfunctions,whichareusefulfortransformingdatathatareskewed.
Figure13:DialogboxfortheComputecommand
Log Transformation
Let’sreturntoourDownloadfestivaldata.OpenthemainComputedialogboxbyselecting .Enter the name logday1 into the box labelled Target Variable, click on and give the variable a moredescriptivenamesuchasLogtransformedhygienescoresforday1ofDownloadfestival.InthelistboxlabelledFunctiongroup clickonArithmetic and then in thebox labelledFunctionsandSpecialVariables clickonLg10 (this is the logtransformation tobase10,Ln is thenatural log) and transfer it to the commandareaby clickingon .When the
For theday2hygienescores there isavalueof0 in theoriginaldata,and there isno logarithmof thevalue0.Toovercometheproblemweaddaconstanttoouroriginalscoresbeforewetakethelogofthosescores.Anyconstantwilldo(althoughsometimesitcanmatter),providedthatitmakesallofthescoresgreaterthan0.Inthiscaseourlowestscoreis0inthedatasowecouldadd1toallofthescorestoensurethatallscoresaregreaterthanzero.Eventhoughthisproblemaffectstheday2scores,weneedtobeconsistentanddothesametotheday1scoresaswewilldowiththeday2scores.Therefore,makesurethecursorisstillinsidethebracketsandclickon andthen .ThefinaldialogboxshouldlooklikeFigure13.NotethattheexpressionreadsLG10(day1+1);thatis,SPSSwilladdonetoeachoftheday1scoresandthentakethelogoftheresultingvalues.Clickon tocreateanewvariablelogday1containingthetransformedvalues.
Tousethesquareroottransformation,wecouldrunthroughthesameprocess,byusinganamesuchassqrtday1andselectingtheSQRT(numexpr)functionfromthelist.ThiswillappearintheboxlabelledNumericExpression:asSQRT(?),and you can simply replace the questionmarkwith the variable youwant to change—in this caseday1. The finalexpressionwillreadSQRT(day1).
slightly negatively skewed for the log and square root transformation, and positively skewed for the reciprocaltransformation2Ifwe’reusingscoresfromday2aloneorlookingattherelationshipbetweenday1andday2,thenwecouldusethetransformedscores;however,ifwewantedtolookatthechangeinscoresthenwe’dhavetoweighupwhetherthebenefitsofthetransformationfortheday2scoresoutweightheproblemsitcreatesintheday1scores—dataanalysiscanbefrustratingsometimesJ
Multiple Choice Test Gotohttps://studysites.uk.sagepub.com/field4e/study/mcqs.htmandtestyourselfonthemultiplechoicequestionsforChapter5.Ifyougetanywrong,re-readthishandout(orField,2013,Chapter5)anddothemagainuntilyougetthemallcorrect.
This document is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 InternationalLicense (https://creativecommons.org/licenses/by-nc-nd/4.0/), basically you can use it for teaching and non-profitactivitiesbutnotmeddlewithitwithoutpermissionfromtheauthor.