Outliers
Chapter 5.3 Data Screening
Outliers can Bias a Parameter Estimate
…and the Error associated with that Estimate
Outliers
• Outlier – case with extreme value on one variable or multiple variables
• Why?– Data input error– Not a population you meant to sample– From the population but has really long tails and
very extreme values
Outliers
• Outliers – Two Types• Univariate – for basic univariate statistics– Use these when you have ONE DV or Y variable.
• Multivariate – for some univariate statistics and all multivariate statistics– Use these when you have multiple continuous
variables or lots of DVs.
Outliers
• Univariate• In a normal z-distribution anyone who has a z-
score of +/- 3 is less than .2% of the population.
• Therefore, we want to eliminate people who’s scores are SO far away from the mean that they are very strange.
Outliers
• Univariate outliers are fine and dandy, but you may have lots of data and don’t want to do each column one at a time. – Plus, the multivariate outlier analysis works just as
well if it’s one column or 500, so let’s just do that.
Outliers
• Multivariate– Now we need some way to measure distance from
the mean (because Z-scores are the distance from the mean), but the mean of means (or all the means at once!)
• Mahalanobis distance– Creates a distance from the centroid (mean of
means)
Outliers
• Mahalanobis• Centroid is created by plotting the 3D picture
of the means of all the means and measuring the distance– Similar to Euclidean distance
Outliers
• Mahalanobis• No set cut off rule – Use a chi-square table.– DF = # of variables (DVs, variables that you used to
calculate Mahalanobis)– Use p<.001
NOTE: DF here has NOTHING to do with the DF for hypothesis testing.
Outliers
• So do I delete them?• Yes: they are far away from the middle!• No: they may not affect your analysis!• It depends: I need the sample size!• SO?!– Try it with and without them. See what happens.
FISH!
Outliers
• Important side notes:– For ANOVA, t-tests, correlation: you will use a fake
regression analyses – it’s considered fake because it’s not the real analysis, just a way to get the information you need to do data screening.
Outliers
• Important side notes:– For regression based tests: you can run the real
regression analysis to get the same information. The rules are altered slightly, so make sure you make notes in the regression section on what’s different.• You will also use other regression based values for this
analysis.
Outliers
• Important side note:– Many functions in R have their own data screening
options. This guide is for global screening not specific to one analysis.
Outliers
• First, figure out the factor columns, as all columns need to be int or num.– filledin_none[ , -c(1,2)] – Use that dataset code in the next function.
Outliers
• Mahalanobis function• mahalanobis(– Dataset name,– colMeans(dataset name, na.rm = TRUE),– cov(datasetname, use = “pairwise.complete.obs)– )
Outliers
• mahal = mahalanobis(filledin_none[ , -c(1,2)], colMeans(filledin_none[ , -c(1,2)],
na.rm = TRUE),cov(filledin_none[ , -c(1,2)],
use="pairwise.complete.obs"))
Outliers
• Now, let’s get rid of people with bad scores– But what is a bad score?– Use a chi-square table.– DF = # of variables (DVs, variables that you used to
calculate Mahalanobis)– Use p<.001
• Oh, let’s make R do it.
Outliers
• Use the qchisq function, which finds the cut off score for you.– qchisq(1-pvalue, Number of columns)
• cutoff = qchisq(.999,ncol(dataset)) • cutoff = qchisq(.999,ncol(filledin_none[ , -
c(1,2)]))
Outliers
• So, let’s see how many are bad– summary(mahal < cutoff)
• Let’s get rid of those peeps– noout = filledin_none[ mahal < cutoff, ]