Top Banner
Garrett Grolemund Phd Student / Rice University Department of Statistics Data cleaning
56
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 18 cleaning

Garrett GrolemundPhd Student / Rice University

Department of Statistics

Data cleaning

Page 2: 18 cleaning

1. Intro to data cleaning

2. What you can’t fix

3. What you can fix

4. Intro to reshape

Page 3: 18 cleaning

Your turn

Do you think men or women leave a larger tip when dining out? What data would you collect to test this belief? What would prompt you to change your belief?

Page 4: 18 cleaning

Data Analysis

Data

Residuals

ModelCompare

Visualize

Transform

Page 5: 18 cleaning

Data Analysis

Data

Residuals

ModelCompare

Visualize

Transform

Page 6: 18 cleaning

Data Analysis

Data

Residuals

ModelCompare

Visualize

Transform

Page 7: 18 cleaning

Data Analysis

Data

Residuals

ModelCompare

Visualize

Transform

Page 8: 18 cleaning

Data Analysis

Data

Residuals

ModelCompare

Visualize

Transform

Page 9: 18 cleaning

Data Analysis

Data

Residuals

ModelCompare

Visualize

Transform

Page 10: 18 cleaning

10 - 20% of an analysis

Page 11: 18 cleaning

Data Cleaning

Data

Residuals

ModelCompare

VisualizeTransform

Page 12: 18 cleaning

Data cleaning

Page 13: 18 cleaning

“Happy families are all alike; every unhappy family is unhappy in its own way.”—Leo Tolstoy

Page 14: 18 cleaning

“Clean datasets are all alike; every messy dataset is messy in its own way.”—Hadley Wickham

Page 15: 18 cleaning

Clean data is:

Complete

Correct(factual and internally consistent)

Concise

Compatible(required variables: observations in rows, one column per variable)

Page 16: 18 cleaning

What you can’t fix:

Page 17: 18 cleaning

Complete

Correct

Page 18: 18 cleaning

Correct

Can’t restore incorrect values without original data but can remove clearly incorrect values

Options:

Remove entire row

Mark incorrect value as missing (NA)

Page 19: 18 cleaning

When two rows present the same information with different values, at least one row is wrong.

Whenever there is inconsistency, you are going to have to make some tradeoff to ensure concision.

Detecting inconsistency is not always easy.

Inconsistency = incorrect

Page 20: 18 cleaning

General strategy

To find incorrect values you need to be creative, combining graphics and data processing.

Page 21: 18 cleaning

Tipping data

One waiter recorded information about each tip he received over a period of a few months

244 records

Do men or women tip more?

Page 22: 18 cleaning

Your turn

Subset the tipping data to include only rows without NA’s. Judge whether you think all of the data points are correct. How will you make your decision?

Page 23: 18 cleaning

tips <- read.csv("tipping.csv", stringsAsFactors = FALSE)

summary(tips)

tips <- subset(tips, !is.na(smoker) & !is.na(non_smoker))

qplot(tip, data = tips, binwidth = .5)qplot(total_bill, data = tips, binwidth = 2)qplot(total_bill, tip, data = tips)

Page 24: 18 cleaning

nrow(tips)

sum(tips$male)

sum(tips$female)

subset(tips, male != female)

Page 25: 18 cleaning

What you can fix:

Page 26: 18 cleaning

Concise(each fact represented once)

Repeating facts: 1. wastes memory 2. creates opportunities for inconsistency

Page 27: 18 cleaning

Compatible(Data is compatible with your analysis

in both form and fact)

1. Do you have the relevant variables for your analysis?

Page 28: 18 cleaning

This often requires some type of calculation. For example,

proportion = sucesses / attempts

Avg score per game per team = ?

join(), transform(), summarise(), ddply(), plyr address this need

Page 29: 18 cleaning

Compatible(Data is compatible with your analysis

in both form and fact)

2. Is the data in the right form for your analysis and visualization tools? (reshape)

Page 30: 18 cleaning

Rectangular

Page 31: 18 cleaning

Observations in rows

Page 32: 18 cleaning

Variables in columns

(1 column per variable)

Page 33: 18 cleaning

Your turn

What are the variables in tipping.csv? How are they arranged in rows and columns? Can you form the variables into two groups?

Page 34: 18 cleaning

Reshape

Page 35: 18 cleaning

install.packages("reshape")library(reshape)library(stringr)head(tips)

Page 36: 18 cleaning

Molten data

We can use melt to put each variable into its own column.“Protect” the good columns. “Melt” the offending columns.Then subset.

Page 37: 18 cleaning

1. ID variables - identify the object that measurements will take place on (we know these before the experiment)

2. Measured variables - the features of the object that will be measured (we have to do an experiment to observe these)

Two types of variables

Page 38: 18 cleaning

object

ID Variables

Bruce Wayne

Batman

SSN: 555-89-3000

Measured Var.

Height (6’1’’)

IQ (180)

Age (71)

Page 39: 18 cleaning

ID Variables

Gotham City +

male +

Top 1% tax bracket

Page 40: 18 cleaning

Identifier variable Measured variable

Index of random variable

Random variable

Dimension Measure

Experimental design Measurement

predictors (Xi) response (Y)

Page 41: 18 cleaning

Molten data

Molten data collapses all the measured variables into two columns: 1) the variable being measured and 2) the value. Sometimes called “long” form.

To protect a column from being melted, label it as an id variable.

reshape::melt(data, id)

Page 42: 18 cleaning

tips1 <- melt(tips, id = c("customer_ID", "total_bill", "tip", "smoker", "non_smoker"))

# assign an appropriate variable namenames(tips1)[6] <- "sex"

# subset out unwanted rowstips1 <- subset(tips1, value == 1)tips1 <- tips1[ , c(1,2,6,4,5,3)]

Page 43: 18 cleaning

Use melt to fix the smoking variable. One column should be enough to record whether a person smokes or not.

Your turn

Page 44: 18 cleaning

Rectangular data are much easier to work with!

qplot(total_bill, tip, data = tips1, color = sex)

# vs.

qplot(total_bill, tip, data = tip, colour = ?)

Page 45: 18 cleaning

qplot(total_bill, tip, data = tips1, color = sex) + geom_smooth(method = lm)

Page 46: 18 cleaning

Clean data is:

Complete

Correct(factual and internally consistent)

Concise

Compatible(required variables: observations in rows, one column per variable)

Page 47: 18 cleaning

Resource

Wickham, H. (2007) Reshaping data with the reshape package. Journal of Statistical Software. 22 (12)

http://www.jstatsoft.org/v21/i12

Page 48: 18 cleaning

Summary

Page 49: 18 cleaning

Clean data is:

Rectangular(observations in rows, one column per variable)

Consistent

Concise

Complete

Correct

Page 50: 18 cleaning

Data

Residuals

ModelCompare

VisualizeTransform

Page 51: 18 cleaning

Data

Residuals

ModelCompare

VisualizeTransform

ggplot2

Page 52: 18 cleaning

Data

Residuals

ModelCompare

VisualizeTransform

ggplot2plyr

Page 53: 18 cleaning

Data

Residuals

ModelCompare

VisualizeTransform

ggplot2plyr

reshape

Page 54: 18 cleaning

Data

Residuals

ModelCompare

Visualize

Transform

most statistics classes

Page 55: 18 cleaning
Page 56: 18 cleaning

This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.