Top Banner
Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech
15

Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Aug 31, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Class Website

CX4242:

Data CleaningMahdi Roozbahani

Lecturer, Computational Science and

Engineering, Georgia Tech

Page 2: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Data CleaningHow dirty is real data?

Page 3: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Examples

• Jan 19, 2016

• January 19, 16

• 1/19/16

• 2006-01-19

• 19/1/16

3

How dirty is real data?

http://blogs.verdantis.com/wp-content/uploads/2015/02/Data-cleansing.jpg

Page 4: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

4

How dirty is real data?

Discuss with you neighbors (group of 2-3)

60 seconds

Comes up with 5+ kinds of “data dirtiness”

Page 5: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

• Missing or corrupted (NaN, null)

• Numbers stored as string (“1232”)

• Different units

• Spelling/typos

• Different string encodings

• Outliers (due to data recording)

• geocoding, timezone offsets (missing +, -)

• Duplicate data

• Fake data (malicious)

• Sql injection

• Different software version generating slightly different formats

• Cap locks

• Semi-colons

• Structure (json objects)

• Invisible characters

• Different delimiters

• Indentation

5

How dirty is real data?

Page 6: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Importance of Data Cleaning

Page 7: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

“80%” Time Spent on Data Preparation

Cleaning Big Data: Most Time-Consuming, Least

Enjoyable Data Science Task, Survey Says [Forbes]http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-

consuming-least-enjoyable-data-science-task-survey-says/#73bf5b137f75

13

Page 8: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Data Janitor

Page 9: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Writing “Clean Code”

• Be careful with trailing whitespaces

• Indent code (spaces vs tabs) following

coding practices in your team/companyhttps://google.github.io/styleguide/javaguide.html#s4.2-block-indentation

17

http://codeimpossible.com/2012/04/02/trailing-whitespace-is-evil-don-t-commit-evil-into-your-repo/

http://www.businessinsider.com/tabs-vs-spaces-from-silicon-valley-2016-5

…there’s no way I'm going to be with someone who uses spaces over tabs…

Trailing whitespace is evil. Don't commit evil into your repo.

Page 10: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

18

Both available free for GT students on

http://safaribooksonline.com/

Page 11: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

Data CleanersWatch videos

• Data Wrangler (research at Stanford)

• Open Refine (previously Google Refine)

Write down

• Examples of data dirtiness

• Tool’s features demo-ed (or that you like)

Will collectively summarize similarities and

differences afterwards

Open Refine: http://openrefine.org

Data Wrangler: http://vis.stanford.edu/wrangler/

19

Page 12: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning
Page 13: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning
Page 14: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

What can Open Refine and Wrangler do?

• [w,o] undo, redo

• [o,w] history of data

• [o] transform data (e.g., take log)

• [w] data editing/highlighting/interaction may be easier

• [o] clustering

• [w] transpose/pivot

• [w] fill in missing data

• [w] suggestions + preview

O = Open Refine

W = Data wrangler 22

Page 15: Class Website CX4242: Data Cleaning - Visualization · Class Website CX4242: Data Cleaning Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech. Data Cleaning

!The videos only show

some of the tools’ features.

Try them out.

Open Refine: http://openrefine.org

Data Wrangler: http://vis.stanford.edu/wrangler/

37