Top Banner
DATA JOURNALISM Dr. Bahareh Heravi @Bahareh360 Week 8 Cleaning and Analysing Data
29

Data Journalism - Cleaning Data

Apr 14, 2017

Download

Education

Bahareh Heravi
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Journalism - Cleaning Data

DATA JOURNALISM

Dr. Bahareh Heravi @Bahareh360

Week 8Cleaning and Analysing Data

Page 2: Data Journalism - Cleaning Data

 DATA  is  o(en  ugly    

&  MESSY  

Page 3: Data Journalism - Cleaning Data

Data ProfilingAssess current state of your data.

Data CleaningCorrect the issues you found during ‘data profiling’. ���

Page 4: Data Journalism - Cleaning Data

Exploring data���Checking dataFiltering data���Cleaning data���Reshaping data���Annotating dataLinking data���

Page 5: Data Journalism - Cleaning Data

Dataset

Powerhouse Museum objects collection

Download from: http://data.freeyourmetadata.org/powerhouse-museum/phm-collection.tsv

Open Refine and load the dataset.

Page 6: Data Journalism - Cleaning Data

Sorting data

Page 7: Data Journalism - Cleaning Data

Faceting dataTo select a subset of your data to work on.

To get useful insight into your data.

To apply a transformation to a subset of your data.

Page 8: Data Journalism - Cleaning Data

Types of Facets���Text facets for text���

Numeric facets for number and dates

Predefined/customised facets

Page 9: Data Journalism - Cleaning Data

Text facets���Text facets used for faceting text

Examples: County or city names, TD names���

Page 10: Data Journalism - Cleaning Data

Text facets

Page 11: Data Journalism - Cleaning Data

Numeric facets���Numeric facets used for faceting numerical values and ranges.

Examples: Expenditure, crime rate

Page 12: Data Journalism - Cleaning Data

Numeric facets

Page 13: Data Journalism - Cleaning Data

Detecting blanks

Page 14: Data Journalism - Cleaning Data

Removing blanks

Page 15: Data Journalism - Cleaning Data

Detecting duplicates

Page 16: Data Journalism - Cleaning Data

Removing duplicates

Warning: ���If we remove all the original records will also be removed!

Page 17: Data Journalism - Cleaning Data

Removing duplicates

Page 18: Data Journalism - Cleaning Data

Removing duplicates

Now you can remove.  

Facet by blank  

Page 19: Data Journalism - Cleaning Data

Congratulations you have removed all blank and duplicate values.

Page 20: Data Journalism - Cleaning Data

Simple cell transformations

Page 21: Data Journalism - Cleaning Data

Advanced data operationsClusteringTransformationsMulti-valued cells Derived columnsSplitting data across columns

Regular ExpressionsGREL (General Refine Expression Language)

Page 22: Data Journalism - Cleaning Data

Multi-valued cellsTo split a cell in

Page 23: Data Journalism - Cleaning Data

ClusteringTo cluster similar (syntactically) items together.

To be used to fix inconsistencies, typos, etc.

Examples in the dataset: Agricultural equipment &Agricultural Equipment

Costume &Costumes

Page 24: Data Journalism - Cleaning Data

Clustering

Page 25: Data Journalism - Cleaning Data

Clustering

Page 26: Data Journalism - Cleaning Data

Transforming cell values

Page 27: Data Journalism - Cleaning Data

Transforming cell valuesGREL    (General  Refine  Expression  Language)  

Page 28: Data Journalism - Cleaning Data

ResourcesUsing OpenRefine by ���Rubben Verborgh and Max De Wilde

http://freeyourmetadata.org/cleanup/

Cleaning Data with Refine, School of Data

The Bastard Book of Regular Expressions by Dan Nguyen

GREL: https://github.com/OpenRefine/OpenRefine/wiki/General-Refine-Expression-Language

Page 29: Data Journalism - Cleaning Data

 Ques8ons?  

 

Bahareh  R.  Heravi    

 

 

@Bahareh360