Top Banner
2013-02-06 Toronto Data Science Group 1 We are surrounded by data
14

20130206 open refine

May 07, 2015

Download

Technology

10 presentation of OpenRefine (former Google Refine) for the Toronto Data Science group.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 20130206  open refine

2013-02-06Toronto Data Science Group

1

We are surrounded by data

Page 2: 20130206  open refine

2013-02-06Toronto Data Science Group

2

We are surrounded by MESSY data

- Multiple standards and formats

Structured vs unstructured

Field nomination and format varies ...

- Human Error (misspellings, errors, etc)

- Non-normalized inputs (free-text entries, the “other" option)

- Incomplete data (laziness)

....

Page 3: 20130206  open refine

2013-02-06Toronto Data Science Group

3

Lack of

Time

Skills

» Software

Page 4: 20130206  open refine

2013-02-06Toronto Data Science Group

4

OpenRefine the

- Swiss army knife for data manipulation!

- glue step between your IT systems

Page 5: 20130206  open refine

2013-02-06Toronto Data Science Group

5

What's OpenRefine(former Google Refine, former Gridworks)

- A Cross platform Web Application that runs locally

- A Community based project hosted on GitHub

- Which have two distributions and multiple extensions

- Something between a spreadsheet and SQL

Page 6: 20130206  open refine

2013-02-06Toronto Data Science Group

6

Three use case

1. Data Cleaning

2. ETL (Extract Transform Load) Prototyping

3. Data extension (reconciliation & linked data)

Page 7: 20130206  open refine

2013-02-06Toronto Data Science Group

7

#1 Data Cleaning

Graphical interface

Facet option

Cluster similar record

Support three languages:

- GREL Jyton, Clojure

+ regex

Page 8: 20130206  open refine

2013-02-06Toronto Data Science Group

8

Facet example

Page 9: 20130206  open refine

2013-02-06Toronto Data Science Group

9

Cluster example

Page 10: 20130206  open refine

2013-02-06Toronto Data Science Group

10

#2 ETL Prototyping(Extract – Transform - Load)

Transform

- Understand your data

- Test the transformation that need to be done

- Undo / Redo

- Export transformation in JSON format

- Automate using the python or ruby extension

Extract & Load

Support:

- tabular (csv, xls)

- hierarchical (xml, json)

Page 11: 20130206  open refine

2013-02-06Toronto Data Science Group

11

History and JSON export

Page 12: 20130206  open refine

2013-02-06Toronto Data Science Group

12

#3 Extend your Data (reconciliation & linked data)

- Cross between OpenRefine projects (vlookup)

- Fetch URL and call web services (API)

Reconcile against

- RDF file & Local SPARQL endpoints

- Online databases

Page 13: 20130206  open refine

2013-02-06Toronto Data Science Group

13

Reconciliation example

Page 14: 20130206  open refine

2013-02-06Toronto Data Science Group

14

OpenRefine

http://openrefine.org

@OpenRefine

Martin Magdinier

[email protected]

@magdmartin

Thanks!