Top Banner
Data Cleaning Jacob Lurye CS265 KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye SIGMOD Conference 2015
66

CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Jun 05, 2018

Download

Documents

HoàngAnh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Data CleaningJacob Lurye

CS265

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye SIGMOD Conference 2015

Page 2: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Let’s talk dirty data

Page 3: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Source: BigDansing: A System for Big Data Cleansing. Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin SIGMOD Conference 2015

Page 4: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Source: BigDansing: A System for Big Data Cleansing. Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin SIGMOD Conference 2015

Page 5: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Source: BigDansing: A System for Big Data Cleansing. Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin SIGMOD Conference 2015

Pr(city = ‘LA’ | zipcode = 90210)

Page 6: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

A B C D E F G

Rossi Italy Rome Verona Italian Proto 1.78

Klate S. Africa Pretoria Pirates Afrikaans P. Eliz. 1.69

Pirlo Italy Madrid Juve Italian Flero 1.77

Integrity constraints?

Machine learning?

Page 7: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

A B C D E F G

Rossi Italy Rome Verona Italian Proto 1.78

Klate S. Africa Pretoria Pirates Afrikaans P. Eliz. 1.69

Pirlo Italy Madrid Juve Italian Flero 1.77

We need something more...

Page 8: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Enter KATARA...

Page 9: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

What is KATARA?

1. Table pattern definition and discovery (using KBs)2. Table pattern validation via crowdsourcing3. Data annotation4. Repair recommendation

Page 10: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

What is a knowledge base, and how can it help us clean data?

Page 11: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.
Page 12: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Resource Description Framework

Page 13: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Resource Description Framework

Resource (and URIs)

Page 14: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Resource Description Framework

Literals

10,500,000

Page 15: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Resource Description Framework

Properties

10,500,000

directorOf

budgetOf

Page 16: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Resource Description Framework

Classes and Instances

Spielberg is an instance of class Director

E.T is an instance of class Sci-Fi Movie which is a subclass of Movie

Page 17: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

So how does this all relate back to data cleaning?

Page 18: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Table pattern semantics

Looks a lot like RDF!

Page 19: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Formalizing pattern matching

Page 20: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.
Page 21: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.
Page 22: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.
Page 23: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

So what do we do with this formalization?

Page 24: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

KBs and table patterns: a few possibilities

Full KB Coverage

Page 25: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

KBs and table patterns: a few possibilities

Partial KB Coverage

“Does S. Africa hasCapital Pretoria?”

Page 26: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

KBs and table patterns: a few possibilities

Partial KB Coverage

Page 27: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

KBs and table patterns: a few possibilities

Not covered by the KB

“What are the possible relationships between Rossi and 1.78?”

Page 28: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

How do we actually get knowledge from KBs?

Page 29: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

SPARQL: a language for querying KBs

Get types and supertypes of resources with value t [ Ai ]

Page 30: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

SPARQL: a language for querying KBs

Q1: get relationships where both attributes are resources

Q2: get relationships with one resource and one literal

Page 31: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

So we’ve run our queries — what next?

Page 32: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Evaluating KB types ( Ti ) for table attributes ( Ai )

Page 33: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Evaluating possible column relationships

prob. of any entity appearing in the subject of property P

prob. of any entity being of type T

prob. of an entity being of type T and appearing as subject of P

Page 34: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

From PMI we get a measure of semantic coherence

Page 35: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Finally! A metric for scoring table patterns:

Page 36: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Generating the top-k patterns

Page 37: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.
Page 38: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.
Page 39: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

We have the top-k patterns — now what?

Page 40: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

We have the top-k patterns — now what?

Page 41: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Asking the crowd for help — some challenges

vs.

Page 42: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Decomposing table patterns into questions

Column Type Validation

Relationship Validation

Page 43: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

So we have our questions — in what order should we ask them?

Page 44: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Maximizing uncertainty reduction

prob. of pattern pi

Page 45: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Pattern validation

First, identify the variable that maximizes expected entropy reduction.

Remove tuples that violate validation, and repeat above until left with one pattern.

Query the crowd for validation.

Page 46: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Wait — what about the “data cleaning” part?

Page 47: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Recognizing erroneous tuples

Just execute a SPARQL query on the tuple.

Fully covered? Otherwise, we need the crowd.

Page 48: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Recognizing erroneous tuples

Table pattern implies this, crowd says yes.

Table pattern implies this, crowd says no. Opportunity to

enrich the KB

Page 49: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

1 change 5 changes

Page 50: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.
Page 51: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Experiments

Page 52: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Setup

(i) RankJoin (KATARA)(ii) Support(iii) MaxLike(iv) PGM

Data:

Algorithms:

Wikitables: 28 tables, avg. 32 tuples

Webtables: 30 tables, avg. 60 tuples

RelationalTables: 3 tables: Person: 317K tuples Soccer: 1625 tuples University: 1357 tuples

Ground truth table patterns

Page 53: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – pattern matching

RankJoin requires fewest top-k patterns to acheivehigh F-measure

Page 54: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – pattern matching

RankJoin requires fewest top-k patterns to acheivehigh F-measure

Page 55: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – pattern matching

RankJoin requires fewest top-k patterns to achievehigh F-measure

Paper’s reason for fast convergence:

DBpedia: 865 typesYago: 317K types

Page 56: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

RankJoin outperformsMaxLike and PGM, and is nearly as fast as Support.

All are faster on DBPedia.

Efficiency – pattern matching

Page 57: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – crowdsourcing

10 students validate patternsgiven 5 tuples per question

Page 58: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – crowdsourcing

10 students validate patternsgiven 5 tuples per question

Most improvementfrom first questionalone

Page 59: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – crowdsourcing

Question order matters.

MUVF (most uncertain variable first)

AVF (all variables independent)

Page 60: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – crowdsourcing

Crowd offers substantialannotation error reduction across the board

Page 61: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – repair suggestion

EQ: equivalence-class approach

SCARE: ML approach

Page 62: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – repair suggestion

EQ: equivalence-class approach

SCARE: ML approach

Randomly generated errors?

Only RelationalTables?

Page 63: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Effectiveness – repair suggestions

Repairs are ranked well

Page 64: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Thoughts on the experiments?

Page 65: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Next steps?

Page 66: CS265 Jacob Lurye Data Cleaning - Harvard SEASdaslab.seas.harvard.edu/classes/cs265/files/presentations/katara.pdf · Data Cleaning Jacob Lurye CS265 ... Let’s talk dirty data.

Some possible next steps

Cold start – no KBs, pure crowdsourced knowledge bootstrapping

Nth degree relationships – person is from city that is located in state that is located in country

Leveraging multiple KBs at once —DBpedia and Yago, not just either / or