YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: IEJoin and Big Data Cleansing

Fast and Scalable Inequality Joins-- for Data Cleansing on Scale --

Zuhair Khayyat

PhD Candidate @ InfoCloud groupKing Abdullah University of Science and Technology (KAUST)

Page 2: IEJoin and Big Data Cleansing

● Two customers having the same zip cannot be in different cities

Data Cleansing

Name Zip City

Winnie 91340 San Francisco

Robbert 91340 New York

Emma 91340 San Francisco

Page 3: IEJoin and Big Data Cleansing

● Two customers having the same zip cannot be in different cities

● “inaccurate data has a direct impact ... the average company losing 12% of its revenue” -- Ben Davis (Econsultancy)

Data Cleansing

Name Zip City

Winnie 91340 San Francisco

Robbert 91340 New York

Emma 91340 San Francisco

Page 4: IEJoin and Big Data Cleansing

● Two customers having the same zip cannot be in different cities

● “inaccurate data has a direct impact ... the average company losing 12% of its revenue” -- Ben Davis (Econsultancy)

● “This is the digital universe. It is growing 40% a year into the next decade” -- EMC2

BigData Cleansing

Name Zip City

Winnie 91340 San Francisco

Robbert 91340 New York

Emma 91340 San Francisco

Page 5: IEJoin and Big Data Cleansing

Data Cleansing System

Data(Dirty)

Quality Rules Violations

Page 6: IEJoin and Big Data Cleansing

Data Cleansing System

Data(Dirty)

Quality Rules Violations

Repair AlgorithmsFixes

Page 7: IEJoin and Big Data Cleansing

Data Cleansing System

Data(Partially

Clean)

Quality Rules Violations

Repair AlgorithmsFixes

Page 8: IEJoin and Big Data Cleansing

Big Data Cleansing System

Data(Partially

Clean)

Quality Rules Violations

Repair AlgorithmsFixes

Big

Page 9: IEJoin and Big Data Cleansing

BigDansing: A System for Big Data CleansingIn SIGMOD 2015

BigDansing

Quality rules

Repair Algorithms

Dirtydatasets

Cleandatasets

Page 10: IEJoin and Big Data Cleansing

BigDansing: A System for Big Data CleansingIn SIGMOD 2015

BigDansing

Quality rules

Repair Algorithms

Dirtydatasets

Cleandatasets

Page 11: IEJoin and Big Data Cleansing

BigDansing: A System for Big Data Cleansing

Functional dependencies

Inclusiondependencies

Denialconstraints

Entityresolution

Domain Specific Language

Optimized execution plan

In SIGMOD 2015

Page 12: IEJoin and Big Data Cleansing

BigDansing: A System for Big Data CleansingIn SIGMOD 2015

Violations

S1

S2

S3

Repair Algorithm

Repair Algorithm

Repair Algorithm

Page 13: IEJoin and Big Data Cleansing

In SIGMOD 2015

10M 20M 40M0

100

200

300

400

500

600

700BigDansing on Spark Spark SQL

Dataset Size

Tim

e (

Se

c)

Two customers having the same zip cannot be in different cities(FD: Zip → City)

BigDansing: A System for Big Data Cleansing

Page 14: IEJoin and Big Data Cleansing

Complex Quality rule: Inequality joins

● If a person has a higher salary, he must pay more taxes compared to others

Page 15: IEJoin and Big Data Cleansing

Complex Quality rule: Inequality joins

● If a person has a higher salary, he must pay more taxes compared to others

● Select * from D t1 JOIN D t2 on

t1.Salary > t2.Salary AND

t1.Tax < t2.Tax;

● Processed as a Cartesian product: O(n2)

Page 16: IEJoin and Big Data Cleansing

Lightning Fast and Space Efficient Inequality Joins

● Sort on Salary:

● Sort on Tax:

● Bit-array:

In VLDB 2015

Tuple Salary Tax

t1 100 5

t2 90 9

t3 150 15

t4 120 10

t3(150) t4(120) t1(100) t2(90)

t3(15) t4(10) t2(9) t2(5)

0 1 2 3

0 1 3 2

0 0 0 0

Permutation array:

Page 17: IEJoin and Big Data Cleansing

Lightning Fast and Space Efficient Inequality Joins

● Sort on Salary:

● Sort on Tax:

● Bit-array:

In VLDB 2015

Tuple Salary Tax

t1 100 5

t2 90 9

t3 150 15

t4 120 10

t3(150) t4(120) t1(100) t2(90)

t3(15) t4(10) t2(9) t2(5)

0 1 2 3

0 1 3 2

0 0 0 0

Permutation array:

Page 18: IEJoin and Big Data Cleansing

Lightning Fast and Space Efficient Inequality Joins

● Sort on Salary:

● Sort on Tax:

● Bit-array:

In VLDB 2015

Tuple Salary Tax

t1 100 5

t2 90 9

t3 150 15

t4 120 10

t3(150) t4(120) t1(100) t2(90)

t3(15) t4(10) t2(9) t1(5)

0 1 2 3

0 1 3 2

0 0 0 0

Permutation array:

O(n log n)

Page 19: IEJoin and Big Data Cleansing

IEJoin vs. DBMS (Single machine)

10K 50K 100K0.01

0.1

1

10

100

1000

10000PG-IEJoin Postgres MonetDB DBMS-X

Dataset size

Run

time

(S

ec)

Page 20: IEJoin and Big Data Cleansing

IEJoin vs. Spark SQL (Distributed)

Salary-Tax Range Intersection0

4

8

12

16

20

24

IEJoin on Spark SQL Spark SQL

Run

time

(H

ou

rs)

100M rows on 6 machines

Page 21: IEJoin and Big Data Cleansing

IEJoin on 8B Rows

● A cluster of 16 workers

● 8B rows, 287 GB on Disk

● Runtime in 13 hours

● Close to the ideal speedup

0 2 4 6 8 10 12 14 16 180

20

40

60

80

100

IEJoin on Spark SQL

Ideal Speedup

Cluster Size

Hou

rs

Page 22: IEJoin and Big Data Cleansing

Visit us!

● Zuhair Khayyat– cloud.kaust.edu.sa

● SIGMOD 15 – BigDansing paper

● VLDB 15 – IEJoin Paper -----> to be presented in VLDB 16

● SIGMOD 16 – Demo Paper


Related Documents