Top Banner
Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute ) ICDE2014 April, 1st
41

Floris Geerts ( University of Antwerp )

Feb 23, 2016

Download

Documents

joshwa

ICDE2014. April, 1st. Floris Geerts ( University of Antwerp ) Giansalvatore Mecca, Donatello Santoro ( Università della Basilicata ) Paolo Papotti ( Qatar Computing Research Institute ). Overview. Motivations and Goals. Semantics. Experimental Results. Overview. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Floris Geerts  ( University of Antwerp )

Floris Geerts (University of Antwerp)Giansalvatore Mecca, Donatello Santoro (Università della

Basilicata)Paolo Papotti (Qatar Computing Research Institute)

ICDE2014 April, 1st

Page 2: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Overview 2

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 3: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Overview 3

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 4: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

4

A Mapping and Cleaning TaskSource 1

Source 2

Source 3

Target

Schema Mapping System

Data CleaningTools

STRONGLYINTERRELATED

PROBLEMS

Page 5: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Motivating Example

TARGET

SOURCE #3 (CONFIDENCE 1.0)

SOURCE #2 (CONFIDENCE 0.5)

SOURCE #1 (CONFIDENCE 0.7)

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

5

Master Data

SSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

PatientsSSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t Date

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.03/11/20

13 Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

Page 6: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Motivating Example

TARGET

SOURCE #3 (CONFIDENCE 1.0)

SOURCE #2 (CONFIDENCE 0.5)

SOURCE #1 (CONFIDENCE 0.7)

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

6

Master Data

SSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

PatientsSSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t Date

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.03/11/20

13 Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

Step1:To exchange data from

source to target

Page 7: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Motivating Example

TARGET

SOURCE #3 (CONFIDENCE 1.0)

SOURCE #2 (CONFIDENCE 0.5)

SOURCE #1 (CONFIDENCE 0.7)

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

7

Master Data

SSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

PatientsSSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t Date

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.03/11/20

13 Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

ST-TGD

Schema Mappingstrasformation can be expressed as a set of source to target tuple

generating dependencies (st-tgds)

[Popa et al., VLDB’02]

Page 8: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Motivating Example

TARGET

SOURCE #1 (CONFIDENCE 0.7)

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

8

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t Date

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.03/11/20

13 Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

ST-TGD

Source-to-Target TGD MedTreat(ssn, n, p, s, c, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.7, s, c, Y3), Treat(ssn, Y4, i, t, d)

Page 9: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Motivating Example

TARGET

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

9

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

ST-TGD

Source-to-Target TGD Pat(ssn, n, p, s, c), Surg(ssn, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.5, s, c, Y3), Treat(ssn, Y4, i, t, d)

SOURCE #2 (CONFIDENCE 0.5)Patients

SSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

Pre-Solution for the TGDs

Page 10: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Motivating Example

TARGET

SOURCE #3 (CONFIDENCE 1.0)

SOURCE #2 (CONFIDENCE 0.5)

SOURCE #1 (CONFIDENCE 0.7)

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

10

Master Data

SSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

PatientsSSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t Date

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.03/11/20

13 Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

Step2:To ensure Data Quality

Page 11: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Motivating Example

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

11

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

fd1. Cust: SSN→ Name, Phone, Str, City, CC#

er7. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Street from MD

fd2. Cust: Name, Str, City → SSNfd3. Treat: SSN → Salary

cfd5. Treat: Insur[‘Abx’] → Tr[‘Dental’]cfd6. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’]

ST-TGD FDFunctional Dependencies

Master DataSSN Name Phone Street

tm 222 F. Lennon 122-1876 Sky Dr.

Inclusion Dependencies

ID

id4. Treat[SSN] ⊆ Customers[SSN]Conditional Functional Dependencies

CFD

Editing Rules

ER

Page 12: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Motivating Example

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

12

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

fd1. Cust: SSN→ Name, Phone, Str, City, CC#

er7. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Street from MD

fd2. Cust: Name, Str, City → SSNfd3. Treat: SSN → Salary

cfd5. Treat: Insur[‘Abx’] → Tr[‘Dental’]cfd6. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’]

ST-TGD FDFunctional Dependencies

Master DataSSN Name Phone Street

tm 222 F. Lennon 122-1876 Sky Dr.

Inclusion Dependencies

ID

id4. Treat[SSN] ⊆ Customers[SSN]Conditional Functional Dependencies

CFD

Editing Rules

ER

VIOLATIONS

Page 13: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Motivating Example

TARGET

SOURCE #3 (CONFIDENCE 1.0)

SOURCE #2 (CONFIDENCE 0.5)

SOURCE #1 (CONFIDENCE 0.7)

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

13

Master Data

SSN Name Phone Street City

tm 222 F. Lennon 122-1876 Sky Dr. SF

PatientsSSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t Date

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.03/11/20

13 Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9Previous Semantics?

Page 14: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Data Exchange•Elegant semantics•Scalable algorithms

14

[Fagin et al., TCS ’05]

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Null NY 78165

8

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pico LA Null

t10 123 W.

Smith324-0000 0.5 Pico LA Null

ST-TGD FDID

CFD ER

fd1. Cust: SSN→ Name, Phone, Str, City, CC#

Soft Violation

Hard Violation

Page 15: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Data Repairing

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

15

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

Master DataSSN Name Phone Street

tm 222 F. Lennon 122-1876 Sky Dr.

Hard Violation• Many approaches and

techniques[Bohannon SIGMOD ’05] [Cong VLDB ’07] [Kolahi ICDT ’09] [Fan VLDB ’10] [Beskales VLDB ’10]

• No support for mapping• No way to handle our

example• Main-memory

implementation only!

TGDFD ID CFD ER INTERACTION!

Page 16: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Pipeline

•Negative Result: There exist scenarios such that pipeline doesn’t return solutions•Even when it works, its quality is

usually poor

16

PreSolution for STTGDs

Source 1

Source 2

Data Exchan

ge

Data Repairin

g

✔ Mappings✗ Cleaning

Rules

✔ Cleaning Rules✗ Mappings

Cleaned Target

✔ Mappings✔ Cleaning Rules

Page 17: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

PipelineTARGET

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

17

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9 SOURCE #2 (CONFIDENCE 0.5)

SOURCE #1 (CONFIDENCE 0.7)

PatientsSSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.

Page 18: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

PipelineTARGET

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

18

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

SOURCE #2 (CONFIDENCE 0.5)

SOURCE #1 (CONFIDENCE 0.7)

PatientsSSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.

PreSolution for TGDs

Page 19: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

PipelineTARGET

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

19

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

SOURCE #2 (CONFIDENCE 0.5)

SOURCE #1 (CONFIDENCE 0.7)

PatientsSSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.123

fd2. Cust: Name, Str, City → SSN

Page 20: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

PipelineTARGET

20

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

t12 124 W.

Smith324-3455 0.7 Pic

o LA Null

SOURCE #2 (CONFIDENCE 0.5)

SOURCE #1 (CONFIDENCE 0.7)

PatientsSSN Name Phone Str City

t2 123 W. Smith 324-0000

Pico Blv. LA

Surgeries

SSN Insurance Treat. Date

t3 123 Med Eye surg.

12/01/2013

MedicalTreatmentsSSN Name Phone Str Cit

yInsu

rTrea

t

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.123

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

Page 21: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Type 1

Schema Mapping Scenarios

Contributions 21

ST-TGD FDID FD CFD ERFD ID

CFD ER

TGD

Type 2

Data Repairing Scenarios

Type 3

Mapping and Cleaning Scenarios

A Uniform Framework for

With a fast and general-purpose chase engine

MDMD

Page 22: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Overview 22

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 23: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Llunatic Data Repairing•An extension of the data-repairing

framework•Let’s see a quick summary…

23

[Geerts et al., VLDB ‘13]

2. Cell Groups3. LLUNs

4. Upgrades

1. Partial Order

Page 24: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

• The Partial Order Π• Elegant way to model

preference rules

Llunatic Data Repairing24

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

Master DataSSN Name Phone Street

tm 222 F. Lennon 122-1876 Sky

[Geerts et al., VLDB ‘13]

Standard preference rules Ordering attribute

No order

PREFERRED VALUE

Page 25: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

• The Partial Order Π• Elegant way to model

preference rules• LLUNs

• a new class of symbols

• placeholders used to mark conflicts

Llunatic Data Repairing25

[Geerts et al., VLDB ‘13]

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

Master DataSSN Name Phone Street

tm 222 F. Lennon 122-1876 Sky

L0

Page 26: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

• The Partial Order Π• Elegant way to model

preference rules• LLUNs

• a new class of symbols

• Cell Groups• Represent the set of

changes

Llunatic Data Repairing26

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

t9 124 Null Med Lapar. 03/11/2013

t11 123 Null Med Eye surg.

12/01/2013

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

t8 124 W. Smith

324-3455 0.7 Pic

o LA Null

t10 123 W.

Smith324-0000 0.5 Pic

o LA Null

Master DataSSN Name Phone Street

tm 222 F. Lennon 122-1876 Sky

[Geerts et al., VLDB ‘13]

g1 = <122→ {t4.phn, t5.phn} >g2 = <Sky→ {t4.str, t5.str} by {tm.strauth} >

Sky

122-1876

Page 27: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

SSN Name CC#

t2 222 L. Lennon 111

t3 222 L. Lennon 555

Upgrades 27

J

Update 1g1 <L0→ {t4.cc, t5.cc}>

SSN Name CC#

t2 L1

L. Lennon 111

t3

222

L. Lennon 555

SSN Name CC#

t2

222

L. Lennon L0

t3

222

L. Lennon L0

e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → cc = cc’

SSN Name CC#

t2

222

L. Lennon 555

t3

222

L. Lennon 555

SSN Name CC#

t2

777

L. Lennon 111

t3

222

L. Lennon 555

Cardinality Minimal

Update 2g2 <L1→ {t4.ssn}>

Update 3g3 <555→ {t4.cc, t5.cc}>

Update 4g4 <777→ {t4.ssn}>

Upgrades Not an upgrade

Upgrade: an improvement over J, since it contains

better value wrt Π

SSN Name CC#

t2

222

L. Lennon 333

t3

222

L. Lennon 333

g5 <333→ {t4.cc, t5.cc}>Update 5

Forward Backward

Page 28: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

SSN Name CC#

t2 222 L. Lennon 111

t3 222 L. Lennon 555

Upgrades 28

J

Update 1g1 <L0→ {t4.cc, t5.cc}>

SSN Name CC#

t2 L1

L. Lennon 111

t3

222

L. Lennon 555

SSN Name CC#

t2

222

L. Lennon L0

t3

222

L. Lennon L0

e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → cc = cc’

Update 2g2 <L1→ {t4.ssn}>

SSN Name CC#

t2 L2 L2 L2

t3 L2 L2 L2Update 6g6 <L2→ {allcells}>

Forward BackwardMinimal

Solutions

over generalizat

ion

Page 29: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

29

+ =ST-TGDsT-TGDsUser

Inputs

Non trivial extension!

Page 30: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Mapping and Cleaning Scenario30

•S: source schema, Sa: authoritative source tablesT: target schema, Σt: TGDs, Σe: EGDs

•Π: the partial order specification•USER: a partial function to abstract user interaction •Solution: Given M, an instance I of S, and an

instance J of T, a solution is an instance J’ such that:• it is a repair, i.e., “I and J’ satisfy Σt ∪ Σe”

•and “J’ is an upgrade of J according to Π”

M&C Scenario M={S, Sa,T,Σt,Σe,Π, USER}

Page 31: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

How to handle TGDs

• We model it in terms of cell groups and updates

31

TARGET

Treatments

SSN Salary Insurance Treat. Date

t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12

Customers

SSN Name Phone PhConf Str. Cit

y CC#

t4 222 L. Lennon

122-1876 0.9 Nul

l NY 781658

t5 222 L. Lennon

000-0000 0.0 Fry SF 78465

9

SOURCE #1 (CONFIDENCE 0.7)MedicalTreatments

SSN Name Phone Str Cit

yInsu

rTrea

t

t1 124

W. Smith

324-3455

Pico Blvd LA Med Lap

ar.t8 124 W.

Smith324-3455 0.7 Pic

o LA Null

t9 124 Null Med Lapar. 03/11/2013

m1: MedTreat(ssn, n, p, s, c, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.7, s, c, Y3), Treat(ssn, Y4, i, t, d)

g1 = <124→ {t8.ssnnew, t9.ssnnew} by {t1.ssn}>

g2 = <W. Smith→ {t8.namenew} by {t1.name}> ...

we do not disrupt key – fkey equality in the

followingnew cells

KEY INTUITION

Page 32: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

User Inputs• In the presence of

inconsistencies user inputs are crucial. User may•change the value of a cell

group•refuse a cell group

•We model user interaction using a partial function over cell groups

32

SSN Name Phon

et2 L1

L. Lennon 123

t3

222

L. Lennon 000

SSN Name Phon

et2

222

L. Lennon 123

t3

222

L. Lennon 123

g1 <123→ {t4.ph, t5.ph}>

g2 <L1→ {t4.ssn}>

555

Page 33: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Non trivial extension•Data cleaning semantics has some nice

properties• scenario C always has a solution for <I,

J>• the chase always terminates (it never

fails)•Adding TGDs and User Inputs• concept of upgrade change

significantly• requires to completely rework upgrades

33

Page 34: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Upgrades•Must take into account many issues•some target cells are “better” than

others•source cells may be authoritative•compare instances with different new

values•compare instances with different

number of tuples•some cells may be changed by users

34

Page 35: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

A Few Results•Conservative extension of the data

exchange•Every (core) solution of a data exchange

scenario corresponds to a (minimal) solution of its associated mapping scenario, and vice versa

•Given a MC scenario, if Σt is a set of weakly-acyclic tgds, then the chase terminates•in essence we may re-use termination

conditions for data exchange

35

Page 36: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Overview 36

‣ Motivations and Goals

‣ Semantics

‣ Experimental Results

Page 37: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Chase Tree 37

J

R1

e0, b1

R2 R3

e0, b2e0, f

R4

e1, b1

R5 R6

e1, b2e1, f

R10e1, b1 R11 R12

e1, b2e1, f

R13

e0, b1

R14 R15

e0, b2e0, f

the e0-e1 sequence the e1-e0 sequence

Different orders of application give different results

Chase algorithm for chasing egds and

tgds

Page 38: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

Scalability Techniques•Chase implementation based on

equivalence classes•Delta Databases•a representation system for chase

trees•Cost managers•pluggable strategies to prune the

chase tree

38

Page 39: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014100000 K 400000 K 700000 K 1000000 K

0

2500

5000

39Scalability

LLUNATIC-FR-S1 LLUNATIC-FR-S5 LLUNATIC-FR-S50LLUNATIC-FR-S10

sec.

DOCTORS-MC

Page 40: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

1% 2% 3% 4% 5% 1% 2% 3% 4% 5% 1% 2% 3% 4% 5%-30%

0%

30%

60%

40Quality of Repairs

LLUNATIC-FR-S1 PIPELINE

HOSPITAL-MC NORM max. rep-rate(Rep, DBexp)

5k, 6%-10% 10k, 6%-10% 25k, 6%-10%

Page 41: Floris Geerts  ( University of Antwerp )

Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014

That’s all Folks!