Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute ) ICDE2014 April, 1st
Feb 23, 2016
Floris Geerts (University of Antwerp)Giansalvatore Mecca, Donatello Santoro (Università della
Basilicata)Paolo Papotti (Qatar Computing Research Institute)
ICDE2014 April, 1st
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Overview 2
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Overview 3
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
4
A Mapping and Cleaning TaskSource 1
Source 2
Source 3
Target
Schema Mapping System
Data CleaningTools
STRONGLYINTERRELATED
PROBLEMS
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Motivating Example
TARGET
SOURCE #3 (CONFIDENCE 1.0)
SOURCE #2 (CONFIDENCE 0.5)
SOURCE #1 (CONFIDENCE 0.7)
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
5
Master Data
SSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
PatientsSSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t Date
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.03/11/20
13 Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Motivating Example
TARGET
SOURCE #3 (CONFIDENCE 1.0)
SOURCE #2 (CONFIDENCE 0.5)
SOURCE #1 (CONFIDENCE 0.7)
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
6
Master Data
SSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
PatientsSSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t Date
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.03/11/20
13 Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
Step1:To exchange data from
source to target
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Motivating Example
TARGET
SOURCE #3 (CONFIDENCE 1.0)
SOURCE #2 (CONFIDENCE 0.5)
SOURCE #1 (CONFIDENCE 0.7)
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
7
Master Data
SSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
PatientsSSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t Date
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.03/11/20
13 Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
ST-TGD
Schema Mappingstrasformation can be expressed as a set of source to target tuple
generating dependencies (st-tgds)
[Popa et al., VLDB’02]
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Motivating Example
TARGET
SOURCE #1 (CONFIDENCE 0.7)
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
8
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t Date
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.03/11/20
13 Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
ST-TGD
Source-to-Target TGD MedTreat(ssn, n, p, s, c, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.7, s, c, Y3), Treat(ssn, Y4, i, t, d)
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Motivating Example
TARGET
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
9
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
ST-TGD
Source-to-Target TGD Pat(ssn, n, p, s, c), Surg(ssn, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.5, s, c, Y3), Treat(ssn, Y4, i, t, d)
SOURCE #2 (CONFIDENCE 0.5)Patients
SSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
Pre-Solution for the TGDs
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Motivating Example
TARGET
SOURCE #3 (CONFIDENCE 1.0)
SOURCE #2 (CONFIDENCE 0.5)
SOURCE #1 (CONFIDENCE 0.7)
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
10
Master Data
SSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
PatientsSSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t Date
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.03/11/20
13 Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
Step2:To ensure Data Quality
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Motivating Example
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
11
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
fd1. Cust: SSN→ Name, Phone, Str, City, CC#
er7. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Street from MD
fd2. Cust: Name, Str, City → SSNfd3. Treat: SSN → Salary
cfd5. Treat: Insur[‘Abx’] → Tr[‘Dental’]cfd6. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’]
ST-TGD FDFunctional Dependencies
Master DataSSN Name Phone Street
tm 222 F. Lennon 122-1876 Sky Dr.
Inclusion Dependencies
ID
id4. Treat[SSN] ⊆ Customers[SSN]Conditional Functional Dependencies
CFD
Editing Rules
ER
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Motivating Example
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
12
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
fd1. Cust: SSN→ Name, Phone, Str, City, CC#
er7. IF Cust.SSN = MD.SSN, Cust.Phone = MD.Phone → TAKE Name, Street from MD
fd2. Cust: Name, Str, City → SSNfd3. Treat: SSN → Salary
cfd5. Treat: Insur[‘Abx’] → Tr[‘Dental’]cfd6. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’]
ST-TGD FDFunctional Dependencies
Master DataSSN Name Phone Street
tm 222 F. Lennon 122-1876 Sky Dr.
Inclusion Dependencies
ID
id4. Treat[SSN] ⊆ Customers[SSN]Conditional Functional Dependencies
CFD
Editing Rules
ER
VIOLATIONS
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Motivating Example
TARGET
SOURCE #3 (CONFIDENCE 1.0)
SOURCE #2 (CONFIDENCE 0.5)
SOURCE #1 (CONFIDENCE 0.7)
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
13
Master Data
SSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
PatientsSSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t Date
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.03/11/20
13 Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9Previous Semantics?
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Data Exchange•Elegant semantics•Scalable algorithms
14
[Fagin et al., TCS ’05]
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Null NY 78165
8
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pico LA Null
t10 123 W.
Smith324-0000 0.5 Pico LA Null
ST-TGD FDID
CFD ER
fd1. Cust: SSN→ Name, Phone, Str, City, CC#
Soft Violation
Hard Violation
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Data Repairing
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
15
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
Master DataSSN Name Phone Street
tm 222 F. Lennon 122-1876 Sky Dr.
Hard Violation• Many approaches and
techniques[Bohannon SIGMOD ’05] [Cong VLDB ’07] [Kolahi ICDT ’09] [Fan VLDB ’10] [Beskales VLDB ’10]
• No support for mapping• No way to handle our
example• Main-memory
implementation only!
TGDFD ID CFD ER INTERACTION!
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Pipeline
•Negative Result: There exist scenarios such that pipeline doesn’t return solutions•Even when it works, its quality is
usually poor
16
PreSolution for STTGDs
Source 1
Source 2
Data Exchan
ge
Data Repairin
g
✔ Mappings✗ Cleaning
Rules
✔ Cleaning Rules✗ Mappings
Cleaned Target
✔ Mappings✔ Cleaning Rules
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
PipelineTARGET
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
17
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9 SOURCE #2 (CONFIDENCE 0.5)
SOURCE #1 (CONFIDENCE 0.7)
PatientsSSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
PipelineTARGET
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
18
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
SOURCE #2 (CONFIDENCE 0.5)
SOURCE #1 (CONFIDENCE 0.7)
PatientsSSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.
PreSolution for TGDs
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
PipelineTARGET
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
19
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
SOURCE #2 (CONFIDENCE 0.5)
SOURCE #1 (CONFIDENCE 0.7)
PatientsSSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.123
fd2. Cust: Name, Str, City → SSN
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
PipelineTARGET
20
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
t12 124 W.
Smith324-3455 0.7 Pic
o LA Null
SOURCE #2 (CONFIDENCE 0.5)
SOURCE #1 (CONFIDENCE 0.7)
PatientsSSN Name Phone Str City
t2 123 W. Smith 324-0000
Pico Blv. LA
Surgeries
SSN Insurance Treat. Date
t3 123 Med Eye surg.
12/01/2013
MedicalTreatmentsSSN Name Phone Str Cit
yInsu
rTrea
t
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.123
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Type 1
Schema Mapping Scenarios
Contributions 21
ST-TGD FDID FD CFD ERFD ID
CFD ER
TGD
Type 2
Data Repairing Scenarios
Type 3
Mapping and Cleaning Scenarios
A Uniform Framework for
With a fast and general-purpose chase engine
MDMD
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Overview 22
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Llunatic Data Repairing•An extension of the data-repairing
framework•Let’s see a quick summary…
23
[Geerts et al., VLDB ‘13]
2. Cell Groups3. LLUNs
4. Upgrades
1. Partial Order
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
• The Partial Order Π• Elegant way to model
preference rules
Llunatic Data Repairing24
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
Master DataSSN Name Phone Street
tm 222 F. Lennon 122-1876 Sky
[Geerts et al., VLDB ‘13]
Standard preference rules Ordering attribute
No order
PREFERRED VALUE
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
• The Partial Order Π• Elegant way to model
preference rules• LLUNs
• a new class of symbols
• placeholders used to mark conflicts
Llunatic Data Repairing25
[Geerts et al., VLDB ‘13]
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
Master DataSSN Name Phone Street
tm 222 F. Lennon 122-1876 Sky
L0
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
• The Partial Order Π• Elegant way to model
preference rules• LLUNs
• a new class of symbols
• Cell Groups• Represent the set of
changes
Llunatic Data Repairing26
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
t9 124 Null Med Lapar. 03/11/2013
t11 123 Null Med Eye surg.
12/01/2013
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
t8 124 W. Smith
324-3455 0.7 Pic
o LA Null
t10 123 W.
Smith324-0000 0.5 Pic
o LA Null
Master DataSSN Name Phone Street
tm 222 F. Lennon 122-1876 Sky
[Geerts et al., VLDB ‘13]
g1 = <122→ {t4.phn, t5.phn} >g2 = <Sky→ {t4.str, t5.str} by {tm.strauth} >
Sky
122-1876
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
SSN Name CC#
t2 222 L. Lennon 111
t3 222 L. Lennon 555
Upgrades 27
J
Update 1g1 <L0→ {t4.cc, t5.cc}>
SSN Name CC#
t2 L1
L. Lennon 111
t3
222
L. Lennon 555
SSN Name CC#
t2
222
L. Lennon L0
t3
222
L. Lennon L0
e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → cc = cc’
SSN Name CC#
t2
222
L. Lennon 555
t3
222
L. Lennon 555
SSN Name CC#
t2
777
L. Lennon 111
t3
222
L. Lennon 555
Cardinality Minimal
Update 2g2 <L1→ {t4.ssn}>
Update 3g3 <555→ {t4.cc, t5.cc}>
Update 4g4 <777→ {t4.ssn}>
Upgrades Not an upgrade
Upgrade: an improvement over J, since it contains
better value wrt Π
SSN Name CC#
t2
222
L. Lennon 333
t3
222
L. Lennon 333
g5 <333→ {t4.cc, t5.cc}>Update 5
Forward Backward
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
SSN Name CC#
t2 222 L. Lennon 111
t3 222 L. Lennon 555
Upgrades 28
J
Update 1g1 <L0→ {t4.cc, t5.cc}>
SSN Name CC#
t2 L1
L. Lennon 111
t3
222
L. Lennon 555
SSN Name CC#
t2
222
L. Lennon L0
t3
222
L. Lennon L0
e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → cc = cc’
Update 2g2 <L1→ {t4.ssn}>
SSN Name CC#
t2 L2 L2 L2
t3 L2 L2 L2Update 6g6 <L2→ {allcells}>
Forward BackwardMinimal
Solutions
over generalizat
ion
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
29
+ =ST-TGDsT-TGDsUser
Inputs
Non trivial extension!
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Mapping and Cleaning Scenario30
•S: source schema, Sa: authoritative source tablesT: target schema, Σt: TGDs, Σe: EGDs
•Π: the partial order specification•USER: a partial function to abstract user interaction •Solution: Given M, an instance I of S, and an
instance J of T, a solution is an instance J’ such that:• it is a repair, i.e., “I and J’ satisfy Σt ∪ Σe”
•and “J’ is an upgrade of J according to Π”
M&C Scenario M={S, Sa,T,Σt,Σe,Π, USER}
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
How to handle TGDs
• We model it in terms of cell groups and updates
31
TARGET
Treatments
SSN Salary Insurance Treat. Date
t6 222 10k Abx Dental 10/01/11t7 222 25k Abx Cholest. 08/12/12
Customers
SSN Name Phone PhConf Str. Cit
y CC#
t4 222 L. Lennon
122-1876 0.9 Nul
l NY 781658
t5 222 L. Lennon
000-0000 0.0 Fry SF 78465
9
SOURCE #1 (CONFIDENCE 0.7)MedicalTreatments
SSN Name Phone Str Cit
yInsu
rTrea
t
t1 124
W. Smith
324-3455
Pico Blvd LA Med Lap
ar.t8 124 W.
Smith324-3455 0.7 Pic
o LA Null
t9 124 Null Med Lapar. 03/11/2013
m1: MedTreat(ssn, n, p, s, c, i, t, d) → ∃Y3, Y4 : Cust(ssn, n, p, 0.7, s, c, Y3), Treat(ssn, Y4, i, t, d)
g1 = <124→ {t8.ssnnew, t9.ssnnew} by {t1.ssn}>
g2 = <W. Smith→ {t8.namenew} by {t1.name}> ...
we do not disrupt key – fkey equality in the
followingnew cells
KEY INTUITION
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
User Inputs• In the presence of
inconsistencies user inputs are crucial. User may•change the value of a cell
group•refuse a cell group
•We model user interaction using a partial function over cell groups
32
SSN Name Phon
et2 L1
L. Lennon 123
t3
222
L. Lennon 000
SSN Name Phon
et2
222
L. Lennon 123
t3
222
L. Lennon 123
g1 <123→ {t4.ph, t5.ph}>
g2 <L1→ {t4.ssn}>
555
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Non trivial extension•Data cleaning semantics has some nice
properties• scenario C always has a solution for <I,
J>• the chase always terminates (it never
fails)•Adding TGDs and User Inputs• concept of upgrade change
significantly• requires to completely rework upgrades
33
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Upgrades•Must take into account many issues•some target cells are “better” than
others•source cells may be authoritative•compare instances with different new
values•compare instances with different
number of tuples•some cells may be changed by users
34
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
A Few Results•Conservative extension of the data
exchange•Every (core) solution of a data exchange
scenario corresponds to a (minimal) solution of its associated mapping scenario, and vice versa
•Given a MC scenario, if Σt is a set of weakly-acyclic tgds, then the chase terminates•in essence we may re-use termination
conditions for data exchange
35
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Overview 36
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Chase Tree 37
J
R1
e0, b1
R2 R3
e0, b2e0, f
R4
e1, b1
R5 R6
e1, b2e1, f
R10e1, b1 R11 R12
e1, b2e1, f
R13
e0, b1
R14 R15
e0, b2e0, f
the e0-e1 sequence the e1-e0 sequence
Different orders of application give different results
Chase algorithm for chasing egds and
tgds
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
Scalability Techniques•Chase implementation based on
equivalence classes•Delta Databases•a representation system for chase
trees•Cost managers•pluggable strategies to prune the
chase tree
38
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014100000 K 400000 K 700000 K 1000000 K
0
2500
5000
39Scalability
LLUNATIC-FR-S1 LLUNATIC-FR-S5 LLUNATIC-FR-S50LLUNATIC-FR-S10
sec.
DOCTORS-MC
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
1% 2% 3% 4% 5% 1% 2% 3% 4% 5% 1% 2% 3% 4% 5%-30%
0%
30%
60%
40Quality of Repairs
LLUNATIC-FR-S1 PIPELINE
HOSPITAL-MC NORM max. rep-rate(Rep, DBexp)
5k, 6%-10% 10k, 6%-10% 25k, 6%-10%
Mapping and Cleaning – F. Geerts, G. Mecca, P. Papotti, D. Santoro April, 1 2014
That’s all Folks!