Top Banner
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai Doan
17

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

Dec 30, 2015

Download

Documents

Justin Glenn
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

Learning Source Mappings

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Database & Information Systems

October 27, 2008

LSD Slides courtesy AnHai Doan

Page 2: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

2

Administrivia

Midterm due Thursday 5-10 pages (single-spaced, 10-12 pt)

Page 3: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

3

Semantic Mappings between Schemas Mediated & source schemas = XML DTDs

house

location contact

house

address

name phone

num-baths

full-baths half-baths

contact-info

agent-name agent-phone

1-1 mapping non 1-1 mapping

Page 4: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

4

Suppose user wants to integrate 100 data sources

1. User manually creates mappings for a few sources, say 3 shows LSD these mappings

2. LSD learns from the mappings “Multi-strategy” learning incorporates many types of

info in a general way Knowledge of constraints further helps

3. LSD proposes mappings for remaining 97 sources

The LSD (Learning Source Descriptions) Approach

Page 5: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

5

listed-price $250,000 $110,000 ...

address price agent-phone description

Example

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

realestate.com

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

Page 6: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

6

LSD’s Multi-Strategy Learning

Use a set of base learners each exploits well certain types of information

Match schema elements of a new source apply the base learners combine their predictions using a meta-learner

Meta-learner uses training sources to measure base learner

accuracy weighs each learner based on its accuracy

Page 7: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

7

Base Learners Input

schema information: name, proximity, structure, ...

data information: value, format, ... Output

prediction weighted by confidence score Examples

Name learner agent-name => (name,0.7), (phone,0.3)

Naive Bayes learner “Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1)

Page 8: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

8

<location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </>

<location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </>

Training the Learners

Naive Bayes Learner

(location, address)(listed-price, price)(phone, agent-phone)(comments, description) ...

(“Miami, FL”, address)(“$ 250,000”, price)(“(305) 729 0831”, agent-phone)(“Fantastic house”, description) ...

realestate.com

Name Learner

address price agent-phone description

Schema of realestate.com

Mediated schema

location listed-price phone comments

Page 9: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

9

<extra-info>Beautiful yard</><extra-info>Great beach</><extra-info>Close to Seattle</>

<day-phone>(278) 345 7215</><day-phone>(617) 335 2315</><day-phone>(512) 427 1115</>

<area>Seattle, WA</><area>Kent, WA</><area>Austin, TX</>

Applying the Learners

Name LearnerNaive Bayes

Meta-Learner

(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)

(address,0.6), (description,0.4)

Meta-LearnerName LearnerNaive Bayes

(address,0.7), (description,0.3)

(agent-phone,0.9), (description,0.1)

address price agent-phone description

Schema of homes.com Mediated schema

area day-phone extra-info

Page 10: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

10

Domain Constraints Impose semantic regularities on sources

verified using schema or data

Examples a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested

in a

Can be specified up front when creating mediated schema independent of any actual source schema

Page 11: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

11

area: address contact-phone: agent-phoneextra-info: description

area: address contact-phone: agent-phoneextra-info: address

area: (address,0.7), (description,0.3)contact-phone: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)

The Constraint Handler

Can specify arbitrary constraints User feedback = domain constraint

ad-id = house-id Extended to handle domain heuristics

a = agent-phone & b = agent-name a & b are usually close to each other

0.30.10.40.012

0.70.90.60.378

0.70.90.40.252

Domain Constraintsa = address & b = adderss a = b

Predictions from Meta-Learner

Page 12: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

12

Putting It All Together: LSD System

L1 L2 Lk

Mediated schema

Source schemas

Data listings

Training datafor base learners Constraint Handler

Mapping Combination

User Feedback

Domain Constraints

Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner

uses stacking [Ting&Witten99, Wolpert92] returns linear weighted combination of base learners’ predictions

Matching PhaseTraining Phase

Page 13: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

13

Empirical Evaluation

Four domains Real Estate I & II, Course Offerings, Faculty

Listings

For each domain create mediated DTD & domain constraints choose five sources extract & convert data listings into XML mediated DTDs: 14 - 66 elements, source DTDs:

13 – 48 Ten runs for each experiment - in each run:

manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified

Page 14: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

14

LSD Matching Accuracy

0

10

20

30

40

50

60

70

80

90

100

Real Estate I Real Estate II CourseOfferings

FacultyListings

LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72%+ Meta-learner: + 5 - 22%+ Constraint handler: + 7 - 13%+ XML learner: + 0.8 - 6%

Ave

rage

Mat

chin

g A

cccu

racy

(%

)

Page 15: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

15

LSD Summary

Applies machine learning to schema matching use of multi-strategy learning Domain & user-specified constraints

Probably the most flexible means of doing schema matching today in a semi-automated way

Complementary project: CLIO (IBM Almaden) uses key and foreign-key constraints to help the user build mappings

Page 16: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

Since LSD…

A lot more work on the following: Alternative schemes for putting together info from

base learners Hierarchical learners

Compare two trees: parent nodes are likely to be the same if child nodes are similar; child nodes are likely to be the same if parent nodes are similar

Using mass collaboration – humans do the work

And a lot of work on entity resolution or record matching Uses similar ideas to try to determine when two

records are referring to the same entity

16

Page 17: Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

17

Jumping Up a Level

We’ve now seen how heterogeneous data makes a huge difference … In the need for relating different kinds of

attributes Mapping languages Mapping tools Query reformulation

… and in query processing Adaptive query processing

Next time we’ll go even further, and start to consider search – focusing on Google