Top Banner
Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: Data Integration: A “Killer App” for Multi-Strategy A “Killer App” for Multi-Strategy Learning Learning
30

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Dec 24, 2015

Download

Documents

Brendan Cannon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Pedro Domingos

Joint work with AnHai Doan & Alon LevyDepartment of Computer Science & Engineering

University of Washington

Data Integration:Data Integration:A “Killer App” for Multi-Strategy LearningA “Killer App” for Multi-Strategy Learning

Page 2: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

2

OverviewOverview

Data integration & XML Schema matching Multi-strategy learning Prototype system & experiments Related work Future work Summary

Page 3: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

3

Data IntegrationData Integration

Find houses with four bathrooms and price under $500,000

mediated schema

superhomes.com

source schema

realestate.com

source schema

homeseekers.com

source schema

wrapper wrapperwrapper

Page 4: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

4

Why Data Integration MattersWhy Data Integration Matters

Very active area in database & AI – research / workshops– start-ups

Large organizations – multiple databases with differing schemas

Data warehousing The Web: HTML sources The Web: XML sources

Page 5: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

5

XMLXML

Extensible Markup Language– introduced in 1996

The standard for data publishing & exchange– replaces HTML & proprietary formats– embraced by database/web/e-commerce communities

XML versus HTML– both use tags to mark up data elements – HTML tags specify format – XML tags define meaning– relationships among elements provided via nesting

Page 6: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

6

ExampleExample

<residential-listings><house> < location> <city> Seattle </city> <state> WA </state> <country> USA </country> </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...</residential-listings>

<h1> Residential Listings </h1><ul>House For Sale <li> location: Seattle, WA, USA <li> agent-phone: (206) 729 0831 <li> listed-price: $250,000 <li> comments: Fantastic house ...</ul><hr><ul> House For Sale...</ul>...

HTML XML

Page 7: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

7

XML DTDXML DTD

A DTD can be visualized as a tree

<!ELEMENT residential-listings (house*)><!ELEMENT house (location?, agent-phone, listed-price, comments?)><!ELEMENT location (city, state, country?)>

Document Type Descriptor– BNF grammar– constraints on element structure: type, order, # of times

A real-estate DTD

Page 8: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

8

Semantic Mappings between SchemasSemantic Mappings between Schemas

Mediated & source schemas = XML DTDs

house

location contact-info

house

address

agent-name agent-phone

num-baths amenities

full-baths half-baths handicap-equipped

contact

name phone

Page 9: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

9

Map of the ProblemMap of the Problemsource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings

Page 10: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

10

Current State of AffairsCurrent State of Affairs

Largely done by hand– labor intensive & error prone– key bottleneck in building applications

Will only be exacerbated – data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data

Need automatic approaches to scale up!

Page 11: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

11

Use machine learning to match schemas Basic idea

1. create training data– manually map a set of sources to mediated schema

2. train system on training data– learns from

– name of schema elements – format of values– frequency of words & symbols– characteristics of value distribution– proximity, position, structure, ...

3. system proposes mappings for subsequent sources

Our Approach Our Approach

Page 12: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

12

ExampleExample

realestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments></house> ...

address phone price description

mediated schema

location

Seattle, WASeattle, WADallas, TX...

listed-price

$250,000$162,000$180,000...

agent-phone

(206) 729 0831(206) 321 4571(214) 722 4035...

comments

Fantastic house ...Great ...Hurry! ......

Page 13: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

13

Multi-Strategy LearningMulti-Strategy Learning

Use a set of base learners– each exploits certain types of information

Match schema elements of a new source– apply the learners– combine their predictions using a meta-learner

Meta-learner– measures base learner accuracy on training data– weighs each learner based on its accuracy

Page 14: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

14

LearnersLearners Input

– schema information: name, proximity, structure, ...– data information: value, format, ...

Output– prediction weighted by confidence score

Example learners– name matcher

– agent-name => (name,0.7), (phone,0.3)

– Naive Bayes – “Seattle, WA” => (address,0.8), (name,0.2)– “Great location ...” => (description,0.9), (address,0.1)

Page 15: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

15

Training the LearnersTraining the Learnersrealestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...

address phone price description

mediated schema

location listed-price agent-phone comments

Name Matcher

(location, address)(agent-phone, phone)(listed-price, price)(comments, description) ...

Naive Bayes

(“Seattle, WA”, address)(“(206) 729 0831”, phone)(“$ 250,000”, price)(“Fantastic house ...”, description) ...

Page 16: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

16

Applying the Learned ModelsApplying the Learned Models

homes.com

address phone price description

mediated schema

area

Seattle, WAKent, WAAustin, TXSeattle, WA Name Matcher

Naive Bayes

Name MatcherNaive Bayes

Meta-learner

Meta-learneraddressaddressdescriptionaddress

Combiner

address

Page 17: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

17

The LSD SystemThe LSD System

Base learners/modules– name matcher– Naive Bayes– Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98]– county-name recognizer

Meta-learner– stacking [Ting&Witten99, Wolpert92]

Page 18: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

18

Name MatcherName Matcher

Matches based on names– including all names on path from root to current node– allowing synonyms

Good for ...– specific, descriptive names: agent-phone, listed-price

Bad for ...– vacuous names: item, listings– partially specified, ambiguous names: office

(for “office phone”)

Page 19: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

19

Naive Bayes LearnerNaive Bayes Learner

Exploits frequencies of words & symbols Good for ...

– elements with words/symbols that are strongly indicative– examples:

– “fantastic” & “great” in house descriptions– $ in prices, parentheses in phone numbers

Bad for ...– short, numeric elements: num-baths, num-bedrooms

Page 20: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

20

WHIRL Nearest-Neighbor ClassifierWHIRL Nearest-Neighbor Classifier

Similarity-based– stores all examples seen so far – classifies a new example based on similarity to

training examples– IR document similarity metric

Good for ...– long, textual elements: house description, names– limited, descriptive set of values: color (blue, red, ...)

Bad for ...– short, numeric elements: num-baths, num-bedrooms

Page 21: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

21

County-Name RecognizerCounty-Name Recognizer

Stores all county names, obtained from the Web Verifies if the input name is a county name Essential to matching a county-name element

Page 22: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

22

Meta-Learner: StackingMeta-Learner: Stacking

Training– uses training data to learn weights– one for each (base learner, mediated-schema element)

Combining predictions– for each mediated-schema element

– computes weighted sum of base-learner confidence scores

– picks mediated-schema element with highest sum

Page 23: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

23

Experiments Experiments

Sources Coverage# of MatchableLeaf Elements

BestSingle Learner

LSD

realestate.yahoo USA 31 63% 77%

homeseekers.com USA 31 52% 64%

nkymls.com Kentucky 28 64% 75%

texasproperties.com Texas 42 59% 62%

windermere.com Northwest 35 55% 63%

Page 24: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

24

Reasons for Incorrect MatchingsReasons for Incorrect Matchings

Unfamiliarity – suburb– solution: add a suburb-name recognizer

Insufficient information– correctly identified the general type– failed to pinpoint the exact type– <agent-name>Richard Smith</agent-name>

<phone> (206) 234 5412 </phone>– solution: add a proximity learner

Page 25: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

25

Experiments: SummaryExperiments: Summary

Multi-strategy learning– better performance than any single learner

Accuracy of 100% unlikely to be reached– difficult even for human

Lots of room for improvement– more learners– better learning algorithms

Page 26: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

26

Related WorkRelated Work

Rule-based approaches– TRANSCM [Milo&Zohar98],

ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98]

– utilize only schema information

Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability

Page 27: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

27

Future WorkFuture Worksource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings

Page 28: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

28

Future WorkFuture Work

Improve matching accuracy– more learners, more domains

Incorporate domain knowledge– semantic integrity constraints– concept hierarchy of mediated-schema elements

Learn with structured data

Page 29: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

29

Learning with Structured DataLearning with Structured Data

Each example with >1 level of structure Generative model for XML XML classifier XML: “killer app” for relational learning

Page 30: Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

30

SummarySummary

Schema matching– automated by learning

Multi-strategy learning is essential– handles different types of data– incorporates different types of domain knowledge– easy to incorporate new learners– alleviates effects of noise & dirty data

Implemented LSD– promising results with initial experiments