Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Pedro Domingos

Joint work with AnHai Doan & Alon LevyDepartment of Computer Science & Engineering

University of Washington

Data Integration:Data Integration:A “Killer App” for Multi-Strategy LearningA “Killer App” for Multi-Strategy Learning

2

OverviewOverview

Data integration & XML Schema matching Multi-strategy learning Prototype system & experiments Related work Future work Summary

3

Data IntegrationData Integration

Find houses with four bathrooms and price under $500,000

mediated schema

superhomes.com

source schema

realestate.com

source schema

homeseekers.com

source schema

wrapper wrapperwrapper

4

Why Data Integration MattersWhy Data Integration Matters

Very active area in database & AI – research / workshops– start-ups

Large organizations – multiple databases with differing schemas

Data warehousing The Web: HTML sources The Web: XML sources

5

XMLXML

Extensible Markup Language– introduced in 1996

The standard for data publishing & exchange– replaces HTML & proprietary formats– embraced by database/web/e-commerce communities

XML versus HTML– both use tags to mark up data elements – HTML tags specify format – XML tags define meaning– relationships among elements provided via nesting

6

ExampleExample

<residential-listings><house> < location> <city> Seattle </city> <state> WA </state> <country> USA </country> </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...</residential-listings>

<h1> Residential Listings </h1><ul>House For Sale <li> location: Seattle, WA, USA <li> agent-phone: (206) 729 0831 <li> listed-price: $250,000 <li> comments: Fantastic house ...</ul><hr><ul> House For Sale...</ul>...

HTML XML

7

XML DTDXML DTD

A DTD can be visualized as a tree

<!ELEMENT residential-listings (house*)><!ELEMENT house (location?, agent-phone, listed-price, comments?)><!ELEMENT location (city, state, country?)>

Document Type Descriptor– BNF grammar– constraints on element structure: type, order, # of times

A real-estate DTD

8

Semantic Mappings between SchemasSemantic Mappings between Schemas

Mediated & source schemas = XML DTDs

house

location contact-info

house

address

agent-name agent-phone

num-baths amenities

full-baths half-baths handicap-equipped

contact

name phone

9

Map of the ProblemMap of the Problemsource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings

10

Current State of AffairsCurrent State of Affairs

Largely done by hand– labor intensive & error prone– key bottleneck in building applications

Will only be exacerbated – data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data

Need automatic approaches to scale up!

11

Use machine learning to match schemas Basic idea

1. create training data– manually map a set of sources to mediated schema

2. train system on training data– learns from

– name of schema elements – format of values– frequency of words & symbols– characteristics of value distribution– proximity, position, structure, ...

3. system proposes mappings for subsequent sources

Our Approach Our Approach

12

ExampleExample

realestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments></house> ...

address phone price description

mediated schema

location

Seattle, WASeattle, WADallas, TX...

listed-price

$250,000$162,000$180,000...

agent-phone

(206) 729 0831(206) 321 4571(214) 722 4035...

comments

Fantastic house ...Great ...Hurry! ......

13

Multi-Strategy LearningMulti-Strategy Learning

Use a set of base learners– each exploits certain types of information

Match schema elements of a new source– apply the learners– combine their predictions using a meta-learner

Meta-learner– measures base learner accuracy on training data– weighs each learner based on its accuracy

14

LearnersLearners Input

– schema information: name, proximity, structure, ...– data information: value, format, ...

Output– prediction weighted by confidence score

Example learners– name matcher

– agent-name => (name,0.7), (phone,0.3)

– Naive Bayes – “Seattle, WA” => (address,0.8), (name,0.2)– “Great location ...” => (description,0.9), (address,0.1)

15

Training the LearnersTraining the Learnersrealestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...


mediated schema

location listed-price agent-phone comments

Name Matcher

(location, address)(agent-phone, phone)(listed-price, price)(comments, description) ...

Naive Bayes

(“Seattle, WA”, address)(“(206) 729 0831”, phone)(“$ 250,000”, price)(“Fantastic house ...”, description) ...

16

Applying the Learned ModelsApplying the Learned Models

homes.com


mediated schema

area

Seattle, WAKent, WAAustin, TXSeattle, WA Name Matcher

Naive Bayes

Name MatcherNaive Bayes

Meta-learner

Meta-learneraddressaddressdescriptionaddress

Combiner

address

17

The LSD SystemThe LSD System

Base learners/modules– name matcher– Naive Bayes– Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98]– county-name recognizer

Meta-learner– stacking [Ting&Witten99, Wolpert92]

18

Name MatcherName Matcher

Matches based on names– including all names on path from root to current node– allowing synonyms

Good for ...– specific, descriptive names: agent-phone, listed-price

Bad for ...– vacuous names: item, listings– partially specified, ambiguous names: office

(for “office phone”)

19

Naive Bayes LearnerNaive Bayes Learner

Exploits frequencies of words & symbols Good for ...

– elements with words/symbols that are strongly indicative– examples:

– “fantastic” & “great” in house descriptions– $ in prices, parentheses in phone numbers

Bad for ...– short, numeric elements: num-baths, num-bedrooms

20

WHIRL Nearest-Neighbor ClassifierWHIRL Nearest-Neighbor Classifier

Similarity-based– stores all examples seen so far – classifies a new example based on similarity to

training examples– IR document similarity metric

Good for ...– long, textual elements: house description, names– limited, descriptive set of values: color (blue, red, ...)

Bad for ...– short, numeric elements: num-baths, num-bedrooms

21

County-Name RecognizerCounty-Name Recognizer

Stores all county names, obtained from the Web Verifies if the input name is a county name Essential to matching a county-name element

22

Meta-Learner: StackingMeta-Learner: Stacking

Training– uses training data to learn weights– one for each (base learner, mediated-schema element)

Combining predictions– for each mediated-schema element

– computes weighted sum of base-learner confidence scores

– picks mediated-schema element with highest sum

23

Experiments Experiments

Sources Coverage# of MatchableLeaf Elements

BestSingle Learner

LSD

realestate.yahoo USA 31 63% 77%

homeseekers.com USA 31 52% 64%

nkymls.com Kentucky 28 64% 75%

texasproperties.com Texas 42 59% 62%

windermere.com Northwest 35 55% 63%

24

Reasons for Incorrect MatchingsReasons for Incorrect Matchings

Unfamiliarity – suburb– solution: add a suburb-name recognizer

Insufficient information– correctly identified the general type– failed to pinpoint the exact type– <agent-name>Richard Smith</agent-name>

<phone> (206) 234 5412 </phone>– solution: add a proximity learner

25

Experiments: SummaryExperiments: Summary

Multi-strategy learning– better performance than any single learner

Accuracy of 100% unlikely to be reached– difficult even for human

Lots of room for improvement– more learners– better learning algorithms

26

Related WorkRelated Work

Rule-based approaches– TRANSCM [Milo&Zohar98],

ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98]

– utilize only schema information

Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability

27

Future WorkFuture Worksource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings

28

Future WorkFuture Work

Improve matching accuracy– more learners, more domains

Incorporate domain knowledge– semantic integrity constraints– concept hierarchy of mediated-schema elements

Learn with structured data

29

Learning with Structured DataLearning with Structured Data

Each example with >1 level of structure Generative model for XML XML classifier XML: “killer app” for relational learning

30

SummarySummary

Schema matching– automated by learning

Multi-strategy learning is essential– handles different types of data– incorporates different types of domain knowledge– easy to incorporate new learners– alleviates effects of noise & dirty data

Implemented LSD– promising results with initial experiments

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Documents

html xml slide

phone slide

training data

source schema wrapper

mediated schema

nesting slide

approach slide

xml sources