Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Pedro Domingos

Joint work with AnHai Doan & Alon LevyDepartment of Computer Science & Engineering

University of Washington

Data Integration:Data Integration:A “Killer App” for Multi-Strategy LearningA “Killer App” for Multi-Strategy Learning

OverviewOverview

Data integration & XML Schema matching Multi-strategy learning Prototype system & experiments Related work Future work Summary

Data IntegrationData Integration

Find houses with four bathrooms and price under $500,000

mediated schema

superhomes.com

source schema

realestate.com

source schema

homeseekers.com

source schema

wrapper wrapperwrapper

Why Data Integration MattersWhy Data Integration Matters

Very active area in database & AI – research / workshops– start-ups

Large organizations – multiple databases with differing schemas

Data warehousing The Web: HTML sources The Web: XML sources

XMLXML

Extensible Markup Language– introduced in 1996

The standard for data publishing & exchange– replaces HTML & proprietary formats– embraced by database/web/e-commerce communities

XML versus HTML– both use tags to mark up data elements – HTML tags specify format – XML tags define meaning– relationships among elements provided via nesting

ExampleExample

<residential-listings><house> < location> <city> Seattle </city> <state> WA </state> <country> USA </country> </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...</residential-listings>

<h1> Residential Listings </h1><ul>House For Sale <li> location: Seattle, WA, USA <li> agent-phone: (206) 729 0831 <li> listed-price: $250,000 <li> comments: Fantastic house ...</ul><hr><ul> House For Sale...</ul>...

HTML XML

XML DTDXML DTD

A DTD can be visualized as a tree

<!ELEMENT residential-listings (house*)><!ELEMENT house (location?, agent-phone, listed-price, comments?)><!ELEMENT location (city, state, country?)>

Document Type Descriptor– BNF grammar– constraints on element structure: type, order, # of times

A real-estate DTD

Semantic Mappings between SchemasSemantic Mappings between Schemas

Mediated & source schemas = XML DTDs

location contact-info

address

agent-name agent-phone

num-baths amenities

full-baths half-baths handicap-equipped

contact

name phone

Map of the ProblemMap of the Problemsource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings

Current State of AffairsCurrent State of Affairs

Largely done by hand– labor intensive & error prone– key bottleneck in building applications

Will only be exacerbated – data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data

Need automatic approaches to scale up!

Use machine learning to match schemas Basic idea

1. create training data– manually map a set of sources to mediated schema

2. train system on training data– learns from

– name of schema elements – format of values– frequency of words & symbols– characteristics of value distribution– proximity, position, structure, ...

3. system proposes mappings for subsequent sources

Our Approach Our Approach

ExampleExample

realestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments></house> ...

address phone price description

mediated schema

location

Seattle, WASeattle, WADallas, TX...

listed-price

$250,000$162,000$180,000...

agent-phone

(206) 729 0831(206) 321 4571(214) 722 4035...

comments

Fantastic house ...Great ...Hurry! ......

Multi-Strategy LearningMulti-Strategy Learning

Use a set of base learners– each exploits certain types of information

Match schema elements of a new source– apply the learners– combine their predictions using a meta-learner

Meta-learner– measures base learner accuracy on training data– weighs each learner based on its accuracy

LearnersLearners Input

– schema information: name, proximity, structure, ...– data information: value, format, ...

Output– prediction weighted by confidence score

Example learners– name matcher

– agent-name => (name,0.7), (phone,0.3)

– Naive Bayes – “Seattle, WA” => (address,0.8), (name,0.2)– “Great location ...” => (description,0.9), (address,0.1)

Training the LearnersTraining the Learnersrealestate.com

<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...

mediated schema

location listed-price agent-phone comments

Name Matcher

(location, address)(agent-phone, phone)(listed-price, price)(comments, description) ...

Naive Bayes

(“Seattle, WA”, address)(“(206) 729 0831”, phone)(“$ 250,000”, price)(“Fantastic house ...”, description) ...

Applying the Learned ModelsApplying the Learned Models

homes.com

mediated schema

Seattle, WAKent, WAAustin, TXSeattle, WA Name Matcher

Naive Bayes

Name MatcherNaive Bayes

Meta-learner

Meta-learneraddressaddressdescriptionaddress

Combiner

address

The LSD SystemThe LSD System

Base learners/modules– name matcher– Naive Bayes– Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98]– county-name recognizer

Meta-learner– stacking [Ting&Witten99, Wolpert92]

Name MatcherName Matcher

Matches based on names– including all names on path from root to current node– allowing synonyms

Good for ...– specific, descriptive names: agent-phone, listed-price

Bad for ...– vacuous names: item, listings– partially specified, ambiguous names: office

(for “office phone”)

Naive Bayes LearnerNaive Bayes Learner

Exploits frequencies of words & symbols Good for ...

– elements with words/symbols that are strongly indicative– examples:

– “fantastic” & “great” in house descriptions– $ in prices, parentheses in phone numbers

Bad for ...– short, numeric elements: num-baths, num-bedrooms

WHIRL Nearest-Neighbor ClassifierWHIRL Nearest-Neighbor Classifier

Similarity-based– stores all examples seen so far – classifies a new example based on similarity to

training examples– IR document similarity metric

Good for ...– long, textual elements: house description, names– limited, descriptive set of values: color (blue, red, ...)

Bad for ...– short, numeric elements: num-baths, num-bedrooms

County-Name RecognizerCounty-Name Recognizer

Stores all county names, obtained from the Web Verifies if the input name is a county name Essential to matching a county-name element

Meta-Learner: StackingMeta-Learner: Stacking

Training– uses training data to learn weights– one for each (base learner, mediated-schema element)

Combining predictions– for each mediated-schema element

– computes weighted sum of base-learner confidence scores

– picks mediated-schema element with highest sum

Experiments Experiments

Sources Coverage# of MatchableLeaf Elements

BestSingle Learner

realestate.yahoo USA 31 63% 77%

homeseekers.com USA 31 52% 64%

nkymls.com Kentucky 28 64% 75%

texasproperties.com Texas 42 59% 62%

windermere.com Northwest 35 55% 63%

Reasons for Incorrect MatchingsReasons for Incorrect Matchings

Unfamiliarity – suburb– solution: add a suburb-name recognizer

Insufficient information– correctly identified the general type– failed to pinpoint the exact type– <agent-name>Richard Smith</agent-name>

<phone> (206) 234 5412 </phone>– solution: add a proximity learner

Experiments: SummaryExperiments: Summary

Multi-strategy learning– better performance than any single learner

Accuracy of 100% unlikely to be reached– difficult even for human

Lots of room for improvement– more learners– better learning algorithms

Related WorkRelated Work

Rule-based approaches– TRANSCM [Milo&Zohar98],

ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98]

– utilize only schema information

Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability

Future WorkFuture Worksource descriptions

schema matching data translationscopecompletenessreliabilityquery capability

leaf elements higher-levelelements

1-1 mappings complex mappings

Future WorkFuture Work

Improve matching accuracy– more learners, more domains

Incorporate domain knowledge– semantic integrity constraints– concept hierarchy of mediated-schema elements

Learn with structured data

Learning with Structured DataLearning with Structured Data

Each example with >1 level of structure Generative model for XML XML classifier XML: “killer app” for relational learning

SummarySummary

Schema matching– automated by learning

Multi-strategy learning is essential– handles different types of data– incorporates different types of domain knowledge– easy to incorporate new learners– alleviates effects of noise & dirty data

Implemented LSD– promising results with initial experiments

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

html xml slide

phone slide

training data

source schema wrapper

mediated schema

nesting slide

approach slide

xml sources

Documents

By Warren Shen, Pedro DeRose, Robert McCann, AnHai Doan, and...

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 7: Data Matching...

Combining Keyword Search and Forms for Ad Hoc Querying of...

Data Integration Zachary G. Ives University of Pennsylvania....

Tuffy Scaling up Statistical Inference in Markov Logic using...

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA...

Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman , Zoheb...

Partly based on slides by AnHai Doan - Semantic...

XML May 6th, 2002. Instructor AnHai Doan Brief bio –high.....

AnHai Doan University of Wisconsin-Madison Data Quality...

Anhai Thesis

Partly based on slides by AnHai Doan

CS 564 Database Management Systems: Design and...

Crowds, Clouds, and Algorithms: Exploring the Human Side of....

1 On Provenance of Non-Answers for Queries over Extracted...

Learning to Match Ontologies on the Semantic Web AnHai Doan....