Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: Data Integration: A “Killer App” for Multi-Strategy A “Killer App” for Multi-Strategy Learning Learning
Dec 24, 2015
Pedro Domingos
Joint work with AnHai Doan & Alon LevyDepartment of Computer Science & Engineering
University of Washington
Data Integration:Data Integration:A “Killer App” for Multi-Strategy LearningA “Killer App” for Multi-Strategy Learning
2
OverviewOverview
Data integration & XML Schema matching Multi-strategy learning Prototype system & experiments Related work Future work Summary
3
Data IntegrationData Integration
Find houses with four bathrooms and price under $500,000
mediated schema
superhomes.com
source schema
realestate.com
source schema
homeseekers.com
source schema
wrapper wrapperwrapper
4
Why Data Integration MattersWhy Data Integration Matters
Very active area in database & AI – research / workshops– start-ups
Large organizations – multiple databases with differing schemas
Data warehousing The Web: HTML sources The Web: XML sources
5
XMLXML
Extensible Markup Language– introduced in 1996
The standard for data publishing & exchange– replaces HTML & proprietary formats– embraced by database/web/e-commerce communities
XML versus HTML– both use tags to mark up data elements – HTML tags specify format – XML tags define meaning– relationships among elements provided via nesting
6
ExampleExample
<residential-listings><house> < location> <city> Seattle </city> <state> WA </state> <country> USA </country> </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...</residential-listings>
<h1> Residential Listings </h1><ul>House For Sale <li> location: Seattle, WA, USA <li> agent-phone: (206) 729 0831 <li> listed-price: $250,000 <li> comments: Fantastic house ...</ul><hr><ul> House For Sale...</ul>...
HTML XML
7
XML DTDXML DTD
A DTD can be visualized as a tree
<!ELEMENT residential-listings (house*)><!ELEMENT house (location?, agent-phone, listed-price, comments?)><!ELEMENT location (city, state, country?)>
Document Type Descriptor– BNF grammar– constraints on element structure: type, order, # of times
A real-estate DTD
8
Semantic Mappings between SchemasSemantic Mappings between Schemas
Mediated & source schemas = XML DTDs
house
location contact-info
house
address
agent-name agent-phone
num-baths amenities
full-baths half-baths handicap-equipped
contact
name phone
9
Map of the ProblemMap of the Problemsource descriptions
schema matching data translationscopecompletenessreliabilityquery capability
leaf elements higher-levelelements
1-1 mappings complex mappings
10
Current State of AffairsCurrent State of Affairs
Largely done by hand– labor intensive & error prone– key bottleneck in building applications
Will only be exacerbated – data sharing & XML become pervasive– proliferation of DTDs– translation of legacy data
Need automatic approaches to scale up!
11
Use machine learning to match schemas Basic idea
1. create training data– manually map a set of sources to mediated schema
2. train system on training data– learns from
– name of schema elements – format of values– frequency of words & symbols– characteristics of value distribution– proximity, position, structure, ...
3. system proposes mappings for subsequent sources
Our Approach Our Approach
12
ExampleExample
realestate.com
<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $250,000 </listed-price> <comments>Fantastic house ... </comments></house> ...
address phone price description
mediated schema
location
Seattle, WASeattle, WADallas, TX...
listed-price
$250,000$162,000$180,000...
agent-phone
(206) 729 0831(206) 321 4571(214) 722 4035...
comments
Fantastic house ...Great ...Hurry! ......
13
Multi-Strategy LearningMulti-Strategy Learning
Use a set of base learners– each exploits certain types of information
Match schema elements of a new source– apply the learners– combine their predictions using a meta-learner
Meta-learner– measures base learner accuracy on training data– weighs each learner based on its accuracy
14
LearnersLearners Input
– schema information: name, proximity, structure, ...– data information: value, format, ...
Output– prediction weighted by confidence score
Example learners– name matcher
– agent-name => (name,0.7), (phone,0.3)
– Naive Bayes – “Seattle, WA” => (address,0.8), (name,0.2)– “Great location ...” => (description,0.9), (address,0.1)
15
Training the LearnersTraining the Learnersrealestate.com
<house> < location> Seattle, WA </location> <agent-phone> (206) 729 0831 </agent-phone> <listed-price> $ 250,000 </listed-price> <comments> Fantastic house ... </comments></house> ...
address phone price description
mediated schema
location listed-price agent-phone comments
Name Matcher
(location, address)(agent-phone, phone)(listed-price, price)(comments, description) ...
Naive Bayes
(“Seattle, WA”, address)(“(206) 729 0831”, phone)(“$ 250,000”, price)(“Fantastic house ...”, description) ...
16
Applying the Learned ModelsApplying the Learned Models
homes.com
address phone price description
mediated schema
area
Seattle, WAKent, WAAustin, TXSeattle, WA Name Matcher
Naive Bayes
Name MatcherNaive Bayes
Meta-learner
Meta-learneraddressaddressdescriptionaddress
Combiner
address
17
The LSD SystemThe LSD System
Base learners/modules– name matcher– Naive Bayes– Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98]– county-name recognizer
Meta-learner– stacking [Ting&Witten99, Wolpert92]
18
Name MatcherName Matcher
Matches based on names– including all names on path from root to current node– allowing synonyms
Good for ...– specific, descriptive names: agent-phone, listed-price
Bad for ...– vacuous names: item, listings– partially specified, ambiguous names: office
(for “office phone”)
19
Naive Bayes LearnerNaive Bayes Learner
Exploits frequencies of words & symbols Good for ...
– elements with words/symbols that are strongly indicative– examples:
– “fantastic” & “great” in house descriptions– $ in prices, parentheses in phone numbers
Bad for ...– short, numeric elements: num-baths, num-bedrooms
20
WHIRL Nearest-Neighbor ClassifierWHIRL Nearest-Neighbor Classifier
Similarity-based– stores all examples seen so far – classifies a new example based on similarity to
training examples– IR document similarity metric
Good for ...– long, textual elements: house description, names– limited, descriptive set of values: color (blue, red, ...)
Bad for ...– short, numeric elements: num-baths, num-bedrooms
21
County-Name RecognizerCounty-Name Recognizer
Stores all county names, obtained from the Web Verifies if the input name is a county name Essential to matching a county-name element
22
Meta-Learner: StackingMeta-Learner: Stacking
Training– uses training data to learn weights– one for each (base learner, mediated-schema element)
Combining predictions– for each mediated-schema element
– computes weighted sum of base-learner confidence scores
– picks mediated-schema element with highest sum
23
Experiments Experiments
Sources Coverage# of MatchableLeaf Elements
BestSingle Learner
LSD
realestate.yahoo USA 31 63% 77%
homeseekers.com USA 31 52% 64%
nkymls.com Kentucky 28 64% 75%
texasproperties.com Texas 42 59% 62%
windermere.com Northwest 35 55% 63%
24
Reasons for Incorrect MatchingsReasons for Incorrect Matchings
Unfamiliarity – suburb– solution: add a suburb-name recognizer
Insufficient information– correctly identified the general type– failed to pinpoint the exact type– <agent-name>Richard Smith</agent-name>
<phone> (206) 234 5412 </phone>– solution: add a proximity learner
25
Experiments: SummaryExperiments: Summary
Multi-strategy learning– better performance than any single learner
Accuracy of 100% unlikely to be reached– difficult even for human
Lots of room for improvement– more learners– better learning algorithms
26
Related WorkRelated Work
Rule-based approaches– TRANSCM [Milo&Zohar98],
ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98]
– utilize only schema information
Learner-based approaches– SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95]– employ a single learner, limited applicability
27
Future WorkFuture Worksource descriptions
schema matching data translationscopecompletenessreliabilityquery capability
leaf elements higher-levelelements
1-1 mappings complex mappings
28
Future WorkFuture Work
Improve matching accuracy– more learners, more domains
Incorporate domain knowledge– semantic integrity constraints– concept hierarchy of mediated-schema elements
Learn with structured data
29
Learning with Structured DataLearning with Structured Data
Each example with >1 level of structure Generative model for XML XML classifier XML: “killer app” for relational learning
30
SummarySummary
Schema matching– automated by learning
Multi-strategy learning is essential– handles different types of data– incorporates different types of domain knowledge– easy to incorporate new learners– alleviates effects of noise & dirty data
Implemented LSD– promising results with initial experiments