Methods for Data Integration Amit Shvarchenberg and Rafi Sayag
39
Embed
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Amit Shvarchenberg and Rafi Sayag
Slide 2
Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan
Department of Computer Science University of Illinois,
Urbana-Champaign, IL, USA fdhamanka,ylee11,[email protected] Alon
Halevy, Pedro Domingos Department of Computer Science and
Engineering University of Washington, Seattle, WA, USA
falon,[email protected]
Slide 3
Introduction Today there are a lot of databases around the
world, and many times it is required to combine two or more similar
databases into a single database In the past, many of this
integrations were made manually The iMAP system offers a
semi-automatic method of matching information from different
sources
Slide 4
The Real-Estate-Agents Example locationpriceAgent-id Raleigh,
NC360,00032 Atlanta, GA430,00015 areaList-priceAgent- address
Agent- name Denver,CO550000Boulder, COLaura Smith
Atlanta,GA370800Athens, GAMike Brown IdNamecityStateFee-rate 32Mike
brownAthensGA0.03 15Jean LaupRaleighNC0.04 Schema T Schema S HOUSES
AGENTS LISTING
Slide 5
The Big Merge
Slide 6
Making Tuples Using SQL area= SELECT location from HOUSES
agent-address= SELECT concat(city, state) FROM AGENTS list-price=
SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id =
id
Slide 7
How Do We Match ? The process of creating mappings typically
proceeds in two steps. first step: schema matching, we find matches
between elements of the two schemas. second step :we elaborate the
matches to create query expressions that enable automated data
translation or exchange.
Slide 8
Schema Matches There are two kinds of schema matches. 1-1
matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta,
GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
Slide 9
Schema Matches There are two kinds of schema matches. 1-1
matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta,
GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
Slide 10
Complex Matches specify that some combination of attributes in
one schema corresponds to a combination in the other.
locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015
IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
Slide 11
Complex Matches specify that some combination of attributes in
one schema corresponds to a combination in the other.
locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015
IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
Slide 12
Complex Matches specify that some combination of attributes in
one schema corresponds to a combination in the other.
locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015
IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
Slide 13
Complex Matches specify that some combination of attributes in
one schema corresponds to a combination in the other.
locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015
IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean
LaupRaleighNC0.04 areaList-priceAgent- address Agent- name
Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens,
GAMike Brown
Slide 14
The Solution The iMAP System We will describe the iMAP system
which semi- automatically discovers complex matches for relational
data in a single table. In some cases iMAP able to find matches
that combine attributes from multiple tables.
Slide 15
The iMAP Architecture
Slide 16
Match Generator Input: target schema and source schema. Output:
match candidates.
Slide 17
How Match Generator Works Match generator uses a searching
method that goes through all possible match candidates. The
searchers uses a prior knowledge of possible match types and
heuristic methods.
Slide 18
The Internals of a Searcher Applying search to candidate
generation involve three major issues: Search strategy Evaluation
of candidate matches Termination condition
Slide 19
Search Strategy The space search can be very large or even
unbounded. We need to efficiently search such spaces. iMAP address
this problem using a search technique called beam search.
Slide 20
Beam Search Beam search uses a scoring function to evaluate
each match candidate At each level of the search tree, it keeps
only k highest- scoring match. By that the searcher can conduct a
very efficient search in any type of search space.
Slide 21
Implemented Searchers on iMAP
Slide 22
Example: Unit Conversion Searcher The unit conversion searcher
can identify a conversion between two different types of
measurement unit. It can do so By looking in the name and data of
the attributes. (e.g., hours", kg", $", etc.)
Slide 23
The searcher finds the best conversion from a set of conversion
functions between the units. In this case weight_kg = 2.2 *
weight_pounds. productpounds apple10 Fruits and vegetableskg banna5
Fruits and vegetableskg banna5 apple22 Example: Unit Conversion
Searcher (cont.)
Slide 24
Similarity Estimator Input: Match candidates. Output:
Similarity matrix. Similarity matrix stores the similarity score of
pairs
Slide 25
Similarity Estimator The similarity estimator gets the results
from all the searchers. Then it gathers the data and calculates a
final score for each match
Slide 26
Similarity Estimator (cont.) The similarity estimator uses two
methods to score match pairs: Name based evaluator Nave Bayese
evaluator
Slide 27
Match Selector Input: Similarity matrix. Output: 1-1 and
complex matches.
Slide 28
Match Selector Match Selector examines the score matrix and
outputs the best matches under certain conditions.
Slide 29
Exploiting Domain Knowledge Exploiting domain knowledge was
shown to be beneficial on 1-1 matching On complex matching, it can
be even more crucial, since it can save valuable processing by
early detection of unlikely matches
Slide 30
Domain Constraints Constraints are either present in the
schema, or provided by an expert or the user iMAP considers 3 kinds
of constraints: Two attributes are un-related Constraint on a
single attribute Multiple schema attributes are un-related
Slide 31
Sources For Domain Constraints Past Complex Matches Overlap
data External Data
Slide 32
Past Complex Matches We often find that we map the same or
similar schemas repeatedly iMAP can extract a template expression
from such matches Example Given the past match: price = pr *
(1+0.6) iMAP will extract: VAR * (1 + CONST) and ask the numeric
searcher to look for matches for that template
Slide 33
Overlap Data In some cases, both the source and the target
share the same data This can be used as information for the
matching process Searchers that exploit overlap data: Overlap text
searcher Overlap numeric searcher Overlap category and schema
mismatch searcher
Slide 34
External Data External data is used as additional constraints
on the attributes of a schema Usually provided by experts Can be
very useful in schema matching
Slide 35
Why do we need it?
Slide 36
Generating Explanations in iMAP iMAPs goal is to provide a
design environment where a human user can quickly generate a
mapping between a pair of schemas For a user to know what match to
choose, it is necessary to supply an explanation for each of the
matches
Slide 37
User Questions iMAP considers 3 questions that might be asked
by a user: Why the match exist? Why the match doesnt exist? Why is
one match better than the other?
Slide 38
Explanation Generation iMAP keeps track of the decision making
progress as a dependency graph: Each node is either a schema
attribute, an assumption, candidate matches or domain knowledge An
edge between two nodes means that one node lead to another