Top Banner
Methods for Data Integration Amit Shvarchenberg and Rafi Sayag
39

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

Dec 16, 2015

Download

Documents

Landen Philips
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Slide 1
  • Amit Shvarchenberg and Rafi Sayag
  • Slide 2
  • Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois, Urbana-Champaign, IL, USA fdhamanka,ylee11,[email protected] Alon Halevy, Pedro Domingos Department of Computer Science and Engineering University of Washington, Seattle, WA, USA falon,[email protected]
  • Slide 3
  • Introduction Today there are a lot of databases around the world, and many times it is required to combine two or more similar databases into a single database In the past, many of this integrations were made manually The iMAP system offers a semi-automatic method of matching information from different sources
  • Slide 4
  • The Real-Estate-Agents Example locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 Schema T Schema S HOUSES AGENTS LISTING
  • Slide 5
  • The Big Merge
  • Slide 6
  • Making Tuples Using SQL area= SELECT location from HOUSES agent-address= SELECT concat(city, state) FROM AGENTS list-price= SELECT price * (1 + fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id
  • Slide 7
  • How Do We Match ? The process of creating mappings typically proceeds in two steps. first step: schema matching, we find matches between elements of the two schemas. second step :we elaborate the matches to create query expressions that enable automated data translation or exchange.
  • Slide 8
  • Schema Matches There are two kinds of schema matches. 1-1 matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 9
  • Schema Matches There are two kinds of schema matches. 1-1 matches. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 10
  • Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 11
  • Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 12
  • Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 13
  • Complex Matches specify that some combination of attributes in one schema corresponds to a combination in the other. locationpriceAgent-id Raleigh, NC360,00032 Atlanta, GA430,00015 IdNamecityStateFee-rate 32Mike brownAthensGA0.03 15Jean LaupRaleighNC0.04 areaList-priceAgent- address Agent- name Denver,CO550000Boulder, COLaura Smith Atlanta,GA370800Athens, GAMike Brown
  • Slide 14
  • The Solution The iMAP System We will describe the iMAP system which semi- automatically discovers complex matches for relational data in a single table. In some cases iMAP able to find matches that combine attributes from multiple tables.
  • Slide 15
  • The iMAP Architecture
  • Slide 16
  • Match Generator Input: target schema and source schema. Output: match candidates.
  • Slide 17
  • How Match Generator Works Match generator uses a searching method that goes through all possible match candidates. The searchers uses a prior knowledge of possible match types and heuristic methods.
  • Slide 18
  • The Internals of a Searcher Applying search to candidate generation involve three major issues: Search strategy Evaluation of candidate matches Termination condition
  • Slide 19
  • Search Strategy The space search can be very large or even unbounded. We need to efficiently search such spaces. iMAP address this problem using a search technique called beam search.
  • Slide 20
  • Beam Search Beam search uses a scoring function to evaluate each match candidate At each level of the search tree, it keeps only k highest- scoring match. By that the searcher can conduct a very efficient search in any type of search space.
  • Slide 21
  • Implemented Searchers on iMAP
  • Slide 22
  • Example: Unit Conversion Searcher The unit conversion searcher can identify a conversion between two different types of measurement unit. It can do so By looking in the name and data of the attributes. (e.g., hours", kg", $", etc.)
  • Slide 23
  • The searcher finds the best conversion from a set of conversion functions between the units. In this case weight_kg = 2.2 * weight_pounds. productpounds apple10 Fruits and vegetableskg banna5 Fruits and vegetableskg banna5 apple22 Example: Unit Conversion Searcher (cont.)
  • Slide 24
  • Similarity Estimator Input: Match candidates. Output: Similarity matrix. Similarity matrix stores the similarity score of pairs
  • Slide 25
  • Similarity Estimator The similarity estimator gets the results from all the searchers. Then it gathers the data and calculates a final score for each match
  • Slide 26
  • Similarity Estimator (cont.) The similarity estimator uses two methods to score match pairs: Name based evaluator Nave Bayese evaluator
  • Slide 27
  • Match Selector Input: Similarity matrix. Output: 1-1 and complex matches.
  • Slide 28
  • Match Selector Match Selector examines the score matrix and outputs the best matches under certain conditions.
  • Slide 29
  • Exploiting Domain Knowledge Exploiting domain knowledge was shown to be beneficial on 1-1 matching On complex matching, it can be even more crucial, since it can save valuable processing by early detection of unlikely matches
  • Slide 30
  • Domain Constraints Constraints are either present in the schema, or provided by an expert or the user iMAP considers 3 kinds of constraints: Two attributes are un-related Constraint on a single attribute Multiple schema attributes are un-related
  • Slide 31
  • Sources For Domain Constraints Past Complex Matches Overlap data External Data
  • Slide 32
  • Past Complex Matches We often find that we map the same or similar schemas repeatedly iMAP can extract a template expression from such matches Example Given the past match: price = pr * (1+0.6) iMAP will extract: VAR * (1 + CONST) and ask the numeric searcher to look for matches for that template
  • Slide 33
  • Overlap Data In some cases, both the source and the target share the same data This can be used as information for the matching process Searchers that exploit overlap data: Overlap text searcher Overlap numeric searcher Overlap category and schema mismatch searcher
  • Slide 34
  • External Data External data is used as additional constraints on the attributes of a schema Usually provided by experts Can be very useful in schema matching
  • Slide 35
  • Why do we need it?
  • Slide 36
  • Generating Explanations in iMAP iMAPs goal is to provide a design environment where a human user can quickly generate a mapping between a pair of schemas For a user to know what match to choose, it is necessary to supply an explanation for each of the matches
  • Slide 37
  • User Questions iMAP considers 3 questions that might be asked by a user: Why the match exist? Why the match doesnt exist? Why is one match better than the other?
  • Slide 38
  • Explanation Generation iMAP keeps track of the decision making progress as a dependency graph: Each node is either a schema attribute, an assumption, candidate matches or domain knowledge An edge between two nodes means that one node lead to another
  • Slide 39
  • Explanation Generation Example