Generic Schema Matching Generic Schema Matching using Cupid using Cupid Jayant Madhavan Jayant Madhavan University of Washington University of Washington Philip A. Bernstein Philip A. Bernstein Erhard Rahm Erhard Rahm Microsoft Research Microsoft Research University of Leipzig University of Leipzig
Generic Schema Matching using Cupid. Jayant Madhavan University of Washington. Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig. PO. PurchaseOrder. POLines. Items. DeliverTo. POShipTo. POShipTo. POShipTo. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Generic Schema Matching using Generic Schema Matching using CupidCupid
Jayant MadhavanJayant MadhavanUniversity of WashingtonUniversity of Washington
Philip A. Bernstein Erhard RahmPhilip A. Bernstein Erhard Rahm Microsoft Research University of LeipzigMicrosoft Research University of Leipzig
September 11th 2001
VLDB 2001 Roma Italy 2
Schema MatchingSchema Matching
PO
Item
POLines
Qty
LineUoM
POShipTo
City
Street
Item
PurchaseOrder
Items
Quantity
ItemNumber
UnitofMeasure
DeliverTo
City
Street
Address
NameNam
e
POShipTo
DeliverTo
Line
ItemNumber
Qty
UoM
Quantity
UnitofMeasure
POShipTo
DeliverTo
Qty
UoM
Quantity
UnitofMeasure
September 11th 2001
VLDB 2001 Roma Italy 3
• Given two schemas obtain a mapping Given two schemas obtain a mapping between them that identifies corresponding between them that identifies corresponding elementselements
The ProblemThe Problem
• A hard problemA hard problem– Naming and structural differences in schemasNaming and structural differences in schemas– Similar, but non-identical concepts modeledSimilar, but non-identical concepts modeled– Multiple data models – SQL DDL, XML, ODMG…Multiple data models – SQL DDL, XML, ODMG…
– Minimize user involvement (semi-automatic)Minimize user involvement (semi-automatic)– Data model independent matching (generic)Data model independent matching (generic)
September 11th 2001
VLDB 2001 Roma Italy 4
MotivationMotivation
• Important component in many applicationsImportant component in many applications– Data IntegrationData Integration– Data MigrationData Migration– E-CommerceE-Commerce
• Model Management [Bernstein, Halevy, Model Management [Bernstein, Halevy, Pottinger ’00]Pottinger ’00]– Algebra for manipulating models and mappingsAlgebra for manipulating models and mappings– Match, Merge, Compose …Match, Merge, Compose …
Taxonomy based survey [Rahm,Bernstein’00]Taxonomy based survey [Rahm,Bernstein’00]
September 11th 2001
VLDB 2001 Roma Italy 6
Related WorkRelated Work
• Hybrid approaches for schema integrationHybrid approaches for schema integration– DIKE [Palopoli, Sacca, Ursino, Terracina]DIKE [Palopoli, Sacca, Ursino, Terracina]– MOMIS [Bergamaschi, Castano, Vincini]MOMIS [Bergamaschi, Castano, Vincini]
• Linguistic and Instance based Linguistic and Instance based – SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li]SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li]
• Instance based Multi-strategy learningInstance based Multi-strategy learning– LSD [Doan, Domingos, Halevy]LSD [Doan, Domingos, Halevy]
• Taxonomy of schema matching approachesTaxonomy of schema matching approaches
• Cupid system that exploits linguistic, data-type, Cupid system that exploits linguistic, data-type, structure and referential integrity informationstructure and referential integrity information– New algorithm that exploits schema structureNew algorithm that exploits schema structure
• Experimental validation and comparison with Experimental validation and comparison with other systemsother systems
September 11th 2001
VLDB 2001 Roma Italy 8
Cupid architectureCupid architecture
Schema 1
Schema 2
StructureMatching
GenerateMapping
Output Mapping
Linguistic Matching
Thesaurus
LSIM
SSIMWSIM
September 11th 2001
VLDB 2001 Roma Italy 9
Linguistic MatchingLinguistic Matching
– Tokenization of namesTokenization of namesPOOrderNum POOrderNum PO, Order, Num PO, Order, Num
– Expansion of short-forms, acronymsExpansion of short-forms, acronymsPOPO Purchase, Order; Num Purchase, Order; Num Number Number
– Clustering of schema elements based on keywords Clustering of schema elements based on keywords and data-typesand data-types
Street, City, POAddress Street, City, POAddress Address Address
– Thesaurus of synonyms, hypernyms, acronymsThesaurus of synonyms, hypernyms, acronyms
– Linguistic Similarity coefficient (lsim) Linguistic Similarity coefficient (lsim) [0,1] [0,1]
• Atomic elements are similarAtomic elements are similar – Linguistically and data-type similarLinguistically and data-type similar– Their ancestors are similarTheir ancestors are similar
• Compound elements (non-leaf) are similar ifCompound elements (non-leaf) are similar if– Linguistically similarLinguistically similar– Subtrees rooted at the elements are similarSubtrees rooted at the elements are similar
• Mutually recursive Mutually recursive – Leaves determine internal node similarityLeaves determine internal node similarity– Similarity of internal nodes leads to increase in leaf Similarity of internal nodes leads to increase in leaf
similaritysimilarity
September 11th 2001
VLDB 2001 Roma Italy 14
Structure Match detailsStructure Match details
• Subtrees are similar ifSubtrees are similar if– Immediate children are similarImmediate children are similar– Leaf sets are similarLeaf sets are similar
• Subtree Similarity (nodes s and t)Subtree Similarity (nodes s and t)– Fraction of leaves in subtree s that can be mapped to a Fraction of leaves in subtree s that can be mapped to a
leaf in the other subtree t and vice-versaleaf in the other subtree t and vice-versa– Less sensitive to variation in intermediate structureLess sensitive to variation in intermediate structure
• Pruning the number of comparisonsPruning the number of comparisons– Elements must have comparable number of leavesElements must have comparable number of leaves
September 11th 2001
VLDB 2001 Roma Italy 15
Order-Customer-fk
Referential IntegrityReferential Integrity
• Join nodes added to the schema tree for each Join nodes added to the schema tree for each referential integrity constraintreferential integrity constraint
• Views can be similarly usedViews can be similarly used
Purchase Order
Product Name
Order ID
Customer ID
Customer
Customer ID Nam
e
Address
Order-Customer-fk
Schema A
Customer-Purchase-Order
Schema B
September 11th 2001
VLDB 2001 Roma Italy 16
Cupid architectureCupid architecture
Schema 1
Schema 2
StructureMatching
Lsim
GenerateMapping
Output Mapping
Linguistic Matching
Thesaurus
Structural(Ssim), Weighted(Wsim) similarity
InvoiceTo BillTo 0.7
UoM UnitMeasure 0.9
City City 1.0
Linguistic Similarity (Lsim)
Ssim,Wsim
InvoiceTo BillTo 0.8 0.7
UoM UnitMeasure 0.7 0.8
InvoiceTo/City BillTo/City 0.8 0.9
September 11th 2001
VLDB 2001 Roma Italy 17
Mapping GenerationMapping Generation
• Individual mapping elements computed from Individual mapping elements computed from Wsim Wsim valuesvalues
– Consider only mapping pairs that have Wsim greater Consider only mapping pairs that have Wsim greater than thresholdthan threshold
– For each element of target find most similar source For each element of target find most similar source elementelement
– Not accepted mappings with high similarity are Not accepted mappings with high similarity are returned in order to help user modify map returned in order to help user modify map
September 11th 2001
VLDB 2001 Roma Italy 18
Cupid ArchitectureCupid Architecture
Schema 1
Schema 2
StructureMatching
Lsim
GenerateMapping
Output Mapping
Linguistic Matching
Thesaurus
Ssim,Wsim
Input hint
September 11th 2001
VLDB 2001 Roma Italy 19
Experimental ValidationExperimental Validation
• DIKEDIKE– Graph Matching of ER modelsGraph Matching of ER models– No Lsim component (LSPD entries)No Lsim component (LSPD entries)
• MOMISMOMIS– Class Level Matching of OO descriptionsClass Level Matching of OO descriptions– Word senses manually chosen from WordNetWord senses manually chosen from WordNet
• Cupid is less sensitive to Cupid is less sensitive to name variationsname variations due to token due to token level manipulationslevel manipulations
• MOMIS is able to infer linguistic relationships based on MOMIS is able to infer linguistic relationships based on intra-schema propertiesintra-schema properties using Description Logic using Description Logic techniquestechniques
• MOMIS has a interface to WordNetMOMIS has a interface to WordNet– Word senses need to be chosen manuallyWord senses need to be chosen manually– Choosing a single sense is not always possibleChoosing a single sense is not always possible
• Matching performance without thesaurusMatching performance without thesaurus depends on depends on similarity of terms used and on available structure similarity of terms used and on available structure (tokenization helps Cupid)(tokenization helps Cupid)
• DIKE and Cupid exploit DIKE and Cupid exploit structural similarity beyond the structural similarity beyond the immediate neighborhoodimmediate neighborhood of schema elements of schema elements
• Leaf structure for sub-tree similarityLeaf structure for sub-tree similarity relaxes relaxes requirements on intermediate structure match requirements on intermediate structure match
• Class-level structural similarityClass-level structural similarity in MOMIS can be in MOMIS can be restrictive while matching schemas with different nestingrestrictive while matching schemas with different nesting
• Context-dependent matchingContext-dependent matching in Cupid resolves mapping in Cupid resolves mapping ambiguityambiguity
• Linguistic similarity with complete path namesLinguistic similarity with complete path names (and no (and no structural similarity) is insufficientstructural similarity) is insufficient
September 11th 2001
VLDB 2001 Roma Italy 22
SummarySummary
• Taxonomy of schema matching approachesTaxonomy of schema matching approaches
• Cupid system that performs linguistic and Cupid system that performs linguistic and structural matching structural matching
• New algorithm for exploiting schema structure New algorithm for exploiting schema structure
• Comparative evaluation Comparative evaluation
September 11th 2001
VLDB 2001 Roma Italy 23
Future WorkFuture Work
• Towards a more robust solutionTowards a more robust solution– Auto-tuning parametersAuto-tuning parameters– Thesaurus Generation and EvolutionThesaurus Generation and Evolution– More scalability testingMore scalability testing
• Schema matching component architectureSchema matching component architecture– Easily extensible by adding multiple techniquesEasily extensible by adding multiple techniques– Data Instances for matchingData Instances for matching– Mapping, Expression and Query DiscoveryMapping, Expression and Query Discovery
• Model ManagementModel Management
September 11th 2001
VLDB 2001 Roma Italy 24
Model ManagementModel Management
• Other recent publicationsOther recent publications– A Model Theory for Generic Schema Management, A Model Theory for Generic Schema Management, DBPL DBPL
20012001– Generic Model Management – A Database Infrastructure for Generic Model Management – A Database Infrastructure for
Schema Manipulation, Schema Manipulation, CoopIS 2001CoopIS 2001– A Vision for Management of Complex Models, A Vision for Management of Complex Models, Sigmod Sigmod
Record, Dec 2000Record, Dec 2000– Data Warehouse Scenarios for Model Management, Data Warehouse Scenarios for Model Management, ER ER