Young-Kwang Nam Joseph Goguen Guilian Wang A Metadata Integration A ssistant Generator for H eterogeneous Databases
Jan 05, 2016
Young-Kwang Nam
Joseph Goguen
Guilian Wang
A Metadata Integration Assistant Generator for Heterogeneous Database
s
Data Integration in Synthetic Scientific Applications
Integrated result without inconsistency, etc.
Applications
…
Integration System
datasource 1
datasource 2
datasource n
local schema/ontology
local schema/ontology
local schema/ontology
global unifiedschema/ontology
Query
Why Difficult: Data Heterogeneity
• Platform & System Heterogeneity– OS, Hardware – DBMSs, Concurrency control and recovery capabilities
• Syntactic & Structural Heterogeneity– Machine readable aspects of representation – Data models, Schemas,
• Semantic Heterogeneity– Naming conflicts: synonyms, homonyms– Scaling & precision conflicts– Sampling rates, error distribution, etc.
More Difficult: Flexible Integration
• No all-encompassing system satisfies everyone:– frequent update of sources– frequent change of user requirements– non-published data from one’s own lab
• Simplicity and readability are more desirable than completeness or exhaustiveness to domain scientists
• Domain knowledge is crucial for – solving heterogeneities– query optimization
• Desirable to support domain scientists to do data integration on their own
A Common Data Integration Architecture
…
Mediator
datasource 1
datasource 2
datasource n
Query
Wrapper Wrapper Wrapper
Result
An Integrated View Materialized or Virtual
Structural vs. Semanticwrt Mediation Level
• Structural approach (Mediated schema approach)– integration by generating mediated schema that characterize a
set of data sources
• Semantic approach (Ontology-based approach)– difficult to integrate structural aspects of sources from
semantic perspective due to inherent embedded semantics within local schemas & implicit assumptions
– integration by sharing a common ontology among the differentdata sources
Global-as-view vs. Local-as-viewwrt Mapping Direction
• Global-as-view approach– each item in Global schema/ontology as a view (query)
over source schemas/ontologies– query(G) = query(f(S1, S2, …, Sn))– straightforward query rewriting
• Local-as-view approach– Each source as a view/query over global schema/ontology– query(G) = query(f1
-1 (S1), f2-1(S2), …, fn
-1 (Sn))– easy adding or removing sources
Representative Systems
• TSIMMIS (Stanford & IBM, 1995)
• MedMaker (Stanford, 1996)
• MIX (SDSC&UCSD, 2000)
• IM (AT&T, 1996)
• Clio+Garlic (IBM, 2000)
• DIXSE (UT, 2001)
• XYLEME (2001)
• HERMES (UMD, 1994)
• SIMS (USC, 1996)
• Observer (UG, 1996)
• Infosleuth (MCC, 1997)
• COIN (MIT, 1999)
• Ontobroker (Ger., 2000)
• KIND (SDSC&UCSD, 2001)
Our Approach
• Virtual Integration: retrieve data and resolve conflicts at query time, easy maintenance
• Structural Approach: take users’ knowledge on data semantics hidden in structural information as input to achieve semantic mediation
• Local-as-view: easily adds or removes sources, convenient to fit applications
• GUI for specifying semantic mappings through assigning same index to same meaning nodes (paths)
• Automatically generate DDXMI for query decomposition
• Semantic functions
Current Prototype Architecture
User query (XML query)
DDXMIColumn or Path
Column or Path For each DB
XML/DB1 XML/DB2 XML/DBn
XML/DBengine2
query2
XML/DBengine1
query1
XML/DBenginen
queryn
queryGenerator/collector
result1result2
resultn
Distributed Database XML Metadata Interface (DDXMI)
• Include Database or XML document name or location information
• Contain table columns or XML path information
• Function or operation name for resolving semantic issues about table columns or XML elements and attributes
DDXMI DTD
<!ELEMENT DDXMIA (DDXMI.header, DDXMI.isequivalent, documentspec)><!ELEMENT DDXMI.header (documentation,version,date,authorization)><!ELEMENT documentation (#PCDATA)><!ELEMENT version (#PCDATA)><!ELEMENT date (#PCDATA)><!ELEMENT authorization (#PCDATA)><!ELEMENT DDXMI.isequivalent (source,destination*)*><!ELEMENT source (#PCDATA)><!ELEMENT destination (#PCDATA)><!ELEMENT documentspec (document, (elementname,operation*)*)><!ELEMENT document (#PCDATA)><!ELEMENT elementname (#PCDATA)><!ELEMENT operation (#PCDATA)>
How to generate DDXMI
• Define a Master DTD (global schema) based on application requirements for choosing elements or tables from the distributed systems
• Parse the master DTD and generate a path for each element from root to current element
• Assign the master index number to the site element node which has the same meaning of the master DTD node
• May include a function name for some nodes
• Generate DDXMI file automatically by collecting over same index numbers
Generate Master Index
Site1 : Book1 DTD Tree
Index number functionname
Book1 Path Information
0 book1.xml1 /bib/book11 /bib/book/price12 /bib/book/author1211 /bib/book/author/first1212 /bib/book/author/last13 /bib/book/title15 /bib/book/publisher16 /bib/book/editor161 /bib/book/editor/affiliation162 /bib/book/editor/last162 /bib/book/editor/first
Master Index0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name
Site1 Index
Site 2 : Book2 DTD Tree
Book2 Path Information
0 book2.xml1 /arts/book12 /arts/book/author1211 /arts/book/author/firstname1212 /arts/book/author/lastname13 /arts/book/title15 /arts/book/publisher
Master Index0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name
Site2 Index
Site 3 : Book3 DTD Tree
Book3 Path Information
Master Index0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name
0 book3.xml1 /bookstore/book11 /bookstore/book/price12 /bookstore/book/author1211
/bookstore/book/author/name1212
/bookstore/book/author/name13 /bookstore/book/title
Site3 Index
XML Query Languages
• XQL : takes a document point of view• XML-QL : takes a database point of view• Quilt : draws from both areas
– proposed by Don Chamberlin, Jonathan Robie, and Daniela Florescu
– Kweelt (University of Washington), a XML query engine based on Quilt, used in our prototype
• XQuery proposal follows Quilt closely
How to generate site queries
• Parse the master query, a query over the global schema
• If encounter a path, depending on its kind, get corresponding path name from DDXMI file and substitute it
• If there is no corresponding path in the DDXMI, then put it as a null value
no queries generated for that site
How to get site element names
book
price authorpublisher
yeartitle
editor
full_name
first_namelast_name
affiliationfull_name
Master index
book
bookstore
Site Index
price_info
price
DDXMI
[In Quilt Query]
1.book bookstore/book
2. price bookstore/book/price_info/price
price_info/price
cut!!<source>book</source> <destination>booksore/book</destination><source>book/price</source> <destination>bookstore/book/price_info/price<destination>
1:1 Mapping ExampleFOR $book IN document("book.xml")//book
[publisher = "Addison-Wesley"] RETURN <book>$book/title</book>
book
priceauthor
publisher
yeartitle
editor
full_name
first_name last_name
affiliation full_name
Master index
book
bib
publisher title
Book1
book
arts
publisher title
Book2
book
bookstore
title
Book3
Query Execution Result
1:N Mapping ExampleFOR $edi IN document("book.xml")//book/editorRETURN <editor>$edi/full_name</editor>
book
priceauthor
publisheryear
title
editor
full_name
first_name last_name
affiliation full_name
Master index
book
bib
editor
Book1
book
artsBook2
book
bookstoreBook3
last first
<source>/book/editor/full_name</source><destination>/bib/book/editor/last,/bib/book/editor/first</destination>
DDXMI
Query Execution Result
N:1 Mapping ExampleFOR $a IN document("book.xml")//book//authorRETURN <author> $a/last_name,$a/first_name </author>
book
priceauthor
publisheryear
title
editor
full_name
last_name first_name
affiliationfull_name
Master index
book
bib
author
Book1
book
bookstoreBook3last first
book
arts
author
Book2
lastname firstname
author
name
<operation>lstring</operation>
<operation>fstring</operation>
Query Generation Result
import split as UDF_split;
FUNCTION fstring($str){ split(" ",$str)[1]}
FUNCTION lstring($str){ split(" ",$str)[2]}
FOR $a IN document("book3.xml") //book//author
RETURN <author> fstring($a/name), lstring($a/name)</author>
Query Execution Result
Semantic Function Involved ExampleFOR $book IN document("book.xml")//bookRETURN <book> $book/title,$book/author,$book/price </book>
<operation>div(100)</operation>
book
priceauthor
publisheryear
title
editor
full_name
first_name last_name
affiliation full_name
Master index
book
bib
price
Book1
book
artsBook2
book
bookstoreBook3
price
Query Execution Result
Remaining Issues• Handle attributes: one DTD has an attribute but others don’t, or an attri
bute in one DTD as an element in others• More efficient way for generating DDXMI file automatically when there
are many paths in the master DTDe.g., tree:tree mapping: if two paths are indicated as the same and have the same children, then the index numbers should be generated automatically
• Migrate to XML schemas, instead of DTDs• Support JOIN, PRODUCT generated by queries• Move to XQuery and a query engine with distributed query support• Integrate the individual site query results as one return as a single data s
ource ready for further analysis • Provide mechanisms for removing redundancy• Justify the semantics of the query generated
• Our prototype uses distributed metadata to generate a GUI tool to describe mappings between master and local databases by assigning index numbers and specifying conversion function names
• Uses Quilt as its XML query language. • A DDXMI file is generated based on the mappings, and is
used to translate queries over the virtual master database into sub-queries to local databases
• An experiment testing feasibility is reported in which 3 different bibliography databases are integrated.
• Implemented with Java Webserver and JavaCC• Move to real applications, e.g. in the context of NSF proje
ct SEEK (Science Environment for Ecological Knowledge)
Conclusion