A Metadata Integration Assistant Generator for Heterogeneous Databases

Young-Kwang Nam

Joseph Goguen

Guilian Wang

A Metadata Integration Assistant Generator for Heterogeneous Database

s

Data Integration in Synthetic Scientific Applications

Integrated result without inconsistency, etc.

Applications

…

Integration System

datasource 1

datasource 2

datasource n

local schema/ontology



global unifiedschema/ontology

Query

Why Difficult: Data Heterogeneity

• Platform & System Heterogeneity– OS, Hardware – DBMSs, Concurrency control and recovery capabilities

• Syntactic & Structural Heterogeneity– Machine readable aspects of representation – Data models, Schemas,

• Semantic Heterogeneity– Naming conflicts: synonyms, homonyms– Scaling & precision conflicts– Sampling rates, error distribution, etc.

More Difficult: Flexible Integration

• No all-encompassing system satisfies everyone:– frequent update of sources– frequent change of user requirements– non-published data from one’s own lab

• Simplicity and readability are more desirable than completeness or exhaustiveness to domain scientists

• Domain knowledge is crucial for – solving heterogeneities– query optimization

• Desirable to support domain scientists to do data integration on their own

A Common Data Integration Architecture

…

Mediator

datasource 1

datasource 2

datasource n

Query

Wrapper Wrapper Wrapper

Result

An Integrated View Materialized or Virtual

Structural vs. Semanticwrt Mediation Level

• Structural approach (Mediated schema approach)– integration by generating mediated schema that characterize a

set of data sources

• Semantic approach (Ontology-based approach)– difficult to integrate structural aspects of sources from

semantic perspective due to inherent embedded semantics within local schemas & implicit assumptions

– integration by sharing a common ontology among the differentdata sources

Global-as-view vs. Local-as-viewwrt Mapping Direction

• Global-as-view approach– each item in Global schema/ontology as a view (query)

over source schemas/ontologies– query(G) = query(f(S1, S2, …, Sn))– straightforward query rewriting

• Local-as-view approach– Each source as a view/query over global schema/ontology– query(G) = query(f1

-1 (S1), f2-1(S2), …, fn

-1 (Sn))– easy adding or removing sources

Representative Systems

• TSIMMIS (Stanford & IBM, 1995)

• MedMaker (Stanford, 1996)

• MIX (SDSC&UCSD, 2000)

• IM (AT&T, 1996)

• Clio+Garlic (IBM, 2000)

• DIXSE (UT, 2001)

• XYLEME (2001)

• HERMES (UMD, 1994)

• SIMS (USC, 1996)

• Observer (UG, 1996)

• Infosleuth (MCC, 1997)

• COIN (MIT, 1999)

• Ontobroker (Ger., 2000)

• KIND (SDSC&UCSD, 2001)

Our Approach

• Virtual Integration: retrieve data and resolve conflicts at query time, easy maintenance

• Structural Approach: take users’ knowledge on data semantics hidden in structural information as input to achieve semantic mediation

• Local-as-view: easily adds or removes sources, convenient to fit applications

• GUI for specifying semantic mappings through assigning same index to same meaning nodes (paths)

• Automatically generate DDXMI for query decomposition

• Semantic functions

Current Prototype Architecture

User query (XML query)

DDXMIColumn or Path

Column or Path For each DB

XML/DB1 XML/DB2 XML/DBn

XML/DBengine2

query2

XML/DBengine1

query1

XML/DBenginen

queryn

queryGenerator/collector

result1result2

resultn

Distributed Database XML Metadata Interface (DDXMI)

• Include Database or XML document name or location information

• Contain table columns or XML path information

• Function or operation name for resolving semantic issues about table columns or XML elements and attributes

DDXMI DTD

<!ELEMENT DDXMIA (DDXMI.header, DDXMI.isequivalent, documentspec)><!ELEMENT DDXMI.header (documentation,version,date,authorization)><!ELEMENT documentation (#PCDATA)><!ELEMENT version (#PCDATA)><!ELEMENT date (#PCDATA)><!ELEMENT authorization (#PCDATA)><!ELEMENT DDXMI.isequivalent (source,destination*)*><!ELEMENT source (#PCDATA)><!ELEMENT destination (#PCDATA)><!ELEMENT documentspec (document, (elementname,operation*)*)><!ELEMENT document (#PCDATA)><!ELEMENT elementname (#PCDATA)><!ELEMENT operation (#PCDATA)>

How to generate DDXMI

• Define a Master DTD (global schema) based on application requirements for choosing elements or tables from the distributed systems

• Parse the master DTD and generate a path for each element from root to current element

• Assign the master index number to the site element node which has the same meaning of the master DTD node

• May include a function name for some nodes

• Generate DDXMI file automatically by collecting over same index numbers

Generate Master Index

Site1 : Book1 DTD Tree

Index number functionname

Book1 Path Information

0 book1.xml1 /bib/book11 /bib/book/price12 /bib/book/author1211 /bib/book/author/first1212 /bib/book/author/last13 /bib/book/title15 /bib/book/publisher16 /bib/book/editor161 /bib/book/editor/affiliation162 /bib/book/editor/last162 /bib/book/editor/first

Master Index0 book.xml 1 /book 11 /book/price 12 /book/author 121 /book/author/full_name 1211 /book/author/full_name/first_name 1212 /book/author/full_name/last_name 13 /book/title 14 /book/year 15 /book/publisher 16 /book/editor 161 /book/editor/affiliation 162 /book/editor/full_name

Site1 Index

Site 2 : Book2 DTD Tree


0 book2.xml1 /arts/book12 /arts/book/author1211 /arts/book/author/firstname1212 /arts/book/author/lastname13 /arts/book/title15 /arts/book/publisher


Site2 Index

Site 3 : Book3 DTD Tree



0 book3.xml1 /bookstore/book11 /bookstore/book/price12 /bookstore/book/author1211

/bookstore/book/author/name1212

/bookstore/book/author/name13 /bookstore/book/title

Site3 Index

XML Query Languages

• XQL : takes a document point of view• XML-QL : takes a database point of view• Quilt : draws from both areas

– proposed by Don Chamberlin, Jonathan Robie, and Daniela Florescu

– Kweelt (University of Washington), a XML query engine based on Quilt, used in our prototype

• XQuery proposal follows Quilt closely

How to generate site queries

• Parse the master query, a query over the global schema

• If encounter a path, depending on its kind, get corresponding path name from DDXMI file and substitute it

• If there is no corresponding path in the DDXMI, then put it as a null value

no queries generated for that site

How to get site element names

book

price authorpublisher

yeartitle

editor

full_name

first_namelast_name

affiliationfull_name

Master index

book

bookstore

Site Index

price_info

price

DDXMI

[In Quilt Query]

1.book bookstore/book

2. price bookstore/book/price_info/price

price_info/price

cut!!<source>book</source> <destination>booksore/book</destination><source>book/price</source> <destination>bookstore/book/price_info/price<destination>

1:1 Mapping ExampleFOR $book IN document("book.xml")//book

[publisher = "Addison-Wesley"] RETURN <book>$book/title</book>

book

priceauthor

publisher

yeartitle

editor

full_name

first_name last_name

affiliation full_name

Master index

book

bib

publisher title

Book1

book

arts

publisher title

Book2

book

bookstore

title

Book3

Query Execution Result

1:N Mapping ExampleFOR $edi IN document("book.xml")//book/editorRETURN <editor>$edi/full_name</editor>

book

priceauthor

publisheryear

title

editor

full_name



Master index

book

bib

editor

Book1

book

artsBook2

book

bookstoreBook3

last first

<source>/book/editor/full_name</source><destination>/bib/book/editor/last,/bib/book/editor/first</destination>

DDXMI


N:1 Mapping ExampleFOR $a IN document("book.xml")//book//authorRETURN <author> $a/last_name,$a/first_name </author>

book

priceauthor

publisheryear

title

editor

full_name

last_name first_name

affiliationfull_name

Master index

book

bib

author

Book1

book

bookstoreBook3last first

book

arts

author

Book2

lastname firstname

author

name

<operation>lstring</operation>

<operation>fstring</operation>

Query Generation Result

import split as UDF_split;

FUNCTION fstring($str){ split(" ",$str)[1]}

FUNCTION lstring($str){ split(" ",$str)[2]}

FOR $a IN document("book3.xml") //book//author

RETURN <author> fstring($a/name), lstring($a/name)</author>


Semantic Function Involved ExampleFOR $book IN document("book.xml")//bookRETURN <book> $book/title,$book/author,$book/price </book>

<operation>div(100)</operation>

book

priceauthor

publisheryear

title

editor

full_name



Master index

book

bib

price

Book1

book

artsBook2

book

bookstoreBook3

price


Remaining Issues• Handle attributes: one DTD has an attribute but others don’t, or an attri

bute in one DTD as an element in others• More efficient way for generating DDXMI file automatically when there

are many paths in the master DTDe.g., tree:tree mapping: if two paths are indicated as the same and have the same children, then the index numbers should be generated automatically

• Migrate to XML schemas, instead of DTDs• Support JOIN, PRODUCT generated by queries• Move to XQuery and a query engine with distributed query support• Integrate the individual site query results as one return as a single data s

ource ready for further analysis • Provide mechanisms for removing redundancy• Justify the semantics of the query generated

• Our prototype uses distributed metadata to generate a GUI tool to describe mappings between master and local databases by assigning index numbers and specifying conversion function names

• Uses Quilt as its XML query language. • A DDXMI file is generated based on the mappings, and is

used to translate queries over the virtual master database into sub-queries to local databases

• An experiment testing feasibility is reported in which 3 different bibliography databases are integrated.

• Implemented with Java Webserver and JavaCC• Move to real applications, e.g. in the context of NSF proje

ct SEEK (Science Environment for Ecological Knowledge)

Conclusion

A Metadata Integration Assistant Generator for Heterogeneous Databases

Documents

data semantics

view query

semantic perspective

semantic issues

semantic mappings

semantic mediationlocal

view approacheach source

approachvirtual integration