1 Tutorial #5: Scientific Data Integration and Mediation San Diego Supercomputer Center San Diego Supercomputer Center U.C. San Diego U.C. San Diego Bertram Lud Bertram Lud ä ä scher scher Ilkay Altintas Ilkay Altintas Amarnath Gupta Amarnath Gupta Kai Lin Kai Lin
95
Embed
1 Tutorial #5: Scientific Data Integration and Mediation San Diego Supercomputer Center U.C. San Diego U.C. San Diego Bertram Ludäscher Ilkay Altintas.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Tutorial #5:Scientific Data Integration and Mediation
San Diego Supercomputer CenterSan Diego Supercomputer Center
U.C. San DiegoU.C. San Diego
Bertram LudBertram Ludääscherscher
Ilkay AltintasIlkay Altintas
Amarnath GuptaAmarnath Gupta
Kai LinKai Lin
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 2
Acknowledgements• National Science Foundation (NSF)– www.nsf.gov
• GEOsciences Network (NSF) – www.geongrid.org
• Biomedical Informatics Research Network (NIH)– www.nbirn.net
• Science Environment for Ecological Knowledge (NSF)– seek.ecoinformatics.org
• Scientific Data Management Center (DOE)– sdm.lbl.gov/sdmcenter/
““gluing” together gluing” together multiple data sources multiple data sources
bridging information bridging information and knowledge gaps and knowledge gaps computationallycomputationally
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 5
Information Integration from a DB Perspective
• Information Integration Problem– Given: data sources S1, ..., Sk (DBMS, web sites, ...) and user
questions Q1,..., Qn that can be answered using the Si
– Find: the answers to Q1, ..., Qn
• The Database Perspective: source = “database” Si has a schema (relational, XML, OO, ...)
Si can be queried
define virtual (or materialized) integrated views V over S1 ,..., Sk using database query languages (SQL, XQuery,...)
questions become queries Qi against V(S1,..., Sk)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 6
Standard (XML-Based) Mediator Architecture
MEDIATORMEDIATOR
XML Queries & Results
S1
Wrapper
(XML) View
S2
Wrapper
(XML) View
Sk
Wrapper
(XML) View
Integrated Global(XML) View G
Integrated ViewDefinition
G(..) S1(..)…Sk(..)
USER/ClientUSER/Client
Query Q ( G (SQuery Q ( G (S11,..., S,..., Skk) )) )
wrappers implementedas web services
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 7
• Data Integration Approaches:– Let’s just share data, e.g., link everything from a web page!– ... or better put everything into an relational or XML database– ... and do remote access using the Grid– ... or just use Web services!
• Nice try. But: – “Find the files where the amygdala was segmented.”– “Which other structures were segmented in the same files?”– “Did the volume of any of those structures differ much from
normal?”– “What is the cerebellar distribution of rat proteins with more
than 70% homology with human NCS-1? Any structure specificity? How about other rodents?”
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 12
Structural / XML-Based Mediation
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 13
Abstract XML-Based Mediator Architecture
S_1
MEDIATORMEDIATOR
XML Queries & Results
USER/ClientUSER/Client
Wrapper
XML View
S_2
Wrapper
XML View
S_k
Wrapper
XML View
IntegratedXML View V
Integrated ViewDefinition
IVD(S1,...,Sn)
Query Q o V (S_1,...,S_k)Query Q o V (S_1,...,S_k)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 14
Extensible Markup Language (XML)
• (meta)language for marking up text & data with user-definable tags– (X)HTML, XSLT, XML Schema, ...– MathML, BioML, GeoML, NeuroML, ... – XML-RPC, SOAP, ...
• semistructured tree data model– flexible: marked-up text, web-pages,
databases, ...
• container model: – “boxes within boxes”
• (meta)language for marking up text & data with user-definable tags– (X)HTML, XSLT, XML Schema, ...– MathML, BioML, GeoML, NeuroML, ... – XML-RPC, SOAP, ...
• semistructured tree data model– flexible: marked-up text, web-pages,
databases, ...
• container model: – “boxes within boxes”
... in their wonderful book called SemWeb Tractat by B. Schatz and T.B. Lee, the authors show how ...
... in their wonderful book called <title>SemWeb Tractat </title> by B. Schatz and T.B. Lee, the authors show how ...
... in their wonderful book called <title>SemWeb Tractat</title> by <author>B. Schatz</author> and <author> T.B. Lee</author>, the authors show how ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 15
Example: Relational Data => XML
c2b2a2
c3b3a3
c1b1a1
CBA
R Rtuple
A a1 /AB b1 /BC c1 /C
/tupletuple
A a2 /AB b2 /BC c2 /C
/tuple …
/R
R
tuple
A B Ca1 b1 c1
tuple
A B Ca2 b2 c2
tuple
A B Ca3 b3 c3
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 16
Tag Names & Nesting => XML DTDs (Grammars)
<!ELEMENT bibliography paper*><!ELEMENT paper (authors,fullPaper?,title,booktitle)><!ELEMENT authors author+>
XML DTD
bibliography paper* paper authors fullPaper? title booktitle authors author+
Grammar Rules
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 17
XML DTDs vs. XML Schema
• XML DTDs– set of allowed tag names
– their nesting structure (via grammar rules)
• XML Schema– tag names and nesting structure
– user-defined complex data types
– subtyping (no multiple inheritance): RESTRICT and EXTEND
– separate “namespace” for type names and tag (=element) names
– ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 18
XML Schema: User-Defined Type/Class Hierarchy
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 19
XML Schema Declarations (“home-style” syntax)
Complex Type Declarations
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 20
XML Schema (“home-style”)
Complex Types
Simple Type Declarations
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 21
XML Schema: Substitution Groups
Elements of a substitution group (hexagons) and associated complex types (boxes)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 22
XML Schema Declarations (W3C syntax)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 23
v mother(X0, X2)&father(X1, X2)&male(X1)&neq(X0, X1)
v father(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1)
v mother(X0, X2)&mother(X1, X2)&male(X1)&neq(X0, X1)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 36
Example (Cont’d)• ?- plan(brother(X0,X1)) .
brother(X0, X1)
==Bp ordered LQP==>
parentDb(father(X1, X2) & father(X0, X2))
& genderDb(male(X1)) & mediator(neq(X0, X1))
v parentDb(father(X1, X2) & mother(X0, X2))
& genderDb(male(X1)) & mediator(neq(X0, X1))
v parentDb(mother(X1, X2)&father(X0,X2))
& genderDb(male(X1)) & z_mediator(neq(X0, X1))
v parentDb(mother(X1, X2)&mother(X0, X2))
& genderDb(male(X1))&z_mediator(neq(X0, X1))
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 37
Computing Feasible Plans (Goal Ordering)• A conjunctive query Q is an expression of the
form– q( X ) p1( X1 ) , ..., pn( Xn )– order of subgoals p_i is irrelevant
• An ordered plan P is an expression of the form– q( X ) [p1( X1 ) , ..., pn( Xn )]– order of subgoals p_i is important
• Problem:– given Q, compute P which is feasible, i.e., observes the limited
query capabilities of sources– Here: binding patterns, i.e., predicates’ arguments can be
• “b” – bound • “f” – free • “_” – bound or free
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 38
A Simple Algorithm for Ordering Goals
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 39
Query Containment• A query Q1 is contained in Q2, denoted Q1 Q2
– if for all possible database instances, the set of answers to Q1 is contained in the set of answers to Q2.
• Q1 and Q2 are called equivalent– if Q1 Q2 and Q2 Q1.
• Query containment is undecidable for many languages, e.g., for the relational calculus (SQL).
• For conjunctive queries, the problem is NP-complete (and thus decidable) – Since query sizes tend to be “small” (in particular, when
compared to database sizes), query containment is still of use in practice (indeed, it is one of the most fundamental tools for logic-based query optimization).
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 40
Query Containment• Q1(Xs,Ys) is contained in Q2(Xs,Zs) iff
ALL Xs: (EXISTS Ys: Q1(Xs,Ys)) (EXISTS Zs: Q2(Xs,Zs))
• iff we can refute its negation• iff
NOT ALL Xs: (EXISTS Ys: Q1(Xs,Ys)) (EXISTS Zs: Q2(Xs,Zs)) |= []
• iffEXISTS Xs: (EXISTS Ys: Q1(Xs,Ys)) AND NOT (EXISTS Zs: Q2(Xs,Zs)) |= []
• iff– canonical_db(Q1) AND Q2(Xs,Zs) |= []
• create database from Q1, then run Q2 as a query...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 41
Query Containment Algorithm (in Prolog)
• Applications: – query minimization (conjunctive query is minimal if not
conjunct can be dropped)– semantic query optimization
• Q denial • here: denial is an integrity constraint and states what must not hold• example: denial = false mother(X,M), father(Y,M)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 42
Example
• 50% of the clauses of the executable plan are irrelevant ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 43
Mediator Demo
• Computer Science Challenges:– Given a query Q over virtual integrated database V, how to come up with Q’
over the source schemas? (cf. Garlic, DiscoveryLink, ...)• query rewriting of Q(V) into Q’(SRCs) using unfolding and normalization• computation of feasible orders (NP-complete!?) while minimizing number of
“chunks” sent to sources• semantic query optimization (reasoning over plans!); e.g. conjunctive query
containment is NP-complete [Chandra-Merlin-77]
• A Quick Demo of the current prototype: – Find 3D reconstructions of cells found in ‘cerebellar cortex’:
• ?- ccdbData('cerebellar cortex').• Join everything reachable along ‘cerebellar-cortex’.(has-a)* in UMLS • ....with concept markup in CCDB• ... retrieve (links to) results• ... also show on SmartAtlas tool
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 44
Mediator Demo
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 45
From XML-Based to Logic and Model-Based (“Semantic”) Mediation
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 46
What’s the Problem with XML & Complex Multiple-Worlds?
• XML is Syntax– DTDs talk about element nesting– XML Schema schemas give you data types – need anything else? => write comments!
• Domain Semantics is complex:– implicit assumptions, hidden semantics sources seem unrelated to the non-expert
• Need Structure and Semantics beyond XML trees! employ richer OO models make domain semantics and “glue knowledge” explicit use ontologies to fix terminology and conceptualization avoid ambiguities by using formal semantics
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 47
From XML-Based to Model-Based Mediation• Data and Knowledge Sharing Potential:
Database Mediation + Knowledge Representation
________________________
= Model-Based Mediation
• Basic Ideas:– turn primary data sources into knowledge sources– employ secondary glue knowledge sources
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 56
Ontologies• So what is an Ontology?
– definition of things that are relevant to your application– representation of terminological knowledge (“TBox”)– explicit specification of a conceptualization– concept hierarchy (“is-a”)– further semantic relationships between concepts– abstractions of relational schemas, (E)ER, UML classes, XML
Schemas
• Examples:– NCMIR ANATOM– GO (Gene Ontology)– UMLS (Unified Medical Language System– CYC
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 57
Formalism for Ontologies: Description Logic
• DL definition of “Happy Father” (Example from Ian Horrocks, U Manchester, UK)
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 58
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 59
Description Logics
• Terminological Knowledge (TBox)– Concept Definition (naming of concepts):
– Axiom (constraining of concepts):
=> a mediators “glue knowledge source”
• Assertional Knowledge (ABox)– the marked neuron in image 27
=> the concrete instances/individuals of the concepts/classes that your sources export
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 60
Querying vs. Reasoning
• Querying: – given a DB instance I (= logic interpretation), evaluate a query
expression (e.g. SQL, FO formula, Prolog program, ...)– boolean query: check if I |= (i.e., if I is a model of ) – (ternary) query: { (X, Y, Z) | I |= (X,Y,Z) } => check happyFathers in a given database
• Reasoning:– check if I |= implies I |= for all databases I, – i.e., if => – undecidable for FO, F-logic, etc.– Descriptions Logics are decidable fragments concept subsumption, concept hierarchy, classification semantic tableaux, resolution, specialized algorithms
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 61
• Really? Why?– authority based: <VIP> said so– faith based: don’t know but firmly believe– query statement Q = ... derived it from DB I– query Q = ... derived it from DB I and KB T using derivation D=> logic-based systems often “come with explanations”
(“computations as proofs”)
XY
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 62
Formalizing Glue Knowledge:Domain Map for SYNAPSE and NCMIR
Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)
Domain Map = labeled graph with concepts ("classes") and roles ("associations")• additional semantics: expressed as logic rules (F-logic)
Domain Map (DM)
Purkinje cells and Pyramidal cells have dendritesthat have higher-order branches that contain spines.Dendritic spines are ion (calcium) regulating components.Spines have ion binding proteins. Neurotransmissioninvolves ionic activity (release). Ion-binding proteinscontrol ion activity (propagation) in a cell. Ion-regulatingcomponents of cells affect ionic activity (release).
Domain Expert Knowledge
DM in Description Logic
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 63
Source Contextualization & DM Refinement
In addition to registering (“hanging off”) data relative toexisting concepts, a source may also refine the mediator’s domain map...
sources can register new concepts at the mediator ...
Example:ANATOM Domain Map
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 65
Browsing Registered Data with Domain Maps
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 66
Process Maps with Abstractions and Elaborations: From Terminological to Procedural Glue
deduction– deductive database/logic programming technology, AI “stuff”...– Semantic Web technology
• Scientific Workflow Management– more procedural than database mediation (often the scientist is
the query planner)– deployment using web services
70
B R E A K
... followed by demos ...
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 71
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 72
GEON SMART Metadata: Multihierarchical Rock Classification for “Thematic Queries” (GSC)
Composition
Genesis
Fabric
Texture
“smart discovery & querying” via multiple, independent concept hierarchies (controlled vocabularies)• data at different description levels can be found and processed
Scientific Data-Mediation AHM'03 National Partnership for Advanced Computational Infrastructure 73
GEON SMART Metadata:Multihierarchical Rock Classification for “Thematic Queries”