Data Integration Methods Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 16, 2004
Dec 20, 2015
Data Integration Methods
Zachary G. IvesUniversity of Pennsylvania
CIS 650 – Database & Information Systems
February 16, 2004
2
Administrivia
Next reading assignment on scalable query reformulation algorithms:
Pottinger and Halevy – MiniCon
Write-up: summarize the main ideas of this paper
3
Today’s Trivia Question
4
A Problem
We’ve seen that even with normalization and the same needs, different people will arrive at different schemas
In fact, most people also have different needs! Often people build databases in isolation, then want
to share their data Different systems within an enterprise Different information brokers on the Web Scientific collaborators Researchers who want to publish their data for others to
use This is the goal of data integration: tie together
different sources, controlled by many people, under a common schema
5
Example
We want to build the UltimateMovieGuideTM
Given: TV Guide schema:
ShowingAt(time, title, year)Shows(title, year, genre, rating)Director(title, year, name)Starring(title, year, name)
GoodMovies.com:FourStarMovie(title, year, genre)DecentMovie(title, year)
OscarWinners:WinningDirector(director, title, year)
Documentaries.org:Documentary(title, year, director, producer)
6
Integrating Data
Several steps: Getting data out from a data source – it may
have its own query/retrieval interface Sometimes it will need a query before it returns
answers, e.g., a Web form
Getting all of the data into the same data model and format
Translating the data into the same schema Answering queries
How might we handle this?
7
Data Warehouses – Offline Replication
Get experts together, define a schema they think best captures all the info
Define a database with this schema
Define procedural mappings in an “ETL tool” to import the data
Perhaps perform “data cleaning”
Periodically copy all of the data from the data sources Note that the sources and
the warehouse are basically independent at this point
Remote,AutonomousData Sources
Data Warehouse
Query Results
8
Pros and Cons of Data Warehouses
Need to spend time to design the physical database layout, as well as logical This actually takes a lot of effort!
Data is generally not up-to-date (lazy or offline refresh)
Queries over the warehouse don’t disrupt the data sources
Can run very heavy-duty computations, including data mining and cleaning
9
An Alternative – Mediators or Virtual Integration Systems
Get experts together, define a schema they think best captures all the info
Define as a virtual mediated schema Create declarative
mappings specifying how to get data from each source into the warehouse
Evaluate queries over the mediated schema “on the fly” using the current data at the sources
Data Integration System
Mediated Schema
Remote,AutonomousData Sources
Schema Mappings
SourceCatalog
Query Results
10
Core Question: How Do We Define and Use Mappings?
Queries must be directly composed with mappings
Leads to use of views as the means of specifying mappings
… So which direction do we specify views?1. Mediated relations as views over source relations2. Source relations as views over mediated relations
TSIMMIS chooses option 1 Information Mainfold chooses option 2 Neither is perfect or comprehensive, as we’ll
see
11
The Job of Mappings
Between different data sources: May have different numbers of tables – different
decompositions Attributes may be broken down differently (“rating” vs.
“EbertThumb” and “RoeperThumb”) Metadata in one relation may be data in another Values may not exactly correspond (“shows” vs.
“movies”) It may be unclear whether a value is the same
(“COPPOLA” vs. “Francis Ford Coppola”) May have different, but synonymous terms
(ImdbID “123456” SSN “987-45-3210”) Might have sub/superclass relationships
12
General Techniques
Value-value correspondences accomplished using concordance tables Join through a table mapping values to values Imdb_Actor(ID, SAG_actor_name)
Table-multitable correspondences accomplished using joins (in one direction), projections (in other direction) Key question: what happens if a needed attribute is missing?
(e.g., DecentMovie has no genre) Super/subclass relationships generally must be
captured using selection (in one direction), union (in other direction)
… And sometimes we just can’t specify the correspondence!
13
Some Examples of Mappings
Show(ID, Title, Year, Lang, Genre)
Movie(ID, Title, Year, Genre, Director, Star1, Star2)
EnglishMovie(Title, Year, Genre, Rating)
Docu(ID, Title, Year)Participant(ID, Name, Role)
ImdbID
CastOf
1234 Catwoman
Name CastOf
Berry, H.
Monster’s Ball
PieceOfArt(I, T, Y, “English”, “G”) :- EnglishMovie(T, Y, G, _), MovieIDFor(I, T, Y)
Movie(I, T, Y, “doc”, D, S1, S2) :- Docu(I, T, Y), Participant(I, D, “Dir”), Participant(I, S1, “Cast1”), Participant(I, S2, “Cast2”)
T1 T2
Need a concordance table from ImdbIDs to actress names
14
TSIMMIS and Information Manifold
Focus: Web-based queryable sources CGI forms, online databases, maybe a few RDBMSs Each needs to be mapped into the system – not as
easy as web search – but the benefits are significant vs. query engines
A few parenthetical notes: Part of a slew of works on wrappers, source profiling,
etc. The creation of mappings can be partly automated –
systems such as LSD, Cupid, Clio, … do this Today most people look at integrating large
enterprises (that’s where the $$$ is!) – Nimble, BEA, IBM
15
TSIMMIS
“The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew
An instance of a “global-as-view” mediation system
One of the first systems to support semi-structured data, which predated XML by several years
16
Semi-structured Data: OEM
Observation: given a particular schema, its attributes may be unavailable from certain sources – inherent irregularity
Proposal: Object Exchange Model, OEM
OID: <label, type, value>
1: show { 2: id { 15 }, 3: title { Catwoman }, 4: year { 2004 }, 5: lang { English }, 6: genre { fantasy }, 7: criticsrating { 8: stars { 0.5 }, 9: source { Bob } }}
17
Queries in TSIMMIS
Specified in OQL-style language called Lorel OQL was an object-oriented query language Lorel is, in many ways, a predecessor to XQuery
Based on path expressions over OEM structures:select showwhere show.title = “Star Wars” and show.genre = “sci-fi”
This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated =
for $s in AllData()/showwhere $s/title/text() = “Star Wars” and $s/genre/text() = “sci-fi”return $s
18
Query Answering in TSIMMIS
Basically, it’s view unfolding, i.e., composing a query with a view The query is the one being asked The views are the MSL templates for the
wrappers Some of the views may actually require
parameters, e.g., an author name, before they’ll return answers Common for web forms (see Amazon, Google, …) XQuery functions (XQuery’s version of views) support
parameters as well, so we’ll see these in action
19
A Wrapper Definition in MSL
Wrappers have templates and binding patterns ($T) in MSL:
S :- S: <show {<genre $G>}> // $$ = “select * from movie where title=“ $T //
This reformats a SQL query over Movie(title, year, genre)
In XQuery, this might look like:define function GetShow($t AS xsd:string) as show {
for $s in sql(“Amazon.DB”, “select * from movie where title=‘” + $t +”’”)
return <show><title>{$t}</title> {$s/year, $s/genre}</show>
}
movie
year genre
… …
…
The union of GetShow’s results is unioned with others to form the view AllData()
…
20
How to Answer the Query
Given our query:for $s in AllData()/showwhere $s/title/text() = “Star Wars” and $s/genre/text() = “sci-fi”return $s
Find all wrapper definitions that: Contain output enough “structure” to match
the conditions of the query Or have already tested the conditions for us!
21
Query Composition with Views
We find all views that define book with author and title, and we compose the query with each:
define function GetBook($x AS xsd:string) as book {for $b in
sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”)
return <book> {$b/title} <author>{$x}</author></book>}for $b in AllData()/book
where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b
book
title author
… …
22
Example on Board
23
Virtues of TSIMMIS
Early adopter of semistructured data, greatly predating XML Can support data from many different kinds of
sources Obviously, doesn’t fully solve heterogeneity
problem
Presents a mediated schema that is the union of multiple views Query answering based on view unfolding
Easily composed in a hierarchy of mediators
24
Limitations of TSIMMIS’ Approach
Some data sources may contain data with certain ranges or properties
“Books by Aho”, “Students at UPenn”, … If we ask a query for students at Columbia, don’t
want to bother querying students at Penn… How do we express these?
Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema
25
An Alternate Approach:The Information Manifold (Levy et al.)
When you integrate something, you have some conceptual model of the integrated domain
Define that as a basic frame of reference, everything else as a view over it
“Local as View” using mappings that are conjunctive queries
May have overlapping/incomplete sources Define each source as the subset of a query over the
mediated schema – the “open world assumption” We can use selection or join predicates to specify that a
source contains a range of values:ComputerBooks(…) Books(Title, …, Subj), Subj =
“Computers”
26
The Local-as-View Model
The basic model is the following: “Local” sources are views over the mediated
schema Sources have the data – mediated schema is
virtual Sources may not have all the data from the
domain – “open-world assumption”
The system must use the sources (views) to answer queries over the mediated schema
27
Answering Queries Using Views
Assumption: conjunctive queries, set semantics Suppose we have a mediated schema:
show(ID, title, year, genre), rating(ID, stars, source)
A conjunctive query might be: q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997
Recall intuitions about this class of queries: Adding a conjunct to a query removes answers
from the result but never adds anyAny conjunctive query with at least the same
constraints & conjuncts will give valid answers
28
Query Answering
Suppose we have the query:q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997
and sources:5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r, “TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r, s) rating(i, r, s)goodMovies(t,y) show(i, t, y, “drama”), rating(i, 5, s),
y = 1997
We want to compose the query with the source mappings – but they’re in the wrong direction!
29
Inverse Rules
We can take every mapping and “invert” it, though sometimes we may have insufficient information:
If5star(i) show(i, t, y, g), rating(i, 5, s)
then we can also infer that:show(i,??? ,??? ,??? ,???) 5star(i)
But how to handle the absence of the missing attributes? We know that there must be AT LEAST one instance
of ??? for each attribute for each show ID So we might simply insert a NULL and define that NULL
means “unknown” (as opposed to “missing”)…
30
But NULLs Lose Information
Suppose we take these rules and ask for: q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997
If we look at the rule:goodMovies(t,y) show(i, t, y, “drama”), rating(i, 5, s), y
= 1997
“By inspection,” q(t) goodMovies(t,y)
But if apply our inversion procedure, we get:show(i, t, y, g) goodMovies(t,y), i = NULL, g = “drama”rating(i, r, s) goodMovies(t,y), i = NULL, r = 5, s =
NULL
We need “a special NULL” so we can figure out which IDs and ratings match up
31
The Solution: “Skolem Functions”
Skolem functions: Conceptual “perfect” hash functions Each function returns a unique, deterministic value
for each combination of input values Every function returns a non-overlapping set of
values (Skolem function F will never return a value that matches any of Skolem function G’s values)
Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values They’re just a way of logically generating “special
NULLs”
32
Query Answering Using Inverse Rules
Invert all rules using the procedures describedTake the query and the possible rule expansions and
execute them in a Datalog interpreter In the previous query, we expand with all combinations of
expansions of book and of author – every possible way of combining and cross-correlating info from different sources
Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent)
More efficient, but equivalent, algorithms now exist: Bucket algorithm [Levy et al.], which we discuss next MiniCon [Pottinger & Halevy] (next time) Also related: “chase and backchase” [Popa, Tannen,
Deutsch]
33
The Bucket Algorithm
Given a query Q with relations and predicates Create a bucket for each subgoal in Q Iterate over each view (source mapping)
If source includes bucket’s subgoal: Create mapping between q’s vars and the view’s var
at the same position If satisfiable with substitutions, add to bucket
Do cross-product of buckets, see if result is contained (exptime, but queries are probably relatively small)
For each result, do a containment check to make sure the rewriting is contained within the query
34
Let’s Try a Bucket Example
Queryq(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997
Sources5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r,
“TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r, s) rating(i, r, s)goodMovies(t,y) show(i, t, y, “drama”), rating(i,
5, s), y = 1997 good98(t,y) show(i, t, y, “drama”), rating(i, 5, s),
y = 1998
35
Example of Containment Testing
Suppose we have two queries:
q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCSE(C),
Course(C, “DB & Info Systems”)
q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X)
Intuitively, q1 must contain the same or fewer answers vs. q2: It has all of the same conditions, except one extra conjunction
(i.e., it’s more restricted) There’s no union or any other way it can add more data
We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course}
36
Checking Containment via Canonical Databases
To test for q1 µ q2: Create a “canonical DB” that contains a tuple for
each subgoal in q1 Execute q2 over it If q2 returns a tuple that matches the head of q1,
then q1 µ q2
(This is an NP-complete algorithm in the size of the query. Testing for full first-order logic queries is undecidable!!!)
Let’s see this for our example…
37
Example Canonical DB
q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCSE(C), Course(C, “DB & Info Systems”)
q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X)
Student Takes Course inCSE
S N S C C X
C DB & Info
Systems
S
Need to get tuple <S,C> in executing q2 over this database
38
Next Time
We’ll look at the state-of-the-art in query reformulation, the MiniCon algorithm Eliminates the need for the containment check Eliminates many cross-product comparisons This – and the Chase&Backchase strategy of
Tannen et al – are the two methods most used in virtual data integration today
Please read the MiniCon paper (Pottinger & Halevy)