Data Integration Methods Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 16, 2004.

Data Integration Methods

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Database & Information Systems

February 16, 2004

2

Administrivia

Next reading assignment on scalable query reformulation algorithms:

Pottinger and Halevy – MiniCon

Write-up: summarize the main ideas of this paper

3

Today’s Trivia Question

4

A Problem

We’ve seen that even with normalization and the same needs, different people will arrive at different schemas

In fact, most people also have different needs! Often people build databases in isolation, then want

to share their data Different systems within an enterprise Different information brokers on the Web Scientific collaborators Researchers who want to publish their data for others to

use This is the goal of data integration: tie together

different sources, controlled by many people, under a common schema

5

Example

We want to build the UltimateMovieGuideTM

Given: TV Guide schema:

ShowingAt(time, title, year)Shows(title, year, genre, rating)Director(title, year, name)Starring(title, year, name)

GoodMovies.com:FourStarMovie(title, year, genre)DecentMovie(title, year)

OscarWinners:WinningDirector(director, title, year)

Documentaries.org:Documentary(title, year, director, producer)

6

Integrating Data

Several steps: Getting data out from a data source – it may

have its own query/retrieval interface Sometimes it will need a query before it returns

answers, e.g., a Web form

Getting all of the data into the same data model and format

Translating the data into the same schema Answering queries

How might we handle this?

7

Data Warehouses – Offline Replication

Get experts together, define a schema they think best captures all the info

Define a database with this schema

Define procedural mappings in an “ETL tool” to import the data

Perhaps perform “data cleaning”

Periodically copy all of the data from the data sources Note that the sources and

the warehouse are basically independent at this point

Remote,AutonomousData Sources

Data Warehouse

Query Results

8

Pros and Cons of Data Warehouses

Need to spend time to design the physical database layout, as well as logical This actually takes a lot of effort!

Data is generally not up-to-date (lazy or offline refresh)

Queries over the warehouse don’t disrupt the data sources

Can run very heavy-duty computations, including data mining and cleaning

9

An Alternative – Mediators or Virtual Integration Systems

Get experts together, define a schema they think best captures all the info

Define as a virtual mediated schema Create declarative

mappings specifying how to get data from each source into the warehouse

Evaluate queries over the mediated schema “on the fly” using the current data at the sources

Data Integration System

Mediated Schema

Remote,AutonomousData Sources

Schema Mappings

SourceCatalog

Query Results

10

Core Question: How Do We Define and Use Mappings?

Queries must be directly composed with mappings

Leads to use of views as the means of specifying mappings

… So which direction do we specify views?1. Mediated relations as views over source relations2. Source relations as views over mediated relations

TSIMMIS chooses option 1 Information Mainfold chooses option 2 Neither is perfect or comprehensive, as we’ll

see

11

The Job of Mappings

Between different data sources: May have different numbers of tables – different

decompositions Attributes may be broken down differently (“rating” vs.

“EbertThumb” and “RoeperThumb”) Metadata in one relation may be data in another Values may not exactly correspond (“shows” vs.

“movies”) It may be unclear whether a value is the same

(“COPPOLA” vs. “Francis Ford Coppola”) May have different, but synonymous terms

(ImdbID “123456” SSN “987-45-3210”) Might have sub/superclass relationships

12

General Techniques

Value-value correspondences accomplished using concordance tables Join through a table mapping values to values Imdb_Actor(ID, SAG_actor_name)

Table-multitable correspondences accomplished using joins (in one direction), projections (in other direction) Key question: what happens if a needed attribute is missing?

(e.g., DecentMovie has no genre) Super/subclass relationships generally must be

captured using selection (in one direction), union (in other direction)

… And sometimes we just can’t specify the correspondence!

13

Some Examples of Mappings

Show(ID, Title, Year, Lang, Genre)

Movie(ID, Title, Year, Genre, Director, Star1, Star2)

EnglishMovie(Title, Year, Genre, Rating)

Docu(ID, Title, Year)Participant(ID, Name, Role)

ImdbID

CastOf

1234 Catwoman

Name CastOf

Berry, H.

Monster’s Ball

PieceOfArt(I, T, Y, “English”, “G”) :- EnglishMovie(T, Y, G, _), MovieIDFor(I, T, Y)

Movie(I, T, Y, “doc”, D, S1, S2) :- Docu(I, T, Y), Participant(I, D, “Dir”), Participant(I, S1, “Cast1”), Participant(I, S2, “Cast2”)

T1 T2

Need a concordance table from ImdbIDs to actress names

14

TSIMMIS and Information Manifold

Focus: Web-based queryable sources CGI forms, online databases, maybe a few RDBMSs Each needs to be mapped into the system – not as

easy as web search – but the benefits are significant vs. query engines

A few parenthetical notes: Part of a slew of works on wrappers, source profiling,

etc. The creation of mappings can be partly automated –

systems such as LSD, Cupid, Clio, … do this Today most people look at integrating large

enterprises (that’s where the $$$ is!) – Nimble, BEA, IBM

15

TSIMMIS

“The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew

An instance of a “global-as-view” mediation system

One of the first systems to support semi-structured data, which predated XML by several years

16

Semi-structured Data: OEM

Observation: given a particular schema, its attributes may be unavailable from certain sources – inherent irregularity

Proposal: Object Exchange Model, OEM

OID: <label, type, value>

1: show { 2: id { 15 }, 3: title { Catwoman }, 4: year { 2004 }, 5: lang { English }, 6: genre { fantasy }, 7: criticsrating { 8: stars { 0.5 }, 9: source { Bob } }}

17

Queries in TSIMMIS

Specified in OQL-style language called Lorel OQL was an object-oriented query language Lorel is, in many ways, a predecessor to XQuery

Based on path expressions over OEM structures:select showwhere show.title = “Star Wars” and show.genre = “sci-fi”

This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Previous query restated =

for $s in AllData()/showwhere $s/title/text() = “Star Wars” and $s/genre/text() = “sci-fi”return $s

18

Query Answering in TSIMMIS

Basically, it’s view unfolding, i.e., composing a query with a view The query is the one being asked The views are the MSL templates for the

wrappers Some of the views may actually require

parameters, e.g., an author name, before they’ll return answers Common for web forms (see Amazon, Google, …) XQuery functions (XQuery’s version of views) support

parameters as well, so we’ll see these in action

19

A Wrapper Definition in MSL

Wrappers have templates and binding patterns ($T) in MSL:

S :- S: <show {<genre $G>}> // $$ = “select * from movie where title=“ $T //

This reformats a SQL query over Movie(title, year, genre)

In XQuery, this might look like:define function GetShow($t AS xsd:string) as show {

for $s in sql(“Amazon.DB”, “select * from movie where title=‘” + $t +”’”)

return <show><title>{$t}</title> {$s/year, $s/genre}</show>

}

movie

year genre

… …

…

The union of GetShow’s results is unioned with others to form the view AllData()

…

20

How to Answer the Query

Given our query:for $s in AllData()/showwhere $s/title/text() = “Star Wars” and $s/genre/text() = “sci-fi”return $s

Find all wrapper definitions that: Contain output enough “structure” to match

the conditions of the query Or have already tested the conditions for us!

21

Query Composition with Views

We find all views that define book with author and title, and we compose the query with each:

define function GetBook($x AS xsd:string) as book {for $b in

sql(“Amazon.DB”, “select * from book where author=‘” + $x + “’”)

return <book> {$b/title} <author>{$x}</author></book>}for $b in AllData()/book

where $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b

book

title author

… …

22

Example on Board

23

Virtues of TSIMMIS

Early adopter of semistructured data, greatly predating XML Can support data from many different kinds of

sources Obviously, doesn’t fully solve heterogeneity

problem

Presents a mediated schema that is the union of multiple views Query answering based on view unfolding

Easily composed in a hierarchy of mediators

24

Limitations of TSIMMIS’ Approach

Some data sources may contain data with certain ranges or properties

“Books by Aho”, “Students at UPenn”, … If we ask a query for students at Columbia, don’t

want to bother querying students at Penn… How do we express these?

Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema

25

An Alternate Approach:The Information Manifold (Levy et al.)

When you integrate something, you have some conceptual model of the integrated domain

Define that as a basic frame of reference, everything else as a view over it

“Local as View” using mappings that are conjunctive queries

May have overlapping/incomplete sources Define each source as the subset of a query over the

mediated schema – the “open world assumption” We can use selection or join predicates to specify that a

source contains a range of values:ComputerBooks(…) Books(Title, …, Subj), Subj =

“Computers”

26

The Local-as-View Model

The basic model is the following: “Local” sources are views over the mediated

schema Sources have the data – mediated schema is

virtual Sources may not have all the data from the

domain – “open-world assumption”

The system must use the sources (views) to answer queries over the mediated schema

27

Answering Queries Using Views

Assumption: conjunctive queries, set semantics Suppose we have a mediated schema:

show(ID, title, year, genre), rating(ID, stars, source)

A conjunctive query might be: q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997

Recall intuitions about this class of queries: Adding a conjunct to a query removes answers

from the result but never adds anyAny conjunctive query with at least the same

constraints & conjuncts will give valid answers

28

Query Answering

Suppose we have the query:q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997

and sources:5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r, “TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r, s) rating(i, r, s)goodMovies(t,y) show(i, t, y, “drama”), rating(i, 5, s),

y = 1997

We want to compose the query with the source mappings – but they’re in the wrong direction!

29

Inverse Rules

We can take every mapping and “invert” it, though sometimes we may have insufficient information:

If5star(i) show(i, t, y, g), rating(i, 5, s)

then we can also infer that:show(i,??? ,??? ,??? ,???) 5star(i)

But how to handle the absence of the missing attributes? We know that there must be AT LEAST one instance

of ??? for each attribute for each show ID So we might simply insert a NULL and define that NULL

means “unknown” (as opposed to “missing”)…

30

But NULLs Lose Information

Suppose we take these rules and ask for: q(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997

If we look at the rule:goodMovies(t,y) show(i, t, y, “drama”), rating(i, 5, s), y

= 1997

“By inspection,” q(t) goodMovies(t,y)

But if apply our inversion procedure, we get:show(i, t, y, g) goodMovies(t,y), i = NULL, g = “drama”rating(i, r, s) goodMovies(t,y), i = NULL, r = 5, s =

NULL

We need “a special NULL” so we can figure out which IDs and ratings match up

31

The Solution: “Skolem Functions”

Skolem functions: Conceptual “perfect” hash functions Each function returns a unique, deterministic value

for each combination of input values Every function returns a non-overlapping set of

values (Skolem function F will never return a value that matches any of Skolem function G’s values)

Skolem functions won’t ever be part of the answer set or the computation – it doesn’t produce real values They’re just a way of logically generating “special

NULLs”

32

Query Answering Using Inverse Rules

Invert all rules using the procedures describedTake the query and the possible rule expansions and

execute them in a Datalog interpreter In the previous query, we expand with all combinations of

expansions of book and of author – every possible way of combining and cross-correlating info from different sources

Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent)

More efficient, but equivalent, algorithms now exist: Bucket algorithm [Levy et al.], which we discuss next MiniCon [Pottinger & Halevy] (next time) Also related: “chase and backchase” [Popa, Tannen,

Deutsch]

33

The Bucket Algorithm

Given a query Q with relations and predicates Create a bucket for each subgoal in Q Iterate over each view (source mapping)

If source includes bucket’s subgoal: Create mapping between q’s vars and the view’s var

at the same position If satisfiable with substitutions, add to bucket

Do cross-product of buckets, see if result is contained (exptime, but queries are probably relatively small)

For each result, do a containment check to make sure the rewriting is contained within the query

34

Let’s Try a Bucket Example

Queryq(t) :- show(i, t, y, g), rating(i, 5, s), y = 1997

Sources5star(i) show(i, t, y, g), rating(i, 5, s)TVguide(t,y,g,r) show(i, t, y, g), rating(i, r,

“TVGuide”)movieInfo(i,t,y,g) show(i, t, y, g)critics(i,r, s) rating(i, r, s)goodMovies(t,y) show(i, t, y, “drama”), rating(i,

5, s), y = 1997 good98(t,y) show(i, t, y, “drama”), rating(i, 5, s),

y = 1998

35

Example of Containment Testing

Suppose we have two queries:

q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCSE(C),

Course(C, “DB & Info Systems”)

q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X)

Intuitively, q1 must contain the same or fewer answers vs. q2: It has all of the same conditions, except one extra conjunction

(i.e., it’s more restricted) There’s no union or any other way it can add more data

We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course}

36

Checking Containment via Canonical Databases

To test for q1 µ q2: Create a “canonical DB” that contains a tuple for

each subgoal in q1 Execute q2 over it If q2 returns a tuple that matches the head of q1,

then q1 µ q2

(This is an NP-complete algorithm in the size of the query. Testing for full first-order logic queries is undecidable!!!)

Let’s see this for our example…

37

Example Canonical DB

q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCSE(C), Course(C, “DB & Info Systems”)

q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X)

Student Takes Course inCSE

S N S C C X

C DB & Info

Systems

S

Need to get tuple <S,C> in executing q2 over this database

38

Next Time

We’ll look at the state-of-the-art in query reformulation, the MiniCon algorithm Eliminates the need for the containment check Eliminates many cross-product comparisons This – and the Chase&Backchase strategy of

Tannen et al – are the two methods most used in virtual data integration today

Please read the MiniCon paper (Pottinger & Halevy)

Data Integration Methods Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 16, 2004.

Documents

data cleaning

integrating data

current data

data mining

data model

data different systems

goal of data integration

cons of data warehouses