Top Banner
iTrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi ETH Zurich VLDB 2007 Anat Heilper Jan. 2009 CS Seminar in Databases (236826) 1
56

ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iTrails: Pay-as-you-go Information Integration in Dataspaces

Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas

Blunschi

ETH Zurich

VLDB 2007

Anat Heilper

Jan. 2009

CS Seminar in Databases (236826)

1

Page 2: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Problem: Querying heterogeneous data Sources

Data Sources

Laptop Email Server

WebServer

DBServer

What is the impact of the global depression in Israel?Query

Systems

? ? ? ?

2

Page 3: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Solution 1: Use a Search Engine

Data Sources

LaptopEmail Server

WebServer

Query

System

DBServer

Graph IR Search Engine

global depression Israel

TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03]

text,links

text,links

text,links

text,links

Query semantics are not precise!

3

Page 4: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Result Query

Solution 2: Use an Information Integration System

Data source 2Data source 1 Data source 3

Query interface Global schema

Source schema

?

Price index

countriesunemployment

Crime rate

countriesunemployment

Crime rate

Too much effort to provide schema mappings!

44

Page 5: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

•Schema first approach (SFA)• Semantically integrated view over the data sources• Mappings between source schemas and mediated schema Queries have clearly defined semantics Expensive to construct and maintain Not all data sources have schemas

•No schema approach (NSA)• Keyword search• Requires good result ranking methods Performs no integration Query semantics is not well defined

2 opposite approaches :

Querying heteregenous data sources

5

Page 6: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Motivation of iTrail Find a integration solution in-between these two extremes?

?Dataspace System

Graph IR Search Engine

Data IntegrationSystem

Temps Cities

CO2 Sunspots

... ...

...

...

text,links

text,links

text,links

text,links

Pay-as-you-goInformation Integration The more effort you pay,

the more query power you have.

6

Page 7: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iTrails Core Idea: Add Integration Hints Incrementally

1) Provide search service over the data– Use general graph data model (iDM)– handles unstructured documents, XML, and relations

2) Add integration semantics via hints (trails)

3) If more semantics needed, apply trails– Smooth transition between search and data integration– Semantics added incrementally to improve precision /

recall

7

Page 8: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Example of an iDM

X1 = { .name = ‘home‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content = “}X2 = { .name = ‘mike‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content = “}. . .X5 = { .name = ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content = ‘@PDF . . . ‘}…..

home

Mike

papers

PIM

SIGMOD42.pdf

SIGMOD44.pdf

QP

VLDB12.pdf

VLDB10.pdf

projects

PIM

SIGMOD42.pdf8

1

2

5

Page 9: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

General graph data model - iDM

iDM (iMeMeX Data Model) represents every structural component of the input data as a node.

Supports unstructured, semi-structured and structured data, e.g., files&folders, XML, relations

9

Page 10: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iMeMeX – integrated MeMeX

Vannevar Bush introduced the concept “memex” in the 1945s: "device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility."

Bush predicted: "Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified."

10

Page 11: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Data model

Data represented by directed graph G = (RV, E) RV: {V1, . . . Vn} termed resource view

E: Ordered pairs (Vi , Vj ) of resource views

Vi Vj : Vj is reachable from Vi by traversing the edges E

11

Page 12: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Resource view

Component

Vi.name string

Vi.Tuple sequence of attribute value pairs ((att0, val0), (att1, val1),… )

Vi.content text

A resource view Vi has three components: name, tuple, and content

{.name= ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = 04.01.2007‘}, .content = ‘@PDF . . . ‘}

12

Page 13: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Query model

Query expression:– Query Q selects nodes R := Q(G) G.RV– Example: //mike/papers

Component projection– C {.name, .tuple.<atti>, .content} : projection of

set of resource views selected by query Q, i.e. set of components R’ := {Vi.C | Vi Q(G)}

13

Page 14: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Component projection example

Example: //mike//PIM/*.tuple.lastmodifiedX1 = { .name = ‘home‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content = “}X2 = { .name = ‘mike‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content = “}. . .X5 = { .name = ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content = ‘@PDF . . . ‘}…..

home

Mike

papers

PIM

SIGMOD42.pdf

SIGMOD44.pdf

QP

VLDB12.pdf

VLDB10.pdf

projects

PIM

SIGMOD42.pdf

1

2

5 14

Page 15: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Syntax of query expression

QUERY_EXPRESSION := (PATH | KT_PREDICATE) (union QUERY_EXPRESSION)*

PATH := (LOCATION_STEP)+

LOCATION_STEP := LS_SEP NAME_PREDICATE (`[` KT_PREDICATE `]`)?

LS_SEP := `//` | `/`

NAME_PREDICATE := `*` | (`*`) ? VALUE (`* `)?

KT_PREDICATE := (KEYWORD | TUPLE) (LOGOP KT_PREDICATE)*

KEYWORD := `”` VALUE (WHITESPACE VALUE) * `”` | VALUE

(WHITESPACE KEYWORD)*

TUPLE := ATTRIBUTE_IDENTIFIER OPERATOR VALUE

OPERATOR := `=` | `<` | `>`

LOGOP := `AND` | `OR`15

Page 16: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

semanticsAll nodes in graph

All nodes in graph that have ‘a’ in its content

All nodes in graph that have ‘a’ and ‘b’ in its content

All nodes in graph such that .name== ‘A’

nodes that .name== ‘B’ and there is an edge from W w.name == ‘A’

16

Page 17: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Logical algebra for query expressions

Operator Name semantics

G All resource views {V|V G.RV}

P(I) Selection {V|V I P(V)}

(I) Shallow unset {W|(V,W) G.E V I}

(I) Deep unset {V|V W V I}

I1 I2 intersection {V|V I1 V I2}

I1I2 union {V|V I1 V I2}

17

Page 18: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Example

18

Page 19: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

What have we seen so far?

Problem: querying heterogeneous data sources

Find a solution between SFA and NSA– Generic graph data model to describe the data– queries describes paths in the graph

19

Page 20: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

How itrails help?

Queries are modified by hints ( trails) which adds/modifies search paths to look at.

Example: yesterday → //*[date = today() – 1]

20

Page 21: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iTrails: Defining Trails

Basic Form of a Trail

QL [.CL] → QR [.CR] Intuition:

When I query for QL [.CL], you should also query for QR [.CR]

Queries: keyword and path expressions

Attribute projections

Page 22: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iTrails: Defining Trails

Unidirectional trail

QL [.CL] → QR [.CR]

Intuition:– When query for QL [.CL], also query for QR [.CR]

Bidirectional trail QL [.CL] QR [.CR]

Example:ψi :=//*.tuple.date //*.tuple.modified

Queries:keyword and path expressions

Attribute projections

Query example:global warming zurich

or//Temperatures/*[celsius>10]

22

Page 23: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

20

15

14

BE

ZH

ZH

Trail Examples: Global Warming Zurich

Trail for Implicit meaning:

query for global warming, also query Temperature data > 10 degrees”

Trail for an Entity: When query for zurich, query for references of zurich as a region

global warming → //Temperatures/*[celsius > 10]

Temperaturescity celsiusdate

Bern24-Sep

24-Sep

Zurich25-Sep

zurich → //*[region = “ZH”]

Uster

region

global warming zurich

9ZHZurich26-Sep

23

Page 24: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Trail Example: Deep Web Bookmarks

Trail for a Bookmark: Query for train home, also query Train website:origin = TelAviv Unidestination = Haifa Hof Hacarmel

train home

train home →//trainCompany.com//*[origin=“Tel Aviv Uni”

and dest =“HAifa Hof Hacarmel”]

WebServer

24

Page 25: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Trail Examples: Thesauri, Dictionaries, Language-agnostic Search

Trail for Thesauri: query for car, also query for auto

Trails for Dictionary: query for car, also query for carro and vice-versa

car auto

car automobile

car → auto

car → automobile automobile → carLaptop Email

Server

25

Page 26: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Trail Examples: Schema Equivalences

Trail for schema match on names: query for Employee.empName, also query for Person.name

Trail for schema match on salaries: query for Employee.salary, also query for Person.income

EmployeeempName salary

Personname age income

//Employee//*.tuple.empName → //Person//*.tuple.name

//Employee//*.tuple.salary → //Person//*.tuple.income

DBServer

empId

SSN

26

Page 27: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

How are Trails Created?

Given by the user– Explicitly– Via Relevance Feedback

(Semi-)Automatically– Automatic schema matching– Ontologies and thesauri (e.g., wordnet)– User communities (e.g., trails on gene data, bookmarks

)

27

Page 28: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Uncertainty and Trails

Probabilistic Trails: – model uncertain trails– probabilities used to rank trails

QL [.CL] → QR [.CR], 0 ≤ p ≤ 1– Example: car → auto, p = 0.9

probability p reflects the likelihood that results obtained by trail are correct.

28

Page 29: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Certainty and Trails - continue

Scored Trails: – Give higher value to certain trails– Scoring Factors: boost scores of results obtained by the trail

QL [.CL] → QR [.CR], sf > 1. examples– T1: weather →sf //Temperatures/*, sf ≥ 1

– T2: yesterday → sf //*[date = today() – 1], sf ≥ 1

Intuition: sf reflects the relevance of the trail. – Results obtained are scored sf times higher than the results

obtained without the trail. – If no scoring factor is available, sf = 1

29

Page 30: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Rewriting Queries with Trails

U

weather yesterday

(1) Matching

T2: yesterday → //*[date = today() – 1]

Query

(2) Transformation

TrailU

weather

yesterday

U//*[date = today() – 1]

(3) Merging

T2

matches

30

Page 31: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Replacing Trails

Trails that use replace instead of union semantics

Uweather yesterday

(1) Matching

T2: yesterday //*[date = today() – 1]

Query

(2) Transformation

Trail

Uweather //*[date = today() – 1]

(3) Merging

T2

matches

31

Page 32: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

...U

Problem: Recursive Matches (1/2)

U

weather

yesterday

U

//*[date = today() – 1]

T2: yesterday →

//*[date = today() – 1]

New query still matches T2,

so T2 could be applied

againU

weather U

yesterday

U//*[date = today() – 1]

//*[date = today() – 1]U

//*[date = today() – 1]

//*[date = today() – 1]

...

Infinite recursion

T2

matches

T2

matches 32

Page 33: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Problem: Recursive Matches (2/2)

U

weather

yesterday

U//*[date = today() – 1]

Trails may be mutually recursive

T3: //*.tuple.date → //*.tuple.modified

U

weather U

yesterday

//*[date = today() – 1]

T10: //*.tuple.modified → //*.tuple.date

U//*[modified = today() – 1]

U

weather Uyesterday

//*[date = today() – 1]U

//*[modified = today() – 1]U //*[date = today() – 1]

We again match T3

and enter an infinite loop

T3 matches

T10 matches

33

Page 34: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Algorithm to solve recursion - MMCA

Multiple Match Coloring Algorithm (MMCA):– Keep history of all trails matched or introduced– Given a set of trails Y. For every trail t in Y:– Apply t to Q iteratively and color the query tree

nodes in Q according to the trails that already touched those nodes

34

Page 35: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

U

weather yesterday

First Level

U

weatheryesterday

//Temperatures/*

UU

//*[date = today() – 1]

U

weatheryesterday

//Temperatures/*

UU

//*[modified = today() – 1]

UU

//*[received = today() – 1]

//*[date = today() – 1]

SecondLevel

T1

matches

T2

matches

T3, T4 match

Multiple Match Coloring Algorithm

T1: weather → //Temperatures/*

T2:yesterday → //*[date =today()-1]

T3://*.tuple.date →//*.tuple.modified

T4://*.tuple.date →//*.tuple.received 35

Page 36: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

MMCA is exponential in number of levels– Every leaf can be applied any of the trails, and

each trail can generate additional leafs.

Solution: Trail Pruning– Number of levels – punish recursive rewrites– Top-K trails matched in each level

Ranking by probability/certainity/weight

– Other - timeout, progressively compute query results

Multiple Match Coloring Algorithm cont.

36

Page 37: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iTrails Evaluation in iMeMex

Main Questions in Evaluation– Quality: Top-K Precision and Recall– Performance: Use of Materialization– Scalability: Query-rewrite Time vs. Number of

Trails

37

Page 38: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iTrails Evaluation in iMeMex

Scenario 1: Few High-quality Trails– Closer to information integration use cases– Obtained real datasets and indexed them– 18 hand-crafted trails– 14 hand-crafted queries

Scenario 2: Many Low-quality Trails– Closer to search use cases– Randomly generated up to 10,000 trails and queries

with a mutual uniform match probability of 1%

38

Page 39: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iTrails Evaluation in iMeMex: Scenario 1

Configured iMeMex to act in three modes– Baseline: Graph / IR search engine– iTrails: Rewrite search queries with trails– Perfect Query: Semantics-aware query

Data: shipped to central index

Laptop Email Server

WebServer

DBServer

sizes in MB

39

Page 40: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Trails and queries used in Scenario 1

max original tree size: 14max final tree size after applying trails: 35max # of trails applied: 5

40

Page 41: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Quality: Top-K Precision and Recall (k=20)

SearchEngine misses relevantresults

SearchQuery is partially

semantics-aware

Scenario 1: few high-quality Trails (18 trails)

Queries

perfect query

Perfect Query always has precision and recall

equal to 1

41

Page 42: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Performance: Use of Materialization

Trail merging adds overhead to query execution

Trail Materialization improves performence for almost all queries

Scenario 1: few high-quality trails (18 trails)

42

Page 43: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Scalability: Query-rewrite Time vs. Number of Trails – scenario 2

• No pruning approach exponential growth in the query plan sizes • Query-rewrite time can be controlled with pruning

43

Page 44: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

summary

First framework to explore pay-as-you-go information integration in dataspaces

iTrails: generic method to model semantic relationships gradually

Itrails are used to rewrite queries Algorithm to control recursive query rewrites

44

Page 45: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Personal opinion - advantages

The method is incremental– Integrators can collect statistics, find most common

queries and define trails for popular queries first.– Dynamic system: If popular queries changes over

time, trails for less popular queries can be disabled to reduce system workload.

Trails can be defined independently by domain expects for each data domain.

45

Page 46: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Personal opinion - disadvantages

Trails are global: every rewritten query is evaluated over every data source.

– Trail can have different meaning for different data sources.

For a good quality of query results, trails have to be defined manually problem for large systems. Solution: use machine learning techniques to improve automatic

trails creation

Overlaps and inconsistencies in trails are possible since query returns union of the results satisfying all trails Solution: trail mining and weighting would be helpful here.

46

Page 47: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Questions?

47

Page 48: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Bibliography

iTrails: Pay-as-you-go Information Integration in Dataspaces:Marcos Antonio Vaz Salles JensPeter Dittrich Shant Kirakos Karakashian

Olivier René Girard Lukas Blunschi ETH Zurich 8092 Zurich, Switzerland dbis.ethz.ch | iMeMex.org

From Databases to Dataspaces: A New Abstraction for Information Management:Michael Franklin University of California, Berkeley, Alon Halevy Google Inc. and U. Washington, David Maier Portland State University

Wikipedia, dataspace:http://en.wikipedia.org/wiki/Data_Spaces, memex:http://en.wikipedia.org/wiki/Vannevar_Bush

Imemex information: http://imemex.ethz.ch/

48

Page 49: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Backup slides

49

Page 50: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Algorithm runtime:– L: Number of leaves in query Q– M: Max number of leaves in query introduced by a trail– N: Number of trails– d {1, . . . ,N} number of levels

Theorem: Maximum number of trail applications performed by MMCA and maximum number of leaves in the merged query tree are both bounded by O(L • M^d )

Multiple Match Coloring Algorithm Analysis

50

Page 51: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

MMCA run time analysis (O(L•M^d ) )

If trail t is matched in query Q, it colors Q leaf nodes Subtree containing only these nodes is not matched again by t. Worst case, in each level only one of the trails matches for

each of the leaves. 1st run: Trail match M new leaves for each of those leaves

total of LM new nodes plus L old nodes L(M+1) leaves and L trail applications for the first level.

2nd run: t doesn’t match any of the leaves anymore (they are colored in 1st run).

However, all leaves may be matched against N −1 colors. Worst case, again, only one of the trails matches for each of

the existing leaf nodes. In the d-th level, will lead to L(M+1)^(d−1) trail applications and

a total of L(M+1)^d leaves.

51

Page 52: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iDM: Lazily Computed Graph

Nodes and edges are lazily computed Each node is a Resource View

52

Page 53: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iDM: Lazily Computed Graph

iDM is not a static model– Every component of every Resource View may be created on

demand– Every Resource View may be created on demand

Behind the scenes, obtaining the content may:– Read a file on the filesystem– Access a page on the web– Fetch the data from an index structure

Behind the scenes, obtaining the group may:– Get the children of a folder in the filesystem– Look up an edge replica– Obtain the sections of a document

53

Page 54: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

How to implement iDM: Architectural Perspective

Indexes&Replicas access (warehousing)

Data source access (mediation)

Complex operators (query algebra)

OperatorsPhysicalAlgebra

Data StoreResultCache

CatalogiQL

Query Processor

DataOperatorsCleaning

Replicas

Indexes&Data Store

CatalogiDM

Query Processor

Operators

Catalog

ContentConverters

Data SourceQuery

Processor

Data SourcePlugins

iMeMex PDSMS

Search & Browse Office ToolsEmail ...

DBMS

Application Layer

Data Source Layer

...

...IMAPFile System...54

Page 55: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Data management approaches

Features

Integration Solution

Search Dataspaces Data Integration

Integration Effort

Low Pay-as-you-go

High

Query Semantics

Precision / Recall

Precision / Recall

Precise

Need for Schema

Schema-never

Schema-later Schema-first

55

Page 56: ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Canonical form

The canonical form of Г(Q) of a query Q is obtained by decomposing Q into location step separators and predicates (P) according to grammar. Г(Q) is constructed by the following recursion:

G if tree is empty

Tree = (tree) if LS_SEP=// and not first location step

μ(tree) if LS_SEP=/ and not first location step

tree σp(G) otherwise

56