ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

iTrails: Pay-as-you-go Information Integration in Dataspaces

Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas

Blunschi

ETH Zurich

VLDB 2007

Anat Heilper

Jan. 2009

CS Seminar in Databases (236826)

1

Problem: Querying heterogeneous data Sources

Data Sources

Laptop Email Server

WebServer

DBServer

What is the impact of the global depression in Israel?Query

Systems

? ? ? ?

2

Solution 1: Use a Search Engine

Data Sources

LaptopEmail Server

WebServer

Query

System

DBServer

Graph IR Search Engine

global depression Israel

TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03]

text,links

text,links

text,links

text,links

Query semantics are not precise!

3

Result Query

Solution 2: Use an Information Integration System

Data source 2Data source 1 Data source 3

Query interface Global schema

Source schema

?

Price index

countriesunemployment

Crime rate

countriesunemployment

Crime rate

Too much effort to provide schema mappings!

44

•Schema first approach (SFA)• Semantically integrated view over the data sources• Mappings between source schemas and mediated schema Queries have clearly defined semantics Expensive to construct and maintain Not all data sources have schemas

•No schema approach (NSA)• Keyword search• Requires good result ranking methods Performs no integration Query semantics is not well defined

2 opposite approaches :

Querying heteregenous data sources

5

Motivation of iTrail Find a integration solution in-between these two extremes?

?Dataspace System

Graph IR Search Engine

Data IntegrationSystem

Temps Cities

CO2 Sunspots

... ...

...

...

text,links

text,links

text,links

text,links

Pay-as-you-goInformation Integration The more effort you pay,

the more query power you have.

6

iTrails Core Idea: Add Integration Hints Incrementally

1) Provide search service over the data– Use general graph data model (iDM)– handles unstructured documents, XML, and relations

2) Add integration semantics via hints (trails)

3) If more semantics needed, apply trails– Smooth transition between search and data integration– Semantics added incrementally to improve precision /

recall

7

Example of an iDM

X1 = { .name = ‘home‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content = “}X2 = { .name = ‘mike‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content = “}. . .X5 = { .name = ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content = ‘@PDF . . . ‘}…..

home

Mike

papers

PIM

SIGMOD42.pdf

SIGMOD44.pdf

QP

VLDB12.pdf

VLDB10.pdf

projects

PIM

SIGMOD42.pdf8

1

2

5

General graph data model - iDM

iDM (iMeMeX Data Model) represents every structural component of the input data as a node.

Supports unstructured, semi-structured and structured data, e.g., files&folders, XML, relations

9

iMeMeX – integrated MeMeX

Vannevar Bush introduced the concept “memex” in the 1945s: "device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility."

Bush predicted: "Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified."

10

Data model

Data represented by directed graph G = (RV, E) RV: {V1, . . . Vn} termed resource view

E: Ordered pairs (Vi , Vj ) of resource views

Vi Vj : Vj is reachable from Vi by traversing the edges E

11

Resource view

Component

Vi.name string

Vi.Tuple sequence of attribute value pairs ((att0, val0), (att1, val1),… )

Vi.content text

A resource view Vi has three components: name, tuple, and content

{.name= ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = 04.01.2007‘}, .content = ‘@PDF . . . ‘}

12

Query model

Query expression:– Query Q selects nodes R := Q(G) G.RV– Example: //mike/papers

Component projection– C {.name, .tuple.<atti>, .content} : projection of

set of resource views selected by query Q, i.e. set of components R’ := {Vi.C | Vi Q(G)}

13

Component projection example

Example: //mike//PIM/*.tuple.lastmodifiedX1 = { .name = ‘home‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content = “}X2 = { .name = ‘mike‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content = “}. . .X5 = { .name = ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content = ‘@PDF . . . ‘}…..

home

Mike

papers

PIM

SIGMOD42.pdf

SIGMOD44.pdf

QP

VLDB12.pdf

VLDB10.pdf

projects

PIM

SIGMOD42.pdf

1

2

5 14

Syntax of query expression

QUERY_EXPRESSION := (PATH | KT_PREDICATE) (union QUERY_EXPRESSION)*

PATH := (LOCATION_STEP)+

LOCATION_STEP := LS_SEP NAME_PREDICATE (`[` KT_PREDICATE `]`)?

LS_SEP := `//` | `/`

NAME_PREDICATE := `*` | (`*`) ? VALUE (`* `)?

KT_PREDICATE := (KEYWORD | TUPLE) (LOGOP KT_PREDICATE)*

KEYWORD := `”` VALUE (WHITESPACE VALUE) * `”` | VALUE

(WHITESPACE KEYWORD)*

TUPLE := ATTRIBUTE_IDENTIFIER OPERATOR VALUE

OPERATOR := `=` | `<` | `>`

LOGOP := `AND` | `OR`15

semanticsAll nodes in graph

All nodes in graph that have ‘a’ in its content

All nodes in graph that have ‘a’ and ‘b’ in its content

All nodes in graph such that .name== ‘A’

nodes that .name== ‘B’ and there is an edge from W w.name == ‘A’

16

Logical algebra for query expressions

Operator Name semantics

G All resource views {V|V G.RV}

P(I) Selection {V|V I P(V)}

(I) Shallow unset {W|(V,W) G.E V I}

(I) Deep unset {V|V W V I}

I1 I2 intersection {V|V I1 V I2}

I1I2 union {V|V I1 V I2}

17

Example

18

What have we seen so far?

Problem: querying heterogeneous data sources

Find a solution between SFA and NSA– Generic graph data model to describe the data– queries describes paths in the graph

19

How itrails help?

Queries are modified by hints ( trails) which adds/modifies search paths to look at.

Example: yesterday → //*[date = today() – 1]

20

iTrails: Defining Trails

Basic Form of a Trail

QL [.CL] → QR [.CR] Intuition:

When I query for QL [.CL], you should also query for QR [.CR]

–

Queries: keyword and path expressions

Attribute projections

iTrails: Defining Trails

Unidirectional trail

QL [.CL] → QR [.CR]

Intuition:– When query for QL [.CL], also query for QR [.CR]

Bidirectional trail QL [.CL] QR [.CR]

Example:ψi :=//*.tuple.date //*.tuple.modified

Queries:keyword and path expressions

Attribute projections

Query example:global warming zurich

or//Temperatures/*[celsius>10]

22

20

15

14

BE

ZH

ZH

Trail Examples: Global Warming Zurich

Trail for Implicit meaning:

query for global warming, also query Temperature data > 10 degrees”

Trail for an Entity: When query for zurich, query for references of zurich as a region

global warming → //Temperatures/*[celsius > 10]

Temperaturescity celsiusdate

Bern24-Sep

24-Sep

Zurich25-Sep

zurich → //*[region = “ZH”]

Uster

region

global warming zurich

9ZHZurich26-Sep

23

Trail Example: Deep Web Bookmarks

Trail for a Bookmark: Query for train home, also query Train website:origin = TelAviv Unidestination = Haifa Hof Hacarmel

train home

train home →//trainCompany.com//*[origin=“Tel Aviv Uni”

and dest =“HAifa Hof Hacarmel”]

WebServer

24

Trail Examples: Thesauri, Dictionaries, Language-agnostic Search

Trail for Thesauri: query for car, also query for auto

Trails for Dictionary: query for car, also query for carro and vice-versa

car auto

car automobile

car → auto

car → automobile automobile → carLaptop Email

Server

25

Trail Examples: Schema Equivalences

Trail for schema match on names: query for Employee.empName, also query for Person.name

Trail for schema match on salaries: query for Employee.salary, also query for Person.income

EmployeeempName salary

Personname age income

//Employee//*.tuple.empName → //Person//*.tuple.name

//Employee//*.tuple.salary → //Person//*.tuple.income

DBServer

empId

SSN

26

How are Trails Created?

Given by the user– Explicitly– Via Relevance Feedback

(Semi-)Automatically– Automatic schema matching– Ontologies and thesauri (e.g., wordnet)– User communities (e.g., trails on gene data, bookmarks

)

27

Uncertainty and Trails

Probabilistic Trails: – model uncertain trails– probabilities used to rank trails

QL [.CL] → QR [.CR], 0 ≤ p ≤ 1– Example: car → auto, p = 0.9

probability p reflects the likelihood that results obtained by trail are correct.

28

Certainty and Trails - continue

Scored Trails: – Give higher value to certain trails– Scoring Factors: boost scores of results obtained by the trail

QL [.CL] → QR [.CR], sf > 1. examples– T1: weather →sf //Temperatures/*, sf ≥ 1

– T2: yesterday → sf //*[date = today() – 1], sf ≥ 1

Intuition: sf reflects the relevance of the trail. – Results obtained are scored sf times higher than the results

obtained without the trail. – If no scoring factor is available, sf = 1

29

Rewriting Queries with Trails

U

weather yesterday

(1) Matching

T2: yesterday → //*[date = today() – 1]

Query

(2) Transformation

TrailU

weather

yesterday

U//*[date = today() – 1]

(3) Merging

T2

matches

30

Replacing Trails

Trails that use replace instead of union semantics

Uweather yesterday

(1) Matching

T2: yesterday //*[date = today() – 1]

Query

(2) Transformation

Trail

Uweather //*[date = today() – 1]

(3) Merging

T2

matches

31

...U

Problem: Recursive Matches (1/2)

U

weather

yesterday

U

//*[date = today() – 1]

T2: yesterday →

//*[date = today() – 1]

New query still matches T2,

so T2 could be applied

againU

weather U

yesterday


//*[date = today() – 1]U

//*[date = today() – 1]

//*[date = today() – 1]

...

Infinite recursion

T2

matches

T2

matches 32

Problem: Recursive Matches (2/2)

U

weather

yesterday


Trails may be mutually recursive

T3: //*.tuple.date → //*.tuple.modified

U

weather U

yesterday

//*[date = today() – 1]

T10: //*.tuple.modified → //*.tuple.date

U//*[modified = today() – 1]

U

weather Uyesterday

//*[date = today() – 1]U

//*[modified = today() – 1]U //*[date = today() – 1]

We again match T3

and enter an infinite loop

T3 matches

T10 matches

33

Algorithm to solve recursion - MMCA

Multiple Match Coloring Algorithm (MMCA):– Keep history of all trails matched or introduced– Given a set of trails Y. For every trail t in Y:– Apply t to Q iteratively and color the query tree

nodes in Q according to the trails that already touched those nodes

34

U

weather yesterday

First Level

U

weatheryesterday

//Temperatures/*

UU

//*[date = today() – 1]

U

weatheryesterday

//Temperatures/*

UU

//*[modified = today() – 1]

UU

//*[received = today() – 1]

//*[date = today() – 1]

SecondLevel

T1

matches

T2

matches

T3, T4 match

Multiple Match Coloring Algorithm

T1: weather → //Temperatures/*

T2:yesterday → //*[date =today()-1]

T3://*.tuple.date →//*.tuple.modified

T4://*.tuple.date →//*.tuple.received 35

MMCA is exponential in number of levels– Every leaf can be applied any of the trails, and

each trail can generate additional leafs.

Solution: Trail Pruning– Number of levels – punish recursive rewrites– Top-K trails matched in each level

Ranking by probability/certainity/weight

– Other - timeout, progressively compute query results

Multiple Match Coloring Algorithm cont.

36

iTrails Evaluation in iMeMex

Main Questions in Evaluation– Quality: Top-K Precision and Recall– Performance: Use of Materialization– Scalability: Query-rewrite Time vs. Number of

Trails

37

iTrails Evaluation in iMeMex

Scenario 1: Few High-quality Trails– Closer to information integration use cases– Obtained real datasets and indexed them– 18 hand-crafted trails– 14 hand-crafted queries

Scenario 2: Many Low-quality Trails– Closer to search use cases– Randomly generated up to 10,000 trails and queries

with a mutual uniform match probability of 1%

38

iTrails Evaluation in iMeMex: Scenario 1

Configured iMeMex to act in three modes– Baseline: Graph / IR search engine– iTrails: Rewrite search queries with trails– Perfect Query: Semantics-aware query

Data: shipped to central index

Laptop Email Server

WebServer

DBServer

sizes in MB

39

Trails and queries used in Scenario 1

max original tree size: 14max final tree size after applying trails: 35max # of trails applied: 5

40

Quality: Top-K Precision and Recall (k=20)

SearchEngine misses relevantresults

SearchQuery is partially

semantics-aware

Scenario 1: few high-quality Trails (18 trails)

Queries

perfect query

Perfect Query always has precision and recall

equal to 1

41

Performance: Use of Materialization

Trail merging adds overhead to query execution

Trail Materialization improves performence for almost all queries

Scenario 1: few high-quality trails (18 trails)

42

Scalability: Query-rewrite Time vs. Number of Trails – scenario 2

• No pruning approach exponential growth in the query plan sizes • Query-rewrite time can be controlled with pruning

43

summary

First framework to explore pay-as-you-go information integration in dataspaces

iTrails: generic method to model semantic relationships gradually

Itrails are used to rewrite queries Algorithm to control recursive query rewrites

44

Personal opinion - advantages

The method is incremental– Integrators can collect statistics, find most common

queries and define trails for popular queries first.– Dynamic system: If popular queries changes over

time, trails for less popular queries can be disabled to reduce system workload.

Trails can be defined independently by domain expects for each data domain.

45

Personal opinion - disadvantages

Trails are global: every rewritten query is evaluated over every data source.

– Trail can have different meaning for different data sources.

For a good quality of query results, trails have to be defined manually problem for large systems. Solution: use machine learning techniques to improve automatic

trails creation

Overlaps and inconsistencies in trails are possible since query returns union of the results satisfying all trails Solution: trail mining and weighting would be helpful here.

46

Questions?

47

Bibliography

iTrails: Pay-as-you-go Information Integration in Dataspaces:Marcos Antonio Vaz Salles JensPeter Dittrich Shant Kirakos Karakashian

Olivier René Girard Lukas Blunschi ETH Zurich 8092 Zurich, Switzerland dbis.ethz.ch | iMeMex.org

From Databases to Dataspaces: A New Abstraction for Information Management:Michael Franklin University of California, Berkeley, Alon Halevy Google Inc. and U. Washington, David Maier Portland State University

Wikipedia, dataspace:http://en.wikipedia.org/wiki/Data_Spaces, memex:http://en.wikipedia.org/wiki/Vannevar_Bush

Imemex information: http://imemex.ethz.ch/

48

Backup slides

49

Algorithm runtime:– L: Number of leaves in query Q– M: Max number of leaves in query introduced by a trail– N: Number of trails– d {1, . . . ,N} number of levels

Theorem: Maximum number of trail applications performed by MMCA and maximum number of leaves in the merged query tree are both bounded by O(L • M^d )

Multiple Match Coloring Algorithm Analysis

50

MMCA run time analysis (O(L•M^d ) )

If trail t is matched in query Q, it colors Q leaf nodes Subtree containing only these nodes is not matched again by t. Worst case, in each level only one of the trails matches for

each of the leaves. 1st run: Trail match M new leaves for each of those leaves

total of LM new nodes plus L old nodes L(M+1) leaves and L trail applications for the first level.

2nd run: t doesn’t match any of the leaves anymore (they are colored in 1st run).

However, all leaves may be matched against N −1 colors. Worst case, again, only one of the trails matches for each of

the existing leaf nodes. In the d-th level, will lead to L(M+1)^(d−1) trail applications and

a total of L(M+1)^d leaves.

51

iDM: Lazily Computed Graph

Nodes and edges are lazily computed Each node is a Resource View

52

iDM: Lazily Computed Graph

iDM is not a static model– Every component of every Resource View may be created on

demand– Every Resource View may be created on demand

Behind the scenes, obtaining the content may:– Read a file on the filesystem– Access a page on the web– Fetch the data from an index structure

Behind the scenes, obtaining the group may:– Get the children of a folder in the filesystem– Look up an edge replica– Obtain the sections of a document

53

How to implement iDM: Architectural Perspective

Indexes&Replicas access (warehousing)

Data source access (mediation)

Complex operators (query algebra)

OperatorsPhysicalAlgebra

Data StoreResultCache

CatalogiQL

Query Processor

DataOperatorsCleaning

Replicas

Indexes&Data Store

CatalogiDM

Query Processor

Operators

Catalog

ContentConverters

Data SourceQuery

Processor

Data SourcePlugins

iMeMex PDSMS

Search & Browse Office ToolsEmail ...

DBMS

Application Layer

Data Source Layer

...

...IMAPFile System...54

Data management approaches

Features

Integration Solution

Search Dataspaces Data Integration

Integration Effort

Low Pay-as-you-go

High

Query Semantics

Precision / Recall

Precision / Recall

Precise

Need for Schema

Schema-never

Schema-later Schema-first

55

Canonical form

The canonical form of Г(Q) of a query Q is obtained by decomposing Q into location step separators and predicates (P) according to grammar. Г(Q) is constructed by the following recursion:

G if tree is empty

Tree = (tree) if LS_SEP=// and not first location step

μ(tree) if LS_SEP=/ and not first location step

tree σp(G) otherwise

56

ITrails: Pay-as-you-go Information Integration in Dataspaces Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi.

Documents

data integration semantics

structured data

input data

data sources mappings

heteregenous data sources

links text

integration query semantics

links query semantics